Installieren Sie unsere Erweiterung an, um sofort in jedem Video zu suchen

Gemini 3.5 Flash Is Good. That’s Not the Story
Indiziert: 2026-05-22

1,081 Aufrufe6316:02MetalSoleOriginalveröffentlichung: 2026-05-22

The video provides a sharp reality check by distinguishing between planning quality and intent recovery, exposing why Gemini 3.5 Flash remains unreliable for complex engineering. It correctly highlights that a model's ability to follow steps is meaningless if it loses sight of the user's original objective.

[00:00:00]So Google IO kicked off this week and I can't help but feel like they were trying to tell us something.

[00:00:05]>> Anti-gravity. There's an anti use Google anti gravity gravity ID and they use anti-ravity. Anti-ravity is the experience and anti-ravity is anti-gravity 2.0. Anti-gravity is anti-ravity harness. Using the new ant the anti-ravity is anti logo on it.

[00:00:17]Anti-ravity 2.0. You can see anti-ravity 2.0. Anti 2.0 anti anti anti anti gravity doing anti anti-gravity 2.0 and anti-gravity.

[00:00:27]Anti-gravity hard anti-gravity. We're bringing anti-gravity. Anti-gravity.

[00:00:30]Anti-gravity is bringing anti-gravity.

[00:00:31]>> Using anti-gravity with antigra 2.0.

[00:00:34]>> That's right. Apparently, um, anti-gravity has some release or something. I think some of you all know about that. We'll talk about that in a second. But first, I want to talk about Gemini 35 Flash, the new model that they released. It is their flash variant next month. Apparently, the Pro version is coming. But I believe the flash variant is definitely worth digging into and taking a look at. They touted it as a very good, very capable model, even advising companies that they should look at it to kind of offload 80% of their standard work to save billions of dollars. I don't know if it's going to measure up to a real frontier model that's sitting out there right now. And I want to use my benchmark, the Care benchmark to figure that out. So, I want to take a look at that and then we will take a look at what happened with anti-gravity, all the updates, and we'll actually run a few things there so that you can kind of see it in action and what the difference is. Okay. So, what I want to talk about, as I said, is the brand new flash model that they just released, the 35 flash model. It is a great model. It is very, very fast. It's worth taking a look at. It's going to be across a lot of their surface areas.

[00:01:36]They're redesigning search and doing new things with search to bring AI even more forward. The model will be doing that.

[00:01:43]It's already used in the Gemini application. You can use it there immediately. So, just go ask Gemini some questions and you'll start understanding the model itself. It's also used in the anti-gravity application that we'll look at in just one second, I promise. But it really is super fast. The question is, how durable is it? So, what we're looking at here is my care benchmark. It measures two things. The planning quality. So, if you ask for 100 things, how many things end up in the final plan as it goes to build? That's the first measurement. The other one is this intent recovery, which just basically means what you meant, how much of that is actually in the final plan itself.

[00:02:19]And these are not esoteric. These are very measurable metrics to say someone was asking for a very specific thing that wasn't a feature itself but helped us understand how to build that feature or what kind of mark to hit when we built it. That is something that I am very concerned with that these systems start to lose an understanding of why we intend to do something or exactly what we're asking for and are just working on task lists. And so we get very generic outcomes many times because what you asked for is much of that intent that you poured in is not making it through.

[00:02:52]What you're seeing here is the GPT55 models and you're seeing that they're getting 97 or 98% kind of on the planning benchmark. That's why we had to create a new axis here for the intent recovery. And the highest intent recovery, this is the highest model altogether is 81%. So it's doing pretty well. that's getting eight out of 10 items and bringing that through. But what we want to see here, if I add another set of models, let's add all of the anthropic models. You'll see opus sitting very close to GPT55. Of course, that makes a lot of sense and roughly the same numbers.

[00:03:28]And down here, you'll see the sonnet model itself. Now, again, I want to point out we're not at 100% and we're not at zero. So, seeing sonnet this low is roughly 50% or in that neighborhood.

[00:03:39]But what we want to add is our latest models which are the models from Google.

[00:03:44]So you can see there's quite a few models from Google. What I want to point out is this was Gemini 31 flash. So this is where you would have worked with uh the flash light 31 version and then when you move in you get into just Gemini 3 flash and then the Gemini 31 Pro. So this is probably the range that most of you are thinking about with Gemini. The 31 Pro model's been out for a while. So, what I'll do is I'll zoom in here so that we can better see. Here's 31 Pro. And now what we see is the brand new model 35 flash. So, this is where 35 flash is landing. The medium model is doing a little better at both planning and at intent recovery or the medium effort level. And it's really scoring at about 75%. So, threequarters of the things you ask for make it through to the plan. Not bad, but not nearly as good as something like Sonnet. And that only 46% less than half of the reasons you're asking for something are making it through to the plan. And that's kind of critically important as you work on something that's got a much longer duration to the execution path. If it's something that you're asking for one shot or something like that, all of that goes through because your request goes all the way to the model. But if you're asking the model to first take your request and turn it into a durable plan that it can work through step by step, you better get everything you asked for in that plan. Otherwise, it has no h no chance of actually being built. And so that's what we're seeing here. And I would say in general looking at these scores, I would be nervous about using this too much for very hard long coding kind of efforts. Though I have used it a fair bit on smaller coding efforts and it really flies and does a great job.

[00:05:27]So, I don't want to say that this is some nail in a coffin. It just happens to measure one thing that I do find important and meaningful, but it's scoring on both axes below sonnet if you were wondering. All right, so that's enough of this. Let's get to the fun stuff. What the heck happened with anti-gravity? All right, as mentioned, they updated anti-gravity. I don't know if you've heard me say anti-gravity a couple times, trying to match the 48 or 38 or however many were in the keynote.

[00:05:54]So, I want to say one more time, they've released anti-gravity 2.0. And that's really the challenge. What I'm going to show you in a second is I will show you what anti-gravity used to look like and how you can still use that that version of the application. I'll show you what it looks like now. And then we'll run a couple exercises against it really quickly so you can see kind of its speed, but also kind of its throughput and performance. So, let's dive in and take a look at what anti-gravity used to look like and what caused all the stink because there was a little bit of frustration in the way that they released it. Okay, so this is anti-gravity IDE. It is now called anti-gravity IDE. This is a different application. You can still download it.

[00:06:32]I want to make that very clear. It does not look like something they will be supporting going forward or at least soon. They've made it very clear that they're going to move toward that anti-gravity 2.0 thing. We'll see that in just a second. But this is what it used to look like. a typical kind of IDE experience that we're all used to from codecs or something like uh Visual Studio Code for example. So that's what this looks like. That's what the experience is. Developers and others were used to this. But they decided to update to 2.0. Huge increase. And what they delivered everybody was this application. So this is what anti-gravity 2.0 looks like. Gone are all of the other panels and all of the other features. You can't pull up a terminal window. you can't put plugins in or extensions. All of those other things are now gone. They said that they were going unapologetically agenticcentric. So, this is really only agents, only working with agents. Now, I need to show you something else for a second just so that you understand why they went this route, but I do believe this is the future route. So, it might be painful to get here, but at the same time, you're going to really be happy that you came. Okay. And for this, I've dropped into my terminal window. And not surprisingly, I'm about to show you a terminal application that looks like Claude Code or looks like Gemini CLI or even codec cli. What they've released is AGY. So there is an anti-gravity CLI that you can use. And this feels very much like any other CLI, very similar to the other applications out there. And this is a very important part of what they were actually pulling off. This is the engine behind anti-gravity now, not the windowed application that was part of something like Visual Studio Code.

[00:08:20]They couldn't export that and kind of use it in all surfaces. Something like this becomes an actual engine. And what they're really doing with it is they're putting it in the cloud and behind all other surfaces that need some kind of agentic solution or usage of models.

[00:08:34]That's why they really built this. Now, building this gives them a lot of freedom on how to use this. It also gives users a lot of freedom on how to use this. But this replaces Gemini CLI.

[00:08:46]So it wasn't like they didn't have something before. That was a little bit of a frustration as well as people need to migrate from Gemini CLI over here.

[00:08:54]I'll say migrate. I think they're kind of being force migrated largely over here. Um but still this is the new Gemini CLI replacement. But let's get back to the desktop application because that's where the real meat of this really is. Okay. So you can see that this is very different. I want to show you something else real quick. I'm going to pull up for a moment. This is Codex CLI. So this is the version from OpenAI, right? This is their version of their desktop builder agent ccentric application, but you can see how similar they are. Really, frankly, it's near perfect parody. Except I think the codec system still has some sophistication over it. Of course, anti-gravity is just getting started, but there's a few more tools that you can bring up, including something like a terminal window or something like that. Um, in the different panels that does not yet exist in anti-gravity, though, you do have some capabilities. You can move around to see different files and different commands with a command window that comes up. And if I want to look at my agents file, it will pull the agents file up off to the side. So it does have some of the same kind of aspects that you might find in the other applications, but it doesn't have a browser surface which is unique and it also doesn't seem to have any terminal surface. So you have to ask the agent to do those kinds of actions. So if you wanted it to start your server, you you'd tell it start the server. So that really is your path forward. Use the agent surface itself to deal with shell.

[00:10:30]Might be a little bit painful. pull up another terminal window if you're really used to that. Okay, but let's talk about how you manage some things here. The projects are down the side. This becomes one of the major values here is you can have multiple projects running all at the same time with different threads inside of them and different conversations going on. So you might have three or four conversations going on across multiple different projects at the same time and only have to manage their completion or questions here. So it becomes much easier to kind of manage across projects or across requests instead of multiple terminal windows or something like that. This is a real boon. This is a plus. I will also say because you can do this, you can also start multiple requests within one project. Make sure that they don't cross up on the same files of course, but other than that, that's a real win here as well. You can also create scheduled tasks. This is a place that you can set things up to come and do kind of cron related work. It will run your command that you're asking it to run in a prompt style. It will work within a project folder if you need it within a specific project and do whatever you need it to do every single day or hour or whatever you need to do. Those are also fantastic. So, there's a lot of reasons to really like this. But enough of that.

[00:11:47]I want to show you what's cool about it.

[00:11:50]And I to do that, we just have to run a couple of these. So that video that I show shared at the beginning about anti-gravity, anti-gravity, anti-gravity, that was done here. So I came up into this conversation and said, I have a video and I need to cut down just to the words when they're said anti-gravity. So let's do that ourselves right here. Okay. So let's give this a shot. We're going to start from here and say using the keynote MP4, please go through the video and find all the places to make a cut down video where they say the word agent. So that's all I'm looking for. I'm going to do basically the same thing we did before.

[00:12:40]And you can see it very quickly working through the plan that it needed to build, all of the research that it needed to do, and now it just works and at kind of a really break neck speed, and surprisingly is very, very successful at this. I'm surprised at this because it's a flash model.

[00:12:57]Typically, those have been relatively underperforming.

[00:13:01]This one's impressive. Okay, while this one's working, let's do something else.

[00:13:06]You can start a new conversation. I wanted to show this. You don't have to be in a folder. And in fact, if you go to the bottom, you'll see that there's a no project down here. So that you can just have a conversation. And I wanted to show one thing that might not be obvious because it feels like a weakness here. So if I use /browser, you'll see that they have a browser kind of skill already installed in this application. And this is how they you get back kind of that browser surface that you might have had in something like anti-gravity, the previous IDE. So, if I use this and say, "Use chat GPT to make a really silly cartoon image of a rabbit trying to talk a squirrel into a bad idea." And let me bring up Chrome here. You'll see that it's asking me, do I want to allow for remote debugging, which is how it's interacting with Chrome directly? And you'll see it comes up with a panel here. And what it'll do is it will open a new tab to chat GPT and kick off our request.

[00:14:05]All right. And here it is kicking through JGPT asking for the the image.

[00:14:21]Opens the image up behind there.

[00:14:23]Downloads the image.

[00:14:31]opens download history and then moves the file and then we're done. So there we have it. All right. And let's go back and check on our previous one, which now says it's done.

[00:14:52]And it says the video is here.

[00:14:55]But of course, this does not actually have a a video viewer. It's not that kind of system and it also doesn't have a very elegant way to kind of open this in finder or open a default application or anything like that. These are the kinds of things that are still a little bit crunchy about it. So let me get this open and let's see how it did.

[00:15:16]>> Agents have agents that agent to agents.

[00:15:19]These are agent powered eyes agent agent agent first agent conversation agent produce agent orchestrator agent harness these agents and model agent to take agents agent really agent for agents these agents agent harness >> agents agents to agents your agent agent agents easy >> to agent agents agents to work for agent agent agent put information agent now information agents were introduced agent >> agent agent agent the agent agents work for you >> how agents are agents don't do the box agents make agents easy agents new agent your age agent 16 >> agent security agent code >> okay So, that's pretty much it. I hope you enjoyed seeing some of this about both the the new Flash 35 model and anti-gravity. In any case, I hope you enjoyed this. Thanks for coming along for the ride on this one and I'll see you in the next

Ähnliche Videos

Künstliche Intelligenz

Elon Musk’s XAI, Fiber-Optic Drones & the New Era of US Defense & Winning the AI Arms Race

DefenseNow

250 views•2026-05-15

Künstliche Intelligenz

I Read Every Google Antigravity 2.0 Doc So You Don't Have To (13-Min Operator Playbook)

hyperautomationlabs1045

120 views•2026-05-19

Künstliche Intelligenz

Could AI change the future of cancer survival?

MotherConservative

999 views•2026-05-16

Künstliche Intelligenz

[RQ] All Preview 2 Midnight Horror School Deepfakes in Macbg Major

macbghuggylego

102 views•2026-05-15

Künstliche Intelligenz

Firefox on Android Just Added 'Shake to Summarize'

BrenTech

349 views•2026-05-19

Künstliche Intelligenz

Google’s NEW AI Just SHOCKED The World…

JulianGoldiePodcast

188 views•2026-05-21

Künstliche Intelligenz

WWDC 2026 Promises Apple Intelligence and Siri Upgrades | Episode 195

TheMacRumorsShow

104 views•2026-05-22

Künstliche Intelligenz

RNNs Had a Fatal Flaw — Why Transformers Replaced Sequential Processing

axiom-motion-math

567 views•2026-05-18

Trends

She Lived A DECADE In 3 Weeks

andyyjiang

3866K views•2026-05-18

you still shouldn't eat watch batteries, but...

ACSReactions

2940K views•2026-05-15

The Gen Alpha Melody

Carl.e.martin

845K views•2026-05-17

How Big is the Biggest Volcano?

CleoAbram

1908K views•2026-05-16