Install our extension to search inside any video instantly.

How AI Agents Can Safely Ship Code to Production
Added: 2026-06-06

106 views338:03boundarymlOriginal Release: 2026-06-05

Building automated pipelines with feature flags requires incremental development rather than trying to implement everything at once. Teams should start with simple tools like CLI or MCP that can query feature flag data, then progressively add capabilities. The goal is to eventually have automated back pressure where metric changes trigger automatic feature flag adjustments. However, teams should not attempt this from day one - they should build incrementally, learning and adapting to their specific system, team, and deployment methodology over time.

[00:00:00]I'm really excited to hear from you today about uh how you think about feature flags.

[00:00:04]>> Feature flags are a fantastic way to go ahead and deploy it to prod measure metrics on it and actually just turn it off while your team analyzes and crunches the numbers.

[00:00:13]>> Most importantly is like what do you do with the outputs of that feature flag and so now you've actually given the model when you deploy this you've given the model the ability to actually get its own back pressure. feature flagbased deployment pipeline with feature flags as just one ingredient in your list of ways to iterate on code. Decide if it's good, decide if you've built the right solution. What's up y'all? We're back with another episode of AI that works.

[00:00:38]We're going to be hanging out talking about feature flags today. We'll get more into that in a sec. I'm joined by Vibv. Vibov, please introduce yourself.

[00:00:48]>> Well, I make a programming language called DAML. We're imminently going to release it. we've started giving out to a few folks uh for the new stuff that we're doing. But the idea is let's build an agent first programming language.

[00:01:00]What about you, Dexter?

[00:01:01]>> Uh we're uh helping coding agents solve hard problems and complex code bases.

[00:01:05]I'm the founder of company called Human Layer. VIB Bob, I assume you have your team in the back there and they're all just hanging out watching TV. Is this what it is this what programming looks like in uh 2026? They're just running their agents overnight. No uh no no humans needed. For anyone uh looking for a job, we do let people just watch their agents on the big screen and we all just watch it like a live sporting event.

[00:01:26]It's fantastic.

[00:01:28]>> Oh yes, the Ralph Wiggum Fishbowl as we like to call it.

[00:01:31]>> Exactly. Exactly.

[00:01:34]>> Um cool man. Uh so I was uh poking around the internet and there is apparently it seems like uh many vibe coders have uh discovered the magic of feature flags and realizing that they can skip code review and pull requests by just shipping something to production and leaving it behind a feature flag and that is their uh license to ship things that are not finished or not fully baked. Uh I'm really excited to hear from you today about uh how you think about feature flags uh and what they're for and what they're perhaps not for. Uh and maybe we can get a little spicy with it.

[00:02:13]>> That's it. Yeah, I mean any if anyone's ever worked at a large company, feature flags exist all over the codebase. It's impossible to write any code without feature flags. Um otherwise like you just risk impacting millions of users. But I think what's interesting is with the paradigm of AI feature flags are a lot more u I think they have more use cases than what we have ever imagined before this.

[00:02:37]>> Okay. So uh I have not uh been a staff engineer at a large company. Uh but I have read a lot of Martin Fowler and I think my understanding with feature flags was they became a way to you would have some idea and you would realize that it would need like code changes over here and code changes over here and code changes over here and it wouldn't really be ready to like ship to anybody.

[00:03:03]Let's say you're building a web UI and you want to like you have the top bar and then you have the sidebar. Uh and like these are two separate features and there's a bunch of data layer stuff in between it and like you didn't want to ship this page to any customers until all of the code was in. But that meant that you had to merge this massive change, big merge. And so feature flags became this way of basically saying like when a user uses your app, you would basically let's say you have like a sidenav here with a bunch of options and only one of these options is like the link to go to the new page and you would say you know feature flag on they see the button >> otherwise they I don't see it.

[00:03:58]>> You know what I mean?

[00:03:59]>> Yeah. Exactly. Very very basic.

[00:04:01]>> So you can you could ship parts of a feature to production without actually giving it to users. And the main goal here was like you could do testing and experiments. Uh but I think the original goal was basically to avoid this like hey now I have 5,000 lines of code that I need to get in and none of it has been tested in production yet. And so you would flag it on for internal people and you would flag it on for beta users and then you would eventually flag it on for everybody and remove the feature flag.

[00:04:30]>> Yeah, it's basically like I think the way I've always thought about it is uh you have some amount of automation that you want to have in your pipelines. So let's say we draw these lines and like as your actual automation curve goes up.

[00:04:45]The thing is in order to have increased automation, you actually need increased granularity in your system. I was actually going to do it on the other axis.

[00:04:53]>> Okay.

[00:04:53]>> Automation down here. Sorry.

[00:04:55]>> Yeah, you can move it, dude.

[00:04:56]>> Uh, as you want this, you actually want like granularity to go up.

[00:05:00]>> Uh, what does this mean? So, like if you think about how much automation you have, like one way is you have like staging versus release. And that is one kind of granularity that you that you have.

[00:05:15]>> Sorry, maybe I did the wrong axis. I don't really know. But staging versus release is one kind of one kind of granularity. And like why do you >> code is either code is either pre-production and it's deployed somewhere or all the code is deployed to production.

[00:05:28]>> Exactly. And that's highly useful. You don't want to break your users. You want your internal teams breaking first. You want your dev team doing it. Another level of this is uh dev versus staging versus release. So you can see how you can add this stuff on here. But like this level of granularity stops being very useful very very fast. there's only so many different branches you can have and like if you're actually like a super mega corp then like you actually have this uh and by user I mean like internal user uh >> yeah this is that idea of like the um the developer developer sandboxes right is like each person gets kind of their own environment so that people aren't fighting over who gets to test their thing on staging today >> exactly exactly so everyone else gets their own environment they can go do this and you can see why are we doing this. We're doing this because we want to have the more people you add to your team, the harder it is to have automation without having more and more granularity across the way. But again, very quickly, like this is the best you can do. You can have individual versus dev versus staging versus release. You really can't go any more than that. The value of feature flags is you add another slicing of the code, which is actually just like I can choose what things are available on a um in a subrelease level. like in the same release I can have different configurations of the system that's running.

[00:06:49]>> Yeah. So let's say you have dev over here >> and you have staging over here.

[00:06:55]>> Yeah.

[00:06:56]>> And you have prod over here.

[00:06:59]Instead of having like this whole bucket of like okay there's one version here.

[00:07:04]Let's see uh you have like these different like versions. Instead, I can say, okay, I'm going to take this slice of the codebase and deploy it. I don't if this is the right way to think about it is like we have this like development thing that is being actually shipped to all the environments, but it's only visible to it's like a different dimension that you can slice along, right? I I don't think the drawing of this.

[00:07:28]>> Yeah, but in theory like this this is exactly the right point. You basically get slicing control that's very different. There's also another huge benefit if you're ever shipping a really complicated feature and I think people don't think about this enough and we'll talk about why this is really important for agents and agentic development teams really quickly um which is that you can turn off features without doing a full deploy and that is highly highly highly useful >> for this goes to prod and then I can basically actually say like without redeploying prod I can actually just flip that off and there's some API somewhere or some database >> DB with feature flags.

[00:08:10]>> Exactly. That guarantees this. And you can see how this is really useful for like really risky code like let's say for example, one of the stuff that we used to do is uh when we worked at Google, we did a lot of work around performance to make like the AR system on the phone really really fast. One of the things that we did was we did a whole like refactor of um some of the what's it called? some of the core algorithms once and when we deployed that we deployed it to like super low-end phones. Now you don't want to you can test it locally you can test it uh you can test it in stage you can test it as an individual developer but you don't know what the end user is actually going to do what expectations are they going to have feature flags are a fantastic way to go ahead and deploy it to prod measure metrics on it and actually just turn it off while your team analyzes and crunches the numbers and then decide to turn it on later if it's actually good >> okay so it's not just about like hey let's try this feature out or let's test it or let's give it to a couple users and see if anybody screams. But you're actually going to measure the difference between the performance of the new version and the old version.

[00:09:18]>> Yep. And you can actually get like behavioral changes in a really nice way when you go do this. And you can measure things around this. And again, if someone screams, you might have done a full roll out and you can say, "Nope, turn this off right now."

[00:09:33]>> Yep. Um Okay. How would you how would you draw this like because this is the screen test version of like oh we shipped it and it actually is bad let's hurry up and just go quickly turn it off like do a you know programmatic roll back >> but what how do you think about like experimentation here? Can you can you try to kind of draw that out?

[00:09:52]>> Yeah, I think Zeke uh talked it's really just AB testing. So like um I'm going to use like simple examples cuz it's easier. Uh if you have what you're really doing is you want to have a group A uh and then group control.

[00:10:10]>> Yep.

[00:10:10]>> And you can literally just have them see different slices of the app. So I'll use your little orange thing. Group A can be the only one that sees the orange feature. And because they're being measured at around the same time horizon, uh, and the time is exactly the same, you can basically guarantee that there's no like I mean there still might be depending on how you selected for group A versus the control. That's like kind of your job is select good experimentation sampling. But let's just say you sample like 2% or 5% of your traffic randomly. You likely will just get like a good deviation on if this feature is making a difference on here.

[00:10:44]Now obviously if your group is highly selective you've done you haven't done a good job of like being actually you modeling group A to be very similar to like the your control group then yes you might get different deviations but this gives you at least a good understanding of what the slices are and what you're doing and now you can measure the metrics whatever metric you define for this feature could be revenue could be click-through rate could be engagement rate could be like error rate whatever you want to define and then you can decide if you want to do a full roll out so it's actually fairly straightforward for to build this out assuming you have good metrics and really clean data to help you segment these things from different individuals.

[00:11:22]Um, but I think the more interesting question is actually how the heck do you actually do this from why is this useful for agentic code? Like why this was obviously useful for mega corpse in the past. Very few startups actually do this. They do this for like landing pages and like other things but very few startups actually do this for like large scale like backend changes from what I have seen but why is this good for an agentic uh system?

[00:11:51]Dexter what's your top guess?

[00:11:55]Um, so the most exciting thing here, and this is going to be a little bit of a tangent, um, but we talk a lot about like harness engineering, we talked a lot about, uh, but, uh, basically the idea of like things that agents are good at are things with back pressure, right?

[00:12:11]So if I have my agent and I'm passing in like a set of specs or whatever uh and like specs in this case like we look at like the the core Ralph Wigum case where it's like you know the spec is like here is how the programming language works and here is how various short programs should behave. Right? So the agent is going to do all kinds of really complex stuff like build a compiler and build a lexer and a parser and all this stuff, right?

[00:12:48]Um, and it's going to build all these really complicated components. And then it's going to, you know, do a test loop where it, you know, is going to like write a program based on specs and then it's going to compile it.

[00:13:11]And then if that succeeds, it's going to run it.

[00:13:15]And at any one of these points, you can get really with no human in the loop, you can get really good uh what we'll call back pressure, which is like, okay, something didn't work. Uh the agent can inspect it and figure out what's wrong, right?

[00:13:32]This is really good for certain classes of problems. What it is not good at is I'm sure you've seen uh what is a thing that agents are really bad at assessing whether it's good or not in like an automated way.

[00:13:46]>> Um anything that is not measurable or like doesn't have like some sort of like indicator outcome. So like uh whether my users are happy or not. So, one that I think is really bad, uh, and I'll tell you because I did this yesterday, is I was building some, um, let me just pull up, uh, this server here. Uh, so I was having an AI make some sloppy ass motion graphics.

[00:14:14]Uh, and it tried really hard to make stuff look good, but at a certain point it just has these weird issues where it like can't quite get the overlaps right.

[00:14:26]There's like gaps in these lines and stuff. It just like is not good at assessing visual things.

[00:14:32]>> Even if you give it, this is with a feedback loop to like use a browser and take screenshots and all this kind of stuff, right?

[00:14:39]And so in my mind, one of the things that AI is really bad at is UI because the agent is going to, let's say you have your specs of like how how how the website should work.

[00:14:52]Um, and so it's going to build a front end and a backend and a database and all this stuff. And then when it goes to test it, the best thing it can do like naively locally uh is you can you know take screenshots.

[00:15:10]You can like hit the API endpoints is pretty is actually pretty good, >> right? That's that's quite deterministic. It can look at the results and and make assertions about the behavior. Um, but like taking screenshots of the web app is not really a good back pressure thing because AI vision is just like not good enough to get things pixel perfect.

[00:15:32]And so the thing that you can use feature flags for that I think is really interesting is like deploy it and track the metrics. And so your model can be shipping, you know, if it has a thesis of how to make the UI better, it can ship three versions of your app.

[00:15:52]version two and rather than like looking at the button and deciding like that's a button that looks good like like basically doing a taste uh uh approximation >> it can go and pull the data of like conversion rate on this page is 3% >> the conversion rate on this page is 7% >> and the conversion rate on this page is like 5%. And so now you've actually given the model when you deploy this, you've given the model the ability to actually get its own back pressure >> again in a way where it's actually going to be constructive of like it can reflect on what was done and uh have something a little more quantitative.

[00:16:32]>> I so freaking wish I could show you guys real production data uh of how to do this, but I can't because it's customer data. But I had a customer do the exact same thing where they took all their data coming in that they stored.

[00:16:46]>> Yep.

[00:16:46]>> And they're stoing like every single BAML function and they they access the API to go get it and they literally just have an agent say, "Go analyze my call pattern.

[00:16:55]>> I'm going to be migrating to Gemini 3.0 Flash instead of 2.5 Flash where the cost is 20% more. Look at my usage pattern of everything that's happening.

[00:17:05]Look at all the traces. Figure out how I can migrate to Gemini 3.0. you know, flash and then cut 20% cost somewhere else so that my cost doesn't change, but I can use the latest model. And I was like, that's such an interesting way to go to go check that out. Um, and like watching that agent run was a really incredible like just a very interesting um interesting experience. Um, I'll try and get a sample I'll try and build like a sample database and then if I and then show the same thing. But it's very fascinating when the agent has access to all this data.

[00:17:40]>> I think there's another element of this though that we're not talking about which is I think there while it's great to give agents access access to this data to me a big part of why we want agents to have uh why we want feature flags with agent to code and why people are talking about this.

[00:17:55]>> Yeah.

[00:17:56]>> Is actually it actually relates to the fact of like how with your rate of shipping code. If you go if you scroll back up a little bit Dexter to the top part where we had dev versus prod versus staging.

[00:18:08]>> I want you guys to think about what what is the what is the real reason to have all these things. The reason that we often have all these things is like uh up one more the the visualization the graph.

[00:18:20]>> Yeah.

[00:18:20]>> Uh yeah the reason that we add all these things is because often times this is actually related to your hedge headcount. As you add more headcount you need more and more granularity. So every single individual member of your team can continue to ship without blocking other people or interfering with other people.

[00:18:36]>> Yep.

[00:18:37]>> The And why does that matter? Well, that's because code often has like conflating in uh conflating effects where like one piece of code almost always impacts another piece of code and you want to be able to like test and merge and do all these things separately. Well, what is an agentic engineering team? It's a team that is shipping code so freaking fast and now you need to measure everything. And that's why feature flags to me are incredibly useful because if a team is actually measuring things well then like every single agent shouldn't actually like it should be allowed to merge to prod but it shouldn't be able to turn things on on prod. And those are two different kinds of things. If you can't merge then you basically pay a huge tax of not being able to understand if users actually like your feature or not.

[00:19:22]>> And if you want to merge a thousand go ahead. Even that feels uh like the agents like if the agent you said like agents should be able to merge to prod but not turn things on in prod. And it's like that seems to imply that like okay users going to go look at the feature in prod and then if it's okay then they're going to turn it on. But I I think the real magic is like >> turn it like turn it on at like 0.01% 01% make sure nothing's broken and then like gradually ramp it up over the next couple days or hours or whatever it is and let the agent review the ramp as as it goes and make sure we're not impacting metrics. So here I I'll write down a in a sidebyside thing.

[00:20:06]>> Yep.

[00:20:10]>> Right. Um and when I think about this it's like this is kind of like the engineering workflow. Um, and I'm going to make the side bigger.

[00:20:19]Deploy. So, if you just take regular software engineers, we have this loop.

[00:20:23]And if you think about what the loop ends up being, like I don't know how to draw a circle. Can you draw a circle here, Dexter? Or not a circle, like an arc.

[00:20:29]>> Yeah.

[00:20:30]>> Uh, like this loop between write and test code is incredibly hot. It's a incredibly hot loop. So, like it's actually really easy to continuously like have an agent iterate on this because it's fast. Um, we tried to make the create a PR and deploy faster by doing interesting things like we added more granularity here like code rabbit and like other like tools like code rabbit that basically like try and build this loop a little bit faster but there's still a big delta between this and deploying and like as an engineering team what we've said is feature flags give us a little bit more granularity post deployment uh well okay staging like basically like deploy deploy environments is like what we have and then we have like feature flags.

[00:21:15]>> Are you saying like there's a long there's a long like time between creation of the PR and the deploy.

[00:21:19]There's a lot of work that has to happen and code rapper can do some of it but you still need manual review and that's quite a bottleneck.

[00:21:25]>> Yeah. And this is kind of why like all these things don't ship very fast. This is like why like um this this gap over here is kind of why I think we're really bottlenecked. Like this whole section is basically the bottleneck. And this is kind of why I feel like we can ship at we can write code at really fast speed.

[00:21:42]We can debug prod issues at really fast speed because like debugging is fast.

[00:21:48]Uh this this statement is actually pretty fast cuz like you can get logs and all this other stuff over here. This this section is really fast but like we get all bottleneck down here. And this is why everyone is like we're shipping slop all the time because the bottleneck of like actually deploying is kind of messed up. And you can I think the key part about feature flags is can you do something interesting to make this no longer the bottleneck in your engineering life cycle. And if you can do this you can build really really interesting systems around this. So like what does it take to go do this? Well I think there's a couple of things you can do and again feature flags help a lot with this. Uh one of the things you can do is you can add another step here.

[00:22:29]Uh I'm going to move manual review and delete this one. You can another sub here that's like exper run run experiment with prod data.

[00:22:41]So you can imagine literally pulling uh pulling down data from prod and as a part of the code review process pulling it down running a quick little test and being like does this actually work? And I'm going to change the color here so it's like six. So >> this is like CI checks.

[00:22:55]>> Yeah. Where you literally run hey does this new code impact met offline metrics in any interesting way with broad data.

[00:23:01]Boom. Done.

[00:23:03]What else can you do? Well, what if you did deploy it, but you deployed it with your feature flag turned off and then what you did is you turned on your feature flag for a very very small amount of time. So, this is different than turning on from a small percentage population. You're just literally turning it on for a very very short duration. So, it's like two >> trace basically >> or some subset of traces or exactly >> but it's very different than like 0.1% of your population. And there's like two dimensions on which feature flags can be turned on. There's time and then number of users impacted.

[00:23:35]>> So you would turn this on for everybody for like 10 seconds or something >> or or like a small percent of your users. It doesn't really matter like something to give you a sampling.

[00:23:45]>> And this is two there's two dimensions of how you activate your feature flags.

[00:23:49]A lot of people think of it just in terms of how many people or how much of the population >> or maybe in a specific cohort you only want to turn it on for people who are >> like the most active or the least active. But then it's also how long do you turn it on for and what do you do with the out most importantly it's like what do you do with the outputs of that feature flag? Like turning it on has some effect which changes the state of the world that you can actually just go then go react to.

[00:24:17]>> Exactly. So you want to go do this and then you kind of want to collect a bunch of data about your system some some metrics that you defined very similar to what you're running in CI/CD checks with broad data or data sets that you have in your eval or whatever you want to collect some metrics >> and then based on those metrics you might want to say okay now let's either roll roll this back and like undeploy it or you want to go ahead and like say okay now let's roll this on for an unbounded time period or maybe a slightly longer time period with for a larger chunk of our users or for a longer duration of time. And as you're doing this, the whole point of this whole system exactly the whole point of the system is to build a more granular and granular version of your codebase.

[00:25:01]So everything is incremental and like as you say adds back pressure constantly in your system.

[00:25:06]>> And this is where >> undeploy and iterate. This would be you know progress.

[00:25:12]>> Exactly. And now you can see how a user can actually deploy stuff and turn this into a red loop >> because deploying stuff is no longer risky.

[00:25:22]>> There's still some risk, but there's way less risk if everything is always kind of off by default on any merge.

[00:25:28]>> Yep.

[00:25:29]>> And you can quickly turn things on and off. Now, not every feature can go do this. Uh some features have real cost consequences to doing this. So like for example, if you're doing a database migration, you now have to do a dual write system where you actually write to both databases for some time to go do that. And that might incur real cost and like um customer impact if you're doing that. There's consistency issues. You need to have two versions of your codebase running at any given time. And they both need to be merged and checked in. And like I'm not saying that this all comes for free, >> but I think >> it adds a lot of complexity. And I I actually posted something on Twitter yesterday about like basically like a lot of the uh we do a lot of work with like staff engineers on larger teams with like you know thousands of engineers or hundreds of engineers and they spend a lot of their time running around cleaning up feature flags because it's like someone did an experiment and maybe the feature flag never got cleaned up and it's just sitting at 1%. And like the experiment just like never never finished and so the feature flag got like left there. there's dead code paths or or maybe it got pumped up to 1% and then got turned off and then everyone got distracted and move on to the next thing because it's not a fire and then you just like start accumulating more and more dead code and as your team turns over or people move around like you lose that kind of like it becomes very hard to keep track of what's important and it means every time you make a change uh you've created more and more like this is like general refactoring right have you have you read this Ron Jeff about like refactoring.

[00:27:03]>> Oh yeah, I haven't read the post, but I have opinions on refactoring.

[00:27:06]>> Let me let me uh let me show you let me show you that. We'll run through this really quick. I think this is relevant.

[00:27:11]Um but he basically says like okay, you have no code and then you go write code and everything is good and then like as you go you start to have these like tangles of code in your codebase that like a little hard but like you just kind of work around them. It slows you down a tiny bit but it's not that bad.

[00:27:25]And then as you go, you accumulate these like things that need to be refactored and it actually starts to slow you down because you don't want to go touch that code or you need to like move around it.

[00:27:35]And then eventually it's like every feature takes you longer and longer because you have so much mess. And what people want to do is they want to go do a big refactor where they just kind of like clear a bunch of it out at once. Uh and then like basically the idea is like this is the wrong way to do it. And the right way to do it is like the next time you build a feature, instead of just like shipping it and going around, you take twice as long and you go clean up the parts of the codebase that you touch. And then as you do this over time, you're like cutting new paths and it's slower. But then eventually like you ship a feature that gets to reuse some of that work over time.

[00:28:11]>> And so I don't know. This is this is how I think about code bases getting like how how to think about technical debt in a codebase. And I'm curious like what have you seen? What works? Yeah, this is kind of what we do with our team. This is how we do it on our team.

[00:28:25]>> Uh but every now and then we do basically just nuke it and rewrite it from scratch >> and that's worth doing as well because it's the cost function is so cheap now.

[00:28:33]Like the cost function is actually doing this. If you have really good testable interfaces is way cheaper than ever before. Like bund did this, right?

[00:28:40]They're like we have really good testable interfaces. We have a program a piece of software that can test this and like boom problem solved.

[00:28:46]>> Yep.

[00:28:48]>> Um cool. Are there are there downsides of feature flags or did you want to say more on that?

[00:28:53]>> No, the main downside is just like I think one the key insight is like think of feature flags in like two dimensions in general. There's like the dimension of how many of your users are getting it and then the time horizon and you can leverage that to actually make a really good automated pipeline.

[00:29:05]>> And the main downside is just slop like you you accumulate dead code over time like you said. So like if you're going to accumulate dead code then like fix it. Like you just don't have a choice.

[00:29:16]It's part of the consequence of wanting to ship really fast. If you're gonna merge more code and run more experiments, you got to pay the tax of cleaning up more code and and like turning experiments on or off and committing to something.

[00:29:29]>> Is is there a world where AI can clean up our feature flags for us? Have you guys thought about this?

[00:29:36]>> We actually do. So, here's the way I think about code bases. Um, and like when we're when we're designing some of the stuff at BAML, here's how I think about it.

[00:29:43]>> I think code should be linear at any given time. like code should just have like a linear branch where you can like actually view it and this is like your deploy branch, right? This is how but what you really want to say >> no none of this tangled merging and things like that. You just think of it as like linear history.

[00:30:02]>> Well, I think that's nice for your main deploy. But what you really want to say is for any given deployment at any given time, I'm also running experiments.

[00:30:14]Ah, right.

[00:30:17]at any given deployment and whenever I redeploy so you're actually not and well I'll talk about how this is possible and what you need to make this possible but these are effectively feature flags of various versions of the code you're kind of doing like a cross horizontal and then you merge into your main branch and all these experiments effectively turn off on every merge >> but a merge is committal and then you decide which of these experiments you want to bring back up and keep link, but not every single one of them actually has to make its way over, if that makes sense, right? So, you might just only decide only two of these experiments have to actually make their way over into a new thing when you actually merge.

[00:30:58]>> You might have a new one as well.

[00:31:00]>> Yeah, because maybe this one actually merges in. This becomes a solid experiment and now you start running it in. Yeah, exactly. And you might add a totally new one that wasn't even in there before.

[00:31:10]Right. And I think this is how we really want to think about code. You kind of want to say like you've allocated like 10% of your traffic total or some slice of your traffic um uh to like running experiments and you just keep running and dunning experiments. But instead of merging all that code in, you kind of want to have a side channel to deploy this code. And as you deploy the side channel, you're basically bounded on what based on your business risk. You can choose what kind of risk risk tolerance you have. you choose.

[00:31:41]>> You kind of have like a It's almost like a canban thing where you say like, "Oh, we're only allowed to have like 10 live experiments at a time and if you want to ship a new one, you got to chuck an old one."

[00:31:51]>> Yeah. Oh, well, I mean, or or they get less percentage or you're kind of like fighting for experimentation. And the idea is once you go promote something to be in the mage main branch, >> that kind of decides what's happening >> that more headroom for new experiments.

[00:32:06]Well, that basically resets the resets the board and now everything else has to like kind of redeploy on top of that experiment.

[00:32:13]>> And I think that kind of >> is how I think about this. Um, this is like the easiest way. Uh, I think this is what we need. We need some system that allows us to have experiments that are orthogonal to our main deployment branch. Otherwise, you end up in the world that you're talking about, which is like I have two versions of my code and I really have like 50 versions of my code because people have like people are testing two database migrations all at once and you can't really have that. I was going to say Josh Joshy asks, you know, any top of- mind concern when feature flag involves DB schema changes.

[00:32:44]How do you keep the dual schemas working?

[00:32:47]>> Oh, you just do a dual write system. You do a dual read dual you do a single write, a single read, dual right, a dual read and then or you really carefully slice your users to make sure that they do that. And then you also have you just have to make sure it's backwards compatible. It this >> right if you want to you can't have it so that people on the experiment cannot be moved back off the experiment otherwise you kind of purpose >> exactly otherwise it's not an experiment. You kind of need you need to be able to gradually migrate people onto the experiment. One uh off the top of my head one way to do this would be to say you do an offline migration of some percentage of your users to like some pre-selected percent of your users to the new database schema. you migrate all their historical data. You then do an online experiment where any new rights that they have get written to the old new databases as well and then you actually turn on the feature flag and let users experience the new database schema. Uh but database schema changes are hard. Um but that's why I would say that you kind of have no excuse to get your code right in the beginning for like lower lower down the stack. Like just spend the >> spend the extra hour prompting with Claude to get it right. And and this is kind of how you would do a like a schema change even if you if you wanted to make a backwards compatible schema change that uh >> even if you weren't using feature flags you would create a new table with the different schema if it was if it if it wasn't just add a column to the table that is optionally used or not used really need to change how things are looking then it's like okay cool we make a new table if you're and then you move all like the new end and maybe even a new API endpoint and the new endpoint uses the new path and the old endpoint uses the old path and that way you can deploy because you als you basically every deployment in a large distributed system >> is an experiment. Yeah, it's a roll out.

[00:34:32]Exactly. So you have to be able to support clients on old versions, APIs on old versions, a mix of the current version and the previous version. That's why people love SAS so much because at any given time you have two versions in play. you have the currently deployed one and the new one versus if you distribute something like a programming language where there are could be 50 versions deployed out in the world or you're shipping an on-prem app where customers upgrade on their own on their own cadence then your backwards compatibility challenges become really really hard.

[00:35:02]>> Yeah, exactly. Uh but yeah, this is like database schemas are just a known hard problem you but there but that also means that there are known good solutions to them. You don't have to invent anything.

[00:35:15]Um, so I wouldn't stress too much about this.

[00:35:19]>> Cool.

[00:35:19]>> Um, >> yeah, I think that's it. Like this is fairly straightforward to go do. I highly recommend not the stuff I just drew. I think that actually requires like very interesting runtime semantics.

[00:35:28]But the stuff on the left is like fairly easy to go do. Uh, you can build an automated pipeline that uses feature flags as a rollout mechanism to collect data and then literally have cloud access access to that data to merge and prod. I think it's a lot of infrastructure work, but I suspect it's like maybe like a week of work at most to kind of tie everything together.

[00:35:48]>> Yep. And I I think one of the one of the failure cases I see here is people try to overmate this on day one versus like building it up incrementally. It's like okay start with a CLI or MCP that could query this data and you know launch a coding agent session and ask it hey how is this doing and then like walk it through making the code change whether it's uh iterating or rolling it back or removing the experiment or whatever it is and then eventually you can get to the point where you have this like automated back pressure where like every time a new feature flag increases a metric well that causes a signal to go back to turn up the feature flag and it like incrementally rolls itself out or you even have an agent making these decisions of like, hey, here's the data.

[00:36:29]Here's the code. Here's the here's here's the thing that changed. What do you think we should do?

[00:36:34]>> And that's where things get really interesting. But like don't try to do that from day one because you're going to spend three months building a software factory instead of actually like shipping value incrementally and like learning uh and adapting to actually like make these systems work for exactly your your system and your team and your deployment methodology. So uh today on AI that works we talked all about feature flags about how feature flags were originally created to avoid large merges. How the uh amount of automation you add to a system uh is based on like the amount of granularity you can have in shipping and shipping small slices to a test group. Um the difference between back pressure on deterministic systems versus using feature flags and feedback on the performance of code to feed back into the agent and improve your code and your systems. especially UIs and things that are hard for agents to actually validate themselves. Uh and then we sketched out a full on uh you know feature flagbased deployment pipeline with feature flags as just one ingredient in your list of ways to iterate on code, decide if it's good, decide if you built the right solution, etc. And some really advanced crazy stuff that Viob is thinking about in terms of like parallel experiments as they map onto your codebase and budgeting how much experiment you have at any given time. So, this is a really fun episode. Uh, look forward to, uh, digging into some of this myself and, uh, and, uh, shipping more dynamic code.

[00:37:58]>> I guess that's it, actually. This is fun. Adios, amigos.

[00:38:01]>> Good stuff. See you next week.

Related Videos

Artificial Intelligence

BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2

aimmediahouse

122 views•2026-06-03

Artificial Intelligence

The Dark Side of Emotionally Intelligent AI #robot #Shorts

AIIn60daily

106 views•2026-06-05

Artificial Intelligence

[한글자막] OpenAI @ Replay 2026 | OpenAI는 Codex로 개발 방식을 어떻게 바꾸고 있을까요?

TechBridge-KR

1K views•2026-06-03

Artificial Intelligence

MA 2 – President Simulator

Oxiwyle

502 views•2026-06-05

Artificial Intelligence

Starting & Test Driving JAKE'S Abandoned BUS from Subway Surfers | POV Restarting

RestartGaragePOV

4K views•2026-06-04

Artificial Intelligence

Unleash AI: Kamiwaza - RDMA

HPE

200 views•2026-06-04

Artificial Intelligence

PoE2 Return of Ancients: Can AI Spark Stormweaver Finish Act 4? Ep8 LIVE

RealAsianRobot

249 views•2026-06-05

Artificial Intelligence

This AI Agent Works For You 24/7 (And Controls Your PC!)

CryptoLocke

1K views•2026-06-04

Trending

How Old Diamonds REALLY Are

CleoAbram

1093K views•2026-06-08

The Riskiest Moment of the AI Bubble

hankschannel

379K views•2026-06-09

FAFO!! Scott Pelley FIRED from 60 Minutes!

DontWalkRUN

130K views•2026-06-09

Karmelo Anthony Guilty Verdict Triggers Social Media Meltdown

ConservativeTwins

359K views•2026-06-09