Reinforcement learning environments with extremely sparse rewards (receiving a reward only once in 10 million steps) pose fundamental challenges for traditional algorithms. Puffer 4 fails to achieve reasonable performance on such tasks, achieving only the random solve rate. The core difficulty is that agents must explore for millions of steps without any feedback, making it nearly impossible to learn which actions are beneficial. The solution involves a new entity encoder architecture that processes entity data where order does not matter, using a pointwise layer mapping to dimension 16, ReLU activation, fused kernel combining linear and max operations, and max pooling over the entity dimension. This set encoder handles permutation-invariant data, treating entities as a set rather than a sequence. The minimal encoder achieves approximately 10x improvement in efficiency while maintaining the same functionality, demonstrating that simpler architectures can outperform more complex ones when the goal is efficiency rather than maximum expressiveness.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Reinforcement learning research with Joseph SuarezIndexed:
Watch science advance live! I am an MIT PhD and stream my research on reinforcement learning. You can also find me here: https://x.com/jsuarez https://www.twitch.tv/jsuarez5341 Want to learn reinforcement learning or contribute to research? We've helped brand new programmers get up and running within a few months. It's all free - you help us by helping advance science: https://puffer.ai/
Okay, we should be live here. Hello.
Let me go set the Twitch category software.
All right. Cool.
That works.
Nice.
Okay.
Hello, folks.
So overnight we ran a nice sweep on the maze environment with curriculum and it looks like in about 7 minutes we have it solved to 87%.
So 87% on 35 by35 mazes.
I started a new sweep this morning just to see if the baseline could get anything. And so far the baseline has failed to get, you know, any reasonable score.
Hey Jess Plasma, I mean this is quite a um a significant thing. Now look, we probably are going to have to sweep more um for the baseline just to see if there's anything. But frankly, I would be shocked if uh you could get anything remotely reasonable out of puffer 4 on this task, right?
You literally get a reward one in 10 million steps.
So unless Puffer 4 is just way way better than expected somehow in a way that I like can't possibly fathom, I think this is safely a major breakthrough. A very major breakthrough.
The encoding layer we talked about uh was it the entity encoder?
Uh it's there's uh there's some like crappy codeex kernel and stuff in the tests folder.
I need to integrate it properly and like write it correctly.
Uh it's the encoder. It's specifically pointwise to dim 16 then relu and then a fused kernel for uh linear and then max over the entity dimension on hidden size 128 or more.
Okay, I'm a little slow here today. So, I think we're just going to take the first little bit of the stream here to figure out what we do. What do we do from this? Like what do we do next?
We have some environments done.
This is set up.
This is done.
This is going to be a bigger question.
That's the same one. Yes. Plasma.
Um if you implement the same operation in pure pietorch it will be slow. You need a custom kernel.
The sanity ends ended up not being super great.
I actually do not care about these as much.
Local bottom will be nice.
I think that the next thing to do will be to run a sweep on 2048.
We do have a CUDA version plasma.
Colonel means cuda.
When I say kernel, I mean cuda. I don't mean shitty Triton [ __ ] or whatever.
Let's go check 2048.
I told you, man. I just told you.
Look, I said you look in tests point linear max kernel CU crappy codeex FC max right literally right there.
Ah, we got to reboot this uh ducker.
G like 2048's a little like a is a little bit like the Inferno if you squint your brain at it.
Also speaking which Val should be back today.
The purpose of the encoding layer, that specific one is to handle entity data.
So you have a bunch of things in the environment that you can observe and their order doesn't matter. It's a set encoder.
Like you have a bunch of agents to observe or a bunch of points or whatever.
Mhm. Exactly. And it was unsolved before. Arch, this was an unsolved problem. How to do this efficiently and I solved it.
Okay, we got this little baseline.
Then we've got this one.
Now, the thing is I don't know if this is a particularly good test case because um this just gets zero everywhere.
I might need to just do some other sample method. We'll see.
the sphere encode. No, it's much simpler.
The new encoding layer has already been vetted. So, here, let me pull up the slide deck. I use this one for clients.
Let me pull up the client slide deck. I may use it for labs as well.
Okay, this environment here, this was solved.
This was solved with the new encoder.
So, it's the puffers have to spread out.
Like, you see that there two blue targets and they're multiple blue puffers.
Like, they can't both go to the same target because the target goes on cool down for a second or two every time you hit it. So, the efficient thing to do, you always want to keep all these square, all the stars dark. And you can see they're doing a pretty good job of just keeping all the squares dark with the exception being if a star spawn spawn super far away, it's not worth it to go there. You just wait for the other one. Um, but this is like a pretty effective coordination thing.
And this did not work no matter what I tried before until I did this architecture.
And then it worked instantly. This is the whole EN. So if you want something to play with the architecture, the EN is in 40 and it's called minimal.
It's a very simple encoder. The thing is it's not meant to be like fancy. The point is that it takes an operation that is traditionally very expensive to compute and it makes it very efficient.
It's like a 10x improvement.
Uh no plasma. So I did it as a oneoff, right? So I fix I did this as a oneoff test and I did not have time to integrate the kernel generally in a reasonable way.
I will do that though.
I mean, this is literally the type of thing that if you have like you can just get codeex to do it temporarily on your branch until you get a clean version from me.
Where can I read about it?
I mean, it this is in progress research.
I don't have reports instantly on everything I do. It's literally just like fully connected to dim 16 ReLU fully connected to your actual hidden dim and then max over entities. And that last layer is fused kernel. It's a fused max. So, the key here is it gives you a two-layer encoder with a very small intermediate layer that's not super expensive to materialize.
And it gives you the full uh the full encoding to your hidden size on the last layer, but it's efficient with the max.
It's in tests, Jess. It's in tests on even 40.
I don't know if the kernel's any good to be fair. I just like it's the type of thing where even a super basic kernel should just be way better than the default.
Okay, so let me think of what we want to actually do today. I'm a little off today. Um the run this morning was just super hard and then I got kind of sloppy on my deadlift form. I hurt my back a little bit. I'm feeling a little bit off.
Oh, you know what? Let me go grab one thing. Let me go grab one thing real quick.
Okay.
See if it works.
Uh, it does work on minimal plasma.
If you want to integrate it and try it on that first, then by all means, but it does work.
So the decision tree here provided we don't somehow get a good solve for 35 by35s with puffer 4 defaults um which I think is next to impossible.
we have full confidence we've made a major breakthrough. Right?
Then the question is practically speaking this lets you solve a qualitatively new type of problem. But where does this actually help?
So we will sweep this instead of prioritized experience replay on 2048.
We will compare that to the puffer four curve.
That will take the whole day for that sweep. We will not bottleneck on that.
Uh in the meanwhile, we will use mazes as our confirmation check that we haven't broken anything.
And we clean up the kernels, I suppose.
What do we do about peer though?
I suppose we keep peer for now.
We can consider dropping it in the future.
You know, I really wouldn't be surprised actually if they're both helpful, right?
I wouldn't be surprised if you do actually just want two layers of peer.
In fact, we should just sweep with both.
Yeah, Jess. I mean, I do my best. I actually I get some people in academia like saying, "Oh, you know, you don't control this or that or you know, you don't run these experiments or that."
But like, I actually think that overall my method produces way more consistent results. Well, I don't think it does produce way more consistent results than you see in academia cuz my stuff actually replicates.
Um I suppose it's it's a little bit more difficult in my case because I don't consider like two experiments uh sufficient evidence for anything. But like when I want to test if a new feature works, I have to run a full hyper pram sweep and then if I have like a bunch of new features in combination, you get massive blow up of the compute requirements.
I'm gonna let this finish. I'm going to launch the new one and then we're going to do colonel stuff.
Val, my man, I hope your interview went well.
We We have a major breakthrough here.
Um, I don't know if you were around for the super sparse maze result yesterday.
We do have a major breakthrough.
Uh, I'm going to try this on 2048 next.
If you squint, like 2048 is kind of like Inferno in the sense that it's a very long horizon problem that introduces new mechanics um as you get deeper into it. So, I think that'll be a nice test environment, but we live and learn. Did it drop a message about interview happily implement an inferno.
Yeah, honestly like you don't need to change the inferno any more than you already did Volo. If you already made it state setable, that piece is the same.
Even you can even get the state setable one PR up. It went terrible.
Rep. That sucks, man.
Interviews are kind of a pain in the ass.
If you want to not feel stupid about that, um I had a deep mind interview in my uh during my PhD where like they asked me to graph sine of x squared and I just looked at the dude dumbfounded like what in the [ __ ] And you know if I'd sat down and thought about it for a couple minutes I'd probably like oh duh. But like what the [ __ ] Weird trivia questions. Yeah, I've had that before. If a company is too dumb to interview properly, you don't want to be there.
The interviewers are very dumb.
CUDA version is in five. No, it's not.
This is branch 40 tests point linear max kernel.cu.
Is this not it?
Wrong fork.
goal for today.
We're going to chuck this on to 2048.
Uh we're probably going to start a sweep on this in a little bit once this runs a little longer. And I think while we're doing this, we're going to start messing with kernels.
That's the most likely um the most likely thing. We're going to start refining the approach a little bit.
Yeah, I think that's what we're going to end up doing. Um, this is running nice.
So, I'm not going to let it just clean everything up on its own, but I am going to let it write a benchmark um for this the logs of feature changes.
Wait, how are you handling the logs of beep? Oh yeah, Spencer. I'm just like I'm just renaming the folder.
If you literally just rename the folder, like constellation just picks up whatever folders you have in the logs directory. So it's very very easy.
So, like I've just moved all our default logs out and then I just have like logs for a few ms worth of sweep like a few uh a few sweeps worth of experiments on different versions.
That's pretty nice.
Won't it write diff folders of same end with on visual tell of what's diff and con?
It uses the folder name.
I'm pretty sure. Yeah, it uses the folder name um in constellation.
So, I've been able to do this just fine here. Let's see here. Like this is one that I had before. So, this is for this password.
And if you see these are just the folder names and you get the the name when you right click. It's super laggy under Windows for some reason.
But that's probably Winders.
Okay. So, I basically need to know um how much time we're going to waste on the sampling op.
works for me. Yeah, it's not that bad, right?
Robocode can look very cool as well. Um, it's not surprising me that it figured out like a basic ramming strategy when the opponent has no evasion. There are bots that just try to ram into each other. You do damage when you ram.
which I think I implemented. I don't remember.
What is this? Oh, this is profile code.
That's fine.
We'll get the timings first to see if it's fast and then I will clean up the code doing here.
Not going super well.
But I mean, these are just guessing hypers, so this is fine.
I'm going to be right back. I'm going to go use the restroom real quick and grab a couple things and uh we will do colonels. Little slow start this morning, but it's okay. Don't I have my be right back screen?
Yeah, there we go.
All right.
Damn, my back is really messed up.
Let's see what we got.
Oh, good timing here, but this is terrible score.
We'll keep running some small things in the background.
All right. Yeah, this is just a profile.
Cool.
Add all sorts of [ __ ] that we're going to end up deleting a lot of. Is there anything better I should be using than a priority queue?
I think priority Q is the right structure here, right?
Everything else would be like strictly based on the history.
Yeah, I saw that plasma.
Yeah, I should probably shouldn't make it the season and the year, you know.
We can also test this on breakout for speed, can't we?
Just do that.
Yeah, this is better already, I think.
Or should she?
Okay, this still has not found anything remotely better.
Yeah, I don't think that's going to find anything.
Okay, we can probably get a 20 48 sweep going then.
The idea here is to get uh 2048 sweeping while we're doing other stuff today.
something like Yes.
Yeah.
I think this is fine.
And um what was it? The mini batch. I think we did an inal and total agents. Is that right?
Yeah.
Oops.
Why don't I work at a frontier lab?
They're not doing this stuff.
The frontier labs have become very large and bloated. They're basically doing products now.
Um, I think that you have very very little impact.
Uh, if you're the type that like can just kind of do crazy out there stuff, you can do way more cool stuff just doing crazy out there stuff on your own.
Plus, this is more fun.
I do genuinely think we're going to get some nutty nutty progress out of reinforcement learning in the next year.
Something happened with the mini batch size clearly.
Let me double check this.
I don't want to have a restricted sweep.
Okay, we did sweep mini batch size and we did sweep total agents.
So, let's undo these these changes.
Yeah, I saw that, Ethan.
I mean, in a sense, like why don't you work at a frontier lab? Like, if you think about it, Puffer is the frontier lab for um this type of stuff.
But actually, The snarky answer to that question is that I Hell.
Next breakthrough in RL.
Pepper. We just made a major one yesterday. Like give a guy a break.
Let me actually clean this one up and test it and then we'll think about the next one.
All right.
What puffer did Puffer do yesterday? So, imagine you have an environment where you get a reward one in 10 million times. How the hell do you solve something like that? One in 10 million actions that you take, you'll get a reward. Well, what you do is you run Puffer 5 and you just insta solve it.
We got some like pretty ridiculous exploration with state sets. Um, I refer this to as this the latching problem, which is when you get a very rare reward, how quickly do you latch on to it? We solved that problem. This is a problem that I've thought about for seven or eight years now, and we solved it yesterday.
Yeah, I know. They make vector databases. It's funny.
There's another puffer alpha, beta in peer. No, peer in epoch per alone, at least from the initial sweep, gets zero score on the problem we solved. And it would be very surprising if anything not using this exact type of method, uh, were able to solve it.
It's like just it's ridiculously unlikely.
It's a way of build. Uh, no, it's not part of the environment, Pepper.
It's It's not part of the environment.
You have to have a separate encoder.
Let's go look at this.
What in the hell?
22 ms. Are you kidding me?
This is garbage mode.
Uh the problem that I just solved would be impossible to solve with scaling compute.
Like this is incredibly obvious if you actually sit down for five minutes and think about it.
There's just very little of that going on lately.
Okay.
Yeah. So the problem is uh sparse mazes.
So you get randomly generated 35x35 mazes. the probability of solving uh a maze by taking random actions.
So like the number of steps I should say it takes to solve a maze by random actions is approximately 10 million. I benchmarked it.
And since you get zero reward until you solve the maze, all you can do is take random actions.
Um Huffer 5 solves this solves this problem. We get 87% completion rate of mazes.
Uh, anything that I've done so far with puffer 4 gets 0%. Like 0 point whatever it is 006 or something which corresponds to roughly the 1 in 10 million um solve rate like the 1 in 10 million random chance.
Maybe very slightly above Yes, but like that's not the point, right?
This is a fully general algorithm.
In particular, this does not use the fact that there are small number of total states. Um, it doesn't even rely on the environment being deterministic.
It's a fully general setup.
The only assumption is that you can set the state of the environment. And the uh the key insight in puffer 5 is that we can always write our environment to be like that.
Yeah, this is too many stupid kernels.
Would model based RL solve it?
Um, no, not on its own because you're still getting zero reward.
breakthrough from yesterday. It's a state setbased algorithm that it's a general way to return to highinterest states.
It's a combination of prioritized experience replay though not that much of that anymore actually. I guess it's kind of no longer that. Well, it was starting with prioritize experience replay and then we optimized it but then overstate set. So go explore style.
No, you still can't.
You can't solve this problem otherwise like you can go check the literature like this is order more than an order of magnitude harder than problems that you'll see in the lit.
The only ways that I've seen that solve this use domain information.
Raw state sets. No action masks.
Yep.
It's pretty nuts. You basically have to get the reward like once or a singledigit number of times and uh it just latches on to it perfectly.
Okay, you know what we should do right now is we should we should pull the results so we can tell which of these things actually matter.
Um I can guarantee you like there is no form of Q-learning that is reasonable for this.
If you want to look at my um in fact I can make a stronger statement which is I don't think there's any form of an off policy method that's even reasonable you can check my article from last summer why RL failed to bootstrap.
I have I I have what I consider pretty convincing evidence that like off policy RL is just doomed.
It's just a dead end.
Again, tabular relies on having a small number of states.
It's irrelevant, right? The goal is not to gimmick the problem by exploiting domain information.
See, this is I actually have been meaning to write an article about this um about the usage of like toy tasks and reinforcement learning.
And it's kind of funny because I used to hate this type of stuff where it's like, oh, you like a relatively simple looking task that you can solve in a number of ways. But uh as I've gained experience, I found that actually tasks like this are incredibly useful if you just approach them correctly. Like you just have to be very careful that you're not somehow exploiting domain information that trivializes it.
How many states did you have?
Well, Arch, I can add a little tiny bit of noise for instance, and then you have an infinite number of states.
But the thing is, it's not even fully observed. It's partially observed anyways. And actually, the mazes are randomized. So like you basically have unlimited states anyways.
It's mood.
I need to get something for my back.
Really messed up my back this morning.
It's better.
Okay. So, this is this is our data here.
Yeah, it's a general solver.
It's pretty cool. I mean, look, it's very odd because this problem you can make it dramatically easier by changing a few very small things here and there.
Uh, interesting. Actually, it's using half the data as CL.
Oh, wait. No, this is log scale. Let's do lin scale.
Interesting. It's actually using a small amount of CL data.
Hm.
That's a very interesting result.
It actually uses a very small state buffer. It seems the best results are actually all the way at like 100.
Yeah, it's random. It's a random maze.
Technically, it's a set of 8,000 mazes, but I guarantee you it won't change if you just make it random every time. We just do that for Perf.
320. This one's the nicest one, I think, right here.
Oh, I'm also interested to see the uh the exploration decay. Let's do that real quick.
How does it have memory to solve the maze?
Uh, it doesn't. It has to explore. It has to learn to explore.
It has memory, but not enough to solve the maze.
So, uh, Jess, I don't know that that is so much of, um, the problem with that is just that the longer runs, like it's actually probably just related to the learning rate in I would imagine.
We can probably figure something out with it.
It's a little tricky Oh, they all crashed.
Yeah. No.
Weird.
Few of them crashed.
Okay, this is what we wanted, right?
This is the exploration decay.
So, this in my mind, this shows us that um this was a good idea.
Yeah, right here. at this one.
Debug crap in a moment as well.
This Yeah, that's good, right?
Not great.
Well, I suppose we should just like clean this all up First very small buffer as well, right?
But that's going to be task dependent.
Okay, what we're going to do going to take these parameters.
Huh?
Wait, what?
I thought I added in better params.
What?
Oh, I know what happened.
Uh yeah, we're working on maze a little bit, but we're going to start to clean up the stuff generally.
like This be much faster.
See, loops.
Yeah. Um, it's not the result is not Oh, yay. It's a general maze. Why is it so [ __ ] slow?
The hell.
I think this thing [ __ ] with it.
Unless I didn't read the time scale correctly.
Of course.
Yeah. Know this should not be that slow.
Something's [ __ ] up.
Oh, wait. What? How is this using Why does this have 36% memory usage on Huh?
What's using this?
Oh, am I on the wrong Damn it. Wrong machine.
the right speed.
Yeah, that's way better.
Let's see if this works.
here. So, this is the key, right? So, that's the first time it got the reward.
It took it like millions of steps to get the reward the first time. And do you see this? It actually latched on to that initial reward. And this score is starting to go up.
That's the key.
We'll let this one run and then we'll do um we'll test breakout and stuff like a sparse reward. Yes, very sparse reward.
One in 10 million to get a reward by random chance back. Good.
Make sure that this replicates This I'm going to try this. What I'm going to do is after this, we're going to throw it on breakout and we're going to see how much speed it costs us on breakout and we're going to use that to optimize.
So, we have our fastest um end as optimization test.
when you need to offset. Uh so it's if you have ends that are going to all finish in the same number of time steps, right? So, like let's say you have an N that has a that's like always going to run 5k time steps.
Then you probably what you want to do is for the first time you initialize them, initialize them already like between one and 5,000 steps through randomly so that you break the um so that they desynchronize essentially.
With chess though, this shouldn't be an issue, right? Because the games are different lengths.
I don't know what end you are on.
I mean, you should notice this quite quickly because you'll get very noisy logs. Like the telltale of that is your score just jumping around a ton.
I see.
Yeah, that'll do it.
Uh yes, the model does have memory.
Our standard model has memory.
Okay.
So that's like okay at least This is what we were going to try here.
And let's do this is mostly just a speed test.
Anything that we get above just like doing speed tests here is bonus.
Okay, no per hit. Oh, obviously scores way worse.
with the 100 states at least not really hurting our our perf at all eight microsconds.
Let's do this.
Way better.
Uh yes, cla zero would be the normal algorithm.
It's actually interesting that um it didn't need very much CL data in order to do very good.
We also have not yet swept with both.
So I think the um the prioritized replay so our normal per gets much weaker with Horizon typically.
The state set one doesn't.
Okay.
The thing I wanted to test with this, so this is going to be warm-up states and state buffer size.
Oh, it doesn't hurt at all.
That's nice.
Oh, what? And it does way better.
H.
Okay. I mean, this seems like there's no overhead now.
Probably if I do bigger it gets um still very minimal.
This is all very new, Ethan. We have to um there's a lot of like experimental work to be done to validate and tune this. The initial result is there, but it's an initial result on an end where like this should absolutely work.
So, it's still a breakthrough in the sense that we solved an impossible problem.
I'm very happy with it. But to get this thing to be broadly useful, we're probably going to have to tune it a fair bit. I wouldn't be surprised.
Yeah, just two weeks.
Two weeks for the core results and like cleaned code um like already like packaged up some results. Yeah. Well, I mean the thing I have the main question is going to be like what does this do on other environments and such. Um, I think if anything, we'll be early on this. like we'll probably be done with the the core stuff near the start of next week and then I'll probably work on the various other smaller things that we wanted to package into 5.
Um, I'm not going to need like a crazy monthlong break for Puffer 5. No, I have some stuff towards the end of the summer, Jess. So, I guess it depends how much we do in Puffer 5, you know, like there's the selfplay thing. There's all sorts of little topics, but still, Puffer 5 is going to be a fairly short and nicely contained update.
What problem did it solve? So the problem is 35x 35 sparse mazes randomly generated. You have a 1 in 10 million chance to get a lucky solve. So one in it takes 10 million steps rather um to get a single solve by random chance on average.
That makes the problem virtually impossible for any of your typical algorithms. We are able to latch on to that single reward basically instantly and we can solve that problem effectively.
client stuff. Yeah, lots of client stuff, Jess.
In fact, you know, I probably do need to spend some more time as well on the business. Um, you know, we could use I I think Puffer is now ready for some larger clients and I'm going to have to put some work into making sure that that happens.
all this debug crap gone.
I'm just doing the rough initial cleanups with this first.
um both E10.
Yeah, both.
ideal client for us like either is sort of doing some RL now or tried RL and like sees that it would be really valuable but it was hard and wants to get their thing super higherf onto all our tools. Um, that's kind of what we look for.
Uh the only one of those that I am allowed to name um is the startup archive because uh their environment well yeah the environment that we work on for them is currently open source that is like an automated um it's like an it's a startup working on automating commerce and marketplaces and we build a marketplace in for and tired.
I think I'm just tired because my back is killing me.
We'll retire a little bit early this evening and um spend some time in the hot tub. See if that helps.
Stupid man. I just got lazy on my form with the deadlifts this morning because I was super tired from the run.
Yeah, I don't really want to do that at the moment, Jess.
I mean, that'll dull me as well for um the work.
So, now we can go source.
Am I on the right mission?
Uh, no.
Okay. Chunky pile.
Thank you.
Make sure this still works.
the env maze. Oh, the uh the client end. Hang on.
I'll show you.
Oh, yeah. Star the puffer if you haven't already, folks. Um it's free and all our stuff is free. It really helps me spend less time advertising for companies and then they come to us and I get to spend more of my time on research.
Here it is. Archive Puppar.
And then there's um there's some branches.
Mhm.
Okay. So, there's some duping GPU. This box I'm training on a 5090.
This box is a multiGPU one with six 4090s.
We have a few other boxes as well.
I need to get some more. So, if anybody could get the um market price of 5090s back down to something that's not total horseshit, that'd be great.
Cloud um in our facility, buffer training facility, we just got a big rack with a whole bunch of machines on them.
See this Mhm.
It's funny that 4090s are about the same speed as H00's for us, but even like 21 nodes of 4090s is plenty centers. Yeah, but it's mostly the RAM, man, because the data centers aren't putting 5090s in them.
Not that bad spirit.
The boxes cost um the six GPU boxes only are like 2 kilowatts or whatever.
So that's like a quarter per hour.
And renting multiGPU boxes is way more cluster nodes.
uh I actually really didn't have access to a ton spirit and we had a ton of CPUs actually for a lot of it. Like the RL code was so crap back then.
The hell?
Well, this is not worth it anymore at all. It still wasn't. I It was already ridiculous at 45.
Oh, wait, wait, wait, wait. This is the um No, no. These are the 6000s, right?
Yeah, yeah, yeah. Hang on. I'm dumb.
Those are the That's the other ones.
Yeah, this is the other one.
I'll be good in a day or two, Ethan.
Uh, I don't think that's remotely true.
Spirit This is just merging a few small things.
Yep.
This What do you mean? Can you elaborate?
They're not just like magically faster at all types of like at all workloads.
this run this.
I have to start getting my hands dirty with this soon, but And this wasn't that big of a This also wasn't that big of a a fix.
It's a kind of a mess, honestly.
We're going to have to just figure out clean some stuff up.
Don't do this [ __ ] with macros.
mouse.
He cuz it did this [ __ ] right?
Conditional crap on macros is just awful.
Why is it doing this?
12 all the streams.
I I actually have to like think about this for cleanups now, don't I?
Yeah. I think the truth, Liberty, is like very few people actually understand the technicals of what the hell's going on with quantum stuff and how much stuff is actually legitimate.
Um, and the things are expensive and fiddly enough that it's not going to be relevant for a while.
[ __ ] is this?
just does dumb [ __ ] Uh, okay. That's why it was there.
summary of where we are at the moment.
Uh we have made a major core research breakthrough yesterday.
We're able to solve tasks of ridiculous, ridiculous sparsity.
There's a whole bunch of crap code that needs to be cleaned up now because that's what happens when you test a whole bunch of methods within two or three days.
Okay, that's way better. That'll get rid of a whole bunch of jank.
All this macro garbage will be gone.
And then soon we'll get this clean enough that I'll actually be able to think about it. But I want to just like rip away the obvious jank pieces first and then we'll be able to do that.
Yeah, like look at this stupid thing, right?
Gone, gone, gone, gone.
GitHub. Um, all the stuff's at puffer.ai.
Demos are on the website. GitHub is linked there. Start the puffer to support. It's free.
All right, cool. So, this is fine. Now, um, obviously the prioritized replay buffer is garbage, garbage.
We will figure out what to do about that shortly.
We also have to figure out this stupid state refresh method because I don't want that to be a thing.
That might require a larger change.
And also figure out why we're cuda oming here.
Yeah, we only have two GPUs going.
Um, mostly it's mostly ready to be tested. I mean, this was the goal for two weeks, right? So, the code's a mess.
It's 12:30. I will uh I think I'll have it better by the end of the day. But what you're seeing here, this is like the automated pass where I'm just kind of going through and like using codec to fix obvious things. And then you're going to see me be reading more of the code and having codeex fix like less obvious things. And then you're going to see me go hand by hand, line by line, and like do that most likely.
See, like look, all this crap gone, right?
This is stuff that like I just let Codeex toss into there because whatever, you know, we'll fix it later. And now it now it's later, you thing about the state refresh thing.
I can probably go do that, right?
Yeah, this as well.
Yep. This is also like we can clean a lot of stuff up in here for five.
We'll be good.
Good question, Liberty. Um, I think that's going to be more so like we'll see. Well, basically we will see based on the results of uh sweeping this on existing ends how much of a difference it makes qualitatively. There is a class of problems before though that we could not solve at all that now are solved efficiently.
I honestly don't know man on the quantum stuff and I think most of the people giving you answers don't know either.
Um, so again, so the task that we used, it's 35 by 35 mazes.
Um, that you only get a reward at the very end if you solve the maze.
That is so sparse that it would take 10 million steps on average to get a reward a single time. The mazes are also randomly generated, so you can't just memorize it.
The new algorithm we have is able to solve this efficiently.
And it does it in a way that doesn't exploit any of the specific properties of the domain. It's fully general.
I see. Okay. So, this is a little annoying.
This is literally a problem I've thought like this is a problem I first thought about like seven years ago and um we solved it today.
We solved it with a method that I probably wouldn't consider to be legitimate seven years ago, mind you, but the method has become it has been made legitimate by engineering advancements uh in Puffer Lib.
Okay, it's based on setting the simulator state to a previously seen Hey, You have too many buffers as well.
Yes.
Setting a simil setting to a previously seen state.
It's not exactly data augmentation.
It's based vaguely off of Go Explore, but then it um it replaces it relies on reinforcement learning more deeply and it replaces heruristics with um uh general learned metrics.
Welcome. Yes. Uh we have made major breakthrough on the curriculum yesterday and I'm now cleaning it up and testing it.
730 lines.
And keep in mind that a lot of this code was from before, right? This because this is also our original prioritized experience replay.
Are you so many games and tasks is unified framework writing the code for other people?
So, it's both. Um, I write some tasks and I write some of the ends. We have contributors that write other ends. I do a lot of the core algorithm work, though we have a couple people helping with that now as well.
It's a combination of things.
Okay, here we are. This is reasonable.
this CDF sampling. Uh, this is new.
A lot of code.
I think I know. I'm just double-checking stuff.
I'm trying to get a sense of how much code is being added versus how much I think should be being added.
This is 50. Yeah, I'm doing um I'm cleaning the code up quite a bit today and then we'll need to run some experiments. I mean, this branch will break a lot of M's until I update them.
So, we're mostly playing with mazes, 2048, and a couple other small things.
Let's build the Let's just make sure the current thing works.
We'll just rerun training and then we'll keep thinking about this.
So look, millions of steps, no score, right?
You get the score once after 10 million steps and then watch how quickly it'll learn from that. Boom. Now it's starting to learn already.
You see that?
It takes 10 million steps to see the reward once and then within 5 million steps after that it's already learning non-trivially.
Uh 2048 should work. Yes. though we have not seen um we have not yet seen improvements on 2048 yet. But I think that once we have this nice and fast and cleaned up, I will suspect that we will be able to do better at 2048 using this.
No, an environment doesn't have to be solvable.
Um, if you have an environment on which you get literally zero score though, you'd better have a motivation for Uh there are some improvements to car to um protein. Uh Luke well Lucas sent apparently some improvements. I need to review those and that's going to be like a bigger thing. I think we will do that as part of 50. It's like that and a few other small algorithmic tweaks. The main main thing though in 5 is going to be um state set based learning.
This is good.
as you expect. Congrats.
Oops.
I don't suppose Robocode is done after all this time.
Uh, well, actually, Spencer just sent me I actually just wanted him to start working on that, believe it or not. And he's been sending me videos of um of training runs. So, actually, Robocode is still a a super useful Oh, I didn't even realize that was you as well. Oh, yeah.
To um congrats, man.
That's great.
Upper has been progressing very very well PhD at Nvidia.
I think that most of those are um they're joint. It's usually like joint with a university.
And yeah, Nvidia's cool. Nvidia is very cool.
I think that's top of list for companies we'd be uh interested in working with given that they have uh real engineers.
I think Nvidia is probably the only company that has a significant number of real engineers at this point.
I think a few folks at Nvidia are using like bits of Puffer. Um, I've got a couple contacts there. I need to spend more time on business side. As you can see, I spend a ton a ton of time just like cranking out next version, next version.
Well, I mean, that's what the streams are, right?
I mean, if you go there, you'll have plenty of people to tell about Puffer and all the things here, right?
That's the easiest thing is we just see uh you know people see puffer, people start using it and then they start using it at the companies that they're at and then they think wow it would be much much more efficient if we just had these guys on our Yeah, but anything you all want to do to help promote is of course appreciated.
Hello, Jess. Uh we're mostly working on cleaning up the curriculum implementation. It is fast now, but basically does not cost any time.
We can still run breakout. I think 19 something mil. We lost a couple percentf that we're going to have to get back.
A puffer illustrated website.
RL researcher RL marketing.
Nah, not my style. Alo my style is just both doing research.
This marketing GF would be awful.
Oh, thank Do visualizations. Cool.
I mean, like I guess I'd want to know what's the type of stuff that you want to see that's not on the website.
We are going to need a good architecture diagram for Puffer at some point. Um, but that's kind of different that I'll go AR. Yeah. So that's a diagram.
That's like a big diagram. I was thinking about that. Honestly, you're not going to be able to go kernel by kernel because like we honestly just need to clean all that stuff up and that'll change a lot. I do think it Puffer would benefit from a nice architecture diagram.
Something that I can like easily edit and update.
That would be a really cool poster actually. You know, I'm thinking of just like a big poster for the puffer architecture.
Yeah, with like the CPU, the GPU layout, like all that off.
Walk through 26 points in has gotten to 94 score.
Nice, Spencer.
I am currently cleaning up all of the uh curriculum crap, all the messy kernels and things. It's a little bit annoying because we need a priority queue and um that just adds a bunch of code.
Uh Jess, I mean it's just faster is better, right?
But that's it's not just it's no longer just going to be the end of time. It also includes copy time in that now.
Yeah. protein as well. Protein is quite fancy.
See, the thing is making that poster sound super cool and you could end up with something really awesome, but I just know that based on how I think, doing something like that would be incredibly difficult for me.
Yeah, I don't know how useful 60 minute read.
I would like to have like a nice diagram for the thing.
I'm almost tempted to have that thing like the thing is a ray demo um where it just like you know you can just it runs the data flow and everything live.
That'd be cool.
You write about the new breakthrough in a research paper. It'll be a blog post, not a paper.
It'll probably be multiple blog posts.
I'm really sick of writing research papers. If I do write any more, it's going to be way less frequently.
Okay, that's a little bit of code gone.
More stuff gone.
Didn't really remove very much.
Uh, I bet it's the historical snapshot thing.
Yeah, Volo, it's a pain, which is why you see me spending all this time.
Oh, this is cool.
This is very That'd be very nice to have that envin puffer.
That's cool.
That's awesome.
Huh? Pretty cool.
Yeah, we'd be very, very happy to have that.
This still works.
All right, what do we have in here? We got 700 lines. This is not 700 lines for uh new curriculum, by the way. This is 700 lines for all curriculum including the prioritized replay that is in tomorrow.
Okay. Bug. Lovely.
Let me see.
Seems like 5v5. Uh, if you want to mess with the moa end, then yes.
Uh, actually, yeah, there's one possible thing that can get screwed up here.
Again, if anybody wants to work on all these M's, by all means, like we're actually very happy now to be taking larger end of project PRs, especially if stuff is fast and stable, cuz Puffer is getting quite good And um we need harder problems to throw it at I there's one bug I can potentially see Here.
hard to measure. The reward is win. It's hard to measure because you need ELO, but we have ELO as part of um as part of our self like the R&D selfplay like that's already there.
It's annoying because ELO is a relative metric though, which is why scripted opponents are really nice because then you can at least do win rate versed.
fix checkpoints work. Um, just a little annoying Okay, lovely.
Make sure that still works.
Okay. Well, this still runs. We'll see if it's as good.
That's not the point, E10.
The point is that it's being done like just selfplay.
You know, there's lots of work out there on chess already. You already have superhuman chess. It's not the point.
Jesus. This Okay, this still works.
You're using scripted. No, so scripted is for eval man. Like if you have a scripted opponent, it's nice to be able to evaluate against it.
See what we got here.
Ah, dated buffer. Cool.
Okay, we're down from 800 some odd to 664.
open.
at 61.
And then I'm going to have to start looking through an actual like some real amount of detail after this. I think stuff that stands out to me is the number of things being allocated looks suspicious. Um, yeah, this is a lot of crap. So, we can probably save a little bit there.
functions with a [ __ ] ton of arguments.
Actually, I think it's just this stupid data type There.
633.
That's much better at least.
And then I have to actually start reading it once this patch is done.
Make sure this still trains.
And I'm going to go get vitamins and such in a moment. Yeah. While this trains, I'll go grab my vitamins and I will be back in a few minutes. Thank you, folks.
The puffer on GitHub. back in a few.
All right.
this. We didn't break anything. You see, still the same perf 629 lines. Good.
In the meantime, Is it 32K? No, it's not 32K.
not use dense rewards.
Um, dense rewards are not necessarily bad, Jess, but you need to know what you're doing.
Okay, now I get to review this manually.
I am actually wondering if I should go get some Advil if it would help because like this is just like constant [ __ ] I'm thinking I just tough it out for a few more hours and then um I like I sit in the hot tub for an hour after stream and see if that helps.
I don't think I like slipped anything or whatnot cuz I didn't get like any sharp pain during lifting. I think I just taxed the hell out of it.
Yeah, probably That's okay. Here's the pryobuffer, right?
I'm going to stick this here.
This has a bunch of crap doing.
This is just doing the thing I told it to.
Okay. So, why do we have this much stuff?
States candidate states priorities on the host. M scores on the host.
E pause.
A lot of crap.
Mostly the priority buffer.
Yeah, this kind of crappy colonel. Sure.
Curriculum checkpoint scores.
Compute curriculum checkpoint scores.
Actually, I do wonder if this is um well, there's potential logic change in here.
It's just a little bit here, a little bit there, a little bit here, a little bit there.
Yeah, a lot of this is this block scan.
um allocations and memory management. Total mess.
Allocator doesn't super help, does it?
This is all the peer or I know the priority queue.
Kuba scan.
Yeah, we don't need to preserve current sampling.
Okay. So, this is this right here.
This is not bad.
Wait, but you still need the CDF, huh?
Yeah, it's getting it's getting lost in details.
Actually, this operation has got to be in Um, Uh, yeah, Val, I saw that. That's the A6. That's the R six000s, though. That's not the 5090s. The 5090s on there are already overpriced.
They're literally going to raise the price to 75,000 for that four GPU box.
It's ridiculous.
Yeah, 45 is nuts for the 5090 box. Like now like I at 25. Okay.
Um because the prices are [ __ ] at the moment. Like 35 maybe. I wouldn't even pro I probably wouldn't even get it at 35 for a 4GPU box.
Um, you know, maybe a good 8GPU box at 60.
Yeah, exactly. So, I'm not going to be buying this [ __ ] when the prices are going up, right?
We'll just be patient.
I'm actually surprised we're not seeing um you know people selling cheap H servers as they're replacing those or 800 servers.
So they'd have to be really cheap because in my mind those cards are only as valuable as like 4090s for 800s.
Yeah. Um, if you don't need the memory, it's a terrible deal compared to like literally just buying 4090s will be faster.
pretty much pepper.
What was it for? Um, was it 25K?
It was like expensive, I think, at 25K for that.
Yeah. 6K each.
One B200.
Well, the other funny thing about the B200s is the B200 is slower than a 5090 um with our current setup on a lot of the tasks. So, um you can just spend all your money and have slower than your [ __ ] gaming desktop. up.
a little bit of speed.
And this is a normal ass priority queue, isn't it?
Also, I think that the state heap is over complicated.
Yeah, I think this can be done faster.
Can we get a normal priority queue from anywhere?
Huh?
Does this get us down to below 600? Very nice. And that's a perfect improvement.
This Oh, wait. Hang on.
This this Actually, that is funny. This would be a radic sword, wouldn't it?
It's fun.
Okay.
That's good.
And then I I don't like the way this thing is done. I think this has done a stupid job with the priority cues.
I need to get a Tylenol because I'm like losing the ability to function Here.
Okay, you can go run this and then I will run I will rerun training.
It's fine.
This is a little faster. You see this actually is a little faster than before.
You going to get anything here? Any rewards?
Uh, that seems as bad.
Oh, wait. There.
There it goes. Okay. So, we got unlucky at the start. You can get unlucky and not see it for a few times um longer.
Okay, it is already then I misread it.
I think it has to be it has to be this way, doesn't it?
Yeah, cuz we can't compute the advantage ahead of time.
buffer. This also took a big hit to PF.
I want to see if this was an unlucky seed because it seemed to me that we got very very unlucky on um the first sample.
It's 1 in 10 million. And it took 30 million before I saw a um any score at all.
Okay, that's better.
We'll see what this converges to.
Is there anything else we could do algorithmically instead of this?
Okay, we're going to have to see if there was a per a regression here first.
Well, this isn't so much check the list, Jess. This is like, you know, do I really need this? Is there any other way to do the same thing?
But I think you do kind of want a priority queue here, don't you?
Cuz don't you just want to keep around the top the top K most informative states?
I mean, you could also just set a a threshold, but I don't like the idea of setting a threshold because the distribution of advantage is different for every And 8. Where the Where's my asymmetry?
That's funky.
Okay, this is looking much better. I think there is not a performance regression and there's just um well, you expect this thing to be seed and unstable, right? Because like you lose 30% of your samples or 40% of your samples just by getting unlucky on when you get the first reward.
Yeah, we're above 70. We're good.
Yeah.
Fine.
importance.
Like why do we need to destroy that?
Because we have two different things, right? We have the computation of priority and we have the sampling op.
Possibly we could merge these I don't know if you can reduce the number of passes here.
I think there are a few global norms in the way.
Maybe You're not doing anything wrong. It's a language for idiots.
Like the Rust community is to programming languages what the Arch community is to distros. It's like a bunch of people that are supposed to be smart for whatever reason having focused in on completely the wrong [ __ ] thing.
Let's see.
Compute raw absolute advantage priority normalize into prob cube scan prob into CDF multinnomial sample. Comput is weight.
right? Because the the thing is you're already going to do the normalization anyways with the multinnomial sample, right?
So, this I didn't I'm going to admit I didn't fully understand this, but I kind of have the idea that because you have two normalization operations, you can probably just not normalize the first time around and rely on the second norm.
Y I still don't like how we're doing the memory management stuff here.
This This is actually kind of a cool insight, isn't it?
It should get shorter.
Little bit shorter.
It's actually it's about the same, but I believe it is should be fewer kernels.
Yeah. So, here's the um Okay, more manageable.
Still not done yet.
We'll be able to commit it though. Nice.
I'll probably also run this against 2048 where I go to dinner in a couple hours, few hours.
It'll be good.
Okay, this is throwaway test. We don't care about this.
Starship flight 12. Cool.
Okay. Temple H.
Okay.
Nice.
Check breakout as well.
Uh if there's any speed up here, it's not discernable for a run that's this low. We might see it on breakout.
I believe that could be like a couple percent.
Okay, so this is annoying. Um, it can't learn until it gets any nonzero score.
And you can just get really unlucky with that. Okay, there it goes. We got unlucky in the sense that it was supposed to take 20 mil and it uh 10 mil. It took 20 mil.
Uh, do we latch on or not?
Okay, we do.
nice.
It latches on I'm going test this on on breakout in a second after we test this on.
So far so good. We got to do a breakout as well just to be sure.
Okay, we got a couple things here and yep, we preserve Perf.
and try to break out before we do this because this is a lot of code change.
pass.
Okay, we still train the 20.8.
Ah, look at that. We recover a little bit of the PF as well. That's better. I think we got back like 1 or 2% PF.
Yeah, we don't want this heat move to the GPU.
That's a lot of data. We don't ever want this data on GPU.
high. Yeah. Okay.
So, I mean that's not unexpected, right?
You shouldn't do it on CPU probably. I don't know why it suggested that, but I mean it's possible it could have been faster. I guess we checked it. It's not.
That's fine. We just get the small change.
Still 575 lines.
Yeah, this is real bad, right?
16k.
Not very bad.
All right.
Uh, no idea, man.
Let me just keep going through this.
I still don't like the way we're doing memory.
This state buffer seems bloated to me, but I can't find anything initially to cut out of it today versus yesterday. Just this is mostly implementation side stuff today.
Um so some speed improvements, efficiency improvements for this. Uh it's now basically no no overhead compared to puffer 4.
And uh we are cleaning up and reducing the code size for it. I think we can probably get the code size to be about the same as puffer 4, maybe two 300 lines longer because we're also able to fuse some of this stuff in with the peer from before.
I mean, Puffer 5 really isn't going to be a ton of code, right?
It's the It's an algorithmic change.
Like, this is an algorithmic update for the most part.
I think it'd be great if we have puffer five shorter than puffer four.
Are we back?
I think the internet just blipped. I think we're good though.
565 lines.
The big one is around how curriculum integrates with rollouts.
So look, if we looking at what's taking up code now, it's a lot cleaner, right?
So here are your buffers. You got your priority re prioritize replay buffer.
You got a state buffer. They have their allocation functions as your first 90 lines of code.
Here is the uh calculation of the prioritized replay coefficients. Right there.
Here is your computation of um like summed ab was it sum of abs of value function.
I don't know why we have a separate importance weight score function.
I I I'll look at that.
Okay, here's your multinnomial path. All this is sampling multinnomial uh init state buffer here actually This could all go up here, right? Because this is with your state buffer.
So that should be here.
And then what are we left with?
Oh yeah. So this stuff here I think yeah okay there's multinnomial and then here this is all of the state buffer stuff the heat the state heap this is your priority queue this is your checkpointing logic and then this stuff at the bottom is the last thing that we need to like go through properly Okay.
chunks of it for sure. Coder Yeah.
Like for instance, right, this is so some of this stuff here like this multinnomial kernel is an iteration on prior sampling code.
So this is taken partially from puffer 4 and then we realize that the same operation has to happen in a slightly different context for puffer 5.
We can merge the code paths. Now, I haven't done the line by line edits for stuff yet because like um these models are very defensive and [ __ ] So, I'm sure that when I go through line by line, I'm going to find lots of stuff like I don't know, should this ever be null, right? Things like this.
But the goal is to have 500 lines of code to review manually and 500 lines of code that are already split up correctly. Like I can't manually review 800 lines of crap that's just written in a way that doesn't make any sense. So first this is like a combination of read code, go to codeex, read code back and forth until I massage it into roughly what I think should be the right shape and then I manually fix [ __ ] Like I'm still not happy. for instance, right here with the memory allocations.
I don't know if I want to use our allocator on this or what, or if I just like if this actually needs this many different buffers.
It probably does, but this is very like this is very obnoxious.
Yeah. Okay. This didn't need puffer all anyways.
Runs.
Okay, that goes there.
Yeah, it doesn't need puffer. That's fine.
So, this compute curriculum checkpoint scores Okay, this has this has to exist.
This makes sense.
State importance weights.
Where are we using this thing?
Okay, this gets used exactly once and this gets used in Why is this a separate function?
Let me make sure that this actually makes sense what I'm thinking here.
It's computing both of them at once is why I See?
Hey per roll out row and agent. I see.
Hang on.
So, the thing I suggested makes sense, but there were ramifications I didn't consider.
I'm use a restroom first real quick and then we're going to figure out whether this makes sense. I'll be right back.
Hey, bet.
All right, let's figure out if this thing makes sense.
So first of all this entire thing is sketch but regardless of whether this should be here you compute prioritization sampling.
You compute the samples.
Even this might be able to be fused but okay. So we compute prioritization.
We sample the indices.
We move them to the GPU.
We copy over states.
Then we copy the G the observations back to GPU.
I don't understand actually. Yeah, this doesn't need to be here, right?
Why can't we compute these?
Why can't we compute these when we sample this only ever gets called here as well, but let me see if I agree With this for M states per agent state set weights, we can fuse state set is into the state sampling only if the sampler writes the per agent importance array for the restored agents. At the same time, we can fuse state set important sampling into the state sampling kernel only if the sampler writes the per agent importance array for their stored ends at the same time.
Yes, that's fine.
This removes intermediates.
Okay.
Yeah, that removes that removes um quite a bit actually. That's nice.
This is my PR. What do you mean this is your PR?
Yeah, to solve the problem.
Um, yeah. So, it turns out that that wasn't really a big issue.
And also um it's quite a bit easier to sample quickly uh with replacement.
So I think this is not necessarily a bad thing. And that wasn't the the uh Perf gap.
You seen the Actually, I don't know.
Have you seen the maze results yet?
Okay, perfect. Look, this is exactly what we wanted. Mine. Get rid of this.
Get rid of this. Let's get rid of this like too many data structures.
We get rid of this whole kernel. We fuse it in. This is good.
Maze results. So, what do you do? Bet if um you have a maze and you only get a reward by random exploration every 10 million steps.
Yeah, you'll only get if every 10 million steps of random exploration, you'll solve it once. So, you'll get a reward once. That's it. No other reward.
What do you do?
This is so nice. That's so much easier.
probably explore more. Ow. You literally get one signal to go off of every um 10 million steps.
I'll save you the trouble and I'll give you the answer. Um you run Puffer 5 and you solve it instantly.
State sets are really [ __ ] good.
So yesterday I solved a um I mean this is a problem I thought about for like seven years now honestly. I call it the latching problem. It's the problem of once you get a reward once, how do you efficiently latch on to it?
And um as far as I can tell, there's essentially no reasonable way to do this that doesn't exploit either domain information, [ __ ] tons of compute, or state sets. and state sets are the cleanest one.
But look, it's a very clean and general way of doing what I just told you. Like that's not tied to mazes in any way.
That's not tied to the structure of it just having one reward at the end versus something else. Like so yeah, this this pretty much rocks.
And now um I am cleaning up a bunch of code for all of that.
Minus 40 lines. Lovely.
Uh, the best I've seen is like 80 to 90% solve rate on 35 by 35 sparse mazes. The best I've seen with puffer 4 is 0%.
Or I guess 0.006 or whatever. Like I think that's the random solve rate.
Yeah, 35 by 35. And this is with no smaller mazes to be clear. It's dramatically easier with smaller mazes as scaffolding. This is purely all the mazes are 35 by 35.
Yeah.
What's the speed here?
Okay. So, it's about the same speed as before. Um cuz we lost a little bit of speed with one compilation change and then we got it back with some kernel improvements.
Yeah. And um also the new curriculum stuff will have essentially no uh no performance price tag associated to it.
It's just free.
Exception being for Ms that have absolutely massive state, but even then it's not that bad.
Uh, I have to push this actually. In fact, let me just make a commit cuz I haven't done that in a Oh.
Yeah, you can see that. That's a nice nice diff though, right?
And yes, I'm using codeex for some stuff, but if you can't tell the difference between what I'm doing and just vibe coding the crap out of everything, that's on you.
Okay, let's try this.
All right, so I at least want to get this into the um like these are the correct kernels and these are the correct operations to run before I start going through the line by line. Now look, we lose 100 some odd lines here to the state stuff just because we needed a priority queue.
Uh what else we got?
Compute pryo abs.
I don't think we can.
Uh, I see.
Damn.
Yeah, this is the annoying [ __ ] with um Yeah, this is the annoying [ __ ] with Uh, I mean just just like we haven't really tested this well outside of mazes yet. So, I don't have any idea of what tweaks are needed to get this working with other stuff. I'm kind of trying to make this efficient and works well on the mazes. And then I'm going to start doing the broader sweeps.
No, but not like password was screwy.
Um, better environment is the sparse mazes.
Let me try multi- agent. It shouldn't be, Jess. As long as it's the two agents perm, it shouldn't Okay.
Wait, how's this work?
Oh, okay. That's actually pretty clever.
Yeah, cuz it captures all the checkpoints separate memory. Yeah, that's nice. So, that actually works perfectly. This is very good.
And then we've got what? Curriculum update advantages.
Actually, that's a good question. When do you call this?
Ah.
Oops.
Stupid thing.
You guys realized that I didn't like switch models because I thought was one was better than the other, right? I literally just switched models because um I was sick of giving anthropic money.
Like these companies are all building the same bloody product.
They don't have personalities. They're [ __ ] nonscentient token predictors.
Damned RL agents have more more personality than this thing does.
Okay, look. There was the first score and now boom latches.
Okay, that is actually true.
You get like the residual moral grandstanding horseshit shrimp spa welfare whatever from the uh Not really, Steve.
Especially not for learning stuff.
Like, you'll honestly learn way more just jamming some low-level dev projects on your own.
That is provided you at least have a reasonable tool chain setup. like you know if you're deving C you'd better be deving C with a dress sanitizer and such and with GDB Yeah, but Volo, I also was the type to not just pay [ __ ] from Stack Overflow, you know.
Let me go double check the sampling logic.
Oh, yeah. On the bright side, Volo, uh, you got some time to do Inferno things now. Yeah, that's going to be such a [ __ ] cool environment.
I think CL should help on it as well.
Buffer contacts.
I mean, it's kind of an open thing that if you've done stuff around Puffer, like enough to show that you're competent and you can also find Puffer a client, like you will go on the contract.
So, that is an open offer.
All All right. So, MB Wait, hang on buffer.
I'm embedding these in my blog now with W wasome.
You can add it if you want. Wait, you're embedding what in your blog?
Current policies can be dropped in any wave and it will just clear. Okay, that's freaking awesome.
Yeah, I think we will be able to put it on um on the website.
Volo, I got to look up like, you know, the inferno trainer doesn't get taken down, right?
Um so, I got to double check that, but I think we're fine. I think that the worst is we get a cease and desist and we just take it down. As long as we don't have the raw assets in the puffer repo and the website just has like the uh the binary, I think it's fine.
There's so much more we can do with Inferno, though, right? Like, it's practically already doing no pillar inferno, right? We could literally just have it do no pillar inferno. Um, there's also pure inferno.
See the problem is Yeah. If you The problem is if you ask right like Dev will ask legal to cover their asses and then legal will say no because that's all that legal knows how to do.
like for literally anything.
But like if you just put something out, right, and it's not against their whatever, like they're not going to go out of their way to take stuff down.
That's cool. It's mainly that they have an obligation, I think, to ask legal if you ask them directly. So, there's some stuff to do there.
Yeah, I think Spencer's right.
Like this is not us trying to weasle out of anything. This is just like the dumb office politics.
Yeah, we have the same one. Volo.
Um, for reference, it implemented this section of the algorithm completely wrong. So hopefully it gets better now.
I only caught it in manual review.
So, it was updating at the wrong time.
It should be updating right here at the end.
Let's see how this goes.
We're going to see what percent perf we get on maze. Okay, that's the first reward. Now it immediately latches on.
You see that 88.2 2 is very good. Um, if you pass d- seed and you pass a different seed, I'd be interested to see if you still get the same result.
Um, puffer is not seed sensitive, but the exploration part here is very seed sensitive, like just intrinsically.
It's a rare event.
If this does worse, it'll be weird, though. Actually, worse here is hard to pin down because it's so sparse.
I mean, I guess we'll just wait and see.
We're making good progress here, though.
I mean, this is good progress.
Every little additional fix we do takes us a bit closer.
I really like this terraform end as well.
Mostly that was Spencer.
Okay, this is doing well so far.
Oo, this is cool.
Made with Rayb.
And this isn't Raleb. That's cool.
Oh, this is very good. Yeah, this is very good right here. This is This made a big difference.
Sent another vid.
Um, I actually had something I was planning to do with send, cut, send, and then I got distracted.
Ooh.
All right. Uh, are you satisfied with this environment, Spencer? Like, do you see that this was a good selfplay one or not really yet?
Because it's starting to get to like some of the crazy things you see.
I think that the ram thing is just like once you've secured an advantage, ramming makes it impossible. Like it just collapses the strategy space.
But like as soon as you've secured an advantage, yeah, that's probably what it is. Okay, this made a big difference. So this was good.
High variance zero.56.
All right. H shouldn't have zero ever. I guess it is technically possible.
Yeah, but it's the thing is it's not necessarily that the algorithm is sensitive to seed. It's that you need to get like one to three instances of the reward and the rewards one in 10 million. So, like you can actually just get unlucky and not see you can basically not see a seed until their learning rate's already decayed by half.
I think that this should help what I just did though. This should help a little bit here.
like yeah see it just hasn't gotten any reward.
Uh okay so this is what's happened actually.
So um this is hypers actually. Yeah this is hypers. You see this? So it's just collapsed the entropy too early.
I think that's a bug that we can work on though. Like this it's not a bug. It's an algorithmic quirk.
Uh, it's templated. Good.
Bullet bullet collisions.
Uh, I forgot about that.
Yeah, I forgot about that.
don't like this.
Maybe it's fine for now like it I do think we have the correct structure now for the most part.
Yeah, we do have this correct structure.
How much code do we have?
500 lines. So, not bad.
It should the allocator should be catching this.
Yes, Spence. There's 500 lines now for the new stuff and it should be basically free per like overhead wise.
Um, it's still we still need to do a bunch of testing to see if it actually helps on other ends, but it does add a fundamentally new capability, it seems.
Yeah, you don't need it for you don't need it for 1 v one. You can't see the bullets.
Uh to be clear, you would need it if you could see the bullets, but part of the design of that environment is that you Can't.
This should be good enough to use.
have to get all the sample all the uh the kernel names correct next as well.
We're almost done for the day. I think I think with a good day tomorrow, we can fully finish we can fully finish this code. Hopefully get initial results.
um even clean up Vec and then next week we get to start on all of the other Puffer 5 things. We could literally just do Puffer 5 in two weeks which would be pretty cool.
So obviously the experiments will drag on longer and then I wouldn't mind taking more time in um in Puffer 5 if we do get it all done that quickly. Um I get to just take some time. Obviously, I need to do some stuff for the business as well, but then I'm also going to have some time to just like learn a bit more proper kernel dev. Um, you know, really understand and clean up some of our kernel paths a little bit better. Um, dial in Perf a little bit more.
It'll be good.
Terraform might be a useful end for this. I don't know. We'll see.
Oh, yeah. For all the folks who are uh still watching this, don't forget to start Puffer on GitHub.
It's free. It helps us out a ton. Can also join Discord to get involved with all this stuff and follow on X. Um if you want more reinforcement learning content, all my articles are right here.
Yeah, Terra will be really cool.
No, I don't think I'm going to do state coverage this update. Yes, I think like my tendency would be to want to do that, but um I think I'd rather just make sure everything is nice and stable with this in it first and make sure the company business-wise is fine and then take some time to like think about really what's remaining on the ALGO side.
This is like this is enough of a win for this update already.
I mean, of course, if we sweep this on every on every environment, it only helps on mazes, then okay, maybe not.
Um, but I it should help somewhere.
What the [ __ ] is this?
else can get clients first.
Well, it helps.
I mean, look, we've gotten people before where it's like somebody's very, very good in an area and then we happen to find a client that's in that exact area.
But yeah, the easiest way is to also be involved with finding of clients.
It's a pretty good gig by default.
The uh contracts are 50/50 unless there's a compelling reason for them not to be.
Fine.
Yeah. I know Jess, that's that is assuming that you've done some work on both the um obtaining of the contract and on the developing of the environment.
Okay.
There we go.
I mean, if I were to try another language and like have to use it, it would probably be either um Zigg or Go, just from what I've heard.
No, Zig and Rust are very different.
Rust is like [ __ ] shoot me.
Yeah, I think I probably would find go to be too high uh high level.
Does go have garbage collection or RAI?
It has a [ __ ] garbage collector. [ __ ] off.
No thank you.
No thank you. [ __ ] that language.
No thanks.
I'm good.
I'm just good.
Whoops.
This is fine.
Does Zig have um R I I Nope. Okay, good.
I might be fine then. The thing is like you need to give me a compelling reason to not just use C. And the thing is that most of these languages give you a couple reasons to not just use C, but then uh they [ __ ] it up by like adding too much [ __ ] I think we're good, right?
Okay, this is a very nice PR.
So, you see that we deleted like a thousand lines of code from Puffer, but then we only ended up with 500 lines here. I think a lot of this was debug.
I think a lot of it was debug.
We'll see.
I think I pushed the check results thing.
Yeah, like I said, Jess, Zig seems fine.
Like of all the things it seems mostly fine there.
Huh?
What?
What the [ __ ] Is this Oh, okay.
Yeah.
Just need to change some names is fine.
and slightly irritating, but whatever.
Good progress though.
Like, why the [ __ ] would you ever do this?
No.
All right, that ought to do it.
Okay, cool.
Seems fine.
Not do that.
I think I'm mostly done for today. Um, we can do a very quick I suppose we can very quickly Maze sweep status.
Wait, do you have full solves of mazes?
Huh?
Wait, what branch did you get a full solve? Is this 35 by35? because I've never seen a full solve.
That would actually be kind of significant. I can see that you changed something cuz it's 150 mil.
What did you do, B?
And is this like a recent version of the code? because I didn't re sweep.
H.
What do we have here?
What's this?
Dang.
That actually tell me if this works.
It's always sketchy when it's like one experiment.
Why would you ever do this?
You've got to be [ __ ] me.
You are [ __ ] slower than Vizdoom.
You're using the wrong render settings.
Reinforcement learning Twitter is just [ __ ] stupid. I swear Doom literally can get over 100K. So, good job on CPU.
All right. Why are we sticking this anywhere?
What are we doing?
Let's see.
I like the ve opaque.
pushing code. Can you just tell me what you did bet to get it to full self CPU to 1 mil?
Uh, I don't want to commit to doing that, Jess. It's like return on time is pretty low.
I basically I want to see once we have better cleaned up kernels how much of a pain it would be.
Good.
All right. No more cool research on here. It seems if anybody wants to go look at the schedule free stuff. Last time I checked it, it didn't work. Maybe it works now.
I mean, that would prevent us from having to do MUP.
But I um I don't know about that because I think it is for one size maybe.
Okay, give away this stuff. Go away.
Nice little PR.
All gone.
And this has to go into CPU again.
What are we [ __ ] doing?
Oh, we're just busting [ __ ] into this.
Lovely.
Okay, this is still dumb, but I don't particularly care for now.
Oh, I'll look at your PR in a second.
Can you just tell me what you did, or is it a lot of stuff?
explanation is stupid.
All right.
Well, none of the GPUs have died yet, so that's probably better.
Not doing very well yet, but we'll see what it does.
State Q Wait, what?
Wall prob? What do you mean wall prob?
You changed the [ __ ] mazes.
What?
Wall? Wait, what the [ __ ] Wall prop.
probably because his father made like solved the task by removing all of the walls.
I didn't even know I exposed that. Wait, do I even expose that?
Where you see wall prop there? Literally, that variable doesn't even exist. Dude, what did you do?
I think I'm going to end up reviewing this PR and you're going to not be super happy that I do.
Huh?
Wrong GPU.
Obstacle for density knob.
Yeah. Okay. Bet. You literally just solved the mazes by deleting the walls.
Good job.
Ensure mazes are solvable. Uh-huh.
The mazes are always solvable.
I mean, dude, you [ __ ] deleted the obstacles.
Are you Are you [ __ ] me, dude? Like, are you [ __ ] me?
You know better than this. You do. I know you know better than this.
Holy crap, man.
There's no curriculum at all.
That's why this result is impressive.
Like the end of itself has no curriculum. It's like stupid absolutely impossible sparse task.
That's why it's cool.
Oh, well that's nice. Cool. We're already up to 10k.
Uh in okay long run is 10k. Cool. So huh.
Cool. Well, I suppose we will uh we'll let this run. And we have all of our GPUs active.
Hopefully it doesn't crash anything.
And we should be good glass ceiling.
Yeah, there is no moat doesn't literally mean delete the moat. And look now, now you can cross.
Um, I think for the time being this is like pretty solid progress, right?
Let's maybe word count.
So yeah, 2700 lines for these 700 plus 3K.
Yeah, it's under it's like less than 4K lines of real source. This is good. I'm happy with this at least for now, right? Like obviously this is not done finished code. I think we can get it to that point tomorrow maybe but 529 lines for curriculum. Also, I want to see how many lines the um how many lines is the 40 puffer 2253.
Wait, 2253 lines.
Yeah. Yeah. So, we add less than 500 lines. Um, we add less than 500 lines of code to do curriculum.
And we made it faster.
Do I dare open this on? I don't dare open this on stream.
Oh my god.
Yeah, good job, man. You really solved exploration there, dude. It's literally a straight shot.
You literally go one direction and then you go to the right and then you go down and you solve the maze. Yeah. Okay.
Good job, B.
You know, I don't know what the [ __ ] you're doing that you even get it to mess with that. Like, how did you manage to do that and not notice?
Like the worst thing I didn't catch today was um it setting advantages on the first iteration of a loop instead of at the end of the loop.
Holy crap. That's whole another level.
Okay.
Um, so I think we are set though. Like let me see if there's anything else we can do on this list.
Oops. Nope. So, refactor environments.
We have a bunch of environments, but this is easy. Literally, Codeex can go do all of these. I don't want to do it now. I'm going to do it when I'm fresh because otherwise I'll miss stupid things.
Multi-agents done. Testms are done. Good enough.
Starting these now.
This is done.
We cleaned up a bunch of kernels.
There's much more to do. I think this is the thing we'll spend some time on. This we don't care about. Uh we're going to play with Soon, I think. But other than just cleaning up the kernel code a little bit more, which to be fair, I could just say it's done for now and we come back to that. Um, if we get some reasonable results at least.
So, this one is super easy to do and possibly useful. Very easy to test.
So, is this this I have to think about and protein changes from Luke. I think this is like a Monday or Tuesday type thing.
I actually do want to spend some time on this, but I think we can very very reasonably have um a nice like stable puffer five um sometime next week.
What am I doing tomorrow? Friday.
Well, we're going to see what we get results wise from this initial sweep.
Hopefully, we get something out of this.
We'll see. I do hope we get something out of this. Um, we'll also try Soaban if this doesn't work.
Uh, and I guess tomorrow we'll like we'll see if we want to if I feel like cleaning up more kernels or if I feel like I feel like porting all the M's over, maybe getting it on Volo's Inferno if he gets the environment uh submitted up depending I think we're making very good time here.
Like that initial solve was pretty good.
Pretty pretty good.
Well, it's 4:36. Um, I thought I was going to be done by 5, but if there's nothing else to look at, I don't see any reason not to be done. A little bit early today.
I will remind folks if you're new around here, I stream pretty much all of my dev, you can find all my stuff here. All the writing that I do, all the educational stuff is at Juarez on X.
Uh, I do not have the mental bandwidth left to review Jonah's kernels at the moment, Jess. I have several like very important colonel PRs to review. I those need to be done fresh.
Um yeah, for folks watching though, X for articles, Discord to get involved.
That's the whole community. And uh just start the GitHub. Like this number really really helps me out a lot. And every time I do um like these long dev grinds, it kind of starts to flatten out a little bit. Really, really helps when we keep getting this consistent growth.
Um it's a really good signal to companies, we get more interested clients, and it just helps us out a lot.
So go ahead and do that.
PR.
Did you PR this crap?
I see you closed it. I assume. Oh, Aurora imple.
Oh.
Wait, is this the Hey, Jess, is this um the efficient version right here?
Aurora optimizer. Small change to Muan.
If Jess is still here cuz Jess said um he had like a fast version.
I just want to know if I'm looking at the correct code because I could very I could see myself playing with this tomorrow as well.
I've wanted to play with this. Now, there are a few simple things I could do that I think would like improve the breakout speedrun record and that' kind of be on.
I uh I gave a talk at this lab um last week.
When am I streaming games?
Not really planning on it.
Okay. Well, I think Jess is out. So, um, if anybody wants to figure out whether or not that Aurora PR is the fast version, by all means. Uh, I'll probably look at that either tomorrow or more likely early next week among the sort of cleanup things that I'm doing. Uh, thank you for everyone for tuning in. Star of the Puffer on GitHub. I will be streaming tomorrow morning starting probably sometime between 10:00 a.m. and 11 a.m. Pacific.
See you all around.
Related Videos
Elon Musk’s XAI, Fiber-Optic Drones & the New Era of US Defense & Winning the AI Arms Race
DefenseNow
250 views•2026-05-15
I Read Every Google Antigravity 2.0 Doc So You Don't Have To (13-Min Operator Playbook)
hyperautomationlabs1045
120 views•2026-05-19
Could AI change the future of cancer survival?
MotherConservative
999 views•2026-05-16
[RQ] All Preview 2 Midnight Horror School Deepfakes in Macbg Major
macbghuggylego
102 views•2026-05-15
Firefox on Android Just Added 'Shake to Summarize'
BrenTech
349 views•2026-05-19
Google’s NEW AI Just SHOCKED The World…
JulianGoldiePodcast
188 views•2026-05-21
WWDC 2026 Promises Apple Intelligence and Siri Upgrades | Episode 195
TheMacRumorsShow
104 views•2026-05-22
RNNs Had a Fatal Flaw — Why Transformers Replaced Sequential Processing
axiom-motion-math
567 views•2026-05-18











