Hooker provides a timely reality check on the diminishing returns of brute-force scaling, correctly identifying that the future of AI lies in efficiency and real-time adaptation rather than just model size. This shift from pre-training dominance to post-training optimization marks a necessary evolution for the next generation of intelligent systems.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
On the slow death of Scaling (birth of Adaption Labs) | Sara Hooker | HF ML Club India EP2Indexed:
This video features Dr. Sara Hooker, co-founder of Adaption Labs, discussing the transition from monolithic model scaling to an era of adaptive intelligence. Addressing the Hugging Face ML Club India, she explores why the "bigger is better" race in compute is reaching an inflection point and what this means for the future of research. Timestamps: * 00:00 - Introduction of Dr. Sara Hooker * 02:37 - Personal anecdote: Using ChatGPT to generate presentation slides * 04:03 - The problem with monolithic AI and "one-size-fits-all" models * 05:27 - Analyzing the "Bitter Lesson" and the limits of scaling * 07:08 - How the belief in scaling has shaped the AI ecosystem * 09:28 - Evidence against monolithic models: Performant small models and weight redundancy * 11:00 - Recent disappointments in massive model scaling * 12:20 - Moving toward post-training and test-time scaling * 14:14 - Optimization in the data space and Adapt Data * 15:38 - Auto Scientist: Automating end-to-end model adaptation * 19:26 - The pillars of research at Adaption Labs * 21:30 - Q\&A: Can you explicitly optimize for adaptability? * 23:44 - Q\&A: Relationship between pre-training, post-training, and test-time scaling * 28:46 - Q\&A: Defining adaptive intelligence vs. continual learning * 31:32 - Q\&A: Efficiency as a pillar for data and compute * 35:12 - Q\&A: Why do labs still invest in LLMs over alternatives? * 37:20 - Q\&A: The impact of the "Hardware Lottery" * 40:30 - Q\&A: The future of adaptive user interfaces * 43:03 - Q\&A: Accuracy and limitations of public "scaling laws" * 44:28 - Q\&A: Undervalued research domains (Sparsity and alternative architectures) * 48:50- Q\&A: Evolution of advice for machine learning beginners * 53:16 - Q\&A: The "Strawberry" problem and tokenization issues * 56:44 - Q\&A: The importance of research communities * 57:50 - Q\&A: pertinent theoretical research in optimization
Okay, >> this is nice. Okay, uh I I'll just quickly introduce and then you could take it from there. So, hello everybody.
Welcome to the second iteration of Hugging Face ML Club India. And I am very excited to be here. Uh so is Shik.
I'm pretty sure um we have with us the very talented and influential Dr. Sara hooker who is the co-founder of adaption labs and u we as in me and Shayak and I'm pretty sure a lot of you folks know her from the Google brain days so that's really interesting and pardon me for if I if I mess up which I have uh due to being starruck today and uh I wanted to take a moment and not only introduce her as a researcher but also a very kind human being because of the chats that we have shared in Slack which led to this event. Uh so without any further ado, I give to you Dr. Sarah Hook. So Sarah, please take it away.
>> That's fantastic. And this is such so punctual. We're starting right at 12:30.
This is great. Um I'm happy to make the best use of this hour. I know this is a more informal um group. So I I can present some slides, but I'm also happy to do much more open discussion about what's interesting and what people find important. Um when I was invited to this talk, I shared um I asked what would be interesting to talk about and I think the response was well to share some of why we started adaption and maybe like what questions motivate me right now. So I'll share some slides. I'm actually in Singapore right now which is why we were able to make this timing work. Um and I as a part of that I was putting together some slides for the presentation I'm giving here today. So, I'll briefly go through those, but I think different from how I did the presentation here, um, I would love to keep this much more informal and maybe more conversational and so we can hopefully at some point we can liberate people to unmute. Sounds like those permissions have been removed. Um, but we can give those back and we can have much more open. I would love to hear what's on your minds and what's interesting. But the topic for this that was put to me was like, you know, why did what drives me now? like what's the problem that I consider most important? I'll share like I arrived in Singapore um earlier this week and I had to put together a bunch of slides and I had 17 hours on the plane which was very convenient. So I said okay I'm going to put together these slides and I thought okay well let me do something handy.
I'll just uh ask uh chat GBT to do it.
And so I said I need an opening slide that speaks to why we need adaptive intelligence. Um this was the result.
It's pretty bombastic. It's got very high amount of flare. You can see here we have which is quite fun. Clearly some evolution influence with the lizard.
Um we have maybe some Darwinian references. Um so I said okay interesting. Let me see if I can do an intro side. For reference this is my normal intro side. So I've done work at different frontier labs. Um I was at Google DeepMind, Google Brain for a long time and then I led coher's research arm. Um this was what chat GPT produced.
Um there's only one small problem if you take a close look. So not quite.
Um and I actually think that this is probably very typical of maybe how we experience AI today. So I could have given a thumbs up, thumbs down and then it would go sit in some researchers's pile of uh RHF preferences for a few months and then maybe it would improve the next version or I could become a prompt engineer and I could have spent a few more hours getting the results I wanted. And I think this illustrates what I consider to be the biggest problem, which is that current AI is very monolithic. And most of the last decade of progress, especially in Frontier Labs, has been how do you build the best model and then how do you just kind of throw it over the fence and hope for the best and um this has been shipping the same model to everyone. And I think intuitively it has two downsides. It puts immense pressure on end users especially end users around the world to make it work somehow for their use case and that often looks like doing acrobatics with prompt engineering but it's also very inefficient in my opinion we spend the same amount of compute irrespective of problem because we apply the same type of model to every problem and I would say this is really the cost of static intelligence um and it's really the cost of a one-sizefits-all world So today um I will talk about a paper that so it's called the slow death of of scaling and I'll I'll also I think that's very much what's motivated me thinking that we're in an interesting moment where there's an inflection point in this um in this recipe and it's not enough to just be the build build the biggest model anymore. Now we're in the age of interaction which is very fun. uh and we have an hour together. So I'll I'll go through these slides fairly quickly. Um but then we can open up uh and spend probably 40 35 minutes in open discussion. Um so how do we get here? So the limits of scaling this is the paper that I suggested we kick off with today.
It's called the slow death of scaling.
And honestly the slow death of scaling says well you know on the one hand for the last decade we've done this bigger is better race in the amount of compute.
This is captured by Rich Sudden as the better lesson. So what Rich is saying is that the only thing that matters in the long run is the leveraging of compute.
It's kind of a punch to the ego of every computer scientist out there. It's saying you can have this intricate idea and it can be so elegant but all that really matters is if it can leverage compute or not. And so um he says also that this is really the lesson that you know history of computer science teaches us. So it's worth asking is sudden right? So here's the first question. I'm curious like what do people think? Um is sudden right? And you could do thumbs up, thumbs down. Uh I you know if you who thinks that it sudden is right?
No one. Okay. Yes. I thank you. Thank you. Brave souls. Great. We have a few.
I mean someone better think he's right.
He did win a touring. He's a wellestablished researcher. Um so this is great. Who thinks that he's wrong?
Okay. I have Ardash is saying yes. No, that he's wrong. Sashan. Okay. Yeah.
Very good. So, we have a few rogue contrarian believers in the audience today. Um I actually think that this is a very important question to ask because in some ways this has determined so much of our ecosystem, right? So much of how we organize and do innovation is being about this question of is the most critical ingredient of progress scaling compute and scaling model size. It's resulted in jokes about being GPU rich or GPU poor. It's resulted in Michael Jordan the researcher not the basketball player saying today we can't think without holding a piece of metal. uh it's resulted in uh really researchers like myself who would traditionally be in academia spending the last decade in industry labs right so this massive shuffling of talent and resources and it's determined who gets to participate and who doesn't so uh it's also resulted in compute being seen as a national priority I think that this is really interesting because in many ways like this question is so important to ask because it's still quite contrarian to suggest that it's not the most important ingredient for progress. Throw in computer the problem is still widely favored. It's seen as more derisked than any type of like algorithmic improvement or architecture improvement. It also fits very nicely into industry planning cycles which is a terrible reason to do something. But you know often very basic organizational reasons are why you know people keep on pursuing certain types of ideas. And I think the other thing is that even when people raise a lot of money, you know, they say, well, we need this because of compute. So it's a bit awkward afterwards to say, well, maybe we didn't need this. And so for all those reasons, this is a very sticky recipe. This has also had massive implications for determining who gets to shape the technology we built. So, you know, you we can pull this room about who company A, company B, and company C is. you will likely say some of the same companies and that's who's determining so much of who gets the shape and own AI and that's very wild but it speaks to this fact that if you believe sudden is right and all that matters is scale you've seen this concentration of both talent and resources within a handful of labs so it is controversial to suggest it's over thank you for the two people who gave a thumbs up when I said it's sudden wrong so we're in the minority and let me see if I can convince you by the end of this presentation Um I would say actually there's a lot of evidence to contradict the view that monolithic large models are the future. Um one is that AI models have been getting far more performant at the same size over time. So this is actually from the hugging phase LLM open leaderboard. It's been retired but it was very it was very helpful because it had historical data over the last few years and basically it shows that models under 13B um have been steadily increasing. But probably more interesting is that small models frequently now outperform large ones. So this is the same historical data. You have this really interesting daily leaderboard submissions. This is the best small model versus all models that uh underperform it submitted to the leaderboard every single day. And this is interesting because what it tells us is you can't have a predictable recipe for just size alone. There are also just severe redundancies between weights.
This is such a classic paper and I really enjoy simple classic papers.
Nando is now um at Microsoft AI but uh he and his collaborators found that you can use a small set of weights to predict 95% of the weights in the network. What this tells us is that most of these weights are redundant. They're just doing the same thing. You can confirm that by just removing them afterwards. So although you appear to need a lot of weights to converge in an appropriate way, you can actually remove them after the training is finished and you only see minimal degradation in performance.
The other striking thing is you can get away with um much less uh in terms of capacity if you increase the size the quality of your data. Um, in general, probably the most damning evidence that size is not everything is just how disappointing recent releases have been that have increased drastically the size. So, it's been wildly regarded that GPT4.5 was not seen as a dramatic step-wise increase in performance despite being much more sizable. And in fact, it was only briefly productionized and then it was replaced with routing because it was seen as expensive to serve but not worth the disproportionate cost of serving. We also see similar responses to llama 4 and also even with mythos there are it's very interesting.
I very much doubt that will ever be served at scale because it's helping with a very specialized part of the distribution but it's also being widely perceived as too expensive to serve. And so this is super interesting, right?
Because while we're seeing like certain improvements of certain parts of distribution, the cost of scaling size, I think has been regarded by many as not not worth it. And so within Frontier Labs, you will unlikely to see a full X of size again this year. Um, and I think that says everything because what it says is uh the rate of return no longer makes sense for pre-training. So increasing the size of your model um or applying compute there doesn't. But I think what's interesting is that you know this goes to what happens after the slow death of scaling model size. We have these new eras of optimization. The rate of return is now much relatively better for post- training test time scaling adaptive compute. And this actually means all bets are off which I find very interesting to think about.
The next year of intelligence will require much more than brute force. And it also means that the era of research is back like how do we actually combine these ingredients post- trainining test time scaling those are very different phenomena the idea of how you use sequential processing how you interact it's a different set of skills than just colllocating and training dynamics you need to think about serving fast you need to think about offloading from GPU to CPU and back um and you really need to think about interface because if your model's interacting with the world and you need to think about how it gets feedback so it can continue to adapt and this is really why I I think it's so critical to work on adaption right now.
So our focus is really on continuous learning and how does a model interact with the world and this is a big departure right because our pursuit as a field has always been has been very much since the 1950s around the algorithm.
the Dartmouth C conference in 1957 which was the first conference where AI was coined that was a conference where the mission was in part to a model a single model skills normally reserved for humans the idea that you're not just an algorithm but we're in an interesting time where optimization is around new spaces is fundamentally different but this is what I find very exciting one and I'll share because I think this is good about why I consider these important directions one is optimization the data space. So we for the first time have data space that is much more cheap to optimize in and to steer in and that changes everything and I'll share why.
Um I think one of the first things we did was release adaptive data which is and we partnered with hugging face um which has also been super meaningful.
But I think that this is really hey we should actually be using the same techniques used in frontier labs to shape data sets that target model behavior expand and target different parts of distribution that are rare. Why do I say that? Because let's say you know in the first part the real conclusion right is that transformer is a are saturated so there's a limit to scaling model size and in fact that you should be leveraging capacity better.
One way to do that is to optimize in the data space and basically work around the limitations of deep neural networks which are it's very expensive it's hard to learn a longtail make your data space a long tail. This is a fundamental paradigm shift. If you think about all the statistics, all the machine learning, the assumption is you have a random sample of your >> Oh, excellent. I think I guess someone has unmuted. So, someone got power.
Excellent. Um, okay, they've been muted.
So, at least we've tested it worked. Um, yeah, and I think that it's very interesting because this is a departure suggest we can target parts of the distribution, which is quite powerful.
Um, and that has been quite special because you can create AI ready data very quickly. I'll share this finally because I think it's very related. How do we create models that continuously learn? One thing I care about a lot is speeding up the speed of innovation and acceleration of like how do you adapt to the world? One version of this is uh auto scientists. So most fine tunings outside of frontier labs fail and they fail because people don't have the right data. then they find it too expensive to fine-tune um and they don't know how and like so much of that knowhow is locked in frontier labs but I think now this idea that you can automate the end to end that you have these longer sequence processes and you're optimizing across multiple steps is very possible so I find this very exciting because again it's a change to interaction like you optimize first your data then you learn from that to do the next step but also it's a movement actually you know even beyond what scientists towards words, how in the end of the day do you not need data at all? How do you just optimize and learn from the type of task and directly adapt real time? Um, this was fun. We just released this, we're actually releasing a technical report in a month. Um, but we we benchmarked against our um our researchers. And what's interesting is that older scientists outperformed researcher set configurations. I attribute this in part to the fact that most research uh specialists are trained to optimize a single model family. So for example when I was working on I languages we knew exactly how to configure that stack and that architecture you develop a lot of domain knowledge but order scientist has a much wider search space right it's any model type any frontier open weights model type and like that is much harder for an AI researcher to get right without past experience and so it's quite interesting that's why I think it is sizable despite our research staff having a lot of experience training frontier models but I think that makes it cooler And this is another fun fact maybe I'll share. So you you'll see here these are the lists for autoscientists but you'll see they're all in the 60s. That's actually because we limited the budget the search space budget to stopping if the performance ended up over 60. So we just removed that limit. So I'm actually excited to see what we get to next. But that's quite fun. So now we've expanded the search budget which is really cool.
Um and so I think we'll see even bigger gains there which is nice. Um so I'm just going to end here. I want to share a little bit. So, a lot of what I talked about was really why I think this is an interesting time to to have an inflection point with the type of research we do. So, hopefully I've convinced you that this is not great.
Maybe maybe there's still some people who think it's okay, but I probably and I hope I've also tried to convince you that we're now in a period of decreasing returns to compute, especially that applied to model size. Um, and instead, regardless of whether I've convinced you that transformers are saturated, I hope I've convinced you that it's very expensive to scale them. That basically we end up paying a lot for the long tail. Um, and so what matters most now is the cost of adaption. And like who makes adaption, learning from new and coming information as as efficient as possible. I think compute is currently the least interesting idea to throw at a problem at least training time compute.
Test time compute is quite interesting.
And there the question is how do you leverage it and like how do you make it um much more adaptive based on the task.
But increasingly we should justify additional complexity and then scaling occurs by focusing on efficiency.
Efficiency is one of our core values and I'll share why because if a model is interacting with the world the speed at which it's learning from new information matters the most. Who gets to shape adaptation is going to be someone who makes that as efficient as possible. and how quickly you can explore your environment is very very much dictated by the efficiency of your learning. And so for us, this is one of the principal things that has to happen. Um, and yes, we're doing a lot of research on these pillars. We're actually very public about these pillars on our website, but basically that is um I'm a big fan that you take very clear, dedicated bets and you throw a lot of talented uh people behind them. And so these are really we have a single northstar which is to make the whole stack adaptable and all of our work is focused around these pillars. So if any of them are interesting I'm happy to chat. Um but why don't I share Oh yeah I did have this thank you. So because I was giving this talk in Singapore um I did ask them to tailor it and this was a bit better. I thought this was lovely. You can kind of see the backdrop of Singapore. Um but maybe I'll leave this here and then why don't we open up for discussion. and I would love to just make it much more dynamic and we can talk about whatever.
>> Um I had two questions to get the ball rolling.
>> Oh great.
>> Um um one question is uh would the kind of models that we are talking about and compute being the least exciting bit which I also uh agree to. My first question is would the models have to go through any kind of technical changes so that they become more adaptable uh to specificity and stuff like that like do those would those models require changes for uh being more controllable and steerable? Uh my second question is so we mentioned training time scaling we also mentioned test time scaling but few of us are also aware of a middle ground which is test time training.
>> Yes.
>> Um and I also wanted to get your thoughts on what you think about test time training. So those are my two questions.
>> Excellent. That was cheeky getting in too at the beginning but yeah let's go through both of them. I think that's they're both really good questions. So like one is um this I think your question amounts to does this can you explicitly optimize for adaptability and I actually I think that you should I mean so there's two aspects to that one is if you're hosting a model and you're hosting a model 247 you can automatically infer characteristics of a task and you should leverage that information to change model behavior and you can optimize a model to be more flexible at test time And I actually I think that's a key part of it. So training techniques, post alignment um should be used in conjunction with gradient free test time techniques to more powerfully uh adjust your model.
And I think that's a very important research bet which is you know ahead of time that you want your model behavior to change. How do you make the model much more flexible at changing that behavior um through your alignment and through understanding the variety of past distributions of tasks um which is really interesting um I think that the goal of that and all the compute applied there should be to make it as fast as possible to adapt to real time like that's that's the interesting part um and the second question is about test time training which yes I also think is a very important um axes, right? It's it's really about like how do we leverage um all these uh trajectories and all the signal that we're we're receiving as we explore an environment and how do we leverage a combination of parametric and non-parametric knowledge and I think this is actually one of the most crucial questions how to get that balance right what do you store in the parameters versus what do you store in external knowledge or what do you store in context um or what do you store in terms of like guiding your search budget it you know auto scientist is very interesting because that's that's automating the process of training itself but I actually think autoscientist is most interesting for future problems where you also automate you know how do you set up your harness to explore how do you dynamically set your search budget for some of these problems based on the type of problem it is >> uh yeah a followup and I'll I'll just leave it there uh one uh quick followup to the to my first question is uh do you think there is a good study I know this is fairly open-ended but I I have always been interested in interested in studying how things like pre-training post- training and test time scaling for a set of tasks complement each other I think there has to be like a complimentary benefit like till what point in time I should post train my way out of things and when exactly I should start incorporating test time scaling like I am not looking for any absolutes but relative uh boundaries and relative quantities also like help driving some of the decisions quite a bit and this is something that I've been working on trying to work on for the past at least two years and so far I do not have any clear detections so >> oh really I'm no that's not true you've done great work in this area um but I think There's a few things there. So there's a few things we know.
One is for example um that you get a lot of return from test time compute especially if you place it on on examples that you're most uncertain about. So this is basically the premise of of adaptive compute. You should spend more time on high uncertainty and how you do decoding for example is a very interesting example of test time compute. Like how many samples should you draw in parallel? How do you ensemble those examples? I think this is all really interesting. The relationship between all of them is quite fascinating because for me one of the key things to um to draw out is we know some things about the relationship between pre-training, continue pre-training, post-training RHF. We know for example that basically you you shouldn't have too much repetition of your same data between continue pre-training um post-training RHF. you need to keep those you need to be injecting new data at different stages to keep it fresh and dynamic. We also know that you should be introducing certain types of data in critical ways at certain points of um training right so continue pre-training has now become very associated with introducing the first of reasoning the first of math and your code and I think that that is quite interesting because um that type of mix of how do we optimize that mix that's I think that's a perfect candidate for auto research and we should be doing more work there to automatically learn those as well as how we ramp up those those uh components I think something that's desperately missing as a separate field of study is a relationship between those mixes and test time which I think you're getting at for me I actually I'm not it's not quite a question of diversity which is often the core question in terms of don't repeat data between these different stages or it's actually what type of data should be stored. So like for example the idea of facts which we which are fairly stable and never change. It's not clear to me that's a good use for like a yeah I don't know some of those are very easy to retrieve right they're very simple um and it's not clear that we that's a good use of capacity in some ways skills or ability to navigate has to be taught in a very general way otherwise you get locked into certain tool to toolkits. That's one of the other lessons from all these pre-trainings is that if you codify exactly what your tool is, you basically create a very brittle model that breaks when you introduce new tooling. So the dynamics of how you introduce knowledge and what is stored in the weights versus what is stored outside I think is actually one of the core questions and I think it relates to the nature of knowledge as well that there are certain things that you basically want to just enable knowledge of like model should be aware that there can be tools it can leverage but it doesn't maybe it shouldn't be aware of what what exact tools there is because then you can introduce more flexibility at at test time. So things like that I think need more work and that is probably very important and doesn't get talked about enough like what do we reserve for certain types of knowledge.
>> Yeah.
>> Yeah. Um no thanks for being uh so generous with the answers. I have taken plenty of notes and this is all recorded and Jim is also taking notes. So I'm sure that I'm going to be um referencing to these things quite a bit. Um I think with that Moritro would like to take some questions from the audience because we already uh have quite a few. So >> yeah but before that I would like to uh you know also have one one of my questions in. Uh so S thank you for the great talk. I just wanted to ask a very nice question wherein Ih wanted to know about what adaptive intelligence and versus continual or continued learning is. Are they two related? Are they two very far away from each other and how drifting in data might change both of them?
>> Yeah, I mean I think that they're very related in the sense that you typically want to add a capability while preserving the rest of your knowledge.
Continual learning typically refers to um a time factor. So the reason why continuous learning has come so to the four with urgency again why why it's um such a pressing topic is mainly because we're moving towards uh these long horizon tasks right if a model is interacting with the world it's having to absorb information make a decision and proceed to the next choice that's where continuous learning comes into play my strong belief is for you to be able to do that successfully at every step you need to be adapting to the new information and learning. And so adaption and specifically efficient, it really makes that whole sequence much more efficient because it it makes your search space much more controlled.
Whereas what happens now, what's the state of the world right now is that people just do these massive rollouts and it's so expensive because basically they're just they're paying for searching the space the most um the most inefficient way possible is that they're only getting the signal at the very end of the rollout. And that's where I see the two as very intertwined. So our mandate is to adapt and that means at every single step you should be changing model behavior and uh incorporating new information. Continuous learning refers to this very important task of essentially long time horizon and that you don't want to forget things. So you want to add capabilities but not detract. Um so yeah both very important parts of the puzzle.
>> Right. Thanks for that. And and a little followup is uh when you talk about adaptive intelligence, do you uh particularly mean a gradient optimizations or the optimizations might be like gradient free and and what about it in general?
>> Yeah. So actually for us uh what we want to our main northstar is like adaption should be real time. That means as I was talking earlier so we do have a lot of work on post- training optimization to make a model more flexible at test time but the goal is at test time no gradient updates so even you know I think this is very important because as we think forward to like how we combine these training techniques with like test time techniques really it's that we should be training models with the idea that they're going to have to adapt and incorporate new information but also you know once we have that train model. It's not allowed to train anymore. It basically needs to incorporate new information using um using all these additional techniques which are gradient free.
Thanks for the great answer. Now moving on to the questions from the audience.
So first is um efficiency being a pillar. Is this about compute efficiency only or would it also include data efficiency too? So this is Yeah. Go ahead.
Yeah, both. I think that's a great question. So, um it's interesting.
I've been at a lot of frontier labs and people ask me like why adaption and I think to do uh continuous learning well you actually need to position teams differently. So in traditional frontier labs you have a modeling team who really they're responsible for model and then they hand it to another team who's responsible for serving and then there's another team that's responsible for front end and I actually think crucial to continuous learning is first of all the whole stack needs to be adaptable.
So you need to be able to change your data on the fly, change your model behavior and actually I think the interface is also important but to do that especially efficiently I think you have to care about co-designing model algorithm with serving. So, Sudep who's my co-founder is I've known him for 10 years. He was at Google and he was also at Coher and he he did the Gemini pathways. So, he's infrastructure systems and I actually think this combination of co-designing this is going to be crucial for doing very fast adaptation but also feedback from the environment matters. So, interface has to matter. So, this is actually a very good question because I think it also relies to the different set of skills now needed to innovate um at the edge of what's possible. But yes, data is certainly part of it. So adaptive data was intentionally our first release to the wider ecosystem because we see scaling and changing your data and targeting parts of the data distribution very quickly is very critical to this.
Great. So moving on uh replying to your point about adapting data curation pipelines used by Frontier Labs. What do data curation and data generation techniques look like in the compute and data efficient paradigm? uh adoption labs is pursuing especially in pre-mid and post training.
>> Oh, interesting. Wait, that's such a long question. So, let me read it.
>> So, sorry, what was the first part and what was the second part?
>> Okay, I'll just uh type it again in the chat.
>> Okay, so Ria point about adapting data curations use. What do data creation data generation techniques look for in the compute and data efficient paradigm?
Okay. So I think that um I mentioned this so I think it's quite important. It is now cheap enough to optimize in the data space and that's pretty profound because traditionally remember you you spend a lot of time and effort and cost collecting data. Um then you may annotate it but that also takes so much time. So you know you better get the categories right because you're not going to go back and redo it. That's most of computer science history. Now for the first time I think we have the ability to steer in the data space and we've actually released papers about this which is you you can even steer towards non-ifferiable objectives because you can penalize based on the characteristics of the data what objectives you want it to represent.
This is profound because it means we should be leveraging that space a lot more and that's like one of the most inefficient levers to get the behavior you want um as you as you use it either for gradient free techniques like adding it in in context or in retrieval or you know or many other techniques that are uh that are gradient free or leveraging in the gradient sense and so the ability to specify and scale and create data on the fly is one of the most important. So adaptive data will also be sharing something um shortly which is invent a data set which allows you to summon parts of the distribution that are missing but this is quite key to be able to target properties um and do it very flexibly.
>> I see um this is going to be a loaded question. So if scaling laws are becoming less reliable, why is frontier in frontier investment still going on further optimizing LLMs rather than exploring alternatives like uh Leon's Japa for example?
>> Um because I think it's very difficult to stray. So just because scaling model size is plateauing doesn't mean that there's still not other areas of optimization around LLMs that you can exploit. So for example, you know, our current architectures tend to be quite expressive. They can generate and collaborate with other models in powerful ways. So we see the shift from pre-training compute towards test time and agentic collaboration. I think it means more than anything the system matters more. So you can't get away with just scaling model size. You have to think about how you leverage context and how you interact with other model outputs in a more powerful way.
um what does that mean for like the post transformer world? So typically it's very hard to design a different paradigm because hardware and um monotype tend to be so intertwined. So I wrote a grumpy paper about this called the hardware lottery. Um it really speaks to the fact that the hardware lottery has in my mind even gotten worse. So typically, you know, I was just uh I just saw a talk um by uh Bill Deli who's at NVIDIA. He's the chief scientist. He was talking about how much optimization with GPUs in particular have just been for matrix multiplies. Matrix multiplies make up 99% of most modern neuronet networks.
The same way we're made up primarily of water, deep neuronet networks are primarily made up of matrix multiplies.
And so what's difficult about that is if your hardware was overly optimized to a single type and architecture, it's very hard to make other ideas empirically successful. So I welcome actually this new wave of what I would I think is collocally referred to as neolabs who are really betting on the next year of intelligence because we need more diversity in the approaches. Will that look like a different architecture? I think most people in this field know there's such severe limitations to transformers particularly their inefficiency because it's you know you know batch size averaging means you pay a lot to learn the longtail takes ages in trading so we know this is not the final answer but when will an alternative emerge I think that's slowed down a bit by the fact that our hardware is so overfit to it >> that's a lovely answer uh the next one is quite close to my heart as well in the context of your paper on the slow death of scaling. What do you think is the best path for a small lab with strong data and training strategy but limited model size to stay competitive?
Oh, I think it's now fun again because think about it for like the last 10 years every year basically there would just be a doubling quadrupling 10x in model size and it would just be a much more performant model right and no Frontier Lab is 4xing or even doubling the size of the model this year which means everything is about the post- training and also the innovation and the recipe and how you leverage context and interaction the characteristics of that type of compute are very different for pre-training which is what drew progress the last 10 years in size you basically need colloccated compute you need enough GPUs in the same data center you need stable and reliable reliable connectivity this type of compute which is you know a gentic harness you can have way more redundancy you can have distributed data centers you can leverage different providers in a meaningful way what's interesting is that it means that the recipe matters the most the algorithmic component And that means all bets are off, right?
It makes it very interesting in terms of who is going to dictate and shape who does this both most efficiently but also what's the right way of leveraging this.
Even something like auto research or auto R&D which I think is very critical for harnesses because I think what's happening right now people are trying to use general harnesses and it wastes so much tokens and it's so computationally inefficient. Um so I think those are very core questions which means which are very promising for a lab that isn't you know doesn't have as much resources.
Does this mean that comput is not important? No, I wouldn't say that. It certainly you still benefit a lot from having compute. It's just the rate of return is much higher with test time and dynamic post-training compute which doesn't require as much dedicated in the same place and that can be an advantage.
So I don't know maybe that was an imperfect answer but I actually see it as quite a promising time for distributed innovation and particularly for global innovation. Um and that is quite important personally to me we're global first from day one as a company but also I think in terms of the dynamics of who gets to contribute to the frontier right uh also one of my questions was uh when we talk about u like adaptive intelligence the inputs matter a lot for the model to adapt. So what do you think the next let's say next 6 months is going to be with uh user experience of u talking to a model or or just interacting with a model would >> oh this is such a fun question yeah so our third pillar is actually adaptive interfaces for a reason I think so the I I love this question because I think code accelerated a lot because it used an interface that engineers were very comfortable giving feedback in and that has and I think code and design fall in that category. Design interfaces for AI were also very rich early on and designers give a lot of feedback because they're very opinionated.
You see a complete absence of that for basically all other tasks and all those other tasks were basically relegated to a chat interface with thumbs up, thumbs down. I find it super interesting to think about well why how how can we impart the same interfaces that were valuable for code for those tasks and how do you do that much more dynamically and I think the field I'll back out the field has different solutions for this like you see browser agent solutions are all about how do you impart agents with the knowledge of humans that traverse the world I find this less interesting because I don't think you're going to get that great data um because really those are not those are mimicking how a humans traverse but they're not in collaboration with it and I think it's much more interesting can you create useful interfaces for humans doing their tasks every day and you're collaborating with a human and I I think I see these as differing bets you can see different labs doing different bets the browser agent one you get a lot of data but actually the idea that you're going to be able to see how someone interacts or what works but also the idea that there's going to be a brand new paradigm for how people want to interact is you're not going to get that from browse agents because that's just that's just leveraging the internet as it is.
Whereas I actually think the internet of the future should probably look different and we should be able to summon to us the right interface at the right time for a given task. Yeah, but this is a very fun question. So I'm curious what other people think too.
>> That makes sense. Uh if if people want to uh talk about this, please uh send us your replies in the chat and I'll be sure to pick them up. Uh, another question is if pure scale is helping out at marginally better levels, do you think this is logarithmic and we will have to have a foundation foundational rewire or is it linear enough for it to be hard to switch. I I'm not sure whether I get >> I guess that's about the rate of returns for compute. So So I think my my my paper suggests is marginal. So that would be much more like a log scale but I think even log scale might be too generous in a bit. So this is a big active sub field right the idea of scaling law some people really strongly support them I think most actual treatment of scaling law retro so scaling laws I'll distinguish two groups scaling laws are often used within frontier labs to predict the next training run incredibly useful there because you're only trying to predict given you're controlling for most things same architecture same data and you're just saying hey if I scale based on this historical data extrapolate out very reasonable use case scale laws are often used in the public discourse like conversations like this is like oh how will how does compute influence performance capability safety and there almost all scaling laws have been shown to be inaccurate and painfully so and a lot of that is that again you have these decreasing returns over time because the thing that most locks in your dynamics of scaling is your architecture that's the heaviest prior that controls test time scaling on the other hand we see this massive slope right now so that's we're seeing massive returns for test time when you apply more compute for exploration that's still a very nice return rate and that's why I think what you'll see is most compute shifting towards this inference-based kind of search space >> all right um another question is what research domains are currently undervalued because they don't fit the scaling narrative >> oh that's very interesting in in what sense? So if they don't fit the scaling narrative um well okay a great example is alternative architectures right so capsule networks um there was a paper led by Sarah Sabore who was working with Jeffrey Hinton at the time it came out I mean I want to say it came out in 2015 no it would have been 2017 because I remember it came out that or 2018 nice someone's going to fact check me but I think what's interesting is that that paper was actually proposing something quite interesting Right? Whether you agree with it or not, it was different.
It was it was an attempt to avoid the the at the time it was an alternative to convolutional neuronet networks and convolutional neuronet networks had all these hacks like they had max pooling which discards most information and um this was and also convolutional networks were interesting because you could have for example a jumbled face and the model would still recognize it as a face because it was invariant to position.
And so Jeffrey Hinton, Sarah Sabore, basically said um this is we need something more sensible that takes into account structure. Um that was almost impossible to get working on uh hardware optimized for um deep networks because it involved operations like squashing.
So that's a great example of the penalty of the hardware lottery. Sparity is another one. So structured sparsity works well, but the best compression results you can get are with unstructured and that doesn't play well at all with current hardware and so no one uses it in practice, but that's a terrible penalty, right? It just doesn't scale because it doesn't work. Um, which is very fun examples.
>> That's a very nice uh answer. Uh, do okay. So do you think domain specific models to replace frontier general purpose foundation and models would come in handy?
I think that we're seeing a pendulum swing swing again. So firstly, I want to kind of share my thoughts on why customization over the last few years has largely failed. Like if you think about it, most providers who offered fine-tuning, I think that was largely considered a failure. Most people who tried it, it took a long time. They had to prepare data. They didn't like the results. They went back to being prompt engineers. I think that was also because pretty much during the era of like scaling predictable gains, you would come out with a new bigger model and it would almost erase a lot of those gains.
There's two I guess there's three trends now that I think are changing that completely. One is there's a switch towards usage based billing instead of you know you can eat subscriptions. So everyone finally is internalizing the cost of using APIs. Um and I think that that is really driving people again to say whoa I need something an alternative that is lower latency and I can leverage in a more e economic way. The second is that I think with the auto research and auto R&D efforts that the actual practices of how do you customize successfully are becoming the speed of innovation is much faster. So you can try things faster and you will fail less often. Um but maybe the third one is that uh the agent workflows typically compound error in more unexpected ways.
And so people are feeling the shortcomings of whatever they use as a off-the-shelf model. Anything that wasn't quite working for your particular slice of the distribution, whether that's language or context or the database structure or the specificity of your very specific tone, all that's compounded when you do a sequential process. And so again, I think that means that we're going to see pools of models and more dynamic switching and people are much more interested again and okay, I'm more open to not just doing prompt engineering because I need stronger levers of control.
I see another question that is uh that is partly certain like I I I really like asking this question to researchers who have like who have been around for a long time now. Did you did you see a shift or evolution of what a beginner should u try and educate themselves uh to get into uh machine learning or deep learning research from time to time? I think the difficulty that I see mostly it's very interesting is that um the cutting edge techniques when you when I started and when I joined Google brain it was so special because there our mandate was seeing sharing with the world exactly how you do science right and there Europe it was like when I my first year at Europe was like a pilgrimage because I was meeting all these researchers and I had read their papers and and that for me um was very like profound because it was all these people. I at the time I carried around the deep learning textbook and then I I literally remember um Erin Corville who ended up being my PhD adviser but I was just so in awe because like this is this is the things that like inform you and how you first start to get to know whereas now unless you're part of a frontier AI lab the sense is you don't actually know the secret source or there's some things that are not shared and it's weird because now when I go to Europe like all the Frontier Labs will meet in a hotel room. Uh which is weird, but I don't know, for some reason it's always someone's hotel room and we'll do this game called underrated, overrated.
And what's interesting about that is you're trying to figure out from what someone says is underrated, overrated, what they actually are doing. And it's the weirdest thing. It reminds me of somehow how someone described the Apple technical teams. when you get invited to an Apple party, the technical team spend as much time talking to each other because they're trying to figure out what each other is doing and like this is really so I think it's much harder in some ways for a user to get to the frontier like a beginner to get to the frontier of the field. On the other hand, it's much easier for someone to get started and to use something for if you think about code generation. If you think about now a lot of what we're working on where we enable people to um to train and own and do that auto research are indeed much faster. The iteration cycle is a lot lower, a lot faster and that's pretty profound because that changes who can participate. So, I see the extremes are just more um uh are more pointed now where it it's it's harder to get to the frontier, but it's much easier to start to to make an impact in your community now than it was when I started.
Does it also mean so so uh great answer by the way but a follow-up would be does it also mean that people should not kind of make their goal to get into a foundational lab at this point in time or or do you see it a little differently what like what should >> Oh I actually hope I think it's very sad that you know that people have to aim and hope to have an opportunity at a handful of labs I think that I grew up um I grew up in Mosmbique I spent my childhood I got a scholarship to the US.
I was very lucky. It opened so many doors. But I think about that sequence and like, wow, so few people have that same path. And it's very clear to me.
Then you look at the handful of institutions that all my colleagues come from like Stanford, the Berkeleys, it's just a handful of places. And I think, wow, that's not um that's really unfortunate, right? Like I think that that's more the the the idea that there's only a few hundred of people who can create this technology is kind of absurd. And so actually that's one of the reasons I I actually care a lot about making adaption as fast as possible. I think it changes the dynamics of who can do the research and automating R&D is very powerful in the same way that code generation acceleration has been very powerful. Um I think both mean a lot to changing you know who gets the best idea starts to win at that point not where you went to school or were you lucky enough to get a scholarship at the right time.
>> That makes sense. Uh we have 8 minutes more. Uh do you sh do you want to have like ask any questions or do I scour through the comments and look for other questions? Um >> um I think it might make sense uh to take like other questions. I I think I really asked what I wanted to ask.
There's already a new question. So maybe uh we could take that.
>> All right. So the question is a big one.
uh how how the asymmetrical learning made by current LLMs like their behaviors for basic questions like number of hours in strawberry uh or with summing task but good with difficult maths can be resolved that's interesting >> yeah that's very good question actually okay so strawberry question drives me nuts because I think for two reasons so one the reason why so many models struggle with it is a tokenization issue so the RS yet it just collapses. The way some frontier model providers have actually solved this is that they just have a rule on top now and then they isolate which is funny because it really is acknowledging a shortcoming. So I guess here's the thing. Does this matter as an example? Is this knowledge that a model should learn? It's kind of interesting to think about. So the number of bars in strawberries is a very esoteric example that people like because it's an example of something very obvious to us that these models are really bad at. So yes, in some way if the goal is to make models as smart as humans, yes, but how much compute do you want to spend on that in pre-training?
How how because it's a tokenized issue, so you basically have to solve it from the very beginning. So I think that's where it gets to the heart of it is like what do you leverage as like your parametric knowledge that needs to be solved for versus what you resolve in test time and I think that's quite interesting to think about. Um, it does speak to a limitation, right? These models have strengths relative to ours.
They're already much better at math than most of us. You know, computer scientists traditionally are very good at math, but I would say, you know, humans in general, not we're not very good at sequential processing. And so even those of us that think we're very good at math are probably not as good as an LLM or an automated process. But there's probably also some things that we're better at, right? The most clear example is we're still way more efficient at processing new informations. And so I think that this is worth thinking about because um often in computer science history the goal is being portrayed as we want to build uh um we want to build human intelligence.
But I actually think it's not clear that should be the case. In fact, there's some things that models will do better because it's a different way of learning than us and there's some things we will inherently do better. For example, global updates for us is so cheap because it's mostly socialbased. We show up at the same time. We care deeply about the respect of others which means like things like co are fascinating because we all agreed globally simultaneously we were going to lock down. Think about how crazy that is across so many humans and it speaks to the fact that our global updates are so cheap because they're based on us wanting the respect of others. Um so super fascinating. So I think about that a lot. Um that was a very long answer to that question but I think it's important because we should think about what our true northstar should be with these models. And oftentimes, I'm not sure it should always be exactly like us. It should be about how do we make these actually useful in the real world. And so problems that we're particularly not good at, but might be better done in compliment with us.
>> That's a brilliant answer. Um I we have four minutes. I'm just going to uh throw in a little question of mine so that it ends uh in convergence to the talk. Um I would like to know what your advice would be for a community like this um where people could take something away from you and also continue doing what they are doing.
>> I think communities are very important.
So before I I got to Google Brain and Deep Mind now it's deep mind um I actually was mostly studied by myself.
Uh, I got up at 4:00 a.m. every day and I would study and um pursue and it was incredibly lonely. And I think most people who um are chasing something like wanting to contribute at the frontier, it can feel extremely lonely because typically you're balancing it with everything else in your life. Um and so I think ecosystems like this are so important because um it provides the sense of making progress in a powerful way. And typically your acceleration is a I always say this that the most important thing is choosing good problems and choosing people that are better than you to work with on and uh figuring out who's better than you and then throwing yourself into it is pretty critical. So I think ecosystems like this are really really important. Um I'll just say there's uh Susan asked Vega question that current theoretical research that you think is pertinent. Uh so I'll just say one thing here. Um so the fact that we need large models but can remove weights afterwards tells us we need large models because optimization is unstable. You need more parameterization to converge. That's a super critical optimization problem. If we could start small um we would we would have a fundamentally different paradigm and return for compute and that's that's an optimization problem.
It's about stability. Optimizers tend to be very difficult areas and theoretical areas to work on. Um not saying you should there's many of people who have sacrificed themselves to trying to invent a new optimizer and failed but just saying that is an area of massive return if you can do it right. Yeah.
>> Great. I think um we can we can close with uh with this um optimization problem uh to to to think about and uh thank you once again for uh joining us.
Uh I'm pretty sure you're pretty sure you're busy and 1 hour from your time means a lot to us. So Shahik, do you have anything uh else to close and then we could >> uh No, I shared the same sentiment. Uh it means a lot to us that uh as a as a founder you took out the time uh while still being uh in Singapore uh and not go out to explore the beautiful country of Singapore. Uh or I both love going.
>> I'm in a very official room right now.
It's very I'm in between other tours. So it's very fun. So this is a fun part of my day actually because after this I have to go meet dignitaries.
So, I'm very much it's very fun to have a break and talk about research and what matters next.
>> Cool. Cool. Uh, no, that uh that sounds that sounds interesting and funny at the same time. Um, but yeah, I mean it's been it's been a uh it's been great uh hosting you uh and I hope we can stay in touch. And for the participants, thanks for joining us and asking cool questions, questions that will matter.
Um uh and as some logistical stuff, we will share the recordings uh and the notes when we have them. So yeah, please be on the lookout. And yeah, that's it.
Thanks for joining.
>> Amazing. Lovely to meet everyone. Chat soon. Bye.
Related Videos
Elon Musk’s XAI, Fiber-Optic Drones & the New Era of US Defense & Winning the AI Arms Race
DefenseNow
250 views•2026-05-15
I Read Every Google Antigravity 2.0 Doc So You Don't Have To (13-Min Operator Playbook)
hyperautomationlabs1045
120 views•2026-05-19
Could AI change the future of cancer survival?
MotherConservative
999 views•2026-05-16
[RQ] All Preview 2 Midnight Horror School Deepfakes in Macbg Major
macbghuggylego
102 views•2026-05-15
Firefox on Android Just Added 'Shake to Summarize'
BrenTech
349 views•2026-05-19
Google’s NEW AI Just SHOCKED The World…
JulianGoldiePodcast
188 views•2026-05-21
WWDC 2026 Promises Apple Intelligence and Siri Upgrades | Episode 195
TheMacRumorsShow
104 views•2026-05-22
RNNs Had a Fatal Flaw — Why Transformers Replaced Sequential Processing
axiom-motion-math
567 views•2026-05-18











