This presentation provides a lucid synthesis of the shift from brute-force scaling to architectural refinement and physical embodiment. It effectively bridges the gap between theoretical compute limits and the pragmatic necessities of edge deployment and multimodal reasoning.
深掘り
前提条件
- データがありません。
次のステップ
- データがありません。
深掘り
The Future of AI – Key Trends Shaping What’s Next • Ekaterina Sirazitdinova • YOW! 2025インデックス作成:
This presentation was recorded at YOW! Australia 2025. #GOTOcon #YOW https://yowcon.com Ekaterina Sirazitdinova - Senior Developer Advocate at NVIDIA @TechWithKatja RESOURCES https://twitter.com/katjasrz https://github.com/katjasrz https://www.linkedin.com/in/katjasrz https://katjasrz.github.io Links https://youtu.be/P-aHuUJfAMM https://build.nvidia.com/nvidia/video-search-and-summarization https://github.com/NVIDIA/TensorRT-LLM https://github.com/ai-dynamo/dynamo ABSTRACT AI is evolving faster than ever, with groundbreaking advancements in multimodality, agentic workflows, and reasoning capabilities redefining what’s possible. In this talk, we’ll take a high-level look at the latest trends shaping the future of AI, from scaling laws to the rise of more autonomous and adaptable systems. This session will provide a fast-paced overview of where AI is headed and what these innovations mean for the next wave of intelligent technology. [...] TIMECODES 00:00 Intro 03:37 How far AI has come 13:11 Reasoning token budget 15:29 Learn & understand everything 23:20 Physical AI is hard to develop 30:07 Low inference latency is a must 32:02 Smaller model footprint required for local deployment 33:44 Al model distillation 35:32 Speculative decoding 36:16 Quantization of weights & activations 38:12 NVIDIA TensorRT-LLM 42:11 NVIDA Dynamo 42:28 Key takeaways 43:05 Outro Download slides and read the full abstract here: https://yowcon.com/brisbane-2025/sessions/3808 RECOMMENDED BOOKS Michael Feathers • AI Assisted Programming • https://leanpub.com/ai-assisted-programming James Phoenix & Mike Taylor • Prompt Engineering for Generative AI • https://amzn.to/43cFDqO Phil Winder • Reinforcement Learning • https://amzn.to/3t1S1VZ Lakshmanan, Görner & Gillard • Practical Machine Learning for Computer Vision • https://amzn.to/3m9HNjP Leo Porter & Daniel Zingaro • Learn AI-assisted Python Programming • https://amzn.to/3Pv3Hx7 https://bsky.app/profile/gotocon.com https://www.linkedin.com/company/goto- https://www.instagram.com/goto_con https://www.facebook.com/GOTOConferences #NVIDIA #AI #ML #DataScience #AINetworks #ArtificialIntelligence #MachineLearning #AIAnalytics #DataAugmentation #AMP #AutomaticMixedPrecision #TodayInTech #EkaterinaSirazitdinova #KatjaSirazitdinova #TechWithKatja #YOWcon CHANNEL MEMBERSHIP BONUS Join this channel to get early access to videos & other perks: https://www.youtube.com/channel/UCs_tLP3AiwYKwdUHpltJPuA/join Looking for a unique learning experience? Attend the next GOTO conference near you! Get your ticket at https://gotopia.tech Sign up for updates and specials at https://gotopia.tech/newsletter SUBSCRIBE TO OUR CHANNEL - new videos posted almost daily. https://www.youtube.com/user/GotoConferences/?sub_confirmation=1
Thank you so much for the lovely introduction. Can you hear me well?
Perfect. Uh I'm super excited to be here. I flew here from Dubai, which probably compared to other speakers was the shortest journey.
Uh took me just 13 hours to get here.
It's my second time in Australia, but every time is super exciting to be here.
So, today I'm speaking about the future of AI, uh key trends, and what's the hottest now in research and development in AI.
Um I usually say there are three categories of people. Uh the first ones say that the future with AI is bright, and I certainly cannot blame them. Uh the second category of people, they know that AI is already here, and they actively use it and take full advantage of that.
But, there are also this category of people who say that AI will not affect my workflows. It's not relevant to me.
And probably I'll be good without using it. So, I had a conversation with a person like that just a month ago, and this totally inspired me in um Let's see.
in generating this image.
So, while some of us are already fully benefiting of AI and using it for vibe coding, for travel itinerary prediction, and so on, some people are still remaining very resistant to it.
And um I gave this talk a few times already and uh usually after I give this talk some people always ask, "Uh could you tell me which jobs will be mostly affected by AI?" And coincidentally a few months ago I read a nice paper from Microsoft Research where they have developed a methodology of estimating and analyzing which jobs are have the highest applicability score of AI. So, I'll share it with you. So, if you're curious, you can check. Uh in this paper, they tell you about the methodology of estimating this, but also they're sharing the top and bottom 40 uh jobs with AI applicability scores. So, um good news for you. Uh if you work as a drage as a drage operator, bridge and locked and there's water treatment plant and system of and so on, your workflows won't be affected by AI in the nearest future. But, if you're a developer and we are at the developer conference, so um I guarantee you you are in the top 40. So, you're not in the top 15.
And that means that doesn't mean that AI will replace you, but the people who are using AI will replace you if you are not. So, my biggest advice here, if you're not trying it first, try using AI in your work. Uh I've tried it. It is really great, so I can highly recommend it.
Um now let's take a look in the retrospective where it all has started.
So, uh going back in the year 2012, uh that's the year when a competition in image processing was held. Uh it was actually held many many years uh before uh annually, but in the 2012, there was a group of researchers from Toronto uh who have revolutionized the field. So, the image processing competition was quite simple. So, you would uh show uh the images of cats and dogs, and you need to identify not only cats and dogs, but common objects including cats and dogs. And the task was to classify what is there in the image. And uh different groups were using different approaches like Canny edge detection, uh support vector machines, and so on.
But there was this one group from Toronto who proposed to use a convolutional neural network uh for that. And they put this neural network on a GPU, which enabled uh running neural network. Because neural networks, they actually date back uh to the '70s or '80s. So, the technology was already there, but it was very unusable because to run a neural network on a CPU was quite slow.
Uh so, the year 2012 was pivotal in the field of uh deep learning, and now how we all call it artificial intelligence.
Uh this has sparked the emergence of um many new techniques like uh perception for robotics, uh computer vision, image recognition, but also speech processing, recommender systems.
Uh Later in the year 2017, another famous paper has been published. It's called Attention Is All You Need, and uh this paper brought uh to the world the architecture called Transformer architecture, which actually now became the basis of the modern generative AI.
And now, well, we call it generative AI revolution has happened and the small company in Silicon Valley, OpenAI, uh they basically opened to the world their model and they called it ChatGPT.
The technology was already there since 2017, but OpenAI made it so accessible to everyone and the whole world instantaneously knew that this exists and this is actually very useful.
And at first all this chatbots, they were very useful, but they had many drawbacks, like the systems hallucinated, they were not explainable.
So, you could not fully rely on that.
The good thing that the whole thing became so popular and the research community has mobilized and in just a few years, in just a few years we had most of the problems solved already. So, for explainability we have retrieval augmented generation, rag. Um we have different weights ways to test if the system is hallucinating or not.
And this led to the emergence of the so-called agentic AI, where we rely on this transformer architectures, but we also use different tools, we find the ways of connecting this uh models to databases of different kinds. So, now now we can say is the time of agentic AI. But we are talking also here about the future and I think the next big thing, if you ask me, is physical AI.
So, in physical AI we are not only relying on well, chatbots, but we are actually using AI to handle objects. We We're using AI to basically coexist with us in the physical space.
And well, in the rest of my talk, I'll dedicate some time to talk a little bit more of how physically I works and what does it mean.
But let's actually also take a look at what led to this progress in physical AI and reasoning. And we call it scaling loss.
A couple of years ago, OpenAI has published this interesting paper where they were able to show that the more you scale your AI system in pre-training, the more intelligent it becomes. So, what does it mean?
Scaling the AI model in pre-training means adding more and more parameters to the model.
Adding more data to train this model, adding more time to train this model.
So, basically, the motto is the more the better. The more you the more time, the more compute you put in your AI system, the more intelligent it becomes. But then, of course, researchers started thinking if we keep doing that, this is not sustainable at all, right? We cannot indefinitely increase the number of parameters in our models. Yes, we can build huge data centers, but maintaining these data centers and the amount of energy they consume is enormous. So, this led the community to realize the urge of optimizing these models and the urge of understanding how to make these models smarter, how to make these models more intelligent without increasing the parameter count. And this led to the next scaling law, the post training scaling law. So in post training scaling, we take a model, we call it foundation model, the model which has been trained already on the whole scale of internet data, and we optimize it, we make it smarter by applying techniques like reinforcement learning with human feedback, by adding rag, by fine tuning this model to specific cases. So this is called post training scaling.
But the progress hasn't stopped there.
And recently, about a year ago, we've observed the emergence of the new scaling law, the so called task scaling or reasoning law.
So we found out that once you have run your inference, if you don't do it just once, but if you let the model think and talk with itself, you can actually achieve a better, more intelligent response.
So the reasoning has led to a very active evolvement of the agentic AI and agentic reasoning.
So in this image I illustrate basically how reasoning work works in the nutshell. So we have our user who communicates with your system, and in the front end of your system, well, you have some UI UI which collaborates and communicates to the AI model in the back end. So this could be this should be some large model which understands lots of things. This model should have capabilities of actually analyzing the user's ask, and then deciding on its own which tools to call and which processes to execute. So, if I give you an example, if you ask a model, this is a famous strawberry test, we call it. So, if you ask your large language model to count the letters in the word strawberry, I encourage you to actually test it on your own in your free time with different models to see the evolution of reasoning.
So, the old models which with reasoning capabilities, they would just go and write the whole essay, like 10 pages, in order to come with the conclusion how many R letters in the word strawberry. If you're lucky, they'll get it right. These days, they actually say, "Okay, it's a very simple task actually. I need to decompose what letters there and just do the arithmetic operation and count R." So, it's getting better, right? So, 2 + 2 is a very simple task for us as a human, but if it's just a retrieval system, then it can get it wrong.
However, if you tell if the network is smart enough to understand, "Oh, that's just a simple calculator task." You can solve it much faster. So, this is reasoning. You can also call do web search. You can call all sorts of tools, like you can communicate with SQL databases, with rag databases, of course.
So, modern reasoning is used in most of the modern chatbots.
There is a downside of reasoning is that if you let the model think for very long time, it easily can get expensive.
Because every reasoning call is an inference call, an inference call translates into the data center use, into energy use. So, it can easily go get very expensive if you if you can go very deep.
Uh there is a solution to that. You can actually, as a developer who deploys this LLM, you can set the so-called uh reasoning token budget. Uh in your config file for an LLM, if the LLM is allowing for that, you can actually say, uh okay, I want you to go to this depths of thinking and no further. Um this is very useful because um different scenarios, there is no one-fits-all solution here, right? So, if you know that uh the user group who will be using your model, they are hardcore researchers who need depths of thinking, you can give them a larger reasoning token budget. But, if you know that the users will be just using the model for fun and for some quick tasks, maybe you can make it more shallow. So, by doing that, you can limit like the time the model thinks, but also kind of adjust the quality of the model.
Uh we have enabled it in uh our NeMo Tron um model. Uh so, Nvidia, yeah, many people know Nvidia is a hardware company, but we actually have lots of software. And we develop our own models, which we are gladly open-sourcing. So, this Nvidia NeMo Tron Nano 9B model is available on Hugging Face, and uh I have developed a small demo. I've wiped coded a wiped coding application, which I'm sharing here in this YouTube video, uh where I'm basically showing you how to set this reasoning token budget and use it correctly.
Uh the slides will be available after that, so please feel free to experiment with it.
Uh another fundamental technology, which is enabling um modern AI and modern agentic systems, is multimodal generative AI. So, these days we're not only talking to chatbots, but we are also using images, and the systems they are generating images, for example. But not only images, we're generating code, of course, we're developers. Uh we're generating videos, and we are understanding videos. So, there's so many applications uh you can think of uh where you can apply AI. And I usually say that basically you can apply AI to any modalities, because AI works in the essence it works on embeddings. So, tokens, embeddings. And what are they? They're just numbers, right? So, if we can turn information into numbers, this means that we can automatically teach an AI system to get patterns from this information. Um and we can basically turn any information into numbers, right? So, images, they are RGB codes, so we convert RGB codes into through tokenizers, we convert them into tokens and embeddings. Code, well, code is just text, so we transform code into numbers and embeddings, and you can apply to any data. And then in in its turn, you can generate any kind of data.
You can generate 3D coordinates of objects. You can generate uh actuator movements of your robots, so the possibilities here are really endless.
And uh in this video I'd like to show you uh a few demos uh my colleagues develop using developed using multimodal generative AI. So, basically, you can see here that we are building advanced reasoning agents which are relying on multimodal generative AI.
And these are just a few applications which can benefit from reasoning agents combining computer vision and text and speech.
So, you can see that it's not only recognizing cats and dogs anymore, but you can actually follow instructions. You can um get the sequence of steps needed to do a task. And uh You can do task planning and optimization in the warehouse. So, imagine you have >> Artificial intelligence has made extraordinary progress.
It has only been 10 years.
Now, we've been talking about AI for a little longer than that.
So, speech recognition is also another example of multimodal generative AI.
And we can do it now in real time.
>> So, lots and lots of potentially useful applications. Uh, we share here the uh agentic system. Uh, it's open source. We call it video search and summarization agent, and uh it's basically a blueprint which you can modify towards your own use cases. Um, in the background, it has uh different sorts of rocks. So, it has a graph rock for understanding of your videos. And you can basically upload a video, generate embeddings from this video, and then you can talk to this video. You can say, "Okay, return me a timestamp of event A or B happening."
Uh, for example, there was a road accident. Uh, you can ask, uh "Has the ambulance already arrived?" or um, "Were there people hurt in this accident?" Or if you have like uh a construction site, are the workers wearing hats uh or helmets uh and so on. So, this is very useful.
And multimodal generative AI also serves as a basis to physical AI.
So, in physical AI, as I already said, uh our goal is uh to make AI handle the objects, right? So, uh a good example, of course, an application is in robotics. Uh, we all want to have a helper at home which will do the chores and maybe cook for you, maybe clean for you. So, uh this is the first steps of making this possible, right? And uh by using multimodal reasoning, you can basically show the robot uh the object and say, "Okay, uh I need you to make stuff happening with this object. So, uh basically because of the reasoning AI, you can get the set of instructions followed and then uh with the generating of the actuator movements, you can get it executed.
Uh developing robots is hard. So, imagine if you want to build a robot with an expectation of yeah, being of it being a helper robot which comes and saves the world. But then things like that happen. Um or you want to develop a robot which will serve you a cup of coffee. And I've noticed here in Australia you have a very strong coffee culture, so maybe this is not the best use uh for for here in Australia, but in some countries they might benefit of having robots like that.
And the situations like this can also happen, unfortunately.
I don't want to be at that table.
Or another one, my favorite, a ketchup-serving robot. What could possibly go wrong, right?
Oh god.
This one is particularly painful for myself because it happened in my own apartment and believe me, cleaning all the small pebbles from this device was really painful. Why does it happen?
Uh the reason is there is automation and there is autonomy. So, all these robots I showed you, they are automated, meaning that there was a developer sitting there and writing commands and saying, "Okay, in the situation A, do B.
If there else as and so on." But what we are actually trying to achieve here with physical AI is the true autonomy. We want the robots being able to act and react in the physical world on its own, making their own decisions and correcting their own actions. This is the goal we are trying to achieve here.
Um, physical AI is also hard to develop for the basically main two reasons is in order to train AI system, you need a lot of data, right? So, how do you actually collect this sort of data? Like, how do you train robots?
Um, and the second reason is uh, robots also need testing. So, imagine you've developed a huge humanoid robot. I don't know how much they weigh, like 300 kilo easily.
And you just don't want it to run in your warehouse smashing the shelves and you know, just because you wanted to test it in the real world environment.
So, the answer here is simulation.
First of all, to train your robot, you can simulate your environment. You can create a digital twin of your environment and then create synthetic data to to train your robot.
And you can also use this simulation environment to test your robot and this is the safe and a cheaper way to do that.
So, we are developing the series of how we call it world foundation models designed to help the users with developing physical AI.
So, we are developing these models and libraries for developers who want to build their own robots, their own physical AIs. And there we are focusing on first of all, synthetic data creation.
We have two models for that, Cosmos Predict, which is an auto-regressive model, meaning that if you provide this model with an image or sequence of images, and then this model predicts what comes next in this images or videos.
Uh in the transfer, Cosmos Transfer, uh you can have like images of renders, let's say from your digital twin, which you can turn into multiverses of examples, uh photorealistic ones for the robot to be trained on. Uh and this is possible thanks to uh diffusion models, right? So, uh if you know diffusion models, they're basically text-to-image or image-to-image uh AI models, which allow you to create variations of images. We call it domain randomization, right? So, if you randomize textures, if you randomize shapes, uh lighting, and there are many parameters you can randomize here, you can actually train your network to know the multiverse of examples. And even if your solution does not include all the multiverse, uh you will be guaranteed that your solution is a subset of uh applies to the subset of the knowledge network has.
And then uh the last one, and probably the most important one uh in the collection of Cosmos models, is the reasoning capability. Uh in the Cosmos Reason, you can provide uh an image or video uh of an object, as I described in showing you the image of the toaster, right? And then the uh the AI model, the reasoning AI model, can analyze the image, it can understand what is actually expected to do here in this environment. So, the next step, of course, will be enabling the actuator movements generation. Like those of you who worked with computer graphics, you know the term quaternions, for example. Generating these values will be extremely useful and important in order to enable robots autonomy.
Uh these are just a few examples from our synthetically generated data set, which we are using to train autonomous driving vehicles. So, you see that there are some corner cases shown with extreme weather conditions. Some images are foggy. Some images have a very sharp and contrast light. Uh usually, when developing a system, these rare cases are called long-tail anomalies. So, the benefit of synthetic data is that it gives you the power of covering this long-tail anomalies and basically making sure that your AI system doesn't miss cases like that. So, these cases could be also deer running in the road. Probably not applicable in Australia, but let's say you want to scale your system up, and you're training the system in Australia, right? So, you don't have deer images here. You have kangaroos.
But, this will be helpful for you still to have deer images, so you can synthetically generate them if you want to sell your system in San Francisco, for example, or I don't know if they have deers, but let's say New Jersey.
Um the systems are growing inside size, right? So, this all these billions of parameters is becoming more and more important to optimize the systems.
And when we talk about optimizing the systems, uh there many parameters to optimize. We can talk about optimizing for latency, the time to the response from the time from when you send the prompt to when you get the response from your model. Or we are optimizing for throughput. Let's say your AI application, which you develop, became super popular and now thousands of users want to use it. How do you make sure that you don't create bottlenecks and serve it to every user within the reasonable time frame? Or you can optimize for the model footprint. So, the models grow bigger, but the Let's say you want to use the system in the constrained environment. You optimize models for that.
There some cases when it's okay not to optimize for latency, for example. If you're using chatbots, you may have noticed that they do not output all the information at once. Instead, they let you read it line by line, which is totally okay because we as humans we cannot read lots of information at once, unless you have super abilities. But on average, you read information line by line, and that means that it's okay to allow for some delay here.
But there are cases when low inference latency is a must. So, the one the most obvious one is autonomous driving. You cannot allow for the reasoning cycle to finish if there is a child or adult person running next to your autonomous driving vehicle. You need to stop immediately.
In trading, in applications like trading, every millisecond also counts. So, you also want your AI systems, if you're using some AI decision making, you want it to go immediately.
In real-time broadcasting and gaming, those of you who ever played online games, you know that you have uh some uh network connectivity lag, this is very annoying. And if it persists, uh you don't want to use the application. So, this is also very crucial to optimize for that in these applications. In telemedicine, this might sounds like science fiction science fiction, but uh in the future, uh surgeries will be done by robots. And uh in Well, this is already happening now, but it will be more and more of that. And uh imagine a situation when the doctor is uh on the other side of uh Earth, and this will be still possible to perform a surgery here in Australia, for example. So, here we like yeah, human life uh matters. So, we cannot afford any delay.
And then like theft prevention in retail, imagine you have a retail shop, and uh you have a bunch of CCTV cameras around, so uh if there is a robbery happening, you want to to track it immediately. So, low low inference latency is a must in this kind of scenarios. There are The list can go on and on, but uh it is clear that we need to optimize our systems.
Uh also smaller model footprint, as I mentioned, is a requirement in some cases. So, one good example is autonomous importable devices. They're usually small in size, and they're usually not plugged to power outlets 24/7, meaning that uh if you're developing an application which is meant to be running on the edge, uh or an embedded system, you need to optimize for all possible parameters to make the model more energy efficient, to make it smaller, and to make it faster. Uh then also lack or absence of connectivity.
So, let's say you have a robot supporting mining operations like uh not crypto mining, but real mining. And usually in these environments, uh you don't have uh good uh network connectivity, so you cannot rely on data center in the cloud there. So, you need to use AI on the edge, and for that you also need to optimize your model.
Uh and then also privacy and security of data. Uh some companies, and I've heard it from some friends of mine working for big corporations, they say, "Oh, we're not allowed to use external tools like ChatGPT uh because of privacy reasons. The company just doesn't want us to send our data to some uh American server." Uh so, in this scenario, you could build your own uh small cloud infrastructure and then deploy your own model. Or even if the model is small enough, you can deploy it on your own machine and still benefit of the power of wipe coding, for example.
Some small models with 4 billion parameters can do a great job already here.
Uh so, what are the ways of optimizing your model? So, uh one pop- popular technique these days is distillation.
Uh basically, in distillation, we have one big teacher model and a smaller student model. Uh they usually have the same architecture, and then uh basically, the knowledge from the teacher model is being passed, being distilled into the student model.
Uh the straightforward way, like the classical way of doing that would be uh you produce the so-called soft labor labels with the teacher model, and using the soft labels, uh the student model is being trained.
Uh it's not a black box, so you need access to the large teacher model, and sometimes you even can train them uh simultaneously. Like many companies, if you noticed, they publish the range of models in the so-called t-shirt sizes.
Large model with hundreds of billions parameters, medium-size model, and then a small language model, SLM. So, they usually train simultaneously, and you learn knowledge from the teacher in the smaller one.
Uh but, these days, there's also a technique called behavior distillation is becoming popular. And basically, the essence here, you use teacher as a black box. It could be a model from other provider, and you basically get you put your input prompts in order to generate output prompts, and you use this as synthetic as a synthetic data to train your student model. We can argue about the ethical aspect of this, but I just wanted to share with you that it's pop it's possible, and it's done by some model providers.
Uh another technique, which is been used to optimize your models, is called speculative decoder decoding. And basically, here you also rely on two models. One model is fast, but it's coarse, and another model is slower, but it's more uh accurate, and basically working in combination of two models, you can achieve faster performance. So, the draft model proposes lots of lots of lots of results, and then the target model basically verifies them, and it goes back to the correct predictions.
Uh another important and popular technique these days is quantization. So, historically, we would train the models in floating-point 32 precision, which was good, but then uh some researchers thought, "What if we actually cut the like the digits uh after mantissa to 14.16, right?" It worked really well, and it led to the model being executed faster, and the model footprint get reduced in half.
But then also some researchers thought, "Oh, what if we actually quantize it to integer eight?" Uh that wasn't that just easy. You cannot round up to integer without losing lots of precision, right?
So, for that you actually need to find a mapping between the minimum and maximum value in your model, and for that and in order to do that you have two techniques. The first one is called calibration, where you basically run your model through some test data set, and you find this minimum and maximum intervals. And the second one, uh which is a little bit more accurate, we call it quantization-aware training. This is when you basically use an extra loss when training your model, uh which already helps you to find this mapping intervals during your training.
And I mentioned integer eight, but these days we are observing the emergence of uh new techniques which allow you to quantize even to lower precision. So, I've seen already like and this is already happening. Integer eight is becoming also the new standard.
Uh but I've also read some research papers showing how to quantize the model into integer one uh precision. So, this is still very experimental and works in just a few cases, but uh this is also the future. Uh we enable these optimization techniques in uh NVIDIA TensorRT LLM. This is an open-source product available on GitHub, and uh it basically saves you some time uh if you don't want to develop like speculative decoding from scratch.
Uh it would be unfair not to mention the emergence of new architectures. So, Transformer is not the end game. Uh and it has some bottlenecks, and different research groups are working hard on developing new variations, new architectures, and of course I wouldn't have time to go through them all, but I just wanted you to know that this field is also developing. And uh another one big trend in uh optimization and deployment optimization is uh the so-called disaggregated serving. In disaggregated serving, we basically separate prefill and decode phase of Transformer of using a Transformer model. So, who of you know how KV cache works?
So, KV cache is basically uh our tokens which we generate uh when we get your prompt inserted in the model.
And uh basically KV cache is the numeric representation of what the user wants to say. The Transformer nature is autoregressive.
So, the way uh when we want to pre- make a prediction with a Transformer, we do it token by token um token after token in the sequential manner, right? But in order to start this sequential process, we need to get the numerical representations from the user prompt.
And we call this process caching. So, instead of uh estimating this caching for each new word in the prediction, we can actually process it in parallel. The process is called prefill.
And then we are ready to start decode.
So, after every new token generated, we append the new numeric values to our KV cache, which allows us to do the next iteration, which allows you us to predict the next word in the sequence.
So, if you look at this two, this one can be done in parallel, but this one is sequential, right? So, having this different nature means that we can actually optimize the deployment.
We can put the parallelizable prefill operation on the larger chip, which allows bigger parallelization. And then in the code, we can optimize for sequential operations. So, this is also a new thing in AI and being actively used lately by model providers.
We have different tasks. We have prefill heavy task, which is, let's say, document summarization.
We can document code. We can do retrieval augmented generation long context chat. Basically, all the operations which require a lot of input, a large prompt, large context window, this are prefill heavy tasks. And we have decode heavy task. You basically can say, "Hey, chatbot, generate me a large story." So, the prompt is very short, but then the story generated can be long. So, by doing that, we distinguish between prefill heavy and decode heavy tasks. And knowing the nature of your task, you can deploy your system accordingly using disaggregated serving by putting the prefill node on the heavier worker, and then by using slower worker, for example, smaller worker, but with a stronger CPU on the decode phase.
We implement these techniques in NVIDIA Dynamo. So, disaggregated serving and many other new serving inventions are integrated in this open source product.
It's also on GitHub. So, uh, highly recommend you to take a look.
Uh, this brings me to the end of my talk. So, to summarize, I've covered scaling laws with you and uh, how they are driving the uh, research field in the agentic AI and reasoning. I've talked about multimodal generative AI and how it serves as the basis for physical AI, which is arguably the next big thing uh, we're going to witness soon. And I've covered the latest and greatest in optimization techniques, uh, which is being actively used by the research community. With that, I'd like to thank you and uh, I'd love to connect. So, please feel free to find me on social networks. And if you have any questions, I'm always available.
関連おすすめ
Elon Musk’s XAI, Fiber-Optic Drones & the New Era of US Defense & Winning the AI Arms Race
DefenseNow
250 views•2026-05-15
I Read Every Google Antigravity 2.0 Doc So You Don't Have To (13-Min Operator Playbook)
hyperautomationlabs1045
120 views•2026-05-19
Could AI change the future of cancer survival?
MotherConservative
999 views•2026-05-16
[RQ] All Preview 2 Midnight Horror School Deepfakes in Macbg Major
macbghuggylego
102 views•2026-05-15
Firefox on Android Just Added 'Shake to Summarize'
BrenTech
349 views•2026-05-19
Google’s NEW AI Just SHOCKED The World…
JulianGoldiePodcast
188 views•2026-05-21
WWDC 2026 Promises Apple Intelligence and Siri Upgrades | Episode 195
TheMacRumorsShow
104 views•2026-05-22
RNNs Had a Fatal Flaw — Why Transformers Replaced Sequential Processing
axiom-motion-math
567 views•2026-05-18











