Skip to content
Dwarkesh PodcastDwarkesh Podcast

Sergey Levine on Dwarkesh Patel: How Robots Learn on the Job

How spoken language instructions during the pi o5 project sped up robot training; Physical Intelligence expects a flywheel effect within five years.

Dwarkesh PatelhostSergey Levineguest
Sep 12, 20251h 28mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. 0:0017:25

    Timeline to widely deployed autonomous robots

    1. DP

      Today, I'm chatting with Sergey LaVine, who is a co-founder of Physical Intelligence, which is a robotics foundations model company, and also a professor at UC Berkeley. And just generally, one of the world's leading researchers in robotics, RL, and AI. Sergey, thank you for coming on the podcast.

    2. SL

      Mm-hmm. Thank you. And thank you for the kind introduction.

    3. DP

      (laughs) Let's talk about robotics. So before I pepper you with questions, I'm wondering if you can give, uh, the audience a b- a summary of where Physical Intelligence is at right now. You guys started a year ago.

    4. SL

      Yeah.

    5. DP

      And what does the progress look like? What are you guys working on?

    6. SL

      Yeah. So Physical Intelligence aims to build robotic foundation models, and that basically means general-purpose models that could, in principle, control any robot to perform any task. Uh, we care about this because we, we see this as a very fundamental, uh, aspect of the AI problem. Like, the robot is essentially, uh, encompassing all A- AI technology, so if you can get a robot that's truly general, then you can, uh, do, uh, you know, hopefully a- a large chunk of what people can do. And where we're at right now is I think we've kind of gotten to the point where we've, uh, built out a lot of the basics. (laughs)

    7. DP

      (laughs)

    8. SL

      And, you know, I think those basics actually are pretty cool. Like, they work pretty well. We can get a robot that will, like, fold laundry and that will go into a new home and, like, try to clean up the kitchen. But in my mind, what we're doing at Physical Intelligence right now is really the very, very early beginnings, just, like, putting in place the basic building blocks on top of which we can then tackle all these, like, really tough problems.

    9. DP

      And what's the year-by-year vision? So, um, one year in, now I got a chance to watch some of the robots, and they can do pretty dexterous tasks like folding a box using grippers, and it's, like, I don't know. It's, like, uh, pretty hard to fold a box even with, like, my hands. Um, if you gotta go year by year until we get to the full, like, robotics explosion, wh- what is happening every single year? What is a thing that needs to be unlocked, et cetera?

    10. SL

      So there are a few things that we need to get right. Uh, I mean, dexterity, obviously, is one of them, and in the beginning, we really wanted to make sure that we, um, understand whether the methods that we're developing have the ability to tackle, like, the kind of intricate tasks that people can do.

    11. DP

      Yeah.

    12. SL

      So as you mentioned, like, folding a box, uh, folding different articles of laundry, cleaning up a table-

    13. DP

      Yeah.

    14. SL

      ... uh, making a coffee, that sort of thing, and that's, like, that's good. Like, that works. Uh, you know, I think that the results we've been able to show are pretty cool, but again, like, the end goal of this is not to fold a nice T-shirt. The end goal is to just, like, confirm our initial hypothesis that, like, the basics are kinda solid.

    15. DP

      Yeah.

    16. SL

      But from there, there are a number of really major challenges, and I think that, you know, sometimes when, um, results get abstracted to the level of, like, a three-minute video, someone can look at those videos, like, it's like, "Oh, that's cool. Like, that's what they're doing." But it's not. Like, it's a very simple and, uh, basic version of what I think is to come. Like, what you really want from a robot is not to tell it, like, "Hey, please fold my T-shirt." What you want from a robot is to tell it, like, "Hey, robot, like, you're now doing all sorts of, uh, home tasks for me. Uh, I like to have dinner made at 6:00 PM. Uh, I wake up and go to work at 7:00 AM. Uh, I'd like... You know, I like to do my laundry on, on Saturday, so make sure that's ready, this and this and this. Uh, and by the way, check in with me, like, every Monday to see, like, w- you know, wha- what I want you to do to pick up when you do the shopping."

    17. DP

      Right.

    18. SL

      Right? Like, that's the prompt, and then the robot should go and do this for, like, you know, six months, a year. Like, that's the duration of the task.

    19. DP

      Mm-hmm.

    20. SL

      So it's a, it's... Ultimately, if, if this stuff is successful, it should be a lot bigger, and it should have that ability to learn continuously. It should have the, uh, understanding of the physical world, the common sense, the ability to go in and pull in more information if it needs it. Like-

    21. DP

      Yeah.

    22. SL

      ... if I ask it, like, "Hey, um, tonight, like, uh, you know, can you, uh, can you make me this type of salad?" It says, "Okay, you should, like, figure out what that entails, like, look it up, go and buy the ingredients." So there's a lot that goes into this. It requires common sense. It requires understanding that there's certain edge cases that you need to handle intelligently, cases where you need to think harder. Uh, it requires the ability to improve continuously. It requires understanding safety, being reliable at the right time, being able to fix your mistakes when you do make those mistakes.

    23. DP

      Yeah.

    24. SL

      So there's a lot more that goes into this. Um, but the principles there are you need to leverage prior knowledge, and you need to have the right representations.

    25. DP

      So, so this grand vision, what year, if you had to give an esti- uh, median estimate?

    26. SL

      Yeah.

    27. DP

      Or 25th percentile, uh, 50-

    28. SL

      (laughs)

    29. DP

      ... 75?

    30. SL

      I think it's something where it's not going to be a case where we develop everything in the laboratory and then it's done, and then, you know, come 2030-something, you get a, a robot in a box. I think it'll be the same as what we've seen with AI assistants, that, uh, once we reach some basic level of competence where the robot is delivering something useful, it'll go out there in the world. The cool thing is that once it's out there in the world, it can collect experience and leverage that experience to get better.

  2. 17:2527:28

    Why robotics will scale faster than self-driving cars

    1. DP

      (laughs)

    2. SL

      (laughs)

    3. DP

      In terms of robotics progress, why won't it be like self-driving cars where we, you know, m- it's been more than 10 years since Google launched its, um... Wasn't it 2009 that they launched the self-driving car initiative? And then I remember when I was a teenager, like, watching demos where we would go buy a Taco Bell, uh, um, and drive back, and only now do we have them actually deployed. And even then, you know, they make- make mistakes, et cetera, and so maybe it'll be many more years before most of the cars are self-driving. So why won't robotics... You know, you're saying five years to this, like, quite robust thing, but actually it'll just feel like 20 years of just, like, once we get the cool demo in five years, then it'll be another 10 years before, like, we have the Waymo or when the Tesla FSD working.

    4. SL

      Yeah, that's a really good question. So one of the big things, uh, that is different now than it was in 2009, uh, actually has to do with-... the technology for machine learning systems that understand the world around them. Uh, principally, for autonomous driving, this is perception. Uh, for robots, it can mean a few other things as well. Uh, and perception certainly was not in a good place in 2009. The trouble with perception is that it's one of those things where you can nail a really good demo with a somewhat engineered system, but hit a brick wall when you try to generalize it. Now, at this point in 2025, we have much better technology for generalizable and robust perception systems, and more generally generalizable and robust systems for understanding the world around us. Like, when you say that the system is scalable, in machine learning, scalable really means generalizable. Um, so that gives us a much better starting point, uh, today. So that's not an argument about robotics being easier than autonomous driving. It's just a- an argument for 2025 being a better year than 2009.

    5. DP

      Mm-hmm.

    6. SL

      But there's also other things about robotics that are a bit different than driving. Like, in some ways, robotic manipulation is a much, much harder problem, but in other ways, it's a- it's a problem space where it's easier to get rolling, to start that flywheel with a more limited scope. Um, so to give you an example, if you're learning how to drive, you'd probably be pretty crazy to learn how to drive on your own without somebody helping you. Uh, like, you- you would not trust your- your- your teenage, uh, child to learn to drive just on their own, just drop them in the car and say, like, "Go for it."

    7. DP

      Mm-hmm.

    8. SL

      Uh, and that's like a, you know, a 16-year-old who's had, uh, a significant amount of time to learn about the world. You would never-

    9. DP

      Right.

    10. SL

      ... even dream of putting a five-year-old in a car-

    11. DP

      Yeah.

    12. SL

      ... and tell him to get started. But if you want somebody to, like, clean the dishes-

    13. DP

      Yeah.

    14. SL

      ... like, dishes can break too, but you would probably be okay with a child trying to do the dishes, uh, without somebody constantly, like, you know- (laughs)

    15. DP

      Mm-hmm.

    16. SL

      ... sitting next to them with a- (laughs) with a- with a brake, so to speak. So for a lot of tasks that we want to do with robotic manipulation, there's potential to make mistakes and correct those mistakes, and when you make a mistake and correct it, well, first, you've- you've achieved the task because you've corrected, but you've also gained knowledge that allows you to avoid that mistake in the future. With driving, because of the dynamics of how it's set up, it's very hard to make a mistake, correct it, and then learn from it because the mistakes themselves have significant ramifications. Um, now, not all manipulation tasks are like that. There are truly some, like, very, uh, safety critical stuff, and this is where the next thing comes in, which is common sense. Uh, common sense meaning the ability to make inferences about what might happen, uh, that are reasonable guesses, but that do not require you to experience that mistake and le- and learn from it in advance.

    17. DP

      Right.

    18. SL

      That's tremendously important, and that's something that we basically had no idea how to do, uh, about five years ago.

    19. DP

      Mm-hmm.

    20. SL

      But now, uh, you- w- we can actually use LMS and VLMs, ask them questions, and they will make reasonable guesses. Like, they will not give you expert behavior, but you can say, like, "Hey, there's a sign that says slippery floor. Like, what's going to happen when I walk over that?" It's kind of pretty obvious.

    21. DP

      Right.

    22. SL

      Right? Uh, and no autonomous car in 2009 would have been able to answer that question. (laughs) So common sense plus the ability to make mistakes and correct those mistakes, like, that's sounding like o- o- o- o- an awful lot like what a- what a person does when they're trying to learn something.

    23. DP

      Mm-hmm.

    24. SL

      All of that doesn't make robotic manipulation easy necessarily, but it allows us to get started with a smaller scope and then grow from there.

    25. DP

      So for years, m- m- using, I mean, not since 2009, but we've had lots of video data, language data, and transformers for five, seven, eight years.

    26. SL

      Mm-hmm.

    27. DP

      And lots of companies have tried to build transformer, uh, based robots with lots of training data, including Google, Meta, et cetera, and what is the reason that they've been hitting roadblocks? What has changed now?

    28. SL

      Yeah, that's a really good question. So I'll start out with, uh, maybe, uh, a slight modification to your comment is I think they- they've made a lot of progress.

    29. DP

      Mm-hmm.

    30. SL

      And in some ways, a lot of the work that, uh, we're doing now at Physical Intelligence is built on the backs of lots of other great work that was done, uh, for example, at, uh, at Google. Like, many of us were actually at Google before.

  3. 27:2845:37

    How vision-language-action models work

    1. SL

      and fully autonomous robots.

    2. DP

      Yeah. Okay, and how does the PI model work?

    3. SL

      Yeah. So, the current model that we have, uh, basically is a vision-language model that has been adapted for motor control. So, uh, to give you a li- a little bit of like a fanciful brain analogy, a VLM, a vision-language model, is basically an LLM that has had a- a little, like, pseudo visual cortex grafted to it, a vision encoder. All right? So our models, they have a vision encoder, but they also have an action expert, an action, uh, decoder essentially. So, it has, like, a little visual cortex and notionally a little motor cortex. And the way that the model actually makes decisions is it reads in the sensory information from the robot, it does some internal processing, and that could involve actually, uh, outputting intermediate steps, like you might tell it to clean up the kitchen and it might think to itself, like, "Hey, to clean up the kitchen, I need to pick up the dish and I need to pick up the sponge and I need to put th- this and this." And then eventually, it works its way through that chain of thought generation down to the action expert which actually produces continuous actions. And that- that has to be a different module because the actions are continuous, they're high frequency, so they ha- have a different data format than, uh, text tokens. But structurally, it's still, uh, an end-to-end transformer and roughly speaking, uh, technically, it- it corresponds to a kinda mixture of experts architecture.

    4. DP

      Hmm. And, like, what is actually happening is that it's, like, deco- like, it's, like, predicting I should do X thing, then it's like there's an image token, then some action tokens, like, what it actually ends up doing, and then more image, more, uh, text description, more a- more action tokens. Basically, I'm, like, looking at what- what stream is going on.

    5. SL

      Yeah. That- that's right. Uh, with the- with the exception of the actions are actually not represented as discrete tokens, it's- it actually uses a flow matching kind of diffusion-

    6. DP

      Mm-hmm.

    7. SL

      ... because they're continuous and you need to be very precise with your actions for

    8. NA

      ... textures to show.

    9. DP

      Right. I find it super interesting that, so you are u- I think you're using the open source GEMMA model which is like Google's, uh, LLM, uh, that they released open source and then adding this action expert on top. And I find it super interesting that the progress in different areas of AI is just based on s- this- not only the same techniques, but literally the same model, that you can just use an open source LLM and then add this action expert on top. It is notable that, like, you naively might think that, oh, there's this, like, separate era of research is robotics and there's a separate era of research called LLMs and, uh, n- natural language processing, and no. It's like, it's literally the same. It's like the considerations are the same, the, um, the architectures are the same, even the weights are the same. I know you do more training on top of these model- open source models, but, y- that I find super interesting.

    10. SL

      Yeah. So one theme here that, like, uh, I think is important to keep in mind is that the reason that th- those building blocks are so valuable is because the AI community has gotten a lot better at leveraging prior knowledge.

    11. DP

      Mm-hmm.

    12. SL

      And a lot of what we're getting from the pretrained LLMs and VLMs is prior knowledge about the world, and it's kind of like, it's a little bit abstracted knowledge, like, you know, you can identify objects, you can figure out, uh, like, roughly where things are in an image, that sort of thing. But I think if- if I had to, like, summarize in one sentence-... the big benefit that recent innovations in AI give to robotics is really that prior, the ability to leverage prior knowledge. And, uh, I, I think the fact that the model is the same model, that's like, that's kind of always been the case in deep learning.

    13. DP

      Mm-hmm.

    14. SL

      But it's, it's that ability to pull in that prior knowledge, that abstract knowledge that has, that can come from many different sources. That's, uh, really powerful.

    15. DP

      Yeah. Today, I'm here with Mark, who is a senior researcher at Hudson River Trading. He has prepared for us a big data set of market prices and historical market data, and we're gonna try to figure out what's going on and whether we can f- predict future prices from historical market data. Mark, let's dig in.

    16. NA

      Happy to do it.

    17. DP

      So, it sounds like the first fun thing to do is probably to start looking at what an order book actually looks like.

    18. NA

      Yeah. I think so.

    19. DP

      Mm-hmm.

    20. NA

      So I've given you, like, real order book data that is snapshots of the top five levels of the order book, both in the bid and ask side for a couple of different tech stocks: NVIDIA, Tesla, AMD, et cetera.

    21. DP

      What is the shape of the prediction? Are, are, are we predicting-

    22. NA

      Why don't you, uh, take the dataframe, look at its Y values and just kind of, like-

    23. DP

      Oh, that's great.

    24. NA

      ... histogram it?

    25. DP

      They are centered at zero.

    26. NA

      They are roughly centered at zero.

    27. DP

      Yeah. But target of what exactly?

    28. NA

      So these things are changes in the mid-price from, like, now to some short period-

    29. DP

      Mm-hmm.

    30. NA

      ... of time in the future.

  4. 45:3757:59

    Changes needed for brainlike efficiency in robots

    1. SL

      too.

    2. DP

      And how physically will... So you have, you have this, like, trilemma. You have three different things which all take more compute during inference that you wanna opt, uh, you wanna increase at the same time. You have the inference speed, and so humans are processing 24 frames a second or whatever it is. That we're just like... We can react to things extremely fast. Then you have the context length, and for, I think, the kind of robot which is just, like, cleaning up your house, I think it has to kind... It has to be aware of, like, things that happened minutes ago or hours ago and how that influences its plan about the next task it's doing. And then you have the model size, and espe- I guess at least with LLMs we've seen that there's gains from increasing the amount of, uh, parameters. And I think currently you have 100 millisecond, uh, inference speeds. You have a second long context, and then the model is what? A co- couple billion parameters? How many?

    3. SL

      Mm-hmm.

    4. DP

      Okay. And so each of these, at least two of them are many orders of magnitude smaller than what seems to be the human equivalent, right? Like the model... If a human brain has like trillions of parameters and this has like two billion parameters, and then if humans are processing at least as fast as a model, like, uh, uh, actually a decent bit faster, and we have hours of context

    5. NA

      (clears throat)

    6. DP

      ... depends on how you define human context, but hours of context, minutes of context.

    7. SL

      Sometimes decades of context.

    8. DP

      Yeah, exactly. So you have to have many order of magnitude improvements across all this thing, all of these three things which seem to oppose each other, or, like, increasing one reduces the amount of, um, r- reduces the amount of compute you can dedicate towards the other one in inference. So how are we gonna, yeah, how are we gonna solve this? (laughs)

    9. SL

      Yeah. Uh, well, that's a very big question.

    10. DP

      (laughs)

    11. SL

      Um, yeah. Let- let's, let's try to unpack this a little bit. I think there's, there's a lot going on in there. One thing that, um, I would say is a really interesting technical problem, and I think that it's something where we'll see perhaps a lot of really interesting innovation over the next few years, is the question of representation for context.

    12. DP

      Mm-hmm.

    13. SL

      So, um, if you imagine the... Like s- some of the examples you gave, like if you, if you have a home robot that's doing something and needs to keep track, as a person, there are certainly some things where you keep track of them very symbolically, like almost in language. Like, you know, I have my checklist. Like, I'm going shopping and I, you know, at least for me, I can, like, literally visualize in my mind, like, my checklist. Like, you know, pick up the, the yogurt, pick up the milk, pick up whatever. And that, and I'm not, like, picturing the milk shelf with the milk sitting there. I'm just thinking, like, milk.

    14. DP

      Right.

    15. SL

      Right? But then there's other things that are much more spatial, almost visual. Uh, you know, when, when I was, uh, trying to get to your, to your studio, I was thinking like, "Okay, uh, here's the, what this street looks like. Here's what that street looks like."

    16. DP

      Right.

    17. SL

      "Here's the, you know, what I expect the doorway to look like."

    18. DP

      Yeah.

    19. SL

      So representing your context in the right form that captures what you really need to achieve, uh, your goal, uh, and otherwise kind of discards all the unnecessary stuff, I think that- that's like, that's a really important thing.

    20. DP

      Yeah.

    21. SL

      And I think we're, we're seeing the beginnings of that with multimodal models, but I think that multimodality has so much more to it than just, like, image plus text, and I think that that's a place where there's a lot of room for really exciting innovation.

    22. DP

      Ooh, do you mean in terms of, um...... how we represent?

    23. SL

      Mm-hmm.

    24. DP

      Okay.

    25. SL

      Yeah, how we represent both context, both what happened in the past, and also plans or reasoning-

    26. DP

      Yeah.

    27. SL

      ... as you can call it in the LLM world, uh, which is what we would like to happen in the future, or intermediate processing stages in solving a task. I think do- doing that in a variety of modalities, including potentially learn modalities that are suitable for the job is something that has, I think, enormous potential, uh, to overcome some of these challenges.

    28. DP

      Interesting. Another question I have as we're dis- as- as we're discussing these, like, um, tough trade-offs in terms of, um, uh, inference is comparing it to the human brain and figuring out the human brain is able to have hours, decades of context while being like, being able to act on the order of 10 milliseconds while having 100 trillion parameters or however you wanna count it. And I wonder if the best way to understand what's happening here is that human brain hardware is just way more advanced than the hardware we have in GPUs, or that the algorithms for encoding video information are, like, way more efficient-

    29. SL

      Mm-hmm.

    30. DP

      ... uh, and maybe it's, like, some crazy mixture of experts where-

  5. 57:591:09:18

    Learning from simulation

    1. SL

      and then at, at a lower frequency, I sort of gauge where I am in traffic.

    2. DP

      And then so you have a couple lectures from a few years back where you say like, even for robotics, RL is, in many cases, better than imitation learning. But f- so far, the models are exclusively doing imitation learning. So I'm curious how your, ho- how is your thinking on this has changed, or maybe it's not changed, but then you, you need to do this for the RL. Like, why, w- why can't you do RL yet?

    3. SL

      Yeah. So the key here is prior knowledge.

    4. DP

      Yeah.

    5. SL

      Uh, so in order to effectively learn from your own experience, it turns out that it's really, really important to already know something about what you're doing. Otherwise, it takes far too long. Uh, it's just like it, it, it takes, uh, uh, a person, when they're a child, a very long time to learn very basic things.

    6. DP

      Yeah.

    7. SL

      To learn to write for the first time, for example. Once you already have some knowledge, then you can learn new things very quickly. So the purpose of training the models with supervised learning now is to build out that foundation that provides the prior knowledge so they can figure things out much more quickly later.

    8. DP

      Mm-hmm.

    9. SL

      And again, this is not a new idea. This is exactly what we've seen with, uh, LMs, right? LMs started off, uh, being trained purely with next token prediction, and that provided an excellent starting point, first for all sorts of synthetic data generation and then, uh, for RL.

    10. DP

      Hmm.

    11. SL

      So I, I, I think it, it makes total sense that we would expect basically any foundational effort to follow that same trajectory, where we first build out the foundation, essentially in like a somewhat brute force way.

    12. DP

      Right.

    13. SL

      And the stronger that foundation gets, the easier it is to then make it even better with much more accessible-

    14. DP

      Right.

    15. SL

      ... training.

    16. DP

      In, um, in 10 years, will the best model for knowledge work also be a robotics model or have like a r- action expert attached to it? And the reason I ask is like-

    17. SL

      Mm-hmm.

    18. DP

      ... so far, we've seen advantages from using more general models for things.

    19. SL

      Yeah.

    20. DP

      And will robotics fall into this bucket of we will just have the model which does everything, including physical work and knowledge work? Or do you think they'll continue to stay separate?

    21. SL

      I really hope that they will actually be the same. And, um, you know, obviously, I'm extremely biased. I, I love robotics. I think it's like, it's very fundamental to AI. But I think that it's optimistically that it's actually the other way around, that the robotics, uh, element of the equation will make all the other stuff better. And there are two, uh, reasons for this that I could, that I, that I could tell you about. One has to do with representations and focus. So what I said before, with, uh, video prediction models, if you just want to predict everything that happens, it's very hard to figure out what's relevant.

    22. DP

      Yeah.

    23. SL

      If you have the focus that comes from actually trying to do a task, now that acts to structure how you see the world in a way that, uh, allows you to more fruitfully utilize the other signals. That could be extremely powerful.

    24. DP

      Yeah.

    25. SL

      The second one is that understanding the physical world at a, at a very deep, fundamental level, at a level that goes beyond just what we can articulate with language, can actually help you solve other problems.

    26. DP

      Mmm.

    27. SL

      Uh, and we, we see, we, we experience this all the time. Like, when we talk about abstract concepts, we say like, "This company has a lot of momentum."

    28. DP

      Yeah.

    29. SL

      (laughs) Uh, right? I, I, I, like you, w- we'll use like social metaphors to describe inanimate objects, like, "My computer hates me." (laughs) Right?

    30. DP

      Right.

  6. 1:09:181:18:01

    How much will robots speed up AI buildouts?

    1. SL

      not necessarily to do really good simulation. The key is to figure out how to answer counterfactuals.

Episode duration: 1:28:28

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode 48pxVdmkMIE

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome