Dwarkesh PodcastAndrej Karpathy on Dwarkesh Patel: Why Agents Take a Decade
Why pre-training and gradient descent produce ghosts rather than agents: Karpathy maps the biological gaps that make the decade of agents the honest frame.
EVERY SPOKEN WORD
150 min read · 30,004 words- 0:00 – 30:33
AGI is still a decade away
- AKAndrej Karpathy
Reinforcement learning is terrible. (laughs)
- DPDwarkesh Patel
(laughs)
- AKAndrej Karpathy
It just so happens that everything that we had before it is much worse. (laughs)
- DPDwarkesh Patel
(laughs)
- AKAndrej Karpathy
I'm actually optimistic. I think this will work. I think it's tractable. I'm only sounding pessimistic because when I go on my Twitter timeline-
- DPDwarkesh Patel
(laughs)
- AKAndrej Karpathy
... I see all this stuff that makes no sense to me. A lot of it is, I think, honestly just, uh, fundraising. We're not actually building animals. We're building ghosts. These are like sort of ethereal spirit entities because they're fully digital and they're kind of like mimicking humans, and it's a different kind of intelligence. It's business as usual because we're in an intelligence explosion already and have been for decades. Everything is gradually being automated, has been for hundreds of years. Don't write blog posts, don't do slides, don't do any of that.
- DPDwarkesh Patel
(laughs)
- AKAndrej Karpathy
Like, build the code, arrange it, get it to work. It's the only way to go, otherwise you're missing knowledge. If you have a perfect AI tutor, maybe you can get extremely far. The geniuses of today are barely scratching the surface of what a human mind can do, I think.
- DPDwarkesh Patel
Today, I'm speaking with Andrej Karpathy. Andrej, why do you say that this will be the decade of agents and not the year of agents?
- AKAndrej Karpathy
Mm-hmm. Uh, well, first of all, uh, thank you for, uh, having me here. I'm, uh, excited to be here. So the quote that you just mentioned, "It's the decade of agents," that's actually a reaction to an existing, preexisting quote, I should say, where I think a lot of th- some of the labs... I'm not actually sure who said this, but they were alluding to this being the year of agents-
- DPDwarkesh Patel
Hmm.
- AKAndrej Karpathy
... uh, with respect to LLMs and, uh, how they were gonna evolve. And I think, um, I was triggered by that-
- DPDwarkesh Patel
(laughs)
- AKAndrej Karpathy
... because I feel like there's some over-predictions going on in the industry.
- DPDwarkesh Patel
Yeah.
- AKAndrej Karpathy
And, uh, in my mind, this is really a lot more accurately described as the decade of agents.
- DPDwarkesh Patel
Yeah.
- AKAndrej Karpathy
And we have some very early agents that are actually like extremely impressive and that I use daily. Uh, you know, Claude and Codex and so on. But I still feel like there's, uh, so much work to be done. And so I think my, like my reaction is like, we'll be working with these things for a decade. They're gonna get better, uh, and, uh, it's gonna be wonderful. But I think I was just reacting to the timelines, I suppose, of the, of the, uh, implication.
- DPDwarkesh Patel
And w- what do you think it will take a decade to accomplish?
- AKAndrej Karpathy
Yeah.
- DPDwarkesh Patel
What are the bottlenecks?
- AKAndrej Karpathy
Well, um, actually make it work.
- DPDwarkesh Patel
Mm-hmm.
- AKAndrej Karpathy
So in my mind, I mean, when you're talking about an agent, I guess, or what the labs have in mind and what maybe I have in mind as well, is it's, uh, you should think of it almost like an employee or like an intern that you would-
- DPDwarkesh Patel
Yeah.
- AKAndrej Karpathy
... hire to work with you. Uh, so for example, you work with some employees here.
- DPDwarkesh Patel
Yeah.
- AKAndrej Karpathy
Um, when would you prefer to have an agent like Claude or Codex, uh, do that work?
- DPDwarkesh Patel
Yeah.
- 30:33 – 40:53
LLM cognitive deficits
- DPDwarkesh Patel
you tweeted out that coding models were actually of very little help to you in assembling this repository, and I'm curious why that was.
- AKAndrej Karpathy
Yeah. Uh, so the repository, I guess I built it over a period of a bit more than a month, and I would say there's, like, three major classes of how people interact with code right now. Some people completely reject all of LLMs, and they are just, uh, writing by scratch...... I think this is probably not the, the right thing to do anymore. Um, the intermediate part, which is where I am, is you still write a lot of things from scratch, but you use, uh, the autocomplete, uh, that's basically, uh, available now from these models. So, when you start writing out a little b- piece of it, it will, it will autocomplete for you and you can just tab through-
- DPDwarkesh Patel
Yeah.
- AKAndrej Karpathy
... and most of the time, it's correct. Sometimes it's not and you edit it. But you're still very much the, um, sort of architect of what you're writing. And then there's the, you know, vibe coding. Uh, you know, "Hi, please implement this or that. Uh, you know, enter." And then let the model do it.
- DPDwarkesh Patel
Yeah.
- AKAndrej Karpathy
And that's the agents. Um, I do feel like the agents work in very specific settings, and I would use them in specific settings. But again, these are all tools available to you, and you have to, like, learn what they, what they're good at-
- DPDwarkesh Patel
Right.
- AKAndrej Karpathy
... and what they're not good at, and when to use them. So, the agents are actually pretty good, for example, if you're doing boilerplate stuff.
- DPDwarkesh Patel
Yeah.
- AKAndrej Karpathy
Boilerplate code that's like just cop- you know, just copy-paste stuff.
- DPDwarkesh Patel
Yeah.
- AKAndrej Karpathy
They're very good at that. They're very good at stuff that occurs very often in the internet, um, because there's lots of examples of it in the training sets of these models. Um, so, so there's, like, features of things that, where the models will do very well. I would say NanoChat is not an example of this, because, uh, it's a fairly unique repository. There's not that much code, I think, in the way that I've structured it. And, um, and it's not boilerplate code. It's, like, actually, like, intellectually intense code almost.
- DPDwarkesh Patel
Mm-hmm.
- AKAndrej Karpathy
And everything has to be very precisely arranged. And the models were always trying to... They kept trying to... I mean, they have so many cognitive deficits, right?
- DPDwarkesh Patel
Mm-hmm.
- AKAndrej Karpathy
So, one example, they keep trying to... They keep misunderstanding the code, um, because they, they have too much memory from all the typical ways of doing things-
- DPDwarkesh Patel
Mm-hmm.
- AKAndrej Karpathy
... on the internet that I just wasn't adopting. Uh, so the models, for example... I mean, I don't know if I wanna get into the full details, but they keep, they keep, um, they keep thinking I'm writing normal code and I'm not (laughs) .
- DPDwarkesh Patel
May- maybe one example. I think it's-
- AKAndrej Karpathy
Maybe one example is-
- DPDwarkesh Patel
... quite interesting.
- AKAndrej Karpathy
Uh, so the way to synchronize... So, you have eight GPUs-
- DPDwarkesh Patel
Yeah.
- AKAndrej Karpathy
... that are all doing forward-backwards. The way to synchronize gradients between them is to use a distributed data parallel container of PyTorch, which automatically does all the... As you're doing the backward, it will start communicating-
- DPDwarkesh Patel
Yeah.
- AKAndrej Karpathy
... and synchronizing the gradients. I didn't use DDP because I didn't want to use it, because it's not necessary. So, I threw it out, and I basically wrote my own synchronization routine that's inside the step of the optimizer. And so the models were trying to get me to use the DDP container-
- DPDwarkesh Patel
(laughs) Yeah.
- AKAndrej Karpathy
... and they were very concerned about... Okay, this gets way too technical, but I wasn't using that container because I don't need it, and I have a custom implementation of-
- DPDwarkesh Patel
Yeah.
- AKAndrej Karpathy
... something like it.
- 40:53 – 50:26
RL is terrible
- DPDwarkesh Patel
Let's talk about RL a bit.
- AKAndrej Karpathy
Mm-hmm.
- DPDwarkesh Patel
Uh, you too did some very interesting things about this. Um, conceptually, how should we think about the way that humans are able to build a rich world model just from interacting with our environment and in ways that seems almost irrespective of the final reward at the end of the episode?
- AKAndrej Karpathy
Mm-hmm.
- DPDwarkesh Patel
If somebody ha- you know, somebody's starting to start a business, and at the end of 10 years she finds out whether the business succeeded or failed, we say that she's earned a bunch of wisdom and experience.
- AKAndrej Karpathy
Mm-hmm. Yeah.
- DPDwarkesh Patel
But it's not because, like, the log probs of every single thing that happened over the last 10 years-
- AKAndrej Karpathy
Yeah.
- DPDwarkesh Patel
... are upweighted or downweighted. Something much more deliberate and, uh, rich is happening.
- AKAndrej Karpathy
Yeah.
- DPDwarkesh Patel
How... What is the ML analogy a- and how does that compare to what we're doing with LLMs right now?
- AKAndrej Karpathy
Yeah, maybe the way I would put it is, uh, humans don't use reinforcement learning is maybe what I've- (laughs)
- DPDwarkesh Patel
Hmm.
- AKAndrej Karpathy
... as I've said it all. I, I think they do something different, which is, yeah, you experience... So reinforcement learning is a lot worse than I think the average person thinks. (laughs)
- DPDwarkesh Patel
(laughs) Reinforcement learning is terrible. (laughs)
- AKAndrej Karpathy
(laughs)
- DPDwarkesh Patel
(laughs)
- AKAndrej Karpathy
It just so happens that, uh, everything that we had before it is much worse. (laughs)
- DPDwarkesh Patel
(laughs)
- AKAndrej Karpathy
Uh, because previously we were just imitating people, so it has all these issues. Um, so in reinforcement learning, say you're working with, uh, you're solving a math problem, because it's very simple. You're given a math problem and you're trying to find the solution. Um, now in reinforcement learning, you will try, uh, lots of things in parallel first. So, uh, you're given a problem. You try hundreds of different attempts. And these attempts can be complex, right? They can be like, "Oh, let me try this. Let me try that. This didn't work. That didn't work," et cetera. And then maybe you get an answer. And now you check the back of the book and you see, okay, the correct answer is this. And then you can see that, okay, this one, this one, and that one got the correct answer, but these other 97 of them didn't. So literally what reinforcement learning does is it goes to the ones that worked really well, and every single thing you did along the way-
- DPDwarkesh Patel
Yeah.
- AKAndrej Karpathy
... every single token gets upweighted of, like, do more of this. The problem with that is, uh, I mean, people will say that, uh, your estimator has high variance, but what... I mean, it's just noisy. It's noisy. (laughs) Uh, so basically, it kind of almost assumes that every single little piece of the solution that you made that arrived at the right answer was correct thing to do, which is not true. Like, you may have gone down the wrong alleys, uh, until you arrived at the right solution. Every single one of those incorrect things you did, as long as you got to the correct solution, will be upweighted as do more of this. It's terrible.
- DPDwarkesh Patel
... yeah.
- AKAndrej Karpathy
It's noise. You've done all this work only to find a single... at the end you get a single number of like, oh, you did correct. And - and based on that, you weigh that entire trajectory as like upweight or downweight. And so you're... the way I like to put it is, you're sucking supervision through a straw. Uh, because you've done all this work, that could be minutes of rollout, and you're - you're like sucking the bits of supervision of the final reward signal-
- DPDwarkesh Patel
(laughs)
- AKAndrej Karpathy
... through a straw, and you're like putting it... you're like... (laughs) you're basically like, um... yeah, you're broadcasting that across the entire trajectory and using that to upweight or downweight that trajectory. It's just stupid and crazy.
- DPDwarkesh Patel
Uh-
- AKAndrej Karpathy
A human would never do this. Number one, a human would never do hundreds of rollouts, right?
- DPDwarkesh Patel
Right.
- AKAndrej Karpathy
Uh, number two, when a person sort of finds a solution, they will have a pretty complicated process of review, of like, "Okay, I think these parts that I did well. These parts I did not do that well. I should probably do this or that." And they think through things. There's nothing in current LLMs that does this. There's no equivalent of it. Um, but I do see papers popping out that are trying to do this, because it's obvious to everyone in the field.
- 50:26 – 1:07:13
How do humans learn?
- AKAndrej Karpathy
- DPDwarkesh Patel
So I guess, like, I, I, I see a very, um, not easy, but, like, I, I can conceptualize how you would ha- be able to train on synthetic examples.
- AKAndrej Karpathy
Mm-hmm.
- DPDwarkesh Patel
Or synthetic problems that you have made for yourself.
- AKAndrej Karpathy
Mm-hmm.
- DPDwarkesh Patel
But there seems to be another thing humans do, maybe sleep is this, maybe daydreaming is this.
- AKAndrej Karpathy
Mm-hmm.
- DPDwarkesh Patel
Which is not necessarily come up with fake problems, but just, like, reflect.
- AKAndrej Karpathy
Yeah.
- DPDwarkesh Patel
And I'm not sure what the ML analogy for, you know, daydreaming or sleeping.
- AKAndrej Karpathy
Mm-hmm. Yeah.
- DPDwarkesh Patel
But just, like, just reflecting, I haven't come up with any problem.
- AKAndrej Karpathy
Yeah, yeah.
- DPDwarkesh Patel
I mean, obviously, the very basic analogy would just be, like, fine-tuning on reflection bits. But I feel like in practice, that probably wouldn't work that well.
- AKAndrej Karpathy
Yeah.
- DPDwarkesh Patel
So I don't know if you have some take on what, what the analogy of, like, th- this thing is.
- AKAndrej Karpathy
Yeah, I do think that, that we're missing some aspects there. So as an example, uh, when you're reading a book-
- DPDwarkesh Patel
Yeah.
- AKAndrej Karpathy
... um, I almost feel like currently, when LLMs are reading a book, uh, what that means is we stretch out the sequence of text and the model is predicting the next token.
- DPDwarkesh Patel
Yeah.
- AKAndrej Karpathy
And it's getting some knowledge from that. Uh, that's not really what humans do, right?
- DPDwarkesh Patel
Yeah.
- AKAndrej Karpathy
So when you're reading a book, I almost don't even feel like the book is like exposition I'm supposed to be attending to and training on.
- DPDwarkesh Patel
Mm-hmm.
- AKAndrej Karpathy
The book is a, is a set of prompts for me to do-
- DPDwarkesh Patel
Mm-hmm.
- AKAndrej Karpathy
... synthetic data generation, or for you to get into a book club and talk about it with your friends.
- DPDwarkesh Patel
Yeah.
- AKAndrej Karpathy
And it's by manipulating that information that you actually gain that knowledge.
- DPDwarkesh Patel
Yeah.
Episode duration: 2:26:07
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode lXUZvyajciY
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome