Jeff Dean & Noam Shazeer — 25 years at Google: from PageRank to AGI

This week I welcome two of the most important technologists in any field. Jeff Dean is Google's Chief Scientist, and through 25 years at the company, has worked on basically the most transformative systems in modern computing: from MapReduce, BigTable, Tensorflow, AlphaChip, to Gemini. Noam Shazeer invented or co-invented all the main architectures and techniques that are used for modern LLMs: from the Transformer itself, to Mixture of Experts, to Mesh Tensorflow, to Gemini and many other things. We talk about their 25 years at Google, going from PageRank to MapReduce to the Transformer to MoEs to AlphaChip – and soon to ASI. 𝐄𝐏𝐈𝐒𝐎𝐃𝐄 𝐋𝐈𝐍𝐊𝐒 * Transcript: https://www.dwarkesh.com/p/jeff-dean-and-noam-shazeer * Apple Podcasts: https://podcasts.apple.com/us/podcast/jeff-dean-noam-shazeer-25-years-at-google-from-pagerank/id1516093381?i=1000691556147 * Spotify: https://open.spotify.com/episode/4atx1POpKIL8WGvdVfdnbb?si=DLn5uQYMQMWKPTTkj5pt_A 𝐒𝐏𝐎𝐍𝐒𝐎𝐑𝐒 * Meter wants to radically improve the digital world we take for granted. They’re developing a foundation model that automates network management end-to-end. To do this, they just announced a long-term partnership with Microsoft for tens of thousands of GPUs, and they’re recruiting a world class AI research team. To learn more, go to https://meter.com/dwarkesh * Scale partners with major AI labs like Meta, Google Deepmind, and OpenAI. Through Scale’s Data Foundry, labs get access to high-quality data to fuel post-training, including advanced reasoning capabilities. If you’re an AI researcher or engineer, learn about how Scale’s Data Foundry and research lab, SEAL, can help you go beyond the current frontier at https://scale.com/dwarkesh * Curious how Jane Street teaches their new traders? They use Figgie, a rapid-fire card game that simulates the most exciting parts of markets and trading. It’s become so popular that Jane Street hosts an inter-office Figgie championship every year. Download from the app store or play on your desktop at https://www.figgie.com/ To sponsor a future episode, visit https://www.dwarkesh.com/p/advertise 𝐓𝐈𝐌𝐄𝐒𝐓𝐀𝐌𝐏𝐒 00:00:00 - Intro 00:03:29 - Joining Google in 1999 00:06:20 - Future of Moore's Law 00:11:04 - Future TPUs 00:13:56 - Jeff’s undergrad thesis: parallel backprop 00:15:54 - LLMs in 2007 00:25:09 - “Holy shit” moments 00:27:28 - AI fulfills Google’s original mission 00:32:00 - Doing Search in-context 00:36:12 - The internal coding model 00:37:29 - What will 2027 models do? 00:43:20 - A new architecture every day? 00:49:10 - Automated chips and intelligence explosion 00:53:07 - Future of inference scaling 01:02:38 - Already doing multi-datacenter runs 01:08:15 - Debugging at scale 01:12:41 - Fast takeoff and superalignment 01:20:51 - A million evil Jeff Deans 01:24:22 - Fun times at Google 01:27:51 - World compute demand in 2030 01:34:37 - Getting back to modularity 01:44:48 - Keeping a giga-MoE in-memory 01:49:35 - All of Google in one model 01:57:59 - What’s missing from distillation 02:03:10 - Open research, pros and cons 02:09:58 - Going the distance

Noam ShazeerguestJeff DeanguestDwarkesh Patelhost

Feb 12, 20252h 15mWatch on YouTube ↗

EVERY SPOKEN WORD

150 min read · 30,027 words

0:00 – 3:29
Intro
1. NSNoam Shazeer
  Organizing information is clearly like a trillion-dollar opportunity, but a trillion dollars is not cool anymore. What's cool is a quadrillion dollars. (laughing)
2. JDJeff Dean
  (laughing)
3. NSNoam Shazeer
  The world GDP is almost certainly going to go way, way up to, like, orders of magnitude higher than it is today-
4. JDJeff Dean
  (laughs)
5. NSNoam Shazeer
  ... due to the fact that we have all of these artificial engineers.
6. JDJeff Dean
  25% of the characters that we're checking into our code base these days are generated by our AI-based coding models.
7. NSNoam Shazeer
  We're going to need, like, a million automated researchers to invent all of this stuff. (laughs)
8. JDJeff Dean
  (laughs) Yeah.
9. DPDwarkesh Patel
  (laughs) If this is where things go, this is actually like getting, like, Noam on a podcast in 2018 and being, like, "Yeah, so I think like, you know, language models will be a thing."
10. NSNoam Shazeer
  I'm guessing that the amount of compute being used for AI to help each person will be astronomical.
11. JDJeff Dean
  (laughs)
12. DPDwarkesh Patel
  (laughs) Today, I have the honor of chatting with Jeff Dean and Noam Shazeer. Jeff is Google's chief scientist, and through his 25 years at the company, he has worked on basically the most transformative systems in modern computing, from MapReduce, BigTable, TensorFlow, AlphaChip. Genuinely, the list doesn't end. Uh, Gemini now. And Noam is the per- single person most responsible for the current AI revolution. He has been the inventor or the co-inventor of all the main, uh, ar- architectures and techniques that are used for modern LLMs, from the transformer itself, to mixture of experts, uh, to mesh-TensorFlow, to many other things. Um, and they are two of the three co-leads of Gemini at Google DeepMind. Awesome. Thanks so much for coming on.
13. JDJeff Dean
  Thanks for having us.
14. DPDwarkesh Patel
  (laughs)
15. JDJeff Dean
  (laughs) Thank you.
16. Super excited to be here.
17. DPDwarkesh Patel
  Okay, first question. Uh, both of you have been at Google for 25 or close to 25 years. At some point early on in the company, you probably understood how everything worked.
18. JDJeff Dean
  Mm-hmm.
19. DPDwarkesh Patel
  When did that stop being the case? Do you feel like there was a clear moment that happened?
20. NSNoam Shazeer
  I mean, I- I know I joined and, like, at that poi- this was, like, end of 2000, and, uh, they had this thing, everybody gets a mentor. And, you know, so, you know, I knew nothing. I would just ask my mentor everything, and my mentor knew everything. It turned out my mentor was Jeff. (laughs)
21. DPDwarkesh Patel
  (laughs)
22. JDJeff Dean
  (laughs)
23. (laughs)
24. NSNoam Shazeer
  And it was not the case that everyone at Google knew everything.
25. JDJeff Dean
  (laughs)
26. NSNoam Shazeer
  It was just the case that Jeff knew everything (laughs) 'cause he, 'cause he had basically written everything.
27. JDJeff Dean
  (laughs) You're- you're very kind. I mean, I think, uh, as companies grow, you- you kind of go through these phases. Like, when I joined, you know, we were 25 people, 26 people, something like that. And so you eventually learned everyone's name, and even though we were growing, you kept track of all the people who were joining. Uh, at some point, then you kind of lose track of everyone's name in the company, but you still know everyone working on, you know, software engineering things. Uh, then you sort of lose track of, you know, all the names of people in the software engineering group, but, you know, you at least know all the different projects th- uh, that everyone's working on. And then, at some point, the company gets big enough that, you know, you get an email that Project Platypus-
28. DPDwarkesh Patel
  (laughs)
29. NSNoam Shazeer
  (laughs)
30. JDJeff Dean
  (laughs)
3:29 – 6:20
Joining Google in 1999
1. JDJeff Dean
2. DPDwarkesh Patel
  Um, ho- how did Google recruit you, by the way?
3. JDJeff Dean
  Um, I kind of reached out to them, actually.
4. DPDwarkesh Patel
  And- and Noam, how did you get recruited? Wha- what was it that you did there?
5. NSNoam Shazeer
  Yeah. I mean, I, um, I actually saw i- uh, Google at a job fair in, like, 1999, and I assumed that it was, like, already this huge company, that no point in joining-
6. DPDwarkesh Patel
  (laughs)
7. NSNoam Shazeer
  ... um, because everyone I knew used Google. I guess that was because I was a grad student at Berkeley at the time.
8. DPDwarkesh Patel
  (laughs)
9. JDJeff Dean
  (laughs)
10. NSNoam Shazeer
  I- I guess I've dropped out of grad programs a few times, but, um, but, you know, it turns out that, like, actually it wasn't really that, uh, that large (laughs) . So-
11. DPDwarkesh Patel
  (laughs)
12. JDJeff Dean
  ... so it tur- turned, turns out I did not apply in 1999, but, like, just kind of sent them a resume on a whim in 2000 'cause I figured I should let, it was, like, my favorite search engine and figured I should apply to multiple places for a job. Um, but then, yeah, turned out to be, uh, be, uh, really, uh, really fun. Looked like a bunch of smart people, uh, doing good stuff, and they had this really nice crayon chart on the wall of the daily, uh, number of search queries-
13. DPDwarkesh Patel
  I love that chart. Uh-huh.
14. NSNoam Shazeer
  ... uh, that, you know, somebody had just been maintaining, uh. And, um, yeah, it looked very exponential. (laughs)
15. DPDwarkesh Patel
  (laughs)
16. JDJeff Dean
  (laughs)
17. NSNoam Shazeer
  I was like, "These guys are going to be very successful." And it looks like they have a lot of good problems to work on. So I was like, "Okay, maybe I'll, uh, yeah, go work there for a little while and then have enough money to just go work on AI for as long as I want after that." (laughs)
18. DPDwarkesh Patel
  Yeah, yeah. In a way, you did that, right?
19. JDJeff Dean
  (laughs)
20. DPDwarkesh Patel
  (laughs)
21. NSNoam Shazeer
  Yeah, yeah. It, uh, yeah, it- it totally worked out exactly according to plan.
22. DPDwarkesh Patel
  Right. So you were thinking about AI in 1999?
23. NSNoam Shazeer
  Um, yeah, this was like 2000. Yeah, I- I- I remember in, um, yeah, in grad school, uh, a- a friend of mine at the time had, um, told me that his, um, uh, New Year's resolution for 2000 was to live to see the year 3000 and that he was going to achieve this by inventing AI.
24. DPDwarkesh Patel
  Uh-huh.
25. NSNoam Shazeer
  Um, so I was like, "Huh, that's- that sounds like a good idea." Um, but, you know, then, you know, I- I- I didn't get the idea at the time that, oh, like, you could go do it at- at a big company, but, you know, I figured, hey, you know, a bunch of people seem to be making a ton of money at startups. Maybe I'll just make some money and then I'll have, uh, you know, enough to live on, just work on AI research, uh, for- for a long time.
26. DPDwarkesh Patel
  Yeah.
27. NSNoam Shazeer
  Um, but yeah, it actually turned out that Google was a terrific place to work in AI.
28. JDJeff Dean
  I mean, one of the things I like about Google is our ambition has always been sort of something that would kind of require pretty advanced AI.
29. DPDwarkesh Patel
  Mm-hmm.
30. JDJeff Dean
  You know, organizing the world's information and making it universally accessible and useful. Like, actually, there's ton- there's a really broad mandate in there, uh, so it's not like the company was gonna do this one little thing and stay in- stay doing that. And also, you could see that what we were doing initially was...... in that direction, but you could do so much more in that direction.
6:20 – 11:04
Future of Moore's Law
1. DPDwarkesh Patel
  has Moore's Law over the last two, three decades changed the kinds of considerations you have to take on board when you design new systems, when you figure out what projects are feasible? What has stayed, you know, like, what, what are, what are still the limitations? What are things you can now do that you obviously couldn't do before?
2. JDJeff Dean
  I mean, I, I, I think of it as actually changing quite a bit in the last couple decades.
3. DPDwarkesh Patel
  Mm-hmm.
4. JDJeff Dean
  So like the two decades ago to one decade ago, it was awesome 'cause you'd just, like, wait and, like, 18 months later, you'd get much faster hardware and you don't have to do anything (laughs) . Um, and then more recently, you know, I feel like the general purpose CPU-based s- uh, machines, scaling has not been as good. Like, the fabrication processes improvements are now taking three years instead of every two years. The, the architectural improvements in, you know, multi-core processors and so on are, you know, not giving you the same boost that you, that we were getting, you know, the pre-, you know, 20 to 10 years ago. Um, but I think at the same time you were seeing, uh, much more specialized computational devices, uh, like machine learning accelerators, TPUs, um, very ML-focused GPUs more recently, um, are making it so that we can actually get, you know, really high performance a- and good efficiency out of the more modern kinds of computations we wanna run that are, that are different than, you know, a twisty pile of C++ code trying to run Microsoft Office or something.
5. DPDwarkesh Patel
  Yeah, yeah. (laughs)
6. JDJeff Dean
  (laughs)
7. NSNoam Shazeer
  I mean, it feels like the, the, um, the algorithms are following the hardware. Basically, like, what's happened is that at this point arithmetic is very, very cheap and moving data around is comparatively, like, much more expensive.
8. DPDwarkesh Patel
  Right.
9. NSNoam Shazeer
  So pretty much all of deep learning has taken off roughly because of that, because it, you can build it-
10. DPDwarkesh Patel
  Interesting.
11. NSNoam Shazeer
  ... out of matrix multiplications that are, you know, n, n cubed operations and n squared bytes of, uh, data communication basically.
12. JDJeff Dean
  Well, I would say that the, the pivot to hardware oriented around that was an important transition. 'Cause before that we had CPUs and-
13. NSNoam Shazeer
  Right, yeah.
14. JDJeff Dean
  ... GPUs that were not, you know, especially well-suited for deep learning. And then, um, you know, we started to build, say, TPUs at Google, uh, that were really just reduced precision linear algebra machines.
15. DPDwarkesh Patel
  Yeah.
16. JDJeff Dean
  And then once you have that, then you wanna-
17. NSNoam Shazeer
  R- right.
18. JDJeff Dean
  ... you know, exploit it.
19. NSNoam Shazeer
  You have to see the insight that seems like it's all about, uh, all about kind of identifying opportunity cost. Like, okay, thi- this is something like Larry Page, I think used to always say, like, uh, "Our second biggest cost is taxes, and our biggest cost is opportunity cost."
20. DPDwarkesh Patel
  (laughs)
21. JDJeff Dean
  (laughs)
22. NSNoam Shazeer
  And if he didn't say that, then I've been misquoting him for years, but-
23. DPDwarkesh Patel
  (laughs)
24. NSNoam Shazeer
  ... uh, but, but, uh, but basically it's like, you know, what, what, what is the opportunity that you have that-
25. DPDwarkesh Patel
  Right.
26. NSNoam Shazeer
  ... that you're missing out on? Um, and like in this case, I guess it was that, okay, you've got all of this chip area and you're putting a very small number of, uh, of arithmetic units on it. Like, fill the thing up with arithmetic units, you could have orders of magnitude more arithmetic getting done. Now what else has to change? Okay, the algorithms and the data flow and everything else-
27. JDJeff Dean
  And oh, by the way-
28. NSNoam Shazeer
  ... about the chip.
29. JDJeff Dean
  ... the arithmetic can be, like, really low precision, so then you can squeeze even more multiplier units in.
30. DPDwarkesh Patel
  Um, Noam, I wanna follow up on what you said, that the algorithms have been following the hardware. If you imagine a counterfactual world where suppose that the cost of memory had declined more than arithmetic, or, uh, just like in- invert the dynamic you saw with the last decade.
11:04 – 13:56
Future TPUs
1. NSNoam Shazeer
2. DPDwarkesh Patel
  Wh- wh- what are the trade-offs that you're considering changing for future, uh, versions of TPU to s- to integrate how, how are you thinking about algorithms differently?
3. JDJeff Dean
  Um, I mean, I think one thing, one general trend is we're getting better at quantizing or having much re- more reduced precision models.
4. DPDwarkesh Patel
  Yeah.
5. JDJeff Dean
  Uh, you know, we started with TPU v1. We weren't even quite sure we could quantize a model for serving with eight-bit integers. But we sort of had some early evidence that seemed like it might be possible, so we're like, "Great, let's build the whole chip around that." (laughs)
6. DPDwarkesh Patel
  (laughs)
7. JDJeff Dean
  And then over time, I think you've seen people, uh, able to use much lower precision for training as well, but also f- the inference precision has, you know, gone... people are now using INT4 or FP4, which sounded like if you said to someone, like, "We're gonna use FP4,"-
8. DPDwarkesh Patel
  (laughs)
9. JDJeff Dean
  ... to s- like a super computing floating point person 20 years ago would be like, "What? That's crazy. We like 64 bits in our floats."
10. DPDwarkesh Patel
  (laughs)
11. JDJeff Dean
  Um, or, you know, even below that. You know, some people are quantizing models to two bits or one bit, uh, and I think that's a, that's a trend to definitely pay attention to.
12. DPDwarkesh Patel
  One bit, just like a zero or one? (laughs)
13. JDJeff Dean
  Yeah, just a zero or one. Um, and then you have, like, a, a sign bit for a group of bits or something.
14. DPDwarkesh Patel
  Ah.
15. NSNoam Shazeer
  (laughs)
16. DPDwarkesh Patel
  (laughs)
17. NSNoam Shazeer
  Really has to be a co-design thing because, you know, if-... you know, if, if, if the, um... You know, if the algorithm designer doesn't realize that he can get greatly improved, uh, performance, you know, uh, throughput with the lower precision, of course the algorithm designer's going to say, "Of course I don't want low precision." (laughs) Like, that-
18. JDJeff Dean
  Right.
19. NSNoam Shazeer
  ... that, that introduces risk, and then, uh, you know, it adds irritation. And then the, uh... Then if you ask the chip designer, "Uh, okay. Uh, you know, what, what are, what do you wanna build?" And then th- they'll ask the person who's w- who's writing the algorithms today, who's gonna say, "No, I don't like, uh, I don't like quantization. It's irritating."
20. JDJeff Dean
  (laughs)
21. NSNoam Shazeer
  So, y- you actually, you know, need to basically see the whole picture and-
22. JDJeff Dean
  Right.
23. NSNoam Shazeer
  ... figure out, "Oh, wait a minute. We can, you know, we can incr- increase our, uh, throughput to cost ratio by a lot by, uh, you know-
24. JDJeff Dean
  Right.
25. NSNoam Shazeer
  ... by quantizing."
26. JDJeff Dean
  Then you're like, "Yes, quantization is irritating." (laughs) But your model's gonna be three times faster, so you're gonna have to deal. (laughs)
27. NSNoam Shazeer
  (laughs)
28. JDJeff Dean
  Uh, through your careers, at various times, you've had sort of an uncanny... You worked on things that have an uncanny resemblance to what is actually, what we're actually using now for generative AI. In 1990, Geoff, your senior thesis was about, uh, uh, back propagation. And in 2007... So, th- this is, this is the thing I didn't realize until I was prepping for this episode. In 2007, you guys trained a two trillion token n-gram model for language modeling. Um, I, I... Just walk me through, when you were developing that model, what, what... Like, was this kind of thing in your head? What, what did you think you guys were doing at the
13:56 – 15:54
Jeff’s undergrad thesis: parallel backprop
1. JDJeff Dean
  time? Yeah. So, um, I mean, let, let me start with the undergrad thesis. Yes. So I kind of got introduced to neural nets in one section of one class on parallel computing that I was taking in my senior year, um, and I needed to do a thesis to graduate, like an honors thesis. And so I approached the professor and I said, "Oh, it'd be really fun to, like, do something around neural nets." So he, he and I decided we would, I would, uh, sort of implement a couple different ways of parallelizing, uh, back propagation training for neural nets in, in 1990, and I called them something funny in my thesis, like, like, uh, pattern partitioning or something. But really, I implemented a, you know, model, model parallelism and data parallelism-
2. NSNoam Shazeer
  Mm-hmm.
3. JDJeff Dean
  ... on a 32 cube, 32 processor hypercube machine. Um, you know, in one, you split all the examples into different batches, and every model, every CPU has a copy of the, the model, and in the other one, you kind of pipeline a bunch of examples, uh, along to processors that have different, um, different, uh, parts of the model. And, you know, com- I compared and contrasted them. And, uh, it was interesting, you know. I was really excited about the abstraction, 'cause it felt like neural nets were the right abstraction that could solve tiny toy problems that no other approach could solve at the time. Um, but... And I thought, "Oh," you know, naive me, "Oh, 32 processors, we'll be able to train, like, really awesome neural nets." Uh, but it turned out, you know, we needed about a million times more compute before they-
4. NSNoam Shazeer
  (laughs)
5. JDJeff Dean
  ... really started to work for real problems. But then starting, you know, in the, you know, late 2008, 2009, 2010 timeframe, we started to have enough compute, uh, thr- thanks to Moore's law, uh, to actually make neural nets work for real things. And that was kind of when I sort of reentered, uh, looking at neural nets. But prior to that, in 2007-
6. NSNoam Shazeer
  Sorry, can I ask about this for just-
7. JDJeff Dean
  Oh, yeah. Sure, yeah.
8. NSNoam Shazeer
  ... um, uh, um... First of all, uh, unlike other artifacts of academia-
9. JDJeff Dean
  Yeah.
10. NSNoam Shazeer
  ... it's actually, like, a really... Like, it's, like, four pages, and you can just, like, read it. And, um, and it's-
11. JDJeff Dean
  Yeah. It was four pages and then, like, 30 pages of C code. (laughs)
12. NSNoam Shazeer
  Sure. (laughs) But it's,
15:54 – 25:09
LLMs in 2007
1. NSNoam Shazeer
  like, just, like, a well-produced sort of artifact. Um, and then, yeah, tell me about how the 2007 paper came together.
2. JDJeff Dean
  Oh, yeah. So that, we had a, we had a machine translation research team at Google, um, uh, led by Franz Och, uh, who had joined Google maybe a year before, um, and a bunch of other people. And every year, they competed in a, uh, uh, uh, I guess it's a DARPA contest on translating a couple of different languages to English, I think. Chinese to English and Arabic to English, I think. Um, and, uh, the Google team had submitted an entry. And the way this works is you get, like, I don't know, 500 sentences on Monday, and you have to submit the answer on Friday. Um, and so the... I saw the results of this, and we'd won the contest, um, and, uh, by a pretty substantial margin measured in BlueScore, which is like a measure of translation quality. And so I reached out to Franz, uh, the, the head of this winning team. I'm like, "This is great. When are we gonna launch it?" And he's like, "Oh, well, we can't launch this. It's not really very practical, 'cause it takes 12 hours to translate a sentence."
3. NSNoam Shazeer
  (laughs)
4. JDJeff Dean
  (laughs) I'm like, "Well, well, that seems like a long time." Um, "How could we fix that?" So it turned out, you know, they'd not really designed it for high throughput, obviously. (laughs)
5. NSNoam Shazeer
  (laughs) Sure.
6. JDJeff Dean
  And so it was doing like 100,000 dist-seqs to... In a, in a large language model that they, they, uh, tra- uh, sort of computed statistics over. I wouldn't say trained, really. Um, and, you know, for each word that it wanted to translate. So, like, obviously doing 100,000 dist-seqs is not-
7. NSNoam Shazeer
  Yeah. (laughs)
8. JDJeff Dean
  ... super speedy. But I said, "Okay, well, let's, let's dive into this." And so I spent about two or three months with them, uh, designing an in-memory compressed representation of n-gram, uh, data. And, and we were using n... An n-gram is basically statistics for how often every N-word sequence occurs in a large corpus. So you basically have... You know, in this case, we had like two trillion words, and most n-gram models of the day were, like, using two grams or maybe three grams. Uh, but we decided we would use five grams.
9. NSNoam Shazeer
  Mm-hmm.
10. JDJeff Dean
  So how often every five-word sequence-
11. NSNoam Shazeer
  Right.
12. JDJeff Dean
  ... occurs in basically as much of the web as we could, uh, process that, that... In that day. Um, and then you have a data structure that says, "Okay, you know, 'I really like this restaurant,'" uh, occurs, you know, 17 times in the web or something. Um, and-... and so, uh, I built, like, a data structure that would let you, uh, store all those in memory on 200 machines, and then have sort of a batched API where you could say, "Here are the 100,000 (laughs) things I need to look up in this round for this word." And it would give you the mall back in pa- in parallel. Um, and that enabled us to go from taking a night to translate a sentence to basically doing something in, you know, 100 milliseconds or something. (laughs)
13. DPDwarkesh Patel
  There's, um, there's this- there's this, uh, list of, uh, Jeff Dean facts, like Chuck Norris facts.
14. JDJeff Dean
  Facts. (laughs)
15. NSNoam Shazeer
  (laughs)
16. DPDwarkesh Patel
  Um, like, for example, that, uh, for Jeff Dean, NP equals no problemo.
17. NSNoam Shazeer
  (laughs)
18. JDJeff Dean
  (laughs)
19. DPDwarkesh Patel
  Um, and one of them-
20. JDJeff Dean
  No.
21. DPDwarkesh Patel
  ... uh, it- it's funny 'cause, uh, now that I hear you say it, it's like, actually, I think it's kind of true.
22. JDJeff Dean
  (laughs)
23. DPDwarkesh Patel
  Um, one of them is, um, the speed of light was 35 miles an hour-
24. JDJeff Dean
  (laughs)
25. DPDwarkesh Patel
  ... until Jeff Dean decided to optimize it over a weekend. (laughs)
26. JDJeff Dean
  (laughs)
27. NSNoam Shazeer
  (laughs)
28. DPDwarkesh Patel
  Just going from 12 hours to, uh, like, 100 milliseconds or whatever is like-
29. JDJeff Dean
  Yeah.
30. DPDwarkesh Patel
  (laughs) I gotta do the orders of magnitude there, but- (laughs)
25:09 – 27:28
“Holy shit” moments
1. DPDwarkesh Patel
  Were there, uh, are there key moments that stand out to you where you, looking at a, a research area and you come up with this idea, and you have this feeling of like, "Holy shit, I can't believe that worked"? Um. (laughs)
2. JDJeff Dean
  One, one thing I remember was, you know, we'd been f- in the early days of the Brain Team, we were focused on, "Let's see if we can build some infrastructure that lets us train really, really big neural nets."
3. DPDwarkesh Patel
  Yeah.
4. JDJeff Dean
  And at that time, we didn't have GPUs in our data centers. We just had CPUs. But, you know, we know how to make lots of CPUs work together. (laughs) So, we built a system that enabled us to train, you know, uh, pretty large neural nets through both model and data parallelism. So we had a, a system for unsupervised learning on, uh, uh, actually 10 million randomly selected, uh, YouTube frames. Uh, and it was kind of a, you know, a, a, a spatially local representation, so it would build up unsupervised representations, uh, based on trying to reconstruct the, the thing from the high-level representations. And, um, so we got that working and training on 2,000 computers using 16,000 cores. Uh, and, um, you know, after a little while, that model was actually able to build a representation at the highest level where one neuron would get excited by, uh, you know, uh, images of cats that-
5. DPDwarkesh Patel
  Mm-hmm.
6. JDJeff Dean
  ... you know, it had never been told what a cat was-
7. DPDwarkesh Patel
  Right.
8. JDJeff Dean
  ... but it sort of had seen enough, uh, examples of them in the training data of head-on facial views of cats-
9. DPDwarkesh Patel
  Yeah.
10. JDJeff Dean
  ... that that neuron would turn on for that and not for much else.
11. DPDwarkesh Patel
  Yeah.
12. JDJeff Dean
  And similarly, you'd have other ones for human faces, and, you know, uh, backs of pedestrians and this kind of thing. Um, and so that was kinda cool 'cause it's sort of from unsupervised learning principles building up these really high-level representations. And then we were able to get, you know, very good results on the supervised ImageNet 20,000 category, uh, challenge that, like, advanced the state of the art by, like, 60% relative improvement, which was quite good at the time. So, that to m- and that neural net was probably 50X bigger than one that had been trained, uh, previously.
13. DPDwarkesh Patel
  Mm.
14. JDJeff Dean
  Um, and it got good results. So that sort of said to me, "Hey, actually scaling up neural nets seems like a (laughs) ... I thought it would be a good idea, and it seems to be, so we should keep pushing on that."
27:28 – 32:00
AI fulfills Google’s original mission
1. DPDwarkesh Patel
  So, the... these examples illustrate how these AI systems fit into what you were just mentioning, that Google is sort of a company that organizes information fundamentally. And then you can... basically what AI is doing in this context is finding relationships between information, between concepts to help get ideas to you faster, information you want to you faster. Um, now we're moving with current AI models, like obviously they're very... you know, you can use BERT in Google Search, and you can ask these things questions, and they obviously are still good at information retrieval. But more fundamentally, you know, like, they're like... uh, they can, like, write your entire code base for you and do all... (laughs) you know, like-
2. JDJeff Dean
  That seems
3. NSNoam Shazeer
  ... actually useful.
4. DPDwarkesh Patel
  ... more like an actual worker-
5. JDJeff Dean
  Yeah, yeah.
6. DPDwarkesh Patel
  ... um, which is going beyond the, uh, just, like, information retrieval. So-
7. JDJeff Dean
  Yeah.
8. DPDwarkesh Patel
  ... has, um... yeah, has your... how are you thinking about, like, is Google still an information retrieval company if you're, like, building an AGI? Like, AGI can do information retrieval, but it can do many other things as well, right?
9. JDJeff Dean
  I think we're an organized inf- inf- the world's information company, and that's broader than information retrieval, right?
10. DPDwarkesh Patel
  Sure.
11. JDJeff Dean
  That's maybe organizing and creating new information from, you know, some guidance you give it. "Can you help me write a, a letter to my, to my veterinarian about my dog? It's got these symptoms," and it'll draft that. Or, "Can you feed in this video, and, you know, can you produce a summary of like what's happening in the video every few minutes?" And, you know, I think our sort of multimodal capabilities are showing that it's more than just text. It's about, you know, understanding the world and all the different kind of modalities that, that information exists in. Um, both kind of human ones, but also, uh, kind of non-human-oriented ones, like weird, uh, LiDAR sensors on autonomous vehicles or, you know, genomic information or health information. And then helping... how do you extract and transform that into useful insights for people and make use of that in, in helping them do all kinds of things they wanna do? And that's, you know, sometimes it's, "I wanna be entertained by chatting with a chatbot." Sometimes it's, "I want answers to this really complicated question." There is no single s- source to retrieve from. It's... you need to pull information from, like, 100 web pages and, like, figure out what's going on and make a organized, synthesized version of that data. And, uh, and then dealing with, you know, multimodal things or coding-related problems. I think it's super exciting what these models are capable of, and they're improving fast. So, I'm excited to see where we go. I don't know about you.
12. NSNoam Shazeer
  I am also excited to see where we go.
13. JDJeff Dean
  (laughs)
14. NSNoam Shazeer
  And, you know, yeah, I think, uh, definitely the, uh, uh, or- organizing, organizing information, you know, is, you know, is, is, is clearly like a, uh, you know, trillion-dollar opportunity. But, you know, a trillion dollars is not cool anymore. What's cool is a quadrillion dollars.
15. DPDwarkesh Patel
  (laughs)
16. JDJeff Dean
  (laughs)
17. NSNoam Shazeer
  (laughs) I mean, and obviously the, the, the, the, the, the, the, the idea is not to just pile up some giant pile of money, but, uh, it's to just... it's create value in the world-
18. DPDwarkesh Patel
  Right.
19. NSNoam Shazeer
  ... you know, and, uh, and so much more value can be, uh, created when these, when these, uh, systems can actually, like, go and do something for you, write your code, or figure out problems that, uh, that, that you wouldn't have been able to figure out yourself and to, uh, and to do that at scale. So I, I, I mean-... we, we're going to have to be very, very flexible and dynamic as, uh, as, as, uh, as we, we improve the capabilities of these models to, you know-
20. JDJeff Dean
  Yeah, I guess I'm, I'm pretty excited about kind of a lot of fundamental research questions-
21. NSNoam Shazeer
  Yeah.
22. JDJeff Dean
  ... that sort of come about because you see something that we're doing could be substantially improved if we-
23. NSNoam Shazeer
  Mm-hmm.
24. JDJeff Dean
  ... tried, you know, this approach or things in this rough direction and, you know, maybe that'll work, maybe it won't. But I also think there's, there's value in seeing what we could achieve for end users and then how, how can we work backwards from that to actually build systems that are able to do that. So as one example, you know, organizing information, that should mean any information in the world should be usable by anyone regardless of what language they speak.
25. NSNoam Shazeer
  Yeah.
26. JDJeff Dean
  And that I think-
27. NSNoam Shazeer
  Yeah.
28. JDJeff Dean
  ... you know, we've done some amount of but it's not nearly the full vision of, you know, no matter what language you speak out of thousands of languages, we can make any piece of content available to you and make, make it usable by you, um, and, you know, any video could be watched in any language. I think that would be pretty awesome. Um, and you know, we're not quite there yet but that's definitely things I see on the horizon that would, should be possible.
29. DPDwarkesh Patel
  The,
32:00 – 36:12
Doing Search in-context
1. DPDwarkesh Patel
  the, speaking of different architectures you might try, I know one thing you're working on right now is longer context. Um, if you think of Google Search as, like, it's got the entire index of the internet in its context but it's, like, sort of very, like, shallow search, uh, and then obviously language models all have, like, uh, limited context right now but they can, like, really think... It's like dark magic, like, in-context learning, right? It just, like, can really think about what, what it's seeing. Um, how, how do you think about what it would be like to merge something like Google Search and something like in-context learning?
2. JDJeff Dean
  Yeah, maybe I'll, I'll take a first stab at it. I mean, I... 'Cause I've thought about this for a bit. I mean, I think one of the things you see with these models is they, they're quite good but they do hallucinate and, you know, have factuality issues sometimes.
3. DPDwarkesh Patel
  Yeah.
4. JDJeff Dean
  And part of that is, you know, you've trained on, say, tens of trillions of tokens and you've stirred all that together in your tens or hundreds of billions of parameters.
5. DPDwarkesh Patel
  Yeah.
6. JDJeff Dean
  But it's all a bit squishy 'cause you've, like, s- (laughs) churned all these-
7. DPDwarkesh Patel
  Right.
8. JDJeff Dean
  ... turk tokens together. And so, um, the model has, like, a reasonably clear view of that data but it sometimes, like, gets confused and will give the wrong date for something.
9. DPDwarkesh Patel
  Right.
10. JDJeff Dean
  Um, whereas information in the context window, uh, in the input of the model is, like, really sharp and clear-
11. DPDwarkesh Patel
  Yeah.
12. JDJeff Dean
  ... because we have this really nice attention mechanism in transformers that the model can pay attention to things-
13. DPDwarkesh Patel
  Yeah.
14. JDJeff Dean
  ... and it knows kind of the exact text or the exact-
15. DPDwarkesh Patel
  Right.
16. JDJeff Dean
  ... frames of the video or audio or whatever that it's processing. Um, and so right now we have a, uh, uh, models that can deal with kind of millions of tokens of context, uh, which is quite a lot. It's like, you know, hundreds of pages of a PDF or-
17. DPDwarkesh Patel
  Right.
18. JDJeff Dean
  ... you know, 50 research papers or, you know, hours of video or tens of hours of audio, or some-
19. DPDwarkesh Patel
  Yeah.
20. JDJeff Dean
  ... combination of those things, which is pretty cool. Um, but it would be really nice if the model could attend to trillions of tokens, right? Could it attend to the entire internet and find the right stuff for you?
21. DPDwarkesh Patel
  Yeah.
22. JDJeff Dean
  Could it attend to all your personal information for you, right?
23. DPDwarkesh Patel
  Yeah.
24. JDJeff Dean
  Like, uh, I would love a model that has access to all my emails and all my documents and all my photos, and I'm, when I ask it to do something it can sort of make use of that, uh-
25. DPDwarkesh Patel
  Right.
26. JDJeff Dean
  ... with my permission to, to sort of help solve what-
27. DPDwarkesh Patel
  Yeah.
28. JDJeff Dean
  ... it is I'm wanting it to do. Uh, but that's gonna be a big computational challenge 'cause the, the naive attention a- algorithm is, is quadratic and you can kind of barely make it work on a fair bit of hardware for millions of tokens but there's no hope of making that just naively go to trillions of tokens. So, we need a whole bunch of in- interesting algorithmic approximations to what you would really want, uh, to make, uh, a way for the model to attend kind of conceptually to, you know, lots and lots of more tokens, trillions of tokens, and attend to your tokens. You know, maybe we can put all of the Google codebase in context for every Google developer, uh, all the world's source code in context for any open source developer. That would be amazing.
29. NSNoam Shazeer
  It would be, it would be incredible.
30. DPDwarkesh Patel
  (laughs)
36:12 – 37:29
The internal coding model
1. NSNoam Shazeer
  Uh-
2. DPDwarkesh Patel
  I, I, I wanna talk more about the thing you mentioned about, look, you know, Google is a company with, like, lots of code and lots of examples, right? Um, if you just think about that one use case and what that implies, so you've got, like, the Google mono repo, um, and if you, maybe you figured out the long context thing, you could put the whole thing in context or you fine-tune on it. Um, uh, yeah. Basically, like, why, why, why don't, why hasn't this been already done and, you know, because you can imagine, like...... in- the amount of code that Google has, uh, proprietary access to, just, like, ma- e- even if you're just using it eternally for-
3. JDJeff Dean
  Yeah.
4. DPDwarkesh Patel
  ... it to make your developers more efficient and productive.
5. JDJeff Dean
  Oh, to be clear, we ha- we have actually already, uh, f- done further training on a Gemini model on our internal code base for our internal developers.
6. DPDwarkesh Patel
  Yeah.
7. JDJeff Dean
  But that's different than attending to all of it-
8. DPDwarkesh Patel
  Right.
9. JDJeff Dean
  ... uh, because it sort of stirs together the code base into a bunch of parameters.
10. DPDwarkesh Patel
  Mm-hmm.
11. JDJeff Dean
  Uh, and I think having it in context i- is, is, uh, he- makes things clearer. But even the f- the sort of further trained model internally is incredibly useful. Like, um, Sundar, I think, has said that 25% of the characters that we're checking into our code base these days are generated by our AI-based-
12. DPDwarkesh Patel
  Right.
13. JDJeff Dean
  ... coding models, uh, with, with kind of human, kind of-
14. DPDwarkesh Patel
  H- how do you imagine-
37:29 – 43:20
What will 2027 models do?
1. JDJeff Dean
  ... driving.
2. DPDwarkesh Patel
  ... in a year or two, based on the capabilities you see are on the horizon, your own personal work, what will it be like to be a researcher at Google? You have a new idea or something, uh, w- with the way in which you're interacting with these models in a year, what does that look like?
3. JDJeff Dean
  (laughs)
4. NSNoam Shazeer
  Well, I mean, I assume the, we will be, we will have these models a lot better and, uh, (clears throat) hopefully be able to be much, much more productive.
5. DPDwarkesh Patel
  (laughs)
6. JDJeff Dean
  (laughs) Uh- Yeah. I mean, I think one, one of the f- i- in addition to kind of research-y context, like, any time you're seeing these models used, I think they're able to make software developers more productive 'cause they can kind of take sort of a high level sp- spec or in-sentence description of what you want done and give a pretty approximate, you know, pretty reasonable first cut at that. Um, and so from a research perspective, maybe you can say, "I'd really like, uh, you to explore, you know, this kind of idea, like, similar to the one in this paper, but maybe, like, let's try making it convolutional or something." Like, that, if you could do that and have the system automatically sort of generate a bunch of experimental code and maybe you look at it and you're like, "Yeah, that looks good. Run that," um, like, that seems like a nice dream direction-
7. DPDwarkesh Patel
  Yeah.
8. JDJeff Dean
  ... to go in and seems plausible in the next year or two years that you might make a lot of progress on that. And-
9. DPDwarkesh Patel
  Seems under-hyped 'cause you've got, like, ins- y- you could have, like-
10. JDJeff Dean
  (laughs)
11. DPDwarkesh Patel
  ... literally millions of extra employees, um, and you can immediately check their output, but employees can check either- each other's output. They, like, im- immediately stream tokens. You could say something-
12. JDJeff Dean
  Yeah. Sorry, I didn't mean to under-hype it.
13. DPDwarkesh Patel
  Yeah.
14. JDJeff Dean
  I think it's super exciting. (laughs)
15. DPDwarkesh Patel
  (laughs)
16. NSNoam Shazeer
  (laughs) Uh, yeah, I'm with you.
17. JDJeff Dean
  I just don't like to hype things that aren't done yet. (laughs)
18. DPDwarkesh Patel
  (laughs)
19. NSNoam Shazeer
  (laughs)
20. DPDwarkesh Patel
  Um, I w- uh, yeah, so l- let's... I, I do wanna play with this idea more 'cause, you know, it, it seems like a big deal, like, yeah, something, like, kind of like an autonomous software engineer, especially from the perspective of a researcher who's like, "I want to spec build this system." Um, yeah, and there's... Okay. So you l- legislate with this idea, like, w- um, y- as somebody who has worked on developing transformative systems through your, through your careers, the idea that instead of having to code something like whatever the, today's equivalent of MapReduce is or TensorFlow is, just like, "Here, here's how I want, like, distributed a- a- a AI library to look like. Write it up for me," um, could y- do you imagine you could be, like, 10X more productive, 100X more productive?
21. JDJeff Dean
  I was pretty impressed. I think it was on Reddit that I saw. Like, we have a new experimental coding, like, model that's much better at coding and math and so on. And someone external tried it and they basically prompted it and said, "I'd like you to implement a SQL processing database system with, uh, no external dependencies. And please, please do that in C."
22. NSNoam Shazeer
  (laughs)
23. JDJeff Dean
  (laughs) And from what the person said, it actually did a quite good job. Like, it generated a SQL parser and a tokenizer and, you know, a, a, uh, query planning system and some storage format for the data on, on disk-
24. DPDwarkesh Patel
  Mm-hmm.
25. JDJeff Dean
  ... and actually was able to handle simple queries. So, you know, from that prompt, which is like, you know, a paragraph of text or something, to get, you know, even a cer- an initial cut at that seems like a big boost in productivity for software developers.
26. DPDwarkesh Patel
  Yeah.
27. JDJeff Dean
  And I think you might end up with other sy- other kinds of systems that maybe don't try to do that in a single, you know, in semi-interactive respond in 40 second kind of thing, but might go off for 10 minutes and, like, might interrupt you after five minutes saying, "Ah, I've, I've done a lot of this, but now I need to, you know, get some input. You know, do you, do you care about handling video or just images or something?" Um, and that, that seems like you'll need ways of managing the workflow if you have a lot of these kind of e- uh, m- background activities happening.
28. DPDwarkesh Patel
  Yeah. Actually, can you talk more about that? So what interface do you imagine we might need if we have, um, uh... if you could l- literally have, like, millions of employees you could spin up, hundreds of thousands of employees you could spin up on command who are able to type incredibly fast and who, um... So it, it's almost like you go from, like, 1930s, like, trading of, like, tickets or something to now modern, like, you know, w- chain suit or something. You know, like, you, you need a be- you need some interface to keep track of all this that's going on for the AIs to integrate into this big mono repo, um, and leverage their own, like, uh, strengths, um, for humans to keep track of what's happening. Wh- what... Basically, wh- wh- what is it like to be, uh, Jeff or Noam in three years working day to day?
29. NSNoam Shazeer
  It might be kind of similar to what we have now 'cause we all- already have, uh, sort of parallelization as, uh, as a major issue 'cause, you know, we, we have, like, lots and lots of really, really brilliant machine learning researchers and we want them to work, all work together and, uh, a- and, and build AI. Um, you know, so actually the parallelization among people might be similar to-
30. DPDwarkesh Patel
  Mm-hmm.
43:20 – 49:10
A new architecture every day?
1. NSNoam Shazeer
  uh ...
2. DPDwarkesh Patel
  Yeah. A- actually, so that's a really interesting idea. If you have, um, like suppose in the world today, there was li- on the order of 10,000 AI researchers, and this-
3. NSNoam Shazeer
  Mm-hmm.
4. DPDwarkesh Patel
  ... community coming up with a breakthrough every-
5. JDJeff Dean
  Probably more than that. There were 15,000 at NeurIPS last year. (laughs)
6. NSNoam Shazeer
  (laughs) Oh, wow. (laughs)
7. DPDwarkesh Patel
  100,000? I don't know. Um ...
8. JDJeff Dean
  Yeah, maybe. (laughs)
9. DPDwarkesh Patel
  Um ...
10. JDJeff Dean
  Sorry. (laughs)
11. DPDwarkesh Patel
  No, no. It's good to have (laughs) the correct order of magnitude. Um, and the odds that this community every year comes up with a breakthrough on the scale of a transformer is, let's say, 10%. Um, now suppose this community is 1,000 times bigger, and it is in some sense like this sort of parallel search of better architectures-
12. NSNoam Shazeer
  Yeah.
13. DPDwarkesh Patel
  ... better techniques. Do we just, like, get, like, Transformer-sized-
14. NSNoam Shazeer
  A breakthrough a day? Yeah.
15. DPDwarkesh Patel
  ... breakthroughs every year or every day?
16. NSNoam Shazeer
  Maybe. Sounds-
17. JDJeff Dean
  Maybe.
18. NSNoam Shazeer
  ... uh, sounds potentially good, you know?
19. DPDwarkesh Patel
  (laughs)
20. NSNoam Shazeer
  (laughs)
21. DPDwarkesh Patel
  But exactly like what ML research is like is just if you have, uh, if you are able to try all these experiments?
22. NSNoam Shazeer
  It's a good question, 'cause we, you know, I th- I don't know that folks have been, uh, haven't been doing that as much. Um, I mean, we definitely have lots of, lots of great ideas coming along. Everyone seems to want to run their experiment at maximum scale, but I think that's, (laughs) you know, that's a human problem. (laughs)
23. DPDwarkesh Patel
  (laughs)
24. JDJeff Dean
  Yeah, yeah. It's very helpful to have a 1/1000th scale problem and then vet, like, 100,000 ideas on that and then scale up the ones that are, that seem promising.
25. DPDwarkesh Patel
  Yeah. A quick word from our sponsor, Scale AI. Publicly available data is running out, so major labs like Meta and Google DeepMind and OpenAI all partner with Scale to push the boundaries of what's possible. Through Scale's Data Foundry, major labs get access to high-quality data to fuel post-training, including advanced reasoning capabilities. As AI races forward, we must also strengthen human sovereignty. Scale's research team, SEAL, provides practical AI safety frameworks, evaluates frontier AI system safety via public leaderboards, and creates foundations for integrating advanced AI into society. Most recently, in collaboration with the Center for AI Safety, Scale published Humanity's Last Exam, a groundbreaking new AI benchmark for evaluating AI systems' expert level knowledge and reasoning across a wide range of fields. If you're an AI researcher or engineer and you want to learn more about how Scale's Data Foundry and research team can help you go beyond the current frontier of capabilities, go to scale.com/dwarkesh. All right, back to Jeff and Noam. So, I think one thing the world might not be taking seriously, um, b- people are aware that it's exponentially harder to make, um, like, to, to do the scale, like make a model that's 100X bigger takes 100X more compute, right? So it's like people are aware that it's like an exponentially, uh, harder problem to go from Gemini 2 to 3 or so forth. But maybe people aren't aware of this other trend where Gemini 3 is coming up with all these different architectural ideas and trying them out, and you see what works, and you're constantly coming up with these algorithmic progress, uh, that makes training the next one easier and easier. Um-
26. JDJeff Dean
  Yeah, I mean-
27. DPDwarkesh Patel
  How far could you take that feedback loop?
28. JDJeff Dean
  I mean, I think one, one thing people should be aware of is the improvements from generation to generation of these models often are partially driven by hardware and larger scale, but equally, and perhaps even more so, driven by major algorithmic improvements and major changes in the model architecture and the training data mix and so on, that really make the model better per, per flop that is applied to the model. Uh, so I think that's a good realization. And then I think if we have automated exploration of ideas, we'll be able to vet a lot more ideas and bring them into kind of the actual, you know, production, uh, training for next generations of these models.
29. DPDwarkesh Patel
  Mm-hmm.
30. JDJeff Dean
  And that's gonna be really helpful, 'cause that's sort of what we're currently doing with a lot of machine learning research, brilliant machine learning researchers, is looking at lots of ideas, you know, winnowing ones that seem to, to work well at small scale, seeing if they work well at medium scale, bringing them into larger scale experiments, and then like settling on, like, adding a whole bunch of new and interesting things to the, to the final model recipe. Um, and then I think if we can do that, you know, 100 times faster through, uh, those machine learning researchers just gently steering a more automated search-
49:10 – 53:07
Automated chips and intelligence explosion
1. JDJeff Dean
  I've been pretty excited lately about how could we dramatically speed up the chip design process.
2. DPDwarkesh Patel
  Yeah.
3. JDJeff Dean
  'Cause as we were talking earlier, the, you know, the current way in which you design a chip takes you roughly, you know, 18 months to go from, "We should build a chip," to something that you then hand over to TSMC and- and then TSMC takes, you know, four, four months to- to fab it and then you get it back and you put it in your data centers. Um, so that's a pretty lengthy cycle. Uh, and the fab time in there is a pretty, uh, you know, small portion of it today. But if you could make that the dominant portion so that instead of taking, you know, 18 month- 12 to 18 months to design the chip, you could shrin- And with, you know, 100, 150 people, you could shrink that to, you know, a few people with a much more automated search process exploring the whole design space of- of chips and getting feedback from all aspects of the chip design process for the kind of choices that the system is trying to explore at the high level.
4. DPDwarkesh Patel
  Yeah.
5. JDJeff Dean
  Uh, then I think you could get, uh, you know, perhaps much more exploration and more rapid, uh, design of something that you actually want to give, uh, to a fab. And- and that would be great 'cause you can shrink the- that time, you can shrink the deployment time, uh, by kind of designing the hardware in the right way so that you just get the chips back and you just plug them in to some, uh, some system. Um, and that will then, I think, enable a lot more specialization. It will enable a shorter timeframe for th- the hardware design so that you don't have to look out quite as far into what kind of ML algorithms-
6. DPDwarkesh Patel
  Yeah.
7. JDJeff Dean
  ... would be interesting. Instead it's like you're looking at, you know, six to nine months from now-
8. DPDwarkesh Patel
  Yeah.
9. JDJeff Dean
  ... uh, what should it be, uh, rather than, you know, two, two and a half years. And that would be pretty cool. I do think that that fab- fabrication time is, if that's in your inner loop of improvement, you're gonna, like-
10. DPDwarkesh Patel
  How long is it?
11. JDJeff Dean
  The- the leading edge nodes unfortunately are taking longer and longer 'cause they have more metal layers-
12. DPDwarkesh Patel
  Yeah.
13. JDJeff Dean
  ... uh, than previous, you know, older nodes. So that- that tends to make- make it take anywhere from three to five months.
14. DPDwarkesh Patel
  Okay. But that's how long training runs take anyways, right? So you could potentially do both at the same time?
15. JDJeff Dean
  Yeah, potentially.
16. DPDwarkesh Patel
  Okay, so I guess like you can't get sooner than three to five months, but the idea that you could g- get, like... But also, yeah, you're, like, r- rapidly developing new algorithmic ideas-
17. JDJeff Dean
  Mm-hmm.
18. DPDwarkesh Patel
  ... between this time.
19. NSNoam Shazeer
  Right. That- that can move fast. It'll- it'll-
20. JDJeff Dean
  That can move fast. That can run on, like, existing chips and explore lots of cool ideas.
21. DPDwarkesh Patel
  Right.
22. NSNoam Shazeer
  Yeah.
23. DPDwarkesh Patel
  So does... Isn't that, like, a situation in which you're... Like, I think people-
24. JDJeff Dean
  Yeah.
25. DPDwarkesh Patel
  ... sort of expect, like, ah, there's gonna be a sigmoid. Um, again, this is not a sure thing, but just like is this a possibility, the idea that you have like sort of an explosion of capabilities, uh, very rapidly towards the tail end of human intelligence that, you know, gets like s- m- uh, smarter and smarter at a more and more rapid rate? Um...
26. NSNoam Shazeer
  Quite possibly. Yeah, yeah.
27. JDJeff Dean
  Yeah. I mean, I th- I like to think of it like this, right? Like right now we have models that can take a pretty complicated problem and can break it down, you know, internally in the model into a bunch of steps, can sort of p- puzzle together the solutions for those steps, and can often give you a solution to the entire problem that you're asking it. But it, you know, isn't super reliable and it's good at breaking things down into, you know, five to 10 steps, not 100 to 1,000 steps. So if you could go from, yeah, 80% of the time it can give you a perfect- perfect answer to something that's 10 steps long to something that, you know, 90% of the time can give you a perfect answer to something that's 100 to 1,000 steps-
28. DPDwarkesh Patel
  Yeah.
29. JDJeff Dean
  ... of sub- subproblem long, that would be an amazing improvement in capability of these models. And, you know, we're not there yet, but I think that's what we're aspirationally trying to get to is-
30. NSNoam Shazeer
  Right. Yeah, we don't need new hardware for that. We-
53:07 – 1:02:38
Future of inference scaling
1. DPDwarkesh Patel
  uh, you kn- Like, one- one of the big areas of improvement I think, uh, you know, in the near future is- is this, uh, inference time compute, like applying more compute, you know, a- a- at inference time. And I guess the way I've- I've liked to describe it is that, um, you know, a, uh, like even some giant, uh, language model, you know, if- even if you're doing say a trillion operations per token, which is, you know, more than- more than most people are- are doing these days, um, you know, operations cost something like 10 to the negative $18 and so you're getting like-
2. JDJeff Dean
  (laughs)
3. DPDwarkesh Patel
  (laughs)
4. NSNoam Shazeer
  ... a million tokens to the dollar, right? So I mean, compare that to like a relatively cheap pastime. Like you- you- you go out and you buy a paper book and read it, you're paying like 10,000 tokens to the dollar. So it's... So like talking to a language model could be like... You know, is- is- is like a hundred times cheaper than reading a paperback. So, um, there is a huge amount of headroom there to say, "Okay, if we can make this thing more expensive, uh, but-"
5. DPDwarkesh Patel
  Ah, right.
6. NSNoam Shazeer
  "... but smarter 'cause we're like two- you know, like 100X cheaper than reading a paperback, we're like 10,000 times cheaper than like talking to a customer support agent, we're like a million times or more cheaper than- than, you know, hiring a software engineer or talking to your doctor or lawyer, like can we add, you know, the-"
7. DPDwarkesh Patel
  Yeah.
8. NSNoam Shazeer
  "... a- a- add computation and make it- make it smarter?" So like I think a lot of- a lot of the takeoff that we're go- that we're going to see in- in the very near future is of this form. Like we've- we've ex- been exploiting and improving, uh, pre-training a lot in the past and post-training. And those things will continue to improve, but like taking advantage of, you know, think harder at, uh, at inference time is going to just be a- a- an explosion.
9. JDJeff Dean
  Yeah, and an aspect of- of inference time is I think you want the system to a- be actively exploring a bunch of different potential solutions. You know, maybe it does some searches on its own and gets some information back and like consumes that information and figures out, "Oh, now I would really like to know more about this thing," so now it kind of iteratively kind of explores how to best solve the high level-... problem you pose to the system. And I think having a dial where you can make the model give you better answers with more inference time compute seems like we have a bunch of techniques now that seem like they can kind of, kind of do that, and the more you crank up the dial, the more it costs you in terms of compute, but the better the answers get. That, that seems like a, a nice trade-off to have, 'cause sometimes you wanna think really hard 'cause there's a super important problem. Sometimes you probably don't wanna spend-
10. DPDwarkesh Patel
  Right.
11. JDJeff Dean
  ... enormous amounts of compute to compute, you know, one plus, what's the answer to one plus one? (laughs) Um, maybe the system should decide to us-
12. DPDwarkesh Patel
  You, you take that to 100 and it comes up with, like, new axioms of set theory or something. (laughs)
13. JDJeff Dean
  Should decide to use the calculator tool or something instead of, (laughs) you know, a very large language model.
14. DPDwarkesh Patel
  Uh, i- i- are there any impediments to taking inference time, like, uh, ha- having some way in which we can just linearly scale up inference time compute? Or is this a, basically a problem that's sort of solved and we know how to s- throw, like, 100X compute or 1,000X compute and get correspondingly better results?
15. JDJeff Dean
  Yeah.
16. NSNoam Shazeer
  Well, we're, we're working out the algorithms as we speak-
17. DPDwarkesh Patel
  (laughs)
18. NSNoam Shazeer
  ... so I, I, I, I, I believe, you know, we'll, we'll see, we'll see better and better solutions to this as these many more than 10,000 researchers (laughs) are-
19. DPDwarkesh Patel
  (laughs)
20. NSNoam Shazeer
  ... are, are hacking at it, uh, many of them-
21. JDJeff Dean
  Yeah.
22. NSNoam Shazeer
  ... at Google.
23. JDJeff Dean
  I mean, I think we do see some examples in our own sort of experimental work of things where you, if you apply more inference time compute, the answers are better than if you just apply, you know, a, you know, X, you know? If you apply 10X, you can get better answers than X amount of in- computed inference time, and that seems useful and important. Um, but I think what we would like is when you apply 10X to get, you know, even a s- bigger improvement in the quality of the answers, uh, than we're getting today. And so that's about, you know, designing new algorithms, trying new approaches, you know, figuring out how best to spend that 10X instead of X to, to improve things.
24. DPDwarkesh Patel
  Does it look more about... Look more like search or does it look more like just keeping going in that linear direction for a longer time?
25. JDJeff Dean
  I mean, I think search is l- I, I, I really like Rich Sutton's paper that he wrote about the-
26. DPDwarkesh Patel
  Mm-hmm.
27. JDJeff Dean
  ... the bitter lesson, and the bitter lesson effectively is this nice one-page paper. But the, the essence of it is, um, you can try lots of approaches, but the, the two techniques that are incredibly effective are learning and search. (laughs) And you can apply and scale those algorithmic or, you know, computationally, and you often will then get better results than any other kind of approach you can apply to, to a pretty broad variety of problems.
28. DPDwarkesh Patel
  Yeah.
29. JDJeff Dean
  Um, and so I think search has gotta be part of the solution to spending more inference time, is you wanna maybe explore a few different ways of solving this problem and, like, "Oh, that one didn't work, but this one worked better, so now I'm gonna explore that a bit more."
30. DPDwarkesh Patel
  How does this change your plans for future data center plan- you know, planning and so forth, where if, um, you know, c- can this kind of search be done asynchronously? Does it have to be online, offline? How does that change how, how big of a, a campus you need and those kinds of considerations?
1:02:38 – 1:08:15
Already doing multi-datacenter runs
1. DPDwarkesh Patel
  S- so a, a big discussion has been about, you know, you're, uh, we're already tapping out, like, nuclear power plants in terms of de- delivering power into one single campus, and so do we have to, like, have just, like even more... like, two gigawatts in one place, uh, five gigawatts in one place, or can it be more distributed and still be able to train a model? Um, does this new regime of inference scaling make different considerations there plausible, or how are you thinking about multi-data center training now?
2. JDJeff Dean
  Uh, I mean, we're already doing it, so-
3. DPDwarkesh Patel
  Yeah.
4. JDJeff Dean
  ... we're, we're, we're pro multi-data center training.
5. DPDwarkesh Patel
  (laughs)
6. JDJeff Dean
  I think, uh, I think in the Gemini 1.5 tech report we said we use multiple metro areas-
7. DPDwarkesh Patel
  Mm.
8. JDJeff Dean
  ... uh, and trained with some of the compute in each place, and then a, you know, pretty, uh, long latency but high bandwidth connection between those data centers, and that works fine.
9. DPDwarkesh Patel
  Yeah.
10. JDJeff Dean
  You know, it, it's great. Actually, training is kind of interesting 'cause each step in a training process is, you know, usually for a large model is a few seconds or something at least, so the latency of it being, you know, 50 milliseconds away doesn't matter that much.
11. DPDwarkesh Patel
  Mm.
12. JDJeff Dean
  Um-
13. NSNoam Shazeer
  Just the bandwidth, you know?
14. JDJeff Dean
  Yeah, just bandwidth.
15. NSNoam Shazeer
  As long as you can sync, uh, you know, sync all of the parameters of the model across the different data centers and then accumulate all the gradients. So it's, uh, in the time it takes to do one step, you're, you're, you're pretty good.
16. JDJeff Dean
  Yeah, and then we have a bunch of work on in, you know, in even early brain days when we were using CPU machines and they were really slow, uh, so needed to do asynchronous training to help scale-
17. DPDwarkesh Patel
  Yeah.
18. JDJeff Dean
  ... uh, where each copy of the model would kind of do some local-
19. DPDwarkesh Patel
  Right.
20. JDJeff Dean
  ... computation and then send gradient updates to a centralized system, and then apply them, uh, asynchronously, and another copy of the model would be doing the same thing. Uh, you know, it makes your, your model parameters kinda wiggle around a bit, and it makes people uncomfortable with the theoretical guarantees, but it actually seems to work-
21. DPDwarkesh Patel
  (laughs)
22. NSNoam Shazeer
  (laughs)
23. JDJeff Dean
  ... in practice. And our research has generally-
24. NSNoam Shazeer
  In practice, it works. It, it was so pleasant-
25. JDJeff Dean
  (laughs)
26. NSNoam Shazeer
  ... to go from async to sync-
27. JDJeff Dean
  Yeah.
28. NSNoam Shazeer
  ... because, because your experiments are now replic-
29. JDJeff Dean
  (laughs)
30. NSNoam Shazeer
  ... replicable, like-
1:08:15 – 1:12:41
Debugging at scale
1. NSNoam Shazeer
  but, uh...
2. DPDwarkesh Patel
  What- what practically does it look like actually to debug or decode what the... Like, you've got these things, some of them which are making the model a lot better, some of which are making it worse. Uh, you, when you go into work tomorrow, you're like, "All right. What's going on here?" How do you figure out what the most salient inputs are?
3. NSNoam Shazeer
  Right. I mean, well, at- at small scale, you do lots of experiments, so, so I mean-
4. DPDwarkesh Patel
  Mm-hmm.
5. NSNoam Shazeer
  ... there's I think one part of, um, of the research that involves, okay, I want to, like, invent these improvements or breakthroughs kind of in the isolation, in which case, you want a nice simple code base that you can fork and hack and- and, like, some baselines, uh, and my dream is I wake up in the morning, come up with- come up with an idea, hack it up in a day, run some experiments, get- get some initial results in a day, like, "Okay, this look- this looks promising. These things worked, uh, these things worked and didn't work." And- and I think that- that is- that is very achievable because okay-
6. DPDwarkesh Patel
  At small scale.
7. NSNoam Shazeer
  ... at small scale.
8. DPDwarkesh Patel
  Right.
9. NSNoam Shazeer
  As long as you keep your- yeah, you know, keep a nice experimental code base and, uh-
10. DPDwarkesh Patel
  Maybe an experiment takes an hour to run or two hours-
11. NSNoam Shazeer
  Yeah.
12. DPDwarkesh Patel
  ... or something.
13. NSNoam Shazeer
  Yeah.
14. DPDwarkesh Patel
  Not- not- not two weeks.
15. NSNoam Shazeer
  It's great. It's great.
16. DPDwarkesh Patel
  Yeah.
17. NSNoam Shazeer
  So- so there's that part of the research, and then there's su- some amount of scaling up, and then you have the part which is like integrating, where you want to stack all the improvements-
18. DPDwarkesh Patel
  Mm-hmm.
19. NSNoam Shazeer
  ... on top of each other and see if they work in- at large scale and see if they work all in conjunction-
20. DPDwarkesh Patel
  Right, how do they interact?
21. NSNoam Shazeer
  ... with each other.
22. DPDwarkesh Patel
  Right. You think maybe they're independent, but actually, maybe there's some funny interaction between-
23. NSNoam Shazeer
  Mm-hmm.
24. DPDwarkesh Patel
  ... you know, improving the- the way in which we handle video data input and the way in which we, you know, uh, update the- the model parameters or something. And- and, you know, that interacts more for video data than some other thing. You know, there's all kinds of interactions that can happen that you maybe, uh, don't anticipate, and so you want to run these experiments where you're then putting a bunch of things together and then periodically making sure that all the things you think are good are good together, uh, and if not, understanding why they're not playing nicely. Um, two questions. One, how often does it end up being the case that things don't stack up well together? Um, is it like a rare thing, or does it happen all the time?
25. NSNoam Shazeer
  It- it happens, happens all the time.
26. DPDwarkesh Patel
  50%? Okay. (laughs)
27. NSNoam Shazeer
  (laughs)
28. DPDwarkesh Patel
  Yeah, I mean, I- I think most things you don't even try to stack because they- they, you know, the initial experiment didn't work that well or showed results that aren't that promising relative to the baseline. And then you sort of take those things and you try to scale them up individually.
29. NSNoam Shazeer
  Yeah.
30. DPDwarkesh Patel
  And then you're like, "Oh, yeah, these ones seem really promising, uh, so I'm gonna now include them in something that I'm gonna now bundle together and try to advance, you know, what the... and combined with other things that seem promising." And then you run the experiments, and then you're like, "Oh, well, they didn't really work that well. Like, let's try to debug why."
1:12:41 – 1:20:51
Fast takeoff and superalignment
1. DPDwarkesh Patel
  so then going back to the whole dynamic of you find better and better algorithmic improvements, uh, and the models get better and better over time, even if you take the hardware part out of it, should the world be thinking more about... and should you guys be thinking more about this? There's one world where you're just like, AI is a thing that takes, like, two decades to slowly get better over time and you can sort of, like, refine that things are, you know, if, like, you've kind of messed something up, you fix it, uh, and it's, like, not that big a deal, right? It's, like, not that much better than the previous version you released. There's another world where you have this big feedback loop which means that the year... the- the- the two years between Gemini 4 and Gemini 5 are the most important years in human history because you go from, um, uh, a pretty good ML researcher to superhuman intelligence because of this feedback loop. To the extent that you think that second world is plausible, how does that change how you sort of approach these greater and greater levels of intelligence?
2. NSNoam Shazeer
  I've stopped cleaning my garage 'cause I'm waiting for the robots, you know.
3. DPDwarkesh Patel
  (laughs)
4. NSNoam Shazeer
  (laughs) So probably I'm- I'm- I'm more in- in the second camp of what we're gonna see a lot of acceleration.
5. DPDwarkesh Patel
  Yeah, I mean, I think it- it's- it's super important to understand what's going on and what the trends are, and I think right now, the trends are the models are getting substantially better generation over generation. Um, and I don't see that slowing down in the next few generations probably. So that means the models...
6. JDJeff Dean
  ... say, two to three generations from now are gonna be capable of, you know, let's go back to the example of breaking down a simple task into 10 sub-pieces and doing it 80% of the time to something that can break down a task, a very high-level task into a hundred or a thousand pieces and get that right 90% of the time, right? That's a major, major step up in what the models are capable of. So I think it's important for people to understand, you know, what's, what is happening in the progress in the field and then those models are gonna be applied in a bunch of different domains. And I think it's really good to make sure that we, as society, get the maximal benefits from what these models can do to improve things in, you know, I'm super excited about areas like education and healthcare, you know, making, uh, information accessible to all people. Um, but we also re- realize that they could be used for misinformation, they could be used for, you know, automated hacking of computer systems. And we wanna sort of put as many safeguards and mitigations and understand the capabilities of the models in, in places we can. And that's kind of, you know, I think Google as a whole has a really, you know, good view to how we should approach this. You know, our responsible AI principles actually are a pretty nice, uh, framework for how to think about trade-offs of making, you know, better and better AI systems available in different conte- contexts and settings, uh, while also sort of making sure that we're doing the right thing in terms of, you know, making sure they're safe and-

Episode duration: 2:15:35

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode v0gjI__RyCY

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome