EVERY SPOKEN WORD
60 min read · 12,410 words- SPSpeaker
Thank you so much for joining us, Amin. Please give me a round of applause for Amin Vahdat. [audience applauding] You guys have no idea how hard it was to get Amin to show up. Seriously, this is the one lecture that, um, I've been super excited about, and Sebastian, who many of you know, um, who is my co-founder in Amp, wanted to be here, and he's so bummed that he couldn't because he's busy working on the cluster for your guys' final projects. Uh, Sebastian worked on the Borg, X-Borg GQM scheduler. De-designed that, too. So we are very much, uh, a, a Google family over at Amp, and so Amin's a bit of a rock star in, in our kind of lore. Um, so to give you guys some context, Amin, well, is the head of... basically in charge of the internal infrastructure at Google. The TPUs that make Gemini possible really would not be at anywhere close to the scale they are at if it wasn't for Amin. Okay? So pay attention to every word he says. Like, think about him as the opposite of Jensen. You know, Jensen, like, is a rapid fire, high throughput LLM. Um, think about Amin kind of as, like, the distillation of, like, three frontier models who have been trained on, like, frontier in... like, the inf... practice and discipline of infrastructure for the last... How long you been doing this, Amin?
- SPSpeaker
Coming up on 30 years, I'm sad to say.
- SPSpeaker
30 years.
- SPSpeaker
Mm-hmm.
- SPSpeaker
And so every word Amin speaks has, like... every token that he produces as an LLM has, like, universes contained in them, okay? And we, we, we will probably not understand what he actually means for years, so I'm glad this is gonna be recorded and put up on YouTube, 'cause I think years from now, people will look back at his lecture and realize how profound his influence was on the, on the industry. Um, you know, to concretize that, uh, how much, uh, compute does the internal pool at Google have today, Amin?
- SPSpeaker
I'll start off with the easy question that I can't answer.
- SPSpeaker
Yeah. [laughs]
- SPSpeaker
Um, I, I've seen some Twitter posts that say we have among the largest computing infrastructures in the whole planet, and I think that... I'm, I'm willing to s-stand up behind that one.
- SPSpeaker
Okay.
- SPSpeaker
Yeah.
- SPSpeaker
Would you say it's in the tens of gigawatts?
- SPSpeaker
Tens of gigawatts. Um, I will say that, uh, we are aiming for tens of gigawatts.
- SPSpeaker
Over the next four years-
- SPSpeaker
Yeah
- SPSpeaker
... it'll be well in the-
- SPSpeaker
Oh-
- SPSpeaker
... north of tens of gigawatts.
- SPSpeaker
Over some, some time period, yeah.
- SPSpeaker
Yeah.
- SPSpeaker
Yeah.
- SPSpeaker
So we crunched the numbers this morning. We think about one gigawatt to build out is about how much? Okay, so o-one gigawatt is about $40 billion of infrastructure. Do the math. Okay? And as much as I hate to say it, Amin's infrastructure org is literally one of the most efficient on the planet because, you know, there was a time when I was starting out Amp and we were looking at how much single cluster utilization was across the industry and, uh, some of our portfolio companies, you know, some of the speakers here were running them at 70, 80% utilization, and some of the other big tech companies were similar, in fact, worse. I'm sure you saw that, um, you know, the Colossus cluster is not running at peak utilization, and I think it's at 11% MFU, which is real... honestly, MFU is kind of hard to get up. But at Google, my understanding is if the, if the node allocation is less than 96%, it's considered a major outage.
- SPSpeaker
Yeah, so I think-
- SPSpeaker
Is that right?
- SPSpeaker
... what, what this, uh, really points to is when you hear numbers like, uh, $40 billion, uh, per gigawatt, and I've heard numbers like $50 billion a gigawatt from, uh, other sources, the numbers are going up. Things are getting more expensive. I think the, the most important consideration isn't how many gigawatts you have, it's how much capability and value you're delivering to your users, and this is something to really be aware of. In other words, if I've got a gigawatt here and a gigawatt there, they're not the same. How much reliability you have actually really, really matters. Like, I could go spend $40, $50 billion on a gigawatt, and if I don't do the work to make sure that every one of those nodes is super reliable... So a gigawatt, it's... let's say that's 150, 200,000 TPUs, GPUs. It could be whatever you want. One of those go-goes down, maybe your whole computation stops. If you're not, A, making sure it doesn't fail, B, when it does fail, figuring out which one it is and getting it repaired really fast, you just wasted a lot of money because your utilization and what we call your goodput-
- SPSpeaker
Mm
- SPSpeaker
... is nowhere near what it, it needs to be. If you have the T-TPUs deployed, but no one can schedule a job on them, it doesn't matter how much money you spent on them. So I think that a lot of these measures are actually broken. The measure isn't how much money you spent per gigawatt, it's actually how much value you deliver per dollar. And if I can spend half the money, deploy half the capacity, and give you the same capability, awesome. Better, if I can deliver twice the value from that gigawatt, I now need to s- build fewer gigawatts.
- SPSpeaker
Okay.
- SPSpeaker
Or I can only get so many gigawatts. Energy's massive problem.
- SPSpeaker
And, um, you know, we had Jensen here last week, and one of the questions I asked him is, "How do you..." He said something similar, which was, honestly-
- SPSpeaker
Is this why everybody's laptop is signed by Jensen?
Episode duration: 1:04:22
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode VeTqsCpcDgg
