Skip to content
Stanford OnlineStanford Online

Stanford CS153 Frontier Systems | The Discipline of Delivering Value per Gigawatt

For more information about Stanford's online Artificial Intelligence programs, visit: https://stanford.io/ai Follow along with the course schedule and syllabus, visit: https://cs153.stanford.edu/ In a CS153 Frontier Systems lecture, the class returns to the upstream infrastructure stack with Amin Vahdat, who leads Google's internal compute infrastructure and the TPU program powering Gemini, framing his nearly 30-year career as the discipline of building reliable, balanced supercomputers at a planetary scale. Vahdat argues the industry is over-fixated on gigawatts and flops as headline metrics: at roughly $40 to $50 billion per gigawatt, the question that matters is value delivered per dollar, measured in happy daily active users and paying enterprise customers, not raw capacity. He walks through the three constraints that govern utility. Reliability, where moving from 99 percent to 99.9 percent uptime closes a 3.65-day annual gap, and where Frontier Labs are newly willing to trade five-nines for double the capacity. System balance, invoking Amdahl's 1967 law that every million instructions per second needs a megabyte per second of I/O, now stretched across 100,000-node synchronous training jobs where a single failed node halts the entire computation. And procurement lead times of two to three years for net-new gigawatts, where land permitting, utility contracts, and 20-year take-or-pay power agreements have replaced the slack capacity that once let hyperscalers ask for ten megawatts on a handshake. He details Google's optical circuit switch architecture, which uses 136 MEMS-controlled mirrors per chip to programmatically rewire the 3D torus topology connecting TPU racks, allowing failed racks to be virtually swapped in seconds and bandwidth redirected to distant storage clusters for the duration of a five-hour Borg job. Vahdat closes on responsibility: data centers should be a net uplift to local grids and communities, citing Google's choice to accept 10 percent worse power efficiency in water-scarce regions and its gigawatt-scale demand response program that returns capacity to utilities during peak residential load. Amin Vahdat is a Fellow and Chief Technologist for AI Infrastructure at Google, where his team is responsible for delivering industry-leading infrastructure which spans custom silicon, data centers, network, and supply chain and operations. This infrastructure serves Alphabet, Google and the world, and Artificial Intelligence technologies that empower ML developers and solve customers’ most pressing business challenges. In the past, he was Vice President and General Manager for Google's compute, storage, and network hardware and software infrastructure. Until 2019, he was the Technical Lead and Vice President for the Networking organization at Google. Before joining Google, Amin was the Science Applications International Corporation (SAIC) Professor of Computer Science and Engineering at UC San Diego (UCSD). He received his doctorate from the University of California Berkeley in computer science, and is a Fellow of the Association for Computing Machinery (ACM). Amin has been recognized with a number of awards, including the National Science Foundation (NSF) CAREER award, the UC Berkeley Distinguished EECS Alumni Award, the Alfred P. Sloan Fellowship, the Association for Computing Machinery's SIGCOMM Networking Systems Award, and the Duke University David and Janet Vaughn Teaching Award. Amin was awarded the SIGCOMM lifetime achievement award for his contributions to data center and wide area networks. He was inducted into the National Academy of Engineering in 2023 for his contributions to the design and implementation of datacenter and planet-scale networks that power cloud computer systems. Follow the playlist: https://youtube.com/playlist?list=PLoROMvodv4rN447WKQ5oz_YdYbS74M5IA&si=DOJ5amlyRdyMJBhG

May 27, 20261h 4mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. SP

    Thank you so much for joining us, Amin. Please give me a round of applause for Amin Vahdat. [audience applauding] You guys have no idea how hard it was to get Amin to show up. Seriously, this is the one lecture that, um, I've been super excited about, and Sebastian, who many of you know, um, who is my co-founder in Amp, wanted to be here, and he's so bummed that he couldn't because he's busy working on the cluster for your guys' final projects. Uh, Sebastian worked on the Borg, X-Borg GQM scheduler. De-designed that, too. So we are very much, uh, a, a Google family over at Amp, and so Amin's a bit of a rock star in, in our kind of lore. Um, so to give you guys some context, Amin, well, is the head of... basically in charge of the internal infrastructure at Google. The TPUs that make Gemini possible really would not be at anywhere close to the scale they are at if it wasn't for Amin. Okay? So pay attention to every word he says. Like, think about him as the opposite of Jensen. You know, Jensen, like, is a rapid fire, high throughput LLM. Um, think about Amin kind of as, like, the distillation of, like, three frontier models who have been trained on, like, frontier in... like, the inf... practice and discipline of infrastructure for the last... How long you been doing this, Amin?

  2. SP

    Coming up on 30 years, I'm sad to say.

  3. SP

    30 years.

  4. SP

    Mm-hmm.

  5. SP

    And so every word Amin speaks has, like... every token that he produces as an LLM has, like, universes contained in them, okay? And we, we, we will probably not understand what he actually means for years, so I'm glad this is gonna be recorded and put up on YouTube, 'cause I think years from now, people will look back at his lecture and realize how profound his influence was on the, on the industry. Um, you know, to concretize that, uh, how much, uh, compute does the internal pool at Google have today, Amin?

  6. SP

    I'll start off with the easy question that I can't answer.

  7. SP

    Yeah. [laughs]

  8. SP

    Um, I, I've seen some Twitter posts that say we have among the largest computing infrastructures in the whole planet, and I think that... I'm, I'm willing to s-stand up behind that one.

  9. SP

    Okay.

  10. SP

    Yeah.

  11. SP

    Would you say it's in the tens of gigawatts?

  12. SP

    Tens of gigawatts. Um, I will say that, uh, we are aiming for tens of gigawatts.

  13. SP

    Over the next four years-

  14. SP

    Yeah

  15. SP

    ... it'll be well in the-

  16. SP

    Oh-

  17. SP

    ... north of tens of gigawatts.

  18. SP

    Over some, some time period, yeah.

  19. SP

    Yeah.

  20. SP

    Yeah.

  21. SP

    So we crunched the numbers this morning. We think about one gigawatt to build out is about how much? Okay, so o-one gigawatt is about $40 billion of infrastructure. Do the math. Okay? And as much as I hate to say it, Amin's infrastructure org is literally one of the most efficient on the planet because, you know, there was a time when I was starting out Amp and we were looking at how much single cluster utilization was across the industry and, uh, some of our portfolio companies, you know, some of the speakers here were running them at 70, 80% utilization, and some of the other big tech companies were similar, in fact, worse. I'm sure you saw that, um, you know, the Colossus cluster is not running at peak utilization, and I think it's at 11% MFU, which is real... honestly, MFU is kind of hard to get up. But at Google, my understanding is if the, if the node allocation is less than 96%, it's considered a major outage.

  22. SP

    Yeah, so I think-

  23. SP

    Is that right?

  24. SP

    ... what, what this, uh, really points to is when you hear numbers like, uh, $40 billion, uh, per gigawatt, and I've heard numbers like $50 billion a gigawatt from, uh, other sources, the numbers are going up. Things are getting more expensive. I think the, the most important consideration isn't how many gigawatts you have, it's how much capability and value you're delivering to your users, and this is something to really be aware of. In other words, if I've got a gigawatt here and a gigawatt there, they're not the same. How much reliability you have actually really, really matters. Like, I could go spend $40, $50 billion on a gigawatt, and if I don't do the work to make sure that every one of those nodes is super reliable... So a gigawatt, it's... let's say that's 150, 200,000 TPUs, GPUs. It could be whatever you want. One of those go-goes down, maybe your whole computation stops. If you're not, A, making sure it doesn't fail, B, when it does fail, figuring out which one it is and getting it repaired really fast, you just wasted a lot of money because your utilization and what we call your goodput-

  25. SP

    Mm

  26. SP

    ... is nowhere near what it, it needs to be. If you have the T-TPUs deployed, but no one can schedule a job on them, it doesn't matter how much money you spent on them. So I think that a lot of these measures are actually broken. The measure isn't how much money you spent per gigawatt, it's actually how much value you deliver per dollar. And if I can spend half the money, deploy half the capacity, and give you the same capability, awesome. Better, if I can deliver twice the value from that gigawatt, I now need to s- build fewer gigawatts.

  27. SP

    Okay.

  28. SP

    Or I can only get so many gigawatts. Energy's massive problem.

  29. SP

    And, um, you know, we had Jensen here last week, and one of the questions I asked him is, "How do you..." He said something similar, which was, honestly-

  30. SP

    Is this why everybody's laptop is signed by Jensen?

Episode duration: 1:04:22

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode VeTqsCpcDgg

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.