This video isn’t embeddableWatch on YouTube →

Stanford CS153 Frontier Systems | The Discipline of Delivering Value per Gigawatt

For more information about Stanford's online Artificial Intelligence programs, visit: https://stanford.io/ai Follow along with the course schedule and syllabus, visit: https://cs153.stanford.edu/ In a CS153 Frontier Systems lecture, the class returns to the upstream infrastructure stack with Amin Vahdat, who leads Google's internal compute infrastructure and the TPU program powering Gemini, framing his nearly 30-year career as the discipline of building reliable, balanced supercomputers at a planetary scale. Vahdat argues the industry is over-fixated on gigawatts and flops as headline metrics: at roughly $40 to $50 billion per gigawatt, the question that matters is value delivered per dollar, measured in happy daily active users and paying enterprise customers, not raw capacity. He walks through the three constraints that govern utility. Reliability, where moving from 99 percent to 99.9 percent uptime closes a 3.65-day annual gap, and where Frontier Labs are newly willing to trade five-nines for double the capacity. System balance, invoking Amdahl's 1967 law that every million instructions per second needs a megabyte per second of I/O, now stretched across 100,000-node synchronous training jobs where a single failed node halts the entire computation. And procurement lead times of two to three years for net-new gigawatts, where land permitting, utility contracts, and 20-year take-or-pay power agreements have replaced the slack capacity that once let hyperscalers ask for ten megawatts on a handshake. He details Google's optical circuit switch architecture, which uses 136 MEMS-controlled mirrors per chip to programmatically rewire the 3D torus topology connecting TPU racks, allowing failed racks to be virtually swapped in seconds and bandwidth redirected to distant storage clusters for the duration of a five-hour Borg job. Vahdat closes on responsibility: data centers should be a net uplift to local grids and communities, citing Google's choice to accept 10 percent worse power efficiency in water-scarce regions and its gigawatt-scale demand response program that returns capacity to utilities during peak residential load. Amin Vahdat is a Fellow and Chief Technologist for AI Infrastructure at Google, where his team is responsible for delivering industry-leading infrastructure which spans custom silicon, data centers, network, and supply chain and operations. This infrastructure serves Alphabet, Google and the world, and Artificial Intelligence technologies that empower ML developers and solve customers’ most pressing business challenges. In the past, he was Vice President and General Manager for Google's compute, storage, and network hardware and software infrastructure. Until 2019, he was the Technical Lead and Vice President for the Networking organization at Google. Before joining Google, Amin was the Science Applications International Corporation (SAIC) Professor of Computer Science and Engineering at UC San Diego (UCSD). He received his doctorate from the University of California Berkeley in computer science, and is a Fellow of the Association for Computing Machinery (ACM). Amin has been recognized with a number of awards, including the National Science Foundation (NSF) CAREER award, the UC Berkeley Distinguished EECS Alumni Award, the Alfred P. Sloan Fellowship, the Association for Computing Machinery's SIGCOMM Networking Systems Award, and the Duke University David and Janet Vaughn Teaching Award. Amin was awarded the SIGCOMM lifetime achievement award for his contributions to data center and wide area networks. He was inducted into the National Academy of Engineering in 2023 for his contributions to the design and implementation of datacenter and planet-scale networks that power cloud computer systems. Follow the playlist: https://youtube.com/playlist?list=PLoROMvodv4rN447WKQ5oz_YdYbS74M5IA&si=DOJ5amlyRdyMJBhG

May 27, 20261h 4mWatch on YouTube ↗

EVERY SPOKEN WORD

60 min read · 12,410 words

0:09 – 5:44
Why “value per gigawatt” beats raw capacity metrics
1. SPSpeaker
  Thank you so much for joining us, Amin. Please give me a round of applause for Amin Vahdat. [audience applauding] You guys have no idea how hard it was to get Amin to show up. Seriously, this is the one lecture that, um, I've been super excited about, and Sebastian, who many of you know, um, who is my co-founder in Amp, wanted to be here, and he's so bummed that he couldn't because he's busy working on the cluster for your guys' final projects. Uh, Sebastian worked on the Borg, X-Borg GQM scheduler. De-designed that, too. So we are very much, uh, a, a Google family over at Amp, and so Amin's a bit of a rock star in, in our kind of lore. Um, so to give you guys some context, Amin, well, is the head of... basically in charge of the internal infrastructure at Google. The TPUs that make Gemini possible really would not be at anywhere close to the scale they are at if it wasn't for Amin. Okay? So pay attention to every word he says. Like, think about him as the opposite of Jensen. You know, Jensen, like, is a rapid fire, high throughput LLM. Um, think about Amin kind of as, like, the distillation of, like, three frontier models who have been trained on, like, frontier in... like, the inf... practice and discipline of infrastructure for the last... How long you been doing this, Amin?
2. SPSpeaker
  Coming up on 30 years, I'm sad to say.
3. SPSpeaker
  30 years.
4. SPSpeaker
  Mm-hmm.
5. SPSpeaker
  And so every word Amin speaks has, like... every token that he produces as an LLM has, like, universes contained in them, okay? And we, we, we will probably not understand what he actually means for years, so I'm glad this is gonna be recorded and put up on YouTube, 'cause I think years from now, people will look back at his lecture and realize how profound his influence was on the, on the industry. Um, you know, to concretize that, uh, how much, uh, compute does the internal pool at Google have today, Amin?
6. SPSpeaker
  I'll start off with the easy question that I can't answer.
7. SPSpeaker
  Yeah. [laughs]
8. SPSpeaker
  Um, I, I've seen some Twitter posts that say we have among the largest computing infrastructures in the whole planet, and I think that... I'm, I'm willing to s-stand up behind that one.
9. SPSpeaker
  Okay.
10. SPSpeaker
  Yeah.
11. SPSpeaker
  Would you say it's in the tens of gigawatts?
12. SPSpeaker
  Tens of gigawatts. Um, I will say that, uh, we are aiming for tens of gigawatts.
13. SPSpeaker
  Over the next four years-
14. SPSpeaker
  Yeah
15. SPSpeaker
  ... it'll be well in the-
16. SPSpeaker
  Oh-
17. SPSpeaker
  ... north of tens of gigawatts.
18. SPSpeaker
  Over some, some time period, yeah.
19. SPSpeaker
  Yeah.
20. SPSpeaker
  Yeah.
21. SPSpeaker
  So we crunched the numbers this morning. We think about one gigawatt to build out is about how much? Okay, so o-one gigawatt is about $40 billion of infrastructure. Do the math. Okay? And as much as I hate to say it, Amin's infrastructure org is literally one of the most efficient on the planet because, you know, there was a time when I was starting out Amp and we were looking at how much single cluster utilization was across the industry and, uh, some of our portfolio companies, you know, some of the speakers here were running them at 70, 80% utilization, and some of the other big tech companies were similar, in fact, worse. I'm sure you saw that, um, you know, the Colossus cluster is not running at peak utilization, and I think it's at 11% MFU, which is real... honestly, MFU is kind of hard to get up. But at Google, my understanding is if the, if the node allocation is less than 96%, it's considered a major outage.
22. SPSpeaker
  Yeah, so I think-
23. SPSpeaker
  Is that right?
24. SPSpeaker
  ... what, what this, uh, really points to is when you hear numbers like, uh, $40 billion, uh, per gigawatt, and I've heard numbers like $50 billion a gigawatt from, uh, other sources, the numbers are going up. Things are getting more expensive. I think the, the most important consideration isn't how many gigawatts you have, it's how much capability and value you're delivering to your users, and this is something to really be aware of. In other words, if I've got a gigawatt here and a gigawatt there, they're not the same. How much reliability you have actually really, really matters. Like, I could go spend $40, $50 billion on a gigawatt, and if I don't do the work to make sure that every one of those nodes is super reliable... So a gigawatt, it's... let's say that's 150, 200,000 TPUs, GPUs. It could be whatever you want. One of those go-goes down, maybe your whole computation stops. If you're not, A, making sure it doesn't fail, B, when it does fail, figuring out which one it is and getting it repaired really fast, you just wasted a lot of money because your utilization and what we call your goodput-
25. SPSpeaker
  Mm
26. SPSpeaker
  ... is nowhere near what it, it needs to be. If you have the T-TPUs deployed, but no one can schedule a job on them, it doesn't matter how much money you spent on them. So I think that a lot of these measures are actually broken. The measure isn't how much money you spent per gigawatt, it's actually how much value you deliver per dollar. And if I can spend half the money, deploy half the capacity, and give you the same capability, awesome. Better, if I can deliver twice the value from that gigawatt, I now need to s- build fewer gigawatts.
27. SPSpeaker
  Okay.
28. SPSpeaker
  Or I can only get so many gigawatts. Energy's massive problem.
29. SPSpeaker
  And, um, you know, we had Jensen here last week, and one of the questions I asked him is, "How do you..." He said something similar, which was, honestly-
30. SPSpeaker
  Is this why everybody's laptop is signed by Jensen?
5:44 – 7:03
Measuring outcomes: DAUs, revenue, and “intelligence per dollar”
1. SPSpeaker
  So how do you measure... intelligence, you know, o-output per unit of input, right? It's ultimately what as a systems person we're trying to optimize. And if the output is this very heterogeneous output, which is to- coding tokens, image tokens, and so on, like, but the input is this generalizable input called compute-
2. SPSpeaker
  Yeah
3. SPSpeaker
  ... or flop, so to speak. What, how, how do we reconcile the fact that the, the evals are just different?
4. SPSpeaker
  We're, we're... It's a, a tough, uh, close to impossible question to answer. We are working on benchmarks that measures intelligence per dollar, actually, and we've published some things externally, and I can send f- uh, folks, uh, references out, out of Google-
5. SPSpeaker
  That'd be great
6. SPSpeaker
  ... uh, broadly, uh, that, uh, captures this question of intelligence and then its intelligence per dollar. But what I really want to emphasize, though, is that it is, um, how much you're actually getting out of it. And so another way to look at it is per gigawatt, how much revenue are you generating? Maybe revenue is not the right measure. How many daily active users do you have for your service? So it's not how many gigawatts do you have. It's-
7. SPSpeaker
  How many daily active users per sh-- Okay, got it.
8. SPSpeaker
  Right. Like, if I'm doing Gemini App and I have a gigawatt behind it, no one cares that I have a gigawatt behind it, or two, or four, or half. It's how many daily active users do I have who are happy? Right? How many-
9. SPSpeaker
  I see
10. SPSpeaker
  ... and then how is that growing?
11. SPSpeaker
  Okay.
12. SPSpeaker
  Right? And if-- And now the question is: how do I deliver... So this is where the efficiency part comes in.
13. SPSpeaker
  Yeah.
7:03 – 9:43
Infrastructure is an orchestration problem (compute + storage + network + CPUs)
1. SPSpeaker
  I wanna make sure that every TPU is up. But by the way, if I have a bunch of TPUs and I don't have the compute, and the storage, and the networking to go along with it, then it doesn't matter how many TPUs I have, especially in the age of agents. You-- Actually, it's a orchestration of the whole. Right? Because if I'm having all my expensive TPUs sitting around idle, waiting for an agent to finish running its simulation through a CPU that had to go get some data from the storage that might be in a whole another region, that's a problem.
2. SPSpeaker
  Okay.
3. SPSpeaker
  So it's the inc-- orchestration as a whole. I think there's too much fixation on how many gigawatts of capacity we have. By the way, I spend a lot of time making sure that we have a lot, a lot of megawatts, a lot of gigawatts of capacity, so I get it. But there isn't enough on how much value are you getting out of it. Are you extracting the most utility out of every machine that you build and deploy?
4. SPSpeaker
  And so if the, if you've closed the loop to say... I think what I'm hearing you say is the, the eval is the business metric-
5. SPSpeaker
  Exactly
6. SPSpeaker
  ... that matters. In the case of Google, it's daily active users or whatever for the Gemini app. But the challenge as an infrastructure person is, w-which you have a extraordinary history and background doing, is you're always trying to gener- design general primitives, right, that are not overspecified for, for a particular output.
7. SPSpeaker
  Yeah.
8. SPSpeaker
  And if intelligence is a humanity scale measure, then how do you reconcile the difference between designing an infrastructure primitive that's general for all of humanity, but that might not align with the specific measure of intelligence that matters to Google? Does that question make sense?
9. SPSpeaker
  It, it makes sense. I think it's a, a phil- great philosophical question. The good news is, in practice, what we do care about are the, uh, business outcomes, because w-we have to believe, and it turns out to be accurate, that people are gonna vote with their feet and use the services that are giving them value. In other words, if we have Gemini Daos and they're growing at a certain rate, uh, f-for whatever reason, if it's competing against ChatGPT, or Claude, or, uh, Groq, or whatever else, if people are using it-
10. SPSpeaker
  Uh
11. SPSpeaker
  ... they're voting with their feet, they must be getting the intelligence and the utility that they need. If they're using, uh, coding in one scheme versus another, if we're delivering the value... Now, a lot of this does come down to how many flops do you have, how much HBM bandwidth do you have, how much ICI or NVLink or whatever else bandwidth do you have.
12. SPSpeaker
  Yeah.
13. SPSpeaker
  All these low-level measures matter, but in the end, where it rolls up to is happy users, paying enterprise customers, uh, developers who are getting their work done. That's what we're trying to maximize. And so if we have capacity that is sitting around idle, that's a bug.
14. SPSpeaker
  Right. Okay. Got it.
9:43 – 12:25
Reliability vs capacity: relaxing ‘five nines’ for frontier training throughput
1. SPSpeaker
  The value that's delivered is a great metric. And so what we have to now make sure is when we have these gigawatts of capacity, it-- the infrastructure layer is fascinating because there are thousands, millions of things that can go wrong. You, you know this very well. And each of them, unfortunately, matter. And so it's about systematically going after it. And, and so in other words, there is no major breakthrough when we say, "Hey, um, if-- in going from ninety-nine percent availability to ninety-nine point nine percent availability, super hard." Uh, one would think ninety-nine percent reliability, that's pretty good. If you think about it, though, that means that three point six five days of the year you're down. That's not good.
2. SPSpeaker
  Right.
3. SPSpeaker
  That's, in fact, might be unacceptable. Now, though, I wanna go, come back to power for a second, because power oftentimes is your, uh, biggest constraint. You talked about eleven percent, um-
4. SPSpeaker
  MFU
5. SPSpeaker
  ... MFU. If you look across all the fleets, I won't tell you what the numbers are, but if you look at the amount of power provisioned at the edge of a data center region and how much power is actually used by the compute, it's probably a lot lower than you want it to be. Reason number one, overprovisioning for reliability. So in other words, to really get to what the power, uh, service wants, which is five nines of availability, which means thirty seconds of downtime a year, you basically have to have two N, one plus one redundant feeds. One goes away, the other basically switches over immediately. That means that half your power capacity is not being used at any given point in time. That's what it takes to deliver five nines of reliability. Now, though, what if you go to your customers and say, "Hey, would you rather have ninety-nine point, let's say, nine percent reliability and double the capacity or ninety-nine point nine nine nine percent reliability and half the capacity?" Historically, the answer would have been, "Give me the five nines. I can't take the outage." Today, though, if you go to the Frontier labs and say, "Would you rather have twice the capacity, but then three point six five days of the year or point three six five days of the year you don't get any of it?" They'll say, "Oh, yeah, sign me up."Give me more capacity. I'll take the downtime.
6. SPSpeaker
  Is that a new phenomenon, or is that-
7. SPSpeaker
  It's a recent phenomenon. It's a recent phenomenon, right? Because again, now if you're, if you're delivering... historically, if you're delivering an enterprise-grade service, it's five-nines. Can't be down. But training a frontier model, it's about throughput. You'll take the downtime for a day or two days or three days a year if the other 362 days of the year... I'm not speaking for everyone, I'm just, just-
8. SPSpeaker
  But by, by and large, your customers are telling you, your internal customers are saying-
9. SPSpeaker
  Yeah, internal and some external
10. SPSpeaker
  ... we will take access over reliability.
12:25 – 14:06
Why ML clusters break ‘internet-style’ resiliency assumptions
1. SPSpeaker
  Yes. And this is a fascinating new development, but now even getting to that 99.9%, thousand things can go wrong. Because the thing I wanna emphasize is, if we're serving a frontier model, that's hundreds, perhaps thousands of TPUs or GPUs. Doesn't matter. If we're training, it's tens of thousands, perhaps more, of the same accelerators, but the computation is synchronous. What this means is that basically all of the TPUs, all the GPUs are talking to each other synchronously. They're distributing data, all reduce, all gather, uh, whatever else it is. One of the nodes goes down, everything goes down. So literally, it's... And again, how, how do we build internet scale web services to this day? To da- to date, if you're building web search, it's designed basically to have any rack go away at any point in time and no one notices. We barely notice. We do notice. We, we'll go get it fixed, but there is no, uh, downtime, no outage. Why? Because we have a backup for all the data on that rack in at least one other place in that same cluster and we have spare compute capacity and it's fungible. So if you think about TPU or GPU training inference, every node is special. Every node has a spec- specific expert, whatever layer in the overall model that it's serving. If it goes away, propagation stops, serving stops. So how you manage these things to actually deliver the value at scale completely changes. And so everything that we developed over the past 20, 25 years that said, loose coupling, don't worry about individual failures, all that's gone out the window, too.
14:06 – 18:44
System balance and Amdahl’s Law: compute is easy, coordinated supercomputers aren’t
1. SPSpeaker
  Do you believe flops should flow like megawatts?
2. SPSpeaker
  Well, they're, they're closely related, but as you said, um, what, what I really believe is, uh, system balance is what matters most. And so if you are overfixated on flops and you don't have enough HBM bandwidth, or if you don't have enough SRAM or if you don't have enough network bandwidth-
3. SPSpeaker
  I see
4. SPSpeaker
  ... then it doesn't matter how much flops you have. Like, we, we, we can build infinite flops and not connect it, and connect it via, via thin pipes to one another or put very little HBM bandwidth or, and very little HBM capacity. That's easy. Scaling flops is easy. Building a coordinated supercomputer that scales out to 10,000, 100,000-ish TPUs that has the right balance point, super hard, and this balance point is the, the key, key insight. So, um, I'll share with you all, um... I, I, I used to be a professor. I, I love this room by the way. Uh, s- seeing, seeing this room, I, I ta- I, I took undergraduate classes in a room like this up at a, a another school up the road at Berkeley. Uh-
5. SPSpeaker
  We're, we're, we're, we're equal opportunity-
6. SPSpeaker
  Yeah
7. SPSpeaker
  ... systems people, right guys?
8. SPSpeaker
  Y- yes.
9. SPSpeaker
  Yes.
10. SPSpeaker
  Well, Berk- it turns out Berkeley does pretty work- good work in systems as well.
11. SPSpeaker
  Yes.
12. SPSpeaker
  Uh-
13. SPSpeaker
  Yeah, it's great
14. SPSpeaker
  ... but, uh, one of the things that I loved learning most about and that has really stayed with me, I'll share it with you all in case you don't know it, is, uh, Amdahl's law. Who here knows about Amdahl's law?
15. SPSpeaker
  Oh, no. No. Amdahl's law?
16. SPSpeaker
  Amdahl's law. Okay, good.
17. SPSpeaker
  Sorry.
18. SPSpeaker
  Good.
19. SPSpeaker
  I failed as a professor.
20. SPSpeaker
  Yeah, you did.
21. SPSpeaker
  Please go ahead. [laughs]
22. SPSpeaker
  Okay. So the Amdahl's law of system balance. Basic- this is late '60s, so before I was born. Um, he came up with this law that said for every million instructions per second that you built into your parallel system, your, your distributed computation, you would need a megabyte per second of I/O. So in other words, if you're going to provision a million instructions per second, think of it as flops today, you better have that I/O to back it up because compute without data is useless, and you have to be able to feed it. And now shockingly, over just- it was 1967 he came up with this, so almost 60 years. This has helped. Now, he was building small scale in the late '60s. Now we're talking about 10,000, 100,000, sometimes spread across even a wide area network. You have to provision a network because a- almost all your data is across a network today. So your I/O is networked I/O. You have to provision for every some number of flops, some amount of HBM bandwidth-
23. SPSpeaker
  Mm
24. SPSpeaker
  ... some amount of network bandwidth, or you're going to starve, you're going to, uh, basically waste your money. If you don't build to this ratio, you'll have huge amount of flops that aren't doing anything. To some extent, this is what's happening today with the very low MFU utilization that, uh, we have. Why? Because with the m- with the move to mixture of experts, sparse computation, actually the hardware today, all of it actually, isn't built at the right system balance point to manage the fact that actually you now need a lot more memory bandwidth relative to the computation ratios.
25. SPSpeaker
  Mm. Mm.
26. SPSpeaker
  So when you think about evaluating your systems utilization, super key. Uh, the reliability part, I really want to get this across, but then system balance is also super key. If you don't have the right system balance, you're wasting your money. So when you say $40, $50 billion per gigawatt, yes, but if you had to spend $55 billion and make sure that that gigawatt was balanced or reliable, you'd do it. So I think-
27. SPSpeaker
  I see
28. SPSpeaker
  ... the, the key here is because otherwise you're not gonna get the value out of it. If you say, "Hey, I've... When with my gigawatt, I got all these giga flops, tera flops, peta flops, exa flops, yotta flops," whatever it is, awesome.But what do you actually get out? And what you get out depends on system balance, and it depends on reliability. But now, going back to the agents, system balance isn't just for your TPUs and GPUs, it's the balance to the CPUs that are sitting next door, the storage that's sitting next door or in the next rack, the network that connects it all together, right? The da- not, not the high-speed NVLink or ICI network, but the data center network that connects it all together.
29. SPSpeaker
  It breaks my brain a little bit to try to figure out how do you decouple the individual bottlenecks in the memory storage bandwidth, uh, supply chains and align that in a predictable fashion to accomplish system balance? H- how does one even-
30. SPSpeaker
  Yeah
18:44 – 20:08
Balance at 100,000-node scale: why 100% MFU is impossible
1. SPSpeaker
  So this is, um, for those of you who took whatever architecture, undergraduate or graduate architecture, you've got your, uh, seven-stage pipeline, right? With the, uh, instruction fetch and decode and, and access, right? And how do you actually... And h- that's how we got super scalar performance.
2. SPSpeaker
  Right.
3. SPSpeaker
  Seven stages, super complicated, uh, within the core. Now, we've got like 127 stages.
4. SPSpeaker
  Right.
5. SPSpeaker
  Within a CPU, it's possible to get that, uh, microarchitecture more or less balanced, but even there, h- getting the right balance point is, is super tough. That's why you get pipeline bubbles, that's why you, you say, "Okay, how many cycles per instruction do I really have, and how do I drive that, uh, down, actually?"
6. SPSpeaker
  Okay.
7. SPSpeaker
  So now extend this out across 100,000 nodes. It is an impossibility. 100% MFU is not possible. So that, that should be like the... I mean, you could with a toy, uh, just like chart it out and say go, but in general, for a real computation, you're not gonna get perfect balance because there's... Like let's say there's just little micro variation in one cache hit rate of one TPU, GPU versus another. That would cause a pipeline bubble, right? So now your MFU, because now you're waiting for the data to come from another-
8. SPSpeaker
  Right
9. SPSpeaker
  ... node, your MFU just went down.
10. SPSpeaker
  So you have this compounding-
11. SPSpeaker
  Yep, and it'll multiply.
12. SPSpeaker
  And let's talk for a sec, 'cause what you described is the computational bottleneck. I'm talking about now you add on-
13. SPSpeaker
  Network
14. SPSpeaker
  No, no, procurement.
15. SPSpeaker
  Oh. [laughs]
16. SPSpeaker
  Like, how do you dec- like...
17. SPSpeaker
  Procurement
20:08 – 24:08
Procurement and lead times as a technical constraint (not just business ops)
1. SPSpeaker
  I mean, literally the, the world can't produce enough memory. I-
2. SPSpeaker
  Yeah.
3. SPSpeaker
  Uh-
4. SPSpeaker
  Yes
5. SPSpeaker
  ... I'll ask you if this is true or not.
6. SPSpeaker
  We can talk about that, yeah.
7. SPSpeaker
  There's, there's reports that one of the Frontier labs cornered the market on memory recently through buying a bunch of call options, and then the rest of the industry revolted. Is that true?
8. SPSpeaker
  I don't know if it's true or not. I, I read the same, uh, f- or I, I, I can't, uh, keep up with the X, so I have the same feed and whatever things-
9. SPSpeaker
  This is from a group chat this morning, but okay. [laughs]
10. SPSpeaker
  Group chat this morning. Okay, this was, this actually came out, um, three or four months ago.
11. SPSpeaker
  Oh, then this, the group chats are behind.
12. SPSpeaker
  Yeah. It's, um, this one, in this particular case, the group chat is behind. You know, these things, uh, yes, uh, the supply chain is a massive, massive issue. I'm now responsible for the supply chain and, and procurement. Uh, the problem is that things just continue to go up and up and up every month, and the lead time is years. So in other words, basically, if you wanna say, "I want a gigawatt of capacity," if I want a net new gigawatt of capacity, my lead time is somewhere around two or three years. Like, doesn't matter if I've got my $40 or $50 billion. Just for buying everything and building it, it's a very physical process. So gigawatt end-to-end, I gotta go get that, uh, capacity of power somewhere.
13. SPSpeaker
  We have a final project here, which is the one person Frontier lab.
14. SPSpeaker
  Okay.
15. SPSpeaker
  And they have increasingly less time, but look, the, the project is a microcosm of life, and what you just heard is Amin saying there's a bottleneck he can't throw more money at to clear.
16. SPSpeaker
  For sure.
17. SPSpeaker
  So if you could prompt them to solve it from a technological perspective, what could they do to help to unblock that bot- that bottleneck?
18. SPSpeaker
  And we're going after it, uh, on, on multiple fronts because pulling that in... In other words, if I had the ability, so many times, uh, so many times, actually, if I had the ability to go spend more money and get more capacity tomorrow, it'd be an easy decision. But if you're saying, "Hey, you now have to commit to how much capacity you want in two years' time. Commit. Like you can't-- no going back. Today, you have to say exactly how much capacity you need in two years' time." Okay. Basically, there's gonna be one of two outcomes. There's a third that's infinitesimally small probability. Outcome number one is you predict too little, and then you're gonna be really upset that you're leaving opportunity on the floor. Outcome two is you over-predicted, and now you wasted a bunch of money. There's some other possibility, which says you predicted perfectly, which never happens. So if you could pull that in and now you said, "Okay, how much capacity do you need tomorrow?" You're probably gonna nail it. Or if you over-predict, you over-predict by, you know, 0.05% or something.
19. SPSpeaker
  Mm-hmm.
20. SPSpeaker
  How do you pull that lead time in? And actually, the, this is a technical problem. This is truly a technical problem, where from procurement to manufacturing... Like right now, if I wanted to have a gigawatt, I'd have to go build a new building. A big building. Probably multiple buildings, actually. I ha- what does that mean? I have to go now and get some land. Maybe I've got some land buffered up, but if I don't, I'm in trouble because I now have to go do permitting.
21. SPSpeaker
  That'll-
22. SPSpeaker
  Six months.
23. SPSpeaker
  Right. Indeterminate, yeah.
24. SPSpeaker
  Right. Who, who knows, et cetera. So net, but by saying, "Okay, well, you know what? The land is kinda cheap, so let me have a bunch of land on the side." Okay, now is the land prepared for a building to go down? Actually, you probably have to grade it. Okay, let's, let's go ahead and spend the money to grade it ahead of time, too. Now I'm ready. But now I, I put down the pad. Do I go procure the power? That starts getting expensive. Do I go to the utility? The utility now, everybody's going to the utility saying, "I want a gigawatt. I want five gigawatts. I want 10 gigawatts." They'll say, "Sure, I'll get you that, but you have to agree to pay me for all of that for the next 20 years. You want a gigawatt? Sign this contract that says you will pay me for a gigawatt 24/7 for 20 years." Why? Because there's no capacity to back it on the grid anymore. It used to be if I went to a utility and said, "I want a gigawatt," they'd say, "Sure, I've got a gigawatt." Well, I wouldn't go for a gigawatt. I'd say, "Give me ten megawatts," and they'd say, "Sure, ten megawatts, no problem. I've got that. You don't even have to, you don't need to sign a contract. It's so much slack capacity, I'll get you ten megawatts."No, no longer true.
24:08 – 25:34
Stranded grid capacity and why serving may ‘unstrand’ smaller sites
1. SPSpeaker
  But it-- my understanding is the reason grid-connected capacity is so acutely under-supplied is because hyperscalers are saying, "Well, we only want sites that are expandable."
2. SPSpeaker
  Mm-hmm.
3. SPSpeaker
  And so everything under a hundred megawatts is just stranded.
4. SPSpeaker
  Yes.
5. SPSpeaker
  But that's a bunch of stranded, unutilized capacity in America. What-- If you could... If you were the chief energy officer of America, and you were trying to, you know, drive up utilization of those stranded assets, what, what would you do?
6. SPSpeaker
  Um, so I, I think the hundred megawatts, if, if you look at it, it'll add up to something, but it's not going to add up to the majority of the, of the demand. I think that just from a scale and operations perspective, if we really wanna go after this, we actually would-- we should unstrand some of those hundred-megawatt sites for sure. I think that as serving takes off, that will happen naturally.
7. SPSpeaker
  I see.
8. SPSpeaker
  So in other words, uh, we, we are, up until recently, in a place where most demand was for training, and training does need large contiguous chunks of infrastructure. As we move to more and more of the demand going to serving, that's gonna shift naturally.
9. SPSpeaker
  Because the serving is more fungible.
10. SPSpeaker
  It's more fungible. It's smaller. I don't need a gigawatt to do training.
11. SPSpeaker
  Right.
12. SPSpeaker
  I don't need five hundred megawatts to do training. I can serve some number of tokens, uh, per minute coming from a small-ish deployment. So I think we're gonna unstrand that somewhat naturally, but I don't think that's going to fulfill the needs because there is gonna be benefit, uh, to scale. Uh, and we are gonna have to figure out how we get larger amounts of power concentrated, delivered to some number of locations.
25:34 – 27:49
Career advice: don’t chase predictions—choose intrinsic motivation
1. SPSpeaker
  Yep. Makes sense. I could go on for hours, Amin, uh, but we should get-- switch to questions. The question is, if you were a Stanford student again, what technical problem would you obsess over?
2. SPSpeaker
  You know, I, I will say that, uh, I get a, a... This, I think, is a really good question, but the answer I'll give is to go-- The-- All of them really, really matter, honestly. In other words, there is no one bottleneck. And predicting the future is really hard. So let me give you an example. When I was a graduate student, uh, what, uh, everyone said is, "Absolutely, positively don't work in artificial intelligence." Like, it's just the worst thing to work, work in. And that, that was true again after ten years, and then true after another ten years, and now look what's, what's happened. Trying to predict the future, really, really hard. I, I would say pick the problem domain that you are most intrinsically excited about because that, that passion for it, that's, that's what's gonna carry you forward. And then, um, in this model, I would say everything from algorithms to hardware engineering to chip design to operating systems to model architecture to... It all matters, which, which is really good. So probably pretty good chance that, uh, what you pick is gonna be really, really important. And so if you-- and if you pick something solely because your prediction is that it's gonna be the most important one, but you don't like it, th-I think that outcome will be the bad outcome because also pretty good chance that you'll have mispredicted.
3. SPSpeaker
  I have a qui- I have a quick que-question based on that. H- you know, many of you submitted your, your project ideas, and there were five hundred, so it's taken me a while, but I'm steadily reading all of them because I don't want to have Claude hallucinate. How many people here feel like you picked a project idea because you were truly intrinsically motivated by it?
4. SPSpeaker
  Good number.
5. SPSpeaker
  Okay. That's actually very helpful. Wasn't clear to me based on the readings because there's surprising similarity between many of the problems you guys are interested in, and I wish we were seeing more diversity in, in those problems, but that's, that's for another time. Next question. The question is, what's your favorite story from your time at Google?
27:49 – 31:33
Learning at Google: the TPUv2 networking debate and being wrong productively
1. SPSpeaker
  Oh my gosh. There are a lot of, uh, favorite stories and, um, yeah, thank-thanks for reminding me of the great time I had at Duke as a, as a professor. You know, the, the stories that are... I mean, we've had, of course, many joyous moments, many funny moments, um, but I think that for me, um, the moments that are best are the ones where you learn the most. And so the one that actually comes to mind, just top, top of mind, is when the original TPUv2 design was happening, and we were going to go build this supercomputer, at the time two hundred and fifty-six nodes. It's gotten much bigger, over nine thousand nodes now. And we were debating what, um, network to use. This was around twenty fifteen, what network technology to use. And, uh, you know, my primary area of research understanding at the time was networking. And the conventional wisdom from forty-five years or whatever at the time of networking was any-- whatever you were gonna do in networking, you were gonna use Ethernet. And some really smart, uh, folks said, "No. Th-this domain, we want a distributed shared memory system, read-write semantics, point to point, not switched, m-and Ethernet is the wrong solution." You know, I, I was like, "What, what the heck? I mean, look, I, I have forty years of history behind me and always been right. Me and, me and a thousand other people have always been right." But then when I-- when we dug into it back and forth, and it was one of these super spirited debates, not a, not an angry debate, to be clear, right? But it was a, you know, s-smart people, uh, whatever you wanna say, really going at it and really convinced, um, uh, that they were right. So it turned out I was wrong. It turned out that actually you don't want to use Ethernet for a TPU supercomputer, and that has stood the test of time for the past decade. Um, I got it wrong. I, I learned something. And, and so the best thing about Google actually, I would say, is, uh, how often I get to learn something.
2. SPSpeaker
  In that story, who was the person who was the first principles thinker that came to that conclusion first and then evangelized that standard?
3. SPSpeaker
  Hard to say, but probably Norm Jouppi. Stanford PhD, so yeah.
4. SPSpeaker
  Yeah. Norm is-
5. SPSpeaker
  May-maybe he learned something.
6. SPSpeaker
  Um, next question. The question, what was it like doing the ChatGPT code read?
7. SPSpeaker
  You know, I think th-this was-- it was a great time, and I think it remains. I think that Google has, uh, changed as a company. I was, uh, this is, um-When I really first started seeing Sundar in action, uh, up close, I, I now report to him. I didn't at the time, but one of the things that he did in that moment was he did a fairly big reorg. Uh, the biggest part of it was bringing Brain and DeepMind together. Probably many of you have h-heard of that. Uh, it was a fantastic move. He also, uh, brought different infrastructure teams together, uh, u-under my leadership. That was the, the lower headline, but I think also turned out to be a good move, not because of me, but because it allowed us to move with, uh, sort of more speed and more, um, unification. Uh, I, I would say that seeing how the people came together was, was really fantastic. The culture at Google is different than it was, uh, three and a half years ago. Uh, I would say it's been a reinvention. I think that we're actually through that now. You know, if you'd asked me a year ago, would I say that we were through it? Probably not. I think we're now at this point through it. Uh, Sundar deserves a lot of credit. Demis Hassabis, uh, and Jeff Dean deserve a lot of credit for, uh, for it as well. But, uh, really I, I'm, I, I speak of November 2022 often, actually internally, and frankly fondly. I, I can repeat the question if you'd like.
8. SPSpeaker
  Yeah. Please do. Yeah.
31:33 – 36:49
Optical circuit switching: programmable topology for availability and bandwidth steering
1. SPSpeaker
  Um, so I think, um, one, the premise is networking is a bottleneck at all, all layers. We at Google have been leveraging, uh, optical circuit switches to remove that bottleneck. And so are, is, are you worried or am I worried that we're going to limit ourselves given the fact that we can't reconfigure these optical circuit switches at per-packet granularity? Um-
2. SPSpeaker
  Is that assumption... Sorry, I interrupted you. Go ahead. Yeah, go ahead.
3. SPSpeaker
  Uh, good question. So we, we, um, don't restrict ourselves to, uh, optical circuit switching. Optical circuit switching plays a role in our networking, but I mean, the, the lecture which you're referring to, the, the presentation I made in terms of all layers, for example, you would not use optical circuit switching for the on-chip network. No way. Not, not applicable. And you would not use optical circuit switching for portions to large portions of the WAN. But even within the data center, where we do use it extensively, it's not the sole technology. It's an augment. In other words, we have a lot of electrical packet switches, a lot of electrical packet switches. And if you look at the TPU, um, within a rack, it is a point-to-point network, but every connection today between TPUs within a rack is copper. Like, there's a direct co- because that is the right technology. Between racks, we have optical circuit switches, but the optical circuit switches essentially creates today a three-dimensional torus.
4. SPSpeaker
  Mm.
5. SPSpeaker
  Why do we do this? The reason is reliability. So if you think about it, if I lose a TPU, I now have again lost my entire, um, lattice. If information is flowing through this torus by pairwise connectivity, I lose that one TPU, everything's gone away. What I can now do with my optical circuit switch is I can remove that rack wholly. I can plug in another rack and those... Within a rack today, we have sixty-four TPUs. Those sixty-four TPUs can take in the exact position of the sixty-four TPUs that I took out. But what does the optical circuit switch do? And this would require some pictures and, um, uh, some slides probably.
6. SPSpeaker
  Sure.
7. SPSpeaker
  Basically, what it then says is imagine that I have the ability to take fiber, unplug it, replug it to another rack without any humans. That's what the optical circuit switch does. Essentially, so what is a optical circuit give? It's a, um, chip about this big square. It has 136 mirrors on it. More, could be more, could be less. Each mirror can be rotated in three dimensions. Essentially, what we do is we take every rack and all the fiber that's coming out of that rack, we connect it to the optical circuit switch. The fiber, now it's light shining out to the fiber, comes into the op-optical circuit switch, shining down on those mirrors. So light comes in, hits a mirror, gets reflected in a particular direction depending on how I rotate the mirror under MEMS control. These tiny mirrors-
8. SPSpeaker
  Yes
9. SPSpeaker
  ... and tiny motors. It'll get reflected precisely to go out an output port. But I can program what output port it goes out. So in other words, essentially what it gives me is a programmable topology, so that if I decide that a rack needs to be virtually removed, virtually removed, this is all under software control. And then another rack gets plugged in, in the exact same position that that other rack got removed. I now can maintain my topology. The torus becomes whole again. And I can do this in, let's say, seconds. So essentially what the real differentiator has been for TPUs is the ability to have much higher levels of availability. I, I can now recover from failures instantaneously, right? As long as I have a few spare racks, quote-unquote, "lying, lying around." And the spare racks, by the way, could be doing smaller computations. They don't have to be doing the gigantic computation. That's place one. Place two that it becomes useful is, let's say, I told you about the compute problem and the storage problem, right? We're doing agents. I now, one more level above that, have a different optical circuit switching layer where I can say, "Point the mirrors to that cluster over there where the storage that I need is located." I now can short-circuit many layers of a general purpose electrical packet switch that I would have had to have normally provisioned and built to go to that distant cluster and basically create a direct connect. So th- really think of an op-- So do I-- I still have lots of electrical packet switches, but I now have many fewer than I would have needed t- where I can program which cluster I can talk to. This is, you're right, it's not per packet. But if I know that I'm gonna run this five-hour job, and th-this five-hour job needs the storage over there, point the mirrors over there.
10. SPSpeaker
  Mm.
11. SPSpeaker
  The next five-hour job needs the storage over there. Okay, as part of Borg scheduling the job, it would say, "Point the mirrors over there for the next five hours."
12. SPSpeaker
  I see.
13. SPSpeaker
  That, that saves me from provisioning layer upon layer upon layer of network and miles and miles of fiber.Essentially allowing me to not have infinite bandwidth wherever I want it. It's not fully fungible because you're right, if at, at a second granularity I said, "Oh, wait a second, I wanna go over there." It's not that I can't, I still have electrical packet switches over there, just not with the full bandwidth. The full bandwidth is pointed over there for the next five hours or however long I decide I need to move back over here. It's a kind of a deep question, but so optical circuit switches, they have their role. They're not a, a magic bullet that solves all problems. We use a lot of electrical packet switches.
36:49 – 37:49
Topology choices: torus for all-reduce, switches for all-to-all, and co-design with models
1. SPSpeaker
  Why is the torus the s- to- topology you settled on versus others?
2. SPSpeaker
  Originally for, uh, ML training, the number one collective was, uh, all reduce rather than all to all. And for an all reduce, actually, you... The torus is the perfect, um, topology because you essentially are disseminating parameters to everyone with potentially a little bit of computation, a little tiny bit of computation on each-
3. SPSpeaker
  I see
4. SPSpeaker
  ... distribution. So the best and fastest way to do dissemination of data for this particular style is with an all reduce.
5. SPSpeaker
  I see.
6. SPSpeaker
  Now, if you are doing an all to all, turns out the switch topologies have, uh, have their benefits as well.
7. SPSpeaker
  For, um, that regime, what is the optimal topology?
8. SPSpeaker
  Optimal... If, if you truly need to do all reduce... Uh, sorry, all to all, uh, with arbitrary communication, a switch topology-
9. SPSpeaker
  Would be good
10. SPSpeaker
  ... a standard factory clo- closed topology would be the best, but it winds up that model designers can work around the topology in very clever ways, and they do.
37:49 – 46:26
Planning, depreciation, and dynamic replanning under uncertainty
1. SPSpeaker
  Yep. Uh, next question. The question... I'm not going to take your assumpt- your assumption was all chips are becoming obsolete. That is not true. However, your question was: how does Google think about hardware depreciation? Correct? Okay, let's take that.
2. SPSpeaker
  Yeah. So all chips are, um, not becoming obsolete. There's so much demand that our, um, older generation chips continue to see very heavy use at Google, and this is true at, uh, whether it's older generation TPUs or GPUs. It's true across the industry. H100s are massive demand despite the fact that, uh, Rubin has been announced, et cetera. Fantastic chips as well, H100s and H200s and V200s and GB200s, et cetera, as well. So, uh, we depreciate our hardware, uh, our compute hardware over six years at Google. I think that is more or less standard across the industries. I think some people, a few people might do five, uh, but six years, I believe, is standard. We are seeing use at least for that period of time and, uh, typically longer for, for our hardware. So it works, uh, works out well. How do we plan? This was the problem that, uh, we were talking about earlier. It's, it's very, very hard to plan for the future because we're having to make these predictions fairly far in advance for capa- One saving grace is when we're provisioning watts and data center space, that's fungible. In other words, it could be generation X, it can be generation X plus one, it can be generation X plus two, it can be generation X minus one. So we first need to have an envelope for watts. But the lead time for these chipsets is also significant. You gotta get your orders in early, and you have to plan, plan for those as well. I can tell you that, um, uh, we have a... Planning is a massive effort, massive and complicated effort, and fast-changing because let's say that, uh, I have a plan and then a new use case comes up. There's a new invention internally at Google, a new product launch, and it needs a particular kind of capacity. Now, I have to figure out how to fit that in. I have to replan. So essentially, by the way, another very interesting domain is how do you plan under uncertainty, and how do you dynamically replan quickly based on all the new information that you have, demands that you have, customers that come in? A new cluster, a cloud customer comes in and wants to buy a bunch of GPUs, but it, it's not the GPUs that I ordered. It's a different kind of GPU. How do I order these new ones, get them? And by the way, they want, they want it built close to their cluster in Minnesota. I'm making all this up. But so, like, all these constraints come in, and now we have to replan dynamically and essentially daily based on the new information that we get.
3. SPSpeaker
  Awesome. Next question. How do you see robotics capabilities being unblocked-
4. SPSpeaker
  Yeah
5. SPSpeaker
  ... roughly?
6. SPSpeaker
  So I think a really exciting domain, and I think that this is... You know, to me, if I think about the internet revolution, it, um, really was the coupling with the mobility revolution that, um, made it truly the impact that it was, right? Basically taking the internet into the real world, making it mobile. Uh, I, I, I think I'm biased, so you, you all can check this, but I think that the best example that we have of, uh, really advanced robotics out there in the world working, uh, in very complex scenarios is Waymo. And so I think that's a good example of this, uh, scaling approaches. In robotics, I think in many cases you're gonna find that latency really matters, but safety is the primary consideration. And I think you're gonna have very similar scaling requirements, but safety, reliability will just shoot through the roof in terms of your, uh, considerations, and that's going to then argue for, um, locality and essentially, whatever you wanna call it, uh, single-threaded programming. I don't mean single-threaded as in, okay, there's only one, one core on the CPU or whatever on the TPU, but essentially, you, you can't have variability. Like if, if there's a safety question, you can't say, "Oh, wait, I had a context switch of ten milliseconds, and I wasn't running when the safety, uh, whatever algorithm needed to be running." So I, I do think that the similar scaling laws are going to apply, but the scale that you can count on for robotics is going to be much, much less. If you're counting on 20,000 TPUs in a data center a thousand miles away for, for your robotics application to work, probab- depending on the robotics application, uh, may or may not work.
7. SPSpeaker
  The question is, um: Are there... Do you have any thoughts on the SpaceX-Anthropic partnership that was announced today, where they're gonna u- Anthropic is gonna be able to do some compute from the former XAI Colossus cluster?
8. SPSpeaker
  Similar, uh, announcement on Cursor. So, uh, Cursor is gonna be leveraging a bunch of capacity on SpaceX XAI. Um, and I think what you're seeing here is massive demand for inference compute today. And so really, if you think about it, you'd have to say that coding agents really exploded. They've been around for quite some time, so I, I do know that, but they really exploded four or five months ago.And nobody, nobody predicted it at this level, and so nobody essentially had enough lead time to say, "I need more GPUs, more TPUs to handle this explosive demand for serving." People are now, um, looking around and saying, "What capacity can I get where?" And I don't know the inside story of whatever Elon and, um, Dario discussed, or whoever, but, you know, clearly a good opportunity for Anthropic to leverage a bunch of available capacity that, uh, um, SpaceX had less use for. What got me into this field, uh, and what, um, convinced me to switch from being a professor to, um, my, my job at Google. Uh, I was lucky in that for whatever reason, I was, uh... I remember I was six-years-old. I was in, um, uh, Iran at the time, actually. My family moved to the US when I was six, so it was right before we moved. I saw a magazine cover, and it had a computer on the magazine cover. And somehow I, I decided I was going to become a computer programmer. Never seen or touched a computer, but I decided that. Um, I think my defining characteristic is I'm very stubborn. I never change my mind [coughs] . And fortunately, I loved it. So when I was in high school, I was, I was the kid, I was that kid, right, and this was a w- a while ago. I was in the lab programming, uh, all the time. So boring story. I, I, I still love it, uh, to, to this day. Uh, and I, I loved it so much that I really, um, decided I had to get a PhD. I re- I needed to, uh, understand the material. It wasn't about, um, anything other than really love for the material. Becoming a professor was natural. I came to Google because I was-- I had been a professor for 12, 13 years, and actually never had a real job. I had jobs in research labs, but that didn't count. So I said, "You know, if I'm teaching all these people, I better know something about what it's like to be in industry." So I came to Google on a one-year sabbatical. I loved being a professor, and actually, I was quite, um, haughty about, um, people and working in industry. Meaning, I couldn't understand why anyone would wanna work in industry. Uh, no, no offense to, uh, anyone here. Because I was so biased. I, I admit I was biased. Um, I got to Google very, very fortunate. So at Google at the time I joined, 2010, there were seven people between me and the CEO. All seven of them, uh, including Eric Schmidt, the CEO at the time, had a PhD in computer science. So here's this guy who knew nothing about, um, industry. Literally nothing. Uh, any other place I would've gone, I, I think that there would've been like organ rejection, or I would've been like, "Oh, I w- I was so right. Industry's terrible." Uh, Google was a match to me. And, uh, it took me a while, probably three years, to figure out that, uh, I was having so much fun that I didn't-- I wouldn't go back to being a professor. But I, I miss it, actually, and I love it. A fantastic job. Uh, one of the best jobs ever. Uh, but the opportunity to really put ideas into practice, and Google is the kinda place where, yes, it is about tech-- uh, business impact, and it's about the outcomes, but it's also about, um, doing the right thing for people, our users, and doing the right thing about-- for technology. In other words, it's like solving hard technical problems. Really valued at the company. Good question, and I think there are a lot of good firms, uh, out there, honestly. So I, I think it really-- Uh, I'm very optimistic about the space, and I think there are a number of, uh, strong firms. Really, it was, um, uh, evaluation of their technology, evaluation of their people, how far along we were with them relative to others. Uh, it, it really-- I wouldn't read too much into it about, um, this one is the very best, or this one is the second-best, et cetera. Kairos is fantastic. We're big believers, obviously. Uh, but I think there's gonna be a, a number of winners in this area.
46:26 – 1:00:00
What’s next: specialization (TPU 8i/8t), endless hardware bottlenecks, and energy as the hardest one
1. SPSpeaker
  Quick question. What do you, what do you see as next for TPUs to beat, um, GPUs? Or is that even a, is that even a goal?
2. SPSpeaker
  Not even a goal. I, I mean, I do get this, uh, question fairly frequently. I think it's a good and reasonable question. But I think that the good news is that the market is expanding so dramatically that there is no beating or there's no competing per se. In other words, there's no winning and, uh, losing. I think it's about driving impact. So I mean, we, we buy and sell a huge number of GPUs. We use a l- a huge number of GPUs. GPUs are fantastic products, and I think they're gonna... And I have, by the way, all the respect in, in the, uh, world for, uh, Jensen. Would, uh, uh, would, would call him for advice, uh, on, on a number of things, for sure. He's, he's amazing. His company's amazing. But I would say that we're going after different, uh, domains and, uh, different, uh, customer use cases, et cetera. What I'll say broadly is for TPUs, uh, we just, a couple weeks ago, announced our latest eighth generation TPUs, 8i. I stands for inference. And 8t, t stands for training. And so for the first time, we're launching two chips in one year. Why am I s- mentioning these two? It's because we're, we for the first time are specializing the TPU line. In other words, previously, we had one chip for both serving and training, and that was the right decision based on everything we could see. Because we could have probably-- We, we always could've built two chips. But if one chip is five percent better for one, and the other chip is five percent better for the other, it's actually better to have the one fungible chip. Right now, the needs are diverging so much that we're actually seeing big uplift, major uplift in specializing for inference and training. What I see coming, um, moving forward is a further increase in specialization. Why? Because general purpose CPUs, they've for many years, a decade plus, have slowed in their rate of performance efficiency improvement year over year. And so what that means is that now you actually have to pick the workloads that, um, are large, and you can't necessarily say, "Hey, just wait a year and your CPUs will get twice as fast," because that won't be good enough to keep up with the demand. We have to pick our big workloads. Inference and training are two great examples, where we can now say, "Hey, we can actually do something, let's say, twice as good because we specialize." The lesson in hardware design is, the more you specializeThe better performance you can get for the subset of workloads that you can run. CPUs of co-- By the way, CPUs aren't going away. Like, they're, they're general purpose. They can do anything. Uh, a TPU can't do anything. But for the domains where it runs, it's literally 100X more efficient than, let's say, a CPU. So we're, we're, we're in the process of finding those use cases one by one and saying, "Okay, now..." And maybe it won't even be a TPU, right? Maybe there's gonna be some other big workload that doesn't require tensors, matrix algebra. Maybe. Or there'll be some other one that needs a different system balance point. By the way, that's the key observation between 8i and 8t. The memory to compute to networking ratios are different. Right? So you, you actually would design the chip differently because that's what that application needs. We're gonna keep looking and specializing for the different domains.
3. SPSpeaker
  The, the question's on un- unblocking your own production bottlenecks from, from provid- vendors and suppliers like DSMC.
4. SPSpeaker
  Yeah. We're, we're deeply engaged across the, the supply chain and, um, and so I, I, I'll say it's, um, the simple answer is it's a domain that we're comfortable with. Uh, you know, my team right now is in, um, Taiwan and, uh, South Korea and, uh, Thailand, uh, et cetera, um, as well as, as we speak. So it, it is a complex issue, but I, I'm actually not worried, uh, about being able to secure supply, uh, our fair share of supply at, uh, Google. I think the challenge is, again, it comes down to the efficient use of that capacity. That's gonna be as key as anything. Now, the total demand in, in the world is gonna be significant, but I think from a supply chain perspective, if you... I mean, maybe I'll just give a generic answer. If you are a vendor for, um, a component, let's say it's, let's say it's a capacitor. Do you wanna have one customer? I, I'll leave it as a, as a hypothetical. Right. And, and let's say that customer was gonna say, "I'm gonna buy you out for th- three years, all your capacitors, whatever you got. I'll, I'll buy it all out." I, I would say that's not good for the vendor, actually, even if they might make more money in one or two or three years, right? Because so, so the flip side of it is, um, as component vendors, they wanna have some diversity. And, you know, again, for, uh, whatever it is, SEC filings, who's your-- How many, uh, how many customers make up 90% of your revenue? If that answer is one or two, investors aren't so- super happy because now you're beholden to exactly one or two customers.
5. SPSpeaker
  I, I think, um, this is a sort of misunderstood point, and I'm gonna try to connect two different questions here just to help synthesize, because we are lucky enough to have a professor who's better than me. But if you've noticed, many times when you guys ask questions, you place some context, and there's an assumption in there about the industry, and then you ask the question. And many times I've noticed over the course of the quarter, you guys use these words like winner, loser. You know? Th- there's a sort of embedded zero-sum mindset that I've picked up in this class, and I don't know why that is. But i- it's a, it's a tra-- it's a constraint of your own making. There's no such thing as winners and losers in the real world. There are just people who get shit done and who don't. People who have impact and who don't. And so I would encourage you guys to really, uh, think first principles about some of these assumptions. I mean, just here we've had somebody who's, who-- he, his answer just demonstrated that, right? He said... I think the question had some assumption like, "Oh, you know, Nvidia is locking up all the bo- all the production at DSMC. What are you gonna do about it? You're gonna lose." He's like, "Well, actually, you know, turns out vendors don't want concentration risk." If you break down from a first principles how their business works, then you can see they actually want Google to have some percentage of their production demand. And in infrastructure and mission critical supply chains, you need to have redundancy built in because earthquakes happen, geopolitics happens, and if you want to be a reliable, stable partner to your customers, you plan for that. So generally I would-- Uh, just let's tone it down a little bit on the whole competition stuff because it, it only holds you back. You know? Have an... I, I don't know if you'd agree with this, but-
6. SPSpeaker
  No, it's good.
7. SPSpeaker
  You're the, the-
8. SPSpeaker
  I, I'm fine with the questions, by the way, but I think the advice back is, um, is great in that really, um, I, I view what we're doing at Google as participating in an ecosystem to lift the entire industry, but also lift all the users. It's not gonna happen on the back of any one company. There's no one company that's going to come out of this as the winner for, for sure. There's gonna be many winners. And by the way, the other thing that is true is, um, the, a huge number of the winners haven't even been invented yet. Like, some number of you in this room are gonna start-
9. SPSpeaker
  Yeah
10. SPSpeaker
  ... some of the winners, uh, no doubt, over the next, uh, several years. Uh, we-- There's gonna be use cases and opportunities that none of us, certainly not me, can predict, that, that you all are gonna invent. There's gonna be, uh, there's gonna be a lot of winners. One, one caution I wanna say, though, is we are also going through, and this is not about companies, uh, a time of societal transformation. So, so if I, if I may just... I know this isn't the topic of this conversation, but it's top of mind for me. I, I would also encourage this group who is thinking about technology to also think about our responsibility as technologists to make sure that we are building in guardrails and safety as we deploy our inventions in terms of how we help drive the societal transformation. I mean, I think five years from now, ten years, uh, from now, how we work, how we live, how we learn is gonna look a lot different. And, uh, we do want it to also be better as, as a whole, maybe hopefully significantly better.
11. SPSpeaker
  And in the ecosystem, as this transition is happening, it's stressful for a lot of people.
12. SPSpeaker
  Stressful.
13. SPSpeaker
  There's fog of war. People don't know... You know, information is not being disseminated out. What are, what are some areas of misalignment across the ecosystem that you would encourage not just them, but other speakers in this class-Who are, who are watching each other's lectures to think about
14. SPSpeaker
  Oh, it's a good, good, great question. And by the way, congratulations to, to you and Michael in terms of w- this class and, and all these students and these, this thoughtful question. I'm just blown away, uh, honestly by the-
15. SPSpeaker
  It's all Mike, uh, behind the scenes
16. SPSpeaker
  ... the, the quality, uh, of, uh, the d- discourse here and the fantastic questions here. Blind spots, um, I think that frankly, um, I, I thought that your feedback to the room here is fantastic. Probably across the ecosystem, there is a notion of, uh, single winner a bit, a bit too much.
17. SPSpeaker
  Yeah.
18. SPSpeaker
  And, and probably also, uh, a bit focused on individuals, uh, winning and losing, so that sort of pairwise fight. I won't name any names. You all know what the names are. But person X is out against person Y, and I don't know how much value that's adding for, to anybody.
19. SPSpeaker
  From your perspective, what do you think true bottlenecks are?
20. SPSpeaker
  Yeah, it's goes back to the question of what you would study if you were coming out of Stanford. There is no one bottleneck. If I, uh, what is the primary bottleneck? Uh, honestly, it shifts, uh, daily, weekly. Uh, yes, I mean, I hear about memory getting locked up in the supply chain or, um, some other, uh, issue that, that might be coming up. And on a particular day, the bottleneck might be the reliability of a particular cluster that for training our next foundation models. That might be the bottleneck. I would say the one that I have least understanding of the solution is energy. In other words, if I can roughly make up answers that I have some confidence in for, for most topics, but for us to scale energy to the level that we need to across the planet, um, there are ways to do it. Uh, there, a lot of them are brut- brute force and expensive, and expensive not just in dollars. So, uh, the biggest innovation bottleneck, uh, I would say in terms of really getting what we need, uh, a energy abundance, which also means affordability, is, yeah, it's, it's probably energy.
21. SPSpeaker
  And i- in the energy space, which solutions do you think are being underexplored, or which vectors should be, could be more systematically explored?
22. SPSpeaker
  I think that here in the, uh, in the US, we re- could look a lot more at, uh, wind, solar, batteries. We are, we are at Google for sure, but this is a manufacturing and scaling process that has some physics involved with it, and, and physics meaning just some time. So th- this is an area where we're probably under-invested, uh, as, as again, as a, as a community.
23. SPSpeaker
  There was, uh, two days ago, there was a company that just announced some money they'd raised to build data centers, um, as a network of distributed floating, uh, pods.
24. SPSpeaker
  Mm-hmm.
25. SPSpeaker
  Is, is that a promising vector? How would you analyze that solution?
26. SPSpeaker
  Yeah, and of course, we and others are looking at data centers in space. Um, I think that there are a n- n- number of really where in, in space energy, uh, five X more efficient, and if you get into a s- a, a sun synchronous orbit twenty-four by seven, no, you know, no, no or very little battery needed. I would say there are a number of promising directions like this. Um, they're all fairly far out and all carry some risk, so it, for me, would be a portfolio. Uh, the proven technique elsewhere in the world is solar, wind, battery, and pretty, pretty affordable, pretty fast to manufacture, pretty fast to stamp out, uh, significant capacity deployments in short amounts of time.
27. SPSpeaker
  But when you say far out, you're talking roughly a decade or so, right, of our-
28. SPSpeaker
  Five to 10 years, we can argue. Five to 10 years.
29. SPSpeaker
  That's pretty short.
30. SPSpeaker
  It's pretty short, but we have a lot to do over the next, uh, a, a, with some risk, and we have a lot to do over the next five or 10 years.
1:00:00 – 1:04:17
Community and grid impact: water, demand response, and ‘optimal scaling’ responsibility
1. SPSpeaker
  The question is how are you thinking about infrastructure e- equity of access of, and, and the impact on the environment?
2. SPSpeaker
  Yeah, I, I love this question, appreciate this question. You know, our goal at Google, my goal at, uh, Google is that, uh, our data centers should be a uplift for the local community and an uplift for the grid. And, uh, so whether it's, uh, noise, water, and, uh, power, across the board, the goal is that these should all be viewed as positives. Of course, uh, jobs and, uh, uh, access to the technology, but we should be, in, in my opinion, must be coming with uplift to the community. There are concerns, by the way. I don't, I don't want to understate them, et cetera, across the country, across the world, but we really are, uh, working proactively. For ex-- Let me give an example here. Um, PUE, power usage, uh, efficiencyHistorically at Google, up until the last few years, we had two designs that we considered for how we build our data centers. One that was more power efficient by 10%. 10% is a lot. That says, you know, if you have a gigawatt, you're 10% more efficient, that's 100 megawatts that you now get to use that you otherwise wouldn't be able to use. Two designs, one that used, uh, more water, and the one that used essentially no water. The one that uses no water, 10% less power efficient. As a whole, okay, maybe that makes sense. Maybe that makes sense from our bottom line to say, "Well, go use more, more water, but you get 10% power efficiency." But in a particular community, that could make zero sense. Right? That would be a net negative, a huge net negative for the community. So what we've done, uh, what I've done is we've said, "You know what? Actually, unless there is abundant water in a particular community where the community would say, 'Actually, we'd rather you use less power,' we're gonna go in with the less power efficient design, but the one that uses almost no water." That needs to app-apply across the, uh, board. In other words, it ne- this needs to be a asset. Another example here is we've recently, um, developed te- uh, technologies to have a gigawatt of demand response. What this means-- across the country. What this means is, I told you about how the grid overprovisions. They overprovision for the homes, the communities, for that one week of the year where, whether it's the coldest or the hottest, where they have to have the most power available for people's homes. What we wanna be able to do is we wanna say, "Okay, we'll take power the one week of the year, the two days of the year that you need it. You tell us, and we'll give you back 100 megawatts. We'll power things down on our data centers." This goes back to the 99% reliability bit.
3. SPSpeaker
  That's cool.
4. SPSpeaker
  Right? So we'll work with the utility, where actually now they can provision less while guaranteeing that the houses, the homes in the community that need them have the power that they, uh, need without having to have two X the provisioning-
5. SPSpeaker
  Mm
6. SPSpeaker
  ... for the bad two days or the bad week of the year. We're happy to take that downtime. We're happy to be, uh, again, an asset to the grid, an asset to the community. We have to do more, by the way. I'm not at, at all suggesting that this is done, but we very much are taking this-- I'm very much taking this, uh, super seriously.
7. SPSpeaker
  And what should this... We'll wrap on this note, but what should other... Well, that's, let's say that's like a what's your, what you're doing there. What should other cloud infrastru- what, what should other infrastructure folks who are scaling capacity in the ecosystem be doing more of that you think we're not doing enough of?
8. SPSpeaker
  What, what I'm... I'll, I'll cast it in the positive. What I'm proud of is that when we say, and this goes back to even the first question, so it's coming back, um, our first discussion point that you raised. W-we're not trying to figure out how to build capacity at any cost. It's not, "Hey, we need a gigawatt. We got to go spend $40 billion or 40..." whatever the number is dollars.
9. SPSpeaker
  It's optimal scaling is the goal.
10. SPSpeaker
  It's optimal scaling, and that is efficient delivery of that capacity for our users, our customers, but it's also how do we make sure that actually we're a grid asset, a community asset, and welcome. Like, that gigawatt is not just an abstract gigawatt in somebody's spreadsheet. It's a massive deployment in the state of Utah, and it needs to be an asset for them, and that check mark needs to be there. So I would encourage, um, all, um, hyperscalers, all builders of capacity to be thinking of it end to end, not just, "Go get me a gigawatt," but use it efficiently, deliver it effectively, have it be an asset for the community.
11. SPSpeaker
  Thank you. We might need some of your professorial insights on how we do that. You know, we're... Anyway. Thank you so much, Amin.

Episode duration: 1:04:22

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode VeTqsCpcDgg

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Why “value per gigawatt” beats raw capacity metrics

Measuring outcomes: DAUs, revenue, and “intelligence per dollar”

Infrastructure is an orchestration problem (compute + storage + network + CPUs)

Reliability vs capacity: relaxing ‘five nines’ for frontier training throughput

Why ML clusters break ‘internet-style’ resiliency assumptions

System balance and Amdahl’s Law: compute is easy, coordinated supercomputers aren’t

Balance at 100,000-node scale: why 100% MFU is impossible

Procurement and lead times as a technical constraint (not just business ops)

Stranded grid capacity and why serving may ‘unstrand’ smaller sites

Career advice: don’t chase predictions—choose intrinsic motivation

Learning at Google: the TPUv2 networking debate and being wrong productively

Optical circuit switching: programmable topology for availability and bandwidth steering

Topology choices: torus for all-reduce, switches for all-to-all, and co-design with models

Planning, depreciation, and dynamic replanning under uncertainty

What’s next: specialization (TPU 8i/8t), endless hardware bottlenecks, and energy as the hardest one

Community and grid impact: water, demand response, and ‘optimal scaling’ responsibility

Get more out of YouTube videos.