This video isn’t embeddableWatch on YouTube →

Stanford CS153 Frontier Systems | The Discipline of Delivering Value per Gigawatt

For more information about Stanford's online Artificial Intelligence programs, visit: https://stanford.io/ai Follow along with the course schedule and syllabus, visit: https://cs153.stanford.edu/ In a CS153 Frontier Systems lecture, the class returns to the upstream infrastructure stack with Amin Vahdat, who leads Google's internal compute infrastructure and the TPU program powering Gemini, framing his nearly 30-year career as the discipline of building reliable, balanced supercomputers at a planetary scale. Vahdat argues the industry is over-fixated on gigawatts and flops as headline metrics: at roughly $40 to $50 billion per gigawatt, the question that matters is value delivered per dollar, measured in happy daily active users and paying enterprise customers, not raw capacity. He walks through the three constraints that govern utility. Reliability, where moving from 99 percent to 99.9 percent uptime closes a 3.65-day annual gap, and where Frontier Labs are newly willing to trade five-nines for double the capacity. System balance, invoking Amdahl's 1967 law that every million instructions per second needs a megabyte per second of I/O, now stretched across 100,000-node synchronous training jobs where a single failed node halts the entire computation. And procurement lead times of two to three years for net-new gigawatts, where land permitting, utility contracts, and 20-year take-or-pay power agreements have replaced the slack capacity that once let hyperscalers ask for ten megawatts on a handshake. He details Google's optical circuit switch architecture, which uses 136 MEMS-controlled mirrors per chip to programmatically rewire the 3D torus topology connecting TPU racks, allowing failed racks to be virtually swapped in seconds and bandwidth redirected to distant storage clusters for the duration of a five-hour Borg job. Vahdat closes on responsibility: data centers should be a net uplift to local grids and communities, citing Google's choice to accept 10 percent worse power efficiency in water-scarce regions and its gigawatt-scale demand response program that returns capacity to utilities during peak residential load. Amin Vahdat is a Fellow and Chief Technologist for AI Infrastructure at Google, where his team is responsible for delivering industry-leading infrastructure which spans custom silicon, data centers, network, and supply chain and operations. This infrastructure serves Alphabet, Google and the world, and Artificial Intelligence technologies that empower ML developers and solve customers’ most pressing business challenges. In the past, he was Vice President and General Manager for Google's compute, storage, and network hardware and software infrastructure. Until 2019, he was the Technical Lead and Vice President for the Networking organization at Google. Before joining Google, Amin was the Science Applications International Corporation (SAIC) Professor of Computer Science and Engineering at UC San Diego (UCSD). He received his doctorate from the University of California Berkeley in computer science, and is a Fellow of the Association for Computing Machinery (ACM). Amin has been recognized with a number of awards, including the National Science Foundation (NSF) CAREER award, the UC Berkeley Distinguished EECS Alumni Award, the Alfred P. Sloan Fellowship, the Association for Computing Machinery's SIGCOMM Networking Systems Award, and the Duke University David and Janet Vaughn Teaching Award. Amin was awarded the SIGCOMM lifetime achievement award for his contributions to data center and wide area networks. He was inducted into the National Academy of Engineering in 2023 for his contributions to the design and implementation of datacenter and planet-scale networks that power cloud computer systems. Follow the playlist: https://youtube.com/playlist?list=PLoROMvodv4rN447WKQ5oz_YdYbS74M5IA&si=DOJ5amlyRdyMJBhG

May 27, 20261h 4mWatch on YouTube ↗

CHAPTERS

0:09 – 5:44
Why “value per gigawatt” beats raw capacity metrics
The conversation opens by challenging the industry fixation on total gigawatts and capex. The central thesis is that a gigawatt is only meaningful insofar as it delivers reliable, usable capability—what matters is value delivered per dollar (and per unit energy), not headline capacity.
- •Gigawatt buildout costs are enormous; costs are rising
- •Utilization and “goodput” determine whether capex turns into real capability
- •A gigawatt in one environment is not equivalent to a gigawatt in another
- •Reliability and repair speed are core to extracting value
5:44 – 7:03
Measuring outcomes: DAUs, revenue, and “intelligence per dollar”
They discuss how to measure AI progress when outputs are heterogeneous (tokens, images, code). Amin argues the evaluation ultimately rolls up to business/user outcomes, while acknowledging ongoing work on benchmarks like intelligence-per-dollar.
- •Hard to reconcile heterogeneous outputs with compute inputs in a single metric
- •Google and others explore benchmarks for intelligence per dollar
- •Outcome metrics (happy users, DAUs, paying customers) are what count
- •Idle capacity is treated as a defect in the system
7:03 – 9:43
Infrastructure is an orchestration problem (compute + storage + network + CPUs)
Amin emphasizes that accelerators alone don’t define capability—system-level orchestration does. With agentic workloads, bottlenecks frequently shift to CPUs, storage locality, and data-center networking, making holistic balance essential.
- •Accelerators without supporting CPU/storage/network are ineffective
- •Agents introduce end-to-end pipeline waits and new latency/throughput bottlenecks
- •Data locality and cross-region storage access can stall expensive compute
- •Stop optimizing a single component; optimize the full stack
9:43 – 12:25
Reliability vs capacity: relaxing ‘five nines’ for frontier training throughput
Power provisioning and redundancy drive large amounts of stranded capacity. Amin describes a new phenomenon: frontier training customers may prefer more capacity with more downtime rather than extreme availability with less capacity.
- •Going from 99% to 99.9% availability is disproportionately hard
- •Five-nines power requires heavy redundancy, reducing usable power
- •Frontier training often prefers throughput and capacity over perfect uptime
- •Customer requirements are shifting: access can trump reliability
12:25 – 14:06
Why ML clusters break ‘internet-style’ resiliency assumptions
The discussion contrasts classic web services (designed for frequent component failure) with synchronous distributed training. In training, one node failure can halt the whole job, invalidating decades of loose-coupling resilience patterns.
- •Synchronous training tightly couples thousands of nodes
- •Single-node failures can stop the entire computation
- •Web services tolerate rack loss via replication and fungible compute
- •Frontier ML changes the fault model and operations playbook
14:06 – 18:44
System balance and Amdahl’s Law: compute is easy, coordinated supercomputers aren’t
Amin argues the key is system balance—compute must be matched with I/O, memory bandwidth, and network capacity. He revisits Amdahl’s system balance principle and connects imbalance to low MFU and MoE-era bandwidth pressure.
- •Amdahl’s system balance: compute must be fed by proportional I/O
- •Modern I/O is largely networked; bandwidth provisioning is essential
- •MoE/sparse workloads increase memory-bandwidth pressure relative to flops
- •Spending more to achieve balance/reliability can outperform cheap flops
18:44 – 20:08
Balance at 100,000-node scale: why 100% MFU is impossible
They explore why perfect utilization collapses under scale: tiny variations compound into bubbles and stalls across distributed pipelines. The takeaway is to design for inevitable variance and manage compounding inefficiencies rather than expecting perfection.
- •Micro-variation (cache hit rates, timing) multiplies across nodes
- •Distributed synchronization amplifies small stalls into large utilization loss
- •Perfect balance is unattainable in real workloads at massive scale
- •Performance engineering becomes variance management at system scale
20:08 – 24:08
Procurement and lead times as a technical constraint (not just business ops)
Amin explains that the limiting factor is often physical: chips, memory, buildings, land, permits, and utility contracts impose multi-year lead times. He frames lead-time compression as a deeply technical systems problem spanning manufacturing and deployment.
- •Net-new gigawatt lead time can be 2–3 years regardless of budget
- •Scaling requires land, permitting, construction, and grid power commitments
- •Utilities increasingly demand long-term take-or-pay contracts
- •Planning errors are inevitable: underbuild leaves opportunity; overbuild wastes capex
24:08 – 25:34
Stranded grid capacity and why serving may ‘unstrand’ smaller sites
They discuss the idea that sub-100MW capacity is stranded because hyperscalers want expandable sites. Amin suggests the training-heavy era favors large contiguous power, but serving growth may naturally distribute workloads and make smaller sites viable—though not sufficient alone.
- •Training prefers large contiguous infrastructure; serving is more fungible
- •Rising inference demand can make smaller deployments economical
- •Unstranding helps but won’t meet total demand at frontier scale
- •Long-term need remains for concentrated power delivery at scale
25:34 – 27:49
Career advice: don’t chase predictions—choose intrinsic motivation
Asked what he’d obsess over as a student, Amin argues there is no single bottleneck and forecasting is unreliable (including historical AI winters). He advises picking domains you’re intrinsically excited about because sustained passion beats trend-chasing.
- •No single bottleneck; what matters shifts over time
- •Tech forecasting is brittle (AI’s repeated ‘don’t work on it’ eras)
- •Work across the stack can be impactful: algorithms to OS to hardware
- •Intrinsic motivation is the best hedge against misprediction
27:49 – 31:33
Learning at Google: the TPUv2 networking debate and being wrong productively
Amin’s favorite story highlights a pivotal architectural decision: rejecting Ethernet for TPU supercomputers. The episode underscores first-principles thinking, spirited technical debate, and the value of being proven wrong by better domain fit.
- •TPUv2-era choice: Ethernet wasn’t optimal for TPU supercomputer fabric
- •Domain required different semantics/topology than conventional wisdom
- •Norm Jouppi cited as a key first-principles voice
- •Cultural lesson: learning happens when assumptions get challenged
31:33 – 36:49
Optical circuit switching: programmable topology for availability and bandwidth steering
Amin clarifies how optical circuit switches fit into Google’s network stack as an augment, not a universal replacement. He explains how MEMS mirror-based switching enables rapid rack-level reconfiguration for reliability and longer-timescale bandwidth ‘pinning’ for jobs.
- •OCS is used selectively; electrical packet switching remains extensive
- •Within-rack TPU links are copper point-to-point; between racks uses OCS
- •OCS enables fast topology repair by swapping racks logically in seconds
- •Higher-level OCS can steer bulk connectivity for multi-hour jobs to needed storage/clusters
36:49 – 37:49
Topology choices: torus for all-reduce, switches for all-to-all, and co-design with models
They examine why TPU training networks settled on a torus and when that’s suboptimal. Amin notes all-reduce patterns map well to torus, while arbitrary all-to-all benefits from switch-based fabrics—yet model designers often adapt to hardware constraints.
- •Torus is well-suited to parameter dissemination and all-reduce
- •All-to-all communication favors switch/fat-tree-like topologies
- •Workloads and collectives drive optimal interconnect design
- •Model/hardware co-design can mitigate topology limitations
37:49 – 46:26
Planning, depreciation, and dynamic replanning under uncertainty
Amin explains Google’s hardware depreciation (six years) and why older chips still see heavy use due to demand. He emphasizes that capacity planning is continuous replanning as new products, customers, regions, and chip mixes introduce shifting constraints.
- •Compute hardware depreciation at Google: ~6 years; often used longer
- •Older TPUs/GPUs remain valuable because demand is overwhelming
- •Watts and space are more fungible than specific chip generations
- •Core challenge: dynamic replanning as new use cases and constraints appear
46:26 – 1:00:00
What’s next: specialization (TPU 8i/8t), endless hardware bottlenecks, and energy as the hardest one
Amin argues ‘TPU vs GPU’ is not the right framing in a rapidly expanding market; specialization is increasing as inference and training diverge in balance needs. He predicts hardware remains a bottleneck for years (even with algorithmic breakthroughs) and identifies energy abundance/affordability as the least-solved constraint.
- •TPU line splitting: 8i (inference) vs 8t (training) reflects diverging system balance needs
- •Specialization yields big gains as general-purpose CPU improvements slow
- •Even a ‘next transformer’ efficiency jump likely won’t remove compute constraints
- •Energy is the hardest bottleneck: scaling power affordably and abundantly is under-solved
1:00:00 – 1:04:22
Community and grid impact: water, demand response, and ‘optimal scaling’ responsibility
The closing focuses on responsible scaling: data centers should be net assets to communities and the grid. Amin describes design tradeoffs (water vs power efficiency) and demand-response capability, urging the industry to optimize end-to-end rather than ‘capacity at any cost.’
- •Community uplift as a design goal: noise, water, power, jobs, and access
- •Choosing low-water designs even at a power-efficiency penalty where appropriate
- •Demand response: powering down to support grid peak events and reduce overprovisioning
- •Industry call: optimize scaling holistically—value, efficiency, and social responsibility

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Why “value per gigawatt” beats raw capacity metrics

Measuring outcomes: DAUs, revenue, and “intelligence per dollar”

Infrastructure is an orchestration problem (compute + storage + network + CPUs)

Reliability vs capacity: relaxing ‘five nines’ for frontier training throughput

Why ML clusters break ‘internet-style’ resiliency assumptions

System balance and Amdahl’s Law: compute is easy, coordinated supercomputers aren’t

Balance at 100,000-node scale: why 100% MFU is impossible

Procurement and lead times as a technical constraint (not just business ops)

Stranded grid capacity and why serving may ‘unstrand’ smaller sites

Career advice: don’t chase predictions—choose intrinsic motivation

Learning at Google: the TPUv2 networking debate and being wrong productively

Optical circuit switching: programmable topology for availability and bandwidth steering

Topology choices: torus for all-reduce, switches for all-to-all, and co-design with models

Planning, depreciation, and dynamic replanning under uncertainty

What’s next: specialization (TPU 8i/8t), endless hardware bottlenecks, and energy as the hardest one

Community and grid impact: water, demand response, and ‘optimal scaling’ responsibility

Get more out of YouTube videos.