This video isn’t embeddableWatch on YouTube →

Stanford Online

Stanford CS153 Frontier Systems | The Discipline of Delivering Value per Gigawatt

For more information about Stanford's online Artificial Intelligence programs, visit: https://stanford.io/ai Follow along with the course schedule and syllabus, visit: https://cs153.stanford.edu/ In a CS153 Frontier Systems lecture, the class returns to the upstream infrastructure stack with Amin Vahdat, who leads Google's internal compute infrastructure and the TPU program powering Gemini, framing his nearly 30-year career as the discipline of building reliable, balanced supercomputers at a planetary scale. Vahdat argues the industry is over-fixated on gigawatts and flops as headline metrics: at roughly $40 to $50 billion per gigawatt, the question that matters is value delivered per dollar, measured in happy daily active users and paying enterprise customers, not raw capacity. He walks through the three constraints that govern utility. Reliability, where moving from 99 percent to 99.9 percent uptime closes a 3.65-day annual gap, and where Frontier Labs are newly willing to trade five-nines for double the capacity. System balance, invoking Amdahl's 1967 law that every million instructions per second needs a megabyte per second of I/O, now stretched across 100,000-node synchronous training jobs where a single failed node halts the entire computation. And procurement lead times of two to three years for net-new gigawatts, where land permitting, utility contracts, and 20-year take-or-pay power agreements have replaced the slack capacity that once let hyperscalers ask for ten megawatts on a handshake. He details Google's optical circuit switch architecture, which uses 136 MEMS-controlled mirrors per chip to programmatically rewire the 3D torus topology connecting TPU racks, allowing failed racks to be virtually swapped in seconds and bandwidth redirected to distant storage clusters for the duration of a five-hour Borg job. Vahdat closes on responsibility: data centers should be a net uplift to local grids and communities, citing Google's choice to accept 10 percent worse power efficiency in water-scarce regions and its gigawatt-scale demand response program that returns capacity to utilities during peak residential load. Amin Vahdat is a Fellow and Chief Technologist for AI Infrastructure at Google, where his team is responsible for delivering industry-leading infrastructure which spans custom silicon, data centers, network, and supply chain and operations. This infrastructure serves Alphabet, Google and the world, and Artificial Intelligence technologies that empower ML developers and solve customers’ most pressing business challenges. In the past, he was Vice President and General Manager for Google's compute, storage, and network hardware and software infrastructure. Until 2019, he was the Technical Lead and Vice President for the Networking organization at Google. Before joining Google, Amin was the Science Applications International Corporation (SAIC) Professor of Computer Science and Engineering at UC San Diego (UCSD). He received his doctorate from the University of California Berkeley in computer science, and is a Fellow of the Association for Computing Machinery (ACM). Amin has been recognized with a number of awards, including the National Science Foundation (NSF) CAREER award, the UC Berkeley Distinguished EECS Alumni Award, the Alfred P. Sloan Fellowship, the Association for Computing Machinery's SIGCOMM Networking Systems Award, and the Duke University David and Janet Vaughn Teaching Award. Amin was awarded the SIGCOMM lifetime achievement award for his contributions to data center and wide area networks. He was inducted into the National Academy of Engineering in 2023 for his contributions to the design and implementation of datacenter and planet-scale networks that power cloud computer systems. Follow the playlist: https://youtube.com/playlist?list=PLoROMvodv4rN447WKQ5oz_YdYbS74M5IA&si=DOJ5amlyRdyMJBhG

May 27, 20261h 4mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Optimizing AI infrastructure: reliability, balance, and value per gigawatt

Raw capacity metrics like gigawatts or FLOPs are misleading; what matters is value delivered per dollar, such as happy daily active users or business outcomes.
Frontier training and large-scale inference change reliability tradeoffs because synchronous, tightly coupled jobs can fail if a single node fails, making availability and fast recovery central to “goodput.”
Power is often the binding constraint, and “five nines” electrical redundancy can halve usable power, pushing a new industry trend where some customers prefer more capacity with slightly less reliability.
System balance (Amdahl-style ratios across compute, memory bandwidth/capacity, storage, and networking) is critical; excess FLOPs without balanced I/O and memory bandwidth leads to low utilization and wasted capital.
Scaling is limited by long lead times and physical realities (land, permitting, utilities, manufacturing), so planning under uncertainty and dynamic replanning become core technical problems.

IDEAS WORTH REMEMBERING

5 ideas

Measure infrastructure by outcomes, not gigawatts.

A gigawatt is not “equal” across operators; reliability, schedulability, and end-to-end system design determine whether users get useful capability. Metrics like DAUs, satisfied customers, or revenue per deployed capacity better reflect real performance than capital spent or nameplate power.

Reliability is a utilization multiplier, not a checkbox.

At large synchronous scales, one accelerator failure can halt an entire run, so preventing failures and repairing/isolation quickly directly increases goodput. Spending more to improve availability can be rational if it unlocks far higher realized throughput from the same fleet.

Power architecture choices embed business tradeoffs.

Achieving “five nines” power availability often requires 2N/1+1 redundancy, leaving large fractions of provisioned power idle. Some frontier workloads now prefer more capacity with tolerable downtime, a reversal from classic enterprise expectations.

System balance beats raw FLOPs; imbalance explains low MFU.

Amdahl’s balance principle generalizes to modern AI: accelerators need sufficient HBM bandwidth/capacity plus network and storage bandwidth, or compute sits idle. Mixture-of-experts and sparse methods can shift the optimal balance toward more memory bandwidth relative to compute.

Agents increase the ‘whole-system’ bottleneck surface area.

Even if TPU/GPU fabrics are fast, agents can stall on CPUs, storage locality, and data-center networking, so orchestration across all tiers becomes critical. End-to-end latency and contention can dominate while expensive accelerators wait.

WORDS WORTH SAVING

5 quotes

The measure isn't how much money you spent per gigawatt, it's actually how much value you deliver per dollar.

— Amin Vahdat

If we have capacity that is sitting around idle, that's a bug.

— Amin Vahdat

Scaling flops is easy. Building a coordinated supercomputer that scales out to 10,000, 100,000-ish TPUs that has the right balance point, super hard, and this balance point is the, the key, key insight.

— Amin Vahdat

Pick the problem domain that you are most intrinsically excited about because that, that passion for it, that's, that's what's gonna carry you forward.

— Amin Vahdat

No point in the future that I can see does the hardware bec- stop being a bottleneck.

— Amin Vahdat

Value per gigawatt vs capacity bragging rightsReliability vs throughput tradeoffs (99.9% vs 99.999%)Synchronous training failure domains and “goodput”Amdahl’s law and system balance (HBM, network, storage)Power provisioning, redundancy, and demand responseOptical circuit switching for programmable topology and recoverySupply-chain lead times, procurement planning, and uncertaintyTPU specialization for inference vs trainingEcosystem mindset: many winners, societal guardrailsCommunity/environmental constraints: water, PUE, grid uplift

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.