OpenAIWhy AI needs a new kind of supercomputer network — the OpenAI Podcast Ep. 18
CHAPTERS
Why GPU supercomputers are a networking problem now
The episode sets up why training frontier AI models increasingly depends on data center networking, not just faster GPUs. As clusters scale, network behavior becomes a first-order limiter of training speed and reliability.
Greg Steinbrecher’s route from physics to AI training systems
Greg explains how his interest in complex systems led from physics and quantum computing toward data center networking and GPU cluster simulation. He ultimately shifted from modeling systems to building the software that keeps GPU training efficient and resilient.
Mark Handley’s networking background and the value of standardization
Mark describes decades of networking research, including early video conferencing work that influenced cellular standards. He contrasts slow-moving global internet standards with the faster iteration possible inside data centers.
Why AI training stresses networks differently than the internet
They explain the core mismatch: internet traffic averages out via many independent flows, while AI training synchronizes thousands of GPUs into one tightly coupled workload. The network must deliver not “good average” performance, but consistently strong worst-case performance.
Bandwidth, multi-hop fabrics, and path collisions create tail bottlenecks
Large GPU clusters require massive bandwidth, forcing networks to become deep hierarchies with thousands of switches and many possible paths. Random or poorly balanced path selection creates hotspots that dominate overall step time because training proceeds in lockstep.
Failures are constant at scale—and waiting is extremely expensive
As clusters grow, component failures become routine rather than exceptional, especially with millions of optical links. Traditional routing reconvergence and transient issues can pause work, trigger retries, or even crash jobs—wasting enormous GPU time and delaying model training.
MRC’s core idea: multipath spraying plus fast loss clarity
MRC combines multiple techniques to avoid hotspots and keep high utilization. Packets are spread across many paths for load balancing, while mechanisms address the resulting reordering and make congestion/loss detectable without guesswork.
Packet trimming: preventing ambiguity by delivering the header even under congestion
Instead of dropping packets when queues overflow, MRC trims payloads and forwards tiny headers so the receiver can immediately request retransmission. This preserves signal about what happened, enabling faster recovery even when packets take different paths.
Routing around failures in milliseconds—without waiting for network convergence
They contrast conventional distributed routing updates (e.g., BGP-like convergence) with MRC’s endpoint-driven avoidance. Endpoints quickly infer bad paths and stop using them, so link flaps don’t stall training jobs.
Turning off dynamic routing: static cores, smarter edges, simpler operations
A major operational shift: because MRC can find working paths itself, OpenAI can run the fabric with static routing tables. This reduces control-plane complexity and removes a failure-prone class of network behaviors.
How it’s implemented: source routing with IPv6 segment routing
They describe pushing intelligence to endpoints and making switches “dumber” via source routing. Using IPv6 segment routing, each packet carries the sequence of switches it should traverse, aiding deterministic load balancing and simpler core behavior.
Why MRC becomes an open standard (and who helped build it)
OpenAI is publishing MRC through OCP because open standards accelerate the ecosystem and reduce fragmented, incompatible approaches. They highlight collaboration with Microsoft and major silicon/network vendors to bring it into production hardware.
Efficiency and cost: flatter networks, fewer devices, better work-per-watt
Beyond reliability, MRC enables simpler and flatter network designs with fewer switch layers, cutting cost and power. More of the power budget can go to GPUs doing useful work, improving overall system efficiency.
Future limits and the ‘compute in space’ question
They close by discussing scaling constraints like the speed of light and the inevitability of ongoing networking work as hardware evolves. On space-based compute, they argue training in orbit is impractical due to latency, failures, and maintenance challenges, even if it’s conceptually intriguing.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome