Why AI needs a new kind of supercomputer network — the OpenAI Podcast Ep. 18

Training frontier models isn’t as simple as adding more GPUs—one small problem and the whole coordinated dance falls apart. OpenAI’s Mark Handley and Greg Steinbrecher discuss how a new supercomputer network design, used to train some of the company’s latest models, keeps the whole system moving in lockstep, even with record numbers of GPUs. They break down Multipath Reliable Connection, a new protocol OpenAI developed with AMD, Broadcom, Intel, Microsoft, and Nvidia, and why they’re making it available for the whole industry to use. Chapters 00:00 Intro 00:39 Greg and Mark's paths to OpenAI 04:34 Why training AI stresses networks differently 10:05 Bottlenecks, failures, and the cost of waiting 15:19 How Multipath Reliable Connection works 18:59 A protocol to route around failures 25:05 Why OpenAI is making MRC an open standard 35:09 Could AI compute move to space?

Andrew MaynehostGreg SteinbrecherguestMark Handleyguest

May 6, 202637mWatch on YouTube ↗

CHAPTERS

Why GPU supercomputers are a networking problem now
The episode sets up why training frontier AI models increasingly depends on data center networking, not just faster GPUs. As clusters scale, network behavior becomes a first-order limiter of training speed and reliability.
Greg Steinbrecher’s route from physics to AI training systems
Greg explains how his interest in complex systems led from physics and quantum computing toward data center networking and GPU cluster simulation. He ultimately shifted from modeling systems to building the software that keeps GPU training efficient and resilient.
Mark Handley’s networking background and the value of standardization
Mark describes decades of networking research, including early video conferencing work that influenced cellular standards. He contrasts slow-moving global internet standards with the faster iteration possible inside data centers.
Why AI training stresses networks differently than the internet
They explain the core mismatch: internet traffic averages out via many independent flows, while AI training synchronizes thousands of GPUs into one tightly coupled workload. The network must deliver not “good average” performance, but consistently strong worst-case performance.
Bandwidth, multi-hop fabrics, and path collisions create tail bottlenecks
Large GPU clusters require massive bandwidth, forcing networks to become deep hierarchies with thousands of switches and many possible paths. Random or poorly balanced path selection creates hotspots that dominate overall step time because training proceeds in lockstep.
Failures are constant at scale—and waiting is extremely expensive
As clusters grow, component failures become routine rather than exceptional, especially with millions of optical links. Traditional routing reconvergence and transient issues can pause work, trigger retries, or even crash jobs—wasting enormous GPU time and delaying model training.
MRC’s core idea: multipath spraying plus fast loss clarity
MRC combines multiple techniques to avoid hotspots and keep high utilization. Packets are spread across many paths for load balancing, while mechanisms address the resulting reordering and make congestion/loss detectable without guesswork.
Packet trimming: preventing ambiguity by delivering the header even under congestion
Instead of dropping packets when queues overflow, MRC trims payloads and forwards tiny headers so the receiver can immediately request retransmission. This preserves signal about what happened, enabling faster recovery even when packets take different paths.
Routing around failures in milliseconds—without waiting for network convergence
They contrast conventional distributed routing updates (e.g., BGP-like convergence) with MRC’s endpoint-driven avoidance. Endpoints quickly infer bad paths and stop using them, so link flaps don’t stall training jobs.
Turning off dynamic routing: static cores, smarter edges, simpler operations
A major operational shift: because MRC can find working paths itself, OpenAI can run the fabric with static routing tables. This reduces control-plane complexity and removes a failure-prone class of network behaviors.
How it’s implemented: source routing with IPv6 segment routing
They describe pushing intelligence to endpoints and making switches “dumber” via source routing. Using IPv6 segment routing, each packet carries the sequence of switches it should traverse, aiding deterministic load balancing and simpler core behavior.
Why MRC becomes an open standard (and who helped build it)
OpenAI is publishing MRC through OCP because open standards accelerate the ecosystem and reduce fragmented, incompatible approaches. They highlight collaboration with Microsoft and major silicon/network vendors to bring it into production hardware.
Efficiency and cost: flatter networks, fewer devices, better work-per-watt
Beyond reliability, MRC enables simpler and flatter network designs with fewer switch layers, cutting cost and power. More of the power budget can go to GPUs doing useful work, improving overall system efficiency.
Future limits and the ‘compute in space’ question
They close by discussing scaling constraints like the speed of light and the inevitability of ongoing networking work as hardware evolves. On space-based compute, they argue training in orbit is impractical due to latency, failures, and maintenance challenges, even if it’s conceptually intriguing.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome

Why GPU supercomputers are a networking problem now

Greg Steinbrecher’s route from physics to AI training systems

Mark Handley’s networking background and the value of standardization

Why AI training stresses networks differently than the internet

Bandwidth, multi-hop fabrics, and path collisions create tail bottlenecks

Failures are constant at scale—and waiting is extremely expensive

MRC’s core idea: multipath spraying plus fast loss clarity

Packet trimming: preventing ambiguity by delivering the header even under congestion

Routing around failures in milliseconds—without waiting for network convergence

Turning off dynamic routing: static cores, smarter edges, simpler operations

How it’s implemented: source routing with IPv6 segment routing

Why MRC becomes an open standard (and who helped build it)

Efficiency and cost: flatter networks, fewer devices, better work-per-watt

Future limits and the ‘compute in space’ question

Get more out of YouTube videos.