Why AI needs a new kind of supercomputer network — the OpenAI Podcast Ep. 18

Training frontier models isn’t as simple as adding more GPUs—one small problem and the whole coordinated dance falls apart. OpenAI’s Mark Handley and Greg Steinbrecher discuss how a new supercomputer network design, used to train some of the company’s latest models, keeps the whole system moving in lockstep, even with record numbers of GPUs. They break down Multipath Reliable Connection, a new protocol OpenAI developed with AMD, Broadcom, Intel, Microsoft, and Nvidia, and why they’re making it available for the whole industry to use. Chapters 00:00 Intro 00:39 Greg and Mark's paths to OpenAI 04:34 Why training AI stresses networks differently 10:05 Bottlenecks, failures, and the cost of waiting 15:19 How Multipath Reliable Connection works 18:59 A protocol to route around failures 25:05 Why OpenAI is making MRC an open standard 35:09 Could AI compute move to space?

Andrew MaynehostGreg SteinbrecherguestMark Handleyguest

May 5, 202637mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

OpenAI’s MRC networking makes massive GPU training faster, resilient, simpler

AI training workloads stress networks differently than traditional internet/web traffic because thousands of GPUs must communicate in lockstep, making worst-case latency and congestion the true limiter.
MRC improves throughput and stability by spraying traffic across many paths while eliminating packet-loss ambiguity via packet trimming and rapid retransmission signaling.
At large scales, frequent component failures are inevitable; MRC routes around failures in milliseconds at endpoints rather than waiting seconds for routing-protocol convergence.
By pushing intelligence to the network edge and simplifying the core (including static routing and source routing), OpenAI can reduce operational complexity and potentially build flatter, lower-power networks.
OpenAI is publishing MRC via OCP as an open standard, aiming to prevent ecosystem fragmentation and accelerate industry-wide infrastructure progress for AI compute.

IDEAS WORTH REMEMBERING

5 ideas

AI training turns the network into part of the compute.

Because GPUs must exchange data every step to agree on results, any slow link, congestion hotspot, or delay forces all GPUs to wait, wasting expensive compute.

Average network performance matters less than the worst-case link.

With synchronized workloads, the most congested link (the “tail of the tail,” or P100 behavior) sets the pace for the entire training step, changing design goals versus typical web workloads.

Multipath alone isn’t enough; you must remove reordering/loss ambiguity.

Spraying packets across many paths improves load balance but increases reordering; MRC’s packet trimming forwards headers even when payloads would be dropped, enabling immediate retransmission requests.

MRC makes failures routine rather than catastrophic.

Instead of waiting for distributed routing protocols to converge after a link failure (seconds to tens of seconds), endpoints quickly stop using bad paths within milliseconds, minimizing training disruption.

Simplifying the network core increases reliability at scale.

MRC can operate with static routing and source-routed packets (IPv6 segment routing), reducing reliance on complex switch control planes that can themselves fail or add operational burden.

WORDS WORTH SAVING

5 quotes

We're talking about a lot of the world's fastest GPUs and making them all work together on a single task, um, which is why this stuff gets hard.

— Mark Handley

A key thing here is that the communication between the GPUs is actually part of the computation.

— Mark Handley

That's just about the worst possible workload you could think to put onto a network.

— Mark Handley

We know we've won when researchers stop needing to know what network protocol this particular cluster is using.

— Greg Steinbrecher

We did not care. We didn't even notice. MRC just took care of it.

— Greg Steinbrecher

Synchronous GPU training and P100 (worst-case) performanceNetwork bottlenecks, tail latency, and wasted GPU timeMultipath packet spraying and deterministic load balancingPacket trimming vs. packet drops for fast loss detectionFailure recovery without routing reconvergence (endpoint adaptation)Static routing and IPv6 segment routing (source routing)Open standards (OCP) and multi-vendor collaboration

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.