OpenAIWhy AI needs a new kind of supercomputer network — the OpenAI Podcast Ep. 18
At a glance
WHAT IT’S REALLY ABOUT
OpenAI’s MRC networking makes massive GPU training faster, resilient, simpler
- AI training workloads stress networks differently than traditional internet/web traffic because thousands of GPUs must communicate in lockstep, making worst-case latency and congestion the true limiter.
- MRC improves throughput and stability by spraying traffic across many paths while eliminating packet-loss ambiguity via packet trimming and rapid retransmission signaling.
- At large scales, frequent component failures are inevitable; MRC routes around failures in milliseconds at endpoints rather than waiting seconds for routing-protocol convergence.
- By pushing intelligence to the network edge and simplifying the core (including static routing and source routing), OpenAI can reduce operational complexity and potentially build flatter, lower-power networks.
- OpenAI is publishing MRC via OCP as an open standard, aiming to prevent ecosystem fragmentation and accelerate industry-wide infrastructure progress for AI compute.
IDEAS WORTH REMEMBERING
5 ideasAI training turns the network into part of the compute.
Because GPUs must exchange data every step to agree on results, any slow link, congestion hotspot, or delay forces all GPUs to wait, wasting expensive compute.
Average network performance matters less than the worst-case link.
With synchronized workloads, the most congested link (the “tail of the tail,” or P100 behavior) sets the pace for the entire training step, changing design goals versus typical web workloads.
Multipath alone isn’t enough; you must remove reordering/loss ambiguity.
Spraying packets across many paths improves load balance but increases reordering; MRC’s packet trimming forwards headers even when payloads would be dropped, enabling immediate retransmission requests.
MRC makes failures routine rather than catastrophic.
Instead of waiting for distributed routing protocols to converge after a link failure (seconds to tens of seconds), endpoints quickly stop using bad paths within milliseconds, minimizing training disruption.
Simplifying the network core increases reliability at scale.
MRC can operate with static routing and source-routed packets (IPv6 segment routing), reducing reliance on complex switch control planes that can themselves fail or add operational burden.
WORDS WORTH SAVING
5 quotesWe're talking about a lot of the world's fastest GPUs and making them all work together on a single task, um, which is why this stuff gets hard.
— Mark Handley
A key thing here is that the communication between the GPUs is actually part of the computation.
— Mark Handley
That's just about the worst possible workload you could think to put onto a network.
— Mark Handley
We know we've won when researchers stop needing to know what network protocol this particular cluster is using.
— Greg Steinbrecher
We did not care. We didn't even notice. MRC just took care of it.
— Greg Steinbrecher
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome