Why AI needs a new kind of supercomputer network — the OpenAI Podcast Ep. 18

Andrew Mayne and Greg Steinbrecher on openAI’s MRC networking makes massive GPU training faster, resilient, simpler.

Andrew MaynehostGreg SteinbrecherguestMark Handleyguest

May 5, 202637mWatch on YouTube ↗

Synchronous GPU training and P100 (worst-case) performanceNetwork bottlenecks, tail latency, and wasted GPU timeMultipath packet spraying and deterministic load balancingPacket trimming vs. packet drops for fast loss detectionFailure recovery without routing reconvergence (endpoint adaptation)Static routing and IPv6 segment routing (source routing)Open standards (OCP) and multi-vendor collaboration

AI-generated summary based on the episode transcript.

In this episode of OpenAI, featuring Andrew Mayne and Greg Steinbrecher, Why AI needs a new kind of supercomputer network — the OpenAI Podcast Ep. 18 explores openAI’s MRC networking makes massive GPU training faster, resilient, simpler AI training workloads stress networks differently than traditional internet/web traffic because thousands of GPUs must communicate in lockstep, making worst-case latency and congestion the true limiter.

WHAT IT’S REALLY ABOUT

OpenAI’s MRC networking makes massive GPU training faster, resilient, simpler

AI training workloads stress networks differently than traditional internet/web traffic because thousands of GPUs must communicate in lockstep, making worst-case latency and congestion the true limiter.
MRC improves throughput and stability by spraying traffic across many paths while eliminating packet-loss ambiguity via packet trimming and rapid retransmission signaling.
At large scales, frequent component failures are inevitable; MRC routes around failures in milliseconds at endpoints rather than waiting seconds for routing-protocol convergence.
By pushing intelligence to the network edge and simplifying the core (including static routing and source routing), OpenAI can reduce operational complexity and potentially build flatter, lower-power networks.
OpenAI is publishing MRC via OCP as an open standard, aiming to prevent ecosystem fragmentation and accelerate industry-wide infrastructure progress for AI compute.

IDEAS WORTH REMEMBERING

5 ideas

AI training turns the network into part of the compute.

Because GPUs must exchange data every step to agree on results, any slow link, congestion hotspot, or delay forces all GPUs to wait, wasting expensive compute.

Average network performance matters less than the worst-case link.

With synchronized workloads, the most congested link (the “tail of the tail,” or P100 behavior) sets the pace for the entire training step, changing design goals versus typical web workloads.

Multipath alone isn’t enough; you must remove reordering/loss ambiguity.

Spraying packets across many paths improves load balance but increases reordering; MRC’s packet trimming forwards headers even when payloads would be dropped, enabling immediate retransmission requests.

MRC makes failures routine rather than catastrophic.

Instead of waiting for distributed routing protocols to converge after a link failure (seconds to tens of seconds), endpoints quickly stop using bad paths within milliseconds, minimizing training disruption.

Simplifying the network core increases reliability at scale.

MRC can operate with static routing and source-routed packets (IPv6 segment routing), reducing reliance on complex switch control planes that can themselves fail or add operational burden.

WORDS WORTH SAVING

5 quotes

We're talking about a lot of the world's fastest GPUs and making them all work together on a single task, um, which is why this stuff gets hard.

— Mark Handley

A key thing here is that the communication between the GPUs is actually part of the computation.

— Mark Handley

That's just about the worst possible workload you could think to put onto a network.

— Mark Handley

We know we've won when researchers stop needing to know what network protocol this particular cluster is using.

— Greg Steinbrecher

We did not care. We didn't even notice. MRC just took care of it.

— Greg Steinbrecher

QUESTIONS ANSWERED IN THIS EPISODE

5 questions

What specific training communication patterns (e.g., all-reduce style collectives) were the biggest drivers for needing MRC rather than existing Ethernet/RDMA approaches?

AI training workloads stress networks differently than traditional internet/web traffic because thousands of GPUs must communicate in lockstep, making worst-case latency and congestion the true limiter.

How does packet trimming work in practice—what device performs the trim, and what guarantees ensure the tiny header always gets through under congestion?

MRC improves throughput and stability by spraying traffic across many paths while eliminating packet-loss ambiguity via packet trimming and rapid retransmission signaling.

What trade-offs did you accept by turning off dynamic routing (BGP-like convergence) and relying on static routing plus endpoint adaptation—are there failure modes where this is worse?

At large scales, frequent component failures are inevitable; MRC routes around failures in milliseconds at endpoints rather than waiting seconds for routing-protocol convergence.

You mentioned deterministic routing to control tail behavior; how are paths assigned to avoid the “balls into bins” tail problem when using many parallel paths?

By pushing intelligence to the network edge and simplifying the core (including static routing and source routing), OpenAI can reduce operational complexity and potentially build flatter, lower-power networks.

How does MRC interact with existing congestion control schemes, and what fairness definition matters most inside a single training fabric?

OpenAI is publishing MRC via OCP as an open standard, aiming to prevent ecosystem fragmentation and accelerate industry-wide infrastructure progress for AI compute.

Chapter Breakdown

Why GPU supercomputers are a networking problem now

The episode sets up why training frontier AI models increasingly depends on data center networking, not just faster GPUs. As clusters scale, network behavior becomes a first-order limiter of training speed and reliability.

Greg Steinbrecher’s route from physics to AI training systems

Greg explains how his interest in complex systems led from physics and quantum computing toward data center networking and GPU cluster simulation. He ultimately shifted from modeling systems to building the software that keeps GPU training efficient and resilient.

Mark Handley’s networking background and the value of standardization

Mark describes decades of networking research, including early video conferencing work that influenced cellular standards. He contrasts slow-moving global internet standards with the faster iteration possible inside data centers.

Why AI training stresses networks differently than the internet

They explain the core mismatch: internet traffic averages out via many independent flows, while AI training synchronizes thousands of GPUs into one tightly coupled workload. The network must deliver not “good average” performance, but consistently strong worst-case performance.

Bandwidth, multi-hop fabrics, and path collisions create tail bottlenecks

Large GPU clusters require massive bandwidth, forcing networks to become deep hierarchies with thousands of switches and many possible paths. Random or poorly balanced path selection creates hotspots that dominate overall step time because training proceeds in lockstep.

Failures are constant at scale—and waiting is extremely expensive

As clusters grow, component failures become routine rather than exceptional, especially with millions of optical links. Traditional routing reconvergence and transient issues can pause work, trigger retries, or even crash jobs—wasting enormous GPU time and delaying model training.

MRC’s core idea: multipath spraying plus fast loss clarity

MRC combines multiple techniques to avoid hotspots and keep high utilization. Packets are spread across many paths for load balancing, while mechanisms address the resulting reordering and make congestion/loss detectable without guesswork.

Packet trimming: preventing ambiguity by delivering the header even under congestion

Instead of dropping packets when queues overflow, MRC trims payloads and forwards tiny headers so the receiver can immediately request retransmission. This preserves signal about what happened, enabling faster recovery even when packets take different paths.

Routing around failures in milliseconds—without waiting for network convergence

They contrast conventional distributed routing updates (e.g., BGP-like convergence) with MRC’s endpoint-driven avoidance. Endpoints quickly infer bad paths and stop using them, so link flaps don’t stall training jobs.

Turning off dynamic routing: static cores, smarter edges, simpler operations

A major operational shift: because MRC can find working paths itself, OpenAI can run the fabric with static routing tables. This reduces control-plane complexity and removes a failure-prone class of network behaviors.

How it’s implemented: source routing with IPv6 segment routing

They describe pushing intelligence to endpoints and making switches “dumber” via source routing. Using IPv6 segment routing, each packet carries the sequence of switches it should traverse, aiding deterministic load balancing and simpler core behavior.

Why MRC becomes an open standard (and who helped build it)

OpenAI is publishing MRC through OCP because open standards accelerate the ecosystem and reduce fragmented, incompatible approaches. They highlight collaboration with Microsoft and major silicon/network vendors to bring it into production hardware.

Efficiency and cost: flatter networks, fewer devices, better work-per-watt

Beyond reliability, MRC enables simpler and flatter network designs with fewer switch layers, cutting cost and power. More of the power budget can go to GPUs doing useful work, improving overall system efficiency.

Future limits and the ‘compute in space’ question

They close by discussing scaling constraints like the speed of light and the inevitability of ongoing networking work as hardware evolves. On space-based compute, they argue training in orbit is impractical due to latency, failures, and maintenance challenges, even if it’s conceptually intriguing.

EVERY SPOKEN WORD

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.