Why AI needs a new kind of supercomputer network — the OpenAI Podcast Ep. 18

Training frontier models isn’t as simple as adding more GPUs—one small problem and the whole coordinated dance falls apart. OpenAI’s Mark Handley and Greg Steinbrecher discuss how a new supercomputer network design, used to train some of the company’s latest models, keeps the whole system moving in lockstep, even with record numbers of GPUs. They break down Multipath Reliable Connection, a new protocol OpenAI developed with AMD, Broadcom, Intel, Microsoft, and Nvidia, and why they’re making it available for the whole industry to use. Chapters 00:00 Intro 00:39 Greg and Mark's paths to OpenAI 04:34 Why training AI stresses networks differently 10:05 Bottlenecks, failures, and the cost of waiting 15:19 How Multipath Reliable Connection works 18:59 A protocol to route around failures 25:05 Why OpenAI is making MRC an open standard 35:09 Could AI compute move to space?

Andrew MaynehostGreg SteinbrecherguestMark Handleyguest

May 6, 202637mWatch on YouTube ↗

EPISODE INFO

Released: May 6, 2026
Duration: 37m
Channel: OpenAI
Watch on YouTube: ▶ Open ↗

EPISODE DESCRIPTION

Training frontier models isn’t as simple as adding more GPUs—one small problem and the whole coordinated dance falls apart. OpenAI’s Mark Handley and Greg Steinbrecher discuss how a new supercomputer network design, used to train some of the company’s latest models, keeps the whole system moving in lockstep, even with record numbers of GPUs. They break down Multipath Reliable Connection, a new protocol OpenAI developed with AMD, Broadcom, Intel, Microsoft, and Nvidia, and why they’re making it available for the whole industry to use. Chapters 00:00 Intro 00:39 Greg and Mark's paths to OpenAI 04:34 Why training AI stresses networks differently 10:05 Bottlenecks, failures, and the cost of waiting 15:19 How Multipath Reliable Connection works 18:59 A protocol to route around failures 25:05 Why OpenAI is making MRC an open standard 35:09 Could AI compute move to space?

SPEAKERS

Andrew Mayne
host
Host of the OpenAI Podcast.
Greg Steinbrecher
guest
OpenAI workload/infrastructure systems engineer focused on efficient large-scale GPU training and reliability.
Mark Handley
guest
Networking researcher at OpenAI and professor at University College London specializing in large-scale network design and protocols.

EPISODE SUMMARY

In this episode of OpenAI, featuring Andrew Mayne and Greg Steinbrecher, Why AI needs a new kind of supercomputer network — the OpenAI Podcast Ep. 18 explores openAI’s MRC networking makes massive GPU training faster, resilient, simpler AI training workloads stress networks differently than traditional internet/web traffic because thousands of GPUs must communicate in lockstep, making worst-case latency and congestion the true limiter.

RELATED EPISODES

What happens now that AI is good at math? — the OpenAI Podcast Ep. 17

Sam Altman on AGI, GPT-5, and what’s next — the OpenAI Podcast Ep. 1

How AI is accelerating scientific discovery today and what's ahead — the OpenAI Podcast Ep. 10

Inside ChatGPT, AI assistants, and building at OpenAI — the OpenAI Podcast Ep. 2

Episode 15 - Inside the Model Spec

ChatGPT Atlas and the next era of web browsing — the OpenAI Podcast Ep. 9

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome

Episode Details