Why AI needs a new kind of supercomputer network — the OpenAI Podcast Ep. 18

Training frontier models isn’t as simple as adding more GPUs—one small problem and the whole coordinated dance falls apart. OpenAI’s Mark Handley and Greg Steinbrecher discuss how a new supercomputer network design, used to train some of the company’s latest models, keeps the whole system moving in lockstep, even with record numbers of GPUs. They break down Multipath Reliable Connection, a new protocol OpenAI developed with AMD, Broadcom, Intel, Microsoft, and Nvidia, and why they’re making it available for the whole industry to use. Chapters 00:00 Intro 00:39 Greg and Mark's paths to OpenAI 04:34 Why training AI stresses networks differently 10:05 Bottlenecks, failures, and the cost of waiting 15:19 How Multipath Reliable Connection works 18:59 A protocol to route around failures 25:05 Why OpenAI is making MRC an open standard 35:09 Could AI compute move to space?

Andrew MaynehostGreg SteinbrecherguestMark Handleyguest

May 6, 202637mWatch on YouTube ↗

EVERY SPOKEN WORD

40 min read · 7,980 words

0:00 – 0:39
Intro
1. AMAndrew Mayne
  Hello, I'm Andrew Mayne, and this is the OpenAI Podcast. On today's episode, we're discussing how to make supercomputers better at training models. Joining me are Mark Handley from the Core Networking Team, and Greg Steinbrecher from Workload Systems. They'll discuss how a breakthrough has made training more efficient so everyone gets smarter models faster.
2. GSGreg Steinbrecher
  [upbeat music] This has really allowed us to remove one of the key barriers to continuing to scale.
3. MHMark Handley
  We're talking about a lot of the world's fastest GPUs and making them all work together on a single task.
4. GSGreg Steinbrecher
  We know we've won when researchers stop needing to know what network protocol this particular cluster is using.
0:39 – 4:34
Greg and Mark's paths to OpenAI
1. AMAndrew Mayne
  So tell me a bit about your background.
2. GSGreg Steinbrecher
  I started out, uh, doing physics and math in undergrad, um, wanting to basically, you know, understand how complex systems work. Uh, I always liked the part of physics that's about, like, how do you take this thing that is, like, unknowably complicated and build a simple model that is a complete and utter lie but tells you something about that system, uh, and then build your intuition on that and kind of build more complex models. Uh, I ended up doing a PhD trying to build quantum computers.
3. AMAndrew Mayne
  Ambitious PhD.
4. GSGreg Steinbrecher
  You know, little things-
5. AMAndrew Mayne
  I hear that
6. GSGreg Steinbrecher
  ... little things. Um, unfortunately, what I liked is big, complicated systems, and you'll note that quantum computers don't work-
7. AMAndrew Mayne
  [laughs]
8. GSGreg Steinbrecher
  ... and therefore they don't scale-
9. AMAndrew Mayne
  Yeah
10. GSGreg Steinbrecher
  ... yet. They will someday, but they don't yet work. Um, so I kind of took a look at the chips we were designing to control light for quantum computers, and I went, "Huh, that kind of looks like a network switch. What if we use this as a network switch?" And what I found out pretty quickly was that academia does not know a whole lot about what real, like, data center workloads look like. You get a whole bunch of, uh, [tsking] kind of very toy models, but they're not very informative. And so I ended up kind of pitching an industry company to get a fellowship. They paid for the last two years of my PhD. I ended up working there for a while, um, building out, kind of doing initial network hardware just to try to understand, like, what is it that we actually need from ne- data center networks. What I found out was that, like, there's a huge amount of headroom on just, like, conventional data center networking hardware and lots of room for optimization. We did not need my little, like, optical chip. We did not need to do anything fancy like that. Um, but that there's all sorts of really fun problems.
11. AMAndrew Mayne
  Mm.
12. GSGreg Steinbrecher
  Um, and then around that time, the whole AI boom started to kick off. We decided we needed to build big GPU clusters, and in particular, we needed to build networks for those GPU clusters. And so I got roped in on trying to build kind of simulations of those so that we can figure out what to build. And then in the process of, like, trying to build a simulation of these systems, you learn a lot about how they have to work.
13. AMAndrew Mayne
  Mm-hmm.
14. GSGreg Steinbrecher
  And at some point I said, "Well, why don't I just go build the thing, the actual thing?" And so I transitioned from writing software to build simulations to just writing the software that allows GPUs to communicate with each other. Um, and then a little over a year ago, ended up coming here to OpenAI-
15. AMAndrew Mayne
  Mm
16. GSGreg Steinbrecher
  ... uh, to do some of the same stuff, but to get even closer to the actual model training.
17. AMAndrew Mayne
  Mm.
18. GSGreg Steinbrecher
  Uh, so the team I'm on is responsible for more or less making sure that we'd use the GPUs efficiently. So are we, you know, are the models training as quickly as they can be? Are we not bottlenecked on the network? What do we do when something fails? Are we restarting efficiently? How do we kind of work around quirks in the hardware? Um, and yeah, now I get to play with some of the most fun hardware in the world and, like, try to make it, try to kind of squeeze every last ounce of performance out of it.
19. AMAndrew Mayne
  Mark, you've had a considerable amount of experience trying to get computers to talk to each other and do meaningful work. Uh, tell me more about your background.
20. MHMark Handley
  Um, so when I'm not at OpenAI, I'm a professor at University College London, and I've been doing networking research for m- more decades than I care to think about. Um, originally I started working out on, on trying to make the internet do video conferencing.
21. AMAndrew Mayne
  Oh, wow.
22. MHMark Handley
  And you know, back then that was a really difficult thing because the computers were so slow.
23. AMAndrew Mayne
  So slow.
24. MHMark Handley
  Um, and then we c- we came up with a way for, for doing that, that suddenly the rest of the world got interested in. And so the standards we wrote for doing that are now the things that your phone uses to communicate with 4G and 5G networks. Um, so that was, like, the first part of my, of my life, was trying to standardize all of that sort of stuff that everybody could actually take advantage of that. Um, and the problem with standardizing things, everybody needs to agree, so it just, it takes a long time to actually get everybody to agree on anything. Um, a while back, I got interested in, in what was happening in the data center world, and it had a, the big advantage that, uh, you could actually do something different because you only needed to agree with it while building, not the whole world.
25. AMAndrew Mayne
  Mm-hmm.
26. MHMark Handley
  And so that was how I got into thinking, well, this data center networking stuff is a really interesting
4:34 – 10:05
Why training AI stresses networks differently
1. MHMark Handley
  place to be.
2. AMAndrew Mayne
  Well, it's an interesting, uh, problem because I think that the, just the idea that we'd be scaling as many GPUs as we are now, uh, it just, it's happened so fast, happened so quickly and often. You know, we're still using GPUs, which are graphical processing units. You know, we're just now starting to use next generation chips to do this. So how much has been the work required just to update the way we think about it and update the sort of the, what the tools we're using for this, which is here, what we're here to talk about with, uh, Multipath Reliable Connection.
3. MHMark Handley
  So from a network, um, point of view, the, the data centers that we traditionally used to build, they, they kind of got de- derived from what we used to do in the internet.
4. AMAndrew Mayne
  Mm-hmm.
5. MHMark Handley
  And, and when you, when you do communication in the internet, you have lots and lots of people communicate. It's like there's an awful lot of data moving around, but you, they're all doing their own separate conversations. And so generally speaking, the, the, if the more communications you add onto the same shared network, the more things smooth out and become even.
6. AMAndrew Mayne
  Mm-hmm.
7. MHMark Handley
  And so that's great. You can take advantage of the statistics of large numbers. Unfortunately, when you look at what we're doing when we're training things, it's exactly the opposite from that. We're talking about a lot of the world's fastest GPUs and making them all work together on a single task, um, which is why this stuff gets hard.
8. GSGreg Steinbrecher
  That process of basically ingesting all of this data and teaching the, allowing the AI model to learn from that data in parallel is what allows us to build smarter, better models. But let's say one GPU goes a little slow.Well, now all the other GPUs have to wait for it. That's all wasted time. Or one of the GPUs, you know, a cosmic ray hits it and some bits flip and it stops. Well, okay, now that whole step is maybe not useful and we have to maybe roll back or kind of stop and take stock of what has happened. And while we stop and take stock, like, all the GPUs are not doing useful work.
9. MHMark Handley
  Yeah. A key thing here is that the communication between the GPUs is actually part of the computation.
10. GSGreg Steinbrecher
  Mm-hmm.
11. MHMark Handley
  It's like they're, they're doing one big computation across all of them. It's not they're doing different things. We have to actually have them communicate with each other in order to agree on what the, the result of that step of computation is. And that's just about the worst possible workload you could think to put onto a network. And so the, the way the industry has evolved over the last couple of decades has been slowly coming up with, with improvements on how we actually do that. But until recently, the scale wasn't enough that it really mattered.
12. GSGreg Steinbrecher
  Hmm.
13. MHMark Handley
  You could get away with doing the same sort of things we did on the internet, just bigger.
14. GSGreg Steinbrecher
  Mm-hmm.
15. MHMark Handley
  Um, but you can't get away with that anymore, and that's why we tried to think of different ways to actually solve these problems, to cope with these very synchronized workloads as we scale things up.
16. GSGreg Steinbrecher
  AI has upended a lot of things in the world. Um, it has definitely upended the way that technology companies have to think about the data centers that they're building. So conventional hyperscalers for kind of, uh, the web era, um, the teams that built this were very kind of disconnected from the individual workloads, you know. The, the goal was to basically just provide an ocean of compute. AI has forced us to think very differently. Um, and OpenAI in particular has been kind of at the forefront of realizing that the systems and the design of the systems is integral to the training of the models.
17. MHMark Handley
  Mm-hmm.
18. GSGreg Steinbrecher
  And that you can't just have your infrastructure team sit over here in one building and kind of just deliver an ocean of compute, and your model team sit over here and, like, try to make the best, uh, you know, model that they can on that compute. You really have to do kind of a co-design across these whole things. Um, and so, you know, we, on my team, sit literally next to the researchers-
19. MHMark Handley
  Hmm
20. GSGreg Steinbrecher
  ... and talk with them every day about, you know, how they can best... You know, how their workloads fit best onto the existing servers. And in doing that, we learn a lot about where the pain points are.
21. MHMark Handley
  Mm-hmm.
22. GSGreg Steinbrecher
  And, you know, we are on call for the big training runs. You know, we get woken up in the middle of the night if something is broken and can't be fixed. And so in, through that process, you know, and you start to go, "Oh, well, what if in the next generation we fix these problems? What if we didn't build data centers with the same properties as we did for web scale workloads? What if we instead, you know, fix this piece or this piece or this piece?" And, uh, I think the network has been a real source of pain for us.
23. MHMark Handley
  Yeah. So when you're building a data center network, all, all of these GPUs, because they do this computation and they all need to talk at the same time, you need a lot of bandwidth.
24. GSGreg Steinbrecher
  Mm-hmm.
25. MHMark Handley
  And the problem with that is you can't build that with, with a single switch or even a hierarchy of switches. You have to build hierarchies of hierarchies of switches. And so that means that when you communicate from one GPU to a different GPU, there are many different paths your traffic could take through that network, thousands of different paths your traffic could take through the network, because we build so many different switches in there. There's, there's, you know, several thousand switches in one of these buildings. Um, now that gives us an interesting problem, which is which path do you take from one place to the other? And if you, if you basically have the requirement that I want to be able to send from one GPU to another GPU as fast as possible, and I choose, say, a random path through the network, if I get lucky and nobody else chooses the same path, then great.
26. GSGreg Steinbrecher
  Mm-hmm.
27. MHMark Handley
  But if I get unlucky and two people choose the same path, we go slow, and if 10 people choose the same path, we go really slow. And so this, this statistical multiplexing that we used to have when we designed the internet just doesn't work out very well when we're trying to build these networks for our data centers. And so that's where we came in to try and design things somewhat differently.
28. GSGreg Steinbrecher
  Maybe to put a fine point on that, the synchronous nature of the workload that we talked about earlier is why this becomes such a problem.
29. MHMark Handley
  Mm-hmm.
10:05 – 15:19
Bottlenecks, failures, and the cost of waiting
1. GSGreg Steinbrecher
  It is not about how fast can the average pair of GPUs talk to each other.
2. MHMark Handley
  Yeah.
3. GSGreg Steinbrecher
  It's always what is the absolute worst case that occurs.
4. MHMark Handley
  Mm-hmm.
5. GSGreg Steinbrecher
  So if you think about you've got thousands of GPUs that are trying to talk to each other, so there's tens of thousands or hundreds of thousands of network flows on tens of thousands of links, what you have to do is you have to look at that entire network and go, "What is the link that got most bottlenecked here?" That one link is gonna set how fast all of your GPUs are able to work and how much time it takes for you to move data through this, 'cause everything is proceeding in lockstep. And so whereas previously we might have been subject to or kind of taken advantage of average statistics, we don't have that luxury anymore. We instead are subject to, like, the tail of the tail, uh, we call P100, the-
6. MHMark Handley
  Mm-hmm
7. GSGreg Steinbrecher
  ... 100th percentile statistics. And that leads to very different systems requirements than when you can kind of rely on the law of large numbers to take care of you.
8. MHMark Handley
  And so the other problem with this is that when you build your networks, we build the best possible networks we can. We go to the, the best equipment vendors. We use the best optics, and so forth. But when you really scale things really big, things are always gonna be failing. Um, links will fail. Switches will get confused and will have to be rebooted, and so forth. Any one of these failures is going to affect the traffic running over the network. And so if you've got this problem where we only care about the, the 100th percentile, and then a link fails in the network-
9. GSGreg Steinbrecher
  Hmm
10. MHMark Handley
  ... what happens? Well, stuff will fail. We may have, uh, time before the routing reconverges, and then we move the traffic onto a different path. We take a glitch there. That can be quite a long glitch. That can cost us. Or worse, we can actually cause one of these communication transfers to fail. A single transfer fails, and you could end up with the whole job crashing. And so that's... We really want to avoid that kind of problem. We want to build a, a way of using these networks that is resilient to not only the potential transient congestion we might build in the network, but also when things fail, we just want to be able to carry on and basically not notice. But that requires that we actually design the network protocols differently from the start. You can't retrofit this onto existing network protocols.
11. AMAndrew Mayne
  Yeah, it's an, it's a very interesting problem as you describe it, because I can see that where you could say, like, "Oh, well, if we have 1,000 GPUs, there's only one chance out of 10 that they're gonna fail," or whatever. And now I have 100,000 GPUs. Well, guess what? I'm gonna have a failure all the time, and that's what you have to sort of solve for each time you scale it. So where does this break?
12. GSGreg Steinbrecher
  Everywhere [laughs]
13. MHMark Handley
  Everywhere
14. AMAndrew Mayne
  Yeah. There we go.
15. MHMark Handley
  If you think about sort of the mean time between failure of the equipment, so there's, for any particular range of equipment, you've got some, some time between which something will fail somewhere in your building.
16. AMAndrew Mayne
  Yeah.
17. MHMark Handley
  And of course, the bigger we get, if we, with the same cost of equipment, th- that time comes down further and further. And eventually you get to a point where actually something is failing sufficiently often that you don't get any work done on a large-
18. AMAndrew Mayne
  Mm
19. MHMark Handley
  ... synchronous workload. And, you know, that's... We can't have that happen, so we have to do things differently to make that work.
20. GSGreg Steinbrecher
  Maybe... Yeah, so the, the, the very simple math here is, like, you can basically assume that if failures are independent and you double the size of your system, you're going to have half the time between failures, right? Your mean time to failure goes down by half. The important thing to think about here for the network is for every GPU, we have tens if not hundreds of network components. So even just, like, say you've got one GPU connected to one network adapter. In that network adapter, if it has an optical transceiver in it, maybe you'll have four lasers. On the other end of that transceiver, you'll have another four lasers. And so already just connecting that one GPU just to its first hop switch, you've got an order of magnitude more lasers than you have GPUs. Now, add in multiple layers of switching and you start to get into, you know, several orders of magnitude more components in the network than you have kind of at the edge of your network. Because we need to have so much bandwidth here, so we have to build these networks. We can't, you know, kind of taper them down and only have a couple of components in the network, 'cause then we'd be starving the GPUs. They wouldn't be able to actually kind of use their full capability to do math as quickly as possible. Instead, we would ha- kind of, we would just be letting them sit idle and wasting time, money, energy. Uh, we would get trained models more slowly. It would be bad.
21. AMAndrew Mayne
  Mm.
22. GSGreg Steinbrecher
  And so we build a very big network, but now you have many, many, many more components in that network than you have maybe at the edge of your network.
23. MHMark Handley
  Yeah, you have literally millions of optical links within the same building.
24. AMAndrew Mayne
  Mm.
25. MHMark Handley
  So it's huge scale.
26. AMAndrew Mayne
  You know, you mentioned the data centers originally were, I go do something to search the cloud, to get my email, whatever. I may be talking to one server and there might be a backup there. The idea that we would be having more computers inside a data center talking to each other than we did just a few years ago, people trying to connect to it. And so how have these protocols evolved?
27. GSGreg Steinbrecher
  One of the things that got me into networking in the first place, I have a very distinct memory of... I was at a conference, uh, OFC, it's an optical communications conference, back in 2017, and they had a presenter from Facebook. And he put a chart up, uh, that had a, it was a stacked line chart, of the traffic that they serve to their end users and the traffic that goes inside of their data centers.
28. AMAndrew Mayne
  Hmm.
29. GSGreg Steinbrecher
  And the traffic inside of their data centers was just exploding, even while the kind of amount of traffic that they were sending to end users was staying constant. And this is way before GPU clusters and AI. So this is... What AI does is it takes all of the system's challenges that people were having previously, and it cranks them up to
15:19 – 18:59
How Multipath Reliable Connection works
1. GSGreg Steinbrecher
  11.
2. AMAndrew Mayne
  So to address this, you've been working on a method, a multipath reliability connection.
3. MHMark Handley
  The insight was basically that you have... Well, you're trying to manage congestion in networks.
4. AMAndrew Mayne
  Mm.
5. MHMark Handley
  And so there are several pieces that you can pull together and do this. And the first part of the insight was if you spray the packets across many paths, you can load balance those paths through the network really equally. And if you do that, and you build a network topology that has enough capacity, then you don't cause hotspots in the network. It leaves you with just one place where you have congestion, which is if multiple people try to send to the same destination at the same time. But it also leaves you some problems, because the packets can get jumbled up in transit because they're taking different paths. And so if you do manage to cause congestion and cause loss, it's a little bit difficult to figure out whether you got loss or whether you should still be waiting for packets because, you know, they got jumbled up in transit. And so the second piece of this, um, is a, a technique we call packet trimming.
6. AMAndrew Mayne
  Mm.
7. MHMark Handley
  Which is if you would, if you, if you're causing congestion in the network and you would overflow a queue, normally we'd just drop the packet.
8. AMAndrew Mayne
  Mm-hmm.
9. MHMark Handley
  And then we've got ambiguity. Did it get lost? Has it got lo- how long do I need to wait for it? Um, but what we do instead is we will trim off the payload of the packet and just forward the little tiny packet header to the destination-
10. AMAndrew Mayne
  Hmm
11. MHMark Handley
  ... which can immediately request a retransmission, and we can retransmit that packet. And that re- totally removes the sense of ambiguity as to whether we lost packets due to congestion or whether we should still be waiting for them because they got reordered.
12. AMAndrew Mayne
  Interesting. So just, just making sure that the part saying, "Are you there?" goes through, and then you can figure that out and then send the rest.
13. MHMark Handley
  Yes. You really need to know that you should still be waiting for it or you should not still be waiting for it.
14. AMAndrew Mayne
  Yeah. What does this mean for the end user?
15. GSGreg Steinbrecher
  The biggest thing that this means is that you're gonna get better models, more intelligent models, faster from OpenAI. So MRC allows us to accelerate every part of our research and deployment pipeline. Um, it allows individual users to not worry about their jobs failing, not worry about how their job has gotten scheduled and whether the performance of it is gonna be different because they're, you know, placed on the same rack as someone else's job. Uh, it allows us to train frontier models, um, much faster, more reliably, uh, and really just to turn the entire crank of that pipeline much faster and much more reliably. So you should expect to see, uh, an ever-increasingly exciting pipeline of releases from us.
16. AMAndrew Mayne
  The vibes are good.
17. GSGreg Steinbrecher
  The vibes are good.
18. MHMark Handley
  Vibes are very good. The idea, um, came out of a lot of research work that we've had over, over the last, uh, few decades. We're, we're not fundamentally inventing anything new. We're just doing things that other people have invented, but pulling the combination together into a set of features. So the, um, we, we formed this, this group of people who are all interested in doing this. And, um, last year, we finally got to the point where we were, um, able to deploy this, and we went from, in a few months, from the, the first hardware available to actually running and training models, and it actually all operating. So the, this has the result that we don't cause that congestion that we talked about before.
19. AMAndrew Mayne
  Mm.
20. MHMark Handley
  Um, the, the second really nice property is that if something fails in the network, every single one of these flows that go through there will probably be affected by the failure. But it'll only be affected a little bit.
21. AMAndrew Mayne
  Hmm.
22. MHMark Handley
  And within, uh-... few round trip times across the network, we've stopped using a failed link. And so this problem of links failing, bringing down the, uh, the network just goes away. All of the flows themselves from the network interface at one side to network interface are just avoiding those failures
18:59 – 25:05
A protocol to route around failures
1. MHMark Handley
  as we go through the network.
2. AMAndrew Mayne
  It's like self-annealing.
3. MHMark Handley
  Exactly, yes.
4. GSGreg Steinbrecher
  So I- I think Mark has mildly undersold this.
5. MHMark Handley
  [laughs]
6. GSGreg Steinbrecher
  Uh, but maybe we should, uh-
7. AMAndrew Mayne
  Mark, come on, man. [laughs]
8. GSGreg Steinbrecher
  So conventionally, like when a link goes down on a network, what happens is, you know, one side of that link, the switch at the one side of that link, or maybe both sides of that link notices. But then it has to tell all of its neighbors that that link went down, and then they have to tell all of their neighbors that that link went down. And so you have a distributed systems problem, uh, that's conventionally solved with a technology called BGP, uh, Border Gateway Protocol, which is basically just like a gossip protocol-
9. AMAndrew Mayne
  Mm-hmm
10. GSGreg Steinbrecher
  ... that allows, you know, one link over here to eventually tell this switch all the way over here, maybe through five or seven hops through the network, that, "Hey, mate, you can't get to this destination if you take this link. You have to use th- these other links." That's a distributed systems problem that has a convergence time. What MRC has done is it has taken that, and it has broken the need to coordinate.
11. AMAndrew Mayne
  Hmm.
12. GSGreg Steinbrecher
  Every endpoint independently very quickly detects, "Hey, that pa- w- I shouldn't use that path," and just stops using it. And this is maybe counterintuitive because you would think, "Oh, it's easier if I just have some central authority that tells me that, you know, this link is down, and, and then I... and that central ch- central authority can distribute that information."
13. AMAndrew Mayne
  Anybody who's waited for a website to update knows that's not gonna work.
14. GSGreg Steinbrecher
  Right.
15. AMAndrew Mayne
  [laughs]
16. GSGreg Steinbrecher
  Uh, central authorities are, generally speaking, also known as single points of failure.
17. AMAndrew Mayne
  Yeah.
18. GSGreg Steinbrecher
  Um, and so instead, what we've done here is we no longer have to w- wait for this whole convergence process to d- to occur, which can take seconds, or in the tail-
19. AMAndrew Mayne
  Hmm
20. GSGreg Steinbrecher
  ... tens of seconds. Instead, everyone within, generally speaking, milliseconds notices and just stops using that link. And so this is a very big deal 'cause previously, you know, the link goes down, and the whole job stops for a few seconds as we wait for kind of the network to stabilize. That's time, again, that the GPUs aren't doing useful work. And as you, again, scale up, you're gonna have more and more and more of those individual little seconds. And here now, what we've observed is that, like, you know, we turned this on as we were basically... as the data center was being built. Uh, as Mark said, we were able to, you know, get jobs up and training within months of hardware arriving. There's a lot of manual labor that goes into building one of these buildings. There's a lot of kind of shared points where fibers from one data hall are coming in, and technicians are trying to assemble another data hall, or things like that. And so what we saw was that because of all of this kind of manual effort that was going on, links were going up and down all the time.
21. AMAndrew Mayne
  Yeah.
22. GSGreg Steinbrecher
  Like way, even way more often than you would kind of hope in just due to natural failures. We did not care. We didn't even notice.
23. AMAndrew Mayne
  Hmm.
24. GSGreg Steinbrecher
  MRC just took care of it. It just kind of would detect, "Hey, can't use that path. Move on to the o- next one." Didn't care. It was incredible.
25. MHMark Handley
  So the oth- the other thing this gives you, um, beyond just that is because once it's handling that by itself, o- once MRC is already working around the failures, the... Traditionally, we would probably still have been running a routing protocol in the network to find the paths that actually work, but routing protocols themselves are complicated, and switches are complicated, and switch software is complicated, and these are all things that can fail. And we realized that actually MRC itself was able to figure out which paths were still working. And so actually we just decided that we would turn off the routing protocols.
26. AMAndrew Mayne
  Wow.
27. MHMark Handley
  We'd use completely static routing in the switches at the largest possible scale. And oh, so some paths are broken. Who cares? MRC will find the broken w- ones that still work and, and keep going.
28. AMAndrew Mayne
  Hmm.
29. MHMark Handley
  And that just removes a whole s- set of complexity out of our network management that we just don't need anymore. We don't care about whether the switch control plane is converged because it doesn't need to. It's entirely static. They have a configuration that they have at boot time. They boot up, and they never change their routing tables from then onwards.
30. AMAndrew Mayne
  So this is a very big effort working with a bunch of different people. Care to talk about some of the partners?
25:05 – 35:09
Why OpenAI is making MRC an open standard
1. AMAndrew Mayne
  And you have all decided to make this open to everybody to use.
2. MHMark Handley
  Yeah, so the, the specification is due out through, uh, through OCP as an open standard, and, uh, we've, as you say, decided to open this up for everybody to use.
3. AMAndrew Mayne
  Mm-hmm.
4. MHMark Handley
  Um, we're big believers in, in open standards and open source. Uh, we're building all of our networks on top of Ethernet, which itself is an open standard. And we benefit when the industry has velocity.
5. AMAndrew Mayne
  Mm-hmm.
6. MHMark Handley
  When the industry can keep up with the things that we're trying to do on the challenging side of things. And so it's, it's in everybody's interest if we're all trying to actually deploy what we think are the best solutions in this space.
7. GSGreg Steinbrecher
  There's no shortage of coverage of the scale of the AI build-out.
8. AMAndrew Mayne
  Mm-hmm.
9. GSGreg Steinbrecher
  Um, I think pers- on a personal level, I think it would be a real shame if that supply chain was fractured.
10. AMAndrew Mayne
  Mm-hmm.
11. GSGreg Steinbrecher
  Right? You have people investing in totally different, um, technologies and underlying hardware just because they're, you know, trying to get some small advantage. I think it is... I mean, I'm really excited that this is gonna be an open standard. I think it will really benefit other people outside of OpenAI.
12. AMAndrew Mayne
  Mm-hmm.
13. GSGreg Steinbrecher
  It also benefits all of us if we are kind of all pushing in the same direction. Uh, infrastructure is kind of this, like, shared fate of the whole industry, and I think it is a very good thing that we are open sourcing this and kind of bringing everyone along.
14. AMAndrew Mayne
  It seems like it, it's beneficial too because everything is becoming very collaborative. It... Even, you know, you take a project like Stargate, which is multiple locations, many partners across the world, and Microsoft, uh, Fairwater, and sort of this idea that compute is a thing that, uh, there's never gonna be enough of it. And the more we kind of work with each other to figure out how to maximize it and continue doing it, probably the better it's gonna be for everybody involved than treating it like, uh... I mean, it is a very limited resource, but these protocols, like you said before, you know, with Ethernet, whatnot, that's really what gave us, you know, what we have, and things like the World Wide Web and where a lot of the cool things, and we realize, oh, share this because what we're gonna benefit is gonna be so much better.
15. MHMark Handley
  Yeah, what we're trying to do is, is hard enough without everybody having to reinvent the wheel all the time.
16. AMAndrew Mayne
  Yeah.
17. MHMark Handley
  Um, we think this is the right way to go, and, um, we'd like everybody else to go in the same direction as us.
18. AMAndrew Mayne
  Where, where are the limits of this?
19. GSGreg Steinbrecher
  MRC is a flexible standard.
20. AMAndrew Mayne
  Okay.
21. GSGreg Steinbrecher
  Um, it builds on top of Ethernet, so as Ethernet scales, so will MRC.
22. AMAndrew Mayne
  Mm-hmm.
23. GSGreg Steinbrecher
  Um, you can think of Ethernet as kind of the protocol that y- that individual devices use to talk to each other.
24. AMAndrew Mayne
  Mm-hmm.
25. GSGreg Steinbrecher
  MRC sits on top of that. Um, it incorporates kind of a, the, this static routing that Mark is talking about. It incorporates, uh, a, what we call congestion control, which is basically if we do end up in situations because of failed links or choices that we make about how we send traffic, uh, how the endpoint should react to that to make sure that we kind of use the network fairly and efficiently.
26. AMAndrew Mayne
  Mm-hmm.
27. GSGreg Steinbrecher
  Um, my experience with networking is that there will always be more work to do.
28. AMAndrew Mayne
  Mm-hmm.
29. GSGreg Steinbrecher
  Uh, there's always gonna be ways we can improve that, make the network more fair. There's fundamental limits on networks, specifically, uh, the speed of light is a known speed limit.
30. AMAndrew Mayne
  Mm-hmm.
35:09 – 37:37
Could AI compute move to space?
1. AMAndrew Mayne
  And there's been talk about like, well, the next step is we're gonna put things into space, and, and my, my question's always been like, I can get, like I have a satellite with some GPUs, it's doing inference, but when you're literally spreading things out across th- thousands or hundreds of thousands of miles on a big network, it seems like you lose all of your advantage for speed. The speed of light is your enemy, versus being in one center.
2. MHMark Handley
  It's hard to envisage doing the sort of training that we do in, in our, um, Stargate data centers in space. Um, just the latency would be a, a huge problem, and just the background rate of failures would be a problem.
3. AMAndrew Mayne
  Hmm.
4. MHMark Handley
  And just, you know, we have technicians from Microsoft and Oracle who have to go in and fix things all the time, every day. Hard to do that in orbit.
5. AMAndrew Mayne
  Yeah.
6. GSGreg Steinbrecher
  Yeah. I think you can make all sorts of arguments, and I have gotten, I've gone very deep on this. As I said, I have a physics background.
7. AMAndrew Mayne
  Yeah.
8. GSGreg Steinbrecher
  I'm very interested in, you know, I, I worked with people who designed satellites. Uh-
9. AMAndrew Mayne
  Yeah. You know people doing LISA and things like that, so.
10. GSGreg Steinbrecher
  Yeah. But I think a, a lot of smart people have had-
11. AMAndrew Mayne
  Hmm
12. GSGreg Steinbrecher
  ... very reasonable arguments both ways on that dire-
13. AMAndrew Mayne
  Yeah
14. GSGreg Steinbrecher
  ... on that dimension. The major barriers, I think, are the, the rate of failure. I mean, every GPU, every generation of GPU, the GPU itself gets more powerful and more expensive and more important. Um, I think we are doing incredible work here on Earth to try to kind of route around failures automatically, but I think that you would find yourself with a lot of hardware that you couldn't use very quickly-
15. AMAndrew Mayne
  Hmm
16. GSGreg Steinbrecher
  ... if you start shipping these things into space. Now, is there a world in which you can also put technicians in space?
17. AMAndrew Mayne
  Mm-hmm.
18. GSGreg Steinbrecher
  Maybe. You can do all sorts of things. The dreamer side of me says that would be really, really cool.
19. AMAndrew Mayne
  Yeah.
20. GSGreg Steinbrecher
  The practical side of me says it's really, really hard to do this stuff on Earth.
21. AMAndrew Mayne
  Yeah.
22. GSGreg Steinbrecher
  Like, every day we are trying to push limits on all sorts of dimensions. Even just spinning up MRC-
23. AMAndrew Mayne
  Yeah
24. GSGreg Steinbrecher
  ... was a huge effort that required very close collaboration with us and engineers at a number of other companies. Um, and it required, in some cases, you know, hands on machines to fix things, to test things, et cetera. These systems are hard enough to build and make work and make perform-
25. AMAndrew Mayne
  Mm-hmm
26. GSGreg Steinbrecher
  ... here on Earth. I think trying to push the boundaries of that and also adding additional complications, uh, you have to make a really strong case for why it makes sense to do it in space.
27. AMAndrew Mayne
  So build more terrestrial compute centers.
28. GSGreg Steinbrecher
  Please.
29. AMAndrew Mayne
  Yeah. [laughs]
30. GSGreg Steinbrecher
  I mean, that is, that is what we are trying to do here-

Episode duration: 37:38

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode TiW96H5HmAw

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome