No Priors Ep. 77 | With Foundry CEO and Founder Jared Quincy Davis

In this episode of No Priors, hosts Sarah and Elad are joined by Jared Quincy Davis, former DeepMind researcher and the Founder and CEO of Foundry, a new AI cloud computing service provider. They discuss the research problems that led him to starting Foundry, the current state of GPU cloud utilization, and Foundry's approach to improving cloud economics for AI workloads. Jared also touches on his predictions for the GPU market and the thinking behind his recent paper on designing compound AI systems. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @jaredq_ Show Notes: 00:00 Introduction 02:42 Foundry background 03:57 GPU utilization for large models 07:29 Systems to run a large model 09:54 Historical value proposition of the cloud 14:45 Sharing cloud compute to increase efficiency 19:17 Foundry’s new releases 23:54 The current state of GPU capacity 29:50 GPU market dynamics 36:28 Compound systems design 40:27 Improving open-ended tasks

Sarah GuohostJared Quincy DavisguestElad Gilhost

Aug 22, 202442mWatch on YouTube ↗

EVERY SPOKEN WORD

95 min read · 18,754 words

0:00 – 2:42
Introduction
1. SGSarah Guo
  (music plays) Hi, listeners. Welcome to No Priors. Today we're talking to Jarrod Quincy Davis, the founder and CEO of Foundry. Jarrod worked at DeepMind and was doing his PhD with Matei Zaharia at Stanford before he began his mission to orchestrate compute with Foundry. We're excited to have him on to talk about GPUs and the future of the cloud. Welcome, Jarrod.
2. JDJared Quincy Davis
  Thanks, Sara, and great to see you. Thanks a lot as well.
3. EGElad Gil
  Yeah, great seeing ya.
4. SGSarah Guo
  The mission at Foundry is directly related to some problems that you, uh, had seen in, in research and at DeepMind. Can you talk a little bit about the genesis?
5. JDJared Quincy Davis
  A couple of the most inspiring events I've witnessed in my career so far were the release of AlphaFold 2 and also ChatGPT. Um, you know, I think that one of the things that was so remarkable to me about AlphaFold 2 is initially it was a really small team, you know, three and then later 18 people or so, and they solved what was kind of a 50-year grand challenge in biology, which is a pretty remarkable fact that, you know, every university, every pharma company hadn't solved. And similarly with, with ChatGPT, a pretty small team, OpenAI was 400 people at the time, you know, released a system that really shook up the entire global business landscape. You know, that's a pretty remarkable thing, you know, and I think it's kind of intriguing to think about what would need to happen for those types of events to be a lot more common in the world. And, you know, although those events are really amazing 'cause of the small numbers of people working on them, I think, you know, it's not quite the David and Goliath story, neither are quite the David and Goliath story that they appear to be when you, when you double-click. In OpenAI's case, you know, there are only 400 people, but had $13 billion worth of compute, you know, which is, uh, quite a bit of computational scale there. And in DeepMind's case, it was a small team, you know, but obviously they were standing on the shoulders of giants in some sense with Google, right? And the leverage that they had via Google. And so, one thing I think that, you know, we thought about is, you know, what can we do to make the type of computational leverage and tools that are currently exclusively the domain of OpenAI and DeepMind kind of available to a much broader class of people? And so that's a lot of what we worked on with Foundry, saying, "C- can we build a public cloud, y- built specifically for AI workloads, where we reimagine a lot of the components that constitute the cloud end-to-end for first, from first principles? And in doing that, can we make things that currently cost a billion dollars cost 100 million, then 10 million, you know, over time?" And that'd be a pretty massive contribution. I think it would increase the frequency of events like AlphaFold 2 by 10X, 100X, or maybe even more super linearly. Um, and we're already starting to see the early signs of that, um, but quite a, you know, quite a lot of room left to push this agenda. So, really exciting. Um, so that's kind of maybe an initial introduction, preamble to how we thought about it, and I can trace that line of reasoning a bit more, but, um, that's kind of part of what we've done and...
6. SGSarah Guo
  Jarrod,
2:42 – 3:57
Foundry background
1. SGSarah Guo
  what, uh... For, for anybody who, uh, hasn't heard of Foundry yet, what is the product offering?
2. JDJared Quincy Davis
  Yeah, so Foundry, we're essentially a public cloud built specifically for AI. And what we've tried to do is really reimagine all of the systems undergirding what we call the cloud end-to-end from first principles for AI workloads, and we've tried to do this in a bit of a new way. I think the AI offerings from the existing major public clouds and kind of some new GPU clouds haven't really re-envisioned things, and by thinking about a lot of these things a bit anew, we've been able to improve the economics by, in many cases, 12 to 20X, um, over lower tech GPU clouds and existing public clouds, and, you know, we'll... Partially based on some of these products that we'll talk about today, um, that we're releasing and a lot of new things that we're working on, um, we think we can push that quite a bit further as well. Um, and so, you know, our, our primary products are essentially infrastructure as a service, so our customers come to us for elastic and really economically viable access to state-of-the-art systems, um, and also a lot of tools to make leveraging those systems really seamless and easy. And we've invested quite a bit in things like reliability, security, elasticity, and just the core price performance.
3. EGElad Gil
  How underutilized
3:57 – 7:29
GPU utilization for large models
1. EGElad Gil
  are most GPU clouds today? And I think there's almost three versions of that. There's things on hyperscalers like AWS or Azure. There's large clusters or clouds that people who are doing large-scale model training or inference run for themselves, and then there's more just, like, everything else. It could be a hobbyist. It could be a research lab. It could be somebody with just, you know, some GPUs that they're messing around with. I'm sort of curious for each one of those types of, or categories of users, like, what, what is the likely utilization rate and how much more do you think it could be optimized? Is it 10%? Is it 50%? Like, I'm just very curious.
2. JDJared Quincy Davis
  One of the most, I'd say, positive cases with the highest utilization, which is the case where you're running kind of an end-to-end pre-training job, right? And so that's a case where you've done a lot of work upfront. You've designated a time that you're gonna run this pre-training workload for, and you're really trying to get the most utilization out of it. And for a lot of companies, utilization even during this phase, you know, is sub 80%. So, why? One reason is actually that these GPUs, particularly the newer ones, actually do fail a lot, um, as, you know, a lot of practitioners would know. And so one of the consequences of that is that it's very common now to hold aside 10 to 20% minimum of the GPUs that a team has as buffer, as healing buffer in case of a failure, so you can slot something else in to keep the training workload running, right? And so even for a lot of the most sophisticated orgs running large pre-trainings at scale, the utilization is sub 80%, sometimes less than 50% actually, depending on how bad of a batch, uh, they have and the frequency of failure in their cluster. Um, and so even in that case... Now, there are often, though, also large gaps and intermissions between training workloads, um, even if you have the, the GPUs are dedicated to a specific entity, you know? And so even in those most conservative cases, which I'll come back to the less conservative extreme cases, utilizations really, can be really, uh, quite a bit lower than people would imagine. Um, so we can pull on that case a bit more 'cause I think it's actually quite counterintuitive and really interesting. I think there's a really fundamental disconnect between people's mental image of what GPRs, GPUs are today and what they actually are.I think that in most people's minds, you know, GPUs are chips, right? And we, we call, talk about them as chips. But actually, the H100 systems are truly systems. You know, they're 70 to 80 pounds, 35,000 plus individual components. They're really kind of monstrosities in some sense. And the remarkable thing that I think Jensen and NVIDIA have done, one of, one of many, is they've basically taken an entire data center's worth of infrastructure and compressed it down into a single box. And so when you look at them from that perspective, the fact that it's 70 pounds isn't quite as alarming. Uh, but it is, these are really gnarly systems. And when you end up composing these individual systems, these DGXs or HGXs into large super computers, what you're often doing is you're interconnecting thousands, tens of thousands, hundreds of thousands of them. And so the failure probability kind of multiplies. And so because you have millions, perhaps, of individual components in this super comp- super computer, the probability that it will run for weeks on end, and this is basically a verbatim quote from Jensen's keynote, um, is basically zero. That's a, a little bit of a challenge. And funnily enough, I think the AI infrastructure world is still somewhat incipient, and so doesn't have perfect tooling, broadly speaking, to deal with these types of things. Um, and so one of the compensatory measures I think people take is reserving this healing buffer, for example. I think that disconnect maybe helps explain wh- why these things fail, and actually, funny en- funnily enough, the newer, more advanced systems anecdotally fail a lot more than historical systems, um, that were, that were
7:29 – 9:54
Systems to run a large model
1. JDJared Quincy Davis
  worse in some ways.
2. EGElad Gil
  And do you think that's just like a quality and, uh, quality control issue for the systems or do you think it's just some form of complexity with some failure rate per component that's rising? Or like what do you think is sort of the driver of that?
3. JDJared Quincy Davis
  I think it's more the complexity has grown, right? And we're in a different regime now. You know, I think that it's fair to say, so maybe stepping back again to definitions, we throw the term large around a lot in the ecosystem. I guess one question is, what does large mean? And one useful definition of large that I think roughly corresponds to what people mean when they invoke the term is that a large language model, you enter the large regime when the, essentially the amount of compute necessary to contain even just the model weights starts to exceed the capacity of even, even a state-of-the-art single GPU or single node. I think it's fair to say you're in the large regime when you need multiple, you know, state-of-the-art servers, um, from NVIDIA or from someone else to even just contain the model, you know, just run the, basically run any training or definitely even just to contain the model. That's definitely the large regime. And so the key characteristic of the large regime is that you have to or- somehow orchestrate a cluster of GPUs to perform a single synchronized calculation. Right? And so then it becomes a bit of a distributed systems problem. I think that's one way of characterizing the large regime. Now, a consequence of that is that you have many components that are all kind of collaborating to perform a single calculation. And so any one of these components failing can actually potentially, you know, lead to some degradation or challenge downstream and, I mean, stop the entire workload. Right? You know, you've probably heard, people have talked a lot about InfiniBand and the fact that NVIDIA, part of NVIDIA's advantage comes from the fact that they do build systems that are also state-of-the-art from a networking perspective, right? And their acquisition of Mellanox was one of the better of all time arguably from a, you know, market cap creation perspective. And, uh, the reason that they did this is because they realized that it would be really valuable to connect many, many machines into a single almost contiguous super computer that almost acts as one unit. Yeah, and the challenge of that, though, is that there now are many, many more components and many more ports of, points of failure. And these things kind of, you know, the point of failure kind of multiplies, so to speak.
4. SGSarah Guo
  You, um, implied that, like, GPUs ... Well, you described GPUs as this unique asset that is more
9:54 – 14:45
Historical value proposition of the cloud
1. SGSarah Guo
  CapEx than OpEx and that the hyper scalers, like an Amazon, are making a certain assumption about what that, uh, depreciation, um, cycle is. Like, where do you think the assumptions for Foundry versus, uh, those hyper scalers or versus, let's say, like, a CoreWeave are different?
2. JDJared Quincy Davis
  This opens up a pretty interesting conversation around, like, what is cloud? I think we've kind of forgotten in this current moment what cloud was originally supposed to be and it was value proposition was intended to be. I think currently iCloud is not cloud in the ind- originally intended sense by any means. Um, so we should pull on that thread. But I'd say right now, um, it's basically co-location. Yeah, it's basically co-location, right? It's not really cloud.
3. SGSarah Guo
  Yeah.
4. JDJared Quincy Davis
  Maybe it ... Yeah. It's, it'll be, it's definitely worth pulling on that thread a little bit.
5. EGElad Gil
  Yeah, do you wanna break that down for sort of our listeners in terms of what you view as the differences?
6. JDJared Quincy Davis
  First, I guess as a little bit of context for people, the cloud as we currently know it is arguably one of the, you know, most important business categories in the world. That's, I think, pretty clear. The biggest companies in the world are either clouds, they're Azure, AWS, GCP, um, you know, core component of the biggest companies in the world, or NVIDIA who sells to the clouds, obviously. Um, so it's clearly an important category. AWS is arguably a trillion dollars plus of market cap if you broke it out from Amazon. Um, so, you know, at one point relatively recently, it was all of Amazon's profit and more. So it's an important category to say the least. Cloud as we know today really started in 2003, which is when Amazon, you know, they created 50 people initially, I believe, to start working on AWS. Um, and they worked on this for three years within Amazon, launched in 2006 in March with S3. Later that year in September, they launched EC2, um, and that kind of was the beginning of the cloud. It took quite a while for this model to catch on and people were not quite clear on why this would be useful, um, even until quite recently. And so in 2009, 2010, you know, Matei, um, sort of who you mentioned, um, wrote this paper called Above the Clouds with some collaborators at Berkeley, A Berkeley Ground Cloud Computing, and they talked about why cloud would be a big deal. And I think it was not, to say the least, not appreciated that the points they were making were, were, were valid at the time. I think it was kind of, you know, not clear to people, to say the least. You know, fast-forward a few years, Bezos in 2015 said AWS was, for all intents and purposes, market size unconstrained, and he was kind of laughed a- by many. Uh, that was seen as a really ludicrous statement to make.You know, fast-forward and no one's laughing. Um, in 2019, I remember people saying, like, "It's not clear if cloud will be a big deal." Snowflake hadn't yet scaled. You know, you know, Databricks hadn't yet scaled.
7. EGElad Gil
  I wonder if it's worth making a distinction on these things, 'cause, you know, I agree with parts of what you're saying, but, you know, I was building startups in that era. And at the time, um, there was a pretty strong belief that, at least for startups, these clouds were incredibly useful. Because it used to be you'd spend a lot of money and effort on setting up your server f- your server racks and wiring everything up and all the rest. And then, you know, the emergence of things like AWS, and nobody expected it to be Amazon, right? It just didn't fit what people thought of as basically an e-commerce marketplace company, right? They were suddenly building infrastructure and providing it, and it w- it felt very natural for somebody like Google to provide that. But I think for startups, 'cause I was doing a startup in 2007, you know, we thought that it was magic, right? 'Cause suddenly... And not all the services were there yet and everything else, but suddenly you could... You didn't have to deal with all the infra. And there was a number of companies that started before that, like Twitter and others, who kind of ended up having to continue to build and maintain their own clouds, and it was really brutal. And then to your point, I think the big transition was the degree to which enterprises, particularly in regulated services or financial industries, kind of fought it initially, and then they, they started opting in. Um, so I, I, I, I, I agree with a lot of what you're saying. I wouldn't say that it was one of those things that nobody believed in, though. You know, like, I actually thought there was a lot of buy-in and a lot of belief in, you know, adoption.
8. JDJared Quincy Davis
  Definitely not nobody. I think, you know, proficient people and a lot of builders really saw the value really early, particularly startups and the VC community. You know, I think Berkeley and some of the researchers there really got it early, like 2009 and, and earlier. You know, by 2015, I think it was already around 3 billion of run rate. So it was not small. It was very far from what it is today, but it was not small. They didn't break it out yet, but it was, you know, not small at all. It was a meaningful thing. I think that people didn't recognize it would become anything like it is today, and it still wasn't clear to people what the value proposition was. You may recall, but there was, you know, Dropbox, for example, famously exited the cloud and went back on-prem. And that story, they published why they did that and the economics of it, and that caught a lot of attention, and people were saying, "Uh, I'm not sure if cloud actually makes sense anymore," et cetera. And I think people kind of lost track of that. And I think one of the insightful things that some of the early cloud soft- systems companies, like Databricks and, and Snowflake recognized was one of the value propositions of the cloud that was really special. I'll actually use o- one of the quotes from, well, Snowflake founders to kind of illustrate this, was the thing... They, they were trying to think about what was unique about the cloud, what you could do in the cloud that you couldn't do anywhere else. And the killer idea that they converged on, fundamentally, the cloud made fast free. Quote-unquote, "Fast was free in the cloud," is how they put it
14:45 – 19:17
Sharing cloud compute to increase efficiency
1. JDJared Quincy Davis
  explicitly. And the idea was if you have a workload that was designed to run for 10 days on 10 machines, in the cloud, you could theoretically run it on 100 machines for one day, you know, or 10,000 machines for 15 minutes. And that would cost the exact same. And so you could run it 1,000 times faster for the same cost, theoretically. And it's like, well, that's kind of a big deal, right? You can run something 1,000 times faster for the same cost, you know, and then give the compute back. Now, what that would require, though, a number, number of things to make that actually work. One being that you'd have to be able to kind of re-shape a workload that was designed to run on 10 machines to run on 10,000 machines. Not trivial. Another is that you actually have to have the 10,000 machines worth of capacity in the cloud and make sure that was utilized to make the economics kind of work. You know. Um, but yeah, if you could do those two things, it'd be a really, really big deal. But it has to be free in the cloud. And so one of the key ideas was this elasticity, right? And I think that's one of the things that's really absent in AI cloud today. In AI cloud today, you're kind of forced to get really long-term reservations. Um, often three years for a fixed amount of capacity. No one really wants 64 GPUs for three years, or 1,000 GPUs for three years. You want 10,000 for, you know, a couple of months, maybe nothing for a little while. Uh, maybe you don't know how much you need because you, you're launching a new product and you're not sure how much demand there will be for it, um, on, and how much inference capacity you'll need, et cetera. It's very, very challenging if you have to reserve your, you know, the total amount that you may need upfront, um, for a long duration.
2. SGSarah Guo
  I wanna go back to this idea of like, um, uh, it's actually hard for, uh, a lot of engineering teams, especially younger ones, to picture, like, a pre-cloud world. Because, y- you know, their experience with it is like, you know-
3. JDJared Quincy Davis
  (laughs)
4. SGSarah Guo
  ...I have virtual machi- machines on Amazon. They're limitless.
5. JDJared Quincy Davis
  Yeah.
6. SGSarah Guo
  Like, maybe it's serverless. And, like, if you go all the way back to what Elad was talking about, you gave us a great historical view, but if you look at, like, functionally. Like, I had, you know, I had my servers in my closet on-prem, right? And then I had colocation, which is like, I have my servers. They're still my servers. I control them and I manage them, but I, they physically live in somebody's data center where they're offering me, like, real estate, like cooling and power, right? And then you had hosting, which is like, I'm buying a machine in a data center. Like reservations for a long time, essentially, um, or for the life cycle of the machine. Uh, and then you had virtualization and containerization and all of these services that came out of this, like, you know, separation into cloud services, where you have higher level functions with scheduling orchestration and, you know, serverless is like, uh, I'm just gonna write, like, the logic. You deal with it, um, and place that workload. And it, like, that, w- you know, we haven't obviously gotten to an end point in non-AI computing, but I feel like the engineering world is so used to being over here, whereas the AI, um, the AI hardware resource world, there's like, we're somewhere between colo and, you know, colo and hosting still.
7. JDJared Quincy Davis
  That's right, yeah. And that's challenging. That means that you have to raise all the capital you may need upfront. You know, um, means that, you know, that's another challenging model. And (...) means also that, you know, you can't quite, you know, grow the product el- elastically. As demand and interest in it grows, you're kind of bottlenecked by the supply chains in a way that I think developers haven't experienced in quite a while. Um, so it's a pretty challenging state, I think, for affairs, and it's also a very challenging from a risk management perspective for these companies, because they're making these big commitments, um, you know, that are potentially, you know-... if they don't work out, are pretty catastrophic for them, um, on this hardware, paying all upfront, um, paying for these long-duration contracts, et cetera, it's a very challenging thing. And there's no analog yet. The market's not mature enough that there's any analog similar to what we have in other domains, like in commodities markets like wheat, oil, et cetera, where you can buy options, and futures, and hedge, and sell back, and things like that. It's kind of we're pretty, still pretty inept in terms of where the market's at. And yeah, I think that's leading to a challenged state of affairs that's going to, you know, continue to bring a lot of pain for people. Um, and so, you know, there are several things I think we've employed to, to do this better. Some are, you know, mixes a kind of business model innovations and te- and technical innovations at the same time. Um, you know, but I think we're making a pretty substantial dent in this, but also in a way that's really viable economically for us, um, and, you know, doesn't involve buying a lot of GPUs and taking undue risks on them, per se. Um, and so that's kind of a lot of what we've tried to do is, can we do something a lot more efficient? Um, you know, can we find some leverage, some points of leverage, um, to address this problem?
8. SGSarah Guo
  Uh, you guys have some new releases, um, as of, I think,
19:17 – 23:54
Foundry’s new releases
1. SGSarah Guo
  a day or two ago. Um, uh, can you describe, uh, what's just come out from Foundry?
2. JDJared Quincy Davis
  So I think right now, AI cloud is... the AI cloud business is very much like a parking lot business. And that sounds really funny because cloud is supposed to be high tech, um, and you can hardly conceive of a less sophisticated business at least on the surface than parking lots. And what do I mean by that? Well, there's fundamentally, there are two models in the parking lot business. One is pay as you go. For pay as you go, the rates are usurious. You know, you may or may not find a space. You... I'm sure many of us have the experience of driving through SF and seeing a lot cl- a lot full sign, you know, for lot after lot as we drive around trying to park. Um, and if you do get a spot, you might pay, you know, the $12 an hour, you know, rate or something like that. I'm choosing that rate because it's the rate of AWS, um, you know, for on-demand. On the other hand, if you want to kind of guarantee that you'll have a spot and also have a better rate, you can basically buy a spot reserved. Um, so you can have your own reserved parking spot in your building. Maybe you pay, you know, $4 an hour, um, so you're getting a massive discount, but it's 3K a month effectively, right? Which is actually pretty substantial. And if you're only using it 40 hours a week when you're in the office as a typical worker, it actually might be effectively $16 an hour, um, as opposed, as opposed to$ 12, so it's actually worse. And so I think one kind of funny analogy for one of the things that we want to do with a couple of these products is kind of create the... enable the, the equivalent of allowing people to park, pay as you go, in someone else's reserved spot. Y- and that sounds kind of funny, but you can imagine that, okay, that'd be actually an interesting thing. And if you could do that, then depending on how much, how the percentage of the spots that are typically reserved, you might have 10X the effective capacity in a lot. Um, you know, and then also it can kind of be a win-win. Instead of the pay as you go person paying 12, you know, they can pay something much lower, in this case, I'll say seven, but it could actually be a lot lower. You know, the person who owns the spot instead of paying four can make five equivalent, and then the lot can also make a couple of bucks. And so it's kind of a win-win-win for everyone. Um, and you're kind of double cleaning the lot and it's, you know, really, really efficient. Now, that sounds great, but it wouldn't quite work if you showed up to your reserved spot and there was someone parked there. That might be a little bit aggravating. It also might not work if we, we were forced to call two hours in advance and say, "Hey, I'm coming." Um, and then the person who was parked in your spot had to leave their dinner reservation to move their car. That'd be... that wouldn't be a fun, a fun model. And so, and I think one thing that we had to do was kind of create the analog of a system to make this all really convenient. And so, you know, the- maybe the V1 of this system was you came into the lot and a sensor was triggered, um, saying, you know, you're here to pick up... to go to your reserved spot, and then some valet ran to the car that was parked there and moved it, and they maybe moved it from the second floor to the tenth floor. And then the person who had parked there previously now comes to the counter, asks the valet where their car is, and gets some ticket saying it's on the tenth floor, and then... but there are no stairs, so they have to walk up the stairs and get it. You know, I'm, I'm stretching this analogy, but you get the idea. It'd be kind of inconvenient. And so part of what we did was added more and more convenience features, which we broadly call spot usability. Um, and these are a lot of things that we're continuing to add. And so, you know, the scenario that we're now at is basically, you know, you letting someone else use your spot, show up, the sensor just kind of knows, okay, you're here, and then the car in your spot is automatically moved, um, via conveyor to another spot. We're managing the spaces to ensure we can move it somewhere. And then when the person comes to pick up their car, it's kind of brought to them. Now, yeah, so it's all really convenient, kind of seamless for everyone involved, um, and creates a ton more effective space, allows us to get much better economics out of the machines, um, and is really, really helpful for companies. That's one really powerful, I think, thing that we've done, and this is kind of offered in the context of a spot product. I think people are somewhat familiar with spot usage in the typical cloud context. It's a lot more challenging to do with GPUs for a few different reasons, which is why there are ver- not very many GPUs available on spot, definitely not at scale and definitely not with interconnect. Um, so one of the things we want to do is enable this, and it makes a lot of other things possible that are pretty neat. And so it's a mechanism we're employing a few, in a few different ways. So that's, I think, one thing, and I'll give some more analogies to explain why that's powerful. What we found that companies are using this type of mechanism quite a bit for everything from training, which is classically seen as a workload that's difficult for spot, but also especially for things like inference, batch inference especially. And you know, that actually opens up another interesting conversation about the different classes of workloads, um, and what they, what each workload needs and cares about, and how this might evolve over time. That, you know... and actually that ties a li- little bit to the compound AI systems concept, um, and also to the Llama 3.1 release in a, in a funny tangential way. Um, but yeah, that's kind of one analogy for the product that we launched on the Foundry Cloud platform around spot.
3. SGSarah Guo
  Yeah, I, I think that, you know, spot usability, um, increasingly, uh, deep and flexible and automated is like a really powerful primitive.
23:54 – 29:50
The current state of GPU capacity
1. SGSarah Guo
  Um, uh, why don't... uh, changing tacks a little bit to just something I think the entire industry, like the tech industry is very interested in. We did a little bit of work together a- a while back, just understanding like where, you know, uh, where is the GPU capacity in the world today, right? Of all of the different types, how much is it? Like, how consolidated is it? And, uh, obviously this is, you know, near and dear to your business. Can you just describe a little bit, like, where, where you think we are and then, like, what caused, what caused the shortage, um, sort of last year?
2. JDJared Quincy Davis
  One kind of funny bit of trivia that I've, I've posed to a few people that I think, you know, reveals how.... off-base a lot of our priors are is kind of what percentage of the world's GPU petaflop capacity or exaflop capacity is kind of owned by the major public clouds. And I've asked this to many people, and I typically have gotten guesses, you know, in the high tens of percents. And the only time I got a lower guess was from Satya at Microsoft, who guessed basis points, um, which actually is, is correct. Um, you know, it's a very small, small amount. And maybe as one anecdote to maybe illustrate how this looked at least a couple of years ago, this is an evolving thing, but how it's looked, you know, the example of GPT-3 in its training is kind of an interesting one. Um, it's a bit dated, but I'll use it just because the numbers of the types of machines and numbers there are public, as is not case for- the case for some of these other systems. Um, and so GPT-3 was trained on 10,000 V100 GPUs in an interconnected cluster in Azure for about 14.6 days. You know, to put that in perspective, it was a state-of-the-art system. You know, it was kind of, by many estimates, eight figures for a single run at the time. Um, so it's a pretty substantial investment by OpenAI. And that tells you that 10,000 V100s running continuously for 14.6 days is quite a bit of compute. I think one interesting kind of maybe trivia question then is how many equivalent GPUs normalized in terms of the number of flops, you know, they weren't fully interconnected, but just interesting, you know, proxy measure anyway, were there in the Ethereum network at the peak of Ethereum? And so I kind of asked people this question, and, um, it's fun to see people guess. Um, can I solicit a guess actually, Ilad? Can you, do you have a guess there? You might know. So, um, 'cause we've talked about this a little bit.
3. SGSarah Guo
  We had this conversation, so I'm not allowed to guess anymore.
4. JDJared Quincy Davis
  Yeah, we had this conversation.
5. SGSarah Guo
  Yeah.
6. JDJared Quincy Davis
  So, Ilad, do you know by chance?
7. EGElad Gil
  For Ethereum?
8. JDJared Quincy Davis
  Yeah, for Ethereum. And don't, don't think too hard. Just make a guess based on priors.
9. EGElad Gil
  When? So when it first launched, or...
10. JDJared Quincy Davis
  The very peak, the tippy top of Ethereum. How many V100 equivalents were there, given there were 10,000 for two weeks for GPT-3? How many were there in Ethereum? Noting, by the way, that these were running 24/7 in Ethereum, so you can modulate your guess based on that.
11. EGElad Gil
  I would guess a few hundred thousand, a few million.
12. JDJared Quincy Davis
  That's an aggressive guess. Um, yeah, that's an aggressive guess, and you're, you're actually very correct. It was about 10 to 20 million.
13. EGElad Gil
  Yeah.
14. JDJared Quincy Davis
  Um, which is quite a substantial scale. (laughs) You can, and you can, by the way, construct this really easily by looking at the, you know, basically the hash power in Ethereum at the peak, which was r- around 900 teraFLOPS per second, I believe, about a... Or terahashes per second, about a petahash per second. So quite a bit of peak power. And a typical V100 will give you between, I believe, 45 to 120 megahashes per second if you really know what you're doing. Um, so that's kind of one guess. There, there are tens of millions of GPUs.
15. EGElad Gil
  Yeah, 'cause it's funny. I remember, um, Bitcoin, even years ago, all the CPU dedicated to it at the time was, um, like larger than all of Google's data centers.
16. JDJared Quincy Davis
  Yeah. Bitcoin, though, used a lot of, um, ASICs in particular. E- Ethereum actually had a higher relative ratio of GPUs. And some of the larger GPU, y- sorry, Ethereum mining providers like, like Hive, for example, had about, you know, had less than 1% of global hash power. Um, you know, and so you can actually start to extrapolate, you know, and, you know, they had had quite a few GPUs, like tens of thousands of NVIDIA GPUs. You know, it starts to give you a number of the, the, the scale, uh, capacity, and then all that other, you know, mining, mining equipment for way less than 1% of the total hash power, about 0.1%. Um, so quite a bit of hash power in Ethereum. And I think that's kind of one proxy, but I, to give you one more anecdotal on that line, actually an iPhone 15 Pro now is actually stronger than a V100, as a funny pro- as a funny example. It has about 35 teraflops in FP16, I believe, um, where a V100 is around 30. And so there's actually quite a bit of compute in the world broadly speaking, is I guess the point I'm making. Now, it's not all useful, it's not all interconnected, it's not all accessible, it's not all secure, but this is one point to make, that there's a lot of compute in the world. And even for the high-end GPUs, there's a lot more than people think. Um, you know, by many measures, utilization of these even H100 systems are kind of state-of-the-art, the most viable, the most precious, et cetera, is in many cases 20%, 25% or lower according to some, um, you know, pretty high-quality data I- I've seen from, uh, some great sources here. Yeah, so quite, quite low. And as I mentioned, even during these pre-training rounds, it's often 80% or lower because of the healing buffer partially. That actually ties to another product that we've launched which is, um, this product that we built actually largely for ourselves called MARS. Um, it's kind of a funny name, but it's Monitoring, Alerting, Resiliency and Security. It's basically a suite of tools that we've missed a lot of IP in to really boost and magnify the availability and uptime of GPUs for our own platform. It was actually something that we planned to make available to other people just to use, um, in their own clusters, um, as well. Actually, one of the reasons why we invested in SPOT is because we reserve very aggressively healing buffer ourselves, um, so that if there's a GPU failure, we can automatically swap in another GPU and a user won't perceive a disruption. And so we actually, we maintain buffer for that reason. And so actually being able to pack that buffer with preemptible nodes is actually a really useful thing. But now we're allowing other people to do this, including third-party partners who want to, for example, make their healing buffer available to others, um, through Foundry. It's really offsetting their economics and the cost of the cluster for them. Um, so it's a really, really powerful thing. And so between MARS and SPOT, you can kind of see how these things are really interconnected and in a nice way. But yeah, the number of GPUs, uh, available, there's, there's quite a few, um, particularly if you look at it more broadly as te- in terms of kind of total AI compute capacity. Um, the percentage that's accessible, useful and used, um, is a pretty de minimis fraction of the, of the total.
29:50 – 36:28
GPU market dynamics
1. JDJared Quincy Davis
2. SGSarah Guo
  How do the GPU market dynamics and your prediction of them factor into Foundry's strategy going forward, right? Because there are, um, especially for anybody doing large-scale training jobs, there is definitely a, uh, you know, a significant effort to be at the, um, leading edge, right? Access to V100s and beyond is at a premium. And then access in, you know, the largest possible interconnected cluster with sufficient power is also a fight now. It sounds like you, you know, see the opportunity differently or you feel like there are resources that, like, can be used, um, that don't require just building new data centers.
3. JDJared Quincy Davis
  I think it's a little bit of all of the above, to be clear. I think two things can be true at once, and that's definitely the case here. I think there will be many workloads and use cases for which having state-of-the-art extremely large interconnected clusters is a really valuable thing.You know, part of something though that we're noticing, and also trying to promulgate further, is... And so are basically paradigms that don't require this though as well. And so here's where I'd say kind of two things can be true at once. And so I think there is a massive shortage of both power, space, and interconnect for the kind of largest of clusters. It's actually very hard to come by and to construct, or to find a really large interconnected cluster. You know, this kind of starts to vanish the larger the cluster gets. Like, there are a lot more 1K clusters than 10K clusters than 20K clusters, and you, you can keep going. Um, now I think one thing that... And it'll get harder and harder to keep the scaling going from there. You know, I think there's one question which is, how will we continue to push the scaling laws? You know, one fact about the scaling law curves is that they're all plotted on logarithmic graphs. And, you know, things get better predictably, but it requires a continued 2X-ing or 10X-ing to get that next bump in performance. And it's, but it's quite a bit harder to get the next 10X-ing. And so it starts to become kind of intractable pretty quickly. And so it's already prompted I think a lot of innovation. So Google, for example, has been doing a lot with, you know, training across facilities for example, um, across data centers, interconnecting them for these models like PaLM 2, something that previously would've seemed a bit inconceivable, or things like DeepACO or DEELOCO, these models that they released that are now trained across facilities. I think that's one innovation, but I think actually an even slightly more radical thing is we're starting to see a shift towards a pretty different paradigm. You know, I think myself and a number of my collaborators and Matei and others have kind of termed this compound AI systems, and I think you actually see it with these most recent models, like AlphaGeometry, AlphaCode, and LLaMA-3. And so I think this actually points the way towards, like, what the AI infrastructure of the future might look like, and I think it looks a lot less like everything requiring these big clusters, and it's a little bit more interesting. And so maybe I'll use Phi-3 as an example. With Phi-3 they took a little bit of a different approach, where they trained a really high... Microsoft trained a really high-quality small model on high-quality data, you know? And the small model did not need the kind of big interconnected cluster. You can train it on a pretty small cluster. However, it was still, you know, a non-trivial endeavor because they had to curate and get and obtain this kind of high-quality data. And so one of the things I think you're seeing is for these models, like some of the, like the LLaMA-3.1 8B and 70B variants, those models are really small, but they're extremely smart. They're smarter than, you know, much, much larger systems, you know, like the prior generation from OpenAI. And the way that they trained LLaMA-8B and 70B looks a little bit different. So what they did was they did... They generated a ton of synthetic data it seems with, with LLaMA-3.145B, and they distilled that larger model into the 70B and 8B variant, right? And so they got a very, very high-quality small variant. Another example is with AlphaCode, um, 2 was able to achieve extremely high code proficiency and win competitions with really, really small language models. But what they did was they called the model a million times for every question. That's an inversely parallelizable workload. You can scale it horizontally infinitely. They called it a million times per query and then had a nice kind of pretty elegant regime, um, to filter down to the top 10 responses, which they then tried one by one. And so this is basically what they did. They generated a million candidate responses and basically filtered down to the best one as a way to solve coding, so to speak. Um, so that's pretty interesting, and I think you're seeing that type of regime a b- more and more. You know, same with AlphaGeometry, a really powerful system, was just announced recently. You know, won, you know, the silver medal level in, you know, in the IMO, um, broadly. Not just geometry, for a broader class of problems. And, you know, these are kind of compound systems with major kind of synthetic data generation pieces, and so I think you're seeing people kind of move computation around, interpolate between training and inference for example, to make the best use of the info they have. And this is actually a kind of a funny, I think, reframing of the scaling laws that we hold near and dear as an ecosystem. I think that, you know, the Chinchilla paper scaling laws that DeepMind uncovered have fueled a lot of the scaling effort. But actually one kind of funny way of looking at those results is they show that if you want to make a monolithic model as smart as possible, there is an ideal way to distribute parameters, compute, and basically train iterations. Um, one though funny thing I think some people have done, like Mistral, is actually choose to maybe inefficiently train a small model to be smarter than it should be, wasting money, but then that small model is actually really cheap to inference because it's small, um, and it's way smarter than it should be for its size. And so I think people are getting more sophisticated at thinking about cost in a more of a life cycle way, and that's actually leading to the workload shifting from large pre-training more and more towards things like batch inference, uh, which is actually a really, really horizontally, you know, scalable workflow that you can parallelize. You don't need interconnection the same way. You don't even need state-of-the-art systems in the same way. And I think you're seeing that type of workload maybe grow in prominence. And so just to give one, just to unpack one more statement on that, one thing you can do that people are doing sometimes is you unroll the current state-of-the-art model many, many times, basically doing chain of thought on the current state-of-the-art model. And then you take what, what, what required six steps with the previous state of the art, and that becomes your training data for the next state of the art, right? That type of approach to generating data that then you'll... That... And then filtering that down to high-quality examples and then s- training models on it looks very different than just throwing more poor quality data into a massive supercomputer to get the next generation.
4. EGElad Gil
  Yeah. It seems like that bootstrap up is really sort of under-discussed relative to as you hit a certain threshold of model, the rate at which you can increase for the next model just kind of accelerates.
36:28 – 40:27
Compound systems design
1. EGElad Gil
  Um, o- one other thing that would be great to cover, I know we only have a couple minutes left of your time, um, is, uh, the recent paper that you authored, um, which I thought was super interesting around compound AI system design and sort of, uh, related topics to that. So would you mind telling us a little bit about that paper and sort of what-
2. JDJared Quincy Davis
  ... uh, what you all showed. So I think it's kind of in this regime that we were just talking about, where more and more often to kind of g- go beyond the capabilities on f- frontier accessible to today's state of the art models, and kind of get GPT-5 or GPT-6 early. Practitioners are starting to do these things, oftentimes implicitly, where they'll call the current state of the art model many, many times. There are many scenarios where maybe you're willing to expend a bit of a higher budget, maybe it's code or something, and if I said that I can give you a 10% better model, you know, for code, many developers might pay 10X for that, access to that. Instead of $20 a month, they might be very willing to pay $200 a month, right, for obvious reasons. Um, and so there's a question of, what do you do in that setting? And so people are, you know, us- if you're willing to call the model many times, you can compose those many calls into almost a network of network calls, right? And, you know, I guess one of the questions is, how then should you compose these networks of networks? What principles should guide their architecture? We kind of know how to construct neural networks, but we haven't yet elucidated the principles for how to construct networks of networks, so to speak. Um, these compound AI systems, so to speak, where you have many, many calls, maybe external components. And so one principle that we start to explore was, maybe one thing you can prove to figure out how to compose these calls, or whether composing many calls will help you, is you can look at how verifiable the problem is. And if it's verifiable, you can actually bootstrap your way to really high performance. So what does this mean? Well, verifiable means that it's kind of easier to check an answer than it is to generate an answer, and there are a lot of cases where this is true. Most software engineering and computing tasks kind of classically have this property. You know, we looked at things like prime factorization or a lot of math tasks. Classically, it can take someone years of suffering to write a proof, and you can, like, read the proof in a couple of hours, right? I think we've all, you know, had that experience, y- you know, with some training. And so there are many examples like this, and so one thing you can, you can do is you can have models you can embarrassingly parallel, uh, you know, you can horizontally scale out and generate many, many candidate responses, and then relatively cheaply check those candidate responses and kind of do a best of K type of approach. And turns out, the judge model or the verifier choosing the best candidate response might actually have a lot higher accuracy at selecting the best candidate response from the set. And so you can kind of repeat this as a, you know, as a pro- as a procedure to actually bootstrap your way to really, really high performance, um, in many cases. And so, you know, we did kind of some preliminary investigations here, and we were able to, in one case of prime factorization, you know, kind of 10X the performance, go from 3.7% to 36.6%, um, on prime factorization, which is pretty hard, kind of factori- you know, taking a number that's a compose- a composite of two primes, two three-digit primes and factorizing it, um, factoring it into the constituent primes. It's kind of a classic problem that pops up a lot in cryptography. And then also, we looked at subjects in the MMLU and found that for kind of the co- subjects you would expect, math, physics, electro-engineering, this type of approach was really helpful. Um, now we use language models. It doesn't have to be a language model. It could be a simulator. These could be unit tests as your verifier, et cetera. But we think this type of approach kind of points towards maybe a very different paradigm for getting better performance, um, than just kind of scaling the models and doing a whole new pre-training from scratch. The MMLU performance bump was about 3%, and to put that in perspective, the gap between some of the previous best models is often less than 1% between, for example, Gemini 1.5 and LLaMA 3.1 and things like that. So actually 2.8% or 3%'s actually a pretty major gap on MMLU. So pretty intriguing, and yeah, I think a lot of practitioners are hopefully gonna explore this setting a lot more.
3. SGSarah Guo
  It's a, it's a super cool paper. Do you have any creative ideas about how you could apply some of the ideas here to improve performance on more open-ended tasks?
40:27 – 42:41
Improving open-ended tasks
1. SGSarah Guo
2. JDJared Quincy Davis
  In many ways, we're not originating these. I think some of these are baked largely into systems like Alpha, AlphaCode, uh, AlphaGeometry already. You know, I was pretty inspired to see the AlphaGeometry results, um, you know, recently as well. Yeah, I think that what you'll see people doing is kind of composing... This will s- this sounds funny, but massive networks where maybe each stage in the network will basically be maybe some best of A comp- best of K component, with many, many calls to different language models. You know, Claude, Gemini, GPT-4, each with their own, you know, spikes in terms of capabilities, and you'll kind of throw multiple of them at questions in many cases, and then kind of choose the best response. You might, you know, also ensemble that with other components, like classical heuristic-based systems and, you know, simulators, et cetera, and kind of compose large networks that may make millions of calls, um, to answer a question. And I think that type of approach, it sounds kind of farcical right now, but I think it'll seem common sense, um, pretty soon. You know, we think that's, eh, we think it's a really interesting approach and we've seen a lot of interesting evidence that we'll, um, speak more about pretty soon, um, for things like code generation, um, and agentic tasks in the code regime, you know, for things like design, chip design, you know, for things like actual neural network design, um, or network of network design even, funny enough, in recursive ways. Um, so it's actually a really good... You know, turns out a lot of these problems that we care about have that property where it's verifiable, and you can compose these systems and bootstrap your way to, you know, much higher performance than people might have imagined. Um, so it seems pretty applicable downstream. Uh, but there's a lot of open questions, a lot of work to do further, and I think, you know, part of our hope is that the community will explore this more, and that these types of workloads that are a bit more parallelizable will become more and more common. There'll be a lot more batch inference, a lot more synthetic data generation, um, and it, you know, you won't necessarily need the big interconnected cluster that maybe only OpenAI can afford, um, to do kind of cutting-edge work in the future.
3. SGSarah Guo
  Yeah, a really cool set of ideas, uh, and overall a great conversation. Thanks so much for doing this, Jared.
4. JDJared Quincy Davis
  No, thank you Sara, and thank you Elad. Great to see you. Yeah, great to see you too.
5. SGSarah Guo
  Find us on Twitter at No Priors Pod. Subscribe to our YouTube channel if you wanna see our faces. Follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get a new episode every week. And sign up for emails or find transcripts for every episode at no-priors.com.

Episode duration: 42:41

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode NoX1fGHPfCM

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome