Skip to content
The Twenty Minute VCThe Twenty Minute VC

Steeve Morin: Why Google Will Win the AI Arms Race & OpenAI Will Not | E1262

Steeve Morin is the Founder & CEO @ ZML, a next-generation inference engine enabling peak performance on a wide range of chips. Prior to founding ZML, Steeve was the VP Engineering at Zenly for 7 years leading eng to millions of users and an acquisition by Snap. ---------------------------------------------- In Today’s Episode We Discuss: (00:00) Intro (00:59) How Will Inference Change and Evolve Over the Next 5 Years (06:24) Challenges and Innovations in AI Hardware (14:07) The Economics of AI Compute (16:57) Training vs. Inference: Infrastructure Needs (24:56) The Future of AI Chips and Market Dynamics (36:25)Nvidia's Market Position and Competitors (40:47) Challenges of Incremental Gains in the Market (41:39) The Zero Buy-In Strategy (42:18) Switching Between Compute Providers (43:23) The Importance of a Top-Down Strategy for Microsoft and Google (44:49) Microsoft's Strategy with AMD (49:35) Data Center Investments and Training (52:20) How to Succeed in AI: The Triangle of Products, Data, and Compute (52:48) Scaling Laws and Model Efficiency (54:34) Future of AI Models and Architectures (01:03:38) Retrieval Augmented Generation (RAG) (01:07:51) Why OpenAI’s Position is Not as Strong as People Think (01:15:20) Challenges in AI Hardware Supply ----------------------------------------------- Subscribe on Spotify: https://open.spotify.com/show/3j2KMcZTtgTNBKwtZBMHvl?si=85bc9196860e4466 Subscribe on Apple Podcasts: https://podcasts.apple.com/us/podcast/the-twenty-minute-vc-20vc-venture-capital-startup/id958230465 Follow Harry Stebbings on X: https://twitter.com/HarryStebbings Follow Steeve Morin on X: https://twitter.com/steeve Follow 20VC on Instagram: https://www.instagram.com/20vchq Follow 20VC on TikTok: https://www.tiktok.com/@20vc_tok Visit our Website: https://www.20vc.com Subscribe to our Newsletter: https://www.thetwentyminutevc.com/contact ----------------------------------------------- #20vc #harrystebbings #steevemorin #zml #openai #nvidia #amd #ai #inference #startups #founder #ceo

Steeve MoringuestHarry Stebbingshost
Feb 24, 20251h 18mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. 0:000:59

    Intro

    1. SM

      The thing with NVIDIA is that they spend a lot of energy making you care about stuff you shouldn't care about. And they were very successful. Like, who gives a shit about Kuda? OpenAI is amazing, but it's not their compute. Ultimately, if you don't own your compute, you're starting with, you know, something at your ankle. In five years, I would say 95% inference, 5% training. You have the products, the data, and the compute. Who has all three? Google has, like, you know, Android, Google Docs, whatever. They have everything. They can sprinkle everywhere. This is the sleeping giant in my mind.

    2. HS

      Ready to go? Steve, dude, I am so grateful to you for joining me today. I've wanted to make this one happen for a while, but when we were discussing who'd be best for this topic, I was like, "We've got to have Steve on." So thank you for joining me today, man.

    3. SM

      Man, well, well thank you. I feel, uh, humbled. I appreciate it. Thank you.

    4. HS

      Dude, I want

  2. 0:596:24

    How Will Inference Change and Evolve Over the Next 5 Years

    1. HS

      to start. Can you just give us a quick overview of ZML, and specifically your role in the infrastructure strategy today and where you sit?

    2. SM

      At the very bottom of things, um, ZML is an ML framework that runs any models on any hardware. Uh, and it does so without compromise. So we sit ultimately at, at the, um, at the infrastructure layer. Uh, we enable anybody to run their model better, faster, more reliably, uh, but on any compute whatsoever. It doesn't really matter, it could be NVIDIA, it could be AMD, it could be TPU, and whatnot, uh, and we do all that without compromise. That's the key point because if there's a compromise, then it's not really, you know, agnostic, right?

    3. HS

      Can I ask you then, if we think about sitting between any model and any provider, there in terms-

    4. SM

      Right.

    5. HS

      ... of AMD, NVIDIA-

    6. SM

      Right.

    7. HS

      ... do you think then we will be existing in a world where people are using multiple models simultaneously and that is-

    8. SM

      Yeah.

    9. HS

      ... concurrently running?

    10. SM

      Yes, um, you, you actually can see it. It's been happening for a while. Models now are, are not the right abstractions. At least if you look at closed source model, they're not really models. They're more like backends, right? Uh, and there are a lot of tricks that you feel like you're talking to one model, but ultimately you're talking to a constellation, an assembly of backends that produces, you know, a response. Uh, probably the number one, you know, I would say obvious thing, would be that if you ask a model to generate an image, then it will, you know, switch to a diffusion model, right, not an LLM. So, and there's many, many more tricks, the turbo models at OpenAI do that. There's a lot of tricks. So definitely, uh, models in the sense of, you know, getting, you know, weights and running them is something that is ultimately going away because, you know, in favor of, like, full-blown backends, right? You feel like you're talking to a model, but ultimately you're talking to an API. The thing is that API will be running locally, right? Or locally I mean in your own, you know, uh, cloud, you know, instances and so on.

    11. HS

      Okay. So we will have a world where we're switching between models and there's kind of this kind of trick-

    12. SM

      Oh, yeah.

    13. HS

      ... trickery around-

    14. SM

      Absolutely.

    15. HS

      Okay, perfect. So we've got that at the top, then we've got ZML-

    16. SM

      (laughs)

    17. HS

      ... in the middle, and then you said, "And then on any hardware."

    18. SM

      Yeah.

    19. HS

      So will we be using multiple hardware providers at the same time or will we be more rigid in our hardware usage?

    20. SM

      No, absolutely. You can get order, like probably an order of magnitude more efficiency depending on the hardware you run on. That is substantial. Um, not a lot of people have that problem at the moment, um, because, you know, things are getting built as we speak. Um, but, you know, a simple example is if you, you know, switch from NVIDIA to AMD, uh, on a seven TB model, you can get four times better efficiency, right, in, in terms of spend. So that is substantial. (laughs) That is very much substantial. Now the problem is getting some AMD GPUs, right?

    21. HS

      I'm really sorry. If there is such a cost efficiency four times, why does everyone not do that?

    22. SM

      So there's a few reason. The, probably the most important one is the PyTorch, Kuda, um, I would say Duo, and that's very, very hard to break. These two are, these two are very much intertwined. And if you-

    23. HS

      Can you just explain to us what, what PyTorch and Kuda are?

    24. SM

      Oh, yes, absolutely. Yeah, yeah. Um, PyTorch is the, uh, ML framework that people use to build, actually train models, right? You can do inference with it, but by far the most, um, successful, uh, framework for training is, is PyTorch. And PyTorch was very much built on top of Kuda, which is the, um, NVIDIA software, right? Uh, and the way, uh, let's just say the strengths of PyTorch make it ultimately, um, very, very bound to Kuda. So of course it runs on, you know, it runs on AMD, it runs on, you know, even Apple and so on, but there was always, you know, the, the tens of little details that not exactly run like, you know, you would expect, and there's work involved, but then also there's supply. Um, so probably that's the number one thing. The second thing is there's a lot of, um, GPUs on the market, and all of them, pretty much all of them are, are NVIDIA. The reason being that if you think, you know, in layers and you say, "All right, I'm going to buy, let's say GPUs, and I'm going to sell them to folks to maybe not even do training, right, just do inference," then most likely if you look at it that way, you'll end up buying NVIDIA because everybody will want to run on NVIDIA because nobody knows really how to do whatever, and they've trained on NVIDIA so they're like, "I can just reuse my code," and so on. Um, so there's like this self per- perpetuating, you know, uh, uh, circle of people just buy NVIDIA because they want to resell, and people just use NVIDIA because it's there, right?... um, but it's by far not the, (clears throat) not the most efficient, uh, platform. And arguably, even in terms of software, it's not the best software, uh, platform. So, that is probably two of the most... um, I would, I, I'd wager the most important

  3. 6:2414:07

    Challenges and Innovations in AI Hardware

    1. SM

      reasons.

    2. HS

      Can I s- before, uh, you know, uh, we were chatting about NVIDIA-

    3. SM

      Right.

    4. HS

      ... and, uh, AMD when DeepSee obviously happened and the stock crash-

    5. SM

      Right.

    6. HS

      ... that happened. Wh- why did NVIDIA rebound, do you think, in a way that AMD didn't?

    7. SM

      Because the chips are there. Uh, so (sighs) there's a lot of things. But in, in my opinion, there's g- there's al- there's going to be a need for inference. Very hard to say whether it will be worth, you know, everybody's money to do it on H100. That is a, a bubble that I think will blow sometime. Uh, I'm kind of afraid of that, to be honest.

    8. HS

      Wh- why do you think that's a bubble that will blow sometime? Why is that not legitimate?

    9. SM

      Because, um, it was built on the A100 mo- uh, I would say financial model, which was at generation zero, we do training. Uh, but when it's last generation, we do inference. And it worked beautifully, right? Uh, for A100. Then H100 comes along and inference is, it's worth five times the price, and, and it maybe runs twice, uh, in terms of performance, on inference, that is. On training it's a lot better, but on inference, it's like maybe twice as fast. When it, actually when it came out, it ran at the same speed than the A100. So there's a money gap that's going to have to, you know, be bridged sometime, right? And the, the part that worries me is that I see, you know, amortization plans on, in like, you know, six, seven years, with the GPUs as the collateral. And I'm like, "Well, I'm not sure how it's going to work." Because, you know, they're worth, at least when they came out, they were worth five times the price, and they're just two times, you know, faster. So something's going to (laughs) ... something has got to give. Um-

    10. HS

      Is speed of development trumping chip development speeds, where it's now becoming a real problem, where as we say, models are far outpacing-

    11. SM

      Right.

    12. HS

      ... the speed of chip deployment?

    13. SM

      Um, not much, ultimately. Um, the two things that could really m- very much shake, uh, the industry, the chip industry, in my opinion is, uh, are agents and, uh, reasoning, uh-

    14. HS

      Go for it. I'm-

    15. SM

      ... because (laughs) -

    16. HS

      Wh- what, wh- Number one, agents. Why does that change ?

    17. SM

      To be honest, this is, this is where we... I think this is where NVIDIA can be, uh, can be attacked. I mean, why agents and why reasoning? The, the, the difference is for agents and reasoning, you need to wait until the end of the request to get whatever it is you came for, right? So you're not... you don't really care about the speed at which the text, you know, outputs, which is what you want in a chat, right? Um, you only care about, "How much time does it take between the beginning of my request and the end?" And so, that fundamentally changes the, uh, the incentives from throughput-bound to latency-bound. And so GPUs, if... let's say you're, you're running a GPU at, let's say, 10 token- 10,000 tokens per second, um, you very much like to do it, you know, 100 times 100, right? Um, and they can do that, but they cannot do... they cannot give you 10,000 tokens per second only on you, right? Per stream, what we say. But in terms of, you know, agents or reasoning, this is exactly what you want. Because you don't wanna wait, like, you know, 50 seconds for whatever thinking, right? Uh, and agents, it's, it's the same. So these two, I think, are the shot that might, you know, at least make NVIDIA change its course with respect to chips. I mean, they're not idiots, right?

    18. HS

      How should agents change NVIDIA's strategy?

    19. SM

      Because they're, they're, they're the same. They are bound by the, the latency of the beginning of the request and the end of the response. So you don't want your agents to take, you know, 30 seconds to generate, you know, whatever its response will be, right? Uh, especially if you pipeline them.

    20. HS

      So how should, how, how should they change then their response? Like, how should they change what-

    21. SM

      Oh, you mean NVIDIA?

    22. HS

      Yeah.

    23. SM

      (tuts) Hard to say, 'cause NVIDIA is, has a very, very vertical approach. Uh, they do more of, of more, right? (laughs) Um, like if you look at Blackwell, it's actually crazy the, what they did. Uh, for Blackwell, they assembled two chips, uh, but the surface was so big that the chips started to, you know, uh, uh, wave. Like, I don't know the, the English word, but like, you know, started to, to bend a bit, which further perpua- p- perpetuated the problem, because it then didn't make contact with the, with the heat sink and so on. So they are very much in, you know, the power envelope. They push it to 1,000 watts, it requires liquid cooling and so on. So they are very much in a very vertical, uh, you know, foot to the, you know, uh, foot to the pedal, um, in terms of GPU scaling. But the thing is, GPUs are, you know, are a good trick for AI, but they're not built for AI. It's not a specialized chip. It is a specialization of a GPU, but it is not, you know, an AI, you know, chip.

    24. HS

      Well, I, I... Forgive me for continuously asking stupid questions.

    25. SM

      No, no, no, please.

    26. HS

      Why, why, why are GPUs not built for AI? And if not, what is better?

    27. SM

      So the, the way it worked is that a screen is... You can think of a screen as a matrix, right? And if you have to render, you know, pixels on a screen, there's a lot of pixels and everything has to happen in parallel, right? So that you don't waste time. Uh, turns out, you know, matrices are very in... uh, are very (laughs) important thing in AI. So, there was this cool trick in which we essentially tricked the GPU into back... that was like probably 20 years ago, we would trick the GPU into believing it was doing graphics rendering, where actually we would making it do pilot work, right? It was called GPGPU at the time, right? So it was always a cool trick. ... right? And very cool and very successful at that, mind you. Um, but it was not dedicated for this. Uh, the pioneers probably were, uh, of course Google with TPU, uh, which are very... much more advanced, uh, on the architectural level. But essentially, the way they work, um, it kind of works for, for AI, but for LLMs that starts to, you know, to crack because they're so big and there's a lot of memory transfers and so on. Actually, that's why Grok achieves, uh... Not Grok, but like Grok, Cerebras, and all these folks, uh, they achieve very high performance single stream, is because the data is right in the chip that does- they don't have to get it from memory, which is slow, which GPU has to do. So there's a lot of these things that ultimately make it a good trick, um, but not, I would say, dedicated solution per se. Um, that said though, the reason probably NVIDIA won, at least in the training space, is because of Mellanox, right? Not because of the raw compute. Because you need to run, you know, lots of DGPs in parallel. So the interconnect between them is ultimately what matters, right? So how fast can they exchange data? Because remember, when you do a matrix multiplication, let's say, you read, you know... The, the metric is read like hundreds of times during the multiplication, so there's a lot of transfers going on. Um, and so far, Mellanox would... You know, InfiniBand had the, the best technology, so that's why, you know, a lot of people... And when you do training, by the way, it is the name of the game, the interconnect. When you do inference, meh, not so much. You don't care when you do inference.

  4. 14:0716:57

    The Economics of AI Compute

    1. SM

      (laughs)

    2. HS

      Before, before we move to inference, I, I do just want us to stay on chips and just say, okay-

    3. SM

      Right.

    4. HS

      ... so we have TPUs, we have NVIDIA, we have, um, AMD.

    5. SM

      Right.

    6. HS

      Is this... In terms of distribution of gains, is this a winner take all market? Is this cloud where you have several providers who are dominant?

    7. SM

      Mm.

    8. HS

      What does the distribution of gains look like in the chip market today?

    9. SM

      So y- I would divide it in two categories. Well, three categories. Um, the GPUs you can buy or rent, uh, the TPUs you can rent, and the TPUs you can buy. This is how the market is structured today, right? Right now, if you are... you want to go dedicated there's, a- at least in the cloud, there's two options. TPUs and Trainium. Uh, TPUs on Google, Trainium on Amazon. Uh, so these are, you know, available chips. You can rent them today. If you want to buy, um, GPUs or rent GPUs, you know, they're GPUs. We, we, we know it all the time. And there's this new wave of com- of computing which are dedicated, you know, chips you can actually buy. The Tenstorrent, the Etched, the Visora. So I think it will be a mix of, you know, whoever, you know, whatever you get. For instance, let's say you are in Google Cloud, of course you don't want to do NVIDIA. You'll get ripped off. Because here's the, here's the dirty secret, is that NVIDIA... Like, a TSMC sells you at 60% margin, NVIDIA sells you at, you know, 90% margin. And on top of that there's Amazon that takes, let's say, a 30% margin. So you are a very thin crust on a very big cake. And so that's why, to me, it's, it's, it's not really... It's a bigger... It's a bit of a losing game if you, you know, go all in on one provider. You want, you know, optionality.

    10. HS

      With increasing competitiveness within each of those layers, do we not see margin reduction?

    11. SM

      Absolutely, yes. Yeah, yeah, yeah. But here's the, here's the problem though. So let's say you are on Google Cloud and you're on TPUs, right? Suddenly you just removed that 90% chunk on, on, on the, you know, on the spend. Uh, the problem is, is that for multiple software reasons, which are, you know, which we are solving, um, at DML, is that they're not really, uh, I would say a commercial success. They are very much successful inside of Google, but not much outside of Google, let's say, right? Uh, Amazon, same, is pushing very, very hard for their, uh, you know, Trainium chips. So my... I would say the future I see is that you use whatever, you know, your provider has, has because you don't want to pay, you know, 90% outrageous margin, uh, and try to make, you know, a profit out of that.

  5. 16:5724:56

    Training vs. Inference: Infrastructure Needs

    1. SM

    2. HS

      Totally get you there. Okay, so when we move to actually inference and training-

    3. SM

      Right.

    4. HS

      I, I mean, everyone's focused so much on training. I'd, I'd love to understand, what are the fundamental differences in infrastructure needs when we think about training versus inference?

    5. SM

      These two obey fundamentally different, I would say, tectonic forces, if you will. So in training, more is better. You want more of everything, essentially. Uh, and you... And the, the recipe for success is the speed of iteration. You change stuff, you'd see how it works, and you do it again. Hopefully it converges, and it's like, you know, changing the wheel of a, of a moving car, so to speak. Uh, some training runs are, that is. Uh, so that is training. On inference, this is a complete reverse. Less is better. Uh, you want less headaches, you don't want to be woken up at night because inference is production, right? You could say that training is research and inference is production, and it's fundamentally different. In terms of infra, the... probably the number one thing that is... the number one difference between these two is the need for interconnect. So if you do, you know, production, you... if you can avoid to have interconnect between, you know, let's say a cluster of GPUs, of course you will not go... you, you will, you know, avoid that, right? If you can. And this is why models...... have the sizes they have. It's so that people can run them without the need to connect multiple machines together. It, it's, it's very constraining in terms of the environment. So that is probably the fundamental diff- difference, the need for interconnect. And number two is, ultimately, do you really care about what your model is running on as long as it's outputting whatever you want it to output? So, yeah.

    6. HS

      Can you just help me understand, sorry, why is training more is more and that's great, and in inference less is more? Why do we have that difference?

    7. SM

      Yeah. Think of it, think of it like, you know, doing a painting and doing a million paintings, right?

    8. HS

      Mm-hmm.

    9. SM

      The tools you will use, the process you will do, if you do one painting what you favor is the speed at which you can do a stroke and do some iteration. If you do a million, what you want is a process, a process that is reliable that can deliver you efficiently a million paintings, right? Um, so that is the same for, for, for training versus inference. Um, if you wanna run, you know, millions of instances of a model, you cannot, you know, hack your way to do that. By the way, people do hack their way, (laughs) uh, today, um, but this is probably the fundamental difference.

    10. HS

      How do people then put inference in production's day? You know, we've seen with training, that's really where NVIDIA have dominated so heavily.

    11. SM

      Right.

    12. HS

      How do people put inference in production?

    13. SM

      Uh-huh, (laughs) there's a lot of duct tape. Um, so here's also probably one of the problem is that training on first principle is actually two passes, forward and backward, right? It's called forward pass and backward pass, right? Inference is running only the forward pass. So that's how things are today, um, uh, mostly. Uh, there- there are people who are spec- trying to specialize a bit, um, because, you know, at some point duct tape doesn't really work out, and when you're on big scales, that makes a problem. And it's a bi- and it's a problem that's growing because a lot of people are coming on the market with needs for inference. That wasn't the case, you know, a year and a half ago, or a year ago. Uh, OpenAI had this problem, right? Maybe Anthropic had this problem. But it wasn't a universal problem yet, and now it's becoming a universal problem.

    14. HS

      Can you- can you articulate what problem did OpenAI and Anthropic have with regards to inference?

    15. SM

      So for instance, probably the number one thing, um, depending on how you deploy, but if you're deploying inference, the number one thing that will get you is what's called auto-scaling. So as your, you know, systems get more and more loaded, you want to provision because, you know, these things are tremendously expensive. You want to provision them as you scale, right? So you wanna say, "I have 1,000 GPUs, you know, 24 hours, even, like, if there's, like, nobody on the production, I will pay for them." Which is, mind you, what people are doing today. This is crazy. Um, so what you wanna do is, you want to, you know, provision, compute as you grow your needs, right? And you wanna do it up and you wanna do down. Um, and that's number- probably the number one thing that, you know, gives you a lot of efficiency, uh, in terms of spend. Like, we're talking, you know, multiples, like, you know, five, you know, sometimes 10X, you know, improvement. The thing is, this is a problem, at least in that, in- in, I would say back- regular back engineering, this is a problem everybody knows, right? Everybody's doing it because the- the- the savings are so huge, um, but on AI, nobody really had the problem, so now they're coming up to it. So this is one example.

    16. HS

      So- so the problem is that they're not doing provisioning, they're paying a shit ton more-

    17. SM

      (laughs)

    18. HS

      ... because they are fully in production all the time versus provisioned as needed?

    19. SM

      That's one example, yeah. Uh, another one is choosing the right compute. Uh, it's like kind of a, I would say a vicious circle because provisioning compute is very hard. So if you lose compute, it's very bad. So you are essentially incentivized to overbuy, so in- in the case of, you know, uh, Amazon or Google, that would be buying reserved compute, which you're not gonna use because if you buy it on demand, you will get tremendously ripped off. So that creates this, like, face scarcity of compute because that people buy preemptively because they're a shit ton of money, um, and they're not using it, right? (laughs) So this is a major problem too.

    20. HS

      When you buy compute preemptively, does it not become outdated by the time you use it, though?

    21. SM

      It might well be, yes. Uh, I mean, judging at the pace, it might well be now. We have a bit of a, um, we are being spared a bit because Blackwell is late and all the others are getting canceled, and, uh, so H-Series, I would say, are still, you know, in the, uh, in the, uh, active. But yes, absolutely. But, you know, what choice do you have? (laughs) This is the thing.

    22. HS

      Will- will we have a moment in time where there is this massive overhang or oversupply of compute which we proactively bought ahead of time, but then actually the hyperscalers go, "We'd rather just burn it and buy fresh, and we have the money to do that"?

    23. SM

      So, um, I might tell you that I think they already started, uh, 'cause I'm- I'm getting cold emails for, you know, discounts, you know, from services I never heard about. Uh, and- and it- and I started getting these emails probably around October, November. So some people are left with a lot of capex that they don't know what to do with. It's very hard, you know, it's a different thing to- to build a cluster and run a training and do a training run than it is to build literally a cloud, you know, provider, right? Uh, or hyperscaler or, you know, whatever you wanna call it. So there are a lot of people who do their training runs on, uh, the regular, you know, providers, but then move to regular hyperscaler when they do production. So I very much worry there will be, um, an oversupply of these chips. The problem is, is that, you know, remember... the chips are the collateral. So, you know, somewhere, you know, in the US or whatever, there's going to be a data center with, like, 1,000 GPUs that people may buy, you know, 30 cents on the dollar, you know. Uh, I don't know, but, um, this is what I, you know, this is what might happen.

  6. 24:5636:25

    The Future of AI Chips and Market Dynamics

    1. SM

    2. HS

      What is the timeframe for that might happening?

    3. SM

      Probably this year.

    4. HS

      Jensen has made it very clear that inference opens up more revenue opportunity for NVIDIA. He said that 40% of their revenues today comes from inference.

    5. SM

      Right.

    6. HS

      To what extent is that correct? Or actually, as Jonathan at Grok said in the show-

    7. SM

      (laughs)

    8. HS

      ... you know, "NVIDIA is not meant for inference. Definitely not. And actually, that market won't be won by NVIDIA."

    9. SM

      I mean, technically speaking, he's right, but realistically speaking, I don't sure, I'm not sure I agree. The thing is, these chips are on the market. They're here. I can, you know, Alt + Tab on Chrome and get one. That is something that, you know, I don't take lightly. Um, uh, availability, that is, right? So I think NVIDIA is going to stay, uh, at least if not for the H100, you know, bubble bust, uh, because these chips are going to be on the market, and people will buy them and do inference, uh, with them. Remains to see, you know, the, um, the, the, the OpEx and the electricity, et cetera, but that is a complicated question. The thing is, uh, as far as I know, the, the, the, the only chips that are really, you know, um, uh, frontier on that sense are probably TPUs and then the upcoming chips. But the thing is, they're great chips, but they're not on the market, or, like, they're outrageous prices, like, millions of dollars, you know, to run a model.

    10. HS

      So what, what chips are great, and why aren't they on the market?

    11. SM

      I mean, if you look at, you know, let's say for instance, Cerebras, right? Uh, incredible technology, incredibly expensive. (laughs) So, uh, will, uh, you know ... How will the market value the premium of having single stream very high tokens per second? Uh, there is a value into that, right? As we saw with Mistral and Perplexity, but I'm not sure, you know, that was done, I think, that was another loss. I don't know. I don't have the details, but I think it was another loss, uh, that Cerebras, you know, put, put it out. Um, so today, there's three actors on the market that can, you know, deliver this. I think this will be a w- I would say the, uh, the, the pushing force for change in the inference, uh, landscape, uh, agents, and reasoning, so that is, you know, very high tokens per second only for you, and not to a, you know, an aggregate of people.

    12. HS

      Wh- what is forcing the price of a Cerebras to be so high? And then you heard Jonathan at Grok on the show-

    13. SM

      Yeah.

    14. HS

      ... say that, "Hey, they're 80% cheaper than NVIDIA."

    15. SM

      Ah, it's ... So there's this trick, 'cause here's, here's the thing, uh, there's no magic. This little trick is called SRAM. Uh, SRAM is memory on the chip directly, so that it's very, very fast memory. But here's the problem with SRAM, is that SRAM consumes, you know, surface on the chip, right, which makes it a bigger chip, which, you know, is very odd in terms of yield, right? Because there's, the chances of, like, problems are higher and so on. SRAM is, I would say, m- very, very, very fast memory, which gives you a lot of advantage when you do very, very high inference, very high, very h- sorry, very high speed inference, uh, but it's terribly expensive. And if you look at, for instance, Grok, they have, uh, on their generation, this generation, they have 230 megabytes of SRAM per chip. A7TB model is, you know, uh, what's called BF16, is a 140 gigabytes. So you do the math, right? Uh, uh, Cerebras has a 44 giga- gigabytes of SRAM into what they're calling their Wafer-Scale Engine, which is a chip the size of a wafer. I mean, most likely, it's interconnected, but it's huge, right? And it's les- it has to be water cooled. They have, uh, copper, you know, n- I would say, needles that touch the chip to... It's crazy stuff, um, uh, very, very impressive technology by the way, mind you, but very, very expensive. So my bet is I think there will be, you know, chips on the market that do that at, at much lower price. Um, and there's two companies I see going in that direction. One is called Etched, um, and the other one is called Vysor. That's the two I see that will be ... Because if you can deliver this at the, I would say, the price that is comparable to GPUs, (claps hands) you've won.

    16. HS

      Is minimizing SRAM the only way to reduce unit cost on these chips?

    17. SM

      It's hard to say. If you can... I mean, you need some SRAM, but if you can, you know, have a smaller process node, uh, but if you can hook yourself with, uh, excellent memory, then yes, you, you can do that a lot better. But the thing is, if you go, like, full-blown SRAM, then you know, there's no magic. You will have to pay the price.

    18. HS

      (laughs) Um, I- I'm so enjoying this with you. I'm also learning.

    19. SM

      (laughs)

    20. HS

      My, my, my notes here are just expanding by the day. Um, how do you think the inference market then evol- if that's today, how do you think the inference market evolves over the next three to five years?

    21. SM

      Pushed by reasoning, uh, so reasoning not in the sense that you see on DeepSeek and whatever, right? Um, uh, reasoning in what's called, uh, latent space reasoning. Uh, latent space reasoning, uh, and agents will push the market towards a different types of compute.

    22. HS

      Can I just ask, w- what's latent space reasoning?

    23. SM

      So the way models reason today is they reason in what's in, in, in tokens. So it's as if, if you think to yourself, you would, you know, say out loud what you're thinking-

    24. HS

      Mm-hmm.

    25. SM

      So, yes it works, but it is a bit, you know, inefficient, right? And you lose, uh, information doing this. Latent space reasoning is this without going, I would say, to English or whatever, right? So staying, you know, in what's called the latent space, which is where all the information of an LLM, let's take an LLM, an LLM lives. So this is very much how we, uh, how we, you know, work as humans, uh, and we move toward what, what Yann LeCun calls a energy-based model in which we have different types of, um, uh, longer or shorter, I would say, thinking times, if you will. So that fundamentally, GPUs cannot deliver, deliver this, plain and simple, at scale. So these chips can deliv- the ch- the chips-

    26. HS

      Why, why, why can't GPUs deliver it?

    27. SM

      Because the access to external memory prevents it. So HBM is all the rage, right? But HBM compared to SRAM is absolutely, you know, dog slow, right? So, so this is the problem you get. So HBM is, like, the best we can do, but it's still slow versus, uh, versus SRAM. Uh-

    28. HS

      So, so when I had Jonathan on, he was like, "Actually, NVIDIA have such a stronghold because they're one of the only buyers of HBM, and that gives them this unique position." Actually-

    29. SM

      (laughs) .

    30. HS

      ... is being a sole buyer of HBM irrelevant if the world needs SRAM instead?

  7. 36:2540:47

    Nvidia's Market Position and Competitors

    1. SM

    2. HS

      With that realization, do you think we'll see NVIDIA move up stack and also move into the cloud and model?

    3. SM

      They are. They are. Uh, they have a product called NIM that does, you know, uh, sort of does that. Um, they are... The thing with NVIDIA is that they spend a lot of energy making you care about stuff you shouldn't care about.

    4. HS

      (laughs) .

    5. SM

      ... s- and they were very successful. Like, who gives a shit about Kuda? I'm sorry, but, uh, I don't want to, I don't want to care about that, right? I want to do my stuff, um, uh, and NVIDIA got me into saying, "Hey, you should care about this because there's nothing else on the market." Well, that's not true, but ultimately this is the GPU I have in my machine, so, you know, off I go. If tomorrow that changes, why would I pay 90% margin on my compute? That's insane. This is why I believe it ultimately goes through the software, uh, because the software... Like, if my entry... This is my entry point to the ecosystem. So if the software is, um, you know, abstracts a way those idiosyncrasies, as they do on CPUs, right, then the, the providers will compete on specs, and not on fake moats, uh, or circumstantial moats. So this is where I think, you know, the market is going, and of course there's, there's the availability, availability problem. There is, you know, if you, you know, piss off Jensen, you might need to kiss the ring, you know, uh, to get back in line, right? Uh, um, but I mean, ultimately this isn't... I don't see this as being sustainable.

    6. HS

      Can I ask, when we chatted before, you said about AMD, and I said-

    7. SM

      Mm-hmm.

    8. HS

      ... "Hey, you know, I bought NVIDIA and I bought AMD. And NVIDIA, thanks to Jensen, I made a ton of money. And AMD-

    9. SM

      (laughs)

    10. HS

      ... I, I think I'm up 1% (laughs) , uh, versus the 20% gain I had-

    11. SM

      (laughs)

    12. HS

      ... on NVIDIA." My question should... You said that NV- uh, AMD basically sold everything to Microsoft and Meta, and had a GTM problem. Can you just unpack that for me?

    13. SM

      So all, I would say, chip makers have a GTM problem. Uh, it's all of them, whether, you know, it's Google, whether it's AMD, whether it's Tenstorant, right? Uh, the problem is, is that there's, I would say, probably two fundamental problems. The, the number one is, if you... Maintaining multiple stacks today is very, very, very hard. So you don't. So let's say I buy, you know, AMD. I wanna buy AMD, right? Uh, that means I'm going to abandon NVIDIA. Oh, crap, you know, I have a six-year amortization plan on that. Oh man, what do I do? So do I need to support both stacks? Uh, unclear. Uh, may be un- uh, until AMD tells me, "Hey, you know, you have, I don't know, let's say 1,000 NVIDIA GPUs, um, you're about to buy 100,000 of AMD." I mean, come on, right? (laughs) And I'm like, okay, that is, you know, makes it worth my while, right? Um, so but that is ultimately the fundamental problem, is that the steps are very high, right? I need to have a lot of incentives to buy into that ecosystem. So I need to buy a lot of them, right? So if you're AMD, that is already a problem, right? Uh, but then Microsoft comes along and buys it all, makes, by the way, OpenAI, or at least on the inference side, puts OpenAI in the green because of the efficiency gains, um-

    14. HS

      Can I just try and understand?

    15. SM

      Sure.

    16. HS

      So are you saying the switch- the switching costs are really high from one provider to another-

    17. SM

      Oh, yeah, absolutely.

    18. HS

      ... which is why you don't? Or are you saying that to get into one of these buy processes, you have to buy so much that it prohibits you, which-

    19. SM

      It's act- it's actually both. (laughs)

    20. HS

      (laughs)

    21. SM

      It's... The buy-in is very high, so to make it worth it, you have to buy a lot. And if you buy a lot, this is, you know, what every... We talked to all of them, they, th- they always have the same questions, and it's completely understandable. They say, "This is great, but who's the customer?" Because on the other side, let's take Amazon, for instance, with Trainium. Apple just came and said, "Hey, we're gonna buy 100,000 of them." "Oh, so you wanna buy, you know, 10,000, you feel like the big shot, right?" "Yeah, but yeah, take, you know, go back to the queue because there's Apple before you," right? So

  8. 40:4741:39

    Challenges of Incremental Gains in the Market

    1. SM

      they have to have very high commitments to make it, you know... You, you cannot be incrementally better. It's very hard. And also very hard, I can give you, uh, I can give you one metric if you want. Um, I know for a fact that s- being seven times better, and whatever, take whatever metric you want, uh, whether it's span, whether it's whatever, is not enough to get people to switch. People will choose nothing over something, right? I've s- like, I have stories. So this is a very hard market to enter into because you cannot also compete of incremental gains, right? It's very hard, right? So yeah, you have to convince a lot of people. Um, maybe you can go the, um, Middle East route in which, you know, they sprinkle everything and they, you know, evaluate everything. But, you know, that's not, you know, very sustainable, I would say, strategy

  9. 41:3942:18

    The Zero Buy-In Strategy

    1. SM

      in the long term, or at least in the mid-term.

    2. HS

      Wh- what is, what is the right sustainable strategy then? You don't want to go so heavy that you can't ever get out, and you have that switching cost.

    3. SM

      Right.

    4. HS

      But you also don't want to sprinkle it around and do, as you said, multiple.

    5. SM

      Absolutely.

    6. HS

      What's the right approach?

    7. SM

      The right approach to me is making the buy-in zero. If the buy-in is zero, you don't need to... You don't worry about this. You just buy whatever is best today.

    8. HS

      How do you do that by renting?

    9. SM

      Oh, because this is what we do. This is our all, uh, promise. Our all, uh, at least as in the ............................ is, our thesis is that, if the buy-in is zero, then, you know, you completely unlock that value,

  10. 42:1843:23

    Switching Between Compute Providers

    1. SM

      because you're free-

    2. HS

      What you're saying... And when you say the buy-in is zero, what does that actually mean?

    3. SM

      It means that you can freely switch, you know, compute... (clears throat) Sorry, compute to compute, uh, like freely, right? You just say, "Hey, now it's AMD," boom, it runs. You just say, "Oh, it's Tenstorant," boom, it runs, right? If-

    4. HS

      Ho- how do you do that then? Do you have agreements with all the different providers?

    5. SM

      Oh, yeah, yeah, yeah. Not agreements, but, like, we, we work with them, uh, to support their, their chips. But the, the thing is, my... at least as, you know, uh, I would say a user myself of, you know, our, of Artech, is that if I can, you know, if it's free for me to switch or to choose whichever provider I want in terms of compute, right, uh, AMD, NVIDIA, whatever, uh, then I can take whatever is best today-... and I can take whatever is best tomorrow, and I can run both. I can run, you know, three different platforms at the same time. I don't care. I only run, you know, what is good at the moment. And that unlocks, to me, uh, a very cool thing, which is incremental, you know, improvement. If you are 30% better, I'll switch

  11. 43:2344:49

    The Importance of a Top-Down Strategy for Microsoft and Google

    1. SM

      to you.

    2. HS

      So are you taking the risk on those, on that hardware then, if you're the one providing them to turn off and on, on demand, provisioning, you name it?

    3. SM

      Oh.

    4. HS

      Who's taking the... Who takes the risk?

    5. SM

      So I... This is a- this is actually a great question. Um, I think that if you are doing it bottom up, infra to applications, you will lose because nobody will care. Uh, as they, as they don't today, right? If you look at TPUs, they're available, they're great. Nobody cares. So my approach is-

    6. HS

      Why did- why does nobody care about TPUs? Sorry.

    7. SM

      Because the cost of buy-in.

    8. HS

      They are.

    9. SM

      It's always the same, right? You have to spend six months of engineering to switch to TPUs. And mind you, TPUs do training. They are the only ones with Trainium now, but AMD can do training, but it's so-so. But in terms of maturity, the... by far the most mature software and compute is TPUs, and then it's NVIDIA. The buy-in is so high that people are like, "Uh, uh, well, we'll see," right? "I'm not on Google Cloud, I have to, you know, sign up. Oh my God," right? So these are tremendous, you know, chips. These are tremendous, uh, assets. Now, in terms of the risk, I think if you want to do it, you have to do it, you know, bottom up... uh, top to bottom. You have to start with, you know, whatever it is you're going to build and then, you know, permeate downwards into the infrastructure.

  12. 44:4949:35

    Microsoft's Strategy with AMD

    1. SM

      Take, for example, um, Microsoft with OpenAI. They just bought all of AMD's supply, and they're on, you know, ChatGPT on it. That's it. And that puts them in the green. That's actually w- what make them, you know, profitable on inference. So... or at least, let's say, not lose money, right? Um-

    2. HS

      I'm sorry, how does Microsoft buying all of AMD's supply make them not lose money on inference? Just help me understand that.

    3. SM

      Because I can give you like actual numbers. If you run eight H100, you can put two 7TB models on them because of the RAM. That's number one. Number two is, if you go from one GPU to two, you don't get twice the performance. Maybe you get 10% better performance. Yeah, that's the dirty secret nobody talks about. Uh, on the, on inf- I'm talking inference, right? So, so you go from, let's say 100 to 110 by doubling the amount of GPUs. That is insane. So you'd rather have two by one than one by two. With one machine of H8- uh, H100, you can run two 7TBs model if you do, you know um, w- uh, four GPUs and four GPUs, right? Um, that's number one. If you run on AMD, well, there's enough memory inside the GPU to run one model per card. So you get, you know, eight GPUs, eight times the throughput. Where on the other hand, you get eight GPUs, two... you know, two, maybe two and a half times the throughput. So that is, (snaps fingers) you know, a 4X right there, right? Uh, just, you know, by virtue of this, uh, of this. Um, and so that is, you know, the, the, the compute path, but if you look at all of these things, there are tremendous amount of... You know, we talked to companies who have chips coming with almost 300 gigabytes of memory on it, right? So that is m- uh, you know, a model, like one chip per model. This is the best thing you want, right? Uh, if you're on seven TBs, right? Um, so which is what I would say, not the state of the art, but this is the, uh, the reg- the regular stuff people will use for serving. There... Like, if you look, you know, top to bottom, and you know what you're going to build with them, then it's a lot better to do the efficiency gains because four times is a big deal, right? It's like you get... A- and mind you, these chips are 30% cheaper than NVIDIA's. It's like a no-brainer. But if you go bottom up and say, "I'm going to rent them out," no, but b- people will not rent them. Simple. (laughs) So that's why, you know, I think it's, it's a good way to attack it from the software because ultimately, do you really care about that your MacBook, let's say, is an M2 or an M3. It's like, oh, it's the better one. All right? (laughs) And that's it, right? And imagine if you had to care about these things. That would be insane.

    4. HS

      When I listen to you now, I'm like, "Shit, I should sell my NVIDIA and buy more AMD."

    5. SM

      (laughs)

    6. HS

      Okay? I ... If you were-

    7. SM

      Uh...

    8. HS

      If you were forced to buy one, I'm not saying sell the other, I'm not saying like this and the other, but buy one, which would you buy and why?

    9. SM

      Stock?

    10. HS

      Yeah.

    11. SM

      The thing is, you know, I, I'm a long, uh... I used to think the market was efficient. (laughs) So this is not investment advice. Um, probably I would go, today at least, I would go with NVIDIA still. Uh-

    12. HS

      Because?

    13. SM

      Because the supply. But, you know, if we play our cards right, we ship our stuff, hopefully I will come back and tell you to buy AMD as much as you can, or Attestorant, you know, if they go public, or, um, whoever else. Um, these chips are amazing, by the way.

    14. HS

      What does everyone think they know about inference that they actually don't? Or what does everyone get wrong about inference?

    15. SM

      Probably not a lot of people are accustomed to what it entails to, to run production. So that inference is production, and production is hard. Somebody has to wake up at night. And I used to be that guy, right? I don't want to do it again. Um, so production is hard. Thankfully, we have a lot of, uh, uh, software nowadays to do that a lot better, but there's not a lot of reuse because the AI field at least is not really accustomed to that yet. It's changing.... uh, but, you know, the discussions I had, you know, a year ago, and th- the, the discussions I had, you know, uh, today are not the same. They're going to the right direction, but they are not there exactly, yet. S- so probably that would be the number one thing. There is only, you know, um, uh, training code running only for Wattpad, right? This is not what

  13. 49:3552:20

    Data Center Investments and Training

    1. SM

      it is. (laughs)

    2. HS

      Can I ask, how do you evaluate the data center investment that we're seeing being made? You know, when you look at Facebook doing 60 to 65, Microsoft doing 80, and some of the intense capex expenditures that you're seeing. How do you think about that, on the data center side?

    3. SM

      Hmm. I mean, they're still going after training. So there's still this frontier. Probably it's why also NVIDIA is the better buy right now. Uh, because on the NVIDIA side, if you do training, it's incremental, right? If you have bought 1,000 NVIDIA GPUs, and you buy 1,000 new NVIDIA GPUs, that gives you 2,000 GPUs, right? But if you buy 1,000 and 1,000 IND, that gives you twice 1,000, right? (laughs) So it's- it's- it's a bit different. Um, so they're still going after training, uh, definitely, and they're very pragmatic in doing so. But, I mean, they have the capex to spend. Uh, they're not making their money out of it probably, is ... The only one, by the way, that owns their compute are Google. There's like this triangle of- of, I would say, of- of wind that I ... This is my mental model, mind you. You have the products, the data and the compute. Who has all three? And you get everything flows from there.

    4. HS

      Product, data, compute. Who has all three? Google?

    5. SM

      Google.

    6. HS

      Amazon?

    7. SM

      Amazon, they don't have products. They have Amazon, right? They have it in the US, but they don't have actual products. Google has like, you know, Android, uh, Google Docs, whatever. They have everything. They can sprinkle everywhere. Uh, this is the sleeping giant, in my mind. If they're not busy doing a reorg, (laughs) they might-

    8. HS

      It- i- it's fascinating-

    9. SM

      (laughs)

    10. HS

      ... 'cause everyone ... If you're a shallow thinker, you think that OpenAI challenges their golden goose, which is search, and Google is threatened-

    11. SM

      Hmm.

    12. HS

      ... more than ever now.

    13. SM

      I mean, O- OpenAI is amazing, but it's not their compute. So that, it is Microsoft's compute.

    14. HS

      And if you own your compute, you own your margin, is essentially what you're saying.

    15. SM

      Yeah. Micros-, even Microsoft, they- they bought, you know, when they weren't running NVIDIA, they bought NVIDIA at, you know, some outrageous margin. I talked to a lot of people that build data centers, and I tell them, you know, they're ... Mind you, these people have like, buy tens of thousands of- of- of GPUs. Uh, and I asked them, "Hey do you get, you know, at least a discount or something?" And they're like, "No. The only thing we get is, you know, uh, s- the supply." I mean, ultimately if you don't own your compute, yeah, you have, you're starting with, you know, something at your ankle,

  14. 52:2052:48

    How to Succeed in AI: The Triangle of Products, Data, and Compute

    1. SM

      definitely. And so this is why I like to think in this like, this triptych, or at least this triangle, product, data, compute, and you can see where everybody's i- is, uh, you know, sits, and their weaknesses and their strengths.

    2. HS

      Can I ask you, if we move a little bit, you said that it's totally rational that everyone's focusing on training still.

    3. SM

      Yeah.

    4. HS

      When we think about that, it's rational if you think that efficiency and scaling laws continue-

    5. SM

      Mm.

    6. HS

      ... to continue place such emphasis

  15. 52:4854:34

    Scaling Laws and Model Efficiency

    1. HS

      on it. How do you think about model scaling and scaling laws coming into place?

    2. SM

      There's like, a brute force approach to this. It is a very American approach, more and more and more. But the thing is, you look at, for instance, the, uh, uh, the XAI Cluster. It's not 100,000 GPUs. It is four times 25,000. So you see st-, you're starting, you know, see some ... Because InfiniBand, and in their case, uh, uh, RoCE, which is, um, anyways, the technology they use to bridge their, uh, GPUs together, they ... you have upper bounds, right? At- at some point, you are fighting physics. So you can push ... It's like, you know, trying to get to the speed of light. You get, you know, a tiny, you know ... Well as- as you approach it, the- the- the amount of energy you need, you know, is a lot higher and a lot higher, and it grows and grows. So there's two approaches to that. One is ... Two, uh, sorry, two I would say counter to that would be that, number one is, we're still scaled, but there's a lot of waste and, uh, excess, you know, spending, uh, um, on the engineering side, which is the DeepSeek approach, right? Very successful at that, mind you. They said, "Yeah, if we do this and this differently, then we get, you know, multiple sometimes, right?" So virtually, you increase your compute capacity because you're more efficient. And the other approach is Yann's, uh, Yann LeCun's approach, which is, this is not scaling, and at some point you need, we need to look the problem in the face and, you know, do something better, right? So, of course we push and push and push because there's capital still, but I'm more of the, of the, um, of these two approaches. I think you can do more with less. Uh-

    3. HS

      At what,

  16. 54:341:03:38

    Future of AI Models and Architectures

    1. HS

      at what point do we stop and say, "Hey, there is a lot of wastage and we could do more better?"

    2. SM

      At least up to-

    3. HS

      How far away are we from that?

    4. SM

      ... I think until somebody does it. DeepSeek was a good, you know, wake-up call, right? Uh, suddenly efficiency is in, right? Um, that's number one. And number two is, until there's a new architecture that comes out and changes the game. So in the case of LLMs for instance, you have these what's called non-transformer models that changes fundamentally the compute requirements. So that might be a frontier that, you know, completely obsoletes the transformers, right? And if the trans- the trans-, sorry, the transformers are the, um, I would say the- the building block by which current model work, right?So, the way they work is that for each, you know, token or syllable, if you will, the model will look at everything behind it. So, you can see that as you add more text, you have more work to do, right? Uh, so there are these new architectures, uh, um, that do not require this, uh, that might change, you know, these things, uh, and probably shift the amount of compute needed to do training or to do inference. And then there's the new thing which is, uh, Yann's, um, uh, uh, thesis, which is the, um, uh, word model, right? As in LLMs are (thunder)   at that end, what we need is something that understands the world fundamentally, and this is the- it's, uh, uh, it's JePA thesis, it's called. I'm very bullish on this, but it's- it's very frontier.

    5. HS

      Why are you bullish on it? And why is it so frontier?

    6. SM

      'Cause it's Yann LeCun. (laughs) It's hard to. Um, he's no bullshit, right? So, um, he explained to me how it worked, and I was blown away. Simple as this. Uh, but it makes a lot of sense. Um, you know, we are creeped out because the machine talks back to us. That's it. But it's not a new thing, right? It used to, you know, this is not new technology when it came out or, like, when it- when it exploded, it wasn't new technology. But suddenly, it was talking back and that freaked us out, uh, and we got crazy on it, right? But language is one form of communication, uh, but it, uh, it is ultimately a very narrow, uh, window into, you know, the world. Uh, we use it to describe the world, arguably with some loss, right? Um, and so there's this, um, the JePA approach is, long story short, is that you have essentially two things you wanna do, and you try and minimize the energy to do them. But... And- and from this, understanding emerges, physics emerges and et cetera, because you're trying to minimize the amount of energy to go from one state to the other, and that actually makes sense. Like, if you try and, you know, pick this AirPod, you know, case, I'm not gonna go round trip around the block to get it, right? I just get it, and then my brain is wired to just do the, uh, the thing. If I- if I go and, you know, talk to myself out loud, you know, put the- the- the- the hand down, move to the left and whatever, that feels very, you know, inefficient. Probably this will be something, uh, that- that changes, and- and in the case of LLMs, there is- there is good work also on what's called diffusion-based LLMs, which i- which means, like, instead of thinking, you know, what's called autoaggressively, that means you get a new token, you reinject, and you redo, et cetera, they think more, like, what we do, which is in patches, right? And, you know, it's like im- imagine a paragraph of text and words appear, you know, until it's done, right?

    7. HS

      What was it... Was it not diffusion that Deep Seat did on OpenAI's models? Basically copying-

    8. SM

      Oh, distillation, they did.

    9. HS

      Distilla- sorry. As a- so let me just brr get rid of that.

    10. SM

      (laughs)

    11. HS

      Uh, is distillation wrong? And if we're all progressively moving towards a better future for hu- humanity, more efficient models, is distillation not effectively open sourcing in another-

    12. SM

      (laughs)

    13. HS

      ... wrapper?

    14. SM

      I think it's fair game, to be honest. Honestly, I- I- I do not- I will not shed a tear. Uh, it's fair game. If you... There were, like, some people who tried to ask, I think it was... I don't know if it was an Open- I don't- I don't remember if it was an OpenAI model, but they asked it to, uh, um, uh, so a diffusion model image, right? They asked it to generate, uh, an image from, uh, a Star Wars movie at whatever timestamp, and it came out with the Star Wars movie, you know, screenshot, right? So, obviously, it was trained with it. I think it's fair game because th- I mean, there's no free lunch, right? Uh, it was trained with data. You had a good ride. Somebody was- was sneaky and- and took it, but you took it from the beginning too. So, let's just accept that, you know, it's- it's fair game. Um...

    15. HS

      And you- and you also learn from their advancements.

    16. SM

      Absolutely. Absolutely. I think it's... You know, I- I- I, you know, take my- my cup and enjoy it very much, that movie, every single day. (laughs)

    17. HS

      (laughs)

    18. SM

      'Cause-

    19. HS

      Can I ask you...

    20. SM

      Yeah.

    21. HS

      You- you mentioned the training there, you know, obviously data and data quality dictates-

    22. SM

      Right.

    23. HS

      ... a lot of training ability. When you think about the future of data that feeds into training-

    24. SM

      Mm-hmm.

    25. HS

      ... how do you think about how that will be between synthetic data versus, like, real data?

    26. SM

      I'm a bit split on this. Um, there's a part of me that said that if you reinject, you know, data into the system, the system deteriorates. Uh, that feels a bit, I would say, intuitive. But if you look at AlphaGo for instance, uh, the- the moment it's, you know, ramped up in its- in its skills is when they started generating games, synthetic games, right? So, there is, um... So I'm a bit, you know, split, but there are some verticals that very much benefit from- from this, uh, you know, code LLMs, for instance. We can run code, right? So this is the poolside thesis.

    27. HS

      Just so I understand, why does it work for coding and not for other things?

    28. SM

      Because you don't use the mod- the AI model to generate, you know, output. You use the machine. You use the... you- you're- you just run the code, right? And you see what it makes, and you run all this code and you create data out of it. Whereas if you're on an LLM and you say to an LLM, "All right, generate me two trillion tokens of text," it will do it with its, you know... So you may inject and stuff, so there's a lot of tricks, but...... ultimately, my guts tell me that it's, it feels wrong, right? (laughs) Because you re-inject data that was there, and so it will deteriorate. There's a, you know, there's loss. So yeah, I'm, I'm a bit bullish. I'm not sure exactly on what vertical. Code is one. Um, but, um, we'll see. Distillation is, in some sense, a bit like that, right? You distill, you know, you create synthetic data from a bigger model into a smaller one. Probably the most, I would say, mind-blowing thing about distillation is that sometimes the smaller models become better than the, than the, than the bigger model, through distillation. So, um, so, so we'll see, uh, but I love this-

    29. HS

      And smaller mode- and smaller models become better than bigger models-

    30. SM

      Sometimes, yeah.

  17. 1:03:381:07:51

    Retrieval Augmented Generation (RAG)

    1. SM

      (laughs)

    2. HS

      What is retrieval augmented generation, first?

    3. SM

      It's a very, um, very, uh, clever trick. What you do is you represent knowledge into what's called a vector space or latent space, and what you do is through what's called vector search. So imagine you have a s- let's say, a 3D space that represents all knowledge, all of everything. And let's say a cat sits here, a, a, a dog sits close because it's a, it's an animal, but it's far from some other property, and so on. So what you do is you run the, the, the user's request through this same system. It's called, uh, uh, an embedding, and that will give you a vector, and you will take whatever is closer to you, right? S- that is what's called semantically close. Then you do that, and then it's, it's actually very, um, it's, it's very clever. You actually insert those pieces of text before the request. So it's as if you would say, uh, "Knowing the following," and you give the data, uh, let's say it's law or whatever, uh, "please answer my request," and that's it, right? So that's a bit of a clever trick. It's a bit dirty because, of course, you know, you are limited by the amount of data you can input, right? Um, so there's this problem in which, how do you chunk, you know, the data that you input? How do you... Like, let's say you're putting, I don't, I don't know, a-

    4. HS

      Uh, are a lot of the things we do not retrieval augmented generation though, where we say, "Here's a link. Um, summarize it to the key points." Is that not RAG? 'Cause we're inputting the data, which is that, and then we're-

    5. SM

      It is. It is. It, uh... Depends on how it works, but yes. Yes, sometimes it is. But think of it as in, it's like a preamble to your question. "Knowing the following," and the following is a, is a tiny window into the content, "please answer my question." And of course, as you talk more and more, it will forget, of course, because that window is fixed.

    6. HS

      How does that shift the movement from large generalized model to smaller, more advanced models?

    7. SM

      What pushes smaller models are efficiency roughly, speed. M- you know, less is better. So if we can do with less, then less it is. Um, simple as this, right? Uh, in terms of RAG, the key frontier is what we call, um, attention-level search, but this is something we're working on. You know, uh, you have the exclusivity now. I'm putting it out there. But there's n- like, the, it doesn't push, um, uh, I would say model sizes. Uh, what really pushes model sizes are the efficiency rather than specializing.

    8. HS

      Okay. Gotcha.

    9. SM

      So meaning that if you can do the same performance with a smaller model that is fine-tuned with RAG or whatever, then you'll do it with the smaller, because again, less is better.

    10. HS

      Can I ask you? Before we move into a quick fire round, I do just wanna ask you, when we had DeepSeek, as we mentioned, to what extent were you surprised that such innovation, I would argue, and I think many would agree with me, came from a Chinese competitor, not from a Western competitor?

    11. SM

      Oh, I love it. Uh, uh, constraint is the mother of innovation. So they got... So yes, you know, we can (laughs) we can, you know, trawl a bit about, you know, the Singapore, you know, gray market and all of these things. But ultimately, they, uh, like, they had no choice. Here's the thing. If you can buy more, why would you give a damn, right? You can just buy more, right? Uh, so if you are pushed, um, to efficiency, then you will deliver efficiency.... uh, so there is a- and there- these are very, very skilled people. This is the coolest thing to me about AI, honestly, is that you can- th- the geography doesn't matter anymore, you know? You can just, you can just do things. You appear out of nowhere, boom, you know, you're on the map. And so, and so I'm very, very glad that they did. I, I found the, the reaction very, uh, entertaining, to be honest. So yeah, I mean, constraint is, is a very good driver of efficiency.

  18. 1:07:511:15:20

    Why OpenAI’s Position is Not as Strong as People Think

    1. SM

    2. HS

      Do you think it is a meaningful threat to OpenAI and ChatGPT? Bluntly, they still have the consumer loyalty, the consumer brand. To what extent is it actually a long-term threat?

    3. SM

      I'm not sure who is a threat to OpenAI at the moment. Um, here's why. You look at the numbers, I mean, we live in a bubble, right? We, you know, we follow every new episode, the whatever new model, whatever, who said what and so on. But, you know, I go to my mother and I ask her, you know, "Do you know ChatGPT?" And she says, "Yes." "And do you know..." I don't know, uh, I don't wanna dunk on anybody, but, uh, "Do you wanna know some other model?" And she says what it is. What is it, right? Even Gemini, right? Like, uh, Google, right? So they have a strong brand, they have a strong product, um, arguably it is... I don't know, like, but there's a balance between the product and the models, honestly. This is Gary from FluidStack actually, who told me that his mental model, in terms of model providers, are, they'll be like car makers, right? There's no winner take all. Everybody will have their own, um, because ultimately also human knowledge is... I mean, there's n- you know, everybody has everything, uh, so we're converging. Uh, but I like that analogy. Um, yes, DeepSeek made a very good, you know... made waves, but it was, it was, m- you know, waves that were amplified by the media and the narrative and, and the drama, right?

    4. HS

      Do you think export regulations inhibit China's ability to compete in any way?

    5. SM

      Today, maybe. Uh, tomorrow, I'm not sure. They're a bit late in terms of, of, of, you know, ASIC. Uh, they are like A100 level, but they have probably, I would say, one of their unfair advantage is that there are, you know, exercise... O- it's like, you know, when you do, uh, when you do exercise in the water, right? It's like this. So this is, there's their state. They are constrained, so they are bound to, bound to do, to do better. They can just not buy their way into better compute. So I think it hinders their, their success, but I think it's short term to, to think that way.

    6. HS

      Are you fearful that Europe are gonna regulate ourselves into constraints in a w- world of AI?

    7. SM

      No, I don't care. I have ze- I, I... This is something I, I, I... it makes me wonder sometimes, and I understand the narrative and so on, but I am absolutely not fearful. Let's be successful first, and then we'll talk about the politics. I have so far, you know... But again, I'm not Mistral, I'm not, you know, I'm not building gigawatt data centers and so on. So if you build gigawatt data centers, you run into these problems. But... maybe you run into these problems. But the thing is, is if you're successful, everything flows from there, from there.

    8. HS

      S- Steve, I, I'm being direct here, but I'm asking you for the pros. Everyone says Mistral just doesn't have enough money to compete. That is kind of word on the street. To what extent is that fair?

    9. SM

      Uh, they are very competent, so I don't know. I think it's easy to spread FUD. Um, there's a lot of FUD going around, uh, especially about regulation and everything. Y- but here's the, here's the thing, I look around me and I don't see, you know, what I read. I am hardly convinced about... You know, everybody was saying that th- they were dead and boom, they came out with their, their release and it was insane. I don't know. Um, what I know is that I hope they don't have too much money, that's for sure. You, you wanna be clever, right? (laughs)

    10. HS

      Final one before we do a quick fire. I've so enjoyed this, Steve. Um, final one before we do it. Stargate was, uh, you know, a $500 billion announcement. Ho- how did you evaluate that?

    11. SM

      My first impression was that I don't buy it. And so it's... I would say, you know, American style, right? You start with the claim and we'll figure it out later. (laughs) I don't buy it, and ultimately, um, ultimately, I'm not sure I cared that much about it. The reason being that, yes, it is... If... Let's imagine it's true, right? Uh, I don't know if it's... To be clear, I don't know if it is or not, right? Uh, but let's imagine it's true. You know, congratulations, amazing. But it is more of the same. It, it is a vertical scaling. And as you know, my days are spent on efficiency. So I look at these things as being like, all right, this is a bigger... You know, this is an American car of AI. It's big, it consumes a lot of gas, but ultimately, you know, it's not a good car, right? (laughs) So, um... Or at least not to my liking. Um, so I think there has to be s- you know, sufficient capital, but at some point, I'm not sure it is really a differentiator. And, you know, that was prior to DeepSeek, then DeepSeek came. There was always, you know, my thesis, but, you know, you need money, right? You need infrastructure, you need... But what is ultimately the number, th- the, probably the two limiting factor today is talent and energy. That's it. The rest, you know, yes, of course you can buy 500 billion, uh, G- of, of GPUs, I... By the way, 90% margin, right? So if we work on that margin, we can, you know, shrink that number probably. That's my... I would say probably I'm wrong, right? But that's my, um, uh, my view of the world. Uh, I'm not easily entertained by these numbers, uh-I've seen how the sausage is made way too many times.

    12. HS

      Dude, I wanna do a quick fire with you. So I say a short statement, you give me your immediate thoughts.

    13. SM

      Sure.

    14. HS

      If you had to bet on one major shift in AI infrastructure over the next five years, what would it be?

    15. SM

      Oh, yeah, latency, reasoning, definitely. T- this year.

    16. HS

      W- what does that mean in...

    17. SM

      So, you know, as I was saying, like, the, the, the shift from throughput, so how speed my answer is, to how long it takes for my answer complete to, to, to, to, to appear. That is probably one of the fundamental, uh, like, this year, right! Um, longer term, probably the, um, I'm very rooting for non-transformer models that will change the compute also landscape and, of course, you know, world models, right? Yes. And or, and or energy-based models.

    18. HS

      What's one piece of advice you'd give to AI startups navigating the changing landscape of training, inference and hardware?

    19. SM

      The, probably the number one thing I would say is, um, do not resell compute if you can. A lot of, you know, AI startups, uh, that are building on top of AI are trying to make a margin, you know, on top of a very big cake. And ultimately, what they sell is compute. If you look at the dollar of spend, um, you know, for $1 of spend, maybe 98% of it goes to somebody else's margin. So if you, if you do AI, uh, as much as you can try to verticalize on, on the product, but not on the compute. So if you are, you know, if your business model implies, you know, buying a lot of tokens, it's a very hard circle to square to, you know, put that into $20, right? Um, uh, uh, a month. So, you know, I, I always say, like, please, you know, look at it from that angle and if you can, try and avoid it.

  19. 1:15:201:18:09

    Challenges in AI Hardware Supply

    1. SM

    2. HS

      What's the biggest challenge that Jensen Huang faces today?

    3. SM

      You know, the highs are very high, uh, but they don't last forever. Um, we see it actually, it's, so probably it's how to navigate the down slope. Blackwell is probably something that keeps them awake, that keeps him awake at night.

    4. HS

      Why would that keep him awake at night? Would that not re-energize him, more orders, new enthusiasm, new product maybe?

    5. SM

      Because orders are getting canceled.

    6. HS

      Why are they getting canceled?

    7. SM

      They, they have a lot of problems with this, with these chips. So a lot of people, you know, are canceling their orders. Um, these chips are, like, on the frontier of, of scaling. And so, you know, they were supposed to come out last su- last summer, right? Uh, but, you know, they have that heat dissipation and, you know, metal bending problem. It used to be called the, the, the people who are very privy to Silicon told me, "This is what we call a pretty big fucking problem," right? (laughs) End quote. So, um, probably how to navigate the down slope, um, maybe you don't know, but the, the supply of H100 was actually, uh, smoothed out over the year, so that they do, decided... So that they didn't have, like, a big, you know, spike in deliveries and then after a quarter less, right? Which pissed a lot of people, mind you, um, who bought a lot of them. Uh, some of them even haven't received their order from last year, and they already see, like, the new chip, the B200, and then the one after, you know, and they're super pissed. I think probably, uh, navigating, there will be a down slope at some point. The question is, you know, when and, uh, um, how... I- like, if there's, like, the H100 bubble, of course it will impact NVIDIA. Uh, but Blackwell is, mm, I'm a bit... I'm probably gonna get a lot of flak for this but, you know, I've seen some very worrying numbers about, about it and varying testimonies about people who operate these things. So I don't know, maybe that train will, uh, that ride will stop (laughs) or at least, you know, slow down.

    8. HS

      Yeah. Steve, I- I'm not sure I've ever learned quite as much in one episode. Seriously. Uh, I, we said before, like-

    9. SM

      Oh, wow!

    10. HS

      No, I like, I love what I do because I'm able to ask anything to the smartest people in their business, and I so appreciate you unpacking so much of it for me today, man. Um, I'm thrilled to say that I actually finally get what you do-

    11. SM

      (laughs)

    12. HS

      ... after years I've been asking you. (laughs)

    13. SM

      (laughs)

    14. HS

      Um, but you've been a star, so thank you, man.

    15. SM

      Thank you. Appreciate it. Thank you.

Episode duration: 1:18:09

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode xEbNfpVE_A8

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome