No Priors Ep. 31 | With Cerebras CEO Andrew Feldman

The GPU supply crunch is causing desperation amongst AI teams large and small. Cerebras Systems has an answer, and it’s a chip the size of a dinner plate. Andrew Feldman, CEO and Co-founder of Cerebras and previously SeaMicro, joins Sarah Guo and Elad Gil this week on No Priors. They discuss why there might be an alternative to Nvidia, localized models and predictions for the accelerator market. As the CEO of Cerebras, Andrew is building a new class of computer to accelerate AI work beyond the current state of the art. He is an entrepreneur dedicated to pushing boundaries in the computer space, and his passion is building teams that solve hard problems. 00:00 - Cerebras Systems CEO Discusses AI Supercomputers 07:03 - AI Advancement in Architecture and Training 16:58 - Future of AI Accelerators and Chip Specialization 26:38 - Scaling Open Source Models and Fine-Tuning

Elad GilhostAndrew FeldmanguestSarah Guohost

Sep 7, 202330mWatch on YouTube ↗

EVERY SPOKEN WORD

55 min read · 10,575 words

0:00 – 7:03
Cerebras Systems CEO Discusses AI Supercomputers
1. EGElad Gil
  The shortage of compute for AI, also known as the GPU crunch, is increasingly impacting AI companies big and small. It is delaying training runs and launches for multiple players in the generative AI world. One company coming to the rescue is Cerebras Systems, which is developing the largest computer chip and one of the fastest purpose-built AI processors ever. This week, Sarah and I are joined by Andrew Feldman, CEO of Cerebras Systems. Andrew is one of the few entrepreneurial veterans in the semiconductor world. He previously started SeaMicro, a pioneer of energy-efficient, high-bandwidth micro servers. SeaMicro was acquired by AMD in 2012. Andrew, thank you so much for joining us today.
2. AFAndrew Feldman
  Elan and Sarah, thank you so much for having me.
3. EGElad Gil
  I, I think you all recently announced that Cerebras has closed 100 million dollar deal with G42 to develop one of the largest AI supercomputers in the world.
4. AFAndrew Feldman
  We, we did announce a strategic partnership with a group called G42, (clears throat) and we announced that we were building nine, uh, supercomputers. Each supercomputer would be four exaflops of AI compute, so (laughs) in total, 36, uh, exaflops of AI compute. Th- th- that was extraordinarily exciting. When you encounter a partner that, that shares your vision and w- wants to build with you (laughs) a- and, uh, you get to start building the biggest computer on earth. I mean, it had (laughs) ... It doesn't get better than that.
5. EGElad Gil
  And I think, in general, you all were very forward-thinking and early to identifying AI as a really important market for custom semiconductors. Could you tell us a little bit more about your thinking early on as you started Cerebras and why you focused on AI many years ago?
6. AFAndrew Feldman
  In, in late 2015, five of us started meeting regularly. I think we were meeting in, in Sarah's offices, actually, at that point. All of us had worked together at our previous company, and we began w- working on, on ideas. And, uh, one day our CTO, Gary, leaned back and he said, "Why would a machine built for pushing pixels to a monitor be ideal for AI?" Then he said, "Wouldn't it be serendipitous if 20 years of optimizing a part for one job left it really well-suited for another?" And that got us excited. We, we began looking at GPUs, looking at AI work, and by, uh, early 2016, we, we'd decided that w- we could build a better part for this work. And our, our strategy was not to build a little bit better, but it, it was to try and do something vastly better. And we went to a technology that had never worked before, called wafer-scale, and we built a chip that's, uh, sort of the size of a dinner plate, whereas most chips are the size of a postage stamp. And we, we, we did that because we knew that this workload would go big (laughs) and that the problems of memory bandwidth and problems with breaking up work and spreading it over lots of little machines would be daunting. This is my fifth startup. This is the first time I was wrong on, uh, the market size on the under side, uh, had no idea it was gonna be this big. And I, I think very few people saw, even those of us who were in it, how big this was gonna be. And so ear- early 2016, we went out, we did eight pitches. We ra- we got eight term sheets. We, we raised money and we're building multi-exaflop AI supercomputers for, for customers around the world.
7. SGSarah Guo
  It's really incredible thinking about the foresight. I actually went back and looked at my notes from, like, our end of 2015 meeting, and if... I'm sure you remember very well, but for our listeners, uh, Andrew had a slide in there of how the top seven problems for deep learning were all long training times, and, like, the whole industry was bottlenecked on compu- This is even before the scale-up of transformers that happened. It's kind of wild how correct you were, at least on, (laughs) you know, the depth of this problem.
8. AFAndrew Feldman
  I, I, I think what you need to do is save that snippet and send it to my wife.
9. SGSarah Guo
  (laughs)
10. AFAndrew Feldman
  (laughs) Yeah, we got that right. I think also, we, we got the, the fundamental architecture right, in a, in a large sense, you know? We laid down the architecture before transformers exist, and we're the fastest at transformers by a lot. So when you break up a problem and, and it encounters things it has never seen before and it's still really good at them, tha- that's a sign in, in hardware architecture that you really got the architecture right.
11. EGElad Gil
  What are some of the benchmarks that you use in order to assess the performance of your chip versus others, and how have you all performed relative to GPUs and other sort of more standard...
12. AFAndrew Feldman
  W- we are not big believers in canned benchmarks. When I was at AMD, we had a team of 30 whose job it was to game benchmarks. When our CTO was at Sun, they had a team of 70 whose job (laughs) it was to game benchmarks. How long it takes your customer to train a model is the answer. And usually, a collection of models. If you're trying to do a 13 billion parameter model, mo- medium-sized, right, almost always, you begin at 100 million. And you do 100 million, you do a series of suites, right? You're trying to do your hyper-parameter tuning. So you do 100 to 300, 600, maybe a billion, and you're, you're trying to find the, the right optimization points. And I think how you do on all of those (laughs) is the answer. How long it takes you to move from customer model to distributed compute to finished run. And if you have to spend months doing distributed compute, doing tensor model parallel distributed compute, I mean, if you look at the back of some of these papers, they're crediting 20 or 30 people, sometimes more, who helped on the distributed compute. And if, if you don't need to do that, like you don't with our, our equipment, because we run strictly data parallel, um, that's all time that you get back (laughs) as a customer, right? That's all time that you get to train models and do interesting work rather than trying to think about how to...... break up a large matrix, multiply it, and distribute it over a collection of, of GPUs.
13. EGElad Gil
  Can you actually talk a little bit about the architecture of your chip relative to others?
14. AFAndrew Feldman
  All right. It's a data flow architecture. It's comprised of about 850,000 identical tiles. Each tile is a processor and memory. And so, uh, it's a fully distributed memory machine. And so you have huge amounts of memory bandwidth because the memory is speaking to a, a, a processor that's one clock cycle away. So by data flow, it means that the cores, uh, wait until they get a token or a, uh, a flit that arrives. It tells them what to do, "Do work on this and where to send it," and then it sends it forward. And this is a, a, a particularly good architecture for, uh, the type of work AI is. Um, AI is a, is a flow, right? It, it's a learning system.
7:03 – 16:58
AI Advancement in Architecture and Training
1. AFAndrew Feldman
  It's interesting, it was developed in, in the '80s at MIT by a guy named Arvind, and they didn't have a good workload for it, so it just sat there (laughs) , r- with nobody-
2. EGElad Gil
  Yeah.
3. AFAndrew Feldman
  And then we got a good workload for it, and it's phenomenal for this. And that, that's a really interesting part of computer architecture, where a really cool architecture can, can have no interesting workloads for it. And, and then you get a good workload and, and now it's ready. Now it's good at this particular problem. We keep a huge amount of SRAM on the wafer, all right? And so there are no memory bandwidth problems ever. That also allows us to harvest sparsity, which is something that, that others really struggle with. The cluster we, we build keeps the, the parameters, uh, off-chip in a perimeter store, and it streams them in. And the result of, of this architecture is that, that we run strictly data parallel, whi- which means even in a 64-node cluster, you run the exact same configuration on each machine. Each see- machine sees the exact same weights, but they see a different portion of the data set. And it scales linearly, which is unheard of in compute.
4. SGSarah Guo
  Can you, for a broader audience, describe, like, how this addresses some of the challenges of using traditional GPUs for large-scale machine learning?
5. AFAndrew Feldman
  Sure. One of the real challenges of a traditional GPU is they fix the amount of memory on, on the interposer, right? So you can buy a GPU with either 40 or 80 gig. What if you want 120? What if you want 20?
6. EGElad Gil
  Oh.
7. AFAndrew Feldman
  What if you want a bigger parameter store (laughs) , right? You gotta buy more GPUs even though you don't want those. And so what this strategy does is it allows you to support extraordinarily large models by disaggregating the memory from the compute. And in all GPUs, they're tightly coupled, and that means if you want more compute without memory, you'll still have to pay for the memory. If you want more memory without compute, you still have to pay for the compute. And th- that was an idea that, that came from supercomputing, uh, that we knew really well, th- that we could organize this so you could run an arbitrarily large net- a trillion-parameter network on a single system. Now, it would be slow. That's a very big network. But you could run it on one system. You can identify layers. You, you can debug. You can do all this. And you can never view that on the GPU, right? Because there's nowhere to store the parameters. And, and so by, by separating, uh, parameter storage from compute, we allowed you to mix the various ratios at your will, rather than at, at the GPU vendor's will.
8. EGElad Gil
  So I guess in terms of the relationship that you have with G42, there's a variety of different applications that you've looked at across healthcare, across a variety of different verticals. What do you view as some of the first use cases or best applications for your architecture?
9. AFAndrew Feldman
  So the, the first thing th- that they allowed is a, a model called BTLM, which is the number one 3-billion-parameter model on Hugging Face right now with more than a million downloads. We have Fortune 100 companies running on it right now. We have them doing their own internal work on it. On Wednesday, we're announcing that we're open-sourcing, uh, with G42's group called Inception and MBZ-UAI University, the largest Arabic LLM. So we're, we're putting that into the open source community. That was a collaborative effort with, with our strategic partners there. When you have a lot of compute (laughs) , the, the mind boggles at all the cool stuff you can do.
10. SGSarah Guo
  Why did you guys get into the, um, training and open source model game?
11. AFAndrew Feldman
  We had so much compute and that it, it was a way we could prove to the world how easy it was to train on us. W- we felt it was evidence th- that we could build and our systems could train the, the biggest and fastest models, the most accurate models in the space. In March, we put seven GPT models in the open source community. Everybody else was putting one. Why? Because it's really hard to redistribute work across a GPU cluster. For us, it's one keystroke, so we put seven. People were coming to us with extraordinary ideas. We had a customer who came and they said, "Look, we'd like you to, to design for us, uh, um, a model at 3 billion th- that we could prune and quantize such that it could be served off a cell phone." Really cool.
12. EGElad Gil
  What are the biggest milestones that you're thinking about going forward?
13. AFAndrew Feldman
  There's been a lot of back and forth. Everybody thought we'd keep getting bigger. You know, north of 175 billion is a giant network. It's expensive to train, and maybe even more important, it's really expensive to serve.... right? 17 and a half times more expensive to serve than a 10 billion parameter model. And so, I think a lot of the models in production are 13 billion and smaller. That, as an effort to keep the cost of production down, we are training, right now, a whole collection in, in the 30 to 175 category. I think there's going to be a spike in multilingual models, right? Stability just put out one in Japanese, there was one in Spanish that came out, uh, Inception, MBZUI and, and Cerebras put out one in Arabic. I, I think these are underrepresented languages, and I, I think we have a challenge there, in that most of the big data sets are giant internet scrapes and most of the internet's in English. And, and so it's really hard to build big data sets in, in languages that aren't English. We don't have enough tokens. And so techniques to, uh, carefully think about how you can do that I think will be l- very much in demand. I think every n- nation wants to feel like its language and culture is, is captured fairly I- i- in an LLM.
14. EGElad Gil
  That's interesting because a lot of the training also seems to capture social norms versus data. And one could imagine as well, as you think about different regional models or different language-specific models, people want their own set of cultural norms to be part of the model in some sense, versus something that's imposed.
15. AFAndrew Feldman
  That's exactly right. Language captures experience. I mean, there, there's words that, that are different and have very unique... I mean, it's... Just like the, the New York experience includes Yiddish, right? (laughs) Right?
16. SGSarah Guo
  Mm-hmm.
17. AFAndrew Feldman
  It will be important that, that we find ways to, to capture these cultural experiences, and we do that in language. And i- if you overwhelm the minority language with English, all that gets sort of sifted out.
18. EGElad Gil
  I think the Yiddish, uh, LLM would just complain all the time.
19. AFAndrew Feldman
  That would be hilarious. (laughs)
20. SGSarah Guo
  (laughs)
21. AFAndrew Feldman
  That would be hilarious.
22. EGElad Gil
  (laughs)
23. AFAndrew Feldman
  Yiddish LLM is just like, "What? Your brothers are doctors." (laughs)
24. EGElad Gil
  Yeah. Yeah. Every response is, "Why aren't you a doctor yet?"
25. AFAndrew Feldman
  (laughs) Are you gonna finish your PhD, Andrew?
26. EGElad Gil
  (laughs)
27. AFAndrew Feldman
  (laughs)
28. SGSarah Guo
  (laughs) Oh, my God. No, don't worry, the Chinese LLM says that too.
29. EGElad Gil
  (laughs)
30. AFAndrew Feldman
  Oh, it will. Th- th- that's right. Yeah, it's exact same, same mothers.
16:58 – 26:38
Future of AI Accelerators and Chip Specialization
1. AFAndrew Feldman
  and that's because a bug can cost you $20 or $30 million. (laughs) It's not like, "Yeah, I'll fix that in, in 4.1.1." You're gonna have to re-spin the chip if, if you have a big, uh... An- and so the commitment to, to tooling, to QA, to simulation is profoundly different in, in chip, in the chip world than, than when you're building, uh, software. Now, we still have to build a huge amount of software, and we're about 75% software. When you build a chip, the chip is the muscle and the software's the brain, and you still need a, a huge amount of, of software. You need low-level software, software that runs right against the hardware, and you need a compile stack that allows the, the user to write in PyTorch and just run as if, as if it were a GPU. And y- you end up with, uh, a lot of different types of engineers in a system company.
2. SGSarah Guo
  Andrew, you have such a unique expertise and depth on a question that everybody is thinking about right now. Can you give our listeners an overview of, um, the AI accelerator market right now? Like, who are the big buyers and then how might they decide to do anything but NVIDIA?
3. AFAndrew Feldman
  N- NVIDIA's made hay when the sun shines, and they're, they've done an extraordinary job and you have to take your hat off to them, right? I think they're now in a situation where they're, uh, extorting customers, they're extremely expensive, they're unable to ship.And that has, among other things, opened the door for many of us who, who, who have alternatives. Nobody likes being dependent on their vendor. You can ask the guys at, at Google and Facebook who are dependent on Intel for years, how much they hated that. They disliked that intensely. And so I, I think there's a, a battle between people's sort of dislike of being dependent, and the need just to keep running forward. I think large enterprises, uh, continue to be a good part of our business. I, I think often wh- where you can provide a little consulting to help them accelerate their model deployment i- is something that, that provides an opportunity for, for smaller companies to get in the door. And when you show them how, how much easier it is. We had a, a situation where they were trying to train on a GP cluster, and they were at 60 days, and it wasn't converging, and we stood it up, and three and a half days later, their, their model converged. And they were like, "Holy cow." We have an internal cloud so customers can just jump on your cloud. They can begin training right away. If they want to try before they buy, they can begin with a little training run and, uh, go from there. We have customers in large enterprises and, and generative AI startups. We have customers in the military and the intelligence communities. We have customers in, uh, the super compute market and world. We sort of up and down organizations of different sizes.
4. EGElad Gil
  H- how much specialization do you think there'll be in the semiconductor layer going forward? In other words, do you think there'll be increasingly different chip sets for inference versus training, for transformer-based architectures versus diffusion models, for multimodal versus not? I'm just sort of curious how you view all that.
5. AFAndrew Feldman
  The fundamental question in computer architecture is, what do you make special and what do you make general? And the second fundamental question is, what are you good at and what are you not good at (laughs) ? What do you choose not to be good at, right? And those things are, are really important, you know. If you put in circuitry for transformers, that has a cost. There's a penalty every time you do a convolutional network. And so, uh, these trade-offs aren't free. You, you put in circuitry for one thing. If you're not doing that thing, that circuitry just sits idle, and you have wasted power and space. And so I, I think, uh, it, it is a, a TBD question on whether we will continue to see, um, more specialization at the chip level in training. We do not do it. Uh, we crack the problem in a different way. All of the problems are linear algebra and are predominantly sparse linear algebra. And so we focused on underneath, uh, the transformer. Now, whether you'll have different silicon for training and for inference, I think you will.
6. SGSarah Guo
  So your view on this, Andrew, is less a strong point of view that there will be something beyond transformers that is very important in the near term and more, you've already solved it in a more general way.
7. AFAndrew Feldman
  That's right. The trick in architecture is to, to, to solve hard problems in a general way so you don't have to rely on product management to sort of guess wh- what's the next cool layer type, right? If we go to a, a type of model that doesn't require very much data, if we go to single-shot learning, right, NVIDIA's totally out of luck (laughs) , right? Um, us too. Everybody. These models are designed for extremely data-intensive transformer base. But whether you're doing convolutional neural networks, or transformers, or diffusion models, or... Certainly, new things will be invented. I mean, we haven't run out of ideas. We haven't even run through the ideas that the guys put together in the '80s (laughs) , right? We still have plenty of work to do to improve our model-making. And a- as the, the, the industry matures, and we're still in the early innings, you, you might see more customization for a particular model type. But that's not my preferred approach.
8. SGSarah Guo
  Are you making bets internally on separating training and inference?
9. AFAndrew Feldman
  We make all sorts of bets, Sarah. We lose frequently, and we win sometimes (laughs) . And, uh, that's the, the real fun of building companies, uh, in a dynamic environment, right? I think inference, especially generative inference, right, is a very different problem than all other forms of inference, right? Basic classification is the exact same as the first pass in, in a training run. But when you're doing, uh, I- I- inference on generative AI, uh, it's not. And you have to hold a ton more information at the, the inference point. And so I, I think there will be... We will discover. We will invent new and different ways to attack that problem. Uh, it's extremely expensive right now. I mean, people using eight H100s to do inference on a big model, I mean, that's half a million dollars.
10. EGElad Gil
  You know, there is this ongoing, uh, crunch to supply for GPU and other sort of AI-centric compute. And that's causing different people to delay launching their companies or training their models, and those are often the same thing for different types of AI companies. What do you view as the main cause of this supply crunch, and when do you think it loosens up? Or when do you think things loosen in the market?
11. AFAndrew Feldman
  The chip market has a profoundly inflexible supply chain. And so TSMC, their fab is a pyramid in modern society. It is the ab- absolute peak of manufacturing capability. It's a $20 billion building in which nobody works in it. And ideas come in, and out come chips that cost $100. I mean, the- these are unbelievably... U- un- unbelievably amazing things. But they can't turn on a dime.And so when somebody misses their forecast by a lot, and they come back to the fab and say, "I need twice as much or three times as much," often that takes six or eight months to, to sift through. And that's the problem that's coming in right now. I mean, it's, it's not just that Wall Street missed, uh, what NVIDIA would sell. NVIDIA missed it. They, they missed the forecast. And, uh, you can go to TSMC and you can ask them for wafers, but many of their wafers are already allocated to AMD. They're allocated to Qualcomm. They're allocated to other giant companies, right? And so you struggle forward. And so in addition to taking a long time to get to, to, to first customer learning, there's a huge premium placed on the ability to forecast accurately in the chip business.
12. EGElad Gil
  Given that we're going through this big discontinuity from a technology perspective, adoption perspective, et cetera, do you think we'll be accurate in forecasts going forward anytime soon, or do you think there's likely to be a crunch for the next couple years?
13. AFAndrew Feldman
  No. We, we're gonna continue to get it wrong. We're gonna continue to get it wrong for a while.
14. EGElad Gil
  Do you think we're largely gonna underestimate it?
15. AFAndrew Feldman
  Oh, yeah.
16. EGElad Gil
  So you, basically the market is accelerating rapidly. Everybody's underestimating it. And, you know, fundamentally, we all think that this is gonna be many times bigger than anybody thinks two, three years from now.
17. AFAndrew Feldman
  Yeah. And, and the dynamics of our supply chain are when we get it wrong and you order too much, you still have to take it, right? You don't get to say, "Oh, guess what, guys? I, I really don't want those wafers anymore." And they're like, "See this $10 billion worth of wafers? Yours." (laughs)
18. EGElad Gil
  Yeah. That happened during crypto, right? So I think NVIDIA had a really big miss during crypto.
19. AFAndrew Feldman
  It happened during crypto. Y- it happened at, it happened to Broadcom. I mean, Arista today is at 52 week lead times on switches.
20. EGElad Gil
  Mm-hmm. Given that you've been involved with training some incredibly performant open source models, and you mentioned that as you scale up a model, inference cost starts to really kick in as you get to bigger and bigger models from a parameter perspective, what do you view as the limits to scaling these models? Like, at what point do they get too big, or do you think we just keep scaling them until we have these sort of hyper models?
21. AFAndrew Feldman
  It, it's a really interesting optimization problem. They get big. They get harder to work with. It's harder to retrain, right? When even GPT-4 came out, it didn't have
26:38 – 30:02
Scaling Open Source Models and Fine-Tuning
1. AFAndrew Feldman
  any insight beyond 2021, and they had to race. I mean, that was 'cause it was so big, right? And, and so I think there are trade-offs. There are trade-offs between accuracy and size. There are trade-offs between size and cost to do inference. I mean, I, maybe I want a larger model for radiology files, right? Or for my doctors. Maybe I'll take a smaller model for my, for my chat or, or, or my customer service bot, right? We're gonna have to think about these in, in terms of w- what they're delivering, what the cost of being wrong is, what the cost of serving is, and, and, and think about this as a, as a business decision. And at first, everyone's just running, trying to say, "I can make a bigger model. I can make a bigger model." And then the guys who are trying to run businesses will be like, "Well, I can't afford t- to give that away. And so we're gonna be down at three and at 6 billion and at 13 billion, because that's... I can get pretty good, pretty darn good, and not break the bank with free inference." And so I, I think that there are these trade-offs that are, are really challenging and we're just beginning to grapple with now, uh.
2. EGElad Gil
  I guess OpenAI, related to that, recently announced a partnership with Scale around fine-tuning. How important do you think fine-tuning will be going forward?
3. AFAndrew Feldman
  I think fine-tuning will be really important. We're still just learning about the capacity of these models to, to hold information and to hold insight. And fine-tuning is an extremely exciting approach. I think we've seen again and again that, that, that human feedback makes these models vastly better. And that, that's not surprising, right? I mean, that's how you learned to dance. That's how you learned to do everything, soccer, gymnastics, ballet. Somebody saying, "No, that's wrong. Nope. Do it again." That correction is a mechanism for improvement s- somehow doesn't seem very surprising to me. I think w- we don't yet know how far fine-tuning is gonna take us and who, who will benefit from a model trained from scratch w- versus continuous training versus taking a model, uh, off the shelf and, and adding to it their unique data.
4. SGSarah Guo
  You know, th- there is this point of view from some of the large labs that very few people will do training, and it will get quite concentrated. But I've heard you speak about how you think that there are really interesting data sets in different large enterprises. Tell us about that.
5. AFAndrew Feldman
  You know, at, at first, everybody scraped the internet, but then everybody had that. And now, the, the companies that have extraordinary data sets, Reuters, Bloomberg, GlaxoSmithKline, AbbVie, right? Th- these companies that have years of research and exceptional data are, are gonna, I think, step into the fore. The AI has made progress to the point where people are looking around for exceptional data sets, unique data sets. The rise of, of data as an asset, I think it's the new gold. A- and I, I think that's something that, that isn't talked about enough and, and will be very much top of mind going forward.
6. SGSarah Guo
  Find us on Twitter @nopriorspod. Subscribe to our YouTube channel if you wanna see our faces. Follow the show on Apple Podcasts, Spotify, or wherever you listen. That way, you get a new episode every week. And sign up for emails or find transcripts for every episode at no-priors.com.

Episode duration: 30:03

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode XDXv9nRN2bQ

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome