No Priors Ep. 11 | With Matei Zaharia, CTO of Databricks

If you have 30 dollars, a few hours, and one server, then you are ready to create a ChatGPT-like model that can do what’s known as instruction-following. Databricks’ latest launch, Dolly, foreshadows a potential move in the industry toward smaller and more accessible but extremely capable AIs. Plus, Dolly is open source, requires less computing power, and fewer data parameters than its counterparts. Matei Zaharia, Cofounder & Chief Technologist at Databricks, joins Sarah and Elad to talk about how big data sets actually need to be, why manual annotation is becoming less necessary to train some models, and how he went from a Berkeley PhD student with a little project called Spark to the founder of a company that is now critical data infrastructure that’s increasingly moving into AI. 00:00 - Introduction 01:29 - Origin of Databricks 04:30 - Work at Stanford Lab 05:29 - Dolly and Role of Open Source 12:30 - Industry focus on high parameter count, understanding reasoning at small model scale 18:42 - Enterprise applications for Dolly & chat bots 25:06 - Making bets as an academic turned CTO 36:23 - The early stages of AI and future predictions

Sarah GuohostMatei ZahariaguestElad Gilhost

Apr 25, 202340mWatch on YouTube ↗

EVERY SPOKEN WORD

80 min read · 16,385 words

0:00 – 1:29
Introduction
1. SGSarah Guo
  Welcome to the podcast, Matei.
2. MZMatei Zaharia
  Thanks a lot. Excited to be here.
3. SGSarah Guo
  Can you, um, start by telling us a little bit about the origins of Databricks and, um, how it led you to where you are today?
4. MZMatei Zaharia
  Sure, yeah. So, so Databricks started, uh, you know, from a group of seven researchers at UC Berkeley, uh, back in 2013 and, um, we were, um, really excited about, um, uh, democratizing, uh, basically the use of large datasets and of machine learning. So, uh, we had seen, um, you know, the web companies at the time were-were very successful with these things, but most other companies, you know, most other organizations, things like scientific labs and so on, uh, weren't. And we were really excited to look at making it easier to do computation on large amounts of data and also to do machine learning, uh, at scale with the latest algorithms. So we had started, um, you know, doing our research. We worked with some of the web companies. We also started open source projects, like most notably Apache Spark, which, you know, was essentially, you know, the first version of it was my PhD thesis and, uh, we had seen a lot of interest in these and we thought, um, you know, it would be great to start a company to really reach enterprises and- and make this type of thing much better and, um, you know, actually a- allow other companies to- to use this stuff.
5. SGSarah Guo
  Can you just give us a sense of what Databricks looks like today from, like, a, you know, scale and product suite perspective?
6. MZMatei Zaharia
  Sure, yeah. So Databricks,
1:29 – 4:30
Origin of Databricks
1. MZMatei Zaharia
  um, offers a pretty, um, you know, comprehensive data and ML platform in the cloud. It runs on top of the three, uh, major cloud providers, um, Amazon, Microsoft, and Google, and, uh, it includes support for, you know, data engineering, data warehousing, uh, machine learning, and y- most interestingly, all this is integrated into one product. So for example, you can have one definition of your business metric that you use in your BI dashboards and the same exact definition is used as a feature in machine learning and you- you don't have this drift or copying data, um, and, uh, you can just kind of go back and forth between these worlds. Um, the company has about s- um, 6,000 employees now and, uh, I th- last year we said that we crossed, uh, a billion dollars in ARR and we're continuing to go. It's a, you know, it's a consumption-based cloud model where, you know, customers that are successful can- can go over time and bring in new use cases and so on.
2. SGSarah Guo
  Did you think the opportunity was as big as it has been when you started the company?
3. MZMatei Zaharia
  Well, yeah, we- we- we di- well, we definitely didn't, um, you know, anticipate necessarily to- to go to this size, right? It's y- uh, a lot of things can go wrong. But we were excited about the, um, the confluence of a few trends. So first of all, uh, you know, it's so easy to collect large amounts of data and people are doing a- automatically in, you know, many industries. Um, and second, uh, cloud computing makes it possible to scale up very quickly, do experiments, scale down and so on, which enables more companies to- to work with this kind of thing. And then the third one was machine learning. So we thought, you know, these are powerful trends and the exciting thing for, you know, us as a company is we- we didn't, like, we didn't invent cloud computing. We didn't, um, necessarily invent big data or anything, but we were able to start at a point in time when- when many companies were thinking to move, uh, into this space and just provide a great platform for that. And th- there's this migration already happening, um, and, you know, if you provide the best platform as people are migrating to the cloud, they'll- they'll consider it.
4. SGSarah Guo
  You, uh, still keep roots in research. You have a research group at Stanford. Can you talk about that?
5. MZMatei Zaharia
  Yeah, yeah. So, um, I'm a computer science professor there, so I s- split my time between that and Databricks and, uh, we work on a bunch of things. We, uh, you know, usually like looking farther ahead into- into the future, um, and, uh, we've worked a lot on scalable systems for machine learning, how to do efficient training on lots of GPUs and- and stuff like that or how to do efficient serving. And then another thing I'm really excited about that we started about three years ago is looking at knowledge in terms of, um, applications where you combine a language model with, uh, something like a search engine or an API you call or something like that, and you try to- to produce a correct result, maybe for a complicated task, like do a literature survey and then, like, tell me, you know, what you found about this thing with- with a bunch of references or counterarguments or whatever. And I have a great group of PhD students that are working on that and, you know, are exploring
4:30 – 5:29
Work at Stanford Lab
1. MZMatei Zaharia
  different ways to do it.
2. EGElad Gil
  How did, um, Databricks decide to start working on DALL-E? Like, what- what- what sparked that and, you know, how did you first get going on that?
3. MZMatei Zaharia
  Yeah, so- so we- we'd had customers working with, um, uh, large language models of various forms, you know, e- even before ChatGPT came out and, uh, you know, but they were doing the more standard things like, um, translation or sentiment analysis or things like that. A lot of them were tuning models for their specific domains. I think we had, like, almost 1,000 customers that were using these in- in some form, but then when ChatGPT came out in November, it got people interested in, you know, using these for a lot more than just analyzing a bit of data and instead creating entire new interfaces or new types of, uh, computer applications, new experiences in them. Um, and so there was a intense interest in this, even at a time when, you know, the industry in general is being conscious about spending and, like, which things are really required and so on. This- this- this was an exciting
5:29 – 12:30
Dolly and Role of Open Source
1. MZMatei Zaharia
  one. And the really exciting thing about, um, ChatGPT, as- as you both know, is the instruction following or basically the- the ability of it to kind of carry on a conversation and, like, you know, listen to the things you're telling it to do and do those as opposed to just completing text or just telling you a, you know, small amount of information, like, "This is a positive or negative sentiment." So we really wanted to see whether it's possible to democratize this and to let people build their own models, you know, with their own data without sending it to some- some centralized provider that's trying to sort of learn from everyone's data and, uh, you know, kind of control their- their destiny in- in this space. We were exploring different ways of doing it and, um, in particular, like, DALL-E is, uh, partly based on this great result from, um, some- some other faculty members at Stanford called Alpaca where they tested a- a way to, you know, basically they- they used the model to generate a bunch of realistic conversations and then they used this to train another model that can now carry on conversation on its own. And so, uh, we tried essentially cloning that approach, but starting with an open source model, um, and it actually worked pretty well, and so that's- that's how it became DALL-E. But yeah, we- we've been looking at this space for a while and seen, you know, incredible demand for, uh, these kinds of applications.
2. EGElad Gil
  Yeah, the, I think the industry has really been, uh, very focused on scaling data, parameter size, and flops, and I think you all really have showcased the power of instruction following, even at, you know, something that's relatively smaller scale. Could you explain that and how that all works?
3. MZMatei Zaharia
  It's very interesting and I think there's actually a lot of research still to be done here because, um, these models have been mostly locked up in these, these, these very large companies for a while and everyone thought it's too hard to reproduce them. Um, so the- the interesting thing is language models had existed for a while. You basically trained them to- to complete words, you know, "Here's a missing word in the text, can you fill it in?" And then at the beginning when people tried to apply them to real applications, not just, you know, I- I erased a word on my homework, like, fill it back in, but, like, actual applications, um, they had always done various ways of, you know, training something else on top of, you know, say the feature representation in these. Um, and so there was a lot of domain specific work but you could build, like, a sentiment classifier or- or stuff like that, is it positive or negative, probably like three years ago now. Uh, OpenAI published a GPT-3 paper which is called Language Models are Few-shot Learners, and they said, number one, like, "We- we trained a language model on 170- 175 billion parameters and we- we trained it on, I think it's like 45 terabytes of text." So lots of data, lots of parameters, um, and it's, like, pretty good at language modeling. And then number two they said, "You can actually kind of prompt this with a few examples of a task and it picks up on the task and does it." Um, so lots of people were working on that, you know, how do you prompt it? What's the best example to show? Um, but everyone assumed that for that capability you need a giant model to begin with. So even the researchers in academia were calling into- into GPT-3 and trying to build, you know, stuff based on it and study this phenomenon. And then last year, 2022, um, OpenAI published this other paper which was, um, sort of instruction tuning, uh, these models where they said, "Hey, we- we used some human feedback and then some reinforcement learning and we got this GPT-3 model to, um, actually just listen to one instruction. It doesn't need a complicated prompt with lots of examples," and it kind of works, and then they released a version of this as ChatGPT. So I think in a lot of people's minds, the- the scientific, you know, view of it was first you need a giant model and then you need this reinforcement learning thing and only then do you get this conversational ca- capability and broad world knowledge. So it's actually very surprising in Alpaca, we just had a larger dataset of, you know, human-like conversations and we had this, um, you know, very kind of modest sized open source model, uh, that's only six billion parameters, only trained on, uh, less than one terabyte of text, so, like, 50 times less data than GPT-3, and it still has this behavior. It's, uh, I think it's been pretty surprising to a lot of, um, you know, researchers the size of model that still gets you this kind of instruction following ability. So I think, uh, there's kind of an open research problem, like, what exactly about these datasets is it that makes them good at this? What are the limitations? You know, are there tasks that these are clearly worse at or better at? It's actually kind of hard to evaluate with long answers 'cause it's hard to, like, automatically score them and say, you know, like, "This is a good Seinfeld skit that you generated and this is, like, a bad, you know, Barack Obama speech." So, but I think we'll figure this out, yeah.
4. EGElad Gil
  Were there any things, uh, that emerged from the model, um, that you also found surprising? Like, you mentioned one aspect of it just in terms of the approach you took and, you know, with, uh, dramatically more limited data and approach you ended up with really performant behavior. Were there other things that were unexpected properties of- of what you did with DALL-E?
5. MZMatei Zaharia
  Mm. Yeah, I think to me the- the most interesting thing is, um, it's, um, it's surprisingly good at just free form, like, kind of fluent text generation. So, um, you can tell it to, like, create a story or create a tweet or create a scientific paper abstract and it does a pretty good job at that, and before that, whenever I talked to my, you know, NLP, like, researcher friends, they thought that that creativity was the thing that required a lot of parameters from something like GPT-3.
6. EGElad Gil
  Mm-hmm.
7. MZMatei Zaharia
  Like, they actually told me, "Oh, the knowledge-intensive stuff like remembering facts, telling me the capital of, like, France and whatever," that's not surprising that a small model with a few parameters can do it, but the- the creativity, that's, like, really hard. So this one is actually pretty good at- at the creativity and generation, it's less good at remembering lots of facts, which kind of makes sense given the parameters. So if you ask it about common topics, you know, it'll be good. If you ask it, like, the author of a book, you know, it might give the wrong one. I think we- we had an example 'cause we've actually been building a- a slightly bigger version of this too and we had this, um, this question with, like, who is the author of, um, Snow Crash which is, uh, Neal Stephenson, and the initial DALL-E model said Neil Gaiman. So, you know, it's still a Neil, it's still a- a-
8. EGElad Gil
  It's close. A sci-fi writer.
9. MZMatei Zaharia
  ... an author but it's the wrong Neil.
10. EGElad Gil
  It's still sci-fi, yeah, yeah. (laughs)
11. MZMatei Zaharia
  Yeah, it's still kind of-
12. EGElad Gil
  Yeah.
13. MZMatei Zaharia
  Yeah. So- so- so it's less good at remembering facts but pretty good at coherent, um, uh, sort of generation.
14. EGElad Gil
  Yeah. The name DALL-E basically references the first cloned mammal, Dolly the sheep.
15. MZMatei Zaharia
  Mm-hmm.
16. EGElad Gil
  Um, can you explain the reference within the AI space?
17. MZMatei Zaharia
  Yeah. So it's- it's based on, you know, cloning this other, uh, model f- from Stanford called Alpaca but doing it with an open dataset, so, uh, and- and that itself was based on something that, uh, Meta released I think maybe three weeks ago or less, uh, called LLaMA which is ... They took a modest sized model, seven billion parameters,
12:30 – 18:42
Industry focus on high parameter count, understanding reasoning at small model scale
1. MZMatei Zaharia
  and they trained it on, um, a ton of data. I think, um, they said 1.4 trillion tokens or something like that which is, um ... I don't know how many bites of data it was but it was i- multiple terabytes of data basically. Um, and they said, "Hey, by just training this for longer we got a small model that's actually producing pretty high quality content for its size." Um, so there were all these kind of wooly sort of animals out there and we thought it's just too perfect to, like, clone it and th- there are all these other things like, you know, it's, uh, like the DALL-E LLaMA, I don't know, there- there are all these, like, things.
2. EGElad Gil
  Yeah, that is a great name. That's a good name.
3. MZMatei Zaharia
  Yeah.
4. EGElad Gil
  Yeah.
5. MZMatei Zaharia
  So ...
6. EGElad Gil
  And then, um, are there other things that you can share that you all have, uh, coming in the background at Databricks or at your Stanford lab in terms of this more general area of language models?
7. MZMatei Zaharia
  Yeah. I mean, at Databricks, definitely, you know, we're using everything we, we learned from DALL-E and we're learning from our customers to, you know, to just offer a great suite of tools for training and, and operating LLM applications. We already have a popular, um, ML ops, um, platform, and we, we also have this open source project called MLFlow that, uh, integrates with a lot of tools out there that our, our offering is built around. Um, so you can expect some, some nice integrations into that. Um, you know, separately, we're also working on Databricks product features that use language models internally, um, and learning a lot from developing those and, and, you know, feeding that into our products. So I think in the, in the next few months, you can expect it, and we also have this big user conference, um, Data AI Summit, coming up, uh, in June that will probably have, um, you know, a lot of stuff about this. Um, and I would say as, um, you know, as a researcher and also kind of with my Databricks hat on, the, the thing I'm most excited about is really connecting these models with, um, reliable data sources and, and making them really produce reliable results. 'Cause if you, you know, if you use ChatGPT or GPT-4, the two big problems with it are, number one, like the knowledge is not up to date, you know. It's, i- it's only, it only knows stuff it was trained on. And number two, a lot of the things it says are inaccurate, and it's confident but, like, wrong in various ways. And I think you can tackle both of these by combining some kind of language model with, um, you know, a system that, that, you know, pulls out, like, vetted data, either from documents, like a search engine, or from, um, you know, APIs and tables and stuff like that inside your company. You know, like for example, when I talk to the chatbot in my bank, it should know my latest bank account balance and transactions and stuff. You know, if I'm like, "Can you, can you cancel the payment I made 'cause I unsubscribed?" it should just know what that means. So cracking how exactly to do that isn't easy. Um, uh, uh, it may actually be easier with small models than with big ones to, to, to reduce hallucination from them, but it, you know, I think it's still an open question. But I think if we can figure this out, then these become a much more reliable component in a, in an application.
8. SGSarah Guo
  Maybe we'll go from there to just, like, projecting a little bit about, like, architecture and research. Um, you know, so much of the industry is focused on model scaling, right, improving reasoning-
9. MZMatei Zaharia
  Mm-hmm.
10. SGSarah Guo
  ... that way. Like, how much do you think that matters i- in terms of, I guess, like, real world usage and production with your customers in the near term?
11. MZMatei Zaharia
  Mm-hmm. Yeah. Great question. So, to me at least, the relationship between scale of the model versus, um, you know, quality of the data and supervision you put in, um, versus, like, design of an application around it, uh, and those things and, like, overall quality, I think the relationship is not 100% clear yet. Like, to get a really reliable, uh, model that say, I don't know, can, can, um, you know, like, make a pharmacy prescription or something like that, maybe you need, um, a trillion parameters, you know. Maybe you actually need a, a really carefully designed dataset and, like, supervision process, which is kind of tr- traditional sort of ML engineering type work. Um, or maybe you actually need a c- clever application where, like, you're, you're chaining together a couple of models and things and you're saying, "Well, does this make sense? Can I find a reference? Um, can I show this example to a human if it's really hard?" Um, so I think it's, it's a little bit open. The, the thing I can say for sure, e- especially and, and DALL-E and, like, other, you know, results like this really, um, highlighted is it does seem that the core tech w- is getting commoditized very quickly. So just, if you just wanna run, you know, something like today's ChatGPT, um, it will be a lot cheaper 'cause all these hardware manufacturers are building devices that are, that are specialized and much cheaper. Um, and an- another thing that's making it less expensive is we're figuring out ways to get a smaller model with less data, fewer parameters and stuff to get similar performance. So that I think is happening, uh, faster than at least I would have thought, um, you know, a few months ago. Um, so, uh, at least to get something with today's capabilities, I think it'll be, uh, you know, it'll be very affordable and you might just be able to run it locally on, you know, your phone or something. Uh, the question of how large can... You know, if you make a much larger model, is it gonna be a lot smarter? I think it's still a bit unknown. I mean, there are people who argue it's going to be very good at reasoning, but at the same time, this kind of token by token generation we're doing now is not an amazing format for reasoning because you have to, like, linearly, like, do one, say one thing at a time. Um, so it's not really good for, like, making plans or comparing versions. I think to get a really smart application you'll need to combine today's language modeling with some, some other sort of framework around it that, um, uh, you know, uses it multiple times or explores a planned space or whatever, um, and then you might get something good. And it's also possible that the very largest models are simply memorizing more stuff, so like, they're impressive in terms of trivia, like, I can ask it about some random topic and it'll know, but they're not really, like, smarter at solving even a basic, um, you know, word problem. Um, so yeah. I- I'm not sure. It- un- unfortunately, especially with training from the web, it's often very hard to tell apart, like, reasoning from, uh, memorization essentially, did
18:42 – 25:06
Enterprise applications for Dolly & chat bots
1. MZMatei Zaharia
  it see that thing before. So it's, um, I- I think actually being able to do experiments where you train these on carefully selected data and... will, will lead to better understanding of, like, what they can do.
2. SGSarah Guo
  Yeah. Yeah. That makes sense. Um, maybe if we think a little bit, just 'cause you have great visibility from your, your role at Databricks, like, what other tooling do companies need, like your enterprise customers or just generally c- um, enterprises need to make use of these models? 'Cause you said, you know, we believe the core technology, the models themselves are getting commoditized.
3. MZMatei Zaharia
  Yeah. So I think, so definitely w-... the first piece is you need a, a data platform that could actually build, you know, reliable data, right? So we think that's, that's like the, uh, you know, the, the bread and potatoes of, like, getting anything. You, you, you need some (laughs) , you know, a basis to, like, sort of build on. So we think that will become really important and, you know, maybe data platforms will have to, uh, evolve a little bit to, to be better at supporting unstructured data like text and images and so on, um, and, and to do quality assessment and stuff like that for it. Uh, that's one piece. Um, I think a- another piece you need is you need the, the ML ops piece of, like, being able to experiment with things, deploy them, um, A/B test them and so on, and, and see what does better, and i- and improve it incrementally. Um, and I also think these models will need a good connection to, um, operational systems inside the company to do really powerful things with, like, the latest data. So, you know, th- th- you saw, probably, the, the support for tools in, in ChatGPT. Uh, you know, before that, there were lots of groups working on at least models integrated into search engines, sometimes intercalling other tools as well, like calculators. Um, I think it's still a little bit, uh, open-ended. There's one extreme where people say the model will figure out what tools to use on its own. I think for, like, enterprise use cases, that's a little bit, like, more than you really need. You know, you can kind of give it some tools and feed it stuff and it doesn't have to discover and, like, read the manual to figure out which one to use. Uh, but yeah, I think that's another piece you'll need for, like, really powerful applications. And then I do think infrastructure, like just basic training and, and serving infrastructure is important too when you start to care about performance, like about latency and speed. And you can see some of the, um, you know, new search engines using these, um, models are n- not that fast, right? Uh, like a little bit slow. You know, it, it would be nice to have it faster. And for automated analytics, it's even more important that it's efficient, so there could be... I think there'll be a lot of activity there. Yeah.
4. NANarrator
  Where do you see enterprises getting the most value from investing in, I guess, more traditional ML and then, like, um, some of-
5. MZMatei Zaharia
  Mm-hmm.
6. NANarrator
  ... the language model stuff today?
7. MZMatei Zaharia
  Yeah, great question. So traditional ML, we're seeing actually virtually all, uh, major enterprises, you know, and all industries are using it. Um, it's, it's changed a lot in the past decade actually, so, um, and, um, it's, it's very good for forecasting things in general and for, um, automating certain types of decisions. So for example, optimizing your supply chain, right? You, you don't have time to look at, like, exactly everything that's going on but, um, and, and, you know, think about it and have a meeting. But, um, you know, if you do order, like, the right amount of, like, parts to meet your demand this week or if you minimize the amount of time, you know, an agricultural product, like, sits in a warehouse and, like, you lose s- you know, it degrades in quality or stuff like that, um, it matters a lot, and it can, it can have a huge impact on, um, you know, on, uh, the profitability of a company. So we're seeing a lot of that, people applying it to automate, you know, uh, supply chain and, and to, to automate basically their, their operations in various ways. And, and then there are more classic use cases like fraud detection and stuff like that where also, you know, it's always an arms race and, like, you're trying to, to do the best you can because every percent of, like, accuracy you do better in can, can translate into a, you know, huge impact. Um, with, um, with language models specifically, um, and especially with kind of conversational ones, um, the really exciting thing is interfaces to people, and I think customer support is a very obvious one. Uh, maybe things like recommendations or asking questions on a product page, you know, in retail, uh, things like search augmented with stuff is one. And we've also found that just internal apps in a company that have a lot of internal data can benefit from this kind of thing. So, like, one of the things, you know, we've, we've built, for example, is inside Databricks, we have all these resources for, you know, engineers to understand the, the s- you know, how different parts of the product work, how to operate it, like, all the APIs. And, you know, people used to just ask each other questions in these Slack channels for each team, um, and we could use that data, like, the questions and answers plus the, the data, you know, and the actual documentation to, you know, essentially automatically answer many, many such questions and just save people a lot of time. Um, so the... uh, I do think that any app that has kind of business data or, like, stuff written by humans in it, like, um, you know, like your, um, issue tracker for your software development or, like, your sales force or something like that, um, could benefit from, you know, these, these kind of interfaces. Yeah.
8. EGElad Gil
  Yeah, it seems like any type of forum or anything else instantly becomes, like, data that you can use to fine-tune or train a model that's specific to your sp- your, your customer support use case, or you could use an embedding or something to, to do interesting things with it, so it seem, it seems like there's some really cool stuff to do. Are there any specific areas that, um, Databricks is not focused on that you think would be especially interesting for somebody to build from a tooling perspective for, um, enterprises trying to-
9. MZMatei Zaharia
  Mm-hmm.
10. EGElad Gil
  ... use some of these technologies?
11. MZMatei Zaharia
  Yeah, I'm, I, I think there are a lot of these. I think it's very early on. Um, so, uh, probably one of the most obvious ones is, um, is just the domain or vertical-specific models and tools, and I, I think, I actually think, um, even a lot of the, the enterprises that, like, have a lot of the data in various domains might turn more into data or model vendors of some form in the future, uh, you know, as, as, as they use this to, like, build something that no one else can. So I wouldn't be surprised at all if you see, like, the next, you know, wave of companies for say, um, security analytics or, like, you know, biotech or, or, you know, uh, analyzing financial data or stuff like that, um, really built around, um, LLM, uh, technology in there.
12. EGElad Gil
  Mm-hmm.
13. MZMatei Zaharia
  Um, and I also
25:06 – 36:23
Making bets as an academic turned CTO
1. MZMatei Zaharia
  think j- in general in the app development space, like, how do you develop apps that incorporate these tools? Um, it's, uh, it's very open. It's not clear what the best way to do it is, and, you know, you might end up with, like, really good programming tools that, um, that focus on this problem. Uh, I would say, you know, for people thinking about startups and so on, like, you, you want your startup to have, um, you know, a long-term defensible moat, ideally something that grows over time also. So anything around a unique dataset, for example, or a unique, like-... feedback interaction you have is, um, is always good, right? Like, ho- honestly, even something like adding ML features in your product that just kind of learn from your users and, you know, do better recommendation and so on could eventually become a moat where, like, you know, others just can't easily catch up. Um, but I think the, you know, anything that's around custom data sets is sort of safest.
2. EGElad Gil
  When, when you were working on, um, Spark for your, for, uh, your PhD, did you think you'd become a founder? Was your intention to start a company, or did you just think it was interesting research to do, or both?
3. MZMatei Zaharia
  No, i- it really wasn't. Yeah, I mean, wha- as a grad stu- you know, I've always been interested in just, like, doing, you know, things that help people, that have, have an impact, help people do cool things. And, um, you know, I, I, I had seen these open source technologies out there for distributor data processing. I thought, "Okay, well, I'll try to start one and see how it goes." You know, wha- I wasn't sure that people would really pick it up and use it. But I wasn't looking to be a, a founder necessarily. I was just looking to do something useful in this, like, emerging space and... Honestly, I thought, like, "Hey, if I'm..." you know, I, I wanted, I was at least considering to be a computer science professor and I thought, "If I'm gonna be a professor and all the most exciting computing is happening in data centers today and, like, I don't know how that works, how am I gonna teach, you know, computer science to, to people? Um, so I better learn about that stuff." Um, uh, but it turned out to be something, you know, more wildly interesting. Yeah.
4. EGElad Gil
  What, what was the most unexpected thing about being a founder?
5. MZMatei Zaharia
  There are a lot of, uh, challenges along the way. I think just being able to learn about all the e- aspects of a business and, and how much complexity there is in each one, you know, starting out as a more technical person, at first I, you know, I didn't really know what to expect that, but there's a ton of depth in each one and if you understand them, if you, like, really try to understand them, get to know the culture of people there, like, really get to know what they're thinking about, you can make, uh, much better decisions across, you know, uh, mult- multiple aspects of your company.
6. EGElad Gil
  Mm-hmm. Is there anything that you would advise people coming from a similar background to yours? I, I have a PhD as well, although it's in biology and-
7. MZMatei Zaharia
  Mm-hmm.
8. EGElad Gil
  ... I feel like there's certain things that I learnt in academia that was really valuable and then there's a bunch of stuff I really needed to unlearn as I, as s- as I went into-
9. MZMatei Zaharia
  (laughs)
10. EGElad Gil
  ... industry. Are there specific pieces of advice you'd give to technical founders or PhD founders in terms of things that they should unlearn?
11. MZMatei Zaharia
  Hm. Well, I think you, you should, um, th- unlearn, like, a lot of research, at least in computer science, the, the kind of stuff that I've worked on, a lot of research is basically, it's mostly prototyping. It's, like, can we showcase an idea-
12. EGElad Gil
  Yeah.
13. MZMatei Zaharia
  ... but it's not really software engineering of, like, we'll build a thing that can be maintained and, like, runs flawlessly in the future and, like, supports, you know
14. SGSarah Guo
  Mm-hmm.
15. MZMatei Zaharia
  ... problems. So, I think you should kind of unlearn just the focus on short-term stuff and think about how is this going to go over time. Eventually, right? There is a phase of the company where you're just prototyping to get a good fit, but you should design things so they can evolve into, you know, into something that's very reliable long term. The other thing is, um, you know, I, I think unlearn trying to invent everything from scratch. You, you should look, you should really be careful about, like, "Hey, where am I doing something unique?" Or, or, "If I'm doing something different from others, like, why is it, right?" Uh, uh, uh, you know, don't do it just for kicks. So, yeah, 'cause in research, it's very tempting to say, you know, "I did this new thing. I'm gonna, you know, I'm gonna try all the fanciest, like, new ideas in each component of it."
16. SGSarah Guo
  Was there something that you guys, like, experimented with being, like, you know, first principles unique about that you then said, you know, there are systems for this?
17. MZMatei Zaharia
  Mm-hmm. A good one early on was, um, was just deployment infrastructure for, like, how do we deploy and, and update our software across, you know, all the clouds and so on?
18. SGSarah Guo
  (laughs)
19. MZMatei Zaharia
  And we soon realized-
20. SGSarah Guo
  Mm-hmm.
21. MZMatei Zaharia
  ... it's better to, to go with really standard things like, uh, Kubernetes and tools like that th- than, than to try to do something custom because they're evolving very quickly. Um, so yeah, tha- that's kind of a good example where, like, you, at the beginning you say, "Ah, how hard can it be? You know, let's just build something." Uh, but then you realize, wait, every, every month there's, like, new stuff coming out and maybe this isn't where we wanna focus on.
22. SGSarah Guo
  So, maybe just thinking about being, like, CTO now of-
23. MZMatei Zaharia
  Mm-hmm.
24. SGSarah Guo
  ... a very large company, like, how is your lens as a researcher, a computer science researcher, informed your thinking as a CTO?
25. MZMatei Zaharia
  Yeah, I think first of all, uh, uh, as a researcher, like, you do learn to, um... You, you, you, you think a lot about the long-term trends, like, what, you know, what could things look like five or 10 years from now or what's the, what's kind of the fundamental things here. So for example, this thing about LLMs being commoditized and, um, uh, or honestly, the, the thing about them kind of maxing out on more parameters, I think many, many people hadn't really thought about that. But if, if you think back, like, you know, um, i- there's, there is a lot of room to improve efficiency usually in hardware and software for an application. And this particular application is kind of simple, uh, because it is all basically like, you know, two or three different types of matrix operations. So, like, it's sort of the hardware designer's dream to do this stuff. Um, and it... And also, there's usually, there are usually diminishing returns from scale, um, in, in terms of quality of, of models in general. And you can also kind of see it in other areas, like in, in computer vision for example, we don't have, you know, trillion parameter models. You got, you know, actually pretty small models that you can train for specific tasks that are good. Um, and self-driving cars is another example. Uh, you know, they rapidly improved in quality up to a point and then they kind of plateaued and they're still not really, you know, ready for prime time. Eventually you hit some limits.
26. SGSarah Guo
  You know, there are plenty of people who, um, are researchers in the field who, uh, don't really see an asymptote, right?
27. MZMatei Zaharia
  Mm-hmm.
28. SGSarah Guo
  Um, with scaling and so where do you believe that limit comes from? Like parameters, compute, data? Something else?
29. MZMatei Zaharia
  I just think a lot of things, like, scale, um, sublinearly in, in general. Now, it's hard to tell for, you know, things like reasoning and so on, but certainly in, in kind of w-... classical machine learning, like for example, if, if you're trying to learn, um, a function that, like, separates positive and negative examples, then the- as you add more data, like, your, your accuracy doesn't really improve linearly. Like with, uh, you know, with a few examples, you get a pretty good estimate of that boundary and then with more of them, it gets a little bit better but, uh, it doesn't get, like, that much better. So, it's just, I think it's common. That would be my main, um, my main reason. Um, now I think the one thing, so with, with language models specifically, I think the part that does go linearly with more parameters, or should, is the ability to just memorize more stuff. So, if you want it to tell you, like, who was on the fifth episode of, like, Friends and, like, what was the second line they said and stuff like that, like yeah, more parameters will get you a neural network that has, that just by putting that input, like, can tell you that stuff. Uh, but that wasn't that interesting to me because I think the right solution for that is look things up in a, in a database. Like, do, do retrieval, right? Do a search index. I think actually, I think from a computation perspective it's very inefficient to have like a trillion parameters and have to actually load them all and add and multiply by them each time you make an inference, 'cause they're just encoding knowledge most of which you don't need for that inference. So, so that one I wasn't as excited about, but I think other people, there are people who are just excited about neural networks, like how do... You know, it's the same kind of people who wonder, like, how do brains work, like, how do animals learn, who are just excited about, "Wait, I only had some neurons and I put in the stuff and it remembered it." Um, but a- as an engineer I'm not that excited 'cause I'm like, "Yeah, I could've built a database that did that." But in terms of like-
30. SGSarah Guo
  (laughs)
36:23 – 40:25
The early stages of AI and future predictions
1. MZMatei Zaharia
  we always think when we have an idea it's sort of a race to, like, figure out is it a good idea and can I publish it? Because the research community values novelty a lot, being the first to do something. You know, for better or worse. It's not amazing but if you just reproduce a thing that someone else did, unfortunately you don't get as much credit. So, uh, so we do think about, "How can we quickly validate something?" Um, but at the same time, and even in research I had, you know, the same thing. I, I, you try to pick topics that will matter. Like, for example, in, when I was doing my PhD I didn't do a ton with machine learning and, you know, I knew, I knew people who did it, I helped them out, I built infrastructure, but I didn't do ML research myself. And then, uh, you know, later I kind of decided, like, "Yeah, I am going to do some things especially around this, like, you know, connecting machine learning to external data sources like search engines." Um, and I know it's gonna take a while to, like, really learn about it and get an intuition and stuff, but I think this is gonna matter long term because I think the local, like, you know, parsing semantics of what the sentence means is kind of solved already, and the interesting thing will be, like, you know, doing this in a, in a bigger system.
2. SGSarah Guo
  Yeah. I have four degrees and no PhD. I've never contributed anything to the, uh, corpus of the, um, the world's knowledge. Uh, Elade, got to ask. Uh, does it affect how you do investing?
3. EGElad Gil
  No, not really.
4. SGSarah Guo
  The PhD-
5. EGElad Gil
  No, not really.
6. SGSarah Guo
  ... moved for you? Nice. (laughs)
7. EGElad Gil
  Yeah. I don't know. I, I have a math degree as well, and I feel like that actually was a thing that forced me to think slightly differently, or at least it forced a way of very logic... I mean, it felt like there's a groove in your brain for logic that gets carved.
8. MZMatei Zaharia
  Mm-hmm.
9. SGSarah Guo
  Yep.
10. EGElad Gil
  So that, that probably helped, but who knows? I don't know.
11. SGSarah Guo
  You've been working in data machine learning for a long time. Like, uh, where do you think we are in this generation of, uh, of AI?
12. MZMatei Zaharia
  Mm-hmm. Yeah, I think we're still at the early stages of, um, um, AI on, on unstructured data, so things like text and, and images and so on, really having an impact in applications. So I think, you know, ChatGPT-related features that every application is going to add will, will change the way we, um, you know, w- we work with computing. Um, and they'll also change data analytics to some extent because you'll be able to use this data. Um, and honestly, I also think that in terms of building, like just basic, you know, data infrastructure and ML infrastructure, we're still pretty early also. It's still, um, uh, you know, ma- many different tools you have to hook together, a lot of complex integration, um, and you need a lot of sort of specialized people to do it. And I think, uh, over time, like, I increasingly think that basically, e- especially because of the capabilities of a- these AI models, every software engineer will need to become an ML engineer and a data engineer also as they build their application. And we'll, we'll figure out ways of doing them, recipes or abstractions or whatever, that are actually easy enough for everyone to do. And, uh, one analogy I, I like is, um, you know, when I was learning programming, which was sort of like, you know, mid, late '90s, um, I, I got these books on, you know, w- web applications, and it was very complicated. There was a book on MySQL, there was a book on Apache web server. Like, CGI bin, all these things you have to hook together. And now m- you know, most developers can make a web application in, like, one function, and even non-programmers can make something like Google Forms or Salesforce or whatever that's sort of, you know, basically it is a custom application. So I think we're far away from that in data and ML, but i- it could sort of look like that. It's a- it's harder because it depends on the sort of static data that you've got sitting around. But, um, um, I do think there's a- you know, th- there are gonna be a lot more of these applications, yeah. (instrumental music plays)
13. SGSarah Guo
  Matei, this is a great conversation. Thanks for joining us on No Priors.
14. MZMatei Zaharia
  Thanks a lot, Sarah and Elad.
15. EGElad Gil
  Thanks so much.

Episode duration: 40:25

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode sCHGWRlydJ8

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome