
No Priors Ep. 11 | With Matei Zaharia, CTO of Databricks
Sarah Guo (host), Matei Zaharia (guest), Elad Gil (host), Narrator
In this episode of No Priors, featuring Sarah Guo and Matei Zaharia, No Priors Ep. 11 | With Matei Zaharia, CTO of Databricks explores databricks CTO on democratizing data, open LLMs, and AI’s future Matei Zaharia traces Databricks’ origins from UC Berkeley research and Apache Spark to a billion‑dollar cloud data and ML platform unifying data engineering, warehousing, and machine learning. He explains how Databricks and his Stanford lab are pushing scalable systems and language-model applications that combine LLMs with search, APIs, and other reliable data sources. A major focus is democratizing instruction-following models like Dolly using open-source foundations, challenging assumptions that only giant proprietary models can be conversational and useful. Zaharia also discusses enterprise needs, tooling gaps, where traditional ML still drives ROI, and why he believes model scale will commoditize while data quality, application design, and domain-specific systems become the real moat.
Databricks CTO on democratizing data, open LLMs, and AI’s future
Matei Zaharia traces Databricks’ origins from UC Berkeley research and Apache Spark to a billion‑dollar cloud data and ML platform unifying data engineering, warehousing, and machine learning. He explains how Databricks and his Stanford lab are pushing scalable systems and language-model applications that combine LLMs with search, APIs, and other reliable data sources. A major focus is democratizing instruction-following models like Dolly using open-source foundations, challenging assumptions that only giant proprietary models can be conversational and useful. Zaharia also discusses enterprise needs, tooling gaps, where traditional ML still drives ROI, and why he believes model scale will commoditize while data quality, application design, and domain-specific systems become the real moat.
Key Takeaways
Unified data and ML platforms reduce friction and unlock more value.
Databricks integrates data engineering, warehousing, BI, and ML so the same metrics and features are reused across analytics and models, avoiding data drift, duplication, and fragmented pipelines.
Get the full analysis with uListen AI
Instruction-following behavior does not require massive proprietary models.
Dolly, built by fine-tuning a modest 6B-parameter open-source model on high-quality instruction data, exhibits strong conversational and generative abilities, challenging the belief that GPT‑scale models are mandatory.
Get the full analysis with uListen AI
Data quality and task-specific supervision may matter more than sheer scale.
Zaharia argues the relationship between model size, data curation, and application design is not fully understood, and results like Dolly suggest carefully chosen and weighted data can unlock capabilities in smaller models.
Get the full analysis with uListen AI
Linking LLMs to reliable, up-to-date data sources is critical for production use.
To fix outdated knowledge and hallucinations, he emphasizes architectures where LLMs are coupled with search, documents, tables, and APIs (e. ...
Get the full analysis with uListen AI
Traditional ML still drives substantial enterprise ROI in operations.
Forecasting, supply chain optimization, fraud detection, and other structured-data problems are widely deployed and often yield large financial gains, independent of newer conversational LLM capabilities.
Get the full analysis with uListen AI
Core LLM tech will commoditize; durable moats come from data and domain focus.
With faster specialized hardware and more efficient training methods, ChatGPT-level capabilities will become cheap and local; long-term defensibility will likely come from unique datasets, domain-specific models, and feedback loops.
Get the full analysis with uListen AI
Every software engineer will increasingly need ML and data skills.
Zaharia predicts data and ML will become as standard in application development as web stacks did, with higher-level abstractions making it accessible but still requiring engineers to think like data and ML practitioners.
Get the full analysis with uListen AI
Notable Quotes
“We really wanted to see whether it's possible to democratize this and to let people build their own models with their own data without sending it to some centralized provider.”
— Matei Zaharia
“It's been pretty surprising to a lot of researchers the size of model that still gets you this kind of instruction-following ability.”
— Matei Zaharia
“The two big problems with [current LLMs] are that the knowledge is not up to date and a lot of the things it says are inaccurate, and it's confident but wrong.”
— Matei Zaharia
“The core tech is getting commoditized very quickly… if you just want to run something like today's ChatGPT, it will be a lot cheaper.”
— Matei Zaharia
“Anything around a unique dataset or a unique feedback interaction you have is always good… that could eventually become a moat where others just can't easily catch up.”
— Matei Zaharia
Questions Answered in This Episode
How should an enterprise decide when to fine-tune its own smaller model versus calling a large proprietary API like GPT‑4?
Matei Zaharia traces Databricks’ origins from UC Berkeley research and Apache Spark to a billion‑dollar cloud data and ML platform unifying data engineering, warehousing, and machine learning. ...
Get the full analysis with uListen AI
What concrete architectures work best today for combining LLMs with search indexes and APIs to minimize hallucinations?
Get the full analysis with uListen AI
Where does Zaharia expect model scaling to truly plateau in terms of reasoning ability, and what empirical evidence would convince him otherwise?
Get the full analysis with uListen AI
How can a startup systematically build and protect a unique dataset that becomes a durable moat in an era of commoditized models?
Get the full analysis with uListen AI
What practical steps can traditional software engineers take to transition into the kind of data/ML-centric engineering role Zaharia envisions?
Get the full analysis with uListen AI
Transcript Preview
Welcome to the podcast, Matei.
Thanks a lot. Excited to be here.
Can you, um, start by telling us a little bit about the origins of Databricks and, um, how it led you to where you are today?
Sure, yeah. So, so Databricks started, uh, you know, from a group of seven researchers at UC Berkeley, uh, back in 2013 and, um, we were, um, really excited about, um, uh, democratizing, uh, basically the use of large datasets and of machine learning. So, uh, we had seen, um, you know, the web companies at the time were-were very successful with these things, but most other companies, you know, most other organizations, things like scientific labs and so on, uh, weren't. And we were really excited to look at making it easier to do computation on large amounts of data and also to do machine learning, uh, at scale with the latest algorithms. So we had started, um, you know, doing our research. We worked with some of the web companies. We also started open source projects, like most notably Apache Spark, which, you know, was essentially, you know, the first version of it was my PhD thesis and, uh, we had seen a lot of interest in these and we thought, um, you know, it would be great to start a company to really reach enterprises and- and make this type of thing much better and, um, you know, actually a- allow other companies to- to use this stuff.
Can you just give us a sense of what Databricks looks like today from, like, a, you know, scale and product suite perspective?
Sure, yeah. So Databricks, um, offers a pretty, um, you know, comprehensive data and ML platform in the cloud. It runs on top of the three, uh, major cloud providers, um, Amazon, Microsoft, and Google, and, uh, it includes support for, you know, data engineering, data warehousing, uh, machine learning, and y- most interestingly, all this is integrated into one product. So for example, you can have one definition of your business metric that you use in your BI dashboards and the same exact definition is used as a feature in machine learning and you- you don't have this drift or copying data, um, and, uh, you can just kind of go back and forth between these worlds. Um, the company has about s- um, 6,000 employees now and, uh, I th- last year we said that we crossed, uh, a billion dollars in ARR and we're continuing to go. It's a, you know, it's a consumption-based cloud model where, you know, customers that are successful can- can go over time and bring in new use cases and so on.
Did you think the opportunity was as big as it has been when you started the company?
Well, yeah, we- we- we di- well, we definitely didn't, um, you know, anticipate necessarily to- to go to this size, right? It's y- uh, a lot of things can go wrong. But we were excited about the, um, the confluence of a few trends. So first of all, uh, you know, it's so easy to collect large amounts of data and people are doing a- automatically in, you know, many industries. Um, and second, uh, cloud computing makes it possible to scale up very quickly, do experiments, scale down and so on, which enables more companies to- to work with this kind of thing. And then the third one was machine learning. So we thought, you know, these are powerful trends and the exciting thing for, you know, us as a company is we- we didn't, like, we didn't invent cloud computing. We didn't, um, necessarily invent big data or anything, but we were able to start at a point in time when- when many companies were thinking to move, uh, into this space and just provide a great platform for that. And th- there's this migration already happening, um, and, you know, if you provide the best platform as people are migrating to the cloud, they'll- they'll consider it.
Install uListen to search the full transcript and get AI-powered insights
Get Full TranscriptGet more from every podcast
AI summaries, searchable transcripts, and fact-checking. Free forever.
Add to Chrome