No Priors Ep. 11 | With Matei Zaharia, CTO of Databricks

If you have 30 dollars, a few hours, and one server, then you are ready to create a ChatGPT-like model that can do what’s known as instruction-following. Databricks’ latest launch, Dolly, foreshadows a potential move in the industry toward smaller and more accessible but extremely capable AIs. Plus, Dolly is open source, requires less computing power, and fewer data parameters than its counterparts. Matei Zaharia, Cofounder & Chief Technologist at Databricks, joins Sarah and Elad to talk about how big data sets actually need to be, why manual annotation is becoming less necessary to train some models, and how he went from a Berkeley PhD student with a little project called Spark to the founder of a company that is now critical data infrastructure that’s increasingly moving into AI. 00:00 - Introduction 01:29 - Origin of Databricks 04:30 - Work at Stanford Lab 05:29 - Dolly and Role of Open Source 12:30 - Industry focus on high parameter count, understanding reasoning at small model scale 18:42 - Enterprise applications for Dolly & chat bots 25:06 - Making bets as an academic turned CTO 36:23 - The early stages of AI and future predictions

Sarah GuohostMatei ZahariaguestElad Gilhost

Apr 24, 202340mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Databricks CTO on democratizing data, open LLMs, and AI’s future

Matei Zaharia traces Databricks’ origins from UC Berkeley research and Apache Spark to a billion‑dollar cloud data and ML platform unifying data engineering, warehousing, and machine learning. He explains how Databricks and his Stanford lab are pushing scalable systems and language-model applications that combine LLMs with search, APIs, and other reliable data sources. A major focus is democratizing instruction-following models like Dolly using open-source foundations, challenging assumptions that only giant proprietary models can be conversational and useful. Zaharia also discusses enterprise needs, tooling gaps, where traditional ML still drives ROI, and why he believes model scale will commoditize while data quality, application design, and domain-specific systems become the real moat.

IDEAS WORTH REMEMBERING

5 ideas

Unified data and ML platforms reduce friction and unlock more value.

Databricks integrates data engineering, warehousing, BI, and ML so the same metrics and features are reused across analytics and models, avoiding data drift, duplication, and fragmented pipelines.

Instruction-following behavior does not require massive proprietary models.

Dolly, built by fine-tuning a modest 6B-parameter open-source model on high-quality instruction data, exhibits strong conversational and generative abilities, challenging the belief that GPT‑scale models are mandatory.

Data quality and task-specific supervision may matter more than sheer scale.

Zaharia argues the relationship between model size, data curation, and application design is not fully understood, and results like Dolly suggest carefully chosen and weighted data can unlock capabilities in smaller models.

Linking LLMs to reliable, up-to-date data sources is critical for production use.

To fix outdated knowledge and hallucinations, he emphasizes architectures where LLMs are coupled with search, documents, tables, and APIs (e.g., banking systems) so the model grounds answers in vetted information.

Traditional ML still drives substantial enterprise ROI in operations.

Forecasting, supply chain optimization, fraud detection, and other structured-data problems are widely deployed and often yield large financial gains, independent of newer conversational LLM capabilities.

WORDS WORTH SAVING

5 quotes

We really wanted to see whether it's possible to democratize this and to let people build their own models with their own data without sending it to some centralized provider.

— Matei Zaharia

It's been pretty surprising to a lot of researchers the size of model that still gets you this kind of instruction-following ability.

— Matei Zaharia

The two big problems with [current LLMs] are that the knowledge is not up to date and a lot of the things it says are inaccurate, and it's confident but wrong.

— Matei Zaharia

The core tech is getting commoditized very quickly… if you just want to run something like today's ChatGPT, it will be a lot cheaper.

— Matei Zaharia

Anything around a unique dataset or a unique feedback interaction you have is always good… that could eventually become a moat where others just can't easily catch up.

— Matei Zaharia

Origins and evolution of Databricks from Berkeley research and Apache SparkDatabricks’ unified data and machine learning platform and business scaleDevelopment of Dolly and instruction-following LLMs based on open-source modelsLimitations of model scaling and the commoditization of core LLM technologyCombining language models with search, APIs, and enterprise data to reduce hallucinationsEnterprise use cases for traditional ML versus conversational LLMsAdvice for technical and PhD founders and long-term AI/ML infrastructure trends

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.