The Twenty Minute VC

Douwe Kiela: Why Data Size Matters More Than Model Size; Why Open Source Isn't Going to Win | E1032

Douwe Kiela is the CEO of Contextual AI, building the contextual language model to power the future of businesses. Last month Contextual closed a $20M funding round including Bain Capital, Sarah Guo, Elad Gil and 20VC. He is also an Adjunct Professor in Symbolic Systems at Stanford University. Previously, he was the Head of Research at Hugging Face, and before that a Research Scientist at Facebook AI Research. -------------------------------------------- In Today’s Episode with Douwe Kiela We Discuss: 1. Founding a Foundational Model Company in 2023: How did Douwe make his way into the world of AI and ML over a decade ago? What are some of his biggest lessons from his time working with Yann LeCun and Meta? How does Douwe’s background in philosophy help him in AI today? 2. Foundational Model Providers: Challenges and Alternatives: What are the biggest problems with the existing foundational data models? Will there be one to rule them all? How does the landscape play out? Why does Douwe believe OpenAI’s data acquisition strategy has been the best? 3. Data Models: Size and Structure: Why does Douwe believe it is naive to think the open approach will beat the closed approach? What are the biggest downsides to the open approach? Does the size of data model matter today? What matters more? How important is access to proprietary data? Are VCs naive to turn down founders due to a lack of access to proprietary data? 4. Regulation and the World Around Us: How does Douwe expect the regulatory landscape to play out around AI? Why is Europe the worst when it comes to regulation? Will this be different this time? How does Douwe analyse Elon’s petition to pause the development of AI for 6 months? Do founders building AI companies have to be in the valley? ---------------------------------------- #DouweKiela #ContextualAI #HarryStebbings #artificialintelligence

Douwe KielaguestHarry Stebbingshost

Jun 29, 202353mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Data Over Parameters: Rethinking AI Models, Moats, Risk and Regulation

Douwe Kiela traces his path through Meta FAIR, Hugging Face, and into founding Contextual, using that journey to frame how modern language models are built, evaluated, and deployed.
He argues that data scale and quality now matter more than raw model size, giving an advantage to players like OpenAI with superior data flywheels, while still leaving room for specialized startups.
The conversation dissects why current LLMs are not yet enterprise‑ready—hallucinations, lack of attribution, compliance, privacy, and latency—and why architectures like retrieval‑augmented generation (RAG) are a next step.
They also tackle open vs closed models, AI moats, existential risk narratives, regulation (especially in the EU), and how and when enterprises will truly adopt AI at scale.

IDEAS WORTH REMEMBERING

5 ideas

Data scale and quality now trump raw model size for performance.

Kiela highlights results like LLaMA showing that smaller models trained longer on more and better data can outperform larger under‑trained models, making data acquisition and curation the primary competitive lever.

The biggest moats belong to players with superior data flywheels, notably OpenAI.

OpenAI’s access to unique datasets (e.g., transcribed audio, ChatGPT interaction logs) plus economies of scale in serving models constitute a deep moat, contradicting claims that “OpenAI and Google have no moat.”

Current LLMs have structural blockers for serious enterprise use.

Hallucinations, opaque reasoning (no attribution), inability to update or delete knowledge, privacy concerns over sending proprietary data to third‑party servers, and latency/inefficiency all limit safe deployment in regulated environments.

Architectures like Retrieval Augmented Generation (RAG) are key to enterprise‑grade AI.

By decoupling memory (retrieval over external data) from generation, RAG reduces hallucinations, gives explicit attribution, allows dynamic updates and deletions, improves efficiency, and cleanly separates the data plane from the model plane for better privacy.

Open source will thrive, but is unlikely to lead at the frontier.

Kiela describes a pyramid: cheap open models at the bottom, expensive frontier models at the top, and commercially interesting mid‑sized models in the middle—arguing open models won’t reach the very top because training frontier systems is simply too costly.

WORDS WORTH SAVING

5 quotes

Data size matters even more than model size.

— Douwe Kiela

OpenAI and Google have a giant moat because it’s really all about data.

— Douwe Kiela

Hallucinations are fine if you’re doing creative writing; they’re unacceptable if you’re running an enterprise‑critical system.

— Douwe Kiela

It’s still very early innings. We haven’t settled on a lot of things that need to be solved before this technology is really ready.

— Douwe Kiela

I would like it to be true that open source could keep up with the frontier, but I think that’s just incredibly naive.

— Douwe Kiela

Douwe Kiela’s background at Meta FAIR, Hugging Face, and founding ContextualEnterprise limitations of current LLMs: hallucinations, attribution, compliance, privacy, latencyData vs model size: scaling laws, data flywheels, and competitive advantageProprietary data, pretraining, fine‑tuning, and RLHF in building systems like ChatGPTOpen source vs closed frontier models and the emerging “pyramid” of AI modelsEvaluation, data contamination, and adversarial testing of language modelsRegulation, existential risk narratives, and the trajectory of enterprise AI adoption

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.