Douwe Kiela: Why Data Size Matters More Than Model Size; Why Open Source Isn't Going to Win | E1032

Name: Douwe Kiela: Why Data Size Matters More Than Model Size; Why Open Source Isn't Going to Win | E1032
Uploaded: 2023-06-30T12:00:00Z
Duration: 53 min 59 s
Description: Douwe Kiela traces his path through Meta FAIR, Hugging Face, and into founding Contextual, using that journey to frame how modern language models are built, evaluated, and deployed.

The Twenty Minute VCJun 30, 202353m

Douwe Kiela (guest), Harry Stebbings (host)

Douwe Kiela’s background at Meta FAIR, Hugging Face, and founding ContextualEnterprise limitations of current LLMs: hallucinations, attribution, compliance, privacy, latencyData vs model size: scaling laws, data flywheels, and competitive advantageProprietary data, pretraining, fine‑tuning, and RLHF in building systems like ChatGPTOpen source vs closed frontier models and the emerging “pyramid” of AI modelsEvaluation, data contamination, and adversarial testing of language modelsRegulation, existential risk narratives, and the trajectory of enterprise AI adoption

In this episode of The Twenty Minute VC, featuring Douwe Kiela and Harry Stebbings, Douwe Kiela: Why Data Size Matters More Than Model Size; Why Open Source Isn't Going to Win | E1032 explores data Over Parameters: Rethinking AI Models, Moats, Risk and Regulation Douwe Kiela traces his path through Meta FAIR, Hugging Face, and into founding Contextual, using that journey to frame how modern language models are built, evaluated, and deployed.

Data Over Parameters: Rethinking AI Models, Moats, Risk and Regulation

Douwe Kiela traces his path through Meta FAIR, Hugging Face, and into founding Contextual, using that journey to frame how modern language models are built, evaluated, and deployed.

He argues that data scale and quality now matter more than raw model size, giving an advantage to players like OpenAI with superior data flywheels, while still leaving room for specialized startups.

The conversation dissects why current LLMs are not yet enterprise‑ready—hallucinations, lack of attribution, compliance, privacy, and latency—and why architectures like retrieval‑augmented generation (RAG) are a next step.

They also tackle open vs closed models, AI moats, existential risk narratives, regulation (especially in the EU), and how and when enterprises will truly adopt AI at scale.

Key Takeaways

Data scale and quality now trump raw model size for performance.

Kiela highlights results like LLaMA showing that smaller models trained longer on more and better data can outperform larger under‑trained models, making data acquisition and curation the primary competitive lever.

Get the full analysis with uListen AI

The biggest moats belong to players with superior data flywheels, notably OpenAI.

OpenAI’s access to unique datasets (e. ...

Get the full analysis with uListen AI

Current LLMs have structural blockers for serious enterprise use.

Hallucinations, opaque reasoning (no attribution), inability to update or delete knowledge, privacy concerns over sending proprietary data to third‑party servers, and latency/inefficiency all limit safe deployment in regulated environments.

Get the full analysis with uListen AI

Architectures like Retrieval Augmented Generation (RAG) are key to enterprise‑grade AI.

By decoupling memory (retrieval over external data) from generation, RAG reduces hallucinations, gives explicit attribution, allows dynamic updates and deletions, improves efficiency, and cleanly separates the data plane from the model plane for better privacy.

Get the full analysis with uListen AI

Open source will thrive, but is unlikely to lead at the frontier.

Kiela describes a pyramid: cheap open models at the bottom, expensive frontier models at the top, and commercially interesting mid‑sized models in the middle—arguing open models won’t reach the very top because training frontier systems is simply too costly.

Get the full analysis with uListen AI

Evaluation and security around models are underdeveloped and represent major opportunities.

We lack robust, non‑contaminated evaluation methods and increasingly rely on GPT‑4 to grade other models; Kiela sees room for “Moody’s/S&P for AI” and for a new generation of security companies focused on prompt injection, data poisoning, and model‑driven actions.

Get the full analysis with uListen AI

AI risk and regulation narratives are often shaped by incumbent self‑interest.

He views existential‑risk talk as a very low‑probability but heavily amplified concern that can justify regulation benefitting big labs, while warning that EU‑style overregulation could stifle innovation and entrench current leaders.

Get the full analysis with uListen AI

Notable Quotes

“Data size matters even more than model size.”
— Douwe Kiela

“OpenAI and Google have a giant moat because it’s really all about data.”
— Douwe Kiela

“Hallucinations are fine if you’re doing creative writing; they’re unacceptable if you’re running an enterprise‑critical system.”
— Douwe Kiela

“It’s still very early innings. We haven’t settled on a lot of things that need to be solved before this technology is really ready.”
— Douwe Kiela

“I would like it to be true that open source could keep up with the frontier, but I think that’s just incredibly naive.”
— Douwe Kiela

Questions Answered in This Episode

If data is the core moat, how can smaller startups practically build defensible data flywheels against giants like OpenAI and Google?

Douwe Kiela traces his path through Meta FAIR, Hugging Face, and into founding Contextual, using that journey to frame how modern language models are built, evaluated, and deployed.

Get the full analysis with uListen AI

What concrete standards or frameworks should emerge for trustworthy, non‑contaminated evaluation of language models across different use cases?

He argues that data scale and quality now matter more than raw model size, giving an advantage to players like OpenAI with superior data flywheels, while still leaving room for specialized startups.

Get the full analysis with uListen AI

Where is the natural boundary between what should be open‑source in AI and what should remain proprietary for safety, economic, or competitive reasons?

Get the full analysis with uListen AI

How should regulators distinguish between realistic, near‑term AI risks (e.g., security, bias, misuse) and low‑probability existential scenarios when crafting policy?

They also tackle open vs closed models, AI moats, existential risk narratives, regulation (especially in the EU), and how and when enterprises will truly adopt AI at scale.

Get the full analysis with uListen AI

In an enterprise context, how do you decide when to rely on a closed frontier model, a specialized mid‑size model, or an open‑source model you run yourself?

Get the full analysis with uListen AI

Transcript Preview

Douwe Kiela

So there are a couple of just really big issues. Hallucination, these models make things up with very high confidence. Attribution, we don't know why they're saying what they're saying. We can't really trace it back to anything. There's compliance issues, so we can't really remove information from them, uh, which is kind of tricky from a GDPR perspective for example. We can't revise information. We can't keep it up to date. There's massive data privacy issues where you have to send your very valuable company data. If you're an enterprise, you have to send that through somebody else's servers.

Harry Stebbings

(digital music) Dao, I am excited for this. Listen, we've chatted before. We've known each other for a while, but thank you so much for joining me today.

Douwe Kiela

Yeah. Thanks very much for having me on the show. I'm a big fan.

Harry Stebbings

Oh, it's very, very kind of you. I literally paid you $25,000 to say that but, um... (laughs)

Douwe Kiela

(laughs)

Harry Stebbings

Um, but my question to you is, it's such a hot space and there's very few people who've actually been in it for a while. You are one of them. How did you first make your way into the world of ML and NLP first?

Douwe Kiela

Yeah. My, my journey has been a little bit unusual actually. So, um, uh, when I was in high school in the Netherlands, um, I wanted to be a cool kid during the day but at night I was secretly fascinated by computers. Um, so I started off as a script kiddie wanting to hack other people's computers and I figured out that if I really wanted to do that, I had to learn to code. So I taught myself to code. Uh, figured I needed to understand operating systems, so I made my own operating system, uh, with bootloader and everything, um, when I was 16. Uh, so then by the time, uh, I had to go to college and, and go study something, I, I thought I already knew everything about computer science. Uh, so I decided to study philosophy instead, um, and, and so that was really a, a very, uh, radical departure from what I had been interested in at the time, um, but it was fascinating, uh, learning a lot about the mind and language and things like that. It, it's... I, I use it still every day, I think. Um, but then, uh, at some point in my career they... it became clear that I had to start making money so I needed a real job, and philosophy is not really a real job. And I did some logic in between foundations of math which is also not really a real job. So I decided to study computer science after all. Uh, so I went to Cambridge, uh, in the UK. I had a fantastic time there, um, and, and so that's really where I started doing NLP, natural language processing, and, um, one of my internships was done at Microsoft Research in New York with a very famous researcher called Leon Bottou, who is Yann LeCun's kind of, um, I wouldn't say sidekick because that doesn't really do justice to, to what he's done. He's invented like stochastic gradient descent and things like that, so one of the godfathers of deep learning. And I had the opportunity to work with him, um, and, uh, that was really an amazing time. So afterwards when Yann and Leon started FAIR, Facebook AI Research, uh, I joined that out of my PhD and that really kicked off my career.

Install uListen to search the full transcript and get AI-powered insights

Get Full Transcript

Get more from every podcast

AI summaries, searchable transcripts, and fact-checking. Free forever.

Add to Chrome