Alex Wang: Why Data Not Compute is the Bottleneck to Foundation Model Performance | E1164

The Twenty Minute VCJun 12, 20241h 6m

Alexandr (Alex) Wang (guest), Harry Stebbings (host)

Data vs compute as the true bottleneck in foundation model performanceThe ‘data wall’, frontier data, and synthetic/human-in-the-loop data generationEnterprise data strategy, on‑prem deployments, and data as a competitive moatEconomic implications: model commoditization, AI services, pricing, and the ‘end of software’ thesisGlobal competition: China’s data posture, industrial policy, and AI as a military assetRegulation and pro‑data policy frameworks in democraciesCompany-building lessons: hiring for ‘people who give a shit’, controlled headcount, and media strategy

In this episode of The Twenty Minute VC, featuring Alexandr (Alex) Wang and Harry Stebbings, Alex Wang: Why Data Not Compute is the Bottleneck to Foundation Model Performance | E1164 explores alex Wang: Data, Not Compute, Now Limits AI’s Next Breakthroughs Scale AI CEO Alex Wang argues that foundation model progress has plateaued not because of insufficient compute, but because the industry has hit a ‘data wall’ after exhausting most high-quality internet text. He introduces the concept of “frontier data” — rich, task- and reasoning-centric datasets drawn from enterprises, experts, agents, and synthetic generation — as the core bottleneck and main moat for future AI systems.

Alex Wang: Data, Not Compute, Now Limits AI’s Next Breakthroughs

Scale AI CEO Alex Wang argues that foundation model progress has plateaued not because of insufficient compute, but because the industry has hit a ‘data wall’ after exhausting most high-quality internet text. He introduces the concept of “frontier data” — rich, task- and reasoning-centric datasets drawn from enterprises, experts, agents, and synthetic generation — as the core bottleneck and main moat for future AI systems.

Wang predicts data strategy will define durable competitive advantage among model labs and enterprises, drive a shift toward on‑prem and highly customized deployments, and reshape software pricing away from per‑seat to consumption- and outcome-based models. He also warns that China’s centralized industrial policy and permissive data regime could let it overtake the U.S. in AI, with profound military and geopolitical implications.

On the business side, he discusses why the real value may accrue above and below the model layer, why hypergrowth in headcount was a mistake, and how Scale maintains a ‘Navy SEALs, not Navy’ talent bar even at 800 employees. He closes with views on media, founder brand, regulation, open vs closed models, and why building data infrastructure will remain a non‑commoditized, long‑term opportunity.

Key Takeaways

Compute alone no longer delivers step-change gains; data quality and breadth are now the main constraint.

Despite a massive surge in GPU spending since GPT‑4, no dramatically superior base model has appeared, suggesting that simply scaling compute without new data and algorithms hits diminishing returns.

Get the full analysis with uListen AI

The internet is ‘used up’ for pretraining; future gains require frontier data, not more web crawl.

Most easily crawlable text is already in large models, but real-world economic reasoning — internal workflows, expert thought processes, complex problem-solving — rarely gets written online and must be explicitly captured.

Get the full analysis with uListen AI

Enterprises’ proprietary data will be their only defensible edge in an AI-first world.

Internal datasets dwarf public corpora and encode unique processes; companies will mine existing data once, then continually generate new, high-value data while keeping it on‑prem or tightly controlled to avoid arming competitors.

Get the full analysis with uListen AI

AI’s ‘reasoning gap’ can be narrowed either by new algorithms or by overwhelming scenario-specific data.

Current models reason well where they’ve seen enough examples; for each domain where robust reasoning is needed, organizations must supply rich, contextual data and agentic traces rather than expecting ‘general intelligence’ to emerge for free.

Get the full analysis with uListen AI

Data will be the primary durable moat for model labs, driving exclusive content deals and bespoke datasets.

Algorithms diffuse and compute can be bought, but unique training data (e. ...

Get the full analysis with uListen AI

Regulation and national data policy will shape geopolitical AI leadership as much as chips do.

Wang argues liberal democracies must adopt ‘pro‑data’ stances—e. ...

Get the full analysis with uListen AI

For startups, hypergrowth in headcount is a trap; talent density beats scale.

Scale’s rapid hiring from ~150 to 700 people diluted quality and execution; Wang now personally approves every hire and keeps the team roughly flat, favoring a small ‘Navy SEALs’ workforce over a large but average organization.

Get the full analysis with uListen AI

Notable Quotes

“A lot of AI progress at this point is fundamentally more data bottleneck.”
— Alex Wang

“We’ve used up all the easy data. We’ve used up all of the internet data.”
— Alex Wang

“All of the reasoning and thinking that is powering the economy today, none of that gets written down on the internet.”
— Alex Wang

“Data is one of the few areas where you can produce a long-term, sustainable competitive advantage in this foundation model game.”
— Alex Wang

“This AI technology has the potential to be one of the greatest military assets that humanity has ever seen, potentially even more of a military asset than nukes.”
— Alex Wang

Questions Answered in This Episode

If we’re truly data-bottlenecked, what practical steps can smaller organizations take to create meaningful frontier data with limited resources?

Get the full analysis with uListen AI

How should enterprises balance the competitive advantage of proprietary AI training data against potential societal benefits of wider data sharing?

Get the full analysis with uListen AI

At what capability threshold does an open model become too powerful to be safely released, and who should decide where that line is?

Get the full analysis with uListen AI

Could aggressive data collection (e.g., longitudinal life-logging or pervasive workplace monitoring) backfire culturally or legally, even if it accelerates AI progress?

Get the full analysis with uListen AI

What early warning signs should we watch for that AI promises are diverging from technical reality, risking an ‘autonomous vehicles-style’ hype crash in generative AI?

Get the full analysis with uListen AI

Transcript Preview

Alexandr (Alex) Wang

At its core, this AI technology has the potential to be one of the greatest military assets that humanity has ever seen. Potentially even more of a military asset than nukes. Let's say China or Russia had AGI today and the United States didn't, I would imagine they would use that to conquer. The CCP system is incredibly good at taking very aggressive centralized action and centralized industrial policy to drive forward critical industries. They have a clear shot at racing forward.

Harry Stebbings

Ready to go? Alex, I am thrilled that we could do this in person. Thank you so much for joining me today.

Alexandr (Alex) Wang

Yeah, great to be here.

Harry Stebbings

Now, listen, it's funny. I told you, I tweeted before like, we should skip the founding stories because there are many, many great times you've told it before. But I want to dive straight in, and I want to ask you the question of when we look at model performance today, let's just start high level, do you think we're seeing a case of diminishing returns where more compute doesn't lead to better performance?

Alexandr (Alex) Wang

Yeah, I think it's pretty fascinating. I mean, I think there's been, um, this especially coming up now where GPT-4, uh, you know, OpenAI has had GPT-4 since fall of 2022. Uh, and since that timeframe, we haven't yet seen a new base model or new, a new model that's, you know, um, jaw droppingly better than GPT-4. You know, we haven't seen the GPT-4.5 or the GPT-5 or, you know, the other labs haven't yet come out with models that are, you know, leagues and leagues better than GPT-4, despite just like way, way more compute expenditure. You know, since, since when ChatGPT came out, you know, you can look at the graph of NVIDIA's revenue, it just inflects. It's just like, it just goes straight up after GPT-4 came out, and it goes from, you know, I think the NVIDIA's data center revenue, they were doing, uh, roughly about five billion a quarter, and then it shoots up to now it's, you know, north of $20 billion a quarter. So there's been, you know, tens of billions going to a hundred- more than a hundred billion of spend on, uh, high-end NVIDIA GPUs all in the same timeframe. We haven't yet seen the big breakthrough since GPT-4, which actually that model was, came out before this huge inflection in NVIDIA, um, expenditure. So, uh, I think overall it's very, it's this interesting thing where we're seeing investment into compute go up dramatically, go up exponentially right now. But we're still, I think as a community, as an industry, kind of waiting for the next great model.

Harry Stebbings

Do you think we've reached this kind of asymptote of performance where actually we'll see this kind of plateauing in performance while we wait for that? And do we think that's like a monthly thing or do we think that's kind of like self-driving? Do you remember with self-driving we saw kind of the plateau in performance actually for kind of several years, and actually it was only recently where we see that inflect again.

Install uListen to search the full transcript and get AI-powered insights

Get Full Transcript

Get more from every podcast

AI summaries, searchable transcripts, and fact-checking. Free forever.

Add to Chrome