The Twenty Minute VC

Alex Wang: Why Data Not Compute is the Bottleneck to Foundation Model Performance | E1164

Alexandr Wang is the Founder and CEO @ Scale.ai, the company that allows you to make the best models with the best data. To date, Alex has raised $1.6BN for the company with a last reported valuation of $14BN earlier this year. Scale tripled their ARR in 2023 and is expected to hit $1.4BN in ARR by the end of 2024. Their investors include Accel, Index, Thrive, Founders Fund, Meta and Nvidia to name a few. ----------------------------------------------- Timestamps: (00:00) Intro (01:05) Diminishing Returns in AI Compute (09:08) Solving Reasoning to Overcome Limits (10:56) From Data Scarcity to Abundance (14:37) Challenges in Structuring Massive Enterprise Data (18:59) Fair Access to Proprietary Data for Models (22:02) Model Commoditization (26:51) Value Extraction Challenges in AI Commoditization (32:55) Navigating Data Regulatory Challenges for Innovation (36:53) A Military Asset in Global Conflict: China & Russia (42:49) The Future Landscape of Foundation Models (44:52) About Founder Brand & PR & Media (52:11) Hiring (01:00:41) Quick-Fire Round ----------------------------------------------- In Today’s Show with Alex Wang We Discuss: 1. Foundation Models: Diminishing Returns: What are the three core pillars that can meaningfully improve foundation models performance? Why is data the single largest bottleneck to the performance of models today? What data do we need to capture that we do not currently, that will have the biggest impact on model performance moving forward? Will we see the largest companies in the world revert back to on-prem with the increasing security challenges of migrating all customer data to foundation models? 2. AI: A Military Asset in Global Conflict: China + Russia Why does Alex believe that AI has the potential to be an even more powerful military asset than nuclear weapons? If this is the case, should we have open systems? Do we not have to have closed systems? Why does Alex believe that the CCP’s approach to industrial policy is better than anyone else’s? How does Alex evaluate the rise of Chinese EV car manufacturers in the last few years? Does Alex really believe that China is two years behind the US in the AI race? 3. “I Get Fairer Treatment in Congress than in the Press”: Why does Alex believe that the best PR is no PR? Why does Alex believe that he got fairer treatment in congress than he does in the media? Why does Alex believe that all founders should look to own their own distribution channels today? 4. Alex Wang: AMA: What are some of Alex’s biggest lessons from Patrick Collison on the impact that a hot company brand has on the ability for that company to hire the best? Does Alex think Trump is going to win? What would be the impact if he were to? Why does Alex believe that enterprise software will be changed forever in the next few years? What question is Alex never asked that he thinks he should be asked? ----------------------------------------------- Subscribe on Spotify: https://open.spotify.com/show/3j2KMcZTtgTNBKwtZBMHvl?si=85bc9196860e4466 Subscribe on Apple Podcasts: https://podcasts.apple.com/us/podcast/the-twenty-minute-vc-20vc-venture-capital-startup/id958230465 Follow Harry Stebbings on Twitter: https://twitter.com/HarryStebbings Follow Alexandr Wang on Twitter: https://twitter.com/alexandr_wang Follow 20VC on Instagram: https://www.instagram.com/20vchq Follow 20VC on TikTok: https://www.tiktok.com/@20vc_tok Visit our Website: https://www.20vc.com Subscribe to our Newsletter: https://www.thetwentyminutevc.com/contact ----------------------------------------------- #20vc #harrystebbings #alexandrwang #scaleai #openai #venturecapital #founder #chatgpt #ai #foundationmodels #china #military

Alexandr (Alex) WangguestHarry Stebbingshost

Jun 11, 20241h 6mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Alex Wang: Data, Not Compute, Now Limits AI’s Next Breakthroughs

Scale AI CEO Alex Wang argues that foundation model progress has plateaued not because of insufficient compute, but because the industry has hit a ‘data wall’ after exhausting most high-quality internet text. He introduces the concept of “frontier data” — rich, task- and reasoning-centric datasets drawn from enterprises, experts, agents, and synthetic generation — as the core bottleneck and main moat for future AI systems.
Wang predicts data strategy will define durable competitive advantage among model labs and enterprises, drive a shift toward on‑prem and highly customized deployments, and reshape software pricing away from per‑seat to consumption- and outcome-based models. He also warns that China’s centralized industrial policy and permissive data regime could let it overtake the U.S. in AI, with profound military and geopolitical implications.
On the business side, he discusses why the real value may accrue above and below the model layer, why hypergrowth in headcount was a mistake, and how Scale maintains a ‘Navy SEALs, not Navy’ talent bar even at 800 employees. He closes with views on media, founder brand, regulation, open vs closed models, and why building data infrastructure will remain a non‑commoditized, long‑term opportunity.

IDEAS WORTH REMEMBERING

5 ideas

Compute alone no longer delivers step-change gains; data quality and breadth are now the main constraint.

Despite a massive surge in GPU spending since GPT‑4, no dramatically superior base model has appeared, suggesting that simply scaling compute without new data and algorithms hits diminishing returns.

The internet is ‘used up’ for pretraining; future gains require frontier data, not more web crawl.

Most easily crawlable text is already in large models, but real-world economic reasoning — internal workflows, expert thought processes, complex problem-solving — rarely gets written online and must be explicitly captured.

Enterprises’ proprietary data will be their only defensible edge in an AI-first world.

Internal datasets dwarf public corpora and encode unique processes; companies will mine existing data once, then continually generate new, high-value data while keeping it on‑prem or tightly controlled to avoid arming competitors.

AI’s ‘reasoning gap’ can be narrowed either by new algorithms or by overwhelming scenario-specific data.

Current models reason well where they’ve seen enough examples; for each domain where robust reasoning is needed, organizations must supply rich, contextual data and agentic traces rather than expecting ‘general intelligence’ to emerge for free.

Data will be the primary durable moat for model labs, driving exclusive content deals and bespoke datasets.

Algorithms diffuse and compute can be bought, but unique training data (e.g., publisher archives, domain-specific corpora, expert-generated frontier data) is hard to copy and will become the main basis of differentiation between OpenAI, Anthropic, etc.

WORDS WORTH SAVING

5 quotes

A lot of AI progress at this point is fundamentally more data bottleneck.

— Alex Wang

We’ve used up all the easy data. We’ve used up all of the internet data.

— Alex Wang

All of the reasoning and thinking that is powering the economy today, none of that gets written down on the internet.

— Alex Wang

Data is one of the few areas where you can produce a long-term, sustainable competitive advantage in this foundation model game.

— Alex Wang

This AI technology has the potential to be one of the greatest military assets that humanity has ever seen, potentially even more of a military asset than nukes.

— Alex Wang

Data vs compute as the true bottleneck in foundation model performanceThe ‘data wall’, frontier data, and synthetic/human-in-the-loop data generationEnterprise data strategy, on‑prem deployments, and data as a competitive moatEconomic implications: model commoditization, AI services, pricing, and the ‘end of software’ thesisGlobal competition: China’s data posture, industrial policy, and AI as a military assetRegulation and pro‑data policy frameworks in democraciesCompany-building lessons: hiring for ‘people who give a shit’, controlled headcount, and media strategy

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.