Skip to content
The Twenty Minute VCThe Twenty Minute VC

Alex Wang: Why Data Not Compute is the Bottleneck to Foundation Model Performance | E1164

Alexandr Wang is the Founder and CEO @ Scale.ai, the company that allows you to make the best models with the best data. To date, Alex has raised $1.6BN for the company with a last reported valuation of $14BN earlier this year. Scale tripled their ARR in 2023 and is expected to hit $1.4BN in ARR by the end of 2024. Their investors include Accel, Index, Thrive, Founders Fund, Meta and Nvidia to name a few. ----------------------------------------------- Timestamps: (00:00) Intro (01:05) Diminishing Returns in AI Compute (09:08) Solving Reasoning to Overcome Limits (10:56) From Data Scarcity to Abundance (14:37) Challenges in Structuring Massive Enterprise Data (18:59) Fair Access to Proprietary Data for Models (22:02) Model Commoditization (26:51) Value Extraction Challenges in AI Commoditization (32:55) Navigating Data Regulatory Challenges for Innovation (36:53) A Military Asset in Global Conflict: China & Russia (42:49) The Future Landscape of Foundation Models (44:52) About Founder Brand & PR & Media (52:11) Hiring (01:00:41) Quick-Fire Round ----------------------------------------------- In Today’s Show with Alex Wang We Discuss: 1. Foundation Models: Diminishing Returns: What are the three core pillars that can meaningfully improve foundation models performance? Why is data the single largest bottleneck to the performance of models today? What data do we need to capture that we do not currently, that will have the biggest impact on model performance moving forward? Will we see the largest companies in the world revert back to on-prem with the increasing security challenges of migrating all customer data to foundation models? 2. AI: A Military Asset in Global Conflict: China + Russia Why does Alex believe that AI has the potential to be an even more powerful military asset than nuclear weapons? If this is the case, should we have open systems? Do we not have to have closed systems? Why does Alex believe that the CCP’s approach to industrial policy is better than anyone else’s? How does Alex evaluate the rise of Chinese EV car manufacturers in the last few years? Does Alex really believe that China is two years behind the US in the AI race? 3. “I Get Fairer Treatment in Congress than in the Press”: Why does Alex believe that the best PR is no PR? Why does Alex believe that he got fairer treatment in congress than he does in the media? Why does Alex believe that all founders should look to own their own distribution channels today? 4. Alex Wang: AMA: What are some of Alex’s biggest lessons from Patrick Collison on the impact that a hot company brand has on the ability for that company to hire the best? Does Alex think Trump is going to win? What would be the impact if he were to? Why does Alex believe that enterprise software will be changed forever in the next few years? What question is Alex never asked that he thinks he should be asked? ----------------------------------------------- Subscribe on Spotify: https://open.spotify.com/show/3j2KMcZTtgTNBKwtZBMHvl?si=85bc9196860e4466 Subscribe on Apple Podcasts: https://podcasts.apple.com/us/podcast/the-twenty-minute-vc-20vc-venture-capital-startup/id958230465 Follow Harry Stebbings on Twitter: https://twitter.com/HarryStebbings Follow Alexandr Wang on Twitter: https://twitter.com/alexandr_wang Follow 20VC on Instagram: https://www.instagram.com/20vchq Follow 20VC on TikTok: https://www.tiktok.com/@20vc_tok Visit our Website: https://www.20vc.com Subscribe to our Newsletter: https://www.thetwentyminutevc.com/contact ----------------------------------------------- #20vc #harrystebbings #alexandrwang #scaleai #openai #venturecapital #founder #chatgpt #ai #foundationmodels #china #military

Alexandr (Alex) WangguestHarry Stebbingshost
Jun 12, 20241h 6mWatch on YouTube ↗

CHAPTERS

  1. Compute spending surges, but GPT-4-level progress stalls

    Alex and Harry open with the observation that GPU spend has exploded since GPT-4, yet the industry hasn’t seen a “jaw-dropping” base-model leap beyond it. They frame the core puzzle: why scaling compute hasn’t translated into obvious capability gains.

    • GPT-4 has been out since 2022 without a clear successor leap
    • NVIDIA data-center revenue and industry compute spend have inflected sharply upward
    • More compute alone doesn’t guarantee step-function model improvements
    • The ecosystem is “waiting for the next great model” despite massive investment
  2. The three pillars: compute, data, and algorithms—and the emerging ‘data wall’

    Alex explains that AI progress requires compute, algorithmic breakthroughs, and data scaling together. He argues the recent plateau is largely explained by exhausting ‘easy’ internet-scale data while continuing to scale compute.

    • AI progress historically comes from compute + data + algorithms in tandem
    • Algorithmic jumps (transformers, RLHF) matter as much as hardware
    • GPT-4 effectively trained on ‘nearly all of the internet’
    • The industry has scaled compute faster than it has scaled data
  3. Why internet data can’t deliver agents and real-world reasoning

    They unpack why pretraining on open-web text hits limits: much of the reasoning that powers real work isn’t written down publicly. Alex uses the fraud analyst example to show that internal decision processes and multi-step deductions don’t appear on the internet.

    • “Easy data” = crawlable/torrented public web content
    • Pretraining makes models great at emulating the internet, not doing work
    • Workplace reasoning chains and decision criteria are rarely documented publicly
    • To build capable agents, models need data that reflects real task execution
  4. Frontier data: capturing reasoning chains, tool use, and agentic behavior

    Alex introduces ‘frontier data’ as the missing ingredient: complex reasoning traces, multi-step agent workflows, and tool-usage examples that train models to execute tasks. This becomes his central thesis for moving from data scarcity to data abundance.

    • Frontier data includes complex reasoning chains and multi-step workflows
    • Agent behaviors: lookup, reasoning, correction, and tool use
    • Frontier data is required to move from GPT-4 to much more capable systems
    • The bottleneck shifts from compute to producing the right kinds of data
  5. Enterprise data at massive scale—and why it stays proprietary

    Alex argues that enormous troves of valuable data sit inside enterprises and dwarf typical internet corpora. But because it’s sensitive and competitively differentiating, it must be mined and used largely within each enterprise rather than broadly open-sourced.

    • Enterprise datasets can be enormous (e.g., JP Morgan ~150PB vs <1PB internet dataset claim)
    • Most valuable enterprise data won’t go on the open internet for good reasons
    • Mining enterprise data is a company-by-company effort tied to internal problems
    • This unlocks value but doesn’t automatically create universal public improvements
  6. ‘Solving reasoning’ as either a breakthrough—or a data coverage problem

    They discuss whether better reasoning requires fundamentally new capabilities or simply broader scenario coverage via data. Alex suggests today’s models reason well where they’ve seen enough examples; absent that, they struggle to generalize like humans.

    • Machine intelligence differs from human general intelligence and transfer
    • Models perform best in domains with abundant relevant data
    • Two paths: true general reasoning breakthroughs vs overwhelming scenario-specific data
    • Data breadth can compensate for limited generalization in many settings
  7. From data scarcity to abundance: mining vs forward data production

    Alex distinguishes between one-time benefits from mining existing datasets and the ongoing need for ‘forward’ data production. He analogizes data production to chip fabs: the industry must invest in scalable “means of production” for frontier data.

    • Two tracks: (1) data mining existing sources, (2) forward data production
    • Mining yields meaningful but finite, one-time gains
    • Long-term progress requires continuous creation of new frontier data
    • Compute and data must scale in lockstep to unlock major capability gains
  8. How new data gets made: longitudinal collection + human-in-the-loop synthesis

    Harry probes concrete mechanisms to increase data supply. Alex outlines longitudinal data capture (workplace process mining and consumer life-logging devices) but argues the real frontier comes from expert-guided, human+synthetic collaboration to produce high-complexity training data.

    • Longitudinal data collection: workplace telemetry, process mining, and consumer devices
    • RPA-like traces can record actions across tools and workflows
    • But frontier advancement needs complex expert-level data (code, science, reasoning chains)
    • Human experts act like ‘safety drivers’ to correct/steer synthetic generation
  9. The ‘AI trainer’ as a high-leverage job and societal impact multiplier

    Alex argues that experts contributing training data may be among the highest-leverage human roles. Improving a model even slightly can compound across billions of downstream uses, amplifying an individual’s impact.

    • Roles: AI trainers/contributors who guide model improvement via data
    • Small expert improvements multiply across all future model calls/users
    • Appeal to scientists, doctors, mathematicians: encode expertise into systems
    • Human oversight raises quality and resolves stuck/factuality edge cases
  10. Enterprise readiness: structuring messy data and the case for on-prem models

    They turn to practical enterprise hurdles: data cleanliness, governance, and operationalizing AI on sensitive datasets. Alex predicts sophisticated companies will mine their data, and many will prefer on-prem/open models to ensure their data doesn’t become competitors’ advantage.

    • Enterprise data is often unstructured; extracting value requires significant effort
    • Most advanced firms will complete internal data mining; others will lag
    • Enterprise data may be the last durable differentiator, increasing caution
    • On-prem/open models (LLaMA, Mistral) fit demands for strong data isolation
  11. Model commoditization and where value accrues in the AI stack

    Alex and Harry debate whether models themselves become commoditized and how value is captured. Alex suggests value may accrue more above (apps/services) and below (infrastructure) the model, referencing NVIDIA’s position and shifting value capture dynamics.

    • Intense competition may limit durable value capture at the base-model layer
    • Infrastructure ‘below the model’ already captures huge value (e.g., NVIDIA)
    • Apps/services ‘above the model’ are likely to capture significant value
    • Value capture may migrate across layers as the market structure evolves
  12. Data as the durable moat: exclusive access, differentiated strategies, and pricing shifts

    Alex frames data as the most sustainable competitive advantage among the three pillars, since algorithms diffuse and compute can be purchased. They discuss exclusive data deals, differentiated data strategies by lab, and how agent-driven work pushes software toward consumption-based pricing over per-seat.

    • Data is the strongest source of durable competitive advantage for labs
    • Exclusive/strategic data deals (e.g., publishers) hint at future competition
    • Labs must develop differentiated data-production strategies aligned to their distribution/use cases
    • Agentic work makes per-seat pricing less logical; consumption/value-based pricing rises
  13. Regulation and national strategy: ‘pro-data’ policy, pooling, and healthcare constraints

    Harry raises fears that regulation will stifle innovation, especially in Europe. Alex argues democracies can be pro-data, advocating for industry data pooling where it benefits safety and fraud prevention, and for frameworks that enable anonymized medical data use despite HIPAA/PII barriers.

    • EU approach is seen as restrictive; the US/UK need a more proactive stance
    • Pro-data policy could include centralized, shared datasets for sector-wide benefit
    • Examples: aerospace safety data and financial fraud/compliance data pooling
    • Healthcare: find anonymization/legal pathways to use patient data for better outcomes
  14. Geopolitics and security: China’s pace, AGI as a military asset, and open vs closed models

    They shift to global competition, with Alex arguing China has rapidly closed the gap and could surpass the US given centralized industrial policy. He emphasizes AGI’s military implications and argues for a threshold: the most advanced systems should be closed, while less advanced open models can still be beneficial.

    • China has meaningfully caught up in model performance; trajectory is concerning
    • Centralized industrial policy excels at ‘turning the crank’ once a paradigm is known
    • AGI could be an unprecedented military asset—potentially beyond nukes in impact
    • A split future: keep cutting-edge systems closed; allow openness below a capability line
  15. Where foundation models go next: consolidation into giants, plus founder brand, hiring, and lessons learned

    Alex predicts foundation model development will become so expensive that only hyperscalers and nation-states can underwrite it, pushing consolidation/partnership-driven outcomes. The conversation closes with company-building themes: distrust of traditional PR, the importance of direct channels and founder brand, and hiring for ‘people who give a shit’—including lessons from hypergrowth and maintaining a high bar.

    • Model training costs will rise to tens/hundreds of billions, favoring giants and nation-states
    • Smaller labs likely get acquired or deeply partnered; outcomes depend on major alliances
    • Traditional media incentives can distort narratives; founders should use direct channels
    • Hiring: maintain elite bar at scale; hypergrowth in headcount can erode talent density

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.