The Twenty Minute VCAlex Wang: Why Data Not Compute is the Bottleneck to Foundation Model Performance | E1164
CHAPTERS
Compute spending surges, but GPT-4-level progress stalls
Alex and Harry open with the observation that GPU spend has exploded since GPT-4, yet the industry hasn’t seen a “jaw-dropping” base-model leap beyond it. They frame the core puzzle: why scaling compute hasn’t translated into obvious capability gains.
- •GPT-4 has been out since 2022 without a clear successor leap
- •NVIDIA data-center revenue and industry compute spend have inflected sharply upward
- •More compute alone doesn’t guarantee step-function model improvements
- •The ecosystem is “waiting for the next great model” despite massive investment
The three pillars: compute, data, and algorithms—and the emerging ‘data wall’
Alex explains that AI progress requires compute, algorithmic breakthroughs, and data scaling together. He argues the recent plateau is largely explained by exhausting ‘easy’ internet-scale data while continuing to scale compute.
- •AI progress historically comes from compute + data + algorithms in tandem
- •Algorithmic jumps (transformers, RLHF) matter as much as hardware
- •GPT-4 effectively trained on ‘nearly all of the internet’
- •The industry has scaled compute faster than it has scaled data
Why internet data can’t deliver agents and real-world reasoning
They unpack why pretraining on open-web text hits limits: much of the reasoning that powers real work isn’t written down publicly. Alex uses the fraud analyst example to show that internal decision processes and multi-step deductions don’t appear on the internet.
- •“Easy data” = crawlable/torrented public web content
- •Pretraining makes models great at emulating the internet, not doing work
- •Workplace reasoning chains and decision criteria are rarely documented publicly
- •To build capable agents, models need data that reflects real task execution
Frontier data: capturing reasoning chains, tool use, and agentic behavior
Alex introduces ‘frontier data’ as the missing ingredient: complex reasoning traces, multi-step agent workflows, and tool-usage examples that train models to execute tasks. This becomes his central thesis for moving from data scarcity to data abundance.
- •Frontier data includes complex reasoning chains and multi-step workflows
- •Agent behaviors: lookup, reasoning, correction, and tool use
- •Frontier data is required to move from GPT-4 to much more capable systems
- •The bottleneck shifts from compute to producing the right kinds of data
Enterprise data at massive scale—and why it stays proprietary
Alex argues that enormous troves of valuable data sit inside enterprises and dwarf typical internet corpora. But because it’s sensitive and competitively differentiating, it must be mined and used largely within each enterprise rather than broadly open-sourced.
- •Enterprise datasets can be enormous (e.g., JP Morgan ~150PB vs <1PB internet dataset claim)
- •Most valuable enterprise data won’t go on the open internet for good reasons
- •Mining enterprise data is a company-by-company effort tied to internal problems
- •This unlocks value but doesn’t automatically create universal public improvements
‘Solving reasoning’ as either a breakthrough—or a data coverage problem
They discuss whether better reasoning requires fundamentally new capabilities or simply broader scenario coverage via data. Alex suggests today’s models reason well where they’ve seen enough examples; absent that, they struggle to generalize like humans.
- •Machine intelligence differs from human general intelligence and transfer
- •Models perform best in domains with abundant relevant data
- •Two paths: true general reasoning breakthroughs vs overwhelming scenario-specific data
- •Data breadth can compensate for limited generalization in many settings
From data scarcity to abundance: mining vs forward data production
Alex distinguishes between one-time benefits from mining existing datasets and the ongoing need for ‘forward’ data production. He analogizes data production to chip fabs: the industry must invest in scalable “means of production” for frontier data.
- •Two tracks: (1) data mining existing sources, (2) forward data production
- •Mining yields meaningful but finite, one-time gains
- •Long-term progress requires continuous creation of new frontier data
- •Compute and data must scale in lockstep to unlock major capability gains
How new data gets made: longitudinal collection + human-in-the-loop synthesis
Harry probes concrete mechanisms to increase data supply. Alex outlines longitudinal data capture (workplace process mining and consumer life-logging devices) but argues the real frontier comes from expert-guided, human+synthetic collaboration to produce high-complexity training data.
- •Longitudinal data collection: workplace telemetry, process mining, and consumer devices
- •RPA-like traces can record actions across tools and workflows
- •But frontier advancement needs complex expert-level data (code, science, reasoning chains)
- •Human experts act like ‘safety drivers’ to correct/steer synthetic generation
The ‘AI trainer’ as a high-leverage job and societal impact multiplier
Alex argues that experts contributing training data may be among the highest-leverage human roles. Improving a model even slightly can compound across billions of downstream uses, amplifying an individual’s impact.
- •Roles: AI trainers/contributors who guide model improvement via data
- •Small expert improvements multiply across all future model calls/users
- •Appeal to scientists, doctors, mathematicians: encode expertise into systems
- •Human oversight raises quality and resolves stuck/factuality edge cases
Enterprise readiness: structuring messy data and the case for on-prem models
They turn to practical enterprise hurdles: data cleanliness, governance, and operationalizing AI on sensitive datasets. Alex predicts sophisticated companies will mine their data, and many will prefer on-prem/open models to ensure their data doesn’t become competitors’ advantage.
- •Enterprise data is often unstructured; extracting value requires significant effort
- •Most advanced firms will complete internal data mining; others will lag
- •Enterprise data may be the last durable differentiator, increasing caution
- •On-prem/open models (LLaMA, Mistral) fit demands for strong data isolation
Model commoditization and where value accrues in the AI stack
Alex and Harry debate whether models themselves become commoditized and how value is captured. Alex suggests value may accrue more above (apps/services) and below (infrastructure) the model, referencing NVIDIA’s position and shifting value capture dynamics.
- •Intense competition may limit durable value capture at the base-model layer
- •Infrastructure ‘below the model’ already captures huge value (e.g., NVIDIA)
- •Apps/services ‘above the model’ are likely to capture significant value
- •Value capture may migrate across layers as the market structure evolves
Data as the durable moat: exclusive access, differentiated strategies, and pricing shifts
Alex frames data as the most sustainable competitive advantage among the three pillars, since algorithms diffuse and compute can be purchased. They discuss exclusive data deals, differentiated data strategies by lab, and how agent-driven work pushes software toward consumption-based pricing over per-seat.
- •Data is the strongest source of durable competitive advantage for labs
- •Exclusive/strategic data deals (e.g., publishers) hint at future competition
- •Labs must develop differentiated data-production strategies aligned to their distribution/use cases
- •Agentic work makes per-seat pricing less logical; consumption/value-based pricing rises
Regulation and national strategy: ‘pro-data’ policy, pooling, and healthcare constraints
Harry raises fears that regulation will stifle innovation, especially in Europe. Alex argues democracies can be pro-data, advocating for industry data pooling where it benefits safety and fraud prevention, and for frameworks that enable anonymized medical data use despite HIPAA/PII barriers.
- •EU approach is seen as restrictive; the US/UK need a more proactive stance
- •Pro-data policy could include centralized, shared datasets for sector-wide benefit
- •Examples: aerospace safety data and financial fraud/compliance data pooling
- •Healthcare: find anonymization/legal pathways to use patient data for better outcomes
Geopolitics and security: China’s pace, AGI as a military asset, and open vs closed models
They shift to global competition, with Alex arguing China has rapidly closed the gap and could surpass the US given centralized industrial policy. He emphasizes AGI’s military implications and argues for a threshold: the most advanced systems should be closed, while less advanced open models can still be beneficial.
- •China has meaningfully caught up in model performance; trajectory is concerning
- •Centralized industrial policy excels at ‘turning the crank’ once a paradigm is known
- •AGI could be an unprecedented military asset—potentially beyond nukes in impact
- •A split future: keep cutting-edge systems closed; allow openness below a capability line
Where foundation models go next: consolidation into giants, plus founder brand, hiring, and lessons learned
Alex predicts foundation model development will become so expensive that only hyperscalers and nation-states can underwrite it, pushing consolidation/partnership-driven outcomes. The conversation closes with company-building themes: distrust of traditional PR, the importance of direct channels and founder brand, and hiring for ‘people who give a shit’—including lessons from hypergrowth and maintaining a high bar.
- •Model training costs will rise to tens/hundreds of billions, favoring giants and nation-states
- •Smaller labs likely get acquired or deeply partnered; outcomes depend on major alliances
- •Traditional media incentives can distort narratives; founders should use direct channels
- •Hiring: maintain elite bar at scale; hypergrowth in headcount can erode talent density