Jonathan Ross, Founder & CEO @ Groq: NVIDIA vs Groq - The Future of Training vs Inference | E1260

Jonathan Ross is the Founder & CEO of Groq, the creator of the world’s first Language Processing Unit (LPUTM). Prior to Groq, Jonathan began what became Google’s Tensor Processing Unit (TPU) as a 20% project where he designed and implemented the core elements of the first-generation TPU chip. Jonathan next joined Google X’s Rapid Eval Team, the initial stage of the famed “Moonshots Factory”, where he devised and incubated new Bets (Units) for Google’s parent company, Alphabet. ---------------------------------------------- In Today’s Episode We Discuss: (00:00) Intro (01:29) Scaling Laws and AI Model Training (06:48) Synthetic Data and Model Efficiency (09:00) Inference vs. Training Costs: Why NVIDIA Loses Inference (15:12) The Future of AI Inference: Efficiency and Cost (16:35) Chip Supply and Scaling Concerns (19:40) Energy Efficiency in AI Computation (25:37) Why Most Dollars Into Datacenters Will Be Lost (31:41) Meta, Google, and Microsoft's Data Center Investments (43:24) Distribution of Value in the AI Economy (44:17) Stages of Startup Success (45:46) The AI Investment Bubble (47:45) The Keynesian Beauty Contest in VC (51:52) NVIDIA's Role in the AI Ecosystem (57:30) China's AI Strategy and Global Implications (01:02:25) Europe's Potential in the AI Revolution (01:17:13) Future Predictions and AI's Impact on Society ---------------------------------------------- Subscribe on Spotify: https://open.spotify.com/show/3j2KMcZTtgTNBKwtZBMHvl?si=85bc9196860e4466 Subscribe on Apple Podcasts: https://podcasts.apple.com/us/podcast/the-twenty-minute-vc-20vc-venture-capital-startup/id958230465 Follow Harry Stebbings on Twitter: https://twitter.com/HarryStebbings Follow Jonathan Ross on Twitter: https://twitter.com/JonathanRoss321 Follow 20VC on Instagram: https://www.instagram.com/20vchq Follow 20VC on TikTok: https://www.tiktok.com/@20vc_tok Visit our Website: https://www.20vc.com Subscribe to our Newsletter: https://www.thetwentyminutevc.com/contact ----------------------------------------------- #20vc #harrystebbings #jonathanross #groq #CEO #venturecapital #founder #ai #nvidia #modeltraining #inference

Jonathan RossguestHarry Stebbingshost

Feb 17, 20251h 25mWatch on YouTube ↗

CHAPTERS

0:00 – 1:26
Groq’s thesis in one minute: win inference, let NVIDIA own training
Jonathan opens by correcting headlines about Groq’s “$1.5B raise,” framing it as revenue and tying it to a broader strategy: carve out low-margin, high-volume inference while NVIDIA continues to dominate high-margin training. He emphasizes scale over near-term profit when growth is “faster than exponential.”
- •$1.5B is positioned as revenue, not fundraising
- •Groq aims to take inference volume while NVIDIA keeps training margins
- •Inference seen as a distinct product category (tokens/$, tokens/watt)
- •Scale and market relevance matter more than profit at extreme growth rates
1:26 – 2:39
Are scaling laws actually hitting limits? Why people misread the curves
The conversation starts with what scaling laws really claim and why the common “we’re at the limit” narrative is misleading. Jonathan argues the usual interpretation assumes uniform data quality, which isn’t how real capability improvements happen.
- •Scaling laws: more parameters/tokens generally improve performance
- •Improvements look asymptotic when data quality is treated as uniform
- •Current training often mixes trivial and advanced tasks without curriculum
- •Token-count growth (trillions) is partly a response to logarithmic gains
2:39 – 4:07
Synthetic data loops: how models bootstrap their own training quality
Jonathan explains why synthetic data can outperform raw internet data when produced by a strong model and then filtered. He describes an iterative flywheel—generate, prune, retrain—that can change the apparent shape of scaling curves.
- •Synthetic data can be higher quality because the generator model is “smarter”
- •Offline pruning/filtering keeps the best samples and removes errors
- •Iterative self-improvement resembles AlphaGo Zero’s self-play dynamic
- •This loop can reduce the sense of asymptotic ceilings in practice
4:07 – 7:07
Efficiency ceilings, Big-O, and why reasoning compute is different from intuition
They move from data to algorithmic limits: some tasks have inherent step complexity. Jonathan distinguishes intuitive “System 1” capability (trained intuition) from “System 2” reasoning (test-time compute), and argues the next gains come from combining both.
- •Big-O complexity explains why some computations require intermediate steps
- •Models can memorize more cases to reduce steps, but reasoning remains necessary
- •System 1 vs System 2 framing: intuition vs deliberate runtime compute
- •Combining better training + more test-time compute yields geometric-like gains
7:07 – 9:00
What’s the real bottleneck: data, algorithms, or compute? (It’s a ‘soft bottleneck’)
Harry presses on constraints beyond data abundance, and Jonathan argues “bottleneck” is often misused. Compute can compensate for weaknesses elsewhere, but the best outcomes come from improving all three levers simultaneously.
- •Compute is fungible and can overpower slower progress in data/algorithms
- •Bottlenecks are often “soft” rather than hard constraints
- •Algorithm improvements can unlock more effective data generation/training
- •DeepSeek is framed as primarily an algorithmic improvement story
9:00 – 16:36
Why inference becomes the real infrastructure: costs, deployment, and a new mental model
Jonathan describes why inference can dwarf training in total compute usage and how that changes company-building. He proposes thinking of accelerators as “employees,” enabling startups to substitute compute for headcount—and shares Groq’s rapid scaling trajectory.
- •At Google, inference could consume 10–20× the compute of training
- •Founders should position for the coming wave rather than chase it late
- •Compute-as-employees framing: CapEx/OpEx can substitute for hires
- •Groq scaling: ~640 chips to 40,000 in 2024; ambitions to reach far higher
16:36 – 18:08
Chip supply, HBM chokepoints, and why NVIDIA has a ‘cornered resource’
They dig into supply chain realities and why some constraints are structural. Jonathan argues NVIDIA effectively controls scarce components like HBM and CoWoS capacity, while Groq’s choices aim to avoid those scaling limits.
- •HBM (high-bandwidth memory) is scarce and hard to ramp (SK hynix/Samsung/Micron)
- •GPUs need huge memory bandwidth; standard memory becomes a ‘martini straw’
- •NVIDIA described as a monopsony buyer for key packaging/memory elements
- •Avoiding HBM is positioned as a path to fewer scale ceilings for inference
18:08 – 21:52
Groq’s LPU architecture: pipeline the model across many chips to cut energy per token
Jonathan explains the architectural bet: if chip counts scale rapidly, redesign inference to use many chips with parameters kept “live,” letting computation flow like an assembly line. This reduces costly memory movement and improves tokens-per-watt despite a bigger footprint.
- •Observation: capability rose faster than Moore’s Law due to chip-count scaling
- •LPU approach: keep parameters on-chip and pipeline compute across many chips
- •Uses hundreds/thousands of chips per model vs a handful in typical GPU setups
- •Energy rationale: shorter/thinner on-chip wires reduce charge/discharge energy
21:52 – 24:11
GPU + LPU coexistence: why training stays on GPUs and LPUs can ‘nitro-boost’ inference
Jonathan argues GPUs remain the best tool for training and will keep selling out, but inference economics push toward specialized hardware. He also describes hybrid deployments where LPUs accelerate portions of workloads to make existing GPU fleets more economical.
- •Training is expected to remain GPU-centric; inference share is up for grabs
- •More inference capacity can increase training demand (feedback loop)
- •Hybrid idea: run parts of a model on LPUs to speed up GPU deployments
- •Operational advantage: simpler deployment, fewer networking components, predictability
24:11 – 29:03
NVIDIA vs Groq in practice: stop selling specs, sell tokens per dollar (and why competition is ‘different’)
The discussion shifts to go-to-market narratives and “specsmanship.” Jonathan dismisses headline metrics in favor of tokens-per-dollar and tokens-per-watt, argues Groq is not solving the same problem as NVIDIA, and frames the market as big enough for both—especially if customers buy GPUs for training and LPUs for inference.
- •Critique of specsmanship: teraflops and benchmark theatrics vs real economics
- •Key metrics: tokens/$ and tokens/watt
- •Groq claims it’s not direct competition because the products target different jobs
- •Customers should still buy GPUs for training even if they shift inference to LPUs
29:03 – 32:29
Cost, margins, and the Aramco/Saudi deal: revenue-share financing instead of Groq-funded CapEx
Jonathan details why Groq can be materially cheaper on inference and how its business model scales via partners funding deployment. He clarifies the Saudi announcement as revenue (not fundraising), and describes structures that flip economics after partners achieve a target IRR.
- •Claimed inference cost advantage: >5× lower; ~3× less energy per token
- •GPU inference OpEx alone can approach Groq’s total cost (CapEx+OpEx) framing
- •Partner-funded deployments: Groq pays back with upside-sharing until IRR is met
- •Saudi/Aramco-style structure described as revenue with upfront profit, not a $1.5B raise
32:29 – 42:18
Data center & power reality check: fake supply, echo-chamber demand, and a coming hard bottleneck
Jonathan warns of widespread misunderstanding: data centers aren’t just real estate, and power availability will become the true limiter. He describes how hyperscaler requests can create “echo” demand signals, leading to misallocated builds today and a power crunch later as compute scales.
- •Mismatch: chips vs power vs data center readiness (generators, water, uptime)
- •Echo effect: one request replicated across many builders inflates perceived demand
- •Many announced projects may never be developed (‘fake data centers’)
- •Power becomes a hard bottleneck in ~3–4 years as chip capacity keeps doubling
42:18 – 57:30
Where value accrues: power laws, startup life-cycle, and the AI investment bubble dynamics
They broaden to market structure: massive CapEx spend by hyperscalers, uneven value distribution, and why lots of capital will be incinerated even if the category wins overall. Jonathan introduces the Keynesian beauty contest as a lens on VC behavior and explains what changes when multiple competitors are all well-capitalized.
- •Hyperscaler spend (Meta/Google/Microsoft) reflects a race to capture AI value
- •Value distribution is a power law; concentration risk grows with market size
- •Startup stages: solve unsolved problem → marketing war → durable ‘powers’
- •Beauty contest in VC breaks when many players can raise billions; talent gets split
57:30 – 1:07:11
Geopolitics and policy: China’s constraints, Europe’s opportunity, and ‘risk-on’ ecosystems
Jonathan compares the US, China, and Europe through the lens of innovation incentives, censorship, and infrastructure scale. He argues China can compensate with brute-force power buildout but may be constrained by speech/privacy controls, while Europe should focus less on regulating “what doesn’t exist” and more on building concentrated entrepreneurial hubs.
- •China: can deploy power/infrastructure at scale, but chip efficiency and openness matter
- •Censorship/free speech constraints may stifle innovation and model usefulness
- •Europe: talent exists but needs risk-on density and faster labor mobility
- •Policy idea: special economic zones (“City F”) to concentrate startup formation
1:07:11 – 1:25:52
Long-term societal impact & rapid-fire: human agency, longevity ‘Mounjaro moments,’ and what comes after hallucinations
The closing section turns to safety and society—especially preserving human agency as AI makes decision-making effortless and abundance grows. In rapid-fire, Jonathan shares contrarian management beliefs, a bold longevity prediction, and a roadmap of breakthrough company archetypes (hallucination, agents, invention, proxies).
- •Core safety concern: people voluntarily outsourcing decisions (agency loss)
- •Abundance risks ‘financial diabetes’—comfort reducing drive and meaning
- •Prediction: if longevity breakthroughs are possible, they may arrive suddenly
- •Defining AI-era companies: solve hallucinations → enable robust agents → unlock invention → decision proxies

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Groq’s thesis in one minute: win inference, let NVIDIA own training

Are scaling laws actually hitting limits? Why people misread the curves

Synthetic data loops: how models bootstrap their own training quality

Efficiency ceilings, Big-O, and why reasoning compute is different from intuition

What’s the real bottleneck: data, algorithms, or compute? (It’s a ‘soft bottleneck’)

Why inference becomes the real infrastructure: costs, deployment, and a new mental model

Chip supply, HBM chokepoints, and why NVIDIA has a ‘cornered resource’

Groq’s LPU architecture: pipeline the model across many chips to cut energy per token

GPU + LPU coexistence: why training stays on GPUs and LPUs can ‘nitro-boost’ inference

NVIDIA vs Groq in practice: stop selling specs, sell tokens per dollar (and why competition is ‘different’)

Cost, margins, and the Aramco/Saudi deal: revenue-share financing instead of Groq-funded CapEx

Data center & power reality check: fake supply, echo-chamber demand, and a coming hard bottleneck

Where value accrues: power laws, startup life-cycle, and the AI investment bubble dynamics

Geopolitics and policy: China’s constraints, Europe’s opportunity, and ‘risk-on’ ecosystems

Long-term societal impact & rapid-fire: human agency, longevity ‘Mounjaro moments,’ and what comes after hallucinations

Get more out of YouTube videos.