The Twenty Minute VCJonathan Ross: DeepSeek Special - How Should OpenAI and the US Government Respond | E1253
CHAPTERS
- 0:00 – 1:39
DeepSeek as “Sputnik 2.0”: why this moment matters
Harry and Jonathan open with a blunt claim: DeepSeek is a geopolitical and industry inflection point comparable to Sputnik. Jonathan previews the core themes—true cost vs headline cost, distillation from OpenAI, and why “open” changes the competitive landscape.
- •DeepSeek framed as a wake-up call for the US AI ecosystem
- •Training cost headlines vs the broader reality of what it took to get results
- •Early thesis: OpenAI may need to respond by open-sourcing to keep users
- •Why this is bigger than a normal model release (speed, surprise, adoption)
- 1:39 – 2:36
Jonathan Ross’s vantage point: Google TPU → Groq and inference-first thinking
Jonathan explains his background in AI hardware, from Google’s TPU work to founding Groq. This context sets up his emphasis on inference economics and how hardware constraints shape real-world AI competition.
- •Experience building accelerators (TPU) and founding Groq (LPUs)
- •Why hardware/inference realities matter as much as model quality
- •Positioning to comment on compute, efficiency, and deployment constraints
- 2:36 – 4:29
Compute, data, and the marketing narrative behind the “$6M model”
They unpack why DeepSeek surprised people: it appears to reach frontier-like performance with far fewer GPUs and budget. Jonathan argues the $6M number is partly marketing—data quality and post-training/distillation spend are the real story.
- •Most labs can access similar raw data; compute and data quality drive deltas
- •Scaling laws: more tokens/compute generally improves models, but not uniformly
- •DeepSeek’s reported 2,000 GPUs/60 days is plausible but incomplete framing
- •Key nuance: high-quality training signals can beat “more tokens of average data”
- 4:29 – 7:29
Distillation explained: using OpenAI outputs as high-quality training signal
Jonathan explains distillation as ‘learning from a smarter tutor’ and ties it to scaling-law assumptions about uniform data quality. He uses AlphaGo/AlphaGo Zero as an analogy for iterative self-improvement and shows how ‘jumping’ to a better teacher accelerates progress.
- •Distillation: train a student model on outputs from a stronger teacher model
- •Scaling laws assume uniform data quality; better data reduces required tokens
- •Synthetic data flywheel: model generates data → retrain → improved model
- •DeepSeek likely spent far more on generating/scraping high-quality outputs than on the base training run
- 7:29 – 9:31
What DeepSeek innovated beyond copying: automated RL-style checking
They address the ‘just copying’ critique. Jonathan argues DeepSeek also introduced genuinely clever reinforcement/automation tricks, especially replacing human grading with programmatic, deterministic checks where possible.
- •Not merely duplication—some reinforcement learning techniques are novel and simple
- •Replacing human preference labeling with automated ‘answer-in-the-box’ verification
- •Why automation reduces friction, cost, and ambiguity in training signals
- •Limits: Jonathan notes he hasn’t dug deeply into all reward-model details
- 9:31 – 11:41
Export controls loopholes and OpenAI’s accidental subsidy via API usage
Discussion shifts to how DeepSeek could get the compute and data it needed despite restrictions. Jonathan argues cloud access creates a major enforcement gap, and that heavy OpenAI API usage can effectively subsidize competitors if tokens are not fully profitable.
- •Why smuggling GPUs may be unnecessary if cloud GPUs are rentable from abroad
- •IP-blocking is porous; identity/location verification is hard to enforce
- •If API tokens are subsidized, large-scale scraping/distillation shifts cost to OpenAI
- •OpenAI may retain logs/data that could theoretically be used for training
- 11:41 – 13:08
The biggest risk: US customer data exposure and the reality of “delete”
They focus on data security and why users underestimate retention and government access. Jonathan argues ‘delete’ often means ‘marked deleted,’ and that even indirect data (neighbors, spouses, health info) can create vulnerabilities.
- •Data retention practices: deletion frequently means soft-delete, not true erasure
- •National security angle: state access to aggregated user data is the key concern
- •Second-order exposure: others can leak information about you unintentionally
- •Why AI apps amplify risk due to sensitive prompts and habitual usage
- 13:08 – 15:41
CCP influence, content shaping, and the TikTok analogy—now with open source
Harry asks directly whether DeepSeek could be used to increase CCP control; Jonathan says the structural issue is any China-based operator can be compelled. They discuss censorship behavior (e.g., Tiananmen responses) and the scarier prospect of subtle persuasion on contested topics.
- •Operating in China/HK can imply compulsory data access and answer constraints
- •Demonstrations of selective topic sensitivity and controlled outputs
- •Risk evolves from censorship to persuasion: “cogent” biased arguments at scale
- •Open source complicates traditional responses like bans or forced divestiture
- 15:41 – 17:21
Why Groq chose to host DeepSeek: offering a ‘no-data-retention’ alternative
Jonathan explains Groq’s difficult decision to run DeepSeek after it became the #1 app. The goal: let users access the model without sending data to China, leveraging Groq’s claim of storing nothing (memory-only).
- •Strategic shift: refusing Chinese models initially, then adding DeepSeek due to adoption
- •User protection framing: provide an alternative where prompts aren’t retained
- •Prediction: CCP may shift strategy after seeing success and seek data capture
- •DeepSeek as hedge-fund-origin project, but ‘influenced’ and potentially leveraged by the state
- 17:21 – 19:09
Models are now commoditized: seven powers, moats, and OpenAI’s open-source dilemma
Jonathan argues DeepSeek makes commoditization undeniable and shifts focus to defensibility (Hamilton Helmer’s seven powers). He claims OpenAI’s strongest current moat is brand, and suggests open-sourcing could be the best strategic response to preserve distribution and goodwill.
- •Commoditization reduces switching costs and erodes pricing power
- •Seven powers applied: brand, scale, network effects, switching costs, etc.
- •OpenAI’s choice: protect proprietary advantage vs win users via openness
- •Distribution advantage is weakening as alternatives spread quickly
- 19:09 – 37:23
Why $500B ‘Stargate’ isn’t absurd: inference dwarfs training
They debate whether massive infrastructure spending is ridiculed by efficiency gains. Jonathan argues the opposite: training breakthroughs trigger far larger inference spend, especially with test-time compute, so total compute demand can grow even as unit costs fall.
- •Google TPU origin story: ML ‘works’ but production cost explodes
- •Inference historically 10–20x training at Google; Jonathan predicts ~95% long-term
- •Test-time compute increases inference tokens dramatically for some queries
- •Efficiency doesn’t necessarily reduce spend; it often expands usage
- 37:23 – 41:01
Big Tech stocks, Jevons paradox, and why cheaper tokens increase demand
Harry asks why markets punished AI/hardware names; Jonathan says investors are over-indexing on training demand. He argues Jevons paradox and price elasticity mean lower costs create more applications, more developers, and ultimately more inference—and then more training again.
- •Market confusion: assuming efficiency implies fewer chips needed
- •Jevons paradox: lower cost → higher consumption → higher total spend
- •Developer adoption spikes when token costs drop and quality rises
- •Positive feedback loop: better inference demand motivates better training
- 41:01 – 42:55
Nvidia’s margins and the emerging split: high-margin training vs high-volume inference
They discuss whether Nvidia’s high margins invite disruption. Jonathan frames training as a premium niche and inference as the larger market, suggesting inference-specialized providers can absorb lower-margin volume while Nvidia preserves high-margin positioning.
- •Training as ‘mainframe-like’ high-margin business; inference as the larger TAM
- •Inference-focused chips/services can complement Nvidia’s margin structure
- •Many investors still misunderstand inference’s dominance despite years of signals
- •Clear definitions: training builds the model; inference uses it at scale
- 42:55 – 48:19
Next efficiency wave: Mixture-of-Experts, sparse compute, and what competitors will copy
Jonathan explains MoE architectures and how DeepSeek uses many experts but activates only a subset per token, reducing compute while maintaining capacity. He predicts widespread adoption of these ideas and intensified synthetic-data generation using vast GPU fleets.
- •MoE basics: dense models use all parameters; MoE routes to a subset of experts
- •DeepSeek’s large expert count enables sparsity and efficiency
- •More parameters can help retain information; sparsity controls runtime cost
- •Competitors will replicate architecture and scale synthetic data + training
- 48:19 – 1:00:01
Compute bottlenecks, AI arms-race nerves, and where value accrues in apps
They note DeepSeek limiting signups (Chinese phone numbers) as a sign of inference scarcity. The conversation closes on dual optimism and fear: AI-enabled cyber offense, deniable escalation, and—on the upside—rapid product creation, with value accruing to polished, crafted user experiences even in a ‘wrapper’ world.
- •Inference scarcity is real: serving users scales with end-user count, not researchers
- •AI security risk: LLMs finding exploits and automating offensive cyber operations
- •Arms-race dynamics: deniability and low-friction attacks increase escalation risk
- •Value in the stack: craftsmanship, reliability, and ‘details’ differentiate products