Kevin Weil: Why evals are the new core skill in AI products

Name: Kevin Weil: Why evals are the new core skill in AI products
Uploaded: 2025-04-10T00:00:00Z
Duration: 1 h 31 min 40 s

Through fine-tuning runs and writing evals against the fuzzy outputs; OpenAI builds at the edge of capabilities, betting on better models every two months.

Kevin WeilguestLenny Rachitskyhost

Apr 10, 20251h 31mWatch on YouTube ↗

CHAPTERS

5:27 – 11:23
Image model launch: why it went viral and what the model can really do
Kevin and Lenny start with OpenAI’s newly launched image generation model and the unexpectedly huge social reaction (e.g., “Ghibli”/style trends). Kevin explains the internal signal he watches for (employee usage exploding) and highlights the model’s strength in instruction-following and complex visual rearrangements.
- •Internal adoption as a predictor of external product virality
- •Why style trends (like Ghibli) emerge organically
- •Image model capabilities: complex multi-image instruction-following
- •Examples: arranging objects in a room, compositional constraints
- •Excitement about emergent user creativity and new use cases
11:23 – 15:57
Joining OpenAI: recruitment story, fast process, and the ‘nine days of silence’
Kevin recounts how he went from planning a break after leaving Planet to quickly interviewing for the CPO role at OpenAI. A brief period of non-response after interviews leads to intense second-guessing—until he learns the team was simply overwhelmed with internal events.
- •Timing: leaving Planet and planning time off
- •Connection points: Sam Altman, Vinod Khosla, prior relationships
- •Rapid interview loop with leadership
- •The anxiety of delayed follow-up (and the lesson: don’t assume)
- •Decision to join and how quickly it ultimately came together
15:57 – 18:45
What’s most different at OpenAI: pace, shifting tech substrate, and probabilistic products
Kevin explains how building at OpenAI differs from prior companies: the underlying technology changes dramatically every couple of months. He contrasts deterministic software (fixed inputs/outputs) with LLMs’ fuzzy behavior and why product design must account for varying reliability levels.
- •Pace as the defining cultural difference
- •AI changes the ‘platform’ faster than traditional infrastructure
- •LLMs: fuzzy inputs and non-deterministic outputs
- •Product design changes radically at 60% vs 95% vs 99.5% correctness
- •Why this forces deeper measurement and iteration loops
18:45 – 24:40
Evals as a must-have skill: ‘unit tests’ for models and products
Kevin defines evals as quizzes/benchmarks for model behavior and explains why they’re foundational for shipping AI products. Using Deep Research as an example, he describes building evals alongside product definition, then “hill climbing” performance through fine-tuning and iteration.
- •Evals = tests/benchmarks for model capability (unit-test analogy)
- •Model performance tiers drive different product architectures
- •Deep Research: long-running answers that replace hours/days of work
- •Creating hero use cases, turning them into evals
- •Using eval improvements as evidence the product is ready
24:40 – 26:24
Startup opportunities and moats: where OpenAI won’t (and can’t) build everything
Lenny asks where founders should build without being ‘squashed’ by foundation model labs. Kevin argues there will always be far more opportunity outside any single company—especially where data, workflows, and expertise are industry-specific—so OpenAI focuses on empowering developers via the API.
- •‘More smart people outside your walls than inside’ (Ev Williams)
- •OpenAI’s focus on being a platform/API for millions of developers
- •Vertical, workflow, and company-specific data create defensible opportunities
- •Why OpenAI doesn’t want to (and can’t) pursue most niche use cases
- •Building with proprietary data/process = durable differentiation
26:24 – 32:50
How OpenAI ships fast: lightweight planning, bottoms-up teams, and iterative deployment
Kevin breaks down the operating model behind OpenAI’s shipping cadence: plan lightly, expect change, and empower teams. He explains the “iterative deployment” philosophy—launch early, learn publicly—and a “model maximalism” mindset that avoids over-scaffolding because models rapidly improve.
- •Quarterly planning for alignment, not prediction (Eisenhower quote)
- •Bottoms-up execution with directional themes
- •Tolerance for mistakes, rollbacks, and imperfect polish (e.g., naming)
- •Iterative deployment: ship, observe, iterate with society
- •Model maximalism: bet that model limitations shrink quickly
32:50 – 36:04
Competition, coding models, and why ‘ChatGPT’ won consumer awareness
Kevin discusses how the model landscape is now multi-dimensional, with different labs leading in different areas like coding. He also explains how ChatGPT won mindshare: being early, moving fast, and becoming a ‘one-stop shop’ for many modalities and tools (voice, video, deep research, agents).
- •Anthropic’s strength in coding; intelligence is multi-dimensional
- •OpenAI’s lead is smaller now; competition accelerates progress
- •Why being first matters in consumer adoption
- •ChatGPT as a unified hub: multimodal + tools + agents (e.g., Operator)
- •Strategy: stay the most useful destination by shipping quickly
36:04 – 40:57
Designing AI experiences: reasoning UIs, chain-of-thought tradeoffs, and ‘human-like’ intuition
Kevin shares a counterintuitive lesson: you can often design AI product behavior by reasoning about how humans behave in similar situations. He describes UI decisions for reasoning models (10–25 seconds of ‘thinking’) and how OpenAI adapted after observing DeepSeek’s full chain-of-thought display.
- •Reasoning models introduce awkward ‘wait time’ UX challenges
- •Why a spinner is bad—but showing full thoughts can be too much
- •Human analogy: give small status updates while thinking
- •Learning from DeepSeek: curiosity vs scalability of long chain-of-thought
- •Settling on summarized thinking updates as a middle ground
40:57 – 45:08
Chat as the interface: universal bandwidth, plus when to use more structured experiences
Kevin argues that chat is underrated as an interface because it matches how humans communicate—unstructured and high-bandwidth. He also notes that for high-volume, well-scoped tasks, more prescribed non-chat interfaces can be better, with chat serving as the universal fallback layer.
- •Chat’s versatility mirrors human communication modes
- •LLMs finally make unstructured language interfaces viable
- •Chat as a ‘catchall’ for anything outside rigid workflows
- •Structured UX can outperform chat for repetitive, constrained tasks
- •Voice and multimodal communication as extensions of ‘chat’
45:08 – 48:07
Researchers + product teams: building as one unit, not ‘research hands off to product’
Kevin explains OpenAI’s evolution from a research-first org to a dual research-and-product company. He emphasizes that the best outcomes come when research, engineering, design, and product iterate together—using evals and fine-tuning loops—rather than treating internal models like an external API.
- •ChatGPT started as a low-key research preview
- •OpenAI must remain world-class at both research and product
- •Avoiding a ‘handoff’ model where product just consumes research outputs
- •Tight feedback loops: evals → data → fine-tuning → better product
- •Cultural/muscle-building: improving collaboration over the last ~6 months
48:07 – 53:37
Hiring PMs at OpenAI: PM-light teams, high agency, ambiguity tolerance, and decisiveness
Kevin shares that OpenAI has relatively few PMs and intentionally stays “PM-light” to avoid slowing execution. He outlines what he looks for—high agency, comfort with ambiguity, strong influence skills, and decisiveness—especially in an environment where problems are ill-formed and fast-moving.
- •Approximate PM count and the philosophy behind staying PM-light
- •Why too many PMs can produce decks vs outcomes
- •High-agency and ambiguity-tolerant PM profile
- •Leading through influence across product/eng/research
- •Decisiveness as a core PM value in uncertain environments
53:37 – 1:04:34
How OpenAI uses AI internally: vibe coding, fine-tuning, ensembles, and automated support
Kevin describes day-to-day AI usage: drafting/summarizing docs, generating specs, and writing evals with models. He highlights “vibe coding” (rapid prototyping with tools like Cursor/Windsurf), predicts more fine-tuning everywhere, and explains how OpenAI uses model ensembles and automation (e.g., customer support) to operate at massive scale with small teams.
- •Internal AI use: summarize/write docs, specs, and eval assistance
- •Vibe coding explained: fast ‘hands off the wheel’ prototyping
- •Expectation: every product team will include ML/research capability
- •Fine-tuning via examples; breaking problems into smaller tasks
- •Ensembles of models by cost/latency needs; automation in support workflows
1:04:34 – 1:08:08
Raising kids with AI and the promise of personalized tutoring
Kevin discusses how AI-native his kids already are and what skills he’s emphasizing: curiosity, independence, confidence, and learning how to think. He argues personalized tutoring may be one of AI’s most world-changing applications and is surprised a truly massive, globally adopted AI tutoring experience isn’t already ubiquitous.
- •AI-native childhood: self-driving cars and AI conversations feel normal
- •Future-proof skills: curiosity, independence, self-confidence, thinking
- •Personalized tutoring as a major lever for learning gains
- •Access and equity: free tools + widespread smartphones could scale globally
- •Why the tech seems ready, but adoption/productization still lags
1:08:08 – 1:17:59
Why Kevin is optimistic: technology’s long arc, reskilling, and ‘the worst model you’ll ever use’
Kevin makes an optimistic case: technology historically improves quality of life, though transitions create real dislocations that require support. He predicts rapid compounding: models get smarter, cheaper, faster, and safer, and emphasizes the mindset shift that today’s AI is the worst it will ever be—so builders should ‘skate to where the puck is going.’
- •Long-run optimism grounded in historical tech progress
- •Acknowledging short-term dislocation and the need for policy/support
- •ChatGPT as a reskilling/learning tool
- •Compounding trends: better + cheaper + faster + safer models
- •Core mantra: today’s model is the worst you’ll ever use; build ahead of capabilities
1:17:59 – 1:22:01
Libra reflections: the vision, what went wrong, and why it still should exist
Kevin reflects on Libra (Meta’s crypto project) as a major career disappointment because it targeted a real, regressive problem: expensive, slow remittances. He explains how the ambition and timing (too many changes at once, Meta’s reputation, regulatory fears) contributed to failure, and notes the tech lives on via open-source descendants like Aptos and Mysten.
- •Libra’s goal: instant, near-free payments in WhatsApp/Messenger
- •The remittance problem: high fees and poor user experience
- •What went wrong: too much change at once + reputational/regulatory headwinds
- •What he’d do differently: introduce change more gradually
- •Legacy: open-source tech enabling Aptos and Mysten; renewed feasibility today
1:22:01 – 1:31:40
Lightning round: books, products, mottos, and prompting tips
Kevin shares favorite book recommendations, products he loves (Waymo, vibe coding tools), and a personal motto about consistent work compounding over time. He also offers practical prompting advice: give examples (poor man’s fine-tuning) and use role framing (‘be Einstein’) while noting prompting should become less of a specialized skill over time.
- •Book recs: CoIntelligence, The Accidental Superpower, Cable Cowboy
- •Favorite products: Windsurf/Cursor vibe coding; Waymo
- •Motto: good work consistently over a long period of time
- •Prompting tip: include examples; role prompting can shift model behavior
- •Vision: prompt engineering should fade as models get more usable

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Image model launch: why it went viral and what the model can really do

Joining OpenAI: recruitment story, fast process, and the ‘nine days of silence’

What’s most different at OpenAI: pace, shifting tech substrate, and probabilistic products

Evals as a must-have skill: ‘unit tests’ for models and products

Startup opportunities and moats: where OpenAI won’t (and can’t) build everything

How OpenAI ships fast: lightweight planning, bottoms-up teams, and iterative deployment

Competition, coding models, and why ‘ChatGPT’ won consumer awareness

Designing AI experiences: reasoning UIs, chain-of-thought tradeoffs, and ‘human-like’ intuition

Chat as the interface: universal bandwidth, plus when to use more structured experiences

Researchers + product teams: building as one unit, not ‘research hands off to product’

Hiring PMs at OpenAI: PM-light teams, high agency, ambiguity tolerance, and decisiveness

How OpenAI uses AI internally: vibe coding, fine-tuning, ensembles, and automated support

Raising kids with AI and the promise of personalized tutoring

Why Kevin is optimistic: technology’s long arc, reskilling, and ‘the worst model you’ll ever use’

Libra reflections: the vision, what went wrong, and why it still should exist

Lightning round: books, products, mottos, and prompting tips

Get more out of YouTube videos.