No PriorsNo Priors Ep. 123 | With ReflectionAI Co-Founder and CEO Misha Laskin
CHAPTERS
- 0:00 – 0:53
Show setup: RL’s resurgence and ReflectionAI’s new agent release
Sarah Guo frames the episode around reinforcement learning’s renewed momentum and introduces ReflectionAI co-founder/CEO Misha Laskin. She previews topics ranging from post-training and reward modeling to product strategy and the broader AI landscape.
- •RL is “back with a vengeance” as a driver of new agent capabilities
- •ReflectionAI team background (DeepMind: AlphaGo/AlphaZero/Gemini)
- •Episode roadmap: reward modeling, distributions, RL vs robotics, market dynamics
- •Introduces the new code comprehension agent release
- 0:53 – 3:25
Superintelligence vs. autonomous superintelligent systems (product–research co-design)
Misha distinguishes between pursuing superintelligence as benchmark-driven lab research versus building deployable autonomous systems by working backward from real user needs. He argues startups can win by focusing on an ASI-complete product category and tightly coupling product and research.
- •Narrow superintelligence already existed (e.g., AlphaGo)
- •Two approaches: benchmark-maxing lab work vs. deployment-first design
- •Co-designing product + research forces focus but targets real problems
- •Choosing an “ASI-complete” category is crucial for capability depth
- 3:25 – 6:34
Misha’s path from physics to RL and DeepMind-era inspiration
Misha recounts how early exposure to the Feynman Lectures sparked a physics trajectory, but AlphaGo shifted his view toward AI as the “root science of our time.” He sought out top reinforcement learning environments, landing in Pieter Abbeel’s lab.
- •Physics felt like the foundational discipline behind major technologies
- •AI felt like a fast-moving frontier vs. physics’ more crystallized core
- •AlphaGo demonstrated startling learned “reasoning” in a domain
- •Decision: join a top RL lab (Pieter Abbeel) to pivot into AI
- 6:34 – 7:36
From Gemini RL work to the thesis: scale RL on top of LLMs
Misha explains working with Yannis Antonoglou on Gemini-era RL, seeing large language model training at scale firsthand. They concluded the missing paradigm for reaching AGI/ASI is scaling reinforcement learning on top of LLMs—and that the field is still early despite recent progress.
- •Yannis led RL for Gemini 1/1.5; Misha worked closely on the team
- •Exposure to LLM training at scale clarified “what’s to come”
- •Thesis: scalable RL over LLMs is the next (and pivotal) paradigm
- •Progress has begun, but Misha believes the field is earlier than it appears
- 7:36 – 11:46
Asimov launch: a codebase ‘deep research’ comprehension agent
Misha introduces Asimov as a code comprehension-focused agent designed to feel like having a principal engineer who understands the organization. He argues the enterprise pain is not writing code but understanding complex systems and scattered organizational knowledge.
- •Asimov positioned as best-in-class code research/comprehension agent
- •Targets enterprise reality: productivity gains from code-gen tools are often low or negative
- •Key bottleneck: knowledge is distributed across code, chats, docs, and people
- •“Omniscient oracle” vision enables more reliable downstream action agents
- 11:46 – 16:16
Why Asimov differs: system design, tool use, and long-context reasoning
The differentiation is not one trick but an integrated system: product features that pull in non-code knowledge plus research advances in agent design and post-training. Misha emphasizes training for long-context reasoning, multi-hop/tool use, and task-relevant capabilities rather than abstract benchmarks.
- •Comprehension requires ingesting multiple knowledge sources beyond the repo
- •Team “tribal knowledge” can be taught to the system for future reuse
- •Post-training focuses on long-context reasoning and neural-retrieval-like behavior
- •Tool-use training must match real tools (e.g., Jira) and real user workflows
- 16:16 – 21:52
Eval philosophy: customer-coupled evaluation as the startup advantage
Misha describes evaluations as the core leverage point for a focused startup competing with incumbents that run hundreds of diffuse evals. Reflection builds evals from real customer prompts and pain points (like onboarding), then works backward to the model and system capabilities needed to solve them.
- •Startups win via focus and iteration speed, not pretraining budgets
- •Incumbents: many evals → generality, but spread thin and far from product
- •Reflection: build evals directly from customer behavior and needs
- •Critical loop: customer need → required capability → agent/product/model changes
- 21:52 – 24:35
Where Asimov shines: semantic, cross-system debugging and “unknown unknowns”
Asimov is optimized for queries where engineers don’t know where to look—issues spanning systems, teams, and sources. Misha contrasts quick file-level questions (where lightweight tools suffice) with slow-to-answer semantic problems like flaky tests, infra slowdowns, and cross-PR interactions.
- •Best fit: semantic queries where the user lacks function/file pointers
- •Not ideal for simple, local file-level Q&A that should be instant
- •Examples: flaky tests, unexplained job slowdowns, cross-team race conditions
- •Value resembles “deep research” mode: compile evidence across many sources
- 24:35 – 28:38
Team-wide memory: building a Git-like system for organizational knowledge
They discuss designing collaborative memory that captures meta-knowledge around the codebase—who can edit it, who can view it, and how authority works. Misha anticipates workflows resembling pull requests and code ownership, effectively versioning shared knowledge rather than code.
- •Team memory needs permissions, edit/view controls, and governance
- •Customers start with trusted senior engineers as memory “gatekeepers”
- •Likely PR-style review and ownership-based approvals (Git-like)
- •Memory becomes “GitHub++”: versioning meta-knowledge around systems
- 28:38 – 32:48
Using pre-trained/open models: why post-training is the wedge (for now)
Misha argues that pretraining is converging, while open-weight models have become surprisingly strong—making it feasible for a new lab to compete via post-training and RL with far less compute. He also notes long-term compute/capital needs remain high, but revenue can sustainably fund frontier work.
- •Bet: open-weight models would be “good enough” to start; they exceeded expectations
- •Pretraining extractable signal is limited without extreme scaling costs
- •RL/post-training compute is ~orders of magnitude less than pretraining
- •Goal: build a revenue-generating frontier business without cloud-provider dependence
- 32:48 – 37:21
Why scaling RL is hard: reward models, hacking, and weak credit assignment
Misha locates the primary bottleneck in reward structure rather than raw compute: we lack clean rewards for arbitrary tasks, so noisy proxies get exploited. He also critiques current RL algorithms for poor exploration and credit assignment, producing meandering reasoning chains unlike human structured thinking.
- •Core constraint: accurate reward modeling for arbitrary tasks is ASI-complete
- •Noisy rewards (e.g., LLM-as-judge) get hacked; ground-truth rewards are scarce
- •Field relies on shortcuts (rubrics, synthetic pipelines) to make progress
- •Algorithm gaps: weak exploration and minimal atomic-level credit assignment
- 37:21 – 38:25
Training in “copycat” software environments + a provocative generalization take
Asked about synthetic replicas of popular software apps to train agents, Misha is bullish and argues “generalization” often means moving test scenarios into the training distribution. He suggests users experience generalization when training environments sufficiently resemble real evaluation conditions.
- •Bullish on synthetic/copycat environments as a practical training path
- •Hot take: “no such thing as generalization—only bringing test into train”
- •Users perceive generalization when distributions match real-world tasks
- •Implication: environment design becomes central to capability gains
- 38:25 – 44:27
When ASI arrives: jagged superintelligence, economics, and real-world deployment gaps
Misha maintains that within a couple years we’ll see definitive superhuman performance in meaningful slices of work (e.g., parts of coding), even if not universal. He argues benchmarks can saturate without transforming GDP because deployment—integrating into organizations and workflows—is half the problem.
- •Prediction: superintelligence in “slivers” of coding within a couple years
- •Once a recipe exists, expansion becomes an economic ROI decision
- •“Jagged superintelligence”: uneven capability across domains and tasks
- •Benchmarks ≠ enterprise productivity; deployment determines real impact
- 44:27 – 48:10
Windsurf non-acquisition and the push toward verticalization (owning intelligence)
Misha interprets the Windsurf outcome as evidence that frontier labs must verticalize into high-leverage categories like coding and search. He warns product-only startups built on third-party models face existential risk if frontier labs successfully integrate model + product and subsidize pricing.
- •Verticalization is accelerating in critical categories (search, coding)
- •Owning the model doesn’t guarantee best product—org distance still matters
- •But successful lab verticalization can overwhelm wrapper products
- •Startups without in-house intelligence face margin/subsidy disadvantages
- 48:10 – 55:12
Beyond RL: alternative datasets, RL vs robotics, and why language is different
Sarah asks about leveraging “non-RL” datasets (e.g., recording org conversations) and Misha agrees such data could be valuable. He contrasts robotics and language: robotics rewards are often more hackable due to noisy sensory signals, making RL harder beyond areas with clean rewards (like locomotion), while language benefits from internet-scale imitation and verifiable outcomes via synthetic/RL loops.
- •Interest in diverse, real-world organizational process datasets
- •Robotics RL is harder: VLM/pixel rewards are highly hackable and noisy
- •RL works best in robotics when rewards are clean (e.g., locomotion)
- •Language: internet imitation + scalable synthetic/RL is the practical path
- 55:12 – 1:02:54
Expanding beyond engineering: “contextual core” strategy and multi-decade deployment arc
Misha frames coding as foundational because agents interact with software through function calls, making coding reasoners operationally transferable to other roles (PM, support, sales). He predicts we’re earlier than most think in deployment; even with a blueprint for building ASIs, category-specific product and research work will make adoption a multi-decade effort.
- •Coding is a gateway: software interaction often reduces to code/function calls
- •Strategy: build deep comprehension first, then expand to adjacent enterprise roles
- •Expect many category-specific “snowflake” environments and post-trains
- •Blueprints may arrive soon, but deployment and impact will unfold over decades