Self-Play for LLMs, AI for Biology, Formal Verification, and More | YC Paper Club

It's hard to keep up with the latest AI research. That's why we started YC Paper Club — a small group of researchers, engineers, and founders who meet every two weeks at our Mountain View office to present and discuss new papers together. In this session, we cover whether scaling laws hold for protein biology, AlphaZero-style self-play for language models, streaming RAG for real-time voice agents, formal verification with Lean, and why one founder thinks programming with agents is exactly like playing a real-time strategy game. Stay tuned for more. Interested in joining a future Paper Club? Apply here: https://events.ycombinator.com/ycpaperclub Apply to Y Combinator: https://www.ycombinator.com/apply Work at a startup: https://www.ycombinator.com/jobs 00:00 — Introduction by Francois Chaubard 05:47 — Yasa Baig: A World Model of Protein Biology (https://biohub.ai/esm/protein/about) 25:38 — Luke Bailey: Scaling Self-Play with Self-Guidance (https://arxiv.org/pdf/2604.20209) 37:51 — Arnab Maiti: Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage (https://arxiv.org/pdf/2510.02044) 47:40 — Robert George: Lean for Science: How Formal Proofs Can Change Mathematics, AI, and Scientific Computing (https://arxiv.org/abs/2602.22631) 58:52 — Lukens Orthwein: Founder AI Hacks: Programming is an RTS Game Now 1:16:07 — Closing Remarks

Francois ChaubardhostYasa BaigguestArnab MaitiguestRobert GeorgeguestLukens Orthweinguest

Jun 12, 20261h 16mWatch on YouTube ↗

CHAPTERS

0:08 – 1:09
Paper Club kickoff: applied focus, upcoming talks, and what the group is exploring next
Francois opens by outlining the session’s applied theme and previewing the lineup: biology foundation models, LLM self-play, streaming voice RAG, formal verification, and founder workflows. He also invites attendees to propose future presentations, benchmarks, and open-source projects.
- •Session overview and speaker lineup
- •Call for future presenters and topics the club wants more of
- •Invitation to propose club challenges, benchmarks, and open-source collaborations
- •Emphasis on practical, applied research directions
1:09 – 2:40
Limits of human-solution training: why sampling beyond the “human subspace” matters
Francois frames a core concern: training on human-generated solutions may confine models to a typical-set region (H) and make it unlikely to discover the broader space of solutions (F−H) with finite compute. He contrasts AlphaGo-like human-anchored learning with AlphaZero-like self-play as a path to less biased exploration.
- •Argument that training on human solutions constrains exploration to a typical set (H)
- •Finite compute/test-time search may not reach enough of (F−H)
- •AlphaGo vs. AlphaZero as an analogy for bias vs. unbiased exploration
- •Motivation for self-play as a way to escape human priors
2:40 – 5:47
Intelligence per sample & per watt: streaming learning, ICL vs LoRA behavior, and alternatives to backprop
Francois discusses two “remaining problems”: improving learning efficiency per new sample and efficiency per watt. He notes non-monotonic performance in in-context learning with more examples, compares this to LoRA adaptation dynamics, and argues humans/algorithms should improve more steadily with experience—implying missing learning procedures. He also flags interest in non-backprop learning approaches inspired by neuroscience.
- •“Intelligence per sample” and continuous learning as a core research target
- •ICL performance can degrade or cliff at training context limits
- •LoRA can help at small sample sizes but also saturates; different optima over time
- •Interest in intelligence-per-watt framing and smaller models’ efficiency
- •Skepticism about biological plausibility of backprop; curiosity about alternatives (e.g., SPSA)
5:47 – 11:51
Protein foundation models: the “Bitter Lesson” applied to biology and protein sequences
Yasa introduces the Biohub/ESM work and positions protein modeling as an analogue to language modeling: proteins are sequences over a 20-token alphabet, and masked-language-model training can extract structure/function signals. He frames the question: do scaling laws and general pretraining recipes transfer to protein biology, or does the domain break them?
- •Proteins as 20-token sequences whose 3D structure drives function
- •Masked token prediction as the core pretraining objective (BERT-style)
- •Mapping NLP concepts to protein modeling (tokens/data/emergence/interpretability)
- •Central hypothesis: scaling + general methods can outperform handcrafted biological pipelines
11:51 – 13:52
Do protein scaling laws hold? Contact prediction as an unsupervised readout of structural knowledge
Yasa explains how the paper measures emergent structural understanding via long-distance contact prediction derived from internal representations. Results show smooth, predictable improvements with compute/model scale for the newer ESM Cambrian family, suggesting language-like scaling behavior in proteins.
- •Long-range contact precision as a proxy for structural understanding
- •Compute-optimal curves extrapolate well to large training runs
- •New model family shows continued scaling instead of plateauing
- •Evidence that sequence-only pretraining can recover nontrivial structural signal
13:52 – 15:23
Why earlier protein models plateaued: data scaling via metagenomics
The talk highlights a key twist: prior ESM2 runs showed diminishing returns with scale, but ESM Cambrian continues improving. The primary change is dramatically more data, especially metagenomic sequences—implying the bottleneck was coverage/diversity rather than architecture.
- •ESM2 plateau vs. ESM Cambrian continuing gains
- •Major driver: dataset expansion from tens of millions to billions of sequences
- •Metagenomic data as a huge source of unexplored protein diversity
- •Parallel to the “data wall” debate in LLMs, but with evolution-generated data
15:23 – 20:26
Bitter Lesson showdown: sequence-only embeddings vs AlphaFold’s MSA-heavy inductive bias
Yasa contrasts AlphaFold-style pipelines that rely on costly multiple sequence alignments (MSAs) with ESMFold-style approaches that use learned embeddings from a protein LM. The paper’s results show near-parity (and sometimes better performance) without MSAs, especially in antibody-related settings where alignments are weak or unavailable.
- •AlphaFold’s strength: handcrafted MSA features capturing evolutionary covariation
- •ESMFold approach: single-sequence inputs + learned per-residue embeddings
- •Competitive docking/complex prediction without MSAs; strong antibody results
- •MSA benefits correlate with availability—often weakest where designers most need help
- •Latency/throughput advantages when skipping MSA construction
20:26 – 25:38
Mechanistic interpretability for proteins: sparse features, biological motifs, and a protein “map”
Yasa describes applying sparse autoencoder-style interpretability to protein LMs, yielding monosemantic features aligned with biological concepts—from amino acids to domains and functional sites. He shows an example motif (nucleophilic elbow) and ends with the idea of building a massive protein atlas—representations organized into a global map of protein families and functions.
- •Sparse coding reveals interpretable, hierarchical protein features
- •Features span residue-level signals to domains and functional sites
- •Example: motif detectors generalize across diverse proteins
- •Large-scale atlas: folding/embedding billions of proteins into a navigable space
- •Takeaway: simple objectives + scale yield rich biological representations
25:38 – 28:41
Self-play for LLMs: why post-training RL dominates compute and where task generation fits
Luke frames modern LLM training as pretraining plus increasingly large RL post-training on curated tasks. He motivates self-play as a way to automatically generate new RL tasks, enabling continued improvement beyond the bounds of human-authored datasets.
- •Post-training RL now consumes huge compute budgets
- •Scaling tasks/steps yields smooth downstream gains (coding/maths)
- •Manual task collection doesn’t scale indefinitely
- •Self-play goal: model generates tasks and learns to solve them
28:41 – 34:47
Symmetric vs asymmetric self-play, and why “hardness rewards” can fail
Luke distinguishes classic symmetric self-play (e.g., AlphaGo using old versions as opponents) from asymmetric self-play where a “conjecturer” creates tasks for a “solver.” He explains the standard hardness reward (favor tasks the solver fails) and shows that it often produces pathological, useless tasks rather than skills that transfer.
- •Symmetric self-play: same role on both sides (agent vs older agent)
- •Asymmetric self-play: conjecturer generates tasks; solver attempts them
- •Baseline reward encourages tasks at the solver’s frontier
- •Failure mode: conjecturer generates messy, adversarially complex statements
- •Empirical result: vanilla self-play matches RL plateau rather than surpassing it
34:47 – 36:18
Self-Guided Self-Play (SGS): grounding synthetic tasks and adding a “guide” to enforce relevance
Luke presents their fix: generate synthetic problems related to unsolved target problems, and introduce a third model role—a guide—to score relatedness and complexity. Training then optimizes for both difficulty and relevance, reducing degenerate problem generation.
- •Generate synthetic tasks conditioned on unsolved real targets (distribution grounding)
- •Add a guide model to evaluate whether tasks are truly related and not junk
- •Conjecturer reward = hardness × guide score
- •Preserves learning signal while preventing adversarial complexity spirals
36:18 – 37:51
SGS results in Lean theorem proving: gains over RL, but self-play still doesn’t fully solve everything
Luke shows SGS improves pass rates over RL and naive self-play on a Lean benchmark, with a small model approaching larger-model performance given more compute. He emphasizes the promise remains incomplete: performance still plateaus below 100%, leaving significant open research questions.
- •SGS outperforms RL and parallel sampling baselines
- •7B model approaches the performance of much larger models with extra compute
- •Demonstrates task quality—not just quantity—is a bottleneck for self-play
- •Open problem: avoiding plateaus and reaching full coverage remains unsolved
37:51 – 41:27
Stream RAG for voice agents: reducing latency by retrieving while the user is still speaking
Arnab motivates streaming retrieval-augmented generation for spoken dialogue systems, where standard RAG adds unacceptable latency. The core idea is to run retrieval incrementally on partial utterances and decide when enough information has arrived to trigger or finalize the full RAG pipeline without hurting accuracy.
- •Voice UX makes latency and hallucinations especially costly
- •Standard RAG improves factuality but increases response delay
- •Streaming approach: process partial speech chunks and retrieve early
- •Key decision problem: when to trigger retrieval and when to commit to results
41:27 – 47:40
Two streaming approaches: fixed-interval retrieval vs learned triggering based on retrieval quality
Arnab outlines the paper’s strategies: (1) run retrieval on fixed audio/text intervals and detect stability across chunks, and (2) train a model to decide when new chunks add critical information. The emphasis is on the broader research agenda: robust triggers, semantic sufficiency tests, and production-ready heuristics.
- •Fixed-interval streaming: retrieve per chunk; look for top-doc stability
- •Triggering based on partial-vs-final retrieval agreement as a heuristic
- •Learned trigger: predict whether new chunk changes the required retrieval
- •Opportunities: semantic sufficiency detection beyond retrieval metrics
- •Reported outcomes: latency reductions with comparable accuracy
47:40 – 54:18
Lean for Science: formal verification as a foundation for math, code, and scientific computing
Robert argues that recent AI math breakthroughs increasingly benefit from formal verification, which makes proofs checkable and explicit rather than hand-wavy. He positions Lean as both an interactive theorem prover and a functional programming language, supported by a large community library (Mathlib) and accelerating AI-driven formalization progress.
- •Informal vs formal math: explicitness and machine-checkable correctness
- •Lean as a fast theorem prover + functional programming language
- •Mathlib as a large, high-quality formalized mathematics ecosystem
- •Rapid progress in AI theorem proving (benchmarks and public claims)
- •Vision: “verified intelligence” across math and beyond
54:18 – 58:52
Verified coding & verified ML: program specifications, TorchLean, and proving properties of attention
Robert connects formal methods to software reliability and ML systems: specifications + proofs can guarantee behavior, which matters as AI-generated code scales. He introduces TorchLean, a Lean-native framework to define and verify neural networks and numerical kernels, including proofs about attention equivalences and invariances and investigations into floating-point nondeterminism.
- •Program verification: specs + code + proofs to satisfy intent
- •Shift from “vibe coding” to “vericoding” for high-assurance software
- •TorchLean: writing neural nets in Lean with a PyTorch-like tensor system
- •Proving properties (e.g., FlashAttention equivalence; attention invariances)
- •Formalizing floating-point nondeterminism effects in inference pipelines
58:52 – 1:10:02
Founder AI workflow hacks: agentic programming as an RTS (parallelism, visibility, and knowledge bases)
Lukens shares a philosophy for engineering in an agentic coding era: treat development like a real-time strategy game where success comes from parallelization, rapid iteration, and high visibility into many agents. He describes practical tooling—worktrees, tmux portability, orchestrator/worker patterns, always-to-PR workflows, and aggressive documentation that makes future agents faster.
- •Programming with agents resembles RTS: manage many threads of work simultaneously
- •Git worktrees + tmux + orchestrator/worker agents for parallel execution
- •“Workers go to PR” with minimal human keystrokes; sandbox for autonomy
- •High-visibility monitoring and early course correction over perfect upfront design
- •Knowledge-base docs as cheaper context than code; keep agents aligned over time
1:10:02 – 1:16:06
Operational metrics and cues: APM-style tracking, audio/visual signals, and scaling team output
He closes with tactics for keeping throughput high: track tool-call “APM,” use strong cues (icons/colors/audio) to manage attention, keep tokens/resources fully utilized, and mix macro/micro work. He claims significant increases in PR throughput per engineer by adopting these agent-first practices.
- •APM reinterpreted as agent tool calls per minute (not clicks)
- •Attention management via sound/visual cues modeled after competitive RTS design
- •Always spend resources (tokens/agents) to avoid idle capacity
- •Satisficing and parallel ticket sizing to keep momentum
- •Reported impact: substantial gains in PR throughput per engineer
1:16:06 – 1:16:55
Wrap-up: refreshments, feedback, and recruiting future presenters
Francois concludes with logistics, thanks the speakers, and reiterates the feedback request. He encourages attendees to propose future papers and notes the schedule for upcoming sessions is filling quickly.
- •Event closing and thanks
- •Feedback form request
- •Call for future presentations and paper suggestions
- •Upcoming session scheduling notes

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Paper Club kickoff: applied focus, upcoming talks, and what the group is exploring next

Limits of human-solution training: why sampling beyond the “human subspace” matters

Intelligence per sample & per watt: streaming learning, ICL vs LoRA behavior, and alternatives to backprop

Protein foundation models: the “Bitter Lesson” applied to biology and protein sequences

Do protein scaling laws hold? Contact prediction as an unsupervised readout of structural knowledge

Why earlier protein models plateaued: data scaling via metagenomics

Bitter Lesson showdown: sequence-only embeddings vs AlphaFold’s MSA-heavy inductive bias

Mechanistic interpretability for proteins: sparse features, biological motifs, and a protein “map”

Self-play for LLMs: why post-training RL dominates compute and where task generation fits

Symmetric vs asymmetric self-play, and why “hardness rewards” can fail

Self-Guided Self-Play (SGS): grounding synthetic tasks and adding a “guide” to enforce relevance

SGS results in Lean theorem proving: gains over RL, but self-play still doesn’t fully solve everything

Stream RAG for voice agents: reducing latency by retrieving while the user is still speaking

Two streaming approaches: fixed-interval retrieval vs learned triggering based on retrieval quality

Lean for Science: formal verification as a foundation for math, code, and scientific computing

Verified coding & verified ML: program specifications, TorchLean, and proving properties of attention

Founder AI workflow hacks: agentic programming as an RTS (parallelism, visibility, and knowledge bases)

Operational metrics and cues: APM-style tracking, audio/visual signals, and scaling team output

Wrap-up: refreshments, feedback, and recruiting future presenters

Get more out of YouTube videos.