Self-Play for LLMs, AI for Biology, Formal Verification, and More | YC Paper Club

It's hard to keep up with the latest AI research. That's why we started YC Paper Club — a small group of researchers, engineers, and founders who meet every two weeks at our Mountain View office to present and discuss new papers together. In this session, we cover whether scaling laws hold for protein biology, AlphaZero-style self-play for language models, streaming RAG for real-time voice agents, formal verification with Lean, and why one founder thinks programming with agents is exactly like playing a real-time strategy game. Stay tuned for more. Interested in joining a future Paper Club? Apply here: https://events.ycombinator.com/ycpaperclub Apply to Y Combinator: https://www.ycombinator.com/apply Work at a startup: https://www.ycombinator.com/jobs 00:00 — Introduction by Francois Chaubard 05:47 — Yasa Baig: A World Model of Protein Biology (https://biohub.ai/esm/protein/about) 25:38 — Luke Bailey: Scaling Self-Play with Self-Guidance (https://arxiv.org/pdf/2604.20209) 37:51 — Arnab Maiti: Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage (https://arxiv.org/pdf/2510.02044) 47:40 — Robert George: Lean for Science: How Formal Proofs Can Change Mathematics, AI, and Scientific Computing (https://arxiv.org/abs/2602.22631) 58:52 — Lukens Orthwein: Founder AI Hacks: Programming is an RTS Game Now 1:16:07 — Closing Remarks

Francois ChaubardhostYasa BaigguestArnab MaitiguestRobert GeorgeguestLukens Orthweinguest

Jun 12, 20261h 16mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

YC Paper Club explores biology AI, self-play, RAG, verification, hacks

Protein language modeling is shown to follow scaling-law-like behavior when training data is massively expanded (e.g., metagenomics), enabling strong structure/function signals from sequence-only pretraining and even interpretable biological features.
LLM self-play for formal math can plateau because “make it hard” rewards incentivize adversarially messy tasks; adding self-guidance and a learned/judged relatedness signal yields better gains than vanilla self-play or standard RL alone.
Streaming RAG targets voice-agent latency by running retrieval during a user’s ongoing utterance, raising open research questions around when to trigger retrieval and how to decide partial-query sufficiency without wasting compute.
Lean is presented as a core substrate for “verified intelligence,” spanning formal math, program verification, and even verified ML/compute kernels, with growing ecosystems (Mathlib) and practical tooling for correctness guarantees.
Agentic coding workflows are reframed as real-time strategy: maximize parallelism via worktrees/agents, minimize human keystrokes/approval overhead, and build strong internal knowledge bases to compound speed gains despite higher token use.

IDEAS WORTH REMEMBERING

5 ideas

Protein models can exhibit clean scaling behavior—if data scale is sufficient.

The talk highlights that earlier protein LMs plateaued, but ESM Cambrian/ESMC regained smooth improvements by expanding training data dramatically (e.g., from tens of millions to billions of sequences), suggesting “data walls” are often domain-specific rather than fundamental.

Sequence-only pretraining can nearly match structure-heavy pipelines in key settings.

ESMFold-style approaches that discard MSAs can approach AlphaFold-level performance on some complex prediction tasks, and can be especially competitive where MSAs are sparse or unhelpful (e.g., certain antibody contexts), while also improving throughput by avoiding alignment costs.

Inference-time compute is becoming a lever in biology too.

Looped/refinement architectures allow repeated passes at inference to improve structure predictions, paralleling test-time compute/sampling ideas in LLMs and reinforcing the general “scale compute” theme beyond pure training.

Interpretability tools from LLMs can transfer to protein models with meaningful biological features.

Sparse-coding/SAE analyses reportedly yield monosemantic-like directions spanning amino acids → motifs → domains → functions, enabling an “atlas” view of protein space and suggesting shared representation phenomena across modalities.

Vanilla self-play reward design can create ‘adversarial curricula’ that don’t teach useful skills.

If the conjecturer is rewarded for producing tasks that are merely hard, it can generate pathological, overly complex formal statements that break usefulness; the solver’s learning stagnates similarly to standard RL despite lots of synthetic task generation.

WORDS WORTH SAVING

5 quotes

If the full solution space F is F, training on known human solutions will limit you to some typical set H despite any feasible amount of test time compute or recursive self, um, improvement.

— Francois Chaubard

The actual paper title is right below that. But I mean, just a quick refresher, I'm sure everyone in this specific audience probably read Richard Sutton's famous article.

— Yasa Baig

So the promise for LLMs is I can take some-- I can train on a bunch of human data, I get to, like, human level, and then I can run loads of self-play and go far beyond that and hopefully solve really interesting problems with, with our models. But unfortunately, this is not how it works.

— Luke Bailey

Especially in voice, we care about this even more because from a human perspective, it's difficult to kind of actively catch hallucinations when you're listening to it compared to like when you're reading it over text.

— Arnab Maiti

You cannot fool this theorem prover.

— Robert George

Protein foundation models (ESM/ESMC) and scaling lawsMetagenomic data scaling for biologySequence-only structure prediction vs MSA-based methodsMechanistic interpretability with sparse autoencoders in protein modelsLLM self-play (symmetric vs asymmetric) for formal reasoningSelf-guided self-play and reward design failuresStreaming RAG for low-latency voice agentsLean theorem proving, Mathlib, and formal verification for code/MLAgentic coding workflows, worktrees, orchestration, knowledge bases

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.