Self-Play for LLMs, AI for Biology, Formal Verification, and More | YC Paper Club

It's hard to keep up with the latest AI research. That's why we started YC Paper Club — a small group of researchers, engineers, and founders who meet every two weeks at our Mountain View office to present and discuss new papers together. In this session, we cover whether scaling laws hold for protein biology, AlphaZero-style self-play for language models, streaming RAG for real-time voice agents, formal verification with Lean, and why one founder thinks programming with agents is exactly like playing a real-time strategy game. Stay tuned for more. Interested in joining a future Paper Club? Apply here: https://events.ycombinator.com/ycpaperclub Apply to Y Combinator: https://www.ycombinator.com/apply Work at a startup: https://www.ycombinator.com/jobs 00:00 — Introduction by Francois Chaubard 05:47 — Yasa Baig: A World Model of Protein Biology (https://biohub.ai/esm/protein/about) 25:38 — Luke Bailey: Scaling Self-Play with Self-Guidance (https://arxiv.org/pdf/2604.20209) 37:51 — Arnab Maiti: Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage (https://arxiv.org/pdf/2510.02044) 47:40 — Robert George: Lean for Science: How Formal Proofs Can Change Mathematics, AI, and Scientific Computing (https://arxiv.org/abs/2602.22631) 58:52 — Lukens Orthwein: Founder AI Hacks: Programming is an RTS Game Now 1:16:07 — Closing Remarks

Francois ChaubardhostYasa BaigguestArnab MaitiguestRobert GeorgeguestLukens Orthweinguest

Jun 12, 20261h 16mWatch on YouTube ↗

EVERY SPOKEN WORD

80 min read · 15,649 words

0:00 – 5:47
Introduction by Francois Chaubard
1. FCFrancois Chaubard
  [on-hold music] Thank you guys so much for coming. This one will have much, a much more applied bent based on the feedback. We have a bunch of really cool people that I'll introduce in a second, but we're covering AI for, uh, biology by my favorite, one of my favorite, uh, co-researchers, Yasa Baig. We have Luke, um, out of Tatsu's lab talking about self-play, AlphaZero-style self-play for LLMs. Super excited about that. Arnab will be, uh, presenting. He's a researcher at Giga on the Stream RAG. Uh, super different application for, you know, thinking about, uh, real, real-time voice, uh, agents. Uh, Robert George working on Lean for Science, super exciting. And then the AI token maxer himself, Luke Worthwein. Cool. So I wanna introduce some, like, call for presentations, you know, maybe inspire some of my interests a- and maybe inspire some, some of you guys to jump up and, and ask for a presentation on this stuff. I think memory has been, like, the hot topic for at least the last year and a half. There's been so many papers from MemZero to recursive language models, cartridges out of, uh, our lab, HNet, you know, dynamic chunking stuff. There's so many different ideas, and so I'm definitely interested in that area if you guys wanna present on that. I did this Noam Brown podcast, I think a couple weeks ago, launched, and, like, he's still of the view that this human-generated subspace H is still, if we train it on that, we can test time compute our way out of it, and recursive self-improve out of it all the way to get to this F minus H. And I just, like, really struggle with this, and I really, really don't see how it's probable. Not that it won't-- it's not possible, but it's just not probable that we'll sample to all of that. So I'm really interested in that, and that's definitely in Luke's. We were talking about that a bunch, and I think that basically the left side is AlphaGo, the right side is AlphaZero. And I think that AlphaZero, unbiased by, um, humans' meandering, is, uh, the way we'll get to much more intelligent systems, maybe even dare say AGI. And the right way to say this, um, would be, um, I wanna be very careful. If the full solution space F is F, training on known human solutions will limit you to some typical set H despite any feasible amount of test time compute or recursive self, um, improvement. You won't feasibly sample F minus H, um, a-and especially all of it. If, if it's infinite recursive self-improvement or infinite test time compute, maybe, but we don't have infinite. So life's a POMDP, and this is a, we're a finite horizon MDP. So intelligence per sample, I think this is, like, the two major problems left, in my opinion, are intelligence per sample, intelligence per watt, and intelligence per sample, I always think about this, like, as I get one new sample, I do this continuous learning. What is the right thing to do if I'm try- my goal is to maximize performance conditioned upon that N? Most people's answer to this right now in practice is ICL, and I actually played with this. As you increase the number of samples in ICL, it is not monotonically improving in performance. And so it actually starts to bob and weave. It gets worse. It gets better sometimes. Um, and then it hits a cliff, which is the context, um, uh, length, the co- context length that, that the model was trained on, and it literally just stops. So it clearly can't go on forever. It doesn't mono- not monotonically improve. And I started playing around with LoRA. I think LoRA at higher, at lower ranks for, uh, lower amounts of sample size actually does impressively well. And then it, it has this kind of arc. They both peter out pretty quickly as you increase number of samples all the way until you do SFT group O all the way at the end. And if you look at this, it's kinda weird that, like, you have this, like, ICL is the optimal thing to do in the beginning, and then training the whole thing. If I get one new sample, I wanna train, retrain with LoRA at some rank on N plus one samples just to get that little bump in performance. And there's a different optimal thing to do all along the way as I stream and I get more and more samples. And that's just not how we are. So we're kinda like monotonically improving, and the more chess games that M- Magnus Carlsen plays, he just keeps getting better. The, you know, ten thousand hour rule, et cetera, et cetera. We just keep getting better, and it's the same in Algo. And so I think there's just something really different happening in us. And so there's, must exist some learning procedure that is, has a much higher intelligence per sample. And then intelligence per watt out of my lab, uh, uh, Ivanika and John, who will hopefully come to the next one, uh, and give a talk on, on this, and I just think it's the right way to think about it, arguing that having smaller models, um, are sometimes actually better from intelligence per watt perspective. Alternative to backprop for those that know me, I'm very hot on this, back to the brain, um, and how we learn. There's very little evidence that the brain is taking the transpose of the weight matrix, and there must exist some other learning procedure. I'm highly interested in SPSA. But if there's alternatives that I'm not aware of, please, like, recommend them. And I'm really interested in novel breakthroughs. Yasa is one of my favorite AI researchers, but he's mostly focused on bio, and he always, um, sends me bio papers, and it's super interesting, whether it's about how birds navigate the world via iron in their liver, apparently. That's how they, they actually navigate. [chuckles] Crazy. Robotics, uh, speech, other things like that as well. And then, of course, unhinged founder hacks. Very interested in that as well. Call for ideas on way to ma- ways to make the club better. So better ways to meet you, if it's a lightning round, if it's not. Um, some people talked about some AI benchmarks that we could actually launch together. That'd be kind of fun. Club challenges to challenge each other. And then, uh, any open source ideas that you wanna hack on together for this club or something or otherwise would be very interesting. All right. That's all I got. Thank you so much. [audience applauding]
5:47 – 25:38
Yasa Baig: A World Model of Protein Biology (https://biohub.ai/esm/protein/about)
1. YBYasa Baig
  Hi. Yeah. Uh, thanks, Francois, for that introduction. We've been lab mates now for, like, two years, something like that? Yeah. Um, I think Francois is a great example of someone who brings very creative and very out-of-distribution ideas to our group all the time, even if I sometimes, like, have no idea where he gets them from. Um- But, um, that being said, uh, he asked me to give a talk on some bio AI things, and I thought, why not? Um, so I'll be presenting on this paper that came out just last week from Biohub folks, um, here in California. I mean, not too far, actually, in the Bay Area. I think they just moved to the city, actually. Um, so I am a second-year PhD student with Francois, but I'm also co-advised by Steve Quake over at Stanford. Sort of anyone in biology probably will know Steve, at least tangentially. Done a lot of work in bioengineering and all kinds of applications. Was director of the Biohub, where this work came out, um, just until recently. So, uh, a lot of, uh, a lot of overlap. But, uh, the high-level pitch for this work is that I know most of this audience is probably more like AI, ML types. I want to talk a little bit about how sort of a lot of these ideas from sort of that's motivating a lot of progress in language modeling and AI very broadly have been sort of recently been translating into biology with a focus on this recent paper, 'cause I think it really does a really excellent job of interrogating how scale, which, you know, at some level has been like the fundamental primitive in terms of assumptions that we as a community have in terms of how to make things better, um, has actually been playing out for a lot of these biological problems, particularly protein biology. So, um, there won't be, like, much bio in this talk. I'll try to focus more on the ML, but feel free to ask questions. Um, so yeah, like I ta-- I called this talk The, uh, Bitter Lesson Comes from Biology. The actual paper title is right below that. But I mean, just a quick refresher, I'm sure everyone in this specific audience probably read Richard Sutton's famous article. You know, basic premise here is that like, you know, across the past seventy years of AI, methods that win are methods that are general, that like sort of exploit really fundamentals of like scaling compute and data, as opposed to methods that sort of hand engineer human domain know- uh, human domain knowledge. And Sutton always cites his work in or like, you know, a lot of the work that was earlier in the field, so AlphaGo, AlphaGo Zero, that sort of just inordinately scaled compute. And then for a long time, they were far worse than sort of expert systems until they eventually overtook and then exponentially improved past them, right? Knowledge systems win at first, but then eventually sort of these like big, large, dumber models will like, um, you know, win in the long run. This is sort of a new goal in biology is to what extent can we study, like this is actually also true for a lot of these sort of biological AI problems, right? Um, the bet sort of behind this paper, and they do a really good job of exploring, is will the same pattern basically also solve protein biology, right? Um, can we take like, you know, a lot of these ideas and scaling law analysis, here is from the, um, uh, famous, you know, neural scaling laws paper, and then translate them for all these problems that we care about, per se, designing a drug or like, you know, trying to understand how like a cell works, right? So on the left is something that we trust. It's like a language model. We have like these nice smooth log linear scaling laws that we see that loss like predictably falls as a function of compute and data. Uh, the question on the right is whether this same curve will exist for bio spec- more generally, but proteins specifically in this paper. You know, sort of does our LLM recipe transfer, or does biology really out-of-distributional domain relative to language sort of break it, right? You know, that's the bet. And like in this co- talk, I'll basically chat about like three like vignettes from this paper that sort of interrogate to what extent this is true. So this slide is all the biology you'll ever need, and it's only about a quarter of the slide, at least for this presentation. Um, let's talk about proteins. So in your body, there's like broadly three major classes of macromolecules. There's lipids, carbohydrates, and proteins. Um, a protein is just a string of amino acids, a special type of biomolecule. There's twenty varieties of amino acids. If you put them together into a sequence, you can have like a virtually infinite number of possible molecules they'll then fold into. So you can think about it largely as just every single protein is this twenty-letter alphabet, and that string specifically determines a unique 3D shape, and by virtue of the shape of that protein, what job it does in your cell, like for instance, catalyzes a reaction, keeps pathogens out, et cetera. Um, the work done in this paper is that their goal is to train, um, ESMC, sort of their f- like third major or fourth major iteration in a long series of models of the group Ev- of the Evolutionary Scale group originally at Meta, then their own company now at Biohub, have been training for a few years now, where the cell is very similar to language, where we just, let's just take hundreds of millions of years of, of sort of evolved sequences that we've sort of gone out and found across biology, both in humans, but also across bacteria and in our environment, and just go train a big masked language model on them, right? So I mean, nowadays, we mostly train NTP models, but the pitch here is that if I take some protein represented as like, you know, these strings of twenty tokens, like hide a few, and then can I train a really big BERT-style transformer to predict masked positions as a function of the other ones nearby. And the crucial part is that we never tell it anything about the protein beyond just the sequence, right? So all this guy's got access to is just this string, and it's being asked to like basically learn things about the grammar of that protein as a function of which protein other amino acids tends to co-occur with, right? So like I think there's like this old saying in natural language processing, it's like you'll know a word by the company that it keeps. And here the idea is that you'll know a protein by the sort of amino acids it keeps. And the bet is that if we do this at scale, just on this simple sequence task, we will eventually get all of these sort of other properties of a protein, say like structure, that we do care about sort of for free. And, uh, yeah, like I said before, there's been a lot of prior work on this, like largely from, um, Evolutionary Scale, but a few other comp- or a few other sort of groups working largely on this bet. Uh, this table will be a map for the rest of the talk. So like every row is a concept you probably already know from natural language and then analog onto the protein context. So tokens become amino acids. The internet becomes sort of all of evolution's sequence databases, all the proteins we can actually go out and measure. Masked token prediction stays as masked token prediction. And sort of emergent capabilities, we talk about language model having, like become emergent structure and function within like basically understanding of a protein. And then there's also this sort of like, this really fun stuff at the bottom. So like recently, there's been a lot of advancements in sort of these interpretability toolkits from the MechAI folks, you know, things like sparse autoencoders. Some of the earliest work basically in really trying to interrogate, um, using the sort of the toolkit the language modeling community has built to understand language models now in a protein language model setting. So I'm gonna fill in the right-hand column for the rest of this talk with sort of evidence and three questions, which is that, do these models learn with scale? Um, can they basically substitute for a lot of these hand-built features? Sort of does the bitter lesson hold? And like what do these representations actually encode interpretably? So question one is, do scaling laws even hold in the protein context in the way that we see them in the language context? First, let me just try to talk a little bit what we measure when we're talking about sort of emergent properties, sort of how do we actually, like study the model, see if it's learning anything, right? We need a proxy for does the model understand protein structure, for instance, right? Um, the one the authors use in this paper is that they look at the internal, or they basically take the model representations during training, and they use this to predict, um, long-distance protein contacts. So the idea here is that proteins have a one-dimensional sequence, but they fold into complex three-dimensional shapes. And if the model has sort of understood something complex about the protein structure or something emergent about the protein structure, it should be able to predict, um, contacts that occur over long distances, sort of nearby contacts are rather kinda obvious. And this is like a really challenging object for it to get just sort of de novo purely from sequences alone. They call this P at L, right? Um, sort of a long contact precision at some given length. And it's sort of just a clean, unsupervised readout, sort of structural knowledge inbuilt in the model that's learned during this language modeling objective. Um, on the right, I plot the performance of this, or the authors plot the performance of this, I should say, against training compute for the, for basically this new model family the authors have built recently called ESM Cambrian at three hundred million, six hundred million, six billion parameter scales. Uh, interestingly, this, and they have this fit line, which is basically this predict compute optimality curve, which they, um, estimated just from sort of low-end training runs, so relatively low computational budget. And they find that it actually extrapolates very cleanly to real model training runs. Meaning so the answer is like, do these models learn with scale? And this data at least suggests that the answer is like, yes, right? Like you do see this nice log linear curve, right? If you keep investing more and more compute, you keep training on more and more protein data with larger and larger models, um, you see the same exact same broad qualitative shape as the LLM scaling res-- or sort of LLM setting. And it res-- re-transfers cleanly, meaning that without any kind of like predisposed part of the model that we've taught to look at protein structure or even given it any protein structures, it does a good job of sort of just picking these out just from sequence co-occurrence patterns. Um, there's like one interesting twist, though, is that I said before, there's been a lot of prior work from this group as well as others in trying to answer these scale questions, so not the first ones to look at this. But previous models, um, sort of the, sort of their prior generation ESM2 model shown in, um, purple here, had actually not shown the same behavior. They sort of hit this wall where they kept adding more parameters, and they got diminishing returns, and you get this sort of flattening out of the scaling curve. This ESMC or ESM Cambrian model, sort of the green line, keeps climbing with no plateau. And their fix for this wasn't really like they come up with like a really clever inductive bias in the architecture. Not to say there isn't a lot of excellent engineering work in this paper, but really, it was just data scaling, right? They, um, had about fifty million training samples in their original ESM2 paper, and here they've just pushed that to two point eight billion by pulling largely in metagenomic data. So essentially, amino acids or protein sequences that have been found from sequencing DNA actually out in like dirt and oceans and like human guts from like organisms that nobody has like really ever cultured or even has really, really elucidated. And their conclusion is that more data ends up being really important and keep getting sort of a, a basically justifying the cost for increasing compute. So it's like the protein version of LLM data wall conversation, right? Like except here in biology, evolution has been generating this training data for, for four billion years and not humans in like the past thirty or so. And, you know, compared to tokens in natural language, like, I mean, we've only sampled like less than one percent of all known protein sequence diversity, and that's like only currently at this moment in time, let alone like all of the sequence diversity that evolution has sampled since the beginning of life on Earth. Um, sort of question two in this paper I think is interesting is that, um, it's sort of the most bitter lesson part, and they really try to evaluate to what extent their paper can do or how well their model trained purely on mass language modeling objectives can compete against a structure-based model with sort of hand-tuned inductive components. So I'm sure you're all familiar with AlphaFold, won the Nobel Prize a few years ago, it was sort of a landmark moment in bio AI. It really showed that these computational tools could have a lot of value in the biology sphere. AlphaFold was brilliant, but its power comes from basically building hand-put in-- or hand-built input, sort of a manual feature curation called a multiple sequence alignment, or an MSA. So to fold a protein, it goes and finds hundreds of evolutionary cousins of that protein and stacks them up. Um, these patterns of sort of covariation across a family are essentially, this just encodes the statistical information you need to do to be able to get structure. This is like a beautiful domain engineering application, and it's the sort of like really good human-crafted inductive bias that the bitter lessons, at least claims, should eventually lose, right? Think like hog features in CV. And compared to sort of the things we used to do before, this is actually like far more bitter lesson than like, say, building a whole physics simulator for a protein. But it's also really slow to do this, right? We need to build these huge databases. The sequence alignment takes time, right? And it's absent precisely where you often want it. For instance, the antibody design tasks we'll come back to at the end. Um, ESMFold just throws this away. All it says is it just takes the input sequence and instead of an alignment, it, it just feeds in the model's representations as the input to their structure predictor. And these are just like per residue embeddings. So take your input sequence, you get a set of like per amino acid, just like some numeric core representation. And we just train the specialized module to do breaking the like large statistical protein structure, right? So this folding network, right, like a projection into, uh, three-coordinate space. So same target, same output, no hand-built features. And the question becomes, can this general model representation like match the sort of specialist model in getting that MSA value? One interesting architectural note, though, for sort of the more ML folks in the crowd, um, the one, their, in their projection network, sort of the part that converts this representation to structure, there is actually one really interesting feature that actually builds off some of the work from our lab alum, Dan Fu. Um, and they have a, actually a looped model, right? I mean, there's a lot of excitement about these recently for parameter sharing, and I just think they're cool algorithmically for a number of reasons. And this is, gives them basically a lever by which they can scale inference time compute, right? So essentially, they have a model that predicts structures, and they have a procedure by which representations can be fed through a series of layers and sort of refine their, uh, structure predictions without necessarily retraining or any kind of fine-tuning. This is like our test time compute access and some of these diffusion steps could be, or like test time sampling from an LLM. It'll, keep this in mind just for later results. And the sort of like headline figure I would say pointing out here is that, um, they basically show that, yeah, their technique works really well. Um, just to quick definitions, on the left, we have this thing called DocQ pass rate. This is just a metric for how good your structure prediction was, essentially. It's a measure of the fraction of test cases where the predicted shape of two proteins stick, that stick close together is close enough to be like really useful to realistic settings. And there are two groups in each panel. One is for single sequence with no MSA, and the other is, um, a single sequence plus an optional MSA you can also feed to the model or is required for a competitor model. And when we look at the sort of outcomes from this, what we see is that for general protein-protein complexes, ESMFold2, their sort of new projection model from a single sequence with no MSA lands within about three points of AlphaFold3, which does take these handcrafted features. So we get near parity without the crutch. And, but if we look at the antibody applications, which are on the right hand, on the left-hand side here, right? Um, the modality b- is like, uh, you know, essentially behind like all modern MC or mAb-based drugs or monoclonal antibodies, tons and tons of applications in human biology and biotech. Um, we are actually winning, or are comparably winning, or the authors are comparably winning. So broadly, like single sequence ESMFold2 does actually build AlphaFold3 sort of fifty versus forty-seven on this really specific design task that people really do care about. And biologically, this makes a lot of sense. Compared to, say, like, um, other classes of proteins, the amount of sort of, uh, sequence variation that's been sampled in the space of all known antibodies relative to structure is considerably smaller considering their enormous diversity. So the headline isn't that MSAs are dead yet, right? It's that handle features only help where it's abundant, and basically where dr- designers really need it, often it does go away. And this general method still basically does save a lot from pre-training across all known evolutionary contacts. So we're not quite there yet. And one other thing though worth flagging is that as second point says, give it MSA, we can also scale the amount of test time compute, so how many loops we run in this recursive model in towards proof performance. And we do see, um, basically returns on this, meaning that like the bitter lesson, at least at inference time, also seems to broadly hold. And it's not just accurate, just as an aside, this is just quick. Um, it's also just much faster. MSA construction just takes a lot of classical computational biology time. So at least if throughput is your concern or latency is your concern, you can with this sort of single representation, like get quicker results. Though the wall clock times here are like, you know, well within like, at least I would consider to be pretty good to start off with. Um, and the last bit, I'll get through this a little bit quicker, is just they did a really interesting analysis of like sort of mechanistic interpretability, like what are these models actually learning and sort of can we find features that are interpretable as, as humans in the same way that sort of language modeling folks in the MechI community have found in, uh, language models like Anthropic has. Um, here they sort of just apply the same tool, or they borrow a lot of the tools for like sparse coding analysis here, where they look at activations from these models and try to see if they can decouple them, find these like mono semantic activating directions inside their feature spaces. And they ask, is this also gonna be a property in protein models? And their answer is largely yes. Um, right. So from like pure fill-in-the-blank pre-training, the model's latent space decomposes into clean features that correspond to real biological concepts. Here the, these concepts have been annotated by LLM agents, and they're organized actually quite interestingly in a nice hierarchy. So you have like features that correspond to, say, individual amino acids at the bottom, um, then like structural motifs, then like whole protein domains, right? So you can look in longer or larger portions of the individual protein molecule up to like functional sites and whole protein roles, right? And none of this was supervised, right? The model like learned to organize its latent space purely just through MLM, which is like crazy. Um, I'll just with one example maybe to close things out before I finish everything, and I think I actually have one more slide after this. Um, this is a instance of a feature activation that corresponds to a really specific well-known protein motif called a nucleophilic elbow. This is a type of catalytic domain that's used in a lot of enzyme catalysis. It's really interesting because it's evolved multiple times in multiple different proteins unrelated to each other. So it's a, it's a motif biology keeps coming back to, and the model has basically learned to identify it in the four quite structurally diverse proteins from like both evolutionary distance as well as the rest of the protein. So it's like found a consistently occurring motif in very different backgrounds. So it's like look, it's basically learned to look at the right thing, not just sort of memorizing like, you know, broad, similarly s- uh, comparable sequences. It's like a deeper level of intuition. And if you look at the sort of the whole SAE activation space, you can find like nice structures that sort of correspond to like various known aspects of biology, right? This organization isn't just local, it scales to all of life, right? So they, um, ended up building an actually huge atlas of their pro-- uh, with their model afterwards, sort of just folding and analyzing, um, millions of, or up to, I think seven billion proteins. This is the largest atlas, I think, out there now for protein structure databases, more than AlphaFolds even. And they've predicted like, you know, O of a billion apiece, as I mentioned before, and laid them out here by their representations in SAE space. And you get like a really nice, interesting, like protein space family map, right? You can find that there's clear families that'll cross clear here, or like for instance, CRISPR-Cas9 enzymes, which if you're not a biologist, maybe you still probably have heard of and really important for a lot of biotechnology applications. It's kind of like a Google Maps for proteins, and it's produced all as a byproduct of a model, right? Like just naturally, it's like picked up evolutionary relationships as well as functional ones just de novo for free, which I think is like, I don't know, if you're maybe not a protein nerd like me, I just think this is like utterly crazy, right? Um, so like just to finish, like does the bitter lesson scale to biology? Not perfectly yet. I mean, some of this analysis still requires a lot of handcrafted features, and it's not fully competitive, but we're getting very close. Um, but even if we just don't care about one specific downstream, the model, just from a relatively quite small amount of da-- or like a relatively quite simple pre-training objective on a lot of data has like learned enormous amount of bio that we can reverse interrogate after the fact. Um, and just for the record, like they found that data scaling does keep improving. Um, I want to just point out, you know, partially as a proselytization, like, or, or partially just like try to convert a lot of smart people like we have in the audience. There's lots of folks who work on ML for a lot of application software. Um, biology is a great place to work in ML because the models are still really young. And the other thing is that the data is increasing exponentially per year, and that rate of increase is also going up, meaning that like we're not data limited. It's a great time to work in this space, and we need a lot of these tools. Uh, and any of the audience members watching this on YouTube, similar pitch. Um, just as one last thing, um, I didn't get to talk into detail, but the one application they use for their models for inverse design, so they actually develop a lot of potential protein drugs, and then they evaluate a lot of use, at least in, um, wet lab settings, to show that these are potential, like, proteins that you can design using this model purely in sequence space for the most part, by the way, um, with the exception of, like, one structure head at the end, um, that bind various, like, known molecules that have therapeutic effect, right? So for instance, uh, this PDL-1 binder is basically the most... or is, like, basically a medication that is now the sort of big success of immunotherapy. It's helped plenty of patients with cancers in ways that historically we've never been able to tackle before, right? And developing medications that sort of targeted this protein was immensely challenging and, like, if we can basically reduce the costs for developing such future drugs for future targets, it would have enormous human impact. So, like, even if the data scale doesn't sell you, then maybe some of the human impact will. But broadly speaking, it's a really exciting time and it's wonderful to see that a lot of these lessons are at least translating and people are really making steady progress.
2. SPSpeaker
  [audience applauding] Okay, next we have Luke, um, second-year PhD out of, uh, Tatsu and Tengyu's lab. Uh, fresh from the UK, then he went to Harvard CS, uh, worked on a-adversarial robustness and now post-training self-play, and is directly, uh, uh, in the spirit of this, um, AlphaZero kind of mindset. And so we've been ta-chatting with that-- about that a lot. All right, please welcome Luke. [audience applauding]
25:38 – 37:51
Luke Bailey: Scaling Self-Play with Self-Guidance (https://arxiv.org/pdf/2604.20209)
1. SPSpeaker
  Okay. Hi, everyone. Um, yeah, I'm Luke. Um, I guess I'll be presenting on this paper we put out a, a few months ago called Scaling Self-Play with Self-Guidance. I guess more generally, I'll be talking about self-play for LLMs. Um, this work was with some great co-authors, Kaihua, Kerfane, and my two advisors, Tatsu and Tengyu. Okay. So, um, what does the current training stack look like for big LLMs? Two simple parts. Basically, we pre-train the model on web text, and then we post-train it. And interestingly, recently, the post-training, we've ended up spending, you know, a huge amount of compute on doing large-scale, long reinforcement learning runs. And what does that reinforcement learning look like? You collect a huge number of tasks, coding tasks, maths tasks, tasks interacting with different bits of software, and you just have the agent take a bunch of actions in those environments. You get some reward back, and we train the model on that data, up-weighting the good rollouts, down-weighting the bad rollouts. And like I said, the interesting change that's happened is we're now approaching the amount or even surpassing that we're spending on pre-training actually on this very long-running RL post-training. And I've, hmm, swept some things under the rug that we do at post-training as well, like a bit of instruction tuning and, and, uh, alignment. But really, most of the compute is spent on these long RL runs. Okay. So we also know that as we increase the number of, uh, RL tasks during post-training and we increase the amount of compute, we get better downstream performance. And I think this is best illustrated by this, like, really beautiful plot from the Composer 2 technical report from Cursor, where what they're most basically showing is they have loads of RL tasks such that they only ever-- the model only ever sees each task once. And so on the X-axis, scaling training steps is basically each training step I'm putting in some compute and a new RL task. And what they show is nice smooth line. As you increase the amount of tasks and compute you put in, you get this reliable improvement. And I guess they have this nice eval set on the left, but they also have a downstream benchmark on actual coding on the right, and that's also, like, increasing reliably in a really nice way. This recipe tells us, great, just collect more and more RL tasks, put it in a loop, and my model's gonna keep getting better and better. But generally, we're gonna have to collect these RL tasks by hand, which might be a problem if you wanna keep on feeding. You'll notice log scale on the X-axis there. And I guess there's another problem where you might think that, um, eventually we'd like the model to surpass any of the problems we can give it. So I guess the question that self-play asks is: how can we automatically generate new RL tasks to the model, train on those, and repeat? Okay. So like I said, in traditional RL, we'll have a predefined task, and we train the model on that predefined environment and task. But in self-play, we do something slightly different, where the model does two things. It's gonna generate RL tasks, and it's gonna attempt to solve those tasks. And crucially, we train it to be better at both of these things. So we train it to be better at, in inverted commas, we'll go through what that means to be better to generate tasks, and then also to get high reward in those tasks. So how do we fit, I guess, some papers we've likely seen from the past into this description of self-play? 'Cause you might be thinking, "This doesn't look exactly like what I thought of when I read the AlphaGo or AlphaZero paper." So those traditional works we'd call symmetric self-play. And in this case, uh, let's say in AlphaGo, how you train the model is you have the Go agent, and then you have the rules of Go, and you have the Go board. That's great, but that is a non-RL environment I can interact with. Like, I need an opponent to play against. And so this generate RL task part, they have an older version of the agent take the role of the opponent. So in this case, generating the task, 'cause I just put an older version of myself in there, and now I have a nice RL task. It's a Go board with an opponent. So this would traditionally be called symmetric self-play 'cause the model is taking on the same role twice, a Go player. More recently, however, uh, in the LLM space, there's been the rise of asymmetric self-play. This actually hails from a lot of older work on control problems and things like this. But in asymmetric self-play, we instead more generally just have a model that I will call the, in this talk, a conjecture that will just generate entire RL tasks for the solver to then operate in. The solver is the equivalent of the agent here. So the conjecture might come up with a coding problem, and they come up with a bunch of unit tests, and then it'll go into that environment to do a bunch of rollouts, get reward, and train on that. Great. So, so, so why do-- why am I excited about self-play? Why do I think you should be excited about self-play? So I guess this first point is-- the, the first point I'll have to go, go in some, some depth. So, so in principle, nothing bounds learning. And what do I mean by that? So if I take a bunch of demonstrations from humans, and I train a model on that, I think it's clear that the model will never get better than those demonstrations. So the next step is, okay, I'm not-- I'm gonna create a bunch of environments to have the model learn those environments. That's regular RL. We have two problems there. One, if you ace all of the environments, you'll never get any better. Or the second problem is, if I can't even get any reward in those environments, I will also never get any better. So self-play, on the other hand, is gonna say, "I'm gonna keep on generating new learning signal with new tasks, learn it, and just keep on improving, hopefully forever." And indeed, we saw this was the case with two-player games like Go. It just kept on getting better and better beyond human, uh, performance and kept improving. So the promise for LLMs is I can take some-- I can train on a bunch of human data, I get to, like, human level, and then I can run loads of self-play and go far beyond that and hopefully solve really interesting problems with, with our models. But unfortunately, this is not how it works. [chuckles] So in practice, if I run, which we'll get into this talk, if I run self-play for a long time, it plateaus. Either model stops improving at some point, which is the exact same thing that happens when you run RL. Like, as much as I'm trying to tell you there's a bunch of secret sauce going on, that it doesn't actually play out. So basically, this paper, we try to figure out, like, why is this happening, and then, like, do one step to solving the problem, but by no mean-- by no means completely solve it. Okay, so to begin with, we need to understand, like, the baseline LLM self-play algorithm. Pretty simple. We're gonna sample synthetic tasks from the conjecturer, which is just our model. Conjecture and solver, same model, just giving it two different names. The model will then-- the solver then attempts them, and we verify the correctness using some reward signal somehow, like perhaps the conjecturer wrote unit sta- unit tests for us to check. And then we're gonna update the solver just on all the correct roll-outs. And then this is the key part. The conjecturer gets updated on this reward, which is zero if the prover-- if the solver could not solve the problem, and one minus the solve rate otherwise. Okay, what is that actually doing? That is basically saying all the conjecturer must do is produce problems that are hard for the solver model, and I think in principle, that makes a lot of sense. The idea is if the conjecturer can ace this, I will keep on giving you problems at the frontier of your capability. You will be able to solve them and learn from them, and we'll just keep on expanding and expanding, expanding, and get better and better and better. Okay, so let's see how this recipe does. So we take in our paper like three thousand formal math problems, so this is just, uh, in Lean for-- you can write out the problem statement in this coding language in math. So you write out a math problem in this coding language. You can write the proof in the coding language. Then you can automatically verify if it's correct. So we take three thousand problems, and we run, like, our best RL baseline on it, and this is the amount of compute we put in here, and on the Y-axis, we have how many problems you solve. And you can see it plateaus out, and we fit a law, and it asymptotes at, like, sixty percent. And if we-- on the right-hand side, we're gonna say how much synthetic new tasks did we generate. RL generates no synthetic tasks, so by construction, this stays at zero. Now I'm gonna fill in the vanilla self-play with that solve rate rewa-reward, and I'm not gonna show the left for now. We see as time goes on, the conjecturer gets better and better at its job. It keeps on generating more and more tasks on the frontier of the solver's capabilities, which seems really good, and yet these tasks are completely useless. The self-play does no better than regular RL. So this is not very promising, so now we d- need to understand why. And here what I'm visualizing or I'm literally showing you is one of the problems the conjecturer generates later on in training, and we don't really understand this. I've highlighted in blue the, the conclusion to this statement in Lean, and if anyone's using that, this is horrific. This is an incredibly complicated, overly complex disaster of a statement. And so what is basically happening is we reward the conjecturer for, for producing tricky problems. But the easiest way to produce tricky problems is produce these basically messy, artificially complex and inelegant problems. It is the exact equivalent if, if I wanted you to get, like, fifty percent solve rate at a problem, I could just give you, like, a three-page long high school calculus problem, and you would make some little mistake somewhere. But that was a completely useless synthetic problem for, like, other tasks we care about in maths, for example. Great, so how do we fix this in a minute? 'Cause I've been talking too slowly. So we diagnose this problem. Here is, like, roughly at a high level how we try-- a-attempt to solve it. So there are two parts of our algorithm, SGS, self-guided self-play. We're gonna take the set of problems we cannot solve, the three thousand problems, and we're gonna do two things. One, for each of those problems we cannot solve, we're gonna get the conjecturer to produce a related problem to it. So we literally prompt it to produce a synthetic problem that is related. So th-this way, we're trying to ground the synthetic data distribution in a distribution of problems that we think is good at least. And next, if you still just trained on the solve rate reward, you would eventually ignore this prior and still produce that junk. So we're gonna introduce a new reward signal, which is the model takes on a third role, and it will literally judge. It looks at the synthe-synthetic problem and the target problem it came from and decide if these two things are actually related and not overly complex. So we call this third component a guide. Okay, so the algorithm looks like this. It's very similar. We-- for every target problem we haven't solved, we'll sample a conjecture f- uh, from the conjecturer that is related to it. We then will attempt to solve them, and then what changes here is when we update the conjecturer, we now have this dual reward. One, we still want the problem to be tricky. That is important so we can get RL signal on it for the solver, and we'll multiply it by this guide score. Great. Okay, there's a bunch of kind of subtleties we cover in the paper that I will skip over 'cause we don't have loads of time. If you wanna talk about, I'm gonna say large-ish scale RL infra, the academic side, that's what I spent most of my time doing, so I would like to talk about that, but there is not time. So le- let's just look at the head- headline results here. Here is basically the same type of plot. I've put the RL baseline on here. Recall that like standard self-play is exactly in line with that. We also put parallel sampling down here just to show you that indeed RL at least gives you a boost, and I guess I wouldn't be here unless our method works better. So the method indeed does work better. Um, to like ground how much better it's doing, we, we, we're using a seven billion parameter model here, and this is its like six hundred and seventy B, uh, like big brother. And we spend eight times as much compute doing the self-play, uh, as you-- Yeah, we, we do eight times much compute the self-play, but we get like to the ability of that larger model, at least its pass at four ability. So you spend a lot more compute, but we are able to get this like little seven B guy to do as well as the bigger model. But very sadly, you will notice like this is not at a hundred percent. So like the work is, is by far not done. The promise of self-play is like you would just ate all the problems here, and so there's a bunch of... Well, there, there's lots of future work, but luckily a PhD is very long, so I'll be able to work on that. [chuckles] Um, but yeah, that's the summary. [clapping]
2. FCFrancois Chaubard
  Awesome. Thank you so much, Luke Bailey. Okay, uh, next one we have Arnab Maiti. Is that the right way to say it? Maiti. Um, who is a researcher currently at Giga, one of YC's fastest-growing companies. I think market cap is like four hundred million, three hundred million, something like that now. So really fast-growing YC company. Uh, PhD, University of Washington, focused on bandit learning. Um, yeah, please, let's-- tell us about Stream RAG.
37:51 – 47:40
Arnab Maiti: Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage (https://arxiv.org/pdf/2510.02044)
1. AMArnab Maiti
  So, um, this a paper by the group at Meta, and I kind of chose this paper to kind of maybe highlight some of the new emerging challenges that are coming up, especially in a voice AI kind of setup. Um, my goal with this talk is not more like about talking specific details about this paper, but more like to highlight the good problems that they have identified, and I feel like there's a lot of research that is to be done here. And it also kind of closely mirrors what at least I do in my production, uh, setup. Like, I look at these kind of problems, I do the research, and then I try to come up with a method that will work probably in production. So yeah, let's get started. This is a very classical setup where, uh, you probably ask an input question to an LLM, and it gives an output. And if you remember maybe from two thousand twenty-three, maybe there was a lot of hallucinations, but o- especially like, say, around citations and all. But over time, maybe the hallucinations went down, and a big role, uh, was, uh, RAG. Like, you kind of give the input query to a RAG system, it kind of goes and figures out relevant information that needs to be provided to an LLM, and then the LLM probably gives you an output which is hopefully not hallucinated. Now, uh, a lot of voice AI, uh, startups are also coming up, and a na-natural expectation with the voice AI is that, okay, you're having like a conversation. Like, oh, you can ask, "Oh, what's the weather like?" And the agent would reply like, "Hey, the weather currently is like twenty-two degree Celsius," and so on. And maybe you can ask a follow-up question. So it's more like conversational in nature. And so even here as well, you would like the output to be like there shouldn't be any hal-hallucination. Especially in voice, we care about this even more because from a human perspective, it's difficult to kind of actively catch hallucinations when you're listening to it compared to like when you're reading it over text. So one might ask, okay, what's the issue with just using RAG here? Like, can't you just take the input query, a-apply RAG, give the relevant information to the voice agent, and get the output? The issue is that RAG would add a lot of latency. Um, like for example, if I ask a voice agent some question, and the voice agent takes ten seconds to reply, that's not at all natural, especially if you want to have some sort of natural conversation. So that's where this paper kind of looks at a very clever idea, I would say. Uh, which is like instead of like waiting for the question to end and then activate your RAG pipeline, you kind of start analyzing the words that are being spoken by the user and somehow figure out a way to run the RAG system while the qu-question is being spoken. Like for example, uh, like you might ask like, "Hey, what's the weather today like? I'm-- I want to decide based on that whether I want to go out or not." The main question is in the first part, so the second part of the question might be irrelevant. So we want, um, some sort of, uh, mechanism via which we can figure out, okay, uh, when to call this RAG system and appropriately get the right, uh, information. So this particular paper focuses on two approaches. Uh, the first one is fairly simple. Um, so it's called fixed RAG, uh, fixed interval streaming RAG. So the idea is like you divide the audio into certain blocks, and after each block arrives, you can like run RAG on every block. So, uh, so a-after-- when the block B arrives, you run a RAG, get the results for the RAG RB, and you keep on doing till this, uh, till the end probably. Now- The question, the main question here is like, which block to consider? Because you ideally cannot, like, wait-- uh, the entire goal was you cannot wait till the end and then run the RAG. So what do you do? So the, the main, uh, maybe idea here is that RAG pipeline has lot of mini-components, so maybe some of the components are, like, easy to run or, like, more faster to run. So for example, uh, you can kind of get some documents very quickly, and you can say, "Okay, for the entire query, what were the top documents, and for the intermediate query, what were the top documents, and are they matching or not?" This is just one of the ideas, which is from the paper. Uh, and then based on that, you can decide, okay, should I go ahead with the intermediate query and, uh, just do the entire RAG pipeline on that? Um, so the thing I want to stress is not the method per se, but the point that, okay, when you are getting this, uh, input in chunks, at what point can you stop and say that, okay, like, this chunk is, like, super relevant for me. Uh, so th- this is, like, an active question, I would say. Like, how would you do that? This, uh, paper does it in a very simple manner, which is just to maybe look at the initial part of the RAG pipeline and if they kind of match, like, if the end part matches the intermediate part, then you go ahead with the intermediate and let the full RAG pipeline complete. Another approach could be, like, you, you can probably fine-tune a model to kind of trigger on its own, like, when to call the RAG, because in the previous approach, you were calling RAG on every single chunk. So maybe that's computationally wasteful. So what you can probably do is when a particular chunk arrives, you can maybe fine-tune some model and ask it to decide whether, uh, this chunk, uh, is, like, a critical new information and you should generate the new query or the query that you generated based on the past chunks are good enough for you to just answer the question. And, uh, based on that, you can generate the final audio. Yeah, so in, in the paper, they kind of describe a post-training pipeline. Uh, what they do is, like, they kind of, uh, f- for the partial, uh, spoken, uh, question, they kind of generate some pseudo-queries using some LLM, and then, uh, they run a RAG on that, and they look at the retrieved documents. And based on the retrieved documents, they kind of decide, okay, sh- is this, uh, partial query like, uh, something new or is it already like we already have the useful material? So in this-- okay, um, in this paper, essentially they are kind of basing their decision based on the retrieval quality of the partial question so far. That's, that would be my takeaway. But maybe there are different ways in which you can do this assessment. Maybe you can look into the semantic of the question so far, like is the partial question so far good enough for me to answer this question just by looking at the question? No, no need to do this entire RAG pipeline. So there are-- my, my point is like you need not, uh, this need not be the only way. There might be so many different ways, and that's where the research maybe, uh, is required. Like, while a user is speaking their question, how do we like on, like, why-- instead of waiting till the end, how do we, like, figure out, okay, this part of that question is good enough for us to go and do the retrieval? Yeah, so that, that's what they do. Probably I'll just, uh, quickly give a glimpse of the results from the paper. Um, uh, the, uh, this paper is a year old, so they were like, yeah, looking at some, uh, smaller open source models. Um, so they were, they kind of considered the RAG benchmark converted into audio and, uh, showed that the latency kind of decreases for the synthetic datasets by point five seconds and for human datasets, like human spoken datasets by, uh, almost like one point five seconds. And, uh, the accuracy, uh, comparison, like, uh, if there was RAG, uh, after the final query and the streaming RAG, it kind of remains the same. Like, um, yeah. So yeah, so that's what the paper is about. So, like, the key takeaway is, like, there are some in-interesting small problems here, like, but if you can crack the small problems, it can lead to huge gains in the production. Yeah. Thank you.
2. FCFrancois Chaubard
  [audience applauding] Okay. Next up, we have Robert George. Um, come on up. Uh, third year PhD at Caltech.
3. RGRobert George
  Yes.
4. FCFrancois Chaubard
  Okay. My brother got his PhD at Caltech. Um, and, uh, you work on AI for math and science.
5. RGRobert George
  Yeah.
6. FCFrancois Chaubard
  Um, and what are you gonna tell us about?
47:40 – 58:52
Robert George: Lean for Science: How Formal Proofs Can Change Mathematics, AI, and Scientific Computing (https://arxiv.org/abs/2602.22631)
1. RGRobert George
  I'm gonna tell us about Lean, basically. Luke already told a little bit, but I wanna go more in depth. So I'm gonna be talking about Lean and what I think is this new era of verified intelligence. Um, so let's get into it. So again, there's bunch of breakthroughs in the past, like, couple of weeks itself. Like, first, I wanna go back, like, two years before, you know, like, we said that IMO at OpenAI and even DeepMind actually at the 2024 IMO got the gold medal. Then, you know, there's this very famous Erdős problem list, which is very famous right now, where people are trying to kind of solve new open unsolved Erdős problems. And, you know, you can see that it's keep on, keep on increasing with The new models from like OpenAI, DeepMind, and all. Um, just two weeks ago, OpenAI claimed to solve another big breakthrough or the eighty-year-old Erdős problems. You know, Terry Tao was... has this promotional video at OpenAI, which he showed really well about these kind of things. And then last week, DeepMind released something also solving bunch of new, not only Erdős problems, but problems in like other different fields, right? But this paper's cool because they also use some kind of formal verification in the loop. So I wanna say that, you know, we all took like high school calculus, we took undergrad college math courses, and all this. You know, informal math is very, very flexible, right? Um, your, your pr-professor say sometimes, you know, proof by QED, like, you know, sometimes it's like proof by intimidation or something like that, right? There's many of the steps are not fully written down. But this is where I believe that, you know, formal world is like you have to be fully explicit, right? And I'll talk and introduce the language Lean. Again, before Lean, past couple of hundred years, you know, people have been doing formal math a lot, but you know, Lean has this kind of, this really good design language which has kind of taken off, right? So again, first thing is it's very easy to check if a proof is correct or not. You cannot fool this theorem prover. Secondly, uh, it's scalable. Again, there's bunch of issues over there, but I can talk more about that soon. So before that, I just wanna give you like a precursor, so people do know about like, um, there was a thing previously, like in the 1990s, even right now, actually 2020s and all this, is there's this thing called automatic theorem provers, which are basically like SMT solvers. Um, they are basically, um, minimal effort from humans, you know, but they're very e-limited expressivity in what type of mathematics they can encode in some sense, right? And on the far right-hand side, you can see interactive theorem provers like Lean, Roce, Isabelle, which are very... have a much more stricter like expressive logic system, so it's based, at least some of them are based on dependent type theory, but much more effort from humans to kind of write down these proofs, right? Like, if you're talking about like ten years people have been r- contributing to this very famous library called Mathlib in Lean, um, there's a lot of human effort to kind of pick premises and all this. And again, we all know how good our LLMs are right now at kind of combining with these kind of theorem provers to kind of do proofs checking for like research-level math, right? And it's so much news that I, you know, if I go on Twitter right now, I can open up a bunch of posts saying how much progress past couple of are probably in some sense, right? So first thing is I wanna introduce why Lean, you know, Luke mentioned this formal, very messy language, but I actually think it's a very beautiful language. Again, one can argue no. But, um, it's a very fast language. Again, it's also... people c-think of it as only a theorem prover, but it's actually also a functional programming language, right? You can use it as a programming language itself. So it's compile checking. Um, it's very good unified, so this is more like the proofs and programs. Um, you can do like meta programming, you can do macros, custom automation. You know, you c- I've seen people trying to even create like games on with using Lean, right? It's actually super cool. So Lean has something called a foreign phase interface where you can do like external library bindings like you can do on the CUDA or something like that. Um, I wanna point out the Mathlib library. I think that is the coolest, biggest formalized math library out there. Um, I forgot how many number of lines, probably at least in a million or so. But all of these are really high-quality math, right? From like say topology to algebra, geometry, and all this. And again, it's an interactive theorem prover, so you always have to, you know, the human can sometimes be in a loop. But it is also a very scalable language because, you know, not only frontier labs are pushing a lot of money into it and also the world is, but, um, there's more data being generated either through synthetic or like a lot of people, like even myself, I do manual formalizations. Um, so just very short, I don't wanna take time, but this is how a simple Lean code looks like. Like in VS Code, you have like an info goal view which shows like what are the current kind of subgoals. Our goal is basically like what are you trying to prove at this step. So the first theorem is like you're basically showing associativity of addition of like natural numbers, right? Like A plus B plus C is equal to A plus C plus B. And each line in the proof is usually called like a tactic. So usually it's when people talk about like proof search, they mean like, you know, you can search over this kind of tactic space. There are methods where you do full proof generation, but, you know, these are the two different axes. So this is how a Lean code looks. It's not as bad as it seems. It's a steep learning curve. I think it's much better than even C++ in some sense, like learning. But, um, you at least get really, um, at least for me, I get very happy when I see, oh, I've fully proven this theorem, right? There's no assumptions. Like, I cannot like hand wave or fool a Lean kernel, basically. Like, you have to be fully 100% sure. Um, now I wanna talk about the formalization breakthroughs, right? I talked about informal. But actually the first work was actually in twenty-twenty, Ilya and S- uh, Stan was from OpenAI. They released something called GPTS. Um, this was the first generative language model for automatic theorem proving. MiniF2F is just like a Olympia-level kind of competition, but you see the amount of progress, like it's kind of exponential, like from open source models, big players in China, in the US, Canada, like across the entire world. Um, last year's IMO, you know, again, DeepMind claimed to not have used Lean. I, uh... If you see the OpenAI solution, some kind of DSL of Lean kind of stuff in the solutions. Um, but even Seed Prover from China also got the IMO gold. And then obviously there's a bunch of like Axiom Prover, there's Harmonic AI, like they got recently in the practicum, they got all the 12 problems solved. Most of the Erdős problems now when people are saying they kind of, um, claim to have a solution using AI, they also prove it, um, using like say Aristotle from Harmonic. Um, and then an-another amazing work was kind of this Fields Medal work from MathINC, and obviously the Google, Google DeepMind stuff in some sense, right? And again, I love the fact that, you know, everyone is talking about math and all, but you know For me personally, there's also these two other bubbles, right? Like there's also code. Now one can argue what is program verification as well. You know, bugs are really expensive. It's like a huge billion-dollar industry. Vibe coding is all of a sudden really great, like everyone is generating. But we-- I want code that needs guarantees, right? I think that's like something which I'm very interested in. And also AI for science matters, like there's, uh, reprodu- uh, reproducibility and all this kind of stuff. So I wanna go through this really fast, but, um, LLMs can write code, but can they prove it's correct? Um, you know, there's scale of generated code, there's type of bugs. Uh, how can you kind of capture human intent and the verification language? And again, in short, I wanna talk about like program verification is like there's these three concepts where humans actually always have some kind of like specification about like what they want their code to do. So a proof is basically saying that the code kind of satisfies that specification. Um, there's this work which I introduced called Bridge, where you can use this Lean as a functional the- programming language to kind of elicit the LLMs to kind of prove this kind of code better. Um, so I like this quote from Max Tegmark at MIT, where they say that we should shift from actually vibe coding to like vericoding, right? Um, verifiable coding will be like definitely, I think, a much more better way. Um, and you should contribute to CS Lab. This is started from Clark Barrett's group at Stanford. There's a bunch of from DeepMind and all. But if you wanna contribute to CS concepts and all, you should definitely contribute to CS Lab. Um, I wanna go through quickly just about one last work about, uh, TorchLean, which I recently introduced. This is the first unified framework for actually writing down neural networks in Lean. So you have this kind of full like PyTorch style like tensor system. Everything compiles down to a shared intermediate representation. You can kind of prove properties of specs, like I can show you some examples. You have like verified floating-point arithmetic. You can kind of do even like neural network verification, like certified robustness kind of stuff, right? And again, there's a bunch of applications which I show, but I think one cool thing that I'll show this and the li- next slide is that, you know, you can show that the flash attention is equal to, like at least in the spec level sh- is equal to the, uh, normal standard attention, right? Again, we don't worry about like IO and all this processing. Also, you can s- a very standard fact is like the attention mechanism is permutation invariant if you don't have position like coordinates. So I actually kind of trained a GPT-2 style like Karpathy's thing in TorchLean itself, fully natively in Lean, right? And you can kind of prove properties about it and all this. Um, one thing I think I can end with this slide is that Thinking Machines Lab last year released something about, um, this kind of nondeterminism, even when you have like temperature zero, um, when you put it into your LLM inference. I actually kind of formalize this whole system in TorchLean, all the way down to like almost a GPU kind of like small CUDA level kernel verification. Because the whole goal in this blog was saying that the f- tiny floating-point arithmetics can flip the final argmax in the kind of the batch thing. So again, uh, there's a blog, you can check it out on my website, but, um, I was very, very cool that you can kind of do real-life software verification, um, in some sense. And, uh, again, there's a bunch of different slides I have, but I kind of wanna end on this note just for the sake of time. But, you know, I see a future where, uh, science, like even code, can be formally verified through a lot of building blocks which people are putting a lot of effort in. And this is one of the examples that I think is like my fused metal like kind of contribution to the ML world in some sense. [audience applauding]
2. FCFrancois Chaubard
  All right. Great job. Okay. For our n- last presentation, it's gonna be the antithesis of Lean [laughs] and token maxing to the max. Um, very excited, uh, to introduce Luke Orthwein, his close friend. Um, we're friends in, in, uh, in Woodside together. Um, and did his, uh, CS degree at Harvard, uh, then ran growth at WeChat from twenty twelve till twenty fifteen, uh, wh- which is why we call him the Lion of Hong Kong. [laughs] Um, and now has been running his startup, Channel AI, and is probably the most unhinged technical CEO that I know, so...
58:52 – 1:16:07
Lukens Orthwein: Founder AI Hacks: Programming is an RTS Game Now
1. LOLukens Orthwein
  Thank you, Francois. Um, yeah, so the, the, the idea behind this talk is sort of, um, what we, uh, at Channel have done to try to take the, the best advantage of sort of rethinking how you should do software engineering in this world of agentic programming assistants, Claude Codex, et cetera. Um, and really, you know, the, the ways in which, uh, I think many assumptions about what good programming is are now sort of the opposite of what you should be doing. Uh, and these are sort of what we have, have worked through ourselves and found very useful and wanted to share with all you guys. To give some context, Channel AI, we're a consumer entertainment, uh, AI business. Uh, and we're really focused on the problem of automating as much as possible of not just software development, but content development. How do you really create like an end-to-end system, uh, that is pure AI that, uh, gets people to pay you money, uh, and, uh, stay engaged, et cetera? Uh, we've had pretty solid success with that so far. Um, and it's inspired us to think in our own workflows, how can we just sort of max this and, and be as far ahead of the curve as possible. Um, and it just is an imperfect analogy to what programming used to be like, but I think the ways that it, uh, is useful is like, uh, maybe programming before you wanted to be very linear, you wanted to predict the future, you wanted to design very thoughtfully systems that would be like robust and work well, uh, and, and be correct. Um, and even if you're trying to do something sloppily, it's still like a, a single-threaded process where you only are worrying at a given moment about what's in front of you. Um, and To me, I'm a big fan of real-time strategy games. Using agentic systems feels exactly like playing real-time strategy games to me. Uh, and there are a lot of properties of those games that are very different from chess. Um, one thing, and especially if you look at like high-level play, uh, there is no single aspect that you can do perfectly and like succeed. You have to be balancing many different things at once. You have to always have your economy running, your production running, your units doing something productive. You need to be engaging. And so this notion of like, how do you maximally parallelize both what your systems are doing, but also your attention, so that you are adding the corrective, uh, feedback that's necessary as you learn new things, as the map is exposed, all this kind of stuff. Um, anyway, this to me feels like exactly what like coding with agents is like. Um, and so this is what we'll talk about. Um, so in terms of like tools we've built just to like ground this in a very simple thing, this is the LW stuff is just like our linear work trees. Um, a lot of people early on started using, realizing how useful Git work trees are when you do coding development, having separate, uh... I assume everybody kind of knows what they are, but in case not, like, you know, it was fine to have one repo on your machine when you were the only one doing development. Now you need to have like lots and lots of repos on your machine, all doing development in parallel, uh, all compiling separately and like not stepping on each other's toes. Um, and so the combination of like, uh, using work trees, using task management software, uh, having the actual work itself be portable, um, which is what the Tmux bit comes in, and then, then like sticking in autonomous agents, one or many different ones on a given workflow. Um, the way that we basically ship stuff, the way I ship stuff, uh, is I have an orchestrator agent that's run by Claude usually, but could be Codex too. Uh, I try to have as minimal a number of keystrokes as possible to go from like a here's an idea of something that needs to be fixed to work being started on it, because I can course correct that work later. Think like grabbing a unit and just like clicking across the map, and you'll come back later to like make it work effectively. Um, status tracking, watching your mini-map, it's the RTS equivalent, uh, from the orchestrator of all the different, uh, spawned workers that you have working. Uh, and then all those workers being instructed basically to try to go as far as they can, really put like a really low premium on their time and effort and a high premium on yours. So even if they're gonna be wrong, even if they're gonna need to be corrected later, it's better for them to push as far as they can before they ask for feedback, uh, so that you can just have a lot of them running in parallel, even if it's wasteful from like a per, per token standpoint. It's like saving you a lot of time or letting you do more things at once. Anyway, so they try and take everything all the way to a PR, uh, not just a PR, but also like a summary that's-- Well, I'll get into that later. Anyway, so, uh, uh, and then like how do you take each, the results of every worker who completes something and like feed it back into the system so that the system learns and becomes better. Again, like without the human having to type a lot of things or doing minimal work, so they can do a lot of these things at once. Uh, and then other pieces like how do you tag in other teammates, we'll also get into. Um, but anyway, this is very much like an RTS where you're like producing units, trying to move them around, trying to constantly adapt to stuff, but also with really high visibility, not just like spawning 20 agents and like hoping that you'll, you know, solve this problem for me, make no mistakes, and it'll just work in the end, 'cause that doesn't actually happen in production. Um, so like some general guidelines or, or, or practices, uh, that, that, that we use, that I use, uh, at least, um, but, but that we've, we've, uh, spread through our team is like trying to run almost everything, including scripts that you run, 'cause sometimes scripts are a lot better and save on context space than, than just like doing everything by the LLM obviously. But running everything from the Claude instances always, like never typing anything outside of it if you can avoid it. Uh, having this portability, because a lot of times you start work on a ticket, you start work on something, and actually the reason you're stuck on it is 'cause someone else on your team or even maybe another machine, maybe you're running it locally on your computer and then you're like, "Oh shit, like I gotta go home now, but I want this to run overnight." And making it really easy to move it elsewhere, uh, and let other people pick it up. Uh, maybe it needs more compute to do something, whatever, it needs more memory. Um, and, uh, and then also just like always running in dangerously skip permissions mode, like whenever possible. Uh, if you can't be running in dangerously skip permissions mode, do what you need to do to like make a sandbox so you can. But if you're having to give feedback at any regular pace, like you're gonna go really slow. Uh, and then like, so what do, yeah, what do the workers do? As I mentioned before, they're always trying to go to PR. Uh, they are not rigorously adhering to like the given spec you do. They're trying to learn and adapt to it as they go because your specs will be wrong. Uh, and it's okay for them to make assumptions 'cause you can correct them, uh, as you catch them. Um, and then, you know, for like, for example, front-end development, doing every- everything is like pre-baked into the worker spawn. So boot the local dev server, run tests yourself on it, have it ready and waiting so that the human can just come and open a browser tab pointing to the right port, and they can just test the thing as quickly as possible, minimizing the number of human steps that need to be taken, and like the clicks to just move something forward to the next sta- uh, step. Um, and also just like lots of things baked in that are like, what are things that we know really reliably the agent's gonna be bad about? How do we learn about those things, bake them in, put them in, uh, to not just like the Claude MD file, but also like broader reaching graphs that you have of MD files, uh, which I'll get to later, uh, to make those things, uh, less of a problem. So for example, one of like the really obvious things that Claude is super bad at today is predicting how long it'll take to do something. If you ask it like How long is it gonna take to solve this problem? Be like, "Oh, maybe like two weeks of like, you know, one engineer's work." And in practice it takes like one prompt, then it can do it in 20 minutes 'cause it's trained on what it would've taken a human to do those things. That's all of its like basis for training data. The, these systems haven't been around long enough for that to be updated, and I think they'll like always be behind. Anyway, so you can take all these things and be like, "No, no, never trust yourself in these ways." Uh, and uh, and then also like, you know, people think a lot, and a lot of times it's kind of true that like the code is the source of truth, but the code is often like a really expensive source of truth for the agents to pull context out of. And it's actually really cheap, especially when you have all the context loaded in memory to like aggressively document things in a way that benefit future agents. So, uh, not just like writing comments in the code, but also structured, linked, uh, um, sort of wiki style knowledge, knowledge-based files that will m-make future agents have an easy time, um, basically take advantage of the context as much as you can. Uh, and also helps the visibility to humans and, and audit-auditability of what you do. Uh, so macro by default, micro when it counts is another RTS principle. Like you can't win a game of RT- uh, like RTS game usually if you're just really good at moving your individual units, because if you didn't make any units, you're just gonna lose. Uh, so yes, it's important to like deep dive and tunnel vision into certain things that are really critical. Some tickets for sure take a long time, but anytime you're like tunnel visioned into something, you should always be thinking, "How do I spawn as many other little things that don't take my cognitive bandwidth as much and just like move those things forward?" Um, so that always you're basically like maxing out your cognitive capacity. Um, and again, like things can wait. You can come back to them like three days later. It's not that expensive and you can just ask Claude, like, "Remind me what the hell I was doing with this thing." All this stuff is really cheap. What's expensive but doesn't feel expensive is like not doing these things at the same time. Um, anyway, so macro necessary, micro useful, but you can win honestly in RTS games and I think in a lot of things, including in programming, if you just macro enough, if you just do enough things, you'll kind of, uh, stupidly adjust your way toward something that's good if you're just always really quickly identifying problems and solving them. Um, and yeah, this is-- gets back to like the high visibility thing. So one of the things that I really like about, you know, like how I set things up is it's not like a lot of agents that are kind of tucked away and that you have to like dig in hard to actually read what their ongoing stream is and what they're actually doing. Like, like in an RTS game, like you click buttons to immediately jump to different key points in the map, so you can always be auditing stuff and always like catch it and correct it quickly if it's a critical thing. That true I-- uh, that too I find is like super useful in programming. Uh, because again, like they're gonna make mistakes all the time. They're gonna like go in wrong directions and you definitely save time and value if you catch them early, fix them, course correct. Uh, so you should be kind of like looking around between your different agents, monitoring them while you are also trying to have as many as you can. Um, another thing to this point that I personally like a lot, uh, and is like a big thing in RTS games is audio. So like the only way that you can manage a big army across the whole map is to have lots of audio cues where it's like your base is under attack or, you know, this guy's moving or whatever thing is happening. You don't have to be looking at, you can hear and it's like, "Okay, shit, I need to put my attention to this thing." And you know based on like a lot of variety of these audio cues that you can learn, and they're good like pneumatic devices, uh, what's important, what do I need to act on right away, what don't I? So like the way I run my personal setup is I actually have all of my individual agent, uh, like tmux sessions mapped to different Warcraft and StarCraft units, uh, that are color-coded and themed based on the type of ticket it is. And then they play the actual sound effects from Warcraft and StarCraft units. So I immediately know and like visually identify, I don't even have to read, like this tab needs my attention, this thing's going on. Anyway, like to me it just seems like a natural way of like taking advantage of these and, and again, like Claude just made all these things for me really quickly as like a side ticket that I was working on over time while I worked on eight other things. So it's like why not do these things and these, these devices, pneumatic devices, uh, or, or whatever, like cues for people are really optimized in gaming and they like know what good sound design is to like be memorable and otherwise catch your attention in different ways. Um, yeah, and like cult use of color, icons, anything that's just like quicker to read and process 'cause I actually do think like these things matter a lot, especially if you're trying to, uh, really aggressively get a lot of stuff done, and the sky is kind of the limit in how you can do that stuff. Another thing we built internally is like an APM tracker, uh, and I'll just show quickly here. Um... So, and th-this is Warcraft 3, which is like one of the lower APM requiring professional RTS games, but this is what it looks like to actually play this game well, uh, at the, at the top level. And one of the things that you'll notice is like, no, APM is not the, uh, the thing that like if you max it, you're the best player in the world, but nobody is good who doesn't have high APM. And so you can just kind of take that as a mental rubric, like, shit, if I'm like thinking and like typing slowly and like am I r- if, if this was a competition, would I really be the best? Like, do I really need to take that much time in everything I'm doing? And how much can I just take like lots of little micro decisions and, you know, fall toward the right, uh, the right goal or toward making things better? Um, anyway, so this is just something like we, we, you know, each of us run like personally on our computers and keep an eye on, and it's just like, just keep track of like- Are, are things moving? And this, this APM is not like clicks you have 'cause I, I don't think that's like a great tracker for, for, for agent use. We use tool u- tool calls. So like how many tool calls are your agents doing per minute? Uh, this minute, this five minutes, this hour, this day, this seven days. Like how do you max all those things and have high numbers? Um, and again, it's like, it's, it's one metric among many, but it's how are you actually being really productive or are you really doing the most you could be doing if you have a low APM? Uh, probably not. So otherwise, like things probably a lot of people know, uh, easy way to, to use tokens more effectively is just like do a lot of things in parallel. Do different things with the same agent. Do different agents in parallel. It will, uh, invariably like four complex tasks usually give you a better outcome than if you did it by yourself. And just like in an RTS, like you should be spending your resources. You should never have your Claude tokens like sitting unused. That's really inefficient economy. Like use them all every hour period that you can. Um, knowledge base, this is like a really big thing that, that for us I think is still like somewhat early on, but, uh, this whole presentation I made and started the exact same way that, uh, I'm just describing how I do tickets, which is I went to Claude, I took what Francois asked me to talk about, I pasted it in. I said, "Look at our knowledge base and how we do stuff," and put together a PowerPoint presentation based on the philosophies embedded in there and like what I've told you before. And he didn't like one-shot it, but it's like I maybe did like 15 edits to it, you know, and, and got to this presentation. Uh, and then I refed it all back into the knowledge base and said like, "Learn everything that I've said and all the, the, the, the advice I've given and corrections I've given, and like make those better instilled in the knowledge base." And this knowledge base is basically just 'cause like linked docs are much faster to traverse by LLMs. And so, uh, and you can encode everything including business knowledge and indeed like Claude and, and Codex are really good at coming up with features and stuff if they have enough knowledge about your business. Uh, so trying to build this up in an automated way is super useful. People come up with their own tickets, uh, because if you have something you could do, everybody, you should just like do it. Everybody should be full stack all the time. Uh, be reactive. Uh, and, uh, even if agent does it way worse than you or slower than you, it's still better to have it do it. And, uh, it's easy to change things when they're screwed up. Satisficing is a word from economics, is like do things satisfa- like enough but not perfect. Uh, really, really key principle for like everything. Uh, mix different ticket sizes at the same time. Uh, you know, in like f- we, we've three and a half X'd our output, uh, PRs per engineer per month, uh, both because LLMs have made ourselves better, but like when we like really adopted this stuff broadly with everyone on the team this last month, we grew another 60% in our PRs per engineer per month. So like you're not gonna get a lot smarter, but the thing you can train on yourself is like, how do I act like people who are good at doing these kinds of things really well, like RTS pro players? What does it look like to be like optimal in this, and how can I learn the methods of doing it? Just like program like an RTS pro. Thank you. [audience applauding]
2. FCFrancois Chaubard
  Okay.
1:16:07 – 1:16:52
Closing Remarks
1. FCFrancois Chaubard
  I think that's all we have. Um, now I think, Vika, we have cookies, ice cream, and popsicles? And mochi donuts. Okay, what is a mochi donut? It's delicious. Okay. Um, yeah. So thank you guys so much for coming. It was, uh, a lot of fun. Uh, I will send out a feedback form. Please review it and give me, give me back your thoughts. Um, think about those, uh, call for presentations and calls for ideas. If you guys have ideas, let's, let's definitely hear them. Um, and, uh, looking for more papers coming up probably in, in two weeks. I think we're already fully slated. Um, and then basically the first one in July, uh, you know, we're looking to fill out as well. So if you wanted to present, please let me know. That's all I got. [audience applauding] Thank you everyone.

Episode duration: 1:16:55

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode 3rWSvrFahIY

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome