Lex Fridman PodcastCursor Team: Future of Programming with AI | Lex Fridman Podcast #447
EVERY SPOKEN WORD
150 min read · 30,038 words- 0:00 – 0:59
Introduction
- LFLex Fridman
The following is a conversation with the founding members of the Cursor team, Michael Truhl, Swale Asif, Arvid Lundmark, and Aman Singer. Cursor is a code editor based on VS Code that adds a lot of powerful features for AI-assisted coding. It has captivated the attention and excitement of the programming and AI communities, so I thought this is an excellent opportunity to dive deep into the role of AI in programming. This is a super technical conversation that is bigger than just about one code editor. It's about the future of programming, and in general, the future of human AI collaboration in designing and engineering complicated and powerful systems. This is the Lex Fridman Podcast. To support it, please check out our sponsors in the description. And now, dear friends, here's Michael, Swale, Arvid, and Aman.
- 0:59 – 3:09
Code editor basics
- LFLex Fridman
All right, this is awesome. We have Michael, Aman, Swale, Arvid here from the Cursor team. First up, big ridiculous question, what's the point of a code editor?
- MTMichael Truell
So the- the code editor is largely the place where you build software, and today, or for a long time, that's meant the place where you text edit, uh, a formal programming language. And for people who aren't programmers, the way to think of a code editor is, like, a really souped-up word processor for programmers, where the reason it's- it's souped up is code has a lot of structure. And so the- the, quote unquote, "word processor," the code editor can actually do a lot for you that word processors, you know, sort of in the writing space haven't been able to do for- for people editing text there. And so, you know, that's everything from giving you visual differentiation of, like, the actual tokens in the code to, so you can, like, scan it quickly, to letting you navigate around the code base, sort of like you're navigating around the internet with, like, hyperlinks. You're going to sort of definitions of things you're using, to error checking, um, to, you know, to catch rudimentary bugs. Um, and so traditionally, that's what a code editor has meant, and I think that what a code editor is, is going to change a lot over the next 10 years, um, as what it means to build software maybe starts to look a bit different.
- LFLex Fridman
I th- I think also a code editor should just be fun.
- ASAman Sanger
Yes. That is very important. That is very important, and it's actually sort of an underrated aspect of how we decide what to build. Like, a lot of the things that we build, and then we- we try them out, we do an experiment, and then we actually throw them out because they're not fun. And- and so a big part of being fun is, like, being fast a lot of the time. Fast is fun.
- LFLex Fridman
Yeah, fast is... (laughs)
- ASAman Sanger
Yeah. (laughs)
- LFLex Fridman
Oh yeah, that should be a T-shirt.
- ASAman Sanger
(laughs)
- MTMichael Truell
But, like, fundamentally, I think one of the things that draws a lot of people to- to building stuff on computers is this, like, insane integration speed, where, you know, in other disciplines, you might be sort of gatekept by resources or the ability, even the ability, you know, to get a large group together, and coding is this, like, amazing thing where it's you and the computer, and, uh, that alone, you can- you can build really cool stuff really quickly.
- 3:09 – 10:27
GitHub Copilot
- MTMichael Truell
- LFLex Fridman
So for people who don't know, Cursor is this super cool new editor that's a fork of VS Code. It would be interesting to get your kind of explanation of your own journey of editors. How did you... I think all of you are f- were big fans of VS Code with Copilot. How did you arrive to VS Code, and how did that lead to your journey with Cursor?
- MTMichael Truell
Yeah, um, so I think a lot of us, well a- all of us were originally Vim users.
- ASAman Sanger
Pure- Pure Vim.
- MTMichael Truell
Pure Vim, yeah.
- ASAman Sanger
(laughs)
- MTMichael Truell
No Neo Vim, just pure Vim in a terminal. And at le- at least for myself, it was around the time that Copilot came out, so 2021, that I really wanted to try it. So I went into VS Code, the only platform, uh, the only code editor in which it was available, and even though I, you know, really enjoyed using Vim, just the experience of Copilot with- with VS Code was more than good enough to convince me to switch. And so that kind of was the default until we started working on Cursor.
- LFLex Fridman
And, uh, maybe we should explain what Copilot does. It's like a really nice autocomplete. It suggests, as you start writing a thing, it suggests one or two or three lines, how to complete the thing. And there's a fun experience in that, y- you know like when you have a close friendship and your friend completes your sentences?
- ASAman Sanger
(laughs)
- MTMichael Truell
(laughs) Yeah.
- LFLex Fridman
Like, when it's done well, there's an intimate feeling. Uh, there's probably a better word than intimate, but there's a, there's a cool feeling of like, "Holy shit, it- it gets me."
- MTMichael Truell
Yeah. (laughs)
- ASAman Sanger
For real. Yeah.
- LFLex Fridman
You know? Now, and then there's an unpleasant feeling when it doesn't get you. Uh, and so there's that- that kind of friction, but, uh, I would say for a lot of people, the feeling that it gets me overpowers the it doesn't.
- ASAman Sanger
And I think actually one of the underrated aspects of GitHub Copilot is that even when it's wrong, it's- it's, like, a little bit annoying, but it's not that bad, because you just type another character, and then maybe then it gets you, or you type another character and then- then it gets you. So even when it's wrong, it's not that bad.
- MTMichael Truell
Yeah, you- you can sort of iterate- iterate and fix it.
- ASAman Sanger
Yeah.
- MTMichael Truell
I mean, the other underrated part of Y- uh, Copilot for me sort of was just the first real- real AI product. So the first language model consumer product.
- LFLex Fridman
So Copilot was kind of like the first killer app for, uh, LLMs.
- MTMichael Truell
Yeah. Yeah.
- ASAman Sanger
Yeah.
- MTMichael Truell
And, like, the beta was out in 2021.
- LFLex Fridman
Right. Okay.
- MTMichael Truell
(laughs)
- LFLex Fridman
Uh, so what's the- the origin story of Cursor?
- MTMichael Truell
So around 2020, the scaling loss papers came out from- from OpenAI, and that was a moment where this looked like clear, predictable progress for the field, where even if we didn't have any more ideas, looked like you could make these models a lot better if you had more compute and more data.
- LFLex Fridman
Uh, by the way, we'll probably talk, uh, for three to four hours on- on the topic of scaling laws.
- MTMichael Truell
(laughs)
- ASAman Sanger
(laughs) Yes. Yes.
- MTMichael Truell
(laughs)
- 10:27 – 16:54
Cursor
- MTMichael Truell
- LFLex Fridman
Okay, so can we take it all the way to Cursor?
- MTMichael Truell
Mm-hmm.
- LFLex Fridman
And what is Cursor? It's a fork of VSCode, and VSCode is one of the most popular editors for a long time. Like, everybody fell in love with it. Everybody left Vim. I left Emacs for it. Sorry.
- MTMichael Truell
(laughs)
- ALArvid Lunnemark
(laughs)
- LFLex Fridman
Uh, so it unified, in some fun- fundamental way, the, uh, the- the developer community. And then, the... you look at the space of things, you look at the Scaling Laws, AI is becoming amazing, and you decided, "Okay, it's not enough to just write an extension for your VSCode because there's a lot of limitations to that. We're... we need... if AI is going to keep getting better, better and better, we need to really, like, rethink how the- the AI is going to be part of the editing process." And so you decide to fork VSCode-
- MTMichael Truell
Yeah.
- LFLex Fridman
... and start to build a lot of the amazing features we'll be able to- to, uh, to talk about. But what was that decision like? Because there's a lot of extensions-
- MTMichael Truell
Mm-hmm.
- LFLex Fridman
... including Copilot of VSCode that are doing sort of AI type stuff. What was the decision like to just fork VSCode?
- MTMichael Truell
So the decision to do an editor seemed kind of self-evident to us for at least what we wanted to do and achieve. Because when we started working on the editor, the idea was, "These models are going to get much better. Their capabilities are gonna improve, and it's going to entirely change how you build software," both in a, "You will have big productivity gains," but also radical in how, like, the act of building software is going to change a lot. And so you're very limited in the control you have over a code editor if you're a plug into an existing coding environment. Um, and we didn't want to get locked in by those limitations. We wanted to be able to, um, just build the most useful stuff.
- LFLex Fridman
Okay, well then the natural question is...You know, VS Code is kind of, with Copilot, a competitor. So how do you win? Is it- is it basically just the speed and the quality of the features?
- MTMichael Truell
Yeah, I mean, I think this is a space that is quite interesting, perhaps quite unique, where if you look at previous tech waves, maybe there's kind of one major thing that happened and it unlocked a new wave of companies. But every single year, every single model capability, uh, or jump you get in model capabilities, you now unlock this new wave of features, things that are possible, especially in programming. And so, I think in AI programming, being even just a few months ahead, let alone a year ahead, makes your product much, much, much more useful. I think Cursor, a year from now, will need to make the Cursor of- of today look obsolete. And I think, you know, Microsoft has done a number of, like, fantastic things, but I don't think they're in a great place to really keep innovating and pushing on this in the way that a startup can.
- LFLex Fridman
Just rapidly implementing features.
- MTMichael Truell
A- and push- Yeah, like, and- and kind of doing the research experimentation necessary, um, to really push the ceiling.
- ALArvid Lunnemark
I don't- I don't know if I think of it in terms of features as I think of it-
- MTMichael Truell
Mm-hmm.
- ALArvid Lunnemark
... in terms of, like, capabilities for- for programmers. It's that, like, you know, as, you know, the new o- one model came out, and I'm sure there are gonna be more- more models of different types, like longer context and maybe faster. Like, there's all these crazy ideas that you can try, and hopefully 10% of the crazy ideas will make it into something kinda cool and useful. And, uh, we want people to have that sooner. To rephrase, it's like, an underrated fact is we're making it for ourself. When we started Cursor, you really felt this frustration that, you know, models, you could see models getting better, uh, but the Copilot experience had not changed. It was just like, "Man, these- these guys, like, the ceiling is getting higher. Like, why are they not making new things? Like, th- they should be making new things. They should be like," You know, like- like, "Where's- where's- where's all the alpha features?" There- there were no alpha features. It was like, uh, I- I'm sure it w- it was selling well. I'm- I'm sure it was a great business, but it didn't feel... I- I'm- I'm one of these people that really want to try and use new things, and it was like just there was no new thing for, like, a very w- long while.
- LFLex Fridman
Yeah, it's interesting. Uh, I don't know how you put that into words, but when you compare Cursor with Copilot, Copilot pretty quickly became, started to feel stale for some reason.
- ASAman Sanger
Yeah, I- I think one thing that I think, uh, helps us is that we're sort of doing it all in one.
- LFLex Fridman
Mm-hmm.
- ASAman Sanger
We're- we're developing the- the UX and the way you interact with the model, um, at the same time as we're developing, like, how we actually make the model give better answers. So, we're like, how you build up the- the prompt or- or, like, how do you find the context, and for a Cursor tab, like, how do you train the model? Um, so I think that helps us to have all of it, like, sort of like the same people working on the entire experience end-to-end.
- ALArvid Lunnemark
Yeah, it's like the- the person making the UI and the person training the model, like, sit to, like, 18 feet away. So-
- ASAman Sanger
Mm-hmm. Often the same person, even.
- ALArvid Lunnemark
Yeah, often- often even the same person. So, you- you can- you can create things that are- that are sort of not possible if you're not- you're not talking, you're not experimenting.
- LFLex Fridman
And you're using, like you said, Cursor to write Cursor.
- ALArvid Lunnemark
Of course.
- ASAman Sanger
Oh, yeah.
- ALArvid Lunnemark
Yeah.
- 16:54 – 23:08
Cursor Tab
- MTMichael Truell
- ALArvid Lunnemark
One of the things we really wanted was we wanted the model to be able to edit code for us. Uh, that was kind of a wish, and we had m- multiple attempts at it before- before we had a sort of a good model that could edit code for you. Um, then after- after we had a good model, I think there- there have been a lot of effort to, you know, make the inference fast for, you know, uh, having- having a good- good experience. And, uh, we've been starting to incorporate... I mean, Michael sort of mentioned this, like, ability to jump to different places, and that jump to different places, I think, came from a feeling of, you know, once you- once you accept an edit, um, it's like, "Man, it should be just really obvious where to go next." It's like- it's like I- I'd made this change, the model should just know that, like, the next place to go to is, like, 18 lines down. Like, uh, if you're- if you're a WIM user, you could press 18JJ or whatever. But, like, why- why even- why am I doing this? Like, the model- the model should just know it. And then so- so the idea was, yo, you just press tab, it would go 18 lines down and then make... It show you- show you the next edit and you would press tab. So it was just you, as long as you could keep pressing tab. And so the internal competition was, how many tabs can we make someone press? Once you have, like, the idea, uh, more- more, uh, sort of...... abstractly, the, the thing to think about is sort of like once... How m- how, how are the edits sort of zero, zero entropy? So once you've sort of expressed your intent and the edit is... there's no, like, new bits of information to finish your thought, but you still have to type some characters to, like, make the computer understand what you're actually thinking. Then maybe the model should just sort of read your mind and, and all the zero entropy bits should just be, like, tabbed away.
- SASualeh Asif
Yeah. There's-
- ALArvid Lunnemark
That, that was, that was sort of the abstract version.
- SASualeh Asif
There, there's this interesting thing where if you look at language model loss on, on different domains, um, I believe the bits per byte, which is kind of character normalized loss for code, is lower than language, which means in general, there are a lot of tokens in code that are super predictable, a lot of characters that are super predictable. Um, and this is, I think even magnified when you're not just trying to auto-complete code, but predicting what the user's going to do next in their editing of existing code. And so, you know, the gold Cursor tabs, let's eliminate all the low entropy actions you take inside of the editor. When the intent is effectively determined, let's just jump you forward in time. Skip you forward.
- LFLex Fridman
Well, well g- what's the intuition and what's the technical details of how to do next Cursor prediction? The, the, that jump, that's not, that's not so intuitive, I think, to people.
- SASualeh Asif
Yeah. I think I can speak to a few of the details on how, how to make these things work. They're incredibly low latency, so you need to train small models on this, on this task. Um, in particular, they're incredibly pre-filled token hungry. What that means is they have these really, really long prompts where they see a lot of your code and they're not actually generating that many tokens. And so the perfect fit for that is using a sparse model, meaning an MoE model. Um, so that was kind of one, one breakthrough, one, one breakthrough we made that substantially improved performance at longer context. The other being, um, a variant of speculative decoding that we, we kind of built out called speculative edits. These are two, I think, important pieces of what make it quite high quality, um, and very fast.
- LFLex Fridman
Okay. So MoE Mixture of Experts, the input is huge, the output is small.
- SASualeh Asif
Yeah.
- LFLex Fridman
Okay. So, like, what, what, what else can you say about how to make... Is, like, is caching play a role in this particular-
- SASualeh Asif
Oh, caching. Caching plays a huge role.
- ALArvid Lunnemark
Mm-hmm.
- SASualeh Asif
Um, because you're dealing with this many input tokens, if every single keystroke that you're typing in a given line you had to rerun the model on all of those tokens passed in, you're just going to, one, significantly degrade latency, two, you're gonna kill your GPUs with load. So you need to, you, you need to design the actual prompts used for the model such that they're cache, caching aware, and then, yeah, you need to, you need to reuse the KV cache across requests just so that you're suspending less work, less compute.
- LFLex Fridman
Uh, again, what are the things that tab is supposed to be able to do kinda in the near term? Just to, like, sort of linger on that. Generate code, like fill empty space, also edit code across multiple lines?
- SASualeh Asif
Mm-hmm.
- ALArvid Lunnemark
Yeah.
- LFLex Fridman
And then jump to different locations inside the-
- ALArvid Lunnemark
Mm-hmm.
- LFLex Fridman
... same file?
- SASualeh Asif
Yeah.
- LFLex Fridman
And then, like, laun-
- ALArvid Lunnemark
Hopefully jump to different files also. So if you make an edit in one file and maybe, maybe you have to go, maybe you have to go to another file to finish your thought, it should, it should go to the second file also.
- SASualeh Asif
Yeah.
- ALArvid Lunnemark
And then in the-
- SASualeh Asif
The full, the full generalization is, like, next, next action prediction. Like, sometimes you need to run a command in the terminal and it should be able to suggest the command based on the code that you wrote too. Um, or sometimes you actually need to... Like, it suggests something, but you, you... It's hard for you to know if it's correct because you nee- actually need some more information to learn. Like, you need to know the type to be able to verify that it's correct. And so maybe it should actually take you to a place that's like the definition of something and then take you back so that you have all the requisite knowledge to be able to accept the next completion.
- LFLex Fridman
So providing the human the knowledge.
- SASualeh Asif
Yes.
- LFLex Fridman
Right.
- ALArvid Lunnemark
Mm-hmm. Yeah.
- LFLex Fridman
Can you integrate, like... I just, uh, gotten to know a guy named PrimeGen who I believe has an SS... You can or- order coffee via SSH?
- 23:08 – 31:20
Code diff
- SASualeh Asif
big changes.
- LFLex Fridman
As we're talking about this, I should mention, like, one of the really cool and noticeable things about Cursor is that there's this whole diff interface situation going on. So, like, the model suggests with, uh, with the red and the green of like, "Here's how we're gonna modify the code." And in the chat window you can apply and it shows you the diff and you can accept the diff. So maybe can you speak to whatever direction of that?
- ALArvid Lunnemark
We'll probably have like four or five different kinds of diffs. Uh, so we, we have optimized the diff for, for the auto-complete. So that has a different diff interface than, uh, than when you're reviewing larger blocks of code. And then ho- we're trying to optimize, uh, another diff thing for when you're doing multiple different files, uh, and, and sort of at a high level, the difference is for when you're doing auto-complete, it should be really, really fast to read. Uh, actually, it should be really fast to read in all situations. Um, but in auto-complete it's sort of you're, you're really... Like, your eye is focused in one area and you, you can't be in too many... You... The humans can't look in too many different places.
- LFLex Fridman
So you're talking about on the interface side, like the-
- ALArvid Lunnemark
On the interface side. So it currently has this box on the side. So we have-
- LFLex Fridman
Mm-hmm.
- ALArvid Lunnemark
... the current box, and if it tries to delete code in some place and tries to add other code, it tries to show you a box on the side.
- SASualeh Asif
You could maybe show it if we pull it up in Cursor.com.This is what we're talking about.
- ASAman Sanger
Go on.
- SASualeh Asif
Exactly, yeah.
- ASAman Sanger
So that- that box, it was, like, three or four different attempts at trying to make this- this thing work, where first, the attempt was, like, these blue crossed-out lines. So before, it was a box on the side that used to show you the code to delete by showing you, like, uh, like Google Doc style, you would see, like, a line through it.
- SASualeh Asif
Oh, yeah.
- ASAman Sanger
And then you would see the- the new code. That was super distracting. And then we tried many different var- uh, you know, there was, there was sort of deletions, there was trying to do red highlight. Then the next, uh, iteration of it, which is sort of funny, would y- you would hold the, on Mac, the option button, so it would, it would sort of highlight a region of code to show you that there might be something coming. Uh, so maybe in this an- example, like, the input and the value, uh, would get, would all get blue, and the blue would highlight that the AI had a suggestion for you. Uh, so instead of directly showing you the thing, it would show you that the AI, it would just hint that the AI had a suggestion, and if you really wanted to see it, you would hold the option button, and then you would see the new suggestion.
- SASualeh Asif
Mm-hmm.
- ASAman Sanger
Then if you release the option button, you would then see original code.
- SASualeh Asif
Mm-hmm. So that's, by the way, that's pretty nice, but you have to know to hold the option button. Yeah.
- ASAman Sanger
Uh, so it wa- it was-
- SASualeh Asif
And by the way, I'm not a Mac user but I got it.
- ASAman Sanger
(laughs)
- SASualeh Asif
Option, op- it's a button, I guess- (laughs) ... you people have.
- ASAman Sanger
It's the, you know, it's, again, it's just, it's just non-intuitive. I think that's the, that's the key thing.
- SASualeh Asif
And- and there's a chance this- this is also not the final version of it.
- ASAman Sanger
I am personally very excited for, um, making a lot of improvements in this ar- area. Like, uh, we- we often talk about it as the verification problem, where, um, these diffs are great for small edits. Uh, for large edits, w- uh, even, or, like, when it's multiple files or something, it's, um, actually a little bit prohibitive to- to review these diffs. And, uh, uh, so there are, like, a couple of different ideas here. Like, one idea that we have is, okay, you know, like, parts of the diffs are important. They have a lot of information. And then parts of the diff, um, are just very low entropy. They're like exam- like, the same thing over and over again. And so maybe you can highlight the important pieces and then gray out the- the not-so-important pieces. Or maybe you can have a model that, uh, looks at the diff and- and sees, "Oh, there is a likely bug here. I will, like, mark this with a little red squiggly and say, like, 'You should probably, like, review this part of the diff.'" Um, and ideas in- in that vein, I think, are exciting.
- SASualeh Asif
Yeah, that's a really fascinating space of, like, UX design engineering.
- ASAman Sanger
Yeah.
- SASualeh Asif
So you're basically trying to guide the human programmer through all the things they need to read and nothing more.
- ASAman Sanger
Yeah.
- SASualeh Asif
Like, optimally.
- ASAman Sanger
Yeah, and you want a- an intelligent model to do it. Like, currently, diffs algor- diff algorithms are, they're like alg- like, they're just like normal algorithms. Uh-
- SASualeh Asif
(laughs)
- 31:20 – 36:54
ML details
- ASAman Sanger
- LFLex Fridman
I'm really feeling the AGI with this editor.
- ASAman Sanger
(laughs)
- NANarrator
(laughs)
- LFLex Fridman
Uh, it feels like there's a lot of machine learning going on underneath. Tell me about some of the ML stuff that makes it all work.
- SASualeh Asif
Well, Cursor really works via this ensemble of custom models that, that we've trained alongside, you know, the frontier models that are fantastic at the reasoning intense things. And so Cursor Tab, for example, is a, is a great example of where you can specialize this model to be even better than even frontier models if you look at evals on, on the, on the task we set it at. The other domain, which it's kind of surprising that it requires custom models, but, but it's kind of necessary and works quite well, is in Apply. Um, so I think these models are, like the frontier models are quite good at sketching out plans for code and generating, like, rough sketches of, like, the change. But actually creating diffs is quite hard, um, for frontier models, for in your training models. Um, like, you try to do this with Sonnet, with o1, any frontier model, and it, it really messes up stupid things like counting line numbers, um, especially in super, super large files. Um, and so what we've done to alleviate this is we let the model kind of sketch out this rough code block that indicates what the change will be, and we train a model to then apply that change to the file.
- LFLex Fridman
And we should say that Apply is, the model looks at your code. It gives you a really damn good suggestion of what new things to do, and the seemingly, for humans, trivial step of combining the two, you're saying is not so trivial.
- ASAman Sanger
Contrary to popular perception, it is not a deterministic algorithm.
- SASualeh Asif
Yeah. I, I, I think, like, you see shallow copies of Apply, um, elsewhere, and it just breaks, like, most of the time because you think you can kind of try to do some deterministic matching, and then it fails, you know, at least 40% of the time. And that just results in a terrible product experience. Um, I think in general, this, this regime of you are going to get smarter and smarter models, and like, so one other thing that Apply lets you do is it lets you use fewer tokens with the most intelligent models. Uh, this is both expensive in terms of latency for generating all these tokens, um, and cost. So you can give this very, very rough sketch and then have your smaller models go and implement it because it's a much easier task to implement this very, very sketched out code. And I think that this, this regime will continue where you can use smarter and smarter models to do the planning, and then maybe the implementation details, uh, can be handled by the less intelligent ones. Perhaps you'll have, you know, maybe o1, maybe it'll be even more ca- capable models given an e- an even higher level plan that is kind of recursively, uh, applied by Sonnet and then the Apply model.
- ASAman Sanger
Maybe we should, we should talk about how to, how to make it fast.
- SASualeh Asif
Yeah.
- ASAman Sanger
I feel like-
- SASualeh Asif
Yeah.
- ASAman Sanger
... fast is always an interesting
- NANarrator
(laughs)
- ASAman Sanger
... detail. Fast is good.
- NANarrator
Yeah.
- LFLex Fridman
How do you make it fast?
- SASualeh Asif
Yeah. So one big component of making it fast is speculative edits. So speculative edits are a variant of speculative decoding. And maybe it'd be helpful to briefly describe speculative decoding. Um, with speculative decoding, what you do is you, you can kind of take advantage of the fact that, you know, most of the time, and I'll, I'll add the caveat that it would be when you're memory bound in, in language model generation, um, if you process multiple tokens at once, um, it is faster than generating one token at a time. So this is like the same reason why if you look at tokens per second, uh, with prompt tokens versus generated tokens, it's much, much faster for prompt tokens. Um, so what we do is instead of using what speculative decoding normally does, which is using a really small model to predict these draft tokens that your larger model will then go in and, and verify, um, with code edits, we have a very strong prior of what the existing code will look like. And that prior is literally the same exact code. So what you can do is you can just feed chunks of the original code back into the, into the model, um, and then the model will just pretty much agree most of the time that, "Okay, I'm just gonna spit this code back out." And so you can process all of those lines in parallel. And you just do this with sufficiently many chunks, and then eventually you'll reach a point of disagreement where the model will now predict text that is different from the ground truth original code. It'll generate those tokens, and then we kind of will decide after enough tokens match, uh, the original code to restart speculating in chunks of code. What this actually ends up looking like is just a much faster version of normal editing code. So it's just like, it looks like a much faster version of the model rewriting all the code. So just, we, we can use the same exact interface that we use for, for diffs, but it will just stream down a lot faster.
- ASAman Sanger
And then, and then the advantage is that wir- wireless streaming, you can just also be reviewing, start reviewing the code-
- SASualeh Asif
Exactly.
- ASAman Sanger
... before, before it's done so there's no, no big loading screen. Uh, so may- maybe that, that is part of the, part of the advantage.
- LFLex Fridman
So the human can start reading-
- SASualeh Asif
Yeah.
- LFLex Fridman
... before the thing is done.
- ASAman Sanger
I think the interesting rift here is something like, like speculation is a fairly common idea nowadays. It's like not only in language models. I mean, there's, there's obviously speculation in CPUs, and there's p- like speculation for databases and-
- ALArvid Lunnemark
... speculation all over the place.
- 36:54 – 43:28
GPT vs Claude
- ALArvid Lunnemark
- LFLex Fridman
Let me ask the sort of, the ridiculous question of, uh, which LLM is better at coding? GPT, Claude? Who wins in the context of programming? And I'm sure the answer is much more nuanced because it sounds like every single part of this involves a different model.
- SASualeh Asif
Yeah. I think there, there's no cl- model that pre-dominates, uh, others, meaning it is better in all categories that we think matter. The categories being speed, um, ability to edit code, ability to process lots of code, long context, you know, a couple of other things and, and, kind of, coding capabilities. The one that I'd say right now is just kind of net best is Sonnet. I think this is a consensus opinion. R1's really interesting and it's really good at reasoning. So if you give it really hard, uh, programming interview style problems or lead code problems, it can do quite, quite well on them. Um, but it doesn't feel like it kind of understands your rough intent as well as Sonnet does. Like, if you look at a lot of the other frontier models, um, one qualm I have is it feels like they're not necessarily over fi-... I'm not saying they, they train in benchmarks. Um, but they perform really well in benchmarks relative to, kind of, everything that's kind of in the middle. So if you tried in all these benchmarks and things that are in the distribution of the benchmarks they're evaluated on, you know, they'll do really well, but when you push them a little bit outside of that, Sonnet's, I think, the one that, that kind of does best at, at, at kind of maintaining that same capability. Like you kind of have the same capability in the benchmark as when you try to instruct it to do anything with coding.
- LFLex Fridman
What... Another ridiculous question is the difference between the normal programming experience versus what benchmarks represent. Like where do benchmarks fall short, do you think, when we're evaluating these models?
- ALArvid Lunnemark
By the way, that's like a really, really hard... It's like, like critically important detail of like how, how different, like, benchmarks are versus, versus like real coding. With real coding, it's not interview style coding. It's you're, you're doing these... You know, humans are saying like half broken English sometimes and sometimes you're saying like, "Oh, do what I did before." Sometimes you're saying, uh, you know, "Go add this thing and then do this other thing for me and then make this UI element." And then, you know, it's, it's just a, like a lot of things are sort of context dependent. You really want to, like, understand the human and then do, do what the human wants as opposed to sort of this... Maybe the, the way to put it is sort of abstractly is, uh, the interview problems are very well specified. They f- lean a lot on specification while the human stuff is less specified.
- MTMichael Truell
Yeah. No, I think that this, this benchmark question is both complicated by what, um, Sohail just mentioned and then also to, uh... What Aman was getting into is that even if you, like, you know, there's this problem of, like, the skew between what can you actually model in a benchmark versus, uh, real programming and that can be sometimes hard to encapsulate because it's like real programming's, like, very messy and sometimes things aren't super well specified what's correct or what isn't. But then, uh, it's also doubly hard because of this public benchmark problem. And that's both because public benchmarks are sometimes kind of hill climbed on, but then it's like really, really hard to also get the data from the public benchmarks out of the models. And so for instance, like one of the most popular, like, agent benchmarks, Sweet Bench, um, is really, really contaminated in the training data of, uh, these foundation models. And so if you ask these foundation models to do a Sweet Bench problem, but you actually don't give them the context of a code base, they can like hallucinate the right file pass, they can hallucinate the right function names. Um, and so the, the... It's, it's also just the public aspect of these things is tricky.
- SASualeh Asif
Yeah. Like in that case it could be trained on the literal issues or pull requests themselves and, and maybe the labs will start to do a better job, um, or they've already done a, a good job at decontaminating those things. But they're not going to omit the actual training data of the repository itself. Like these are all, like some of the most popular Python repositories, like SymPy is one example. I don't think they're going to handicap their models on SymPy and all these popular Py- Python repositories in order to get, uh, true evaluation scores in these benchmarks.
- MTMichael Truell
Yeah. I think that given the dearths in benchmarks, um, there have been like a few interesting crutches that, uh, places that build systems with these models or build these models actually use to get a sense of are they going in the right direction or not? And, uh, in a lot of places, uh, people will actually just have humans play with the things and give qualitative feedback on these. Um, like one or two of the foundation model companies, they, they have people who that's, that's a big part of their role and you know, internally we also, uh, you know, qualitatively assess these models and actually lean on that a lot in addition to like private evals that we have.
- ALArvid Lunnemark
It's like the VIBE. (laughs)
- LFLex Fridman
The VIBE, yeah. The VIBE.
- MTMichael Truell
(laughs)
- ALArvid Lunnemark
It's like a VIBE.
- LFLex Fridman
(laughs) The VIBE benchmark, human benchmark.
- MTMichael Truell
Yeah.
- LFLex Fridman
The humans, you pull in the humans to do a VIBE check.
- MTMichael Truell
Yeah.
- LFLex Fridman
Okay. I mean that's, that's kind of what I do, like just, like reading online forums and Reddit and X just like... Well, I don't know how to properly load in people's opinions 'cause they'll say things like, "I feel like Claude or GPT has gotten dumber or something." They'll say, "I feel like..." And then I sometimes feel like that too, but I wonder if it's the model's problem or mine.
- SASualeh Asif
Yeah. With Claude there is an interesting take I heard where I think AWS has different chips, um, and I, I suspect they have slightly different numerics than, uh, NVIDIA GPUs and-Someone speculated that Claude's degre- de- degraded performance had to do with maybe using the quantized version that existed on AWS Bedrock versus, uh, whatever was running on, on Anthropic's GPUs.
- LFLex Fridman
I interview a bunch of people that have conspiracy theories, so I'm glad-
- ASAman Sanger
(laughs)
- SASualeh Asif
(laughs)
- LFLex Fridman
... you spoke, you spoke to this conspiracy theory.
- ASAman Sanger
Well, it's-
- SASualeh Asif
(laughs)
- ASAman Sanger
... it's not, not, like, a conspiracy theory as much as-
- SASualeh Asif
(laughs)
- LFLex Fridman
(laughs)
- ASAman Sanger
... it is they're just, they're, like, they're, you know, humans, humans are humans and there's, there's these details.
- LFLex Fridman
Yes.
- ASAman Sanger
And, you know, you're doing, like, this queasy amount of flops. And, you know, chips are messy and, man, you can just have bugs. Like, bugs are... It's- it's hard to overstate how, how hard bugs are to avoid.
- 43:28 – 50:54
Prompt engineering
- ASAman Sanger
- LFLex Fridman
What's, uh, the role of, uh, a good prompt in all this? So you s- you, you mentioned that benchmarks have really, uh, structured, well-formulated prompts. Wh- what, what should a human being doing to maximize success? And what's the importance of what the humans... You wrote a blog post on, you called it, uh, Prompt Design.
- ASAman Sanger
Yeah. Uh, I think it depends on which model you're using, and all of them are slightly different and they respond differently to different prompts. But, um, I think the original GPT-4, uh, and the original sort of pre-dable models last, last year, they were quite sensitive to the prompts. And they also had a very small context window. And so we have all of these pieces of information around the code base that would maybe be relevant in the prompt. Like, you have the docs, you have the files that you add, you have the conversation history. And then there's a problem, like, how do you decide what you actually put in the prompt and when you have a, a limited space? And even for today's models, even when you have long context, filling out the entire context window means that it's slower. It means that sometimes the model actually gets confused, and some models get more confused than others. And we have this one system internally that we call Preempt, which helps us with that a little bit. Um, and I think it was built for the era before where we had 8,000, uh, token context windows. Uh, and it's a little bit similar to when you're making a website. You, you sort of, you, you want it to work on mobile, you want it to work on a desktop screen, and you have this, uh, dynamic information which you don't have, for example, if you're making, like, designing a print magazine. You have, like, you know exactly where you can put stuff. But when you have a website or when you have a prompt, you have these inputs, and then you need to format them to always work. Even if the input is really big, then you might have to cut something down. Uh, and, and, and so the idea was, okay, like, let's take some inspiration. What's the best way to design websites? Well, um, the thing that we really like is, is React and the declarative approach where you, um, you use JSX in, in, in JavaScript. Uh, and then you declare this is what I want, and I think this has higher priority or, like, this has higher Z index than something else. Um, and then you have this rendering engine. Uh, in web design it's, it's, like, Chrome, and, uh, in our case it's the Preempt Renderer, uh, which then fits everything onto the page. And as you declaratively decide what you want and then it figures out what you want. Um, and, and so we have found that to be, uh, quite helpful. And I think the role of it has, has sort of shifted over time, um, where it initially was to fit to these small context windows. Now it's really useful because, you know, it helps us with splitting up the data that goes into the prompt and the actual rendering of it. And so, um, it's easier to debug because you can change the rendering of the prompt and then try it on old prompts because you have the raw data that went into the prompt. And then you can see did my change actually improve it for, for, like, this entire eval set.
- LFLex Fridman
So do you literally prompt with JSX?
- ASAman Sanger
Yes.
- SASualeh Asif
Yeah.
- ASAman Sanger
Yes. So it kind of looks like React. There are components. Like, we have one component that's a file component and it takes in, like, the cursor. Like, usually there's, like, one line where the cursor is in your file, and that's, like, probably the most important line because that's the one you're looking at. And so then you can give priorities. So, like, that line has the highest priority, and then you subtract one for every line that, uh, is farther away. And then eventually when it's rendered, it figures out how many lines can actually fit in its centers around that thing.
- LFLex Fridman
That's amazing.
- ASAman Sanger
Yeah.
- SASualeh Asif
And you can do, like, other fancy things where if you have lots of code blocks from the entire code base, you could use, uh, retrieval, um, and things like embedding and re-ranking scores to add priorities for each of these components.
- LFLex Fridman
So should humans, when they ask questions, also use, try to use something like that? Like, would it be beneficial to write JSX in the, in the prompt? Or the whole idea is it should be loose and messy?
- ASAman Sanger
I, I think our goal is kind of that you should just, uh, do whatever is the most natural thing for you.
- SASualeh Asif
Yeah.
- LFLex Fridman
Well-
- ASAman Sanger
And then we, our job is to figure out-
- SASualeh Asif
Mm-hmm.
- LFLex Fridman
Mm-hmm.
- ASAman Sanger
... how do we actually, like, retrieve the relevant things so that your thing actually makes sense.
- LFLex Fridman
Well, this is, uh, sort of the discussion I had with, uh, Arvin of Perplexity is, like, his whole idea is, like, you should let the person be as lazy-
- ASAman Sanger
Yes.
- SASualeh Asif
Mm-hmm.
- LFLex Fridman
... as he wants to be.
- SASualeh Asif
Mm-hmm.
- ASAman Sanger
Mm-hmm. Mm-hmm.
- LFLex Fridman
But, like, yeah, that's a beautiful thing, but I feel like you're allowed to ask more of programmers, right?
- SASualeh Asif
Yes. Yes.
- LFLex Fridman
So, like, if you say, "Just do what you want," I mean, humans are lazy. There's a kind of tension between-
- ASAman Sanger
Yes.
- LFLex Fridman
... just being lazy-
- ASAman Sanger
Yeah.
- 50:54 – 1:04:51
AI agents
- ALArvid Lunnemark
- LFLex Fridman
To what degree do you use, uh, agentic approaches? How useful are agents?
- ASAman Sanger
We think agents are really, really cool.
- LFLex Fridman
(laughs)
- ASAman Sanger
Like, I, I, I think-
- LFLex Fridman
Okay.
- ASAman Sanger
... agents is like, um... It's like it resembles sort of like a human. It's, it's sort of like the things... Like, you can kind of feel that it... like, you're getting closer to AGI because you see a demo where, um, it acts as, as a human would. And, and it's really, really cool. I think, um, agents are not yet super useful for many things. They... I think we're, we're getting close to where they will actually be useful. And, and so I think, uh, there are certain types of tasks where having an agent would be really nice. Like, I would love to have an agent. For example, if... Like, we have a bug where you, you sometimes can't Command+C and Command+V, uh, inside our chat input box, and that's a task that's super well specified. I just want to say, like, in two sentences, "This does not work. Please fix it." And then I would love to have an agent that just goes off, does it, and then, uh, a day later, I, I come back and I review the, the thing.
- LFLex Fridman
You mean, it goes, finds the right file and-
- ASAman Sanger
Yeah. It finds the right files. It, like, tries to reproduce the bug. It, like, fixes the bug, and then it verifies that it's correct. And this is... could be a process that takes a long time. Um, and so I think I would love to have that. Uh, and then I think a lot of programming... Like, there is often this belief that agents will take o- over all of programming. Um, I don't think we think that that's the case because a lot of programming, a lot of the value is in iterating or you don't actually want to specify something upfront because you don't really know what you want until you've seen an initial version and then you want to iterate on that, and then you provide more information. And so for a lot of programming, I think what you actually want is a system that's instant that gives you an initial version instantly back, and then you can iterate super, super quickly.
- LFLex Fridman
Uh, what about something like that we're thinking about, Replit Agent, that does also, like, setting up the development environment, installing software packages, configuring everything, configuring the databases and actually deploying the app?
- ASAman Sanger
Yeah.
- LFLex Fridman
Is that also, in, in the set of th- things you dream about?
- ASAman Sanger
I think so. I think that would be really cool. I... For, for certain types of programming, uh, it, it would be really cool.
- LFLex Fridman
Is that within scope of Cursor?
- ASAman Sanger
Yeah. We aren't actively working on it right now, um, but it's definitely, like, we want to make the programmer's life easier and more fun. And some things are just really tedious and you need to go through a bunch of steps and you want to delegate that to an agent. Um, and then some things, you can actually have an agent in the background while you're working. Like, let's say, you have a PR that's both backend and frontend and you're working on the frontend and then you can have a background agent that does some work and figure out kind of what you're doing and then when you get to the backend part of your PR, then you have some, like, initial piece of code that you can then iterate on. Um, and, and so that, that would also be really cool.
- LFLex Fridman
One of the things we already talked about is speed. But I (laughs) I wonder if we can just, uh, linger on that some more and the, the various places that, uh... the technical details involved in making this thing really fast. So every single aspect of, uh, Cursor, most aspects of Cursor feel really fast. Like I mentioned, the apply is probably the slowest thing and for me personally.
- ASAman Sanger
Yeah.
- LFLex Fridman
I'm sorry. The pain on Harvey's (laughs) face as I say that.
- ASAman Sanger
(laughs) I know. It's a, it's a pain. It's a pain that we're feeling and-
- LFLex Fridman
Well...
- ASAman Sanger
... we're working on fixing it. (laughs)
- LFLex Fridman
Uh... (laughs) Yeah. I mean, it says something that something that feels... I don't know what it is, like one second or two seconds, that feels slow, that means... that's actually, uh, shows that everything else is just really, really fast. Um, so is there some technical details about how to make some of these models, how to make the chat fast, how to make the diffs fast? Is there something that just jumps to mind?
- ASAman Sanger
Yeah. I mean, so we can go over a lot of the strategies that we use. One interesting thing is cache warming. Um, and so what you can do is if... as the user is typing, you can have...
- SASualeh Asif
Yeah. You're, you're probably going to use, uh, some piece of context, and you can know that before the user's done typing. So, you know, a- as we discussed before, reusing the KV cache results in lower latency, lower cost, uh, cross-requests. So as the user starts typing, you can immediately warm the cache with, like, let's say, the current file contents. And then when they press Enter, uh, there's very few tokens it actually has to pre-fill and compute before starting the generation. This will significantly lower TTFT.
- LFLex Fridman
Can you explain how KV cache works?
- SASualeh Asif
Yeah. So the way transformers work, um-
- LFLex Fridman
(laughs)
- ALArvid Lunnemark
(laughs)
- SASualeh Asif
... uh-
- LFLex Fridman
I like it.
- 1:04:51 – 1:09:31
Running code in background
- SASualeh Asif
tokens.
- LFLex Fridman
Arvid, you wrote a blog post, Shadow Workspace.
- ASAman Sanger
Yes.
- LFLex Fridman
Iterating on code in the background.
- ASAman Sanger
Yeah.
- LFLex Fridman
So what's going on behind the scenes?
- ASAman Sanger
Uh, so to be clear, we want there to be a lot of stuff ha- stuff happening in the background, and we're experimenting with a lot of things. Uh, right now, uh, we don't have much of that happening other than like the, the cache warming or like, you know, uh, figuring out the right context too that goes into your command queue prompts, for example. Uh, but the idea is if you can actually spend computation in the background, then you can help, um, help the user maybe like add a slightly longer time horizon than just predicting the next few lines that you're gonna make. But actually like in the next 10 minutes, what are you going to make? And by doing it in the background, you can spend more com- computation doing that. And so the idea of the shadow workspace that, that we implemented, and we use it internally for like experiments, um, is that to actually get advantage of doing stuff in the background, you want some kind of feedback signal to gi- give back to the model. Because otherwise, like you can get higher performance by just letting the model think for longer. Um, and, and so like 01 is a good example of that. But another way you can improve performance is by letting the model iterate and get feedback. And, and, and so one very important piece of feedback when you're a programmer is, um, the language server, which is, uh, this thing, it exists, uh, for most different languages and there's like a separate language server per language. And it can tell you, you know, you're using the wrong type here, and then gives you an error. Or it can allow you to go to definition and sort of understands the structure of, of your code. So language servers are extensions developed by, like there is a TypeScript language server developed by the TypeScript people, a Rust language server developed by the Rust people, and then they all int- interface over the language server protocol to a VS code. So that VS code doesn't need to have all of the different languages built into VS code, but rather, uh, you can use the existing compiler infras- infrastructure.
- LFLex Fridman
For linting purposes? What, what-
- ASAman Sanger
It's for, it's for linting, it's for going to definition, uh, and for like seeing the, the right types that you're using. Um-
- LFLex Fridman
So it's doing like type checking also?
- ASAman Sanger
Yes. Type checking and, and going to references. Um, and that's like when you're working in a big project, you, you kind of need that. If you, if you don't have that, it's like really hard to, to code in a big project.
- LFLex Fridman
Can you say again how that's being used inside Cursor, the-... the language s-server protocol communication thing?
- ASAman Sanger
Yes. So it's being used in Cursor to show to the programmer, just like in VS code. But then the idea is you want to show that same information to the models, the IO models. Um, and you want to do that in a way that doesn't affect the user, because you want to do it in background. And so, uh, the idea behind the shadow workspace was, okay, like, one way we can do this is, um, we spawn a separate window of Cursor that's hidden, and so you, you can set this flag in Electron as hidden. There is a window, but you don't actually see it. And inside of this window, uh, the AI agents can modify code however they want, um, as long as they don't save it, because it's still the same folder. Um, and then can get feedback from, from the linters and go to definition and, and iterate on their code.
- LFLex Fridman
So, like, literally run everything in the background. Like, as if... Right.
- ASAman Sanger
Yeah.
- LFLex Fridman
May- maybe even run the code? Is that-
- ASAman Sanger
Uh, so that's the, uh, eventual version.
- LFLex Fridman
Okay.
- ASAman Sanger
That's what you want. And a lot of the blog post is actually about, how do you make that happen? Because it's a little bit tricky. You want it to be on the user's machine so that it exactly, uh, mirrors the user's environment. And then on Linux, you can do this cool thing where you can actually mirror the file system and have the AI make changes to the files and, and it thinks that it's operating on the file level, but actually that's stored in, in memory and you, you can, uh, uh, create this kernel extension to, to make it work. Um, whereas on Mac and Windows, it's a little bit m- more difficult, uh, and, and, uh, but it's, it's a fun technical problem, so that's why - Yeah.
- LFLex Fridman
... right at the end.
- SASualeh Asif
One, one maybe hacky but interesting idea that I like is holding a lock on saving. And so basically, you can then have the language model kind of hold the lock on, on saving to disc, and then instead of you operating in the ground truth version of the files, uh, that are saved to disc, you, you actually are operating on what was the shadow workspace before and these unsaved things that only exist in memory that you still get linter errors for and you can code in. And then when you try to maybe run code, it's just, like, there's a small warning that, that there's a lock, and then you kind of will take back the lock from the language server if you're trying to do things concurrently. Or from the, the shadow workspace if you're trying to do things concurrently.
- 1:09:31 – 1:14:58
Debugging
- SASualeh Asif
- LFLex Fridman
That's such an exciting future, by the way. It's a bit of a tangent, but, like, to allow a model to change files. It's scary for people, but, like, it's really cool. To be able to just, like, let the agent do a s- a set of tasks and you come back the next day and kind of observe, like it's a colleague or something like that. Yeah.
- SASualeh Asif
Yeah. And I think there may be different versions of, like, runability, where for the simple things, where you're doing things in the span of a few minutes on behalf of the user as they're programming, it makes sense to make something work locally on their machine. I think for the more aggressive things, where you're making larger changes that take longer periods of time, you'll probably want to do this in some sandbox remote environment. And that's another incredibly tricky problem of how do you exactly reproduce or mostly reproduce to the point of it being effectively equivalent for running code the user's environment with this remote, r- remote sandbox?
- ASAman Sanger
I'm curious what kind of agents you want for, for coding.
- LFLex Fridman
(laughs)
- SASualeh Asif
Oh, for-
- ASAman Sanger
Do you, do you want them to find bugs? Do you want them to, like, implement new features? Like, w- what, what agents do you want?
- LFLex Fridman
So by the way, when I think about agents, I don't think just about coding. Uh, I think... So for t- pr- this, th- this particular podcast, there's video editing, and a lot of... If you look in Adobe, a lot of... There's code behind. Uh, it's very poorly documented code. But you could interact with Premiere, for example, using code, and, uh, basically all the uploading, everything I do on YouTube, everything, as you could probably imagine, I do all of that through code. And s- and including translation and overdubbing, all of this. So I envision th- all of those kinds of tasks, so automating many of the tasks that don't have to do directly with the editing. So th- that. Okay. That's what I was thinking about. But in terms of coding, I would be s- fundamentally thinking about bug finding. Like, many levels of kind of b- bug finding, and also bug finding, like, logical bugs... Not logical, like spiritual bugs or something.
- ASAman Sanger
(laughs)
- SASualeh Asif
(laughs)
- LFLex Fridman
Ones, like, sort of big directions of implementation, that kind of stuff.
- ASAman Sanger
What's your opinion on bug finding?
- SASualeh Asif
Yeah. I mean, it's really interesting that these models are so bad at bug finding, uh, when just naively prompted to find a bug. They're incredibly poorly calibrated.
- ASAman Sanger
Even the, the smartest models.
- LFLex Fridman
Mm-hmm.
- SASualeh Asif
Exactly. Even o- even o1.
- LFLex Fridman
How do you explain that? Is there a good intuition?
- SASualeh Asif
I think these models are a really strong reflection of the pre-training distribution. And, you know, I do think they, they generalize as the loss gets lower and lower, but I don't think the, the, the loss in the scale is quite... Or the loss is low enough such that they're, like, really fully generalizing on code. Like, the things that we use these things for, uh, the frontier models, that, that they're quite good at are really code generation and question answering. And these things exist in massive quantities in pre-training with all of the code on GitHub on the scale of many, many trillions of tokens and questions and answers on things like Stack Overflow and maybe GitHub Issues. And so when you try to push one of these things that really don't exist, uh, very much online, like, for example, the CursorTab objective of predicting the next edit given the edits done so far, uh, the brittleness kind of shows. And then bug detection is another great example where there aren't really that many examples of, like, actually detecting real bugs and then proposing fixes. Um, and the models just kind of, like, s- really struggle at it. But I think i- i- it's a question of transferring the model. Like, in the same way that you get this fantastic transfer, um, from pretrained models, uh, just on code in general to the CursorTab objective, uh, you'll see a very, very similar thing with generalized models that are really good at code to bug detection. It just takes, like, a little bit of kind of nudging in that direction.
- ASAman Sanger
Like, to be clear, I think they sort of understand code really well. Like, while they're being pre-trained, like, the representation that's being built up, like, almost certainly, like, you know, somewhere in the stream there's...... the model knows that maybe there's- there's some sket- something sketchy going on, right? It sort of has some sketchiness. But actually eliciting this, the sketchiness to, uh... Like actually, like, part- part of it is that humans are really calibrated on which bugs are really important.
- LFLex Fridman
Mm-hmm. Yeah. Yeah.
- ASAman Sanger
It's not just actually, it's not just actually saying like, "There's something sketchy." It's like, it's just, it's just a sketchy trivial? It's this sketchy, like, "You're gonna take the server down."
- LFLex Fridman
Yeah. Yeah.
- ASAman Sanger
It's like, like, part of it is maybe the cultural knowledge of, uh... Like, why is the staff engineer a staff engineer? A staff engineer is, is good because they know that three years ago, like, someone wrote a really s- you know, sketchy piece of code that took- took the server down. And as opposed to, like-
- LFLex Fridman
(laughs)
- ASAman Sanger
... as opposed to maybe there's like, you know, you just, this thing is like an experiment, so like a few bugs are fine. Like, you're just trying to experiment and get the feel of the thing. And so if the model gets really annoying when you're writing an experiment, that's really bad. But if you're writing something for s- super production, you're like writing a database, right? You're, you're writing code in Postgres or Linux or whatever, like your line is 12 volts. You're, you're, it's- it's sort of unacceptable to have even an edge case. And just, just having the calibration of like, how paranoid is the user? Like...
- LFLex Fridman
Yeah. But even then, like if you're putting in a maximum paranoia, it still just like doesn't quite get it.
- ASAman Sanger
Yeah,
- 1:14:58 – 1:26:09
Dangerous code
- ASAman Sanger
yeah, yeah.
- LFLex Fridman
I mean, uh, but this is hard for humans too, to understand what- which line of code is important and which is not. Like, you, I think one of your principles on the website says if, if, if a code can do a lot of damage, one should add a comment that say, "This- this- this line of code is, is dangerous."
- ASAman Sanger
And, uh, all caps.
- LFLex Fridman
All caps. (laughs)
- ASAman Sanger
And repeat it 10 times.
- LFLex Fridman
Read it 10... No, you say like, "For every single line-"
- ASAman Sanger
Yes.
- LFLex Fridman
... of code inside the function, you have to a- and that- that's quite profound. That says something about human beings, because the, uh, the engineers move on. Even the same person might just forget how it can sink the Titanic, a single function. Like you don't-
- ASAman Sanger
Yeah, it-
- LFLex Fridman
You might not intuit that quite clearly by looking at the single piece of code.
- ASAman Sanger
Yeah, and I think that, that one is also, uh, partially also for e- today's AI models, where, uh, if you actually write "dangerous, dangerous, dangerous" in every single line, like, uh, the models will pay more attention to that and will-
- LFLex Fridman
(laughs)
- ASAman Sanger
... be more likely to find bugs in that region.
- LFLex Fridman
That's, uh, actually just straight up a really good practice of labeling code, of how much damage this can do.
- ASAman Sanger
Yeah, I mean, it's controversial.
- LFLex Fridman
(laughs)
- ASAman Sanger
Like, it, it's, some people think it's ugly. Uh, Swalek does not like it. Well, I- I actually think it's, it's, uh, like-
- LFLex Fridman
(laughs)
- ASAman Sanger
In fact, I- I actually think this is one of the things I learned from Arvid is, you know, like, I, uh, sort of aesthetically, I don't like it.
- LFLex Fridman
Mm-hmm.
- ASAman Sanger
But I think there's certainly something where like, it's- it's useful for the models and- and humans just forget a lot, and it's really easy to make a small mistake and cause like, bring down the... You know, like, just bring down the server and like, y- like, uh, of course we t- we like test a lot and whatever, but there- there's always these things that you have to be very careful.
- LFLex Fridman
Yeah, like with just normal doc strings, I think people will often just skim it when making a change and think, "Oh, this, I- I know how to do this." Um, and you kind of really need to point it out to them so that that doesn't slip through. Yeah, you have to be reminded that you could do a lot of damage. That's like, we don't really think about that. Like...
- ASAman Sanger
Yeah.
- LFLex Fridman
You think about, okay, m- how do I figure out how this works so I can improve it? You don't think about the other direction that it could
- MTMichael Truell
Yeah.
- LFLex Fridman
... le- so...
- ASAman Sanger
Until, until we have formal verification for everything. Then you can do whatever you want and you, you know for certain that you have not introduced a bug if the proof pass.
- MTMichael Truell
But concretely, what do you think that future would look like?
- ASAman Sanger
I think, um, people will just not write tests anymore, and, um, the model will suggest, like you write a function, the model will suggest a spec, and you review the spec. And, uh, in the meantime, a smart reasoning model computes a proof that the implementation follows the spec. Um, and I think that happens for, for most functions.
Episode duration: 2:29:04
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode oFfVt3S51T4
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome