Edwin Chen: Why optimizing for benchmarks creates AI sloth

How Surge bootstrapped past $1B revenue with fewer than 100 people; Chen argues benchmark gaming pushes AI toward dopamine, emojis, and slop, not truth.

Lenny RachitskyhostEdwin Chenguest

Dec 7, 20251h 10mWatch on YouTube ↗

EVERY SPOKEN WORD

150 min read · 29,554 words

0:00 – 4:48
Introduction to Edwin Chen
1. LRLenny Rachitsky
  You guys hit a billion in revenue in less than four years with around 60 to 70 people. You were completely bootstrapped, haven't raised any VC money. I don't believe anyone has ever done this before.
2. ECEdwin Chen
  We basically never wanted to play the Silicon Valley game. That always sounds ridiculous. I used to work at a bunch of the big tech companies and I always felt that we could fire 90% of people and we would move faster because the best people wouldn't have all the distractions. So when we started Surge, we wanted to build it completely differently with a super small, super elite team.
3. LRLenny Rachitsky
  You guys are by far the most successful data company out there.
4. ECEdwin Chen
  We essentially teach AI models what's good and what's bad. People don't understand what quality even means in this space. They think you can just throw bodies at a problem and get good data. That's completely wrong.
5. LRLenny Rachitsky
  To a regular person, it doesn't feel like these models are getting that much smarter constantly.
6. ECEdwin Chen
  Over the past year, I've realized that the values that the companies have will shape the models. I was asking Claude to help me draft an email the other day and after 30 minutes, yeah, I think it really crafted me the perfect email and I sent it. But then I realized that I spent 30 minutes doing something that didn't matter at all. If you could choose the perfect model behavior, which model would you want? Do you want a model that says, "You're absolutely right. There are definitely 20 more ways to improve this email," and it continues for 50 more iterations? Or do you want a model that's optimizing for your time and productivity and just says, "No, you need to stop. Your email's great. Just send it and move on."
7. LRLenny Rachitsky
  You have this hot take that a lot of these labs are pushing AGI in the wrong direction.
8. ECEdwin Chen
  I'm worried that instead of building AI that will actually advance us as a species, curing cancer, solving poverty, understanding the universe, we are optimizing for AI sloth instead. We'll be optimizing our models for the types of people who buy tabloids at the grocery store. We're basically teaching our models to chase dopamine instead of truth.
9. LRLenny Rachitsky
  (instrumental music) Today my guest is Edwin Chen, founder and CEO of Surge AI. Edwin is an extraordinary CEO and Surge is an extraordinary company. They're the leading AI data company powering training at every frontier AI lab. They are also the fastest company to ever hit $1 billion in revenue in just four years after launch with fewer than 100 people and also completely bootstrapped. They've never raised a dollar in VC money. They've also been profitable from day one. As you'll hear in this conversation, Edwin has a very different take on how to build an important company and how to build AI that is truly good and useful to humanity. I absolutely loved this conversation and I learned a ton. I'm really excited for you to hear it. If you enjoy this podcast, don't forget to subscribe and follow it in your favorite podcasting app or YouTube. It helps tremendously. And if you become an annual subscriber of my newsletter, you get a ton of incredible products for free for an entire year, including Devin, Lovable, Replit, Bolt, N8n, Linear, Superhuman, Descript, Whisperflow, Gamma, Perplexity, Warp, Granola, Magic Patterns, Raycast, Shipyard, Emob and Posthog and Stripe Atlas. Head on over to lennysnewsletter.com and click Product Pass. With that, I bring you Edwin Chen after a short word from our sponsors. My podcast guest and I love talking about craft and taste and agency and product market fit. You know what we don't love talking about? SOC 2. That's where Vanta comes in. Vanta helps companies of all sizes get compliant fast and stay that way with industry leading AI, automation and continuous monitoring. Whether you're a startup tackling your first SOC 2 or ISO 27001, or an enterprise managing vendor risk, Vanta's Trust Management Platform makes it quicker, easier and more scalable. Vanta also helps you complete security questionnaires up to five times faster so that you can win bigger deals sooner. The result? According to a recent IDC study, Vanta customers slashed over $500,000 a year and are three times more productive. Establishing trust isn't optional. Vanta makes it automatic. Get $1,000 off at vanta.com/lenny. Here's a puzzle for you. What do OpenAI, Cursor, Perplexity, Vercel, Plat, and hundreds of other winning companies have in common? The answer is they're all powered by today's sponsor, WorkOS. If you're building software for enterprises, you've probably felt the pain of integrating single sign-on, SKIM, RBAC, audit logs and other features required by big customers. WorkOS turns those deal blockers into drop-in APIs with a modern developer platform built specifically for B2B SaaS. Whether you're a seed stage startup trying to land your first enterprise customer or a unicorn expanding globally, WorkOS is the fastest path to becoming enterprise ready and unlocking growth. They're essentially Stripe for enterprise features. Visit workos.com to get started or just hit up their Slack support where they have real engineers in there who answer your questions super fast. WorkOS allows you to build like the best with delightful APIs, comprehensive docs and a smooth developer experience. Go to workos.com to make your app enterprise ready today.
4:48 – 7:08
AI’s role in business efficiency
1. LRLenny Rachitsky
  (instrumental music) Edwin, thank you so much for being here and welcome to the podcast.
2. ECEdwin Chen
  Thanks so much for having me. I am super excited.
3. LRLenny Rachitsky
  I want to start with just how absurd what you've achieved is. A lot of people and a lot of companies talk about scaling massive businesses with very few people as a result of AI and you guys have done this in a way that is- is unprecedented. You guys hit a- a billion in revenue in less than four years with less than 60... around 60 to 70 people. You were completely bootstrapped, haven't raised any VC money. I don't believe anyone has ever done this before. So you guys are actually achieving the dream of what people are describing will happen with AI. I'm curious just do you think this will happen more and more as a result of AI? And a- also just where has AI most helped you, uh, find leverage to be able to do this?
4. ECEdwin Chen
  Yeah, so we hit over a billion in revenue last year with under 100 people and I think we're going to see companies with even crazier ratios, like 100 million per employee in the next few years. AI is just going to get better and better and make things more efficient so that ratio just becomes inevitable. Like I- I used to work at a bunch of the big tech companies and I always felt that we could fire 90% of people and we would move faster because the best people wouldn't have all the distractions.And so when we started Surge, we wanted to build it completely differently with a super small, super elite team. And yeah, what's crazy is that we actually succeeded. And so I think two things are colliding. One is that people are realizing that you don't have to build giant organizations in order to win. And two, yeah, all these efficiencies from AI, and they're just gonna lead to a really amazing time in company building. Like, the thing I'm most excited about is that the types of companies are going to change too. It won't just be that they're smaller. We're gonna see fundamentally different companies emerging. Like, if you think about it, fewer employees means less capital. Less capital means you don't need a raise. So instead of companies started by founders who are great at pitching and great at hyping, you will get founders who are really great at technology or product. And instead of products optimized for revenue and what VCs want to see, you'll get more interesting ones built by these tiny, obsessed teams, so people building things they actually care about, real, real technology and real innovation. So I'm actually re- really hoping that the Silicon Valley startup scene will actually go back to being a place for, for hackers
7:08 – 8:55
Building a contrarian company
1. ECEdwin Chen
  again.
2. LRLenny Rachitsky
  You guys have done a lot of things, uh, in a very contrarian way. And one was actually just not being, like, on LinkedIn posting viral posts, not on Twitter, constantly promoting Surge. I think most people hadn't heard of Surge until just recently, and then you just came out and like, okay, the fastest growing company at a billion dollars. Why would you do that? I imagine that was very intentional.
3. ECEdwin Chen
  We basically never wanted to play the Silicon Valley game. And like I always thought it was ridiculous. Like, w- what did you dream of doing when you were a kid? Was it building a company from scratch yourself and getting in the weeds of your code in your product every day? Or was it explaining all your decisions to VCs and getting on this giant PR enhan- fundraising hamster wheel? And it definitely made things more difficult for us, because yeah, when you fundraise, you just naturally get part of this kind of Silicon Valley industrial complex where people will... Your VCs will tweet about you. You'll get the tech crunch headlines. You'll get announced in all of the newspapers because you raised at this massive valuation. And so it made things more difficult for us, because the only way we were gonna succeed was by building a 10 times better product and getting word of mouth from researchers. But I think it also meant that our customers were people who really understood data and really cared about it. Like, I always thought it was really important for us to have customers, early customers who are really aligned with what we were building and who really cared about having really high quality data and really understood how that data would make their AI models so much better, because they were the ones helping us. They were the ones giving us feedback on what we're producing, and so just having that kind of, like, very- very close mission alignment with our customers actually helped us early on. So these were people who basically just buying our product because they knew how different it was and because it was helping them rather than because they saw some random tech crunch headline. So it made things harder for us, but I, I think in
8:55 – 9:36
An explanation of what Surge AI does
1. ECEdwin Chen
  a really good way.
2. LRLenny Rachitsky
  It's such an empowering story to hear this journey for a... for founders, that they don't need to be on Twitter all day promoting what they're doing. They don't have to raise money. They can just kind of go heads down and build. So I, I love so much about, uh, the story of Surge. For people that don't know what Surge does, just give us a quick explanation of what Surge is.
3. ECEdwin Chen
  We essentially teach AI models what's good and what's bad. So we train them using human data and just a lot of different products that we have, like SFT, RHF, Rubrics verifiers, our environments, and so on and so on. And then we also measure how well they're progressing. So essentially, we're, we're, we're, we're a data
9:36 – 13:31
The importance of high-quality data
1. ECEdwin Chen
  company.
2. LRLenny Rachitsky
  What you always talk about is the quality has been the big reason you guys have been so successful, the quality of the data. What does it take to create higher quality data? What do you all do differently? What are people missing?
3. ECEdwin Chen
  I think most people don't understand what quality even means in this space. They think you can just throw bodies at a problem and get good data, and that's completely wrong. Let, let, let me give you an example. So imagine you wanted to train a model to write an eight-line poem about the moon. What makes it a good, high-quality poem? If you don't think deeply about quality, you'll be like, "Is this a poem? Does it contain eight lines? Does it contain the word moon?" You check all of these boxes, and if so, sure, yeah, you say it's a great poem. But that's completely different from what we want. We are looking for Nobel Prize-winning poetry. Like, is this poetry unique? Is it full of subtle imagery? Does it surprise you and tug at your heart? Does it teach you something about the nature of moonlight? Does it play with your emotions and does it make you think? That's what we are thinking about when we think about high-quality poem. So it might be like a haiku about moonlight on water. It might use internal rhyme and meter. There are a thousand ways to write a poem about the moon, and, and each one gives you all these different insights into language and imagery and human expression. And I think thinking about quality in this way is really hard. It's hard to measure. It's really subjective and complex and rich, and it sets a really high bar. And so we have to build all of this technology in order to measure it, like thousands of signals on all of our workers, thousands of signals on every project, every task. Like, we know at the end of the day if you are good at writing poetry versus good at writing essays versus good at writing technical doc- documentation. And so we have to gather all these signals on what your background is, what your expertise is, and not just that, like how you're actually performing when you're- when you're writing all these things. And we use those signals to inform whether or not you are a good enough worker for these projects and whether or not you are improving the models. And it's really hard, and so we had to build all this technology to measure it, but I think that's exactly what we want AI to do. And so we have these very, very deep notions about quality that we're always trying to, try to achieve.
4. LRLenny Rachitsky
  So what I'm hearing is there's kind of a just going much deeper in, uh, understanding what quality is within the verticals that you are selling data around, so you... And is this, like, a person you hire that is incredibly talented at poetry plus, uh, evals that they, I guess, help write that tell them that this is great? How, what's the, the mechanics of that?
5. ECEdwin Chen
  The way it works is we essentially gather thousands of signals about everything that you're doing when you're working on platform.So we are looking at your keyboard strokes. We are looking how fast you answer things. We are using reviews, we are using gold standards. We are using, like, we're training models ourselves-
6. LRLenny Rachitsky
  Mm-hmm.
7. ECEdwin Chen
  ... on the outputs that you create, and then we're seeing whether they improve the model's performance. And so in a very similar way to how Google Search, like when Google Search is trying to determine what is a good webpage, there's almost two aspects of it. One is you want to remove all of the worst-of-the-worst webpages. So you want to remove all the spam, all the, uh, just, like, low-quality content, all the pages that lo- don't load. And so there's like a... It's almost like a content moderation problem. You just want to remove the worst-of-the-worst. But then you also want to discover the best of the best. Okay, like, this is the best webpage or, you know, this is the best person for this job. They are not just somebody who writes the equivalent of high school level poetry. Again, like they're not just robotically writing poetry that checks all these boxes, checks all these explicit instructions, but rather, yeah, they're, they're writing poetry that makes you emotional. And so we have all these signals as well that, again, like completely differently from removing the worst-of-the-worst, we are finding the best of the best. And so we have all these signals. Again, just like Google Search uses all these signals and feeds them into their ML algorithms and uses them, predict certain types of things, we- we do the same with all of our, with all of our workers and all of our tasks and all of our projects. And so it's almost like a complicated machine learning problem at the end of day. And, uh, that- that- that's actually how it works.
8. LRLenny Rachitsky
  That is incredibly interesting.
13:31 – 17:37
How Claude Code has stayed ahead
1. LRLenny Rachitsky
  I want to ask you about something I've been very curious about over the past couple years. If you look at Claude, it's been so much better at coding and at writing than any other model for so long, and it's really surprising just how long it took other companies to catch up, considering just how much economic value there is there, just like every AI coding product sat on top of Claude because it was so good at, Claude code, and writing also. What is it that made it so much better? Is it just the quality of the data they trained on, or is there... is there something else?
2. ECEdwin Chen
  I think there are multiple parts to it. So a big part of it certainly is the data. Like, I think people don't realize that there- there's almost, like, this infinite amount of choices that all the frontier labs are deciding between when they're choosing what data goes into their models. It's like, okay, are you purely using human data? Are you gathering the human data in XYZ way? When you are gathering the human data, what exactly are you asking the people who are creating it to create for you? Like, maybe you create... Maybe you care more... For example, in- in the coding realm, maybe you care more about front-end coding versus back-end coding. Maybe when you're doing front-end coding, you care a lot about the visual design of the front-end, uh, applications that you're creating, or maybe you don't care about it so much and you care more about, I don't know, the efficiency of it or the pure correctness over that, like, visual design. And then other questions like, okay, are you carrying both... Are you... Like, how much synthetic data are we throwing into the mix? How much do you care about these 20 different benchmarks? Like, some companies, they see these benchmarks and they're like, "Okay, for PR purposes, even though we don't think that these academic ben- benchmarks matter all that- all that much, maybe we just need to optimize for them anyways, because we or marketing team needs to show certain progress on certain standard evaluations that every other company talks about. And if we don't show good performance here, it's just going to be bad for us, even if- even if, like, ignoring these academic benchmarks makes us better at real tasks." Other companies are going to be principled and be like, "Okay, yeah, no, I- I don't care about marketing. (laughs) I just care about how my model performs on these real-world tasks at the end of day, and so I'm going to optimize for that instead." And it's almost like there's a trade-off between all of these different things, and there's like a... Like, one of the things that I often think about is that there's a... It's almost like there's an art to post-training. It's not purely a science. Like, when you are deciding what kind of model you're trying to create and what it's good at, there's this notion of taste and sophistication. Like, okay, do I think that these... 'Cause, so going back to example of how good the model is at visual design. Like, okay, maybe you have a different notion of visual design than what I do. Like, maybe you care more about minimalism, and you care more about, I don't know, uh, like 3D animations than- than I do. Maybe, uh, maybe this other person prefor- prefers things that look a little bit more broke. Like, there's all these notions of t- taste and sophistication that you have to decide between when you're- when you're designing your post-training mix, and so that matters as well. So long story short, I think there's all these different factors, and certainly the data is a big part of it, but it's also, like, what is, like, what is the objective function that you're trying to optimize your model towards?
3. LRLenny Rachitsky
  That is so interesting. Like, the taste will... The taste of the person leading this work will inform what data they ask for, what data they feed it.
4. ECEdwin Chen
  Yeah.
5. LRLenny Rachitsky
  But it just... It's wild. It shows the value of great data. Anthropic got so much growth and win from essentially, uh, better data.
6. ECEdwin Chen
  Yeah. Yeah, exactly.
7. LRLenny Rachitsky
  And I could see why companies like yours are growing so fast. There's just so much, and that's just one vertical, that's just coding, and then there's probably a similar area for writing. Uh, I love that it's- it's interesting that AI, you know, it feels like this artificial computer binary thing, but it's, like, taste, human judgment is still such a key factor in these things being successful.
8. ECEdwin Chen
  Yep, yep. Yep, exactly. Like, again, going back to the example I said earlier, certain companies, if you ask them, "What is good poem?" They will simply robotically check off all of these, uh, all these instructions on our list. But again, I don't think that makes for good poetry. So certain frontier labs, the ones with more taste and sophistication, they will realize that it doesn't reduce to this fixed set of checkboxes and they'll consider all of these kind of implicit, very subtle qualities instead, and I think that's what makes them better- better at
17:37 – 21:54
Edwin’s skepticism toward benchmarks
1. ECEdwin Chen
  ............................
2. LRLenny Rachitsky
  You mentioned benchmarks. This is something a lot of people worry about, is there's all these models that are always... Like, basically, it feels like every model is, uh, better than humans at kind of every STEM field at this point. Uh, but to a regular person, it doesn't feel like these models are getting that much smarter constantly. What's your just sense of how much you trust benchmarks and just how correlated those are with actual AI advancements?
3. ECEdwin Chen
  Yeah. So I don't trust the benchmarks at all, and I think that's for two reasons. So one is, I think a lot of people don't realize, even researchers within the community, they don't realize that the benchmarks themselves are often honestly just wrong. Like, they have wrong answers. They're full of all this, uh, kind of messiness.And people trust on this for, like, for the, for the popular ones. Um, people have maybe realized this to some extent, but the vast majority just have all these flaws that people don't realize. So that's one part of it. And the other part of it is, these benchmarks, at the end of the day, they are often... They often have well-defined objective answers that make them very easy for models to hill climb on, in a way that's very, very different from the messiness and ambiguity of the real world. Like, I, like, I think one thing they often say is that it's kind of crazy that these models can win IMO gold medals, but they still have trouble parsing PDFs and that's because, yeah, even though IMO gold medals seem hard to the average person, yeah, like, they are hard at the end of the day. But they have this notion of objectivity that, okay, yeah, parsing a PDF sometimes doesn't, doesn't have, and so it's easier for a model for, for the frontier labs to hill climb on all these than to solve all the, all these mis- messy ambiguous problems in the real world. So I think there's a lack of direct correlation there.
4. LRLenny Rachitsky
  It's so interesting the way you described it as, uh, hitting these benchmarks is kind of like a marketing piece when you launch. Say Gemini 3 just launched and it's like, "Cool, number one at all these benchmarks." Is that, is what happens? They just kind of train their models to get good at these very specific things?
5. ECEdwin Chen
  Yeah, so there's, uh, again, maybe two parts to this. So one is, sometimes, yeah, these benchmarks, they accidentally leak in certain ways or the frontier labs will tweak the way they evaluate their models on these benchmarks. Like, they'll tweak their system prompt or they'll tweak the number of times they, they run a model and so on and so on, in a way that games these benchmarks. The other part of it though is, it's like by optimizing for the benchmark instead of optimizing for the real world, you will just naturally climb on the benchmark and, yeah, it's basically another form of gaming it.
6. LRLenny Rachitsky
  Knowing that, with that in mind, how do you kind of get a sense of if we're heading in a, towards AGI? How do you measure progress?
7. ECEdwin Chen
  Yes. So the way we really care about measuring model progress is by running all these human evaluations. So for example, what we do is, yeah, we will take our human annotators and we'll ask them, "Okay, go have a conversation with the model." And maybe you're having this small conversation with the model across all of these different topics. So okay, you are a Nobel Prize winning physicist, so you go have a conversation about pushing the frontier of your own research. You are a teacher and you're trying to create lesson plans for your students, so go talk to the model about these things. Or you are a, uh, yeah, you're, you're a coder and you're working at one of these big tech companies and you have these problems every day, so go talk to the model and see how much it helps you. And because our searchers, our annotators, they are experts at the top of their fields and they are not just skimming your responses, they're actually working through the responses deeply themselves. They are, yeah, they're going to evaluate the code that it writes. They're gonna double-check the physics equations that it writes. They're going to evaluate the models in very, very deep ways. They're going to pay attention to accuracy and instruction following and all these things that casual users don't when you suddenly get a pop-up on your G- ChatGPT response asking you to compare these two different responses. Like, uh, people like that, they're not evaluating the models deeply. They're just vibing and picking whatever response looks flashiest. Our annotators are looking closely at responses and evaluating them for all of these different dimensions, and so I think that's a much better approach than, uh, than, than these benchmarks or kind of these random online AV tests.
8. LRLenny Rachitsky
  Again, I love just how central humans continue to be (laughs) in all this work, that we're not totally done
21:54 – 28:33
AGI timelines and industry trends
1. LRLenny Rachitsky
  yet. Is there gonna be a point where we don't need these people anymore, that AI is so smart that, "Okay, we're good. We got everyth- everything out of your heads"?
2. ECEdwin Chen
  Yeah, I think that will not happen until we reach AGI. Like, it's almost like by definition, if we haven't reached AGI yet, then there's more for the models to learn from, and so yeah, I don't think that's gonna happen anytime soon.
3. LRLenny Rachitsky
  Okay, cool. So (laughs) more reason to stress about AGI. (laughs) We don't-
4. ECEdwin Chen
  Yeah.
5. LRLenny Rachitsky
  We don't need these folks anymore. What's your, uh... I can't not ask. Just, and as people that work closely with this stuff, uh, I'm always just curious, what's your AGI timelines? How far do you think we are from this? Do you think we're in, like, a couple years or is it, like, decades?
6. ECEdwin Chen
  So I'm certainly on the longer time horizon front. Like, I think people don't realize that there's a big difference between moving from 80% performance to 90% performance to 99% performance to 99.9% performance and so on and so on. And so, like, in my head, uh, I probably bet that within the next one or two years, yeah, the models are going to automate 80% of, you know, the average L6 software engineer's job. But it's gonna take another few years to move to 90% and another few years to 99% and so on and so on. So I think we're closer to a decade or decades away, um, than, than now, folks.
7. LRLenny Rachitsky
  You have this hot take that a lot of these labs are kind of pushing AGI in the wrong direction, uh, and this is based on your work at, at Twitter and Google and Facebook. Can you just talk about that?
8. ECEdwin Chen
  I'm worried that instead of building AI that will actually advance us as a species, curing cancer, solving poverty, understanding the universe, all these big grand questions, we are optimizing for AI slop instead. Like, we're basically teaching our models to chase dopamine instead of truth, and I think this relates to what we were talking about regarding these, uh, these benchmarks. So let me, let me give you a couple examples. So right now, the industry is plagued by these terrible leaderboards like LLM Arena. It's this popular online leaderboard where random people from around the world vote on which AI response is better. But the thing is, like I was saying earlier, they're not carefully reading or fact-checking. They're skimming these responses for two seconds and picking whatever looks flashiest. So a model can hallucinate everything. It can completely hallucinate, but it will look impressive because it has crazy emojis and voting and Markdown headers and all these superficial things that don't matter at all, but they catch your attention, and these LLM Arena users love it. It's literally optimizing your models for the types of people who buy tabloids at the grocery store. Like, we've seen this in every day ourselves. The easiest way to climb LLM Arena? It's adding crazy voting, it's doubling the number of emojis, it's tripling the length of your model responses, even if your model starts hallucinating and getting the answer completely wrong.And the problem is, again, because all of these frontier labs, they kind of have to pay attention to PR, because their sales team, when they're trying to sell to all these enterprise customers, those enterprise customers will say, "Oh, well, but your model's only number five on Elmo Reno, so why should I buy it?" They have to, in some sense, pay attention to the, these leaderboards. And so what our research is all telling us is they just say, "The only way I'm gonna get promoted at the end of the year is if I climb this leaderboard, even though I know that climbing is probably going to make my model worse in accuracy and structure following." So I think there's all these negative incentives that are pushing, pushing work in, in the wrong direction. S- I'm also worried about this trend towards optimizing AI for engagement. Like, I used to work on social media, and every time we optimized for engagement, terrible things happened. You'd get clickbait and pictures of bikinis and Bigfoot and horrifying s- skin diseases just filling your feeds. And I think I worry that the same thing's happening with AI. Like, if you think about all the sick fantasy issues with ChatGPT. "Oh, you're absolutely right. What an amazing question." Like, the easiest way to hook users is to tell them how amazing th- they are. And so these models, they constantly tell you, "You're a genius." They'll feed into delusions and conspiracy theories. They'll pull you down these rabbit holes because Silicon Valley loves maximizing time spent and just increasing number of conversations you're having with it. And so, yeah, companies are spending all their time hacking these leaderboards and benchmarks, and the scores are going up, but I think it actually masks that the models with the best scores, they are often the worst or just have all these fundamental failures. So I, I think I'm really worried that all of these negative incidents are putting, pushing AGI in the wrong direction.
9. LRLenny Rachitsky
  So what I'm hearing is, uh, AGI is being slowed down by these, basically the wrong objective function, these labs paying attention to the wrong, basically, benchmarks and evals.
10. ECEdwin Chen
  Yep.
11. LRLenny Rachitsky
  Is... I know you probably can't play favorites since you work with all the labs. Is there anyone doing better at this and maybe kind of realizing this is the wrong direction?
12. ECEdwin Chen
  I would say I've always been very, very impressed by Anthropic. Like, I think Anthropic takes a very principled view about what they do and don't care about, and how they want their models to behave in a way that feels a lot more, a lot more principled to me.
13. LRLenny Rachitsky
  Interesting. Are there any other mistakes, big mistakes you think labs are making just that are kind of slowing things down or heading in the wrong direction? What we've heard just, uh, you know, chasing benchmarks, this, uh, engagement focus. Is there anything else you're seeing of just like, "Okay, we should... We got to work on this because it'll, it'll speed everything up"?
14. ECEdwin Chen
  I mean, I think there is a question of what products they're building and whether those products themselves are something that kind of help or hurt humanity. Like, I, I think a lot about Sora, (laughs) and-
15. LRLenny Rachitsky
  (laughs) I was thinking that since you're imagining-
16. ECEdwin Chen
  ... what, what it... Yeah, what, when what it entails. And so it's like, it's kind of interesting, it's like, which companies would build Sora and which wouldn't? And I think that answer to that qu- I mean, I don't know what the answer is myself. I, I have an idea in my head, but I think the answer to that question maybe reveals certain things about what kinds of AI models those companies want to build and what direction and what future they, they want to, want to achieve. Um, yeah, so, so I think about that a lot.
17. LRLenny Rachitsky
  The steel man argument there is, you know, it's like fun, people want it, it's an, uh, it- it'll help them generate revenue to grow this thing and build better models. Uh, it'll train data in an interesting way. It's also just like, you know, really fun.
18. ECEdwin Chen
  Yeah, it, it... (sighs) I think it's almost like, do you care about how you get there? And in the same way, so, so I made this tabloid analogy earlier, but-
19. LRLenny Rachitsky
  Mm-hmm.
20. ECEdwin Chen
  ... like, would you sell tabloids in order to fund, I don't know, some, some other newspaper? Like, sure, like, in, in some sense, uh, if you don't care about the path, then you will just do whatever it takes. But it's possible that it has negative consequences in of itself that will harm the long-term, long-term direction of what you're trying to achieve, and maybe it will distract you from, from all the more important things. So yeah, I, I think the, the path you take matters a lot as well.
28:33 – 33:07
The Silicon Valley machine
1. ECEdwin Chen
2. LRLenny Rachitsky
  Along these lines, we talked a bunch about this, of just Silicon Valley and kind of the, the downsides of raising a lot of money, being in the, the echo chamber. What do you call it? The Silicon Valley machine. You talk about how, uh, s- it's hard to build important companies in this way, and that you might actually be much more successful if you're not going down the VC path. Can you just talk about what you've seen in your experience and your advice essentially to founders? 'Cause they're always hearing, you know, raise money from fancy VCs, move to Silicon Valley. What's kind of the, the countertake?
3. ECEdwin Chen
  Yes. So I've always really hated a lot of the Silicon Valley mantras. The standard playbook is to get product market fit by pivoting every two weeks, and to chase growth and chase engagement with all of these dark patterns, and to blitzscale by hiring as fast as possible. And I've always disagreed. So yeah, I, I would say don't pivot, don't, don't blitzscale. Don't hire that Stanford grad who simply wants to add a hot company to your resume. Just build the one thing only you could build, the thing that wouldn't exist without the insight and expertise that only you have. And you see these buyable companies everywhere now. Some founder who was doing crypto in 2020 and then pivoted to NFTs in 2022 and now they're an AI company. There's no consistency. There's no mission. They're just chasing valuations. And I've always hated this because Silicon Valley loves to scorn Wall Street for focusing on money. But honestly, most of the Silicon Valley is chasing the same thing. And so we stayed focused on our mission from day one, pushing that frontier of high-quality complex data. And I've always thought that because I think startups, I have this very romantic notion of startups. Like, startups are supposed to be about taking big risks to build something that you really believe in. But if you're constantly pivoting, you're not taking any risks. You're just trying to make a quick buck. And if you fail because the market isn't ready yet, I actually think that's way better. At least you took a swing at something deep and novel and hard instead of pivoting into another LLM wrapper company. See, I like, I, I think the only way you build something that matters and that's going to change the world is if you find a big idea you believe in and you say no to everything else.So you don't keep on pivoting when it gets hard. You don't hire a team of 10 product managers because that's what every other cookie cutter, cookie cutter startup does. You just keep building that one company that wouldn't exist without you. And I, I think there are a lot of people in Silicon Valley now who are sick of all the grift, who want to work on big things that matter with people who actually care. And I'm, I'm hoping that, that will be the future of how we, how we pick to go with technology.
4. LRLenny Rachitsky
  I'm actually working on a post right now with, uh, Terrence Rohan, this is a VC that I really like to work with, and we interviewed five people who picked really, uh, successful generational companies early and joined them as really early employees. Like, they joined OpenAI before anyone thought it was awesome, Stripe before anyone knew it was awesome. And so we're looking for patterns of how people find these generational companies before anyone else. And there's, uh, it, it aligns exactly with what you just described, which is, uh, ambition. They have, uh, wild ambition with what they want to achieve. They're not, as you said, just kind of looking around for product market fit no matter what it ends up being. Uh, and so I love that what you described very much aligns with what we're seeing there.
5. ECEdwin Chen
  Yep, yep. Yeah, I absolutely think that you have to have huge ambitions, and you have to have a huge belief in your idea that it's going to change the world. And you have to be willing to double down and keep on doing whatever it takes to, to make it happen.
6. LRLenny Rachitsky
  Mm-hmm. I, I love how counter your, uh, narrative is to so many other things people hear, and so I love that we're doing this, I love that we're sharing this story. Today's episode is brought to you by Coda. I personally use Coda every single day to manage my podcast and also to manage my community. It's where I put the questions that I plan to ask every guest that's coming on the podcast, it's where I put my community resources. It's how I manage my workflows. Here's how Coda can help you. Imagine starting a project at work and your vision is clear, you know exactly who's doing what and where to find the data that you need to do your part. In fact, you don't have to waste time searching for anything, because everything your team needs, from project trackers and OKRs, to documents and spreadsheets, lives in one tab, all in Coda. With Coda's collaborative all-in-one workspace, you get the flexibility of docs, the structure of spreadsheets, the power of applications, and the intelligence of AI, all in one easy-to-organize tab. Like I mentioned earlier, I use Coda every single day, and more than 50,000 teams trust Coda to keep them more aligned and focused. If you're a startup team looking to increase alignment and agility, Coda can help you move from planning to execution in record time. To try it for yourself, go to coda.io/lenny today and get six months free of the team plan for startups. That's coda.io/lenny to get started for free and get six months of the team plan. Coda.io/lenny.
33:07 – 39:37
Reinforcement learning and future AI training
1. LRLenny Rachitsky
  Slightly different direction, but something else that was maybe a, a counter-narrative. Um, I imagine you watched the Dwarkesh and Richard Sutton podcast episode. And even if you didn't, there's a, they basically had this conversation with Richard Sutton. He was a, a famous AI researcher, had this whole bitter, the bitter lesson, uh, meme, and he talked about how LMs almost are kind of a dead end, and he thinks we're gonna really plateau around LMs because of the way they learn. Uh, what's your take there? Do you think LMs will get us to AGI or beyond, or do you think there's gonna be something new or a big breakthrough that needs to get us there?
2. ECEdwin Chen
  I'm in a camp where I do believe that something new will be needed. Like, the way I think about it is, when I think about training AI, I take a very, I don't know if I would say biological point of view, but I believe that in the same way that there's a million different ways that humans learn, we need to build models that can mimic all of those ways as well. And maybe it'll have a distribu- different distribution of the focuses that they have. You know, I know they'll be different for humans, so maybe it'll have a different distribution. But we want to be able to mimic the learning abilities of humans and make sure that we have the algorithms and the data for, for models to learn in the same way. And so to the extent that LMs have different ways of learning from humans than, uh, than yeah, I, I think something, something new will be needed.
3. LRLenny Rachitsky
  This connects to, um, reinforcement learning. That's something that you're, you're big on and something I'm hearing more and more is just becoming a big deal in the world of post-training. Can you just help people understand what is reinforcement learning and reinforcement learning environments, and why they're so, they're gonna be more and more important in the future?
4. ECEdwin Chen
  Reinforcement learning is essentially training your model to reach a certain reward. And let me explain what an RM environment is. An RM environment is essentially a simulation of the real world. So think of it like building a video game with a fully fleshed-out universe. Every character has a real story, every business has tools and data you can call, and you have all these different enti- entities interacting with each other. So for example, we might build a world where you have a startup with Gmail messages and Slack threads and Jira tickets and GitHub PRs and a whole code base. And then suddenly AWS goes down and Slack goes down. And so okay, model, w- what do you do? Like, the model needs to figure out, figure it out. So we give the models tasks in these environments, we design interesting challenges for them, and then we run them to see how they perform. And then we teach them, we give them these rewards when they're doing a good job or a bad job. And I think one of the interesting things is that these environments really showcase where models are end to end, are en- weak at end-to-end tasks in the real world. You have all these models that seem really smart on isolated benchmarks. Like, they're good at single step co- tool calling, they're good at single step instruction following. But suddenly you dump them into these messy worlds where you have confusing Slack messages and tools they've never seen before, and they need to perform right actions and modify the databases and interact over longer time horizons, where what they do in step one affects what they do in step 50. And that's very, very different from these kind of academic single step environments that they've been in before. And so the model just fails catastrophically in all these crazy ways.So I think these R environments are going to be really interesting playgrounds for them also to learn from, that will essentially be simulations and mimics of the real world. And so they'll hopefully get better and better at, at real tasks, uh, compared to all these contrived environments.
5. LRLenny Rachitsky
  So, what I'm trying to imagine what this looks like, essentially it's like a virtual machine with, I don't know, a browser or spreadsheet or something in it, with, uh, like, I don't know, uh, surge.com. Is that, is that your website, surge.com? Let's make sure we get that right.
6. ECEdwin Chen
  (laughs) So we are, we are actually surgehq.ai.
7. LRLenny Rachitsky
  Surgehq.ai, check it out. Uh, we're hiring here. (laughs)
8. ECEdwin Chen
  (laughs) Yeah.
9. LRLenny Rachitsky
  I imagine. Yes. Okay. So, uh, so it's like, well, here's surgehq.ai. Uh, your job, here's your job as an agent, let's say, is to make sure it stays up. And then all of a sudden, it goes down, and the objective function is, uh, figure out why. Is that, is that an example?
10. ECEdwin Chen
  Yeah. So the objective function might be, um... Or the goal of the task might be-
11. LRLenny Rachitsky
  The goal.
12. ECEdwin Chen
  ... "Okay, go figure out why and fix it."
13. LRLenny Rachitsky
  And fix it.
14. ECEdwin Chen
  And so the objective function might be, it might be passing a series of unit tests. It might be writing a document, like maybe a retro containing certain information that matches exactly what happened. Uh, there's, there's all these, like, different rewards that we might give it that determine whether or not it's succeeding, and so the model is, we're basically teaching the models to achieve that reward.
15. LRLenny Rachitsky
  So essentially, it's, like, running. It's off and running. Here's your goal. Uh, figure out why the site went down and fix it, and it just starts trying stuff, with using every- all the intelligence it's got. It makes mistakes. You kind of help it along the way, reward it if it's doing the right sort of thing. And so what you're describing here is this is, uh, where model... This is the next phase of models becoming smarter, more RL environments focused on very specific tasks that are, uh, economically valuable, I imagine.
16. ECEdwin Chen
  Yeah, yeah. So just in the same way that there were all these different methods for models learning in the past, like originally we had SFT and RHF, and then we had rubrics and verifiers. This is the next stage, and it's not the case that the previous methods are obsolete. This is, again, just a different form of learning that complements all the previous types. So it's just like a different skill th- model, models learn how to do.
17. LRLenny Rachitsky
  And so in this case, it's less, um, some physics PhD sitting around, uh, talking to a model, correcting it, giving it evals of here's what the correct answer is, creating rubrics and things like that. More it's like this person now designing an environment. So another example I've heard is, like, a financial analyst, just like, "Here's an Excel spreadsheet. Here's your goal. Figure out our profit and loss," or whatever. Uh, and so this expert now is, instead of just sitting around writing rubrics, they're designing this RL environment.
18. ECEdwin Chen
  Yeah, exactly. So that financial analyst might create a spreadsheet. They may create certain tools that the model needs to call in order to help fill out a spreadsheet. Like, it might be, okay, the, the model needs to access Bloomberg Terminal and needs to learn how to use it, and it needs to learn how to use this calculator, and it needs to learn how to perform this calculation. So it all, it has all these tools that it has access to. And then the reward might be, okay, it's like maybe I will download that spreadsheet and I'm going to see does cell B22 contain the correct profit to- profit and loss number. Um, or does tab number two contain this piece of information?
39:37 – 41:11
Understanding model trajectories
1. LRLenny Rachitsky
  And this, what's interesting is this is a lot closer to how humans learn. We just try stuff, uh, figure out what's working and what's not. You, um, y- you talk about how trajectories are really important to this. It's not just here's the goal and here's the end. It's, like, every step along the way. Can you just talk about what trajectories are and why that's important to this?
2. ECEdwin Chen
  I think one of the things that people don't realize is that sometimes even though the model reaches the correct answer, it does so in all these crazy ways. So it may have in the intermediate trajectory, it may have tried 50 different times and failed, but eventually it just kind of, like, randomly lands on a correct number or correct number. Or maybe it, um, it, it... Sometimes it just does things very, very inefficiently or it almost reward hacks a way to get at the correct answer. And so I think paying attention to trajectory is actually really, really important. And I think it's also really important because some of these trajectories can be very, very long. And so if all you're doing is checking whether or not the model reaches the final answer, it's like there's all this information about how the model behaved in the immediate step that's missing. Like, sometimes you want models to get to the correct answer by reflecting on what it did. Sometimes you want it to get it the correct answer by just one-shotting it. And if you ignore all of that, it's just, it's just like teaching, teaching the, uh... It's just missing a lot of the information that you could be teaching the model to, to do.
3. LRLenny Rachitsky
  I love that. Like, it just, yeah, it tries a bunch of stuff and eventually gets it right. You don't want it to learn this is the way to get there.
4. ECEdwin Chen
  Yeah.
5. LRLenny Rachitsky
  There's often a much more efficient way of doing it.
41:11 – 42:55
How models have advanced and will continue to advance
1. LRLenny Rachitsky
  You mentioned all the, kind of the steps we've taken along the journey of getting... of helping AI models get smarter. Since you've been so close to this for so long, I, I, I think this is gonna be really helpful for people. What's kind of, like, been the steps along the way from the first, uh, of post-training that has most helped models advance? Like, where do you evals fit in, the RL environments? Just, like, what's been, like, the steps in, and now we're heading towards RL environments?
2. ECEdwin Chen
  Originally, the way models started getting post-trained was purely through SFT. And-
3. LRLenny Rachitsky
  And what does that stand for?
4. ECEdwin Chen
  So SFT is, stands for supervised fine-tuning. And it's a lot like... So, so again, I think often in terms of these human analogies, and so SFT is a lot li- by... It's a lot like mimicking a master and copying what they do. And then RI- RLHF became very dominant and the knowledge there would be like sometimes you learn by writing 55 different essays and someone telling you which one they like the most. And then I think over the past year or so, rubrics and verifiers have, uh, have become very important. And rubrics and verifiers are, like, learning by being graded and getting detailed feedback on where you went wrong.
5. LRLenny Rachitsky
  And those are evals, th- another word for that.
6. ECEdwin Chen
  Yeah, yeah. So, uh, I think evals often covers two terms.One is you are using the evaluations for training because you're evaluating whether or not the model did a good job, and when it does do a good job, you're rewarding it. And then there's this other notion of evals where you're trying to measure a model's progress. Like, okay, yeah, I have five different candidate checkpoints, and I want to pick the one that's best in order to release it to the public. So we're going to run all these evals on these five different checkpoints in order to decide which one, which one is best.
7. LRLenny Rachitsky
  Awesome.
8. ECEdwin Chen
  Yeah. And yeah, now, uh, now, now we have RMR, so that's kind of like the hot new thing.
42:55 – 44:39
Adapting to industry needs
1. ECEdwin Chen
2. LRLenny Rachitsky
  Awesome. So what I love about this business you're in is just there's always something new. There's always this like, okay, uh, we're getting so good at just all this beautiful data for companies, and now they need something completely different. Now we're setting up all these virtual machines for them in all these different use cases.
3. ECEdwin Chen
  Yep.
4. LRLenny Rachitsky
  And it feels like that's a big part of this industry you're in, is just ada- adapting to what labs are asking for.
5. ECEdwin Chen
  Yeah, yeah. So I mean, I really do think that we are gonna know... need to build a suite of products that reflect the million different ways that humans learn. And like, like for example, think about becoming a great writer. You don't become great by memorizing a bunch of grammar rules. You become great by reading great books, and you practice writing, and you get feedback from your teachers and from the people who buy your books in the bookstore and leave reviews. And you notice what works and what doesn't, and you develop taste by being exposed to all these masterpieces and also just terrible writing. So you learn through this endless cycle of practicing reflection and each type of learning that you have. Again, like these are all very, very different methods of learning to become a great writer. So just in the same way that there's a thousand different ways that a great writer becomes great, I think there's going to be a thousand different ways that AI models need, need to learn.
6. LRLenny Rachitsky
  It's so interesting this just ends up being like just like humans in so many ways. It makes sense because in a sense, neural networks, deep learning is modeled after how humans have learned and how our brains operate. But it's interesting just to make them smarter. It's how do we come closer to how humans learn more and more.
7. ECEdwin Chen
  Yeah, it's almost like maybe the end goal is just throwing you in, into the environment and-
8. LRLenny Rachitsky
  Yeah.
9. ECEdwin Chen
  ...just seeing how you evolve. Um, but within that, within that evolution, there's all these different sub-learning mechanisms.
10. LRLenny Rachitsky
  Yeah, which is kind of what we're doing now, so that's really interesting. This might be the last step up until we hit AGI.
44:39 – 48:07
Surge’s research approach
1. LRLenny Rachitsky
  Along these lines, something that's really unique to Surge that, uh, I learned is that you guys have your own research team, which I think is pretty rare. Um, talk about just why that's something you guys have invested in and what has come out of that investment.
2. ECEdwin Chen
  Yeah. So I think that stems from my own background. Like, my own background is as a researcher, and so I've always cared fundamentally about pushing the industry and pushing the research community and not just about revenue. And so I think what our research team does is a couple different things. So we almost have two types of researchers at our company. One is our forward-deployed researchers who are often working hand-in-hand with our customers to help them understand their models. So we will work very closely with our, with our customers to help them understand, okay, this is where your model is today. This is where you're lagging behind all the competitors. These are some ways that you could be improving in the future, given, given your goals. And we're gonna design these datasets, these evaluation methods, these training techniques, to make your models better. So it is like very, very notion... Uh, it's very, very, um, uh, kind of collaborative notion of working with our customers, like being researchers by themselves, just a little bit more focused on the data side, and we're here hand-in-hand with them to, to do whatever it takes to, to make them the best. And then we also have our internal researchers. So our internal researchers are focused on slightly different things. So they are focused on building better benchmarks and better lead- leaderboards. So I've talked a lot about how I worry that the leaderboards and benchmarks out there today are steering models in the wrong direction, so then yeah, so the question is how do we, how do we fix that? (laughs) And so that's what our research team is focused on really, really heavily on... really foc- focused really heavily on right now. So they're working a lot on that. And they're also working on these other things like, okay, we need to train our own models to see what types of data performs, uh, performs the best. What types of people pre- perform the best. And so they are also working on all these, uh, kind of like training techniques and evaluation of our own datasets to improve, um, improve our, our data operations and the internal data products that we have that determine what, what makes something good quality.
3. LRLenny Rachitsky
  It's such a cool thing because I don't think... Like, basically the labs have researchers helping them advance AI. Uh, I imagine it's pretty rare for a company like yours to have researchers actually doing p- primary research on AI.
4. ECEdwin Chen
  Yeah, yeah. I think it's just because it's something I've fundamentally always cared about.
5. LRLenny Rachitsky
  Mm-hmm.
6. ECEdwin Chen
  Like, I often think about us more like a research lab than a startup because that is my goal. Like, like it's, it's kind of funny, but I've always said I would rather be Terence Tao than, than Warren Buffett. So that notion of creating research that pushes the frontier forward and not just getting some valuation, like that, that's always been what drives me.
7. LRLenny Rachitsky
  And it's worked out. That's the beautiful thing about this. You mentioned that you were hiring researchers. Is there anything there you wanna share, folks you're looking for?
8. ECEdwin Chen
  So we look for people who are just fundamentally interested in data all day. So types of people who could literally spend 10 hours digging through a dataset and playing around with models and thinking, okay, yeah, this is where I think the model's failing. This is a kind of a behavior you want the model to have instead. And just this aspect of being very, very hands-on and thinking about the, the qualitative aspects of models and not just the quantitative parts. So again, it's like this aspect of being hands-on with data and not just caring about these kind of abstract algorithms.
48:07 – 50:43
Predictions for the next few years in AI
1. ECEdwin Chen
2. LRLenny Rachitsky
  Awesome. I want to ask a couple broad AI kind of market questions. What else do you think is coming in the next couple years that people are maybe not thinking enough about or not expecting in terms of where AI is heading, what's gonna matter?
3. ECEdwin Chen
  I think one of the things that's going to happen in the next few years is that the models are actually going to become increasingly differentiated because of the personalities and behaviors that...... The different labs have and the kind of objective functions that they are optimizing their models for. Like, I think it's one thing I didn't appreciate a year or so ago. Like, a year or so ago, I thought that all of the AI models would essentially become very, very commoditized. They would all behave like each other and, sure, one of them might be slightly more intelligent in one way today, but, sure, the other ones would catch up in the next few months. I think over the past year, I've realized that the values that the companies have will shape the, the, the model. So, l- let, let, let, let, me give an example. So I was asking Claude to help me draft an email the other day, and it went through 30 different versions. And after 30 minutes, yeah, I think it really crafted me the perfect email and I sent it. But then I realized I spent 30 minutes doing something that didn't matter at all. Like, sure, now I got a perfect email, but I spent 30 minutes doing something I wouldn't have worried at all before, and this email probably didn't even move the needle on anything anyways. So I think there's a deep question here, which is, if you could choose the perfect model behavior, which model would you want? Do you want a model that says, "You're absolutely right. There are definitely 20 more ways to improve this email," and it continues for 50 more iterations and it sucks up all your time and engagement? Or do you want a model that's optimizing for your time and productivity and just says, "No. You need to stop. Your email's great. Just send it and move on with your day."
4. LRLenny Rachitsky
  (laughs)
5. ECEdwin Chen
  And again, like, again, just be- like...
6. LRLenny Rachitsky
  I love it.
7. ECEdwin Chen
  ... in the same way there's like a, kind of like a fork in the road between how you could choose how your model behaves for this question. It's like, for every other question that models have, the kind of behavior that you want will fundamentally affect it. It's almost like, in the same way that when Google builds a search engine, it's very, very different from how Facebook would build a search engine, which is very, very different from how Apple would build a search engine. Like, they all have their own principles and values and things that they're trying to achieve in the world that shape all the products that they're going to build. And in the same way, I think all the LLMs will start behaving very, very differently too.
8. LRLenny Rachitsky
  That is incredibly interesting.
50:43 – 52:55
What’s underhyped and overhyped in AI
1. LRLenny Rachitsky
  Uh, you already see that with Brock. It's got, like, a very different personality and a very different approach to answering questions. And so what I'm hearing is you're gonna see more of, of this differentiation.
2. ECEdwin Chen
  Yep.
3. LRLenny Rachitsky
  Kind of a- another question along these lines. What do you think is most under-hyped in AI that you think maybe people aren't talking enough about that is really cool, and what do you think is over-hyped?
4. ECEdwin Chen
  So I think one of the things that's under-hyped is the built-in products that all of the chatbots are going to start having. Like, I've always been a huge fan of call-out artifacts, and I think it just works really, really well. And actually, the other day, I don't know if it's a new feature or not, but it's asking me to help me create a, uh, like an email, and then it just cre-... So it didn't quite work because it, uh, didn't allow me to send email. But what it created instead was like a little, I don't know what you call it, like a little box where I could click on it and it would just text someone (laughs) this message. And I think that concept of taking artifacts to the next level where you just have these like mini apps, mini UIs within the chatbots themselves, I, I feel like people aren't talking enough about that. So I think that that's one under-hyped area. And in terms of over-hyped areas, I definitely think that vibe coding is overhyped. I think people don't realize how much it's going to make your systems unmaintainable in the long term if you simply dump this code into the core bases, even if it seems to work ou- outright now. So I, uh, kind of, yeah, kind of, kind of worry about future coding (laughs) , if this keeps on happening.
5. LRLenny Rachitsky
  These are amazing answers. On that, on that first, uh, point, there's something I actually asked. I had the chief product officer of Anthropic and OpenAI, Kevin Whelan and Mike Krieger on the podcast, and I asked them just like, as a product team, like you have this giga-brain intelligence. How long do you even need product teams? H- uh, you think this is, this AI will just create the product for you. Here's what I want. Well, it's like, it's like the next level of vibe coding. It's just, just like tell it, "Here's what I want," and it's just building the product and evolving the product as you're using it. And it feels like that's what you're describing is where we might be heading.
6. ECEdwin Chen
  Yeah, yeah. I think there's a very, very powerful notion where it helps people just achieve their ideas in a, in a magical
52:55 – 1:02:18
The story of founding Surge AI
1. ECEdwin Chen
  way.
2. LRLenny Rachitsky
  Something we haven't gotten into that I think is really interesting is just the story of how you got to starting Surge. You had, uh, uh... You have a really unique background. I always think about these... Brian, Brian, uh, Armstrong, the founder of Coinbase once ha- gave this talk that has really stuck with me where he kind of talked about how his very unique background allowed him to start Coinbase. He had like a economics background, he had a cryptography experience, and then he was an engineer. And Scotty was like the perfect Venn diagram for starting Coinbase, and I feel like you have a very similar story with Surge. Talk about that, your background there, and how you led... how that led to Surge.
3. ECEdwin Chen
  Going way back, I was always fascinated by math and language when I was a kid. Like, I went to MIT because it's obviously one of the best places for math and CS, but also because Noam Chomsky. My dream in school was actually to find some underlying theory connecting all these different fields. And then I became a researcher at Google and Facebook and Twitter, and I just kept running into the same problem over and over again. It was impossible to get, get the data that we needed to train our models. So I was always this huge believer in the need for high quality data, and then GB3 came out in 2020, and I realized that, yeah, if we want to take things to the next level and build models that could code and use tools and tell jokes and write poetry and solve the Fermat paralysis and cure cancer, then yeah, we were going to need a completely new solution. Like, the thing that always drove me crazy when I was at all these companies was we had the full power of the human mind in front of us, and all the students out there were focused on really simple things like image labeling. So I wanted to build something focused on all these advanced complex use cases instead that would really help us build an x-ray machine models. So yeah, I think my background in kind of cross math and computer science and linguistics really, really informed what I always wanted to do.And so I started Surge a month later with, with our, with our one mission, to basically build the use cases that I thought were going to ne- be needed to push the frontier of AI.
4. LRLenny Rachitsky
  And you said a month later. A month later after what?
5. ECEdwin Chen
  After GPT-3 launched in '22.
6. LRLenny Rachitsky
  Oh, okay. Wow. Okay. (laughs) Yeah. A great decision.
7. ECEdwin Chen
  Yeah.
8. LRLenny Rachitsky
  What, uh, what just kind of drives you at this point, of, other than just the epic success you're having, what keeps you motivated to keep building this and, and, you know, building something in this space?
9. ECEdwin Chen
  I think I'm a scientist at heart. I always thought I was going to become this math or CS professor and work on trying to understand the universe and language and the nature of communication. Like, it's kind of funny, but I always had this fanciful dream where if aliens ever came to visit Earth and we need to figure out how to communicate with them, I wanted to be the one the government would call, and I'd use all this fancy math and computer science and linguistics to decipher it. So even today, what I love doing most is, every time a new model is released, we'll actually do a really deep dive into the model itself. I'll play around with it. I'll run evales. I'll compare where it's improved, where it's regressed. I'll create this really deep dive analysis that we send our customers. And it, it's actually kind of funny because a lot of times we will say it's from our data science team, but often it's actually just from me. And I think I could do this all day. Like, I have a very hard time being in meetings all day. I'm terrible at sales. I'm terrible at doing the typical CEO things that people expect you to do. But I love writing these analyses. I love jamming with our research team on what they're seeing. Sometimes I'll be, like, up until 3:00 AM just, just t- just talking on the phone with somebody on our research team and
10. NANarrator
  (laughs)
11. ECEdwin Chen
  ... taking treatment model. So I love that I still get to be really hands-on working on the data and the, and the science all day. And I think what drives me is that I want Surge to play this critical role in the future of AI, which I think is also the future of humanity. Like, we have these really unique perspectives on data and language and quality and how to measure all this and how to ensure it's all going on the right path, and I think we're uniquely unconstrained by all of these influences that can sometimes steer companies in a dir- negative direction. Like what I was saying earlier, we built Surge a lot more like a research lab than a typical startup. So we care about curiosity and long-term incentives and intellectual rigor, and we don't care as much about quarterly metrics and what's going to look good in a board deck. And so my goal is to take all these unique things about us as a company and use that to make sure that we're shaping AI in a way that's really beneficial for our species in the long term.
12. LRLenny Rachitsky
  What I'm realizing in this conversation is just how much influence you have and companies like yours have on where AI heads, the fact that you help labs understand where they have gaps and where they need to improve. And it's not just... You know, everyone looks at just, like, the heads of OpenAI, Anthropic, and a- all these companies as they're the ones ushering in AI, but what I'm hearing here is you, you have a lot of influence on where things head to.
13. ECEdwin Chen
  Yeah, I think there's this really powerful ecosystem where, honestly, people just don't know where models are headed and how they want to shape them yet and how they want humanity to kind of play a role in- in the future of all this. And so I think there's a lot of opportunity to just continue shaping this dis- discussion.
14. LRLenny Rachitsky
  Along that thread, I know you have a very strong thesis on just why this work matters to humanity and why this is so important. Talk about that.
15. ECEdwin Chen
  I'll get a bit philosophical here, but I think the question itself is a bit philosophical, so just bear with me. So the most straightforward way of thinking about what we do is we train and evaluate AI, but there's a deeper mission that I often think about, which is helping our customers think about their dream objective functions. Like, yeah, what kind of model do they want their model to be? And once we help them do that, we'll help them train their model to reach their North Star, and we'll help them measure their progress. But it's really hard because objective functions are really rich and complex. It's kind of like the difference between having a kid and asking them, "Okay, what test do you want to pass? Do you want to get a high score on the SAT and write a really good college essay?" Like, that's the simplistic version, versus, "What kind of person do you want them to grow up to be? Will you be happy if they're happy no matter what they do? Or are you hoping they'll go to a good school and be financially successful?" And again, if you take that notion, it's like, okay, how do you define happiness? How do you measure whether they're happy? How do you measure whether they're financially successful? Like, it's a lot harder than simply measuring whether or not you're getting a high score on the SAT. And what we're doing is we want to help our customers reach, again, their- their dream North Stars and figure out how to measure them. And so I- I get... I talked about this example of what you want models to do when you're asking them to write 50 different email iterations. Do you just continue them for 50 more or do you just say, "No, just- just move on with your day because this is perfect enough"? And the broader question is, are we building these systems that actually advance humanity? And if so, how do we build a sys- da- the datasets to train towards that and measure it? Are we optimizing for all these wrong things, just systems that suck up more and more of our time and make us lazier and lazier? And yeah, I think that's really relevant to what we do because it's very hard and difficult to measure and define whether something is genuinely advancing humanity. It's very easy to measure all these proxies instead, like clicks and likes. But I think that's why our work is so interesting. We want to work to hard, important metrics that require the hardest types of data and not- not just the easy ones. So I think one of the things I often say is, "You- you are your objective function." So we want the rich, complex objective functions and not these simplistic proxies. And our job is to figure out how to get the data to match this. So yeah, we want data, we want metrics that measure whether AI is, like, making our life richer. We want to train our systems this way, and we want tools that make us more curious and more creative, not just lazier. And it's hard because, yeah, humans are kind of inherently lazy. So AI self-redeems are the easiest way to get engagement and make all your metrics go up. So I think this question about choosing the right objective functions and making sure that we're optimizing towards them and not just these easy proxies is really, really important for our future.
16. LRLenny Rachitsky
  Wow. I love how what you're sharing here gives you so much more appreciation of the nuances of...... building AI, training AI, the work that you're doing. You know, from the outside, people could just look at Surge and companies in the space of Kaggle. They're just creating all this data, feeding it to AI, but clearly there's so much to this that, uh, people don't realize, and, uh, I love knowing that you're at the head of this, that someone like you is thinking through this so deeply. Maybe one more question. Is there something you wish you'd known before you started Surge? A lot of people start companies. They don't know what they're getting into. Is there something you wish you could tell your earlier self?
17. ECEdwin Chen
  Yeah, so I definitely wish I'd known that you could build a company by being heads down and doing great research, and simply building something amazing, and not by constantly tweeting and hyping and fundraising. It's kind of funny, but I never thought I wanted to start a company. Like, I loved doing research, and I was obviously always a huge fan of DeepMind because they were this amazing research company that got bought and still managed to keep on doing amazing science. But I always thought that it was... They were this magical IRL unicorn. So I thought if I started a company, I'd have to become a business person, looking at financials all day, and building, being in meetings all day, and doing all this stuff that sounded incredibly boring and I always hated. So I think it's crazy that didn't end up being true at all. Like, I'm still in the weeds in the data every day, and I love it.
18. LRLenny Rachitsky
  That's amazing.
19. ECEdwin Chen
  Like, I love that I get to do all these analyses and talk to researchers, and it's basically applied research where we're building all these amazing data systems that really push the frontier of AI. So yeah, I wish I'd known that you don't need to spend all your time fundraising. You don't need to constantly generate hype. You don't need to become someone you're not. You can actually build a successful company by simply building something so good that it cuts through all that noise, and I think if I'd known this was possible, I would have started even sooner. So I
20. LRLenny Rachitsky
  And
1:02:18 – 1:10:31
Lightning round and final thoughts
1. LRLenny Rachitsky
  that is such an amazing, uh, place to end. I feel like this is exactly what founders need to hear, and I think this conversation is going to inspire a lot of founders, uh, and especially a lot of founders that want to do things in a different way. Before we get to our very exciting lightning round, is there anything else you wanted to share? Anything else you want to leave us- listeners with? We covered a lot of ground. It's totally okay to say no as well.
2. ECEdwin Chen
  So I think the thing I would end with is, I think a lot of people think of data labeling as really simplistic work, like labeling cat photos and drawing bounding boxes around cars. And so I've actually always hated the word data labeling because it just paints this very simplistic picture when I think what we're doing is completely different. Like, I think a lot about what we're doing as a lot more like raising a child. You don't just feed a child information. You're teaching them values and creativity and what's beautiful and these infinite subtle things about what makes somebody a good person, and that's what we're doing for AI. So I... Yeah, I just often think about what we're doing as almost, like, the future of humanity or how, how we're raising humanity's children (laughs) . Uh, so I'll leave it at that.
3. LRLenny Rachitsky
  Wow. I love just how much philosophy there is in this whole conversation that I was not expecting. With that, Edwin, we've reached our very exciting lightning round. I've got five questions for you. Are you ready?
4. ECEdwin Chen
  Yep, let's go.
5. LRLenny Rachitsky
  (laughs) Here we go. What are two or three books that you find yourself recommending most to other people?
6. ECEdwin Chen
  Yeah. So three books I often recommend are, first, Story of Real Life by Ted Chiang. It's my all-time favorite short story, and it's about a linguist learning an alien language, and I basically re-read it every couple years.
7. LRLenny Rachitsky
  And that's what the Interstellar was about, is that... Is that what it was-
8. ECEdwin Chen
  Oh, yeah.
9. LRLenny Rachitsky
  ... the story?
10. ECEdwin Chen
  So there's a movie called Arrival-
11. LRLenny Rachitsky
  Arrival, okay.
12. ECEdwin Chen
  ... which is, which was based off the story-
13. LRLenny Rachitsky
  Yes, okay.
14. ECEdwin Chen
  ... which I, which I love as well.
15. LRLenny Rachitsky
  Great. Okay, keep going.
16. ECEdwin Chen
  And then second, Myth of Sis- Sisyphus by Camus. I actually can't really explain why I love this, but I always find the final chapter somehow really inspiring. And then third, Le Tombeau de Marthe by Douglas Hofstadter. And so I think Gertrude or Escher Bach is his for- is his more famous book, but I've actually always loved this one better. It basically takes a single French poem and translates it 89 different ways and discusses all the motivations behind each translation. And so I've always loved the way it embodies this idea that translation isn't this robotic thing that you do. Instead, there's a million different ways to think about what makes a high-quality translation, which mimics a lot of ways I think about data and quality in LLMs.
17. LRLenny Rachitsky
  All these resonate so deeply with the way... With all the things we've been talking about, especially that first one, if that was your goal after school is, like, "I want to help translate, uh, alien language."
18. ECEdwin Chen
  (laughs)
19. LRLenny Rachitsky
  I'm not surprised you love that short story. Next question. Do you have a favorite recent movie or TV show you've really enjoyed?
20. ECEdwin Chen
  One of my new all-time favorite TV shows is something I found recently. It's called Travelers. It's basically about a group of travelers from the future who are sent back in time to prevent an apocalypse. Sorry, I just realized that was science fiction. And then I actually just rewatched Contact, which is all, one of my all-time favorite movies. So yeah, I think one of the things you'll notice about me is that, yeah, I love any kind of book or film that involves scientists deciphering alien communication.
21. LRLenny Rachitsky
  Hmm.
22. ECEdwin Chen
  Again, just this dream I always had as a kid.
23. LRLenny Rachitsky
  That's so funny. I love that (laughs) . Okay, hmm. Is there a product you've recently discovered that you really love?
24. ECEdwin Chen
  So it's funny, but I was in SF earlier this week, and I finally took a Waymo for the first time. Honestly, it was magical and it really felt like living in the future.
25. LRLenny Rachitsky
  Yeah, it's like the thing that you can... People hype it like crazy, but it always exceeds your expectations.
26. ECEdwin Chen
  Yeah, it deserves the hype. It was crazy.
27. LRLenny Rachitsky
  Yeah, it's absurd. It's like, holy moly. Like, if you're not in SF, you don't realize just how common these things are. They're just, like, all over the place, just driverless cars constantly going about, and when you, like, go to an event at the end, there's just, like, all these Waymos lined up picking people up.
28. ECEdwin Chen
  Yeah.
29. LRLenny Rachitsky
  Yeah, Waymo, good job. Good job over there. Uh, do you have a favorite life motto that you find yourself coming back to in work or in life?
30. ECEdwin Chen
  So I think I mentioned this idea that founders should build a company that only they could build, almost like it's this destiny that their entire life and experiences and interests shape them towards. And so I think that principle applies pretty broadly, not just to founders, but to people creating anything.

Episode duration: 1:10:31

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode dduQeaqmpnI

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Introduction to Edwin Chen

AI’s role in business efficiency

Building a contrarian company

An explanation of what Surge AI does

The importance of high-quality data

How Claude Code has stayed ahead

Edwin’s skepticism toward benchmarks

AGI timelines and industry trends

The Silicon Valley machine

Reinforcement learning and future AI training

Understanding model trajectories

How models have advanced and will continue to advance

Adapting to industry needs

Surge’s research approach

Predictions for the next few years in AI

What’s underhyped and overhyped in AI

The story of founding Surge AI

Lightning round and final thoughts

Get more out of YouTube videos.