No Priors Ep. 124 | With SurgeAI Founder and CEO Edwin Chen

In the generative AI revolution, quality data is a valuable commodity. But not all data is created equally. Sarah Guo and Elad Gil sit down with SurgeAI founder and CEO Edwin Chen to discuss the meaning and importance of quality human data. Edwin talks about why he bootstrapped Surge instead of raising venture funds, the importance of scalable oversight in producing quality data, and the work Surge is doing to standardize human evals. Plus, we get Edwin’s take on what Meta’s investment into Scale AI means for Surge, as well as whether or not he thinks an underdog can catch up with OpenAI, Anthropic, and other dominant industry players. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @echen | @HelloSurgeAI Chapters: 00:00 – Edwin Chen Introduction 00:41 – Overview of SurgeAI 02:28 – Why SurgeAI Bootstrapped Instead of Raising Funds 07:59 – Explaining SurgeAI’s Product 09:39 – Differentiating SurgeAI from Competitors 11:27 – Measuring the Quality of SurgeAI’s Output 12:25 – Role of Scalable Oversight at SurgeAI 14:02 – Challenges of Building Rich RL Environments 16:39 – Predicting Future Needs for Training AI Models 17:29 – Role of Humans in Data Generation 21:27 – Importance of Human Evaluation for Quality Data 22:51 – SurgeAI’s Work Toward Standardization of Human Evals 23:37 – What the Meta/ScaleAI Deal Means for SurgeAI 24:35 – Edwin’s Underdog Pick to Catch Up to Big AI Companies 24:50 – The Future Frontier Model Landscape 26:25 – Future Directions for SurgeAI 29:29 – What Does High Quality Data Mean? 32:26 – Conclusion

Sarah GuohostEdwin ChenguestElad Gilhost

Jul 24, 202532mWatch on YouTube ↗

EVERY SPOKEN WORD

80 min read · 15,696 words

0:00 – 0:41
Edwin Chen Introduction
1. SGSarah Guo
  (instrumental music plays) Hi, listeners. Welcome back to No Priors. Today, Elade and I are here with Edwin Chen, the founder and CEO of Surge, the bootstrapped human data startup that surpassed a billion in revenue last year, and serves top-tier clients like Google, OpenAI, and Anthropic. We talk about what high-quality human data means, the role of humans as models become superhuman, benchmark hacking, why he believes in a diversity of frontier models, the Scale Meta not-MNA deal, and why there's no ceiling on environment quality for RL, or the simulated worlds that labs want to train agents in. Edwin, thanks for joining us.
2. ECEdwin Chen
  Great, great seeing you guys today.
3. SGSarah Guo
  Surge has been
0:41 – 2:28
Overview of SurgeAI
1. SGSarah Guo
  really under the radar until just about now, um, can you give us a little bit of, uh, color on sort of the scale of the company and what the original founding thesis was?
2. ECEdwin Chen
  So we hit over a billion in revenue last year. We are kinda like the biggest human data player in this space, and we're about 100, a little over 100 people. And our original thesis was... We just really believed in the power of human data to advance AI, and we just had this really big focus from the start of making sure that we g- we had the highest quality data possible.
3. EGElad Gil
  Can you give people context for how long you've been around, how you got going, et cetera? I think, uh, again, you all have accomplished an enormous amount in a short period of time, and I think, you know, you've been very quiet about some of the things you've been doing, so it'd be great to just get a little bit of history and, you know, when you started, how you got started, and how long you've been around.
4. ECEdwin Chen
  Uh, yeah, so we've been around for five years. I think we just hit our five-year anniversary, so we started in 2020. So before that... So I can give someone a context. So before that I used to work at Google, Facebook and Twitter, and one of the... Like, basically the reason we started Surge was... I just used to work on ML at a bunch of these big companies, and just the problem I kept running into over and over again was that it really was impossible getting the data that we needed to train our models. So it was just this big blocker that we faced over and over again, and there was just, like, so much more that we wanted to do. Like, uh, even just the basic things that we want to do, we struggled so hard to get the data. It was really just the- the big blocker. But then simultaneously, there were all these more futuristic things that we wanted to build. Like, if we thought of the next generation AI systems, if we could barely get the data that we needed at the time to solve, like just building a simple sentence analysis classifier, we, if we could barely do that, then (laughs) like, how would we ever advance, advance beyond that? So that, that really was the, the biggest problem. I can go into more of that, but that, that, that was essentially what we faced.
5. EGElad Gil
  Uh, and
2:28 – 7:59
Why SurgeAI Bootstrapped Instead of Raising Funds
1. EGElad Gil
  then you guys are also known for having bootstrapped the company versus raising a lot of external venture money or things like that. Do you want to talk about that choice in terms of going profitable early and then scaling off of that?
2. ECEdwin Chen
  In terms of why we didn't raise, so I think (laughs) , I mean, a big part of it was obviously just that we didn't need the money. I think we were very, very lucky to be profitable from, from the start. So we didn't need the money. It always felt weird to give up control, and, uh, like one of the things I've always hated about Silicon Valley is that you so- see so many people raising for the sake of raising. Like, I think one of the things that I often see is that a lot of founders that I know, they don't, they don't have some big dream of building a product that solves some idea that they really believe in. Like, if you talk to a bunch of YC founders or whoever it is, like what, what is their goal? It really is to tell all their friends that they raised $10 million and show their parents they got a headline on Tech Crunch. Like, that is their goal. Like, I, I think of, like, my friends at Google. They, they often tell me, "Oh yeah, I've, you know, I've been at Google or Facebook for 10 years, and I want to start a company." I'm like, "Okay, so what problem do you want to solve?" And they don't know. (laughs) They're like, "Yeah, I just want to start something new. I'm bored." And it's weird because th- they can, like, pay their own salaries for a couple of months. Again, they've been at Google and Facebook for 10 years, they're not just, like, fresh out of school. They, they can pay their own salaries, but the first thing they think about is just going out and raising money. And I've always just thought it weird because they, like, might try talking to some users and they might try building an MVP, but they kind of just do it in this throwaway manner where the only reason they do it is to check off a box on a starter ac- ac- accelerator application. And then they'll just pivot around these random product ideas and they'll, they happen to get a little bit of traction so then a VC DMs them, and so they spend all their time tweeting and they go to these VC dinners, and it's all just so that they can show the world that they raised a big amount of money. And so I think raising immediately always felt silly to me. Like, everybody's default is to just immediately raise, but if you were to think about it from first principles, like if you didn't know how Silicon Valley worked, if you didn't know that raising was a thing, like why, why would you do that? Like, what is money really going to solve for 90% of these startups where the luck- founders are lucky to have some savings? I really, really think that your first instinct should be to go out and build whatever you're dreaming of. And sure, if you ever run into financial problems, then sure, think about raising money then, but don't waste all this effort and time upfront when you don't even know what you'd do with it.
3. EGElad Gil
  Yeah, it's funny. I feel like I'm one of the few investors that actually tries to talk people out of fundraising often.
4. ECEdwin Chen
  Right. (laughs)
5. EGElad Gil
  Like, I actually had a conversation today where the founder was talking about doing a raise, and I'm like, "Why?" You know? "You don't have to. You can maintain control," et cetera. And then the flip side of it is, I would actually argue outside of Silicon Valley, too few people raise venture capital when the money can actually help them scale, and so I feel like in Silicon Valley there's too much, and outside of Silicon Valley there's too little. So it's this interesting, you know, spread of, uh, different models that sort of stick.
6. SGSarah Guo
  Edwin, what would you, um, what would you say to founders who, uh, feel like there's some external validation necessary to, uh, especially, like, uh, hire a team or scale their team? This is a very, like, common complaint or rationale for going and raising more capital.
7. ECEdwin Chen
  I think about it in a couple ways. So I guess it depends on what you mean by external validation. Like, in my mind, again, like, I often think about things from the pers- perspective of, are you trying to build a startup that's actually going to change the world? Like, do you have this big thing that you're dreaming of? And if you have this big thing that you're dreaming of, you... Like, why do you care? (laughs)
8. EGElad Gil
  Maybe the way to, to think about it is in, in Sara's context, like, if you haven't... Say you're a YC founder. You haven't been at Google, you haven't been at Meta, you haven't been at Twitter, you don't have this network of engineers, you're a complete unknown, you haven't worked with very many people, you're straight out of school.
9. ECEdwin Chen
  Yup.
10. EGElad Gil
  How do you then attract that talent? And to your point, you can tell a story of how you're going to build things or what you're going to do, right? But it is a harder, um, obstacle to basically convince others to join you or for others to come on board or to have money to pay them if you haven't, if you don't have long work h- history. So I think maybe that's the point Sara was making.
11. ECEdwin Chen
  Uh, yeah. So I mean, I think I would differentiate between maybe two...... two things. Like, one is, do you need the money? So first of all, like, there is a difference between people who are, yeah, like literally fresh out of school, or maybe, you know, have never gone to school in-fur- first place, and so maybe they don't have any savings, and so they literally need some money in order to, in order to live. And then there's others who, okay, l- like, let's assume that you don't necessarily need money because, uh, again, you've been working at Google or Facebook for 10 years, like, or, you know, five years, whatever it is. You have some savings. So, I would say one of the questions is, again, like, it kind of, uh, the path kind of differs depending on, depending on those, those two, those two choices, or those two, um, scenarios. But I think one of the questions is, well, do you really need to go out and hire all these people? Like, one of the things that I often see, again, like I am... I'm curious what you guys see, but one of the things that I often see is founders will tell me, like, uh, "Okay, so I'm trying, I'm trying to think about the first few hires I'm going to make." And they're like, "Yeah, I'm gonna hire a PM. I'm gonna hire a data scientist. Yeah, these are one of my first five to ten hires." I'm like, "What?" (laughs) Like, this is just wild to me. Like, I would never hire a data scientist as one of the first three people at a company. And I, and I say this because I used to be a data scientist. Like, data scientists are great when you want to optimize your product by 2% or 5%, but that's definitely not what you want to be doing when you start a company. You're, you're trying to swing for 10X or 100X changes, not worrying and nitpicking about small percentage points that are just noise anyways. And so someone who is a product manager. It's like, product managers are great when your company gets big enough, but at the beginning, you, you should be thinking about yourself about what product you want to build. And your engineer should be hands-on. They should be having great ideas as well. And so product managers have this weird conception that big companies have when your engineers don't have time to be in the weeds on the, on the details and try things themselves, and it's not a role that you come up with the AI department before.
12. EGElad Gil
  So I guess, um, with the initial Surge team, it sounds like you had sort of a small initial tight engineering team. You guys started building product. You were bootstrapping off of revenue.
13. ECEdwin Chen
  Yeah.
14. EGElad Gil
  You know, at this point, you're at over a billion dollars in revenue, which is amazing. Um, how do you think about the future of how you want to shape the organization, how big you want to get, the different products you're, you're launching and introducing? Like, what, what do you view as sort of the future of Surge and how that's all going to evolve?
15. SGSarah Guo
  Before we do that,
7:59 – 9:39
Explaining SurgeAI’s Product
1. SGSarah Guo
  can you just explain, like, what the, uh, at whatever level of detail makes sense here, like, what the billion dollars of revenue is? Maybe like how product supports the company, who your data, who your humans are, because I think there's just very little visibility into, into all of that.
2. ECEdwin Chen
  So in terms of what our product is, I mean, at the end of the day, our product is our data.
3. SGSarah Guo
  Mm-hmm.
4. ECEdwin Chen
  Um, like, we literally deliver data to companies, and that is what they use to train and evaluate our models. So imagine, you know, one of your, one of those fr- these frontier labs, and you want to improve your model, uh, your model's coding abilities. What we will do on our end is we will gather a lot of coding data, and so this coding data may come in different forms. It may be SFT data. We are literally writing out coding solutions. Or maybe unit tests. Like, these are the tests that a good, uh, that a good piece of code must pass. Maybe it's preference data where it's, okay, like, here are two pieces of code or here are two coding explanations. Which one's better? Or, these might be like verifiers, like, okay, um, here's a web app that I created. I want to make sure that in the top right hand of the screen, there's like a, there's a login button, or I want to make sure that when you click this button, something else happens. Like, there's a bu-bunch of different forms that this data may take, but at the end of the day, what we're d- doing is we're delivering data that will basically help the models improve on these capabilities. Very, very related to that is this notion of evaluating the models. Like, you also want to know, yeah, is this a good coding model? Uh, is it better than this other one? What are the areas in which this model is weak and this, this model is worse? Like, what insights can we get from that? And so in addition to the data, oftentimes we're delivering insights to our customers. We're delivering loss patterns, we're delivering failure modes. So there may be li- a lot of other things, um, like related to the data, but at the end of the day, it's like this, this universe of, uh, of like applications or this like, just universe around the data that we deliver and that, that
9:39 – 11:27
Differentiating SurgeAI from Competitors
1. ECEdwin Chen
  is our product.
2. SGSarah Guo
  Yeah. And maybe going back to Elad's question, um, uh, maybe like product isn't actually the right word here, but what's r- what's like repeatable about the company? Or what are like core capabilities that you guys have that you would say your competitors, you know, um, fail to meet the mark?
3. ECEdwin Chen
  The way we think about our company is that... And the way we differentiate from others is that a lot of other companies in this space, they are essentially just body shops. What they are delivering is not data. They are literally just delivering warm bodies to, um, to, uh, to companies. And so what that means is, like at the end of the day, they don't have any technology. And one of our fundamental beliefs is that, again, quality is the most important thing at the end of the day. Like, is this high-quality data? Is this a good coding solution? Is this a good unit test? Is this mathematical problem solved correctly? Is this a great poem? And basically, a lot of companies in this space, like, uh, just, just as like how things have worked out historically, it's that... Like, historically, a lot of companies, they, uh, uh, they've treated quality and data as commodity. Like, one of the ways we often think about it is, imagine you were trying to draw a bounding box around a car. Like Sarah (laughs) , you and I, we're probably going to draw the same bounding box. Like, ask Hemingway and ask a second grader. Well, at the end of the day, we're all gonna draw the same bounding box. There's not much difference that we can do, so there's a very, very low ceiling on the bar of quality. But then take something like writing poetry. Well, I, I suck at writing poetry. Hemingway is definitely going to write a much better poem than I am. Or imagine a, I don't know, a VC pitch deck. You're gonna write a much better (laughs) - you're gonna create a much better pitch deck than I will. And so there's almost an unlimited ceiling in this gen AI world on the type of quality that you can, that you can build. And so the way we think of our product is like we have a platform. We have actual technology that we're using to measure the quality that our workers, our annotators are generating. If you don't have that technology, if you don't have any way of measuring
11:27 – 12:25
Measuring the Quality of SurgeAI’s Output
1. ECEdwin Chen
  it-
2. EGElad Gil
  Is the measurement through human evaluation? Is it through model-based evaluation? I'm, I'm a little bit curious like how you create that feedback loop since, to some extent, it's a little bit this question of how do you have enough evaluators to evaluate the output relative to the people generating the output? Or do you use models or how, how do you approach it?
3. ECEdwin Chen
  Like, I think one analogy that we often make is...Think about something like Google Search, or think about something like YouTube. Like, you have, you know, millions of search results. You have millions of web pages, you have millions of videos. How do you evaluate the qualities of these videos? Like, is this a high-quality, like... Is this a high-quality web page? Is it informative? Or is it really spammy? Like, in... The way you do this is, like, you just need... I mean, you gather so many signals. You gather, like, page-dependent signals. You gather, like, user-dependent signals. You gather activity-based signals, and all of these feed into, you know, a giant ML algorithm at the end of the day. And so, in the same way, we, we gather all these seals about our annotators, about the work that they're performing, about, like, their activity on the site, and we just feed it into, um, a lot of these different, uh... Like, we basically have an ML team internally that builds a lot of these algorithms to, to measure all this.
12:25 – 14:02
Role of Scalable Oversight at SurgeAI
1. SGSarah Guo
  What is changing or breaking as you are, like, scaling increasingly sophisticated, like, annotations, right? Like, if, you know, model quality baseline is going up, um, every couple of months, then the expectation is it, like, exceeds, you know, what might have been, um, a random human at some point, as you said. Like, can draw a bounding box into all of these different fields, um, where, uh, you know, we, we, we have modeled better than the 90th percentile at some point.
2. ECEdwin Chen
  So, this is actually something that we do a lot of internal research on ourselves as well. So, there's basically this field of, uh, AI alignment called scalable oversight, which is basically this question of, how do you, how do you, like, have models and humans working together, hand-in-hand, to produce data that is better than either one of them can achieve on their own? And so even, like, even today, something like writing an SAT story from scratch, even today, like... Or a couple of years ago, we might have written that story completely from scratch ourselves. But today, it's just, like, not very efficient, right? Like, you might start with a story that the model, model created, and then you would edit it. You might e-edit it in a very substantial way, like maybe just the core of it is very vanilla, very generic. But there's just so much, like, kind of, like, cruft that is just inefficient for a human to do, and doesn't really benefit from, like, the human creativity and human ingenuity that we're trying to add into the response. And so you can just start with, like, this bare bones, uh, structure that you're basically just layering on top of. And so again, like, there's, like... Th- there are more sophisticated ways of thinking about scalable oversight, but just this question of, how do you build the right interfaces? How do you build the right tools? How do you just combine people i- with, with AI in the right ways to, like, make them, uh, make them more efficient? Um, it, it is something that we build u- build a lot of technology for.
14:02 – 16:39
Challenges of Building Rich RL Environments
1. SGSarah Guo
  Lot of the discussion in terms of what human data the labs want has moved to, um, RL environments and reward models in, you know, recent months. Um, what is hard about this, or, you know, what are you guys working on here?
2. ECEdwin Chen
  So, we do a lot of work building up our environments, and I think one of the things that people really underestimate is how... (laughs) It is how complicated it is, that you can't just synthetically generate it. Like, for example, you think you need a lot of tools because these are massive environments that people want.
3. SGSarah Guo
  Can you give an example of, like... Just to make it more real?
4. ECEdwin Chen
  Like, imagine you are a salesperson, and when you are a salesperson, you need to be interacting with Salesforce, you need to be getting leads through Gmail. You're going to be talking to customers in Slack. You're going to be creating Excel sheets tracking your leads. You're going to be, uh, I don't know, writing Google Docs and making Power Point presentations to present things to customers. And so you want these, basically these very rich environments that are literally simulating your entire world as a salesperson. Like, it, it literally is just like... Imagine, like, your entire world, so with everything on your desktop, and then, in the future, everything that is, you know, not on your desktop, uh, as well. Like, maybe you have a calendar. Maybe there's... Maybe you need to travel to a meeting to meet a customer, and then you want to simulate a car accident happening, and you're getting notified of that, so you need to, like, leave a little bit earlier. Like, all these things are things that we actually want to model in these very, very rich RO environments. And so the question is, how do you generate all of the data that go into this? Like, okay, you're gonna need to generate, like, thousands of Slack messages, hundreds of emails. You need to make sure that these are all consistent with each other. You need to make sure that, uh, like, going back to, like, my car example, you need to make sure that time is evolving in these environments and, like, certain, like, external events happen. Like, how do you do all of this and then, um... like, in a way that is actually kind of, like, interesting and creative, but also realistic and not, like, incongruent with each other? Um, like, there, there's just, like, a lot of thought that needs to go into these environments, um, to, to make sure that they're, again, like, rich, creative environments that the models can learn interesting things from. And so, uh, yeah, you basically need, like, a lot of tools and kind of sophistication for, for creating these.
5. SGSarah Guo
  Is there any intuition for, like, how real or how complex is enough, or is it just, like, you know, there's a, there's no ceiling on the, um, realism that is useful here, or the complexity of environment that is useful here?
6. ECEdwin Chen
  I think there's no ceiling. Like, at the end of the day, you just want as, as, like, much diversity and richness as you can get. Because the more richness that you have, yeah, the more the models can learn from. The, like... The longer the time horizons, the more the models can learn on and improve on. So, I think there's almost, almost an unlimited ceiling here.
16:39 – 17:29
Predicting Future Needs for Training AI Models
1. ECEdwin Chen
2. SGSarah Guo
  If you were to make a five or 10-year bet on, like, what scales most in terms of demand from people training AI models and, and types of data, is it RL environments or is it traces on types of, like, expert reasoning, or, or what other areas do you think there's going to be a really large demand for?
3. ECEdwin Chen
  I mean, I think it will be all of the above. Like, I don't think RO environments alone will suffice, just because... I mean, it depends on everything I put in RO environments, but oftentimes these are very, very rich trajectories that are very, very long. And so it's almost, like, inconceivable that a single reward... (laughs) Um, I mean, I think even today, we often think about things in terms of multiple rewards, not just a single reward. But a sing- like, a single reward may just, may not be, like, rich enough to capture, um, all of the, all of the work that goes into, like, the model solving some, like, very, very complicated goal. Um, so I think there'll probably be a combination of,
17:29 – 21:27
Role of Humans in Data Generation
1. ECEdwin Chen
  of all those.
2. EGElad Gil
  If you assume eventually, um, some form of superhuman performance across different model types relative to human experts, how do you think about the role of humans relative to data and data generation versus synthetic data or other approaches? Like, at what point does human input sort of run out as a, as a useful, um-... point of either feedback or data, data generation.
3. ECEdwin Chen
  So, I think human feedback will never run out, and that's for a couple reasons. So, I mean, even if I think about the landscape today, I think people often overestimate the role of synthetic data. I, I personally, I think synthetic data actually is very, very useful. Like, we use it, um, like a ton ourselves in order to supplement what the humans do. Like, again, like, like I said earlier, there's, like, a lot of cruft that simply isn't worth a human's time. But what we often find is that... Like, for example, a lot of the times where customers will come to us and they'll be like, "Yeah, for this past six months, I've been experimenting with, with synthetic data. I've gathered 10 to 20 million pieces of synthetic data. Actually," (laughs) yeah, we finally realized that 99% of it just wasn't useful, and so we're trying to find right now, we're trying to curate the 5% that is useful, but we are literally going to throw out nine million of it." And oftentimes you'll find out that, yeah, like actually 1,000, even 1,000 pieces of high-quality human data, uh, highly curated, really, really high-quality human data is actually more valuable than those 10 mill- t- t- than those 10 million points. So, that is, that is one thing I'll say. Another thing I'll say is that it's almost like sometimes you need an external signal to the models. Like, the models just think so differently from humans that you always need to make sure that they're kind of aligned with the actual objectives that you want. Let me give two examples. So, one example is that it, it's kind of funny. If you, sometimes if you try... And so, o- one of the frontier models, (laughs) let me just say that one of them, if you go use the frontier model, it's, like, one of the top models or one of the models everybody thinks is one of the top. If you go use it today, like, maybe 10% of the time when I use it, it will just output random Hindi characters and random Russian characters in
4. SGSarah Guo
  Yeah.
5. ECEdwin Chen
  one of the responses. So, I'm like, "Tell me about Donald Trump. Tell me about Barack Obama." And just, like, in the middle of it, it will just output Hindi and Russian. It's like, what is this? (laughs) And the model just isn't, like, self-consistent enough to be aware of this. It's almost like you need a, uh, like an external human to tell the model that, yeah, this is wrong. One of the things I think is a giant plague on AI is LMSYS, LM RINA. And I'll, I'll, I'll skip the details for now, but I think right now people will often... It's, like, if you train your model on the wrong objectives, so, like, the, the mental model that you should have of LM- LMSYS, LM RINA is that people are writing prompts, they'll get two responses, and they'll spend, like, five, 10 seconds looking at th- looking at responses and they'll just take whichever one looks better to them. So, they're not evaluating whether or not the model hallucinated, they're not evaluating the factual accuracy and whether it followed any instructions. They're literally just vibing with the model and like, "Okay, yeah, like, th- th- this one seemed better because it had a bunch of formatting, it had a bunch of emojis, it just looks more impressive." And people will train on, like, basically an LMSYS objective, and they won't realize all the consequences of it. And again, like, the, the model itself doesn't, doesn't, like, know what its objective is. It's like you almost need, like, an external, like, quality signal in order to tell it what the right objective should be. And if you don't have that, then the model will just go in all these crazy directions. Again, like, you might, you may have seen some of the results with, like, the, with LLaMA 4, but it would just go in all these crazy directions that, um, kind of, kind of mean you need these external, external validators.
6. EGElad Gil
  This also happens actually when you do different forms of, like, protein evolution or things like that, where you select a protein against a catalytic function or something else and you just kind of randomize it and have, like, a giant library of them, and you end up with the same thing where you have these really weird activities that you didn't anticipate actually happening. And so I sometimes think of model training as almost this odd evolutionary landscape that you're effectively evolving and selecting against, and you're kind of shaping the model into that local, uh, maxima or something. And so it's kind of this really interesting output of anything where you're effectively evolving against a feedback, uh, signal. And, uh, depending on what that feed- feedback signal is, you just end up with these odd results. So, it's interesting to see how it kind of transfers across domains.
21:27 – 22:51
Importance of Human Evaluation for Quality Data
1. EGElad Gil
2. SGSarah Guo
  These, you know, course, uh, as you said, five-second reaction academic benchmarks, or even non-academic industrial benchmarks are, uh, easily hacked or, like, not the right gauge of performance against any given task. They are very popular. What is the alternative, um, for somebody who's trying to, like, choose the right model or understand model capability?
3. ECEdwin Chen
  So, the alternative that I think all the frontier labs view as the gold standards is basically human evaluation. So again, proper human evaluation where you're actually taking the time to look at the response, you're gonna fact check it, you're going to see whether or not it followed all the instructions. You have good taste, so you know whether or not the, the model has good writing quality. Like, this concept of, like, doing all that and spending all the time to do that, as opposed to just vibing for five seconds, I think actually is really, really important, because if you don't do this, you basically, you're basically just training your models on an analog of clickbait. Um, so I, I think it actually is really, really important for model progress.
4. SGSarah Guo
  If it's not LMSYS, like, how should people, um, actually evaluate model capability for any given task?
5. ECEdwin Chen
  What all the frontier labs find is that human e- evals really are the gold standard. Like, you really need to take a lot of time to fact check these responses, to verify they're, they're following instructions. You need people with good taste to evaluate the writing quality, uh, and so on and so on. And if you don't do this, you're basically training your models on the analog of clickbait. And so I think, I think that really, really harms model
22:51 – 23:37
SurgeAI’s Work Toward Standardization of Human Evals
1. ECEdwin Chen
  progress.
2. SGSarah Guo
  Is there work that Surge is doing in this domain of, like, trying to standardize human eval or make it more transparent to end consumers of the API or even users?
3. ECEdwin Chen
  So internally, w- we do a lot of work actually today with working with all the frontier labs to help them understand our models. So again, we're constantly evaluating them, um, we're constantly surfacing loss areas for them to improve on, and so on and so on. And so right now, a lot of this is internal, but one of the things that we actually want to do is start external forms of this as well, where we're helping educate people on, "Yeah, like, these are the different capabilities of all these models. Here, these models are better at coding. Here, these models are better at instruction following. Here, these models are actually hallucinating a lot, so you trust do- trust them as much." So, we, we actually do want to start a lot of, uh, external work to, to help educate the, the broader landscape on this.
23:37 – 24:35
What the Meta/ScaleAI Deal Means for SurgeAI
1. SGSarah Guo
  If we can zoom out and talk just about the, um, the larger, like, competitive landscape and what happens with frontier models over time, what does the Meta Scale deal mean for you guys? Or what do you make of it?
2. ECEdwin Chen
  So, I think it's kind of interesting in that, so we're already the number one player in the space. It's been beneficial because, yeah, there were still some legacy teams using Scale, like, they just didn't know about us (laughs) because we were still pretty under the radar. I think it's been beneficial because one of the things that we've always believed is that...Sometimes when you use these low-quality data solutions, people kind of get burned on human data, and so they have this negative experience, and so then they don't want to use human data again. And so they'll try these other methods that are honestly just a lot slower and don't come with the right objectives, and so I- I think it just harms model progress overall. And so, like, just, like, the more and more we can get, uh, all these frontier labs using high-quality data, I think it actually really, really is beneficial for an industry as a whole. So I- I am... I- I- I think overall it was a, like, a
24:35 – 24:50
Edwin’s Underdog Pick to Catch Up to Big AI Companies
1. ECEdwin Chen
  good thing to happen.
2. SGSarah Guo
  If you were to, um, make a bet that an underdog catches up to OpenAI, Anthropic, and DeepMind, who would it be?
3. ECEdwin Chen
  So I would bet on xAI. (laughs) I think they're just very hungry and mission-oriented in a way that gives them a lot of really unique advantages.
24:50 – 26:25
The Future Frontier Model Landscape
1. ECEdwin Chen
2. SGSarah Guo
  I guess maybe a- another, um, sort of broader question is, do you think there's three competitive frontier models, ten competitive frontier models, uh, a couple years from now? And are... is any of those open-source?
3. ECEdwin Chen
  Yeah. So I actually see more and more frontier models, uh, opening up over time, because I actually don't think that the models will be commodities. Like, I think one of the things that we've... I mean, w- I think one of the things that has actually been surprising the past couple of years is that you actually see all of their models have their own focuses that give them unique strengths. Like, for example, I think Anthropic's obviously been really, really amazing at coding and enterprise, and OpenAI has this big consumer focus because of ChatGPT. Like, I- I actually really love it... (laughs) its model's personality. And then Grok, you know... (laughs) it just has a different set of things that it's willing to say and to build. And so it's almost like every company has... it's almost like a different set of principles that they care about. Like, they're... like, some will just never do one thing. Others are totally willing to do it. Others have... just have different... Like, models will just have so many different facets to their personalities, so many different facets to the type of skills that they will be good at. And sure, like, eventually AGI will maybe encompass this- this all, but in the meantime you just kind of need to focus. Like, there's only so many focuses that you can have as a company. And so I think that just will lead to, uh, like, different strengths for all of the model providers. So, I mean, I think today, you know, we already see, like, a lot of people, including me... (laughs) we will switch between all the different, uh, models just depending on what we're doing. And so in the future, I think that will, um, just happen even more as, uh, as- as people are just using more and more models for... or, using models for different aspects of their lives, like with their personal and their inner,
26:25 – 29:29
Future Directions for SurgeAI
1. ECEdwin Chen
  like, professional lives.
2. SGSarah Guo
  Going back to something Elad mentioned, like, where should we expect to see, like, Surge investing over time? Like, what do you think you guys will do a few years from now that you don't do today?
3. ECEdwin Chen
  Again, I think I'm really excited about this more kind of, like, public research push that we're- we're starting to have. Like, I think it is really interesting in that a lot of the... like, for o- obvious reasons, a lot of frontier labs, they're just not publishing anymore. And as a result of that, I think it's almost like the- the industry has fallen into kind of a trap that I worry about. So, like, maybe to dig into some of the- some of the things I said earlier, um, with some of the negative incentives of- of the industry and some of the kind of concerning trends that- that- that we've seen. So, like, going back to LMSIS. One of the things that we'll see is, like, a lot of researchers, they'll tell us that their VPs make them focus on increasing their rank on LM- LMSIS. And so I've had researchers explicitly tell me that they're okay with making their models worse at factuality, worse at following instructions, as long as it improves their ranking, because their leadership just wants to see these metrics go up. And again, that is, like, something that literally- literally happens because the people ranking these things on LMSIS, they don't care whether the models are (laughs) good at instruction following. They don't care whether the models are- are, um, emitting, like, factual responses. They... what they care about is, okay, did this model emi- emit a lot of emojis? Did it emit a lot of bold words? Did it have really long responses? Because that's just going to look more impressive to them. Like, one of the things that we found is that the easiest way to improve your rank on LM Arena is literally to make your- make your model response longer. And so what happens is, like, there are a lot of companies who are trying to improve their leaderboard rank. So they'll see progress for six months because all they're doing is unwittingly making their model responses longer and adding more emojis. And they don't realize that all they're doing is training the models to produce better clickbait. And they might finally realize six months or a year later... like, again, you- you may have seen some of these things in industry, but it basically means that they spent the past six months making zero progress. In- in a similar way, I think, you know, besides LMSIS, you have all these a- academic benchmarks. And they're completely divorced from the real world. Like, a lot of teams are focused on proving these SAT-style scores instead of real-world progress. Like, I'll give an example. There's a benchmark- benchmark called IF Eval and if you look at IF Eval... so it- so it stands for Instruction Following Eval. If you look at IF Eval, like, some of the instructions they're trying to check whether our models can do, it's like, "Hey, can you, uh, write an essay about Abraham Lincoln? And every time you, uh, like, mention the word 'Abraham Lincoln,' make sure that five of the letters are capitalized and all the other letters are uncapitalized." It's like, what is this? (laughs) And sometimes we'll get customers telling us, like, "Yeah, like, we really, really need to improve our, like, our score on L- uh, on IF Eval." And what this means is, again, like, you have all these companies or all these researchers who, instead of focused on real-world progress, they're just, like, optimizing for these silly SAT-style benchmarks. And so one of the things that we- we really want to do is just think about ways to educate the industry, think about ways of publishing on our own, just, like, think about ways of steering the industry into, like, hopefully a better direction. And so I think that's just one- one big thing that we're- we're really excited about and could be- could be really big in the next five years.
29:29 – 32:26
What Does High Quality Data Mean?
1. ECEdwin Chen
2. EGElad Gil
  Okay. Yeah. I mean, so Sara brought up earlier, um, how everybody kind of wants high-quality data. Uh, w- what does that mean? How do you think about that? How do you generate it? Can you tell us a little bit more about your thoughts on that?
3. ECEdwin Chen
  So let's- let's say you wanted to train a model to write an eight-line poem about the moon. And so the way most companies think about it is, "Well, let's just hire a bunch of people from Craigslist or through some recruiting agency, and let's ask them to write poems." And then the way they think about quality is, "Well, is this a poem? Is it eight lines? Does it contain the word 'moon?'" If so, like, okay, yeah, I hit these three checkboxes, so yeah, sure, this is a great poem because it follows all these instructions. But if you think about it, like, the reality is you'd get these terrible poems. Like, sure, it's eight lines and it has the word "moon," but they feel like they're written by kids from high school.And so other companies would be like, "Okay, sure. These people on Craigslist don't have any poetry experience. So what I'm going to do instead is hire a bunch of people with PhDs in English literature." But this is also terrible. Like, a lot of PhDs, they are actually not good writers or poets. Like, if you think of, uh, like, think of Hemingway or Emily Dickinson. They definitely didn't have a PhD. I, I don't think they even completed college. And like, one of the things that I'll say is like, yeah, I- I- I went to MIT, I think Elad you went, you went there too. And a lot of people I knew from MIT who graduated with a CS degree, they're terrible coders. And so we think about quality completely differently. Like, what we want isn't poetry that checks some boxes. And like, okay, yeah, it checked these few boxes and used some complicated language. We want the type of poetry that Nobel Prize laureates would write. And so what, what we want is like, okay, we want to recognize that poetry's actually really subjective and rich. Like, maybe one poem, it's a haiku about moonlight on water. And there's another poem that's like, it has a lot of internal rhyme and meter. And another one that f- I don't know, focused on emotions behind the moon rising at night. And so you actually want to capture it that there's thousands of ways to write a poem about the moon. There isn't a single correct way, and each one gives you all these different insights into language and imagery and- and poetry. And if you think about it, it's not just poetry. It's like math. There's a thousand ways probably to prove the Pythagorean theorem. And so I think the, the difference is that when you think about quality the wrong way, you kind of get commodity data that optimizes for things like interrater agreement, and again, checking boxes off of some list. But one of the things that we try to teach all of our customers is that high quality data actually really embraces human intelligence and creativity. And when you train the models on this, like, richer data, they don't just learn to follow instructions, they really learn all these deeper patterns about all the stuff that makes language in the world really compelling and meaningful. And so I think a lot of companies, they just throw humans at the problem and they think that you can get good data that way, but I think you really need to think about quality from first principles and what it means, and you need a lot of technology to identify, yeah, that these are amazing problems, and these are creative math problems, and these are games and web apps that are beautiful and fun to play, and these ones are terrible to use. It's like, you, you really need to build a lot of technology and think about quality in the right way. Otherwise, you're basically just like scaling up mediocrity.
4. SGSarah Guo
  That sounds very domain specific. So do you, like, in every domain, are you building a lens of what quality looks like along with your partners?
5. ECEdwin Chen
  Yeah, I mean, I think we have kind of like holistic quality principles, but then oftentimes there are differences per domain. So it's like a com- combination
32:26 – 32:58
Conclusion
1. ECEdwin Chen
  of both.
2. SGSarah Guo
  I think we got all the, uh, core topics. Nice work on podcast number two, Edwin-
3. ECEdwin Chen
  (laughs)
4. SGSarah Guo
  ... and thanks for doing this. Congrats on all the progress with the business.
5. EGElad Gil
  Yeah, no, thanks so much for having us.
6. ECEdwin Chen
  Yeah, it was great, great meeting you guys.
7. SGSarah Guo
  Find us on Twitter @nopriorspod. Subscribe to our YouTube channel if you wanna see our faces. Follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get a new episode every week. And sign up for emails or find transcripts for every episode at no-priors.com.

Episode duration: 32:58

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode UU26zm676Lg

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome