Chip Huyen: Why RAG wins come from data prep, not vector DBs

Preparing data and talking to users beats agonizing over which vector database; Huyen says post-training, not new models, drives real AI product wins.

Chip HuyenguestLenny Rachitskyhost

Oct 23, 20251h 22mWatch on YouTube ↗

EVERY SPOKEN WORD

150 min read · 30,013 words

0:00 – 4:28
Introduction to Chip Huyen
1. CHChip Huyen
  One question I get asked a lot and a lot is how do we keep up to date with the latest AI news? Why, why do you need to keep up to date with the latest AI news? If you talk to the users and understand what they want, what they don't want, look into the feedbacks, then you can actually improve the application way, way, way more.
2. LRLenny Rachitsky
  A lot of companies are building AI products. A lot of companies are not having a good time building AI products.
3. CHChip Huyen
  We are in an idea crisis. Now we have all this really cool tools. We have do everything from scratch. We have your design, it can have you write code, it can have build website. So in theory, we should see a lot more. But at the same time, people are somehow stuck. They don't know what to build.
4. LRLenny Rachitsky
  All this AI hype, the data is actually showing most companies try it, doesn't do a lot, they stop. What do you think is the gap here?
5. CHChip Huyen
  It's really hard to measure productivity. So I do ask people to ask their managers, would you rather have... Give everyone on a team very expensive coding agent subscriptions, or you get an extra headcount? Almost everyone, the managers will say headcount. But if you ask VP level or someone who manage a lot of teams, they would say one AI assistant. Because as managers, you are still growing. So for you, having one extra headcount is big. Whereas for executive, maybe you have more business metrics that you, you care about, so you actually think about what actually drive productivity metrics for you.
6. LRLenny Rachitsky
  Today my guest is Chip Nguyen. Unlike a lot of people who share insights into building great AI products and where things are heading, Chip has built multiple successful AI products, platforms, tools. Chip was a core developer on NVIDIA's NeMo platform, an AI researcher at Netflix. She taught machine learning at Stanford. She's also a two-time founder and the author of two of the most popular books in the world of AI, including her most recent book called AI Engineering, which has been the most read book on the O'Reilly platform since its launch. She's also gotten to work with a lot of enterprises on their AI strategies, and so she gets to see what's actually happening on the ground inside a lot of different companies. In our conversation, Chip explains a lot of the basics, like what exactly does pre-training and post-training look like? What is RAG? What is reinforcement learning? What is RLHF? We also get into everything she's learned about how to build great AI products, including what people think it takes and what it actually takes. We talk about the most common pitfalls that companies run into, where she's seeing the most productivity gains, and so much more. This episode is quite technical, more technical than most conversations I've had, and is meant for anyone looking for a more in-depth conversation about AI. If you enjoy this podcast, don't forget to subscribe and follow it in your favorite podcasting app or YouTube. And if you become an annual subscriber of my newsletter, you get a year free of 16 incredible products, including Devin, Lovable, Replit, Bolt, Innate and Linear, Superhuman, Descript, WhisperFlow, Gamma, Perplexity, Warp, Granola, Magic Patterns, Recast, ChatBRD, and Mobbin. Head on over to LennysNewsletter.com and click Product Pass. With that, I bring you Chip Nguyen after a short word from our sponsors. This episode is brought to you by Dscout. Design teams today are expected to move fast, but also to get it right. That's where Dscout comes in. Dscout is the all-in-one research platform built for modern product and design teams. Whether you're running usability tests, interviews, surveys, or in-the-wild field work, Dscout makes it easy to connect with real users and get real insights fast. You can even test your Figma prototypes directly inside the platform. No juggling tools, no chasing ghost participants. And with the industry's most trusted panel plus AI-powered analysis, your team gets clarity and confidence to build better without slowing down. So if you're ready to streamline your research, speed up decisions, and design with impact, head to Dscout.com to learn more. That's D-S-C-O-U-T.com. The answers you need to move confidently. Did you know that I have a whole team that helps me with my podcast and with my newsletter? I want everyone on that team to be super happy and thrive in their roles. Justworks knows that your employees are more than just your employees. They're your people. My team is spread out across Colorado, Australia, Nepal, West Africa, and San Francisco. My life would be so incredibly complicated to hire people internationally, to pay people on time and in their local currencies, and to answer their HR questions 24/7. But with Justworks, it's super easy. Whether you're setting up your own automated payroll, offering premium benefits, or hiring internationally, Justworks offers simple software and 24/7 human support from small business experts for you and your people. They do your human resources right so that you can do right by your people. Justworks, for your people.
4:28 – 7:05
Chip’s viral LinkedIn post
1. LRLenny Rachitsky
  Chip, thank you so much for being here and welcome to the podcast.
2. CHChip Huyen
  Hi, Lenny. I've been a big fan of the podcast for a while, so I'm really excited to be here. Thank you for having me.
3. LRLenny Rachitsky
  I want to start with this table/chart that you shared on LinkedIn a while ago that went super viral, and I think it went super viral 'cause it hit a nerve with a lot of people. And let me just read this, and we'll show this on YouTube for people that are watching. So it's this very simple table you shared of what people think will improve AI apps and what actually improves AI apps. What people think will improve AI apps: staying up to date with the latest AI news, adopting the newest agentic framework, agonizing over what vector databases to use, constantly evaluating what model is smarter, fine-tuning a model. And then you have what actually improves AI apps: talking to users, building more reliable platforms, preparing better data, optimizing end-to-end workflows, writing better prompts. Why do you think this hit such a nerve with people? And just what... If you had to boil it down, what do you think is... What do you think people are missing about building successful AI apps?
4. CHChip Huyen
  One question I get asked a lot and a lot is that how do we keep up to date with the latest AI news? And I'm like, "Why? Why do you need to keep up to date with the latest AI news?" I know it sounds very counter to you, counter intuitive, but like there's so much news out there. A lot of people also ask me questions like, "How do I choose between two different technologies?" Like maybe like recently like MCP versus like agent, agents, right? Like protocol. And it was like, which one is better or like this or that? And things... A, a similar question I usually ask them is like, first, like if... How much of the improvement could you get, like, if-... from, like, optimal solutions versus non-optimal solutions, right? And sometimes people were like, "Actually, it's not much," right? And I was like, "Okay, if it's not much improvement, then why do you want to spend so much time debating something that doesn't, uh, make that much difference to your performance?" And another question they asked is like, "If you adopted a new technology, like how hard it would be to switch that out to another?" And sometimes people were like, "Oh, I think it would be, like, a lot of work switching it out." And I was just like, "Hmm, let's say here's a new technology. It hasn't been tested by a lot of people. And if you adopt it, it would be like stuck with it forever. Like, do you actually want to adopt it?" Right? And maybe you wanna think twice about, about like over-commit to, like, uh, new technologies that hasn't been battle-tested.
5. LRLenny Rachitsky
  I love your just broader advice is just simple, like talk t- ... to build successful AI apps, talk to users, build better data, uh, write better prompts, optimize the user experience, versus just like what is the latest and greatest? What's the best model to use right now? What's happening in AI?
7:05 – 8:50
Understanding AI training: pre-training vs. post-training
1. LRLenny Rachitsky
  Let me follow this thread of this idea of fine-tuning and basically post-training. There's all these terms that people hear in AI, and I think this is gonna be a really good opportunity for people to learn what we're actually talking about since you actually do these things, you build these things, you work with companies doing these things, and, um, there's a few terms I wanna sprinkle in through the conversation, but let's start with this one. What- what's the simplest way for someone to understand? What is the difference between pre-training and post-training and then just how fine-tuning fits into that, just what fine-tuning actually is?
2. CHChip Huyen
  So disclaimer, I don't have, like, full visibility into, like, on what, like, this big secretive, like, Frontier labs are doing. Uh, but right from what I heard, right, so, so I think it's, like, one is, um, like supervise fine-tuning when you have demonstration data and you have, like, a bunch of, like, um, experts. Like, okay, here's a prompt, right, and here is a... what the answer should be like. And, like, you, you just train it, like, on, like, to, like, stim- like, uh, simulate, um, like emulate what the human expert could be like. And that's also, like, what a lot of people with, like, um, those, um, open source models are doing, is they do it by distillation. So instead of having human experts to, like, write really good sounding, uh, great answers to my prompts, they get, like, very popular, famous, good models to, like, generate the response for it and, like, getting this trained smaller model to emulate. So, so it's sometimes you see people is like... So that's because I sum- ... I, I really appreciate open source community by the way, but, like, going from, like, have- being able to train a model that can emulate a existing good model is very different from, like, being able to just train a good model so it can outperform existing good models, so it's a big step there. Uh, so yeah, so like we have a supervise fine-tuning. And another thing that's, like, very big in pre- uh, I'm not sure if you have guests talking about it already,
8:50 – 13:55
Language modeling explained
1. CHChip Huyen
  but, like, reinforcement learning is, like, everywhere.
2. LRLenny Rachitsky
  Let's pause on that 'cause I definitely want to spend time on that and that's such a cool topic that I... that's merging more and more in my conversations. But just to even summarize the things you just shared, which I think is really, really important stuff. So the idea here is a model, essentially this algorithm piece of code that someone writes, and say the frontier models are feeding it just, like, the entire internet of content. And basically it's trying to test itself on predicting in all of the... in a- across all that data the next word. Essentially. Token is a simpler way, is the correct way of thinking about it, but a simpler way to think about it is like the next word in a... in text. And as it gets it wrong, it adjusts these things called weights essentially, uh, just, like, is that a simple way to think about it even though that's... oh, that... even that's just, like, very surface level?
3. CHChip Huyen
  So I think of language modeling as a way of encoding statistical information about language, right? So, so let's say that, um, you, we, we both speak English, so we kind of get a sense of, like, what is more statistically likely. Like if I say my favorite color is, then you would type, okay, that should be another color, like the word blue would be much more likely to appear than the word, like, uh, end of table, right? Because statistically blue is more likely to come out
4. LRLenny Rachitsky
  Mm-hmm.
5. CHChip Huyen
  ... as your favorite color is. So, so it's just I get this, um, is, is, is a way of encoding statistical information. So, like, when language modeling, when you train a large amount of data, like, it ca- you see a lot of languages, a lot of domains, so it can tell, like, okay, you may say this sentence. Then it use a, do the prompts and it would come, like, with the next, uh, most likely token. Uh, so by the way, it's not a new idea. Actually,
6. LRLenny Rachitsky
  Mm-hmm.
7. CHChip Huyen
  ... I really li- so this idea comes very, very old, like from the 1951 papers, um, um, the English Entropy, uh, I think by Claude Shannon. It's a great paper and I think it presents a story I really like, um, is from... did you read Sherlock Holmes by the way?
8. LRLenny Rachitsky
  Uh, yeah, I read a few Sherlock Holmes books. Yeah.
9. CHChip Huyen
  Yeah. So, so this a story of like when Sherlock Holmes was using this statistical information to, like, help solve a case. So he was getting, um... so this is this story, uh, there is... uh, somebody left a message, uh, with a lot of, like, stick figures. So Sherlock Holmes was like okay, he knows that in English the most common letter is E, then the most common stick figure must be E, right? And then he goes, he start like that
10. LRLenny Rachitsky
  Mm-hmm.
11. CHChip Huyen
  ... and he originally saw, uh, solved the, uh, the code. So I think there's language... so in a way it's just like simple language modeling, right? But instead of, like, at a word level, he does this at, like, to- like character level. And token is something in between, right? Mm-hmm. A token is not quite a word, uh, but it's bigger than a character. So let's say, uh, we, we say token because, uh, it helps us, like, re- word- help us reduce vocabulary because with character is, like, smallest amount of, like, vocabulary, right? And so, so
12. LRLenny Rachitsky
  Mm-hmm.
13. CHChip Huyen
  ... alphabet has, like, 26 character, but words can have, like, millions and millions, right? Uh, whereas, um, um, tokens you can, like, be able to, like, get, like, the sweet spot between the two. So let's say that we have a, uh, the new word, like, uh, uh, how to say, like par- podcasting, right? Let's say it's a, it's a new word but it can divided in podcast and ing. So people understand, okay, podcast, we know the meaning, we know that ing is, like, uh, like a verb, like gerund, whatever it is. So we, we know the word like podcasting. So that's where the token comes in. But yeah, uh, that's like the pre-training is basically like encoding statistical information of language to help you predict what is most likely. Um-I think that most likely is the most simple way of doing it, uh, because it is more like building a distributions, of like, okay, so like the next token could be more like 90% of the time it could be, like, a color, and I do it, like, 10% of the time could be, like, something else, right? So you're basing attributions the language would, like, pick, like, depending on your sampling strategy. Like, do you want it to always pick the most likely token or do you want it to pick something more creative, you know? So, so, so I think my sampling strategy I think is something extremely important. Uh, this can have you boost your performance in a, in a huge way, and very, very underrated.
14. LRLenny Rachitsky
  Okay, awesome. Uh, so essentially, uh, a model is just code with this whole, uh, set of weights, essentially the statistical model that has learned to predict what comes next after certain words and phrases.
15. CHChip Huyen
  Yeah.
16. LRLenny Rachitsky
  And then post-training and fine-tuning specifically is doing that same thing. So pre-training you get like GPT-5, fine-tuning is someone taking GPT-5 and doing the same sort of thing, uh, adjusting these weights a little bit for specific use cases on data that they, uh, find is necessary to do their very specific use case. Is that a simple way to think about it?
17. CHChip Huyen
  Yeah, I think of, like, weights as, like, functions, right? So let's say just, like, you, you have a... maybe has a functions of, like, maybe Leni's height is maybe, like, 1X... like, 1X plus something, or, like, 2X. So, like, one and plus something is, is the weight, right? So you change it until you fit the, uh, the, the, the correct data, which is, like, my height and your height, right? So, so you can think of the weight as just, like, a weight, like, say, function. So you... so, so you, like, chain adjust the weights so they can fit the data, which is the training data.
18. LRLenny Rachitsky
  Awesome.
13:55 – 15:20
The importance of post-training
1. LRLenny Rachitsky
  Okay. So, so we were talking about pre-training, post-training, fine-tuning. Is there anything else here that's important to share, but just, like, what this is ex- exactly what people need to understand about these parts of training?
2. CHChip Huyen
  So the vast majority of time we don't touch on, like, pre-training model. Like, as users, we don't-
3. LRLenny Rachitsky
  Right.
4. CHChip Huyen
  ... use them at all.
5. LRLenny Rachitsky
  It's already done for us.
6. CHChip Huyen
  Like, yeah, so, so I think, like, actually it's bit of fun, like, uh, process, like, when my friends training models, I try to play with their pre-training model and they're horrendous. They are, like, saying things just like, we was like, "Oh my gosh," just like, yeah, it's crazy. Um, so, so it is really interesting to look at, like, how much of, like, post-training can change the model behavior. Um, yeah, and I think that's where, like, a lot of time is that a lot of people are spending energy on nowadays in Frontier Lab is on, like, post-training. Because, uh, pre-training, uh, I think, um... so pre-training have been used to, like, increase the general capa- capacity of, of, of a model, capabilities of a model, and it depends on... it needs a lot of data and, like, model size, like, to increase, um, to increase the model capabilities, and at some point we are actually, like, have got max out all our internet data, right? And people, like, text data we max out. I think a lot of people are doing, like, with other data, like, audios and videos and, like, everyone's trying to think of, like, what is the new source of data? But where, like, post-training, uh, but, like, middle quasi, like, it is more of, like, everyone can have very similar pre-training data. It's, like, post-training is where they make a big difference
15:20 – 22:23
Reinforcement learning and human feedback
1. CHChip Huyen
  nowadays.
2. LRLenny Rachitsky
  This is a good segue to you talked about supervised learning versus unsupervised learning. I love we're getting into this by the way. This is super interesting. So you're talking about labeled data. Basically supervised learning is AI learning on data that somebody has already labeled and told it here's correct versus incorrect. For example, this is spam versus not spam. This is, uh, a good short story; this is not a good short story. We've had a... the CEOs of a lot of these companies that do this for labs, uh, Mercor and Scale, Handshake, uh, there's Micro, uh, there's a few others. So is, is that essentially what these companies are doing for labs, giving them labeled data, high-quality data to train on?
3. CHChip Huyen
  It is in a way, but I think it's more like a part of a big equations. So there are a lot more different components than that, so that's why I was talking about reinforcement learning. I'm not sure if your CEOs that you interview bring up, like, that term. Uh, so, so the idea is that, um, you want people to, like... so, like, let's say you have a model, give the model, like, a prompt, right, and it produce an output, right? You want to buy... like, you want to reinforce or encourage the model to produce an output that is better, right? So, so like the... now, like, now's the time to, like, how do we know that the answer is good or bad, right? So usually people realize on, like, um, signals. So one way to get, like, a first one good or bad is, like, human feedback, right? It happen we have two responses, you can, "Okay, this one's better than the other." Um, and we do that is because, like, as humans, uh, we tend to... it's very hard to give, like, concrete score, but it's easier to do comparisons, right? Like, if you ask me, "Okay, give this song a score," I'm not a musicians, like, and don't know, like, how hard it is. Like, it's like, yeah, I don't know, like, what, like, out of 10 I'm going to go six, you know? And then if you ask me again a month from now and I completely forgot and it's like, okay, maybe now it's seven or, like, four. I don't know. But say if you ask me, "Okay, here are two songs, and which one could you prefer to play for the birthday party?" I was like, "Okay, I think probably would prefer this song." So, like, comparison's a lot easier. Um, so, so yes, so we have a humans... um, you have human feedback, uh, and then you use this human feedback to train a reward model. So you, like, tell it which, like, so... and then the feedback model help you, like, okay, it's a model and I'll produce this response. This reward model can score is this good or bad? And you try to bias toward, like, producing better model, uh, the better responses. Another ways that you can instead of using a humans you can use, like, AI, right? Like, look at the response, say yes or good, good or bad, right? Or in terms of things that people are very big on nowadays, like, ver- verifiable rewards, which is, like, kind of natural. Um, so basically they give it a math problem and then math solutions, uh, like, as a model outputs a solution it's, you know, okay, so expected response should be a 482, and it doesn't provide 482 then it's... then it's wrong, right? It's not a good response. Um, so, so yeah, so, like, uh, a lot of time people are, like, using this, um, human label, like, human, um, human labelers who, like, produce, like, ma- like, um, how to say? Expert questions and I say expected answers, and in the ways that, like, these are systems that, like, verifiable so that the, the models can, can be trained on. Yeah.
4. LRLenny Rachitsky
  Okay, I'm really glad you went there. This is essentially RL HF.... reinforcement learning with human feedback, which is exactly what I wanted to also talk about. Right?
5. CHChip Huyen
  Yeah. So, um, I think it's like, it's general. It's like, it's a way of learning. It's like training is ****** learning, and whether it learn from human feedback or, like, AI feedback or, like, verifiable rewards, uh, I think they say it's a, just different way of, like, um, collecting signals.
6. LRLenny Rachitsky
  Awesome. Yeah. That's, uh... Uh, we, we had the CEO of Anthropic on the podcast, and he talked about their version of RLHF, which is AI-driven reinforcement learning. I love the way he phrased it, where you basically... You want to help the model... You want to reinforce correct behavior and correct answers, and this is the method to do it, whether it's, say, an engineer seeing an output from a model being like, "No, here's how I would code it differently," and then training. And it's training a different model that the original model works with to tell it, "Am I correct or not correct?" Is that right?
7. CHChip Huyen
  Yeah.
8. LRLenny Rachitsky
  Roughly? Okay.
9. CHChip Huyen
  I, I think, I think that's a way of, of looking into it, and I think that space is so exciting nowadays because there are so many, like, domain expert task that a model... Like that model developers want model to do well on, right? Let's say you're a, like, accountant, right? Like maybe I want to use a model to have accounting tasks, so you, you... And need a lot of, like, accounting data, like examples from a accountant, so you need to hire a lot of them to, like, do it. Or if you want a physics problem or you want to do, um, I don't know, like, legal questions and stuff, or, like, engineering questions. Or like, somebody was telling me they want to do, like, uh, using, like, coding for... To solve scientific problems and not just, like, coding to ****** product, which is another different whole realm of things, and, like, also, like, using very specific toolings like, uh, yeah, like, uh... I'm not sure, like, what apps you use, but maybe, like, um, ******* app or, uh, like, QuickBooks or, like, Google Ex- Excel. Like, there are very specific, like, tool-specific, um, expert- expertise that you want the model to learn. So they need a lot of, like, humans, experts in this area to, like, create data to train them, and it's a massive thing. It's like people... Because, uh, everyone wants a lot of data and, like, ****** has, like, unlimited budget. Uh, but, uh, where the ******* is also, like, a little bit of low-key interesting economics. I'm not sure you've talked to, to, like, the guests about... I thought it's very interesting to, like, think about because it's very lopsided, right? Because, like, there are only a very small numbers of frontier labs, right? And they want a lot of data, and there's, like, a massive amount of, like, startups or companies that provide related data. So, like, you can see these companies, like, these startup, like, doing later labeling that they have, like... Maybe it's like massive AR. But if you ask them, like, "Okay, so how many customers you have?" And they could be like, "Oh, a very small numbers." I'm not sure. I'm not sure you, you, you... I saw you smiling. Um, so, so-
10. LRLenny Rachitsky
  Yeah, yeah. We chatted, we chatted about that.
11. CHChip Huyen
  Yeah. So, so I'm, like, a little bit, like, ****** uneasy about it. Here we have a, like-
12. LRLenny Rachitsky
  Mm-hmm.
13. CHChip Huyen
  ... a company's growing like crazy, but it's, like, heavily dependent on, like, two or three companies. And at the same time, like if I, if I was this company frontier labs, what could be the right economical things for me to do? Right now I want a lot of startups. I want to have a lot of providers, so I can pick and choose. And as these providers can also, like, compete each other to lower the price, and it's so dependent on ****** regardless. So, so I feel like... Yeah. So, so I don't know. This economics, a whole economics is very interesting to me, and I'm curious to see how it plays out.
14. LRLenny Rachitsky
  What I'm hearing is you're, uh, you're bearish on the future of these data labeling companies, because as you said, they're... Aro- don't have a lot of leverage over pricing because they have so few customers, and there's so many people getting into the space. So basically, even though they're some of the fastest growing companies in the world, you're feeling like there's, there's a challenge up ahead.
15. CHChip Huyen
  I'm not sure if I'm bearish on it. Um, I think I'm curious, because I think things have, has a way of work out in ways that I don't expect. So, I think that maybe these companies, they have a lot of data, maybe they wouldn't be able to use that to, like, have some insight that helps them, like, stay ahead of the curve, uh, you know? So, so I don't know. Uh...
16. LRLenny Rachitsky
  A very fair
22:23 – 31:55
The importance of evals in AI development
1. LRLenny Rachitsky
  answer. (laughs) Okay, while we're on this topic, I want to chat about evals, which is a very recurring topic in this podcast. This is the other piece of data content these companies share that AI labs really need. Can you just talk about what an eval is, the simplest way to understand it, and then how this helps models get smarter?
2. CHChip Huyen
  So, I think when people approach eval, I think they're, like, two very different problems. One is a app builder, right? And I can say have an app that do, like, uh... Maybe a chatbot. Very simple. I know. It's the first thing that came to my mind. Um, and I want to know if chatbot is good or bad, right? So, so I need to come up with a way, like, evaluate the chatbot. Um, another thing is, uh, I think of this as a, uh, task-specific eval design. So let's say I'm a model developer, and I want to make my model better at code writing, right? And I was like, "Okay, h- but how, how, how do you wanna measure code writing?" Right? So I would need someone to like, okay, understand code writing and think about, like, what makes good story, or, like, what makes a story good, and then design the whole dataset and the criteria to evaluate creative writing. Um, so yeah. So, so I think there's that. I think it's, like, more like eval design that is very interesting. Uh, come up with crit- criteria, um, come up with guideline how to do it, and then also, like, train people, like, how to do it effectively. Um, so I guess, uh, in a ****** I think eval is really, really fun because it's extremely creative. Uh, I was looking at, like, different evals people built and was like, "Wow." Like, is this not dry at all. It's just, like, super, super fun.
3. LRLenny Rachitsky
  We had a whole podcast on evals with Hamel, Hamel and Shreya and, uh, and that's exactly what they talked about is just, it's actually really fun to create evals for, for companies especially. So let's still dig into that one a little bit more. There's this kind of debate online that... I don't know how big of a deal this debate is, but it feels like people, uh, spend a lot of time thinking about this, this idea of do we need evals for AI products? Some of the best companies say they don't really do evals, they just go on vibes. They're just like, "Is this working well? Can I feel it or not?"What's your take on just the importance of building evals and the skill of evals for app, AI apps, not the model companies?
4. CHChip Huyen
  You don't have to be, like, absolutely perfect at things to win, you just need to be, like, good enough and being consistent about it. Okay, this is not a philosophy I follow, but, like, I have worked with enough companies to see that play out. So when I say, like, why company don't need eval, right? Let's say you are, like, an executive, right, and you want to have a new use case. So here's a use case you, you started out, you built, and it's like it works well, right? The customers are somewhat happy. Don't, you don't have the exact metric for it, but, like, so the traffic keeps increasing, like, people seem happy, people keep buying stuff, right? And now here's an engineer coming like, "Okay, we need eval for it." And so I, and as I was saying, it was like, okay, how much effort do we need to put into eval? And they were like, "Okay, um, maybe, like, two engineers as much, as much," and it could maybe, would improve that and was like, okay, so how much expected gain can I get from it? And the engineer would be like, "Oh, maybe you can improve it from, like, 80% to, like, 82%, 95%." Right? And I was like, okay, but if we were to take, like, that two engineers and be able to launch a new feature, then it could give me, like, so much more, like, improvement, right? So, so I think it's, like, one of them will say eval, sometime you're putting your eval, it's like okay, this is good enough, just don't touch it. Like, if you do spend a lot of energy on eval, I would like only incremental improvement, where it spends the energy on, like, another use case, and maybe then its scale is good enough that you can survive check it, right? So, so I, I do think just, like, maybe, like, that's what the bid is about. Um, I do think just, like, a lot of time people just, like, get things to the, to the place when it's like, okay, good enough, people run. But, and then, but of course, it's like there's a lot of risk associated with it, because if you don't have a clear metric, uh, you have good visibility into how the application as a model is performing, it might do something very dumb or it can cause you, like, I don't know, some- something, like, crazy can happen. So, so yeah. So, um, so, so I do think eval is very, very important if you have... if you operate at scale, and where, like, failures can have, like, catastrophic consequences. Then you do need to be very tyrannical about, like, what you put in front of the users, understand different failure modes, like what could go wrong. And also maybe in a space, mean that, like, it's... it's a feature as of the product is as a complete advantage, right? You want to be the best at it, so you want to have, like, a very strong understanding of, like, where you are, and, like, where you are with the competitors. But it's just something that's, like, more, like, a low-key, okay, it's like something that's like, okay, that's not so core but, like, it helps with our users. Then maybe you don't need to be so, so obsessed or tyrannical about it and say, "Okay, that's good enough for now." And if it fails then it fails. (laughs) Like, okay, I know, it's like it sounds terrifying, but, like, yeah. Um, yeah, I say things aren't about, like, the question of, like, return, return on investment. Um, I'm a big fan of eval. I love writing eval. And the say is like, I understand why some people would choose to not focus on eval right away and choose, like, bringing on new functionalities instead.
5. LRLenny Rachitsky
  Awesome. That is a really pragmatic answer. What I'm hearing is evals are great and very important, especially if you're operating at scale, but pick your battles. You don't need to write evals for every little feature. Something that Hamel and Shreya shared is that people need just, like, I don't know, five or seven evals for the most important elements of their product. Is that... is that what you see, or do you see a lot more in production that people build and need?
6. CHChip Huyen
  Mm, I... I don't think of, like... (sighs) There's no fixed number on, like, the evals. Like, what was the goal of eval, right? The goal of eval is to guide the product development. Um, so, so, like, you see eval, um, because I think I'm a big fan of eval, is... is that it helps you uncover opportunities where the progress are doing well. So it's something that's seen as very obvious. It was like... it's like, okay, well, you look at the eval and we realize it's like, okay, it performed really poorly on this, like, specific segment of users. And then we look into it's like, okay, what... what... what... what's wrong with it? And it turns out it's like we just, like, don't have a good messaging to it. So, like, maybe we should, like, just focus on the things that we're doing poorly and can improve significantly. Yeah, so I kind of see a number of eval is really depends. Like, we have seen product with, like, hundreds of different metrics, right-
7. LRLenny Rachitsky
  Oh, wow.
8. CHChip Huyen
  ... that can go, like, crazy. This is because, like, some product is, like, general, right? They have different type, like one eval for, like, I don't know, like, uh, verbosity, have, like, one eval for, like, user sensitive data. Um, and, like, it's another is, like, is, um, for length. But, like, have the number of, like, um... Okay, let's just pick, like, a case sample, concrete case sample. Like deep... deep research. So, so you have the application, you have, like, built a model to, like, do deep research for you, right? Like, okay, like, have a prompt and they may say, "Okay, do me a com- comprehensive research on all Lenny's podcasts and help me, like, sort of, like, uh, propose, like, uh, show me report on what kind of topics he's interested in, what kind of videos could get the most views or, like, what topics that he's missing on that he should be covering," right? Like, have that kind of, like, prompt. Then how do you evaluate the... the result, right? I don't think there's, like, one, like, metrics that would help. Maybe there's, like... maybe you have, like, a hundred... I think if somebody has a benchmark and is they get, like, a hundred expert, like, write a bunch of prompts and they go through, like, all the... all the answers on AI and I do it. And it's, like, it's extremely costly and slow, right? But if I might have something else, first of all, like, one way I was thinking about it, um, I was talking to a friend about it, and... and one way is just, like, how to produce the result of the... of the summary, right? At first, you need to be able to do, like, gather informations. And to gather informations, you need to do a lot of search queries. Uh, you, like, gather, grab the search results, and then from the search results you, like, uh, aggregate and then maybe say, "Okay, I'm still missing out on this." You have to do another round and, like, another round until you have the summary. So every step of the way, you need evaluations, right? You don't need to do it at the end. So maybe for the first search query, you might first think about, like, okay, now I write five search queries. I might look into, like, how good are these search queries? Like, do they, like... are they, like, similar to each other? Because you need five search queries that are very similar. Like, okay, Lenny podcast, Lenny... Lenny podcast, uh, last month, L- Lenny podcast, like, two months ago, right? It's not... it's not gonna be very exciting, but, like, if the quality is a podca- like, the... the keywords are, like, more, um-... more diverse, right? And then it look at the results of the, of the search query. And let's say you enter the search query like, Lenny Parcat data labeling. And then they come up with, like, 10 pages, uh, 10 results. And then you come up with, like, oh, Lenny podcast on, uh, I don't know, mo- um, I don't know, like, uh, Frontier Labs, and have, like, 10 results. And might look at a different webpage, like, uh, how much of them overlapping? Like are we, are we doing both, like, the breadth, like getting a lot of, uh, page, but also, like, do we have depth? And also do we have relevance? Because if we come up with the search queries, like, completely irrelevant to the, to, to the original prompt. So I feel like every aspect of it, it would need a way of evaluating, right? So I don't think it's just like, how many eval should I get? But, like, how many eval should, do I need to get a good coverage, a high confidence in my application's performance? And also to help me understand, like, where it is not performing well so that I can fix it.
9. LRLenny Rachitsky
  Awesome. And I'm hearing also just especially for the very core use case, like, the most common path people take in your product is where you wanna focus.
10. CHChip Huyen
  Yeah. So, yeah.
11. LRLenny Rachitsky
  Okay.
31:55 – 38:50
Retrieval augmented generation (RAG) explained
1. LRLenny Rachitsky
  Let me, uh, there's one more term I want to cover, and then I want to go in a somewhat different direction. RAG, people see this term a lot, R-A-G. What does it mean?
2. CHChip Huyen
  So RAG is stand for Retrieval Augmented Generations, and also not a specific to GenAI. So, um, the idea is just, like, for a lot of questions, we need context to answer. So I think it came pretty, uh, I think it's from the paper 2017. And so, so someone was like, um, so they realize it's, like, for a bunch of, like, benchmark, when the question answering benchmarks. They realize just like, okay, if we give the model informations about the questions, then it, the answer can be much, much better. So what's it do? It just, uh, tries to retrieve information from Wiki- Wikipedia. So for, for question or topics, it's, like, retrieve that and then put into the context and, like, answer it, it just much better. So I feel like it sounds like a no-brainer, right? I mean, like, obviously. So, so I think that's what RAG is as a simplest sense, it's just like providing some model with the relevant context so, so that it can answer the questions. And, and that's where, like, things get, like, uh, really, um, more, more interesting. Because traditionally when it started out, uh, RAG is mostly, like, text. Um, so, so we, we talk about, like, a lot of way of, like, how to prepare data so that the model can retrieve, uh, effectively. Let's say there's, like, not everything is a Wikipedia page, right? Like, Wiki- Wikipedia page is pretty contained and, like, who knows, okay, everything about it is about a topic. Uh, but a lot of time we have documents are extremely long, right? And, like, they have a weird way of, like, structures and documents. Let's say that, um, you have documents about Lenny, uh, podcast, right? And in the f- in the future, uh, in the beginning the document's like, "From now on, podcast wouldn't refer to Lenny's podcast." Right? So let's say there's somebody in the future is like, "Okay, tell me about Lenny, right? Lenny's work." And because, 'cause the rest of the document does not have the term Lenny, you just don't know, uh, you might not retrieve it. And the document is long enough that it chunk into different parts, so, like, the second part ha- doesn't have the, the word Lenny, so you cannot reach it. So I have to find a way to process data so that make sure that, like, it can retrieve the information that's relevant to the query, even though it might not immediately, like, obvious that it's related. So we'll come up with, like, only thing of writing, like, contextual retrieval. Um, like, uh, giving a chunk of the data, uh, that relevant, like, maybe in a summary, metadata, so that it knows, um... Or some people use, like, as, um, hypothetical questions, it's very interesting, like, for given this chunk of, like, documents, I might generate a bunch of questions that these chunks can help answer. So that when they have a, a query, it's like, okay, does it match any of the, like, hypothetical questions so it can, it can fetch it? So it's very interesting approach. Okay, so maybe before I go to the next thing, I just want to say this, like, data preparations for RAG is extremely important. And I will say this, like, in the, a lot of the companies that I've, have seen, so it's like the biggest performance in their RAG solutions coming from, like, better data, data preparations, not agonizing over, like, what better databases to use. Uh, because rare database... Of course, it's very important to care about, like, things like latency or, like, if you have, like, very specific access patterns, like, read-heavy or write-heavy, of course it's like it matters. But in term of, like, pure quality answers, right, I think the data preparation is just, like, hands out.
3. LRLenny Rachitsky
  When you say data preparation, what's an example to make that real and concrete for us to understand?
4. CHChip Huyen
  So, so, like, one way is, like, uh, um, just mentioned as in, like, um, you have, like, chunks of data. So we can think about, like, how big of a chunk should be, right? Because, um, if it's like... So the thing about, like, is, if the context you want to maximize, maybe you can... And it's very simple example right now, you want to retrieve, like, a thousand words, right? So if a chunk of data is too long, um, then so if, if a data chunk is long, then it's more likely to contain more relevant metadata, so it can retrieve more. But if it's too long, like, then you have a thousand word and the chunk is, like, a thousand words, you're gonna reach one chunk, so it's not very useful. But if it's too short, then you can con- retrieve more relevant information. Like, also it can retrieve a wider range of, like, documents and chunk. But at the same time, a chunk is too small to contain relevant information. So we have, like, very nice, like, chunk design, like, how big a chunk should be. Uh, you add, like, contextual informations like summary, metadata, hypothetical questions. Uh, somebody was telling me just, like, um, a very big performance they got is that from, um, rewriting their data in the question answering format. So they, instead of having, like... So they have a podcast, right? Instead of just chunking the podcast, you just, like, reframe, rewrite it into, like, here's a question, here's answers, um, like, and, and produce a lot of them. They can use AI for that as well. So that's one example of data processing. A lot of example we, I see is, like, for people helping, like, using AI to have, like, specific, uh, tool news and documentations, right? And a lot... And when we write documentation usually, a lot documents, doc- documentation today is written for human, um, reading. And AI reading is different because it's different, because humans, we have a common sense and we kind of know what it is. Um, so, so all, all the things that are like, uh, even for human experts, they have the context that AI doesn't quite have. So somebody told me is that, like, what the big change they have is, like, let's say that, um, you have a, you have a function, a document, uh, documentation for this, maybe it's a library. And this library say, okay, the output of this one is, like, maybe talking for, like, uh, I don't know, some crazy term, maybe, like, some, uh, temperature or something on the graph. Should be like one-... zero or minus one. And as a human expert, maybe understand the scale, like what one in the scale mean, but, like, for AI, just really doesn't understand what that mean. So, so actually have, like, another allotation- annotation lay- layer for AI. And say, "Okay, what temperatures equal what mean," like that. It's not like it's an absolute temperature, it's more like associated with a scale over there. So it's just saving all this data processing to make it easier for AI to retrieve the relevant information to answer the questions.
5. LRLenny Rachitsky
  This episode is brought to you by Persona, the verified identity platform helping organizations onboard users, fight fraud, and build trust. We talk a lot on this podcast about the amazing advances in AI, but this can be a double-edged sword. For every wow moment, there are fraudsters using the same tech to wreak havoc, laundering money, taking over employee identities, and impersonating businesses. Persona helps combat these threats with automated user, business, and employee verification. Whether you're looking to catch candidate fraud, meet age restrictions, or keep your platform safe, Persona helps you verify users in a way that's tailored to your specific needs. Best of all, Persona makes it easy to know who you're dealing with without adding friction for good users. This is why leading platforms like Etsy, LinkedIn, Square, and Lyft trust Persona to secure their platform. Persona is also offering my listeners 500 free services per month for one full year. Just head to withpersona.com/lenny to get started. That's withpersona.com/lenny. Thanks again to Persona for sponsoring this
38:50 – 43:19
Challenges in AI tool adoption
1. LRLenny Rachitsky
  episode. Awesome. Okay. So you've talked a bit about how you work with companies on these sorts of things, on their AI strategies, on their AI products, how they build, which tools they build, all these things. I want to spend a little time here, 'cause a lot of companies are building AI products. A lot of companies are not having a good time building AI products. Let me ask a few questions along these lines of what you've learned working with companies that are doing this well. One is just, I guess in terms of AI tool adoption and adoption generally within companies, there's all this talk recently of just, like, all this AI hype, the data is actually showing most companies try it, doesn't do a lot, they stop. And so there's all this just like, maybe this isn't going anywhere. So in terms of just adoption of tools and AI within companies, what are you seeing there?
2. CHChip Huyen
  For gen AI in company, I think there are two type of gen AI toolings that have been, uh, I have seen. It's like ones is to, like, um, internal productivity, right? And I have coding tools like chatbot, um, uh, internal knowledge. Like, a lot of big enterprises have some kind... like a rubber around, like, um, model. So then, but, like, with access to, like, maybe some different kind of a rack solution, I think we talk about data, um... or kind of, like, text-based rack. I haven't talked about Asian tech rack or, like, I haven't saw, like, multimodal rack yet, but it's like... yeah, there's a whole very exciting area around that. Um, yeah, so, like, ba- basically it should allow the employee to, like, access internal document. Uh, similarly somebody asked, like, okay, um, I'm having a baby. What could be the maternal or paternal policy, right? Or, like, am I having this operation? Does the health benefit, like, cover that? Or, like, I want to, like, interview... I want to, like, refer my friend. What could be the process for that? So a lot of this, like, having chatbot, internal chatbot, to help with internal operations. Um, and, um, another things, um... another category is more, like, customer-facing, um, so... or, like, partner-facing. Um, so, so for our customers, a chatbot is a big one. If a hotel chain, you might have, like, a booking chatbot, which is, like, somehow massive. Like, a lot of booking chatbot because I guess it's, it's, it's... I do have this theory of, like, a lot of applications, uh, companies pursue because they can measure the concrete outcome. And I feel like booking or, like, sales chatbot is very clear, right? Like, what's the conversion rate right now with a chatbo- with human operators? And what could be conversion rate with a chatbot? And it's so- somehow I think it's, like, very clear outcomes and companies are easier to buy into these, um, these solutions. So a lot of companies have that, like, customer, uh, facing chatbot. Uh, so yeah. So, so that is, um, another category of tool. Um, and I think that, um, I think that for customers or external-facing tools, um, because people are driven to, um... people are driven to choose applications with clear outcomes. So the questions of, uh, adopting them is really based on, like, whether they see the outcome or not. Of course, it's not perfect because sometimes, uh, the outcome can be bad not because the idea or, like, the application's idea itself is bad. It's just because the... the, I don't know, the, the process of building it is like not that great (laughs) . Um, yeah, so, so it's tricky. For the internal adoptions of, like, toolings or internal productivities, that's where it gets tricky. I would say, like, a lot of companies, uh, when they think of AI strategy, like, I think of AI strategies have, like, usually have very, um... have, like, two key aspect, right? It's, like, use cases, and the second is talent. You might have a great data for great use cases, but you don't have talents then you cannot do it. So a lot of time at the beginning with gen AI, and it's still... and sometimes I really admire a lot of companies for that. It's just like executives say, "Okay, we need our employees to be very gen AI aware, like very AI literate," right? So what they do is they start, like, maybe, like, adopting a bunch of tools for, for the team to use. They have a lot of upskill- uh, upskilling, uh, workshops. Like, they encourage learning. And I think it's, like, a really, really good thing, and it's also, like, willing to spend a lot of money into, like, adopting. Like, um, giving people, like, ChatGPT subscriptions, personal subscriptions, uh, cloud call subscriptions, like, to get the employees to, like, to, to be more AI literate. Um, and then the other thing is like a lot of the executive company say, "Okay, we spend a ton of money on this tooling, but then we don't see," because you can see the usage. It's like, but people don't seem to use them as much. And what is the issue? So, so yeah. So I think it's that. That is, um, that is tricky.
43:19 – 45:20
Challenges in measuring productivity
1. CHChip Huyen
  Yeah.
2. LRLenny Rachitsky
  What do you think is the issue? Is it just-... they're not, they're like, they don't know how to use them. Like, what do you think is the gap here? Do you think we'll get to a place of just like, wow, work is completely different because of AI for a lot of companies?
3. CHChip Huyen
  The main thing is like it's really hard to measure productivity, uh, gain. So I talk to a lot of people and they will say... First of all, an or- only sample is coding, right? A lot of companies are using coding agents, uh, or a coding AI-assisted coding. Uh, and, um, I was asking, I was just like, I was like, "Co- do, do you think that like it helps with your productivity?" And a lot of time the questions were very hand wave-y. It was like, they were like, okay, they were like, "Okay, I feel like it's been better," right? And I said, "Okay, because we have more PRs, uh, we see more code and then immediate correct." Okay. But of course code, number, line of code is not a good metric for that, right? So, so it's, it's really, really tricky. And there's something funny. Uh, so, so, so I do ask, um, people to ask their managers because I work with like usually VP levels, so they have like multiple teams under them. So I ask them like, "Okay, do you ask your managers, um, like, okay, would you rather have access or would you rather have, give everyone on the team like very expensive coding agent, um, subscriptions, or you get an extra headcount, right?" Let's say it's like maybe like, um... And, and almost everyone could say, the managers could say headcount. But if you ask VP level or like someone who manage a lot of team, they would say just like, they would want AI assistant as his tools. And the reason is that people say that, okay, because as managers, right, because you are still growing. Like, you're not as a level when you, you manage hundreds of thousands of people. So for you, like, having one extra headcount is like, is big. So you want that not for productivity reasons, but because you just want to have more people working for you. Whereas for executive, you care more about like the, um, the... maybe you have more, like, business metrics that you, you care about. So, so you actually think about like what actually drive, drive productivity,
45:20 – 49:10
The three-bucket test
1. CHChip Huyen
  uh, metrics for you. Uh, so, so yeah, so it, it is tricky. Um, and I think just like the question of like productivity, um, is not... I'm not sure it's like fundamentally is there's some people more productive, but it's just like we don't have a good way of measuring productivity improvement. Uh, another thing is this will vary widely. Um, and I think that people do tell me that they notice different buckets of, of employees, like different reactions to AI-assisted tools. Like, first of all, I, I'm, I keep going back to coding because it's very... it's big and it's like easier to my reasons about. Um, so this is like, um... I have different reports. Like, one team would tell me that like, um... One of the people tell me, okay, amongst all his engineers, he thinks just like senior engineers would get the most output, like would be more productive because it's like, okay, so that person's very interesting. So, so he actually divided his team to like three bucket. But he didn't tell them, obviously. He was like, "Okay, here's more like currently, like, um, best performing, average performing and lowest performing." And then there's a randomized trial. So like they give like half of each, of each group like access to like, to like Cursor. And then he was noticed like over time he was like, "Okay, the... something funny, like the, the group that get the biggest performance boost..." like in his opinion, so he was very close in his team. And so the biggest boom- boost by the senior engineer, like the, the highest performing. So the highest performing engineer get the biggest boost out of it. And then the second group is just like the, um, the average performing. So, so he, so his opinion is like, okay, the highest performing engineers is more, more proactive. They would say know how to solve problem. So AI helps them solve problem better. Whereas the people who are typically lowest performing, they only don't care much about work, right? So like this, easier to just like go on autopilot, get it to like generate like bad code and just like do it and they always just don't know how to do it. Uh, another company however, they tell me just like, actually senior engineers are the one most resistant to like using AI-assisted tooling because they said it's like, okay, but AI... because they am more opinionated and they have very high standard. It was like, okay, but AI code, generated code just sucks. So just like very, very resistant in using it. So I don't know. I, I haven't quite been able to reconcile like very different reports on that yet.
2. LRLenny Rachitsky
  This is so interesting. So just to make sure I'm hearing what you're... the story. So there's a company you work with that did a three-bucket test with their engineering team where they created three sorts of groups, the highest performing engineers, mid-performing engineers, lowest performing engineers, uh, and gave some of them... So they gave some of them access to, say, Cursor. Was it Cursor?
3. CHChip Huyen
  Yeah.
4. LRLenny Rachitsky
  Or what did they give them access to? It was Cursor?
5. CHChip Huyen
  I think they said it was Cursor.
6. LRLenny Rachitsky
  Okay, cool. And so within-
7. CHChip Huyen
  Oh, I didn't work with them. This is more like a friend company.
8. LRLenny Rachitsky
  Okay, it's a friend's company.
9. CHChip Huyen
  Yeah.
10. LRLenny Rachitsky
  So did they give like half of the higher performing engineers Cursor and half not? Or how did they do the split here?
11. CHChip Huyen
  Yeah. So like, they give like half of the entire company, but like half of each bucket. Yeah. And then-
12. LRLenny Rachitsky
  Whoa.
13. CHChip Huyen
  ... they observe the difference in like productivity.
14. LRLenny Rachitsky
  I see.
15. CHChip Huyen
  Yeah.
16. LRLenny Rachitsky
  So how do they even do that? They're just like, okay, you get Cursor, you don't get Cursor. Is that (laughs) how do they do that? That's so interesting.
17. CHChip Huyen
  Yeah, I, I didn't get into the mechanics of it. Uh, but, but I was like, I respect you for doing a randomized trial.
18. LRLenny Rachitsky
  That is so cool.
19. CHChip Huyen
  Like, 100%. Yeah. Yeah.
20. LRLenny Rachitsky
  Okay. Wow. How large was this engineering team? Was it like hundreds of people?
21. CHChip Huyen
  Um, it's, it's not that large. It's about like maybe the, um, 30 to 40. Yeah.
22. LRLenny Rachitsky
  30 to 40. Okay.
23. CHChip Huyen
  Yeah.
24. LRLenny Rachitsky
  Wow. Okay. So they found that the highest performing engineers had the most benefit from using AI tools and then behind them was the middle tier engineers and the worst performers-
25. CHChip Huyen
  Yeah.
26. LRLenny Rachitsky
  ... were the lowest performers. Okay.
27. CHChip Huyen
  Yeah. But also not the same everywhere. Um, like some companies-
28. LRLenny Rachitsky
  Right, right, right, right.
29. CHChip Huyen
  Yeah, yeah, different.
30. LRLenny Rachitsky
  Right. This other example you shared of just-
49:10 – 55:31
The future of engineering roles
1. LRLenny Rachitsky
  they work, which I get. I, I do feel like the most valuable people right now other than E- ML researchers, uh, uh, and AI researchers like yourself, uh, are senior engineers because it feels like junior engineers...... just, like, so much of this is now done by AI, but as- but an engineer that knows what they're doing, that understands how things work at a large scale, with AI tools, just basically like infinite junior engineers doing their bidding, feels like an extremely valuable and powerful asset.
2. CHChip Huyen
  Yeah. Uh, I definitely, like, really appreciate... Uh, I do see companies, like, they appreciate engineers who are, um, have a good understanding of the whole systems and being able to have good problem-solving skill. Like, thinking holistically instead of, like, locali- uh, locally. Um, or when our company have seen the way they work, as they told me, it's like completely different now. And like, so they actually restructured engineering arc so that, like, they get more senior engineer should be more in the peer review, because they've, like, they get, like, uh, sort of writing guidelines on, like, what is a good engineering practices, um, what is the process would be like. They be like, okay, so they write, like, a lot of, like, processes, uh, on how to work well. And then they, um, and then they have more, more junior engineers just, like, produce code and, uh, and, like, submit PR, but senior engineer more in the reviewing phase. So I think is it might be prepare for the future. So another company actually told me something very similar. So it's like preparing for future once they only need a very small group of, like, very, very strong engineers to, like, create, like, processes and, like, reviewing code to get into production, but, like, get, like, AI or, like, junior engineers to, like, produce code. But then the question becomes just, like, how does one become a very strong senior engineer?
3. LRLenny Rachitsky
  Right. That's right. That's right. That's-
4. CHChip Huyen
  Yeah. I feel like, yeah-
5. LRLenny Rachitsky
  ... that's the problem.
6. CHChip Huyen
  Uh, yeah. So- so I don't know what's the process. Was thinking about, like, yeah, um...
7. LRLenny Rachitsky
  No one's thinking about it.
8. CHChip Huyen
  Anybody that has-
9. LRLenny Rachitsky
  It's just, it's a problem (laughs) . We won't have any more in 10, 20 years. There'll be no more engineers because no one's hiring junior engineers. Although I could make the case junior engineers, people just getting into computer science right now are just native, AI native. And in theory, you could argue they will become really good really fast if they're curious, aren't just, you know, d- delegating learning and thinking to AI, but learning how to actually... Using it to learn how to code well and architect correctly. Like, you could argue they will be the most successful engineers in the future.
10. CHChip Huyen
  I do think that what I mentioned say learning to architect, um, I think I, I group that in my system thinking. I do think it's very important skill, uh, because I think AI can help automate a lot of, like, um, disjointed skills, but, like, knowing how to, like, utilize the skills together to solve a problems is, is very... Uh, it's, it's, it's, it's hard. So there's a, a webinar between, um, Mehran Sami, who is my, one of my favorite professors. He was a chair of the curriculum at the CS department at Stanford. So he spent a lot of time thinking about CS educations, right? Like what, what, what should students learn nowadays in the area of like AI coding? And then the other person is like Andrew Ng, which is, of course, is like a, a legend in the AI space. And Mehran Sami, Professor, like, Sami said something very interesting. So, like, he said, like, "A lot of people think that CS is about coding, but it's not." Like, coding is just a means to an end. Like, CS is about system thinking, like using, like, coding to solve actual problem. And problem-solving will never go away because, like, what, like AI can automate more stuff, the problems just get bigger. But as a process of understanding what cause the issue and, like, how to, like, design step-by-step solution to it will always be there. Um, so I think an example of, um, of like... I actually have a lot of issues with, like, AI for, like, um, in the way of, like, it's debugging. So I'm not sure you use a lot of AI for coding, but like, uh, uh, something I've noticed and also seen from my friends, it's like, AI is pretty good when you have very clear, well-defined task, maybe write documentations, fix specific features or, like, build an app from scratch, right? Like, doesn't have to interact with a large existing code base. But if you're adding something like a little bit more complicated, um, maybe require interacting with other components and stuff, it's usually, like, not that good. Um, and, and for example, I was using AI to, like, use, um, to deploy an applications. Um, and I was testing out a new, uh, hosting service I was not familiar with. It was like, okay, w- Like, usually they forward me... So what the AI does give me is, like, confidence to try new tool. Like before with AI, it's like trying new tools has ? ] documentations from the beginning with AI. I was like, "Okay, just try it out and, and, and learn." So I was testing out this new hosting service and it kept getting the bugs. It was, like, very, very annoying. And I was like, "Okay." I asked, uh, Cloud Code's like, "Fix it," and it kept giving me, like, it kept changing the way, like maybe change environment variable, fix the code, maybe like change from the function to this function, maybe change the la- language, ma- maybe it doesn't process JavaScript well. I don't know, whatever. And so it didn't work, and I was like, "Okay, that's it. I'm just gonna read the document- document, uh, documentation myself and see what's wrong." And it turns out it's like I'm on another tier. Like, the feature I want did not... Is not available in this tier, right? So I feel like, okay, so the issue with Cloud Code is just trying to focus on fixing things from a very, a different component, whereas the issue is from a different component. So I think they... I think of like, okay, be understanding, like, how different components, um, work together and where the source of issue might come from. You need to, you need to give a holistic view of it. And this made me think, it's like, okay, how do we teach AI, like, system thinking? Like, like that, where I can have all the human experts, like, having, like, write, like, very much ] scaffold. Uh, just like, okay, for this kind of problem, look into this, look into that, look into that, and then stuff. So, so I think that could be one way. But that's what made me think is, like, how do we teach humans, like, system thinking? Um, yeah. So, so yeah. So I think it's very interesting, um, skill. I, I do think it's very important.
11. LRLenny Rachitsky
  That's exactly the same insight Bret Taylor shared on the podcast. He's the co-founder of Sierra. He created Google Maps. He was CEO of Salesforce, Quip, a few other things. And I asked him, just like, "Should people learn to code?" And his point is exactly what you said, which is learning... Taking computer science classes is not about learning Java and Python. It's learning how systems work and how c- code operates and how software works broadly, not just here's, like, a function to do a
55:31 – 57:12
ML Engineers vs. AI engineers
1. LRLenny Rachitsky
  thing.One thing that I wanted to help people understand, you- you wrote this book called AI Engineering, which is essentially helping people understand this new genre of engineer. And you have this really simple way of thinking about the difference between an ML engineer and an AI engineer, which has a really good corollary to product managers now, of just, like, an AI product manager versus, uh, non-AI product manager. The way you describe it, and fill in what I'm missing, is just ML engineers build models themselves. AI engineers use existing models to build products. Anything you want to add there?
2. CHChip Huyen
  One thing I really dislike about writing books is that it has to define like this. And- and I think it's like no definitions can be perfect with us, and always be like edge cases. Um, but yeah, in general, I think it's like yes, like gen- like, um, AI as a service, like more as a service. Like when somebody build the models for you and the base model performances are pretty strong. So- so it's like it's enabled people to just like, "Okay, now once you in- integrate AI into my product, it- I don't need to learn for creating this and this," even though knowing that could really help. Uh, but yeah, it's like it makes the entry barrier really low for people who want to use AI to build product. And at the same time, AI ca- cal- abilities are like so strong 'cause like it's also like increased like the possibilities, like the type applications that AI can be used for. So I think like yes, let both entry barriers like super low and there's a demand for like AI applications, like a lot bigger. So I feel this is very, very exciting. It opens up like a whole new world of possibilities.
3. LRLenny Rachitsky
  Yeah. It's like now you don't have the time, you don't even have to spend time building this AI brain. Now you can just use it to do stuff. Uh-
4. CHChip Huyen
  Yeah.
5. LRLenny Rachitsky
  ... such a, such an unlock.
57:12 – 1:05:48
Looking forward: the impact of AI
1. LRLenny Rachitsky
  Okay, maybe just a final question. You get to see a lot of whe- what's working, what's not working, where things are heading. I'm curious just if you had to think about in the next two or three years just where things are heading, what do you think, what do you think, how do you think building products will be different? How do you think companies working will be different, if you had to think of maybe the biggest change we expect to see in the next few years in terms of how companies work?
2. CHChip Huyen
  I think in a lot organizations, they don't move that fast, right? Um, but at the same time, they also move faster than I expected (laughs) . Uh, because, uh, again, I think it's like bias. Like I don't work with dinosaur companies. I don't care. I think a lot of executive who come to me are like very forward-looking. So maybe for me, I'm very biased, uh, towards- towards like organizations just like move fast. Um, so- so yeah, so I think one- one big change I see is just like in organizational structure. Um, I think it's like a lot of value placed, um, in like, um... So before, right, we have like a lot of disjointed team. Like we have very clear, like engineering team, product team. But then there's a question of like who should write eval, right? Like who should own the metrics? And it turns out just like eval is not a, um, it's not a, it's not a separate problem. It's a system problem, right? Because you- you- you need to look into different components, how they interact with each other. You need to check user behaviors because you need to know what users care about so that you can- so that you can like write- write evals as like reflect what users care about. So- so all of that like you can sort it from like you look into different component architectures, uh, place guard rails and stuff. So it's just engineering, but understanding users is like one product, right? So- so because of like a lot of things, an eval extremely important. So like that kind of bring product team and like engineering team, even like marketing team, like user acquisition, like very close each other. So- so yeah, since in a ways that people are structuring, so that's more communications between like previously very distinct functions. Another thing is just I also see as teams, um, of course like think about like what can be automated in the next few years, and what work cannot be automated. And I've seen that people already like sharing like... Actually is- it's a little bit like scary to think about it, but I also think just like the teams
3. LRLenny Rachitsky
  (?) .
4. CHChip Huyen
  It's like okay, this is between you and me, but we have really like got rid of these functions, right? Like for a lot of things, like, uh, previously outsourced, for example, like traditionally it's a business outsourcing that's not core to them, and like can be done with like not, um... can be even more, um, systemized, uh, um, systematized. Um, so- so with that you can actually like use AI to like automate a lot of that. And also there's a separation if you're thinking more of like what is the value of like junior engineers or senior engineers, how to restructure engineering org for that. Um, so- so yeah, so I do definitely think that, um, it's one thing to success an organization, people are just moving pieces around and like thinking about like use cases, um, whether you want to like spin out new use cases and who would lead the new effort. And like, yeah, um, that is one big, uh, change. Another things in terms of like AI, I think there's, um... I'm not sure how true this is. Um, I guess I'm- I'm also like on the camp of like thinking that is has merit, is it's a camp of like, okay, uh, base models we have probably like not quite maxed out, but we want- we are unlikely to see like really, really strong, like crazily strong model. So like you- you remember like when we have like GPT, right? And GPT-2, which is a big step up, like on auto magnitude, like- like better. It's not like GPT-3 was like much, much bigger. GPT-4, much, much bigger. And like of course GPT-5, but like is GPT-5 like that scale of like much bigger, like a step jump compared to like the previous? I think it's a debatable, right? So- so I think that it's like we have this appointment like the base model, um, performance improvement is not gonna be like mind-blowing, and it was in the last three years. Uh, and so- so I think there's like a lot of like improvements we're gonna see in the post-training phase, in the application building phase. Um, and, um, and yes, also I think this is where I feel I would see a lot of improvement there. I'm also very, like, interested in like multimodality. Um, so we've seen a lot of, uh, text-based, but I think there's a lot of, um, audio, videos, use cases, uh, that is very, very exciting. And I think audios is not quite as solved as working because I do work with like, uh-... with, with like a couple of like voice startups. And when it comes to vo- thing about voice is, uh, entirely different beast. Uh, so let's say you have chatbot, right? And we go from a text chatbot to voice chatbot. It's like the concerns are completely different because now with voice chatbot, right, we need to think about like latency because having multiple steps, uh, first like have like text, like, like voice to text, text to text an- text question to text answer, and then like, and then text to voice answer, right? So we have like multiple hops and like latency becomes very important. And there's a question like what does make you sound natural? So for example, like people think that like, um, in, in, in AI and humans, like when humans talk to each other, like if I say, if I say... You try to interrupt me and say, "Um, Chip, that's right," I would like pause and I try to hear you out, right? But sometimes I just even might just say, say some word like acknowledge when I... Mm-hmm. Mm-hmm. That I shouldn't stop. I just continue. So the question of like false interruption and whether it's like I should... should I stop or not? Like it's, it's a big and like what perceived as like natural conversations. And there's also regulations, right? Because like, because like a lot of time people want to build AI chatbot, voice chatbots to sound like humans, try to like trick users into thinking they're talking to humans. But also, right, maybe potential regulations saying like, okay, you have to disclose to users when to talk if the, if the bot is talking to a human or, um, or AI. So, so I think just like, um, just the whole space. I think it's not quite as sole as, as you think. Uh, is it? But it's also, it's also not quite like an AI foundation model problem, right? Because like a human interruption detection is actually a classical machining problem. Like y- y- you... It's, it's a different, uh, framing, but like you can be a classifier for that. Or, or like the question of like, let us say, actually you have a massive engineering challenge, not an AI challenge. Of course it can be an AI challenge because we're trying to build a voice-to-voice model. So instead of having like... having to first like transcribe the voice from me into text and then get a model to generate text answer and get another model to like turn from text to speech, you can send your voice-to-voice directly. So that is something we're working on, but it's like very hard. Um, yeah. So, so yeah, so like even audio, I think of it as like the easier than video, right? Because video have like both image and voice. Uh, it's already like pretty hard. So I think there's a lot of challenges in that space.
5. LRLenny Rachitsky
  That was an awesome list of things. Let me mirror back real quick. So what you're predicting in the next few years, things that will change in the way we work. And these actually resonate with so many conversations I've had on this podcast. So this is just kind of doubling, doubling down on where things are heading. One is the blurring of lines between different functions instead of just like eng- design engineering. Everyone's gonna be doing a lot of different things now. Uh, two is just more of work being automated with agents and all these AI tools and just, in theory, productivity going up. Third is shifting from pre-training models to post-training, fine-tuning and things like that because to your point, model, models maybe are slowing down in how smart they're getting. Although I'll point folks to the ed chat with the co-founder of Anthropic. He made a really good point here. He's like, "We're really bad at understanding what exponentials feel like when we're in the middle of that. And also models are being released more often. So the difference between them, we may not notice because they're just happening more often versus GPT-3 came out like a year, I don't know, af- before, after GPT-2." So, uh, maybe true, maybe not. And then the fourth point you made is this idea of multimodal, investing in multimodal experiences. I cannot wait for ChatGPT voice mode to get better at interruption. Like exactly what you're saying, I'm just like talking to it and then someone makes a little sound and it's like, "Oh, okay." And then you have to... And then it's like, and then it stops talking. It's so annoying.
6. CHChip Huyen
  I'm shocked that we don't have better voice assistant at home yet. I think I have been testing out a bunch. Honestly, I keep hoping, "Oh my God. That could be the one." And then I don't know how many of them are just like had to get a void because
1:05:48 – 1:08:23
Model capabilities vs. perceived performance
1. CHChip Huyen
  they're not that good.
2. LRLenny Rachitsky
  I think it's coming. I hear it's coming. Anthropic's working with someone, uh, that I don't know if it's launched or not yet.
3. CHChip Huyen
  Yeah. Um, so I want to bring back to what you mentions about like the, your guest, like from Anthropic mentioned about the performance, uh, improvement. I think there's a big change. Um, I think like, um, there's difference between, um, a model-based capability. So I'm still talking about like the pretrained model, right, versus the perceived performance. So, so let's say it's like, um, I'm not sure if you thought about, like... Are you familiar with the term test-time compute? Test-time compute?
4. LRLenny Rachitsky
  Uh, I don't think so.
5. CHChip Huyen
  Yeah.
6. LRLenny Rachitsky
  Help us understand.
7. CHChip Huyen
  So, um, so, so the idea is just like, okay, like, um, you have some co- a fixed amount of compute, right? So you're gonna spend a lot of compute on pre-training or training the model. Pre-training is enough, then a lot of, uh, some compute on like fine-tuning. And the ratio of like pre-training to the post-training compute is like crazy varies different, diff- between different amount. Um, um, and also like since then has spent compute, uh, on like generating inference. When I have it trained and fine-tuned the model, now you want to like serve it to users. So I might type a questions or prompt and it's like generate, like do inference, right? And that requires a compute. And I guess if you have a discussion of like, uh, should I spend more compute on like pre-training or fine-tuning or inference, right? Because like inference and people found out like test-time compute. So like spending more compute on inference is like called like test-time, like, uh, compute, uh, like as a strategy of like just allocating more resources, compute resource, to generate, uh, inference. When actually bring better performance. And how does that do it? Like let's say, um, let's say you have a math questions, right? And maybe instead of just generating one answer, it can generate like four different answers and say, "Okay, whichever is, uh, the best according to some standard." Uh, or like, okay, have four answers and then maybe like three of them say 482 and one of them says like 20. And say, okay, three of them in the, in, in agreement. So the answer should be 482. Right? So like just people should then generate a bunch of it. Or another thing is like a lot of time, like reasoning, uh, thinking. You just like people should like generate more thinking tokens and spend more time thinking before showing the final answers. Uh, it's like require more compute, but also like give more, uh, more, uh, more, uh...... better performance. So, so yeah, so, so I think it's like from the user perspective, right, like when the model spend more time exploring different potential answers, thinking longer, it can give you much better final answers. But the base model itself does not change.
8. LRLenny Rachitsky
  Awesome.
9. CHChip Huyen
  Does it make sense? Yeah. Yeah.
10. LRLenny Rachitsky
  Yes. That does. Absolutely.
11. CHChip Huyen
  Yeah.
12. LRLenny Rachitsky
  Uh, that is a good corollary to, um, to Ben Mann's point.
1:08:23 – 1:11:32
Lightning round and final thoughts
1. LRLenny Rachitsky
2. CHChip Huyen
  Yeah.
3. LRLenny Rachitsky
  Chip, we covered a lot of ground. I've gone through everything I was hoping to learn and more. Before we get to our very exciting lightning round, is there anything else that you wanted to share? Anything else you want to leave listeners with?
4. CHChip Huyen
  So, I do work at a few companies that does these things of like they want employees to, like, come up with ideas. So, there's a big debate on, like, what is a better way for a strategy? Should it be top-down or, like, bottom-up, right? Should, like, executive come up with, like, one or two, like, killer use case and, like, everyone, like, allocate resources to that? Or, like, should you give engineers and PMs and smart people, like, come up with ideas? And I think it's just a mixture of both. So, so some companies it was like, okay, we hire a bunch of smart people. Like, let's see, like, what they come up with, and they, they organize, like, modern hackathons or, like, in- internal challenge to get people to, to build product. And one thing that, um, I noticed is, like, a lot of people just, like, don't know what to build. Uh, and, and it shocked me, like, why... I feel like we are in some kind of like an idea crisis, right? Now we have all these really cool tools to have you, like, do everything from scratch or it can have you, like, design. It can have you, like, write code. It can have you build websites. In theory, we should see a lot more. But at the same time, people are, like, somehow stuck. Like, they don't know what to build. And, and I think it's, like, maybe it's a lot of had to do with, like, maybe, like, um, society expectations because, like, we have gone through, uh, we have gone into this phase of, like, specializations. Like, people, like, uh, very highly, uh, specialize and people are supposed to do, like, focus on one thing really well instead of in a big picture, and when you don't have a big-picture view, it's hard to come up with, like, ideas of what to build. So, so I don't want to, like, uh... When, when I work with this company on this hackathon, like, we do work out, like, how, come up with a guideline, like, how to come up with ideas. And usually what we think of is like, okay, like, one tip is, like, go look from the last week, right? Like, for a week just, like, pay attention to what you do and what frustrates you. And when something frustrate you, think about, like, is there anything we can do? Is there, like... Can it be done a different way so it's not frustrating? And you can talk, like, people can swap, uh, accept sub-notes or teams, and if you see the common frustrations, maybe this is something you can think about, like, just to build something around that. So yeah, so I feel like, um, just, like, notice, like, how we work, uh, thinking of, like, ways to constantly ask questions, like, "How can this be better?" And then I just build something to, like, address the frustrations. I think it's a good way to just, like, learn and adopt AI.
5. LRLenny Rachitsky
  I think people have felt exactly what you're describing every time they open up one of these vibe coding tools, where they could just describe anything you want and be like, "I don't know. What do I want?"
6. CHChip Huyen
  (laughs)
7. LRLenny Rachitsky
  And, and I love this very tactical piece of advice, just like what frustrates you. Just pay attention to where you're frustrated. For example, I just built a very cool little vibe-coded app. I was working on a newsletter post inside Google Docs, and I, I pasted all these images into the Google Doc from screenshots and stuff, and, and then I forgot, oh yeah, you can't take images out of Google Docs. It's like this Hotel California experience where you can paste stuff into it. Very hard to get images back out. So I, I just went to all the vibe-coded tools and just built an app that I can give you a Google Doc URL, and it let me download all the images automatically, and it worked amazingly well, and I made it really cute, and I'll, I'll link to it in the show notes.

Episode duration: 1:22:35

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode qbvY0dQgSJ4

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome

Introduction to Chip Huyen

Chip’s viral LinkedIn post

Understanding AI training: pre-training vs. post-training

Language modeling explained

The importance of post-training

Reinforcement learning and human feedback

The importance of evals in AI development

Retrieval augmented generation (RAG) explained

Challenges in AI tool adoption

Challenges in measuring productivity

The three-bucket test

The future of engineering roles

ML Engineers vs. AI engineers

Looking forward: the impact of AI

Model capabilities vs. perceived performance

Lightning round and final thoughts

Get more out of YouTube videos.