This video isn’t embeddableWatch on YouTube →

Stanford CS230 | Autumn 2025 | Lecture 6: AI Project Strategy

For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/ai October 28, 2025 This lecture provides walkthroughs of examples of AI projects and making day-to-day decisions in building AI systems. To learn more about enrolling in this course, visit: https://online.stanford.edu/courses/cs230-deep-learning To follow along with the course schedule and syllabus, visit: https://cs230.stanford.edu/syllabus/ More lectures will be published regularly. View the playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rNRRGdS0rBbXOUGA0wjdh1X NOTE: There was no class on November 4, 2025 (Lecture 7). The next lecture is Lecture 8. Andrew Ng Founder of DeepLearning.AI Adjunct Professor, Stanford University’s Computer Science Department Kian Katanforoosh CEO and Founder of Workera Adjunct Lecturer, Stanford University’s Computer Science Department

Andrew Nghost

Nov 5, 20251h 15mWatch on YouTube ↗

EVERY SPOKEN WORD

65 min read · 13,292 words

0:05 – 2:37
Why AI project strategy is about development speed, not just algorithms
1. ANAndrew Ng
  So what I want to do today is, um, continue our discussion on AI project strategy. So if you're building a deep learning system for some task, and today, um, for the first part of today, I'm gonna use a speech recognition, uh, voice-activated device example. And then for the second half I'm gonna use a kind of AI deep researcher example. But what I want to do is, um, walk you through a couple concrete examples of, you know, projects you might work on, um, and let you understand what it feels like to be in the thick of building an AI system, making day-to-day decisions on what to do next. Uh, I find that, I, I think as you've heard me say before, understanding the algorithms is important. So in this class you learn a lot, you know, from the online videos about the, uh, deep learning algorithms, how to build pipelines. But even beyond understanding how the algorithms work, what really drives performance is a team's ability to have an efficient development process. How do you tune the parameters? How do you collect data? When you try something and it doesn't work the first time, which it often doesn't, what do you do next? The skill in making those decisions is what often makes a massive literally 10X difference in productivity. And as I was reflecting, was preparing for, you know, what to say to you today, I was reflecting on quite a few projects where, um, that 10X difference in productivity is, is, is, is really not an exaggeration, right? Maybe more than 10X. But I've literally seen many teams in many, you know, well-known companies with good brands, um, spend a year working on a project that I will see a more skilled team execute in, like, a month, right? So these differences in skill are real. And one of the challenges, um, finding people learning this is if you work for some company, maybe you work on a different project every year or two, right? But so it takes you, like, you know, two years of your... Two years or a year of your life to gain experience on one more project, and then after, I don't know, 10 years, you've finally seen 10 projects and are pretty experienced. Uh, but what I want to do is in today's class walk you through a few concrete examples of projects that are similar to the ones... They're kind of simplified versions of stuff that I've seen myself, to try to accelerate your hands-on experience looking at these projects and thinking through if you are the one in the hot seat building a system and it works or doesn't work or this problem or whatever, making those decisions for what you would do. So try to get
2:37 – 7:12
Motivating product: an offline voice-controlled device with named wake phrases
1. ANAndrew Ng
  you through that, um, today with a couple examples rather than you having to spend years and years of your life to finally see a small number of examples of how these projects can be driven. Right? So the first multi... The first of two motivating examples I'm gonna use today is, um, building a voice-activated device. Uh, in my house I have, you know, like a, a Amazon Echo, right? I have a lot of them, actually, and I think it's a delightful experience. But those devices like that, uh, Amazon Echo or Google Home or the, um, Apple Siri home pod, uh, they require quite a bit of setup, right? They require some set... Connect to the Wi-Fi, figure out a way to connect to your phone, blah. And so, you know, uh, actually for a long time, even though I, I built smart speakers, um, for a long time in my house I had one light bulb connected to my home Wi-Fi internet, you know, because it's just so much of a hassle to set things up. I, I think now I have two light bulbs connected in my house. I guess I should, I should connect more stuff. Um, but so for this motivating example I want to talk about if you are, uh, part of a startup building a new product that, um, makes it much easier to get these voice control devices without the user needing to do this whole Wi-Fi setup process. So, um, you know, I'm not, like, very good at drawing, but if you could go to some store and buy a desk lamp and the desk lamp already has a name, I'll just call this lamp Robert. And you could just take it, plug it into the electricity, plonk it on your desk, and then say, "Robert, turn on," and then it turns on. Say, "Robert, turn off," then it turns off without needing to be connected to the internet, using cloud access and all that. Then that would give users a easier setup experience. Um, there's a project that my friends and I actually discussed a few years ago that we thought would actually be a decent startup idea. We decided not to do it because there were too many other ideas that we were even more excited about, but we felt that it was actually, you know, like, a reasonable, reasonable startup idea to build a little IC circuit, a little integrated circuit, to sell to, say, lamp manufacturers and other device manufacturers to make it really easy if, if, say, some company sells lamps, to build little things so they can very quickly make their devices voice-enabled. And if you have a few pre-built names, maybe give the users a choice. You can call your lamp Robert or Lena, you know, or, or, or I don't know, or Johnny or, I don't know, Alice or whatever. Um, um, have a little switch, then users could, um, just buy a lamp, put it down, and then immediately have it be voice-controlled without needing to worry about how do you get this onto my Wi-Fi network and if my internet is down then my whole house stays dark as well because of things like that, right? Um, and, and I actually did once have an... I was building a lot of voice assistants. I actually set up my office to have a lot of voice control devices. So had different names of different lamps. Standing desk had its name as well. So I'd say... I forget what my desk name was. You know, like, uh, you know, "Jonathan, go higher," or whatever, and then my standing desk would go up and down. It was actually pretty cool. Um, so what I want you to... So, uh, just for today's illustration-Probably we should give the different, uh, devices different names because you can't have every device in your house called Robert. Otherwise, you say, "Robert, turn on," the whole house illuminates. Then, "Robert, turn off," whole house is plunged into the darkness. So we found out you do need different devices to have different names. Uh, but just for today, I'm going to use, um, as illustration the task of training a neural network or building a system that detects when someone says, "Robert, turn on." And you need kind of, "Robert, turn on," "Robert, turn off." If you have a choice of names, it'd be Lena, turn on, Lena, turn off, and a variety of handful of options of names. But just for simplicity, I'm gonna worry only about detecting the phrase, um, "Robert, turn on." Right? And what do we do for this? You convince and repeat to give the turn off command and a handful of other names to make it user selectable what, you know, what name they want to give this thing. Okay? So, um, this is something that'll need to run on device, uh, small IC circuits. And my... And, and I'm gonna ask you a question, right? Um, when
7:12 – 12:38
How to start as CTO: pick a fast first build, then iterate
1. ANAndrew Ng
  you've graduated from CS230 or maybe when you graduate from Stanford, if you are the CTO of a startup responsible for building this, um, what would you do, right? So call out. So i-imagine you just graduated from CS230 or just graduated from Stanford, and you're the CTO of a startup, and you want to build this lamp that can turn on when anyone says, "Robert, turn on." How would you approach this problem? And I know this is an incredibly open-ended question, and it turns out life is incredibly open-ended, right? You graduate from CS230, you have to decide what to do. So if this is what you're doing, uh, feel free to raise your hand and call out. If you want to build this product, what's the first thing you'd do? How would you think about it? Go for it.
2. SPSpeaker
  I think to take it from here, uh, it speaks to the exponent and then it speaks to the exponent to have the actual work and then kind of operate the work with proper knowledge. Like, uh, like bring up, um, to other ways.
3. ANAndrew Ng
  Yeah. Cool. Yeah, great. So get some, like, uh, open source speech-to-text model or something, and then see if it run. Yeah, that'd be a good start. Anything else? Yeah, go ahead.
4. SPSpeaker
  It's created with three models. Um, increasing the complexity, the first one would detect audio, uh, just going audio. The second one would detect the word Robert, and the third one would try to parse the phrase that comes after that.
5. ANAndrew Ng
  Okay. Sorry. Three models, one to detect Robert, and then one that-
6. SPSpeaker
  Well, the first one detects the word sound.
7. ANAndrew Ng
  I see.
8. SPSpeaker
  The second one detects the word Robert, and the third one tries to understand what the sentence after the word Robert.
9. ANAndrew Ng
  I see. Cool. Right. Okay. Right. So, so, uh, uh, three models to detect Robert, understand the rest sentence, or detecting the sound. Cool. Go for it.
10. SPSpeaker
  I think it would be like-
11. ANAndrew Ng
  Oh, sorry. Say again.
12. SPSpeaker
  Like, you... I would do it a bit differently because like if we have a model that we detect Robert, then let's say you want to expand the company and have different services and things.
13. ANAndrew Ng
  Mm-hmm.
14. SPSpeaker
  Then you have to repeat like the same process for Jane, Alice, or... So instead, I think what you could do is take a model that takes, uh, given like those strings auto prompt and another auto prompt and see like train it to identify like the two policy the same as well.
15. ANAndrew Ng
  Okay. Oh, I see. Oh, like a... It's like a Siamese network. We actually teach a Siamese network later in this course where something that inputs two audio files and decides they're saying the same words, so you can more easily generalize to new words than, than Robert. Cool. No, that's actually pretty interesting. Yeah. Go, go for it.
16. SPSpeaker
  I, I don't know how to make, uh, these devices work together, but I think it may be simpler way to just have some plugin on the phone and like have it run remotely through that. So like, kind of like it's like Siri. Uh, so if I could go like image, uh, like activate it on my phone, my phone would just then call like turn on, turn on to a device.
17. ANAndrew Ng
  Oh, oh, sorry. You mean just, uh, connect up your device to Siri?
18. SPSpeaker
  Uh, yes. Kind of like a neural network on my phone and the iPhone would send device, uh, the signal to the device to turn it on.
19. ANAndrew Ng
  I see. Okay, cool. Yeah. Right. Yeah. That sounds interesting. It sounds like a different product than this if, if we need to connect to your cell phone and all that though. Yeah. Cool. Right. So l-let me just, um... Lots of interesting ideas. Uh, let me just make some observations. Um, so I find that when building software products, um, uh, there are actually lots of good ideas or lots of reasonable things you could try. But, you know, as, as you heard me mention a few weeks ago, I think one of the strongest predictors for the odds of you building something compelling is speed. So I find that, um, of all of these ideas, I think some are better than others, but it doesn't... But, but whether the, the idea is, you know, a bit better or a little bit worse, it is important, but it's actually secondary to how quickly you can just get something built. So if you're actually the CTO of a startup like this, I would encourage you to look and say, "All right. What can we build, you know, today?" Or, "What can we build maybe in a week?" And, um, try out any of these architecture choices and build it and see what happens. Because even if what you build is a little bit less, you know, good, um, you can find that out in two days, you know, then you course correct very quickly. Um, I've actually wor- I've actually built a lot of smart speakers, um, so I have maybe-Firsthand experience of this, and I just share some things that I happen to know that there's no reason you would know. But it, it turns out that, um, let's see. At least today, general purpose, um, speech recognition is still a little bit heavyweight. Uh, takes quite a lot of, you know, processing power. Uh, is a bit expensive to run on an edge device if you wanna make this, like, a just a few dollars. Um, but it turns out that if you look at the, um, smart speakers, uh, there's usually, um, uh, uh... If you want to train a neural network just to detect one phrase, uh, be it a phrase like, you know, uh, "Okay, Google," or, "Hey, Siri," or, "Alexa," or whatever, any of the smart speaker trigger words, that can be done with a fairly small
12:38 – 18:13
Literature search tactics and leveraging experts for acceleration
1. ANAndrew Ng
  neural network. Although to your point, if we want to do different neural networks, different words, we'll need to swap out different neural networks and rinse and repeat that, that, uh, uh... But if you have only a small handful of names, phrases we want to detect, I think, I think that'd be okay. And then one other piece of advice I would give, um, if you're embarking on this is the first thing I would do, actually, if I was working on this for the first time is, is actually a literature search. Um, and it turns out that, uh, you know, we're... I think we're fortunate that the AI world has a ton of open, uh, source software and a ton of, um, uh, open research papers. Somewhat surprisingly, despite smart speakers having been around for a long time, there still to this day isn't, like, a single architecture that everyone's agreed on on the best way to do this. If you look at the literature, there is still actually a diversity of opinions on how to do this type of, uh, wake word or trigger word, uh, we sometimes just, uh... Uh, so when you say something like, you know, uh, "Okay, Google," or, "Hey, Siri," or, "Alexa," that's sometimes called a wake word 'cause it wakes up the device, or a trigger word triggers the device's wake activity. So somewhat surprisingly, even though we've had smart speakers for a long time now, for a lot more than a decade, um, uh, there still isn't a single agreed on unified architecture that the community has agreed on on what's the best algorithm to do this. But I feel like if you are embarking on this for the first time, um, the number one boost in your speed of learning, it could be implementing something, but I would say doing a literature search and trying open source software would be the even faster accelerator. And I wanna just give you a few tips for that, right? Um, real quick. So, you know, today, if, if this is, um, uh... There are a lot of research articles and blog posts and open GitHub repos on, uh, m-many topics, certainly wake word detection. And what I find is that, um, if this is research paper one, research paper two, research paper three, research paper four, I find that, you know, sometimes people will, um, spend a lot of time reading research paper one until you're done, right? It's just 0% complete and then it's 100% complete. And they spend a lot of time reading research paper two, spend a lot of time reading research paper three. Um, and I just recommend you not do that. Instead, when I'm doing a literature search, what it often feels like is, you know, do a few web searches for a handful of resources, skim all of them, 0% complete, 100% complete, right? Based on your initial reading, you may decide to go back to paper C and spend more time to really read and understand that, but that'll help you find additional references. You can skim, and maybe you find a paper seven that's really seminal. Spend a lot of effort. But this is what doing a very broad survey of the literature will feel like, where, you know, you, you, you really put in the time to finish only a very small number of resources, but spend a lot more time skipping around and getting a cursory level understanding of a broader set of papers, which can also point you to, um, also point you to the more useful resources to focus attention on. And just one other thing I've seen among Stanford students, um, uh, there's one other thing that I find people tend to underuse, which is, um, trying to talk to experts. So if you are actually a CT of a startup trying to build this for the first time, um, I feel like, you know, we all wanna do our own work and not bother other people, which is good. But I just encourage you to consider if, um, after you've done your own work, if reach out to an expert can really accelerate your learning. So something I've often done is, you know, I will do my own work, right? I don't wanna call random experts to try to bother them before I've at least done my work. But if you've, uh... Sometimes I'm reading a paper, I'm really struggling to understand it, and I find that instead of me struggling for another, like, four hours, um, if I send the authors a respectful email, uh, and say, "Hey, I read your paper. Tried to understand this. I'm still confused. Can you help me out? Can you explain this to me?" A lot of the time, um, uh, i-including when, you know, I, I was less well-known, right? But I, I think a lot of people, if they see that you're doing your work and not just reaching out to them before you've even done anything, a lot of them, not everyone, but a lot of, um, research authors will actually be quite, you know, understanding and, and, and try to help you out. And so I find that for a lot of projects I did, um, finding that one expert, sometimes a, sometimes a Stanford professor actually. We have a lot of speech faculty and Stanford faculty as well. But, um, when I had problems with my speech recognition systems, you know, sometimes I call, like, oh, Dan Jurafsky, whatever, and then, like, a half-hour conversation or even a 10-minute conversation really accelerates, um, what I've been able to do. So I just encourage you to, um... It, it takes you, like, 10 minutes to send an email, and maybe there's a 50% chance of a response. I don't know, right? Not 100% chance, but sometimes that tends to be really high ROI, right? And then what I think you find is that, um, if you do a literature search, you will likely discover that, uh, most of the robust enterprise-grade smart speaker systems all have a specialized system trainedTo detect the wake word or the trigger word, right? And again, there's a variety of architectures. You probably come up with some good neural network architectures for this. Um, and, and, and-- Oh, it turns out that, uh, there is no data set on the internet with lots of people saying, "Robert, turn on,"
18:13 – 21:59
Data acquisition: collecting “Robert, turn on” (and respecting consent)
1. ANAndrew Ng
  right? That's just not a thing, right? So if you decide that the names are Robert and Lena and, you know, I don't know, Jeremy and Alicia and whatever. Sorry if they're people with those real names. It was actually kind of weird when we chose people's names, uh, but may- but we just did that for some reason. But it turns out that there is no large data set on the internet of lots of people saying, "Robert, turn on." So if you want to train a neural network to detect if someone has said this phrase, you need to collect that data yourself, right? Um, so let me then ask you another question. Um, let's say you've done a literature search, found some open source code you can try, but that has led you, say, to conclude that you need, uh, data sets of people saying, "Robert, turn on." Um, uh, you, you need a data set to train to distinguish between someone saying, "Robert, turn on," versus not someone saying, "Robert, turn on." Where would you-- How, how would you approach getting a data set like that? So yeah, go for it.
2. SPSpeaker
  Text to speech.
3. ANAndrew Ng
  Text to speech. Mm-hmm.
4. SPSpeaker
  Use a data
5. ANAndrew Ng
  Yeah. Cool, yeah. Yep, you use text to speech as one method. Um, this-- uh, I'll come back to that later, so you need that notes. Go, go.
6. SPSpeaker
  Just take all those who visit your campus and ask people to speak.
7. ANAndrew Ng
  Yeah.
8. SPSpeaker
  And record it.
9. ANAndrew Ng
  Yeah. Yep, cool. Yep, I like that. Yeah, so walk around and ask people, tell them what you're doing, get their permission to record and use their voice, and then just if they're okay giving permission, just record their voices. Yeah, I like that. Anything else? Cool. Um, yeah.
10. SPSpeaker
  You could turn on like samples where if you have a basic like maybe like parent speech, like the phrase like trigger to offset is set and then that was procedural. But in a training set where it triggers-
11. ANAndrew Ng
  Oh, sorry. Can you say that again? Uh-
12. SPSpeaker
  So like you mentioned use that rule, like you should have samples where they're all the turn offsets and then there's like also samples where they're not.
13. ANAndrew Ng
  Oh, yep. Cool.
14. SPSpeaker
  Like indication.
15. ANAndrew Ng
  Yeah. Yep, cool. Yep, right. So samples of people saying not Robert, turn on. Not people... Yeah, yep. Agree. Yeah, cool. Go ahead.
16. SPSpeaker
  Um, kind of in addition to that, so you have like going around and asking people to say, "Robert, turn on," I would also have created sentences with the word Robert in it and then classify those as not, uh, as in zero. Like I don't want it to activate.
17. ANAndrew Ng
  I see. Cool. Yeah, actually, actually, you're right. So the, the data is what people are saying, Robert, but you don't want that to be a signal. That's actually a good point. I'll come back to that later. Um, yes. Did you say something?
18. SPSpeaker
  Oh, yeah, same.
19. ANAndrew Ng
  Oh, same thing. Oh, interesting. Cool. Awesome. Great. All right. So, um, one of the, uh... Yes. Uh, so I really like the idea of, you know, just going around and asking people for permission to record their voices. And then by the way, in today's world, privacy is important. Consent is important. So, you know, don't do anything sneaky or weird, right? Just tell people clearly what you're doing, ask them for permission, make sure any permission is freely given, and if they don't give permission, it's fine. Move on. I really respect people's privacy. That, that is really important. Having said that, I find that a lot of people in the world are very nice. Not everyone, but the vast majority of people in the world seem very nice, and if you ask them nicely, you know, I've g-- I, I know this because I've done this myself. They will give permission for you to rec-- you know, provide their data, right? Uh, um, so and, and then just one other framing I encourage you to think about as always is, is the speed. So how long will it take you to wander around campus or wander around San Francisco, um, and collect a sample of voices? And I think you actually get a lot done. You'll get maybe dozens of samples,
21:59 – 27:03
Synthetic speech data: useful, but usually not the first step
1. ANAndrew Ng
  maybe hundreds of samples easily in a, in a day. So I think those are good tactics. Um, and then, uh, let me s-share with you some things. And, and then, um, it turns out synthetic data using text to speech is an interesting tactic. Um, I would usually not use that as the, the first thing I do, um, mainly because it turns out, um, uh, I just-- It's hard to know, um, how accurate synthetic data is compared to, um, true, you know, natural data. And maybe just share one thing I've run into. I've actually, you know, done a lot of synthetic data for speech recognition, right? One thing you have to watch out for is if you go to a lot of the synthetic, you know, sources of data, um, how many voices, how many different voices does it provide? And is it gonna be a pain to get enough, you know, diversity in different people's voices? So it turns out, um, I don't know, like for example, I, I often, you know, talk to, um, OpenAI voice on my phone, right, instead of typing, but the number of voices there is limited. And if you go to a TTS provider, I guess there are now some services with a larger number of voices, but these are things you end up worrying about. Uh, and it's all solvable. It's, it's-- Sy-synthetic data does work for speech recognition, right? But there-- it turns out that there are often enough details associated with fiddling with the synthetic data generation process that that ends up taking longer. So for a lot of, um, machine learning applications, you know, using synthetic data is a good idea. And eventually you might get around to it, but using synthetic data is usually not the first type of data I would collect, uh, because usually there are just too many hyperparameters and too many knobs you have to worry about. And then at the back of your head, you're wondering whether there's something, you know, weird about the synthetic data that I had not thought of before. Uh, and, and, and once you collect natural data, collect real data, it's just one fewer-- one, one less thing to worry about. Um-Maybe just to, just to tell one more story. Um, not a speech, but, uh, self-driving cars, right? So if you're building self-driving cars, you want to detect other cars. Um, where do you get pictures of cars? Uh, it turns out that a lot of people will have the idea of, oh, there are lots of video games with cars driving around. Why don't we use video games to get pictures of cars out of the video game? Um, but it turns out a problem with a lot of video games is there could be like twenty different cars in the entire video game, depends on the video game. But it turns out that to have a realistic video game, you know, you don't need a thousand different cars, and there are tons of different car designs on the road, but you need a very narrow set of cars. And so to a human, you know, seeing the same twenty cars over and over it looks like the road looks fine, the video game plays fine. But for a lot of video games, you know, the data just isn't rich enough to capture anywhere near the richness of the real world. Whereas in contrast, if you've got real pictures, that's just one less thing to have to worry about, right? So I find that for society data, it works, it's very valuable to use it a lot, but I usually get to it only later in the process. That make sense? Cool. So, um, and then in the interest of building these things quickly, let me share with you the types of things that, um, you know, my teams have done to collect data for this, right? And, and I, I, I feel like, you know, when, when you read research papers or you take, you know, courses, often you get a very clean view of data. I'm gonna tell you, you know, uh, about one of the weird random hacks that one of my teams has used to build like a very serious working really well commercial system, right? And at least, uh, at least that's part of the journey to do so. Which is, um, let's see. So, um, right. So collected hundred training audio clips and, uh, twenty-five development set, uh, to tune to and, uh, zero tests. Right? One thing I'll often do if my main goal is to just build a system that works as opposed to publish a research paper, is to not have a test set. We just have a dev set that we will tune the parameters to. And if I want to publish a research paper, I probably need a clean, unbiased test set. But if my goal is to just build something, you know, and have it work and ship it, sometimes I just say, "I'm not gonna bother to collect any test data," just a training set and a dev set, and I just unapologely-- unapologetically tune my system to the dev set. Right? Um, and one thing you could do is, um, let's see. So audio clips. So audio is, um...
27:03 – 30:12
Windowing trick: turning long audio clips into many labeled examples
1. ANAndrew Ng
  Let's see. So you, you may have seen the audio waveforms, right? Which one says time, you know, x-axis is time, and audio waveforms kind of are these wiggly time series. Right? Uh, and, and sound is, um, very rapid vibrations in the air that your ear perceives as sound. And what a microphone does is it records these very rapid changes in air pressure, right? So that's why you see audio waveforms that look like this. And, um, so one thing that one of my teams once did was collect a hundred samples of, you know, audio waveforms, where somewhere in the middle of it, um, is someone saying, "Robert, turn on." So, you know, "Hey, how are you doing?" "Oh, yep." And let's, let's say, "Robert, turn on," and we now talk about some other things. Right? And so, uh, the phrase "Robert, turn on," takes about one second to for-- to, to say. And one way to collect a-- construct a training set would be to take a three-second clip. Uh, so this is where "Robert, turn on" was said. This is where they just finished saying, "Robert, turn on." And so if you collect a hundred audio clips like this, one way to turn this into a bigger training set is, um, to take these long audio clips, and then this becomes a training example with a label one because that's an utterance where the end of the utterance is corresponds to when someone just finished saying, "Robert, turn on." And in contrast, this would be an example, you know, would be a negative example. Um, and this too is a negative example, you know, and this too is a negative example. Does it make sense? So given a, say, ten-second clip, given the ten seconds of audio, we will have cut out, um, a phrase where you get a positive label if that three-second audio corresponds to someone just finishing saying, "Robert, turn on," which is when you should turn on the lamp. Uh, and anything else is labeled zero, right? And so this is a way to take, say, a hundred training examples and turn that into three thousand binary examples. So a hundred, a hundred audio clips, sorry. I should say a hundred audio clips. Right? And if we take thirty windows out of this, then you can turn this into, um, say, three thousand training examples, each with a binary zero-one label. That labels is this moment in time when someone just finished saying the phrase, "Robert, turn on." Right? Um, so it turns out that if you do this, and we did do this, then we wound up, um... Let's see.We
30:12 – 35:38
The 97% accuracy trap: diagnosing class imbalance and metric misuse
1. ANAndrew Ng
  wound up with a system-- So we ran this and, uh, tested this on the dev set, um, and we wound up with a system that when trained to predict binary classification was ninety-seven percent accurate. Right? Uh, but it turns out it did this by outputting zero all the time, that, that had zero detections, right? And so, um, with this training set, we basically trained a very large neural network where I would have gotten exactly the same result with that one line of Python code, right? So this is the kind of stuff that happens in real life, right? And, and by the way, I'm sharing these stories not, you know, just to entertain you, though hopefully you're entertained, but because I think of this by living these experiences that you, you know, go, "Oh, I could see this problem. This is why I do it. I could see..." So my question to you is, if you collected this data, trained the system, ninety-seven percent accuracy, isn't that fantastic? But you realize that you just implemented via a huge neural network a print zero statement or the equivalent of print zero that is never finding the phrase Robert turned on. Like, what, what, what do you do next? What's going on? What do you do next? Oh, yes.
2. SPSpeaker
  [background noise]
3. ANAndrew Ng
  Yeah. Cool. Awesome. Right. So, so, um, increase the number of Robert turned ons in the training example. This is a very unbalanced training example, and it's-- did great by just saying, "I never hear this phrase." Right? How would you go about, and you or anyone else, how would you go about increasing the number of positive examples? Oh, go for it.
4. SPSpeaker
  [background noise]
5. ANAndrew Ng
  Oh, but how, how, how do you do that?
6. SPSpeaker
  Oh, like, like, you would use like audio, like audio editors.
7. ANAndrew Ng
  Oh, audio editors. I see. Wow. I see. Cool. Yeah. Interesting. Okay. Yes. That would work. It get-- In that case, the synthetic data again, right? Which, which is actually-- which actually works, and, and it also has all the complexities of synthetic data. But I-- it turns out it does work, so I know it works because I've done that too, but yeah. Like I see... Oh, go ahead.
8. SPSpeaker
  You can always take your example and duplicate it.
9. ANAndrew Ng
  Yeah. Yep, that works too. So, um, one thing you could do is, uh, take your positive examples, the examples of Robert turned on, and just duplicate those examples or, or maybe just give those examples more weight to the training objective. But yeah. Any other ideas? Did you want to say something?
10. SPSpeaker
  I think similar response. Like an example which will show the increase announcement [clears throat] or like signal.
11. ANAndrew Ng
  Oh, increase the noisy example.
12. SPSpeaker
  Yeah.
13. ANAndrew Ng
  Yeah. Yeah, cool. Yeah. That, that, that, yeah. So a few variations of synthetic generation. Yeah. Yep. So I think the, the, the easy things to try, one would be, um, to take the examples and duplicate it, right? So, you know, mat-mathematically, this is equivalent to taking your positive examples and just making multiple copies of that in your training set. It turns out that will work. The law should solve it. And by the way, un-unbalanced data sets is a common issue in training neural networks. I'll, I'll tell you the rule of thumb I use and, you know, is, um, many neural networks are pretty good at handling up to like a one-to-ten ratio of unbalanced data sets. So people often worry, what if an unbalanced data set? In this example, we have like a one-to-thirty ratio, and your mileage may vary. Sometimes it works, sometimes it won't. In z-- uh, in this, in this case, it didn't work. But, um, I find that if you have a, you know, like a one-to-two ratio, just usually not that worried about it. Neural network is fine training like that. Maybe up to one to ten is when I start to worry about it being a little bit too unbalanced for, you know, standard training procedures and when I might do something to make it a little bit more balanced. Um, but duplicating the examples would be one tactic. Oh, go ahead.
14. SPSpeaker
  What if you also penalize false negatives?
15. ANAndrew Ng
  Uh, yes. So, uh, penalize, uh, false negatives. Yes, that would work too. So I think, um, there are a few ways to train the-- to change the cost function. You can give the positive examples more weight or, um, uh... Yes. Or penalizing false negative would be another way to, to, to, uh, change the cost function. That will work. Yes, that will work too. Okay. Um, and yeah, go ahead.
16. SPSpeaker
  You can also decrease the, um, amount of the, um, negative examples so it kind of creates a very normal distributed
17. ANAndrew Ng
  Yeah. Yes. So you can also decrease the number of negative examples. That would work too. The one downside of that is, um, like it's okay, I think it's fine to do what you said. The one slight downside is if you are reducing the diversity of the negative examples, then the neural network has just a little bit less information to learn from. Um, I think this is small. What I just said is a small difference though. I think that would also work. It's a quick, it's a quick one to try. Yeah. Um, and I, I'll just tell you another-- I'll, I'll tell you what my
35:38 – 37:41
A commercial ‘hack’: widen the positive window to create richer positives
1. ANAndrew Ng
  team actually did. And I'm telling you this not because it's a brilliant technique I'm proud of, but because I just wanna tell you the examples of the types of hacks that actual commercial machine learning teams do that actually works, right? So I'll tell you what we did. It was very close to duplicating the positive examples, but we had just a slightly different variation, which was, um, we said that instead of the positive example being the one window where Robert turned on just finished. So here we're, we're having a sequence of labels, right? Where zero corresponds to, um-Is that moment in time when someone just finished Robert turn on. And in the architecture I described, we are detecting a very, very narrow window in time where someone just finished Robert turn on, and they got a turn on. But the hack that we used was actually just extend this out a little bit. So instead, instead of saying, did someone just finish saying Robert turn on in the last, you know, one hundred milliseconds, we extended that out to half a second or a second. And so if someone finished saying Robert turn on any time in the last half second, let's turn on the light, right? And so this actually generates a few more training examples. Now, the reason we did that, and again, this is a small difference, is this actually creates a little bit more diversity in the positive examples than just duplicating it. Because now this red rectangle is a positive example, and so is this one, and so is this one, and so is this one. So it just creates a little bit more diversity in the positive examples. Um, I expect this will make a very small difference in the training-- in the learning algorithm's performance, but it's just, you know, maybe just slightly better to have slightly, um, yes, covers the, the space of positive examples just a little bit. Just very, very s- little bit better. Make sense? Um, so okay, just to keep going with the story, um, let's say you do this,
37:41 – 41:01
Overfitting after imbalance fixes: regularization and more data
1. ANAndrew Ng
  and now, um, you still do well on the training set. Ninety-five percent accuracy on the training set, but, um, fifty percent accuracy, so not good enough on your dev set. Right? So you fix one problem and another one comes up, and so what do you do next? Go for it.
2. SPSpeaker
  [background noise]
3. ANAndrew Ng
  Yeah. Cool. Awesome. Yep. Overfitting and use regularization. Go ahead.
4. SPSpeaker
  [background noise]
5. ANAndrew Ng
  Um, is it-- So right. Is it that training and test set distributions are not the same? In this example, uh, because we collected the training and the test sets, if we randomly shuffle between train and dev, then they would be the same distribution, actually. Yeah. But some-sometimes, um, there's one other thing I do see, which is, uh, if you-- I-i-it turns out early in the history of machine learning, there was always this assumption or obsession with making the training set and the dev set and the test set have the same distribution. I think it's because if the training set and the test set are the same distribution, it's easy to prove theorems, easy to give guarantees, whatever. So in kind of academic machine learning, there was always this, you know, theoretical assumption, which makes the theory work, well, way better. The training set and your test set come from the same distribution, right? It is just like from a, from a publishing paper's point of view that makes life much better. From a practical point of view, what I see is a lot of the time your training set distribution is just different than your tested distribution. Um, that's just life because, you know, it, it's hard to demand data of a certain sort. And so, for example, one common thing is if you do use synthetic data generation, uh, you can come up with really clever ways to generate synthetic training data by using TTS, um, or editing audio or whatever. There, these techniques let you generate a massive training set, but the price of that is, you know, your synthetic data, I mean, that's not how users talk, right? Users don't speak synthetic data, they speak raw data. And so to make sure that your test metric truly reflects how users will perceive your product, you know, I would put true data in your test set. And so you end up with a lot of systems where your training set distribution is synthetic data or other things you fiddle with, and then your test set is what the world actually cares more about, and the two distributions is very different. Um, and then I think the question you just asked is, is the training set distribution too different from your synthetic data that it's a problem? Those would be good questions to ask. Right? All right. So, um, if you see this, this is overfitting because we're doing great on training set, not so well on dev. So one thing to try would be-- The first thing to try is very regularization, um, so that'd be a good thing to try. And then beyond using regularization, I find that, um, for a lot of speech problems, uh, if you're overfitting, getting more data is
41:01 – 48:12
Noise mixing for speech: scalable synthetic augmentation (and its pitfall)
1. ANAndrew Ng
  nice. And so just to share with you, I know, I know you suggested synthetic data earlier. I'll, I'll, I'll share with you one thing for synthetic data that does work, uh, which we eventually wound up using, which is, um, it turns out that, um... Sorry, this is audio stream, right? It turns out that there are a lot of, uh, uh, if you can get audio clips of background noise. So for example, in this room, you know, if I'm not-- if I'm quiet, we can hear a little bit of air conditioning noise, right? So a lot of rooms have a little background noise. Or if you're near a highway, there's cars in the background. So there are-- Actually, most, most people where they may use the lab, you know, there's a little bit of background noise. So if you're able to get audio of background noise, and then additionally record some very clean audio clips of Robert turn on, uh, it turns out that if you take two audio waveforms and sum them together, you end up with an audio waveform that sounds like someone saying this in the presence of the background noise, right? So it's the-- I think it's called the superposition property of sound. But basically sound adds, which is why if you take two audio clips and you just add the audio waveforms together, then you end up with an audio recording that sounds like both sounds going on at the same time. So what we can do is take, you know, background noiseUm, there's actually a lot of, uh, uh, uh, audio clips of background noise on YouTube as well and, and so on that... Check the licensing terms before you use stuff like that, right? But there are actually quite a lot of, uh, openly licensed audio clips of, you know, just someone like a quiet... like a coffee shop noise or, or someone sitting in a house studying or whatever. Um, and so if you can take some background noise audio clips and then take a clean voice, uh, someone saying, "Robert, turn on." Right? And if you add these two together, then you end up with background noise, "Robert, turn on," then more background noise, and this becomes a positive example using the process that we talked about just now. And if you have a handful of clips of "Robert, turn on" and a lot of clips of background noise, so you can synthesize the phrase "Robert, turn on" against a lot of different types of background noise and create, you know, a pretty large training set. Um, it turns out one problem with what I just described is if you do exactly what I just said, you won't actually find the "Robert, turn on" detector. You end up with a voice activity detection detector because you have a lot of background noise, and anytime anyone says anything, the only thing people say is, "Robert, turn on." It's much easier to just decide, is there a loud sound of someone talking to actually recognize things? So the other thing we should do is, um, instead of adding "Robert, turn on," you know, pick a dictionary or find other types of audio. There's-- I, I think that's why you had the comment just now of make sure it's-- don't confuse all the things and add, I don't know, get someone to say the word, you know, I don't know, cardinal or whatever, other words. So you have a bunch of examples of people saying words other than "Robert, turn on," and then also a bunch of examples of people saying "Robert, turn on." And, um, by having a handful of clean recordings of "Robert, turn on," uh, and a handful of clean recordings of people saying other stuff and synthesizing a data set like this, you can get a data set with maybe thousands of examples, um, of "Robert, turn on" pretty efficiently. Uh, and if you train a neural network on this, you, you get like a decent, uh, wake word or trigger word detector. That make sense? So, um... Yeah, right. So, uh, with a process like this, um... So this is what working on a machine learning project feels like, right? You try something, doesn't work. Like, you find out that, um, you have a skewed data set. Be creative in how you create-- collect data. You may find that you have a skewed data set, it just doesn't work. You may find it's overfitting. And, and I find that it is these, um... The ability to drive iterations on a system that determines how quickly you can get something to work. Uh, yeah.
2. SPSpeaker
  [faint speaking]
3. ANAndrew Ng
  Uh, um, so for the "Robert, turn on" example.
4. SPSpeaker
  Yeah.
5. ANAndrew Ng
  So if I found that there were a lot of users that are listening to music, then, um, I would probably synthesize data with music in the background and then have someone say, "Robert, turn on." So if, if you think a lot of users listen to music and there's our test lab, we want to make sure we catch them say, "Robert, turn on," um, uh, then I would probably synthesize more training data to include, you know, loud-ish music in the background.
6. SPSpeaker
  [faint speaking]
7. ANAndrew Ng
  Sorry, what are the genre of music?
8. SPSpeaker
  [faint speaking] Like if it's like classical music or, or heavy metal, would you-- don't know, like, the-- what they're usually listening to and then you train, um, different genre of music through the model if it's extremely different.
9. ANAndrew Ng
  Oh. Yeah. So, um, it, it, it turns out that, uh, if you have a... I, I feel like, uh, as a rule of thumb, if you can have a more diverse set of training data, it's usually better, right? So, you know, um, so maybe I'm not sure what music my user base likes to listen to the most. Is it classical? Is it rock? Is it EDM? Is it whatever? Um, if you could collect a very rich training set that includes all of the above and even more, then, um, usually that'll do better than going too narrow and risking picking wrong. Um, yeah. And, and, and the one asterisk to what I said is, uh, so long as you can train a neural network that's big enough, usually more diverse data, more data, more diverse data is better. Uh, if your neural network is too small, then it may lack the capacity or the intelligence to memorize all this stuff, uh, whi-which could be a problem if you need a very small network that runs at the edge. But usually, um, usually, uh, if you have the capacity to get more rich, more diverse training data, that ends up delivering better results. That make sense? Cool. Cool. Um, all right. Now, I wanna share with you... So it turns out, um, when
48:12 – 55:21
Iteration cadence: ML feels like debugging, powered by tight eval loops
1. ANAndrew Ng
  you're building a machine learning system, very common experience is you try something and it doesn't work, and then you have to figure out all the wonderful ways it could be not working.And then go and fix whatever is not working, right? And even in this, you know, example we went through, is the training data too skewed? Is it overfitting, so you need either organization or more data, or is it, like, something else? Or is the synthetic data distribution not matching real data distribution? It turns out that when you're training a neural network, when you're training a... When you're building AI system, um, it's really difficult to know in advance what's gonna go wrong next. Uh, I find that in software development, if I'm writing traditional software, you know, you kind of control all the code. And so it's more okay to write a spec, and then you just build it, and then it, you know, kind of works. You still need to debug it, but the bugs are, you know, it's like my own bugs, right? In contrast, when you're building machine learning system, it's much more like I don't know what's gonna happen next, right? Maybe because I don't know what the data will give me, hard to predict how the, how the algorithm will perform. And so the workflow of machine learning feels much more like debugging than development. Um, and by debugging, I mean it's this... There is a process where you build a system and you're just repeatedly trying to find how it doesn't work and fix it. Uh, and, and, and if you're trying to do a task that humans can do, then the bugs or the gaps in performance is often whatever a human can clearly do but the AI system is unable to do. So for many machine learning teams that, you know, I've led or worked in, um, if you can get the team to healthy rhythm of this debugging cycle, you can make really rapid progress. So for example, it turns out... Uh, let, let me give one example. Um, I've been on teams where we would do the following, right? Let me say morning, afternoon, evening, night. So we run our training jobs at night, right? And for example, it turns out for some of these models we're training for speech recognition or it takes... Let's say it takes four hours, right? So the training job takes about four hours to train the neural network. So training night, in the morning, um, we look at results. You look at results through error analysis, try to figure out what's wrong. Um, in the afternoon, you know, we write code, right? To figure out how we're gonna fix whatever thing we discovered the day before. And in the e-evening, you know, right before we go home, we launch the training job. And then it runs overnight, and the next morning we do it again, and again, and again. And it turns out that if you can fix one problem a day, that's actually pretty good. And within, you know, like, huh... And then you, you, you saw me walk through three, three or four problems. Uh, when, when I built this for real, there are slightly more problems than that, right? But I find if you can get a team into cadence where you train stuff, look at results, figure out is this bias, variance, data mismatch... Oh, actually, this should be code or get data. Fix it, you know, and in the evening before you go home, launch the job. Maybe if you have time in the evening, just baby the training job a little bit, make sure it's still running. But then come back in the m- next morning to see how it did. Then you just do this over and over, and you can make really rapid progress. Um, and I find it's often this discipline, right, that, that lets you make progress. In contrast, the teams that kind of wake up and they go, "Huh, what do we do next? Oh, let's call a meeting this afternoon. Wait, where's the data? Okay, I guess we'll meet tomorrow to look at the data." And then, and then like, "Oh, infra is down." And teams like that move much slower than the really disciplined teams that just keep on rolling forward. And one more observation. It turns out the iteration cycle, um, sometimes it is driven by this. So I've worked on machine learning jobs where it took us about four hours to train a model, and four hours is long enough, don't wanna wait for it, but, you know, running overnight is just fine. Um, I've also worked on teams where a typical training job took about three weeks on average, right? And so if it takes three weeks to train a neural network for the particular type of model we're working on, then the pacing is very different. So how long it takes to train a neural network really drives the pacing of this, where sometimes we would, you know, launch a training job and then, like, hope for the best. We, we would monitor it during those three weeks. It's not that we do nothing for those three weeks, but you just gotta launch a training job and then frankly, you know, hope that it works three weeks later. But we take three weeks, and after that, tons of analysis and debugging and whatever work that will take us one to two weeks to set us up for the next training job that are gonna take us another three weeks to get a result. And then we're also parallelizing. While we're launching this job, we're analyzing the job from a previous job, and we launch a few jobs asynchronously. But I find that how long it takes, this is a huge driver for how you do this. The other end of the spectrum is if it takes ten minutes to train a neural network, then that's wonderful. You just train it, get a result, train it, get a coffee, come back, you know, then do the analysis. And then the bottleneck is how quickly can you analyze the data and, and get more data. So this really drives the design of this process. Um, and one other thing that happens is, uh, for a lot of projects, you start off with a small thing that takes ten minutes to train. But as you get more and more data, you know, then, like, you're bigger than... But now it takes four hours to train. It goes even bigger. Then it takes two weeks to train. It's like, oh, now this thing takes a month and a half to train. So I've experienced that as well, where unfortunately, as the performance climb, we decided we had to train bigger and bigger models with more and more data. And then it went from this really fantastic ten-minute iterations to these, like, months-long iterations. Uh, but they just had to be done because we were training bigger and bigger networks, right? Yeah.
2. SPSpeaker
  What types of things that you could do to change, like, while the model is training, like, uh, as you're training the model and you see it lagging, like, do you have to stop the model or anything?
3. ANAndrew Ng
  Um, we probably didn't stop the model that much. Uh, so sometimes you can look at checkpoints, um, and if you see that for some reason, um... So a- a- actually, if you're training like a three-week job, you kind of expect the performance to be at a certain level after a couple of days or after a week. And if it's way off, then before burning another two weeks, you're gonna ask, "Is my learning rate clearly wrong?" Or, or the-- or maybe this new dataset we tried is clearly wrong. So you can actually start to run some analysis in the checkpoints, and sometimes we would yank the job and just kill it. I would say that happened very rarely. Uh, uh, yeah.
4. SPSpeaker
  [background noise]
55:21 – 57:55
Why speed compounds into competitive advantage
1. ANAndrew Ng
  Yeah. Yes. So one, one thing you learn about in the online videos is, uh, transfer learning. We train a large dataset and then, uh, maybe fine-tune on just a much smaller dataset. And then if you process the fine-tuning on much smaller dataset, takes, you know, half an hour, right, then that's great. Then you can also drive much faster iterations in that based on that half hour as the limiting factor for the training time. Cool. All right. Um, one reason I obsess about speed is because, um, it, it, it, it turns out that, uh, if the X-axis... I- imagine you're building a startup to launch this product, right? Um, I find that, um, if a team takes, you know, twice as long to do it, they're just much less competitive in the marketplace, right? So here's what I mean. So if this is error and this is your months, right? There's some machine learning systems that we work on for months to, to keep on trying to improve it. And if, um, this is you and if a competitor, say, takes twice as long to reach the same level point. So instead of taking this long to get here, they take this long to get here. Instead of taking this long to get here, they take twice as long to get here. So the competitor kind of does that, right? They just take twice as long. You know, they take two days instead of one day to do something. But if they always take two days to do what you would take one day to do, then the performance over time looks like this. And what the customer cares about at a certain moment in time is you are so much better than them, right? So these two differences in speed really translates into a massive difference in the performance of your system versus someone else's system in the marketplace. And, uh, and, and I find that the fast-moving teams are just so much more effective. And, and you might think, "Yeah, you know, I took two days, they took one day. What's the big deal?" Like, the big deal is not that you're a day slower. The, the big deal is you are two times slower, and it's very not... Just in the marketplace, it's just not competitive if, if you are-- if you take twice as long. So maybe for, for, for some applications. Okay. All right. There's one other example I wanna cover. Let me try to do that quickly. Um, so
57:55 – 1:06:30
Second example: building an LLM ‘deep researcher’ pipeline
1. ANAndrew Ng
  what I've talked about so far is, um, speech recognition, uh, you know, wake word detection, which is more of an end-to-end deep learning system where, um, you know, your input audio goes to a neural network, and then this outputs, you know, is this R2 or Robert turn on, right? So, so the entire system you're trying to build is just a single neural network. For a lot of applications you build, you end up building pipelines or sometimes call them cascades. I'm gonna use the term pipelines. We saw one example of this last time where, um, to detect people coming up to unlock a door with face recognition, we had video that fed to visual activity detection to see if anyone is even in front of it. And then that fed to neural network to recognize, you know, is this an authorized person? And then zero one to say, "Is this a person that we should unlock the door for?" Right? Um, what I want to do is use a different example of a pipeline of a, AI deep researcher. So it turns out, right, all, all of the, um, all of the leading LLM, well, almost all of the leading LLM providers, uh, uh, have deep researchers, uh, where you can ask a query or go and search the internet, look at a lot of web pages, and come back and synthesize a very thoughtful report, right? Um, and so... Yeah. But I, I actually use the open AI deep researcher quite a lot. I think that... I, I, I know some of the team that built it, and it's really well-built. Uh, well, I think someone has also did a good job. But so, you know, you input a query like, um, uh, uh, this, this example I'm taking from the, uh, agentic AI course online. Um, but, um, input a query like, "Show me the latest research on black holes," right? Some, some query like that. And then, um, one thing you might do is take a query like, whatever, right? Black hole research. And then use an LLM, use a large language model to generate search terms to feed to a web search engine. Um, so it may generate a few terms like, you know, black hole research, latest in, I don't know, astronomy and black holes or whatever. So generate a handful of search terms. This then call out to web search engine, uh, be it Serper, Google, DataGo, Bing, uh, I don't know. I used to really quite a lot, uh, but there are multiple... Actually, there are actually more and more web search engines designed for AI rather than for humans, which I think are pretty neat. Um, so this can, uh, call, call the web search engineAnd then fetch the top URLs. Oh, sorry. I mean, uh, fetch the top pages, I think it's top URLs. So the web search engine returns ten pages. You may not wanna fetch all ten of them. You can read the snippets, decide which one is the most relevant. Uh, again, maybe with an LLM, with a large language model, to decide what are the pages you wanna download. So just like a human, you know, I'll do a web search. I won't click every single link. I'll take a glance and then decide which ones I'm gonna click. Um, so identify and fetch the top URLs, and then feed all that into writing, and this gives the final output. Right. So this is a... By the way, as a-- this is a more traditional deep researcher article. Uh, sorry, deep researcher architecture. The more modern deep research articl- architectures let the system decide when to do more web search, when to, uh, uh, fetch more web pages, and they are more, you know, autonomous, more agentic. But this is what the early deep researcher architectures look like. There's more of a linear pipeline. The, the... There are more modern architectures where system will fetch some pages and autonomously decide, "Do I need to go back and do more research?" and obviously what topics and iterate a few times. But this would be like a, you know, pretty decent basic deep researcher. Right. And it turns out that if you actually build this, um, there-- one of the s- in, in the speech recognition system, uh, you know, Robert Turn on LAMP example, we talk a lot about how to improve one component, which is the neural network. If you have a pipeline like this, the other thing you need to do, which is critically important, is to decide of all the different components in the pipeline, which one do you wanna focus your attention on, right? So I find that, um, one thing that makes a huge difference in a team's performance is, again, being able to drive a disciplined evaluation and error analysis process to decide what to work on. Uh, I find that a lot of machine learning is not, you know, wildly doing things to see what works. There's actually a very thoughtful, very disciplined process where, well, the system maybe is not doing as well as we wish, so we should look at it to decide, um, it turns out lots of things could be wrong. Is it generating search terms that aren't quite right, right? Or maybe we're using a web search engine that is returning results that aren't that good. So for example, is, um, uh, is, is, is my web search engine comprehensive enough, or is it just, you know, maybe I picked a lower cost web search service or something that's just not returning the latest materials? Um, it turns out for the-- for, for, for some internet articles, many web search engines will do really well. But if you ever wanna fetch news, really fresh news, there's actually a lot of difference in the performance. Some web search engines are much better at having really fresh content. Some just don't update as frequently. So do I need to switch web search engines? Or am I successfully identifying the best web pages to fetch? So if I have black hole science, you know, I think nasa.gov is a very, very authoritative web page. But if a web page-- web search engine returns, you know, like, um, uh, I don't know, bobsbackyardastronomyblog.com, right, uh, that's less authoritative than nasa.gov, am I correctly fetching the most authoritative scientific articles versus whatever is hyped up, you know, or maybe like, uh, uh, I don't know. Am I fetching a lot of, like, random TikTok videos, you know, rather than really scientific, authoritative things? Um, and then lastly, given all this information, am I writing a thoughtful article, right, with an LLM from, from the final research output? So it turns out that with a pipeline like this, there are lots of steps that could go wrong, and your ability to decide what is the components you should focus your effort on, that, that's a massive driver of, um, your productivity in improving a system like this. And the good news is, um, there, there's actually one other thing I've seen. Um, sometimes I've seen a few experienced machine learning people... So actually one, one thing I've seen, um, sometimes a team will build a system like this, uh, and it is a less experienced team. You know, they build a system, it's not quite working, trying to decide what to do next. One thing I've seen many times is if you get a few senior machine learning people around, like at Stanford, I've seen this quite a few times. Sometimes the students on a project may build a system like this. If you get a few, you know, experienced professors to look at the system, you find that our opinions on what to do next, there's remarkably little variance, right? Uh, so, so you find that many, you know, experienced AI people look at it and will go, "Well, based on what this seems, we think this is the problem or this is wrong. We should try this." It's not that we always agree with each other a hundred percent, but the variance is much less than you might think. And to me, that, you know, I think there's a methodology behind how to approach these problems, whereas the variance in what to try next among less experienced engineers is much larger, right? So, so I-- so to me, I think there's actually, um, uh, if you have a systematic way of doing error analysis to figure out where it's actually underperforming, um, you know, experienced people all kind of... Doesn't mean experienced people are always right, but there's just no variance, right? But there, there, there really is a methodology to
1:06:30 – 1:15:12
Error analysis for pipelines: spreadsheet-driven diagnosis to pick the next work item
1. ANAndrew Ng
  figure out what to do next. And one of the key ideas is error analysis. Um, and I talk about this on online videos as well, but this is, uh, so important we should go through. Right.And, um, one thing to do is to look at the outputs of each of these intermediate steps. And it, it's one of these things I know I'll say it, you probably even agree is a good idea, but the percentage of people that'll actually do it when the time comes are found to be far lower than one hundred percent, right? But there's just, you know, take a handful of queries, be it, is it, um, latest in black hole science, or should I rent or buy an apartment? Or, uh, you know, what's the, uh, uh... What are... Help me make fun plans for a weekend in Santa Cruz, whatever. Some, some handful of queries for a deep researcher. Um, and then look at what search terms it generates and see if it makes sense. Look at the web pages it fetches, see if they look good. Look at the top pages it fetch and see if the top web pages selected is similar or was materially different than what you would pick. Like my example of one of this decides to fetch bob's backyard astronomy page.com rather than nasa.gov, right? Then you go, "Oh, okay, maybe I need to change how to do that." Uh, and then also look at the final writing to see, given the source articles, is it writing the appropriate articles? And what I find is, um, for a process like this, you know, um, it, it, it turns out error analysis is often a very manual process because when the system isn't performing, error analysis or gap analysis is often a manual process of figuring out what a human would do that is better than what the AI system would do, and we're trying to inject knowledge from the human into the AI system. So because the AI doesn't have this knowledge yet, uh, usually you do need a human time to do this. And I know, you know, and people are talking about automating some of this, and maybe there'll be progress there, but so far I find error analysis, it just takes human to look at it and have the insights into where AI system is doing something different than an expert human would be, and it's that insight that then points you in the direction for how to improve it. And, um, so for... And, and so often I would build a spreadsheet like this where I have a query that is, you know, black hole science. And then another one, do I rent versus buy, you know, in whatever, San Francisco or, uh, uh, you know, weekend activities in Santa Cruz, right? Whatever. Just have, I don't know, maybe up to a hundred. I find that I often have patience looking up to a hundred examples, uh, beyond a hundred. Yeah, sometimes more than a hundred, but it's somewhere between ten and a hundred. Have a list of queries and have a list of steps. Um, and so the steps would be, right, search terms. Um, what's that? Web search. Uh, fetch pages. So this is what it feels like to do error analysis. It is a labor-intensive process because it's a process of identifying where a human outperforms AI to try to close the gap, right? Just do this over and over. But so what I would do is I'll sit down, often with a spreadsheet in front of me open like this, and I will run, you know, black hole science through the whole system and then just read, are the search terms satisfactory? Um, and if I'm not an expert in black hole science, then we need to get an expert or do, do a bit of work to figure that out. But if I find that the black ho-- the search terms sent to web search engine are completely satisfactory, then I'll just say, "Okay, this looks good." Then given those search terms, I'll look at the web search results and say, "Huh, did the web search engine retrieve u- reasonable things?" And it looks good, then great. And then I'll look at did my system, maybe an LM, did it make a good choice in the top web pages to actually go and fetch? And maybe I found it's retrieving... It chose to retrieve Bob, Bob's Backyard Astronomy blog, but skipped nasa.gov. Then I'll say, "Okay, there's a problem there," right? And then often to make a note, you know, Bob instead of NASA, right? Just to kind of just take notes on, on what you're seeing. Um, and then I'll say, "Given these sources, is this writing okay?" Maybe it's okay. And then the process is, um, to take a handful, somewhere between, you know, ten, twenty, and a hundred of articles where the performance is subpar. And it's called error analysis because I, I want to focus on where it's underperforming, right? So there's some articles where it's doing a great job on, um, I would tend to pay less attention to those. But I'll focus on finding anywhere from ten, twenty to a hundred queries where it's clearly underperforming what I think a human should be doing or what, what I hope the system will do, and then going through to just try to get a sense of how often the hotspots are in different parts of the pipeline. And so maybe for rent versus buy, um, I might say, "Boy, there's a really authoritative blogger that talks about this, but somehow web search misses." I don't know. Maybe I'm using the wrong web search engine. Um, weekend in Santa Cruz. Yes, it's, it's, I don't know, buy into the, you know, some hyped up tourist things rather than actually finding locally interesting web pages and so on. And then by doing this for, you know, twenty, thirty, fifty, uh, queries, you can then start to get a sense of where the hotspots are, right? So again, all of these are subpar results. I'm focusing on queries where the performance is not good enough. And then, um, if I find that, you know, I don't know, forty percent of the time it's failing to fetch the we- top web pages, only five percent of the time, right, it's failing to do that. Actually, let me see. May find that, um, seventy percent of the time I'm really not happy with this. Five percent of the time web search is not good enough. You know, maybe, uh, twenty percent of the time search terms are wrong and maybe, uh, twenty percent of the time the writing's not good enough. Uh, so th- these don't have to add up to a hundred percent. Sometimes you have problems in more than one column.But if this is what it turns out to be, then we'll go, "Well, clearly a lot of the... my dissatisfaction with results is because it's just not choosing good pages from the web search to return, so let me go focus on that." And the thing about a lot of machine learning systems is you just don't know if you don't do this error, error analysis ahead of time in terms of what component to focus on. Um, and so I find that teams that know how to drive this process systematically, and, and it, it sometimes takes us hours. Remember that, you know, daily schedule thing? Sometimes it takes us a few hours to go through this process to reach conclusions. But then the benefit of spending, like, whatever, you know, three, four hours on this, uh, is that it can save you weeks of otherwise heading in the wrong direction. So, um, teams can drive this evals and error analysis process in a very methodo- methodological way. Um, you are much better at picking what direction to work in, and that allows the team to go way faster, and just way more than a 2X difference. I've, I've actually literally visited teams that have been working on something for six months. I go like, "Gee, you know, I could've told you six months ago this wasn't gonna cut it," right? And, and, and imagine if this was the problem. We're not writing the right web pages, but, you know, for some reason the team kept on trying out different web search engines. Maybe there's someone trying to sell a new web search service to you, so you spend a lot of time with the sales team, do a lot of integration with the website. There's actually, there's actually more web search services than most people know, right? So... And I, I actually swap between them whenever I feel like it. So, so imagine you spend all your time trying to come up with a better web search service, you know, for six months, and that's... it's entirely possible. Then you could just... it just won't move the needle for the overall performance, which is why this type of error analysis process is so important, right? I walked through this with the example of one pipeline. Uh, the online videos go through other examples as well. But, um, uh, both of building deep learning pipelines and for kind of AI agentic pipelines, I think this is a very important concept to master

Episode duration: 1:15:17

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode s6JVGzABKho

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Why AI project strategy is about development speed, not just algorithms

Motivating product: an offline voice-controlled device with named wake phrases

How to start as CTO: pick a fast first build, then iterate

Literature search tactics and leveraging experts for acceleration

Data acquisition: collecting “Robert, turn on” (and respecting consent)

Synthetic speech data: useful, but usually not the first step

Windowing trick: turning long audio clips into many labeled examples

The 97% accuracy trap: diagnosing class imbalance and metric misuse

A commercial ‘hack’: widen the positive window to create richer positives

Overfitting after imbalance fixes: regularization and more data

Noise mixing for speech: scalable synthetic augmentation (and its pitfall)

Iteration cadence: ML feels like debugging, powered by tight eval loops

Why speed compounds into competitive advantage

Second example: building an LLM ‘deep researcher’ pipeline

Error analysis for pipelines: spreadsheet-driven diagnosis to pick the next work item

Get more out of YouTube videos.