Skip to content
Stanford OnlineStanford Online

Stanford CS230 | Autumn 2025 | Lecture 6: AI Project Strategy

For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/ai October 28, 2025 This lecture provides walkthroughs of examples of AI projects and making day-to-day decisions in building AI systems. To learn more about enrolling in this course, visit: https://online.stanford.edu/courses/cs230-deep-learning To follow along with the course schedule and syllabus, visit: https://cs230.stanford.edu/syllabus/ More lectures will be published regularly. View the playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rNRRGdS0rBbXOUGA0wjdh1X NOTE: There was no class on November 4, 2025 (Lecture 7). The next lecture is Lecture 8. Andrew Ng Founder of DeepLearning.AI Adjunct Professor, Stanford University’s Computer Science Department Kian Katanforoosh CEO and Founder of Workera Adjunct Lecturer, Stanford University’s Computer Science Department

Andrew Nghost
Nov 5, 20251h 15mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. AN

    So what I want to do today is, um, continue our discussion on AI project strategy. So if you're building a deep learning system for some task, and today, um, for the first part of today, I'm gonna use a speech recognition, uh, voice-activated device example. And then for the second half I'm gonna use a kind of AI deep researcher example. But what I want to do is, um, walk you through a couple concrete examples of, you know, projects you might work on, um, and let you understand what it feels like to be in the thick of building an AI system, making day-to-day decisions on what to do next. Uh, I find that, I, I think as you've heard me say before, understanding the algorithms is important. So in this class you learn a lot, you know, from the online videos about the, uh, deep learning algorithms, how to build pipelines. But even beyond understanding how the algorithms work, what really drives performance is a team's ability to have an efficient development process. How do you tune the parameters? How do you collect data? When you try something and it doesn't work the first time, which it often doesn't, what do you do next? The skill in making those decisions is what often makes a massive literally 10X difference in productivity. And as I was reflecting, was preparing for, you know, what to say to you today, I was reflecting on quite a few projects where, um, that 10X difference in productivity is, is, is, is really not an exaggeration, right? Maybe more than 10X. But I've literally seen many teams in many, you know, well-known companies with good brands, um, spend a year working on a project that I will see a more skilled team execute in, like, a month, right? So these differences in skill are real. And one of the challenges, um, finding people learning this is if you work for some company, maybe you work on a different project every year or two, right? But so it takes you, like, you know, two years of your... Two years or a year of your life to gain experience on one more project, and then after, I don't know, 10 years, you've finally seen 10 projects and are pretty experienced. Uh, but what I want to do is in today's class walk you through a few concrete examples of projects that are similar to the ones... They're kind of simplified versions of stuff that I've seen myself, to try to accelerate your hands-on experience looking at these projects and thinking through if you are the one in the hot seat building a system and it works or doesn't work or this problem or whatever, making those decisions for what you would do. So try to get you through that, um, today with a couple examples rather than you having to spend years and years of your life to finally see a small number of examples of how these projects can be driven. Right? So the first multi... The first of two motivating examples I'm gonna use today is, um, building a voice-activated device. Uh, in my house I have, you know, like a, a Amazon Echo, right? I have a lot of them, actually, and I think it's a delightful experience. But those devices like that, uh, Amazon Echo or Google Home or the, um, Apple Siri home pod, uh, they require quite a bit of setup, right? They require some set... Connect to the Wi-Fi, figure out a way to connect to your phone, blah. And so, you know, uh, actually for a long time, even though I, I built smart speakers, um, for a long time in my house I had one light bulb connected to my home Wi-Fi internet, you know, because it's just so much of a hassle to set things up. I, I think now I have two light bulbs connected in my house. I guess I should, I should connect more stuff. Um, but so for this motivating example I want to talk about if you are, uh, part of a startup building a new product that, um, makes it much easier to get these voice control devices without the user needing to do this whole Wi-Fi setup process. So, um, you know, I'm not, like, very good at drawing, but if you could go to some store and buy a desk lamp and the desk lamp already has a name, I'll just call this lamp Robert. And you could just take it, plug it into the electricity, plonk it on your desk, and then say, "Robert, turn on," and then it turns on. Say, "Robert, turn off," then it turns off without needing to be connected to the internet, using cloud access and all that. Then that would give users a easier setup experience. Um, there's a project that my friends and I actually discussed a few years ago that we thought would actually be a decent startup idea. We decided not to do it because there were too many other ideas that we were even more excited about, but we felt that it was actually, you know, like, a reasonable, reasonable startup idea to build a little IC circuit, a little integrated circuit, to sell to, say, lamp manufacturers and other device manufacturers to make it really easy if, if, say, some company sells lamps, to build little things so they can very quickly make their devices voice-enabled. And if you have a few pre-built names, maybe give the users a choice. You can call your lamp Robert or Lena, you know, or, or, or I don't know, or Johnny or, I don't know, Alice or whatever. Um, um, have a little switch, then users could, um, just buy a lamp, put it down, and then immediately have it be voice-controlled without needing to worry about how do you get this onto my Wi-Fi network and if my internet is down then my whole house stays dark as well because of things like that, right? Um, and, and I actually did once have an... I was building a lot of voice assistants. I actually set up my office to have a lot of voice control devices. So had different names of different lamps. Standing desk had its name as well. So I'd say... I forget what my desk name was. You know, like, uh, you know, "Jonathan, go higher," or whatever, and then my standing desk would go up and down. It was actually pretty cool. Um, so what I want you to... So, uh, just for today's illustration-Probably we should give the different, uh, devices different names because you can't have every device in your house called Robert. Otherwise, you say, "Robert, turn on," the whole house illuminates. Then, "Robert, turn off," whole house is plunged into the darkness. So we found out you do need different devices to have different names. Uh, but just for today, I'm going to use, um, as illustration the task of training a neural network or building a system that detects when someone says, "Robert, turn on." And you need kind of, "Robert, turn on," "Robert, turn off." If you have a choice of names, it'd be Lena, turn on, Lena, turn off, and a variety of handful of options of names. But just for simplicity, I'm gonna worry only about detecting the phrase, um, "Robert, turn on." Right? And what do we do for this? You convince and repeat to give the turn off command and a handful of other names to make it user selectable what, you know, what name they want to give this thing. Okay? So, um, this is something that'll need to run on device, uh, small IC circuits. And my... And, and I'm gonna ask you a question, right? Um, when you've graduated from CS230 or maybe when you graduate from Stanford, if you are the CTO of a startup responsible for building this, um, what would you do, right? So call out. So i-imagine you just graduated from CS230 or just graduated from Stanford, and you're the CTO of a startup, and you want to build this lamp that can turn on when anyone says, "Robert, turn on." How would you approach this problem? And I know this is an incredibly open-ended question, and it turns out life is incredibly open-ended, right? You graduate from CS230, you have to decide what to do. So if this is what you're doing, uh, feel free to raise your hand and call out. If you want to build this product, what's the first thing you'd do? How would you think about it? Go for it.

  2. SP

    I think to take it from here, uh, it speaks to the exponent and then it speaks to the exponent to have the actual work and then kind of operate the work with proper knowledge. Like, uh, like bring up, um, to other ways.

  3. AN

    Yeah. Cool. Yeah, great. So get some, like, uh, open source speech-to-text model or something, and then see if it run. Yeah, that'd be a good start. Anything else? Yeah, go ahead.

  4. SP

    It's created with three models. Um, increasing the complexity, the first one would detect audio, uh, just going audio. The second one would detect the word Robert, and the third one would try to parse the phrase that comes after that.

  5. AN

    Okay. Sorry. Three models, one to detect Robert, and then one that-

  6. SP

    Well, the first one detects the word sound.

  7. AN

    I see.

  8. SP

    The second one detects the word Robert, and the third one tries to understand what the sentence after the word Robert.

  9. AN

    I see. Cool. Right. Okay. Right. So, so, uh, uh, three models to detect Robert, understand the rest sentence, or detecting the sound. Cool. Go for it.

  10. SP

    I think it would be like-

  11. AN

    Oh, sorry. Say again.

  12. SP

    Like, you... I would do it a bit differently because like if we have a model that we detect Robert, then let's say you want to expand the company and have different services and things.

  13. AN

    Mm-hmm.

  14. SP

    Then you have to repeat like the same process for Jane, Alice, or... So instead, I think what you could do is take a model that takes, uh, given like those strings auto prompt and another auto prompt and see like train it to identify like the two policy the same as well.

  15. AN

    Okay. Oh, I see. Oh, like a... It's like a Siamese network. We actually teach a Siamese network later in this course where something that inputs two audio files and decides they're saying the same words, so you can more easily generalize to new words than, than Robert. Cool. No, that's actually pretty interesting. Yeah. Go, go for it.

  16. SP

    I, I don't know how to make, uh, these devices work together, but I think it may be simpler way to just have some plugin on the phone and like have it run remotely through that. So like, kind of like it's like Siri. Uh, so if I could go like image, uh, like activate it on my phone, my phone would just then call like turn on, turn on to a device.

  17. AN

    Oh, oh, sorry. You mean just, uh, connect up your device to Siri?

  18. SP

    Uh, yes. Kind of like a neural network on my phone and the iPhone would send device, uh, the signal to the device to turn it on.

  19. AN

    I see. Okay, cool. Yeah. Right. Yeah. That sounds interesting. It sounds like a different product than this if, if we need to connect to your cell phone and all that though. Yeah. Cool. Right. So l-let me just, um... Lots of interesting ideas. Uh, let me just make some observations. Um, so I find that when building software products, um, uh, there are actually lots of good ideas or lots of reasonable things you could try. But, you know, as, as you heard me mention a few weeks ago, I think one of the strongest predictors for the odds of you building something compelling is speed. So I find that, um, of all of these ideas, I think some are better than others, but it doesn't... But, but whether the, the idea is, you know, a bit better or a little bit worse, it is important, but it's actually secondary to how quickly you can just get something built. So if you're actually the CTO of a startup like this, I would encourage you to look and say, "All right. What can we build, you know, today?" Or, "What can we build maybe in a week?" And, um, try out any of these architecture choices and build it and see what happens. Because even if what you build is a little bit less, you know, good, um, you can find that out in two days, you know, then you course correct very quickly. Um, I've actually wor- I've actually built a lot of smart speakers, um, so I have maybe-Firsthand experience of this, and I just share some things that I happen to know that there's no reason you would know. But it, it turns out that, um, let's see. At least today, general purpose, um, speech recognition is still a little bit heavyweight. Uh, takes quite a lot of, you know, processing power. Uh, is a bit expensive to run on an edge device if you wanna make this, like, a just a few dollars. Um, but it turns out that if you look at the, um, smart speakers, uh, there's usually, um, uh, uh... If you want to train a neural network just to detect one phrase, uh, be it a phrase like, you know, uh, "Okay, Google," or, "Hey, Siri," or, "Alexa," or whatever, any of the smart speaker trigger words, that can be done with a fairly small neural network. Although to your point, if we want to do different neural networks, different words, we'll need to swap out different neural networks and rinse and repeat that, that, uh, uh... But if you have only a small handful of names, phrases we want to detect, I think, I think that'd be okay. And then one other piece of advice I would give, um, if you're embarking on this is the first thing I would do, actually, if I was working on this for the first time is, is actually a literature search. Um, and it turns out that, uh, you know, we're... I think we're fortunate that the AI world has a ton of open, uh, source software and a ton of, um, uh, open research papers. Somewhat surprisingly, despite smart speakers having been around for a long time, there still to this day isn't, like, a single architecture that everyone's agreed on on the best way to do this. If you look at the literature, there is still actually a diversity of opinions on how to do this type of, uh, wake word or trigger word, uh, we sometimes just, uh... Uh, so when you say something like, you know, uh, "Okay, Google," or, "Hey, Siri," or, "Alexa," that's sometimes called a wake word 'cause it wakes up the device, or a trigger word triggers the device's wake activity. So somewhat surprisingly, even though we've had smart speakers for a long time now, for a lot more than a decade, um, uh, there still isn't a single agreed on unified architecture that the community has agreed on on what's the best algorithm to do this. But I feel like if you are embarking on this for the first time, um, the number one boost in your speed of learning, it could be implementing something, but I would say doing a literature search and trying open source software would be the even faster accelerator. And I wanna just give you a few tips for that, right? Um, real quick. So, you know, today, if, if this is, um, uh... There are a lot of research articles and blog posts and open GitHub repos on, uh, m-many topics, certainly wake word detection. And what I find is that, um, if this is research paper one, research paper two, research paper three, research paper four, I find that, you know, sometimes people will, um, spend a lot of time reading research paper one until you're done, right? It's just 0% complete and then it's 100% complete. And they spend a lot of time reading research paper two, spend a lot of time reading research paper three. Um, and I just recommend you not do that. Instead, when I'm doing a literature search, what it often feels like is, you know, do a few web searches for a handful of resources, skim all of them, 0% complete, 100% complete, right? Based on your initial reading, you may decide to go back to paper C and spend more time to really read and understand that, but that'll help you find additional references. You can skim, and maybe you find a paper seven that's really seminal. Spend a lot of effort. But this is what doing a very broad survey of the literature will feel like, where, you know, you, you, you really put in the time to finish only a very small number of resources, but spend a lot more time skipping around and getting a cursory level understanding of a broader set of papers, which can also point you to, um, also point you to the more useful resources to focus attention on. And just one other thing I've seen among Stanford students, um, uh, there's one other thing that I find people tend to underuse, which is, um, trying to talk to experts. So if you are actually a CT of a startup trying to build this for the first time, um, I feel like, you know, we all wanna do our own work and not bother other people, which is good. But I just encourage you to consider if, um, after you've done your own work, if reach out to an expert can really accelerate your learning. So something I've often done is, you know, I will do my own work, right? I don't wanna call random experts to try to bother them before I've at least done my work. But if you've, uh... Sometimes I'm reading a paper, I'm really struggling to understand it, and I find that instead of me struggling for another, like, four hours, um, if I send the authors a respectful email, uh, and say, "Hey, I read your paper. Tried to understand this. I'm still confused. Can you help me out? Can you explain this to me?" A lot of the time, um, uh, i-including when, you know, I, I was less well-known, right? But I, I think a lot of people, if they see that you're doing your work and not just reaching out to them before you've even done anything, a lot of them, not everyone, but a lot of, um, research authors will actually be quite, you know, understanding and, and, and try to help you out. And so I find that for a lot of projects I did, um, finding that one expert, sometimes a, sometimes a Stanford professor actually. We have a lot of speech faculty and Stanford faculty as well. But, um, when I had problems with my speech recognition systems, you know, sometimes I call, like, oh, Dan Jurafsky, whatever, and then, like, a half-hour conversation or even a 10-minute conversation really accelerates, um, what I've been able to do. So I just encourage you to, um... It, it takes you, like, 10 minutes to send an email, and maybe there's a 50% chance of a response. I don't know, right? Not 100% chance, but sometimes that tends to be really high ROI, right? And then what I think you find is that, um, if you do a literature search, you will likely discover that, uh, most of the robust enterprise-grade smart speaker systems all have a specialized system trainedTo detect the wake word or the trigger word, right? And again, there's a variety of architectures. You probably come up with some good neural network architectures for this. Um, and, and, and-- Oh, it turns out that, uh, there is no data set on the internet with lots of people saying, "Robert, turn on," right? That's just not a thing, right? So if you decide that the names are Robert and Lena and, you know, I don't know, Jeremy and Alicia and whatever. Sorry if they're people with those real names. It was actually kind of weird when we chose people's names, uh, but may- but we just did that for some reason. But it turns out that there is no large data set on the internet of lots of people saying, "Robert, turn on." So if you want to train a neural network to detect if someone has said this phrase, you need to collect that data yourself, right? Um, so let me then ask you another question. Um, let's say you've done a literature search, found some open source code you can try, but that has led you, say, to conclude that you need, uh, data sets of people saying, "Robert, turn on." Um, uh, you, you need a data set to train to distinguish between someone saying, "Robert, turn on," versus not someone saying, "Robert, turn on." Where would you-- How, how would you approach getting a data set like that? So yeah, go for it.

  20. SP

    Text to speech.

  21. AN

    Text to speech. Mm-hmm.

  22. SP

    Use a data

  23. AN

    Yeah. Cool, yeah. Yep, you use text to speech as one method. Um, this-- uh, I'll come back to that later, so you need that notes. Go, go.

  24. SP

    Just take all those who visit your campus and ask people to speak.

  25. AN

    Yeah.

  26. SP

    And record it.

  27. AN

    Yeah. Yep, cool. Yep, I like that. Yeah, so walk around and ask people, tell them what you're doing, get their permission to record and use their voice, and then just if they're okay giving permission, just record their voices. Yeah, I like that. Anything else? Cool. Um, yeah.

  28. SP

    You could turn on like samples where if you have a basic like maybe like parent speech, like the phrase like trigger to offset is set and then that was procedural. But in a training set where it triggers-

  29. AN

    Oh, sorry. Can you say that again? Uh-

  30. SP

    So like you mentioned use that rule, like you should have samples where they're all the turn offsets and then there's like also samples where they're not.

Episode duration: 1:15:17

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode s6JVGzABKho

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.