This video isn’t embeddableWatch on YouTube →

Stanford CS230 | Autumn 2025 | Lecture 6: AI Project Strategy

For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/ai October 28, 2025 This lecture provides walkthroughs of examples of AI projects and making day-to-day decisions in building AI systems. To learn more about enrolling in this course, visit: https://online.stanford.edu/courses/cs230-deep-learning To follow along with the course schedule and syllabus, visit: https://cs230.stanford.edu/syllabus/ More lectures will be published regularly. View the playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rNRRGdS0rBbXOUGA0wjdh1X NOTE: There was no class on November 4, 2025 (Lecture 7). The next lecture is Lecture 8. Andrew Ng Founder of DeepLearning.AI Adjunct Professor, Stanford University’s Computer Science Department Kian Katanforoosh CEO and Founder of Workera Adjunct Lecturer, Stanford University’s Computer Science Department

Andrew Nghost

Nov 5, 20251h 15mWatch on YouTube ↗

CHAPTERS

0:05 – 2:37
Why AI project strategy is about development speed, not just algorithms
Ng frames the lecture around practical AI project strategy: what to do next when a model doesn’t work, and how teams make day-to-day decisions. He argues that an efficient iteration process can create 10× (or more) productivity differences between teams, often dwarfing pure algorithm knowledge.
- •Algorithm knowledge is necessary but not sufficient for strong real-world performance
- •Development process (data, tuning, next-step decisions) drives outsized productivity gains
- •Teams can differ by months vs years on the same project due to iteration discipline
- •Lecture goal: simulate hands-on experience through simplified real project stories
2:37 – 7:12
Motivating product: an offline voice-controlled device with named wake phrases
Using a startup-like scenario, Ng describes a device (e.g., a lamp) that responds to a spoken name and command without Wi‑Fi/cloud setup. The simplified technical target becomes detecting the phrase “Robert, turn on” on a low-power edge chip.
- •Product vision: plug-in-and-use voice control without internet setup
- •Need distinct device names to prevent whole-house activation
- •Edge constraints imply small, efficient models and careful system design
- •Focus for the lecture example: detect a single phrase reliably
7:12 – 12:38
How to start as CTO: pick a fast first build, then iterate
After students suggest architectures (general ASR, multi-stage models, Siamese networks, phone-assisted control), Ng emphasizes that idea quality matters less than quickly building something testable. Fast prototypes reveal what works and enable rapid course correction.
- •Many reasonable architectures exist; don’t over-optimize upfront decisions
- •Speed of getting a first working system is a strong predictor of success
- •Early prototypes reduce uncertainty and guide architecture/data choices
- •Edge reality check: full speech-to-text is often too heavy vs wake-word models
12:38 – 18:13
Literature search tactics and leveraging experts for acceleration
Ng recommends literature search and open-source implementations as the highest-leverage first moves. He shares practical advice: skim broadly before deep-reading, and don’t hesitate to respectfully contact paper authors or domain experts to unblock progress.
- •No single universally accepted wake-word architecture despite long history
- •Skim many papers/repos first; then deep-dive only into the most promising
- •Open-source and blog posts can dramatically speed up implementation
- •Emailing authors/experts (after doing your homework) can save hours/days
18:13 – 21:59
Data acquisition: collecting “Robert, turn on” (and respecting consent)
Ng highlights a common blocker: the needed dataset often doesn’t exist, so you must create it. He discusses practical collection methods (asking people to record samples) and stresses privacy, clear consent, and ethical data practices.
- •Custom wake phrases require custom data collection
- •In-person collection can yield dozens/hundreds of samples in a day
- •Privacy and explicit consent are essential; avoid ‘sneaky’ collection
- •Negative examples matter: not just the target phrase
21:59 – 27:03
Synthetic speech data: useful, but usually not the first step
While synthetic data (TTS) can help, Ng explains why he often starts with natural data: synthetic pipelines introduce many knobs and uncertainty. He illustrates with a self-driving analogy: synthetic worlds may lack the diversity of real-world variation.
- •TTS can work, but voice diversity and realism can be limiting
- •Synthetic generation adds complexity and hidden failure modes
- •Real data reduces uncertainty early in development
- •Analogy: video game cars may be too few/too uniform vs real roads
27:03 – 30:12
Windowing trick: turning long audio clips into many labeled examples
Ng describes a practical dataset-building hack: record longer clips containing the phrase, then slice them into multiple time windows and label only the window aligned to the end of the phrase as positive. This expands a small set of recordings into thousands of supervised examples.
- •Represent audio as waveforms; phrase duration ~1 second
- •Use sliding windows (e.g., 3-second) cut from longer recordings
- •Label positive when the phrase just finished; others are negative
- •Small clip count can become thousands of labeled training examples
30:12 – 35:38
The 97% accuracy trap: diagnosing class imbalance and metric misuse
Training yields 97% accuracy—but the model predicts ‘0’ always, producing zero detections. Ng uses this to show how misleading accuracy can be under heavy class imbalance and prompts strategies to increase positives or reweight the learning objective.
- •High accuracy can hide a useless always-negative classifier
- •Problem source: highly imbalanced labels (many more negatives)
- •Fix options: oversample/duplicate positives, weight loss, penalize false negatives
- •Rule of thumb: imbalance up to ~1:10 often okay; beyond that, start correcting
35:38 – 37:41
A commercial ‘hack’: widen the positive window to create richer positives
Instead of labeling only a narrow instant as positive, Ng’s team extended the positive label to cover a longer interval after the phrase ends. This increases positive sample count and adds diversity without pure duplication, improving learnability in practice.
- •Extend ‘positive’ to a 0.5–1.0s window after phrase completion
- •Generates multiple distinct positive windows from one utterance
- •Adds slight diversity vs exact duplication/oversampling
- •Illustrates pragmatic, product-driven labeling choices
37:41 – 41:01
Overfitting after imbalance fixes: regularization and more data
After balancing, training accuracy remains high but dev accuracy collapses, indicating overfitting. Ng discusses common remedies (regularization, more data) and flags that real projects often face distribution mismatch between training and real-world evaluation data.
- •Train >> dev performance implies overfitting (high variance)
- •Try regularization first; also collect more data
- •In practice, train/test distributions may differ (especially with synthetic data)
- •Synthetic-heavy training can diverge from real user conditions
41:01 – 48:12
Noise mixing for speech: scalable synthetic augmentation (and its pitfall)
Ng shares a synthetic technique that worked: add clean speech waveforms to background-noise clips (superposition) to simulate real environments. He warns that if all speech is the wake phrase, the model may learn voice activity detection rather than phrase recognition—so you must include non-target speech too.
- •Create training data by summing clean speech + background noise audio
- •Use diverse noise sources (room hum, highway, coffee shop), respecting licenses
- •Avoid learning ‘someone is talking’ by adding many non-target phrases
- •Increase robustness to real usage conditions like music in the background
48:12 – 55:21
Iteration cadence: ML feels like debugging, powered by tight eval loops
Ng explains that ML development is dominated by diagnosing failure modes and fixing them—more like debugging than writing to spec. He outlines a disciplined daily loop (train overnight, analyze in morning, implement in afternoon, launch at night) and notes how training time fundamentally shapes team workflow.
- •Hard to predict what fails next; success comes from systematic debugging
- •Daily rhythm: train → error analysis → code/data changes → retrain
- •Training duration (minutes vs hours vs weeks) changes project management style
- •Transfer learning can shorten iteration cycles via quick fine-tuning
55:21 – 57:55
Why speed compounds into competitive advantage
Ng illustrates that being consistently 2× faster in iteration yields a large performance lead at any given time, not just a small schedule slip. Faster teams reach better quality sooner and become more competitive in the market.
- •Small per-iteration delays compound into large capability gaps
- •Marketplace cares about performance at a point in time, not eventual parity
- •Speed is a strategic moat for AI product teams
- •Operational discipline often outweighs marginal algorithmic differences
57:55 – 1:06:30
Second example: building an LLM ‘deep researcher’ pipeline
Ng shifts to pipeline-based systems using a deep-research agent: generate search terms, query a search engine, choose which pages to fetch, then synthesize a report. He emphasizes that multiple pipeline steps can fail, so strategy requires identifying the true bottleneck rather than guessing.
- •Pipeline stages: query → LLM-generated search terms → web search → select URLs → fetch → write report
- •Modern systems may iterate autonomously; early versions are linear pipelines
- •Failures can come from any stage (terms, search engine freshness, page selection, writing)
- •Project strategy = decide which component to improve for maximum impact
1:06:30 – 1:15:17
Error analysis for pipelines: spreadsheet-driven diagnosis to pick the next work item
Ng describes a hands-on evaluation method: examine intermediate outputs for a set of underperforming queries and record which stage caused the issue. By quantifying hotspots (e.g., page selection fails 70% of the time), teams can focus effort where it matters and avoid weeks of wasted work.
- •Manually inspect intermediate outputs for 10–100 ‘bad’ examples
- •Track each step’s quality (search terms, search results, URL choice, writing)
- •Use frequency of failure to prioritize engineering effort
- •Methodical error analysis reduces variance in next-step decisions and accelerates progress