Stanford OnlineStanford CS230 | Autumn 2025 | Lecture 8: Agents, Prompts, and RAG
At a glance
WHAT IT’S REALLY ABOUT
Practical methods to augment LLMs: prompting, RAG, agents, evals, multi-agents
- Base LLMs fail in practice due to domain gaps, staleness, controllability issues, limited context/attention, lack of sources, and task-specific precision requirements.
- Prompt engineering is presented as the first and often best optimization layer, covering prompt templates, few-shot examples, chain-of-thought guidance, and multi-prompt chaining for debuggability.
- Fine-tuning is framed as costly and often non-ideal compared with rapid prompt iteration and model upgrades, though it can help for repeated high-precision domain outputs.
- RAG is explained as grounding answers in external documents via embeddings and vector databases, with enhancements like chunking and HyDE to improve retrieval quality and citation fidelity.
- Agentic workflows extend LLM apps from single-step Q&A to multi-step tool-using systems with memory, planning/execution loops, evals, and sometimes multi-agent parallelism for reuse and speed.
IDEAS WORTH REMEMBERING
5 ideasTreat LLM performance as a stack: model choice plus engineering around it.
The lecture emphasizes two axes of improvement: upgrading the foundation model and adding engineering layers (prompts, RAG, agents) that can dramatically change outcomes without touching weights.
Base models are hard to control, not just “missing knowledge.”
Beyond domain gaps and staleness, real deployments face safety/behavior drift, formatting inconsistency, and precision requirements (e.g., legal wording) that require guardrails and workflow design.
Few-shot examples are a fast way to “align” subjective labels without training.
For tasks like sentiment/tone classification where labels depend on company context, embedding labeled examples directly in the prompt can outperform zero-shot and iterate faster than building a fine-tuning dataset.
Prompt chaining is about debuggability and targeted optimization, not just accuracy.
Breaking one complex prompt into multiple prompts yields intermediate artifacts (issues → outline → final response) that can be evaluated separately to locate bottlenecks and improve specific steps.
Evals are the missing production discipline for LLM systems.
You need a mix of end-to-end metrics (user satisfaction), component metrics (tool call correctness), objective checks (IDs/prices), and subjective scoring (tone/helpfulness) often using LLM judges with rubrics.
WORDS WORTH SAVING
5 quotesWe started to learn about neurons... and now we're going one level beyond into what would it look like if you were building agentic AI systems at work.
— Kian Katanforoosh
LLMs are very difficult to control.
— Kian Katanforoosh
There is a frontier within which AI is absolutely helping and one where... people relied on AI... and it ended up going worse—‘falling asleep at the wheel.’
— Kian Katanforoosh
Chaining improves performance, but most importantly helps you control your workflow and debug it more seamlessly.
— Kian Katanforoosh
By the time you're done fine-tuning your model, the next model is out, and it's actually beating your fine-tuned version.
— Kian Katanforoosh
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome