Stanford Online

Stanford CS230 | Autumn 2025 | Lecture 8: Agents, Prompts, and RAG

For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/ai November 11, 2025 This lecture covers agents, prompts, and RAG. To learn more about enrolling in this course, visit: https://online.stanford.edu/courses/cs230-deep-learning Please follow along with the course schedule and syllabus: https://cs230.stanford.edu/syllabus/ More lectures will be published regularly. View the playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rNRRGdS0rBbXOUGA0wjdh1X NOTE: There was no class on November 4, 2025 (Lecture 7). The previous lecture is Lecture 6. Andrew Ng Founder of DeepLearning.AI Adjunct Professor, Stanford University’s Computer Science Department Kian Katanforoosh CEO and Founder of Workera Adjunct Lecturer, Stanford University’s Computer Science Department

Kian Katanforooshhost

Nov 20, 20251h 49mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Practical methods to augment LLMs: prompting, RAG, agents, evals, multi-agents

Base LLMs fail in practice due to domain gaps, staleness, controllability issues, limited context/attention, lack of sources, and task-specific precision requirements.
Prompt engineering is presented as the first and often best optimization layer, covering prompt templates, few-shot examples, chain-of-thought guidance, and multi-prompt chaining for debuggability.
Fine-tuning is framed as costly and often non-ideal compared with rapid prompt iteration and model upgrades, though it can help for repeated high-precision domain outputs.
RAG is explained as grounding answers in external documents via embeddings and vector databases, with enhancements like chunking and HyDE to improve retrieval quality and citation fidelity.
Agentic workflows extend LLM apps from single-step Q&A to multi-step tool-using systems with memory, planning/execution loops, evals, and sometimes multi-agent parallelism for reuse and speed.

IDEAS WORTH REMEMBERING

5 ideas

Treat LLM performance as a stack: model choice plus engineering around it.

The lecture emphasizes two axes of improvement: upgrading the foundation model and adding engineering layers (prompts, RAG, agents) that can dramatically change outcomes without touching weights.

Base models are hard to control, not just “missing knowledge.”

Beyond domain gaps and staleness, real deployments face safety/behavior drift, formatting inconsistency, and precision requirements (e.g., legal wording) that require guardrails and workflow design.

Few-shot examples are a fast way to “align” subjective labels without training.

For tasks like sentiment/tone classification where labels depend on company context, embedding labeled examples directly in the prompt can outperform zero-shot and iterate faster than building a fine-tuning dataset.

Prompt chaining is about debuggability and targeted optimization, not just accuracy.

Breaking one complex prompt into multiple prompts yields intermediate artifacts (issues → outline → final response) that can be evaluated separately to locate bottlenecks and improve specific steps.

Evals are the missing production discipline for LLM systems.

You need a mix of end-to-end metrics (user satisfaction), component metrics (tool call correctness), objective checks (IDs/prices), and subjective scoring (tone/helpfulness) often using LLM judges with rubrics.

WORDS WORTH SAVING

5 quotes

We started to learn about neurons... and now we're going one level beyond into what would it look like if you were building agentic AI systems at work.

— Kian Katanforoosh

LLMs are very difficult to control.

— Kian Katanforoosh

There is a frontier within which AI is absolutely helping and one where... people relied on AI... and it ended up going worse—‘falling asleep at the wheel.’

— Kian Katanforoosh

Chaining improves performance, but most importantly helps you control your workflow and debug it more seamlessly.

— Kian Katanforoosh

By the time you're done fine-tuning your model, the next model is out, and it's actually beating your fine-tuned version.

— Kian Katanforoosh

Limitations of vanilla pre-trained LLMsPrompt templates and role prompting (“act as”)Zero-shot vs few-shot promptingChain-of-thought vs prompt chaining (workflow chaining)Prompt testing and LLM-as-judge evalsFine-tuning trade-offs and failure modesRAG pipeline, chunking, HyDE, vector databasesAgentic workflows: memory, tools, autonomy levelsMCP (Model Context Protocol) vs direct API toolingEvals: objective/subjective, component/end-to-endMulti-agent architectures: hierarchical vs flatTrends: plateau debate, architecture search, multimodality, method-composition

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.