Stanford CS230 | Autumn 2025 | Lecture 8: Agents, Prompts, and RAG

For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/ai November 11, 2025 This lecture covers agents, prompts, and RAG. To learn more about enrolling in this course, visit: https://online.stanford.edu/courses/cs230-deep-learning Please follow along with the course schedule and syllabus: https://cs230.stanford.edu/syllabus/ More lectures will be published regularly. View the playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rNRRGdS0rBbXOUGA0wjdh1X NOTE: There was no class on November 4, 2025 (Lecture 7). The previous lecture is Lecture 6. Andrew Ng Founder of DeepLearning.AI Adjunct Professor, Stanford University’s Computer Science Department Kian Katanforoosh CEO and Founder of Workera Adjunct Lecturer, Stanford University’s Computer Science Department

Kian Katanforooshhost

Nov 21, 20251h 49mWatch on YouTube ↗

CHAPTERS

Why “Beyond LLM”: Building real-world agentic AI systems
The lecture frames a shift from studying neural networks to engineering practical LLM applications in companies. The goal is breadth: prompting, fine-tuning tradeoffs, RAG, agentic workflows, evals, and multi-agent systems—enough context to dive deeper after class.
Limitations of vanilla pretrained LLMs (and why augmentation is needed)
Students and instructor enumerate common failure modes of using a base model directly. The discussion emphasizes domain gaps, lack of freshness, controllability issues, precision requirements, and context-window constraints that block many enterprise use cases.
Two axes of improvement: better foundation models vs better engineering
The lecture separates performance gains into (1) upgrading the base model and (2) designing systems around the model to boost results. The class concentrates on the “engineering axis”: prompts, RAG, agents, and orchestration.
Why prompt engineering matters: the BCG consultant study & human-AI collaboration styles
A research study shows that access to LLMs improves performance on some tasks but harms it on others (the “Jagged frontier”). Training in prompting outperforms untrained AI usage, and people tend to collaborate as either delegators (“centaurs”) or tight-loop iterators (“cyborgs”).
Core prompt design patterns: specificity, roles, examples, reflection, step-by-step
The lecture upgrades a vague summarization prompt into a specific, audience-aware instruction. Students propose common prompt improvements including role prompting, examples, critique/reflection, and breaking tasks into steps (chain-of-thought style guidance).
Zero-shot vs few-shot prompting for alignment on subjective tasks
A sentiment/tone classification example illustrates that “neutral vs negative” can be ambiguous and domain-dependent. Few-shot prompting aligns the model by giving labeled examples directly in the prompt, acting like lightweight dataset building without weight updates.
Prompt chaining (multi-prompt pipelines) to improve control and debuggability
Chaining is presented as a major practical technique: split a complex ask into multiple prompts with intermediate outputs (issues → outline → final response). The benefit is easier debugging and targeted optimization, at the cost of added latency.
Evaluating prompts at scale: human review, tooling, and LLM-as-judge
The lecture introduces systematic prompt testing: compare baseline vs refined prompts on fixed test cases. It then moves to scalable evaluation via tools (e.g., PromptFu) and LLM judges using pairwise comparison, scalar ratings, references, and rubrics.
Fine-tuning: when it helps, why it often hurts, and a cautionary Slack example
Fine-tuning is framed as costly, slow, and prone to overfitting or reducing generality—especially as newer foundation models may outpace your tuned model. A humorous Slack-message fine-tuning story shows how models can learn undesirable behavioral quirks rather than task performance.
RAG fundamentals: grounding answers with embeddings + vector search + cited context
RAG is introduced as the standard way to overcome context limits, staleness, hallucinations, and lack of citations. The pipeline: embed documents into a vector database, embed the query, retrieve nearest chunks/docs, then answer using a prompt template that constrains the model to the retrieved sources.
RAG improvements: chunking, hierarchical retrieval, and HYDE
The lecture briefly surveys common RAG enhancements to address large documents and query/document mismatch. Chunking and multi-level embeddings help with precise sourcing, while HYDE generates a hypothetical answer-document to embed for better retrieval alignment.
Agentic AI workflows: from single Q&A to multi-step tool-using systems
Agentic workflows are defined (per Andrew Ng) as multi-step processes combining prompts, tools, memory, retrieval, and API calls—distinct from reinforcement-learning “agents.” A refund example shows the jump from static policy Q&A to an interactive workflow that asks for order details, calls tools, and completes an action.
Engineering shift: deterministic software → fuzzy systems with guardrails
The lecture explains why building with LLMs requires new engineering instincts: unstructured inputs, probabilistic outputs, and higher security/robustness demands. It argues for combining deterministic components where possible with fuzzy LLM components where they add value, plus human-in-the-loop safeguards.
Agent architecture components: prompts, memory (working vs archival), tools, resources, MCP
A travel agent example illustrates core building blocks: prompts, memory layers, and tool access to external systems. The class introduces Model Context Protocol (MCP) as a more scalable way to connect agents to many services than hand-integrating each API, while noting security and change-management concerns.
Evals case study: building a customer-support agent and measuring success
Students design an address-change support agent by decomposing the human workflow and mapping each step to LLM calls or deterministic tools. The lecture then shows how to evaluate: end-to-end user satisfaction, component-level correctness, objective checks (order ID extraction), and subjective judgments (tone) using humans or LLM judges.
Multi-agent systems: when parallelism and reuse justify added complexity
Multi-agent workflows are positioned as most useful when tasks can run in parallel or when specialized agents can be reused across teams. A smart-home brainstorming exercise yields hierarchical designs with an orchestrator agent coordinating specialized agents (climate, security, energy, groceries), plus optional peer-to-peer links.
What’s next in AI: plateau questions, architecture search, multimodality, and fast skill half-life
The lecture closes with forward-looking themes: potential model progress plateauing, the likelihood that new architectures will unlock efficiency gains, and multimodality improving overall intelligence. It emphasizes that multiple learning paradigms may converge and that AI methods evolve so quickly that breadth plus rapid learning is the durable strategy.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome

Why “Beyond LLM”: Building real-world agentic AI systems

Limitations of vanilla pretrained LLMs (and why augmentation is needed)

Two axes of improvement: better foundation models vs better engineering

Why prompt engineering matters: the BCG consultant study & human-AI collaboration styles

Core prompt design patterns: specificity, roles, examples, reflection, step-by-step

Zero-shot vs few-shot prompting for alignment on subjective tasks

Prompt chaining (multi-prompt pipelines) to improve control and debuggability

Evaluating prompts at scale: human review, tooling, and LLM-as-judge

Fine-tuning: when it helps, why it often hurts, and a cautionary Slack example

RAG fundamentals: grounding answers with embeddings + vector search + cited context

RAG improvements: chunking, hierarchical retrieval, and HYDE

Agentic AI workflows: from single Q&A to multi-step tool-using systems

Engineering shift: deterministic software → fuzzy systems with guardrails

Agent architecture components: prompts, memory (working vs archival), tools, resources, MCP

Evals case study: building a customer-support agent and measuring success

Multi-agent systems: when parallelism and reuse justify added complexity

What’s next in AI: plateau questions, architecture search, multimodality, and fast skill half-life

Get more out of YouTube videos.