Stanford CS230 | Autumn 2025 | Lecture 8: Agents, Prompts, and RAG

For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/ai November 11, 2025 This lecture covers agents, prompts, and RAG. To learn more about enrolling in this course, visit: https://online.stanford.edu/courses/cs230-deep-learning Please follow along with the course schedule and syllabus: https://cs230.stanford.edu/syllabus/ More lectures will be published regularly. View the playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rNRRGdS0rBbXOUGA0wjdh1X NOTE: There was no class on November 4, 2025 (Lecture 7). The previous lecture is Lecture 6. Andrew Ng Founder of DeepLearning.AI Adjunct Professor, Stanford University’s Computer Science Department Kian Katanforoosh CEO and Founder of Workera Adjunct Lecturer, Stanford University’s Computer Science Department

Kian Katanforooshhost

Nov 21, 20251h 49mWatch on YouTube ↗

EVERY SPOKEN WORD

90 min read · 18,416 words

KKKian Katanforoosh
Hi, everyone. Welcome to another lecture for CS230 Deep Learning. Today, we're going to talk about enhancing large language model applications, and I, I call this lecture Beyond LLM. Um, it has a lot of newer content and, uh, the idea behind this lecture is y- we started to learn about neurons, and then we learned about layers, and then we learned about deep neural networks, and then we learned a little bit about how to structure projects in C3, and now we're going one level beyond into, uh, what would it look like if you were building, uh, agentic AI systems at work in a startup, in a company. Um, and, uh, it's probably one of the more practical lectures. Again, the goal is not to build a product end-to-end in the next hour or so, but rather to tell you all the techniques that AI engineers have cracked, figured out, are exploring, so that after the class you have sort of the breadth of view of different prompting techniques, different agentic workflows, multi-agent systems, evals, and then when you wanna dive deeper, you have the baggage to dive deeper and learn faster about it. Okay? Uh, let's try to make it as interactive as possible, as usual. Um, when we look at the agenda, the agenda is going to start with the core idea behind challenges and opportunities for augmenting LLMs. So we start from a base model. How do we maximize the performance of that base model? Um, then we'll dive deep into the first line of optimization, which is prompting methods, and we'll see a variety of them. Then we'll go slightly deeper. If we were to get our hands under the hood and do some fine-tuning, what would it look like? I'm not a fan of fine-tuning, and I talk a lot about that, but I'll explain why I, I try to avoid fine-tuning as much as possible. Um, and then we'll do a section four on retrieval-augmented generation or RAG, which you've probably heard of in the news. Maybe some of you have played with RAGs. We're gonna sort of unpack what a RAG is and how it works, and then the different methods within, um, RAGs. And then we talk about agentic AI workflows. Um, I'll define it. Um, Andrew Ng is one of the, call it, first ones to have called this trend a agentic AI workflows, and so we look at the definition that Andrew gives to agentic workflows, and then we start seeing examples. The section six is very practical. It's a case study where we will, uh, think about an agentic workflow, and we'll-- and I'll ask you to measure, um, uh, if the agent actually works, and we brainstorm how we can measure if, um, an agentic workflow is working the way you want it to work. There's plenty of methods called evals that, um, uh, solve that problem. Uh, and then we look briefly at multi-agent workflow, and then we can have a, a sort of open-ended discussion where I'll share some thoughts on what's next in AI. Um, and I'm looking forward to hearing from you all as well on that one. Okay? So let's get started, uh, with the problem of augmenting LLM. So open-ended question for you. You are all familiar with pre-trained models like GPT-3.5 Turbo or GPT-4.0. What's the limitation of using just a base model? What are the typical issues that might arise as you're using a vanilla pre-trained model? Yes.
SPSpeaker
We lack some domain knowledge.
KKKian Katanforoosh
Lacks some domain knowledge. You're perfectly right. You know, you-- We, we had a group of students a few years ago, it was not LLM related, but, um, you know, they were building a, an autonomous, uh, farming, uh, device or vehicle that had a camera underneath taking pictures of crops to determine if the crop is, uh, sick or not, if it should be thrown away, like if it should be, if it should be used or not. And, um, and that data set is not a data set you find out there. And the base model or a pre-trained, um, computer vision model would lack that knowledge, of course. What else? Yes.
SPSpeaker
Trained on, uh, quality pictures, but the reality of the pictures are very dark, uh, or blurry.
KKKian Katanforoosh
Okay, maybe the-- you're saying-- So just to re- uh, repeat for people online, you're saying the model might have been trained on high-quality data, but the data in the wild is actually not that high quality. And in fact, yes, the distribution of the real world might di-differ, uh, as we've seen with GANs from the training set, and that might create an issue with pre-trained models. Although pre-trained LLMs are getting better at, you know, handling all sorts of data inputs. Uh, yes.
SPSpeaker
Uh, lack current information.
KKKian Katanforoosh
Like what?
SPSpeaker
Current information.
KKKian Katanforoosh
Lacks current information. Uh, the LLM is not up to date. And in fact, you're right. Imagine you have to retrain from scratch your LLM every couple of months. Uh, one story that I found funny, um, it's from probably three years ago or maybe more, five years ago, where, um, during his first presidency, uh, President Trump one day tweeted, uh, Covfefe. You remember that tweet or no?Just covfefe, and it was probably a typo, or it was in his pocket, I don't know. But that word, uh, did not exist. Uh, the LLMs, in fact, that Twitter was running at the time could not recognize that word. And so, uh, the recommender system sort of went wild because suddenly everybody was making fun of that tweet using the word covfefe, and the LLM was so confused on, you know, what does that mean? Where should we show it? To whom should we show it? And this is an example of, uh, nowadays, especially on social media, there are so many new trends and, um, it's very hard to retrain an LLM to match the new trend and understand the new words out there. I mean, you know, you oftentimes hear Gen Z words like rizz or mid or whatever. I don't know all of them, but, uh, uh, you, you probably wanna find a way that, uh, can allow the LLM to understand those trends without retraining the LLM from scratch. Yeah. What else?
SPSpeaker
It's trained to have a breadth of knowledge, and if you wanted to use
KKKian Katanforoosh
Yeah. It might be trained on a breadth of knowledge, but it might fail or not perform adequately on a narrow task that is very well-defined. Uh, think about enterprise applications that... Yeah, enterprise application, you need high precision, high fidelity, low latency, and maybe the model is not great at that specific thing. It might do fine, but just not good enough, and you might wanna augment it in a certain way. Yeah.
SPSpeaker
Maybe it has, like, a shallow domain knowledge, but not really deep, so it makes the model a lot heavier, a lot slower.
KKKian Katanforoosh
Yeah.
SPSpeaker
Training the model.
KKKian Katanforoosh
So maybe it has a lot of, uh, broad domain knowledge that might not be needed for your application, and so you're using a massive heavy model when you actually are only using two percent of the model capability. You're perfectly right. You might not need all of it, so you might find ways to prune, quantize the model, modify it. All of these are, are, are good points. I'm gonna add a few more as well. Um, LLMs are very difficult to control. Uh, your last point is actually an example of that. You wanna control the LLM to use a part of its knowledge, but it's not. It's, in fact, getting confused. Uh, we've seen that in history. In 2016, uh, Microsoft created a notorious Twitter bot that learned from users, and it quickly became a racist jerk. Microsoft ended up removing the bot sixteen hours after launching it. The community was really fast [chuckles] at, uh, determining that this was a racist bot. Um, and, and, you know, you can empathize with Microsoft in the sense that it is actually hard to control an LLM. They might have done a better job to qualify before launching, uh, but it is really hard to control an LLM. Even more recently, this is a tweet from Sam Altman, um, last November, um, where there was this, uh, debate between Elon Musk and Sam Altman on, uh, whose LLM is the left-wing, uh, propaganda machine or the right-wing propaganda machine, and they were hating on each other's LLMs. Uh, but that tells you at the end of the day that, uh, even those two teams, uh, Grok and OpenAI, which are probably the best-funded team with a lot of talent, are not doing a great job at controlling their LLMs, you know. And from time to time, if you hang out on X, uh, you might see screenshots of users interacting with LLMs and the LLM saying something really controversial or, or, or, uh, racist or, you know, something that's-- would not co-- uh, not be considered, uh, uh, great by [chuckles] social standards, I guess. And, uh, and that tells you that the model is really hard to, um, to control. Um, the second aspect of it is, uh, something that you've mentioned earlier. LLMs may underperform in your task, um, and that might include specific knowledge gaps such as medical diagnosis. If you're doing medical diagnosis, you would rather have a-- an LLM that is specialized for that and is great at it. And in fact, something that we haven't mentioned as a group has sources, so the answer is sourced specifically. You have a hard time believing something unless you have the actual source of the research that backs it up. Um, inconsistencies in style and format. So imagine you're building a legal AI agentic workflow. Uh, legal has a very specific way to write and read, uh, where every word counts. You know, if you're negotiating a large contract, every word on that contract might mean something else when it comes to the court, and so it's very important that you use an LLM that is very good at it. The precision matters. And then, you know, task-specific understanding, such as doing a classification on a niche field. Here, I pulled an example where, um, you know, let's say a biotech product is trying to use an LLM to, uh, categorize user reviews into positive, neutral, or negative. Um, you know, maybe for that company, something that, uh, would be considered a negative review typically is actually considered a neutral review because the NPS of that industry tends to be way lower than other industries, let's say. Um, that's a task-specific understanding, and the LLM needs to be aligned to what the company believes is the categorization that it wants. We will see an example of how to solve that problem in a second. And then limited context handling. Um, a lot of AI applications, especially in the enterprise, have, uh, required data that has a lot of context. Like, just to give you a simple example, knowledge management is an important space that enterprises buy a lot of knowledge management tool. Um, when you go on your drive and you have all your documents, ideally, you could have an LLM running on top of that drive. You can ask any question, and it will read immediately, uh, thousands of document and answer, "What was our Q4 performance in sales? Um, it was X dollars." Uh, it finds it super quickly. In practice, because LLMs do not have a large enough context, you cannot use a standalone vanilla pre-trained LLM to solve that problem. You will have to augment it.Does that make sense? Uh, the other aspect around context windows is they are in fact limited. If you look at the context windows of the models from the last five years, um, even the best models today will range in context, um, window or number of tokens it can take as input, um, somewhere in the hundreds of thousands, um, of tokens max. Just to give you a sense, two hundred thousand tokens is roughly two books. Yeah. So that's how much you can upload, uh, and, and it can read pretty much. And you can imagine that when you're dealing with video understanding or, um, heavier, uh, data files, that is, of course, an issue. So you might have to chunk it, you might have to embed it, you might have to find other ways to get, uh, the LLM to handle larger contexts. Um, the attention mechanism is also, uh, powerful but problematic because it does not do a great job at attending in very large contexts. There is actually a, an interesting, uh, uh, problem called needle in a haystack. It's an AI problem where, um, or call it a benchmark, where, um, in order to test if your LLM is good at attending-- at putting attention on a very specific fact within a large, uh, corpus, researchers might randomly insert in a book one sentence that con-- uh, that, that outlines a certain fact, such as, uh, "Arun and Max are having coffee at Blue Bottle" in the middle of, uh, the Bible, let's say, or some very long, uh, text. Um, and then you ask the LLM, um, uh, "What were Arun and Max having, um, you know, uh, at Blue Bottle?" And you see if it remembers, say, it was coffee. It's actually a complex problem, not because the question is complex, but because you're asking the model to find a fact within a very large corpus, and that's complicated. Yeah. So again, this is a, a limiting factor for LLMs. We, we'll talk about RAG in a second, but I wanna preview, you know, the, the-- there is, uh, debates around whether RAG is the right, uh, long-term approach for AI systems. So as a, as a high-level idea, a RAG is a mechanism if you will, that embeds documents, uh, that an LLM can retrieve, um, and then add as context to, um, its, uh, initial prompt and answer a question. Um, it has lots of application. Knowledge management is an example. So imagine you have your drive again, but every document is sort of compressed in representation, and the LLM has access to that lower dimensional representation. Um, the debates that this tweet from Yaofu, um, outlines is, uh, in theory, if we have infinite compute, then RAG is useless because you can just read a massive corpus immediately and answer your question. Uh, but even in that case, um, latency might be an issue. Imagine the time it takes for an AI to read all your drive every single time you ask a question. It doesn't make sense. So, uh, RAG has other advantages beyond even, uh, the accuracy. On top of that, the sourcing matters as well. So it might-- RAG allows you to source. We, we'll talk, we'll talk about all that later. But there are-- there's always this, uh, debate in the, in the community whether a certain method is actually future-proof. Because in practice, as compute power doubles every year, let's say, some of the methods we're learning right now might not be relevant three years from now. Like, we don't know, essentially. Um, you know, and, and the analogy that, that he makes on, on, um, you know, context windows and why RAG approaches might be relevant even a long time from now, um, is search. You know, when you search on a search engine, you still find, uh, sources of information, and in fact, in the background, there is very, uh, uh, you know, detailed traversal algorithms that, uh, rank and find, uh, the specific links that might be the, the best to present you, um, uh, versus if you had to read-- imagine you had to read the entire web every single time you're doing a search query, uh, without being able to narrow to a certain portion of the space, uh, that might again not be, uh, um, uh, reasonable. Okay. When we're thinking of, um, improving LLMs, uh, the easiest way, uh, we think of it is, is two dimensions. One dimension is we are going to improve the foundation model itself. So for example, we move from, uh, GPT, uh, three point five Turbo to GPT-4 to GPT-4o to GPT-5. Each of that is supposed to improve the base model. GPT-5 is another debate because it's sort of packaging other models within itself. But, you know, if you're thinking about three point five, 4 and 4o, that's really what it is. The pre-trained model improves, and so you should see your, uh, performance improve on your tasks. Um, but the other dimension is we can actually engineer-- leverage the LLM in a way that makes it better. So you can prompt simply GPT-4o. You can, mm, change some prompts and improve the prompt, and it will improve the performance. It's shown. Uh, you can even put a RAG around it. You can put an agentic workflow around it. You can even put a multi-agent system around it, and that is another dimension for you to improve performance. So that's how I want you to think about it. Which LLM I'm using, and then how can I maximize the performance of that LLM? This lecture is about the vertical axis. Those are the methods that we will see with-- together. Sounds good for the introduction. So let's move to prompt engineering. Um-I'm gonna start with an interesting study just to motivate why prompt engineering matters. Um, there is a, a study from, uh, you know, HBS, UPenn, um, uh, as well as Harvard Business School and, and, and others also involved, Wharton, that, uh, took a subset of BCG consultants, uh, individual contributors, split them into three groups. One group had no access to AI, one group had access to, um, I think it was GPT, uh, four, um, and then one group had access to the LLM, but also a training on how to prompt better. Um, and then they observed, uh, the performance of these consultants across a wide variety of tasks. There's a few things that they noticed that I thought was interesting. One is something they called the Jagd frontier, um, meaning that certain tasks that consultants are doing fall beyond the Jagd frontier, meaning AI is, uh, is not good enough. It's not, it's not improving, uh, human performance. In fact, it's actually making it worse. Um, and some tasks are within the frontier, um, meaning that, uh, AI is actually significantly improving the performance, the speed, the quality of the consultant. Um, many tasks fell within and many tasks fell without, and they shared their, um, insights. But the TLDR is, uh, there is a frontier within which AI is absolutely helping and, uh, one where they call out this behavior of falling asleep at the wheel, where people relied on AI on a task that was beyond the frontier, and in fact, uh, it ended up going worse because the human was not reviewing the outputs carefully enough. Um, they did note that the, the group that was trained was, uh, the best, uh, better than the group that was not trained on prompt engineering, which also motivates why this lecture matters, um, so that you're, you're, you're within that group afterwards. Uh, one other insights, um, were the centaurs and the cyborgs. Uh, they noticed that consultants had the tendency to work with AI in one of two ways, and you might yourself fi-- uh, find-- uh, be part of one of these groups. The centaurs, uh, are mythical creatures that are half, uh, human and half, uh, uh, I think half, uh, uh, uh, what, horses? Yeah, horses. Half horses, half something. Um, and, uh, those were individuals that would divide and delegate. They might give a pretty big task to the AI. So imagine you're working on a PowerPoint, which consultants are known to do. Um, you might actually write a very long prompt on how you want it to do your PowerPoint and then let it work for some time and then come back, and it's done. When others would act as cyborgs. Cyborgs are fully blended bionic, uh, human robots, a human robot and robot-- uh, augmented with robotic parts. Uh, and those individuals would not delegate fully a task. They would actually work super quickly with the model and, like, back and forth. I find that a lot of students are actually more working like cyborgs than, uh, centaurs. But, uh, while maybe in the enterprise, when you're trying to automate a workflow, you're thinking more like a centaur. Yeah. That's just something good to keep in mind. Also, a lot of companies will tell you, "Oh, we're hiring prompt engineers, et cetera. It's a career." I, I don't buy that. I think it's just a skill that everybody should have. Uh, you're not gonna make a career out of prompt engineering, but you're probably gonna use it as a very powerful skill in your career. Um, so let's talk about basic prompt design principles. Uh, I'm giving you a very simple prompt here. Uh, summarize this document, and then the document is uploaded alongside it, and the model has not, um, much context around, you know, what should be the summary? Well, how long should be the summary? What should it talk about? Et cetera. You can actually improve these prompts, um, by, by doing, you know, something like summarize this ten-page scientific paper on renewable energy in five bullet points, focusing on key findings and implications for policymakers. That's already better, right? You're sharing the audience, and it's going to tailor it to the audience. You're saying that, uh, you want five bullet points, and you want to focus only on key findings. Um, you know, that's a better prompt, you would argue. Um, how could you even make these prompts, uh, better? Uh, what are other techniques that you've heard of or tried yourself that could make this one-shot prompt, uh, uh, better? Yeah.
SPSpeaker
Write formulas, uh, like write the example.
KKKian Katanforoosh
Okay, write, write example. So say, uh, you mean, uh, here is an example of a great summary. Yeah, you're right. That's a good idea.
SPSpeaker
Just ask it to be like someone. Act like you are now.
KKKian Katanforoosh
Okay. Very popular technique. Act like a renewable energy expert giving a conference at, uh, Davos, let's say. Yeah, that's great. Um, someone where-
SPSpeaker
It sounds cliche saying it, but just say like, "You're really good at it." Like-
KKKian Katanforoosh
Yeah. You are the best in the world at this. Explain. [laughs] Yeah, yeah, actually, I mean, these things work. It's, it's funny, but, uh, it does work to, to say, act like XYZ. It's a very popular, uh, prompt template. We'll, we'll see a few examples. What else could you do? Yes.
SPSpeaker
I personally like to say critique your own project.
KKKian Katanforoosh
Okay.
SPSpeaker
Yeah.
KKKian Katanforoosh
Critique your own project, so you're using reflection. So you might actually do one output and then ask it to critique it and then give it back. Yeah, we'll see that. That's a great one. That's, that's the one that probably works best within those typically, but we, we'll see some examples. What else? Uh, yeah.
SPSpeaker
Break the task down into steps.
KKKian Katanforoosh
Okay, break the task down into steps. Do you know how that is called?
SPSpeaker
No.
KKKian Katanforoosh
Okay. Chain of thoughts. So, um, this is actually a popular method that's been shown in, in research that-It improves. You could actually give a clear instruction and also encourage the model to think step by step. Approach the task step by step and do not skip any step. And then you give it some steps such as step one, identify the three most important findings. Step two, explain how key each finding impacts renewable energy policy. Step three, write the five bullet summary with each point addressing a finding, um, et cetera. So, uh, chain of thoughts, I linked the paper from twenty twenty-three, uh, that popularized chain of thoughts. Chain of thoughts is very, very popular right now, especially in AI startups that are trying to control their LLMs. Okay. Um, to go back to your examples about act like XYZ, um, I-- what I like to do, Andrew Ng also talks about that, uh, is to look at other people's prompts. And in fact, in, in online, you have a lot of prompt repositories for free on GitHub. In fact, I, I linked the awesome prompt, uh, template rep-repo on GitHub, where you have so many examples of great prompt that engineers have built. They said it works great for us, and they published it online. And a lot of them starts with act as. You know, act as a Linux terminal, act as an English translator, act like, um, a position interviewer, et cetera. The advantage of a prompt template is that you can actually put it in your code and scale it for many user requests. So let me give you an example from Workera. Um, you know, Workera evaluates skills. Some of you have taken the assessments already and, um, tries to personalize it to the user. Um, and in fact, if you actually read in an HR system in an enterprise, in the HR system, you might have, uh, Jane, uh, is a product manager, level three, and she is, uh, in the US, and her preferred language is English. And actually that metadata can be inserted in a prompt template that we personalize for Jane. And similarly for Joe, whose favorite language is-- preferred language is Spanish, it will, uh, uh, it will tailor it to Joe, and that's called a prompt template. Yeah.
SPSpeaker
Usually foundation models, they don't use prompt templates. That's something you have to, um, integrate yourself.

Episode duration: 1:49:53

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode k1njvbBmfsw

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome