Evals for taste: Hill-climbing a slide-generation agent

Built rubric-driven replayable eval system from real user projects giving quality, cost, latency, error, token signals in under 6 hours per model change. Evolved into dev flywheel powered by real user dissatisfaction signals.

May 23, 202639mWatch on YouTube ↗

EVERY SPOKEN WORD

40 min read · 7,747 words

0:00 – 0:19
Intro
1. SPSpeaker
  [on-hold music]
0:19 – 1:20
Why evals matter: turning “vibes” into actionable feedback
1. SPSpeaker
  Hello. Hello. Hello. Good afternoon, everyone. I hope you all had a wonderful lunch. Um, there's so many of you as well. I'm actually kind of surprised by this. Um, happy to see that there's that much interest in talking about, uh, evals. Um, I personally am a big fan of anything evals related, but I know not everyone's... that's not everyone's cup of tea, right? Um, so very happy to see this many people of you. Um, so yeah, this, so today's session is really going to be about evals. Um, and I guess my goal for this session is for you all to be, afterwards, to be inspired to build evals, to be like, "Okay, evals are actually really useful," um, and how you can act on them, right? Like, we're gonna be building evals. I want you to get a better sense of like, "Okay, how should I be thinking about building evals? What are useful type of evals?" And then also, how can we use and take these evals to then make better agents, right? So that's the main goal of this session.
1:20 – 3:21
What evals are and how they encode expectations
1. SPSpeaker
  Um, and the way we're going to do this is by building a slide-generation agent and then finding out, like, okay, what are some good evals? What do we want to measure? And then how can we build now better agents based on the feedback that we're getting from our evals? And the first thing that we all need to set the stage on is: What are evals, right? So evals are systematic tests that measure how well an AI system performs on a specific domain or use case, right? So they give you information about, like, what's the quality of the results, um, what did it do well, what was it not good at, how can we improve, right? And evals, they are made up of tests that define certain scenarios, um, that then encode certain expectations through degrading logic. So one way that we're thinking about evals is if you, for example, are building an AI system, an AI agent, and you want to make sure that the output adheres to, like, a certain type of quality, or you w- need to make sure, like, this must always be present. Evals are a way to kind of encode this behavior in a way where then afterwards, if your evals fail, you know, like, "Okay, my agent is not doing or behaving the way it is intended," right? So that's the way how we can use these evals. And then evals is also the bridge between things like it seems to work or, like, um, we know it works. Or maybe it's all like, ah, it kind of feels a little bit worse today for some reason. It's always very hard to act on these types of vibes, right? Like, I think vibes definitely have their own place. I think they're useful just to get, like, a general sense check of, like, how people are feeling. But they're not very actionable, right? And that's kind of what we want to get out of evals. We want to have something that's actionable. So then we always ship eval, or like we always... Once we release,
3:21 – 4:52
Why generic benchmarks aren’t enough for your app
1. SPSpeaker
  like, a model, we always have this accompanying benchmark card, right? And we always list like, "Oh, these are like a bunch of evals. This is what we achieve, what our models achieve." We compare them to other models. We compare them to competitor models, right? Um, and there's like always a few usual suspects, right? Like, for example, um, SWE-bench is a very famous one which measures agentic coding abilities. Terminal Bench is one that's also quite popular. But we also have other types of evals, right? We have, like, tool use and agents, like, for example, like TAO-bench, BrowseCom, OSworld, which are some other evals that measure different things. And then we also have, like, reasoning and knowledge, um, like Arc AGI too. Um, now this is all fine and dandy, right? And then you look at these evals, and we always... every time a new model releases, like, "Oh, it's top of the benchmark for these and these, um, evals," right? Um, and they give us like a gener- general sense of, like, how well is the model and h- how much did we improve upon previous versions, right? But for you guys, if you're, like, building something, if you're building an agentic system, this doesn't really say much usually, right? Like, because, like, we, we don't measure, for example, like, a very specific use case that you guys are building on, right? We measure these gener- generic, general benchmark that measure a lot of capabilities, but they might not be applicable to your specific use case, right? So that's why we always say build your own evals, benchmark the different models, benchmark your AI agent, and make sure that you
4:52 – 7:57
Life without evals: reactive debugging and invisible regressions
1. SPSpeaker
  get the most out of the models, and make sure that you're also using the right model for the job, right? And so why are these evals specifically important? So this is my pitch to you of start using evals, right? So without evals, suppose you don't have evals. I think we've all been into the scenario where you have, like, this agent and it's working fine, and then you get, like, this feedback of, like, a customer who is, like, saying like, "Uh, it's, it's not really up to par of, like, this new model switch. Uh, it's... something is off," right? It's very hard, like, to do anything with that information, right? It's just like, "Okay, um, do you have some logs maybe that we can take a look at some specific instances?" Right? And then you try to, like, debug it manually, right? But in a way, you're still flying blind, right? You're always in a reactive loop. So you wait for the feedback, and then you're like, "Okay, let's see what we can do about this," right? So you basically only catch issues in production. If you can fix, for example, like, one issue, which then might, for example, create multiple more down the line by making, I don't know, a prompt change tweak that suddenly degrades the capabilities on, like, other tasks that you haven't even considered. Um-It's, it's also quite annoying to distinguish, like, genuine feedback from noise, right? Um, which is always, you don't wanna act on every single thing that you see because people have also some, um, biases in the way they perceive these things, right? Um, and then finally, I think which is the most important one, is there's no way to verify improvements or regressions on anything that you're building or that you've done, right? So you, like, need a way to make sure that the changes that you are making to your agent are actually impacting the quality and making sure that you improve upon your previous versions, right? And so this is basically what evals do give you. If you add evals, you have clarity. You need to define what does success look like, right? Because, like, if... Let's say you don't have evals, right, and you're not even able to articulate, like, "This is how the agent should behave. This is what a successful end product would look like from agent," then how can you make sure that your agent's actually behaving properly? Because you can't even vocalize it to yourself, like, "This is what it should be." So building these evals forces you to define, formalize, in a way, what you expect your agent to do. Um, it also allows you, as I said, to iterate on optimal agent configs. Um, you can also adopt new models faster, right? Instead of, like, saying like, "Oh, we might test out this new model and then see if it's, like, okay," you now have, like, some clarity to say like, "Okay, this is better on this, and this is not better on this, and this is why we should or should not migrate to a new model." Which especially, I think, is quite relevant with the pace of new models coming out. I think it's also, like, just taking this load off your back of always constantly having to find, like, okay, what is the new frontier, right? Um, and then finally, making problems visible before launch, right? So you know, like, oh, if there's, like, a few, um, cases that you have that will always do well or that you trust to provide a lot of insight, that's where you get the most value out of evals. And
7:57 – 9:28
How evals fit into agent iteration loops
1. SPSpeaker
  so how, how do evals really fit in? So originally, um, when we were, like, thinking about, like, prompt engineering, we had this, basically, this flow of, like, how you should optimize a prompt, right? So first you develop your test cases, which are the evals in the end. Then you write, like, a prompt. You test out the prompt against the tasks. You refine the prompt a little bit, and then it goes back. You run the prompt again. You refine it until you've, like, "Okay, I'm doing good- great on my evals. I'm confident that my system is working properly." And then finally, you can, like, ship the polish- polished prompt, right? Um, over time, systems have gotten a lit- little... Oh. Oh, can I go back? Can we go back one slide, please? Thank you. Um, so over time, it has gotten a little bit more complex with now agents coming into the loop with, like, tool calls, skills, all the different ways to optimize your context and all that stuff. So over time, these systems get more and more complex. So it's also way more levers that you can pull to make changes to your agents, which makes it, once again, then more important to have evals that forces you to have concrete way of identifying these are the things that we can change, and these are the things that impact the system in a positive way. So once again, like with agents, it's the same flow, right? Um, except now with just way more and way more complex things. Um, evals, when you create them, there's basically a few pr- graders. A grader is what we consider basically a way how
9:28 – 12:29
Grader types: code-based, model-based, and human evaluation
1. SPSpeaker
  we can judge the outputs, right? And, like, one of those ways is, for example, a code-based grader, which is pretty similar to, for example, a unit test, as you might know in, like, software engineering, right? Um, it's, can be like a string match, regex, maybe fuzzy, fuzzy match. Um, but it's like a strict analysis, right? Um, it finds static, uh, and tool call checks, and the advantages of this one is it's fast, cheap, deterministic. But it has a big drawback, which is that it's brittle, and it also lacks in nuance, right? Um, a- and with this, we mean, like, especially brittle is quite an interesting one, in my opinion, because, like, these deterministic checks, they force a certain deterministic behavior, right? But with... Um, sometimes this is absolutely the way we want, um, an agent to behave, right? Like, for example, let's say you have an agent that creates a slide deck, for example. You want to make sure that in the end, there is a slide deck present. Deterministic check. But then if you want to have, like, a check on what's the quality of this slide deck, this is way more nuanced, right? Like, you cannot easily encode this in, like, gem- some deterministic checks, right? And that's why we also have this second type of graders, which is the model-based graders, right? And this is, like, rubric-based reasoning. Um, so you, for example, say, like, "Is this slide high quality?" Very generic, but that might be, for example, a rubric. Or, like, for example, "Is this text coherent?" Also a way to get some intel on how well your agent's performing. Um, you can do some interesting things with this as well. Um, pairwise comparison is, in my opinion, quite underrated. Let's say you have two examples, two outputs, um, and then you basically ask the model, "Which one of the two do you prefer, and why?" That's also quite interesting to get some information out of, especi- especially for these scenarios where you don't really have a clear way of, of defining what makes a better one, right? Um, and u- and then another one is the multi-judge consensus, which is just, for example, you take, like, best of three, and you say, like, three judges score independently and say, like, majority wins, for example, right? Once again, that, this multi-judge consensus is interesting because it allows you to introduce some more determinism in a way, um, where if you have, like... We know that an LLM is un- undeterministic, right? And the same would be happening for these model-based graders, right? Like, if you run them, like, 100 times, a few times it might say, "Oh, this is great," and a few other times it might say, "Ah, it's not that great." If you have, like, this multi-judge consensus, you basically are assuming, "Let's put more compute into this, and let's see what the majority of our grade is."Consensus, right? A- a- and this is, unlocks a lot of things, right? Like, this is flexible, this is scalable, this is nuanced. But as I said, it's non-deterministic, it costs more money, and also it requires some calibration, which we will see is not easy at all. And then finally, the most expensive one are the human graders. And these are probably the graders that
12:29 – 14:01
Repo walkthrough: the slide-generation agent setup
1. SPSpeaker
  i- when you're building these agentic systems, you will be using the least, right? Because they are, like, incredibly expensive. Um, you have, like, a whole subject matter expert that will do, like, a whole review of the system. It will... It's expensive, it's slow as well, but it is more, it's the highest quality. It is very nuanced. Um, and yeah, it's, like, really good for, like, some A/B testing and some spot checking, right? So I'm not sure, like, how many of you were able to clone the repo beforehand and have this all set up. Um, I actually wanted to do this session a little bit differently, but given the amount of people, I will probably do a little bit more, um, myself, uh, instead of, like, letting you, um, think about all of the things. Um, but I'll quickly give you an overview of, of what's in the repo, right? Um, let me make this a little bit bigger. Um, I have made some pre-made, uh, slides that I will show you in a bit. Um, the resources is the main thing where you guys would be working in. So you have the... Let me actually close this session for now. Um, so you have, like, the agent.yaml, and this is basically where you would define your agent, right? Like, um, I think before we did a session. So this is basically what we're gonna do, use like, the managed agent. So for the people who attended that session, um, before lunch, it's basically the same thing. We define here, like, an, an agent in this case. Um, and we have given this, uh, the system prompt, right? So this is a system prompt that we are giving. So basically, you are a slide generation agent, and when the user gives you a topic, create a PowerPoint file at this location. And then also we tell it you have a shell, um, with Python PPTX, uh, pre-installed,
14:01 – 15:35
Defining slide-deck evals: audience brainstorming to concrete metrics
1. SPSpeaker
  right? Um, so that's all we give it for now. And then we also have, like, an environment which we've defined, um, with, like, a few packages, um, and what is, what it needs to complete this session. Um, and then basically that's it. We also have some other things defined, but I will get to that. I think maybe the first question that I have for the audience today is, we wanna make a slide-generation agent, right? What do you guys think is a good eval? What are you trying to measure? What would be some good information that you wanna get out of evals? Number of words on slide. Sorry? Number of words on slide. Number of words on slide is, is indeed a useful, uh, thing to track. Any- anyone else with some ideas? [laughs] Sorry? Same concept but something on the slide. Yeah, absolutely valid. Absolutely valid. Yeah, yeah, yeah. Um, and, and this actually, I, I like these two examples because they immediately give you, like, a different sense of, um, how you can use the type of graders. Like, for example, the number of words on a slide is quantifiable, right? It's, like, easy to say you can count the number of words with, like, a deterministic grader, with, like, a code grader. The one, if it's, like, overlapping or if it's overspilling, that one is harder to, um, um, encode in code, right? So for this one, you might, for example, use a model grader, and that's exactly what we did, right? So we have actually defined for you guys already a few graders beforehand. Two specific, uh, uh, directories. We have the code, and we have judge. So the code one is, as I said, it's like these. These code graders are quite deterministic. Like, for example, if we take
15:35 – 19:38
Baseline output review: spotting real failure modes in generated slides
1. SPSpeaker
  a look at emoji count, for example, is one, um, that we have defined, where we basically just count the number of emojis present in the slide deck. Um, because we do- we just noticed that it's quite prevalent. Like, for example, if I open the slide deck, um, let me go with environment one in this case. Um, so these are the slides that I... With basically the agent running. Um, it's, it's done beforehand just because it takes, can take quite a while, um, to get the agents running. Um, but this is, for example, the results of the initial agents, right? So this is slide number one, um, slide number two, slide number three, um, with some weird things on the bottom left. Um, slide four and slide five. Now, I, I think we can all agree, like, this is not the best slide deck you guys have ever seen. Um, but it's a good start. At least it has a slide deck. Um, there's five slides. I think that's exactly the prompt that we send it. Um, so we have a few slides. There's a few content on here. There's, like, few boxes. It's, you know, it's a slide deck. Um, given these slides, is there anything else that you guys are seeing that, like, this is something that we would never want in our slide deck? Teal and gray. What was that? A- No teal. No teal. We can... Indeed, if you absolutely wanna avoid teal, that's absolutely right. I think in this ca- uh, it doesn't do that for every single slide. Like, let me see for the career one. Um, let me see what is... Oh, okay, maybe it does always use teal, actually. Uh, but for example, in this one, we see, like, this overlap of, like, words and, and, and this, this horizontal. Um, what else? We have some weird coloring. Um, yeah, there's, there's, there's a few weird things happening generally, right? Um, so yeah, based on this, we take, you take a look at what it is, what the results are, and you're like, "Hmm, what type of graders do I wanna define for this specifically?" Right? And so we did that, and we noticed, for example, emoji count is one that's quite prevalent. We wanna check how many times do we see an emoji popping up. Another one is, for example, cluttered slides. Like, how many shapes do we see on these slides? Like, if there's just so too many things, it becomes cluttered. Um, counting the number of slides. For example, we always ask for five slides, making sure that you have five slides. Um, do we have slides with image, small font, text-heavy slides? Now, this is, this is... In this case, it's quite arbitrarily chosen, right? These were just, like, things that we, like, thought were, like, this is quite representative of what-A slide deck might have four graders, right? It really depends, like, I want, really wanna stress this, like, it really depends from use case to use case what makes a good grader, right? I think generally the way I think about this thing is, if you have a grader that you get no useful out of, information out of, then you should not have that part of your eval, right? Like, each thing, you should be able to tell, like, for each single scenario that you're testing, you should be able to say, like, "This is the information that I wanna get out of this. This is the type of... Or this is the part of the system that I'm testing, and this is how I can act on if it's being degrading." Right? Um, so those were just, like, a few codes ones, and then we also have a few judge ones. For example, the color judge, which basically judge what is the color contrast, and then it gives a score from, like, zero to five. Um, same with image, um, the layout, text. And this is the prompt that we give. Um, let me close this one real quick. Um, oh. So let me keep it like this. So this is basically the system prompt, um, that we give it. Um, no. So it's saying, "Please evaluate the slide based on each of the following criteria. Text, the title should be simple and clear to indicate the main points. For main content, avoid too many text and keep words concise. Use a consistent and readable font size, style, and color." And I mean, it, it goes on and on, right? So we give, like,
19:38 – 21:39
Running the scoring script: interpreting code metrics and judge scores
1. SPSpeaker
  for each of the different things that you wanna measure, we give, like, a little information of, like, this is what you should be focusing on when you want to measure this, right? Okay, cool. So we have these evals. Let's say you have now created a slide deck, and you now wanna see like, okay, what are the results, right? And how can we act on these results? So in this wrapper, we also have created this nice little script that will im- automatically score your slide deck for you. And so at the top here, we basically have it all, um, listed out. So we have, like, the slide count, which is being counted, the slides, the number of slides, which image, text-heavy slides, cluttered slides, small font slides, and so on. We also have our judges over here, which are saying, like, they give a score from, like, zero to five, uh, based on, like, how good is the text, how good is the image la- the layout and the color, right? Um, uh, honestly, like, these scores, you can immediately note that these scores are quite high. So as we said, like, we calibrated between zero and five. And as we see, like, the scores we've been giving you are, like, between 2.8 and a 4, which honestly, I think are quite high given the slide deck that we have seen, right? So that's, like, a part of the calibration that needs to happen as well, right? I think there's also, like, one thing that I maybe wanna stress. It's not because you have set up your evals once that they are now, like, the ground truth, you know? Um, evals, over time, they can evolve. They need to be a living artifact. It's not, like, something you make once and then forget and then use this, like, to com- make all of your future decisions on, right? Because, like, we will see over time, like, as I go through all of the different examples that we have, we will see, like, there's needs to be a way also how we can see, how we can make sure that the evals that we create are actually still measuring something useful for us, right? Um, if you ever hear people talk about saturation of evals, that's basically what they mean, in the way that, like, the eval is not giving any more relevant information that we can act on due to several reasons. Cool. So we see this, and I guess maybe the first thing that we want to do in this case is we wanna make an
21:39 – 25:40
Hill-climbing via prompt improvements: typography, layout, and “AI tells”
1. SPSpeaker
  agent that is a little bit more polished, right? And so for this, we actually just update our system prompts. So in- instead of just having, like, "Oh, you are a slide-generation agent, make slide deck," we now give it a little bit more information of, like, what are the expectations that we have of you in terms of typography, right? Um, because as we noticed, we said like, "Oh, this font is too small. There's too many words on there. It's not readable, or it's too big." Right? So we give it a little bit more of information. So we say, like,, "Slide title should be this size. Section headers should be this size. Body this size. Caption this size," right? Um, and we also give it some information on the layout and density, like, here are the things that we expect from the layout and density point of view. Uh, for example, we say, "Keep the body text concise, leave breathing room, and left align paragraphs." Right? And then also we, I think everyone kind of, I mean, I am at least getting, like, ticked off, like, if I read something that's clearly AI-written. I'm always a little bit skeptical of if I can completely trust the content and if the p- person sending me this text is, like, has, like, read it himself or themself and is standing behind that content, right? So we also say, like, avoid these AI-generated tells as well. So never use thin accent lines on the titles and don't pepper slides with emojis as decorative icons, right? So this is based on the things that we have seen in our eval, right? So we have seen as a... Let's go back a little bit. We've looked at the slide deck, um, and we're like, "Oh, this is not properly done. These fonts are a little bit off. Um, there's some emoji use in here. It's, like, a little bit all over the place." And then based on the score, we were like, "Okay, these are the things that we're clearly failing at," right? So we have, like, emoji counts, four in this case. Small font slides also four as well. Cluttered slides, two, and text-heavy slides, right? So based on the information that we have gotten from the eval that we have run, we have made these changes to our new agent, right? Let me now pull up the result of the new agent that we have created in this case, right? Um, so this is slide one, which I think is immediately way more enjoyable to look at. Like, there's no overlapping stuff. Um, there's no, uh, dollar sign. There's just... Generally it's cleaner. This, once again, I think this one still has, like, quite small text, but at least once again, we're, like, getting a little bit more consistent with the coloring as well. Um, once again, like, the whole slide deck is more consistent. This third slide, the fourth slide, and the fifth slide, right? And this is just by basically identifying here's a few failure modes of our original one. Here's how we now make changes based on these things that we found in the system prompt, and now we run it back. And now once again, we can do the same thing. So this, we're now basically in this loop of finding what's wrong, iterating, finding what's wrong, running it again, and making improvements over time.So now we can take a look back at what we found over here. Oh, and this is actually way worse suddenly. We see, like, emoji count 20. I'm wondering where they are. I haven't seen them actually. Wondering where that is at. Hmm. Wonder if it's, like, um, a mistake in this case. Um, but generally we see, like, okay, small font slides. We've seen that. But we've, we've improved upon the cluttering. Um, and let's see. Text-heavy. Is that still the case? Um, I think, I think that's fine. I mean, those are a little bit text-heavy, but I think it's acceptable, right? So now we, like... So this once again shows the value of, like, human review as well, right? Because now we see, oh, these things that we have defined in our Evals are maybe not as well-defined as we hope them to be, right? Because now I'm here arguing, like, oh, this is not as text-heavy as I expected it to be, right? So that means that something is actually wrong with the way we're grading. So now we go back then. We would go back. Go to our grader, change the grader, update it, and make sure that it better reflects the actual thing that we want to measure. Right? And this is also not something to be underestimated. Like, this calibration of how your agent should behave, um, and how your judges should judge the specific a- output is really something very fickle, right? Like, you should spend, like, proper time
25:40 – 27:42
New requirement: diagrams on every slide (and new measurement challenges)
1. SPSpeaker
  trying to find the ways on how you should make this happen. Um, let's say now that we wanna have an agent. Like, I, I, I think with this one, I mean, it's fun. I think it's nice. Um, but it's still quite text-heavy, and it's only text, right? Um, let's say now that we wanna have an agent. Let's say that's one of our requirements, right? That we have an agent that we always want to have includes diagrams. Once again, we go back to our system prompt, we update it, and we now say every slide must include at least one generated diagram or chart inserted as an actual image, right? Um, so once again, we update the system prompt or any part of the agent that you can tune, and then we go again, and we check what do we get. Okay, so this one is quite interesting. Um, I guess personally I'm not a fan of having an image on the opening slide, but once again, it is what we defined that it should do, right? So I'm gonna let that slide. But it's, it's a nice, nice graph what it's saying. It's like no negotiation and active negotiation. So it's arguing that if you do active negotiation for your salary, you can see over time the gap widens between no negotiation and yes nego- negotiation. Some extra benchmarks. I, I think this looks immediately way better just in the way that it's, like, kind of grounded into some actual facts right now instead of just waffling its way through the slide deck, right? Yeah. This one I'm not a big fan of. I feel like it's a little bit stretched, but that might also just be the screenshot. Um, yeah, and this one also not the best one either, right? Let me see... Like, let's see what the score JSON now says. Okay. No emojis. Great. No cluttered slides. Still quite text-heavy slides, surprisingly. Um, still small font size. I think that's fine. I think we just say, like, with images, I think... Yeah, I think we, we accept, like, these types of things are fine. Um, so once again, shows you some, um, questions regarding the grader that we have set up. But now we can also take a look at, like,
27:42 – 31:17
Adding a QA loop: self-critique through render–inspect–fix cycles
1. SPSpeaker
  the judges, right? Like, for example, because now we have images that we have created, so now we can also consider how does the image judge, uh, think this is. And it says it's three point eight out of five. Um, doesn't s- give us a lot to go off, right? It just gives us a random number. What does this mean? How can we improve upon this? But that's fine for now. Now, one thing that we always see that works just generally quite well, and that's, like, it's transversal over every single use case, is adding a QA loop, right? Um, for coding, this is quite intuitive. That's basically saying, like, you create an agent that actually is writing the code, right? And then you add a second agent that is then looking at the code that has been written and just criticizes it. So it's basically saying, "This is bad. This is bad. This is bad. This introduces a bug. This introduces a bug. This is not according to standards," whatever, right? So it basically is criticizing the, the thing that has been created. And then that part of the feedback you give back to your original agent, the creation agent. The creation agent goes off again, does the creation, does the fine-tuning, makes the changes that's were informed by the criticizing, and then once again, after that is done, it goes back to the criticizing agent. And that loop basically goes on and on and on until both sides are like, "Okay, this is fine. We can ship this." And that's basically what we now do in this, uh, next step. Um, so we basically say, like, okay, required QA loop. Um, assume there are no problems. Um, oh, assume there are problems, and then your job is to find them. Approach QA as a bug hunt, not a confirmation step. And this is quite interesting because we're, like, actively instructing the agent to behave in a way adversar- adversarily, right? Like, we're saying, like, there are issues, you need to find them. It's not b- it's not like, "Oh, there might be something. You might be interested in finding something." No, it's actively saying, "There are issues. Go find them." Um, and then we say, like, we instruct after writing the deck, okay, convert it to images, inspect every slide image yourself, fix issues, re-render, re-inspect, and then do not stop until you've completed at least one fix and verify cycle. Cool. Now, as I said, I think for coding, this is quite intuitive. But I think it's also quite intuitive if you take a look at, like, um, the, the slides that we have created, right? Because that's basically what we did. We have looked at the slides and we're like, "Ah, this is not good. This is not good. Let's take that feedback, update our graders, update our system prompt, and let's run it back again." Right? So let's now see if we can actually-- if this is actually showing some improvements. Um, I think this immediately a lot better. So the, the, the image is way bigger now. I think it's way more readable even from a further distance away. Um, still the slides are small, but it's like, for example, it's sourced now. There's a source over here as well, which is quite good. Um, I think this is also way better. It is more cleanly structured. I think the image is also a little bit better as well, right?A quite interesting graph in this case, uh, your value profile versus team average. This one is still a little off in my opinion. Also, we now have, like, a little introduction of, like, these weird ticks. Um, and this one is also a little bit better, I would say. But I think, like, the, just the image taking is, um, kind of messing with the slide here. And so then we kind of know the drill by now. We take a look at the score. We see, like, has it improved? Why do we see still gaps? And now we see, like, for all of the judges that we have created, it is higher than, uh, the ones before, right? We are now all good in the 4.2 to 4.4 now. Um, so we're on a, we're on a good track, right? And, um, you can keep on doing this. You can keep on doing this. Um, and
31:17 – 33:48
Model upgrade vs. prompt engineering: switching to Opus for better defaults
1. SPSpeaker
  you will always make, like, these little changes. But sometimes, and this is, I guess, where it gets quite interesting and more, like, on- uh, more, like, nuanced, is you can also just go to a smarter model, right? Because, like, now you're, like, defining, oh, this is what a good slide should look like. This is what it should do, this what not, what it should not do. But with these models getting smarter and better over time, you kind of expect them to be, like, able to figure that out on their own, right? Um, I mean, that would at least be nice. So that's what we tried out as well. So now in the last one, we basically just changed our model to Opus 4 seven instead of Sonnet 4.7, which we have used up to these points, if you can... Uh, 4.6. Um, so now we have switched to Opus 4.7, and we have basically just given it a simple prompt again. Like, you are a slide-generation agent, and then when the user gives you a topic, create a PowerPoint file at whatever, and then you have a shell. So it's basically just the initial prompt that we gave to, uh, our Sonnet model in the beginning, right? And then once again, let's now consider taking a look at the results of those. And this is just a base prompt, right? Like, you can immediately see, like, it's significantly better than the Sonnet one, right? I think there's still clear issues that we can iron out, but generally, like, it's way more structured, right? And then we can take a look at the score as well. And I think this is quite interesting and quite telling. Like, for example, this Opus just does not use any emojis. Like, it kind of knows, like, if you wanna make a slide deck about salary increase, emojis are probably not the right place to put them, right? Um, it also has, like, fewer small font slides because it kind of has, like, this innate knowledge of, okay, it should be readable. This is how a slide deck should function. This is what people expect out of a slide deck, right? And then we get to these judge graders, right? Um, we see a 4.4. We see a five for the image judge. Do we even have an image in this one? I don't think we, we do, actually. No, we don't. Okay. But so once again, we got a five in this one. Layout judge 4.2, and then the color judge 4.8. And title body coherence 4.4. So this is, like, immediately giving, like, extremely high scores as well, right? Which I think is quite interesting because, like, this is once again showing that we might not be measuring the right thing. And this is not too unexpected for these types of, um, graders, right? Or for these judge graders. I think one of these things with, like, I, I... Okay, let's go to the code graders. I think those are quite straightforward. I think most people in the room
33:48 – 37:51
Making model-based judging reliable: anchoring, explanations, and ordering effects
1. SPSpeaker
  would have understood by now, like, how they work and how, what we can do with them. Like, for example, emoji count, it's quite simple. Just count the number of emojis, and that's it. But with this judging, what we have done here is actually quite problematic. We basically say, like, give a score from zero to five, and for text, well, the text shou- uh, the title should be simple and clear to indicate the main points for main content. Avoid too many text and keywords. But it has nothing to anchor on, right? Like, it doesn't really know what good looks like in this case. It doesn't know what bad looks like. So there's still, like, this trade-off between, like, what does a model actually know, and what do we need to give more information, um, to the model to make sure that it can give, like, a proper, um, proper judging of what we actually have produced, right? So for example, in this case, um, I would, for example, say what could help is say, like, oh, this is a bad example. Like, let's say you have a zero. Like, everything is just awful. These are some telltale signs that you're dealing with an extremely badly pro-formatted, um, um, slide deck. And then, like, over time, the different ranges you can kind of express. Um, and then once over time, uh, then once again, like, that doesn't mean it will still be able to give, like, a good answer because we now have these results. We have this number that our LLM decided to output for some reason. Like, for example, in this re- in this case, image judge put out five. Okay, what do we do with that number now? Okay, it's a five. We, we, we just said there was no-not a single image in that slide deck, right? Um, so how can we interpret this five? One way of doing this is just basically always asking your judge graders to give reasons why it came to that conclusion, right? And one thing that should be very, like, um, cautious about is the ordering, right? Um, I, I've had it happen where I was, like, setting this up, and I did, like, this exact thing. So I had, like, the number, and then I said, like, "Okay, give me also reasons why you did that." And so then it said, like, "Oh, it's a four, and reasons for this are this, this, and this." But we know that an LLM, it works autoregressively, right? So if it is anchored on, like, this four, it will do anything it can to argue why it should be a four, right? Anything. A- a- even if it's, like, extremely bad. If it's like, if it should be, like, a one, it will still say, "Oh, it is good for these and these reasons," because it needs to justify the four that it put out. So once how you do it is you actually turn it around. So first you say, like, "Give me a bunch of reasons. Give me pros, give me cons, give me reasons why it should be high, give me reasons why it should be bad." And then based on all of those reasons together, then you need to make a final decision on the output, right? And that's also, that goes also back to, like, this QA loop as well.Um, because then once again, you can get a little bit trickier where you have, like, multiple agents also doing the verification part, where you have, like, one agent that is, like, finding all of the issues, and then the other one is, like, refuting those. For example, one example that I can give, which I think is quite interesting, let's say you wanna make, um, a document for, um... Where you need to, like, some analysis. You first need to gather a lot of context from the internet, for example, like on a legal document, for example, right? Um, a- and you ask the, you ask the model to, like, make a summary of, um, a certain case. What was decided? What does this have for legal implications for other cases, right? You need to be very careful with, like, all of these things that, like, legal cases are generally, like, quite tricky. And, like, an agent would love to create, like, oh, this and this, and jump to conclusions, like, this is the reason, and that's it, right? And then the grader might be like, "Oh, this is unclear. This is, um, maybe not as... This is maybe inf- uh, untrue. This is maybe, uh, maybe, like, glossing over the actual facts." All of those type of things, right? But then once again, you can, like, apply these multiple techniques. You can have, like, multiple graders, for example, seeing, like, um, evaluating those and seeing, like, what are the main ones popping up. Um, because once again, a grader might still hallucinate things as well, right? Especially in, like, these very nuanced scenarios, right? So
37:51 – 39:15
Closing: evals as a continuous practice for building better agents
1. SPSpeaker
  there's, like, different ways of how you then can work with these judges to make sure that you actually get, like, good, consistent output that is actionable, right? And what I've shown you here today is basically just a small introduction to how Evals can help you, but it's definitely not the end. I think 45 minutes for a session on Evals is, in my opinion, quite short because it can get really deep, right? Because, like, I started this off, this session, with talking about benchmarks, which are, in the end, just Evals. And every single time, why would every single model provider care so much about benchmarks, so much about Evals, if it wasn't one of the main important things when we are building new models, right?
2. SPSpeaker
  [laughs]
3. SPSpeaker
  Exactly. We need to find the things that we are failing at. Exactly. We need to find things. What are we good at? What are we bad at? How can we make the model better in future generations? And that's the same thing when building applications that are consisting, that's using AI agents, right? It's the same thing. It's just finding what works, finding what doesn't, iterating, and making sure that the changes that you're making, that you're informed on the decisions that you are making, and making sure that the changes you make have actually positive influence on your final outputs. Okay. Thank you, guys, so much. This is all the time that I have. Thank you, guys. [upbeat music]

Episode duration: 39:15

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode v9FTCvkV_a0

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Intro

Why evals matter: turning “vibes” into actionable feedback

What evals are and how they encode expectations

Why generic benchmarks aren’t enough for your app

Life without evals: reactive debugging and invisible regressions

How evals fit into agent iteration loops

Grader types: code-based, model-based, and human evaluation

Repo walkthrough: the slide-generation agent setup

Defining slide-deck evals: audience brainstorming to concrete metrics

Baseline output review: spotting real failure modes in generated slides

Running the scoring script: interpreting code metrics and judge scores

Hill-climbing via prompt improvements: typography, layout, and “AI tells”

New requirement: diagrams on every slide (and new measurement challenges)

Adding a QA loop: self-critique through render–inspect–fix cycles

Model upgrade vs. prompt engineering: switching to Opus for better defaults

Making model-based judging reliable: anchoring, explanations, and ordering effects

Closing: evals as a continuous practice for building better agents

Get more out of YouTube videos.