Teaching agents to learn from your team

Agent that improves itself daily by treating instructions as code: edited, reviewed, merged like any PR. Writing skills that teach agents how to think (not what to do). Closing feedback loop so team judgment flows back automatically.

May 22, 202628mWatch on YouTube ↗

EVERY SPOKEN WORD

25 min read · 5,165 words

0:00 – 0:25
Intro
1. SPSpeaker
  [on hold music] [audience applauding]
0:25 – 2:59
Why most agents stall at “80% there” (and why this talk exists)
1. SPSpeaker
  Hello, everyone. My name is Petra, and today we are going to talk about teaching agents. Um, I'm heading up developer experience at Warp, and I'm sure a lot of you know Warp. Um, we are pretty much the best place to run your agents and manage them and work with them. So, uh, it's probably a terminal that you know. You should check it out if, if you want to, uh, use Cloud Code in there. Um, my team is responsible for telling the world how great Warp is. Um, and, um, we work a lot with the community. We create content. And so today, we are going to talk about how to teach agents through the story of how we built, uh, Buzz, our social, um, social, um, response helper little agent. Um, so this is Buzz. Um, before we-- before I say a bit more about what Buzz does, a quick lay of the land. I would like to ask all of you guys to put up your hand if you've built an agent before. Great. Obviously, that's a lot of guy-- a lot of people. Um, now put up your hand if you've built an agent that is running on a weekly or daily basis and is doing something... Okay. [chuckles] Already hands going up. A lot of you have done that. And now put up your hand if that agent is still daily in production, shipped, and you're very happy with the result. Now, that's a significantly less, uh, hands up. So that gap is what we are going to talk about how to close today and, and help all of you guys, uh, build better agents that actually, um, work for you and do things for you, um, on a daily basis. Um, so while we were building Buzz, we noticed, and I've seen from a lot of people, that we build agents, and we get to this point where a lot of you have put up your hand that you have something that kinda sorta works, and it kinda sorta does what you want to do, um, but it's just not quite there for you to put it into production and, and just let it go and let it do its thing. Um, and that sort of eighty percent there but just not quite is where I think a lot of agents die, and it sort of, um, end up more-- almost worse than if you would not have an agent because you end up spending time on tweaking it and prompting it and spending time on trying to get it right because you can kind of feel that it's almost there, but it's not quite good enough to just let it run on its own. Um, and so that gap is what I would like to close today through telling you about how we taught Buzz, um, to learn on its own, essentially.
2:59 – 5:00
What Buzz does: triage social mentions and draft authentic replies
1. SPSpeaker
  Um, so Buzz is our agent at Warp to help us respond when users mention us. Um, it basically monitors, um, our social mentions, and it helps us figure out what to do with them. Warp is a very popular product, and we have ton of people wanting to talk to us. We have a lot of users who love Warp, and they want to reach out to us. They have questions. They have bug reports. They want to learn about what features are coming up next. And we want to talk to all of those people, and we want to make sure that we can get back to them and engage with them. And so these replies are very important to us because they set the tone for the product, for the company, for the community, and all of these things are very important for Warp as a, as a company. And so we are also a startup, and we don't want to hire an army of people to engage with this army of the community that we have. And so we built Buzz, and Buzz monitors these mentions for us and helps figure out what to do, um, as a, as a, like, as an action. So it helps us decide between replying to something because someone had a question or someone had a product feedback item or something along those lines, or it helps us just figure out that we should like something because it shows engagement with the user, or we should just skip it because it's not actually about Warp, or they don't expect a response, or they're talking to someone else. And if Buzz decides that we should respond to something, it also helps us draft a message so that we don't start from scratch. And all of this is a massive time save for us because we are a small team, and it allows us to focus on the high ROI moments of engaging with the community. Buzz was built in a few days. It's composed of about fifteen or so skills, um, which you will also see in a little bit. There is essentially zero code written. So all of these are skill files, and they build on these agentic primitives, um, that allow us all to build these really cool agents that take things over from our plates. Um, and it also connects to various services from the Axe API to Slack and all of these things to be plugged
5:00 – 6:31
The hard part: encoding “judgment and taste” (not unit-testable behavior)
1. SPSpeaker
  into the rest of the team and the context. So the challenge that we faced when building Buzz is that something like figuring out social replies and what to do with them requires judgment and taste. When to say something, what to say, how to say it, when to completely stay out of it because by inserting yourself, you're not actually achieving the goal that you have. And you probably also know that you see all these replies on Twitter that are completely AI-generated, and you can notice those from a mile away, and you probably don't really want to engage with them because it kind of feels weird, and it's just very obviously AI-generated. And so we wanted something that is better than that, but it's-- but something that is not just not AI-generated, but that it really helps us engage with the community deeply because we care about, um, we care about our users a lot. And so the question was, how do we build an agent that can do this well? How do we build an agent that has good judgment and good taste to make these decisions, write these replies, make sense of what users are telling us about?Um, and that gap is not just-- we wanted it to be not just kind of okay, like you could kind of put this on Twitter, but we wanted it to be really good like it, like, like it actually sounds like it's coming from a human, and it actually understands our product context and our company context to make good decisions about which users-- which user mentions to engage with in what way, and what is
6:31 – 9:03
Why common agent loops work for code, but fail for fuzzy human tasks
1. SPSpeaker
  valuable for the user as well. Um, and so if we look at some of the approaches of how this is normally solved, we see a lot of these agentic loops. A lo-- Agents are really, really good at figuring things out, basically, and they're really good at going in loops when they can figure out if they're there or not. And so, for example, you probably know the Ralph loop that you can see. Um, and why the Ralph loop works really well is because there is this external check that allows the agent to know if it's there or not. It has a specific goal, and it can evaluate if it has achieved that goal or not. Now, it-- that works really great when it's about coding. It works really great when it can run our unit test suite, and it works really great when it can use computer use and browser use, and it can check that, "Hey, I made this change. I check it in the browser, I make an API call, I make a cURL command. Um, is it doing the thing that I want it to do yet? Is it doing it? Yes. Great. Then I'm good. Is it not doing it yet? Then okay, what am I seeing? How can I iterate on it to make that achieve its goal?" Um, now the problem is that when an agent actually needs judgment and taste, and it's not that black and white, or you don't have unit tests, it's very difficult to figure out if to-- how to do this loop, essentially. Because you have something where with social replies, you would have to, for that external check to work on its own for the agent, you would have to, like, send a lot of replies live and see what users are saying and how they are reacting, what the impact on the brand perception is on the community is, and if it's like w-what other people are thinking when they are reading those threads. And those feedback loops are super complex and long, and you can't just set it up for the agent as an external check like you do with unit tests. And so what we've-- what, what the question here is, how do we get that knowledge into that agent? How do we give the agent judgment and taste? How do we teach it to do these fuzzy things that we humans really understand well? We understand nuance and context. When we look at a Twitter thread, we know that, "Hey, that was a pretty weird reply," or, "What should we say to help the user, and how should we say it in a way that makes them feel better about the product and, and also creates a, a good engagement?" And, and this applies to a lot of other things, from customer replies to helping you with your own Slack messages, to code review comments, to all of these fuzzy things that we
9:03 – 10:03
Attempt #1: “Nail the prompt” turns into brittle checklists
1. SPSpeaker
  interact with on a d-daily basis that need judgment and taste and are not as clear-cut as something that you can unit test. So when we built Buzz, we tried a few things. Um, we started, as I think a lot of you do, with just trying to nail the prompt. Seems like a pretty reasonable place to start at. And usually, you try to craft a prompt that encapsulates everything that you want the agent to do, and then you work with the agent to improve that prompt, and you ask it to identify conflicting information, you ask it to figure out what's ambiguous for it, and you ask it to identify gaps that, that it could fill, and then you work with the agent to improve that prompt. I think a lot of us have done that, whether for an agent or just to, just to nail out, uh, plan out what a, a feature implementation should look like. What we saw with Buzz is that the agent or the prompt ended up as pretty much a checklist. It was just a list of rules of, like,
10:03 – 12:04
Shift from rules to principles: teach the agent how to think
1. SPSpeaker
  if X happens, you should do Y. Um, and the problem with that was that it sounded like a robot because it couldn't really figure out what to say and how to say, and it also broke the moment something new appeared because it just couldn't have that flexibility, and it couldn't deal with that. The rules were just too brittle. So we thought about how would we explain this to a new team member, because working with agents is very similar as to trying to explain to a new team member how and what they should do. And when it comes to that, you would probably explain to them how to do things. You wouldn't just give them rules that if X, Y, Z happens, you should a, b-- do A, B, C. You would explain to them how to think about things, how to make good decisions, how to reason about these situations, what the purpose of engaging with the community is, and you would give them guidelines like, "Hey, you shouldn't get defensive when users complain about the product or they have an issue. You should be kind and empathic. You should come across as a product builder versus just someone who is processing support requests." And you have all these principles, and we switched our, our agent to work off of these principles. So we ended up with, instead of these long list of rules, we ended up with principles that really much better encapsulated what we wanted the agent to do. And so the result was pretty much that the scale file was like a fifth of the original length because you needed much less text, much less, uh, lines, lines of code, if I can put it that way, in a scale file, um, to achieve the same or even better result. Um, and through these principles, because they were so much more flexible for new situations, the agent could reason with them better, so the output got better. So you have a smaller scale file, you have better output just because you switched from these rules that the agent tends to do on its own. You switch from rules to principles. So that was the first thing that worked really well for us. But
12:04 – 14:06
Attempt #2: manual evaluation and feedback—then the agent regresses to rules
1. SPSpeaker
  there was still this gap that it wasn't quite as good as we wanted it to be. It was like kind of-- getting kind of there, but it wasn't like, "Oh yeah, this reply is amazing, and we're just gonna send it live as is."So the next thing we did, uh, we did as I think any good engineer, you start testing the thing, you evaluate the results, and then you fix whatever you see. I think it's a pretty, pretty usual loop of things as you build something. Um, and what this looked like in practice when we were working with these skill files and these, these agents is that I collected a bunch of, um, these responses that I wanted the agent to triage and figure out what to do with, and it gave me whatever response it would have done or whatever it suggested that we do, and I gave feedback on it. I gave feedback on, "Hey, this is good because of A, B, C. This is not really good because of X, Y, Z. This is how I would do this differently. This is how my reply would be. Um, this is why I wouldn't actually reply to this because A, X, Y, Z." Um, and so this created this, um, set of human feedback on the agent-generated output, um, and I wanted the agent to learn from that. Like, "Here is, here is what you did. Here's a bunch of feedback. Go and figure out how you can do better." Now, what the agent did was go back to these damn rules, um, and we kinda know that that doesn't work. So it had these principles, and it started adding these very specific rules like, "If a person is talking about, uh, is having some X, Y, Z problem with, uh, with the product, never mention pricing in the first line." And it's like, sure, it worked in that specific case, but it's not something that is applicable to most other use cases, uh, or most other situations. A much better learning would be, "If someone is venting about the product, don't try to pitch them some other part of the product." That is a much more flexible thing. And so
14:06 – 15:08
Teaching the agent to learn: a meta-skill that updates principles correctly
1. SPSpeaker
  what we realized was that we had to teach the agent to learn differently. The agent sort of needed to learn how to learn. And if we think about again, how would we do this with a new team member? We would explain to them that-- We would sort of explain why, um, why something is, is better based on our feedback, and we would ask them to, like, "Hey, take a look at what you did. Take a look at what I told you. Take a look at your instructions, and what is the gap between the two? What would your instructions need to be for you to have the same output that I gave you as an ideal expected output?" And so we encapsulated this in, in another skill, and that worked actually really well. I was really happy with how now we had two components. Um, basically, we had principles which told the agent what to do, and they were very flexible, and they were-- they applied well to new situations. And
15:08 – 15:39
Operational challenge: who keeps training it (without adding team toil)?
1. SPSpeaker
  we also had a way for the agent to learn, and so it could expand its own instructions. Now, the next problem was who's going to keep teaching it? Because it takes a bunch of time to sit down and do all this, all this back and forth and keep spending time on this, giving it data, giving it feedback. So I don't really want another team meeting. I don't want another task to assign to someone on a rotation. Um, so how can we have the agent just learn from what the team is already doing? And
15:39 – 18:11
The low-friction Slack feedback loop: emoji reactions + thread notes
1. SPSpeaker
  this was the last piece that really clicked everything, um, together for us. How do we have the smallest input from the team for the biggest output for the agent to have a better result? And so we designed a feedback loop, um, that allows us to basically have almost no extra action on our side on a daily basis and allow the agent to still learn from the team. What it-- this looks like in practice is that Buzz monitors these mentions and helps us figure out what to do. In the end, we still do everything manually when we actually interact with a user because it's very important to us to maintain that authenticity, and, and we actually care about the, the, the user, um, experience there. But by Buzz triaging these mentions and helping us figure out what to do, helping us figure out what not to care about, um, we save a lot of time. And what happens in reality is Buzz monitors these mentions and sends us Slack messages. We have a Slack channel. We get a Slack ping, "Hey, there is this mention. This is what you should do with it. This is why." So it also explains its thinking, which is really helpful for us to get in context quickly. And then the team just monitors this channel as any other team channel, um, and adds an emoji reaction. So basically, Buzz says, "Hey, here's a mention. You should reply to it. You should say something like this." And then the team takes a look at that. It's very easy and quick to skim through it. We also leverage a lot of Slack's, um, structured formatting to make it easy to skim and make it the least effort on the team. And then the team adds an emoji reaction with, with what action they actually took. If they actually reply, they just add a check mark. And through this, um, Buzz can take a look at what it suggested to do versus what the team actually did and draw takeaways on what that gap was. The team can also leave notes in Slack threads. Um, I'm gonna show all of this in, in a few minutes on screenshots. I think it's easier to see. And so all of this allows Buzz to have a lot of contextual feedback with extremely little extra effort from the team because the team also uses these emoji reactions to not step on each other's toes and keep track of what was already handled and what wasn't. And so this means that the sort of breadcrumbs that we leave for each other, Buzz just simply learns from those
18:11 – 19:43
From Slack signals to GitHub PRs: daily automated instruction updates
1. SPSpeaker
  and draws takeaways on how to make its instructions better. Um, and then opens pull requests. All of these skills are in a Git repo, um, so that we can essentially handle them as code, and then we just look at these, uh, pull requests on a daily basis.And the insight that we, that really worked for us was just keep it simple. [chuckles] With the trickiest part of designing feedback loops is always the humans, because you have to be very intentional about how you create some extra stuff for humans so that they still do it. If it's too complicated or if it takes too much time or too out of the normal process, they're just simply not going to do it. And you want them to do it because the agent learns from it. And so that was the first insight. The other one was to make it feel like a teammate because it just-- you just see humans interact more and more meaningfully with agents that feel like teammates. You can talk to them on Slack, you can leave them notes, you can leave them emoji reactions, and it really helps the team give very m- give more valuable feedback to the agent itself by just making it have a name and have a little bit of personality and a little bit, little bit of whimsy. And so ultimately, what closed the gap for us was this daily loop. It allowed us to consistently make the agent better, um, and allowed us to, to have consistently better results and, and more time and effort saved on our side. And all of these pieces need the others. The principles need-- The principles are needed because the agent needs to know what to do. Teaching
19:43 – 24:47
What it looks like in practice: skills repo, Slack triage examples, and PR diffs
1. SPSpeaker
  the agent to learn is required because the t- the agent needs to get better over time, and then this day-to-day feedback loop is needed so that the agent actually gets the information that it can use to improve itself. So let's take a look at what this looks like in practice. Um, this is one of Buzz, Buzz's skill files. Um, you can see that it's just, this is just a normal, um, GitHub UI. On the left, you can see all the various skills. Um, as I mentioned, we have about fifteen, but it consistently-- It, it's, it always increases because we keep adding new stuff to Buzz and it keeps doing new things. Um, but here you can see this is the Warp reply skill. Um, this is what allows Buzz to draft messages, and it's, uh, an integral part of it, making a decision on what to reply to and what to skip and what to like. And so here you can see the principles. It doesn't say things like, "If person X mentions whatever feature, you should say ABC," or, "You should link to this docs page," or something along those lines. It has principles, so it can take a look at the, the Twitter thread, let's say, and just understand what the user needs and what they want and what they want to talk about and figure out what to say. Then, as I mentioned, we have this, um, Slack channel, and this is the channel that the, the team consistently monitors. Um, you can see here the different actions that Buzz suggests that we do. Um, the top one here is, uh, a suggested reply with some reasoning of what the user is looking for and a message that is drafted. We then use this as the basis of what to say, but we usually sort of make it our own, but it's immensely helpful to have the right tone, the right style, the right information, the right content already there for us to start from. It basically removes ninety percent of the effort and leaves us the ten percent that is, uh, the most valuable to give that reply, the, the most helpful for the user as well. Um, and then you see skipped replies here. This is my favorite, honestly, because it means that we don't even have to care about this. We don't have to look at this. We don't have to look at this-- the Twitter thread. It's because we don't end up, um, taking an action. They're talking to someone else, they're talking about something else, and it makes no sense for us to insert ourselves, and so we don't have to spend any time on these. Um, or it just suggests that we like something. If a user says that Warp is cool, uh, we don't necessarily need to say anything. It's just very valuable when we like the tweet and show appreciation towards the user. And then the feedback loop, um, looks like this. So you can add these emoji reactions, um, and what you can also do is add, um, a note for Buzz that it will pick up in the Slack thread itself. Um, so here you can see, um, I gave some feedback to Buzz that we shouldn't correct the user in this case. They said something nice, or they had a question about the product. We gave them, uh, an answer of where they can find that feature. It makes zero sense to correct the user about, um, whatever context they, they ask this question in. And that feedback then gets picked up by Buzz. It runs daily. It looks at all these emoji reaction differences, it looks at all these threads, and it draws these takeaways, and it opens a pull request, and it sends us a Slack message with that pull request. It explains what it changed in a very brief way. It links to the PR. So we get all of this information pushed to us. We don't pull anything. And so because it's also all in Slack, it's the same tool, it's the same processes that the team is already using, so it's extremely low friction. It makes it very easy for the team to interact with it. And so Buzz opens this pull request, links to it every morning. We just click the link. It's like a sixty-second PR review because it's just a few English line changes, and it makes a lot of, uh, we get-- we have context about what has, what it has done, and we can quickly see if that's a good change to make. And you can see the pull request here, um, and it basically adjusted the instructions in the relevant place. So it didn't just add a random rule at the end of a list. It looked at its own current instructions, and it adjusted them in the most appropriate way to not correct users when we shouldn't correct users. And so then we can just take a look at this through this normal pull request review process and, and merge it in, um, if everything looks good.I also actually personally really like the feature where you can just make quick edits to, to the instructions so you have a little bit of, um, control
24:47 – 26:51
Results and scale: volume handled, time saved, analytics, and orchestration
1. SPSpeaker
  over how exactly things are phrased. Um, we also do this because we don't want the agent to just change its instructions willy-nilly. We want to have some control over it not drifting into some weird direction that it keeps doubling down on. Um, we want to know what it's doing because it's important to us. So what this looks like, um, in practice, in numbers, is we have about a few thousand mentions a month. Um, fifty percent of those get skipped. So as I mentioned, we save a ton of time by skipping these things, and that's such a perfect task for Buzz or an agent to just take over, um, because we don't end up spending any time on them. Um, it consists of, it consists of fifteen skills. Some of this is triaging things, some of this is writing posts, some of this is about reporting, some of this is about analytics. For example, on a daily basis, I get a DM from Buzz that shows me a bunch of graphs that it generated about our distribution of the different actions that we take, about who is replying, how much, um, so that we have some health metrics on this part of what the team is responsible for. And all of this just really allows us to get more done as a small team. Um, we have a few thousand Cloud Agent runs per month, so all of this runs on its own in the background in the cloud. We use Oz, which is Warp's orchestration platform for running cloud agents, um, and it just runs on a schedule, runs on various triggers. I think yesterday you all have, um, seen routines, um, on, on Cloud Code, and it's a very similar concept where API calls, webhooks, um, cron jobs, things like that trigger an agent to just run in the cloud, and it just takes things off your plate. So it gets triggered on its own, and you don't ever have to just talk to it for it to get things done. So
26:51 – 28:27
Core takeaway: design the feedback loop, not the perfect initial prompt
1. SPSpeaker
  all of these things, um, are there to help you build agents that improve on their own. Um, what you basically need for this is to create and design the feedback loop that works for you for your use case versus trying to nail the prompt from the get-go. Um, and each of these pieces needs the others. The principles are there so that the agent knows what it's doing. Teaching it to learn is important so it can get better over time, and then the feedback loop is important so it gets the input to be able to get better over time and, and, and, uh, improve its instructions. So if you remember one thing from this whole talk, if you recall just one thing that you can take into, um, into practice, um, when you, when you go back to your jobs and, and you build your own things, is to focus on designing that feedback loop. Think about how your agent will improve over time and not just how to nail that initial prompt. That initial prompt can be just good. It doesn't need to be perfect. But you should try to figure out how you can create a feedback loop that allows the agent to get better over time as your understanding of the problem evolves, as new situations pop up. Allow the agent to improve on its own versus you having to go back and improve it manually. So that's all. Thank you all. Um, go build great agents and thank you so much. [audience applauding] [upbeat music]

Episode duration: 28:28

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode uGroRwlC9y4

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Intro

Why most agents stall at “80% there” (and why this talk exists)

What Buzz does: triage social mentions and draft authentic replies

The hard part: encoding “judgment and taste” (not unit-testable behavior)

Why common agent loops work for code, but fail for fuzzy human tasks

Attempt #1: “Nail the prompt” turns into brittle checklists

Shift from rules to principles: teach the agent how to think

Attempt #2: manual evaluation and feedback—then the agent regresses to rules

Teaching the agent to learn: a meta-skill that updates principles correctly

Operational challenge: who keeps training it (without adding team toil)?

The low-friction Slack feedback loop: emoji reactions + thread notes

From Slack signals to GitHub PRs: daily automated instruction updates

What it looks like in practice: skills repo, Slack triage examples, and PR diffs

Results and scale: volume handled, time saved, analytics, and orchestration

Core takeaway: design the feedback loop, not the perfect initial prompt

Get more out of YouTube videos.