Teaching agents to learn from your team

Agent that improves itself daily by treating instructions as code: edited, reviewed, merged like any PR. Writing skills that teach agents how to think (not what to do). Closing feedback loop so team judgment flows back automatically.

May 22, 202628mWatch on YouTube ↗

CHAPTERS

0:25 – 2:59
Why most agents stall at “80% there” (and why this talk exists)
Petra frames the central problem: many people can build an agent prototype, but far fewer can keep one happily running in production. The talk focuses on closing the last-mile gap—getting agents from “kind of works” to reliable daily value.
- •Audience poll reveals a steep drop from “built an agent” to “thriving in production”
- •Agents often die at the 80% mark due to constant tweaking and prompting
- •Goal: make agents good enough to run continuously with confidence
- •The talk uses Warp’s agent “Buzz” as a concrete case study
2:59 – 5:00
What Buzz does: triage social mentions and draft authentic replies
Buzz is introduced as Warp’s social-mentions assistant that helps a small team handle high-volume community engagement. It classifies mentions into actions (reply/like/skip) and drafts responses to reduce effort while preserving a human final touch.
- •Monitors social mentions and recommends actions: reply, like, or skip
- •Drafts suggested replies so humans don’t start from scratch
- •Optimizes team time toward high-ROI engagement moments
- •Built quickly with ~15 skills, minimal/no traditional code, integrated with Slack/APIs
5:00 – 6:31
The hard part: encoding “judgment and taste” (not unit-testable behavior)
Petra explains why social replies are fundamentally different from coding tasks: they require nuance, brand voice, and context. Standard agent evaluation loops work when there’s an external, objective check—but “good taste” lacks fast, reliable automated feedback.
- •Social engagement requires tone, empathy, and knowing when not to engage
- •AI-generated replies are often obvious and harm authenticity
- •Agentic loops thrive with objective checks (tests, browser validation, API calls)
- •Brand/community feedback loops are long, complex, and risky to run live
6:31 – 9:03
Why common agent loops work for code, but fail for fuzzy human tasks
The talk contrasts coding-focused agent loops with the ambiguity of communication tasks. Without crisp pass/fail signals, agents struggle to iterate effectively, making it unclear how they should improve autonomously.
- •External checks enable iterative refinement in coding workflows
- •For social replies, “success” depends on delayed, qualitative community reaction
- •You can’t safely “test in production” at scale for brand interactions
- •Core question: how to transfer human nuance into agent behavior
9:03 – 10:03
Attempt #1: “Nail the prompt” turns into brittle checklists
Petra describes starting with prompt engineering: iterating instructions to capture desired behavior. The outcome was a rule-heavy checklist that sounded robotic and broke in new situations—too rigid for real-world social contexts.
- •Prompt iteration often yields explicit if/then rules
- •Rule checklists produce robotic tone and limited adaptability
- •New or unexpected scenarios break brittle rule systems
- •A better approach should generalize across contexts
10:03 – 12:04
Shift from rules to principles: teach the agent how to think
To mirror onboarding a human teammate, the team replaced detailed rules with flexible principles. This made instructions shorter, improved generalization, and produced better tone and reasoning in novel cases.
- •Onboarding analogy: humans learn decision-making principles, not exhaustive rules
- •Principles emphasize empathy, non-defensiveness, and product-builder voice
- •Skill file became much shorter while outputs improved
- •Principle-based guidance adapts better to new mention types
12:04 – 14:06
Attempt #2: manual evaluation and feedback—then the agent regresses to rules
They created a dataset-like process: feed mentions, review Buzz’s suggestions, and provide detailed human feedback. But when asked to incorporate feedback, the agent tended to add narrow rules again, improving single cases without learning transferable lessons.
- •Collected examples, evaluated outputs, and wrote structured feedback
- •Agent tried to ‘patch’ feedback with overly specific rules
- •Specific fixes don’t generalize (e.g., ‘never mention pricing in first line’)
- •Desired learning is abstract/general (e.g., don’t pitch when user is venting)
14:06 – 15:08
Teaching the agent to learn: a meta-skill that updates principles correctly
The breakthrough was making the agent compare its output, the human ideal, and its existing instructions—then update the instructions at the right level of abstraction. This created a durable mechanism for improving behavior without accumulating brittle rules.
- •Agent is prompted to find the gap between current instructions and ideal outcomes
- •Learns by revising principles/guidelines rather than appending random rules
- •Result: two key components—principles + a ‘learning’ skill to evolve them
- •Improvements become reusable across future scenarios
15:08 – 15:39
Operational challenge: who keeps training it (without adding team toil)?
Even with a good learning mechanism, ongoing human feedback can become a burden. Petra explains the requirement: the agent must learn from work the team is already doing, not from extra meetings or training rituals.
- •Manual coaching doesn’t scale; teams won’t sustain additional process overhead
- •Need minimal human input for maximal improvement impact
- •Design requirement: feedback must be embedded in existing workflows
- •Goal: continuous improvement with near-zero extra team effort
15:39 – 18:11
The low-friction Slack feedback loop: emoji reactions + thread notes
Buzz posts each mention and recommendation into a Slack channel with reasoning and drafts. Team members react with emojis indicating the action taken (reply/skip/like), and optionally add notes in threads—creating structured signals Buzz can learn from.
- •Buzz pushes triage results to Slack with explanation and suggested drafts
- •Team uses emoji reactions to mark what actually happened
- •Optional thread notes provide richer qualitative guidance
- •Feedback doubles as team coordination (avoid duplicate handling)
18:11 – 19:43
From Slack signals to GitHub PRs: daily automated instruction updates
Buzz runs daily, compares its suggestions against team actions, extracts takeaways, and opens pull requests that update the skill instructions. The team reviews and merges small English-text diffs, preventing drift while keeping improvement continuous.
- •Daily job ingests emoji mismatches + thread notes and derives improvements
- •Buzz opens PRs to the skills repo and posts links back to Slack
- •PR review is fast and keeps humans in control of instruction changes
- •Edits are integrated into the most relevant part of instructions, not tacked-on rules
19:43 – 24:47
What it looks like in practice: skills repo, Slack triage examples, and PR diffs
Petra walks through screenshots: the skill file with principles, the Slack channel showing reply/skip/like outputs, and an example of feedback (“don’t correct the user”) turning into an instruction change via PR. This demonstrates the end-to-end loop working as intended.
- •Skill file shows principle-based guidance for drafting replies
- •Slack output is formatted for skimmability and quick decision-making
- •Example feedback: avoid unnecessary correction of users in replies
- •PR updates instruction phrasing in context to reflect the takeaway
24:47 – 26:51
Results and scale: volume handled, time saved, analytics, and orchestration
Buzz processes thousands of mentions monthly, skipping about half and saving substantial team attention. It also generates reporting/analytics and runs autonomously in the cloud via Warp’s orchestration, triggered by schedules and events.
- •A few thousand mentions/month; ~50% skipped (major time savings)
- •~15 skills spanning triage, writing, reporting, and analytics
- •Daily DM metrics/graphs track action distributions and team engagement health
- •Runs in the cloud with orchestration (cron/webhooks/API triggers), no manual babysitting
26:51 – 28:28
Core takeaway: design the feedback loop, not the perfect initial prompt
Petra closes with the key lesson: don’t obsess over a flawless prompt upfront. Build systems that continuously capture real team behavior and convert it into improving instructions, so the agent adapts as situations and understanding evolve.
- •Initial prompt only needs to be ‘good enough’ to start
- •Principles guide behavior; learning mechanism updates those principles
- •Feedback loop supplies ongoing training data from real workflows
- •Focus on continuous improvement so agents mature into reliable production tools