The prompting playbook

Name: The prompting playbook
Uploaded: 2026-05-22T00:00:00Z
Duration: 33 min 48 s
Description: The talk frames prompting work around two common engineering scenarios: debugging an existing production prompt (often after a model migration) and designing a new agentic use case from scratch.

How to apply core prompting principles to agentic systems that plan, act, and adapt.

May 22, 202633mWatch on YouTube ↗

CHAPTERS

0:20 – 2:52
Why prompting still matters: two real-world scenarios
Margot frames prompting as a core engineering skill and introduces two common workplace situations: fixing a degrading production prompt (often during model migration) and creating a brand-new agentic system from scratch. The talk will use practical, customer-inspired examples rather than abstract dos and don’ts.
- •Prompting as a critical engineering skill for LLM systems
- •Two focus scenarios: debugging an existing prompt vs. building a new agent
- •Approach: learn through a concrete, realistic prompt example
2:52 – 3:53
Start with evals: separating prompt issues from model capability limits
Before changing anything, Margot argues you need evaluations to measure regressions and improvements. Evals help distinguish between behavior changes that prompting can fix versus true capability gaps where no prompt will help.
- •Evals provide rigor when iterating on prompts
- •Model migrations can fail due to different behavior or lower capability
- •Use eval suites to detect regressions and validate improvements
3:53 – 4:53
Designing an eval suite: control cases, edge cases, and boundaries
The talk outlines what a representative eval suite should contain. It must include easy controls, previously-seen edge failures, and tests that ensure the model knows when to refuse or escalate to humans.
- •Control cases that should always pass
- •Edge cases that previously caused failures
- •Boundary tests: escalation to human and refusal when appropriate
4:53 – 7:24
Telco support bot setup: five test cases and the iteration plan
Margot introduces the Meridian Mobile customer support bot and the five test cases used throughout the demo. The plan is to run an initial eval, identify failure modes, and fix them systematically—starting with general prompt hygiene.
- •Meridian Mobile support bot prompt as the running example
- •Five eval cases: plan info, proration math, policy questions, billing error escalation, hotspot data disclosure
- •Iterative workflow: run V0 → inspect failures → target one failure mode at a time
7:24 – 8:25
First eval run: good control case, poor performance elsewhere
After running the initial evals, the unambiguous control case passes, but several other scenarios fail. This motivates cleaning up the prompt before making targeted, case-specific fixes.
- •Control case passes as expected
- •Multiple failures emerge in more complex scenarios
- •Best practice: apply general hygiene before detailed debugging
8:25 – 11:28
Prompt hygiene and structure: remove junk, clarify role, separate sections
Margot demonstrates typical prompt rot: claiming the bot is human, copied webpage artifacts, and unstructured instruction blobs. She restructures the prompt using XML-like sections to separate role, guidelines, policy, and tone—improving performance immediately.
- •Remove incorrect role claims (e.g., bot pretending to be human)
- •Delete irrelevant copied content (hero image, cookies, etc.)
- •Add explicit structure: role vs. policy vs. tone vs. data
- •Rule of thumb: if humans can’t parse sections, the model likely can’t either
11:28 – 13:30
Output contracts and harness controls: stop sequences and structured outputs
The talk adds an explicit output format section and explains why consistency is sometimes better enforced in the harness than the prompt. Stop sequences and structured outputs are highlighted as scalable ways to ensure well-formed responses, especially for JSON.
- •Define an output contract (e.g., XML wrapper or JSON schema)
- •Use stop sequences in the API harness to prevent runaway generation
- •Structured outputs are valuable for complex schemas (e.g., nested JSON)
- •Not all improvements show up in task accuracy, but they boost reliability
13:30 – 16:36
Failure mode #1 (Hotspot): preventing the model from withholding known info
The hotspot test fails because the model deflects to a URL instead of stating the customer’s hotspot allowance, due to a legacy-plan warning. The fix reframes instructions so customer-provided account data is treated as the source of truth, resolving the issue.
- •Legacy/grandfathered plan causes confusion vs. current policy doc
- •Bad instruction patch: “Never give wrong plan details—point to URL” overrides helpfulness
- •Fix: acknowledge grandfathered plans but trust provided customer account context
- •Lesson: models can withhold facts they have, not just hallucinate
16:36 – 17:07
Prompt patches and version control: avoid overfitting to old model behavior
Margot generalizes the hotspot issue into a maintainability lesson: defensive prompt patches can become harmful as models improve at instruction-following. Tracking why changes were made (version control and rationale) makes it easier to revisit and remove obsolete constraints.
- •Defensive prompt changes can create future regressions
- •Model generations differ; old patches can become counterproductive
- •Track prompt edits and the reason they were introduced
- •Be willing to remove redundant restrictions when they start suppressing correct answers
17:07 – 20:18
Failure mode #2 (Proration): instructions don’t add capability—tools do
The proration test fails because the model ‘reasons’ with mental math and produces vague or unreliable billing results. The solution is to provide a calculation tool (with schema and implementation) and instruct the model to use it for arithmetic, leading to correct outputs.
- •Observed behavior: token-heavy reasoning but no trustworthy final number
- •Anti-pattern: “critical: calculate correctly” without enabling correctness
- •Introduce a proration calculation tool and define its schema
- •Implement the tool logic in the system/harness
- •Lesson: prompting cannot replace missing capabilities; tools increase reliability
20:18 – 23:29
Failure mode #3 (Billing error): resolve instruction conflicts with balanced trade-offs
The billing error test fails because the prompt discourages escalation by emphasizing cost and contract metrics, so the model tries to diagnose instead of handing off. Updating the prompt to reflect both the cost of escalation and the cost of mishandling (refunds, trust) aligns behavior with the eval goal.
- •Model avoids escalation and attempts self-diagnosis
- •Prompt bias: escalation framed only as costly and undesirable
- •Fix: include both sides—escalation cost vs. refund/customer trust cost
- •Lesson: smarter models optimize objectives; specify balanced trade-offs explicitly
23:29 – 25:01
New agent from scratch: retail staff scheduling as a constraints problem
The talk shifts to building a new agentic use case: generating a week-long retail schedule under availability and staffing constraints. Because the rules are hard, a programmatic evaluator can score schedules by counting violations.
- •New use case: schedule generation with constraints
- •Consider three levers: prompt, model choice, and harness design
- •Use a Python function to compute constraint violations (hard-rule grading)
25:01 – 28:36
Model and harness trade-offs: baseline prompt vs bigger model vs adaptive thinking
A simple prompt with Sonnet 4.6 fails consistently; switching to Opus 4.7 reduces violations but still fails. Enabling adaptive thinking on Opus produces compliant schedules reliably, but at significantly higher token usage and latency.
- •Sonnet 4.6 + simple prompt: repeated failures and no self-checking
- •Opus 4.7 improves reasoning, reduces violations, but still fails overall
- •Opus + adaptive thinking: reliable compliance
- •Downside: ~3x tokens and latency; cost/latency becomes a constraint
28:36 – 29:41
Prompting harder on a smaller model: better instructions, but output-limit bottlenecks
Improving the Sonnet prompt (more guidance, explicit self-checking) helps but introduces new failure modes: the model times out or hits output limits. Increasing token limits would raise cost and latency, making it unattractive.
- •Enhanced prompt adds reasoning approach and ‘check your work’ step
- •Partial success (some trials pass), but completion failures appear
- •Output/max-token constraints become a practical bottleneck
- •More tokens = more latency; not an ideal production path
29:41 – 32:17
Agentic workflow wins: generate–evaluate–repair loop with runtime soft constraints
A three-step agentic loop (generate, LLM-evaluate, repair) achieves full pass rates with fewer tokens and lower latency than the monolithic prompt approach. It also supports injecting soft constraints at runtime without changing the backend evaluator logic.
- •Split tasks into three prompts: generate schedule, evaluate violations, repair schedule
- •Achieves 0-violation schedules across trials with better efficiency
- •More maintainable than one huge prompt doing everything
- •Enables runtime soft preferences (e.g., avoid pairing certain employees)
32:17 – 33:48
Wrap-up: the prompting playbook distilled
Margot summarizes the playbook: use evals for rigor, apply prompt hygiene first, then target failures one by one. Prefer removing obsolete patches, add tools for real capability gaps, and consider multi-prompt agentic designs when tasks benefit from decomposition.
- •Evals first: measure regressions and improvements objectively
- •Hygiene: structure prompts and clarify roles/policy/tone/data
- •Remove outdated ‘patch’ instructions that cause overfitting
- •Tools > instructions for computations and hard tasks
- •Agentic decomposition can outperform single-prompt designs

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Why prompting still matters: two real-world scenarios

Start with evals: separating prompt issues from model capability limits

Designing an eval suite: control cases, edge cases, and boundaries

Telco support bot setup: five test cases and the iteration plan

First eval run: good control case, poor performance elsewhere

Prompt hygiene and structure: remove junk, clarify role, separate sections

Output contracts and harness controls: stop sequences and structured outputs

Failure mode #1 (Hotspot): preventing the model from withholding known info

Prompt patches and version control: avoid overfitting to old model behavior

Failure mode #2 (Proration): instructions don’t add capability—tools do

Failure mode #3 (Billing error): resolve instruction conflicts with balanced trade-offs

New agent from scratch: retail staff scheduling as a constraints problem

Model and harness trade-offs: baseline prompt vs bigger model vs adaptive thinking

Prompting harder on a smaller model: better instructions, but output-limit bottlenecks

Agentic workflow wins: generate–evaluate–repair loop with runtime soft constraints

Wrap-up: the prompting playbook distilled

Get more out of YouTube videos.