CHAPTERS
Why prompting still matters: two real-world scenarios
Margot Van Lare frames prompting as a core engineering skill for LLM-based systems, especially as models and architectures change. She sets up two scenarios: debugging a prompt that regressed after a migration, and designing a brand-new agentic workflow from scratch.
Start with evaluations: separating behavior changes from capability limits
Before changing prompts, she emphasizes building an eval suite to measure whether edits actually improve outcomes. Evals also help distinguish ‘model behaves differently’ (prompt-tunable) from ‘model not capable enough’ (needs a better model/tools).
Designing an eval suite: control cases, edge cases, and boundaries
She outlines what a representative eval suite should cover, even if small at first. The suite should include a control case, edge cases known to fail, and tests that ensure the model knows when to refuse or escalate to a human.
Case study setup: Meridian Mobile support bot + five test cases
The demo uses a telecom customer support bot prompt and a small set of five tests covering plan questions, proration math, policy adherence, escalation on billing errors, and not withholding relevant customer-specific info. The intent is to iteratively fix failures one by one.
First eval run + prompt hygiene: structure, remove junk, clarify role
Initial results show the control case passes, but multiple failures elsewhere. She then applies ‘general hygiene’: remove incorrect claims (e.g., bot being human), delete copied website debris (hero image/cookies), and separate mixed instructions into clear sections using structure like XML tags.
Output contracts and stop sequences: making formatting reliable
She introduces the idea of an output contract—explicitly specifying output formatting (e.g., XML tags) to reduce inconsistency. She also notes harness-level controls like stop sequences, and mentions structured outputs for more complex schemas such as nested JSON.
Failure mode 1 (Hotspot): when models withhold info due to defensive patches
The hotspot test fails because the model deflects to a URL instead of answering, despite having the correct customer-specific hotspot allowance in provided context. The culprit is an old defensive instruction: ‘Never give wrong plan details; point to the URL,’ which the newer model over-optimizes for.
Prompt maintenance lesson: track patches and retire them when models change
After adjusting the hotspot instruction, the eval passes consistently. She generalizes the lesson: as instruction-following improves, old patches can become harmful, so teams should track why a prompt change was added (e.g., via version control) to safely revisit it later.
Failure mode 2 (Proration): instructions don’t add capability—tools do
The proration test fails because the model does vague mental math and doesn’t deliver a reliable numeric bill. The fix is not stronger wording (‘critical’) but giving the model a calculation tool, defining the tool schema, enabling it in the API, and implementing the math correctly.
Failure mode 3 (Billing error): handling goal conflicts and trade-offs explicitly
The billing escalation test fails because the prompt warns against escalation due to cost and resolution metrics, so the model tries to diagnose instead of handing off. She fixes this by stating both sides of the trade-off (escalation cost vs. refund/trust cost) and aligning instructions with desired escalation policy.
From debugging to green evals: systematic iteration over failure modes
With hygiene improvements, removal of outdated patches, added tooling, and clarified escalation incentives, all five Meridian Mobile eval tests pass. She underscores the workflow: use evals, isolate one failure at a time, and change the minimal lever that actually addresses the cause.
New agent from scratch: retail staff scheduling problem and evaluation approach
She pivots to building an agentic system for generating a week-long retail staff schedule under hard constraints. Because constraints are formal, evaluation can be programmatic (Python counting violations) rather than LLM-judged, enabling faster iteration and clearer metrics.
Comparing approaches: model size, adaptive thinking, prompt tweaks, and agentic loops
She tests a baseline prompt on Sonnet 4.6 (fails), upgrades to Opus 4.7 (fewer violations but still failing), then uses Opus with adaptive thinking (passes but high tokens/latency). A ‘better prompt’ on Sonnet improves results but hits output limits, while a generate–evaluate–repair agentic loop succeeds with lower tokens/latency and more flexibility.
