At a glance
WHAT IT’S REALLY ABOUT
Prompting best practices: evals, hygiene, tools, and agentic loops
- The talk frames prompting work around two common engineering scenarios: debugging an existing production prompt (often after a model migration) and designing a new agentic use case from scratch.
- It emphasizes evaluations as the starting point for prompt iteration, helping distinguish “model behavior differences” that prompting can fix from true capability gaps that prompting cannot.
- General prompt hygiene—removing redundant/copied content, clarifying role, and adding structure (e.g., XML sections)—can produce immediate performance gains before tackling specific failure modes.
- Three concrete failure modes are addressed systematically: information withholding due to over-defensive instructions, unreliable mental math solved by tool use, and escalation failures caused by one-sided cost incentives in policy text.
- For a constraint-heavy scheduling agent, the session compares model choice and prompting strategies, showing that a generate–evaluate–repair loop can outperform a single large prompt in cost/latency while supporting soft constraints at runtime.
IDEAS WORTH REMEMBERING
5 ideasStart with evals to make prompt changes measurable.
Without an eval suite you can’t tell whether edits genuinely improve behavior or just shift failures around; evals also help determine whether issues are prompt-tunable or due to insufficient model capability.
Include control, edge, and boundary cases in every eval suite.
Control cases should always pass, edge cases prevent regressions on known failures, and boundary cases verify correct handoffs/refusals when the model shouldn’t proceed.
Prompt structure is not cosmetic—it changes performance.
Separating role, policy, guidelines, tone, and data (e.g., via XML tags) reduces instruction confusion; if a human can’t easily parse the prompt, the model likely can’t either.
Over-defensive “patch” instructions can cause information withholding.
The bot redirected hotspot questions to a URL because “never give the wrong plan details” dominated; updating the policy to treat provided customer context as the source of truth restored correct answers.
Instructions don’t add capability; tools do.
Telling a model to “calculate proration correctly” won’t fix mental-math unreliability; adding a proration tool with schema + implementation made the result consistently correct.
WORDS WORTH SAVING
5 quotesAnd prompting is arguably one of the first skills, if not the first skill, that we had to learn as engineers when we first started to work with LLMs.
— Margot Van Lare
We need evaluations to provide that rigor, um, to understand whether a change to our prompt is actually correlating to an improvement in its performance.
— Margot Van Lare
A general rule of thumb that I like to follow is if you're reading a prompt and you can't tell guidelines from policy, from data, most likely the model isn't able to either.
— Margot Van Lare
We worry a lot about hallucinations or the invention of facts and numbers, but actually the opposite can also happen. The model can withhold information that it actually has access to.
— Margot Van Lare
So the key lesson to take away here is instructions don't add capability.
— Margot Van Lare
High quality AI-generated summary created from speaker-labeled transcript.
