Braintrust CEO: Evals are the new PRD for AI products

In this episode, I sit down with Ankur Goyal, founder and CEO of Braintrust, the AI evals and observability platform used by teams like Notion, Stripe, Vercel, and Zapier. This one is for the senior engineers, staff engineers, VPs of engineering, and CTOs in my audience. We get into how coding agents can take on deeply technical architecture and infrastructure work that no single human engineer could tackle before, and then we demystify evals so you can use them to make your AI products better without touching the implementation. *What you’ll learn:* 1. How Ankur uses Codex to run week-long benchmark experiments across database indexes, column store formats, and execution engines to speed up slow queries 2. Why he argues there’s no excuse to skip rigorous benchmarking now that agents can run them tirelessly 3. The “agent line” framework: how to decide which decisions, directions, and interactions you can hand off to an agent 4. How I think about the practical vs. theoretical quality of AI on hard technical problems, and why human attention decays on tedious work 5. Why evals are the modern version of a PRD, and how to encode “what good looks like” so a model can figure out the “how” 6. How to build a scoring function live and let an agent improve your prompt inside a safe playground 7. How Ankur turned his designer David’s taste into a repeatable eval so quality scales beyond one person 8. Why fixing your CI is the highest-leverage way to speed up engineering velocity *Brought to you by:* Guru—The AI layer of truth: http://getguru.com/ Persona—Trusted identity verification for any use case: https://withpersona.com/lp/howiai *In this episode, we cover:* (00:00) Introduction to Ankur Goyal (03:00) Using AI agents for database optimization (06:10) Running exhaustive benchmarks with coding agents (09:03) Why staff engineers are wrong about AI limitations (11:30) The “agent line” framework for delegation (14:00) Ankur’s workflow: running 4 to 6 concurrent agents (17:16) Technical setup: foreground agents, background agents, and cloud environments (20:32) Spending time with AI tools (23:06) Demystifying evals (26:02) Live demo: Building an eval for documentation answers (30:20) The alternative to evals: vibe checks and whack-a-mole (32:09) Capturing designer taste in scoring functions (33:13) Quick recap (33:44) Managing velocity and throughput (35:40) Why CI/CD investment is critical for AI-accelerated teams (37:30) Ankur’s prompting strategy when agents fail (39:10) Closing thoughts and how to connect *Blog & detailed workflow walkthroughs from this episode:* Blog: ↳ Ankur Goyal's Playbook for Agent-Driven Benchmarking and AI Evals https://www.chatprd.ai/how-i-ai/ankur-goyals-playbook-for-agent-driven-benchmarking-and-ai-evals Workflows: ↳ How to Scale Expert Judgment in AI Systems with a Human Feedback Loop https://www.chatprd.ai/how-i-ai/workflows/how-to-scale-expert-judgment-in-ai-systems-with-a-human-feedback-loop ↳ How to Use AI Coding Agents for Exhaustive Infrastructure Benchmarking https://www.chatprd.ai/how-i-ai/workflows/how-to-use-ai-coding-agents-for-exhaustive-infrastructure-benchmarking *Tools referenced:* • Braintrust: https://www.braintrust.dev/ • Codex: https://openai.com/codex/ • GPT 5.4: https://developers.openai.com/api/docs/models/gpt-5.4 • Claude: https://claude.ai/ *Other references:* • GPT 5.5 just did what no other model could: https://www.lennysnewsletter.com/p/gpt-55-just-did-what-no-other-model • Paul Graham’s Maker vs. Manager Schedule: http://www.paulgraham.com/makersschedule.html • tmux: https://github.com/tmux/tmux • Chris Tate at Vercel: https://www.linkedin.com/in/ctatedev/ *Where to find Ankur Goyal:* LinkedIn: https://www.linkedin.com/in/ankrgyl/ *Where to find Claire Vo:* ChatPRD: https://www.chatprd.ai/ Website: https://clairevo.com/ LinkedIn: https://www.linkedin.com/in/clairevo/ X: https://x.com/clairevo _Production and marketing by https://penname.co/._ _For inquiries about sponsoring the podcast, email jordan@penname.co._

Claire VohostAnkur Goyalguest

Jun 15, 202640mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Evals, agents, and rigor redefine how AI products ship fast

Coding agents can tackle complex infrastructure work (e.g., database indexing/query latency) by running exhaustive, production-like benchmarks that humans rarely execute thoroughly.
Goyal argues “rigor is now cheap,” so teams have little excuse to skip performance testing, edge-case coverage, or iterative experimentation.
He introduces the “agent line” as a delegation threshold: routinely push work below it to agents, freeing maker-time and increasing personal/team throughput.
Evals are framed as the new PRD for AI products: define “what good looks like” with measurable criteria and examples, then let models explore the “how.”
Without evals, teams default to vibe checks and whack-a-mole fixes; combining quantitative evals with periodic expert taste reviews scales quality (e.g., encoding a designer’s palate into scoring functions).

IDEAS WORTH REMEMBERING

5 ideas

Use agents to expand the benchmark surface area, not just write code.

Goyal’s database work emphasizes week-long continuous experiments and exhaustive matrices (e.g., column store formats × execution engines) to discover practical wins like bloom filters that might be dismissed otherwise.

The practical quality of engineering improves when agents run the tedious loops.

Humans lose context and avoid long, repetitive benchmark work; agents can keep running consistently, making it more likely you’ll catch regressions (e.g., faster queries but slower indexing).

Adopt an “agent line” and keep raising it.

Ask whether an agent given the same information could solve the problem; if yes, delegate it and reinvest the saved time into deep work, integrations, and reusable skills that push the line upward over time.

Limit concurrency to what you can supervise; organize it intentionally.

Goyal runs ~4–6 “foreground” agents as separate tmux sessions plus a remote long-running environment for heavy workloads, acknowledging a human context limit while still multiplying throughput.

Prefer safe, sandboxed agent environments over unbounded local ‘unhinged’ modes.

He highlights that agent autonomy is far less risky in controlled playgrounds (data/prompt-only contexts) than on a laptop with shell access, encouraging more structured experimentation setups.

WORDS WORTH SAVING

5 quotes

Evals are a methodology for you to say, "This is what success looks like." In my opinion, evals are actually the modern version of a PRD.

— Ankur Goyal

There's no staff engineer who is running as many rigorous benchmarks and trying out different algorithms and analyzing ideas manually than someone who's using, uh, an agent, and even that baseline is just incredible.

— Ankur Goyal

I think that everyone should take a, a hard look in the mirror and reevaluate how they spend their time.

— Ankur Goyal

Product building and code writing is, now looks like carving rather than constructing.

— Ankur Goyal

You might make it really good at one or two things, then you ship it, and then it's not good at something else.

— Ankur Goyal

Agents for database/query optimizationExhaustive benchmark matrices (formats × engines)“Agent line” delegation frameworkForeground vs. background agents; tmux workflowsRemote/cloud dev environments for scale testingEvals as PRDs (what vs. how)CI/CD as the throughput bottleneck in AI-accelerated teams

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.