Braintrust CEO: Evals are the new PRD for AI products

In this episode, I sit down with Ankur Goyal, founder and CEO of Braintrust, the AI evals and observability platform used by teams like Notion, Stripe, Vercel, and Zapier. This one is for the senior engineers, staff engineers, VPs of engineering, and CTOs in my audience. We get into how coding agents can take on deeply technical architecture and infrastructure work that no single human engineer could tackle before, and then we demystify evals so you can use them to make your AI products better without touching the implementation. *What you’ll learn:* 1. How Ankur uses Codex to run week-long benchmark experiments across database indexes, column store formats, and execution engines to speed up slow queries 2. Why he argues there’s no excuse to skip rigorous benchmarking now that agents can run them tirelessly 3. The “agent line” framework: how to decide which decisions, directions, and interactions you can hand off to an agent 4. How I think about the practical vs. theoretical quality of AI on hard technical problems, and why human attention decays on tedious work 5. Why evals are the modern version of a PRD, and how to encode “what good looks like” so a model can figure out the “how” 6. How to build a scoring function live and let an agent improve your prompt inside a safe playground 7. How Ankur turned his designer David’s taste into a repeatable eval so quality scales beyond one person 8. Why fixing your CI is the highest-leverage way to speed up engineering velocity *Brought to you by:* Guru—The AI layer of truth: http://getguru.com/ Persona—Trusted identity verification for any use case: https://withpersona.com/lp/howiai *In this episode, we cover:* (00:00) Introduction to Ankur Goyal (03:00) Using AI agents for database optimization (06:10) Running exhaustive benchmarks with coding agents (09:03) Why staff engineers are wrong about AI limitations (11:30) The “agent line” framework for delegation (14:00) Ankur’s workflow: running 4 to 6 concurrent agents (17:16) Technical setup: foreground agents, background agents, and cloud environments (20:32) Spending time with AI tools (23:06) Demystifying evals (26:02) Live demo: Building an eval for documentation answers (30:20) The alternative to evals: vibe checks and whack-a-mole (32:09) Capturing designer taste in scoring functions (33:13) Quick recap (33:44) Managing velocity and throughput (35:40) Why CI/CD investment is critical for AI-accelerated teams (37:30) Ankur’s prompting strategy when agents fail (39:10) Closing thoughts and how to connect *Blog & detailed workflow walkthroughs from this episode:* Blog: ↳ Ankur Goyal's Playbook for Agent-Driven Benchmarking and AI Evals https://www.chatprd.ai/how-i-ai/ankur-goyals-playbook-for-agent-driven-benchmarking-and-ai-evals Workflows: ↳ How to Scale Expert Judgment in AI Systems with a Human Feedback Loop https://www.chatprd.ai/how-i-ai/workflows/how-to-scale-expert-judgment-in-ai-systems-with-a-human-feedback-loop ↳ How to Use AI Coding Agents for Exhaustive Infrastructure Benchmarking https://www.chatprd.ai/how-i-ai/workflows/how-to-use-ai-coding-agents-for-exhaustive-infrastructure-benchmarking *Tools referenced:* • Braintrust: https://www.braintrust.dev/ • Codex: https://openai.com/codex/ • GPT 5.4: https://developers.openai.com/api/docs/models/gpt-5.4 • Claude: https://claude.ai/ *Other references:* • GPT 5.5 just did what no other model could: https://www.lennysnewsletter.com/p/gpt-55-just-did-what-no-other-model • Paul Graham’s Maker vs. Manager Schedule: http://www.paulgraham.com/makersschedule.html • tmux: https://github.com/tmux/tmux • Chris Tate at Vercel: https://www.linkedin.com/in/ctatedev/ *Where to find Ankur Goyal:* LinkedIn: https://www.linkedin.com/in/ankrgyl/ *Where to find Claire Vo:* ChatPRD: https://www.chatprd.ai/ Website: https://clairevo.com/ LinkedIn: https://www.linkedin.com/in/clairevo/ X: https://x.com/clairevo _Production and marketing by https://penname.co/._ _For inquiries about sponsoring the podcast, email jordan@penname.co._

Claire VohostAnkur Goyalguest

Jun 15, 202640mWatch on YouTube ↗

CHAPTERS

0:00 – 3:47
AI skepticism vs. “no excuse for rigor” on hard engineering problems
Claire and Ankur open by challenging the common staff-engineer belief that AI can’t help on the most complex technical work. Ankur argues that agents enable more rigor—more experiments, more benchmarks, and more iteration—than humans typically sustain manually.
- •Pushback on the idea that AI fails on “complicated things”
- •Agents can run more rigorous experiments than typical human workflows
- •Rigor becomes cheaper: fewer excuses for untested assumptions
- •Framing: many decisions and tasks now belong “below the agent line”
3:47 – 5:18
Using coding agents to diagnose and speed up real production queries
Ankur describes how Braintrust uses agents to identify slow query patterns and reproduce them for systematic optimization. The emphasis is on defining success criteria (latency, throughput, indexing cost) and letting agents explore solutions from database literature.
- •Reproducing slow queries based on real user patterns
- •Defining tests/success metrics so agents can iterate independently
- •Applying database indexing and execution ideas at scale
- •Using production or production-like data for more realistic results
5:18 – 9:32
Exhaustive benchmarking: column stores, execution engines, and a solution matrix
They dive into Ankur’s current work: testing many open-source column store formats and execution engines to find the best combination. The key advantage is the ability to run exhaustive permutations that would be too time-consuming for a human team.
- •Trying “every” open-source column store format and engine
- •Computing the performance matrix across combinations
- •Why this breadth of benchmarking is rarely feasible manually
- •Agents as force multipliers for infrastructure/platform decisions
9:32 – 11:33
Why risk-averse teams are wrong: benchmarks, peer review, and hidden regressions
Ankur explains how even strong engineers often skip or hand-wave benchmarks under time pressure, especially for database changes. With agents, teams can systematically validate both wins (query speed) and tradeoffs (indexing slowdown), surfacing issues earlier.
- •Typical benchmark gaps: focusing on a few tests and “bullshitting” the rest
- •Example: bloom filters discovered via week-long continuous experiments
- •Benchmarking both query performance and indexing cost
- •Agents reduce the cost of thoroughness and expose regressions sooner
11:33 – 14:16
Production safety, practical quality, and continuous iteration beats theoretical perfection
Claire and Ankur distinguish theoretical human-perfect code from practical reality: attention decay, context loss, and limited cycles. They argue that agent-driven persistence and repeatability often improves real-world quality and enables bigger technical bets.
- •Practical quality improves with sustained, consistent iteration
- •Humans lose context and stamina on tedious technical work
- •Agents enable “run at the problem” persistence
- •Business case: background progress without tying up staff for months
14:16 – 15:16
The “agent line” and reclaiming maker time (no meetings after noon)
Ankur introduces the “agent line” framework: if an agent can solve it given the meeting’s information, it likely belongs below the line. He pairs that with schedule design—protecting maker time—to write code deeply and effectively alongside agents.
- •Definition: can an agent solve it with the info being discussed?
- •The agent line keeps rising as tooling/integrations improve
- •Pushing the agent line via skills and integrations inside the company
- •Maker schedule tactics: blocking afternoons for focused coding
15:16 – 20:32
Running 4–6 concurrent agents: tmux sessions, local constraints, and remote experiments
Ankur describes a pragmatic setup: multiple foreground agents in tmux, each tied to a workstream, plus a remote environment for compute-heavy benchmarks. They also cover the current limits of off-the-shelf background agents for complex systems (ports, isolation, collisions).
- •Foreground agent concurrency: ~4–6 sessions in tmux
- •Naming and isolating parallel workstreams per agent
- •Port collisions and why complex software is harder to containerize neatly
- •Remote/cloud environments for high-scale, multi-day experiments
20:32 – 23:09
Sustainable AI usage: flow state vs. productivity anxiety (and closing the laptop)
They discuss the emotional side of agentic coding: some people rediscover flow, while others feel constant pressure to be “kicking off agents.” Claire advocates chunking time with AI, and both emphasize boundaries to avoid always-on work habits.
- •Two camps: renewed joy/flow vs. anxiety/burnout
- •Avoiding the feeling that you must run agents constantly
- •Chunking work time with AI for better focus and enjoyment
- •Personal boundary example: closing the laptop at dinner
23:09 – 26:16
Evals, demystified: machine learning shifts “how” to “what”
Ankur explains evals as the core mechanism for specifying outcomes in AI systems—defining success rather than implementation. He frames evals as the modern PRD: prose plus examples, encoded in measurable ways so models can explore the solution space.
- •ML changes programming from “how” to “what”
- •Transformers as an inspiration: define the objective and measure it
- •Evals as modern PRDs: examples + quantifiable criteria
- •Focus on success definitions enables creative solution search
26:16 – 30:21
Live demo: building a documentation-answer eval with a generated scoring function
Ankur walks through creating a dataset of real documentation questions, running a prompt to answer them, and asking a model to generate a scoring function. They highlight why this is safer in a sandbox/playground than letting agents loose on a local machine.
- •Creating a question dataset from real doc queries
- •Using MCP/context tools to ground doc answers
- •Letting a model generate evaluation criteria and scoring code
- •Sandboxed agent environments reduce risk vs. local bash autonomy
30:21 – 30:57
Why “vibe checks” aren’t enough: avoiding whack-a-mole with systematic evals
They contrast rigorous eval pipelines with the common alternative: testing one or two examples and generalizing. Ankur supports vibe checks, but argues that without evals you end up fixing one issue at a time after shipping, never knowing what regressed.
- •Common anti-pattern: a couple examples and a guess
- •Vibe checks are useful but insufficient alone
- •Whack-a-mole failure mode after shipping
- •Evals provide aggregate tracking and regression detection
30:57 – 33:53
Encoding taste: using a designer’s judgment to raise quality without replacing them
Ankur shares how a designer (“David”) provides periodic taste-based reviews that are then converted into eval criteria. Claire addresses fears about “building your replacement,” and they argue that codifying taste lets experts scale their impact and raise the quality bar.
- •Designer-in-the-loop: periodic vibe checks on top of quantitative evals
- •Turning qualitative feedback into scoring functions for future runs
- •Capturing taste increases reach and consistency across outputs
- •Codifying expertise elevates the expert rather than replacing them
33:53 – 37:31
Lightning round: carving vs. constructing, and why CI/CD is the bottleneck
Ankur explains that AI makes it easy to overbuild, so product work becomes “carving”—removing complexity to reduce confusion. On throughput, he emphasizes investing in CI so teams can safely move faster, and reframes engineering as building platforms for agents.
- •AI accelerates feature creation, increasing the need to remove/trim
- •Default response to confusion: simplify rather than add complexity
- •Throughput strategy: improve CI to earn speed safely
- •Core team job: build feedback loops/pipelines (for evals and for code)
37:31 – 39:08
When agents fail: reset, improve the eval, and re-run (plus hand-writing the eval)
Asked about a go-to prompting strategy, Ankur prefers restarting with better evaluation scaffolding rather than wrestling with a stuck session. He shares an example where a vibe-coded eval became unusable, so he hand-wrote the eval to understand the problem and quickly finish the migration decision.
- •Preferred fix: close the session, strengthen evals, restart clean
- •Avoids accumulating brittle, confusing “agent-made” evaluation code
- •Example: 3,000-line messy eval script replaced by a hand-written eval
- •Principle: humans focus on the eval/feedback loop; agents do the rest
39:08 – 40:11
Closing: where to find Braintrust and Ankur (and hiring)
They wrap up with links to Braintrust and Ankur’s socials, plus an invitation to talk about evals and observability. Claire closes with standard show outro and subscription prompts.
- •Braintrust: braintrust.dev and @braintrust
- •Ankur: @ankrgyl; open to chatting
- •Hiring call for people excited about rigor and evals
- •Podcast outro: like/subscribe, ratings, and where to listen

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

AI skepticism vs. “no excuse for rigor” on hard engineering problems

Using coding agents to diagnose and speed up real production queries

Exhaustive benchmarking: column stores, execution engines, and a solution matrix

Why risk-averse teams are wrong: benchmarks, peer review, and hidden regressions

Production safety, practical quality, and continuous iteration beats theoretical perfection

The “agent line” and reclaiming maker time (no meetings after noon)

Running 4–6 concurrent agents: tmux sessions, local constraints, and remote experiments

Sustainable AI usage: flow state vs. productivity anxiety (and closing the laptop)

Evals, demystified: machine learning shifts “how” to “what”

Live demo: building a documentation-answer eval with a generated scoring function

Why “vibe checks” aren’t enough: avoiding whack-a-mole with systematic evals

Encoding taste: using a designer’s judgment to raise quality without replacing them

Lightning round: carving vs. constructing, and why CI/CD is the bottleneck

When agents fail: reset, improve the eval, and re-run (plus hand-writing the eval)

Closing: where to find Braintrust and Ankur (and hiring)

Get more out of YouTube videos.