Skip to content
How I AIHow I AI

Braintrust CEO: Evals are the new PRD for AI products

In this episode, I sit down with Ankur Goyal, founder and CEO of Braintrust, the AI evals and observability platform used by teams like Notion, Stripe, Vercel, and Zapier. This one is for the senior engineers, staff engineers, VPs of engineering, and CTOs in my audience. We get into how coding agents can take on deeply technical architecture and infrastructure work that no single human engineer could tackle before, and then we demystify evals so you can use them to make your AI products better without touching the implementation. *What you’ll learn:* 1. How Ankur uses Codex to run week-long benchmark experiments across database indexes, column store formats, and execution engines to speed up slow queries 2. Why he argues there’s no excuse to skip rigorous benchmarking now that agents can run them tirelessly 3. The “agent line” framework: how to decide which decisions, directions, and interactions you can hand off to an agent 4. How I think about the practical vs. theoretical quality of AI on hard technical problems, and why human attention decays on tedious work 5. Why evals are the modern version of a PRD, and how to encode “what good looks like” so a model can figure out the “how” 6. How to build a scoring function live and let an agent improve your prompt inside a safe playground 7. How Ankur turned his designer David’s taste into a repeatable eval so quality scales beyond one person 8. Why fixing your CI is the highest-leverage way to speed up engineering velocity *Brought to you by:* Guru—The AI layer of truth: http://getguru.com/ Persona—Trusted identity verification for any use case: https://withpersona.com/lp/howiai *In this episode, we cover:* (00:00) Introduction to Ankur Goyal (03:00) Using AI agents for database optimization (06:10) Running exhaustive benchmarks with coding agents (09:03) Why staff engineers are wrong about AI limitations (11:30) The “agent line” framework for delegation (14:00) Ankur’s workflow: running 4 to 6 concurrent agents (17:16) Technical setup: foreground agents, background agents, and cloud environments (20:32) Spending time with AI tools (23:06) Demystifying evals (26:02) Live demo: Building an eval for documentation answers (30:20) The alternative to evals: vibe checks and whack-a-mole (32:09) Capturing designer taste in scoring functions (33:13) Quick recap (33:44) Managing velocity and throughput (35:40) Why CI/CD investment is critical for AI-accelerated teams (37:30) Ankur’s prompting strategy when agents fail (39:10) Closing thoughts and how to connect *Blog & detailed workflow walkthroughs from this episode:* Blog: ↳ Ankur Goyal's Playbook for Agent-Driven Benchmarking and AI Evals https://www.chatprd.ai/how-i-ai/ankur-goyals-playbook-for-agent-driven-benchmarking-and-ai-evals Workflows: ↳ How to Scale Expert Judgment in AI Systems with a Human Feedback Loop https://www.chatprd.ai/how-i-ai/workflows/how-to-scale-expert-judgment-in-ai-systems-with-a-human-feedback-loop ↳ How to Use AI Coding Agents for Exhaustive Infrastructure Benchmarking https://www.chatprd.ai/how-i-ai/workflows/how-to-use-ai-coding-agents-for-exhaustive-infrastructure-benchmarking *Tools referenced:* • Braintrust: https://www.braintrust.dev/ • Codex: https://openai.com/codex/ • GPT 5.4: https://developers.openai.com/api/docs/models/gpt-5.4 • Claude: https://claude.ai/ *Other references:* • GPT 5.5 just did what no other model could: https://www.lennysnewsletter.com/p/gpt-55-just-did-what-no-other-model • Paul Graham’s Maker vs. Manager Schedule: http://www.paulgraham.com/makersschedule.html • tmux: https://github.com/tmux/tmux • Chris Tate at Vercel: https://www.linkedin.com/in/ctatedev/ *Where to find Ankur Goyal:* LinkedIn: https://www.linkedin.com/in/ankrgyl/ *Where to find Claire Vo:* ChatPRD: https://www.chatprd.ai/ Website: https://clairevo.com/ LinkedIn: https://www.linkedin.com/in/clairevo/ X: https://x.com/clairevo _Production and marketing by https://penname.co/._ _For inquiries about sponsoring the podcast, email jordan@penname.co._

Claire VohostAnkur Goyalguest
Jun 15, 202640mWatch on YouTube ↗

CHAPTERS

  1. 0:00 – 3:00

    Ankur Goyal’s AI-first engineering thesis: define success, let agents explore

    Claire introduces Ankur Goyal (Braintrust CEO) and frames the episode as a deep technical discussion for senior engineering leaders. Ankur’s core philosophy is that as models get better at writing code, engineers should shift effort toward specifying success criteria and letting agents run experiments to find solutions.

  2. 3:00 – 6:10

    AI agents for database/query optimization on massive, long-tail workloads

    Ankur explains how Braintrust tackles slow, user-generated queries across billions of traces over long time windows. Instead of relying on a handful of manual optimizations, they reproduce slow patterns and use coding agents to try ideas from database literature under realistic constraints.

  3. 6:10 – 11:30

    Exhaustive benchmarking with coding agents: column stores, engines, and the “matrix” approach

    Ankur describes an exhaustive sweep of open-source column store formats and execution engines to find the best combination. The key unlock is having agents generate and run benchmarks continuously, producing evidence rather than opinions.

  4. 11:30 – 14:00

    The “agent line” framework: what to delegate and how it keeps moving upward

    Ankur introduces the “agent line”: if meeting information or decisions can be handed to an agent to solve equivalently, it should be delegated. He argues the agent line rises as teams build skills, integrations, and tooling that expand what agents can reliably do.

  5. 14:00 – 17:16

    Ankur’s day-to-day workflow: 4–6 concurrent foreground agents + a remote heavy-duty agent

    Ankur describes running multiple “foreground” coding agents locally (tmux sessions) while also running a long-lived remote agent for compute-heavy experiments. Claire generalizes the pattern: personal context limits, concurrency discipline, and remote environments for data/compute intensive work.

  6. 17:16 – 23:06

    Technical setup reality: local risks, safer sandboxes, and cloud dev environments

    They discuss why off-the-shelf background agents work best for simpler web apps, while complex systems still push teams to build internal tooling. A recurring theme is moving agent autonomy into safer, constrained environments rather than letting “unhinged” agents loose on a laptop shell.

  7. 23:06 – 26:02

    Demystifying evals: “the what” vs “the how,” and why evals are the new PRD

    Ankur explains evals as a shift in programming: specify desired outcomes (the what) and let models discover implementations (the how). He argues evals function like modern PRDs—prose plus examples—but with quantification so progress can be measured and automated.

  8. 26:02 – 30:20

    Live demo walkthrough: building an eval for documentation Q&A

    Ankur demonstrates creating a dataset of documentation questions, writing a basic answering prompt, and attaching retrieval/context tools (MCP, doc indexing). He then uses a model to generate a scoring function so he doesn’t need to manually review every answer.

  9. 30:20 – 32:09

    Why vibe checks alone fail: whack-a-mole behavior and missing regressions

    They contrast eval-driven iteration with the common alternative: testing one or two examples and generalizing. Ankur argues vibe checks matter, but without evals you end up patching failures reactively after shipping.

  10. 32:09 – 33:44

    Capturing “taste” in scoring functions: scaling a designer’s judgment without replacing them

    Ankur shares how he incorporates a designer’s (David’s) high-quality taste: he uses evals to make quantitative progress, then gets periodic human critique, and finally encodes that feedback into updated scoring criteria. Claire highlights the fear of “building your replacement,” and reframes it as amplifying expertise across more work.

  11. 33:44 – 37:30

    Managing velocity and throughput: carving products down + investing in CI/CD for AI-accelerated teams

    In lightning round, Ankur explains that building now resembles “carving” because it’s easy to overbuild with AI—so teams often simplify by removing confusing features. Technically, he argues that faster AI-driven development increases the need for strong CI/CD; when constrained, the right move is often to improve the pipeline, not ship lower-quality code.

  12. 37:30 – 39:10

    When agents fail: reset the loop—rewrite the eval, then try again

    Ankur’s ‘back pocket’ strategy isn’t to argue with the agent: he closes the session, improves the evals, and restarts. He gives an example where an agent-generated eval script became huge and unusable; hand-writing a clean eval clarified the problem and enabled a quick model migration decision.

  13. 39:10

    Closing: where to learn more and connect with Ankur/Braintrust

    They wrap with how to find Braintrust for evals/observability and how to connect with Ankur. Claire closes the episode with standard subscribe/review info for the show.

  14. Challenging staff-engineer skepticism: rigor beats theoretical objections

    Claire and Ankur address a common claim: AI can’t help on the hardest systems work. Ankur argues that even if models aren’t perfect at concurrency/performance code, agents enable a level of rigor—breadth of benchmarks, algorithm trials, and iteration—that most teams don’t achieve manually.

  15. Time, attention, and boundaries: getting the benefits without burnout

    Claire and Ankur discuss the emotional reality of always-on agents: flow-state joy versus productivity anxiety. They emphasize chunking time, protecting deep-work hours, and setting boundaries (e.g., closing the laptop at dinner).

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.