CHAPTERS
- 0:00 – 3:00
Ankur Goyal’s AI-first engineering thesis: define success, let agents explore
Claire introduces Ankur Goyal (Braintrust CEO) and frames the episode as a deep technical discussion for senior engineering leaders. Ankur’s core philosophy is that as models get better at writing code, engineers should shift effort toward specifying success criteria and letting agents run experiments to find solutions.
- 3:00 – 6:10
AI agents for database/query optimization on massive, long-tail workloads
Ankur explains how Braintrust tackles slow, user-generated queries across billions of traces over long time windows. Instead of relying on a handful of manual optimizations, they reproduce slow patterns and use coding agents to try ideas from database literature under realistic constraints.
- 6:10 – 11:30
Exhaustive benchmarking with coding agents: column stores, engines, and the “matrix” approach
Ankur describes an exhaustive sweep of open-source column store formats and execution engines to find the best combination. The key unlock is having agents generate and run benchmarks continuously, producing evidence rather than opinions.
- 11:30 – 14:00
The “agent line” framework: what to delegate and how it keeps moving upward
Ankur introduces the “agent line”: if meeting information or decisions can be handed to an agent to solve equivalently, it should be delegated. He argues the agent line rises as teams build skills, integrations, and tooling that expand what agents can reliably do.
- 14:00 – 17:16
Ankur’s day-to-day workflow: 4–6 concurrent foreground agents + a remote heavy-duty agent
Ankur describes running multiple “foreground” coding agents locally (tmux sessions) while also running a long-lived remote agent for compute-heavy experiments. Claire generalizes the pattern: personal context limits, concurrency discipline, and remote environments for data/compute intensive work.
- 17:16 – 23:06
Technical setup reality: local risks, safer sandboxes, and cloud dev environments
They discuss why off-the-shelf background agents work best for simpler web apps, while complex systems still push teams to build internal tooling. A recurring theme is moving agent autonomy into safer, constrained environments rather than letting “unhinged” agents loose on a laptop shell.
- 23:06 – 26:02
Demystifying evals: “the what” vs “the how,” and why evals are the new PRD
Ankur explains evals as a shift in programming: specify desired outcomes (the what) and let models discover implementations (the how). He argues evals function like modern PRDs—prose plus examples—but with quantification so progress can be measured and automated.
- 26:02 – 30:20
Live demo walkthrough: building an eval for documentation Q&A
Ankur demonstrates creating a dataset of documentation questions, writing a basic answering prompt, and attaching retrieval/context tools (MCP, doc indexing). He then uses a model to generate a scoring function so he doesn’t need to manually review every answer.
- 30:20 – 32:09
Why vibe checks alone fail: whack-a-mole behavior and missing regressions
They contrast eval-driven iteration with the common alternative: testing one or two examples and generalizing. Ankur argues vibe checks matter, but without evals you end up patching failures reactively after shipping.
- 32:09 – 33:44
Capturing “taste” in scoring functions: scaling a designer’s judgment without replacing them
Ankur shares how he incorporates a designer’s (David’s) high-quality taste: he uses evals to make quantitative progress, then gets periodic human critique, and finally encodes that feedback into updated scoring criteria. Claire highlights the fear of “building your replacement,” and reframes it as amplifying expertise across more work.
- 33:44 – 37:30
Managing velocity and throughput: carving products down + investing in CI/CD for AI-accelerated teams
In lightning round, Ankur explains that building now resembles “carving” because it’s easy to overbuild with AI—so teams often simplify by removing confusing features. Technically, he argues that faster AI-driven development increases the need for strong CI/CD; when constrained, the right move is often to improve the pipeline, not ship lower-quality code.
- 37:30 – 39:10
When agents fail: reset the loop—rewrite the eval, then try again
Ankur’s ‘back pocket’ strategy isn’t to argue with the agent: he closes the session, improves the evals, and restarts. He gives an example where an agent-generated eval script became huge and unusable; hand-writing a clean eval clarified the problem and enabled a quick model migration decision.
- 39:10
Closing: where to learn more and connect with Ankur/Braintrust
They wrap with how to find Braintrust for evals/observability and how to connect with Ankur. Claire closes the episode with standard subscribe/review info for the show.
Challenging staff-engineer skepticism: rigor beats theoretical objections
Claire and Ankur address a common claim: AI can’t help on the hardest systems work. Ankur argues that even if models aren’t perfect at concurrency/performance code, agents enable a level of rigor—breadth of benchmarks, algorithm trials, and iteration—that most teams don’t achieve manually.
Time, attention, and boundaries: getting the benefits without burnout
Claire and Ankur discuss the emotional reality of always-on agents: flow-state joy versus productivity anxiety. They emphasize chunking time, protecting deep-work hours, and setting boundaries (e.g., closing the laptop at dinner).
