At a glance
WHAT IT’S REALLY ABOUT
Evals, agents, and rigor redefine how AI products ship fast
- Coding agents can tackle complex infrastructure work (e.g., database indexing/query latency) by running exhaustive, production-like benchmarks that humans rarely execute thoroughly.
- Goyal argues “rigor is now cheap,” so teams have little excuse to skip performance testing, edge-case coverage, or iterative experimentation.
- He introduces the “agent line” as a delegation threshold: routinely push work below it to agents, freeing maker-time and increasing personal/team throughput.
- Evals are framed as the new PRD for AI products: define “what good looks like” with measurable criteria and examples, then let models explore the “how.”
- Without evals, teams default to vibe checks and whack-a-mole fixes; combining quantitative evals with periodic expert taste reviews scales quality (e.g., encoding a designer’s palate into scoring functions).
IDEAS WORTH REMEMBERING
5 ideasUse agents to expand the benchmark surface area, not just write code.
Goyal’s database work emphasizes week-long continuous experiments and exhaustive matrices (e.g., column store formats × execution engines) to discover practical wins like bloom filters that might be dismissed otherwise.
The practical quality of engineering improves when agents run the tedious loops.
Humans lose context and avoid long, repetitive benchmark work; agents can keep running consistently, making it more likely you’ll catch regressions (e.g., faster queries but slower indexing).
Adopt an “agent line” and keep raising it.
Ask whether an agent given the same information could solve the problem; if yes, delegate it and reinvest the saved time into deep work, integrations, and reusable skills that push the line upward over time.
Limit concurrency to what you can supervise; organize it intentionally.
Goyal runs ~4–6 “foreground” agents as separate tmux sessions plus a remote long-running environment for heavy workloads, acknowledging a human context limit while still multiplying throughput.
Prefer safe, sandboxed agent environments over unbounded local ‘unhinged’ modes.
He highlights that agent autonomy is far less risky in controlled playgrounds (data/prompt-only contexts) than on a laptop with shell access, encouraging more structured experimentation setups.
WORDS WORTH SAVING
5 quotesEvals are a methodology for you to say, "This is what success looks like." In my opinion, evals are actually the modern version of a PRD.
— Ankur Goyal
There's no staff engineer who is running as many rigorous benchmarks and trying out different algorithms and analyzing ideas manually than someone who's using, uh, an agent, and even that baseline is just incredible.
— Ankur Goyal
I think that everyone should take a, a hard look in the mirror and reevaluate how they spend their time.
— Ankur Goyal
Product building and code writing is, now looks like carving rather than constructing.
— Ankur Goyal
You might make it really good at one or two things, then you ship it, and then it's not good at something else.
— Ankur Goyal
High quality AI-generated summary created from speaker-labeled transcript.
