At a glance
WHAT IT’S REALLY ABOUT
Braintrust CEO Explains Enterprise AI Evals, Data, Teams, and Tooling
- Ankur Goyal, CEO of Braintrust, describes how the company evolved from internal tooling into an enterprise platform for AI evals, observability, and prompt development, now used by leading AI-forward companies like Notion, Airtable, and Zapier.
- He details what enterprises are actually doing with LLMs—heavy use of RAG, far less fine-tuning than expected, cautious experiments with agents, and very limited production use of open-source models so far.
- The conversation covers how AI is reshaping data infrastructure (from warehouses and SQL to embeddings-based workflows), engineering stacks (TypeScript over Python, fewer AI-specific frameworks), and organizational structures (product-engineer-led AI platform teams).
- Goyal also shares startup lessons on hiring, customer-obsessed execution, vendor consolidation, and consciously architecting Braintrust—and his own CEO role—around deep, ongoing involvement in coding and product craftsmanship.
IDEAS WORTH REMEMBERING
5 ideasEvals are a hard but critical bottleneck for serious AI products.
Superficially, evals look like a simple loop over prompts and outputs, but in production—especially with complex systems and agents—companies need fast, consistent, reusable evaluation workflows to iterate and improve quality reliably.
RAG is mainstream; fine-tuning is niche and often unnecessary.
Roughly half of Braintrust customers’ production use cases involve RAG, while most have moved away from fine-tuning toward instruction-tuned frontier models because fine-tuning is slower, riskier, and harder to get right for many workloads.
Enterprises still favor proprietary frontier models over open source in production.
Despite strong developer interest in open source, Braintrust sees limited production adoption; OpenAI and Anthropic via AWS Bedrock dominate because they offer better UX, faster iteration, and strong ROI, which matter more than raw per-token cost.
Data infrastructure for AI is shifting from warehouses and SQL to embeddings and LLM-based querying.
Traditional data warehouses optimized for structured data and ad hoc SQL don’t fit AI workloads; advanced teams use embeddings and models to mine logs, discover underrepresented cases, and construct better eval and training datasets.
Free-form agents are being dialed back in favor of deterministic control flow with pervasive LLM calls.
Early adopters went deep on ‘fully autonomous’ agents but hit uncontrollable error rates and compounding failures, so they’re returning to architectures where code handles control flow and LLMs are invoked at many well-defined points.
WORDS WORTH SAVING
5 quotesEvals really sound easy—‘oh, it’s just a for loop’—but it is actually a pretty hard problem to do evals well.
— Ankur Goyal
Almost, if not all of our customers, have moved off of fine-tuned models onto instruction-tuned models, and are seeing really good performance.
— Ankur Goyal
A data warehouse is really designed for ad hoc exploration on structured data, which is… neither of those two things is relevant in AI land.
— Ankur Goyal
TypeScript is the language of AI and Python is the language of machine learning.
— Ankur Goyal
People are always gonna push things to their extreme. AI is an inherently non-deterministic thing, and so I think evals are still gonna be there.
— Ankur Goyal
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome