
No Priors Ep. 85 | CEO of Braintrust Ankur Goyal
Elad Gil (host), Ankur Goyal (guest), Sarah Guo (host)
In this episode of No Priors, featuring Elad Gil and Ankur Goyal, No Priors Ep. 85 | CEO of Braintrust Ankur Goyal explores braintrust CEO Explains Enterprise AI Evals, Data, Teams, and Tooling Ankur Goyal, CEO of Braintrust, describes how the company evolved from internal tooling into an enterprise platform for AI evals, observability, and prompt development, now used by leading AI-forward companies like Notion, Airtable, and Zapier.
Braintrust CEO Explains Enterprise AI Evals, Data, Teams, and Tooling
Ankur Goyal, CEO of Braintrust, describes how the company evolved from internal tooling into an enterprise platform for AI evals, observability, and prompt development, now used by leading AI-forward companies like Notion, Airtable, and Zapier.
He details what enterprises are actually doing with LLMs—heavy use of RAG, far less fine-tuning than expected, cautious experiments with agents, and very limited production use of open-source models so far.
The conversation covers how AI is reshaping data infrastructure (from warehouses and SQL to embeddings-based workflows), engineering stacks (TypeScript over Python, fewer AI-specific frameworks), and organizational structures (product-engineer-led AI platform teams).
Goyal also shares startup lessons on hiring, customer-obsessed execution, vendor consolidation, and consciously architecting Braintrust—and his own CEO role—around deep, ongoing involvement in coding and product craftsmanship.
Key Takeaways
Evals are a hard but critical bottleneck for serious AI products.
Superficially, evals look like a simple loop over prompts and outputs, but in production—especially with complex systems and agents—companies need fast, consistent, reusable evaluation workflows to iterate and improve quality reliably.
Get the full analysis with uListen AI
RAG is mainstream; fine-tuning is niche and often unnecessary.
Roughly half of Braintrust customers’ production use cases involve RAG, while most have moved away from fine-tuning toward instruction-tuned frontier models because fine-tuning is slower, riskier, and harder to get right for many workloads.
Get the full analysis with uListen AI
Enterprises still favor proprietary frontier models over open source in production.
Despite strong developer interest in open source, Braintrust sees limited production adoption; OpenAI and Anthropic via AWS Bedrock dominate because they offer better UX, faster iteration, and strong ROI, which matter more than raw per-token cost.
Get the full analysis with uListen AI
Data infrastructure for AI is shifting from warehouses and SQL to embeddings and LLM-based querying.
Traditional data warehouses optimized for structured data and ad hoc SQL don’t fit AI workloads; advanced teams use embeddings and models to mine logs, discover underrepresented cases, and construct better eval and training datasets.
Get the full analysis with uListen AI
Free-form agents are being dialed back in favor of deterministic control flow with pervasive LLM calls.
Early adopters went deep on ‘fully autonomous’ agents but hit uncontrollable error rates and compounding failures, so they’re returning to architectures where code handles control flow and LLMs are invoked at many well-defined points.
Get the full analysis with uListen AI
Product engineers and TypeScript are becoming central to AI application development.
Most of Braintrust’s customers now build AI features in TypeScript because product engineers are driving AI innovation and TS’s strong type system makes it better suited for safely handling uncertain, model-generated data shapes.
Get the full analysis with uListen AI
AI evals themselves are increasingly powered by LLMs running on production logs.
More than half of Braintrust evals are LLM-based; teams run AI and code evaluators over live logs (sometimes with access to PII humans can’t see) to automatically assess quality at scale and continuously surface failures and edge cases.
Get the full analysis with uListen AI
Notable Quotes
“Evals really sound easy—‘oh, it’s just a for loop’—but it is actually a pretty hard problem to do evals well.”
— Ankur Goyal
“Almost, if not all of our customers, have moved off of fine-tuned models onto instruction-tuned models, and are seeing really good performance.”
— Ankur Goyal
“A data warehouse is really designed for ad hoc exploration on structured data, which is… neither of those two things is relevant in AI land.”
— Ankur Goyal
“TypeScript is the language of AI and Python is the language of machine learning.”
— Ankur Goyal
“People are always gonna push things to their extreme. AI is an inherently non-deterministic thing, and so I think evals are still gonna be there.”
— Ankur Goyal
Questions Answered in This Episode
How should an enterprise decide when RAG is enough versus when to invest in fine-tuning or specialized models for a given use case?
Ankur Goyal, CEO of Braintrust, describes how the company evolved from internal tooling into an enterprise platform for AI evals, observability, and prompt development, now used by leading AI-forward companies like Notion, Airtable, and Zapier.
Get the full analysis with uListen AI
What concrete architectural patterns work best for ‘pervasive AI’ in a product without relying on brittle, fully autonomous agents?
He details what enterprises are actually doing with LLMs—heavy use of RAG, far less fine-tuning than expected, cautious experiments with agents, and very limited production use of open-source models so far.
Get the full analysis with uListen AI
How can a company practically transition from traditional data warehouse-centric analytics to embeddings- and LLM-driven data workflows?
The conversation covers how AI is reshaping data infrastructure (from warehouses and SQL to embeddings-based workflows), engineering stacks (TypeScript over Python, fewer AI-specific frameworks), and organizational structures (product-engineer-led AI platform teams).
Get the full analysis with uListen AI
If you’re building an AI platform team from scratch, what early hires and skills mix are most critical for the first 6–12 months?
Goyal also shares startup lessons on hiring, customer-obsessed execution, vendor consolidation, and consciously architecting Braintrust—and his own CEO role—around deep, ongoing involvement in coding and product craftsmanship.
Get the full analysis with uListen AI
What are the biggest risks of relying heavily on LLM-based evals, and how can teams validate or calibrate those evaluators over time?
Get the full analysis with uListen AI
Transcript Preview
(music plays) So today on No Priors, um, we have Ankur Goyal, the co-founder and CEO of Braintrust. Ankur was previously vice president of engineering at SingleStore and was the founder and CEO of Impira, an AI company acquired by Figma. Braintrust is an end-to-end enterprise platform for building AI applications. They have companies like Notion, Airtable, Instacart, Zapier, Vercel, and many more with evals, observability, and prompt development for their AI products. And Braintrust, um, just raised $36 million from Andreessen Horowitz and others. Ankur, thank you so much for joining us today on No Priors.
Very excited to be here.
Can you tell us a little bit more about Braintrust, what the product does and, um, you know, we could talk a little bit about how you got started in this area and AI more generally?
Yeah, for sure. So, um, uh, I have been working on AI f- since what one might now think of as ancient history. Uh, back in 2017 when, uh, we started working on Impira, um, you know, things were totally different, um, but still it was really hard to ship products that work. And so we built tooling internally as we developed our A- AI products, um, to help us evaluate things, collect real user data, use it to do better evals and so on. Um, fast-forward a few years, Figma acquired us, and we actually ended up having exactly the same problems and building pretty much the same tooling. And I thought that was interesting for a few reasons, um, some of which you pointed out, by the way, when we were hanging out and, and chatting about stuff. But one, Impira was kind of pre-LLM. Uh, my time at Figma was post-LLM. But these problems were the same, and I think there's some, you know, longevity that's implied by that. You know, problems that existed pre-LLM probably are gonna exist in LLM land, uh, for a while. Um, and, and the second thing is that, you know, having, having built the same tooling essentially twice, it was clear that there's a pretty consistent need. Um, and so, uh, you know, I have very fond memories of the two of us hanging out and talking to a bunch of folks like, you know, Brian and Mike at Zapier and Simon at Notion and, and, you know, many others. And, uh, you know, I've been in a lot of user interviews over time. I've never seen anything resonate like the early ideas around Braintrust and really everyone's, uh, desire to, to ha- to have a good solution to the eval problem. Um, so we got to work and built a, a, honestly a pretty crappy, uh, initial prototype. Um, but people started using it, and, uh, you know, Braintrust, um, just, uh, over a year later has now kind of iterated from people's, uh, feedback and, you know, complaints and, and ideas into something I think that's, that's really powerful. Um, and yeah, that's how we kind of got started.
Install uListen to search the full transcript and get AI-powered insights
Get Full TranscriptGet more from every podcast
AI summaries, searchable transcripts, and fact-checking. Free forever.
Add to Chrome