Episode Details
EPISODE INFO
- Released
- May 8, 2026
- Duration
- 27m
- Channel
- Claude
- Watch on YouTube
- ▶ Open ↗
EPISODE DESCRIPTION
Most teams shipping AI products can't build evals that predict how a model will actually perform in production. Michele Catasta, President & Head of AI at Replit, shares how his team closed that gap with ViBench — a public vibe-coding benchmark that scores whether the generated app works — and the offline/online evaluation loop behind Replit Agent that turns weeks of engineering into compounding overnight gains. Anthropic's Hannah Moran joins to share what separates evals that look rigorous from ones that actually help teams adopt new models with confidence.
SPEAKERS
Michele Catasta
guestPresident and Head of AI at Replit, presenting on evaluating and improving Replit Agent at scale.
Hannah Moran
hostMember of Anthropic’s Applied AI team who facilitates the discussion and interviews Michele Catasta.
EPISODE SUMMARY
In this episode of Claude, featuring Michele Catasta and Hannah Moran, Evaluating and improving Replit Agent at scale explores replit’s approach to evaluating and improving coding agents daily Replit argues that traditional, human-scored offline evals are insufficient for “vibe-coding,” where users expect a working app from only a natural-language spec and no tests or framework constraints.
RELATED EPISODES
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome





