Evaluating and improving Replit Agent at scale

Most teams shipping AI products can't build evals that predict how a model will actually perform in production. Michele Catasta, President & Head of AI at Replit, shares how his team closed that gap with ViBench — a public vibe-coding benchmark that scores whether the generated app works — and the offline/online evaluation loop behind Replit Agent that turns weeks of engineering into compounding overnight gains. Anthropic's Hannah Moran joins to share what separates evals that look rigorous from ones that actually help teams adopt new models with confidence.

Michele CatastaguestHannah Moranhost

May 8, 202627mWatch on YouTube ↗

EPISODE INFO

Released: May 8, 2026
Duration: 27m
Channel: Claude
Watch on YouTube: ▶ Open ↗

EPISODE DESCRIPTION

Most teams shipping AI products can't build evals that predict how a model will actually perform in production. Michele Catasta, President & Head of AI at Replit, shares how his team closed that gap with ViBench — a public vibe-coding benchmark that scores whether the generated app works — and the offline/online evaluation loop behind Replit Agent that turns weeks of engineering into compounding overnight gains. Anthropic's Hannah Moran joins to share what separates evals that look rigorous from ones that actually help teams adopt new models with confidence.

SPEAKERS

Michele Catasta
guest
President and Head of AI at Replit, presenting on evaluating and improving Replit Agent at scale.
Hannah Moran
host
Member of Anthropic’s Applied AI team who facilitates the discussion and interviews Michele Catasta.

EPISODE SUMMARY

In this episode of Claude, featuring Michele Catasta and Hannah Moran, Evaluating and improving Replit Agent at scale explores replit’s approach to evaluating and improving coding agents daily Replit argues that traditional, human-scored offline evals are insufficient for “vibe-coding,” where users expect a working app from only a natural-language spec and no tests or framework constraints.

RELATED EPISODES