Skip to content
ClaudeClaude

Evaluating and improving Replit Agent at scale

Most teams shipping AI products can't build evals that predict how a model will actually perform in production. Michele Catasta, President & Head of AI at Replit, shares how his team closed that gap with ViBench — a public vibe-coding benchmark that scores whether the generated app works — and the offline/online evaluation loop behind Replit Agent that turns weeks of engineering into compounding overnight gains. Anthropic's Hannah Moran joins to share what separates evals that look rigorous from ones that actually help teams adopt new models with confidence.

Michele CatastaguestHannah Moranhost
May 8, 202627mWatch on YouTube ↗

Episode Details

EPISODE INFO

Released
May 8, 2026
Duration
27m
Channel
Claude
Watch on YouTube
▶ Open ↗

EPISODE DESCRIPTION

Most teams shipping AI products can't build evals that predict how a model will actually perform in production. Michele Catasta, President & Head of AI at Replit, shares how his team closed that gap with ViBench — a public vibe-coding benchmark that scores whether the generated app works — and the offline/online evaluation loop behind Replit Agent that turns weeks of engineering into compounding overnight gains. Anthropic's Hannah Moran joins to share what separates evals that look rigorous from ones that actually help teams adopt new models with confidence.

SPEAKERS

  • Michele Catasta

    guest

    President and Head of AI at Replit, presenting on evaluating and improving Replit Agent at scale.

  • Hannah Moran

    host

    Member of Anthropic’s Applied AI team who facilitates the discussion and interviews Michele Catasta.

EPISODE SUMMARY

In this episode of Claude, featuring Michele Catasta and Hannah Moran, Evaluating and improving Replit Agent at scale explores replit’s approach to evaluating and improving coding agents daily Replit argues that traditional, human-scored offline evals are insufficient for “vibe-coding,” where users expect a working app from only a natural-language spec and no tests or framework constraints.

RELATED EPISODES

Tag Claude in, right where you already work

Tag Claude in, right where you already work

Coding is no longer the constraint: Scaling devex to teams and agents at Spotify

Coding is no longer the constraint: Scaling devex to teams and agents at Spotify

How to get to production faster with Claude Managed Agents

How to get to production faster with Claude Managed Agents

Caching, harnesses, and advisors: Building on Claude at GitHub scale

Caching, harnesses, and advisors: Building on Claude at GitHub scale

How Slack uses Claude for AI search and summaries

How Slack uses Claude for AI search and summaries

Building AI-native at enterprise scale: monday.com, Doctolib, and Delivery Hero

Building AI-native at enterprise scale: monday.com, Doctolib, and Delivery Hero

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.