Skip to content
ClaudeClaude

Evaluating and improving Replit Agent at scale

Most teams shipping AI products can't build evals that predict how a model will actually perform in production. Michele Catasta, President & Head of AI at Replit, shares how his team closed that gap with ViBench — a public vibe-coding benchmark that scores whether the generated app works — and the offline/online evaluation loop behind Replit Agent that turns weeks of engineering into compounding overnight gains. Anthropic's Hannah Moran joins to share what separates evals that look rigorous from ones that actually help teams adopt new models with confidence.

Michele CatastaguestHannah Moranhost
May 8, 202627mWatch on YouTube ↗

Episode Details

EPISODE INFO

Released
May 8, 2026
Duration
27m
Channel
Claude
Watch on YouTube
▶ Open ↗

EPISODE DESCRIPTION

Most teams shipping AI products can't build evals that predict how a model will actually perform in production. Michele Catasta, President & Head of AI at Replit, shares how his team closed that gap with ViBench — a public vibe-coding benchmark that scores whether the generated app works — and the offline/online evaluation loop behind Replit Agent that turns weeks of engineering into compounding overnight gains. Anthropic's Hannah Moran joins to share what separates evals that look rigorous from ones that actually help teams adopt new models with confidence.

SPEAKERS

  • Michele Catasta

    guest

    President and Head of AI at Replit, presenting on evaluating and improving Replit Agent at scale.

  • Hannah Moran

    host

    Member of Anthropic’s Applied AI team who facilitates the discussion and interviews Michele Catasta.

EPISODE SUMMARY

In this episode of Claude, featuring Michele Catasta and Hannah Moran, Evaluating and improving Replit Agent at scale explores replit’s approach to evaluating and improving coding agents daily Replit argues that traditional, human-scored offline evals are insufficient for “vibe-coding,” where users expect a working app from only a natural-language spec and no tests or framework constraints.

RELATED EPISODES

The CLAUDE.md file

The CLAUDE.md file

MCP in Claude Code

MCP in Claude Code

What's new in Claude Code

What's new in Claude Code

Building with Claude on Google Cloud

Building with Claude on Google Cloud

The thinking lever

The thinking lever

Building AI-native: Inside the stacks powering Cognition, Gamma, and Harvey

Building AI-native: Inside the stacks powering Cognition, Gamma, and Harvey

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome