Skip to content
ClaudeClaude

Evaluating and improving Replit Agent at scale

Most teams shipping AI products can't build evals that predict how a model will actually perform in production. Michele Catasta, President & Head of AI at Replit, shares how his team closed that gap with ViBench — a public vibe-coding benchmark that scores whether the generated app works — and the offline/online evaluation loop behind Replit Agent that turns weeks of engineering into compounding overnight gains. Anthropic's Hannah Moran joins to share what separates evals that look rigorous from ones that actually help teams adopt new models with confidence.

Michele CatastaguestHannah Moranhost
May 7, 202627mWatch on YouTube ↗

At a glance

WHAT IT’S REALLY ABOUT

Replit’s approach to evaluating and improving coding agents daily

  1. Replit argues that traditional, human-scored offline evals are insufficient for “vibe-coding,” where users expect a working app from only a natural-language spec and no tests or framework constraints.
  2. They introduce ViBench, an open-source end-to-end benchmark that starts from real PRD-style prompts and uses automated, implementation-agnostic evaluators that interact with the generated app via a browser.
  3. Replit pairs offline gatekeeping (benchmarks to prevent regressions before shipping) with online learning loops (A/B testing plus production trace clustering) to react quickly to failures and prioritize improvements.
  4. Their internal system, Telescope, turns clustered production failures into hypotheses, agent-assisted code changes, and validation via ViBench and/or A/B tests—enabling multiple production releases per day while managing risk.
  5. They emphasize that “taste” and product philosophy still matter because A/B tests often yield mixed signals, and teams must optimize for what their specific user base values (e.g., knowledge workers vs. developers).

IDEAS WORTH REMEMBERING

5 ideas

Treat evaluation as a continuous engine, not a final gate.

Replit frames evals as an always-on feedback loop driven by production traces and rapid iteration, rather than a periodic score that only decides whether to ship.

Vibe-coding requires end-to-end functional evaluation, not patch-and-tests scoring.

Benchmarks like SWE-bench assume existing repos and tests; Replit’s users often start from empty repos and provide only intent, so the correct metric is whether the resulting app actually works as specified.

Automated evaluators unlock high-frequency iteration on app-building agents.

ViBench replaces human grading with automated, natural-language test plans executed by an evaluator agent that reads the codebase, runs the app, and performs browser-based checks step-by-step.

Test “slop-on-slop” explicitly—models degrade when extending their own generated code.

ViBench includes scenarios like “vibe-on-vibe,” where an agent adds features on top of an agent-built MVP; results show this is especially failure-prone, motivating frequent testing between feature additions.

Online A/B tests are necessary because offline benchmarks only capture part of real usage.

Replit uses production metrics (run duration, cost, sentiment from prompts, and whether users publish apps) to detect trade-offs and validate changes that benchmarks can’t predict.

WORDS WORTH SAVING

5 quotes

So my argument today for this talk will be that we have to fundamentally rethink how we do evaluations.

Michele Catasta

This does not reflect what happens in vibe-coding. As I was mentioning before, users are not writing the test. They often start from a completely empty code base, so there is not a scenario where you can just apply patches. You're building things from the ground up.

Michele Catasta

And today on stage, I'm launching ViBench.It's a new public benchmark for vibe-coding end-to-end that we work on at Replit for several months.

Michele Catasta

I, I don't believe in competing on evaluations. I come from a research background where everything should be open.

Michele Catasta

Don't think of evaluation just as this la- last check before shipping. It shouldn't be just a Boolean flag. But rather, think of this as an engine that allows you to ship a better agent every single day.

Michele Catasta

Vibe-coding vs. traditional coding-agent evaluationContinuous evaluation loops for production agentsViBench: PRD-in, app-out benchmark designAutomated, framework-agnostic app testing via browser agentsA/B testing metrics: latency, cost, sentiment, publish rateProduction trace embedding, clustering, and LLM classificationTelescope: agent-assisted debugging and release gating

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome