Evaluating and improving Replit Agent at scale

Most teams shipping AI products can't build evals that predict how a model will actually perform in production. Michele Catasta, President & Head of AI at Replit, shares how his team closed that gap with ViBench — a public vibe-coding benchmark that scores whether the generated app works — and the offline/online evaluation loop behind Replit Agent that turns weeks of engineering into compounding overnight gains. Anthropic's Hannah Moran joins to share what separates evals that look rigorous from ones that actually help teams adopt new models with confidence.

Michele CatastaguestHannah Moranhost

May 8, 202627mWatch on YouTube ↗

CHAPTERS

Why vibe-coding changes how you evaluate coding agents
Michele Catasta frames Replit Agent’s core challenge: users start from only natural language and expect a working app without specifying frameworks, tests, or structure. This “prompt-to-app” reality breaks many assumptions behind traditional coding-agent evaluation.
From one-off benchmark scores to continuous evaluation in production
He argues evaluation must evolve from occasional, human-scored benchmarks into a continuous system powered by production traces. The goal is to measure what users care about, detect breakages quickly, and steer what to improve next.
Replit’s two-pillar evaluation system: offline gates + online learning loops
Replit combines offline benchmarks (as a release gate) with online A/B testing and trace analysis (as the engine for daily iteration). The offline pillar prevents regressions; the online pillar forces rapid reaction post-ship.
Why SWE-bench-style testing doesn’t match vibe-coding
Michele explains the mismatch between patch-based benchmarks (SWE-bench, HumanEval) and greenfield app generation. Vibe-coding is judged by whether the app works for the requested behavior, not whether unit tests pass on an existing repo.
Launching ViBench: an end-to-end public benchmark for vibe-coding
He introduces ViBench, built from real Replit user traces and designed for fully automated evaluation. The benchmark takes a PRD-like spec as input and measures whether an agent can build a functional application end-to-end from scratch.
ViBench task modes: from single-shot builds to “slop on slop” complexity
ViBench supports multiple scenario pairings to stress different agent behaviors, including extending a reference app, extending an agent-built MVP, and parallel task decomposition + merge. These modes reflect real product workflows and compound-error dynamics.
The hardest part: automated, framework-agnostic grading via app interaction
Instead of relying on fixed test suites, Replit’s evaluator reads the codebase, opens the app in a browser, and executes a natural-language test plan step-by-step. This makes grading agnostic to language/framework and suitable for greenfield apps.
ViBench release + early findings: frontier vs open weights, and self-extension is hard
Michele shares initial headline results and invites the community to adopt the benchmark so model builders optimize for vibe-coding. Two insights stand out: frontier models outperform open-weight models significantly, and extending agent-generated code is especially difficult.
Online evaluation at scale: A/B tests and product-native metrics
Offline benchmarks provide limited signal compared to millions of daily live sessions. Replit relies on A/B tests and rich instrumentation to understand cost, latency, user sentiment, and downstream success signals like publishing an app.
Trace clustering to discover failures and verify fixes over time
To manage overwhelming feedback volume and long-tail failures, Replit clusters production traces by embedding failure summaries and using LLM-based semantic classification. Clusters are retrained nightly to track shifting agent versions and confirm whether fixes eliminate a failure mode.
Telescope: an agent-assisted continuous improvement loop
Michele describes Telescope, Replit’s internal system that turns observed failures into proposed fixes, validates them against ViBench, and then decides between shipping, A/B testing, or iterating further. It automates much of the engineering loop while keeping human oversight for key decisions.
Concrete production example: environment readiness causing agent ‘debug tangents’
A long-tail regression occurred when the agent began working before the execution environment fully initialized, leading it to attempt inconsistent self-debugging. Semantic trace clustering surfaced the issue despite low visibility in standard log dashboards, enabling a quick patch.
Humans still matter: hypothesis, prioritization, and product ‘taste’
Even with heavy automation, Michele emphasizes that humans must choose what to optimize, which failures matter, and how to interpret ambiguous A/B outcomes. “Taste” is developed over time and must reflect the actual user base—especially when builders are more technical than their users.
Q&A: Why open-source ViBench, how to build Telescope-like systems, and developing taste
Hannah Moran (Anthropic) asks about making ViBench public, lessons from building Telescope, and how Replit developed product taste. Michele argues public evals help everyone, modern long-context models unlock better trace analysis, and taste comes from learning the user’s needs over time.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome

Why vibe-coding changes how you evaluate coding agents

From one-off benchmark scores to continuous evaluation in production

Replit’s two-pillar evaluation system: offline gates + online learning loops

Why SWE-bench-style testing doesn’t match vibe-coding

Launching ViBench: an end-to-end public benchmark for vibe-coding

ViBench task modes: from single-shot builds to “slop on slop” complexity

The hardest part: automated, framework-agnostic grading via app interaction

ViBench release + early findings: frontier vs open weights, and self-extension is hard

Online evaluation at scale: A/B tests and product-native metrics

Trace clustering to discover failures and verify fixes over time

Telescope: an agent-assisted continuous improvement loop

Concrete production example: environment readiness causing agent ‘debug tangents’

Humans still matter: hypothesis, prioritization, and product ‘taste’

Q&A: Why open-source ViBench, how to build Telescope-like systems, and developing taste

Get more out of YouTube videos.