Skip to content
ClaudeClaude

Evaluating and improving Replit Agent at scale

Most teams shipping AI products can't build evals that predict how a model will actually perform in production. Michele Catasta, President & Head of AI at Replit, shares how his team closed that gap with ViBench — a public vibe-coding benchmark that scores whether the generated app works — and the offline/online evaluation loop behind Replit Agent that turns weeks of engineering into compounding overnight gains. Anthropic's Hannah Moran joins to share what separates evals that look rigorous from ones that actually help teams adopt new models with confidence.

Michele CatastaguestHannah Moranhost
May 8, 202627mWatch on YouTube ↗

CHAPTERS

  1. 0:21 – 2:23

    Why vibe-coding needs new evaluation methods

    Michele Catasta frames Replit Agent’s core challenge: users want a working app from only a natural-language spec, with no framework choices, tests, or existing codebase. This shifts what “success” means and makes traditional coding-agent evals insufficient.

    • Replit serves non-developer knowledge workers expecting prompt-to-app outcomes
    • Users provide no tests, frameworks, or implementation constraints
    • Evaluation must reflect end-to-end product success, not just correct code edits
    • Rapid model and product changes create “shifting ground” for eval stability
  2. 2:23 – 3:23

    From one-off scores to continuous evaluation in production

    He argues evaluation must evolve from single, human-scored checkpoints to a continuous system that learns from millions of real production traces. The goal is to track daily improvements and detect breakages immediately.

    • Traditional evals produce a single score and a ship/no-ship conclusion
    • Production usage generates massive, valuable signal via traces
    • Need evals that optimize user outcomes, surface what breaks, and guide what to ship next
    • Continuous evaluation becomes part of the daily development loop
  3. 3:23 – 4:55

    Two-pillar eval system: offline gate + online learning loop

    Replit uses a two-part approach: offline benchmarks as a pre-release gatekeeper and online evaluations to react quickly post-release. Together they create a tight “change → eval → ship → observe → iterate” cycle.

    • Offline benchmarks act like a Boolean gate before shipping
    • Online pillar includes A/B testing and production-trace analysis after shipping
    • Rapid iteration: multiple releases per day to millions of users
    • Feedback from both pillars informs the next code/prompt/tooling changes
  4. 4:55 – 5:55

    Why SWE-bench-style benchmarks don’t fit vibe-coding

    Michele explains the mismatch between patch-and-test benchmarks (SWE-bench, HumanEval) and Replit’s greenfield app-building reality. Vibe-coding requires measuring whether the app does what the user asked, not whether repo tests pass.

    • SWE-bench assumes existing repos, patches, and tests
    • Vibe-coding often starts from an empty repo with no tests
    • Key metric becomes functional correspondence to the PRD
    • Motivates a new benchmark tailored to end-to-end app creation
  5. 5:55 – 7:26

    Launching ViBench: an end-to-end public benchmark built from real PRDs

    He introduces ViBench, a new open-source benchmark designed specifically for vibe-coding. It uses real Replit user traces as PRD inputs and evaluates full app builds from scratch with automated graders.

    • Input is a PRD-like long natural-language specification
    • Built from 20 real-world Replit traces (not synthetic prompts)
    • Harness builds apps end-to-end from an empty repository
    • Automated evaluators enable frequent, repeatable runs (e.g., per PR merge)
  6. 7:26 – 8:57

    ViBench scenarios: from single-shot builds to ‘vibe-on-vibe’ and parallel agent workflows

    The benchmark supports multiple “pairings” that vary complexity: building from scratch, extending a reference app, extending agent-generated code, and decomposing tasks to run agents in parallel before merging. This targets realistic, failure-prone workflows.

    • Single-shot ‘zero-to-one’ from PRD
    • ‘Vibe-on-ref’: add features to a working reference implementation
    • ‘Vibe-on-vibe’: extend unverified agent-generated code (‘slop on slop’)
    • Parallel decomposition + merge mirrors Replit Agent 4’s workflow
    • Extensions can include starting from buggy apps to test robustness
  7. 8:57 – 10:28

    How ViBench grades without knowing the stack: browser-based automated app testing

    Michele details the hardest part: grading apps when languages/frameworks vary and implementations are unconstrained. Replit’s evaluator reads the codebase, launches the app, and executes a natural-language test plan through browser actions to produce a score.

    • Evaluator must be framework- and language-agnostic
    • Agent reads repo, opens a browser, and interacts with the running app
    • Test plans are expressed in natural language steps (login, click toggles, etc.)
    • Failures are aggregated into a score; enables greenfield evaluation unlike fixed-repo benchmarks
  8. 10:28 – 11:59

    ViBench results and what they imply for model builders

    He shares early findings: frontier models significantly outperform open-weights models, and models struggle most when extending their own generated code. These results motivate testing between feature additions and encourage community optimization on the benchmark.

    • ~2x performance gap: frontier vs open-weights models
    • ‘Vibe-on-vibe’ is the most challenging scenario
    • Advocacy: add testing between feature iterations to avoid compounding errors
    • Open-sourcing aims to steer broader community progress on vibe-coding
  9. 11:59 – 14:30

    Online evaluation at scale: A/B tests and product-centric metrics

    Offline benchmarks cover limited apps, while production yields millions of unscripted sessions daily. Replit relies on A/B testing with rich instrumentation to measure cost, run duration, sentiment, and strong success signals like publishing apps.

    • Online signal volume dwarfs offline benchmark coverage
    • A/B testing keeps teams honest about real user impact
    • Metrics include agent run duration, cost, and user sentiment from prompts
    • Publishing an app is treated as a strong positive outcome signal
    • A/B results are often mixed and require interpretation
  10. 14:30 – 16:31

    Trace clustering to surface long-tail failures (semantic, not regex)

    To make A/B iteration actionable, Replit clusters production traces to separate nominal behavior from problematic long-tail patterns. They embed failure summaries and use LLM-driven semantic classification, retraining clusters nightly to match rapidly changing agent versions.

    • Clustering identifies both normal behavior and rare but important failures
    • Failure summaries are embedded and clustered by type
    • LLM classification captures semantic similarity beyond deterministic logs
    • Clusters must be retrained nightly due to many daily agent versions
    • Clusters help verify fixes by observing whether a failure cluster disappears
  11. 16:31 – 18:01

    Telescope: the automated improvement loop from discovery to shipping

    Michele outlines Telescope, Replit’s internal system that turns clustered failure insights into code changes and validates them through ViBench and/or A/B tests. It automates much of the iteration cycle while keeping humans in control of key decisions.

    • Discover problems via clustering and production signals
    • Auto-generate code changes/PRs using an agent with trace + log context
    • Re-run ViBench as a regression ‘litmus test’ before shipping
    • Use A/B tests for changes with ambiguous trade-offs; ship clear wins directly
    • Iterate with additional tests/PRs until outcomes stabilize
  12. 18:01 – 19:32

    Concrete example: catching environment setup timing regressions via clustering

    He describes a long-tail issue where the agent began acting before the execution environment finished setting up, triggering non-deterministic debugging tangents. Traditional log graphing didn’t clearly reveal it, but semantic trace clustering surfaced it quickly and enabled a fast fix.

    • Setup time degraded in a long tail; agent started too early
    • Agent attempted to “fix” environment, causing varied debugging traces
    • Non-determinism made it hard to detect via simple log dashboards
    • Clustering surfaced the pattern; a patch fixed it without needing an A/B test
  13. 19:32 – 21:47

    Humans still matter: hypothesis, prioritization, and product ‘taste’

    Michele emphasizes that despite automation, human judgment is essential: deciding what to fix, forming hypotheses, shaping the optimization target (cost vs speed vs UX), and making final calls when A/B tests are ambiguous.

    • Production generates overwhelming candidate issues; prioritization is critical
    • Teams form hypotheses before letting agents propose fixes
    • Product philosophy shapes what metrics to optimize and how
    • Ambiguous A/B outcomes often require human decision-making
  14. 21:47 – 27:42

    Q&A: why open-source ViBench, how to build Telescope-like systems, and developing ‘taste’

    In discussion with Hannah Moran (Anthropic), Michele explains the motivation to open-source ViBench and encourage community contributions. They cover practical advice for building trace-analysis systems today (enabled by long-context reasoning) and how teams develop product-aligned ‘taste’ as agents become more capable.

    • Open-source evals benefit models, agents, and products; don’t compete on eval secrecy
    • Modern long-context models can ingest full traces for higher-quality debugging feedback
    • Combine traces with product feedback and platform instrumentation (e.g., Datadog)
    • Clustering reduces overwhelm and helps focus on high-volume recurring issues
    • ‘Taste’ evolves with agent capability and must reflect the actual user base (non-coders vs devs)

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.