CHAPTERS
- 0:21 – 2:23
Why vibe-coding needs new evaluation methods
Michele Catasta frames Replit Agent’s core challenge: users want a working app from only a natural-language spec, with no framework choices, tests, or existing codebase. This shifts what “success” means and makes traditional coding-agent evals insufficient.
- •Replit serves non-developer knowledge workers expecting prompt-to-app outcomes
- •Users provide no tests, frameworks, or implementation constraints
- •Evaluation must reflect end-to-end product success, not just correct code edits
- •Rapid model and product changes create “shifting ground” for eval stability
- 2:23 – 3:23
From one-off scores to continuous evaluation in production
He argues evaluation must evolve from single, human-scored checkpoints to a continuous system that learns from millions of real production traces. The goal is to track daily improvements and detect breakages immediately.
- •Traditional evals produce a single score and a ship/no-ship conclusion
- •Production usage generates massive, valuable signal via traces
- •Need evals that optimize user outcomes, surface what breaks, and guide what to ship next
- •Continuous evaluation becomes part of the daily development loop
- 3:23 – 4:55
Two-pillar eval system: offline gate + online learning loop
Replit uses a two-part approach: offline benchmarks as a pre-release gatekeeper and online evaluations to react quickly post-release. Together they create a tight “change → eval → ship → observe → iterate” cycle.
- •Offline benchmarks act like a Boolean gate before shipping
- •Online pillar includes A/B testing and production-trace analysis after shipping
- •Rapid iteration: multiple releases per day to millions of users
- •Feedback from both pillars informs the next code/prompt/tooling changes
- 4:55 – 5:55
Why SWE-bench-style benchmarks don’t fit vibe-coding
Michele explains the mismatch between patch-and-test benchmarks (SWE-bench, HumanEval) and Replit’s greenfield app-building reality. Vibe-coding requires measuring whether the app does what the user asked, not whether repo tests pass.
- •SWE-bench assumes existing repos, patches, and tests
- •Vibe-coding often starts from an empty repo with no tests
- •Key metric becomes functional correspondence to the PRD
- •Motivates a new benchmark tailored to end-to-end app creation
- 5:55 – 7:26
Launching ViBench: an end-to-end public benchmark built from real PRDs
He introduces ViBench, a new open-source benchmark designed specifically for vibe-coding. It uses real Replit user traces as PRD inputs and evaluates full app builds from scratch with automated graders.
- •Input is a PRD-like long natural-language specification
- •Built from 20 real-world Replit traces (not synthetic prompts)
- •Harness builds apps end-to-end from an empty repository
- •Automated evaluators enable frequent, repeatable runs (e.g., per PR merge)
- 7:26 – 8:57
ViBench scenarios: from single-shot builds to ‘vibe-on-vibe’ and parallel agent workflows
The benchmark supports multiple “pairings” that vary complexity: building from scratch, extending a reference app, extending agent-generated code, and decomposing tasks to run agents in parallel before merging. This targets realistic, failure-prone workflows.
- •Single-shot ‘zero-to-one’ from PRD
- •‘Vibe-on-ref’: add features to a working reference implementation
- •‘Vibe-on-vibe’: extend unverified agent-generated code (‘slop on slop’)
- •Parallel decomposition + merge mirrors Replit Agent 4’s workflow
- •Extensions can include starting from buggy apps to test robustness
- 8:57 – 10:28
How ViBench grades without knowing the stack: browser-based automated app testing
Michele details the hardest part: grading apps when languages/frameworks vary and implementations are unconstrained. Replit’s evaluator reads the codebase, launches the app, and executes a natural-language test plan through browser actions to produce a score.
- •Evaluator must be framework- and language-agnostic
- •Agent reads repo, opens a browser, and interacts with the running app
- •Test plans are expressed in natural language steps (login, click toggles, etc.)
- •Failures are aggregated into a score; enables greenfield evaluation unlike fixed-repo benchmarks
- 10:28 – 11:59
ViBench results and what they imply for model builders
He shares early findings: frontier models significantly outperform open-weights models, and models struggle most when extending their own generated code. These results motivate testing between feature additions and encourage community optimization on the benchmark.
- •~2x performance gap: frontier vs open-weights models
- •‘Vibe-on-vibe’ is the most challenging scenario
- •Advocacy: add testing between feature iterations to avoid compounding errors
- •Open-sourcing aims to steer broader community progress on vibe-coding
- 11:59 – 14:30
Online evaluation at scale: A/B tests and product-centric metrics
Offline benchmarks cover limited apps, while production yields millions of unscripted sessions daily. Replit relies on A/B testing with rich instrumentation to measure cost, run duration, sentiment, and strong success signals like publishing apps.
- •Online signal volume dwarfs offline benchmark coverage
- •A/B testing keeps teams honest about real user impact
- •Metrics include agent run duration, cost, and user sentiment from prompts
- •Publishing an app is treated as a strong positive outcome signal
- •A/B results are often mixed and require interpretation
- 14:30 – 16:31
Trace clustering to surface long-tail failures (semantic, not regex)
To make A/B iteration actionable, Replit clusters production traces to separate nominal behavior from problematic long-tail patterns. They embed failure summaries and use LLM-driven semantic classification, retraining clusters nightly to match rapidly changing agent versions.
- •Clustering identifies both normal behavior and rare but important failures
- •Failure summaries are embedded and clustered by type
- •LLM classification captures semantic similarity beyond deterministic logs
- •Clusters must be retrained nightly due to many daily agent versions
- •Clusters help verify fixes by observing whether a failure cluster disappears
- 16:31 – 18:01
Telescope: the automated improvement loop from discovery to shipping
Michele outlines Telescope, Replit’s internal system that turns clustered failure insights into code changes and validates them through ViBench and/or A/B tests. It automates much of the iteration cycle while keeping humans in control of key decisions.
- •Discover problems via clustering and production signals
- •Auto-generate code changes/PRs using an agent with trace + log context
- •Re-run ViBench as a regression ‘litmus test’ before shipping
- •Use A/B tests for changes with ambiguous trade-offs; ship clear wins directly
- •Iterate with additional tests/PRs until outcomes stabilize
- 18:01 – 19:32
Concrete example: catching environment setup timing regressions via clustering
He describes a long-tail issue where the agent began acting before the execution environment finished setting up, triggering non-deterministic debugging tangents. Traditional log graphing didn’t clearly reveal it, but semantic trace clustering surfaced it quickly and enabled a fast fix.
- •Setup time degraded in a long tail; agent started too early
- •Agent attempted to “fix” environment, causing varied debugging traces
- •Non-determinism made it hard to detect via simple log dashboards
- •Clustering surfaced the pattern; a patch fixed it without needing an A/B test
- 19:32 – 21:47
Humans still matter: hypothesis, prioritization, and product ‘taste’
Michele emphasizes that despite automation, human judgment is essential: deciding what to fix, forming hypotheses, shaping the optimization target (cost vs speed vs UX), and making final calls when A/B tests are ambiguous.
- •Production generates overwhelming candidate issues; prioritization is critical
- •Teams form hypotheses before letting agents propose fixes
- •Product philosophy shapes what metrics to optimize and how
- •Ambiguous A/B outcomes often require human decision-making
- 21:47 – 27:42
Q&A: why open-source ViBench, how to build Telescope-like systems, and developing ‘taste’
In discussion with Hannah Moran (Anthropic), Michele explains the motivation to open-source ViBench and encourage community contributions. They cover practical advice for building trace-analysis systems today (enabled by long-context reasoning) and how teams develop product-aligned ‘taste’ as agents become more capable.
- •Open-source evals benefit models, agents, and products; don’t compete on eval secrecy
- •Modern long-context models can ingest full traces for higher-quality debugging feedback
- •Combine traces with product feedback and platform instrumentation (e.g., Datadog)
- •Clustering reduces overwhelm and helps focus on high-volume recurring issues
- •‘Taste’ evolves with agent capability and must reflect the actual user base (non-coders vs devs)
