At a glance
WHAT IT’S REALLY ABOUT
Replit’s approach to evaluating and improving coding agents daily
- Replit argues that traditional, human-scored offline evals are insufficient for “vibe-coding,” where users expect a working app from only a natural-language spec and no tests or framework constraints.
- They introduce ViBench, an open-source end-to-end benchmark that starts from real PRD-style prompts and uses automated, implementation-agnostic evaluators that interact with the generated app via a browser.
- Replit pairs offline gatekeeping (benchmarks to prevent regressions before shipping) with online learning loops (A/B testing plus production trace clustering) to react quickly to failures and prioritize improvements.
- Their internal system, Telescope, turns clustered production failures into hypotheses, agent-assisted code changes, and validation via ViBench and/or A/B tests—enabling multiple production releases per day while managing risk.
- They emphasize that “taste” and product philosophy still matter because A/B tests often yield mixed signals, and teams must optimize for what their specific user base values (e.g., knowledge workers vs. developers).
IDEAS WORTH REMEMBERING
5 ideasTreat evaluation as a continuous engine, not a final gate.
Replit frames evals as an always-on feedback loop driven by production traces and rapid iteration, rather than a periodic score that only decides whether to ship.
Vibe-coding requires end-to-end functional evaluation, not patch-and-tests scoring.
Benchmarks like SWE-bench assume existing repos and tests; Replit’s users often start from empty repos and provide only intent, so the correct metric is whether the resulting app actually works as specified.
Automated evaluators unlock high-frequency iteration on app-building agents.
ViBench replaces human grading with automated, natural-language test plans executed by an evaluator agent that reads the codebase, runs the app, and performs browser-based checks step-by-step.
Test “slop-on-slop” explicitly—models degrade when extending their own generated code.
ViBench includes scenarios like “vibe-on-vibe,” where an agent adds features on top of an agent-built MVP; results show this is especially failure-prone, motivating frequent testing between feature additions.
Online A/B tests are necessary because offline benchmarks only capture part of real usage.
Replit uses production metrics (run duration, cost, sentiment from prompts, and whether users publish apps) to detect trade-offs and validate changes that benchmarks can’t predict.
WORDS WORTH SAVING
5 quotesSo my argument today for this talk will be that we have to fundamentally rethink how we do evaluations.
— Michele Catasta
This does not reflect what happens in vibe-coding. As I was mentioning before, users are not writing the test. They often start from a completely empty code base, so there is not a scenario where you can just apply patches. You're building things from the ground up.
— Michele Catasta
And today on stage, I'm launching ViBench.It's a new public benchmark for vibe-coding end-to-end that we work on at Replit for several months.
— Michele Catasta
I, I don't believe in competing on evaluations. I come from a research background where everything should be open.
— Michele Catasta
Don't think of evaluation just as this la- last check before shipping. It shouldn't be just a Boolean flag. But rather, think of this as an engine that allows you to ship a better agent every single day.
— Michele Catasta
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome