Lenny's PodcastThe ultimate guide to A/B testing | Ronny Kohavi (Airbnb, Microsoft, Amazon)
At a glance
WHAT IT’S REALLY ABOUT
Ronny Kohavi Reveals How To Build Truly Trustworthy Experiment Cultures
- Ronny Kohavi, a leading authority on A/B testing, shares hard-won lessons from building experimentation platforms at Amazon, Microsoft/Bing, and Airbnb. He emphasizes that most experiments fail (often 70–90%), so organizations must embrace humility, rigorous statistics, and strong guardrails rather than chasing silver bullets. The conversation covers when to start experimenting, how to define the right success metric (OEC), why trust and platform quality matter more than speed, and how to institutionalize learning from both wins and surprising failures. Kohavi also addresses concerns that experimentation kills innovation, arguing instead for a portfolio of small optimizations and a minority of high-risk, high-reward bets—all tested rigorously.
IDEAS WORTH REMEMBERING
5 ideasExpect most experiments to fail—and plan portfolios accordingly.
Across Microsoft, Bing, and Airbnb, 66–92% of experiments did not improve the target metric. This means teams should assume they’re wrong most of the time, run lots of experiments, and structure roadmaps as portfolios with many incremental bets and a smaller set of high-risk, high-reward ideas.
Test everything once you have scale; even tiny changes can have huge impact.
Kohavi advocates that every code change eventually be in an experiment because small tweaks (like reordering ad lines at Bing) sometimes drive massive revenue shifts. Once you have sufficient traffic and a platform, the marginal cost of testing should approach zero.
Define a clear Overall Evaluation Criterion that reflects long-term value.
Optimizing for a single short-term metric (e.g., revenue or conversion) is dangerous. Instead, teams should define an OEC that blends business outcomes with user experience and retention, constrained by guardrail metrics so gains don’t erode long-term customer lifetime value.
Trust in the experimentation platform is more important than speed.
If people don’t trust the stats, they’ll ignore or override results. Kohavi stresses rigorous checks (like detecting sample ratio mismatches, avoiding naïve real-time p-value peeking, and flagging anomalies via Twyman’s law) so the platform becomes a reliable “oracle” and safety net.
Beware p-value myths and high false positive risk in low–success-rate environments.
A p-value of 0.05 does not mean a 95% chance the treatment is better. When only ~8% of experiments succeed (as in Airbnb Search), even p<0.05 can still mean roughly a 26% chance the result is a false positive; replication and stricter thresholds (e.g., p<0.01) help reduce this risk.
WORDS WORTH SAVING
5 quotesI'm a big fan of test everything… any code change that you make, any feature that you introduce has to be in some experiment.
— Ronny Kohavi
Of these experiments, 92% failed to improve the metric that we were trying to move.
— Ronny Kohavi (on Airbnb Search relevance experiments)
We are often humbled by how bad we are at predicting the outcome of experiments.
— Ronny Kohavi
If you go for something big, try it out, but be ready to fail 80% of the time.
— Ronny Kohavi
If something looks too good to be true, investigate… hold the celebratory dinner.
— Ronny Kohavi (on Twyman’s law)
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome