Lenny's Podcast

The ultimate guide to A/B testing | Ronny Kohavi (Airbnb, Microsoft, Amazon)

Ronny Kohavi, PhD, is a consultant, teacher, and leading expert on the art and science of A/B testing. Previously, Ronny was Vice President and Technical Fellow at Airbnb, Technical Fellow and corporate VP at Microsoft (where he led the Experimentation Platform team), and Director of Data Mining and Personalization at Amazon. He was also honored with a lifetime achievement award by the Experimentation Culture Awards in September 2020 and teaches a popular course on experimentation on Maven. In today’s podcast, we discuss: • How to foster a culture of experimentation • How to avoid common pitfalls and misconceptions when running experiments • His most surprising experiment results • The critical role of trust in running successful experiments • When not to A/B test something • Best practices for helping your tests run faster • The future of experimentation Enroll in Ronny’s Maven class, Accelerating Innovation with A/B Testing, at https://bit.ly/ABClassLenny. Promo code “LENNYAB” will give $500 off the class for the first 10 people to use it. — Brought to you by Mixpanel—Event analytics that everyone can trust, use, and afford: https://mixpanel.com/startups | Round—The private network built by tech leaders for tech leaders: https://www.round.tech/apply?utm_campaign=lennys-letter&utm_medium=email-ad&utm_source=email-marketing&utm_content=send-2-2023-07-27 | Eppo—Run reliable, impactful experiments: https://www.geteppo.com/ Find the full transcript at: https://www.lennysnewsletter.com/p/the-ultimate-guide-to-ab-testing Where to find Ronny Kohavi: • Twitter: https://twitter.com/ronnyk • LinkedIn: https://www.linkedin.com/in/ronnyk/ • Website: http://ai.stanford.edu/~ronnyk/ Where to find Lenny: • Newsletter: https://www.lennysnewsletter.com • Twitter: https://twitter.com/lennysan • LinkedIn: https://www.linkedin.com/in/lennyrachitsky/ In this episode, we cover: (00:00) Ronny’s background (04:29) How one A/B test helped Bing increase revenue by 12% (09:00) What data says about opening new tabs (10:34) Small effort, huge gains vs. incremental improvements (13:16) Typical fail rates (15:28) UI resources (16:53) Institutional learning and the importance of documentation and sharing results (20:44) Testing incrementally and acting on high-risk, high-reward ideas (22:38) A failed experiment at Bing on integration with social apps (24:47) When not to A/B test something (27:59) Overall evaluation criterion (OEC) (32:41) Long-term experimentation vs. models (36:29) The problem with redesigns (39:31) How Ronny implemented testing at Microsoft (42:54) The stats on redesigns (45:38) Testing at Airbnb (48:06) Covid’s impact and why testing is more important during times of upheaval (50:06) Ronny’s book, Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing (51:45) The importance of trust (55:25) Sample ratio mismatch and other signs your experiment is flawed (1:00:44) Twyman’s law (1:02:14) P-value (1:06:27) Getting started running experiments (1:07:43) How to shift the culture in an org to push for more testing (1:10:18) Building platforms (1:12:25) How to improve speed when running experiments (1:14:09) Lightning round Referenced: • Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing: https://experimentguide.com/ • Seven rules of thumb for website experimenters: https://exp-platform.com/rules-of-thumb/ • GoodUI: https://goodui.org • Defaults for A/B testing: http://bit.ly/CH2022Kohavi • Ronny’s LinkedIn post about A/B testing for startups: https://www.linkedin.com/posts/ronnyk_abtesting-experimentguide-statisticalpower-activity-6982142843297423360-Bc2U • Sanchan Saxena on Lenny’s Podcast: https://www.lennyspodcast.com/sanchan-saxena-vp-of-product-at-coinbase-on-the-inside-story-of-how-airbnb-made-it-through-covid-what-he8217s-learned-from-brian-chesky-brian-armstrong-and-kevin-systrom-much-more/ • Optimizely: https://www.optimizely.com/ • Optimizely was statistically naive: https://analythical.com/blog/optimizely-got-me-fired • SRM: https://www.linkedin.com/posts/ronnyk_seat-belt-wikipedia-activity-6917959519310401536-jV97 • SRM checker: http://bit.ly/srmCheck • Twyman’s law: http://bit.ly/twymanLaw • “What’s a p-value” question: http://bit.ly/ABTestingIntuitionBusters • Fisher’s method: https://en.wikipedia.org/wiki/Fisher%27s_method • Evolving experimentation: https://exp-platform.com/Documents/2017-05%20ICSE2017_EvolutionOfExP.pdf • CUPED for variance reduction/increased sensitivity: http://bit.ly/expCUPED • Ronny’s recommended books: https://bit.ly/BestBooksRonnyk • Chernobyl on HBO: https://www.hbo.com/chernobyl • Blink cameras: https://blinkforhome.com/ • Narrative not PowerPoint: https://exp-platform.com/narrative-not-powerpoint/ Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email podcast@lennyrachitsky.com. Lenny may be an investor in the companies discussed.

Ronny KohaviguestLenny Rachitskyhost

Jul 26, 20231h 23mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Ronny Kohavi Reveals How To Build Truly Trustworthy Experiment Cultures

Ronny Kohavi, a leading authority on A/B testing, shares hard-won lessons from building experimentation platforms at Amazon, Microsoft/Bing, and Airbnb. He emphasizes that most experiments fail (often 70–90%), so organizations must embrace humility, rigorous statistics, and strong guardrails rather than chasing silver bullets. The conversation covers when to start experimenting, how to define the right success metric (OEC), why trust and platform quality matter more than speed, and how to institutionalize learning from both wins and surprising failures. Kohavi also addresses concerns that experimentation kills innovation, arguing instead for a portfolio of small optimizations and a minority of high-risk, high-reward bets—all tested rigorously.

IDEAS WORTH REMEMBERING

5 ideas

Expect most experiments to fail—and plan portfolios accordingly.

Across Microsoft, Bing, and Airbnb, 66–92% of experiments did not improve the target metric. This means teams should assume they’re wrong most of the time, run lots of experiments, and structure roadmaps as portfolios with many incremental bets and a smaller set of high-risk, high-reward ideas.

Test everything once you have scale; even tiny changes can have huge impact.

Kohavi advocates that every code change eventually be in an experiment because small tweaks (like reordering ad lines at Bing) sometimes drive massive revenue shifts. Once you have sufficient traffic and a platform, the marginal cost of testing should approach zero.

Define a clear Overall Evaluation Criterion that reflects long-term value.

Optimizing for a single short-term metric (e.g., revenue or conversion) is dangerous. Instead, teams should define an OEC that blends business outcomes with user experience and retention, constrained by guardrail metrics so gains don’t erode long-term customer lifetime value.

Trust in the experimentation platform is more important than speed.

If people don’t trust the stats, they’ll ignore or override results. Kohavi stresses rigorous checks (like detecting sample ratio mismatches, avoiding naïve real-time p-value peeking, and flagging anomalies via Twyman’s law) so the platform becomes a reliable “oracle” and safety net.

Beware p-value myths and high false positive risk in low–success-rate environments.

A p-value of 0.05 does not mean a 95% chance the treatment is better. When only ~8% of experiments succeed (as in Airbnb Search), even p<0.05 can still mean roughly a 26% chance the result is a false positive; replication and stricter thresholds (e.g., p<0.01) help reduce this risk.

WORDS WORTH SAVING

5 quotes

I'm a big fan of test everything… any code change that you make, any feature that you introduce has to be in some experiment.

— Ronny Kohavi

Of these experiments, 92% failed to improve the metric that we were trying to move.

— Ronny Kohavi (on Airbnb Search relevance experiments)

We are often humbled by how bad we are at predicting the outcome of experiments.

— Ronny Kohavi

If you go for something big, try it out, but be ready to fail 80% of the time.

— Ronny Kohavi

If something looks too good to be true, investigate… hold the celebratory dinner.

— Ronny Kohavi (on Twyman’s law)

High failure rates of experiments and what they imply for product developmentDesigning an Overall Evaluation Criterion (OEC) and focusing on lifetime valueBuilding an experimentation platform and creating an experiment-driven cultureCommon pitfalls in A/B testing: p-values, false positives, and sample ratio mismatchesBalancing micro-optimizations with big bets and large redesignsInstitutional learning: documenting, searching, and reusing experiment insightsWhen and how startups should begin running controlled experiments

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.