0% found this document useful (0 votes)
120 views11 pages

A/B Testing Intuition Busters: Ron Kohavi Alex Deng Airbnb Inc Lukas Vermeer Vista Lukas@lukasvermeer - NL

Uploaded by

Wasim Ullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
120 views11 pages

A/B Testing Intuition Busters: Ron Kohavi Alex Deng Airbnb Inc Lukas Vermeer Vista Lukas@lukasvermeer - NL

Uploaded by

Wasim Ullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

https://bit.

ly/ABTestingIntuitionBusters
© Kohavi, Deng, Vermeer 2022. This is the author's version of the work. It is posted here for your personal use.
Not for redistribution. The definitive version will be published in KDD 2022 at https://doi.org/10.1145/3534678.3539160

A/B Testing Intuition Busters


Common Misunderstandings in Online Controlled Experiments

Ron Kohavi Alex Deng Lukas Vermeer


Airbnb Inc Vista
Los Altos, CA Seattle, WA Delft, The Netherlands
[email protected] [email protected] [email protected]

ABSTRACT A/B tests, or online controlled experiments (see appendix for


references), are heavily used in industry to evaluate
A/B tests, or online controlled experiments, are heavily used in implementations of ideas, with the larger companies starting
industry to evaluate implementations of ideas. While the over 100 experiment treatments every business day (Gupta, et
statistics behind controlled experiments are well documented al. 2019). While the statistics behind controlled experiments are
and some basic pitfalls known, we have observed some well documented and some pitfalls were shared (Crook, et al.
seemingly intuitive concepts being touted, including by A/B tool 2009, Dmitriev, Frasca, et al. 2016, Kohavi, Tang and Xu 2020,
vendors and agencies, which are misleading, often badly so. Our Dmitriev, Gupta, et al. 2017), we see many erroneous
goal is to describe these misunderstandings, the “intuition” applications and misunderstanding of the statistics, including in
behind them, and to explain and bust that intuition with solid books, papers, and software. The appendix shows the impact of
statistical reasoning. We provide recommendations that these misunderstood concepts in courts and legislation.
experimentation platform designers can implement to make it
harder for experimenters to make these intuitive mistakes. The concepts we share appear intuitive yet hide unexpected
CCS CONCEPTS complexities. Although some amount of abstraction leakage is
usually unavoidable (Kluck and Vermeer 2015), our goal is to
General and Reference → Cross-computing tools and techniques →
share these common intuition busters so that experimentation
Experimentation; Mathematics of computing → Probability and
platforms can be designed to make it harder for experimenters
statistics → Probabilistic inference problems → Hypothesis testing and
to misuse them. Our contributions are as follows:
confidence interval computation
• We share a collection of important intuition busters.
KEYWORDS Some well-known commercial vendors of A/B testing
A/B Testing, Controlled experiments, Intuition busters software have focused on “intuitive” presentations of
ACM Reference format:
results, resulting in incorrect claims to their users
instead of addressing their underlying faulty
Ron Kohavi, Alex Deng, Lukas Vermeer. A/B Testing Intuition Busters: intuitions. We believe that these solutions exacerbate
Common Misunderstandings in Online Controlled Experiments. In the situation, as they reinforce incorrect intuitions.
Proceedings of the 28th ACM SIGKDD Conference on Knowledge
• We drill deeply into one non-intuitive result, which to
Discovery and Data Mining (KDD ’22), August 14-18, 2022,
the best of our knowledge has not been studied before:
Washington DC, USA. https://doi.org/10.1145/3534678.3539160
the distribution of the treatment effect under non-
1. Introduction uniform assignment to variants. Non-uniform
assignments have been suggested in the statistical
Misinterpretation and abuse of statistical tests, literature. We highlight several concerns.
confidence intervals, and statistical power have • We provide recommendations as well as deployed
been decried for decades, yet remain rampant. examples for experimentation platform designers to
A key problem is that there are no interpretations help address the underlying faulty intuitions identified
of these concepts that are at once simple, intuitive, in our collection.
correct, and foolproof
-- Greenland et al (2016) 2. Motivating Example
Permission to make digital or hard copies of all or part of this work for personal or
You win some, you learn some
classroom use is granted without fee provided that copies are not made or distributed -- Jason Mraz
for profit or commercial advantage and that copies bear this notice and the full
citation on the first page. Copyrights for components of this work owned by others
than the author(s) must be honored. Abstracting with credit is permitted. To copy GuessTheTest is a website that shares “money-making A/B test
otherwise, or republish, to post on servers or to redistribute to lists, requires prior case studies.” We believe such efforts to share ideas evaluated
specific permission and/or a fee. Request permissions from [email protected].
KDD ’22, August 14–18, 2022, Washington, DC, USA. using A/B tests are useful and should be encouraged. That said,
© 2022 Copyright is held by the authors. Publication rights licensed to ACM.
ACM ISBN 978-1-4503-9385-0/22/08...$15.00.
https://doi.org/10.1145/3534678.3539160
KDD ’22, August 14-18, 2022, Washington DC, USA Kohavi, Deng, and Vermeer

some of the analyses could be improved with the (Goodman 2008, Greenland, Senn, et al. 2016, Vickers 2009). A
recommendations shared in this paper (indeed, some were common alternative to p-values used by commercial vendors is
already integrated into the web site based on feedback from one “confidence,” which is defined as (1-p-value)*100%, and often
of the authors). This site is not unique and represents common misinterpreted as the probability that the result is a true positive.
industry practices in sharing ideas. We are using it as a concrete
example that shows several patterns where the industry can Vendors who sell A/B testing software and should know better,
improve. get this concept wrong. For example, Optimizely’s
documentation equates p-value of 0.10 with “10% error rate”
A real A/B test was shared on December 16, 2021, in (Optimizely 2022):
GuessTheTest’s newsletter and website with the title: “Which …to determine whether your results are statistically significant: how
design radically increased conversions 337%?” (O'Malley confident you can be that the results actually reflect a change in your
2021). The A/B test described two landing pages for a website visitors' behavior, not just noise or randomness… In statistical terms,
it's 1-[p-value]. If you set a significance threshold of 90%...you can
(the specific change is not important). The test ran for 35 days,
expect a 10% error rate.
and traffic was split 50%/50% for maximum statistical power.
The surprising results are shown in Table 1 below.
Book authors about A/B Testing also get it wrong. The book A/B
Testing: The Most Powerful Way to Turn Clicks Into Customers
Table 1: Results of a real A/B Test
(Siroker and Koomen 2013) incorrectly defines p-value:
Variant Visitors Conversions Conversion Lift
…we can compute the probability that our observed difference (-
rate
0.007) is due to random chance. This value, called the p-value...
Control 82 3 3.7% --
Treatment 75 12 16.0% 337%
The book You Should Test That: Conversion Optimization for
More Leads, Sales and Profit (Goward 2012) incorrectly states
The analysis showed a massive lift of 337% for the Treatment
…when statistical significance (that is, it’s unlikely the test results
with a p-value of 0.009 (using Fisher’s exact test, which is more
are due to chance) has been achieved.
appropriate for small numbers, the p-value is 0.013), which the
article said is “far below the standard < 0.05 cut-off,” and with
observed power of 97%, “well beyond the accepted 80% Even Andrew Gelman, a Statistics professor at Columbia
minimum.” University, has gotten it wrong in one of his published papers
(due to an editorial change) and apologized (Gelman 2014).
Given the data presented, we strongly believe that this result
should not be trusted, and we hope to convince the readers and The above examples, and several more in the appendix, show
improve industry best practices so that similar experiment that p-values and confidence are often misunderstood, even
results will not be shared without additional validation. Based among experts who should know better. What is the p-value
on our feedback and feedback from others, GuessTheTest added then? The p-value is the probability of obtaining a result equal
that the experiment was underpowered and suggested doing a
to or more extreme than what was observed, assuming that all
replication run.
the modeling assumptions, including the null hypothesis, 𝐻0 , are
3. Surprising Results Require Strong true (Greenland, Senn, et al. 2016). Conditioning1 on the null
hypothesis is critical and most often misunderstood. In
Evidence—Lower P-Values probabilistic terms, we have
Extraordinary claims require p-value = 𝑃(Δ 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑜𝑟 𝑚𝑜𝑟𝑒 𝑒𝑥𝑡𝑟𝑒𝑚𝑒|𝐻0 𝑖𝑠 𝑡𝑟𝑢𝑒) .
extraordinary evidence" (ECREE)
-- Carl Sagan This conditional probability is not what is being described in the
examples above. All the explanations above are variations of the
Surprising results make great story headlines and are often opposite conditional probability: what is the probability of the
remembered even when flaws are found, or the results do not null hypothesis given the delta observed:
replicate. Many of the most cited psychology findings failed to 𝑃(𝐻0 𝑖𝑠 𝑡𝑟𝑢𝑒 |Δ 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑)
replicate (Open Science Collaboration 2015). Recently, the term
Bernoulli’s Fallacy has been used to describe the issue as a Bayes Rule can be used for inverting between these two, but the
“logical flaw in the statistical methods” (Clayton 2021). crux of the problem is that it requires the prior probability of the
null hypothesis. Colquhoun (2017) makes a similar point and
While controlled experiments are the gold standard in science writes that “we hardly ever have a valid value for this prior.”
for claiming causality, many people misunderstand p-values. A However, in companies running online controlled experiments
very common misunderstanding is that a statistically significant at scale, we can construct good prior estimates based on
result with p-value 0.05 has a 5% chance of being a false positive historical experiments.

1
Some authors prefer to use the semicolon notation; see discussion at:
https://statmodeling.stat.columbia.edu/2013/03/12/misunderstanding-the-
p-value/#comment-143481
-2-
KDD ’22, August 14-18, 2022, Washington DC, USA Kohavi, Deng, and Vermeer

One useful metric to look at is the False Positive Risk (FPR), anywhere near this amount. We think it’s appropriate to invoke
which is the probability that the statistically significant result is Twyman’s law here. In the next section, we show that the pre-
a false positive, or the probability that 𝐻0 is true (no real effect) experiment power is about 3% (highly under-powered).
when the test was statistically significant (Colquhoun 2017). Plugging that number in, even with the highest success rate of
Using the following terminology: 33% from Table 2, we end up with an FPR of 63%, so likely to
be false. Alternatively, to override such low power, if we want
• SS is a statistically significant result
the false positive probability, 𝑃(𝐻0 |𝑆𝑆) to be 0.05, we would
• 𝜶 is the threshold used to determine statistical significance
need to set the p-value threshold as follows:
(SS), commonly 0.05 for a two-tailed t-test.
• 𝜷 is the type-II error (usually 0.2 for 80% power) 0.05 ∗ (1 − 𝛽) ∗ (1 − 𝜋)
α/2 =
• 𝝅 is the prior probability of the null hypothesis, that is 0.95 ∗ 𝜋
𝑃(𝐻0 )
or α = 0.0016, much lower than the 0.009 reported.
Using Bayes Rule, we can derive the following (Wacholder, et
al. 2004, Ioannidis 2005, Kohavi, Deng and Longbotham, et al. Table 2: False Positive Risk given the Success Rate, p-value
2014, Benjamin, et al. 2017): threshold of 0.025 (successes only), and 80% power
Company/ Success FPR Reference
𝑃(𝐻0 )
𝑃(𝐻0 |𝑆𝑆) = 𝑃(𝑆𝑆|𝐻0) ∗ Source Rate
𝑃(𝑆𝑆)
Microsoft 33% 5.9% (Kohavi, Crook and
𝑃(𝑆𝑆|𝐻0 )∗𝑃(𝐻0 )
= Longbotham 2009)
𝑃(𝑆𝑆|𝐻0 )∗𝑃(𝐻0 )+𝑃(𝑆𝑆|¬𝐻0 )∗𝑃(¬𝐻0 )
Avinash 20% 11.1% (Kaushik 2006)
α∗𝜋
= Kaushik
α ∗ 𝜋 + (1 − 𝛽) ∗ (1 − 𝜋)
Bing 15% 15.0% (Kohavi, Deng and
Longbotham, et al.
Several estimates of historical success rates (what the org
2014)
believes are true improvements to the Overall Evaluation
Criterion) have been published. These numbers may involve Booking.com, 10% 22.0% (Manzi 2012,
different accounting schemes, and we never know the true rates, Google Ads, Thomke 2020, Moran
Netflix 2007)
but they suffice as ballpark estimates. The table below
summarizes the corresponding implied FPR, assuming 𝜋 = 1 − Airbnb Search 8% 26.4% https://www.linkedin.
success-rate, experiments were properly powered at 80%, and com/in/ronnyk2
using a p-value of 0.05 but plugging in 0.025 into the above
formula because only statistically significant improvements are We recommend that experimentation platforms show the FPR
considered successful in two-tailed t-tests. In practice, some or estimates of the posterior probability in addition to p-values,
results will have a significantly lower p-value than the threshold, and that surprising results be replicated. At Microsoft, the
and those have a lower FPR, while results close to the threshold experimentation platform, ExP, provides estimates that the
have a higher FPR, as this is the overall FPR for p-value <= 0.05 treatment effect is not zero using Bayes Rule with priors from
in a two-tailed t-test (Goodman and Greenland 2007). Also, historical data. In other organizations, FPR was used to set α.
other factors like multiple variants, iterating on ideas several
times, and flexibility in data processing increase the FPR due to 4. Experiments with Low Statistical
multiple hypothesis testing. Power are NOT Trustworthy
When I finally stumbled onto power analysis…
What Table 2 summarizes is how much more likely it is to have it was as if I had died and gone to heaven
a false positive stat-sig result than what people intuitively think. -- Jacob Cohen (1990)
Moving from the industry standard of 0.05 to 0.01 or 0.005
aligns with the threshold suggested by the 72-author paper Statistical power is the probability of detecting a meaningful
(Benjamin, et al. 2017) for “claims of new discoveries.” Finally, difference between the variants when there really is one, that is,
if the result of an experiment is highly unusual or surprising, one rejecting the null when there is a true difference of δ. When
should invoke Twyman’s law—any figure that looks interesting running controlled experiments, it is recommended that we pick
or different is usually wrong (Kohavi, Tang and Xu 2020)—and the sample size to have sufficient statistical power to detect a
only accept the result if the p-value is very low.
minimum delta of interest. With an industry standard power of
80%, and p-value threshold of 0.05, the sample size for each of
In our motivating example, the lift to overall conversion was
two equally sized variants can be determined by this simple
over 300%. We have been involved in tens of thousands of A/B formula (van Belle 2002):
tests that ran at Airbnb, Booking, Amazon, and Microsoft, and
16 𝜎2
have never seen any change that improves conversions 𝑛= δ2

2
Permission to include statistic was given by Airbnb
-3-
KDD ’22, August 14-18, 2022, Washington DC, USA Kohavi, Deng, and Vermeer

Where 𝑛 is the number of users in each variant, and the variants The power for detecting a 10% relative change with 80 users in
are assumed to be of equal size, 𝜎 2 is the variance of the metric this example is 3% (formula in the next section). With so little
of interest, and δ is the sensitivity, or the minimum amount of power, the experiment is meaningless. Gelman et al. (2014)
change you want to detect. show that when power goes below 0.1, the probability of getting
the sign wrong (e.g., concluding that the effect is positive when
The derivation of the formula is useful for the rest of the section it is in fact negative) approaches 50% as shown in Figure 1.
and the next section, so we will summarize its derivation (van
Belle 2002). Given two variants of size 𝑛 each with a standard
deviation of 𝜎, we reject the null hypothesis that there is no
difference between Control and Treatment (treatment effect is
zero) if the observed value is larger than 𝑍1−𝛼/2 *SE (e.g.,
𝑍1−𝛼/2 for 𝛼 = 0.05 in a two-tailed test is 𝑍0.975 = 1.96); 𝑆𝐸,
the standard error for the difference is 𝜎√2⁄𝑛 . We similarly
reject the alternative hypothesis that the difference is 𝛿 if the
observed value is smaller than 𝑍1−𝛽 *SE from 𝛿. (Without loss
of generality, we evaluate the left tail of a normal distribution
centered on a positive 𝛿 as the alternative; the same mirror
computation can be made with a normal centered on −𝛿.) The Figure 1: Type S (sign) error of the treatment effect as a
critical value is, therefore, when these two rejection criteria are function of statistical power (Gelman and Carlin 2014)
equal (the approximation ignores rejection based on the wrong
tail, sometimes called type III error, a very reasonable and The general guidance is that A/B tests are useful to detect effects
common approximation): of reasonable magnitudes when you have, at least, thousands of
active users, preferably tens of thousands (Kohavi, Deng and
𝑍1−𝛼/2 ∗ SE = 𝛿 – 𝑍1−𝛽 *SE Equation 1 Frasca, et al. 2013).
SE = 𝛿/( 𝑍1−𝛽 + 𝑍1−𝛼/2 ) Equation 2
𝜎√2⁄𝑛 = 𝛿/( 𝑍1−𝛽 + 𝑍1−𝛼/2 ) Table 3 shows the False Positive Risk (FPR) for different levels
𝑛 = 2𝜎 2 ( 𝑍1−𝛽 + 𝑍1−𝛼/2 )2 ⁄𝛿 2 of power. Running experiments at 20% power with similar
success rate to Booking.com, Google ads, Netflix, or Airbnb
search, more than half of your statistically significant results will
For 80% power, 𝛽 = 0.2, 𝑍1−𝛽 = 0.84, and 𝑍1−𝛼/2 = 1.96,
be false positives!
so the numerator is 15.68𝜎 2 , conservatively rounded to 16.
Another way to look at Equation 2, is that with 80% power, the
Table 3: False Positive Risk as in Table 2, but with 80%
detectable effect, 𝛿, is 2.8SE (0.84SE+1.96SE).
power, 50% power, and 20% power
Company/ Success FPR @ FPR @ FPR @
From our GuessTheTest motivating example, a conservative
Source Rate 80% 50% 20%
pre-test statistical power calculation would be to detect a 10%
Power Power Power
relative change. In Optimizely’s survey (2021) of 808
companies, about half said experimentation drove 10% uplift in Microsoft 33% 5.9% 9.1% 20.0%
revenue over time from multiple experiments. At Bing, monthly Avinash Kaushik 20% 11.1% 16.7% 33.3%
improvements in revenue from multiple experiments were
Bing 15% 15.0% 22.1% 41.5%
usually in the low single digits (Kohavi, Tang and Xu 2020,
Figure 1.4). A large relative percentage, such as 10% for a single Booking.com, 10% 22.0% 31.0% 52.9%
experiment, is conservative in that it will require a smaller Google Ads,
sample than attempting to detect smaller changes. Assuming Netflix
historical data showed 3.7% as the conversion rate (what we see Airbnb search 8% 26.4% 36.5% 59.0%
for Control), we can plug-in
𝜎 2 = 𝑝 ∗ (1 − 𝑝) = 3.7% ∗ (1 − 3.7%) = 3.563% and Ioannidis (2005) made this point in a highly cited paper: Why
𝛿 = 3.7% ∗ 10% = 0.37% Most Published Research Findings Are False. With many low
The sample size recommended for each variant to achieve 80% statistical power studies published, we should expect many false
power is therefore: positives when studies show statistically significant results.
16𝜎 2 /𝛿 2 = 16 ∗ 3.563%/(0.37%)2 = 41,642 . Moreover, power is just one factor; other factors that can lead to
The above-mentioned test was run with about 80 users per incorrect findings include: flexibility in designs, financial
variant, and thus grossly underpowered even for detecting a incentives, and simply multiple hypothesis testing. Even if there
large 10% change. is no ethical concern, many researchers are effectively p-
hacking.

-4-
KDD ’22, August 14-18, 2022, Washington DC, USA Kohavi, Deng, and Vermeer

A seminal analysis of 78 articles in the Journal Abnormal and


Social Psychology during 1960 and 1961 showed that
5. Post-hoc Power Calculations are
researchers had only 50% power to detect medium-sized effects Noisy and Misleading
and only 20% power to detect small effects (Cohen 1962). With This power is what I mean when I talk of
such low power, it is no wonder that published results are often reasoning backward
wrong or exaggerated. In a superb paper by Button et al. (2013), -- Sherlock Holmes, A Study in Scarlet
the authors analyzed 48 articles that included meta-analyses in
the neuroscience domain. Based on these meta-analyses, which Given an observed treatment effect 𝛿, one can assume that it is
evaluated 730 individual studies published, they were able to the true effect and compute the “observed power” or “post-hoc
assess the key parameters for statistical power. Their conclusion: power” from Equation 1 above as follows:
the median statistical power in neuroscience is conservatively 𝑍1−𝛽 *SE = 𝛿 − 𝑍1−𝛼/2 ∗ SE
estimated at 21%. With such low power, many false positive 𝑍1−𝛽 = 𝛿 ⁄SE − 𝑍1−𝛼/2
results are to be expected, and many true effects are likely to be
1 − 𝛽 = Φ(𝛿 ⁄SE − 𝑍1−𝛼/2 )
missed!
The term 𝛿 ⁄SE is the observed Z-value used for the test statistic.
It is hence 𝑍1−𝑝𝑣𝑎𝑙/2 , and we can derive the ad-hoc power as
The Open Science Collaboration (2015) attempted to replicate
100 studies from three major psychology journals, where studies 1 − 𝛽 = Φ(𝑍1−𝑝𝑣𝑎𝑙/2 − 𝑍1−𝛼/2 ).
typically have low statistical power. Of these, only 36% had Note that power is thus fully determined by the p-value and 𝛼,
significant results compared to 97% in the original studies. and the graph is shown in Figure 3. If the p-value is greater than
0.05, then the power is less than 50% (technically as noted
When the power is low, the probability of detecting a true effect above, this ignores type-III errors, which are tiny).
is small, but another consequence of low power, which is often
unrecognized, is that a statistically significant finding with low 100% 0.001, 91%
power is likely to highly exaggerate the size of the effect. The
0.005, 80%
winner’s curse says that the “lucky” experimenter who finds an 80%
0.01, 73%
effect in a low power setting, or through repeated tests, is cursed 60% 0.015, 68%
Power

by finding an inflated effect (Lee and Shen 2018, Zöllner and 0.05, 50%
Pritchard 2007, Deng, et al. 2021). For studies in neuroscience, 40% 0.1, 38% 0.4, 13%
0.15, 30%
where power is usually in the range of 8% to 31%, initial 0.2, 25% 0.3, 18%
20% 0.5, 10%
treatment effects found are estimated to be inflated by 25% to
50% (Button, et al. 2013). 0%
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Gelman and Carlin (2014) show that when power is below 50%, P-value
the exaggeration ratio, defined as the expectation of the absolute
Figure 3: post-hoc power is determined by p-value
value of the estimate, divided by the true effect size, becomes so
high as to be meaningless, as shown in Figure 2.
In our motivating example, the p-value was 0.009, translating
into Z of 2.61. Subtracting 1.96 gives 0.65, which translates into
74% post-power, which may seem reasonable.

However, compare this number to the calculation in Section 4,


where the pre-experiment power was estimated at 3%. In low-
power experiments, the p-value has enormous variation, and
translating it into post-hoc power results in a very noisy estimate
(a video of p-values in a low power simulation is at
https://tiny.cc/dancepvals). Gelman (2019) wrote that “using
observed estimated of effect size is too noisy to be useful.”
Figure 2: Exaggeration ratio as a function of statistical Greenland (2012) wrote: “for a study as completed (observed),
power (Gelman and Carlin 2014) it is analogous to giving odds on a horse race after seeing the
outcome” and “post hoc power is unsalvageable as an analytic
Our recommendation is that experimentation platforms should tool, despite any value it has for study planning.”
discourage experimenters from starting underpowered
experiments. With high probability, nothing statistically A key use of statistical power is to claim that for a non-
significant will be found, and in the unlikely case (e.g., by significant result, the true treatment effect is bounded by a small
multiple running iterations) a statistically significant result is region of ∓𝜀 because otherwise there is a high probability (e.g.,
obtained, it is likely to be a false positive with an overestimated 80%) that the observation would have been significant. This
effect size.
-5-
KDD ’22, August 14-18, 2022, Washington DC, USA Kohavi, Deng, and Vermeer

claim holds true for pre-experiment power calculations, but it Outlier removal must be blind to the hypothesis. André (2021)
fails spectacularly for post-hoc, or observed power, calculations. showed that outlier removal within a variant (e.g., removal of
the 1% extreme values, determined for each variant separately),
In The Abuse of Power: The Pervasive Fallacy of Power rather than across the data, can result in false-positive rates as
Calculations for Data Analysis (Hoenig and Heisey 2001), the high as 43%.
authors share what they call a “fatal logical flow” and the
“power approach paradox” (PAP). Suppose two experiments Optimizely’s initial A/B system was showing near-real-time
gave rise to nonrejected null hypotheses, and the observed results, so their users peeked at the data and chose to stop when
power was larger in the first than the second. The intuitive it was statistically significant, a procedure recommended by the
interpretation is that the first experiment gives stronger support company at the time. This type of multiple testing significantly
favoring the null hypothesis, as with high power, failure to reject inflates the type-I error rates (Johari, et al. 2017).
the null hypothesis implies that it is probably true. However, this
interpretation is only correct for pre-experiment power. As Flexibility in data collection, analysis, and reporting
shown above, post-hoc power is determined by the p-value and dramatically increases actual false-positive rates (Simmons,
𝛼, so the first experiment has a lower p-value, providing stronger Nelson and Simonsohn 2011). The culprit is researcher degrees
support against the null hypothesis! of freedom, which include:
1. Should more data be collected, or should we stop now?
Experimenters who get a non-significant result will sometimes
2. Should some observations be excluded (e.g., outliers,
do a post-hoc power analysis and write something like this: the bots)?
non-significant result is due to a small sample size, as our power 3. Segmentation by variables (e.g., gender, age, geography)
was only 30%. This claim implies that they believe they have and reporting just those as statistically significant.
made a type-II error and if only they had a larger sample, the
null would be rejected. This is catch-22—the claim cannot be The authors write that “In fact, it is unacceptably easy to publish
made from the data using post-hoc power, as a non-significant ‘statistically significant’ evidence consistent with any
result will always translate to low post-hoc power. hypothesis.”

Given the strong evidence that post-hoc power is a noisy and Gelman and Loken (2014) discuss how data-dependent analysis,
misleading tool, we strongly recommend that experimentation called the “garden of forking paths,” leads to statistically
systems (e.g., https://abtestguide.com/calc) not show it at all. significant comparisons that do not hold up. Even without
Instead, if power calculations are desired, such systems should intentional p-hacking, researchers make multiple choices that
encourage their users to pre-register the minimum effect size of lead to a multiple-comparison problem and inflate type-I errors.
interest ahead of experiment execution, and then base their For example, Bem’s paper (2011) providing evidence of
calculations on this input rather than the observed effect size. At extrasensory perception (ESP) presented nine different
Booking.com, the deployed experimentation platform— experiments and had multiple degrees of freedom that allowed
Experiment Tool—asks users to enter this information when him to keep looking until he could find what he was searching
creating a new experiment. for. The author found statistically significant results for erotic
pictures, but performance could have been better overall, or for
6. Minimize Data Processing Options non-erotic pictures, or perhaps erotic pictures for men but not
in Experimentation Platforms women. If results were better in the second half, one could claim
evidence of learning; if it’s the opposite, one could claim
Statistician: you have already calculated the p-value? fatigue.
Surgeon: yes, I used multinomial logistic regression.
Statistician: Really? How did you come up with that? For research, preregistration seems like a simple solution, and
Surgeon: I tried each analysis on the statistical software organizations like the Center for Open Science support such
dropdown menus, and that was the one preregistrations.
that gave the smallest p-value
-- Andrew Vickers (2009) For experimentation systems, we recommend that data
processing should be standardized. If there is a reason to modify
In an executive review, a group presented an idea that, they said, the standard process, for example, outlier removal, it should be
was evaluated in an A/B test and resulted in a significant pre-specified as part of the experiment configuration and there
increase to a key business metric. When one of us (Kohavi) should be an audit trail of changes to the configuration, as is
asked to see the scorecard, and the metric’s p-value was far from done at Booking.com. Finally, the benefit in doing A/B testing
significant. Why did you say it was statistically significant, he in software is that replication is much cheaper and easier. If
asked? The response was that it was statistically significant insight leads to a new hypothesis about an interesting segment,
once you turn on the option for extreme outlier removal. We pre-register it and run a replication study.
had inadvertently allowed users to do multiple-comparisons and
inflate type-I error rates.
-6-
KDD ’22, August 14-18, 2022, Washington DC, USA Kohavi, Deng, and Vermeer

7. Beware of Unequal Variants are skewed, in an unequal assignment, the t-test cannot maintain
the nominal Type-I error rate on both tails. When a metric is
The difference between theory and practice positively skewed, and the control is larger than the treatment,
is larger in practice than the difference the t-test will over-estimate the Type-I error on one tail and
between theory and practice in theory under-estimate on the other tail because the skewed distribution
-- Benjamin Brewster convergence to normal is different. But when equal sample sizes
are used, the convergence is similar and the Δ(observed delta) is
represented well by a Normal- or t-distribution.
In theory, a single control can be shared with several treatments,
and the theory says that a larger control will be beneficial to
reduce the variance (Tang, et al. 2010). Assuming equal Two common sources of skewness are 1) heavy-tailed
variances, the effective sample size of a two-sample test is the measurements such as revenue and counts, often zero-inflated at
1 1 the same time; and 2) binary/conversion metric with very small
harmonic mean 1/( + ). When there is one control taking
𝑁𝑇 𝑁𝐶
positive rate. We ran two simulated A/A studies. In the first
a proportion 𝑥 of users and 𝑘 equally sized treatments with size
1−𝑥 study, we drew 100,000 random samples from a heavy-tailed
, the optimal control size should be chosen by minimizing distribution, D1, of counts, like nights booked at a reservation
𝑘
𝑘 1
the sum + . We differentiate to get site. This distribution is both zero inflated (about 5% nonzero)
1−𝑥 𝑥
𝑘 1 and a skewed non-zero component, with a skewness of 35. The
− . second study drew 1,000,000 samples from a Bernoulli
(1 − 𝑥)2 𝑥 2
The optimal control proportion x is the positive solution to distribution, D2, with a small p of 0.01%, which implies a
1
(𝑘 − 1)𝑥 2 + 2𝑥 − 1 = 0 , which is . skewness of 100.
√𝑘+1

For example, when k = 3, instead of using 25% of users for all In each study, we allocated 10% samples to the treatment. We
four variants, we could use 36.6% for control and 21.1% for the then compared two cases: in one, the control also allocated 10%;
treatments, making control more than 1.5x larger. When k = 9, in the second, the remaining 90% were allocated to the control.
control would get 25% and each treatment only 8.3%, making We did 10,000 simulation trials and counted number of times
control 3 times the size of treatment. 𝐻0 was rejected at the right tail and left tail at 2.5% level for
each side (5% two-sided). Skewness of Δ and metric value from
Ramp-up is another scenario leading to more extreme unequal the 10% treatment group are also reported.
treatment vs. control sample size. When a treatment starts at a
small percentage, say 2%, the remaining 98% traffic may seem
Table 4 shows the results with the following observations:
to be the obvious control.
1. The realized Type-I error is close to the nominal 2.5% rate
when control is the same size as treatment.
There are several reasons why this seemingly intuitive direction 2. When control is larger, Type-I error at the left tail is greater
fails in practice: than 2.5%, while smaller than 2.5% at the right tail.
1. Triggering. As organizations scale experimentation, they 3. Skewness of the Δ is very close to 0 when control and
run more triggered experiments, which give a stronger treatment are equally sized. It is closer to the skewness of
signal for smaller populations, great for testing initial ideas treatment metric when control is much larger.
and for machine learning classifiers (Kohavi, Tang and Xu
2020, Chapter 20, Triggering). It is practically too hard to
share a control and compute for each treatment whether to Table 4: Type I errors at left and right tails from 10,000
trigger, especially for experiment treatments that start at simulation runs for two skewed distributions
different times and introduce performance overhead (e.g., Distri- Variants Type-I Type-I Skewness Skewness
bution Left tail Right tail ofΔ of 10%
doing inference on both control and treatment to determine
variant
if the results differ in order to trigger). D1 10%/10% 2.35% 2.30% 0.0142 0.36
2. Because of cookie churn, unequal variants will cause a 10%/90% 5.42% 0.85% 0.2817 0.36
larger percentage of users in the smaller variants to be D2 10%/10% 2.63% 2.63% -0.0018 0.32
contaminated and be exposed to different variants (their 10%/90% 5.75% 0.96% 0.2745 0.32
probability of being re-randomized into a larger variant is
higher than to their original variant). If there are
mechanisms to map multiple cookies to users (e.g., based Skewness of a metric decreases with the rate of √𝑛 as the sample
on logins), this mapping will cause sample-ratio size increases. Kohavi, Deng, et al. (2014) recommended that
mismatches (Kohavi, Tang and Xu 2020, Fabijan, et al. sample sizes for each variant large enough such that the
2019). skewness of metrics be no greater than 1/√355 = 0.053.
3. Shared resources, such as Least Recently Used (LRU) Because the skewness of Δ is more critical for the t-test, note
caches will have more cache entries for the larger variant, how in equally sized variants, the skewness is materially
giving it a performance advantage (Kohavi, Tang and Xu smaller. Table 4 shows that even when the skewness of the
2020).
metric itself is above 0.3, the skewness of these Δ for equal sized
Here we raise awareness of an important statistical issue cases were all smaller than 0.053. Because the ratio of skewness
mentioned in passing by Kohavi et al (2012). When distributions is so high (e.g., 0.2817/0.0142=~19.8), achieving the same

-7-
KDD ’22, August 14-18, 2022, Washington DC, USA Kohavi, Deng, and Vermeer

skewness, that is, convergence to normal, with unequal variants and Social Psychology 65 (3): 145-153.
requires 19.82 ≈ 400 times more users. https://psycnet.apa.org/doi/10.1037/h0045186.
Cohen, Jacob. 1990. "Things I have Learned (So Far)."
For experiment ramp-up, where the focus is to reject at the left American Psychologist 45 (12): 1304-1312.
tail so we can avoid degradation of experiences to users, using a https://www.academia.edu/1527968/Things_I_Have_Learne
much larger control can lead to higher-than-expected false d_So_Far_.
rejections, so a correction should be applied (Boos and Hughes- Colquhoun, David. 2017. "The reproducibility of research and
Oliver 2000). For using a shared control to increase statistical the misinterpretation of p-values." Royal Society Open
power, the real statistical power can be lower than the expected. Science (4). https://doi.org/10.1098/rsos.171085.
For a number of treatments ranging from two to four, the Crook, Thomas, Brian Frasca, Ron Kohavi, and Roger
reduced variance from using the optimal shared control size is Longbotham. 2009. "Seven Pitfalls to Avoid when Running
less than 10%. We do not think this benefit justifies all the Controlled Experiments on the Web." KDD '09: Proceedings
potential issues with unequal variants, and therefore recommend of the 15th ACM SIGKDD international conference on
against the use of a large (shared) control. Knowledge discovery and data mining, 1105-1114.
Deng, Alex, Yicheng Li, Jiannan Lu, and Vivek Ramamurthy.
8. Summary 2021. "On Post-Selection Inference in A/B Tests."
We shared five seemingly intuitive concepts that are heavily Proceedings of the 27th ACM SIGKDD Conference on
touted in the industry, but are very misleading. We then shared Knowledge Discovery & Data Mining. 2743-2752.
our recommendations for how to design experimentation Dmitriev, Pavel, Brian Frasca, Somit Gupta, Ron Kohavi, and
platforms to make it harder for experimenters to be misled by Garnet Vaz. 2016. "Pitfalls of long-term online controlled
these. The recommendations were implemented in some of the experiments." IEEE International Conference on Big Data.
deployed platforms in our organizations. Washington, DC. 1367-1376.
doi:https://doi.org/10.1109/BigData.2016.7840744.
Dmitriev, Pavel, Somit Gupta, Dong Woo Kim, and Garnet Vaz.
ACKNOWLEDGMENTS
2017. "A Dirty Dozen: Twelve Common Metric
We thank Georgi Georgiev, Somit Gupta, Roger Longbotham,
Interpretation Pitfalls in Online Controlled Experiments."
Deborah O’Malley, John Cutler, Pavel Dmitriev, Aleksander
Proceedings of the 23rd ACM SIGKDD International
Fabijan, Matt Gershoff, Adam Gustafson, Bertil Hatt, Michael
Conference on Knowledge Discovery and Data Mining (KDD
Hochster, Paul Raff, Andre Richter, Nathaniel Stevens, Wolfe
2017). Halifax, NS, Canada: ACM. 1427-1436.
Styke, and Eduardo Zambrano for valuable feedback.
http://doi.acm.org/10.1145/3097983.3098024.
References Fabijan, Aleksander, Jayant Gupchup, Somit Gupta, Jeff
Omhover, Wen Qin, Lukas Vermeer, and Pavel Dmitriev.
André, Quentin. 2021. "Outlier exclusion procedures must be 2019. "Diagnosing Sample Ratio Mismatch in Online
blind to the researcher’s hypothesis." Journal of Experimental Controlled Experiments: A Taxonomy and Rules of Thumb
Psychology: General. for Practitioners." KDD '19: The 25th SIGKDD International
doi:https://psycnet.apa.org/doi/10.1037/xge0001069. Conference on Knowledge Discovery and Data Mining.
Bem, Daryl J. 2011. "Feeling the future: Experimental evidence Anchorage, Alaska, USA: ACM.
for anomalous retroactive influences on cognition and affect." Gelman, Andrew. 2019. "Don’t Calculate Post-hoc Power Using
Journal of Personality and Social Psychology 100 (3): 407- Observed Estimate of Effect Size." Annals of Surgery 269 (1):
425. doi:https://psycnet.apa.org/doi/10.1037/a0021524. e9-e10. doi:10.1097/SLA.0000000000002908.
Benjamin, Daniel J., James O. Berger, Magnus Johannesson, —. 2014. "I didn’t say that! Part 2." Statistical Modeling, Causal
Brian A. Nosek, E.-J. Wagenmakers, Richard Berk, Kenneth Inference, and Social Science. October 14.
A. Bollen, et al. 2017. "Redefine Statistical Significance." https://statmodeling.stat.columbia.edu/2014/10/14/didnt-say-
Nature Human Behaviour 2 (1): 6-10. part-2/.
https://www.nature.com/articles/s41562-017-0189-z. Gelman, Andrew, and Eric Loken. 2014. "The Statistical Crisis
Boos, Dennis D, and Jacqueline M Hughes-Oliver. 2000. "How in Science." American Scientist 102 (6): 460-465.
Large Does n Have to be for Z and t Intervals?" The American doi:https://doi.org/10.1511/2014.111.460.
Statistician, 121-128. Gelman, Andrew, and John Carlin. 2014. "Beyond Power
Button, Katherine S, John P.A. Ioannidis, Claire Mokrysz, Brian Calculations: Assessing Type S (Sign) and Type M
A Nosek, Jonathan Flint, Emma S.J. Robinson, and Marcus R (Magnitude) Errors." Perspectives on Psychological Science
Munafò. 2013. "Power failure: why small sample size 9 (6): 641 –651. doi:10.1177/1745691614551642.
undermines the reliability of neuroscience." Nature Reviews Goodman, Steven. 2008. "A Dirty Dozen: Twelve P-Value
Neuroscience 14: 365-376. https://doi.org/10.1038/nrn3475. Misconceptions." Seminars in Hematology.
Clayton, Aubrey. 2021. Bernoulli's Fallacy: Statistical Illogic doi:https://doi.org/10.1053/j.seminhematol.2008.04.003.
and the Crisis of Modern Science. Columbia University Press. Goodman, Steven, and Sander Greenland. 2007. Assessing the
Cohen, Jacob. 1962. "The Statistical Power for Abnormal-Social unreliability of the medical literature: a response to "Why
Psychological Research: A Review." Journal of Abnormal most published research findings are false". Johns Hopkins
-8-
KDD ’22, August 14-18, 2022, Washington DC, USA Kohavi, Deng, and Vermeer

University, Department of Biostatistics. Data Mining Case Studies and Practice Prize.
https://biostats.bepress.com/cgi/viewcontent.cgi?article=113 http://bit.ly/expMicrosoft.
5&context=jhubiostat. Lee, Minyong R, and Milan Shen. 2018. "Winner’s Curse: Bias
Goward, Chris. 2012. You Should Test That: Conversion Estimation for Total Effects of Features in Online Controlled
Optimization for More Leads, Sales and Profit or The Art and Experiments." KDD 2018: The 24th ACM Conference on
Science of Optimized Marketing. Sybex. Knowledge Discovery and Data Mining. London: ACM.
Greenland, Sander. 2012. "Nonsignificance Plus High Power Manzi, Jim. 2012. Uncontrolled: The Surprising Payoff of Trial-
Does Not Imply Support for the Null Over the Alternative." and-Error for Business, Politics, and Society. Basic Books.
Annals of Epidemiology 22 (5): 364-368. Moran, Mike. 2007. Do It Wrong Quickly: How the Web
doi:https://doi.org/10.1016/j.annepidem.2012.02.007. Changes the Old Marketing Rules . IBM Press.
Greenland, Sander, Stephen J Senn, Kenneth J Rothman, John O'Malley, Deborah. 2021. "Which design radically increased
B Carlin, Charles Poole, Steven N Goodman, and Douglas G conversions 337%?" GuessTheTest. December 16.
Altman. 2016. "Statistical tests, P values, confidence https://guessthetest.com/test/which-design-radically-
intervals, and power: a guide to misinterpretations." increased-conversions-337/?referrer=Guessed.
European Journal of Epidemiology 31: 337-350. Open Science Collaboration. 2015. "Estimating the
https://doi.org/10.1007/s10654-016-0149-3. Reproducibility of Psychological Science." Science 349
Gupta, Somit, Ronny Kohavi, Diane Tang, Ya Xu, Reid (6251). doi:https://doi.org/10.1126/science.aac4716.
Anderson, Eytan Bakshy, Niall Cardin, et al. 2019. "Top Optimizely. 2022. "Change the statistical significance setting."
Challenges from the first Practical Online Controlled Optimizely Help Center. January 10.
Experiments Summit." 21 (1). https://support.optimizely.com/hc/en-
https://bit.ly/ControlledExperimentsSummit1. us/articles/4410289762189.
Hoenig, John M, and Dennis M Heisey. 2001. "The Abuse of —. 2021. "How to win in the Digital Experience Economy."
Power: The Pervasive Fallacy of Power Calculations for Data Optimizely. https://www.optimizely.com/insights/digital-
Analysis." American Statistical Association 55 (1): 19-24. experience-economy-report/.
doi:https://doi.org/10.1198/000313001300339897. Simmons, Joseph P, Leif D Nelson, and Uri Simonsohn. 2011.
Ioannidis, John P. 2005. "Why Most Published Research "False-Positive Psychology: Undisclosed Flexibility in Data
Findings Are False." PLoS Medicine 2 (8): e124. Collection and Analysis Allows Presenting Anything as
doi:10.1371/journal.pmed.0020124. Significant." Psychological Science 22 (11): 1359-1366.
Johari, Ramesh, Leonid Pekelis, Pete Koomen, and David https://journals.sagepub.com/doi/full/10.1177/09567976114
Walsh. 2017. "Peeking at A/B Tests." KDD '17: Proceedings 17632.
of the 23rd ACM SIGKDD International Conference on Siroker, Dan, and Pete Koomen. 2013. A/B Testing: The Most
Knowledge Discovery and Data Mining. Halifax, NS, Powerful Way to Turn Clicks Into Customers. Wiley.
Canada: ACM. 1517-1525. Tang, Diane, Ashish Agarwal, Deirdre O'Brien, and Mike
doi:https://doi.org/10.1145/3097983.3097992. Meyer. 2010. "Overlapping Experiment Infrastructure: More,
Kaushik, Avinash. 2006. "Experimentation and Testing: A Better, Faster Experimentation." Proceedings 16th
Primer." Occam’s Razor by Avinash Kaushik. May 22. Conference on Knowledge Discovery and Data Mining.
Kluck, Timo, and Lukas Vermeer. 2015. "Leaky Abstraction In https://ai.google/research/pubs/pub36500.
Online Experimentation Platforms: A Conceptual Framework Thomke, Stefan H. 2020. Experimentation Works: The
To Categorize Common Challenges." The Conference on Surprising Power of Business Experiments. Harvard Business
Digital Experimentation (CODE@MIT). Boston, MA. Review Press.
https://arxiv.org/abs/1710.00397. van Belle, Gerald. 2002. Statistical Rules of Thumb. Wiley-
Kohavi, Ron, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, Interscience.
and Nils Pohlmann. 2013. "Online Controlled Experiments at Vickers, Andrew J. 2009. What is a p-value anyway? 34 Stories
Large Scale." KDD 2013: Proceedings of the 19th ACM to Help You Actually Understand Statistics. Pearson.
SIGKDD international conference on Knowledge discovery Wacholder, Sholom, Stephen Chanock, Montserrat Garcia-
and data mining. http://bit.ly/ExPScale. Closas, Nathaniel Rothman, and Laure Elghormli. 2004.
Kohavi, Ron, Alex Deng, Roger Longbotham, and Ya Xu. 2014. "Assessing the Probability That a Positive Report is False: An
"Seven Rules of Thumb for Web Site Experimenters." Approach for Molecular Epidemiology Studies." Journal of
Proceedings of the 20th ACM SIGKDD international the National Cancer Institute.
conference on Knowledge discovery and data mining (KDD https://doi.org/10.1093/jnci/djh075.
'14). http://bit.ly/expRulesOfThumb. Zöllner, Sebastian, and Jonathan K Pritchard. 2007.
Kohavi, Ron, Diane Tang, and Ya Xu. 2020. Trustworthy Online "Overcoming the Winner’s Curse: Estimating Penetrance
Controlled Experiments: A Practical Guide to A/B Testing. Parameters from Case-Control Data." The American Journal
Cambridge University Press. https://experimentguide.com. of Human Genetics 80 (4): 605-615.
Kohavi, Ron, Thomas Crook, and Roger Longbotham. 2009. https://doi.org/10.1086/512821.
"Online Experimentation at Microsoft." Third Workshop on

-9-
https://bit.ly/ABTestingIntuitionBusters
© Kohavi, Deng, Vermeer 2022. This is the author's version of the work. It is posted here for your personal use.
Not for redistribution. The definitive version will be published in KDD 2022 at https://doi.org/10.1145/3534678.3539160

A/B Testing Intuition Busters: Appendix

Introduction random chance). Typically, 95% is the recommended level


of confidence for the lift to be considered significant.
This appendix provides additional support and useful This statement is wrong and was likely fixed after a LinkedIn
references to several sections in the main paper. post from one of us that highlighted this error.

There are many references for A/B tests, or online controlled The book Designing with Data: Improving the User
experiments (Kohavi, Tang and Xu 2020, Luca and Bazerman Experience with A/B Testing (King, Churchill and Tan 2017)
2020, Thomke 2020, Georgiev 2019, Kohavi, Longbotham, et incorrectly states
al. 2009, Goward 2012, Siroker and Koomen 2013); (Box, p-values represent the probability that the difference you
Hunter and Hunter 2005, Imbens and Rubin 2015, Gerber and observed is due to random chance
Green 2012).
GuessTheTest defined confidence incorrectly (GuessTheTest
Statistical concepts that are misunderstood have not only 2022) as
caused businesses to make incorrect decisions, hurting user A 95% confidence level means there’s just a 5% chance the
experiences and the businesses themselves, but have also results are due to random factors -- and not the variables
resulted in innocent people being convicted of murder and that changed within the A/B test
serving years in jail. The owner is in the process of updating its definitions based
In courts, incorrect use of conditional probabilities is called on our feedback.
the Prosecutor’s fallacy and “The use of p-values can also lead
to the prosecutor’s fallacy” (Fenton, Neil and Berger 2016). The web site AB Test Guide (https://abtestguide.com/calc/)
Sally Clark and Angela Cannings were convicted of the uses the following incorrect wording when the tool is used,
murder of their babies, in part based on a claim presented by and the result is statistically significant:
eminent British pediatrician, Professor Meadow, who You can be 95% confident that this result is a consequence
incorrectly stated that the chance of two babies dying in those of the changes you made and not a result of random chance
circumstances are 1 in 73 million (Hill 2005). The Royal
Statistical Society issued a statement saying that the “figure of The industry standard threshold of 0.05 for p-value is stated in
1 in 73 million thus has no statistical basis” and that “This medical guidance (FDA 1998, Kennedy-Shaffer 2017).
(mis-)interpretation is a serious error of logic known as
Prosecutor’s Fallacy” (2001). Minimize Data Processing Options in
In the US, right turn on red was studied in the 1970s but “these Experimentation Platforms
studies were underpowered” and the differences on key Additional discussion of ESP following up on Bem’s paper
metrics were not statistically significant, so right turn on red (2011) are in Schimmack et. al. (2018).
was adopted; later studies showed “60% more pedestrians
were being run over, and twice as many bicyclists were In Many Analysts, One Data Set: Making Transparent How
struck” (Reinhart 2015). Variations in Analytic Choices Affect Results (Silberzhan, et
al. 2018), the authors shared how 29 teams involving 61
Surprising Results Require Strong analysts used the same data set to address the same research
Evidence—Lower P-Values question. Analytic approaches varied widely, and estimated
effect sizes ranged from 0.89 to 2.93. Twenty teams (69%)
Eliason (2018) shares 16 popular myths that persist despite found a statistically significant positive effect, and nine teams
evidence they are likely false. In the Belief in the Law of Small (31%) did not. Many subjective decisions are part of the data
Numbers (Tvesrky and Kahneman 1971), the authors take the processing and analysis and can materially impact the
reader through intuition busting exercises in statistical power outcome.
and replication.
In the online world, we typically deal with a larger number of
Additional examples where concepts are incorrectly stated by units than in domains like psychology. Simmons et al. (2011)
people or organizations in the field of A/B testing include: recommend at least 20 observations per cell, whereas in A/B
testing we recommend thousands to tens of thousands of users
Until December 2021, Adobe’s documentation stated that (Kohavi, Deng, et al. 2013). On the one hand, this larger
The confidence of an experience or offer represents the sample size results in less dramatic swings in p-values because
probability that the lift of the associated experience/offer experiments are adequately powered, but on the other hand
over the control experience/offer is “real” (not caused by
- 10 -
KDD ’22, August 14-18, 2022, Washington DC, USA Kohavi, Deng, and Vermeer

online experiments offer more opportunities for optional Kohavi, Ron, Alex Deng, Brian Frasca, Toby Walker, Ya Xu,
stopping and post-hoc segmentation, which suffer from and Nils Pohlmann. 2013. "Online Controlled Experiments
multiple hypothesis testing. at Large Scale." KDD 2013: Proceedings of the 19th ACM
SIGKDD international conference on Knowledge discovery
Resources for Reproducibility and data mining. http://bit.ly/ExPScale.
The key tables and simulations are available for Kohavi, Ron, Diane Tang, and Ya Xu. 2020. Trustworthy
reproducibility at https://bit.ly/ABTestingIntuitionBustersExtra . Online Controlled Experiments: A Practical Guide to A/B
Testing. Cambridge University Press.
https://experimentguide.com.
References Kohavi, Ron, Roger Longbotham, Dan Sommerfield, and
Randal M. Henne. 2009. "Controlled experiments on the
Bem, Daryl J. 2011. "Feeling the future: Experimental
web: survey and practical guide." Data Mining and
evidence for anomalous retroactive influences on cognition
Knowledge Discovery 18: 140-181. http://bit.ly/expSurvey.
and affect." Journal of Personality and Social Psychology
Luca, Michael, and Max H Bazerman. 2020. The Power of
100 (3): 407-425.
Experiments: Decision Making in a Data-Driven World.
doi:https://psycnet.apa.org/doi/10.1037/a0021524.
The MIT Press.
Box, George E.P., J Stuart Hunter, and William G Hunter.
Reinhart, Alex. 2015. Statistics Done Wrong: The Woefully
2005. Statistics for Experimenters: Design, Innovation, and
Complete Guide. No Starch Press.
Discovery. 2nd. John Wiley & Sons, Inc.
Royal Statistical Society. 2001. Royal Statistical Society
Eliason, Nat. 2018. 16 Popular Psychology Myths You
concerned by issues raised in Sally. London, October 23.
Probably Still Believe. July 2.
https://web.archive.org/web/20110824151124/http://www.
https://www.nateliason.com/blog/psychology-myths.
rss.org.uk/uploadedfiles/documentlibrary/744.pdf.
FDA. 1998. "E9 Statistical Principles for Clinical Trials." U.S.
Schimmack, Ulrich, Linda Schultz, Rickard Carlsson, and
Food & Drug Administration. September.
Stefan Schmukle. 2018. "Why the Journal of Personality
https://www.fda.gov/regulatory-information/search-fda-
and Social Psychology Should Retract Article DOI:
guidance-documents/e9-statistical-principles-clinical-
10.1037/a0021524 “Feeling the Future: Experimental
trials.
evidence for anomalous retroactive influences on cognition
Fenton, Norman, Martin Neil, and Daniel Berger. 2016.
and affect” by Daryl J. Bem." Replicability-Index. January
"Bayes and the Law." Annual Review of Statistics and Its
30. https://replicationindex.com/2018/01/05/bem-
Application 3: 51-77. doi:https://doi.org/10.1146/annurev-
retraction/.
statistics-041715-033428.
Silberzhan, R, E L Uhlmann, D P Martin, and et. al. 2018.
Georgiev, Georgi Zdravkov. 2019. Statistical Methods in
"Many Analysts, One Data Set: Making Transparent How
Online A/B Testing: Statistics for data-driven business
Variations in Analytic Choices Affect Results." Advances
decisions and risk management in e-commerce.
in Methods and Practices in Psychological Science 1 (3):
Independently published. https://www.abtestingstats.com/.
337-356.
Gerber, Alan S, and Donald P Green. 2012. Field
doi:https://doi.org/10.1177%2F2515245917747646.
Experiments: Design, Analysis, and Interpretation. W. W.
Simmons, Joseph P, Leif D Nelson, and Uri Simonsohn. 2011.
Norton & Company.
"False-Positive Psychology: Undisclosed Flexibility in
Goward, Chris. 2012. You Should Test That: Conversion
Data Collection and Analysis Allows Presenting Anything
Optimization for More Leads, Sales and Profit or The Art
as Significant." Psychological Science 22 (11): 1359-1366.
and Science of Optimized Marketing. Sybex.
https://journals.sagepub.com/doi/full/10.1177/0956797611
GuessTheTest. 2022. "Confidence." GuessTheTest. January
417632.
10. https://guessthetest.com/glossary/confidence/.
Siroker, Dan, and Pete Koomen. 2013. A/B Testing: The Most
Hill, Ray. 2005. "Reflections on the cot death cases."
Powerful Way to Turn Clicks Into Customers. Wiley.
Significance 13-16. doi:https://doi.org/10.1111/j.1740-
Thomke, Stefan H. 2020. Experimentation Works: The
9713.2005.00077.x.
Surprising Power of Business Experiments. Harvard
Imbens, Guido W, and Donald B Rubin. 2015. Causal
Business Review Press.
Inference for Statistics, Social, and Biomedical Sciences:
Tvesrky, Amos, and Daniel Kahneman. 1971. "Belief in the
An Introduction. Cambridge University Press.
Law of Small Numbers." Psychological Bulletin 76 (2):
Kennedy-Shaffer, Lee. 2017. "When the Alpha is the Omega:
105-110. https://psycnet.apa.org/record/1972-01934-001.
P-Values, “Substantial Evidence,” and the 0.05 Standard at
FDA." Food Drug Law J. 595-635.
https://pubmed.ncbi.nlm.nih.gov/30294197.
King, Rochelle, Elizabeth F Churchill, and Caitlin Tan. 2017.
Designing with Data: Improving the User Experience with
A/B Testing. O'Reilly Media.

- 11 -

You might also like