Foundations of Large-Scale Doubly-Sequential Experimentation
Foundations of Large-Scale Doubly-Sequential Experimentation
“doubly-sequential” experimentation
Aaditya Ramdas
Assistant Professor
Dept. of Statistics and Data Science
Machine Learning Dept.
Carnegie Mellon University
www.stat.cmu.edu/~aramdas/kdd19/
A/B-testing : tech :: clinical trials : pharma
A/B-testing : tech :: clinical trials : pharma
In 2013, a team from Microsoft (Bing) claimed that they run tens
of thousands of such experiments, leading to millions of dollars in
increased revenue.
In 2013, a team from Microsoft (Bing) claimed that they run tens
of thousands of such experiments, leading to millions of dollars in
increased revenue.
Much has been discussed about doing A/B testing the “right” way,
both theoretically and practically in real-world systems.
50% 50%
Sign up! …
A B
… Sign up!
Users of app or website
50% 50%
Sign up! …
A B
… Sign up!
44 conversions 71 conversions
Users of app or website
50% 50%
Sign up! …
A B
… Sign up!
50% 50%
Sign up! …
A B
… Sign up!
…
A new “doubly-sequential” perspective:
a sequence of sequential experiments
A new “doubly-sequential” perspective:
a sequence of sequential experiments
Exp.
Time / Samples
A new “doubly-sequential” perspective:
a sequence of sequential experiments
Exp.
Time / Samples
A new “doubly-sequential” perspective:
a sequence of sequential experiments
Exp.
A/B tests
Time / Samples
A new “doubly-sequential” perspective:
a sequence of sequential experiments
Exp.
Time / Samples
A new “doubly-sequential” perspective:
a sequence of sequential experiments
treatments
Exp.
common control
Time / Samples
A new “doubly-sequential” perspective:
a sequence of sequential experiments
Exp.
Time / Samples
[1 hour]
The “duality” between
confidence intervals and p-values
Hypothesis testing is like
stochastic proof by contradiction.
Hypothesis testing is like
stochastic proof by contradiction.
Hypothesis testing is like
stochastic proof by contradiction.
Null hypothesis:
The coin is fair (bias = 0)
Alternative:
Coin is biased towards H
Hypothesis testing is like
stochastic proof by contradiction.
Tails Heads
1000 tosses
Alternative:
Coin is biased towards H
Hypothesis testing is like
stochastic proof by contradiction.
Tails Heads
1000 tosses
Possible
observations
Calculate p-value
Possible
all all
observations
tails heads
Calculate p-value
Prob.
density
Possible
all all
observations
tails heads
Calculate p-value
Prob.
density
Possible
all data all
observations
tails heads
Calculate p-value
Prob.
density
p-value P
Possible
all data all
observations
tails heads
Calculate p-value
Prob.
density
p-value P
Possible
all data all
observations
tails heads
Reject null if P ≤ α
Calculate p-value
Prob.
density
p-value P
Possible
all data all
observations
tails heads
Prob.
density
p-value P
Possible
all data all
observations
tails heads
#H − #T
Estimate the coin bias by μ ̂ := .
N
An equivalent view via confidence intervals
#H − #T
Estimate the coin bias by μ ̂ := .
N
( N)
z1−α z1−α
An asymptotic (1 − α)-CI for μ is given by μ ̂− , μ ̂+ ,
N
An equivalent view via confidence intervals
#H − #T
Estimate the coin bias by μ ̂ := .
N
( N)
z1−α z1−α
An asymptotic (1 − α)-CI for μ is given by μ ̂− , μ ̂+ ,
N
where z1−α is the (1 − α)-quantile of N(0,1) .
(appealing to the Central Limit Theorem)
An equivalent view via confidence intervals
#H − #T
Estimate the coin bias by μ ̂ := .
N
( N)
z1−α z1−α
An asymptotic (1 − α)-CI for μ is given by μ ̂− , μ ̂+ ,
N
where z1−α is the (1 − α)-quantile of N(0,1) .
(appealing to the Central Limit Theorem)
#H − #T
Estimate the coin bias by μ ̂ := .
N
( N)
z1−α z1−α
An asymptotic (1 − α)-CI for μ is given by μ ̂− , μ ̂+ ,
N
where z1−α is the (1 − α)-quantile of N(0,1) .
(appealing to the Central Limit Theorem)
≈ #H − #T ≥ 2N log(1/α) .
For any parameter µ of interest,
with associated estimator µb,
the following claim holds:
R b
µ
( )
| {z }
(1 ↵) confidence interval
for µ
For any parameter µ of interest,
with associated estimator µb,
the following claim holds:
R b
µ µ0
( )
| {z }
(1 ↵) confidence interval
for µ
For any parameter µ of interest,
with associated estimator µb,
the following claim holds:
R b
µ µ0
( ) ⌘
| {z }
(1 ↵) confidence interval
for µ
For any parameter µ of interest,
with associated estimator µb,
the following claim holds:
R b
µ µ0 For H0 : µ = µ0
( ) ⌘
| {z }
(1 ↵) confidence interval
for µ
For any parameter µ of interest,
with associated estimator µb,
the following claim holds:
R b
µ µ0 For H0 : µ = µ0
( ) ⌘
we have p-value Pµ0 ↵
| {z }
(1 ↵) confidence interval
for µ
For any parameter µ of interest,
with associated estimator µb,
the following claim holds:
R b
µ µ0 For H0 : µ = µ0
( ) ⌘
we have p-value Pµ0 ↵
| {z }
(1 ↵) confidence interval (we would reject the null
for µ hypothesis at level ↵)
For any parameter µ of interest,
with associated estimator µb,
the following claim holds:
Start
High-level caricature of an A/B-test
Start
Start
“peek”
Start
“peek”
Stop,
Report
“optional stopping”
High-level caricature of an A/B-test
Start
“peek”
“optional continuation”
Stop,
Report
“optional stopping”
High-level caricature of an A/B-test
Start
“peek”
“optional continuation”
Stop,
Report
With commonly-taught p-values,
false positive rate ≫ α . “optional stopping”
After 10 people
After 10 people
After 10 people
∀n ≥ 1, Pr(P(n) ≤ α) ≤ α.
prob. of false positive
Let P (n) be a classical p-value (eg: t-test),
calculated using the first n samples.
∀n ≥ 1, Pr(P(n) ≤ α) ≤ α.
prob. of false positive
∀n ≥ 1, Pr(P(n) ≤ α) ≤ α.
prob. of false positive
Unfortunately, Pr(P(τ) ≤ α) ≰ α .
In other words, Pr( ∃n ∈ ℕ : P(n) ≤ α) ≫ α .
Same problem with confidence interval (CI)
Start
“peek”
“optional continuation”
Stop,
Stop
Report
ℙ( ∀n ≥ 1 : θ ∈ (Ln, Un)) ≥ 1 − α .
Sample size
A “confidence sequence” for a parameter ✓ <latexit sha1_base64="JqEnYvV6PtsKBJYmBVwEpjIMANw=">AAAB7XicbVDLSgNBEJyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOOsmY2ZllplcIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqSw6PvfXmFtfWNzq7hd2tnd2z8oHx41rU4NhwbXUpt2xCxIoaCBAiW0EwMsjiS0ovHtzG89gbFCqwecJBDGbKjEQHCGTmp2cQTIeuWKX/XnoKskyEmF5Kj3yl/dvuZpDAq5ZNZ2Aj/BMGMGBZcwLXVTCwnjYzaEjqOKxWDDbH7tlJ45pU8H2rhSSOfq74mMxdZO4sh1xgxHdtmbif95nRQH12EmVJIiKL5YNEglRU1nr9O+MMBRThxh3Ah3K+UjZhhHF1DJhRAsv7xKmhfVwK8G95eV2k0eR5GckFNyTgJyRWrkjtRJg3DySJ7JK3nztPfivXsfi9aCl88ckz/wPn8Ao/ePKA==</latexit>
<latexit
ℙ( ∀n ≥ 1 : θ ∈ (Ln, Un)) ≥ 1 − α .
Sample size
effding Pointwise
Pointwise
Pointwise
Linear CLT
CLT
CI boundary
(CLT) Pointwise
Anytime
Pointwise
Curved Hoeff
CIboundary
Hoeffding
0.5
Cumulative miscoverage prob
(Fair coin)
effding Pointwise
Pointwise
Pointwise
Linear CLT
CLT
CI boundary
(CLT) Pointwise
Anytime
Pointwise
Curved Hoeff
CIboundary
Hoeffding
Eg: If Xi is 1-subGaussian, then
n
∑i=1 Xi log log(2n) + 0.72 log(5.19/α)
± 1.71
n n
is a (1 − α) confidence sequence.
Eg: If Xi is 1-subGaussian, then
n
∑i=1 Xi log log(2n) + 0.72 log(5.19/α)
± 1.71
n n
is a (1 − α) confidence sequence.
Hoeffding bound
CLT bound
0
0 105
Vt
Eg: If Xi is 1-subGaussian, then
n
∑i=1 Xi log log(2n) + 0.72 log(5.19/α)
± 1.71
n n
is a (1 − α) confidence sequence.
Hoeffding bound
CLT bound
0
0 105
Vt
Howard, Ramdas, McAuliffe, Sekhon ’18
⋃
ℙ( {θ ∉ (Ln, Un)}) ≤ α .
n∈ℕ
⋃
ℙ( {θ ∉ (Ln, Un)}) ≤ α .
n∈ℕ
Some implications:
⋃
ℙ( {θ ∉ (Ln, Un)}) ≤ α .
n∈ℕ
Some implications:
(τ)
For all stopping times τ, Pr(P ≤ α) ≤ α .
For all data-dependent times T, Pr(P(T) ≤ α) ≤ α .
Relationship to Sequential Probability Ratio Test
Given a stream of data X1, X2, … ∼ fθ, suppose
we want to test a null hypothesis H0 : θ = θ0
against an alternative hypothesis H1 : θ = θ1 .
Wald ‘48
Relationship to Sequential Probability Ratio Test
Given a stream of data X1, X2, … ∼ fθ, suppose
we want to test a null hypothesis H0 : θ = θ0
against an alternative hypothesis H1 : θ = θ1 .
and rejects when L (n) > 1/α . Can also use prior/mixture over θ1 .
Wald ‘48
Relationship to Sequential Probability Ratio Test
Given a stream of data X1, X2, … ∼ fθ, suppose
we want to test a null hypothesis H0 : θ = θ0
against an alternative hypothesis H1 : θ = θ1 .
and rejects when L (n) > 1/α . Can also use prior/mixture over θ1 .
Wald ‘48
Relationship to Sequential Probability Ratio Test
Given a stream of data X1, X2, … ∼ fθ, suppose
we want to test a null hypothesis H0 : θ = θ0
against an alternative hypothesis H1 : θ = θ1 .
and rejects when L (n) > 1/α . Can also use prior/mixture over θ1 .
[40 mins]
Quick recap of A/B testing
A:
Quick recap of A/B testing
A:
B:
Quick recap of A/B testing
A:
B:
Null hypothesis:
A is at least
as good as B.
Quick recap of A/B testing
Misses Clicks
A:
B:
0 200 400 600 800
Null hypothesis:
A is at least
as good as B.
Quick recap of A/B testing
Misses Clicks
A:
B:
0 200 400 600 800
B:
0 200 400 600 800
B: a wrong rejection
0 200 400 600 800
of the null
is a false discovery
Null hypothesis: Calculate p-value: and implies
A is at least P = Pr(observed data or more a bad change
as good as B. extreme, assuming null is true) from A to B.
Reality: internet companies run thousands
of different (independent) A/B tests over time.
Reality: internet companies run thousands
of different (independent) A/B tests over time.
Time
Reality: internet companies run thousands
of different (independent) A/B tests over time.
vs. Color
Time
Reality: internet companies run thousands
of different (independent) A/B tests over time.
Decision rule:
vs. Color
Time
Reality: internet companies run thousands
of different (independent) A/B tests over time.
Decision rule:
P1 ↵? vs. Color
Time
Reality: internet companies run thousands
of different (independent) A/B tests over time.
Decision rule:
P1 ↵? vs. Color
vs. Orientation
Reality: internet companies run thousands
of different (independent) A/B tests over time.
Decision rule:
P1 ↵? vs. Color
P3 ↵? vs. Orientation
Reality: internet companies run thousands
of different (independent) A/B tests over time.
Decision rule:
P1 ↵? vs. Color
P3 ↵? vs. Orientation
vs. Style
Reality: internet companies run thousands
of different (independent) A/B tests over time.
Decision rule:
P1 ↵? vs. Color
P3 ↵? vs. Orientation
P4 ↵? vs. Style
Reality: internet companies run thousands
of different (independent) A/B tests over time.
Decision rule:
P1 ↵? vs. Color
P3 ↵? vs. Orientation
P4 ↵? vs. Style
vs. Logo
Reality: internet companies run thousands
of different (independent) A/B tests over time.
Decision rule:
P1 ↵? vs. Color
P3 ↵? vs. Orientation
P4 ↵? vs. Style
P5 ↵? vs. Logo
Reality: internet companies run thousands
of different (independent) A/B tests over time.
Decision rule:
P1 ↵? vs. Color
P3 ↵? vs. Orientation
P4 ↵? vs. Style
P5 ↵? vs. Logo
Problem!
Run 10,000
different,
independent
A/B tests
Run 10,000 9,900 true
different, nulls
independent 100 non-
A/B tests nulls
type-1error rate (per test)
= 0.05
FDR = E[FDP]
type-1error rate (per test)
= 0.05
FDR = E[FDP]
Summary: FDR can be larger than per-test error rate.
(even if hypotheses, tests, data are independent)
Given a possibly infinite sequence
of independent tests (p-values), can we
guarantee control of the FDR
in a fully online fashion?
Foster-Stine ’08
Aharoni-Rosset ’14
Javanmard-Montanari ’16
Ramdas-Yang-Wainwright-Jordan ’17
Ramdas-Zrnic-Wainwright-Jordan ’18
Tian-Ramdas ’19
The aim of online FDR procedures
Decision rule:
Time
The aim of online FDR procedures
Decision rule:
vs. Color
Time
The aim of online FDR procedures
Decision rule:
P1 ↵ 1 ? vs. Color
Time
The aim of online FDR procedures
Decision rule:
P1 ↵ 1 ? vs. Color
Decision rule:
P1 ↵ 1 ? vs. Color
Decision rule:
P1 ↵ 1 ? vs. Color
vs. Orientation
The aim of online FDR procedures
Decision rule:
P1 ↵ 1 ? vs. Color
P3 ↵ 3 ? vs. Orientation
The aim of online FDR procedures
Decision rule:
P1 ↵ 1 ? vs. Color
P3 ↵ 3 ? vs. Orientation
vs. Style
The aim of online FDR procedures
Decision rule:
P1 ↵ 1 ? vs. Color
P3 ↵ 3 ? vs. Orientation
P4 ↵ 4 ? vs. Style
The aim of online FDR procedures
Decision rule:
P1 ↵ 1 ? vs. Color
P3 ↵ 3 ? vs. Orientation
P4 ↵ 4 ? vs. Style
vs. Logo
The aim of online FDR procedures
Decision rule:
P1 ↵ 1 ? vs. Color
P3 ↵ 3 ? vs. Orientation
P4 ↵ 4 ? vs. Style
P5 ↵ 5 ? vs. Logo
The aim of online FDR procedures
Decision rule:
P1 ↵ 1 ? vs. Color
P3 ↵ 3 ? vs. Orientation
Benjamini-Hochberg ’95
The following method is not a
valid online FDR algorithm:
PT
[ i=1 ↵ i
Maintain FCP(T ) := PT ↵.
1 _ i=1 Si
<latexit sha1_base64="h2Dsi8pb9WFrVY+BGXU36brflXA=">AAACVnicbVFNa9wwEJWdpkm3X0567EV0KaSXxS6BhkAgNFB6KWzpbhJYb81YO86KSLIjjdMsxn+yvbQ/pZdS7a4PTdKBgcd7b0bSU14p6SiOfwXhxoPNh1vbj3qPnzx99jza2T11ZW0FjkWpSnueg0MlDY5JksLzyiLoXOFZfnmy1M+u0TpZmhEtKpxquDCykALIU1mkU8Ibaj6BNOSbtzz9Jmc4B2rWyoeTYdvujd4cHqWFBdGkrtZZI4+S9uuIp6CqOWSSt03C02tEfkv+kkm/T+FVZxxkUT8exKvi90HSgT7raphF39NZKWqNhoQC5yZJXNG0AUtSKGx7ae2wAnEJFzjx0IBGN21WsbT8tWdmvCitb0N8xf470YB2bqFz79RAc3dXW5L/0yY1FQfTRpqqJjRifVBRK04lX2bMZ9KiILXwAISV/q5czMGnR/4nej6E5O6T74PTt4MkHiSf9/vH77s4ttlL9ortsYS9Y8fsIxuyMRPsB/sdhMFG8DP4E26GW2trGHQzL9itCqO/IGOz7g==</latexit>
Online FCR control: the main idea
PT
[ i=1 ↵ i
Maintain FCP(T ) := PT ↵.
1 _ i=1 Si
<latexit sha1_base64="h2Dsi8pb9WFrVY+BGXU36brflXA=">AAACVnicbVFNa9wwEJWdpkm3X0567EV0KaSXxS6BhkAgNFB6KWzpbhJYb81YO86KSLIjjdMsxn+yvbQ/pZdS7a4PTdKBgcd7b0bSU14p6SiOfwXhxoPNh1vbj3qPnzx99jza2T11ZW0FjkWpSnueg0MlDY5JksLzyiLoXOFZfnmy1M+u0TpZmhEtKpxquDCykALIU1mkU8Ibaj6BNOSbtzz9Jmc4B2rWyoeTYdvujd4cHqWFBdGkrtZZI4+S9uuIp6CqOWSSt03C02tEfkv+kkm/T+FVZxxkUT8exKvi90HSgT7raphF39NZKWqNhoQC5yZJXNG0AUtSKGx7ae2wAnEJFzjx0IBGN21WsbT8tWdmvCitb0N8xf470YB2bqFz79RAc3dXW5L/0yY1FQfTRpqqJjRifVBRK04lX2bMZ9KiILXwAISV/q5czMGnR/4nej6E5O6T74PTt4MkHiSf9/vH77s4ttlL9ortsYS9Y8fsIxuyMRPsB/sdhMFG8DP4E26GW2trGHQzL9itCqO/IGOz7g==</latexit>
PT
[ i=1 ↵ i
Maintain FCP(T ) := PT ↵.
1 _ i=1 Si
<latexit sha1_base64="h2Dsi8pb9WFrVY+BGXU36brflXA=">AAACVnicbVFNa9wwEJWdpkm3X0567EV0KaSXxS6BhkAgNFB6KWzpbhJYb81YO86KSLIjjdMsxn+yvbQ/pZdS7a4PTdKBgcd7b0bSU14p6SiOfwXhxoPNh1vbj3qPnzx99jza2T11ZW0FjkWpSnueg0MlDY5JksLzyiLoXOFZfnmy1M+u0TpZmhEtKpxquDCykALIU1mkU8Ibaj6BNOSbtzz9Jmc4B2rWyoeTYdvujd4cHqWFBdGkrtZZI4+S9uuIp6CqOWSSt03C02tEfkv+kkm/T+FVZxxkUT8exKvi90HSgT7raphF39NZKWqNhoQC5yZJXNG0AUtSKGx7ae2wAnEJFzjx0IBGN21WsbT8tWdmvCitb0N8xf470YB2bqFz79RAc3dXW5L/0yY1FQfTRpqqJjRifVBRK04lX2bMZ9KiILXwAISV/q5czMGnR/4nej6E5O6T74PTt4MkHiSf9/vH77s4ttlL9ortsYS9Y8fsIxuyMRPsB/sdhMFG8DP4E26GW2trGHQzL9itCqO/IGOz7g==</latexit>
PT
[ i=1 ↵ i
Maintain FCP(T ) := PT ↵.
1 _ i=1 Si
<latexit sha1_base64="h2Dsi8pb9WFrVY+BGXU36brflXA=">AAACVnicbVFNa9wwEJWdpkm3X0567EV0KaSXxS6BhkAgNFB6KWzpbhJYb81YO86KSLIjjdMsxn+yvbQ/pZdS7a4PTdKBgcd7b0bSU14p6SiOfwXhxoPNh1vbj3qPnzx99jza2T11ZW0FjkWpSnueg0MlDY5JksLzyiLoXOFZfnmy1M+u0TpZmhEtKpxquDCykALIU1mkU8Ibaj6BNOSbtzz9Jmc4B2rWyoeTYdvujd4cHqWFBdGkrtZZI4+S9uuIp6CqOWSSt03C02tEfkv+kkm/T+FVZxxkUT8exKvi90HSgT7raphF39NZKWqNhoQC5yZJXNG0AUtSKGx7ae2wAnEJFzjx0IBGN21WsbT8tWdmvCitb0N8xf470YB2bqFz79RAc3dXW5L/0yY1FQfTRpqqJjRifVBRK04lX2bMZ9KiILXwAISV/q5czMGnR/4nej6E5O6T74PTt4MkHiSf9/vH77s4ttlL9ortsYS9Y8fsIxuyMRPsB/sdhMFG8DP4E26GW2trGHQzL9itCqO/IGOz7g==</latexit>
α1 Error budget
for first expt.
α1 Error budget
for first expt.
α1 Error budget
for first expt.
α1 Error budget
for first expt.
α1 Error budget
for first expt.
α1 Error budget
for first expt.
α1 Error budget
for first expt.
α1 Error budget
for first expt.
α1 Error budget
for first expt.
α1 Error budget
for first expt.
α1 Error budget
for first expt.
[Next 10 mins]
Part III
Putting the modular pieces together:
the doubly-sequential process
[Next 10 mins]
Combining inner and outer solutions (FCR):
αi
α5
α4
α3
α2
α1
Combining inner and outer solutions (FCR):
(a) Online FCR method assigns αi when expt. starts
αi
α5
α4
α3
α2
α1
Combining inner and outer solutions (FCR):
(a) Online FCR method assigns αi when expt. starts
(b) We keep track of (1 − αi) confidence sequence
αi
α5
α4
α3
α2
α1
Combining inner and outer solutions (FCR):
(a) Online FCR method assigns αi when expt. starts
(b) We keep track of (1 − αi) confidence sequence
(c) Adaptively decide to stop, to report final CI or not
αi
α5
α4
α3
α2
α1
Combining inner and outer solutions (FCR):
(a) Online FCR method assigns αi when expt. starts
(b) We keep track of (1 − αi) confidence sequence
(c) Adaptively decide to stop, to report final CI or not
(d) Guarantee FCR(T) ≤ α at any time positive time T
αi
α5
α4
α3
α2
α1
Combining inner and outer solutions (FCR):
(a) Online FCR method assigns αi when expt. starts
(b) We keep track of (1 − αi) confidence sequence
(c) Adaptively decide to stop, to report final CI or not
(d) Guarantee FCR(T) ≤ α at any time positive time T
αi
α5
Exp. α4
α3
α2
α1
Time / Samples
Combining inner and outer solutions (FCR):
(a) Online FCR method assigns αi when expt. starts
(b) We keep track of (1 − αi) confidence sequence
(c) Adaptively decide to stop, to report final CI or not
(d) Guarantee FCR(T) ≤ α at any time positive time T
αi
α5
Exp. α4
α3
α2
α1
Time / Samples
Combining inner and outer solutions (FDR):
αi
α5
Exp. α4
α3
α2
α1
Time / Samples
Combining inner and outer solutions (FDR):
(a) Online FDR method assigns αi when expt. starts
αi
α5
Exp. α4
α3
α2
α1
Time / Samples
Combining inner and outer solutions (FDR):
(a) Online FDR method assigns αi when expt. starts
(n)
(b) We keep track of anytime p-value Pi
αi
α5
Exp. α4
α3
α2
α1
Time / Samples
Combining inner and outer solutions (FDR):
(a) Online FDR method assigns αi when expt. starts
(n)
(b) We keep track of anytime p-value Pi
(c) Adaptively stop at time τ, report discovery if Pi(τ) ≤ αi
αi
α5
Exp. α4
α3
α2
α1
Time / Samples
Combining inner and outer solutions (FDR):
(a) Online FDR method assigns αi when expt. starts
(n)
(b) We keep track of anytime p-value Pi
(c) Adaptively stop at time τ, report discovery if Pi(τ) ≤ αi
(d) Guarantee FDR(T) ≤ α at any time positive time T
αi
α5
Exp. α4
α3
α2
α1
Time / Samples
PART IV: Advanced topics
(inner sequential process)
[Next 25 mins]
1. What if we are testing more than one alternative?
Mean = + ∞
Prob.
density
q0 q1/2 q0.8
Possible
observations
Mean is undefined.
Prob.
density
Mean is undefined.
Prob.
density
2
pk ∝ 1/k ⟹ Mean = + ∞
Prob.
mass
…
q1/2 q0.8
Mean is undefined.
Prob.
mass
…
A<B<C<D<E<…
q1/2 q0.8
Eg: grades or non-numerical ratings
2. Quantile sensible in totally ordered settings
Mean is undefined.
Prob.
mass
…
A<B<C<D<E<…
q1/2 q0.8
Eg: grades or non-numerical ratings
Do not need to artificially assign numerical values.
(Are they equally spaced? Spacing and start point matter.)
2. A/B testing with quantiles
H0 : q0.9(A) = q0.9(B)
H0 : q0.9(A) = q0.9(B)
H0 : q0.9(A) = q0.9(B)
H0 : q0.9(A) = q0.9(B)
H0 : q0.9(A) = q0.9(B)
50% 50%
Sign up! Can change with time! …
A B
… (eg: keep groups balanced)
Sign up!
4. Sequential Average Treatment Effect estimation
with adaptive randomization
Users of app or website
50% 50%
Sign up! Can change with time! …
A B
… (eg: keep groups balanced)
Sign up!
[Next 15 mins]
1. Smoothly forgetting the past
(similarly mem-FCR)
Ramdas et al. ’17
2. Post-hoc analysis
2. Post-hoc analysis
[ ]
# incorrect sign decisions made
FSR := 𝔼
# sign decisions made
4. False-sign rate
[ ]
# incorrect sign decisions made
FSR := 𝔼
# sign decisions made
...
Time
A selective history (inner process)
Fisher (1925)
null hypothesis testing basics
randomization for causal inference
Time
A selective history (inner process)
Fisher (1925)
null hypothesis testing basics
Wald (1948) randomization for causal inference
sequential probability ratio test
(the first always-valid p-values)
Time
A selective history (inner process)
Fisher (1925)
null hypothesis testing basics
Wald (1948) randomization for causal inference
sequential probability ratio test
(the first always-valid p-values)
Robbins (1952)
multi-armed bandits
Time
A selective history (inner process)
Fisher (1925)
null hypothesis testing basics
Wald (1948) randomization for causal inference
sequential probability ratio test
(the first always-valid p-values)
Robbins (1952)
multi-armed bandits
Darling & Robbins (1967)
confidence sequences
(the first always valid CIs)
Time
A selective history (inner process)
Fisher (1925)
null hypothesis testing basics
Wald (1948) randomization for causal inference
sequential probability ratio test
(the first always-valid p-values)
Robbins (1952)
multi-armed bandits
Darling & Robbins (1967)
confidence sequences
(the first always valid CIs) Lai, Siegmund,… (1970s)
confidence sequences, inference
after stopping experiments
Time
A selective history (inner process)
Fisher (1925)
null hypothesis testing basics
Wald (1948) randomization for causal inference
sequential probability ratio test
(the first always-valid p-values)
Robbins (1952)
multi-armed bandits
Darling & Robbins (1967)
confidence sequences
(the first always valid CIs) Lai, Siegmund,… (1970s)
confidence sequences, inference
Jennison & Turnbull (1980s) after stopping experiments
group sequential methods
(peeking only 2 or 3 times)
Time
A selective history (outer process)
Time
A selective history (outer process)
Tukey (1953)
an unpublished book on the
problem of multiple comparisons
Time
A selective history (outer process)
Tukey (1953)
an unpublished book on the
problem of multiple comparisons
Eklund & Seeger (1963)
define false discovery proportion
suggested heuristic algorithm
Time
A selective history (outer process)
Tukey (1953)
an unpublished book on the
problem of multiple comparisons
Eklund & Seeger (1963)
define false discovery proportion
suggested heuristic algorithm
Benjamini & Hochberg (1995)
rediscovered Eklund-Seeger method
first proof of FDR control
Time
A selective history (outer process)
Tukey (1953)
an unpublished book on the
problem of multiple comparisons
Eklund & Seeger (1963)
define false discovery proportion
suggested heuristic algorithm
Benjamini & Hochberg (1995)
rediscovered Eklund-Seeger method
Benjamini & Yekutieli (2005) first proof of FDR control
false coverage rate (FCR)
first methods to control it
Time
A selective history (outer process)
Tukey (1953)
an unpublished book on the
problem of multiple comparisons
Eklund & Seeger (1963)
define false discovery proportion
suggested heuristic algorithm
Benjamini & Hochberg (1995)
rediscovered Eklund-Seeger method
Benjamini & Yekutieli (2005) first proof of FDR control
false coverage rate (FCR)
first methods to control it
Foster & Stine (2008)
conceptualized online FDR control
first method to control it
Time
In this tutorial, you learnt the basics of
In this tutorial, you learnt the basics of
• How to think about a single experiment
A. Why peeking is an issue in practice
B. Why applying a t-test repeatedly inflates errors
C. Anytime confidence intervals and p-values
In this tutorial, you learnt the basics of
• How to think about a single experiment
A. Why peeking is an issue in practice
B. Why applying a t-test repeatedly inflates errors
C. Anytime confidence intervals and p-values
• Across experiments:
A. Error metrics with decaying-memory
B. The false sign rate
C. Weighted error metrics
D. Post-hoc analysis
You also learnt some advanced topics:
• Within a single experiment:
A. Using bandits for hypothesis testing
B. Quantiles can be estimated sequentially
C. The pros and cons of running intersections
D. SATE with adaptive randomization
• Across experiments:
A. Error metrics with decaying-memory
B. The false sign rate
C. Weighted error metrics
D. Post-hoc analysis
• Open problems:
A. Incentives/errors within hierarchical organizations
B. Utilizing contextual information for testing
C. Designing systems that fail loudly
SOFTWARE
• Across experiments:
R package called “onlineFDR”
Maintained by David Robertson (Cambridge)
Frequent updates + wrappers for months to come
Aaditya Ramdas
Assistant Professor
Dept. of Statistics and Data Science
Machine Learning Dept.
Carnegie Mellon University
Funding welcomed!
www.stat.cmu.edu/~aramdas/kdd19/ Thank you! Questions?