0% found this document useful (0 votes)
27 views339 pages

Foundations of Large-Scale Doubly-Sequential Experimentation

Uploaded by

dukeblue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views339 pages

Foundations of Large-Scale Doubly-Sequential Experimentation

Uploaded by

dukeblue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 339

Foundations of large-scale

“doubly-sequential” experimentation

(KDD tutorial in Anchorage, on 4 Aug 2019)

Aaditya Ramdas
Assistant Professor
Dept. of Statistics and Data Science
Machine Learning Dept.
Carnegie Mellon University

www.stat.cmu.edu/~aramdas/kdd19/
A/B-testing : tech :: clinical trials : pharma
A/B-testing : tech :: clinical trials : pharma

Large-scale A/B-testing or other related


forms of randomized experimentation has
revolutionized the tech industry in the last 15yrs.
A/B-testing : tech :: clinical trials : pharma

Large-scale A/B-testing or other related


forms of randomized experimentation has
revolutionized the tech industry in the last 15yrs.

In 2013, a team from Microsoft (Bing) claimed that they run tens
of thousands of such experiments, leading to millions of dollars in
increased revenue.

Kohavi et al. ’13


A/B-testing : tech :: clinical trials : pharma

Large-scale A/B-testing or other related


forms of randomized experimentation has
revolutionized the tech industry in the last 15yrs.

In 2013, a team from Microsoft (Bing) claimed that they run tens
of thousands of such experiments, leading to millions of dollars in
increased revenue.

Much has been discussed about doing A/B testing the “right” way,
both theoretically and practically in real-world systems.

Many companies contributing to this vast and growing literature.

Kohavi et al. ’13


Audience poll (for the speaker)
Audience poll (for the speaker)

How many of you have written papers on A/B testing


or online experimentation?
(or work in the area, or consider yourselves experts?)
Audience poll (for the speaker)

How many of you have written papers on A/B testing


or online experimentation?
(or work in the area, or consider yourselves experts?)

How many of you have read papers on A/B testing


and know what it is, but want to know more?
Audience poll (for the speaker)

How many of you have written papers on A/B testing


or online experimentation?
(or work in the area, or consider yourselves experts?)

How many of you have read papers on A/B testing


and know what it is, but want to know more?

How many have no idea what I’m talking about?


Users of app or website

50% 50%
Sign up! …
A B
… Sign up!
Users of app or website

50% 50%
Sign up! …
A B
… Sign up!

44 conversions 71 conversions
Users of app or website

50% 50%
Sign up! …
A B
… Sign up!

44 conversions B wins! 71 conversions


Users of app or website

50% 50%
Sign up! …
A B
… Sign up!

44 conversions B wins! 71 conversions


What we will NOT cover today
What we will NOT cover today
Pre-experiment analysis and difference-in-differences
What we will NOT cover today
Pre-experiment analysis and difference-in-differences
What makes a good metric? (directionality and sensitivity)
What we will NOT cover today
Pre-experiment analysis and difference-in-differences
What makes a good metric? (directionality and sensitivity)
Combining metrics into an overall evaluation criterion
What we will NOT cover today
Pre-experiment analysis and difference-in-differences
What makes a good metric? (directionality and sensitivity)
Combining metrics into an overall evaluation criterion
Time-series aspects: dealing with periodicity and trends
What we will NOT cover today
Pre-experiment analysis and difference-in-differences
What makes a good metric? (directionality and sensitivity)
Combining metrics into an overall evaluation criterion
Time-series aspects: dealing with periodicity and trends
Causal inference in observational studies
What we will NOT cover today
Pre-experiment analysis and difference-in-differences
What makes a good metric? (directionality and sensitivity)
Combining metrics into an overall evaluation criterion
Time-series aspects: dealing with periodicity and trends
Causal inference in observational studies
What we will NOT cover today
Pre-experiment analysis and difference-in-differences
What makes a good metric? (directionality and sensitivity)
Combining metrics into an overall evaluation criterion
Time-series aspects: dealing with periodicity and trends
Causal inference in observational studies

Parametric methods for A/B testing (like SPRT and variants)


What we will NOT cover today
Pre-experiment analysis and difference-in-differences
What makes a good metric? (directionality and sensitivity)
Combining metrics into an overall evaluation criterion
Time-series aspects: dealing with periodicity and trends
Causal inference in observational studies

Parametric methods for A/B testing (like SPRT and variants)


Bayesian A/B testing
What we will NOT cover today
Pre-experiment analysis and difference-in-differences
What makes a good metric? (directionality and sensitivity)
Combining metrics into an overall evaluation criterion
Time-series aspects: dealing with periodicity and trends
Causal inference in observational studies

Parametric methods for A/B testing (like SPRT and variants)


Bayesian A/B testing
Side effects and risks associated with running experiments
What we will NOT cover today
Pre-experiment analysis and difference-in-differences
What makes a good metric? (directionality and sensitivity)
Combining metrics into an overall evaluation criterion
Time-series aspects: dealing with periodicity and trends
Causal inference in observational studies

Parametric methods for A/B testing (like SPRT and variants)


Bayesian A/B testing
Side effects and risks associated with running experiments
Deployment with controlled, phased rollouts
What we will NOT cover today
Pre-experiment analysis and difference-in-differences
What makes a good metric? (directionality and sensitivity)
Combining metrics into an overall evaluation criterion
Time-series aspects: dealing with periodicity and trends
Causal inference in observational studies

Parametric methods for A/B testing (like SPRT and variants)


Bayesian A/B testing
Side effects and risks associated with running experiments
Deployment with controlled, phased rollouts
What we will NOT cover today
Pre-experiment analysis and difference-in-differences
What makes a good metric? (directionality and sensitivity)
Combining metrics into an overall evaluation criterion
Time-series aspects: dealing with periodicity and trends
Causal inference in observational studies

Parametric methods for A/B testing (like SPRT and variants)


Bayesian A/B testing
Side effects and risks associated with running experiments
Deployment with controlled, phased rollouts

Pitfalls of long experiments (survivorship bias, perceived trends)


What we will NOT cover today
Pre-experiment analysis and difference-in-differences
What makes a good metric? (directionality and sensitivity)
Combining metrics into an overall evaluation criterion
Time-series aspects: dealing with periodicity and trends
Causal inference in observational studies

Parametric methods for A/B testing (like SPRT and variants)


Bayesian A/B testing
Side effects and risks associated with running experiments
Deployment with controlled, phased rollouts

Pitfalls of long experiments (survivorship bias, perceived trends)


ML meets causal inference meets online experiments
What we will NOT cover today
Pre-experiment analysis and difference-in-differences
What makes a good metric? (directionality and sensitivity)
Combining metrics into an overall evaluation criterion
Time-series aspects: dealing with periodicity and trends
Causal inference in observational studies

Parametric methods for A/B testing (like SPRT and variants)


Bayesian A/B testing
Side effects and risks associated with running experiments
Deployment with controlled, phased rollouts

Pitfalls of long experiments (survivorship bias, perceived trends)


ML meets causal inference meets online experiments
Experimentation in marketplaces or with network effects
What we will NOT cover today
Pre-experiment analysis and difference-in-differences
What makes a good metric? (directionality and sensitivity)
Combining metrics into an overall evaluation criterion
Time-series aspects: dealing with periodicity and trends
Causal inference in observational studies

Parametric methods for A/B testing (like SPRT and variants)


Bayesian A/B testing
Side effects and risks associated with running experiments
Deployment with controlled, phased rollouts

Pitfalls of long experiments (survivorship bias, perceived trends)


ML meets causal inference meets online experiments
Experimentation in marketplaces or with network effects
Ethical aspects of running experiments
What we will NOT cover today
Pre-experiment analysis and difference-in-differences
What makes a good metric? (directionality and sensitivity)
Combining metrics into an overall evaluation criterion
Time-series aspects: dealing with periodicity and trends
Causal inference in observational studies

Parametric methods for A/B testing (like SPRT and variants)


Bayesian A/B testing
Side effects and risks associated with running experiments
Deployment with controlled, phased rollouts

Pitfalls of long experiments (survivorship bias, perceived trends)


ML meets causal inference meets online experiments
Experimentation in marketplaces or with network effects
Ethical aspects of running experiments
There are many resources for these topics

Yandex tutorial at The Web Conference ’18

Microsoft tutorial at The Web Conference ’19


(+ ExP Platform webpage)

Blog posts by Evan Miller, Etsy, Optimizely, etc.


A new “doubly-sequential” perspective:
a sequence of sequential experiments
A new “doubly-sequential” perspective:
a sequence of sequential experiments

Exp.

Time / Samples
A new “doubly-sequential” perspective:
a sequence of sequential experiments

Exp.

Time / Samples
A new “doubly-sequential” perspective:
a sequence of sequential experiments

Exp.

A/B tests

Time / Samples
A new “doubly-sequential” perspective:
a sequence of sequential experiments

Exp.

Time / Samples
A new “doubly-sequential” perspective:
a sequence of sequential experiments

treatments
Exp.

common control

Time / Samples
A new “doubly-sequential” perspective:
a sequence of sequential experiments

Exp.

Time / Samples

Zrnic, Ramdas, Jordan ‘18


Yang, Ramdas, Jamieson, Wainwright ‘17
What kind of guarantees would we like
for doubly sequential experimentation?
What kind of guarantees would we like
for doubly sequential experimentation?

(a) inner sequential process (a single experiment)


— correct inference when experiment ends
(correct p-values for A/B test or correct
confidence intervals for treatment effect)
What kind of guarantees would we like
for doubly sequential experimentation?

(a) inner sequential process (a single experiment)


— correct inference when experiment ends
(correct p-values for A/B test or correct
confidence intervals for treatment effect)

(b) outer sequential process (multiple experiments)


— less clear (is error control on inner
process enough?!)
Some existing problems in practice

Some potential issues within each experiment

Some potential issues across experiments

Many other concerns as well


Some existing problems in practice

Some potential issues within each experiment


(a) continuous monitoring
(b) flexible experiment horizon
(c) arbitrary stopping (or continuation) rules

Some potential issues across experiments

Many other concerns as well


Some existing problems in practice

Some potential issues within each experiment


(a) continuous monitoring
(b) flexible experiment horizon
(c) arbitrary stopping (or continuation) rules

Some potential issues across experiments


(a) selection bias (multiplicity)
(b) dependence across experiments
(c) don’t know future outcomes

Many other concerns as well


Solutions for these issues
Solutions for these issues
Inner sequential process:
Part I
“confidence sequence” for estimation
also called “anytime confidence intervals”
(correspondingly, “always valid p-values” for testing)
Solutions for these issues
Inner sequential process:
Part I
“confidence sequence” for estimation
also called “anytime confidence intervals”
(correspondingly, “always valid p-values” for testing)

Outer sequential process: Part II

“false coverage rate” for estimation


(correspondingly, “false discovery rate” for testing)
Solutions for these issues
Inner sequential process:
Part I
“confidence sequence” for estimation
also called “anytime confidence intervals”
(correspondingly, “always valid p-values” for testing)

Outer sequential process: Part II

“false coverage rate” for estimation


(correspondingly, “false discovery rate” for testing)

Modular solutions: fit well together Part III


Many extensions to each piece
Part I

The INNER Sequential Process


(a single experiment)

[1 hour]
The “duality” between
confidence intervals and p-values
Hypothesis testing is like
stochastic proof by contradiction.
Hypothesis testing is like
stochastic proof by contradiction.
Hypothesis testing is like
stochastic proof by contradiction.

Null hypothesis:
The coin is fair (bias = 0)

Alternative:
Coin is biased towards H
Hypothesis testing is like
stochastic proof by contradiction.
Tails Heads

1000 tosses

0 200 400 600 800


Null hypothesis:
The coin is fair (bias = 0)

Alternative:
Coin is biased towards H
Hypothesis testing is like
stochastic proof by contradiction.
Tails Heads

1000 tosses

0 200 400 600 800


Null hypothesis:
The coin is fair (bias = 0)
Apparent contradiction!
Should we reject the null hypothesis?
Alternative:
Coin is biased towards H
Calculate p-value
Calculate p-value

Possible
observations
Calculate p-value

Possible
all all
observations
tails heads
Calculate p-value

Prob.
density

Possible
all all
observations
tails heads
Calculate p-value

Prob.
density

Possible
all data all
observations
tails heads
Calculate p-value

Prob.
density
p-value P

Possible
all data all
observations
tails heads
Calculate p-value

Prob.
density
p-value P

Possible
all data all
observations
tails heads

Reject null if P ≤ α
Calculate p-value

Prob.
density
p-value P

Possible
all data all
observations
tails heads

Reject null if P ≤ α ≈ #H − #T ≥ 2N log(1/α) .


Calculate p-value

Prob.
density
p-value P

Possible
all data all
observations
tails heads

Reject null if P ≤ α ≈ #H − #T ≥ 2N log(1/α) .

Then, Pr(false positive) ≤ α .


An equivalent view via confidence intervals
An equivalent view via confidence intervals

#H − #T
Estimate the coin bias by μ ̂ := .
N
An equivalent view via confidence intervals

#H − #T
Estimate the coin bias by μ ̂ := .
N

( N)
z1−α z1−α
An asymptotic (1 − α)-CI for μ is given by μ ̂− , μ ̂+ ,
N
An equivalent view via confidence intervals

#H − #T
Estimate the coin bias by μ ̂ := .
N

( N)
z1−α z1−α
An asymptotic (1 − α)-CI for μ is given by μ ̂− , μ ̂+ ,
N
where z1−α is the (1 − α)-quantile of N(0,1) .
(appealing to the Central Limit Theorem)
An equivalent view via confidence intervals

#H − #T
Estimate the coin bias by μ ̂ := .
N

( N)
z1−α z1−α
An asymptotic (1 − α)-CI for μ is given by μ ̂− , μ ̂+ ,
N
where z1−α is the (1 − α)-quantile of N(0,1) .
(appealing to the Central Limit Theorem)

If this confidence interval does not contain 0,


we may be reasonably confident that the coin is biased,
and we may reject the null hypothesis.
An equivalent view via confidence intervals

#H − #T
Estimate the coin bias by μ ̂ := .
N

( N)
z1−α z1−α
An asymptotic (1 − α)-CI for μ is given by μ ̂− , μ ̂+ ,
N
where z1−α is the (1 − α)-quantile of N(0,1) .
(appealing to the Central Limit Theorem)

If this confidence interval does not contain 0,


we may be reasonably confident that the coin is biased,
and we may reject the null hypothesis.

≈ #H − #T ≥ 2N log(1/α) .
For any parameter µ of interest,
with associated estimator µb,
the following claim holds:

R b
µ
( )
| {z }
(1 ↵) confidence interval
for µ
For any parameter µ of interest,
with associated estimator µb,
the following claim holds:

R b
µ µ0
( )
| {z }
(1 ↵) confidence interval
for µ
For any parameter µ of interest,
with associated estimator µb,
the following claim holds:

R b
µ µ0
( ) ⌘
| {z }
(1 ↵) confidence interval
for µ
For any parameter µ of interest,
with associated estimator µb,
the following claim holds:

R b
µ µ0 For H0 : µ = µ0
( ) ⌘
| {z }
(1 ↵) confidence interval
for µ
For any parameter µ of interest,
with associated estimator µb,
the following claim holds:

R b
µ µ0 For H0 : µ = µ0
( ) ⌘
we have p-value Pµ0  ↵
| {z }
(1 ↵) confidence interval
for µ
For any parameter µ of interest,
with associated estimator µb,
the following claim holds:

R b
µ µ0 For H0 : µ = µ0
( ) ⌘
we have p-value Pµ0  ↵
| {z }
(1 ↵) confidence interval (we would reject the null
for µ hypothesis at level ↵)
For any parameter µ of interest,
with associated estimator µb,
the following claim holds:

(1 ↵/2) confidence interval


z }| {
R b
µ µ0 For H0 : µ = µ0
( ( ) ) ⌘
we have p-value Pµ0  ↵
| {z }
(1 ↵) confidence interval (we would reject the null
for µ hypothesis at level ↵)
For any parameter µ of interest,
with associated estimator µb,
the following claim holds:

(1 ↵/2) confidence interval ⌘ Pµ0 > ↵/2


z }| {
R b
µ µ0 For H0 : µ = µ0
( ( ) ) ⌘
we have p-value Pµ0  ↵
| {z }
(1 ↵) confidence interval (we would reject the null
for µ hypothesis at level ↵)
In summary, tests (p-values) and CIs are “dual”.
In summary, tests (p-values) and CIs are “dual”.
family of tests for θ → CI for θ

CI for θ → family of tests for θ


In summary, tests (p-values) and CIs are “dual”.
family of tests for θ → CI for θ
A (1 − α)-CI for a parameter θ is the set of all θ0 such that
the test for H0 : θ = θ0 has p-value larger than α .

CI for θ → family of tests for θ


In summary, tests (p-values) and CIs are “dual”.
family of tests for θ → CI for θ
A (1 − α)-CI for a parameter θ is the set of all θ0 such that
the test for H0 : θ = θ0 has p-value larger than α .

CI for θ → family of tests for θ


A p-value for testing the null H0 : θ = θ0 can be given by
the smallest q for which the (1 − q)-CI for θ fails to cover θ0 .
In summary, tests (p-values) and CIs are “dual”.
family of tests for θ → CI for θ
A (1 − α)-CI for a parameter θ is the set of all θ0 such that
the test for H0 : θ = θ0 has p-value larger than α .

CI for θ → family of tests for θ


A p-value for testing the null H0 : θ = θ0 can be given by
the smallest q for which the (1 − q)-CI for θ fails to cover θ0 .

CI for θ → composite tests for θ


In summary, tests (p-values) and CIs are “dual”.
family of tests for θ → CI for θ
A (1 − α)-CI for a parameter θ is the set of all θ0 such that
the test for H0 : θ = θ0 has p-value larger than α .

CI for θ → family of tests for θ


A p-value for testing the null H0 : θ = θ0 can be given by
the smallest q for which the (1 − q)-CI for θ fails to cover θ0 .

CI for θ → composite tests for θ


A p-value for testing the null H0 : θ ∈ Θ0 can be given by
the smallest q for which the (1 − q)-CI for θ fails to intersect Θ0 .
In summary, tests (p-values) and CIs are “dual”.
family of tests for θ → CI for θ
A (1 − α)-CI for a parameter θ is the set of all θ0 such that
the test for H0 : θ = θ0 has p-value larger than α .

CI for θ → family of tests for θ


A p-value for testing the null H0 : θ = θ0 can be given by
the smallest q for which the (1 − q)-CI for θ fails to cover θ0 .

CI for θ → composite tests for θ


A p-value for testing the null H0 : θ ∈ Θ0 can be given by
the smallest q for which the (1 − q)-CI for θ fails to intersect Θ0 .

Both of them are useful tools to estimate uncertainty,


and like any other tool, they can be used well, or be misused.
However, commonly taught confidence intervals
and p-values are only valid (correctly control error)
if the sample size is fixed in advance.
High-level caricature of an A/B-test

Start
High-level caricature of an A/B-test

Start

Collect more data


(increase sample size)
High-level caricature of an A/B-test

Start
“peek”

Collect more data Check if P (n) ≤ α


(increase sample size)
High-level caricature of an A/B-test

Start
“peek”

Collect more data Check if P (n) ≤ α


(increase sample size)

Stop,
Report

“optional stopping”
High-level caricature of an A/B-test

Start
“peek”

Collect more data Check if P (n) ≤ α


(increase sample size)

“optional continuation”
Stop,
Report

“optional stopping”
High-level caricature of an A/B-test

Start
“peek”

Collect more data Check if P (n) ≤ α


(increase sample size)

“optional continuation”
Stop,
Report
With commonly-taught p-values,
false positive rate ≫ α . “optional stopping”
After 10 people
After 10 people
After 10 people

After 284 people


After 10 people

After 284 people


After 10 people

After 284 people

After 1214 people


After 10 people

After 284 people

After 1214 people


After 10 people

After 284 people

After 1214 people

After 2398 people


After 10 people

After 284 people

After 1214 people

After 2398 people


After 10 people

After 284 people

After 1214 people

After 2398 people

After 7224 people


After 10 people

After 284 people

After 1214 people

After 2398 people

After 7224 people


After 10 people

After 284 people

After 1214 people

After 2398 people

After 7224 people

After 11,219 people, STOP!


After 10 people

After 284 people

After 1214 people

After 2398 people

After 7224 people

After 11,219 people, STOP!


Let P (n) be a classical p-value (eg: t-test),
calculated using the first n samples.
Let P (n) be a classical p-value (eg: t-test),
calculated using the first n samples.

Under the null hypothesis (no treatment effect),

∀n ≥ 1, Pr(P(n) ≤ α) ≤ α.
prob. of false positive
Let P (n) be a classical p-value (eg: t-test),
calculated using the first n samples.

Under the null hypothesis (no treatment effect),

∀n ≥ 1, Pr(P(n) ≤ α) ≤ α.
prob. of false positive

Let τ be the stopping time of the experiment.


Often, τ depends on data, eg: τ := min{n ∈ ℕ : Pn ≤ α} .
Let P (n) be a classical p-value (eg: t-test),
calculated using the first n samples.

Under the null hypothesis (no treatment effect),

∀n ≥ 1, Pr(P(n) ≤ α) ≤ α.
prob. of false positive

Let τ be the stopping time of the experiment.


Often, τ depends on data, eg: τ := min{n ∈ ℕ : Pn ≤ α} .

Unfortunately, Pr(P(τ) ≤ α) ≰ α .
In other words, Pr( ∃n ∈ ℕ : P(n) ≤ α) ≫ α .
Same problem with confidence interval (CI)

Start
“peek”

Collect more data Check if 0 ∉ (1 − α) CI


(increase sample size)

“optional continuation”
Stop,
Stop
Report

Again, false positive rate ≫ α . “optional stopping”


Let (L (n), U (n)) be any classical (1 − α) CI,
calculated using the first n samples (eg: CLT).
Let (L (n), U (n)) be any classical (1 − α) CI,
calculated using the first n samples (eg: CLT).

When trying to estimate the treatment effect θ,

∀n ≥ 1, Pr(θ ∈ (L(n), U(n))) ≥ 1 − α .


prob. of coverage
Let (L (n), U (n)) be any classical (1 − α) CI,
calculated using the first n samples (eg: CLT).

When trying to estimate the treatment effect θ,

∀n ≥ 1, Pr(θ ∈ (L(n), U(n))) ≥ 1 − α .


prob. of coverage

Let τ be the stopping time of the experiment.


(n)
Again, τ may depend on data, eg: τ := min{n ∈ ℕ : L > 0} .

Unfortunately, Pr(θ ∈ (L(τ), U(τ))) ≱ 1 − α .


Let (L (n), U (n)) be any classical (1 − α) CI,
calculated using the first n samples (eg: CLT).

When trying to estimate the treatment effect θ,

∀n ≥ 1, Pr(θ ∈ (L(n), U(n))) ≥ 1 − α .


prob. of coverage

Let τ be the stopping time of the experiment.


(n)
Again, τ may depend on data, eg: τ := min{n ∈ ℕ : L > 0} .

Unfortunately, Pr(θ ∈ (L(τ), U(τ))) ≱ 1 − α .


In other words, Pr( ∀n ≥ 1 : θ ∈ (L(n), U(n))) ≪ 1 − α .
usually = 0.
Solution: “confidence sequence”
(aka “anytime confidence intervals”)

or “sequential p-values” for testing


(aka “always-valid p-values”)
A “confidence sequence” for a parameter ✓ <latexit sha1_base64="JqEnYvV6PtsKBJYmBVwEpjIMANw=">AAAB7XicbVDLSgNBEJyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOOsmY2ZllplcIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqSw6PvfXmFtfWNzq7hd2tnd2z8oHx41rU4NhwbXUpt2xCxIoaCBAiW0EwMsjiS0ovHtzG89gbFCqwecJBDGbKjEQHCGTmp2cQTIeuWKX/XnoKskyEmF5Kj3yl/dvuZpDAq5ZNZ2Aj/BMGMGBZcwLXVTCwnjYzaEjqOKxWDDbH7tlJ45pU8H2rhSSOfq74mMxdZO4sh1xgxHdtmbif95nRQH12EmVJIiKL5YNEglRU1nr9O+MMBRThxh3Ah3K+UjZhhHF1DJhRAsv7xKmhfVwK8G95eV2k0eR5GckFNyTgJyRWrkjtRJg3DySJ7JK3nztPfivXsfi9aCl88ckz/wPn8Ao/ePKA==</latexit>
<latexit

is a sequence of confidence intervals (Ln , Un )


with a uniform (simultaneous) coverage guarantee.
<latexit sha1_base64="PpnXScjDZFhd9U2EC1D0fZEkPzE=">AAAB8XicbVBNS8NAEJ3Ur1q/oh69LBahgpREBD0WvXjwUMG0xTaEzXbTLt1swu5GKKH/wosHRbz6b7z5b9y2OWjrg4HHezPMzAtTzpR2nG+rtLK6tr5R3qxsbe/s7tn7By2VZJJQjyQ8kZ0QK8qZoJ5mmtNOKimOQ07b4ehm6refqFQsEQ96nFI/xgPBIkawNtJj7S4QZ8gLxGlgV526MwNaJm5BqlCgGdhfvX5CspgKTThWqus6qfZzLDUjnE4qvUzRFJMRHtCuoQLHVPn57OIJOjFKH0WJNCU0mqm/J3IcKzWOQ9MZYz1Ui95U/M/rZjq68nMm0kxTQeaLoowjnaDp+6jPJCWajw3BRDJzKyJDLDHRJqSKCcFdfHmZtM7rrlN37y+qjesijjIcwTHUwIVLaMAtNMEDAgKe4RXeLGW9WO/Wx7y1ZBUzh/AH1ucP2MyPtg==</latexit>

ℙ( ∀n ≥ 1 : θ ∈ (Ln, Un)) ≥ 1 − α .
Sample size
A “confidence sequence” for a parameter ✓ <latexit sha1_base64="JqEnYvV6PtsKBJYmBVwEpjIMANw=">AAAB7XicbVDLSgNBEJyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOOsmY2ZllplcIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqSw6PvfXmFtfWNzq7hd2tnd2z8oHx41rU4NhwbXUpt2xCxIoaCBAiW0EwMsjiS0ovHtzG89gbFCqwecJBDGbKjEQHCGTmp2cQTIeuWKX/XnoKskyEmF5Kj3yl/dvuZpDAq5ZNZ2Aj/BMGMGBZcwLXVTCwnjYzaEjqOKxWDDbH7tlJ45pU8H2rhSSOfq74mMxdZO4sh1xgxHdtmbif95nRQH12EmVJIiKL5YNEglRU1nr9O+MMBRThxh3Ah3K+UjZhhHF1DJhRAsv7xKmhfVwK8G95eV2k0eR5GckFNyTgJyRWrkjtRJg3DySJ7JK3nztPfivXsfi9aCl88ckz/wPn8Ao/ePKA==</latexit>
<latexit

is a sequence of confidence intervals (Ln , Un )


with a uniform (simultaneous) coverage guarantee.
<latexit sha1_base64="PpnXScjDZFhd9U2EC1D0fZEkPzE=">AAAB8XicbVBNS8NAEJ3Ur1q/oh69LBahgpREBD0WvXjwUMG0xTaEzXbTLt1swu5GKKH/wosHRbz6b7z5b9y2OWjrg4HHezPMzAtTzpR2nG+rtLK6tr5R3qxsbe/s7tn7By2VZJJQjyQ8kZ0QK8qZoJ5mmtNOKimOQ07b4ehm6refqFQsEQ96nFI/xgPBIkawNtJj7S4QZ8gLxGlgV526MwNaJm5BqlCgGdhfvX5CspgKTThWqus6qfZzLDUjnE4qvUzRFJMRHtCuoQLHVPn57OIJOjFKH0WJNCU0mqm/J3IcKzWOQ9MZYz1Ui95U/M/rZjq68nMm0kxTQeaLoowjnaDp+6jPJCWajw3BRDJzKyJDLDHRJqSKCcFdfHmZtM7rrlN37y+qjesijjIcwTHUwIVLaMAtNMEDAgKe4RXeLGW9WO/Wx7y1ZBUzh/AH1ucP2MyPtg==</latexit>

ℙ( ∀n ≥ 1 : θ ∈ (Ln, Un)) ≥ 1 − α .
Sample size

Darling, Robbins ’67, ‘68


Lai ’76, ’84
Howard, Ramdas, McAuliffe, Sekhon ’18
Example: tracking the mean of a Gaussian
or Bernoulli from i.i.d. observations.

X1, X2, … ∼ N(θ,1) or Ber(θ)


Example: tracking the mean of a Gaussian
or Bernoulli from i.i.d. observations.

X1, X2, … ∼ N(θ,1) or Ber(θ)

Producing a confidence interval at a fixed time


is elementary statistics (~100 years old).
Example: tracking the mean of a Gaussian
or Bernoulli from i.i.d. observations.

X1, X2, … ∼ N(θ,1) or Ber(θ)

Producing a confidence interval at a fixed time


is elementary statistics (~100 years old).

How do we produce a confidence sequence?


(which is like a confidence band over time)
0.5
Cumulative miscoverage prob
(Fair coin)

Cumulative miscoverage prob.


1.0
0.6
0.0
Confidence bounds
0.6
0.5
0.4
0.4
−0.5 0.0 Empirical mean
0.2
−0.5 Empirical mean 0.2
−1.0
1 2 3 4
5
100.0
−1.01 10 2 103 10 10
0.0
10 10 101 1010 10 4 5 105
2
10 3
1010
4
10 1
Number
Number
of samples,
of samples,
Number of samples,t t
t

effding Pointwise
Pointwise
Pointwise
Linear CLT
CLT
CI boundary
(CLT) Pointwise
Anytime
Pointwise
Curved Hoeff
CIboundary
Hoeffding
0.5
Cumulative miscoverage prob
(Fair coin)

Cumulative miscoverage prob.


1.0
0.6
0.0
Confidence bounds
0.6
0.5
0.4
0.4
−0.5 0.0 Empirical mean
0.2
−0.5 Empirical mean 0.2
−1.0
1 2 3 4
5
100.0
−1.01 10 2 103 10 10
0.0
10 10 101 1010 10 4 5 105
2
10 3
1010
4
10 1
Number
Number
of samples,
of samples,
Number of samples,t t
t

effding Pointwise
Pointwise
Pointwise
Linear CLT
CLT
CI boundary
(CLT) Pointwise
Anytime
Pointwise
Curved Hoeff
CIboundary
Hoeffding
Eg: If Xi is 1-subGaussian, then
n
∑i=1 Xi log log(2n) + 0.72 log(5.19/α)
± 1.71
n n

is a (1 − α) confidence sequence.
Eg: If Xi is 1-subGaussian, then
n
∑i=1 Xi log log(2n) + 0.72 log(5.19/α)
± 1.71
n n

is a (1 − α) confidence sequence.

2,000 Jamieson et al. (2013)


Balsubramani (2014)
Zhao et al. (2016)
Darling & Robbins (1967b)
Kaufmann et al. (2014)
Normal mixture
Boundary

Darling & Robbins (1968)


Polynomial stitching (ours)
Inverted stitching (ours)
Discrete mixture (ours)

Hoeffding bound
CLT bound
0
0 105
Vt
Eg: If Xi is 1-subGaussian, then
n
∑i=1 Xi log log(2n) + 0.72 log(5.19/α)
± 1.71
n n

is a (1 − α) confidence sequence.

2,000 Jamieson et al. (2013)


Balsubramani (2014)
Zhao et al. (2016)
Darling & Robbins (1967b)
Kaufmann et al. (2014)
Normal mixture
Boundary

Darling & Robbins (1968)


Polynomial stitching (ours)
Inverted stitching (ours)
Discrete mixture (ours)

Hoeffding bound
CLT bound
0
0 105
Vt
Howard, Ramdas, McAuliffe, Sekhon ’18

ℙ( {θ ∉ (Ln, Un)}) ≤ α .
n∈ℕ

ℙ( {θ ∉ (Ln, Un)}) ≤ α .
n∈ℕ
Some implications:

ℙ( {θ ∉ (Ln, Un)}) ≤ α .
n∈ℕ
Some implications:

1.Valid inference at any time, even stopping times:



ℙ( {θ ∉ (Ln, Un)}) ≤ α .
n∈ℕ
Some implications:

1.Valid inference at any time, even stopping times:


For any stopping time τ : ℙ(θ ∉ (Lτ, Uτ)) ≤ α .

ℙ( {θ ∉ (Ln, Un)}) ≤ α .
n∈ℕ
Some implications:

1.Valid inference at any time, even stopping times:


For any stopping time τ : ℙ(θ ∉ (Lτ, Uτ)) ≤ α .
2.Valid post-hoc inference (in hindsight):

ℙ( {θ ∉ (Ln, Un)}) ≤ α .
n∈ℕ
Some implications:

1.Valid inference at any time, even stopping times:


For any stopping time τ : ℙ(θ ∉ (Lτ, Uτ)) ≤ α .
2.Valid post-hoc inference (in hindsight):
For any random time T : ℙ(θ ∉ (LT, UT )) ≤ α .

ℙ( {θ ∉ (Ln, Un)}) ≤ α .
n∈ℕ
Some implications:

1.Valid inference at any time, even stopping times:


For any stopping time τ : ℙ(θ ∉ (Lτ, Uτ)) ≤ α .
2.Valid post-hoc inference (in hindsight):
For any random time T : ℙ(θ ∉ (LT, UT )) ≤ α .

3. No pre-specified sample size:


can extend or stop experiments adaptively.
The same duality between
confidence intervals and p-values
also holds in the sequential setting:
“confidence sequences” are dual to
“always valid p-values”.
Duality between anytime p-value and CI
Duality between anytime p-value and CI
Define a set of null values ℋ0 for θ .
Duality between anytime p-value and CI
Define a set of null values ℋ0 for θ .

Let P (n) := inf{α : the (1 − α) CI (n) does not intersect ℋ0}


Duality between anytime p-value and CI
Define a set of null values ℋ0 for θ .

Let P (n) := inf{α : the (1 − α) CI (n) does not intersect ℋ0}

If CI (n) is a pointwise CI then P (n) is a classical p-value .


(n)
For all fixed times n, Pr(P ≤ α) ≤ α.
prob. of false positive
Duality between anytime p-value and CI
Define a set of null values ℋ0 for θ .

Let P (n) := inf{α : the (1 − α) CI (n) does not intersect ℋ0}

If CI (n) is a pointwise CI then P (n) is a classical p-value .


(n)
For all fixed times n, Pr(P ≤ α) ≤ α.
prob. of false positive

If CI (n) is an anytime CI then P (n) is an always-valid p-value .

(τ)
For all stopping times τ, Pr(P ≤ α) ≤ α .
For all data-dependent times T, Pr(P(T) ≤ α) ≤ α .
Relationship to Sequential Probability Ratio Test
Given a stream of data X1, X2, … ∼ fθ, suppose
we want to test a null hypothesis H0 : θ = θ0
against an alternative hypothesis H1 : θ = θ1 .

Wald ‘48
Relationship to Sequential Probability Ratio Test
Given a stream of data X1, X2, … ∼ fθ, suppose
we want to test a null hypothesis H0 : θ = θ0
against an alternative hypothesis H1 : θ = θ1 .

Wald's SPRT (or SLRT) calculates a probability/likelihood ratio:


n
∏i=1 f1(Xi)
L (n) := n ,
∏i=1 f0(Xi)

and rejects when L (n) > 1/α . Can also use prior/mixture over θ1 .

Wald ‘48
Relationship to Sequential Probability Ratio Test
Given a stream of data X1, X2, … ∼ fθ, suppose
we want to test a null hypothesis H0 : θ = θ0
against an alternative hypothesis H1 : θ = θ1 .

Wald's SPRT (or SLRT) calculates a probability/likelihood ratio:


n
∏i=1 f1(Xi)
L (n) := n ,
∏i=1 f0(Xi)

and rejects when L (n) > 1/α . Can also use prior/mixture over θ1 .

(n) (n) (n)


Equivalently, define P = 1/L . Then P is an always-valid p-value.

Wald ‘48
Relationship to Sequential Probability Ratio Test
Given a stream of data X1, X2, … ∼ fθ, suppose
we want to test a null hypothesis H0 : θ = θ0
against an alternative hypothesis H1 : θ = θ1 .

Wald's SPRT (or SLRT) calculates a probability/likelihood ratio:


n
∏i=1 f1(Xi)
L (n) := n ,
∏i=1 f0(Xi)

and rejects when L (n) > 1/α . Can also use prior/mixture over θ1 .

(n) (n) (n)


Equivalently, define P = 1/L . Then P is an always-valid p-value.

(And inverting it defines a confidence sequence.) Wald ‘48


Can construct confidence sequences
(and hence always valid p-values)
in a wide variety of nonparametric settings
(eg: random variables that are
bounded, or subGaussian, or subexponential)

Howard, Ramdas, McAuliffe, Sekhon ’18


Solutions for these issues
Solutions for these issues
Inner sequential process:
Part I
“confidence sequence” for estimation
also called “anytime confidence intervals”
(correspondingly, “always valid p-values” for testing)
Solutions for these issues
Inner sequential process:
Part I
“confidence sequence” for estimation
also called “anytime confidence intervals”
(correspondingly, “always valid p-values” for testing)

Outer sequential process: Part II

“false coverage rate” for estimation


(correspondingly, “false discovery rate” for testing)
Solutions for these issues
Inner sequential process:
Part I
“confidence sequence” for estimation
also called “anytime confidence intervals”
(correspondingly, “always valid p-values” for testing)

Outer sequential process: Part II

“false coverage rate” for estimation


(correspondingly, “false discovery rate” for testing)

Modular solutions: fit well together Part III


Many extensions to each piece
Part I1

The OUTER Sequential Process


(a sequence of experiments)

[40 mins]
Quick recap of A/B testing

A:
Quick recap of A/B testing

A:

B:
Quick recap of A/B testing

A:

B:

Null hypothesis:
A is at least
as good as B.
Quick recap of A/B testing

Misses Clicks

A:

B:
0 200 400 600 800

Null hypothesis:
A is at least
as good as B.
Quick recap of A/B testing

Misses Clicks

A:

B:
0 200 400 600 800

Null hypothesis: Calculate p-value:


A is at least P = Pr(observed data or more
as good as B. extreme, assuming null is true)
Quick recap of A/B testing
Decision rule :
Misses Clicks
if P  ↵, then
we reject the null
A: (“discovery”).
We change A to B,
ensuring that
type-1 errorP  ↵.

B:
0 200 400 600 800

Null hypothesis: Calculate p-value:


A is at least P = Pr(observed data or more
as good as B. extreme, assuming null is true)
Quick recap of A/B testing
Decision rule :
Misses Clicks
if P  ↵, then
we reject the null
A: (“discovery”).
We change A to B,
ensuring that
type-1 errorP  ↵.

B: a wrong rejection
0 200 400 600 800
of the null
is a false discovery
Null hypothesis: Calculate p-value: and implies
A is at least P = Pr(observed data or more a bad change
as good as B. extreme, assuming null is true) from A to B.
Reality: internet companies run thousands
of different (independent) A/B tests over time.
Reality: internet companies run thousands
of different (independent) A/B tests over time.

Time
Reality: internet companies run thousands
of different (independent) A/B tests over time.

vs. Color

Time
Reality: internet companies run thousands
of different (independent) A/B tests over time.
Decision rule:
vs. Color

Time
Reality: internet companies run thousands
of different (independent) A/B tests over time.
Decision rule:
P1  ↵? vs. Color

Time
Reality: internet companies run thousands
of different (independent) A/B tests over time.
Decision rule:
P1  ↵? vs. Color

Time vs. Size


Reality: internet companies run thousands
of different (independent) A/B tests over time.
Decision rule:
P1  ↵? vs. Color

Time P2  ↵? vs. Size


Reality: internet companies run thousands
of different (independent) A/B tests over time.
Decision rule:
P1  ↵? vs. Color

Time P2  ↵? vs. Size

vs. Orientation
Reality: internet companies run thousands
of different (independent) A/B tests over time.
Decision rule:
P1  ↵? vs. Color

Time P2  ↵? vs. Size

P3  ↵? vs. Orientation
Reality: internet companies run thousands
of different (independent) A/B tests over time.
Decision rule:
P1  ↵? vs. Color

Time P2  ↵? vs. Size

P3  ↵? vs. Orientation

vs. Style
Reality: internet companies run thousands
of different (independent) A/B tests over time.
Decision rule:
P1  ↵? vs. Color

Time P2  ↵? vs. Size

P3  ↵? vs. Orientation

P4  ↵? vs. Style
Reality: internet companies run thousands
of different (independent) A/B tests over time.
Decision rule:
P1  ↵? vs. Color

Time P2  ↵? vs. Size

P3  ↵? vs. Orientation

P4  ↵? vs. Style

vs. Logo
Reality: internet companies run thousands
of different (independent) A/B tests over time.
Decision rule:
P1  ↵? vs. Color

Time P2  ↵? vs. Size

P3  ↵? vs. Orientation

P4  ↵? vs. Style

P5  ↵? vs. Logo
Reality: internet companies run thousands
of different (independent) A/B tests over time.
Decision rule:
P1  ↵? vs. Color

Time P2  ↵? vs. Size

P3  ↵? vs. Orientation

P4  ↵? vs. Style

P5  ↵? vs. Logo
Problem!
Run 10,000
different,
independent
A/B tests
Run 10,000 9,900 true
different, nulls
independent 100 non-
A/B tests nulls
type-1error rate (per test)
= 0.05

Run 10,000 9,900 true


different, nulls
independent 100 non-
A/B tests nulls
type-1error rate (per test)
= 0.05

Run 10,000 9,900 true 495 false


different, nulls discoveries
independent 100 non-
A/B tests nulls
type-1error rate (per test)
= 0.05

Run 10,000 9,900 true 495 false


different, nulls discoveries
independent 100 non-
A/B tests nulls
power (per test)
= 0.80
type-1error rate (per test)
= 0.05

Run 10,000 9,900 true 495 false


different, nulls discoveries
independent 100 non- 80 true
A/B tests nulls discoveries
power (per test)
= 0.80
type-1error rate (per test)
= 0.05

Run 10,000 9,900 true 495 false


different, nulls discoveries
independent 100 non- 80 true
A/B tests nulls discoveries
power (per test)
= 0.80
type-1error rate (per test)
= 0.05

Run 10,000 9,900 true 495 false


nulls discoveries false discovery
different, proportion
independent 100 non- 80 true FDP = 495/575
A/B tests nulls discoveries
power (per test)
= 0.80
# false discoveries
FDP = # discoveries
type-1error rate (per test)
= 0.05

Run 10,000 9,900 true 495 false


nulls discoveries false discovery
different, proportion
independent 100 non- 80 true FDP = 495/575
A/B tests nulls discoveries
power (per test)
= 0.80
# false discoveries
FDP = # discoveries

FDR = E[FDP]
type-1error rate (per test)
= 0.05

Run 10,000 9,900 true 495 false


nulls discoveries false discovery
different, proportion
independent 100 non- 80 true FDP = 495/575
A/B tests nulls discoveries
power (per test)
= 0.80
# false discoveries
FDP = # discoveries

FDR = E[FDP]
Summary: FDR can be larger than per-test error rate.
(even if hypotheses, tests, data are independent)
Given a possibly infinite sequence
of independent tests (p-values), can we
guarantee control of the FDR
in a fully online fashion?

Foster-Stine ’08
Aharoni-Rosset ’14
Javanmard-Montanari ’16
Ramdas-Yang-Wainwright-Jordan ’17
Ramdas-Zrnic-Wainwright-Jordan ’18
Tian-Ramdas ’19
The aim of online FDR procedures

Decision rule:

Time
The aim of online FDR procedures

Decision rule:
vs. Color

Time
The aim of online FDR procedures

Decision rule:
P1  ↵ 1 ? vs. Color

Time
The aim of online FDR procedures

Decision rule:
P1  ↵ 1 ? vs. Color

Time vs. Size


The aim of online FDR procedures

Decision rule:
P1  ↵ 1 ? vs. Color

Time P2  ↵ 2 ? vs. Size


The aim of online FDR procedures

Decision rule:
P1  ↵ 1 ? vs. Color

Time P2  ↵ 2 ? vs. Size

vs. Orientation
The aim of online FDR procedures

Decision rule:
P1  ↵ 1 ? vs. Color

Time P2  ↵ 2 ? vs. Size

P3  ↵ 3 ? vs. Orientation
The aim of online FDR procedures

Decision rule:
P1  ↵ 1 ? vs. Color

Time P2  ↵ 2 ? vs. Size

P3  ↵ 3 ? vs. Orientation

vs. Style
The aim of online FDR procedures

Decision rule:
P1  ↵ 1 ? vs. Color

Time P2  ↵ 2 ? vs. Size

P3  ↵ 3 ? vs. Orientation

P4  ↵ 4 ? vs. Style
The aim of online FDR procedures

Decision rule:
P1  ↵ 1 ? vs. Color

Time P2  ↵ 2 ? vs. Size

P3  ↵ 3 ? vs. Orientation

P4  ↵ 4 ? vs. Style

vs. Logo
The aim of online FDR procedures

Decision rule:
P1  ↵ 1 ? vs. Color

Time P2  ↵ 2 ? vs. Size

P3  ↵ 3 ? vs. Orientation

P4  ↵ 4 ? vs. Style

P5  ↵ 5 ? vs. Logo
The aim of online FDR procedures

Decision rule:
P1  ↵ 1 ? vs. Color

Time P2  ↵ 2 ? vs. Size

P3  ↵ 3 ? vs. Orientation

How do we P  ↵ ? vs. Style


4 4
set each
error level to
control FDR P5  ↵5 ? vs. Logo
at any time?
One of the most famous offline
FDR methods is the “Benjamini-
Hochberg” (BH) method

Offline FDR methods


do not control the FDR
in online settings

Benjamini-Hochberg ’95
The following method is not a
valid online FDR algorithm:

At the end of experiment t, run BH on P1, …, Pt .


The following method is not a
valid online FDR algorithm:

At the end of experiment t, run BH on P1, …, Pt .

The reason is that the decision


about the first hypothesis depends
on all future hypotheses. We cannot
commit to a decision and stick to it.
The following method is not a
valid online FDR algorithm:

At the end of experiment t, run BH on P1, …, Pt .

The reason is that the decision


about the first hypothesis depends
on all future hypotheses. We cannot
commit to a decision and stick to it.

We need the error level αt for experiment t


to be specified when it starts, and we need
to make a final decision when experiment t ends.
This multiple testing issue
is not particular to p-values.
It also exists when selectively
reporting treatment effects
with confidence intervals.

Benjamini, Yekutieli ’05


Weinstein, Ramdas ’19
Multiplicity in reported CIs
One rarely cares about all CIs or follows-up on them,
one usually reports only the most “promising” CIs.
Multiplicity in reported CIs
One rarely cares about all CIs or follows-up on them,
one usually reports only the most “promising” CIs.

False coverage proportion


# incorrectly reported CIs
FCP =
# reported CIs
Multiplicity in reported CIs
One rarely cares about all CIs or follows-up on them,
one usually reports only the most “promising” CIs.

False coverage proportion


# incorrectly reported CIs
FCP =
# reported CIs

False coverage rate


FCR = 𝔼[FCP]
Benjamini-Yekutieli ’06
Weinstein-Yekutieli ’14
Fithian et al. ’14
Controlling FCR is nontrivial

Constructing marginal 95% CIs for all parameters


fails to control FCR at 0.05.
<latexit sha1_base64="FMP/KogEvHFO2tjJge5o3+g3UBs=">AAACP3icbVA9TyMxFPTyfeErHCXNEwkSVbSLhIAOEekEHSASkLJR9NbxBguvvbLfIkUR/4zm/sJ1tDQUnE60dOcNKfh61WjmPXtmklxJR2H4EExNz8zOzS/8qCwuLa+sVtd+tp0pLBctbpSxVwk6oaQWLZKkxFVuBWaJEpfJTbPUL2+FddLoCxrmopvhQMtUciRP9artptGObMFJ6gFkaAdSo4L6wW68VYfmiYPUWEClIEeLmSD/VhxDJUWpHJABbjRZo+BX8xyQoB42wt16o1eteTAe+AqiCaixyZz2qn/ivuFFJjRxhc51ojCn7ggtSa7EXSUunMiR3+BAdDzU3orrjsb572DLM/2x0dS7gTH7/mKEmXPDLPGbGdK1+6yV5Hdap6B0vzuSOi9IaP72UVqoMndZJvSlFZzU0APkVnqvwK99T7ysqeJLiD5H/graO40obERnO7XDo0kdC2yDbbJtFrE9dsiO2SlrMc7u2SN7Zn+D38FT8C94eVudCiY36+zDBK//AfwqrFs=</latexit>
Controlling FCR is nontrivial

Constructing marginal 95% CIs for all parameters


fails to control FCR at 0.05.
<latexit sha1_base64="FMP/KogEvHFO2tjJge5o3+g3UBs=">AAACP3icbVA9TyMxFPTyfeErHCXNEwkSVbSLhIAOEekEHSASkLJR9NbxBguvvbLfIkUR/4zm/sJ1tDQUnE60dOcNKfh61WjmPXtmklxJR2H4EExNz8zOzS/8qCwuLa+sVtd+tp0pLBctbpSxVwk6oaQWLZKkxFVuBWaJEpfJTbPUL2+FddLoCxrmopvhQMtUciRP9artptGObMFJ6gFkaAdSo4L6wW68VYfmiYPUWEClIEeLmSD/VhxDJUWpHJABbjRZo+BX8xyQoB42wt16o1eteTAe+AqiCaixyZz2qn/ivuFFJjRxhc51ojCn7ggtSa7EXSUunMiR3+BAdDzU3orrjsb572DLM/2x0dS7gTH7/mKEmXPDLPGbGdK1+6yV5Hdap6B0vzuSOi9IaP72UVqoMndZJvSlFZzU0APkVnqvwK99T7ysqeJLiD5H/graO40obERnO7XDo0kdC2yDbbJtFrE9dsiO2SlrMc7u2SN7Zn+D38FT8C94eVudCiY36+zDBK//AfwqrFs=</latexit>

Suppose treatment e↵ect ✓j 2 {±0.1} for all j,


and experimental observations are normalized to
Xj ⇠ N (✓j , 1).
<latexit sha1_base64="gfM/xJEjMu7AD/AFwnxRplluGo0=">AAACeHicbVHLbtNAFB2bVwmPBlh2c0Vc0UpVZGdDlxVsWKEiSBspE1nj8XUz6TysmXHVYOUb+Dd2fAgbVozdIEHLXR2dO0dnzrlFLYXzafojiu/df/Dw0c7jwZOnz57vDl+8PHOmsRyn3EhjZwVzKIXGqRde4qy2yFQh8by4fN/tz6/QOmH0F7+ucaHYhRaV4MwHKh9++9zUtXEIPqi8Qu0Bqwq5h4T6JXqWr6jQtKW1gnSc0U0ClbHApIRklRxRCgOmS8DrGq3o5EyCKRzaq97AAbMI2ljFpPiKJXgDlA4gmeUroE4o+Hjwx+coO0zG+XCUjtN+4C7ItmBEtnOaD7/T0vCms+aSOTfP0tovWma94BI3A9o4rBm/ZBc4D1AzhW7R9sVtYD8wZR+oMiF5z/6taJlybq2K8FIxv3S3dx35v9288dXxohW6bjxqfmNUNbKL310BSmFDx3IdAONWhL8CXzLLuA+3GoQSstuR74KzyTgLJ/k0GZ2829axQ/bIa3JAMvKWnJAP5JRMCSc/o70oifajXzHEb+LDm6dxtNW8Iv9MPPkNyNS8yQ==</latexit>
Controlling FCR is nontrivial

Constructing marginal 95% CIs for all parameters


fails to control FCR at 0.05.
<latexit sha1_base64="FMP/KogEvHFO2tjJge5o3+g3UBs=">AAACP3icbVA9TyMxFPTyfeErHCXNEwkSVbSLhIAOEekEHSASkLJR9NbxBguvvbLfIkUR/4zm/sJ1tDQUnE60dOcNKfh61WjmPXtmklxJR2H4EExNz8zOzS/8qCwuLa+sVtd+tp0pLBctbpSxVwk6oaQWLZKkxFVuBWaJEpfJTbPUL2+FddLoCxrmopvhQMtUciRP9artptGObMFJ6gFkaAdSo4L6wW68VYfmiYPUWEClIEeLmSD/VhxDJUWpHJABbjRZo+BX8xyQoB42wt16o1eteTAe+AqiCaixyZz2qn/ivuFFJjRxhc51ojCn7ggtSa7EXSUunMiR3+BAdDzU3orrjsb572DLM/2x0dS7gTH7/mKEmXPDLPGbGdK1+6yV5Hdap6B0vzuSOi9IaP72UVqoMndZJvSlFZzU0APkVnqvwK99T7ysqeJLiD5H/graO40obERnO7XDo0kdC2yDbbJtFrE9dsiO2SlrMc7u2SN7Zn+D38FT8C94eVudCiY36+zDBK//AfwqrFs=</latexit>

Suppose treatment e↵ect ✓j 2 {±0.1} for all j,


and experimental observations are normalized to
Xj ⇠ N (✓j , 1).
<latexit sha1_base64="gfM/xJEjMu7AD/AFwnxRplluGo0=">AAACeHicbVHLbtNAFB2bVwmPBlh2c0Vc0UpVZGdDlxVsWKEiSBspE1nj8XUz6TysmXHVYOUb+Dd2fAgbVozdIEHLXR2dO0dnzrlFLYXzafojiu/df/Dw0c7jwZOnz57vDl+8PHOmsRyn3EhjZwVzKIXGqRde4qy2yFQh8by4fN/tz6/QOmH0F7+ucaHYhRaV4MwHKh9++9zUtXEIPqi8Qu0Bqwq5h4T6JXqWr6jQtKW1gnSc0U0ClbHApIRklRxRCgOmS8DrGq3o5EyCKRzaq97AAbMI2ljFpPiKJXgDlA4gmeUroE4o+Hjwx+coO0zG+XCUjtN+4C7ItmBEtnOaD7/T0vCms+aSOTfP0tovWma94BI3A9o4rBm/ZBc4D1AzhW7R9sVtYD8wZR+oMiF5z/6taJlybq2K8FIxv3S3dx35v9288dXxohW6bjxqfmNUNbKL310BSmFDx3IdAONWhL8CXzLLuA+3GoQSstuR74KzyTgLJ/k0GZ2829axQ/bIa3JAMvKWnJAP5JRMCSc/o70oifajXzHEb+LDm6dxtNW8Iv9MPPkNyNS8yQ==</latexit>

Suppose we only care about drugs with large e↵ects.


So we only pursue phase II of the trial if Xj > 3. <latexit sha1_base64="eT1OBoZu+q2ozcPHjJOqWpC8g1E=">AAACVXicbVFNbxMxFPQupZTw0QBHLk+kSJxWu+0BTqgqF3orKmkjJVH01vs2a+pdW/Zzqyjqn+wF8U+4IOGkEYKWkSyNZt7Iz+PSauU5z38k6YOth9uPdh73njx99ny3/+LlmTfBSRpKo40blehJq46GrFjTyDrCttR0Xl58Wvnnl+S8Mt1XXliatjjvVK0kcpRmfX0arDWe4IrAdHoBEh0BliYwVC7MPVwpbkCjmxNQXZNkn00mvVPzJ2GD84HANnENOD4GUwM3BOwUalA17I1m3+AjHOxls/4gz/I14D4pNmQgNjiZ9W8mlZGhpY6lRu/HRW55ukTHSmq67k2CJ4vyAuc0jrTDlvx0uW7lGt5GpYLauHg6hrX6d2KJrfeLtoyTLXLj73or8X/eOHD9YbpUnQ1Mnby9qA4a2MCqYqiUizXFZiqF0qm4K8gGHUqOH9GLJRR3n3yfnO1nRZ4VX/YHh0ebOnbEa/FGvBOFeC8OxWdxIoZCihvxM0mSNPme/Eq30u3b0TTZZF6Jf5Du/gY/R7EV</latexit>
Controlling FCR is nontrivial

Constructing marginal 95% CIs for all parameters


fails to control FCR at 0.05.
<latexit sha1_base64="FMP/KogEvHFO2tjJge5o3+g3UBs=">AAACP3icbVA9TyMxFPTyfeErHCXNEwkSVbSLhIAOEekEHSASkLJR9NbxBguvvbLfIkUR/4zm/sJ1tDQUnE60dOcNKfh61WjmPXtmklxJR2H4EExNz8zOzS/8qCwuLa+sVtd+tp0pLBctbpSxVwk6oaQWLZKkxFVuBWaJEpfJTbPUL2+FddLoCxrmopvhQMtUciRP9artptGObMFJ6gFkaAdSo4L6wW68VYfmiYPUWEClIEeLmSD/VhxDJUWpHJABbjRZo+BX8xyQoB42wt16o1eteTAe+AqiCaixyZz2qn/ivuFFJjRxhc51ojCn7ggtSa7EXSUunMiR3+BAdDzU3orrjsb572DLM/2x0dS7gTH7/mKEmXPDLPGbGdK1+6yV5Hdap6B0vzuSOi9IaP72UVqoMndZJvSlFZzU0APkVnqvwK99T7ysqeJLiD5H/graO40obERnO7XDo0kdC2yDbbJtFrE9dsiO2SlrMc7u2SN7Zn+D38FT8C94eVudCiY36+zDBK//AfwqrFs=</latexit>

Suppose treatment e↵ect ✓j 2 {±0.1} for all j,


and experimental observations are normalized to
Xj ⇠ N (✓j , 1). <latexit sha1_base64="gfM/xJEjMu7AD/AFwnxRplluGo0=">AAACeHicbVHLbtNAFB2bVwmPBlh2c0Vc0UpVZGdDlxVsWKEiSBspE1nj8XUz6TysmXHVYOUb+Dd2fAgbVozdIEHLXR2dO0dnzrlFLYXzafojiu/df/Dw0c7jwZOnz57vDl+8PHOmsRyn3EhjZwVzKIXGqRde4qy2yFQh8by4fN/tz6/QOmH0F7+ucaHYhRaV4MwHKh9++9zUtXEIPqi8Qu0Bqwq5h4T6JXqWr6jQtKW1gnSc0U0ClbHApIRklRxRCgOmS8DrGq3o5EyCKRzaq97AAbMI2ljFpPiKJXgDlA4gmeUroE4o+Hjwx+coO0zG+XCUjtN+4C7ItmBEtnOaD7/T0vCms+aSOTfP0tovWma94BI3A9o4rBm/ZBc4D1AzhW7R9sVtYD8wZR+oMiF5z/6taJlybq2K8FIxv3S3dx35v9288dXxohW6bjxqfmNUNbKL310BSmFDx3IdAONWhL8CXzLLuA+3GoQSstuR74KzyTgLJ/k0GZ2829axQ/bIa3JAMvKWnJAP5JRMCSc/o70oifajXzHEb+LDm6dxtNW8Iv9MPPkNyNS8yQ==</latexit>

Suppose we only care about drugs with large e↵ects.


So we only pursue phase II of the trial if Xj > 3. <latexit sha1_base64="eT1OBoZu+q2ozcPHjJOqWpC8g1E=">AAACVXicbVFNbxMxFPQupZTw0QBHLk+kSJxWu+0BTqgqF3orKmkjJVH01vs2a+pdW/Zzqyjqn+wF8U+4IOGkEYKWkSyNZt7Iz+PSauU5z38k6YOth9uPdh73njx99ny3/+LlmTfBSRpKo40blehJq46GrFjTyDrCttR0Xl58Wvnnl+S8Mt1XXliatjjvVK0kcpRmfX0arDWe4IrAdHoBEh0BliYwVC7MPVwpbkCjmxNQXZNkn00mvVPzJ2GD84HANnENOD4GUwM3BOwUalA17I1m3+AjHOxls/4gz/I14D4pNmQgNjiZ9W8mlZGhpY6lRu/HRW55ukTHSmq67k2CJ4vyAuc0jrTDlvx0uW7lGt5GpYLauHg6hrX6d2KJrfeLtoyTLXLj73or8X/eOHD9YbpUnQ1Mnby9qA4a2MCqYqiUizXFZiqF0qm4K8gGHUqOH9GLJRR3n3yfnO1nRZ4VX/YHh0ebOnbEa/FGvBOFeC8OxWdxIoZCihvxM0mSNPme/Eq30u3b0TTZZF6Jf5Du/gY/R7EV</latexit>

For these drugs, the standard marginal 95% CI


does not cover ✓j . So FCR=1.
<latexit sha1_base64="MNZd6d4WqE/DvhH93VkMRXIatwE=">AAACQHicbVA9bxNBEN0LX8F8GShpRtiRKNDpLhJKKJCiWIqgCx+OI/ksa25vzt5kb/e0OxfJsvLT0uQn0FHTUIAQLRV7iQtIeNXbN/NmZ15ea+U5Sb5Eazdu3rp9Z/1u5979Bw8fdR8/OfC2cZKG0mrrDnP0pJWhISvWdFg7wirXNMqPB219dELOK2s+8aKmSYUzo0olkYM07Y72rAOekycoXDPzL9sHeEZToCugQjdTBjX0X7/KNvoweJdlncKSB2MZpA2joZ8FC+P0qB/DRwt7gw9v0nja7SVxcgG4TtIV6YkV9qfdz1lhZVORYanR+3Ga1DxZomMlNZ12ssZTjfIYZzQO1GBFfrK8COAUNoJSQBlOKa1p9wrq344lVt4vqjx0Vshzf7XWiv+rjRsutydLZeqGycjLj8pGA1to04RCOZKsF4GgdCrsCnKODiWHzDshhPTqydfJwWacJnH6frO3s7uKY108E8/FC5GKLbEj3op9MRRSnImv4rv4EZ1H36Kf0a/L1rVo5Xkq/kH0+w9bcayH</latexit>
Can we control FCR *online*?

When experiment j starts, we must assign


a target confidence level ↵j .
When experiment j ends, we must decide
if we wish to report ✓j .
This must be done such that the FCR
is controlled at any time.
<latexit sha1_base64="5F3c2hZSIZVBX2jI9lT37fkUWZU=">AAAC33icbVJNb9NAEF2brxK+Ahy5jGiQOKDI7gWOFZUQx4KaplIcRev1ON52vWvtjluiKBcuHECIK3+LG3+EM+MmSKXtSLae570Zz77ZvDE6UJL8juIbN2/dvrN1t3fv/oOHj/qPnxwG13qFI+WM80e5DGi0xRFpMnjUeJR1bnCcn+x1/PgUfdDOHtCiwWkt51aXWkni1Kz/J7NO2wItwbhCC/ipQa/r7ntwPIBA0lN4BWcIdRsIZAh6brMMehKYmiOBcrbU3EAhGDxFA4NMmqaSs+PBMMt613VFW1zoWaDiepbqssud6VABOfDYOM/6jCqkf90OKh3WVTlC4SxCaBXLK0n8Qni397EbjkU8FnlnDBbAXIZ1Uy2lXayAeI7hrL+dDJPzgKsg3YBtsYn9Wf9XVjjVdkdQhl2YpElD0yW7o5XBVS9rAzZSncg5ThhaWWOYLs/3s4IXnCmgdJ4f2znG2YsVS1mHsKhzVtaSqnCZ65LXcZOWyjfTpbZNS7yA9Y/K1nT2dcuGQntUZBYMpPKaZwVVSS8V8ZXosQnp5SNfBYc7wzQZph92tnffbuzYEs/Ec/FSpOK12BXvxb4YCRVl0efoa/QtlvGX+Hv8Yy2No03NU/FfxD//Aijn46o=</latexit>
Can we control FCR *online*?

When experiment j starts, we must assign


a target confidence level ↵j .
When experiment j ends, we must decide
if we wish to report ✓j .
This must be done such that the FCR
is controlled at any time.
<latexit sha1_base64="5F3c2hZSIZVBX2jI9lT37fkUWZU=">AAAC33icbVJNb9NAEF2brxK+Ahy5jGiQOKDI7gWOFZUQx4KaplIcRev1ON52vWvtjluiKBcuHECIK3+LG3+EM+MmSKXtSLae570Zz77ZvDE6UJL8juIbN2/dvrN1t3fv/oOHj/qPnxwG13qFI+WM80e5DGi0xRFpMnjUeJR1bnCcn+x1/PgUfdDOHtCiwWkt51aXWkni1Kz/J7NO2wItwbhCC/ipQa/r7ntwPIBA0lN4BWcIdRsIZAh6brMMehKYmiOBcrbU3EAhGDxFA4NMmqaSs+PBMMt613VFW1zoWaDiepbqssud6VABOfDYOM/6jCqkf90OKh3WVTlC4SxCaBXLK0n8Qni397EbjkU8FnlnDBbAXIZ1Uy2lXayAeI7hrL+dDJPzgKsg3YBtsYn9Wf9XVjjVdkdQhl2YpElD0yW7o5XBVS9rAzZSncg5ThhaWWOYLs/3s4IXnCmgdJ4f2znG2YsVS1mHsKhzVtaSqnCZ65LXcZOWyjfTpbZNS7yA9Y/K1nT2dcuGQntUZBYMpPKaZwVVSS8V8ZXosQnp5SNfBYc7wzQZph92tnffbuzYEs/Ec/FSpOK12BXvxb4YCRVl0efoa/QtlvGX+Hv8Yy2No03NU/FfxD//Aijn46o=</latexit>

Weinstein, Ramdas ‘19


A simple solution for both
online FDR control, and
online FCR control
Online FCR control: the main idea

Let Si 2 {0, 1} denote the selection decision


made after experiment i. <latexit sha1_base64="GOHWK6b8TQqolUAuucVsdwvzEVA=">AAACRHicbVA9T8MwEHX4pnwVGFlOtEgMqEpYYESwMDCAoAWpqSrHuVALx45sB1FF/XEs/AA2fgELAwixIpy2A18nWXp67+7d+UWZ4Mb6/pM3MTk1PTM7N19ZWFxaXqmurrWMyjXDJlNC6auIGhRcYtNyK/Aq00jTSOBldHNU6pe3qA1X8sL2M+yk9FryhDNqHdWttkOpuIxRWjhBC/XzLoeQSwgLfycIB3VwkrIItofgtiArxxzJeGkZhlBJaYxAE4sa8C5DzdPSrM7rjW615jf8YcFfEIxBjYzrtFt9DGPF8tKACWpMO/Az2ymotpwJHFTC3GBG2Q29xraDkqZoOsUwhAFsOSaGRGn33AFD9vtEQVNj+mnkOlNqe+a3VpL/ae3cJvudgssstyjZaFGSC7AKykQh5trFIvoOUKa5uxVYj2rKXCSm4kIIfn/5L2jtNgK/EZzt1g4Ox3HMkQ2ySbZJQPbIATkmp6RJGLknz+SVvHkP3ov37n2MWie88cw6+VHe5xcxvK/J</latexit>

PT
[ i=1 ↵ i
Maintain FCP(T ) := PT  ↵.
1 _ i=1 Si
<latexit sha1_base64="h2Dsi8pb9WFrVY+BGXU36brflXA=">AAACVnicbVFNa9wwEJWdpkm3X0567EV0KaSXxS6BhkAgNFB6KWzpbhJYb81YO86KSLIjjdMsxn+yvbQ/pZdS7a4PTdKBgcd7b0bSU14p6SiOfwXhxoPNh1vbj3qPnzx99jza2T11ZW0FjkWpSnueg0MlDY5JksLzyiLoXOFZfnmy1M+u0TpZmhEtKpxquDCykALIU1mkU8Ibaj6BNOSbtzz9Jmc4B2rWyoeTYdvujd4cHqWFBdGkrtZZI4+S9uuIp6CqOWSSt03C02tEfkv+kkm/T+FVZxxkUT8exKvi90HSgT7raphF39NZKWqNhoQC5yZJXNG0AUtSKGx7ae2wAnEJFzjx0IBGN21WsbT8tWdmvCitb0N8xf470YB2bqFz79RAc3dXW5L/0yY1FQfTRpqqJjRifVBRK04lX2bMZ9KiILXwAISV/q5czMGnR/4nej6E5O6T74PTt4MkHiSf9/vH77s4ttlL9ortsYS9Y8fsIxuyMRPsB/sdhMFG8DP4E26GW2trGHQzL9itCqO/IGOz7g==</latexit>
Online FCR control: the main idea

Let Si 2 {0, 1} denote the selection decision


made after experiment i. <latexit sha1_base64="GOHWK6b8TQqolUAuucVsdwvzEVA=">AAACRHicbVA9T8MwEHX4pnwVGFlOtEgMqEpYYESwMDCAoAWpqSrHuVALx45sB1FF/XEs/AA2fgELAwixIpy2A18nWXp67+7d+UWZ4Mb6/pM3MTk1PTM7N19ZWFxaXqmurrWMyjXDJlNC6auIGhRcYtNyK/Aq00jTSOBldHNU6pe3qA1X8sL2M+yk9FryhDNqHdWttkOpuIxRWjhBC/XzLoeQSwgLfycIB3VwkrIItofgtiArxxzJeGkZhlBJaYxAE4sa8C5DzdPSrM7rjW615jf8YcFfEIxBjYzrtFt9DGPF8tKACWpMO/Az2ymotpwJHFTC3GBG2Q29xraDkqZoOsUwhAFsOSaGRGn33AFD9vtEQVNj+mnkOlNqe+a3VpL/ae3cJvudgssstyjZaFGSC7AKykQh5trFIvoOUKa5uxVYj2rKXCSm4kIIfn/5L2jtNgK/EZzt1g4Ox3HMkQ2ySbZJQPbIATkmp6RJGLknz+SVvHkP3ov37n2MWie88cw6+VHe5xcxvK/J</latexit>

PT
[ i=1 ↵ i
Maintain FCP(T ) := PT  ↵.
1 _ i=1 Si
<latexit sha1_base64="h2Dsi8pb9WFrVY+BGXU36brflXA=">AAACVnicbVFNa9wwEJWdpkm3X0567EV0KaSXxS6BhkAgNFB6KWzpbhJYb81YO86KSLIjjdMsxn+yvbQ/pZdS7a4PTdKBgcd7b0bSU14p6SiOfwXhxoPNh1vbj3qPnzx99jza2T11ZW0FjkWpSnueg0MlDY5JksLzyiLoXOFZfnmy1M+u0TpZmhEtKpxquDCykALIU1mkU8Ibaj6BNOSbtzz9Jmc4B2rWyoeTYdvujd4cHqWFBdGkrtZZI4+S9uuIp6CqOWSSt03C02tEfkv+kkm/T+FVZxxkUT8exKvi90HSgT7raphF39NZKWqNhoQC5yZJXNG0AUtSKGx7ae2wAnEJFzjx0IBGN21WsbT8tWdmvCitb0N8xf470YB2bqFz79RAc3dXW5L/0yY1FQfTRpqqJjRifVBRK04lX2bMZ9KiILXwAISV/q5czMGnR/4nej6E5O6T74PTt4MkHiSf9/vH77s4ttlL9ortsYS9Y8fsIxuyMRPsB/sdhMFG8DP4E26GW2trGHQzL9itCqO/IGOz7g==</latexit>

Weinstein & Ramdas ’19


Online FCR control: the main idea

Let Si 2 {0, 1} denote the selection decision


made after experiment i. <latexit sha1_base64="GOHWK6b8TQqolUAuucVsdwvzEVA=">AAACRHicbVA9T8MwEHX4pnwVGFlOtEgMqEpYYESwMDCAoAWpqSrHuVALx45sB1FF/XEs/AA2fgELAwixIpy2A18nWXp67+7d+UWZ4Mb6/pM3MTk1PTM7N19ZWFxaXqmurrWMyjXDJlNC6auIGhRcYtNyK/Aq00jTSOBldHNU6pe3qA1X8sL2M+yk9FryhDNqHdWttkOpuIxRWjhBC/XzLoeQSwgLfycIB3VwkrIItofgtiArxxzJeGkZhlBJaYxAE4sa8C5DzdPSrM7rjW615jf8YcFfEIxBjYzrtFt9DGPF8tKACWpMO/Az2ymotpwJHFTC3GBG2Q29xraDkqZoOsUwhAFsOSaGRGn33AFD9vtEQVNj+mnkOlNqe+a3VpL/ae3cJvudgssstyjZaFGSC7AKykQh5trFIvoOUKa5uxVYj2rKXCSm4kIIfn/5L2jtNgK/EZzt1g4Ox3HMkQ2ySbZJQPbIATkmp6RJGLknz+SVvHkP3ov37n2MWie88cw6+VHe5xcxvK/J</latexit>

PT
[ i=1 ↵ i
Maintain FCP(T ) := PT  ↵.
1 _ i=1 Si
<latexit sha1_base64="h2Dsi8pb9WFrVY+BGXU36brflXA=">AAACVnicbVFNa9wwEJWdpkm3X0567EV0KaSXxS6BhkAgNFB6KWzpbhJYb81YO86KSLIjjdMsxn+yvbQ/pZdS7a4PTdKBgcd7b0bSU14p6SiOfwXhxoPNh1vbj3qPnzx99jza2T11ZW0FjkWpSnueg0MlDY5JksLzyiLoXOFZfnmy1M+u0TpZmhEtKpxquDCykALIU1mkU8Ibaj6BNOSbtzz9Jmc4B2rWyoeTYdvujd4cHqWFBdGkrtZZI4+S9uuIp6CqOWSSt03C02tEfkv+kkm/T+FVZxxkUT8exKvi90HSgT7raphF39NZKWqNhoQC5yZJXNG0AUtSKGx7ae2wAnEJFzjx0IBGN21WsbT8tWdmvCitb0N8xf470YB2bqFz79RAc3dXW5L/0yY1FQfTRpqqJjRifVBRK04lX2bMZ9KiILXwAISV/q5czMGnR/4nej6E5O6T74PTt4MkHiSf9/vH77s4ttlL9ortsYS9Y8fsIxuyMRPsB/sdhMFG8DP4E26GW2trGHQzL9itCqO/IGOz7g==</latexit>

For testing, let Ri ∈ {0,1} represent a rejection.

Then maintain FDP ̂(T ) :=


T
∑i=1 αi
T
≤ α.
1 ∨ ∑i=1 Ri

Weinstein & Ramdas ’19


Online FCR control: the main idea

Let Si 2 {0, 1} denote the selection decision


made after experiment i. <latexit sha1_base64="GOHWK6b8TQqolUAuucVsdwvzEVA=">AAACRHicbVA9T8MwEHX4pnwVGFlOtEgMqEpYYESwMDCAoAWpqSrHuVALx45sB1FF/XEs/AA2fgELAwixIpy2A18nWXp67+7d+UWZ4Mb6/pM3MTk1PTM7N19ZWFxaXqmurrWMyjXDJlNC6auIGhRcYtNyK/Aq00jTSOBldHNU6pe3qA1X8sL2M+yk9FryhDNqHdWttkOpuIxRWjhBC/XzLoeQSwgLfycIB3VwkrIItofgtiArxxzJeGkZhlBJaYxAE4sa8C5DzdPSrM7rjW615jf8YcFfEIxBjYzrtFt9DGPF8tKACWpMO/Az2ymotpwJHFTC3GBG2Q29xraDkqZoOsUwhAFsOSaGRGn33AFD9vtEQVNj+mnkOlNqe+a3VpL/ae3cJvudgssstyjZaFGSC7AKykQh5trFIvoOUKa5uxVYj2rKXCSm4kIIfn/5L2jtNgK/EZzt1g4Ox3HMkQ2ySbZJQPbIATkmp6RJGLknz+SVvHkP3ov37n2MWie88cw6+VHe5xcxvK/J</latexit>

PT
[ i=1 ↵ i
Maintain FCP(T ) := PT  ↵.
1 _ i=1 Si
<latexit sha1_base64="h2Dsi8pb9WFrVY+BGXU36brflXA=">AAACVnicbVFNa9wwEJWdpkm3X0567EV0KaSXxS6BhkAgNFB6KWzpbhJYb81YO86KSLIjjdMsxn+yvbQ/pZdS7a4PTdKBgcd7b0bSU14p6SiOfwXhxoPNh1vbj3qPnzx99jza2T11ZW0FjkWpSnueg0MlDY5JksLzyiLoXOFZfnmy1M+u0TpZmhEtKpxquDCykALIU1mkU8Ibaj6BNOSbtzz9Jmc4B2rWyoeTYdvujd4cHqWFBdGkrtZZI4+S9uuIp6CqOWSSt03C02tEfkv+kkm/T+FVZxxkUT8exKvi90HSgT7raphF39NZKWqNhoQC5yZJXNG0AUtSKGx7ae2wAnEJFzjx0IBGN21WsbT8tWdmvCitb0N8xf470YB2bqFz79RAc3dXW5L/0yY1FQfTRpqqJjRifVBRK04lX2bMZ9KiILXwAISV/q5czMGnR/4nej6E5O6T74PTt4MkHiSf9/vH77s4ttlL9ortsYS9Y8fsIxuyMRPsB/sdhMFG8DP4E26GW2trGHQzL9itCqO/IGOz7g==</latexit>

For testing, let Ri ∈ {0,1} represent a rejection.

Then maintain FDP ̂(T ) :=


T
∑i=1 αi
T
≤ α.
1 ∨ ∑i=1 Ri

This provably controls FCR/FDR at level α . Weinstein & Ramdas ’19


Online FCR control : high-level picture

Remaining error budget


or “alpha-wealth”
Online FCR control : high-level picture

α1 Error budget
for first expt.

Remaining error budget


or “alpha-wealth”
Online FCR control : high-level picture

α1 Error budget
for first expt.

α2 Error budget for


second expt.

Remaining error budget


or “alpha-wealth”
Online FCR control : high-level picture

α1 Error budget
for first expt.

α2 Error budget for


second expt.
α3 Expts. use wealth

Remaining error budget


or “alpha-wealth”
Online FCR control : high-level picture

α1 Error budget
for first expt.

α2 Error budget for


second expt.
α3 Expts. use wealth
α4 Selections
earn wealth

Remaining error budget


or “alpha-wealth”
Online FCR control : high-level picture

α1 Error budget
for first expt.

α2 Error budget for


second expt.
α3 Expts. use wealth
α4 Selections
earn wealth

Remaining error budget


or “alpha-wealth”
Online FCR control : high-level picture

α1 Error budget
for first expt.

α2 Error budget for


second expt.
α3 Expts. use wealth
α4 Selections
earn wealth
α5 Error budget
is data-dependent
Remaining error budget
or “alpha-wealth”
Online FCR control : high-level picture

α1 Error budget
for first expt.

α2 Error budget for


second expt.
α3 Expts. use wealth
α4 Selections
earn wealth
α5 Error budget
is data-dependent
Remaining error budget
or “alpha-wealth” α6 Infinite process
Online FCR control : high-level picture

α1 Error budget
for first expt.

α2 Error budget for


second expt.
α3 Expts. use wealth
α4 Selections
earn wealth
α5 Error budget
is data-dependent
Remaining error budget
or “alpha-wealth” α6 Infinite process
Online FCR control : high-level picture

α1 Error budget
for first expt.

α2 Error budget for


second expt.
α3 Expts. use wealth
α4 Selections
earn wealth
α5 Error budget
is data-dependent
Remaining error budget
or “alpha-wealth” α6 Infinite process
Online FCR control : high-level picture

α1 Error budget
for first expt.

α2 Error budget for


second expt.
α3 Expts. use wealth
α4 Selections
earn wealth
α5 Error budget
is data-dependent
Remaining error budget
or “alpha-wealth” α6 Infinite process
Online FCR control : high-level picture

α1 Error budget
for first expt.

α2 Error budget for


second expt.
α3 Expts. use wealth
α4 Selections
earn wealth
α5 Error budget
is data-dependent
Remaining error budget
or “alpha-wealth” α6 Infinite process
Summary of this section
Summary of this section
8t, errort  ↵ does not imply 8t, FDR(t)  ↵,
even if hypotheses, data, p-values are independent.
Summary of this section
8t, errort  ↵ does not imply 8t, FDR(t)  ↵,
even if hypotheses, data, p-values are independent.

Can track a running estimate of the FDP (or FCP):


a simple update rule to keep this estimate bounded
also results in the FDR (or FCR) being controlled.
Handling local dependence

Most online FDR algorithms assume independent p-values


(but hypotheses can be dependent).
Handling local dependence

Most online FDR algorithms assume independent p-values


(but hypotheses can be dependent).

However, assuming arbitrary dependence between all p-values


is also extremely pessimistic and unrealistic.
Handling local dependence

Most online FDR algorithms assume independent p-values


(but hypotheses can be dependent).

However, assuming arbitrary dependence between all p-values


is also extremely pessimistic and unrealistic.

A middle ground is a flexible notion of local dependence:


Pt arbitrarily depends on the previous Lt p-values,
where Lt is a user-chosen lag-parameter .
Handling local dependence

Most online FDR algorithms assume independent p-values


(but hypotheses can be dependent).

However, assuming arbitrary dependence between all p-values


is also extremely pessimistic and unrealistic.

A middle ground is a flexible notion of local dependence:


Pt arbitrarily depends on the previous Lt p-values,
where Lt is a user-chosen lag-parameter .

The online FDR and FCR algorithms can be easily modified


to handle local dependence.
Zrnic, Ramdas, Jordan ’18
Solutions for these issues
Solutions for these issues
Inner sequential process:
Part I
“confidence sequence” for estimation
also called “anytime confidence intervals”
(correspondingly, “always valid p-values” for testing)
Solutions for these issues
Inner sequential process:
Part I
“confidence sequence” for estimation
also called “anytime confidence intervals”
(correspondingly, “always valid p-values” for testing)

Outer sequential process: Part II

“false coverage rate” for estimation


(correspondingly, “false discovery rate” for testing)
Solutions for these issues
Inner sequential process:
Part I
“confidence sequence” for estimation
also called “anytime confidence intervals”
(correspondingly, “always valid p-values” for testing)

Outer sequential process: Part II

“false coverage rate” for estimation


(correspondingly, “false discovery rate” for testing)

Modular solutions: fit well together Part III


Many extensions to each piece
Putting the modular pieces together:
the doubly-sequential process

[Next 10 mins]
Part III
Putting the modular pieces together:
the doubly-sequential process

[Next 10 mins]
Combining inner and outer solutions (FCR):

αi

α5
α4
α3
α2
α1
Combining inner and outer solutions (FCR):
(a) Online FCR method assigns αi when expt. starts

αi

α5
α4
α3
α2
α1
Combining inner and outer solutions (FCR):
(a) Online FCR method assigns αi when expt. starts
(b) We keep track of (1 − αi) confidence sequence

αi

α5
α4
α3
α2
α1
Combining inner and outer solutions (FCR):
(a) Online FCR method assigns αi when expt. starts
(b) We keep track of (1 − αi) confidence sequence
(c) Adaptively decide to stop, to report final CI or not

αi

α5
α4
α3
α2
α1
Combining inner and outer solutions (FCR):
(a) Online FCR method assigns αi when expt. starts
(b) We keep track of (1 − αi) confidence sequence
(c) Adaptively decide to stop, to report final CI or not
(d) Guarantee FCR(T) ≤ α at any time positive time T

αi

α5
α4
α3
α2
α1
Combining inner and outer solutions (FCR):
(a) Online FCR method assigns αi when expt. starts
(b) We keep track of (1 − αi) confidence sequence
(c) Adaptively decide to stop, to report final CI or not
(d) Guarantee FCR(T) ≤ α at any time positive time T

αi

α5
Exp. α4
α3
α2
α1

Time / Samples
Combining inner and outer solutions (FCR):
(a) Online FCR method assigns αi when expt. starts
(b) We keep track of (1 − αi) confidence sequence
(c) Adaptively decide to stop, to report final CI or not
(d) Guarantee FCR(T) ≤ α at any time positive time T

αi

α5
Exp. α4
α3
α2
α1

Time / Samples
Combining inner and outer solutions (FDR):

αi

α5
Exp. α4
α3
α2
α1

Time / Samples
Combining inner and outer solutions (FDR):
(a) Online FDR method assigns αi when expt. starts

αi

α5
Exp. α4
α3
α2
α1

Time / Samples
Combining inner and outer solutions (FDR):
(a) Online FDR method assigns αi when expt. starts
(n)
(b) We keep track of anytime p-value Pi

αi

α5
Exp. α4
α3
α2
α1

Time / Samples
Combining inner and outer solutions (FDR):
(a) Online FDR method assigns αi when expt. starts
(n)
(b) We keep track of anytime p-value Pi
(c) Adaptively stop at time τ, report discovery if Pi(τ) ≤ αi

αi

α5
Exp. α4
α3
α2
α1

Time / Samples
Combining inner and outer solutions (FDR):
(a) Online FDR method assigns αi when expt. starts
(n)
(b) We keep track of anytime p-value Pi
(c) Adaptively stop at time τ, report discovery if Pi(τ) ≤ αi
(d) Guarantee FDR(T) ≤ α at any time positive time T

αi

α5
Exp. α4
α3
α2
α1

Time / Samples
PART IV: Advanced topics
(inner sequential process)

[Next 25 mins]
1. What if we are testing more than one alternative?

Much more traffic needed by an A/B/n test


1. Multi-armed bandits for hypothesis testing

What would you do?


1. Multi-armed bandits for hypothesis testing

What would you do?


Depends on the aim: minimize regret OR identify best arm?
We would like to test null hypothesis
H0 : μA ≥ max{μB, μC} .
1. Multi-armed bandits for hypothesis testing

What would you do?


Depends on the aim: minimize regret OR identify best arm?
We would like to test null hypothesis
H0 : μA ≥ max{μB, μC} .
Can design variant of UCB algorithms to define anytime p-value,
with optimal sample complexity for high power.
1. Multi-armed bandits for hypothesis testing

What would you do?


Depends on the aim: minimize regret OR identify best arm?
We would like to test null hypothesis
H0 : μA ≥ max{μB, μC} .
Can design variant of UCB algorithms to define anytime p-value,
with optimal sample complexity for high power.
Yang, Ramdas, Jamieson, Wainwright ’17
desired FDR level 𝛼𝛼

Online FDR procedure


𝛼𝛼𝑗𝑗 𝑅𝑅𝑗𝑗 (𝛼𝛼𝑗𝑗) 𝛼𝛼 j+1 𝑅𝑅 j+1 (𝛼𝛼 j+1)

… Exp j Exp j+1 …


𝑝𝑝𝑗𝑗 (𝛼𝛼𝑗𝑗) Test 𝑝𝑝j+1 (𝛼𝛼j+1) Test
MAB 𝑝𝑝𝑗𝑗 < 𝛼𝛼𝑗𝑗
MAB 𝑝𝑝j+1 < 𝛼𝛼j+1

MAB-FDR meta algorithm


2. Switch from estimating means to quantiles?

Let X ∼ F . Define the α-quantile as qα := sup{x : F(x) ≤ α} .


(hence q1/2 is the median)
2. Switch from estimating means to quantiles?

Let X ∼ F . Define the α-quantile as qα := sup{x : F(x) ≤ α} .


(hence q1/2 is the median)

Howard, Ramdas ‘19


2. Switch from estimating means to quantiles?

Let X ∼ F . Define the α-quantile as qα := sup{x : F(x) ≤ α} .


(hence q1/2 is the median)

Reasons to use quantiles include:

Howard, Ramdas ‘19


2. Switch from estimating means to quantiles?

Let X ∼ F . Define the α-quantile as qα := sup{x : F(x) ≤ α} .


(hence q1/2 is the median)

Reasons to use quantiles include:


Quantiles always exist for any distribution,
while means (moments) do not always exist (eg: Cauchy).

Howard, Ramdas ‘19


2. Switch from estimating means to quantiles?

Let X ∼ F . Define the α-quantile as qα := sup{x : F(x) ≤ α} .


(hence q1/2 is the median)

Reasons to use quantiles include:


Quantiles always exist for any distribution,
while means (moments) do not always exist (eg: Cauchy).
Quantiles can be defined for any totally ordered space,
eg: ratings A-F, where “distance between ratings” undefined.

Howard, Ramdas ‘19


2. Switch from estimating means to quantiles?

Let X ∼ F . Define the α-quantile as qα := sup{x : F(x) ≤ α} .


(hence q1/2 is the median)

Reasons to use quantiles include:


Quantiles always exist for any distribution,
while means (moments) do not always exist (eg: Cauchy).
Quantiles can be defined for any totally ordered space,
eg: ratings A-F, where “distance between ratings” undefined.
Estimating quantiles can be done sequentially, without
any tail assumptions, unlike estimating means.

Howard, Ramdas ‘19


2. Switch from estimating means to quantiles?

Let X ∼ F . Define the α-quantile as qα := sup{x : F(x) ≤ α} .


(hence q1/2 is the median)

Reasons to use quantiles include:


Quantiles always exist for any distribution,
while means (moments) do not always exist (eg: Cauchy).
Quantiles can be defined for any totally ordered space,
eg: ratings A-F, where “distance between ratings” undefined.
Estimating quantiles can be done sequentially, without
any tail assumptions, unlike estimating means.
Can run A/B tests and get always valid p-values
for testing the difference in quantiles.

Howard, Ramdas ‘19


2. Switch from estimating means to quantiles?

Let X ∼ F . Define the α-quantile as qα := sup{x : F(x) ≤ α} .


(hence q1/2 is the median)

Reasons to use quantiles include:


Quantiles always exist for any distribution,
while means (moments) do not always exist (eg: Cauchy).
Quantiles can be defined for any totally ordered space,
eg: ratings A-F, where “distance between ratings” undefined.
Estimating quantiles can be done sequentially, without
any tail assumptions, unlike estimating means.
Can run A/B tests and get always valid p-values
for testing the difference in quantiles.
Can run bandit experiments, including best-arm identification.
Howard, Ramdas ‘19
2. Switch from estimating means to quantiles?

Let X ∼ F . Define the α-quantile as qα := sup{x : F(x) ≤ α} .


(hence q1/2 is the median)

Reasons to use quantiles include:


Quantiles always exist for any distribution,
while means (moments) do not always exist (eg: Cauchy).
Quantiles can be defined for any totally ordered space,
eg: ratings A-F, where “distance between ratings” undefined.
Estimating quantiles can be done sequentially, without
any tail assumptions, unlike estimating means.
Can run A/B tests and get always valid p-values
for testing the difference in quantiles.
Can run bandit experiments, including best-arm identification.
Can also estimate all quantiles simultaneously! Howard, Ramdas ‘19
2. Quantiles are informative for heavy tails

Mean = + ∞
Prob.
density

(heavy right tail)

q0 q1/2 q0.8
Possible
observations

Eg: amount of time spent on Reddit


2. The mean need not even exist (eg: Cauchy)

Mean is undefined.
Prob.
density

(heavy left tail) (heavy right tail)

q0.3 q1/2 q0.8


Eg: amount of money won/lost in a casino
2. The mean need not even exist (eg: Cauchy)

Mean is undefined.
Prob.
density

(heavy left tail) (heavy right tail)

q0.3 q1/2 q0.8


Eg: amount of money won/lost in a casino
Do not need to resort to trimming “outliers”.
(How to pick threshold? Throw away or cap?)
2. The same could arise in discrete settings

2
pk ∝ 1/k ⟹ Mean = + ∞
Prob.
mass


q1/2 q0.8

Eg: number of links clicked


2. Quantile sensible in totally ordered settings

Mean is undefined.
Prob.
mass


A<B<C<D<E<…

q1/2 q0.8
Eg: grades or non-numerical ratings
2. Quantile sensible in totally ordered settings

Mean is undefined.
Prob.
mass


A<B<C<D<E<…

q1/2 q0.8
Eg: grades or non-numerical ratings
Do not need to artificially assign numerical values.
(Are they equally spaced? Spacing and start point matter.)
2. A/B testing with quantiles

First pick target quantile α (say 0.9).

H0 : q0.9(A) = q0.9(B)

H1 : q0.9(A) < q0.9(B)


2. A/B testing with quantiles

First pick target quantile α (say 0.9).

H0 : q0.9(A) = q0.9(B)

H1 : q0.9(A) < q0.9(B)

Howard, Ramdas ‘19


2. A/B testing with quantiles

First pick target quantile α (say 0.9).

H0 : q0.9(A) = q0.9(B)

H1 : q0.9(A) < q0.9(B)

Can construct always valid p-value.

Howard, Ramdas ‘19


2. A/B testing with quantiles

First pick target quantile α (say 0.9).

H0 : q0.9(A) = q0.9(B)

H1 : q0.9(A) < q0.9(B)

Can construct always valid p-value.

If numerical, can construct confidence sequence for q0.9(B) − q0.9(A) .

Howard, Ramdas ‘19


2. A/B testing with quantiles

First pick target quantile α (say 0.9).

H0 : q0.9(A) = q0.9(B)

H1 : q0.9(A) < q0.9(B)

Can construct always valid p-value.

If numerical, can construct confidence sequence for q0.9(B) − q0.9(A) .

(In that case, one way to define sequential p-value is


the smallest δ such that the (1 − δ) CS overlaps with ℝ−0 .)

Howard, Ramdas ‘19


2. Best-arm identification with quantiles

Which arm has the highest 80% quantile?

Howard, Ramdas ‘19


2. Best-arm identification with quantiles

Which arm has the highest 80% quantile?

Can design MAB algorithms to adaptively determine


the “best” arm with a prescribed failure probability.

Howard, Ramdas ‘19


2. Best-arm identification with quantiles

Which arm has the highest 80% quantile?

Can design MAB algorithms to adaptively determine


the “best” arm with a prescribed failure probability.

If the first arm is “special”, can design MAB algorithms to adaptively


test the null hypothesis that A is best, and get a sequential p-value.

Howard, Ramdas ‘19


3. Running intersections or minimums: pros/cons
Fact 1: if P (n) is an anytime p-value, so is min P (m) .
m≤n

Fact 2: if (L (n), U (n)) is a confidence sequence, so is (L (n), U (n)) .



m≤n
3. Running intersections or minimums: pros/cons
Fact 1: if P (n) is an anytime p-value, so is min P (m) .
m≤n

Fact 2: if (L (n), U (n)) is a confidence sequence, so is (L (n), U (n)) .



m≤n

Howard, Ramdas, McAuliffe, Sekhon ’19


3. Running intersections or minimums: pros/cons
Fact 1: if P (n) is an anytime p-value, so is min P (m) .
m≤n

Fact 2: if (L (n), U (n)) is a confidence sequence, so is (L (n), U (n)) .



m≤n
Pro of taking running intersections of CIs :
Smaller width, hence tighter inference, without inflating error.

Howard, Ramdas, McAuliffe, Sekhon ’19


3. Running intersections or minimums: pros/cons
Fact 1: if P (n) is an anytime p-value, so is min P (m) .
m≤n

Fact 2: if (L (n), U (n)) is a confidence sequence, so is (L (n), U (n)) .



m≤n
Pro of taking running intersections of CIs :
Smaller width, hence tighter inference, without inflating error.

Con of taking running intersections of CIs :


Can have intervals of decreasing width (great!) and then in the
next step, end up with an empty interval (disconcerting).

Howard, Ramdas, McAuliffe, Sekhon ’19


3. Running intersections or minimums: pros/cons
Fact 1: if P (n) is an anytime p-value, so is min P (m) .
m≤n

Fact 2: if (L (n), U (n)) is a confidence sequence, so is (L (n), U (n)) .



m≤n
Pro of taking running intersections of CIs :
Smaller width, hence tighter inference, without inflating error.

Con of taking running intersections of CIs :


Can have intervals of decreasing width (great!) and then in the
next step, end up with an empty interval (disconcerting).
Pro of ending up with zero width :
“Failing loudly”: you know you’re in the low-probability error
event, or assumptions have been violated.
Howard, Ramdas, McAuliffe, Sekhon ’19
4. Sequential Average Treatment Effect estimation
with adaptive randomization
Users of app or website

50% 50%
Sign up! Can change with time! …
A B
… (eg: keep groups balanced)
Sign up!
4. Sequential Average Treatment Effect estimation
with adaptive randomization
Users of app or website

50% 50%
Sign up! Can change with time! …
A B
… (eg: keep groups balanced)
Sign up!

Can infer the treatment effect sequentially (Neyman-Rubin


potential outcomes model) using anytime p-value or CI.

Howard, Ramdas, McAuliffe, Sekhon ’19


PART V: Advanced topics
(outer sequential process)

[Next 15 mins]
1. Smoothly forgetting the past

Recent tests may be more relevant than older ones,


and hence we may wish to smoothly forget the past.
1. Smoothly forgetting the past

Recent tests may be more relevant than older ones,


and hence we may wish to smoothly forget the past.

With this motivation, we may wish to control the


decaying memory FDR: (user-chosen decay d < 1)
1. Smoothly forgetting the past

Recent tests may be more relevant than older ones,


and hence we may wish to smoothly forget the past.

With this motivation, we may wish to control the


decaying memory FDR: (user-chosen decay d < 1)
P T t
d 1(false discoveryt )
mem-FDR(T ) = E tP
T t 1(discovery )
t d t

Ramdas et al. ’17


1. Smoothly forgetting the past

Recent tests may be more relevant than older ones,


and hence we may wish to smoothly forget the past.

With this motivation, we may wish to control the


decaying memory FDR: (user-chosen decay d < 1)
P T t
d 1(false discoveryt )
mem-FDR(T ) = E tP
T t 1(discovery )
t d t

(similarly mem-FCR)
Ramdas et al. ’17
2. Post-hoc analysis
2. Post-hoc analysis

Katsevich, Ramdas ’18


2. Post-hoc analysis

What if you did not use an online FCR or FDR algorithm,


but at the end of the year, you would like to answer
“based on the decisions made and error levels used,
how large could my FCR or FDR be?”

Katsevich, Ramdas ’18


2. Post-hoc analysis

What if you did not use an online FCR or FDR algorithm,


but at the end of the year, you would like to answer
“based on the decisions made and error levels used,
how large could my FCR or FDR be?”

With probability at least 1 − δ we have


1 + ∑i≤t αi log(1/δ)
FDPt ≤ ⋅ simultaneously for all t .
∑i≤t Ri log(1 + log(1/δ))

Katsevich, Ramdas ’18


2. Post-hoc analysis

What if you did not use an online FCR or FDR algorithm,


but at the end of the year, you would like to answer
“based on the decisions made and error levels used,
how large could my FCR or FDR be?”

With probability at least 1 − δ we have


1 + ∑i≤t αi log(1/δ)
FDPt ≤ ⋅ simultaneously for all t .
∑i≤t Ri log(1 + log(1/δ))
1 + ∑i≤t αi log(1/δ)
FCPt ≤ ⋅ simultaneously for all t .
∑i≤t Si log(1 + log(1/δ))
Katsevich, Ramdas ’18
3. Weighted error metrics and algorithms
3. Weighted error metrics and algorithms

The usual error metrics count all mistakes equally.


But, different experiments may have differing importances.
3. Weighted error metrics and algorithms

The usual error metrics count all mistakes equally.


But, different experiments may have differing importances.

Can define “weighted” variants of FDR and FCR


in the natural way: weighted sums in numerator/denominator.

Benjamini, Hochberg ’97


Ramdas, Yang, Jordan, Wainwright ’17
3. Weighted error metrics and algorithms

The usual error metrics count all mistakes equally.


But, different experiments may have differing importances.

Can define “weighted” variants of FDR and FCR


in the natural way: weighted sums in numerator/denominator.

Online FDR and FCR algorithms can be extended to


control weighted error metrics.

Benjamini, Hochberg ’97


Ramdas, Yang, Jordan, Wainwright ’17
4. False-sign rate
4. False-sign rate

Sometimes, all we want is a “sign decision” about parameter:


an output of +1 if treatment effect is +ve,
and an output of -1 if treatment effect is -ve,
or no output at all if it is uncertain.
4. False-sign rate

Sometimes, all we want is a “sign decision” about parameter:


an output of +1 if treatment effect is +ve,
and an output of -1 if treatment effect is -ve,
or no output at all if it is uncertain.

We may correspondingly define the false sign rate as

[ ]
# incorrect sign decisions made
FSR := 𝔼
# sign decisions made
4. False-sign rate

Sometimes, all we want is a “sign decision” about parameter:


an output of +1 if treatment effect is +ve,
and an output of -1 if treatment effect is -ve,
or no output at all if it is uncertain.

We may correspondingly define the false sign rate as

[ ]
# incorrect sign decisions made
FSR := 𝔼
# sign decisions made

To control the FSR, just using the online FCR algorithm,


and report the sign iff the CI does not contain zero.
Weinstein, Ramdas ’19
Open Problems [5 mins]
1. Errors and incentives in large organizations

A large number of different teams run


such A/B tests or randomized experiments

From the larger organization’s perspective,


coordination is necessary to control FDR or FCR,
since that might affect the bottom line of the company.
1. Errors and incentives in large organizations

A large number of different teams run


such A/B tests or randomized experiments

From the larger organization’s perspective,


coordination is necessary to control FDR or FCR,
since that might affect the bottom line of the company.

But each individual group or team might feel


“why do we have to pay if some other group
is running lots of random tests/experiments”?
1. Errors and incentives in large organizations

A large number of different teams run


such A/B tests or randomized experiments

From the larger organization’s perspective,


coordination is necessary to control FDR or FCR,
since that might affect the bottom line of the company.

But each individual group or team might feel


“why do we have to pay if some other group
is running lots of random tests/experiments”?

How do we align incentives?


Should our notion of error be hierarchical?
1. A hierarchical FDR or FCR control?

Company desires FDR ≤ 0.1

...

Product 1 Product 2 Product 15


(Group 1) (Group 2) (Group 15)

The average of group FDRs does not give company FDR.

FDR is additive in the worst case: if each group separately


controls FDR at 0.1, the company FDR could be trivial.
2. Utilizing contextual information

Often, we have contextual information about


each visitor (sample), like age, gender, etc.
These have been utilized for contextual
bandit algorithms that minimize regret.
2. Utilizing contextual information

Often, we have contextual information about


each visitor (sample), like age, gender, etc.
These have been utilized for contextual
bandit algorithms that minimize regret.

Is such information useful for hypothesis testing?


How do we use contextual bandits for hypothesis testing?
3. Designing systems that fail loudly

When our assumptions are wrong, and the system


is not behaving like intended or expected, how
can we automatically detect and report this?
3. Designing systems that fail loudly

When our assumptions are wrong, and the system


is not behaving like intended or expected, how
can we automatically detect and report this?

Is it possible to design such self-critical systems


that “announce” failures?
Summary [15 mins]
A selective history (inner process)

Time
A selective history (inner process)
Fisher (1925)
null hypothesis testing basics
randomization for causal inference

Time
A selective history (inner process)
Fisher (1925)
null hypothesis testing basics
Wald (1948) randomization for causal inference
sequential probability ratio test
(the first always-valid p-values)

Time
A selective history (inner process)
Fisher (1925)
null hypothesis testing basics
Wald (1948) randomization for causal inference
sequential probability ratio test
(the first always-valid p-values)
Robbins (1952)
multi-armed bandits

Time
A selective history (inner process)
Fisher (1925)
null hypothesis testing basics
Wald (1948) randomization for causal inference
sequential probability ratio test
(the first always-valid p-values)
Robbins (1952)
multi-armed bandits
Darling & Robbins (1967)
confidence sequences
(the first always valid CIs)

Time
A selective history (inner process)
Fisher (1925)
null hypothesis testing basics
Wald (1948) randomization for causal inference
sequential probability ratio test
(the first always-valid p-values)
Robbins (1952)
multi-armed bandits
Darling & Robbins (1967)
confidence sequences
(the first always valid CIs) Lai, Siegmund,… (1970s)
confidence sequences, inference
after stopping experiments

Time
A selective history (inner process)
Fisher (1925)
null hypothesis testing basics
Wald (1948) randomization for causal inference
sequential probability ratio test
(the first always-valid p-values)
Robbins (1952)
multi-armed bandits
Darling & Robbins (1967)
confidence sequences
(the first always valid CIs) Lai, Siegmund,… (1970s)
confidence sequences, inference
Jennison & Turnbull (1980s) after stopping experiments
group sequential methods
(peeking only 2 or 3 times)

Time
A selective history (outer process)

Time
A selective history (outer process)
Tukey (1953)
an unpublished book on the
problem of multiple comparisons

Time
A selective history (outer process)
Tukey (1953)
an unpublished book on the
problem of multiple comparisons
Eklund & Seeger (1963)
define false discovery proportion
suggested heuristic algorithm

Time
A selective history (outer process)
Tukey (1953)
an unpublished book on the
problem of multiple comparisons
Eklund & Seeger (1963)
define false discovery proportion
suggested heuristic algorithm
Benjamini & Hochberg (1995)
rediscovered Eklund-Seeger method
first proof of FDR control

Time
A selective history (outer process)
Tukey (1953)
an unpublished book on the
problem of multiple comparisons
Eklund & Seeger (1963)
define false discovery proportion
suggested heuristic algorithm
Benjamini & Hochberg (1995)
rediscovered Eklund-Seeger method
Benjamini & Yekutieli (2005) first proof of FDR control
false coverage rate (FCR)
first methods to control it

Time
A selective history (outer process)
Tukey (1953)
an unpublished book on the
problem of multiple comparisons
Eklund & Seeger (1963)
define false discovery proportion
suggested heuristic algorithm
Benjamini & Hochberg (1995)
rediscovered Eklund-Seeger method
Benjamini & Yekutieli (2005) first proof of FDR control
false coverage rate (FCR)
first methods to control it
Foster & Stine (2008)
conceptualized online FDR control
first method to control it

Time
In this tutorial, you learnt the basics of
In this tutorial, you learnt the basics of
• How to think about a single experiment
A. Why peeking is an issue in practice
B. Why applying a t-test repeatedly inflates errors
C. Anytime confidence intervals and p-values
In this tutorial, you learnt the basics of
• How to think about a single experiment
A. Why peeking is an issue in practice
B. Why applying a t-test repeatedly inflates errors
C. Anytime confidence intervals and p-values

• How to think about a sequence of experiments


A. Why selective reporting is an issue in practice
B. Why Benjamini-Hochberg fails in the online setting
C. Online FCR and FDR controlling algorithms
In this tutorial, you learnt the basics of
• How to think about a single experiment
A. Why peeking is an issue in practice
B. Why applying a t-test repeatedly inflates errors
C. Anytime confidence intervals and p-values

• How to think about a sequence of experiments


A. Why selective reporting is an issue in practice
B. Why Benjamini-Hochberg fails in the online setting
C. Online FCR and FDR controlling algorithms

• How to think about doubly-sequential experimentation


A. Using anytime CIs with online FCR control
B. Using anytime p-values with online FDR control
C. Handling asynchronous tests with local dependence
You also learnt some advanced topics:
You also learnt some advanced topics:
• Within a single experiment:
A. Using bandits for hypothesis testing
B. Quantiles can be estimated sequentially
C. The pros and cons of running intersections
D. SATE with adaptive randomization
You also learnt some advanced topics:
• Within a single experiment:
A. Using bandits for hypothesis testing
B. Quantiles can be estimated sequentially
C. The pros and cons of running intersections
D. SATE with adaptive randomization

• Across experiments:
A. Error metrics with decaying-memory
B. The false sign rate
C. Weighted error metrics
D. Post-hoc analysis
You also learnt some advanced topics:
• Within a single experiment:
A. Using bandits for hypothesis testing
B. Quantiles can be estimated sequentially
C. The pros and cons of running intersections
D. SATE with adaptive randomization

• Across experiments:
A. Error metrics with decaying-memory
B. The false sign rate
C. Weighted error metrics
D. Post-hoc analysis

• Open problems:
A. Incentives/errors within hierarchical organizations
B. Utilizing contextual information for testing
C. Designing systems that fail loudly
SOFTWARE

• Within a single experiment:


Python package called “confseq”
Maintained by Steve Howard (Berkeley)
Frequent updates + wrappers for months to come

• Across experiments:
R package called “onlineFDR”
Maintained by David Robertson (Cambridge)
Frequent updates + wrappers for months to come

References and links at


www.stat.cmu.edu/~aramdas/kdd19/
Collaborators from this talk

Steve Jinjin Asaf Eugene Akshay Tijana David


Howard Tian Weinstein Katsevich Balsubramani Zrnic Robertson

Jasjeet Jon Kevin Fanny Martin Michael


Sekhon McAuliffe Jamieson Yang Wainwright Jordan
Foundations of large-scale
“doubly-sequential” experimentation

(KDD tutorial in Anchorage, on 4 Aug 2019)

Aaditya Ramdas
Assistant Professor
Dept. of Statistics and Data Science
Machine Learning Dept.
Carnegie Mellon University

Funding welcomed!
www.stat.cmu.edu/~aramdas/kdd19/ Thank you! Questions?

You might also like