0% found this document useful (0 votes)

27 views339 pages

Foundations of Large-Scale Doubly-Sequential Experimentation

Uploaded by

dukeblue

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views339 pages

Foundations of Large-Scale Doubly-Sequential Experimentation

Uploaded by

dukeblue

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 339

Foundations of large-scale

“doubly-sequential” experimentation

(KDD tutorial in Anchorage, on 4 Aug 2019)

Aaditya Ramdas
Assistant Professor
Dept. of Statistics and Data Science
Machine Learning Dept.
Carnegie Mellon University

www.stat.cmu.edu/~aramdas/kdd19/
A/B-testing : tech :: clinical trials : pharma
A/B-testing : tech :: clinical trials : pharma

Large-scale A/B-testing or other related

forms of randomized experimentation has
revolutionized the tech industry in the last 15yrs.
A/B-testing : tech :: clinical trials : pharma

Large-scale A/B-testing or other related

forms of randomized experimentation has
revolutionized the tech industry in the last 15yrs.

In 2013, a team from Microsoft (Bing) claimed that they run tens
of thousands of such experiments, leading to millions of dollars in
increased revenue.

Kohavi et al. ’13

A/B-testing : tech :: clinical trials : pharma

Large-scale A/B-testing or other related

forms of randomized experimentation has
revolutionized the tech industry in the last 15yrs.

In 2013, a team from Microsoft (Bing) claimed that they run tens
of thousands of such experiments, leading to millions of dollars in
increased revenue.

Much has been discussed about doing A/B testing the “right” way,
both theoretically and practically in real-world systems.

Many companies contributing to this vast and growing literature.

Kohavi et al. ’13

Audience poll (for the speaker)
Audience poll (for the speaker)

How many of you have written papers on A/B testing

or online experimentation?
(or work in the area, or consider yourselves experts?)
Audience poll (for the speaker)

How many of you have written papers on A/B testing

or online experimentation?
(or work in the area, or consider yourselves experts?)

How many of you have read papers on A/B testing

and know what it is, but want to know more?
Audience poll (for the speaker)

How many of you have written papers on A/B testing

or online experimentation?
(or work in the area, or consider yourselves experts?)

How many of you have read papers on A/B testing

and know what it is, but want to know more?

How many have no idea what I’m talking about?

Users of app or website

50% 50%
Sign up! …
A B
… Sign up!
Users of app or website

50% 50%
Sign up! …
A B
… Sign up!

44 conversions 71 conversions
Users of app or website

50% 50%
Sign up! …
A B
… Sign up!

44 conversions B wins! 71 conversions

Users of app or website

50% 50%
Sign up! …
A B
… Sign up!

44 conversions B wins! 71 conversions

What we will NOT cover today
What we will NOT cover today
Pre-experiment analysis and difference-in-differences
What we will NOT cover today
Pre-experiment analysis and difference-in-differences
What makes a good metric? (directionality and sensitivity)
What we will NOT cover today
Pre-experiment analysis and difference-in-differences
What makes a good metric? (directionality and sensitivity)
Combining metrics into an overall evaluation criterion
What we will NOT cover today
Pre-experiment analysis and difference-in-differences
What makes a good metric? (directionality and sensitivity)
Combining metrics into an overall evaluation criterion
Time-series aspects: dealing with periodicity and trends
What we will NOT cover today
Pre-experiment analysis and difference-in-differences
What makes a good metric? (directionality and sensitivity)
Combining metrics into an overall evaluation criterion
Time-series aspects: dealing with periodicity and trends
Causal inference in observational studies
What we will NOT cover today
Pre-experiment analysis and difference-in-differences
What makes a good metric? (directionality and sensitivity)
Combining metrics into an overall evaluation criterion
Time-series aspects: dealing with periodicity and trends
Causal inference in observational studies
What we will NOT cover today
Pre-experiment analysis and difference-in-differences
What makes a good metric? (directionality and sensitivity)
Combining metrics into an overall evaluation criterion
Time-series aspects: dealing with periodicity and trends
Causal inference in observational studies

Parametric methods for A/B testing (like SPRT and variants)

What we will NOT cover today
Pre-experiment analysis and difference-in-differences
What makes a good metric? (directionality and sensitivity)
Combining metrics into an overall evaluation criterion
Time-series aspects: dealing with periodicity and trends
Causal inference in observational studies

Parametric methods for A/B testing (like SPRT and variants)

Bayesian A/B testing
What we will NOT cover today
Pre-experiment analysis and difference-in-differences
What makes a good metric? (directionality and sensitivity)
Combining metrics into an overall evaluation criterion
Time-series aspects: dealing with periodicity and trends
Causal inference in observational studies

Parametric methods for A/B testing (like SPRT and variants)

Bayesian A/B testing
Side effects and risks associated with running experiments
What we will NOT cover today
Pre-experiment analysis and difference-in-differences
What makes a good metric? (directionality and sensitivity)
Combining metrics into an overall evaluation criterion
Time-series aspects: dealing with periodicity and trends
Causal inference in observational studies

Parametric methods for A/B testing (like SPRT and variants)

Bayesian A/B testing
Side effects and risks associated with running experiments
Deployment with controlled, phased rollouts
What we will NOT cover today
Pre-experiment analysis and difference-in-differences
What makes a good metric? (directionality and sensitivity)
Combining metrics into an overall evaluation criterion
Time-series aspects: dealing with periodicity and trends
Causal inference in observational studies

Parametric methods for A/B testing (like SPRT and variants)

Bayesian A/B testing
Side effects and risks associated with running experiments
Deployment with controlled, phased rollouts

Pitfalls of long experiments (survivorship bias, perceived trends)

Parametric methods for A/B testing (like SPRT and variants)

Bayesian A/B testing
Side effects and risks associated with running experiments
Deployment with controlled, phased rollouts

Pitfalls of long experiments (survivorship bias, perceived trends)

ML meets causal inference meets online experiments
What we will NOT cover today
Pre-experiment analysis and difference-in-differences
What makes a good metric? (directionality and sensitivity)
Combining metrics into an overall evaluation criterion
Time-series aspects: dealing with periodicity and trends
Causal inference in observational studies

Parametric methods for A/B testing (like SPRT and variants)

Bayesian A/B testing
Side effects and risks associated with running experiments
Deployment with controlled, phased rollouts

Pitfalls of long experiments (survivorship bias, perceived trends)

ML meets causal inference meets online experiments
Experimentation in marketplaces or with network effects
What we will NOT cover today
Pre-experiment analysis and difference-in-differences
What makes a good metric? (directionality and sensitivity)
Combining metrics into an overall evaluation criterion
Time-series aspects: dealing with periodicity and trends
Causal inference in observational studies

Parametric methods for A/B testing (like SPRT and variants)

Bayesian A/B testing
Side effects and risks associated with running experiments
Deployment with controlled, phased rollouts

Pitfalls of long experiments (survivorship bias, perceived trends)

ML meets causal inference meets online experiments
Experimentation in marketplaces or with network effects
Ethical aspects of running experiments
What we will NOT cover today
Pre-experiment analysis and difference-in-differences
What makes a good metric? (directionality and sensitivity)
Combining metrics into an overall evaluation criterion
Time-series aspects: dealing with periodicity and trends
Causal inference in observational studies

Parametric methods for A/B testing (like SPRT and variants)

Bayesian A/B testing
Side effects and risks associated with running experiments
Deployment with controlled, phased rollouts

Pitfalls of long experiments (survivorship bias, perceived trends)

ML meets causal inference meets online experiments
Experimentation in marketplaces or with network effects
Ethical aspects of running experiments
There are many resources for these topics

Yandex tutorial at The Web Conference ’18

Microsoft tutorial at The Web Conference ’19

(+ ExP Platform webpage)

Blog posts by Evan Miller, Etsy, Optimizely, etc.

…
A new “doubly-sequential” perspective:
a sequence of sequential experiments
A new “doubly-sequential” perspective:
a sequence of sequential experiments

Exp.

Time / Samples
A new “doubly-sequential” perspective:
a sequence of sequential experiments

Exp.

Time / Samples
A new “doubly-sequential” perspective:
a sequence of sequential experiments

Exp.

A/B tests

Time / Samples
A new “doubly-sequential” perspective:
a sequence of sequential experiments

Exp.

Time / Samples
A new “doubly-sequential” perspective:
a sequence of sequential experiments

treatments
Exp.

common control

Time / Samples
A new “doubly-sequential” perspective:
a sequence of sequential experiments

Exp.

Time / Samples

Zrnic, Ramdas, Jordan ‘18

Yang, Ramdas, Jamieson, Wainwright ‘17
What kind of guarantees would we like
for doubly sequential experimentation?
What kind of guarantees would we like
for doubly sequential experimentation?

(a) inner sequential process (a single experiment)

— correct inference when experiment ends
(correct p-values for A/B test or correct
confidence intervals for treatment effect)
What kind of guarantees would we like
for doubly sequential experimentation?

(a) inner sequential process (a single experiment)

— correct inference when experiment ends
(correct p-values for A/B test or correct
confidence intervals for treatment effect)

(b) outer sequential process (multiple experiments)

— less clear (is error control on inner
process enough?!)
Some existing problems in practice

Some potential issues within each experiment

Some potential issues across experiments

Many other concerns as well

Some existing problems in practice

Some potential issues within each experiment

(a) continuous monitoring
(b) flexible experiment horizon
(c) arbitrary stopping (or continuation) rules

Some potential issues across experiments

Many other concerns as well

Some existing problems in practice

Some potential issues within each experiment

(a) continuous monitoring
(b) flexible experiment horizon
(c) arbitrary stopping (or continuation) rules

Some potential issues across experiments

(a) selection bias (multiplicity)
(b) dependence across experiments
(c) don’t know future outcomes

Many other concerns as well

Solutions for these issues
Solutions for these issues
Inner sequential process:
Part I
“confidence sequence” for estimation
also called “anytime confidence intervals”
(correspondingly, “always valid p-values” for testing)
Solutions for these issues
Inner sequential process:
Part I
“confidence sequence” for estimation
also called “anytime confidence intervals”
(correspondingly, “always valid p-values” for testing)

Outer sequential process: Part II

“false coverage rate” for estimation

(correspondingly, “false discovery rate” for testing)
Solutions for these issues
Inner sequential process:
Part I
“confidence sequence” for estimation
also called “anytime confidence intervals”
(correspondingly, “always valid p-values” for testing)

Outer sequential process: Part II

“false coverage rate” for estimation

(correspondingly, “false discovery rate” for testing)

Modular solutions: fit well together Part III

Many extensions to each piece
Part I

The INNER Sequential Process

(a single experiment)

[1 hour]
The “duality” between
confidence intervals and p-values
Hypothesis testing is like
stochastic proof by contradiction.
Hypothesis testing is like
stochastic proof by contradiction.
Hypothesis testing is like
stochastic proof by contradiction.

Null hypothesis:
The coin is fair (bias = 0)

Alternative:
Coin is biased towards H
Hypothesis testing is like
stochastic proof by contradiction.
Tails Heads

1000 tosses

0 200 400 600 800

Null hypothesis:
The coin is fair (bias = 0)

Alternative:
Coin is biased towards H
Hypothesis testing is like
stochastic proof by contradiction.
Tails Heads

1000 tosses

0 200 400 600 800

Null hypothesis:
The coin is fair (bias = 0)
Apparent contradiction!
Should we reject the null hypothesis?
Alternative:
Coin is biased towards H
Calculate p-value
Calculate p-value

Possible
observations
Calculate p-value

Possible
all all
observations
tails heads
Calculate p-value

Prob.
density

Possible
all all
observations
tails heads
Calculate p-value

Prob.
density

Possible
all data all
observations
tails heads
Calculate p-value

Prob.
density
p-value P

Possible
all data all
observations
tails heads
Calculate p-value

Prob.
density
p-value P

Possible
all data all
observations
tails heads

Reject null if P ≤ α
Calculate p-value

Prob.
density
p-value P

Possible
all data all
observations
tails heads

Reject null if P ≤ α ≈ #H − #T ≥ 2N log(1/α) .

Calculate p-value

Prob.
density
p-value P

Possible
all data all
observations
tails heads

Reject null if P ≤ α ≈ #H − #T ≥ 2N log(1/α) .

Then, Pr(false positive) ≤ α .

An equivalent view via confidence intervals
An equivalent view via confidence intervals

#H − #T
Estimate the coin bias by μ ̂ := .
N
An equivalent view via confidence intervals

#H − #T
Estimate the coin bias by μ ̂ := .
N

( N)
z1−α z1−α
An asymptotic (1 − α)-CI for μ is given by μ ̂− , μ ̂+ ,
N
An equivalent view via confidence intervals

#H − #T
Estimate the coin bias by μ ̂ := .
N

( N)
z1−α z1−α
An asymptotic (1 − α)-CI for μ is given by μ ̂− , μ ̂+ ,
N
where z1−α is the (1 − α)-quantile of N(0,1) .
(appealing to the Central Limit Theorem)
An equivalent view via confidence intervals

#H − #T
Estimate the coin bias by μ ̂ := .
N

( N)
z1−α z1−α
An asymptotic (1 − α)-CI for μ is given by μ ̂− , μ ̂+ ,
N
where z1−α is the (1 − α)-quantile of N(0,1) .
(appealing to the Central Limit Theorem)

If this confidence interval does not contain 0,

we may be reasonably confident that the coin is biased,
and we may reject the null hypothesis.
An equivalent view via confidence intervals

#H − #T
Estimate the coin bias by μ ̂ := .
N

( N)
z1−α z1−α
An asymptotic (1 − α)-CI for μ is given by μ ̂− , μ ̂+ ,
N
where z1−α is the (1 − α)-quantile of N(0,1) .
(appealing to the Central Limit Theorem)

If this confidence interval does not contain 0,

we may be reasonably confident that the coin is biased,
and we may reject the null hypothesis.

≈ #H − #T ≥ 2N log(1/α) .
For any parameter µ of interest,
with associated estimator µb,
the following claim holds:

R b
µ
( )
| {z }
(1 ↵) confidence interval
for µ
For any parameter µ of interest,
with associated estimator µb,
the following claim holds:

R b
µ µ0
( )
| {z }
(1 ↵) confidence interval
for µ
For any parameter µ of interest,
with associated estimator µb,
the following claim holds:

R b
µ µ0
( ) ⌘
| {z }
(1 ↵) confidence interval
for µ
For any parameter µ of interest,
with associated estimator µb,
the following claim holds:

R b
µ µ0 For H0 : µ = µ0
( ) ⌘
| {z }
(1 ↵) confidence interval
for µ
For any parameter µ of interest,
with associated estimator µb,
the following claim holds:

R b
µ µ0 For H0 : µ = µ0
( ) ⌘
we have p-value Pµ0  ↵
| {z }
(1 ↵) confidence interval
for µ
For any parameter µ of interest,
with associated estimator µb,
the following claim holds:

R b
µ µ0 For H0 : µ = µ0
( ) ⌘
we have p-value Pµ0  ↵
| {z }
(1 ↵) confidence interval (we would reject the null
for µ hypothesis at level ↵)
For any parameter µ of interest,
with associated estimator µb,
the following claim holds:

(1 ↵/2) confidence interval

z }| {
R b
µ µ0 For H0 : µ = µ0
( ( ) ) ⌘
we have p-value Pµ0  ↵
| {z }
(1 ↵) confidence interval (we would reject the null
for µ hypothesis at level ↵)
For any parameter µ of interest,
with associated estimator µb,
the following claim holds:

(1 ↵/2) confidence interval ⌘ Pµ0 > ↵/2

z }| {
R b
µ µ0 For H0 : µ = µ0
( ( ) ) ⌘
we have p-value Pµ0  ↵
| {z }
(1 ↵) confidence interval (we would reject the null
for µ hypothesis at level ↵)
In summary, tests (p-values) and CIs are “dual”.
In summary, tests (p-values) and CIs are “dual”.
family of tests for θ → CI for θ

CI for θ → family of tests for θ

In summary, tests (p-values) and CIs are “dual”.
family of tests for θ → CI for θ
A (1 − α)-CI for a parameter θ is the set of all θ0 such that
the test for H0 : θ = θ0 has p-value larger than α .

CI for θ → family of tests for θ

A p-value for testing the null H0 : θ = θ0 can be given by
the smallest q for which the (1 − q)-CI for θ fails to cover θ0 .
In summary, tests (p-values) and CIs are “dual”.
family of tests for θ → CI for θ
A (1 − α)-CI for a parameter θ is the set of all θ0 such that
the test for H0 : θ = θ0 has p-value larger than α .

CI for θ → family of tests for θ

A p-value for testing the null H0 : θ = θ0 can be given by
the smallest q for which the (1 − q)-CI for θ fails to cover θ0 .

CI for θ → composite tests for θ

CI for θ → family of tests for θ

A p-value for testing the null H0 : θ = θ0 can be given by
the smallest q for which the (1 − q)-CI for θ fails to cover θ0 .

CI for θ → composite tests for θ

A p-value for testing the null H0 : θ ∈ Θ0 can be given by
the smallest q for which the (1 − q)-CI for θ fails to intersect Θ0 .
In summary, tests (p-values) and CIs are “dual”.
family of tests for θ → CI for θ
A (1 − α)-CI for a parameter θ is the set of all θ0 such that
the test for H0 : θ = θ0 has p-value larger than α .

CI for θ → family of tests for θ

A p-value for testing the null H0 : θ = θ0 can be given by
the smallest q for which the (1 − q)-CI for θ fails to cover θ0 .

CI for θ → composite tests for θ

A p-value for testing the null H0 : θ ∈ Θ0 can be given by
the smallest q for which the (1 − q)-CI for θ fails to intersect Θ0 .

Both of them are useful tools to estimate uncertainty,

and like any other tool, they can be used well, or be misused.
However, commonly taught confidence intervals
and p-values are only valid (correctly control error)
if the sample size is fixed in advance.
High-level caricature of an A/B-test

Start
High-level caricature of an A/B-test

Start

Collect more data

(increase sample size)
High-level caricature of an A/B-test

Start
“peek”

Collect more data Check if P (n) ≤ α

(increase sample size)
High-level caricature of an A/B-test

Start
“peek”

Collect more data Check if P (n) ≤ α

(increase sample size)

Stop,
Report

“optional stopping”
High-level caricature of an A/B-test

Start
“peek”

Collect more data Check if P (n) ≤ α

(increase sample size)

“optional continuation”
Stop,
Report

“optional stopping”
High-level caricature of an A/B-test

Start
“peek”

Collect more data Check if P (n) ≤ α

(increase sample size)

“optional continuation”
Stop,
Report
With commonly-taught p-values,
false positive rate ≫ α . “optional stopping”
After 10 people
After 10 people
After 10 people

After 284 people

After 10 people

After 284 people

After 10 people

After 284 people

After 1214 people

After 10 people

After 284 people

After 1214 people

After 10 people

After 284 people

After 1214 people

After 2398 people

After 10 people

After 284 people

After 1214 people

After 2398 people

After 10 people

After 284 people

After 1214 people

After 2398 people

After 7224 people

After 10 people

After 284 people

After 1214 people

After 2398 people

After 7224 people

After 10 people

After 284 people

After 1214 people

After 2398 people

After 7224 people

After 11,219 people, STOP!

After 10 people

After 284 people

After 1214 people

After 2398 people

After 7224 people

After 11,219 people, STOP!

Let P (n) be a classical p-value (eg: t-test),
calculated using the first n samples.
Let P (n) be a classical p-value (eg: t-test),
calculated using the first n samples.

Under the null hypothesis (no treatment effect),

∀n ≥ 1, Pr(P(n) ≤ α) ≤ α.
prob. of false positive
Let P (n) be a classical p-value (eg: t-test),
calculated using the first n samples.

Under the null hypothesis (no treatment effect),

∀n ≥ 1, Pr(P(n) ≤ α) ≤ α.
prob. of false positive

Let τ be the stopping time of the experiment.

Often, τ depends on data, eg: τ := min{n ∈ ℕ : Pn ≤ α} .
Let P (n) be a classical p-value (eg: t-test),
calculated using the first n samples.

Under the null hypothesis (no treatment effect),

∀n ≥ 1, Pr(P(n) ≤ α) ≤ α.
prob. of false positive

Let τ be the stopping time of the experiment.

Often, τ depends on data, eg: τ := min{n ∈ ℕ : Pn ≤ α} .

Unfortunately, Pr(P(τ) ≤ α) ≰ α .
In other words, Pr( ∃n ∈ ℕ : P(n) ≤ α) ≫ α .
Same problem with confidence interval (CI)

Start
“peek”

Collect more data Check if 0 ∉ (1 − α) CI

(increase sample size)

“optional continuation”
Stop,
Stop
Report

Again, false positive rate ≫ α . “optional stopping”

Let (L (n), U (n)) be any classical (1 − α) CI,
calculated using the first n samples (eg: CLT).
Let (L (n), U (n)) be any classical (1 − α) CI,
calculated using the first n samples (eg: CLT).

When trying to estimate the treatment effect θ,

∀n ≥ 1, Pr(θ ∈ (L(n), U(n))) ≥ 1 − α .

prob. of coverage
Let (L (n), U (n)) be any classical (1 − α) CI,
calculated using the first n samples (eg: CLT).

When trying to estimate the treatment effect θ,

∀n ≥ 1, Pr(θ ∈ (L(n), U(n))) ≥ 1 − α .

prob. of coverage

Let τ be the stopping time of the experiment.

(n)
Again, τ may depend on data, eg: τ := min{n ∈ ℕ : L > 0} .

Unfortunately, Pr(θ ∈ (L(τ), U(τ))) ≱ 1 − α .

Let (L (n), U (n)) be any classical (1 − α) CI,
calculated using the first n samples (eg: CLT).

When trying to estimate the treatment effect θ,

∀n ≥ 1, Pr(θ ∈ (L(n), U(n))) ≥ 1 − α .

prob. of coverage

Let τ be the stopping time of the experiment.

(n)
Again, τ may depend on data, eg: τ := min{n ∈ ℕ : L > 0} .

Unfortunately, Pr(θ ∈ (L(τ), U(τ))) ≱ 1 − α .

In other words, Pr( ∀n ≥ 1 : θ ∈ (L(n), U(n))) ≪ 1 − α .
usually = 0.
Solution: “confidence sequence”
(aka “anytime confidence intervals”)

or “sequential p-values” for testing

(aka “always-valid p-values”)
A “confidence sequence” for a parameter ✓ <latexit sha1_base64="JqEnYvV6PtsKBJYmBVwEpjIMANw=">AAAB7XicbVDLSgNBEJyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOOsmY2ZllplcIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqSw6PvfXmFtfWNzq7hd2tnd2z8oHx41rU4NhwbXUpt2xCxIoaCBAiW0EwMsjiS0ovHtzG89gbFCqwecJBDGbKjEQHCGTmp2cQTIeuWKX/XnoKskyEmF5Kj3yl/dvuZpDAq5ZNZ2Aj/BMGMGBZcwLXVTCwnjYzaEjqOKxWDDbH7tlJ45pU8H2rhSSOfq74mMxdZO4sh1xgxHdtmbif95nRQH12EmVJIiKL5YNEglRU1nr9O+MMBRThxh3Ah3K+UjZhhHF1DJhRAsv7xKmhfVwK8G95eV2k0eR5GckFNyTgJyRWrkjtRJg3DySJ7JK3nztPfivXsfi9aCl88ckz/wPn8Ao/ePKA==</latexit>
<latexit

is a sequence of confidence intervals (Ln , Un )

with a uniform (simultaneous) coverage guarantee.
<latexit sha1_base64="PpnXScjDZFhd9U2EC1D0fZEkPzE=">AAAB8XicbVBNS8NAEJ3Ur1q/oh69LBahgpREBD0WvXjwUMG0xTaEzXbTLt1swu5GKKH/wosHRbz6b7z5b9y2OWjrg4HHezPMzAtTzpR2nG+rtLK6tr5R3qxsbe/s7tn7By2VZJJQjyQ8kZ0QK8qZoJ5mmtNOKimOQ07b4ehm6refqFQsEQ96nFI/xgPBIkawNtJj7S4QZ8gLxGlgV526MwNaJm5BqlCgGdhfvX5CspgKTThWqus6qfZzLDUjnE4qvUzRFJMRHtCuoQLHVPn57OIJOjFKH0WJNCU0mqm/J3IcKzWOQ9MZYz1Ui95U/M/rZjq68nMm0kxTQeaLoowjnaDp+6jPJCWajw3BRDJzKyJDLDHRJqSKCcFdfHmZtM7rrlN37y+qjesijjIcwTHUwIVLaMAtNMEDAgKe4RXeLGW9WO/Wx7y1ZBUzh/AH1ucP2MyPtg==</latexit>

ℙ( ∀n ≥ 1 : θ ∈ (Ln, Un)) ≥ 1 − α .
Sample size
A “confidence sequence” for a parameter ✓ <latexit sha1_base64="JqEnYvV6PtsKBJYmBVwEpjIMANw=">AAAB7XicbVDLSgNBEJyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOOsmY2ZllplcIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqSw6PvfXmFtfWNzq7hd2tnd2z8oHx41rU4NhwbXUpt2xCxIoaCBAiW0EwMsjiS0ovHtzG89gbFCqwecJBDGbKjEQHCGTmp2cQTIeuWKX/XnoKskyEmF5Kj3yl/dvuZpDAq5ZNZ2Aj/BMGMGBZcwLXVTCwnjYzaEjqOKxWDDbH7tlJ45pU8H2rhSSOfq74mMxdZO4sh1xgxHdtmbif95nRQH12EmVJIiKL5YNEglRU1nr9O+MMBRThxh3Ah3K+UjZhhHF1DJhRAsv7xKmhfVwK8G95eV2k0eR5GckFNyTgJyRWrkjtRJg3DySJ7JK3nztPfivXsfi9aCl88ckz/wPn8Ao/ePKA==</latexit>
<latexit

is a sequence of confidence intervals (Ln , Un )

ℙ( ∀n ≥ 1 : θ ∈ (Ln, Un)) ≥ 1 − α .
Sample size

Darling, Robbins ’67, ‘68

Lai ’76, ’84
Howard, Ramdas, McAuliffe, Sekhon ’18
Example: tracking the mean of a Gaussian
or Bernoulli from i.i.d. observations.

X1, X2, … ∼ N(θ,1) or Ber(θ)

Example: tracking the mean of a Gaussian
or Bernoulli from i.i.d. observations.

X1, X2, … ∼ N(θ,1) or Ber(θ)

Producing a confidence interval at a fixed time

is elementary statistics (~100 years old).
Example: tracking the mean of a Gaussian
or Bernoulli from i.i.d. observations.

X1, X2, … ∼ N(θ,1) or Ber(θ)

Producing a confidence interval at a fixed time

is elementary statistics (~100 years old).

How do we produce a confidence sequence?

(which is like a confidence band over time)
0.5
Cumulative miscoverage prob
(Fair coin)

Cumulative miscoverage prob.

1.0
0.6
0.0
Confidence bounds
0.6
0.5
0.4
0.4
−0.5 0.0 Empirical mean
0.2
−0.5 Empirical mean 0.2
−1.0
1 2 3 4
5
100.0
−1.01 10 2 103 10 10
0.0
10 10 101 1010 10 4 5 105
2
10 3
1010
4
10 1
Number
Number
of samples,
of samples,
Number of samples,t t
t

effding Pointwise
Pointwise
Pointwise
Linear CLT
CLT
CI boundary
(CLT) Pointwise
Anytime
Pointwise
Curved Hoeff
CIboundary
Hoeffding
0.5
Cumulative miscoverage prob
(Fair coin)

Cumulative miscoverage prob.

effding Pointwise
Pointwise
Pointwise
Linear CLT
CLT
CI boundary
(CLT) Pointwise
Anytime
Pointwise
Curved Hoeff
CIboundary
Hoeffding
Eg: If Xi is 1-subGaussian, then
n
∑i=1 Xi log log(2n) + 0.72 log(5.19/α)
± 1.71
n n

is a (1 − α) confidence sequence.
Eg: If Xi is 1-subGaussian, then
n
∑i=1 Xi log log(2n) + 0.72 log(5.19/α)
± 1.71
n n

is a (1 − α) confidence sequence.

2,000 Jamieson et al. (2013)

Balsubramani (2014)
Zhao et al. (2016)
Darling & Robbins (1967b)
Kaufmann et al. (2014)
Normal mixture
Boundary

Darling & Robbins (1968)

Polynomial stitching (ours)
Inverted stitching (ours)
Discrete mixture (ours)

Hoeffding bound
CLT bound
0
0 105
Vt
Eg: If Xi is 1-subGaussian, then
n
∑i=1 Xi log log(2n) + 0.72 log(5.19/α)
± 1.71
n n

is a (1 − α) confidence sequence.

2,000 Jamieson et al. (2013)

Balsubramani (2014)
Zhao et al. (2016)
Darling & Robbins (1967b)
Kaufmann et al. (2014)
Normal mixture
Boundary

Darling & Robbins (1968)

Polynomial stitching (ours)
Inverted stitching (ours)
Discrete mixture (ours)

Hoeffding bound
CLT bound
0
0 105
Vt
Howard, Ramdas, McAuliffe, Sekhon ’18
⋃
ℙ( {θ ∉ (Ln, Un)}) ≤ α .
n∈ℕ
⋃
ℙ( {θ ∉ (Ln, Un)}) ≤ α .
n∈ℕ
Some implications:
⋃
ℙ( {θ ∉ (Ln, Un)}) ≤ α .
n∈ℕ
Some implications:

1.Valid inference at any time, even stopping times:

⋃
ℙ( {θ ∉ (Ln, Un)}) ≤ α .
n∈ℕ
Some implications:

1.Valid inference at any time, even stopping times:

For any stopping time τ : ℙ(θ ∉ (Lτ, Uτ)) ≤ α .
⋃
ℙ( {θ ∉ (Ln, Un)}) ≤ α .
n∈ℕ
Some implications:

1.Valid inference at any time, even stopping times:

For any stopping time τ : ℙ(θ ∉ (Lτ, Uτ)) ≤ α .
2.Valid post-hoc inference (in hindsight):
⋃
ℙ( {θ ∉ (Ln, Un)}) ≤ α .
n∈ℕ
Some implications:

1.Valid inference at any time, even stopping times:

For any stopping time τ : ℙ(θ ∉ (Lτ, Uτ)) ≤ α .
2.Valid post-hoc inference (in hindsight):
For any random time T : ℙ(θ ∉ (LT, UT )) ≤ α .
⋃
ℙ( {θ ∉ (Ln, Un)}) ≤ α .
n∈ℕ
Some implications:

1.Valid inference at any time, even stopping times:

For any stopping time τ : ℙ(θ ∉ (Lτ, Uτ)) ≤ α .
2.Valid post-hoc inference (in hindsight):
For any random time T : ℙ(θ ∉ (LT, UT )) ≤ α .

3. No pre-specified sample size:

can extend or stop experiments adaptively.
The same duality between
confidence intervals and p-values
also holds in the sequential setting:
“confidence sequences” are dual to
“always valid p-values”.
Duality between anytime p-value and CI
Duality between anytime p-value and CI
Define a set of null values ℋ0 for θ .
Duality between anytime p-value and CI
Define a set of null values ℋ0 for θ .

Let P (n) := inf{α : the (1 − α) CI (n) does not intersect ℋ0}

Duality between anytime p-value and CI
Define a set of null values ℋ0 for θ .

Let P (n) := inf{α : the (1 − α) CI (n) does not intersect ℋ0}

If CI (n) is a pointwise CI then P (n) is a classical p-value .

(n)
For all fixed times n, Pr(P ≤ α) ≤ α.
prob. of false positive
Duality between anytime p-value and CI
Define a set of null values ℋ0 for θ .

Let P (n) := inf{α : the (1 − α) CI (n) does not intersect ℋ0}

If CI (n) is a pointwise CI then P (n) is a classical p-value .

(n)
For all fixed times n, Pr(P ≤ α) ≤ α.
prob. of false positive

If CI (n) is an anytime CI then P (n) is an always-valid p-value .

(τ)
For all stopping times τ, Pr(P ≤ α) ≤ α .
For all data-dependent times T, Pr(P(T) ≤ α) ≤ α .
Relationship to Sequential Probability Ratio Test
Given a stream of data X1, X2, … ∼ fθ, suppose
we want to test a null hypothesis H0 : θ = θ0
against an alternative hypothesis H1 : θ = θ1 .

Wald ‘48
Relationship to Sequential Probability Ratio Test
Given a stream of data X1, X2, … ∼ fθ, suppose
we want to test a null hypothesis H0 : θ = θ0
against an alternative hypothesis H1 : θ = θ1 .

Wald's SPRT (or SLRT) calculates a probability/likelihood ratio:

n
∏i=1 f1(Xi)
L (n) := n ,
∏i=1 f0(Xi)

and rejects when L (n) > 1/α . Can also use prior/mixture over θ1 .

Wald's SPRT (or SLRT) calculates a probability/likelihood ratio:

n
∏i=1 f1(Xi)
L (n) := n ,
∏i=1 f0(Xi)

and rejects when L (n) > 1/α . Can also use prior/mixture over θ1 .

(n) (n) (n)

Equivalently, define P = 1/L . Then P is an always-valid p-value.

Wald's SPRT (or SLRT) calculates a probability/likelihood ratio:

n
∏i=1 f1(Xi)
L (n) := n ,
∏i=1 f0(Xi)

and rejects when L (n) > 1/α . Can also use prior/mixture over θ1 .

(n) (n) (n)

Equivalently, define P = 1/L . Then P is an always-valid p-value.

(And inverting it defines a confidence sequence.) Wald ‘48

Can construct confidence sequences
(and hence always valid p-values)
in a wide variety of nonparametric settings
(eg: random variables that are
bounded, or subGaussian, or subexponential)

Howard, Ramdas, McAuliffe, Sekhon ’18

Outer sequential process: Part II

“false coverage rate” for estimation

Outer sequential process: Part II

“false coverage rate” for estimation

(correspondingly, “false discovery rate” for testing)

Modular solutions: fit well together Part III

Many extensions to each piece
Part I1

The OUTER Sequential Process

(a sequence of experiments)

[40 mins]
Quick recap of A/B testing

A:
Quick recap of A/B testing

B:
Quick recap of A/B testing

Null hypothesis:
A is at least
as good as B.
Quick recap of A/B testing

Misses Clicks

B:
0 200 400 600 800

Null hypothesis:
A is at least
as good as B.
Quick recap of A/B testing

Misses Clicks

B:
0 200 400 600 800

Null hypothesis: Calculate p-value:

A is at least P = Pr(observed data or more
as good as B. extreme, assuming null is true)
Quick recap of A/B testing
Decision rule :
Misses Clicks
if P  ↵, then
we reject the null
A: (“discovery”).
We change A to B,
ensuring that
type-1 errorP  ↵.

B:
0 200 400 600 800

Null hypothesis: Calculate p-value:

B: a wrong rejection
0 200 400 600 800
of the null
is a false discovery
Null hypothesis: Calculate p-value: and implies
A is at least P = Pr(observed data or more a bad change
as good as B. extreme, assuming null is true) from A to B.
Reality: internet companies run thousands
of different (independent) A/B tests over time.
Reality: internet companies run thousands
of different (independent) A/B tests over time.

Time
Reality: internet companies run thousands
of different (independent) A/B tests over time.

vs. Color

Time
Reality: internet companies run thousands
of different (independent) A/B tests over time.
Decision rule:
vs. Color

Time
Reality: internet companies run thousands
of different (independent) A/B tests over time.
Decision rule:
P1  ↵? vs. Color

Time vs. Size

Reality: internet companies run thousands
of different (independent) A/B tests over time.
Decision rule:
P1  ↵? vs. Color

Time P2  ↵? vs. Size

Reality: internet companies run thousands
of different (independent) A/B tests over time.
Decision rule:
P1  ↵? vs. Color

Time P2  ↵? vs. Size

vs. Orientation
Reality: internet companies run thousands
of different (independent) A/B tests over time.
Decision rule:
P1  ↵? vs. Color

Time P2  ↵? vs. Size

P3  ↵? vs. Orientation
Reality: internet companies run thousands
of different (independent) A/B tests over time.
Decision rule:
P1  ↵? vs. Color

Time P2  ↵? vs. Size

P3  ↵? vs. Orientation

vs. Style
Reality: internet companies run thousands
of different (independent) A/B tests over time.
Decision rule:
P1  ↵? vs. Color

Time P2  ↵? vs. Size

P3  ↵? vs. Orientation

P4  ↵? vs. Style
Reality: internet companies run thousands
of different (independent) A/B tests over time.
Decision rule:
P1  ↵? vs. Color

Time P2  ↵? vs. Size

P3  ↵? vs. Orientation

P4  ↵? vs. Style

vs. Logo
Reality: internet companies run thousands
of different (independent) A/B tests over time.
Decision rule:
P1  ↵? vs. Color

Time P2  ↵? vs. Size

P3  ↵? vs. Orientation

P4  ↵? vs. Style

P5  ↵? vs. Logo
Reality: internet companies run thousands
of different (independent) A/B tests over time.
Decision rule:
P1  ↵? vs. Color

Time P2  ↵? vs. Size

P3  ↵? vs. Orientation

P4  ↵? vs. Style

P5  ↵? vs. Logo
Problem!
Run 10,000
different,
independent
A/B tests
Run 10,000 9,900 true
different, nulls
independent 100 non-
A/B tests nulls
type-1error rate (per test)
= 0.05

Run 10,000 9,900 true

different, nulls
independent 100 non-
A/B tests nulls
type-1error rate (per test)
= 0.05

Run 10,000 9,900 true 495 false

different, nulls discoveries
independent 100 non-
A/B tests nulls
type-1error rate (per test)
= 0.05

Run 10,000 9,900 true 495 false

different, nulls discoveries
independent 100 non-
A/B tests nulls
power (per test)
= 0.80
type-1error rate (per test)
= 0.05

Run 10,000 9,900 true 495 false

different, nulls discoveries
independent 100 non- 80 true
A/B tests nulls discoveries
power (per test)
= 0.80
type-1error rate (per test)
= 0.05

Run 10,000 9,900 true 495 false

different, nulls discoveries
independent 100 non- 80 true
A/B tests nulls discoveries
power (per test)
= 0.80
type-1error rate (per test)
= 0.05

Run 10,000 9,900 true 495 false

nulls discoveries false discovery
different, proportion
independent 100 non- 80 true FDP = 495/575
A/B tests nulls discoveries
power (per test)
= 0.80
# false discoveries
FDP = # discoveries
type-1error rate (per test)
= 0.05

Run 10,000 9,900 true 495 false

nulls discoveries false discovery
different, proportion
independent 100 non- 80 true FDP = 495/575
A/B tests nulls discoveries
power (per test)
= 0.80
# false discoveries
FDP = # discoveries

FDR = E[FDP]
type-1error rate (per test)
= 0.05

Run 10,000 9,900 true 495 false

nulls discoveries false discovery
different, proportion
independent 100 non- 80 true FDP = 495/575
A/B tests nulls discoveries
power (per test)
= 0.80
# false discoveries
FDP = # discoveries

FDR = E[FDP]
Summary: FDR can be larger than per-test error rate.
(even if hypotheses, tests, data are independent)
Given a possibly infinite sequence
of independent tests (p-values), can we
guarantee control of the FDR
in a fully online fashion?

Foster-Stine ’08
Aharoni-Rosset ’14
Javanmard-Montanari ’16
Ramdas-Yang-Wainwright-Jordan ’17
Ramdas-Zrnic-Wainwright-Jordan ’18
Tian-Ramdas ’19
The aim of online FDR procedures

Decision rule:

Time
The aim of online FDR procedures

Decision rule:
vs. Color

Time
The aim of online FDR procedures

Decision rule:
P1  ↵ 1 ? vs. Color

Time
The aim of online FDR procedures

Decision rule:
P1  ↵ 1 ? vs. Color

Time vs. Size

The aim of online FDR procedures

Decision rule:
P1  ↵ 1 ? vs. Color

Time P2  ↵ 2 ? vs. Size

The aim of online FDR procedures

Decision rule:
P1  ↵ 1 ? vs. Color

Time P2  ↵ 2 ? vs. Size

vs. Orientation
The aim of online FDR procedures

Decision rule:
P1  ↵ 1 ? vs. Color

Time P2  ↵ 2 ? vs. Size

P3  ↵ 3 ? vs. Orientation
The aim of online FDR procedures

Decision rule:
P1  ↵ 1 ? vs. Color

Time P2  ↵ 2 ? vs. Size

P3  ↵ 3 ? vs. Orientation

vs. Style
The aim of online FDR procedures

Decision rule:
P1  ↵ 1 ? vs. Color

Time P2  ↵ 2 ? vs. Size

P3  ↵ 3 ? vs. Orientation

P4  ↵ 4 ? vs. Style
The aim of online FDR procedures

Decision rule:
P1  ↵ 1 ? vs. Color

Time P2  ↵ 2 ? vs. Size

P3  ↵ 3 ? vs. Orientation

P4  ↵ 4 ? vs. Style

vs. Logo
The aim of online FDR procedures

Decision rule:
P1  ↵ 1 ? vs. Color

Time P2  ↵ 2 ? vs. Size

P3  ↵ 3 ? vs. Orientation

P4  ↵ 4 ? vs. Style

P5  ↵ 5 ? vs. Logo
The aim of online FDR procedures

Decision rule:
P1  ↵ 1 ? vs. Color

Time P2  ↵ 2 ? vs. Size

P3  ↵ 3 ? vs. Orientation

How do we P  ↵ ? vs. Style

4 4
set each
error level to
control FDR P5  ↵5 ? vs. Logo
at any time?
One of the most famous offline
FDR methods is the “Benjamini-
Hochberg” (BH) method

Offline FDR methods

do not control the FDR
in online settings

Benjamini-Hochberg ’95
The following method is not a
valid online FDR algorithm:

At the end of experiment t, run BH on P1, …, Pt .

The following method is not a
valid online FDR algorithm:

At the end of experiment t, run BH on P1, …, Pt .

The reason is that the decision

about the first hypothesis depends
on all future hypotheses. We cannot
commit to a decision and stick to it.
The following method is not a
valid online FDR algorithm:

At the end of experiment t, run BH on P1, …, Pt .

The reason is that the decision

about the first hypothesis depends
on all future hypotheses. We cannot
commit to a decision and stick to it.

We need the error level αt for experiment t

to be specified when it starts, and we need
to make a final decision when experiment t ends.
This multiple testing issue
is not particular to p-values.
It also exists when selectively
reporting treatment effects
with confidence intervals.

Benjamini, Yekutieli ’05

Weinstein, Ramdas ’19
Multiplicity in reported CIs
One rarely cares about all CIs or follows-up on them,
one usually reports only the most “promising” CIs.
Multiplicity in reported CIs
One rarely cares about all CIs or follows-up on them,
one usually reports only the most “promising” CIs.

False coverage proportion

# incorrectly reported CIs
FCP =
# reported CIs
Multiplicity in reported CIs
One rarely cares about all CIs or follows-up on them,
one usually reports only the most “promising” CIs.

False coverage proportion

# incorrectly reported CIs
FCP =
# reported CIs

False coverage rate

FCR = 𝔼[FCP]
Benjamini-Yekutieli ’06
Weinstein-Yekutieli ’14
Fithian et al. ’14
Controlling FCR is nontrivial

Constructing marginal 95% CIs for all parameters

fails to control FCR at 0.05.
<latexit sha1_base64="FMP/KogEvHFO2tjJge5o3+g3UBs=">AAACP3icbVA9TyMxFPTyfeErHCXNEwkSVbSLhIAOEekEHSASkLJR9NbxBguvvbLfIkUR/4zm/sJ1tDQUnE60dOcNKfh61WjmPXtmklxJR2H4EExNz8zOzS/8qCwuLa+sVtd+tp0pLBctbpSxVwk6oaQWLZKkxFVuBWaJEpfJTbPUL2+FddLoCxrmopvhQMtUciRP9artptGObMFJ6gFkaAdSo4L6wW68VYfmiYPUWEClIEeLmSD/VhxDJUWpHJABbjRZo+BX8xyQoB42wt16o1eteTAe+AqiCaixyZz2qn/ivuFFJjRxhc51ojCn7ggtSa7EXSUunMiR3+BAdDzU3orrjsb572DLM/2x0dS7gTH7/mKEmXPDLPGbGdK1+6yV5Hdap6B0vzuSOi9IaP72UVqoMndZJvSlFZzU0APkVnqvwK99T7ysqeJLiD5H/graO40obERnO7XDo0kdC2yDbbJtFrE9dsiO2SlrMc7u2SN7Zn+D38FT8C94eVudCiY36+zDBK//AfwqrFs=</latexit>
Controlling FCR is nontrivial

Constructing marginal 95% CIs for all parameters

Suppose treatment e↵ect ✓j 2 {±0.1} for all j,

and experimental observations are normalized to
Xj ⇠ N (✓j , 1).
<latexit sha1_base64="gfM/xJEjMu7AD/AFwnxRplluGo0=">AAACeHicbVHLbtNAFB2bVwmPBlh2c0Vc0UpVZGdDlxVsWKEiSBspE1nj8XUz6TysmXHVYOUb+Dd2fAgbVozdIEHLXR2dO0dnzrlFLYXzafojiu/df/Dw0c7jwZOnz57vDl+8PHOmsRyn3EhjZwVzKIXGqRde4qy2yFQh8by4fN/tz6/QOmH0F7+ucaHYhRaV4MwHKh9++9zUtXEIPqi8Qu0Bqwq5h4T6JXqWr6jQtKW1gnSc0U0ClbHApIRklRxRCgOmS8DrGq3o5EyCKRzaq97AAbMI2ljFpPiKJXgDlA4gmeUroE4o+Hjwx+coO0zG+XCUjtN+4C7ItmBEtnOaD7/T0vCms+aSOTfP0tovWma94BI3A9o4rBm/ZBc4D1AzhW7R9sVtYD8wZR+oMiF5z/6taJlybq2K8FIxv3S3dx35v9288dXxohW6bjxqfmNUNbKL310BSmFDx3IdAONWhL8CXzLLuA+3GoQSstuR74KzyTgLJ/k0GZ2829axQ/bIa3JAMvKWnJAP5JRMCSc/o70oifajXzHEb+LDm6dxtNW8Iv9MPPkNyNS8yQ==</latexit>
Controlling FCR is nontrivial

Constructing marginal 95% CIs for all parameters

Suppose treatment e↵ect ✓j 2 {±0.1} for all j,

Suppose we only care about drugs with large e↵ects.

So we only pursue phase II of the trial if Xj > 3. <latexit sha1_base64="eT1OBoZu+q2ozcPHjJOqWpC8g1E=">AAACVXicbVFNbxMxFPQupZTw0QBHLk+kSJxWu+0BTqgqF3orKmkjJVH01vs2a+pdW/Zzqyjqn+wF8U+4IOGkEYKWkSyNZt7Iz+PSauU5z38k6YOth9uPdh73njx99ny3/+LlmTfBSRpKo40blehJq46GrFjTyDrCttR0Xl58Wvnnl+S8Mt1XXliatjjvVK0kcpRmfX0arDWe4IrAdHoBEh0BliYwVC7MPVwpbkCjmxNQXZNkn00mvVPzJ2GD84HANnENOD4GUwM3BOwUalA17I1m3+AjHOxls/4gz/I14D4pNmQgNjiZ9W8mlZGhpY6lRu/HRW55ukTHSmq67k2CJ4vyAuc0jrTDlvx0uW7lGt5GpYLauHg6hrX6d2KJrfeLtoyTLXLj73or8X/eOHD9YbpUnQ1Mnby9qA4a2MCqYqiUizXFZiqF0qm4K8gGHUqOH9GLJRR3n3yfnO1nRZ4VX/YHh0ebOnbEa/FGvBOFeC8OxWdxIoZCihvxM0mSNPme/Eq30u3b0TTZZF6Jf5Du/gY/R7EV</latexit>
Controlling FCR is nontrivial

Constructing marginal 95% CIs for all parameters

Suppose treatment e↵ect ✓j 2 {±0.1} for all j,

and experimental observations are normalized to
Xj ⇠ N (✓j , 1). <latexit sha1_base64="gfM/xJEjMu7AD/AFwnxRplluGo0=">AAACeHicbVHLbtNAFB2bVwmPBlh2c0Vc0UpVZGdDlxVsWKEiSBspE1nj8XUz6TysmXHVYOUb+Dd2fAgbVozdIEHLXR2dO0dnzrlFLYXzafojiu/df/Dw0c7jwZOnz57vDl+8PHOmsRyn3EhjZwVzKIXGqRde4qy2yFQh8by4fN/tz6/QOmH0F7+ucaHYhRaV4MwHKh9++9zUtXEIPqi8Qu0Bqwq5h4T6JXqWr6jQtKW1gnSc0U0ClbHApIRklRxRCgOmS8DrGq3o5EyCKRzaq97AAbMI2ljFpPiKJXgDlA4gmeUroE4o+Hjwx+coO0zG+XCUjtN+4C7ItmBEtnOaD7/T0vCms+aSOTfP0tovWma94BI3A9o4rBm/ZBc4D1AzhW7R9sVtYD8wZR+oMiF5z/6taJlybq2K8FIxv3S3dx35v9288dXxohW6bjxqfmNUNbKL310BSmFDx3IdAONWhL8CXzLLuA+3GoQSstuR74KzyTgLJ/k0GZ2829axQ/bIa3JAMvKWnJAP5JRMCSc/o70oifajXzHEb+LDm6dxtNW8Iv9MPPkNyNS8yQ==</latexit>

Suppose we only care about drugs with large e↵ects.

For these drugs, the standard marginal 95% CI

does not cover ✓j . So FCR=1.
<latexit sha1_base64="MNZd6d4WqE/DvhH93VkMRXIatwE=">AAACQHicbVA9bxNBEN0LX8F8GShpRtiRKNDpLhJKKJCiWIqgCx+OI/ksa25vzt5kb/e0OxfJsvLT0uQn0FHTUIAQLRV7iQtIeNXbN/NmZ15ea+U5Sb5Eazdu3rp9Z/1u5979Bw8fdR8/OfC2cZKG0mrrDnP0pJWhISvWdFg7wirXNMqPB219dELOK2s+8aKmSYUzo0olkYM07Y72rAOekycoXDPzL9sHeEZToCugQjdTBjX0X7/KNvoweJdlncKSB2MZpA2joZ8FC+P0qB/DRwt7gw9v0nja7SVxcgG4TtIV6YkV9qfdz1lhZVORYanR+3Ga1DxZomMlNZ12ssZTjfIYZzQO1GBFfrK8COAUNoJSQBlOKa1p9wrq344lVt4vqjx0Vshzf7XWiv+rjRsutydLZeqGycjLj8pGA1to04RCOZKsF4GgdCrsCnKODiWHzDshhPTqydfJwWacJnH6frO3s7uKY108E8/FC5GKLbEj3op9MRRSnImv4rv4EZ1H36Kf0a/L1rVo5Xkq/kH0+w9bcayH</latexit>
Can we control FCR *online*?

When experiment j starts, we must assign

a target confidence level ↵j .
When experiment j ends, we must decide
if we wish to report ✓j .
This must be done such that the FCR
is controlled at any time.
<latexit sha1_base64="5F3c2hZSIZVBX2jI9lT37fkUWZU=">AAAC33icbVJNb9NAEF2brxK+Ahy5jGiQOKDI7gWOFZUQx4KaplIcRev1ON52vWvtjluiKBcuHECIK3+LG3+EM+MmSKXtSLae570Zz77ZvDE6UJL8juIbN2/dvrN1t3fv/oOHj/qPnxwG13qFI+WM80e5DGi0xRFpMnjUeJR1bnCcn+x1/PgUfdDOHtCiwWkt51aXWkni1Kz/J7NO2wItwbhCC/ipQa/r7ntwPIBA0lN4BWcIdRsIZAh6brMMehKYmiOBcrbU3EAhGDxFA4NMmqaSs+PBMMt613VFW1zoWaDiepbqssud6VABOfDYOM/6jCqkf90OKh3WVTlC4SxCaBXLK0n8Qni397EbjkU8FnlnDBbAXIZ1Uy2lXayAeI7hrL+dDJPzgKsg3YBtsYn9Wf9XVjjVdkdQhl2YpElD0yW7o5XBVS9rAzZSncg5ThhaWWOYLs/3s4IXnCmgdJ4f2znG2YsVS1mHsKhzVtaSqnCZ65LXcZOWyjfTpbZNS7yA9Y/K1nT2dcuGQntUZBYMpPKaZwVVSS8V8ZXosQnp5SNfBYc7wzQZph92tnffbuzYEs/Ec/FSpOK12BXvxb4YCRVl0efoa/QtlvGX+Hv8Yy2No03NU/FfxD//Aijn46o=</latexit>
Can we control FCR *online*?

When experiment j starts, we must assign

Weinstein, Ramdas ‘19

A simple solution for both
online FDR control, and
online FCR control
Online FCR control: the main idea

Let Si 2 {0, 1} denote the selection decision

made after experiment i. <latexit sha1_base64="GOHWK6b8TQqolUAuucVsdwvzEVA=">AAACRHicbVA9T8MwEHX4pnwVGFlOtEgMqEpYYESwMDCAoAWpqSrHuVALx45sB1FF/XEs/AA2fgELAwixIpy2A18nWXp67+7d+UWZ4Mb6/pM3MTk1PTM7N19ZWFxaXqmurrWMyjXDJlNC6auIGhRcYtNyK/Aq00jTSOBldHNU6pe3qA1X8sL2M+yk9FryhDNqHdWttkOpuIxRWjhBC/XzLoeQSwgLfycIB3VwkrIItofgtiArxxzJeGkZhlBJaYxAE4sa8C5DzdPSrM7rjW615jf8YcFfEIxBjYzrtFt9DGPF8tKACWpMO/Az2ymotpwJHFTC3GBG2Q29xraDkqZoOsUwhAFsOSaGRGn33AFD9vtEQVNj+mnkOlNqe+a3VpL/ae3cJvudgssstyjZaFGSC7AKykQh5trFIvoOUKa5uxVYj2rKXCSm4kIIfn/5L2jtNgK/EZzt1g4Ox3HMkQ2ySbZJQPbIATkmp6RJGLknz+SVvHkP3ov37n2MWie88cw6+VHe5xcxvK/J</latexit>

PT
[ i=1 ↵ i
Maintain FCP(T ) := PT  ↵.
1 _ i=1 Si
<latexit sha1_base64="h2Dsi8pb9WFrVY+BGXU36brflXA=">AAACVnicbVFNa9wwEJWdpkm3X0567EV0KaSXxS6BhkAgNFB6KWzpbhJYb81YO86KSLIjjdMsxn+yvbQ/pZdS7a4PTdKBgcd7b0bSU14p6SiOfwXhxoPNh1vbj3qPnzx99jza2T11ZW0FjkWpSnueg0MlDY5JksLzyiLoXOFZfnmy1M+u0TpZmhEtKpxquDCykALIU1mkU8Ibaj6BNOSbtzz9Jmc4B2rWyoeTYdvujd4cHqWFBdGkrtZZI4+S9uuIp6CqOWSSt03C02tEfkv+kkm/T+FVZxxkUT8exKvi90HSgT7raphF39NZKWqNhoQC5yZJXNG0AUtSKGx7ae2wAnEJFzjx0IBGN21WsbT8tWdmvCitb0N8xf470YB2bqFz79RAc3dXW5L/0yY1FQfTRpqqJjRifVBRK04lX2bMZ9KiILXwAISV/q5czMGnR/4nej6E5O6T74PTt4MkHiSf9/vH77s4ttlL9ortsYS9Y8fsIxuyMRPsB/sdhMFG8DP4E26GW2trGHQzL9itCqO/IGOz7g==</latexit>
Online FCR control: the main idea

Let Si 2 {0, 1} denote the selection decision

Weinstein & Ramdas ’19

Online FCR control: the main idea

Let Si 2 {0, 1} denote the selection decision

For testing, let Ri ∈ {0,1} represent a rejection.

Then maintain FDP ̂(T ) :=

T
∑i=1 αi
T
≤ α.
1 ∨ ∑i=1 Ri

Weinstein & Ramdas ’19

Online FCR control: the main idea

Let Si 2 {0, 1} denote the selection decision

For testing, let Ri ∈ {0,1} represent a rejection.

Then maintain FDP ̂(T ) :=

T
∑i=1 αi
T
≤ α.
1 ∨ ∑i=1 Ri