Notes
Notes
September 8, 2022
Contents
1 Mathematical tricks 2
4 Uniform convergence 11
6 Mathematical tricks 14
7 Statistical Models 14
8 Exponential Families 16
8.1 Properties of exponential families . . . . . . . . . . . . . . . . . . . . . . . . . . 16
8.2 The maximum entropy duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1
9 Methods of Constructing Estimators 18
10 Evaluation of estimators 19
12 Mathematical tricks 23
13 Testing frameworks 23
14 Multiple testing 26
16 Causal Inference 29
Part I
1 Mathematical tricks
Some proof techniques that emerge from the homework.
R
• Blind-man trick. To prove that f (x, t)dt = g(x), express
Z Z
f (x, t)
f (x, t)dt = g(x) dt,
g(x)
f (x,t) R f (x,t)
and then show that g(x)
is a pdf of some random variable, which implies g(x)
dt = 1.
2
– E(X − c)2 = σ 2 + (µ − c)2 where µ, σ 2 are the mean and variance.
• Concentration of iid variables around the mean. Given X1 , . . . , Xn iid with expected
value µ, we have
n
! n
!
1X X
P Xi − µ ≥ u = P (Xi − µ) ≥ nu
n i=1 i=1
E[exp(t( ni=1 (Xi − µ)))]
P
≤
exp(tnu)
n
1 Y
= E[exp(t(Xi − µ))]
exp(tnu) i=1
n
1 Y
= MXi −µ (t).
exp(tnu) i=1
As the next step, we can apply some upper bound for MXi −µ (t), depending on the given
condition, e.g., Xi being sub-Gaussian or sub-exponential.
• E[I (f (Xi ) 6= yi )] = P (f (X) 6= y), so the expected value of empirical risk is the true risk:
n n
1X 1X
E R̂n (f ) = E[I (f (Xi ) 6= yi )] = P (f (X) 6= y) = P (f (X) 6= y).
n i=1 n i=1
3
2 Non-asymptotic concentration inequalities / tail bounds
Theorem 1: (Markov Inequality)
For a positive random variable X ≥ 0,
EX
P (X ≥ t) ≤ . (2.1)
t
1
P (|X − EX| ≥ kσ) ≤ ∀k > 0. (2.2)
k2
E(X − EX)2 σ2 1
P (|X − EX| ≥ kσ) = P (|X − EX|2 ≥ k 2 σ 2 ) ≤ 2 2
= 2 2
= 2.
k σ k σ k
An alternative expression of Chebyshev’s inequality is
σ2
P (|X − EX| ≥ ) ≤ 2 . ∀ > 0
4
Proof: Define µ = EX. For any t > 0,
E[exp(t(X − µ))]
P (X − µ ≥ u) = P (exp(t(X − µ)) ≥ exp(tu)) ≤ .
exp(tu)
Definition 1:
A random variable X with mean µ is sub-Gaussian if there exists a σ such that
for all t ∈ R.
Using the Chernoff bound, we can then derive the same two-sided exponential tail bound for
sub-Gaussian variable as in (2.4). Furthermore, the average of n independent σ-sub Gaussian
√
RVs is σ/ n-sub Gaussian, with the tail bound
√
P (|µ̂ − µ| ≥ kσ/ n) ≤ 2 exp(−k 2 /2).
5
Theorem 5: (Jensen’s Inequality)
For a convex function g : R → R (i.e., g 00 (x) ≥ 0 if g 00 exists), we have
6
Theorem 8: (Bernstein’s Inequality)
Consider n iid random variables X1 , . . . , Xn with mean µ, bound [a, b] and variance σ 2 . Then
nt2
P (|µ̂ − µ| ≥ t) ≤ 2 exp − . (2.9)
2(σ 2 + (b − a)t
Other functions also show exponential concentration. One main property of these functions is
in other words, if we change the input xk to x0k , the value of the function changes by at most
Lk . This property yields McDiarmid’s Inequality.
2t2
P (|f (X1 , . . . , Xn ) − E[f (X1 , . . . , Xn )]| ≥ t) ≤ 2 exp − P . (2.10)
Lk 2
t2
P (|f (X1 , . . . , Xn ) − E[f (X1 , . . . , Xn )]| ≥ t) ≤ 2 exp − 2 . (2.11)
2L
then !
n
Yn 1X 2
P −1 ≥t =P X − 1 ≥ t ≤ 2 exp(−nt2 /8). (2.12)
n n i=1 i
7
Inequality Condition Statement
EX
Markov X≥0 P (X ≥ t) ≤ .
t
1
Chebyshev X with std σ P (|X − EX| ≥ kσ) ≤ .
k2
nλ2
Hoeffding (Xi )ni=1 iid and P (|µ̂ − µ| ≥ λ) ≤ 2 exp − .
2(b − a)2
bounded on [a, b]
nt2
Bernstein (Xi )ni=1 iid, bounded P (|µ̂ − µ| ≥ t) ≤ 2 exp − .
2(σ 2 + (b − a)t
on [a, b], variance σ 2
n
!
1X 2
χ2 tail bound Xi ∼ N (0, 1) P Xi − 1 ≥ t ≤ 2 exp(−nt2 /8).
n i=1
8
Definition 2: (Convergence of random variables)
X1 , . . . , Xn converges in probability to X if
1
Pn
A popular corrollary of WWLN is that if X1 , X2 , . . . , Xn are iid then n i=1 Xi2 → E[X 2 ] in
probability.
9
A simple use case of CLT is to construct confidence interval for averages. Given X1 , X2 , . . . , Xn
iid, the interval
σ σ
Cα = µ̂ − zα/2 √ , µ̂ + zα/2 √
n n
has the property that P (µ ∈ C) ≈ 1 − α (i.e., it has coverage 1 − α).
As an example, suppose we have X1 , . . . , Xn iid with mean µ and variance σ 2 . To get the
distribution of Yn = exp(µ̂n ), by the CLT and Delta method,
√
exp(µ̂n ) − exp(µ) d
n →
− N (0, exp(2µ)).
σ
1
P (|Xn | ≥ ) = P (U ∈ [0, 1/n]) = →0
n
but
1
E(Xn − X)2 = EXn2 = nP (U ∈ [0, 1/n]) = n · = 1.
n
1
has P (|Xn | > ) = n
→ 0 but E(Xn ) = 1 6= 0.
10
Math tricks to connect different quantities:
≤ P (X ≤ x + ) + P (|Xn − X| ≥ )
E[f (Xn )] = E[f (Xn )I (|Xn | > )] + E[f (Xn )I (|Xn | ≤ )]
4 Uniform convergence
Theorem 17: (Glivenko-Cantelli theorem)
For any distribution with cdf FX , if we observe the samples X1 , . . . , Xn and define the
empirical pdf F̂n (x) = n1 ni=1 I (Xi ≤ X), then
P
p
∆ = sup |F̂n (x) − FX (x)| →
− 0. (4.1)
x∈R
More generally, we are interested in collections of sets A for which we have uniform convergence,
1
Pn
where Pn (A) = n i=1 I (Xi ∈ A) is the empirical probability of a set A. Furthermore, one can
replace the indicators with general integrable functions.
11
Definition 4: (Empirical risk)
Given a training set {(Xi , yi )}ni=1 , for a given classifier f we can estimate its empirical risk
as
n
1X
R̂n (f ) = I (f (Xi ) 6= yi )) . (5.1)
n i=1
If f is some fixed classifer (does not depend on training data), we can apply Hoeffding’s bound
to see that
P (|R̂n (f ) − P (f (X) 6= y)|) ≤ 2 exp(−2nt2 ). (5.2)
If we are trying to pick a good classifier from some set of classifiers F, a natural way is to choose
the one that looks best on the training set, i.e.,
To argue that in some cases this procedure will indeed select a good classifier, let f ∗ be the best
classifier in F. We would like to bound the excess risk of the classifier,
5.2 VC dimension
12
Definition 5:
Let {z1 , . . . , zn } be a finite set of n points. Let NA (z1 , . . . , zn ) be the number of distinct sets
in the collection of sets
{{z1 , . . . , zn } ∪ A : A ∈ A}.
In other words, it is the maximal number of different subsets of n points that can be picked
out by the collection A.
Theorem 18:
For any distribution P and class of sets A,
Definition 6:
The VC dimension of a set system A is the largest integer d for which s(A, d) = 2d .
We can use Sauer’s lemma to conclude that for a system A with VC dimension d,
13
Part II
• E[E[X | Y ]] = E[X].
n n
X
2
X 1P
• (xi − µ) = (xi − x̄)2 + n(x̄ − µ)2 , where x̄ = xi .
i=1 i=1
n
PN
• The sample variance S = N1−1 i=1 (xi − x̄)2 is an unbiased estimator of σ. The MLE
variance is n1 ni=1 (xi − X)2 .
P
• To prove that a sufficient partition induced by T is minimal, show that, for any other
statistic T 0 , if x and y belong to the same partition induced by T 0 , then they belong to the
same partition induced by T .
• To compute the Bayes risk, we can use two expected values instead of integral:
Z
Bπ(θ) = R(θ, θ̂)π(θ)dθ = Eπ(θ) [R(θ, θ̂)] = Eπ(θ) [EX [L(θ, θ̂)]].
7 Statistical Models
Definition 7: (Statistic)
A statistic is simply a function of the observed sample, i.e. if we have X1 , . . . , Xn ∼ P then
any function T (X1 , . . . , Xn ) is called a statistic. A statistic is a random variable.
14
Roughly, once we know about the value of the sufficient statistic, the joint distribution no longer
has any more information about θ.
Since the condition probability p(X1 , . . . , Xn | T ; θ) is not easy to compute, we often check
sufficiency through the following lemma.
Theorem 20: (Factorization Theorem)
T (X1 , . . . , Xn ) is sufficient for θ if and only if the join pdf/pmf of (X1 , . . . , Xn ) can be
factored as
p(x1 , . . . , xn ; θ) = h(x1 , . . . , xn ) × g(T (x1 , . . . , xn ); θ). (7.1)
Definition 9: (Likelihood)
The likelihood arises from viewing the join density as a function of θ, i.e.,
Theorem 21:
Define
p(y1 , . . . , yn ; θ)
R(x1 , . . . , xn , y1 , . . . , yn ; θ) = . (7.3)
p(x1 , . . . , xn ; θ)
For a statistic T , if R does not depend on θ iff T (y1 , . . . , yn ) = T (x1 , . . . , xn ), we say T is
MSS.
Note that MSS is not unique but minimal sufficient partition is unique. Furthermore, the
likelihood function induces a minimal partition.
Now suppose we observe X1 , . . . , Xn ∼ p(X; θ) and we would like to estimate θ by an estimator
θ̂(X1 , . . . , Xn ). We can define the risk as a function of θ:
Based on the following theorem, estimators which do not depend only on sufficient statistics can
be improved.
Theorem 22: (Rao-Blackwell Theorem)
Let θ̂ be an estimator and T be any sufficient statistic. Define θ̃ = E[θ̂ | T ] then
15
Proof: Observe that because T is sufficient, E[θ̂ | T ] − θ = E[θ̂ − θ | T ].
= E[E 2 (θ̂ − θ | T )]
≤ E[E[(θ̂ − θ)2 | T ]]
= R(θ̂, θ).
8 Exponential Families
Definition 11: (Exponential families)
A family Pθ of distributions forms an s-dimensional exponential family if the distributions
Pθ have densities of the form
s
!
X
p(x; θ) = exp ηi (θ)Ti (x) − A(θ) h(x), (8.1)
i=1
where ηi , A are functions of θ and the Ti are known as the sufficient statistics (due to the
factorization theorem). Alternatively, we have the canonical parameterization format:
s
!
X
p(x; θ) = exp θi Ti (x) − A(θ) h(x). (8.2)
i=1
The term A(θ), called log-normalization constant, is what makes the distribution integrate to 1,
Z
A(θ) = log exp (θi Ti (x)) h(x)dx.
x
The set of θs for which A(θ) < ∞ constitute the natural parameter space.
16
• Log-partition generates moments.
∂A(θ)
= E[Ti (X)],
∂θi
∂ 2 A(θ)
= Cov(Ti (X), Tj (X)).
∂θi ∂θj
so A is a convex function of θ.
For exponential families, the Bregman divergence between params (using the log-partition
as convex function A), is exactly equal to the KL divergence between the corresponding
distributions.
• The MLE and MoM estimators of exponential families are the same.
Non-minimal exponential families are not statistically identifiable (i.e., there exists θ1 6= θ2 such
that p(X; θ1 ) = p(X; θ2 )), while minimal families are. One can eliminate some of the sufficient
statistics from the non-minimal representation to obtain a minimal one.
An exponential family where the space of allowed parameters θi is s-dimensional is called full-
rank family. In this case, the sufficient statistics
n n
!
X X
T (X1 , . . . , Xn ) = T1 (X), . . . , Ts (X)
i=1 i=1
is minimal sufficient.
17
8.2 The maximum entropy duality
Suppose we are given a random sample {X1 , . . . , Xn } from some distribution, and we compute
the empirical expectations of certain functions that we choose
n
1X
µ̂i = Ti (Xj ), 1 ≤ i ≤ s.
n j=1
If we constrain a small number of statistics Ti s in this fashion, there are infinitely many consistent
distributions. The principle of maximum entropy suggests to pick the distribution that has the
R
largest Shannon entropy: H(p) = − x p(x) log p(x),
where θi are called Lagrange parameters, which are equivalent to θM LE of this distribution.
In summary, exponential families arise naturally from trying to constrain a few simple
statistics of a distribution using the data and then choosing a distribution that
maximizes the entropy.
18
Definition 14: (Maximum Likelihood Estimate)
Consider the log-likelihood function
X
LL(θ) = log L(θ) = log p(Xi ; θ).
i
Note again that for exponential families, the MoM estimator coincides with the MLE if we choose
the sufficient statistics as moments.
Definition 15: (Bayes estimator)
Given a prior distribution p(θ) and sample x1 , . . . , xn , we can compute the posterior distri-
bution
p(x1 , . . . , xn | θ)p(θ)
p(θ | x1 , . . . , xn ) = ∝ p(x1 , . . . , xn | θ)p(θ).
p(x1 , . . . , xn )
Now compute θ̂ from the posterior
R
θp(x1 , . . . , xn | θ)p(θ)dθ
θ̂ = E(θ | x1 , . . . , xn ) = R . (9.3)
p(x1 , . . . , xn | θ)p(θ)dθ
10 Evaluation of estimators
For a true parameter θ and its estimator θ̂, the MSE is
Z Z
Eθ (θ̂ − θ) = . . . (θ̂ − θ)2 p(x1 , . . . , xn ; θ̂)dx1 . . . dxn .
2
Finding estimators with lowest MSE is difficult, so one way to narrow the search space is to
restrict our attention to unbiased estimators with minimum variance.
Definition 16: (Fisher Information)
Suppose X1 , . . . , Xn | p(X; θ). The score function is the gradient of the log-likelihood:
n
X
s(θ) = ∇θ LL(θ) = ∇θ log p(Xi ; θ). (10.1)
i=1
And the Fisher Information is the expected outer product of the score:
Intuitively, the score represents how quickly the distribution density will change when we slightly
change the parameter θ̂ near θ. When we square and take the expectation to get I(θ), we
19
get an averaged version of this measure. So if Fisher information is large, this means that the
distribution will change quickly when we move the parameter, so the distribution with parameter
θ is ‘quite different’ and ‘can be well distinguished’ from the distributions with parameters not
so close to θ. This means that we should be able to estimate θ well based on the data. On the
other hand, if Fisher information is small, this means that the distribution is ‘very similar’ to
distributions with parameter not so close to θ and, thus, more difficult to distinguish, so our
estimation will be worse.
The score function and Fisher information have the following important properties:
• When there is one sample, I1 (θ) = −E[∇θ s(θ)] = −E[∇θ2 log p(X; θ)].
I(θ) = E(s(θ)s(θ)T )
1 1
Var(θ̂) ≥ = . (10.3)
I(θ) nI1 (θ)
Estimators that are unbiased and achieve the Cramer-Rao lower bound on the variace are called
efficient estimators. The Cramer-Rao bound also suggests that the MSE in a parametric model
typically scales as 1/(nI1 (θ)).
20
An estimator that minimizes the maximum risk is called the minimax estimator. An estimator
that minimizes the Bayes risk is called the Bayes estimator.
Theorem 24:
R
Let r(θ̂ | xn ) = L(θ, θ̂)π(θ | xn )dθ be the posterior risk of an estimator θ(xn ) and m(xn ) =
R
p(xn | θ)π(θ)dθ be the marginal distribution of X n . The Bayes risk Bπ (θ̂) satisfies
Z
Bπ (θ̂) = r(θ̂ | xn )m(xn )dxn , (10.7)
If we choose θ̂ that minimizes the posterior risk r(θ̂ | xn ), it will minimize Bπ (θ̂). If L is the
squared loss, for example, we want to find
Z
n
arg min r(θ̂ | x ) = arg min (θ − θ̂)2 π(θ | xn )dθ.
θ̂ θ̂
Theorem 25:
If L is the squared loss then the Bayes estimator is
Z
θ̂ = θπ(θ | xn )dθ = E(θ | X = xn ). (10.8)
We will study two ways in which to use Bayes estimators to find minimax estimators. One
involves tightly bounding the minimax risk and the other involves identifying what is called a
least favorable prior. It is worth keeping in mind the trade-off: Bayes estimators although easy
to compute are somewhat subjective (in that they depend strongly on the prior π). Minimax
estimators although more challenging to compute are not subjective, but do have the drawback
that they are protecting against the worst-case which might lead to pessimistic conclusions, i.e.
the minimax risk might be much higher than the Bayes risk for a “nice” prior.
Intuitively, we choose the best estimator θ̃ and evaluate it on the worst case. The estimator
that minimizes this best-worst risk is the minimax estimator.
21
For any estimator θ̂up , the minimax risk is upper bounded by R(θ, θ̂up ). For any prior θ, the
Bayes risk of the Bayes estimator θlow lower bounds the minimax risk. In summary,
Theorem 26:
1
P
Given X1 , . . . , Xn ∼ N (θ, Id ) then the average θ̂ = n
Xi is the minimax estimator of θ
w.r.t squared loss.
Theorem 27:
If θ̂ is the Bayes estimator w.r.t some prior θ and R(θ̂, θ) is constant, then θ̂ is the minimax
estimator.
p
θ̂M LE →
− θ. (11.1)
√ d
n(θ̂M LE − θ) →
− N (0, 1/I1 (θ)), (11.2)
or equivalently,
d
θ̂M LE − θ →
− N (0, 1/I(θ)). (11.3)
where the remainder is small (roughly proportional to the previous term multiplied by I(θ̃) −
I(θ) → 0).
22
Definition 19: (Influence function)
The influence function
∇θ log p(x; θ)
ψ(x) = (11.4)
I(θ)
measures the influence each single point has on the the estimator θ, i.e.,
n
1X
θ̂ ≈ θ + ψ(Xi ). (11.5)
n i=1
Part III
Statistical testing
12 Mathematical tricks
• P (Z > Zα ) = α. P (Z > t) = α ⇒ t = φ−1 (1 − α).
• Prove that p is uniformly distributed on [0, 1] when the CDF Φ is continuous and increasing.
√ √ √
p = P (Tn > tn ) = P ( nTn > ntn ) = φ(− ntn )
√ √
P (p ≤ u) = P (φ(− ntn ) ≤ u) = P (− ntn ≤ φ−1 (u)) = φ(φ−1 (u)) = u.
13 Testing frameworks
Definition 20: (Constructing tests)
Hypothesis testing involves the following steps:
23
We say that a test controls the Type I error at level α if for any parameter θ0 ,
Pθ0 (X1 , . . . , Xn ∈ R) ≤ α.
d
− N (θ0 , σ02 )
θ̂ → (13.3)
where σ02 is the variance of θ̂ under the null. The canonical example is when θ̂ is the MLE.
In this case, consider the statistic
θ̂ − θ0
Tn = . (13.4)
σ0
d
Under the null, Tn →
− N (0, 1) so we reject the null if |Tn | > zα/2 .
supθ∈Θ0 L(θ)
λ= < c. (13.5)
supθ∈Θ L(θ)
d
− χ21 .
− 2 log λ → (13.6)
24
Definition 23:
Suppose we have a test of the form: reject when T (X1 , . . . , Xn ) > c. Then the p-value is
Under some conditions, p-value will be uniformly distributed on [0, 1] under the null, because
We can then show that asymptotically this test statistic, under the null, has a χ2k−1 distri-
bution.
Zi + Zi0
ĉi = . (13.9)
n1 + n2
25
Definition 27: (Permutation test)
Suppose we observe X1 , . . . , Xn , Y1 , . . . , Ym . Define N = m + n and consider all N ! permu-
tations of the data. For each permutation, compute a statistic T (so we have T1 , . . . , TN ! ).
Under the null hypothesis, each Ti has the same distribution. We can then define the p-value
as
N!
1 X
p= I (Ti > Tobs ) . (13.11)
N ! i=1
where Tobs is the test statistic on the observed data.
14 Multiple testing
The basic question is how do we adjust our p-value cutoffs to account for the fact that multiple
testings are being done.
The main problem with the Sidak correction is that it requires the independence of p-values.
The Bonferroni correction uses the union bound to avoid this assumption.
26
Definition 31: (Holm’s Procedure)
We perform the following steps:
α
2. If p(1) < d
then reject H(1) and move on. Else, stop and accept all Hi for i ≥ 1.
α
3. If p(2) < d−1
then reject H(2) and move on. Else, stop and accept all Hi for i ≥ 2.
4. . . .
Holm’s procedure also controls the FWER at level α and strictly dominates the Bonferroni
procedure.
If the p-values are independent, we can control the FDR using Benjamini-Hochberg procedure.
27
Definition 33: (Benjamini-Hochberg procedure)
Suppose we do d tests. We perform the following steps:
iα
2. Define the thresholds ti = d
.
3. Find the largest imax such that imax = arg maxi i : p(i) < ti
• FWER ≥ FDR always. Control the FWER implies FDR control. However, FDR is
less stringent so if it is the correct measure, we have more power by controlling FDR.
P (θ ∈ Cn (X1 , . . . , Xn )) ≥ 1 − α, ∀P ∈ P. (15.1)
This means that no matter which distribution in P generated the data, the interval guar-
antees coverage property described above. At a high-level, the confidence interval gives us
some idea of how precise our estimate of the unknown parameter θ is, i.e., a wide interval
indicates that our (point) estimate is imprecise.
28
Theorem 33: (Constructing CI by inverting a test)
Suppose we have a test / family of tests for the hypotheses H0 : θ = θ0 and H1 : θ 6= θ0 .
Denote the acceptance region for H0 at A(θ0 ). Given observed data {X1 , . . . , Xn }, we
consider the random set
If our family of tests has level α then the set C(X1 , . . . , Xn ) is a 1 − α confidence set.
Pθ (a ≤ Q(X1 , . . . , Xn , θ) ≤ b) = 1 − α ∀θ ∈ Θ. (15.3)
16 Causal Inference
We will think of the case when there are two possible actions (or treatments). Often we refer to
one of the treatments as the active treatment (or just treatment) and the other as the control
treatment (or just control).
We associate every unit and the two treatments with two potential outcomes: the potential
outcome if the unit received the treatment and the potential outcome if the unit received control.
29
A priori both potential outcomes are possible. However, every unit only receives one of the two
treatments (i.e. either treatment or control) and so we only observe one of the two potential
outcomes. This is known as the fundamental problem of causal inference. We only observe one
of the potential outcomes for each unit.
There are many ways to measure causal associations. An estimate for the causal effect that we
will focus on is the average treatment effect:
n
1X
τ = E[Y (1) − Y (0)] or Yi (1) − Yi (0) (16.1)
n i=1
which is the difference in outcomes if all units were treated versus all were in the control group.
The main problem in causal inference is that each unit is either treated or in the control group
so we never observe both potential outcomes. What we do observe is
Suppose m individuals are in the treatment group, then we can estimate the quantity
However, in general α 6= τ since in a typical setting we have selection bias, i.e. people can
choose treatment or control based on their knowledge of their potential outcomes so that W
and (Y (0), Y (1)) are not independent. One formal way of defining selection bias in this context
is simply as the difference between τ and α. If we can ensure that W ⊥ (Y (0), Y (1)), then we
indeed have that
30
Proof: We have
n
X E(Wi ) E(1 − Wi )
E(τ̂ ) = Yi (1) − Yi (0).
i=1
m n−m
n−1
(m−1 ) m n−m
The mean of Wi is given by E(Wi ) = = and so E(1 − Wi ) = . This gives us
(mn ) n n
E(τ̂ ) = τ .
where X is some covariates. One way to think about this assumption, is that conditional on
X we have a randomized trial, i.e. the treatment is independent of the potential outcomes.
So if we condition on X (they are the confounders) we no longer have any selection bias.
Alternatively, within levels of the covariate treatment is decided by (a biased) coin flip.
Proof: We have
1. Identification. Leveraging some set of “causal assumptions” in order to link the param-
eter of interest to something that can be derived from the observed data distribution. In
a simple randomized trial, we used the assumption W ⊥ (Y (0), Y (1)) to say that
31
2. Estimation. Once we have “identified” the parameter (written it in the form of observed
quantities), we can design an estimator for it.
32