0% found this document useful (0 votes)
2 views

Notes

Uploaded by

nathan.yerima
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Notes

Uploaded by

nathan.yerima
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

36-705 Intermediate Statistics Summary Notes

September 8, 2022

Contents

I Behavior of sums of independent random variables 2

1 Mathematical tricks 2

2 Non-asymptotic concentration inequalities / tail bounds 4

3 Asymptotic concentration / convergence of random variables 8


3.1 Definitions and Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Other notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Uniform convergence 11

5 All of statistics and machine learning 11


5.1 Theorems and definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.2 VC dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

II Statistical estimation and inferences 13

6 Mathematical tricks 14

7 Statistical Models 14

8 Exponential Families 16
8.1 Properties of exponential families . . . . . . . . . . . . . . . . . . . . . . . . . . 16
8.2 The maximum entropy duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1
9 Methods of Constructing Estimators 18

10 Evaluation of estimators 19

11 Asymptotic behavior of MLE 22

III Statistical testing 23

12 Mathematical tricks 23

13 Testing frameworks 23

14 Multiple testing 26

15 Confidence Interval and Bootstrapping 28

16 Causal Inference 29

Part I

Behavior of sums of independent


random variables
We analyze how averages of independent random variables concentrate around their means.

1 Mathematical tricks
Some proof techniques that emerge from the homework.
R
• Blind-man trick. To prove that f (x, t)dt = g(x), express
Z Z
f (x, t)
f (x, t)dt = g(x) dt,
g(x)
f (x,t) R f (x,t)
and then show that g(x)
is a pdf of some random variable, which implies g(x)
dt = 1.

• Formulas for expectation.


"∞ #
X (tX)i
– E(etX ) = E .
i=0
i!

2
– E(X − c)2 = σ 2 + (µ − c)2 where µ, σ 2 are the mean and variance.

– EX = E[X · I (C(X))] + E[X · I (¬C(X))], where C(X) is some condition on X, e.g.,


X > .
p
– |E[XY ]| ≤ E[X 2 ]E[Y 2 ] for any X, Y . (Cauchy-Schwarz)

X
– For discrete variable X : Ω → {0, 1, . . .}, E(X) = P (X ≥ i).
i=1

– For continuous variable X : Ω → [0, +∞],


Z ∞ Z ∞
E(X) = P (X ≥ x)dx = P (X > x)dx.
0 0

– For continuous variable X : Ω → [−∞, 0],


Z 0 Z 0
E(X) = − P (X ≤ x)dx = P (X < x)dx.
−∞ −∞

• Formulas for mgf.

– If X and Y are independent then MX+Y (t) = MX (t)MY (t).

– If MXn (t) → MX (t) then Xn converges in distribution to X.

• Concentration of iid variables around the mean. Given X1 , . . . , Xn iid with expected
value µ, we have
n
! n
!
1X X
P Xi − µ ≥ u = P (Xi − µ) ≥ nu
n i=1 i=1
E[exp(t( ni=1 (Xi − µ)))]
P

exp(tnu)
n
1 Y
= E[exp(t(Xi − µ))]
exp(tnu) i=1
n
1 Y
= MXi −µ (t).
exp(tnu) i=1

As the next step, we can apply some upper bound for MXi −µ (t), depending on the given
condition, e.g., Xi being sub-Gaussian or sub-exponential.

• E[I (f (Xi ) 6= yi )] = P (f (X) 6= y), so the expected value of empirical risk is the true risk:
n n
1X 1X
E R̂n (f ) = E[I (f (Xi ) 6= yi )] = P (f (X) 6= y) = P (f (X) 6= y).
n i=1 n i=1

3
2 Non-asymptotic concentration inequalities / tail bounds
Theorem 1: (Markov Inequality)
For a positive random variable X ≥ 0,

EX
P (X ≥ t) ≤ . (2.1)
t

Proof: For any t, consider the indicator function



1 if X ≥ t

I(X) = .
0 if X < t

It then follows that t · I(X) ≤ X, so

E[X] ≥ E[t · I(X)] = t · E[I(X)] = tP (X ≥ t).

Theorem 2: (Chebyshev Inequality)


For any random variable X with std σ,

1
P (|X − EX| ≥ kσ) ≤ ∀k > 0. (2.2)
k2

Proof: Applying Markov’s Inequality (2.1) to Y = |X − EX|2 and t = k 2 σ 2 , we see that

E(X − EX)2 σ2 1
P (|X − EX| ≥ kσ) = P (|X − EX|2 ≥ k 2 σ 2 ) ≤ 2 2
= 2 2
= 2.
k σ k σ k


An alternative expression of Chebyshev’s inequality is

σ2
P (|X − EX| ≥ ) ≤ 2 . ∀ > 0


Theorem 3: (Chernoff ’s bound)


For any random variable X whose mgf if finite for all |t| ≤ b where b > 0,

P (X − µ ≥ u) ≤ inf exp(−t(u + µ))E[exp(tX)]. (2.3)


0≤t≤b

4
Proof: Define µ = EX. For any t > 0,
E[exp(t(X − µ))]
P (X − µ ≥ u) = P (exp(t(X − µ)) ≥ exp(tu)) ≤ .
exp(tu)


Theorem 4: (Gaussian tail bound)


For a random variable X ∼ N (µ, σ 2 ),

P (|X − µ| ≥ u) ≤ 2 exp(−u2 /(2σ 2 )). (2.4)

Proof: According to HW2, the mgf of X is

MX (t) = E[exp(tX)] = exp(tµ + t2 σ 2 /2).

Applying the Chernoff bound,

P (X − µ ≥ u) ≤ inf exp(−t(u + µ))E[exp(tX)] = exp(−u2 /(2σ 2 )).


0≤t≤b

Corrollary 1. If we consider the average of iid Gaussian random variables X1 , X2 , . . . , Xn ,


n
1X
µ̂ = Xi ,
n i=1
then the Gaussian tail bound gives us

P (|µ̂ − µ| ≥ kσ/ n) ≤ 2 exp(−k 2 /2),
1
which is much tighter than Chebyshev’s bound of .
k2

Definition 1:
A random variable X with mean µ is sub-Gaussian if there exists a σ such that

E[exp(t(X − µ))] ≤ exp(t2 σ 2 /2). (2.5)

for all t ∈ R.

Using the Chernoff bound, we can then derive the same two-sided exponential tail bound for
sub-Gaussian variable as in (2.4). Furthermore, the average of n independent σ-sub Gaussian

RVs is σ/ n-sub Gaussian, with the tail bound

P (|µ̂ − µ| ≥ kσ/ n) ≤ 2 exp(−k 2 /2).

5
Theorem 5: (Jensen’s Inequality)
For a convex function g : R → R (i.e., g 00 (x) ≥ 0 if g 00 exists), we have

E[g(X)] ≥ g(E[X]). (2.6)

If g is concave (i.e., g 00 (x) ≤ 0 if g 00 exists) then the reverse inequality holds.

Theorem 6: (Bounded random variables are sub-Gaussian)


Let X be a random variable bounded on [a, b]. Then

E[exp(tX)] ≤ exp(t2 (b − a)2 /2). (2.7)

In other words, bounded random variables are (b − a)-sub Gaussian.

This in turns yields Hoeffding’s bound:


Theorem 7: (Hoeffding’s bound)
Suppose that X1 , . . . , Xn are iid bounded random variables on [a, b], then
n
!
2nλ2
 
1X
P Xi − µ ≥ λ ≤ 2 exp − . (2.8)
n i=1 (b − a)2

Proof: We prove the upper tail, as the lower tail is similar.


n
!
X
P (Xi − µ) ≥ nλ
i=1
" n
!#
X
≤ exp(−tnλ)E exp t (Xi − µ) (Chernoff’s bound)
i=1
n
!
X
= exp(−tnλ) exp E[t(Xi − µ)]
i=1
n
Y
= exp(−tnλ) E[exp(t(Xi − µ))]
i=1
Yn
≤ exp(−tnλ) exp(t2 (b − a)2 /8) (Bounded RVs are sub-Gaussian)
i=1
t2 n2 (b − a)2
 
= exp − tnλ
2
nλ2
 
= exp − . (Minimizing quadratic function of t)
2(b − a)2

If we further know that the bounded variables have small variances, we can also derive a tighter
bound.

6
Theorem 8: (Bernstein’s Inequality)
Consider n iid random variables X1 , . . . , Xn with mean µ, bound [a, b] and variance σ 2 . Then

nt2
 
P (|µ̂ − µ| ≥ t) ≤ 2 exp − . (2.9)
2(σ 2 + (b − a)t

Other functions also show exponential concentration. One main property of these functions is

|f (x1 , . . . , xk , . . . , xn ) − f (x1 , . . . , x0k , . . . , xn )| ≤ Lk ,

in other words, if we change the input xk to x0k , the value of the function changes by at most
Lk . This property yields McDiarmid’s Inequality.

Theorem 9: (McDiarmid’s Inequality)


If the random variables X1 , . . . , Xn are iid then for all t ≥ 0,

2t2
 
P (|f (X1 , . . . , Xn ) − E[f (X1 , . . . , Xn )]| ≥ t) ≤ 2 exp − P . (2.10)
Lk 2

If, on the other hand, the function is Lipschitz, i.e.,


v
u n
uX
|f (X1 , . . . , Xn ) − f (Y1 , . . . , Yn )| ≤ Lt (Xi − Yi )2 .
i=1

for all Xi , Yi ∈ R, then we have Levy’s Inequality.

Theorem 10: (Levy’s Inequality)


Consider a Lipschitz function f and X1 , . . . , Xn ∼ N (0, 1), we have

t2
 
P (|f (X1 , . . . , Xn ) − E[f (X1 , . . . , Xn )]| ≥ t) ≤ 2 exp − 2 . (2.11)
2L

Theorem 11: (χ2 tail bound)


Let Y = N 2 2
P
i=1 Xi where Xi ∼ N (0, 1) be a χ random variable with n degrees of freedom,

then !
  n
Yn 1X 2
P −1 ≥t =P X − 1 ≥ t ≤ 2 exp(−nt2 /8). (2.12)
n n i=1 i

Finally, we present a summary of all of the above inequalities.

7
Inequality Condition Statement

EX
Markov X≥0 P (X ≥ t) ≤ .
t
1
Chebyshev X with std σ P (|X − EX| ≥ kσ) ≤ .
k2

Chernoff X with mgf finite for P (X − µ ≥ u) ≤ inf exp(−t(u + µ))E[exp(tX)].


0≤t≤b
|t| ≤ b

Gaussian tail X ∼ N (µ, σ 2 ) P (|X − µ| ≥ u) ≤ 2 exp(−u2 /(2σ 2 )).


bound

Bounded RV X bounded on [a, b], E[exp(tX)] ≤ exp(t2 (b − a)2 /2).


tail bound EX = 0

nλ2
 
Hoeffding (Xi )ni=1 iid and P (|µ̂ − µ| ≥ λ) ≤ 2 exp − .
2(b − a)2
bounded on [a, b]

nt2
 
Bernstein (Xi )ni=1 iid, bounded P (|µ̂ − µ| ≥ t) ≤ 2 exp − .
2(σ 2 + (b − a)t
on [a, b], variance σ 2
n
!
1X 2
χ2 tail bound Xi ∼ N (0, 1) P Xi − 1 ≥ t ≤ 2 exp(−nt2 /8).
n i=1

Table 1: Summary of tail-bound inequalities.

3 Asymptotic concentration / convergence of random vari-


ables

3.1 Definitions and Theorems

8
Definition 2: (Convergence of random variables)
X1 , . . . , Xn converges in probability to X if

lim P (|Xn − X| ≥ ) = 0 ∀ > 0 ⇔ P (|Xn − X| ≥ ) < δ ∀, δ > 0. (3.1)


n→∞

X1 , . . . , Xn converges in quadratic means to X if

E(Xn − X)2 → 0. (3.2)

X1 , . . . , Xn converges in distribution to X if, for any t where FX is continuous,

lim FXn (t) = FX (t). (3.3)


n→∞

Theorem 12: (Weak law of large number)


Suppose Y1 , . . . , Yn are iid with mean µ and variance σ 2 < ∞. Let
i
1X
µ̂i = Yj
i j=1

then µ̂i , . . . , µ̂n converges in probability to µ = E(Y ).

1
Pn
A popular corrollary of WWLN is that if X1 , X2 , . . . , Xn are iid then n i=1 Xi2 → E[X 2 ] in
probability.

Theorem 13: (Continuous mapping theorem)


If a sequence X1 , . . . , Xn converges in probability or distribution to X then for any continuous
function h, h(X1 ), . . . , h(Xn ) converges in probability to h(X).

Theorem 14: (Slutsky’s theorem)


For convergence in probability, if Xn → X and Yn → Y then Xn + Yn → X + Y and
Xn Yn → XY .
For convergence in distribution, if Yn → c and Xn → X then Xn + Yn → X + c and
Xn Yn → cX.

Theorem 15: (Central limit theorem)


Let X1 , . . . , Xn be iid with mean µ and variance σ 2 . We have

n(µ̂ − µ) d
Sn = →
− Z ∼ N (0, 1). (3.4)
σ

9
A simple use case of CLT is to construct confidence interval for averages. Given X1 , X2 , . . . , Xn
iid, the interval  
σ σ
Cα = µ̂ − zα/2 √ , µ̂ + zα/2 √
n n
has the property that P (µ ∈ C) ≈ 1 − α (i.e., it has coverage 1 − α).

Theorem 16: (Delta method)


Suppose that √
n(Xn − µ) d

− N (0, 1),
σ
and g is a continuously differentiable function (i.e., g 0 (x) is continuous) where g 0 (µ) 6= 0,
then √
n(g(Xn ) − g(µ)) d
− N (0, g 0 (µ)2 ).
→ (3.5)
σ

As an example, suppose we have X1 , . . . , Xn iid with mean µ and variance σ 2 . To get the
distribution of Yn = exp(µ̂n ), by the CLT and Delta method,


 
exp(µ̂n ) − exp(µ) d
n →
− N (0, exp(2µ)).
σ

3.2 Other notes

Some examples of different types of convergences:

• (Convergence in probability does not imply convergence in qm) Take S ∼ U [0, 1]



and Xn = n · I ((U ∈ [0, 1/n])). Then

1
P (|Xn | ≥ ) = P (U ∈ [0, 1/n]) = →0
n

but
1
E(Xn − X)2 = EXn2 = nP (U ∈ [0, 1/n]) = n · = 1.
n

• When X is deterministic (i.e., P (X = c) = 1 for some constant c), convergence in distri-


bution implies convergence in probability.

• (Convergence in probability does not imply convergence in expectation) The


sequence 
n wp
 1
n
Xn =
1
0 wp 1 −

n

1
has P (|Xn | > ) = n
→ 0 but E(Xn ) = 1 6= 0.

10
Math tricks to connect different quantities:

• FXn (x) to P (|Xn − X|):

FXn (x) = P (Xn ≤ X, X ≤ x + ) + P (Xn ≤ x, X ≥ x + )

≤ P (X ≤ x + ) + P (|Xn − X| ≥ )

• E[f (Xn )] to P (|Xn | > ) where f is upper bounded by b:

E[f (Xn )] = E[f (Xn )I (|Xn | > )] + E[f (Xn )I (|Xn | ≤ )]

≤ b · P (|Xn | > ) + E[f (Xn )I (|Xn | ≤ )].

4 Uniform convergence
Theorem 17: (Glivenko-Cantelli theorem)
For any distribution with cdf FX , if we observe the samples X1 , . . . , Xn and define the
empirical pdf F̂n (x) = n1 ni=1 I (Xi ≤ X), then
P

p
∆ = sup |F̂n (x) − FX (x)| →
− 0. (4.1)
x∈R

More generally, we are interested in collections of sets A for which we have uniform convergence,

∆(A) = sup |Pn (A) − P (A)| → 0,


A∈A

1
Pn
where Pn (A) = n i=1 I (Xi ∈ A) is the empirical probability of a set A. Furthermore, one can
replace the indicators with general integrable functions.

Definition 3: (Empirical process)


Let F be a class of integrable real-valued functions and suppose we have an iid sample
X1 , . . . , Xn ∼ P , then the empirical process is defined as
n
1X
∆(F) = sup f (Xi ) − E[f ] . (4.2)
F ∈F n
i=1

5 All of statistics and machine learning

5.1 Theorems and definitions

11
Definition 4: (Empirical risk)
Given a training set {(Xi , yi )}ni=1 , for a given classifier f we can estimate its empirical risk
as
n
1X
R̂n (f ) = I (f (Xi ) 6= yi )) . (5.1)
n i=1

If f is some fixed classifer (does not depend on training data), we can apply Hoeffding’s bound
to see that
P (|R̂n (f ) − P (f (X) 6= y)|) ≤ 2 exp(−2nt2 ). (5.2)

If we are trying to pick a good classifier from some set of classifiers F, a natural way is to choose
the one that looks best on the training set, i.e.,

fˆ = arg min R̂n (f ).


f ∈F

To argue that in some cases this procedure will indeed select a good classifier, let f ∗ be the best
classifier in F. We would like to bound the excess risk of the classifier,

∆ = P (fˆ(X) 6= y) − P (f ∗ (X) 6= y),

which can be expressed as

∆ = P (fˆ(X) 6= y) − R̂n (fˆ) + R̂n (fˆ) − R̂n (f ∗ ) + R̂n (f ∗ ) − P (f ∗ (x) 6= y) .


| {z } | {z } | {z }
T1 T2 T3

Note that T2 ≤ 0 by definition of fˆ and T3 → 0 by (5.2). Hence

∆ ≤ T1 = P (fˆ(X) 6= y) − R̂n (fˆ) ≤ sup[P (f (X) 6= y) − R̂n (f )].


f ∈F

5.2 VC dimension

When the collection of sets A has finite cardinality, we have


X
P (∆(A) ≥ t) ≤ P (|Pn (A) − P (A)| ≥ t) ≤ 2|A| exp(−2nt2 ).
A∈A

Therefore, with probability at least 1 − δ,


r
ln(2|A|/δ)
∆(A) ≤ .
2n

We are now interested in controlling ∆(A) when its cardinality is infinite.

12
Definition 5:
Let {z1 , . . . , zn } be a finite set of n points. Let NA (z1 , . . . , zn ) be the number of distinct sets
in the collection of sets
{{z1 , . . . , zn } ∪ A : A ∈ A}.

The n-th shatter coefficient of A is defined as

s(A, n) = max NA (z1 , . . . , zn ).


z1 ,...,zn

In other words, it is the maximal number of different subsets of n points that can be picked
out by the collection A.

Theorem 18:
For any distribution P and class of sets A,

P (∆(A) ≥ t) ≤ 8s(A, n) exp(−nt2 /32). (5.3)

Definition 6:
The VC dimension of a set system A is the largest integer d for which s(A, d) = 2d .

Theorem 19: (Sauer’s lemma)


If A has VC dimension d, then for n > d,

s(A, n) ≤ (n + 1)d . (5.4)

We can use Sauer’s lemma to conclude that for a system A with VC dimension d,

P (∆(A) ≥ t) ≤ 8(n + 1)d exp(−nt2 /32). (5.5)

As a corollary, the VC result says that for a class with VC dimension d,


r
d log n
∆(A) ≈ .
n

13
Part II

Statistical estimation and inferences


6 Mathematical tricks
Some mathematical formulas and identities that emerge from the notes / homework:

• E(θ̂ − θ)2 = (E θ̂ − θ)2 + E(θ̂ − E θ̂)2 = (µθ̂ − θ)2 + Var(θ̂).

• E[E[X | Y ]] = E[X].
n n
X
2
X 1P
• (xi − µ) = (xi − x̄)2 + n(x̄ − µ)2 , where x̄ = xi .
i=1 i=1
n
PN
• The sample variance S = N1−1 i=1 (xi − x̄)2 is an unbiased estimator of σ. The MLE
variance is n1 ni=1 (xi − X)2 .
P

• To prove that a sufficient partition induced by T is minimal, show that, for any other
statistic T 0 , if x and y belong to the same partition induced by T 0 , then they belong to the
same partition induced by T .

• To compute the Bayes risk, we can use two expected values instead of integral:
Z
Bπ(θ) = R(θ, θ̂)π(θ)dθ = Eπ(θ) [R(θ, θ̂)] = Eπ(θ) [EX [L(θ, θ̂)]].

• Max of min is smaller than min of max:

sup inf f (x, y) ≤ inf sup f (x, y).


x y y x

7 Statistical Models
Definition 7: (Statistic)
A statistic is simply a function of the observed sample, i.e. if we have X1 , . . . , Xn ∼ P then
any function T (X1 , . . . , Xn ) is called a statistic. A statistic is a random variable.

Definition 8: (Sufficient Statistic)


A statistic T (X1 , . . . , Xn ) is said to be sufficient for the parameter θ if p(X1 , . . . , Xn |
T (X1 , . . . , Xn ) = t; θ) does not depend on θ for any value of t.

14
Roughly, once we know about the value of the sufficient statistic, the joint distribution no longer
has any more information about θ.
Since the condition probability p(X1 , . . . , Xn | T ; θ) is not easy to compute, we often check
sufficiency through the following lemma.
Theorem 20: (Factorization Theorem)
T (X1 , . . . , Xn ) is sufficient for θ if and only if the join pdf/pmf of (X1 , . . . , Xn ) can be
factored as
p(x1 , . . . , xn ; θ) = h(x1 , . . . , xn ) × g(T (x1 , . . . , xn ); θ). (7.1)

Definition 9: (Likelihood)
The likelihood arises from viewing the join density as a function of θ, i.e.,

L(θ) = L(θ; x1 , . . . , xn ) = p(x1 , . . . , xn ; θ). (7.2)

Definition 10: (Minimal Sufficient Statistic)


A statistic T (x1 , . . . , xn ) is minimal sufficient if it is sufficient, and for any other sufficient
statistic S(x1 , . . . , xn ) we can write T (x1 , . . . , xn ) = g(S(x1 , . . . , xn )).

Theorem 21:
Define
p(y1 , . . . , yn ; θ)
R(x1 , . . . , xn , y1 , . . . , yn ; θ) = . (7.3)
p(x1 , . . . , xn ; θ)
For a statistic T , if R does not depend on θ iff T (y1 , . . . , yn ) = T (x1 , . . . , xn ), we say T is
MSS.

Note that MSS is not unique but minimal sufficient partition is unique. Furthermore, the
likelihood function induces a minimal partition.
Now suppose we observe X1 , . . . , Xn ∼ p(X; θ) and we would like to estimate θ by an estimator
θ̂(X1 , . . . , Xn ). We can define the risk as a function of θ:

R(θ̂, θ) = E(θ̂ − θ)2 = (E θ̂ − θ)2 + E(θ̂ − E θ̂)2 .


| {z } | {z }
bias variance

Based on the following theorem, estimators which do not depend only on sufficient statistics can
be improved.
Theorem 22: (Rao-Blackwell Theorem)
Let θ̂ be an estimator and T be any sufficient statistic. Define θ̃ = E[θ̂ | T ] then

R(θ̃, θ) ≤ R(θ̂, θ). (7.4)

15
Proof: Observe that because T is sufficient, E[θ̂ | T ] − θ = E[θ̂ − θ | T ].

R(θ̃, θ) = E[E(θ̂ | T ) − θ]2

= E[E 2 (θ̂ − θ | T )]

≤ E[E[(θ̂ − θ)2 | T ]]

= R(θ̂, θ).

8 Exponential Families
Definition 11: (Exponential families)
A family Pθ of distributions forms an s-dimensional exponential family if the distributions
Pθ have densities of the form
s
!
X
p(x; θ) = exp ηi (θ)Ti (x) − A(θ) h(x), (8.1)
i=1

where ηi , A are functions of θ and the Ti are known as the sufficient statistics (due to the
factorization theorem). Alternatively, we have the canonical parameterization format:
s
!
X
p(x; θ) = exp θi Ti (x) − A(θ) h(x). (8.2)
i=1

The term A(θ), called log-normalization constant, is what makes the distribution integrate to 1,
Z
A(θ) = log exp (θi Ti (x)) h(x)dx.
x

The set of θs for which A(θ) < ∞ constitute the natural parameter space.

8.1 Properties of exponential families

• The exponential family structure is preserved for iid samples,


s n
! n
X X Y
p(x1 , . . . , xn ; θ) = exp θi Ti (xj ) − nA(θ) h(xi ),
i=1 j=1 i=1

with the same natural parameters θi but with sufficient statistics


n
X
Ti (x1 , . . . , xn ) = Ti (xj ).
j=1

16
• Log-partition generates moments.

∂A(θ)
= E[Ti (X)],
∂θi
∂ 2 A(θ)
= Cov(Ti (X), Tj (X)).
∂θi ∂θj

so A is a convex function of θ.

• The likelihood function is concave.

• Given a strictly convex function A we can define a divergence between parameters by

p(θ1 , θ2 ) = A(θ2 ) − A(θ1 ) − hA(θ1 ), θ2 − θ1 i.

For a pair of distributions we can define the KL divergence,


Z
KL(p||q) = p(x) log(p(x)/q(x))dx.

For exponential families, the Bregman divergence between params (using the log-partition
as convex function A), is exactly equal to the KL divergence between the corresponding
distributions.

• The MLE and MoM estimators of exponential families are the same.

Definition 12: (Minimal sufficiency)


An exponential family is minimal if the sufficient statistics are not redundant, i.e., there is
no set of coefficients a ∈ Rs , a 6= 0 such that si=1 ai Ti (x) = const for all x.
P

Non-minimal exponential families are not statistically identifiable (i.e., there exists θ1 6= θ2 such
that p(X; θ1 ) = p(X; θ2 )), while minimal families are. One can eliminate some of the sufficient
statistics from the non-minimal representation to obtain a minimal one.
An exponential family where the space of allowed parameters θi is s-dimensional is called full-
rank family. In this case, the sufficient statistics
n n
!
X X
T (X1 , . . . , Xn ) = T1 (X), . . . , Ts (X)
i=1 i=1

is minimal sufficient.

17
8.2 The maximum entropy duality

Suppose we are given a random sample {X1 , . . . , Xn } from some distribution, and we compute
the empirical expectations of certain functions that we choose
n
1X
µ̂i = Ti (Xj ), 1 ≤ i ≤ s.
n j=1

A distribution p is consistent with the data we observe if

µ̂i = Ep [Ti (X)], 1 ≤ i ≤ s.

If we constrain a small number of statistics Ti s in this fashion, there are infinitely many consistent
distributions. The principle of maximum entropy suggests to pick the distribution that has the
R
largest Shannon entropy: H(p) = − x p(x) log p(x),

p∗ = arg max H(p), µ̂i = Ep∗ [Ti (X)].


p

The solution to this problem turns out to be


s
!
X
p∗ = exp θi Ti (x) − A(θ) h(x),
i=1

where θi are called Lagrange parameters, which are equivalent to θM LE of this distribution.
In summary, exponential families arise naturally from trying to constrain a few simple
statistics of a distribution using the data and then choosing a distribution that
maximizes the entropy.

9 Methods of Constructing Estimators


Given X1 , . . . , Xn ∼ p(X; θ), there are three methods for constructing an estimator θ̂.

Definition 13: (Method of Moments)


Suppose θ = (θ1 , . . . , θk ), solve
n
1X i
X = E(X i ), 1 ≤ i ≤ k. (9.1)
n j=1 j

18
Definition 14: (Maximum Likelihood Estimate)
Consider the log-likelihood function
X
LL(θ) = log L(θ) = log p(Xi ; θ).
i

The MLE estimator satisfies


∂L
= 0, 1 ≤ j ≤ k. (9.2)
∂θj

Note again that for exponential families, the MoM estimator coincides with the MLE if we choose
the sufficient statistics as moments.
Definition 15: (Bayes estimator)
Given a prior distribution p(θ) and sample x1 , . . . , xn , we can compute the posterior distri-
bution
p(x1 , . . . , xn | θ)p(θ)
p(θ | x1 , . . . , xn ) = ∝ p(x1 , . . . , xn | θ)p(θ).
p(x1 , . . . , xn )
Now compute θ̂ from the posterior
R
θp(x1 , . . . , xn | θ)p(θ)dθ
θ̂ = E(θ | x1 , . . . , xn ) = R . (9.3)
p(x1 , . . . , xn | θ)p(θ)dθ

10 Evaluation of estimators
For a true parameter θ and its estimator θ̂, the MSE is
Z Z
Eθ (θ̂ − θ) = . . . (θ̂ − θ)2 p(x1 , . . . , xn ; θ̂)dx1 . . . dxn .
2

Finding estimators with lowest MSE is difficult, so one way to narrow the search space is to
restrict our attention to unbiased estimators with minimum variance.
Definition 16: (Fisher Information)
Suppose X1 , . . . , Xn | p(X; θ). The score function is the gradient of the log-likelihood:
n
X
s(θ) = ∇θ LL(θ) = ∇θ log p(Xi ; θ). (10.1)
i=1

And the Fisher Information is the expected outer product of the score:

I(θ) = E[s(θ)s(θ)T ]. (10.2)

Intuitively, the score represents how quickly the distribution density will change when we slightly
change the parameter θ̂ near θ. When we square and take the expectation to get I(θ), we

19
get an averaged version of this measure. So if Fisher information is large, this means that the
distribution will change quickly when we move the parameter, so the distribution with parameter
θ is ‘quite different’ and ‘can be well distinguished’ from the distributions with parameters not
so close to θ. This means that we should be able to estimate θ well based on the data. On the
other hand, if Fisher information is small, this means that the distribution is ‘very similar’ to
distributions with parameter not so close to θ and, thus, more difficult to distinguish, so our
estimation will be worse.
The score function and Fisher information have the following important properties:

• Ep(X1 ,...,Xn ) [s(θ)] = 0..

• When there is one sample, I1 (θ) = −E[∇θ s(θ)] = −E[∇θ2 log p(X; θ)].

• If there are n iid samples, I(θ) = nI1 (θ). Hence,

I(θ) = E(s(θ)s(θ)T )

= nI1 (θ) = −nE[∇2θ s(θ)].

Theorem 23: (Cramer-Rao bound)


Suppose X1 , . . . , Xn ∼ p(X; θ) and θ̂ is an unbiased estimator of θ, then

1 1
Var(θ̂) ≥ = . (10.3)
I(θ) nI1 (θ)

Estimators that are unbiased and achieve the Cramer-Rao lower bound on the variace are called
efficient estimators. The Cramer-Rao bound also suggests that the MSE in a parametric model
typically scales as 1/(nI1 (θ)).

Definition 17: (Risk of estimator)


The risk of an estimator θ̂ is a function of θ defined as
Z
R(θ, θ̂) = Eθ L(θ, θ̂) = L(θ, θ̂)p(x1 , . . . , xn ; θ)dx. (10.4)

The maximum risk is


R̄(θ̂) = sup R(θ, θ̂). (10.5)
θ

The Bayes risk under prior π is


Z
Bπ (θ̂) = R(θ, θ̂)π(θ)dθ. (10.6)

20
An estimator that minimizes the maximum risk is called the minimax estimator. An estimator
that minimizes the Bayes risk is called the Bayes estimator.

Theorem 24:
R
Let r(θ̂ | xn ) = L(θ, θ̂)π(θ | xn )dθ be the posterior risk of an estimator θ(xn ) and m(xn ) =
R
p(xn | θ)π(θ)dθ be the marginal distribution of X n . The Bayes risk Bπ (θ̂) satisfies
Z
Bπ (θ̂) = r(θ̂ | xn )m(xn )dxn , (10.7)

If we choose θ̂ that minimizes the posterior risk r(θ̂ | xn ), it will minimize Bπ (θ̂). If L is the
squared loss, for example, we want to find
Z
n
arg min r(θ̂ | x ) = arg min (θ − θ̂)2 π(θ | xn )dθ.
θ̂ θ̂

Taking the derivative of r w.r.t θ̂ and setting it equal to zero yields


Z
2 (θ − θ̂)π(θ | xn )dθ = 0,

which yields the following theorem.

Theorem 25:
If L is the squared loss then the Bayes estimator is
Z
θ̂ = θπ(θ | xn )dθ = E(θ | X = xn ). (10.8)

We will study two ways in which to use Bayes estimators to find minimax estimators. One
involves tightly bounding the minimax risk and the other involves identifying what is called a
least favorable prior. It is worth keeping in mind the trade-off: Bayes estimators although easy
to compute are somewhat subjective (in that they depend strongly on the prior π). Minimax
estimators although more challenging to compute are not subjective, but do have the drawback
that they are protecting against the worst-case which might lead to pessimistic conclusions, i.e.
the minimax risk might be much higher than the Bayes risk for a “nice” prior.

Definition 18: (Minimax estimator)


The minimax estimator θminimax satisfies

sup R(θ, θminimax ) = inf sup R(θ, θ̃). (10.9)


θ θ̃ θ

Intuitively, we choose the best estimator θ̃ and evaluate it on the worst case. The estimator
that minimizes this best-worst risk is the minimax estimator.

21
For any estimator θ̂up , the minimax risk is upper bounded by R(θ, θ̂up ). For any prior θ, the
Bayes risk of the Bayes estimator θlow lower bounds the minimax risk. In summary,

Bπ (θlow ) ≤ inf sup R(θ, θ̃) ≤ R(θ, θup ).


θ̃ θ

Theorem 26:
1
P
Given X1 , . . . , Xn ∼ N (θ, Id ) then the average θ̂ = n
Xi is the minimax estimator of θ
w.r.t squared loss.

Theorem 27:
If θ̂ is the Bayes estimator w.r.t some prior θ and R(θ̂, θ) is constant, then θ̂ is the minimax
estimator.

11 Asymptotic behavior of MLE


Theorem 28: (Asymptotic theory)
If the model has strong identifiability and uniform LLN, then then MLE is consistent, i.e.,

p
θ̂M LE →
− θ. (11.1)

Under enough regularity condition,

√ d
n(θ̂M LE − θ) →
− N (0, 1/I1 (θ)), (11.2)

or equivalently,
d
θ̂M LE − θ →
− N (0, 1/I(θ)). (11.3)

Using a similar proof, we can also show that


n
1 X ∇θ log p(Xi ; θ)
θ̂ = θ + + Remainder,
n i=1 I(θ)

where the remainder is small (roughly proportional to the previous term multiplied by I(θ̃) −
I(θ) → 0).

22
Definition 19: (Influence function)
The influence function
∇θ log p(x; θ)
ψ(x) = (11.4)
I(θ)
measures the influence each single point has on the the estimator θ, i.e.,
n
1X
θ̂ ≈ θ + ψ(Xi ). (11.5)
n i=1

Part III

Statistical testing
12 Mathematical tricks
• P (Z > Zα ) = α. P (Z > t) = α ⇒ t = φ−1 (1 − α).

• Prove that p is uniformly distributed on [0, 1] when the CDF Φ is continuous and increasing.

√ √ √
p = P (Tn > tn ) = P ( nTn > ntn ) = φ(− ntn )
√ √
P (p ≤ u) = P (φ(− ntn ) ≤ u) = P (− ntn ≤ φ−1 (u)) = φ(φ−1 (u)) = u.

13 Testing frameworks
Definition 20: (Constructing tests)
Hypothesis testing involves the following steps:

• Choose a test statistic Tn (X1 , . . . , Xn ).

• Choose a rejection region R ⊂ X n .

Definition 21: (Power function)


Define the power function β as

β(θ) = Pθ (X1 , . . . , Xn ∈ R). (13.1)

A test is size α if supθ∈Θ0 β(θ) = α. A test is level α if supθ∈Θ0 β(θ) ≤ α.

23
We say that a test controls the Type I error at level α if for any parameter θ0 ,

Pθ0 (X1 , . . . , Xn ∈ R) ≤ α.

Definition 22: (Uniformly most powerful)


Let Cα denote all level α test. A test in Cα with power function β is UMP if β(θ) ≥ β 0 (tt)
for all θ ∈ Θ1 and any power function β 0 of any test in Cα .

Theorem 29: (Neyman-Pearson Lemma)


Consider testing the hypothesis H0 : θ = θ0 and H1 : θ = θ1 . Let L(θ) be the likelihood
function and
L(θ1 )
Tn = . (13.2)
L( θ0 )
The test that rejects H0 if Tn > k where k is chosen so that Pθ0 (X n ∈ R) = α is UMP.

Theorem 30: (Wald Test)


Consider testing the hypothesis H0 : θ = θ0 and H1 : θ 6= θ0 . Suppose we have access to an
estimator θ̂ which under the null hypothesis satisfies

d
− N (θ0 , σ02 )
θ̂ → (13.3)

where σ02 is the variance of θ̂ under the null. The canonical example is when θ̂ is the MLE.
In this case, consider the statistic
θ̂ − θ0
Tn = . (13.4)
σ0
d
Under the null, Tn →
− N (0, 1) so we reject the null if |Tn | > zα/2 .

Theorem 31: (Likelihood Ratio Test)


Consider testing the hypothesis H0 : θ ∈ Θ0 and H1 : θ ∈
/ Θ0 . We reject H0 if

supθ∈Θ0 L(θ)
λ= < c. (13.5)
supθ∈Θ L(θ)

We can simplify the LRT by using an asymptotic approximation.


Theorem 32:
Consider testing the hypothesis H0 : θ ∈ Θ0 and H1 : θ ∈
/ Θ0 . Under H0 ,

d
− χ21 .
− 2 log λ → (13.6)

Hence we let Tn = −2 log λ and reject when Tn > χ21,α .

24
Definition 23:
Suppose we have a test of the form: reject when T (X1 , . . . , Xn ) > c. Then the p-value is

p = sup Pθ (Tn (X1 , . . . , Xn ) ≥ Tn (x1 , . . . , xn )). (13.7)


θ∈Θ0

where x1 , . . . , xn are the observed data and X1 . . . , Xn ∼ pθ0 .

Under some conditions, p-value will be uniformly distributed on [0, 1] under the null, because

P0 (p-value ≤ u) = P0 (φ(−Tn ) ≤ u) = P0 (−Tn ≤ Φ−1 (u)) = Φ(Φ−1 (u)) = u.

Definition 24: (Goodness-of-fit testing)


Given samples X1 , . . . , Xn ∼ P , we want to test H0 : P = P0 and H1 : P 6= P0 for some
fixed, known distribution P0 .

Definition 25: (χ2 goodness-of-fit test)


In the simplest setting, P0 and P are multinomials on k categories. Given a sample
X1 , . . . , Xn you can reduce it to a vector of counts (Z1 , . . . , Zk ) where Zi is the number
of times you observed the i-th category. We can then consider
k
X (Zi − np0i )2 − np0i
T (X1 , . . . , Xn ) = . (13.8)
i=1
np0i

We can then show that asymptotically this test statistic, under the null, has a χ2k−1 distri-
bution.

Definition 26: (Two-sample testing)


Suppose we observe X1 , . . . , Xn1 ∼ P and Y1 , . . . , Yn2 ∼ Q. We want to test H0 : P = Q
and H1 : P 6= Q. In the multinomial testing, define (Z1 , . . . , Zk ) and (Z10 , . . . , Zk0 ) to be the
counts in the X and Y sample respectively. We then have

Zi + Zi0
ĉi = . (13.9)
n1 + n2

The two-sample χ2 then involves the statistic


k 
(Zi − n1 ĉi )2 (Z 0 − n2 ĉi )2
X 
Tn = + i , (13.10)
i=1
n1 ĉi n2 ĉi

which also has a χ2k−1 distribution under the null.

25
Definition 27: (Permutation test)
Suppose we observe X1 , . . . , Xn , Y1 , . . . , Ym . Define N = m + n and consider all N ! permu-
tations of the data. For each permutation, compute a statistic T (so we have T1 , . . . , TN ! ).
Under the null hypothesis, each Ti has the same distribution. We can then define the p-value
as
N!
1 X
p= I (Ti > Tobs ) . (13.11)
N ! i=1
where Tobs is the test statistic on the observed data.

We can prove that PH0 (φperm (Zobs ) = 1) ≤ α:


N!
1 X
α≥ φperm (Zi )
N ! i=1
N! N!
1 X 1 X
⇒α ≥ EH0 (φperm (Zi )) = EH0 (φperm (Zobs ),
N ! i=1 N ! i=1

⇒PH0 (φperm (Zobs ) = 1) ≤ α.

14 Multiple testing
The basic question is how do we adjust our p-value cutoffs to account for the fact that multiple
testings are being done.

Definition 28: (Family Wise Error Rate)


The FWER is the probability of falsely rejecting the null hypothesis even once amongst the
multiple tests.

There are 2 ways to control the FWER:

Definition 29: (Sidak correction)


Suppose we do d hypothesis tests and want to control the FWER at α. We reject any test
if the p-value is smaller than 1 − (1 − α)1/d = αt .

The main problem with the Sidak correction is that it requires the independence of p-values.
The Bonferroni correction uses the union bound to avoid this assumption.

Definition 30: (Bonferroni correction)


We reject any test if the p-value is smaller than αd .

An improvement to the Bonferroni correction is the Holm’s procedure.

26
Definition 31: (Holm’s Procedure)
We perform the following steps:

1. Order the p-values p(1) ≤ p(2) ≤ . . . ≤ p(n) .

α
2. If p(1) < d
then reject H(1) and move on. Else, stop and accept all Hi for i ≥ 1.

α
3. If p(2) < d−1
then reject H(2) and move on. Else, stop and accept all Hi for i ≥ 2.

4. . . .

5. If p(d) < α, reject H(d) . Else, accept H(d) .

Holm’s procedure also controls the FWER at level α and strictly dominates the Bonferroni
procedure.

Definition 32: (FDR)


The false discovery rate is the expected number of false rejections divided by the number of
rejections: F DR = E[F DP ] where

V /R

if R > 0
F DP =
0

if R = 0

In this notation, F W ER = P (V ≥ 1).

If the p-values are independent, we can control the FDR using Benjamini-Hochberg procedure.

27
Definition 33: (Benjamini-Hochberg procedure)
Suppose we do d tests. We perform the following steps:

1. Order the p-values p(1) ≤ p(2) ≤ . . . ≤ p(n) .


2. Define the thresholds ti = d
.

3. Find the largest imax such that imax = arg maxi i : p(i) < ti

4. Reject all nulls upto and including imax .

Connections between FDR and FWER:

• FDR is a very weak notion of error control.

• Under the global null, FDR control is equivalent to FWER control.

• FWER ≥ FDR always. Control the FWER implies FDR control. However, FDR is
less stringent so if it is the correct measure, we have more power by controlling FDR.

15 Confidence Interval and Bootstrapping


Definition 34: (Confidence interval)
Suppose we have a collection of distributions P. Let Cn (X1 , . . . , Xn ) be a set constructed
using the observed data X1 , . . . , Xn . Cn is a 1 − α confidence set for a parameter θ is

P (θ ∈ Cn (X1 , . . . , Xn )) ≥ 1 − α, ∀P ∈ P. (15.1)

This means that no matter which distribution in P generated the data, the interval guar-
antees coverage property described above. At a high-level, the confidence interval gives us
some idea of how precise our estimate of the unknown parameter θ is, i.e., a wide interval
indicates that our (point) estimate is imprecise.

28
Theorem 33: (Constructing CI by inverting a test)
Suppose we have a test / family of tests for the hypotheses H0 : θ = θ0 and H1 : θ 6= θ0 .
Denote the acceptance region for H0 at A(θ0 ). Given observed data {X1 , . . . , Xn }, we
consider the random set

C(X1 , . . . , Xn ) = {θ0 : {X1 , . . . , Xn } ∈ A(θ0 )}. (15.2)

If our family of tests has level α then the set C(X1 , . . . , Xn ) is a 1 − α confidence set.

Theorem 34: (Constructing CI by inverting probability inequalities)


If we have a tail bound inequality P (|θ̂ − θ| > ) ≤ C(), we can pick  = C −1 (α).

Definition 35: (Pivot)


A pivot is a function of the data and the unknown parameter θ whose distribution does not
depend on θ. For example:

tn

0≤t≤1
X(n)
• X1 , . . . , Xn ∼ U [0, θ] and Q(X1 , . . . , Xn , θ) = θ
has PDF F (t) =
1 t≥1

• X1 , . . . , Xn ∼ N (θ, 1) and Q(X1 , . . . , Xn ) = X̄n − θ ∼ N (0, 1/n)

Theorem 35: (Constructing CI by pivot)


Given a pivot we can select a, b which do not depend on θ such that

Pθ (a ≤ Q(X1 , . . . , Xn , θ) ≤ b) = 1 − α ∀θ ∈ Θ. (15.3)

Now our confidence interval is

C(X1 , . . . , Xn ) = {θ : a ≤ Q(X1 , . . . , Xn , θ) ≤ b}. (15.4)

16 Causal Inference
We will think of the case when there are two possible actions (or treatments). Often we refer to
one of the treatments as the active treatment (or just treatment) and the other as the control
treatment (or just control).
We associate every unit and the two treatments with two potential outcomes: the potential
outcome if the unit received the treatment and the potential outcome if the unit received control.

29
A priori both potential outcomes are possible. However, every unit only receives one of the two
treatments (i.e. either treatment or control) and so we only observe one of the two potential
outcomes. This is known as the fundamental problem of causal inference. We only observe one
of the potential outcomes for each unit.

Definition 36: (Potential Outcomes and Causal Estimand)


Suppose our experiment has n units, each belonging to either the treatment or control group.
For the i-th unit we will denote the potential outcome if the unit receives control as Yi (0)
and the potential outcome if the unit receives treatment as Yi (1).

There are many ways to measure causal associations. An estimate for the causal effect that we
will focus on is the average treatment effect:
n
1X
τ = E[Y (1) − Y (0)] or Yi (1) − Yi (0) (16.1)
n i=1

which is the difference in outcomes if all units were treated versus all were in the control group.
The main problem in causal inference is that each unit is either treated or in the control group
so we never observe both potential outcomes. What we do observe is

Yiobs = Yi (1) · Wi + Yi (0) · (1 − Wi ).

Suppose m individuals are in the treatment group, then we can estimate the quantity

α = E[Y (1) | W = 1] − E[Y (0) | W = 0]

via the estimator


1 X obs 1 X
α̂ = Yi − Yiobs .
n i:W =1 m − n i:W =0
i i

However, in general α 6= τ since in a typical setting we have selection bias, i.e. people can
choose treatment or control based on their knowledge of their potential outcomes so that W
and (Y (0), Y (1)) are not independent. One formal way of defining selection bias in this context
is simply as the difference between τ and α. If we can ensure that W ⊥ (Y (0), Y (1)), then we
indeed have that

α = E[Y (1) | W = 1] − E[Y (0) | W = 0] = E[Y (1)] − E[Y (0)] = W.

Theorem 36: (Unbiased estimator of treatment effect)


τ̂ = n1 ni=1 Yi (1) − Yi (0) is an unbiased estimator of the treatment effect.
P

30
Proof: We have
n
X E(Wi ) E(1 − Wi )
E(τ̂ ) = Yi (1) − Yi (0).
i=1
m n−m
n−1
(m−1 ) m n−m
The mean of Wi is given by E(Wi ) = = and so E(1 − Wi ) = . This gives us
(mn ) n n

E(τ̂ ) = τ . 

Definition 37: (Selection on observables)


The key assumption that makes causal inference from observational data possible is called
selection on observables:
W ⊥ (Y (1), Y (0)) | X,

where X is some covariates. One way to think about this assumption, is that conditional on
X we have a randomized trial, i.e. the treatment is independent of the potential outcomes.
So if we condition on X (they are the confounders) we no longer have any selection bias.
Alternatively, within levels of the covariate treatment is decided by (a biased) coin flip.

Theorem 37: (Identification under selection on observables)


The average treatment effect can be estimated from observed data under the assumption of
selection on observables.

Proof: We have

τ = E[Y (1) − Y (0)]

= EX [E[Y (1) − Y (0) | X]] (law of total expectation)

= EX [E[Y (1) | X, W = 1]] − EX [E[Y (0) | X, W = 0]] (selection on observables)

= EX [E[Y obs | X, W = 1]] − EX [E[Y obs | X, W = 0]],

which is just a function of the observed data. 

To recap, causal inference is most clearly thought about in two steps:

1. Identification. Leveraging some set of “causal assumptions” in order to link the param-
eter of interest to something that can be derived from the observed data distribution. In
a simple randomized trial, we used the assumption W ⊥ (Y (0), Y (1)) to say that

τ = E[Y (1)] − E[Y (0)] = E[Y (1) | W = 1] − E[Y (0) | W = 0]

= E[Y obs | W = 1] − E[Y obs | W = 0].

31
2. Estimation. Once we have “identified” the parameter (written it in the form of observed
quantities), we can design an estimator for it.

32

You might also like