0% found this document useful (0 votes)
57 views31 pages

Probability and Statistics: Cookbook

This cookbook integrates various topics in probability theory and statistics, based on literature and in-class material from courses of the statistics department at the University of California in Berkeley. It provides overviews and formulas for many common probability distributions and statistical concepts.

Uploaded by

George Singer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views31 pages

Probability and Statistics: Cookbook

This cookbook integrates various topics in probability theory and statistics, based on literature and in-class material from courses of the statistics department at the University of California in Berkeley. It provides overviews and formulas for many common probability distributions and statistical concepts.

Uploaded by

George Singer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Probability and Statistics

Cookbook

Version 0.2.6
19th December, 2017
http://statistics.zone/
Copyright
c Matthias Vallentin
Contents 14 Exponential Family 16 21.5 Spectral Analysis . . . . . . . . . . . . . 28

1 Distribution Overview 3 15 Bayesian Inference 16 22 Math 29


1.1 Discrete Distributions . . . . . . . . . . 3 15.1 Credible Intervals . . . . . . . . . . . . . 16 22.1 Gamma Function . . . . . . . . . . . . . 29
1.2 Continuous Distributions . . . . . . . . 5 15.2 Function of parameters . . . . . . . . . . 17 22.2 Beta Function . . . . . . . . . . . . . . . 29
15.3 Priors . . . . . . . . . . . . . . . . . . . 17 22.3 Series . . . . . . . . . . . . . . . . . . . 29
2 Probability Theory 8 15.3.1 Conjugate Priors . . . . . . . . . 17 22.4 Combinatorics . . . . . . . . . . . . . . 30
15.4 Bayesian Testing . . . . . . . . . . . . . 18
3 Random Variables 8
3.1 Transformations . . . . . . . . . . . . . 9 16 Sampling Methods 18
16.1 Inverse Transform Sampling . . . . . . . 18
4 Expectation 9 16.2 The Bootstrap . . . . . . . . . . . . . . 18
16.2.1 Bootstrap Confidence Intervals . 18
5 Variance 9 16.3 Rejection Sampling . . . . . . . . . . . . 19
16.4 Importance Sampling . . . . . . . . . . . 19
6 Inequalities 10
17 Decision Theory 19
7 Distribution Relationships 10 17.1 Risk . . . . . . . . . . . . . . . . . . . . 19
17.2 Admissibility . . . . . . . . . . . . . . . 20
8 Probability and Moment Generating
17.3 Bayes Rule . . . . . . . . . . . . . . . . 20
Functions 11
17.4 Minimax Rules . . . . . . . . . . . . . . 20
9 Multivariate Distributions 11 18 Linear Regression 20
9.1 Standard Bivariate Normal . . . . . . . 11 18.1 Simple Linear Regression . . . . . . . . 20
9.2 Bivariate Normal . . . . . . . . . . . . . 11 18.2 Prediction . . . . . . . . . . . . . . . . . 21
9.3 Multivariate Normal . . . . . . . . . . . 11 18.3 Multiple Regression . . . . . . . . . . . 21
18.4 Model Selection . . . . . . . . . . . . . . 22
10 Convergence 11
10.1 Law of Large Numbers (LLN) . . . . . . 12 19 Non-parametric Function Estimation 22
10.2 Central Limit Theorem (CLT) . . . . . 12 19.1 Density Estimation . . . . . . . . . . . . 22
19.1.1 Histograms . . . . . . . . . . . . 23
11 Statistical Inference 12 19.1.2 Kernel Density Estimator (KDE) 23
11.1 Point Estimation . . . . . . . . . . . . . 12 19.2 Non-parametric Regression . . . . . . . 23
11.2 Normal-Based Confidence Interval . . . 13 19.3 Smoothing Using Orthogonal Functions 24
11.3 Empirical distribution . . . . . . . . . . 13
11.4 Statistical Functionals . . . . . . . . . . 13 20 Stochastic Processes 24
20.1 Markov Chains . . . . . . . . . . . . . . 24
12 Parametric Inference 13 20.2 Poisson Processes . . . . . . . . . . . . . 25
12.1 Method of Moments . . . . . . . . . . . 13
12.2 Maximum Likelihood . . . . . . . . . . . 14 21 Time Series 25
12.2.1 Delta Method . . . . . . . . . . . 14 21.1 Stationary Time Series . . . . . . . . . . 26 This cookbook integrates various topics in probability theory
12.3 Multiparameter Models . . . . . . . . . 14 21.2 Estimation of Correlation . . . . . . . . 26 and statistics, based on literature [1, 6, 3] and in-class material
12.3.1 Multiparameter delta method . . 15 21.3 Non-Stationary Time Series . . . . . . . 26 from courses of the statistics department at the University of
12.4 Parametric Bootstrap . . . . . . . . . . 15 21.3.1 Detrending . . . . . . . . . . . . 27 California in Berkeley but also influenced by others [4, 5]. If you
21.4 ARIMA models . . . . . . . . . . . . . . 27 find errors or have suggestions for improvements, please get in
13 Hypothesis Testing 15 21.4.1 Causality and Invertibility . . . . 28 touch at http://statistics.zone/.
1 Distribution Overview
1.1 Discrete Distributions
Notation1 FX (x) fX (x) E [X] V [X] MX (s)

0 x<a
(b − a + 1)2 − 1 eas − e−(b+1)s

bxc−a+1 I(a ≤ x ≤ b) a+b
Uniform Unif {a, . . . , b} a≤x≤b
 b−a b−a+1 2 12 s(b − a)
1 x>b

Bernoulli Bern (p) (1 − p)1−x px (1 − p)1−x p p(1 − p) 1 − p + pes
!
n x
Binomial Bin (n, p) I1−p (n − x, x + 1) p (1 − p)n−x np np(1 − p) (1 − p + pes )n
x
 
np1 !n
np1 (1 − p1 ) −np1 p2
k
! k
n! x
X  .  X
Multinomial Mult (n, p) px1 · · · pk k xi = n  ..  .. pi e si
x1 ! . . . xk ! 1 i=1 −np2 p1 . i=0
npk
m N −m
!  
x − np x n−x nm nm(N − n)(N − m)
Hypergeometric Hyp (N, m, n) ≈Φ N N 2 (N − 1)
p 
np(1 − p) n
N
!  r
x+r−1 r 1−p 1−p p
Negative Binomial NBin (r, p) Ip (r, x + 1) p (1 − p)x r r 2
r−1 p p 1 − (1 − p)es
1 1−p pes
Geometric Geo (p) 1 − (1 − p)x x ∈ N+ p(1 − p)x−1 x ∈ N+
p p2 1 − (1 − p)es
x
X λi λx e−λ s
Poisson Po (λ) e−λ λ λ eλ(e −1)

i=0
i! x!

1 We use the notation γ(s, x) and Γ(x) to refer to the Gamma functions (see §22.1), and use B(x, y) and Ix to refer to the Beta functions (see §22.2).

3
Uniform (discrete) Binomial Geometric Poisson
● n = 40, p = 0.3 0.8 ● p = 0.2 ● ●
● ● ● λ=1
● n = 30, p = 0.6 ● p = 0.5 ● λ=4
● n = 25, p = 0.9 ● p = 0.8 ● λ = 10

0.3
0.2 ● 0.6

● 0.2
PMF

PMF

PMF

PMF
1 ● ●
● ● ● ● ● ● ●
● ● ● ● 0.4 ●
n ●
● ● ●
● ●

0.1
● ●
● ● ● ● ●
● ●


0.1 ●
● 0.2 ●
● ● ● ●
● ● ●
● ●
● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ● ●
●●● ● ● ● ● ● ● ● ● ●
● ● ●
0.0 ●●●●
●●●●●●●●● ●●
●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●● 0.0 ● ●

● ●
● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ●

a b 0 10 20 30 40 0.0 2.5 5.0 7.5 10.0 0 5 10 15 20


x x x x
Uniform (discrete) Binomial Geometric Poisson
1 ● 1.00 ●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●● 1.0 ● ● ● ● ● ● ● ● 1.00 ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ●
● ● ●
● ●● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ●
● ● ●
● ● ●
● ●
● ●
● ●
● ● ●
0.75 0.8 ● ● 0.75 ●

i ● ● ●
● ●
● ●
n ●
● ● ●

CDF

CDF

CDF

CDF
0.50 0.6 ● 0.50
● ●
● ●

i ● ● ●
● ● ●
n ●

0.25 ● 0.4 0.25 ●

● ●


● n = 40, p = 0.3 p = 0.2
● ● ● ● ● ● λ=1
n = 30, p = 0.6 p = 0.5 ●
● ● ● ● ●

● λ=4
● ●
0 ● 0.00 ●●●● ●
●●●●●●●●●● ●
●●●●●●●●●●●●●●●●● ● ● n = 25, p = 0.9 0.2 ● ● p = 0.8 0.00

● ● ● ● ● λ = 10

a b 0 10 20 30 40 0.0 2.5 5.0 7.5 10.0 0 5 10 15 20


x x x x

4
1.2 Continuous Distributions
Notation FX (x) fX (x) E [X] V [X] MX (s)

0 x<a
(b − a)2 esb − esa

x−a I(a < x < b) a+b
Uniform Unif (a, b) a<x<b
 b−a b−a 2 12 s(b − a)
1 x>b

(x − µ)2
Z x
σ 2 s2
   
1
N µ, σ 2 σ2

Normal Φ(x) = φ(t) dt φ(x) = √ exp − µ exp µs +
−∞ σ 2π 2σ 2 2
(ln x − µ)2
   
1 1 ln x − µ 1 2 2 2
ln N µ, σ 2 eµ+σ /2
(eσ − 1)e2µ+σ

Log-Normal + erf √ √ exp −
2 2 2σ 2 x 2πσ 2 2σ 2
 
1 T
Σ−1 (x−µ) 1
Multivariate Normal MVN (µ, Σ) (2π)−k/2 |Σ|−1/2 e− 2 (x−µ) µ Σ exp µT s + sT Σs
2
−(ν+1)/2 ( ν
Γ ν+1
 
ν ν 
2 x2 ν−2
ν>2
Student’s t Student(ν) Ix , √ 1 + 0 ν>1
νπΓ ν2

2 2 ν ∞ 1<ν≤2
 
1 k x 1
Chi-square χ2k γ , xk/2−1 e−x/2 k 2k (1 − 2s)−k/2 s < 1/2
Γ(k/2) 2 2 2k/2 Γ(k/2)
r
d
(d1 x)d1 d2 2
2d22 (d1 + d2 − 2)
 
d1 d2 (d1 x+d2 )d1 +d2 d2
F F(d1 , d2 ) I d1 x , d1 d1 d2 − 2 d1 (d2 − 2)2 (d2 − 4)

d1 x+d2 2 2 xB 2
, 2
1 −x/β 1
Exponential∗ Exp (β) 1 − e−x/β e β β2 s (s < β)
β 1− β

γ(α, βx) β α α−1 −βx α α 1
Gamma∗ Gamma (α, β) x e s (s < β)
Γ(α) Γ (α) β β2 1− β

Γ α, βx

β α −α−1 −β/x β β2 2(−βs)α/2 p 
Inverse Gamma InvGamma (α, β) x e α>1 α>2 Kα −4βs
Γ (α) Γ (α) α−1 (α − 1)2 (α − 2) Γ(α)
P 
k
Γ i=1 αi Y α −1
k
αi E [Xi ] (1 − E [Xi ])
Dirichlet Dir (α) Qk xi i Pk Pk
i=1 Γ (αi ) i=1 i=1 αi i=1 αi + 1
∞ k−1
!
Γ (α + β) α−1 α αβ X Y α+r sk
Beta Beta (α, β) Ix (α, β) x (1 − x)β−1 1+
Γ (α) Γ (β) α+β (α + β)2 (α + β + 1) r=0
α+β+r k!
k=1

sn λn 
   
k k  x k−1 −(x/λ)k 1 2 X n
Weibull Weibull(λ, k) 1 − e−(x/λ) e λΓ 1 + λ2 Γ 1 + − µ2 Γ 1+
λ λ k k n=0
n! k
 x α
m xα αxm x2m α
Pareto Pareto(xm , α) 1− x ≥ xm m
α α+1 x ≥ xm α>1 α>2 α(−xm s)α Γ(−α, −xm s) s < 0
x x α−1 (α − 1)2 (α − 2)

∗ 1
We use the rate parameterization where β = λ
. Some textbooks use β as scale parameter instead [6].

5
Uniform (continuous) Normal Log−Normal Student's t
2.0 1.00 0.4 ν=1
µ = 0, σ = 0.2
2
µ = 0, σ = 3
2

µ = 0, σ2 = 1 µ = 2, σ2 = 2 ν=2
µ = 0, σ2 = 5 µ = 0, σ2 = 1 ν=5
ν=∞
µ = −2, σ2 = 0.5 µ = 0.5, σ2 = 1
µ = 0.25, σ2 = 1
1.5 0.75 µ = 0.125, σ2 = 1 0.3
PDF

PDF

PDF

PDF
1
● ● 1.0 0.50 0.2
b−a

0.5 0.25 0.1

● ● 0.0 0.00 0.0

a b −5.0 −2.5 0.0 2.5 5.0 0 1 2 3 −5.0 −2.5 0.0 2.5 5.0
x x x x
χ 2 F Exponential Gamma
d1 = 1, d2 = 1 2.0 β = 0.5 0.5 α = 1, β = 0.5
1.00 k=1 3 d1 = 2, d2 = 1 β=1 α = 2, β = 0.5
k=2 d1 = 5, d2 = 2 β = 2.5 α = 3, β = 0.5
k=3 d1 = 100, d2 = 1 α = 5, β = 1
k=4 d1 = 100, d2 = 100 0.4 α = 9, β = 2
k=5
1.5
0.75

2
0.3
PDF

PDF

PDF
PDF

0.50 1.0

0.2

1
0.25 0.5
0.1

0.00 0 0.0 0.0

0 2 4 6 8 0 1 2 3 4 5 0 1 2 3 4 5 0 5 10 15 20
x x x x
Inverse Gamma Beta Weibull Pareto
α = 1, β = 1 5 α = 0.5, β = 0.5 2.0 λ = 1, k = 0.5 4 xm = 1, k = 1
α = 2, β = 1 α = 5, β = 1 λ = 1, k = 1 xm = 1, k = 2
α = 3, β = 1 α = 1, β = 3 λ = 1, k = 1.5 xm = 1, k = 4
4 α = 3, β = 0.5 α = 2, β = 2 λ = 1, k = 5
4 α = 2, β = 5
1.5 3

3
3
PDF

PDF

PDF

PDF
1.0 2
2
2

0.5 1
1 1

0 0 0.0 0

0 1 2 3 4 5 0.00 0.25 0.50 0.75 1.00 0.0 0.5 1.0 1.5 2.0 2.5 1.0 1.5 2.0 2.5
x x x x

6
Uniform (continuous) Normal Log−Normal Student's t
1 1.00 µ = 0, σ = 3
2 1.00
µ = 2, σ2 = 2
0.75 µ = 0, σ2 = 1
µ = 0.5, σ2 = 1
µ = 0.25, σ2 = 1
0.75 µ = 0.125, σ2 = 1 0.75

0.50
CDF

CDF

CDF

CDF
0.50 0.50

0.25
0.25 0.25

µ = 0, σ2 = 0.2 ν=1
µ = 0, σ2 = 1 ν=2
µ = 0, σ2 = 5 ν=5
0 0.00 µ = −2, σ2 = 0.5 0.00 0.00 ν=∞

a b −5.0 −2.5 0.0 2.5 5.0 0 1 2 3 −5.0 −2.5 0.0 2.5 5.0
x x x x
χ 2 F Exponential Gamma
1.00 1.00 1.00
1.00

0.75 0.75 0.75 0.75


CDF

CDF

CDF
CDF

0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25


k=1 d1 = 1, d2 = 1 α = 1, β = 0.5
k=2 d1 = 2, d2 = 1 α = 2, β = 0.5
k=3 d1 = 5, d2 = 2 β = 0.5 α = 3, β = 0.5
k=4 d1 = 100, d2 = 1 β=1 α = 5, β = 1
0.00 k=5 0.00 d1 = 100, d2 = 100 0.00 β = 2.5 0.00 α = 9, β = 2

0 2 4 6 8 0 1 2 3 4 5 0 1 2 3 4 5 0 5 10 15 20
x x x x
Inverse Gamma Beta Weibull Pareto
1.00 1.00
1.00 1.00 α = 0.5, β = 0.5
α = 5, β = 1
α = 1, β = 3
α = 2, β = 2
α = 2, β = 5
0.75 0.75 0.75 0.75
CDF

CDF

CDF

CDF
0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25

α = 1, β = 1 λ = 1, k = 0.5
α = 2, β = 1 λ = 1, k = 1 xm = 1, k = 1
α = 3, β = 1 λ = 1, k = 1.5 xm = 1, k = 2
0.00 α = 3, β = 0.5 0.00 0.00 λ = 1, k = 5 0.00 xm = 1, k = 4

0 1 2 3 4 5 0.00 0.25 0.50 0.75 1.00 0.0 0.5 1.0 1.5 2.0 2.5 1.0 1.5 2.0 2.5
x x x x

7
2 Probability Theory Law of Total Probability
n n
Definitions X G
P [B] = P [B|Ai ] P [Ai ] Ω= Ai
• Sample space Ω i=1 i=1

• Outcome (point or element) ω ∈ Ω Bayes’ Theorem


• Event A ⊆ Ω
n
• σ-algebra A P [B | Ai ] P [Ai ] G
P [Ai | B] = Pn Ω= Ai
1. ∅ ∈ A j=1 P [B | Aj ] P [Aj ] i=1
S∞
2. A1 , A2 , . . . , ∈ A =⇒ i=1 Ai ∈ A Inclusion-Exclusion Principle
3. A ∈ A =⇒ ¬A ∈ A
n n
r
[ X X \
• Probability Distribution P (−1)r−1

Ai = A ij


1. P [A] ≥ 0 ∀A i=1 r=1 i≤i1 <···<ir ≤n j=1

2. P [Ω] = 1
"∞ #
G ∞
X 3 Random Variables
3. P Ai = P [Ai ]
i=1 i=1 Random Variable (RV)
• Probability space (Ω, A, P) X:Ω→R

Properties Probability Mass Function (PMF)

• P [∅] = 0 fX (x) = P [X = x] = P [{ω ∈ Ω : X(ω) = x}]


• B = Ω ∩ B = (A ∪ ¬A) ∩ B = (A ∩ B) ∪ (¬A ∩ B)
Probability Density Function (PDF)
• P [¬A] = 1 − P [A]
b
• P [B] = P [A ∩ B] + P [¬A ∩ B]
Z
P [a ≤ X ≤ b] = f (x) dx
• P [Ω] = 1 P [∅] = 0 a
S T T S
• ¬( n An ) = n ¬An ¬( n An ) = n ¬An DeMorgan
S T Cumulative Distribution Function (CDF)
• P [ n An ] = 1 − P [ n ¬An ]
• P [A ∪ B] = P [A] + P [B] − P [A ∩ B] FX : R → [0, 1] FX (x) = P [X ≤ x]
=⇒ P [A ∪ B] ≤ P [A] + P [B]
1. Nondecreasing: x1 < x2 =⇒ F (x1 ) ≤ F (x2 )
• P [A ∪ B] = P [A ∩ ¬B] + P [¬A ∩ B] + P [A ∩ B]
2. Normalized: limx→−∞ = 0 and limx→∞ = 1
• P [A ∩ ¬B] = P [A] − P [A ∩ B]
3. Right-Continuous: limy↓x F (y) = F (x)
Continuity of Probabilities
S∞ b
• A1 ⊂ A2 ⊂ . . . =⇒ limn→∞ P [An ] = P [A]
Z
where A = i=1 Ai
T∞ P [a ≤ Y ≤ b | X = x] = fY |X (y | x)dy a≤b
• A1 ⊃ A2 ⊃ . . . =⇒ limn→∞ P [An ] = P [A] where A = i=1 Ai a

Independence ⊥
⊥ f (x, y)
fY |X (y | x) =
A⊥
⊥ B ⇐⇒ P [A ∩ B] = P [A] P [B] fX (x)
Conditional Probability Independence

P [A ∩ B] 1. P [X ≤ x, Y ≤ y] = P [X ≤ x] P [Y ≤ y]
P [A | B] = P [B] > 0 2. fX,Y (x, y) = fX (x)fY (y)
P [B] 8
Z
3.1 Transformations • E [XY ] = xyfX,Y (x, y) dFX (x) dFY (y)
X,Y
Transformation function
• E [ϕ(Y )] 6= ϕ(E [X]) (cf. Jensen inequality)
Z = ϕ(X)
• P [X ≥ Y ] = 1 =⇒ E [X] ≥ E [Y ]
Discrete • P [X = Y ] = 1 =⇒ E [X] = E [Y ]
X ∞
fZ (z) = P [ϕ(X) = z] = P [{x : ϕ(x) = z}] = P X ∈ ϕ−1 (z) =
 
fX (x)
X
• E [X] = P [X ≥ x] X discrete
x∈ϕ−1 (z) x=1

Continuous Sample mean


n
Z 1X
X̄n = Xi
FZ (z) = P [ϕ(X) ≤ z] = f (x) dx with Az = {x : ϕ(x) ≤ z} n i=1
Az
Conditional expectation
Special case if ϕ strictly monotone Z

d

dx 1 • E [Y | X = x] = yf (y | x) dy
fZ (z) = fX (ϕ−1 (z)) ϕ−1 (z) = fX (x) = fX (x)

dz dz |J| • E [X] = E [E [X | Y ]]
Z ∞
The Rule of the Lazy Statistician • Eϕ(X,Y ) | X=x [=] ϕ(x, y)fY |X (y | x) dx
−∞
Z Z ∞
E [Z] = ϕ(x) dFX (x) • E [ϕ(Y, Z) | X = x] = ϕ(y, z)f(Y,Z)|X (y, z | x) dy dz
−∞
Z Z • E [Y + Z | X] = E [Y | X] + E [Z | X]
E [IA (x)] = IA (x) dFX (x) = dFX (x) = P [X ∈ A] • E [ϕ(X)Y | X] = ϕ(X)E [Y | X]
A
• E [Y | X] = c =⇒ Cov [X, Y ] = 0
Convolution
Z ∞ Z z
X,Y ≥0
• Z := X + Y fZ (z) = fX,Y (x, z − x) dx = fX,Y (x, z − x) dx
−∞ 0 5 Variance
Z ∞
• Z := |X − Y | fZ (z) = 2 fX,Y (x, z + x) dx Definition and properties
0
Z ∞ Z ∞ 2
    2
X ⊥
⊥ • V [X] = σX = E (X − E [X])2 = E X 2 − E [X]
• Z := fZ (z) = |y|fX,Y (yz, y) dy = |y|fX (yz)fY (y) dy " n # n
Y −∞ −∞ X X X
• V Xi = V [Xi ] + Cov [Xi , Xj ]
i=1 i=1 i6=j
4 Expectation " n
X
# n
X
• V Xi = V [Xi ] if Xi ⊥
⊥ Xj
Definition and properties i=1 i=1
X

 xfX (x) X discrete Standard deviation p
sd[X] = V [X] = σX

Z  x

• E [X] = µX = x dFX (x) = Covariance

 Z
 xfX (x) dx X continuous


• Cov [X, Y ] = E [(X − E [X])(Y − E [Y ])] = E [XY ] − E [X] E [Y ]
• P [X = c] = 1 =⇒ E [X] = c • Cov [X, a] = 0
• E [cX] = c E [X] • Cov [X, X] = V [X]
• E [X + Y ] = E [X] + E [Y ] • Cov [X, Y ] = Cov [Y, X]
9
• Cov [aX, bY ] = abCov [X, Y ] 7 Distribution Relationships
• Cov [X + a, Y + b] = Cov [X, Y ]

n m

n X m
Binomial
X X X
n
• Cov  Xi , Yj  = Cov [Xi , Yj ] X
i=1 j=1 i=1 j=1
• Xi ∼ Bern (p) =⇒ Xi ∼ Bin (n, p)
i=1
Correlation • X ∼ Bin (n, p) , Y ∼ Bin (m, p) =⇒ X + Y ∼ Bin (n + m, p)
Cov [X, Y ]
ρ [X, Y ] = p • limn→∞ Bin (n, p) = Po (np) (n large, p small)
V [X] V [Y ] • limn→∞ Bin (n, p) = N (np, np(1 − p)) (n large, p far from 0 and 1)
Independence
Negative Binomial
X⊥
⊥ Y =⇒ ρ [X, Y ] = 0 ⇐⇒ Cov [X, Y ] = 0 ⇐⇒ E [XY ] = E [X] E [Y ]
• X ∼ NBin (1, p) = Geo (p)
Pr
Sample variance • X ∼ NBin (r, p) = i=1 Geo (p)
n P P
1 X • Xi ∼ NBin (ri , p) =⇒ Xi ∼ NBin ( ri , p)
S2 = (Xi − X̄n )2
n − 1 i=1 • X ∼ NBin (r, p) . Y ∼ Bin (s + r, p) =⇒ P [X ≤ s] = P [Y ≥ r]
Conditional variance Poisson
    2 n n
!
• V [Y | X] = E (Y − E [Y | X])2 | X = E Y 2 | X − E [Y | X] X X
• Xi ∼ Po (λi ) ∧ Xi ⊥⊥ Xj =⇒ Xi ∼ Po λi
• V [Y ] = E [V [Y | X]] + V [E [Y | X]]
i=1 i=1
 
n n
X X λ i
6 Inequalities • Xi ∼ Po (λi ) ∧ Xi ⊥⊥ Xj =⇒ Xi Xj ∼ Bin  Xj , Pn 
j=1 j=1 j=1 λ j

Cauchy-Schwarz
2 Exponential
E [XY ] ≤ E X 2 E Y 2
   
n
X
Markov • Xi ∼ Exp (β) ∧ Xi ⊥
⊥ Xj =⇒ Xi ∼ Gamma (n, β)
E [ϕ(X)]
P [ϕ(X) ≥ t] ≤ i=1
t • Memoryless property: P [X > x + y | X > y] = P [X > x]
Chebyshev
V [X] Normal
P [|X − E [X]| ≥ t] ≤
t2  
X−µ

Chernoff • X ∼ N µ, σ 2 =⇒ σ ∼ N (0, 1)
δ
 
e 
• X ∼ N µ, σ ∧ Z = aX + b =⇒ Z ∼ N aµ + b, a2 σ 2
2

P [X ≥ (1 + δ)µ] ≤ δ > −1
(1 + δ)1+δ 
• Xi ∼ N µi , σi2 ∧ Xi ⊥⊥ Xj =⇒
P
Xi ∼ N
P
µi , i σi2
P 
i i
Hoeffding  
• P [a < X ≤ b] = Φ b−µ − Φ a−µ

σ σ
X1 , . . . , Xn independent ∧ P [Xi ∈ [ai , bi ]] = 1 ∧ 1 ≤ i ≤ n • Φ(−x) = 1 − Φ(x) φ0 (x) = −xφ(x) φ00 (x) = (x2 − 1)φ(x)
−1
2 • Upper quantile of N (0, 1): zα = Φ (1 − α)
P X̄ − E X̄ ≥ t ≤ e−2nt t > 0
   

Gamma
2n2 t2
 
   
P |X̄ − E X̄ | ≥ t ≤ 2 exp − Pn 2
t>0
i=1 (bi − ai ) • X ∼ Gamma (α, β) ⇐⇒ X/β ∼ Gamma (α, 1)

Jensen • Gamma (α, β) ∼ i=1 Exp (β)
P P
E [ϕ(X)] ≥ ϕ(E [X]) ϕ convex • Xi ∼ Gamma (αi , β) ∧ Xi ⊥
⊥ Xj =⇒ i Xi ∼ Gamma ( i αi , β)
10
Z ∞
Γ(α) 9.2 Bivariate Normal
• = xα−1 e−λx dx
λα 0  
Let X ∼ N µx , σx2 and Y ∼ N µy , σy2 .
Beta  
1 Γ(α + β) α−1 1 z
• xα−1 (1 − x)β−1 = x (1 − x)β−1 f (x, y) = exp −
2(1 − ρ2 )
p
B(α, β) Γ(α)Γ(β) 2πσx σy 1 − ρ2
  B(α + k, β) α+k−1
E X k−1
  " #
• E Xk =
2 2
=
  
B(α, β) α+β+k−1 x − µx y − µy x − µx y − µy
z= + − 2ρ
• Beta (1, 1) ∼ Unif (0, 1) σx σy σx σy
Conditional mean and variance
8 Probability and Moment Generating Functions E [X | Y ] = E [X] + ρ
σX
(Y − E [Y ])
  σY
• GX (t) = E tX |t| < 1 p
V [X | Y ] = σX 1 − ρ2
"∞ # ∞  
X (Xt)i X E Xi
· ti
 
• MX (t) = GX (et ) = E eXt = E =
i=0
i! i=0
i!
9.3 Multivariate Normal
• P [X = 0] = GX (0)
• P [X = 1] = G0X (0) Covariance matrix Σ (Precision matrix Σ−1 )
(i)
GX (0)  
• P [X = i] = V [X1 ] · · · Cov [X1 , Xk ]
i! .. .. ..
Σ=
 
• E [X] = G0X (1− ) . . . 
  (k)
• E X k = MX (0) Cov [Xk , X1 ] · · · V [Xk ]
 
X! (k) If X ∼ N (µ, Σ),
• E = GX (1− )
(X − k)!  
2 1
• V [X] = G00X (1− ) + G0X (1− ) − (G0X (1− )) fX (x) = (2π) −n/2
|Σ|
−1/2
exp − (x − µ)T Σ−1 (x − µ)
d 2
• GX (t) = GY (t) =⇒ X = Y
Properties
9 Multivariate Distributions • Z ∼ N (0, 1) ∧ X = µ + Σ1/2 Z =⇒ X ∼ N (µ, Σ)
• X ∼ N (µ, Σ) =⇒ Σ−1/2 (X − µ) ∼ N (0, 1)
9.1 Standard Bivariate Normal • X ∼ N (µ, Σ) =⇒ AX ∼ N Aµ, AΣAT

p 
Let X, Y ∼ N (0, 1) ∧ X ⊥
⊥ Z where Y = ρX + 1 − ρ2 Z • X ∼ N (µ, Σ) ∧ kak = k =⇒ aT X ∼ N aT µ, aT Σa

Joint density
1 x2 + y 2 − 2ρxy
  10 Convergence
f (x, y) = exp −
2(1 − ρ2 )
p
2π 1 − ρ2 Let {X1 , X2 , . . .} be a sequence of rv’s and let X be another rv. Let Fn denote
Conditionals the cdf of Xn and let F denote the cdf of X.
Types of Convergence
(Y | X = x) ∼ N ρx, 1 − ρ2 (X | Y = y) ∼ N ρy, 1 − ρ2
 
and D
1. In distribution (weakly, in law): Xn → X
Independence
X⊥
⊥ Y ⇐⇒ ρ = 0 lim Fn (t) = F (t) ∀t where F continuous
n→∞ 11
P
2. In probability: Xn → X √
X̄n − µ n(X̄n − µ) D
Zn := q   = →Z where Z ∼ N (0, 1)
(∀ε > 0) lim P [|Xn − X| > ε] = 0 σ
n→∞ V X̄n
as
3. Almost surely (strongly): Xn → X lim P [Zn ≤ z] = Φ(z) z∈R
n→∞
h i h i
P lim Xn = X = P ω ∈ Ω : lim Xn (ω) = X(ω) = 1 CLT notations
n→∞ n→∞

qm
Zn ≈ N (0, 1)
4. In quadratic mean (L2 ): Xn → X
σ2
 
X̄n ≈ N µ,
lim E (Xn − X)2 = 0 n
 
n→∞
σ2
 
X̄n − µ ≈ N 0,
Relationships n
√ 2

qm P D n(X̄n − µ) ≈ N 0, σ
• Xn → X =⇒ Xn → X =⇒ Xn → X √
as
• Xn → X =⇒ Xn → X
P n(X̄n − µ)
≈ N (0, 1)
D P
• Xn → X ∧ (∃c ∈ R) P [X = c] = 1 =⇒ Xn → X σ
P P P
• Xn →X ∧ Yn → Y =⇒ Xn + Yn → X + Y
qm qm qm
• Xn →X ∧ Yn → Y =⇒ Xn + Yn → X + Y Continuity correction
P P P
• Xn →X ∧ Yn → Y =⇒ Xn Yn → XY
x + 12 − µ
P P
 
• Xn →X =⇒ ϕ(Xn ) → ϕ(X)  
P X̄n ≤ x ≈ Φ √
D
• Xn → X =⇒ ϕ(Xn ) → ϕ(X)
D σ/ n
qm
• Xn → b ⇐⇒ limn→∞ E [Xn ] = b ∧ limn→∞ V [Xn ] = 0
x − 12 − µ
 
qm
 
• X1 , . . . , Xn iid ∧ E [X] = µ ∧ V [X] < ∞ ⇐⇒ X̄n → µ P X̄n ≥ x ≈ 1 − Φ √
σ/ n
Slutzky’s Theorem Delta method
D P D
• Xn → X and Yn → c =⇒ Xn + Yn → X + c 
σ2
 
2 σ2

D P D
• Xn → X and Yn → c =⇒ Xn Yn → cX Yn ≈ N µ, =⇒ ϕ(Yn ) ≈ N ϕ(µ), (ϕ0 (µ))
n n
D D D
• In general: Xn → X and Yn → Y =⇒
6 Xn + Yn → X + Y
11 Statistical Inference
10.1 Law of Large Numbers (LLN)
iid
Let X1 , · · · , Xn ∼ F if not otherwise noted.
Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ.
Weak (WLLN)
P
X̄n → µ n→∞ 11.1 Point Estimation
Strong (SLLN) • Point estimator θbn of θ is a rv: θbn = g(X1 , . . . , Xn )
as
h i
X̄n → µ n→∞ • bias(θbn ) = E θbn − θ
P
• Consistency: θbn → θ
10.2 Central Limit Theorem (CLT)
• Sampling distribution: F (θbn )
Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ, and V [X1 ] = σ 2 .
r h i
• Standard error: se(θn ) = V θbn
b
12
h i h i
• Mean squared error: mse = E (θbn − θ)2 = bias(θbn )2 + V θbn 11.4 Statistical Functionals
• limn→∞ bias(θbn ) = 0 ∧ limn→∞ se(θbn ) = 0 =⇒ θbn is consistent • Statistical functional: T (F )
θbn − θ D • Plug-in estimator of θ = (F ): θbn = T (Fbn )
• Asymptotic normality: → N (0, 1) R
se • Linear functional: T (F ) = ϕ(x) dFX (x)
• Slutzky’s Theorem often lets us replace se(θbn ) by some (weakly) consis- • Plug-in estimator for linear functional:
tent estimator σ
bn . Z n
1X
T (Fbn ) = ϕ(x) dFbn (x) = ϕ(Xi )
11.2 Normal-Based Confidence Interval n i=1
 
b 2 . Let zα/2 = Φ−1 (1 − (α/2)), i.e., P Z > zα/2 = α/2
 
Suppose θbn ≈ N θ, se
 
  b 2 =⇒ T (Fbn ) ± zα/2 se
• Often: T (Fbn ) ≈ N T (F ), se b
and P −zα/2 < Z < zα/2 = 1 − α where Z ∼ N (0, 1). Then
• pth quantile: F −1 (p) = inf{x : F (x) ≥ p}
Cn = θbn ± zα/2 se
b • µb = X̄n
n
1 X
b2 =
• σ (Xi − X̄n )2
11.3 Empirical distribution n − 1 i=1
1
Pn
Empirical Distribution Function (ECDF) n i=1 (Xi − µb)3
• κ
b=
Pn
I(Xi ≤ x) b3

Fn (x) = i=1
b n
i=1 (Xi − X̄n )(Yi − Ȳn )
n • ρb = qP qP
n 2 n 2
(X − X̄ ) i=1 (Yi − Ȳn )
(
1 Xi ≤ x i=1 i n
I(Xi ≤ x) =
0 Xi > x
Properties (for any fixed x) 12 Parametric Inference
h i
• E Fbn = F (x)

Let F = f (x; θ) : θ ∈ Θ be a parametric model with parameter space Θ ⊂ Rk
h i F (x)(1 − F (x)) and parameter θ = (θ1 , . . . , θk ).
• V Fbn =
n
F (x)(1 − F (x)) D 12.1 Method of Moments
• mse = →0
n
P j th moment
• Fbn → F (x) Z
αj (θ) = E X j = xj dFX (x)
 
Dvoretzky-Kiefer-Wolfowitz (DKW) inequality (X1 , . . . , Xn ∼ F )
 
P sup F (x) − Fn (x) > ε = 2e−2nε
b 2
j th sample moment
x n
1X j
Nonparametric 1 − α confidence band for F α
bj = X
n i=1 i
L(x) = max{Fbn − n , 0}
Method of Moments estimator (MoM)
U (x) = min{Fbn + n , 1}
s   α1 (θ) = α
b1
1 2
= log α2 (θ) = α
b2
2n α
.. ..
.=.
P [L(x) ≤ F (x) ≤ U (x) ∀x] ≥ 1 − α αk (θ) = α
bk
13
Properties of the MoM estimator • Equivariance: θbn is the mle =⇒ ϕ(θbn ) is the mle of ϕ(θ)
• θbn exists with probability tending to 1 • Asymptotic optimality (or efficiency), i.e., smallest variance for large sam-
P
• Consistency: θbn → θ ples. If θen is any other estimator, the asymptotic relative efficiency is:
p
• Asymptotic normality: 1. se ≈ 1/In (θ)
√ (θbn − θ) D
D
n(θb − θ) → N (0, Σ) → N (0, 1)
se
  q
where Σ = gE Y Y T g T , Y = (X, X 2 , . . . , X k )T , b ≈ 1/In (θbn )
2. se
∂ −1
g = (g1 , . . . , gk ) and gj = ∂θ αj (θ)
(θbn − θ) D
→ N (0, 1)
se
b
12.2 Maximum Likelihood • Asymptotic optimality
Likelihood: Ln : Θ → [0, ∞) h i
V θbn
n
Y are(θen , θbn ) = h i ≤ 1
Ln (θ) = f (Xi ; θ) V θen
i=1
• Approximately the Bayes estimator
Log-likelihood
n
X 12.2.1 Delta Method
`n (θ) = log Ln (θ) = log f (Xi ; θ)
i=1 b where ϕ is differentiable and ϕ0 (θ) 6= 0:
If τ = ϕ(θ)
Maximum likelihood estimator (mle)
τn − τ ) D
(b
→ N (0, 1)
Ln (θbn ) = sup Ln (θ) se(b
b τ)
θ
where τb = ϕ(θ)
b is the mle of τ and
Score function

s(X; θ) = log f (X; θ) b = ϕ0 (θ)
se se(
b θn )
b b
∂θ
Fisher information
I(θ) = Vθ [s(X; θ)] 12.3 Multiparameter Models
In (θ) = nI(θ) Let θ = (θ1 , . . . , θk ) and θb = (θb1 , . . . , θbk ) be the mle.
Fisher information (exponential family)
∂ 2 `n ∂ 2 `n
  Hjj = Hjk =
∂ ∂θ2 ∂θj ∂θk
I(θ) = Eθ − s(X; θ)
∂θ Fisher information matrix
Observed Fisher information 
Eθ [H11 ] ··· Eθ [H1k ]

n
In (θ) = −  .. .. ..
∂2 X
 
. . .
Inobs (θ) = −

log f (Xi ; θ)
∂θ2 i=1 Eθ [Hk1 ] · · · Eθ [Hkk ]

Properties of the mle Under appropriate regularity conditions


P
• Consistency: θbn → θ (θb − θ) ≈ N (0, Jn )
14
with Jn (θ) = In−1 . Further, if θbj is the j th component of θ, then • Critical value c
• Test statistic T
(θbj − θj ) D • Rejection region R = {x : T (x) > c}
→ N (0, 1)
se
bj • Power function β(θ) = P [X ∈ R]
h i • Power of a test: 1 − P [Type II error] = 1 − β = inf β(θ)
b 2j = Jn (j, j) and Cov θbj , θbk = Jn (j, k)
where se θ∈Θ1
• Test size: α = P [Type I error] = sup β(θ)
θ∈Θ0
12.3.1 Multiparameter delta method
Let τ = ϕ(θ1 , . . . , θk ) and let the gradient of ϕ be Retain H0 Reject H0


∂ϕ
 H0 true Type
√ I Error (α)
 ∂θ1  H1 true Type II Error (β) (power)
 . 
p-value
 .. 
∇ϕ =  
 ∂ϕ 
∂θk

• p-value = supθ∈Θ0 Pθ [T (X) ≥ T (x)] = inf α : T (x) ∈ Rα
Pθ [T (X ? ) ≥ T (X)]

• p-value = supθ∈Θ0 = inf α : T (X) ∈ Rα
Suppose ∇ϕ θ=θb 6= 0 and τb = ϕ(θ).
b Then, | {z }
1−Fθ (T (X)) since T (X ? )∼Fθ
τ − τ) D
(b
→ N (0, 1)
se(b
b τ)
p-value evidence
where r < 0.01 very strong evidence against H0
T
0.01 − 0.05 strong evidence against H0
  
se(b
b τ) = ∇ϕ
b Jbn ∇ϕ
b
0.05 − 0.1 weak evidence against H0
b and ∇ϕ

b = ∇ϕ b. > 0.1 little or no evidence against H0
and Jbn = Jn (θ) θ=θ
Wald test
12.4 Parametric Bootstrap
• Two-sided test
Sample from f (x; θbn ) instead of from Fbn , where θbn could be the mle or method
of moments estimator. θb − θ0
• Reject H0 when |W | > zα/2 where W =
  se
b
• P |W | > zα/2 → α
13 Hypothesis Testing • p-value = Pθ0 [|W | > |w|] ≈ P [|Z| > |w|] = 2Φ(−|w|)

H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ1
Likelihood ratio test
Definitions

• Null hypothesis H0 supθ∈Θ Ln (θ) Ln (θbn )


• Alternative hypothesis H1 • T (X) = =
supθ∈Θ0 Ln (θ) Ln (θbn,0 )
• Simple hypothesis θ = θ0 k
• Composite hypothesis θ > θ0 or θ < θ0 iid
D
X
• λ(X) = 2 log T (X) → χ2r−q where Zi2 ∼ χ2k and Z1 , . . . , Zk ∼ N (0, 1)
• Two-sided test: H0 : θ = θ0 versus H1 : θ 6= θ0
 i=1 
• One-sided test: H0 : θ ≤ θ0 versus H1 : θ > θ0 • p-value = Pθ0 [λ(X) > λ(x)] ≈ P χ2r−q > λ(x)
15
Multinomial LRT Natural form
 
X1 Xk
• mle: pbn = ,..., fX (x | η) = h(x) exp {η · T(x) − A(η)}
n n
k
Y  pbj Xj = h(x)g(η) exp {η · T(x)}
Ln (b
pn )
• T (X) = = = h(x)g(η) exp η T T(x)

Ln (p0 ) j=1
p0j
k  
X pbj D
• λ(X) = 2 Xj log → χ2k−1 15 Bayesian Inference
j=1
p 0j

• The approximate size α LRT rejects H0 when λ(X) ≥ χ2k−1,α Bayes’ Theorem
Pearson Chi-square Test f (x | θ)f (θ) f (x | θ)f (θ)
f (θ | x) = =R ∝ Ln (θ)f (θ)
k f (xn ) f (x | θ)f (θ) dθ
X (Xj − E [Xj ])2
• T = where E [Xj ] = np0j under H0
j=1
E [Xj ] Definitions
D
• T → χ2k−1 • X n = (X1 , . . . , Xn )
 
• p-value = P χ2k−1 > T (x) • xn = (x1 , . . . , xn )
D
2
• Faster → Xk−1 than LRT, hence preferable for small n • Prior density f (θ)
• Likelihood f (xn | θ): joint density of the data
Independence testing Yn
In particular, X n iid =⇒ f (xn | θ) = f (xi | θ) = Ln (θ)
• I rows, J columns, X multinomial sample of size n = I ∗ J i=1
X
• mles unconstrained: pbij = nij • Posterior density f (θ | xn )
X
• Normalizing constant cn = f (xn ) = f (x | θ)f (θ) dθ
R
• mles under H0 : pb0ij = pbi· pb·j = Xni· n·j
• Kernel: part of a density that dependsRon θ
 
PI PJ nX
• LRT: λ = 2 i=1 j=1 Xij log Xi· Xij·j θL (θ)f (θ)dθ
• Posterior mean θ̄n = θf (θ | xn ) dθ = R Lnn(θ)f (θ) dθ
R
PI PJ (X −E[X ])2
• PearsonChiSq: T = i=1 j=1 ijE[Xij ]ij
D
• LRT and Pearson → χ2k ν, where ν = (I − 1)(J − 1) 15.1 Credible Intervals
Posterior interval
14 Exponential Family Z b
P [θ ∈ (a, b) | xn ] = f (θ | xn ) dθ = 1 − α
Scalar parameter a

fX (x | θ) = h(x) exp {η(θ)T (x) − A(θ)} Equal-tail credible interval


= h(x)g(θ) exp {η(θ)T (x)} Z a Z ∞
f (θ | xn ) dθ = f (θ | xn ) dθ = α/2
Vector parameter −∞ b

Highest posterior density (HPD) region Rn


( s
)
X
fX (x | θ) = h(x) exp ηi (θ)Ti (x) − A(θ)
i=1 1. P [θ ∈ Rn ] = 1 − α
= h(x) exp {η(θ) · T (x) − A(θ)} 2. Rn = {θ : f (θ | xn ) > k} for some k
= h(x)g(θ) exp {η(θ) · T (x)} Rn is unimodal =⇒ Rn is an interval
16
15.2 Function of parameters 15.3.1 Conjugate Priors
Continuous likelihood (subscript c denotes constant)
Let τ = ϕ(θ) and A = {θ : ϕ(θ) ≤ τ }.
Likelihood Conjugate prior Posterior hyperparameters
Posterior CDF for τ 
Unif (0, θ) Pareto(xm , k) max x(n) , xm , k + n
Z Xn
n n n
H(r | x ) = P [ϕ(θ) ≤ τ | x ] = f (θ | x ) dθ Exp (λ) Gamma (α, β) α + n, β + xi
A
i=1
 Pn   
µ0 i=1 xi 1 n
2
 2

Posterior density N µ, σc N µ0 , σ0 + / + 2 ,
σ2 σ2 σ02 σc
 0 c−1
1 n
h(τ | xn ) = H 0 (τ | xn ) + 2
σ02 σc
Pn
 νσ02 + i=1 (xi − µ)2
Bayesian delta method N µc , σ 2 Scaled Inverse Chi- ν + n,
ν+n
square(ν, σ02 )

νλ + nx̄ n

τ | X n ≈ N ϕ(θ),
b seb ϕ0 (θ)

N µ, σ 2
b
Normal- , ν + n, α + ,
ν+n 2
scaled Inverse n 2
1X γ(x̄ − λ)
Gamma(λ, ν, α, β) β+ (xi − x̄)2 +
2 i=1 2(n + γ)
15.3 Priors −1
Σ−1 −1
Σ−1 −1

MVN(µ, Σc ) MVN(µ0 , Σ0 ) 0 + nΣc 0 µ0 + nΣ x̄ ,
−1 −1
Σ−1

Choice 0 + nΣc
Xn
MVN(µc , Σ) Inverse- n + κ, Ψ + (xi − µc )(xi − µc )T
• Subjective Bayesianism: prior should incorporate as much detail as possible Wishart(κ, Ψ) i=1
the research’s a priori knowledge—via prior elicitation n
X xi
• Objective Bayesianism: prior should incorporate as little detail as possible Pareto(xmc , k) Gamma (α, β) α + n, β + log
x mc
(non-informative prior) i=1
Pareto(xm , kc ) Pareto(x0 , k0 ) x0 , k0 − kn where k0 > kn
• Robust Bayesianism: consider various priors and determine sensitivity of Xn
our inferences to changes in the prior Gamma (αc , β) Gamma (α0 , β0 ) α0 + nαc , β0 + xi
i=1

Types

• Flat: f (θ) ∝ constant


R∞
• Proper: −∞ f (θ) dθ = 1
R∞
• Improper: −∞ f (θ) dθ = ∞
• Jeffrey’s Prior (transformation-invariant):

p p
f (θ) ∝ I(θ) f (θ) ∝ det(I(θ))

• Conjugate: f (θ) and f (θ | xn ) belong to the same parametric family


17
Discrete likelihood Bayes factor
Likelihood Conjugate prior Posterior hyperparameters log10 BF10 BF10 evidence
n n 0 − 0.5 1 − 1.5 Weak
0.5 − 1 1.5 − 10 Moderate
X X
Bern (p) Beta (α, β) α+ xi , β + n − xi
i=1 i=1
1−2 10 − 100 Strong
Xn n
X n
X >2 > 100 Decisive
Bin (p) Beta (α, β) α+ xi , β + Ni − xi
p
i=1 i=1 i=1 1−p BF10
n
X p∗ = p where p = P [H1 ] and p∗ = P [H1 | xn ]
NBin (p) Beta (α, β) α + rn, β + xi 1 + 1−p BF10
i=1
n
16 Sampling Methods
X
Po (λ) Gamma (α, β) α+ xi , β + n
i=1
n
X 16.1 Inverse Transform Sampling
Multinomial(p) Dir (α) α+ x(i)
i=1 Setup
n
X
Geo (p) Beta (α, β) α + n, β + xi • U ∼ Unif (0, 1)
i=1 • X∼F
• F −1 (u) = inf{x | F (x) ≥ u}
15.4 Bayesian Testing Algorithm
1. Generate u ∼ Unif (0, 1)
If H0 : θ ∈ Θ0 :
2. Compute x = F −1 (u)
Z
Prior probability P [H0 ] = f (θ) dθ
Θ0 16.2 The Bootstrap
Z
Posterior probability P [H0 | xn ] = f (θ | xn ) dθ Let Tn = g(X1 , . . . , Xn ) be a statistic.
Θ0
1. Estimate VF [Tn ] with VFbn [Tn ].
2. Approximate VFbn [Tn ] using simulation:
∗ ∗
Let H0 . . .Hk−1 be k hypotheses. Suppose θ ∼ f (θ | Hk ), (a) Repeat the following B times to get Tn,1 , . . . , Tn,B , an iid sample from
the sampling distribution implied by Fn b
f (xn | Hk )P [Hk ] i. Sample uniformly X1∗ , . . . , Xn∗ ∼ Fbn .
P [Hk | xn ] = PK ,
n
k=1 f (x | Hk )P [Hk ] ii. Compute Tn∗ = g(X1∗ , . . . , Xn∗ ).
(b) Then
Marginal likelihood B B
!2
1 X ∗ 1 X ∗
vboot = VFbn =
b Tn,b − T
B B r=1 n,r
Z
n
f (x | Hi ) = f (xn | θ, Hi )f (θ | Hi ) dθ b=1
Θ
16.2.1 Bootstrap Confidence Intervals
Posterior odds (of Hi relative to Hj )
Normal-based interval
n
P [Hi | x ] n
f (x | Hi ) P [Hi ] Tn ± zα/2 se
b boot
= ×
P [Hj | xn ] f (xn | Hj ) P [Hj ] Pivotal interval
| {z } | {z }
Bayes Factor BFij prior odds 1. Location parameter θ = T (F )
18
2. Pivot Rn = θbn − θ 2. Generate u ∼ Unif (0, 1)
3. Let H(r) = P [Rn ≤ r] be the cdf of Rn Ln (θcand )
∗ ∗
3. Accept θcand if u ≤
4. Let Rn,b = θbn,b − θbn . Approximate H using bootstrap: Ln (θbn )
B
1 X ∗ 16.4 Importance Sampling
H(r)
b = I(Rn,b ≤ r)
B Sample from an importance function g rather than target density h.
b=1
Algorithm to obtain an approximation to E [q(θ) | xn ]:
5. θβ∗ = β sample quantile of (θbn,1
∗ ∗
, . . . , θbn,B ) iid
1. Sample from the prior θ1 , . . . , θn ∼ f (θ)
6. rβ∗ = beta sample quantile of (Rn,1
∗ ∗
, . . . , Rn,B ), i.e., rβ∗ = θβ∗ − θbn
Ln (θi )
2. wi = PB ∀i = 1, . . . , B
 
7. Approximate 1 − α confidence interval Cn = â, b̂ where
i=1 Ln (θi )
PB
3. E [q(θ) | xn ] ≈ i=1 q(θi )wi
b −1 1 − α =
 
∗ ∗
â = θbn − H θbn − r1−α/2 = 2θbn − θ1−α/2
2

b̂ = θbn − Hb −1
2
= ∗
θbn − rα/2 = ∗
2θbn − θα/2 17 Decision Theory
Percentile interval   Definitions
∗ ∗
Cn = θα/2 , θ1−α/2 • Unknown quantity affecting our decision: θ ∈ Θ
• Decision rule: synonymous for an estimator θb
16.3 Rejection Sampling • Action a ∈ A: possible value of the decision rule. In the estimation
context, the action is just an estimate of θ, θ(x).
b
Setup
• Loss function L: consequences of taking action a when true state is θ or
• We can easily sample from g(θ) discrepancy between θ and θ, b L : Θ × A → [−k, ∞).
• We want to sample from h(θ), but it is difficult Loss functions
k(θ)
• We know h(θ) up to a proportional constant: h(θ) = R • Squared error loss: L(θ, a) = (θ − a)2
k(θ) dθ (
• Envelope condition: we can find M > 0 such that k(θ) ≤ M g(θ) ∀θ K1 (θ − a) a − θ < 0
• Linear loss: L(θ, a) =
K2 (a − θ) a − θ ≥ 0
Algorithm
• Absolute error loss: L(θ, a) = |θ − a| (linear loss with K1 = K2 )
1. Draw θcand ∼ g(θ) • Lp loss: L(θ, a) = |θ − a|p
2. Generate u ∼ Unif (0, 1)
(
0 a=θ
k(θcand ) • Zero-one loss: L(θ, a) =
3. Accept θcand if u ≤ 1 a 6= θ
M g(θcand )
4. Repeat until B values of θcand have been accepted
17.1 Risk
Example
Posterior risk
• We can easily sample from the prior g(θ) = f (θ)
Z h i
r(θb | x) = L(θ, θ(x))f
b (θ | x) dθ = Eθ|X L(θ, θ(x))
b
• Target is the posterior h(θ) ∝ k(θ) = f (xn | θ)f (θ)
• Envelope condition: f (xn | θ) ≤ f (xn | θbn ) = Ln (θbn ) ≡ M (Frequentist) risk
• Algorithm Z h i
1. Draw θ cand
∼ f (θ) R(θ, θ)
b = L(θ, θ(x))f
b (x | θ) dx = EX|θ L(θ, θ(X))
b
19
Bayes risk 18 Linear Regression
ZZ
Definitions
h i
r(f, θ)
b = L(θ, θ(x))f
b (x, θ) dx dθ = Eθ,X L(θ, θ(X))
b
• Response variable Y
• Covariate X (aka predictor variable or feature)
h h ii h i
r(f, θ)
b = Eθ EX|θ L(θ, θ(X)
b = Eθ R(θ, θ)
b

18.1 Simple Linear Regression


h h ii h i
r(f, θ)
b = EX Eθ|X L(θ, θ(X)
b = EX r(θb | X)
Model
17.2 Admissibility Yi = β0 + β1 Xi + i E [i | Xi ] = 0, V [i | Xi ] = σ 2
Fitted line
• θb0 dominates θb if
b0 rb(x) = βb0 + βb1 x
∀θ : R(θ, θ ) ≤ R(θ, θ)
b
Predicted (fitted) values
∃θ : R(θ, θb0 ) < R(θ, θ)
b Ybi = rb(Xi )
• θb is inadmissible if there is at least one other estimator θb0 that dominates Residuals  
it. Otherwise it is called admissible. ˆi = Yi − Ybi = Yi − βb0 + βb1 Xi

Residual sums of squares (rss)


17.3 Bayes Rule
n
X
Bayes rule (or Bayes estimator) rss(βb0 , βb1 ) = ˆ2i
i=1
• r(f, θ)
b = inf e r(f, θ)
θ
e
R Least square estimates
• θ(x)
b = inf r(θb | x) ∀x =⇒ r(f, θ)
b = r(θb | x)f (x) dx
βbT = (βb0 , βb1 )T : min rss
β
b0 ,β
b1
Theorems

• Squared error loss: posterior mean βb0 = Ȳn − βb1 X̄n


Pn Pn
• Absolute error loss: posterior median i=1 (Xi − X̄n )(Yi − Ȳn ) i=1 Xi Yi − nX̄Y
β1 =
b Pn = P n
• Zero-one loss: posterior mode i=1 (Xi − X̄n )
2 2 2
i=1 Xi − nX
 
β0
h i
E βb | X n =
17.4 Minimax Rules β1
σ 2 n−1 ni=1 Xi2 −X n
h i  P 
Maximum risk V βb | X n = 2
R̄(θ)
b = sup R(θ, θ)
b R̄(a) = sup R(θ, a) nsX −X n 1
θ θ r Pn
2
σ i=1 Xi

b
Minimax rule se(
b βb0 ) =
sX n n
sup R(θ, θ)
b = inf R̄(θ)
e = inf sup R(θ, θ)
e
θ θe θe θ σ

b
se(
b βb1 ) =
sX n
θb = Bayes rule ∧ ∃c : R(θ, θ)
b =c Pn Pn 2
where s2X = n−1 i=1 (Xi − X n )2 and σ b2 = n−21
i=1 
ˆi (unbiased estimate).
Least favorable prior Further properties:
P P
θbf = Bayes rule ∧ R(θ, θbf ) ≤ r(f, θbf ) ∀θ • Consistency: βb0 → β0 and βb1 → β1
20
• Asymptotic normality: 18.3 Multiple Regression
βb0 − β0 D βb1 − β1 D Y = Xβ + 
→ N (0, 1) and → N (0, 1)
se(
b βb0 ) se(
b βb1 )
where
• Approximate 1 − α confidence intervals for β0 and β1 :      
X11 ··· X1k β1 1
 .. ..  β =  ... 
..  .. 
βb0 ± zα/2 se( and βb1 ± zα/2 se( X= . =.
 
b βb0 ) b βb1 ) . . 
Xn1 ··· Xnk βk n
• Wald test for H0 : β1 = 0 vs. H1 : β1 6= 0: reject H0 if |W | > zα/2 where
W = βb1 /se(
b βb1 ). Likelihood
 
1
R2 L(µ, Σ) = (2πσ 2 )−n/2 exp − 2 rss
Pn b 2
Pn 2 2σ
i=1 (Yi − Y ) ˆ rss
2
R = Pn 2
= 1 − Pn i=1 i 2 = 1 −
i=1 (Yi − Y ) i=1 (Yi − Y )
tss
N
X
Likelihood rss = (y − Xβ)T (y − Xβ) = kY − Xβk2 = (Yi − xTi β)2
n n n i=1
Y Y Y
L= f (Xi , Yi ) = fX (Xi ) × fY |X (Yi | Xi ) = L1 × L2
i=1 i=1 i=1 If the (k × k) matrix X T X is invertible,
Yn
L1 = fX (Xi ) βb = (X T X)−1 X T Y
i=1 h i
V βb | X n = σ 2 (X T X)−1
n
( )
Y 1 X 2
−n
L2 = fY |X (Yi | Xi ) ∝ σ exp − 2 Yi − (β0 − β1 Xi )
2σ i βb ≈ N β, σ 2 (X T X)−1

i=1

Under the assumption of Normality, the least squares estimator is also the mle
Estimate regression function
but the least squares variance estimator is not the mle.
n k
1X 2 X
b2 =
σ ˆ rb(x) = βbj xj
n i=1 i j=1

18.2 Prediction Unbiased estimate for σ 2


Observe X = x∗ of the covariate and want to predict their outcome Y∗ . n
1 X 2
b2 =
σ ˆ ˆ = X βb − Y
Yb∗ = βb0 + βb1 x∗ n − k i=1 i
h i h i h i h i
V Yb∗ = V βb0 + x2∗ V βb1 + 2x∗ Cov βb0 , βb1 mle
n−k 2
Prediction interval µ
b = X̄ b2 =
σ σ
 Pn 2
 n
2 2 i=1 (Xi − X∗ )
ξn = σ
b P +1
n i (Xi − X̄)2 j
b
1 − α Confidence interval
Yb∗ ± zα/2 ξbn βbj ± zα/2 se(
b βbj )
21
18.4 Model Selection Akaike Information Criterion (AIC)
Consider predicting a new observation Y ∗ for covariates X ∗ and let S ⊂ J
denote a subset of the covariates in the model, where |S| = k and |J| = n. bS2 ) − k
AIC(S) = `n (βbS , σ
Issues
Bayesian Information Criterion (BIC)
• Underfitting: too few covariates yields high bias
• Overfitting: too many covariates yields high variance k
bS2 ) −
BIC(S) = `n (βbS , σ log n
Procedure 2

1. Assign a score to each model Validation and training


2. Search through all models to find the one with the highest score
m
X n n
Hypothesis testing R
bV (S) = (Ybi∗ (S) − Yi∗ )2 m = |{validation data}|, often or
i=1
4 2
H0 : βj = 0 vs. H1 : βj 6= 0 ∀j ∈ J
Leave-one-out cross-validation
Mean squared prediction error (mspe)
n n
!2
h i X X Yi − Ybi (S)
mspe = E (Yb (S) − Y ∗ )2 R
bCV (S) = (Yi − Yb(i) )2 =
i=1 i=1
1 − Uii (S)
Prediction risk
n n h i
U (S) = XS (XST XS )−1 XS (“hat matrix”)
X X
R(S) = mspei = E (Ybi (S) − Yi∗ )2
i=1 i=1

Training error
n
R
btr (S) =
X
(Ybi (S) − Yi )2
19 Non-parametric Function Estimation
i=1
2 19.1 Density Estimation
R Pn b 2
R i=1 (Yi (S) − Y )
rss(S) btr (S) R
R2 (S) = 1 − =1− =1− Estimate f (x), where f (x) = P [X ∈ A] = A
f (x) dx.
P n 2
i=1 (Yi − Y )
tss tss Integrated square error (ise)
The training error is a downward-biased estimate of the prediction risk. Z  2 Z
h i L(f, fbn ) = f (x) − fn (x) dx = J(h) + f 2 (x) dx
b
E R btr (S) < R(S)

h i n
X h i Frequentist risk
bias(Rtr (S)) = E Rtr (S) − R(S) = −2
b b Cov Ybi , Yi
i=1
h i Z Z
R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx
Adjusted R2
n − 1 rss
R2 (S) = 1 −
n − k tss h i
Mallow’s Cp statistic b(x) = E fbn (x) − f (x)
h i
R(S)
b =R σ 2 = lack of fit + complexity penalty
btr (S) + 2kb v(x) = V fbn (x)
22
19.1.1 Histograms KDE
n  
Definitions 1X1 x − Xi
fbn (x) = K
n i=1 h h
• Number of bins m
Z Z
1 4 00 2 1
1 R(f, fn ) ≈ (hσK )
b (f (x)) dx + K 2 (x) dx
• Binwidth h = m 4 nh
• Bin Bj has νj observations c
−2/5 −1/5 −1/5
c2 c3
Z Z
h∗ = 1 c = σ 2
, c = K 2
(x) dx, c = (f 00 (x))2 dx
R
• Define pbj = νj /n and pj = Bj f (u) du n1/5
1 K 2 3

Z 4/5 Z 1/5
∗ c4 5 2 2/5 2 00 2
Histogram estimator R (f, fn ) = 4/5
b c4 = (σK ) K (x) dx (f ) dx
n 4
| {z }
m C(K)
X pbj
fbn (x) = I(x ∈ Bj )
j=1
h Epanechnikov Kernel
h i pj
E fbn (x) = (
3

h √
4 5(1−x2 /5)
|x| < 5
h i p (1 − p ) K(x) =
j j
V fbn (x) = 0 otherwise
nh2
h2
Z
2 1
R(fbn , f ) ≈ (f 0 (u)) du + Cross-validation estimate of E [J(h)]
12 nh
!1/3
1 6 n n n  
1 X X ∗ Xi − Xj
Z

h = 1/3 R 2Xb 2
2 du JbCV (h) = fbn2 (x) dx − f(−i) (Xi ) ≈ K + K(0)
n (f 0 (u)) n i=1 hn2 i=1 j=1 h nh
 2/3 Z 1/3
∗ b C 3 0 2
R (fn , f ) ≈ 2/3 C= (f (u)) du
n 4 Z
K ∗ (x) = K (2) (x) − 2K(x) K (2) (x) = K(x − y)K(y) dy
Cross-validation estimate of E [J(h)]

Z
2Xb
n
2 n+1 X 2
m 19.2 Non-parametric Regression
JbCV (h) = fbn2 (x) dx − f(−i) (Xi ) = − pb
n i=1 (n − 1)h (n − 1)h j=1 j Estimate f (x) where f (x) = E [Y | X = x]. Consider pairs of points
(x1 , Y1 ), . . . , (xn , Yn ) related by

Yi = r(xi ) + i
19.1.2 Kernel Density Estimator (KDE)
E [i ] = 0
Kernel K V [i ] = σ 2

• K(x) ≥ 0 k-nearest Neighbor Estimator


R
• K(x) dx = 1

R
xK(x) dx = 0 1 X
rb(x) = Yi where Nk (x) = {k values of x1 , . . . , xn closest to x}

R 2 2
x K(x) dx ≡ σK >0 k
i:xi ∈Nk (x)
23
Nadaraya-Watson Kernel Estimator 20 Stochastic Processes
n
X
rb(x) = wi (x)Yi Stochastic Process
i=1 (
x−xi

K {0, ±1, . . . } = Z discrete
wi (x) = h ∈ [0, 1] {Xt : t ∈ T } T =
[0, ∞)

Pn
K
x−xj continuous
j=1 h
4 Z  2
h4 f 0 (x)
Z
2 2 00 0 • Notations Xt , X(t)
R(brn , r) ≈ x K (x) dx r (x) + 2r (x) dx
4 f (x) • State space X
Z 2R 2
σ K (x) dx • Index set T
+ dx
nhf (x)
c1
h∗ ≈ 1/5 20.1 Markov Chains
n
c2
R∗ (b
rn , r) ≈ 4/5 Markov chain
n

P [Xn = x | X0 , . . . , Xn−1 ] = P [Xn = x | Xn−1 ] ∀n ∈ T, x ∈ X


Cross-validation estimate of E [J(h)]
n
X n
X (Yi − rb(xi ))2 Transition probabilities
JbCV (h) = (Yi − rb(−i) (xi ))2 = !2
i=1 i=1 K(0) pij ≡ P [Xn+1 = j | Xn = i]
1− Pn  x−x 
j
K
j=1 h pij (n) ≡ P [Xm+n = j | Xm = i] n-step

19.3 Smoothing Using Orthogonal Functions Transition matrix P (n-step: Pn )


Approximation
∞ J • (i, j) element is pij
X X
r(x) = βj φj (x) ≈ βj φj (x) • pij > 0
P
j=1 j=1 • i pij = 1
Multivariate regression
Y = Φβ + η Chapman-Kolmogorov
 
φ0 (x1 ) ··· φJ (x1 ) X
 .. .. ..  pij (m + n) = pij (m)pkj (n)
where ηi = i and Φ =  . . .  k
φ0 (xn ) · · · φJ (xn )
Least squares estimator Pm+n = Pm Pn
βb = (ΦT Φ)−1 ΦT Y
Pn = P × · · · × P = Pn
1
≈ ΦT Y (for equally spaced observations only)
n Marginal probability
Cross-validation estimate of E [J(h)]
 2 µn = (µn (1), . . . , µn (N )) where µi (i) = P [Xn = i]
n J
R
bCV (J) =
X
Yi −
X
φj (xi )βbj,(−i)  µ0 , initial distribution
i=1 j=1 µn = µ0 Pn
24
20.2 Poisson Processes Autocorrelation function (ACF)
Poisson process
Cov [xs , xt ] γ(s, t)
ρ(s, t) = p =p
• {Xt : t ∈ [0, ∞)} = number of events up to and including time t V [xs ] V [xt ] γ(s, s)γ(t, t)
• X0 = 0
• Independent increments: Cross-covariance function (CCV)
∀t0 < · · · < tn : Xt1 − Xt0 ⊥
⊥ · · · ⊥⊥ Xtn − Xtn−1
γxy (s, t) = E [(xs − µxs )(yt − µyt )]
• Intensity function λ(t)
– P [Xt+h − Xt = 1] = λ(t)h + o(h) Cross-correlation function (CCF)
– P [Xt+h − Xt = 2] = o(h)
γxy (s, t)
• Xs+t − Xs ∼ Po (m(s + t) − m(s)) where m(t) =
Rt
λ(s) ds ρxy (s, t) = p
0 γx (s, s)γy (t, t)
Homogeneous Poisson process
Backshift operator
λ(t) ≡ λ =⇒ Xt ∼ Po (λt) λ>0
B k (xt ) = xt−k
Waiting times
Wt := time at which Xt occurs Difference operator
 
1 ∇d = (1 − B)d
Wt ∼ Gamma t,
λ
Interarrival times White noise
St = Wt+1 − Wt
2
 
1 • wt ∼ wn(0, σw )
St ∼ Exp iid 2

λ • Gaussian: wt ∼ N 0, σw
• E [wt ] = 0 t ∈ T
St • V [wt ] = σ 2 t ∈ T
• γw (s, t) = 0 s 6= t ∧ s, t ∈ T
Wt−1 Wt t

Random walk
21 Time Series
• Drift δ
Pt
Mean function Z ∞
• xt = δt + j=1 wj
µxt = E [xt ] = xft (x) dx • E [xt ] = δt
−∞

Autocovariance function Symmetric moving average

γx (s, t) = E [(xs − µs )(xt − µt )] = E [xs xt ] − µs µt k


X k
X
mt = aj xt−j where aj = a−j ≥ 0 and aj = 1
γx (t, t) = E (xt − µt )2 = V [xt ]
 
j=−k j=−k
25
21.1 Stationary Time Series Sample variance
n  
Strictly stationary 1 X |h|
V [x̄] = 1− γx (h)
n n
P [xt1 ≤ c1 , . . . , xtk ≤ ck ] = P [xt1 +h ≤ c1 , . . . , xtk +h ≤ ck ] h=−n

∀k ∈ N, tk , ck , h ∈ Z Sample autocovariance function

Weakly stationary n−h


1 X
  γ
b(h) = (xt+h − x̄)(xt − x̄)
• E x2t < ∞ ∀t ∈ Z n t=1
 2
• E xt = m ∀t ∈ Z
• γx (s, t) = γx (s + r, t + r) ∀r, s, t ∈ Z Sample autocorrelation function
Autocovariance function
γ
b(h)
ρb(h) =
• γ(h) = E [(xt+h − µ)(xt − µ)] ∀h ∈ Z γ
b(0)
 
• γ(0) = E (xt − µ)2
• γ(0) ≥ 0 Sample cross-variance function
• γ(0) ≥ |γ(h)|
n−h
• γ(h) = γ(−h) 1 X
γ
bxy (h) = (xt+h − x̄)(yt − y)
n t=1
Autocorrelation function (ACF)

Cov [xt+h , xt ] γ(t + h, t) γ(h) Sample cross-correlation function


ρx (h) = p =p =
V [xt+h ] V [xt ] γ(t + h, t + h)γ(t, t) γ(0)
γ
bxy (h)
Jointly stationary time series ρbxy (h) = p
γbx (0)b
γy (0)
γxy (h) = E [(xt+h − µx )(yt − µy )]
Properties
γxy (h)
ρxy (h) = p 1
γx (0)γy (h) • σρbx (h) = √ if xt is white noise
n
Linear process 1
• σρbxy (h) = √ if xt or yt is white noise

X ∞
X n
xt = µ + ψj wt−j where |ψj | < ∞
j=−∞ j=−∞


21.3 Non-Stationary Time Series
X
2
γ(h) = σw ψj+h ψj Classical decomposition model
j=−∞

xt = µt + st + wt
21.2 Estimation of Correlation
Sample mean • µt = trend
n
1X • st = seasonal component
x̄ = xt
n t=1 • wt = random noise term
26
21.3.1 Detrending Moving average polynomial
Least squares θ(z) = 1 + θ1 z + · · · + θq zq z ∈ C ∧ θq 6= 0
2
1. Choose trend model, e.g., µt = β0 + β1 t + β2 t
Moving average operator
2. Minimize rss to obtain trend estimate µ bt = βb0 + βb1 t + βb2 t2
3. Residuals , noise wt θ(B) = 1 + θ1 B + · · · + θp B p
Moving average MA (q) (moving average model order q)
1
• The low-pass filter vt is a symmetric moving average mt with aj = 2k+1 : xt = wt + θ1 wt−1 + · · · + θq wt−q ⇐⇒ xt = θ(B)wt
k q
1 X X
vt = xt−1 E [xt ] = θj E [wt−j ] = 0
2k + 1
i=−k j=0
Pk ( Pq−h
1 2
• If 2k+1 i=−k wt−j ≈ 0, a linear trend function µt = β0 + β1 t passes
σw j=0 θj θj+h 0≤h≤q
γ(h) = Cov [xt+h , xt ] =
without distortion 0 h>q
Differencing MA (1)
xt = wt + θwt−1
• µt = β0 + β1 t =⇒ ∇xt = β1 
2 2
(1 + θ )σw h = 0

2
21.4 ARIMA models γ(h) = θσw h=1

0 h>1

Autoregressive polynomial
(
θ
φ(z) = 1 − φ1 z − · · · − φp zp z ∈ C ∧ φp 6= 0 2 h=1
ρ(h) = (1+θ )
0 h>1
Autoregressive operator
ARMA (p, q)
φ(B) = 1 − φ1 B − · · · − φp B p
xt = φ1 xt−1 + · · · + φp xt−p + wt + θ1 wt−1 + · · · + θq wt−q
Autoregressive model order p, AR (p)
φ(B)xt = θ(B)wt
xt = φ1 xt−1 + · · · + φp xt−p + wt ⇐⇒ φ(B)xt = wt
Partial autocorrelation function (PACF)
AR (1) • xih−1 , regression of xi on {xh−1 , xh−2 , . . . , x1 }
k−1 ∞ • φhh = corr(xh − xh−1
h , x0 − xh−1
0 ) h≥2
X k→∞,|φ|<1 X
• xt = φk (xt−k ) + φj (wt−j ) = φj (wt−j ) • E.g., φ11 = corr(x1 , x0 ) = ρ(1)
j=0 j=0
| {z } ARIMA (p, d, q)
linear process
P∞ j
∇d xt = (1 − B)d xt is ARMA (p, q)
• E [xt ] = j=0 φ (E [wt−j ]) = 0
2 h
σw φ φ(B)(1 − B)d xt = θ(B)wt
• γ(h) = Cov [xt+h , xt ] = 1−φ2
γ(h) Exponentially Weighted Moving Average (EWMA)
• ρ(h) = γ(0) = φh
• ρ(h) = φρ(h − 1) h = 1, 2, . . . xt = xt−1 + wt − λwt−1
27

X • Frequency index ω (cycles per unit time), period 1/ω
xt = (1 − λ)λj−1 xt−j + wt when |λ| < 1
j=1
• Amplitude A
• Phase φ
x̃n+1 = (1 − λ)xn + λx̃n
• U1 = A cos φ and U2 = A sin φ often normally distributed rv’s
Seasonal ARIMA
Periodic mixture
• Denoted by ARIMA (p, d, q) × (P, D, Q)s
q
• ΦP (B s )φ(B)∇D d s
s ∇ xt = δ + ΘQ (B )θ(B)wt X
xt = (Uk1 cos(2πωk t) + Uk2 sin(2πωk t))
k=1
21.4.1 Causality and Invertibility
P∞ • Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rv’s with variances σk2
ARMA (p, q) is causal (future-independent) ⇐⇒ ∃{ψj } : j=0 ψj < ∞ such that Pq
• γ(h) = k=1 σk2 cos(2πωk h)
  Pq

X • γ(0) = E x2t = k=1 σk2
xt = wt−j = ψ(B)wt
j=0 Spectral representation of a periodic process
P∞
ARMA (p, q) is invertible ⇐⇒ ∃{πj } : j=0 πj < ∞ such that γ(h) = σ 2 cos(2πω0 h)
∞ σ 2 −2πiω0 h σ 2 2πiω0 h
X = e + e
π(B)xt = Xt−j = wt 2 2
Z 1/2
j=0
= e2πiωh dF (ω)
Properties −1/2

• ARMA (p, q) causal ⇐⇒ roots of φ(z) lie outside the unit circle Spectral distribution function


X θ(z)
j 0
 ω < −ω0
ψ(z) = ψj z = |z| ≤ 1
φ(z) F (ω) = σ 2 /2 −ω ≤ ω < ω0
j=0 
 2
σ ω ≥ ω0
• ARMA (p, q) invertible ⇐⇒ roots of θ(z) lie outside the unit circle
• F (−∞) = F (−1/2) = 0

X φ(z) • F (∞) = F (1/2) = γ(0)
π(z) = πj z j = |z| ≤ 1
j=0
θ(z)
Spectral density
Behavior of the ACF and PACF for causal and invertible ARMA models ∞
X 1 1
AR (p) MA (q) ARMA (p, q) f (ω) = γ(h)e−2πiωh − ≤ω≤
2 2
h=−∞
ACF tails off cuts off after lag q tails off
PACF cuts off after lag p tails off q tails off P∞ R 1/2
• Needs h=−∞ |γ(h)| < ∞ =⇒ γ(h) = −1/2
e2πiωh f (ω) dω h = 0, ±1, . . .
21.5 Spectral Analysis • f (ω) ≥ 0
• f (ω) = f (−ω)
Periodic process • f (ω) = f (1 − ω)
R 1/2
xt = A cos(2πωt + φ) • γ(0) = V [xt ] = −1/2 f (ω) dω
2
= U1 cos(2πωt) + U2 sin(2πωt) • White noise: fw (ω) = σw
28
• ARMA (p, q) , φ(B)xt = θ(B)wt : 22.2 Beta Function
Z 1
Γ(x)Γ(y)
|θ(e−2πiω )|2
2 • Ordinary: B(x, y) = B(y, x) = tx−1 (1 − t)y−1 dt =
fx (ω) = σw 0 Γ(x + y)
|φ(e−2πiω )|2 Z x
a−1 b−1
Pp Pq • Incomplete: B(x; a, b) = t (1 − t) dt
where φ(z) = 1 − k=1 φk z k and θ(z) = 1 + k=1 θk z k 0
• Regularized incomplete:
Discrete Fourier Transform (DFT) a+b−1
B(x; a, b) a,b∈N X (a + b − 1)!
Ix (a, b) = = xj (1 − x)a+b−1−j
n
X B(a, b) j=a
j!(a + b − 1 − j)!
d(ωj ) = n−1/2 xt e−2πiωj t
• I0 (a, b) = 0 I1 (a, b) = 1
i=1
• Ix (a, b) = 1 − I1−x (b, a)
Fourier/Fundamental frequencies
22.3 Series
ωj = j/n
Finite Binomial
Inverse DFT n n  
n−1 X n(n + 1) X n
• = 2n
X
xt = n −1/2
d(ωj )e 2πiωj t k= •
2 k
j=0 k=1 k=0
n n    
X X r+k r+n+1
Periodogram • (2k − 1) = n2 • =
I(j/n) = |d(j/n)|2 k n
k=1 k=0
n n    
Scaled Periodogram
X n(n + 1)(2n + 1) X k n+1
• k2 = • =
6 m m+1
k=1 k=0
4 n
P (j/n) = I(j/n) X 
n(n + 1)
2 • Vandermonde’s Identity:
n • k3 = r  
m n
 
m+n

2
!2 !2 X
n n k=1 =
2X 2X n k r−k r
= xt cos(2πtj/n + xt sin(2πtj/n cn+1 − 1 k=0
n t=1 n t=1
X
• ck = c 6= 1 • Binomial Theorem:
c−1 n  
n n−k k
k=0
X
a b = (a + b)n
22 Math k
k=0

22.1 Gamma Function Infinite


Z ∞
∞ ∞
• Ordinary: Γ(s) = ts−1 e−t dt X 1 X p
0 • pk = , pk = |p| < 1
Z ∞ 1−p 1−p
k=0 k=1
• Upper incomplete: Γ(s, x) = ts−1 e−t dt ∞ ∞
!  
X d X d 1 1
Z xx • kpk−1 = pk
= = |p| < 1
dp dp 1 − p (1 − p)2
• Lower incomplete: γ(s, x) = ts−1 e−t dt k=0 k=0
0 ∞  
X r+k−1 k
• Γ(α + 1) = αΓ(α) α>1 • x = (1 − x)−r r ∈ N+
k
• Γ(n) = (n − 1)! n∈N k=0
∞  
• Γ(0) = Γ(−1) = ∞ X α k
√ • p = (1 + p)α |p| < 1 , α ∈ C
• Γ(1/2) = π k
k=0
• Γ(−1/2) = −2Γ(1/2)
29
22.4 Combinatorics [3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications With R
Examples. Springer, 2006.
Sampling [4] A. Steger. Diskrete Strukturen – Band 1: Kombinatorik, Graphentheorie, Algebra.
Springer, 2001.
k out of n w/o replacement w/ replacement [5] A. Steger. Diskrete Strukturen – Band 2: Wahrscheinlichkeitstheorie und Statistik.
k−1 Springer, 2002.
Y n!
ordered nk = (n − i) = nk [6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer, 2003.
i=0
(n − k)!
nk
     
n n! n−1+r n−1+r
unordered = = =
k k! k!(n − k)! r n−1

Stirling numbers, 2nd kind


        (
n n−1 n−1 n 1 n=0
=k + 1≤k≤n =
k k k−1 0 0 else

Partitions
n
X
Pn+k,k = Pn,i k > n : Pn,k = 0 n ≥ 1 : Pn,0 = 0, P0,0 = 1
i=1

Balls and Urns f :B→U D = distinguishable, ¬D = indistinguishable.

|B| = n, |U | = m f arbitrary f injective f surjective f bijective


( (
mn m ≥ n
 
n n! m = n
B : D, U : D mn m!
0 else m 0 else
      (
m+n−1 m n−1 1 m=n
B : ¬D, U : D
n n m−1 0 else
m  
(   (
X n 1 m≥n n 1 m=n
B : D, U : ¬D
k 0 else m 0 else
k=1
m
( (
X 1 m≥n 1 m=n
B : ¬D, U : ¬D Pn,k Pn,m
k=1
0 else 0 else

References
[1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory. Brooks Cole,
1972.
[2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships. The American
Statistician, 62(1):45–53, 2008.
30
Univariate distribution relationships, courtesy Leemis and McQueston [2].
31

You might also like