0% found this document useful (0 votes)
44 views

Probability and Statistics: Cookbook

This document is a cookbook that integrates topics in probability theory and statistics. It covers a wide range of topics from discrete and continuous probability distributions to statistical inference, linear regression, and non-parametric function estimation. The cookbook provides formulas, properties, and examples for each topic.

Uploaded by

Susi González
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Probability and Statistics: Cookbook

This document is a cookbook that integrates topics in probability theory and statistics. It covers a wide range of topics from discrete and continuous probability distributions to statistical inference, linear regression, and non-parametric function estimation. The cookbook provides formulas, properties, and examples for each topic.

Uploaded by

Susi González
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Probability and Statistics

Cookbook

Copyright
c Matthias Vallentin, 2015
[email protected]

31st March, 2015


This cookbook integrates a variety of topics in probability the- 12 Parametric Inference 13 20 Stochastic Processes 24
ory and statistics. It is based on literature and in-class material 12.1 Method of Moments . . . . . . . . . . . 13 20.1 Markov Chains . . . . . . . . . . . . . . 24
from courses of the statistics department at the University of 12.2 Maximum Likelihood . . . . . . . . . . . 14 20.2 Poisson Processes . . . . . . . . . . . . . 25
California in Berkeley but also influenced by other sources [2, 3]. 12.2.1 Delta Method . . . . . . . . . . . 14
If you find errors or have suggestions for further topics, I would 12.3 Multiparameter Models . . . . . . . . . 15 21 Time Series 25
appreciate if you send me an email. The most recent version 21.1 Stationary Time Series . . . . . . . . . . 26
12.3.1 Multiparameter delta method . . 15
of this document is available at http://matthias.vallentin. 21.2 Estimation of Correlation . . . . . . . . 26
12.4 Parametric Bootstrap . . . . . . . . . . 15
net/probability-and-statistics-cookbook/. To reproduce, 21.3 Non-Stationary Time Series . . . . . . . 26
please contact me. 21.3.1 Detrending . . . . . . . . . . . . 27
13 Hypothesis Testing 15
21.4 ARIMA models . . . . . . . . . . . . . . 27
14 Exponential Family 16 21.4.1 Causality and Invertibility . . . . 28
Contents 21.5 Spectral Analysis . . . . . . . . . . . . . 28
15 Bayesian Inference 16
1 Distribution Overview 3 22 Math 29
15.1 Credible Intervals . . . . . . . . . . . . . 16
1.1 Discrete Distributions . . . . . . . . . . 3 22.1 Gamma Function . . . . . . . . . . . . . 29
1.2 Continuous Distributions . . . . . . . . 5 15.2 Function of parameters . . . . . . . . . . 17
22.2 Beta Function . . . . . . . . . . . . . . . 29
15.3 Priors . . . . . . . . . . . . . . . . . . . 17 22.3 Series . . . . . . . . . . . . . . . . . . . 29
2 Probability Theory 8 15.3.1 Conjugate Priors . . . . . . . . . 17 22.4 Combinatorics . . . . . . . . . . . . . . 30
15.4 Bayesian Testing . . . . . . . . . . . . . 18
3 Random Variables 8
3.1 Transformations . . . . . . . . . . . . . 9 16 Sampling Methods 18
16.1 Inverse Transform Sampling . . . . . . . 18
4 Expectation 9 16.2 The Bootstrap . . . . . . . . . . . . . . 18
16.2.1 Bootstrap Confidence Intervals . 18
5 Variance 9
16.3 Rejection Sampling . . . . . . . . . . . . 19
6 Inequalities 10 16.4 Importance Sampling . . . . . . . . . . . 19

7 Distribution Relationships 10 17 Decision Theory 19


17.1 Risk . . . . . . . . . . . . . . . . . . . . 19
8 Probability and Moment Generating 17.2 Admissibility . . . . . . . . . . . . . . . 20
Functions 11 17.3 Bayes Rule . . . . . . . . . . . . . . . . 20
17.4 Minimax Rules . . . . . . . . . . . . . . 20
9 Multivariate Distributions 11
9.1 Standard Bivariate Normal . . . . . . . 11 18 Linear Regression 20
9.2 Bivariate Normal . . . . . . . . . . . . . 11 18.1 Simple Linear Regression . . . . . . . . 20
9.3 Multivariate Normal . . . . . . . . . . . 11 18.2 Prediction . . . . . . . . . . . . . . . . . 21
10 Convergence 11 18.3 Multiple Regression . . . . . . . . . . . 21
10.1 Law of Large Numbers (LLN) . . . . . . 12 18.4 Model Selection . . . . . . . . . . . . . . 22
10.2 Central Limit Theorem (CLT) . . . . . 12
19 Non-parametric Function Estimation 22
11 Statistical Inference 12 19.1 Density Estimation . . . . . . . . . . . . 22
11.1 Point Estimation . . . . . . . . . . . . . 12 19.1.1 Histograms . . . . . . . . . . . . 23
11.2 Normal-Based Confidence Interval . . . 13 19.1.2 Kernel Density Estimator (KDE) 23
11.3 Empirical distribution . . . . . . . . . . 13 19.2 Non-parametric Regression . . . . . . . 23
11.4 Statistical Functionals . . . . . . . . . . 13 19.3 Smoothing Using Orthogonal Functions 24
1 Distribution Overview
1.1 Discrete Distributions
Notation1 FX (x) fX (x) E [X] V [X] MX (s)

0 x<a
(b − a + 1)2 − 1 eas − e−(b+1)s

bxc−a+1 I(a ≤ x ≤ b) a+b
Uniform Unif {a, . . . , b} a≤x≤b
 b−a b−a+1 2 12 s(b − a)
1 x>b

Bernoulli Bern (p) (1 − p)1−x px (1 − p)1−x p p(1 − p) 1 − p + pes
!
n x
Binomial Bin (n, p) I1−p (n − x, x + 1) p (1 − p)n−x np np(1 − p) (1 − p + pes )n
x
k k
!n
n! x
X X
Multinomial Mult (n, p) px1 1 · · · pkk xi = n npi npi (1 − pi ) pi e si
x1 ! . . . xk ! i=1 i=0
! m m−x
 
x − np x n−x nm nm(N − n)(N − m)
Hypergeometric Hyp (N, m, n) ≈Φ N N 2 (N − 1)
p 
np(1 − p) x
N
!  r
x+r−1 r 1−p 1−p p
Negative Binomial NBin (r, p) Ip (r, x + 1) p (1 − p)x r r
r−1 p p2 1 − (1 − p)es
1 1−p pes
Geometric Geo (p) 1 − (1 − p)x x ∈ N+ p(1 − p)x−1 x ∈ N+
p p2 1 − (1 − p)es
x
X λi λx e−λ s
Poisson Po (λ) e−λ λ λ eλ(e −1)

i=0
i! x!

1 We use the notation γ(s, x) and Γ(x) to refer to the Gamma functions (see §22.1), and use B(x, y) and Ix to refer to the Beta functions (see §22.2).

3
Uniform (discrete) Binomial Geometric Poisson

● n = 40, p = 0.3 0.8 ●
● p = 0.2 ● ●
● λ=1
● n = 30, p = 0.6 ● p = 0.5 ● λ=4
● n = 25, p = 0.9 ● p = 0.8 ● λ = 10

0.3
0.2 ● 0.6


0.2
PMF

PMF

PMF

PMF
1 ● ● ● ●
● ●
● ● ● ● ● ● ● 0.4 ●
n ●
● ● ●

● ●
0.1
● ●
● ● ● ● ●
● ●
● 0.1 ●
● 0.2 ● ●
● ● ● ● ●
● ●
● ● ●
● ●
● ● ●
● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ●●● ● ● ● ●
0.0 ●●●● ●●
●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●● 0.0 ● ●

● ●
● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ●

a b 0 10 20 30 40 0.0 2.5 5.0 7.5 10.0 0 5 10 15 20


x x x x
Uniform (discrete) Binomial Geometric Poisson
1 ● 1.00 ●●●●●●●●●●●●●●●●
● ● ●●●●●●●●●●●●●●●●●●●●●
●● 1.0 ● ● ● ● ●
● ● ● ● ● 1.00 ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ● ●
● ●
● ●
● ●
● ●


● ● ●
0.75 ●
0.8 ● ● 0.75 ●
● ● ●
i ●

● ●
n ●
● ● ●

CDF

CDF

CDF

CDF
0.50 0.6 ● 0.50
● ●
● ●

● ●
i ●
● ● ●
n ●

0.25 ● 0.4 0.25 ●

● ●


● ● ● ● n = 40, p = 0.3 ● p = 0.2 ● ● λ=1
● ● ●
● n = 30, p = 0.6 ● p = 0.5 ●
● λ=4

0 ● 0.00 ●●●● ●

●●
●●●●●●●●●●●●●●●●●
●●●●●●●●●● ● ● n = 25, p = 0.9 0.2 ● ● p = 0.8 0.00

● ● ● ● ● λ = 10

a b 0 10 20 30 40 0.0 2.5 5.0 7.5 10.0 0 5 10 15 20


x x x x

4
1.2 Continuous Distributions
Notation FX (x) fX (x) E [X] V [X] MX (s)

0 x<a
(b − a)2 esb − esa

x−a I(a < x < b) a+b
Uniform Unif (a, b) a<x<b
 b−a b−a 2 12 s(b − a)
1 x>b

(x − µ)2
Z x
σ 2 s2
   
1
N µ, σ 2 σ2

Normal Φ(x) = φ(t) dt φ(x) = √ exp − µ exp µs +
−∞ σ 2π 2σ 2 2
(ln x − µ)2
   
1 1 ln x − µ 1 2 2 2
ln N µ, σ 2 eµ+σ /2
(eσ − 1)e2µ+σ

Log-Normal + erf √ √ exp −
2 2 2σ 2 x 2πσ 2 2σ 2
 
1 T
Σ−1 (x−µ) 1
Multivariate Normal MVN (µ, Σ) (2π)−k/2 |Σ|−1/2 e− 2 (x−µ) µ Σ exp µT s + sT Σs
2
−(ν+1)/2 ( ν
Γ ν+1
 
ν ν 
2 x2 ν−2
ν>2
Student’s t Student(ν) Ix , √ 1 + 0
νπΓ ν2

2 2 ν ∞ 1<ν≤2
 
1 k x 1
Chi-square χ2k γ , xk/2−1 e−x/2 k 2k (1 − 2s)−k/2 s < 1/2
Γ(k/2) 2 2 2k/2 Γ(k/2)
r
d
(d1 x)d1 d2 2
2d22 (d1 + d2 − 2)
 
d1 d1 (d1 x+d2 )d1 +d2 d2
F F(d1 , d2 ) I d1 x , d1 d1 d2 − 2 d1 (d2 − 2)2 (d2 − 4)

d1 x+d2 2 2 xB 2
, 2
1 −x/β 1
Exponential Exp (β) 1 − e−x/β e β β2 (s < 1/β)
β 1 − βs
 α
γ(α, x/β) 1 1
Gamma Gamma (α, β) xα−1 e−x/β αβ αβ 2 (s < 1/β)
Γ(α) Γ (α) β α 1 − βs
Γ α, βx

β α −α−1 −β/x β β2 2(−βs)α/2 p 
Inverse Gamma InvGamma (α, β) x e α>1 α>2 Kα −4βs
Γ (α) Γ (α) α−1 (α − 1)2 (α − 2) Γ(α)
P 
k
Γ i=1 αi Y α −1
k
αi E [Xi ] (1 − E [Xi ])
Dirichlet Dir (α) Qk xi i Pk Pk
i=1 Γ (αi ) i=1 i=1 αi i=1 αi + 1
∞ k−1
!
Γ (α + β) α−1 α αβ X Y α+r sk
Beta Beta (α, β) Ix (α, β) x (1 − x)β−1 1+
Γ (α) Γ (β) α+β (α + β)2 (α + β + 1) r=0
α+β+r k!
k=1
    ∞ n n
k k  x k−1 −(x/λ)k 1 2 X s λ  n
Weibull Weibull(λ, k) 1 − e−(x/λ) e λΓ 1 + λ2 Γ 1 + − µ2 Γ 1+
λ λ k k n=0
n! k
 x α
m xα αxm xα
Pareto Pareto(xm , α) 1− x ≥ xm m
α α+1 x ≥ xm α>1 m
α>2 α(−xm s)α Γ(−α, −xm s) s < 0
x x α−1 (α − 1)2 (α − 2)

5
Uniform (continuous) Normal Log−Normal Student's t
2.0 1.00 0.4
µ = 0, σ2 = 0.2 µ = 0, σ2 = 3 ν=1
µ = 0, σ2 = 1 µ = 2, σ2 = 2 ν=2
µ = 0, σ2 = 5 µ = 0, σ2 = 1 ν=5
ν=∞
µ = −2, σ2 = 0.5 µ = 0.5, σ2 = 1
µ = 0.25, σ2 = 1
1.5 0.75 µ = 0.125, σ2 = 1 0.3
PDF

PDF

PDF

PDF
1
● ● 1.0 0.50 0.2
b−a

0.5 0.25 0.1

● ● 0.0 0.00 0.0

a b −5.0 −2.5 0.0 2.5 5.0 0 1 2 3 −5.0 −2.5 0.0 2.5 5.0
x x x x
χ 2 F Exponential Gamma
d1 = 1, d2 = 1 2.0 β=2 2.0 α = 1, β = 2
1.00 k=1 3 d1 = 2, d2 = 1 β=1 α = 2, β = 2
k=2 d1 = 5, d2 = 2 β = 0.4 α = 3, β = 2
k=3 d1 = 100, d2 = 1 α = 5, β = 1
k=4 d1 = 100, d2 = 100 α = 9, β = 0.5
k=5
1.5 1.5
0.75

2
PDF

PDF

PDF
PDF

0.50 1.0 1.0

1
0.25 0.5 0.5

0.00 0 0.0 0.0

0 2 4 6 8 0 1 2 3 4 5 0 1 2 3 4 5 0 5 10 15 20
x x x x
Inverse Gamma Beta Weibull Pareto
α = 1, β = 1 5 α = 0.5, β = 0.5 2.0 λ = 1, k = 0.5 4 xm = 1, k = 1
α = 2, β = 1 α = 5, β = 1 λ = 1, k = 1 xm = 1, k = 2
α = 3, β = 1 α = 1, β = 3 λ = 1, k = 1.5 xm = 1, k = 4
4 α = 3, β = 0.5 α = 2, β = 2 λ = 1, k = 5
4 α = 2, β = 5
1.5 3

3
3
PDF

PDF

PDF

PDF
1.0 2
2
2

0.5 1
1 1

0 0 0.0 0

0 1 2 3 4 5 0.00 0.25 0.50 0.75 1.00 0.0 0.5 1.0 1.5 2.0 2.5 1.0 1.5 2.0 2.5
x x x x
6
Uniform (continuous) Normal Log−Normal Student's t
1 1.00 1.00
µ = 0, σ2 = 3
µ = 2, σ2 = 2
0.75 µ = 0, σ2 = 1
µ = 0.5, σ2 = 1
µ = 0.25, σ2 = 1
0.75 µ = 0.125, σ2 = 1 0.75

0.50
CDF

CDF

CDF

CDF
0.50 0.50

0.25
0.25 0.25

µ = 0, σ = 0.2
2
ν=1
µ = 0, σ2 = 1 ν=2
µ = 0, σ2 = 5 ν=5
0 0.00 µ = −2, σ2 = 0.5 0.00 0.00 ν=∞

a b −5.0 −2.5 0.0 2.5 5.0 0 1 2 3 −5.0 −2.5 0.0 2.5 5.0
x x x x
χ 2 F Exponential Gamma
1.00 1.00 1.00
1.00

0.75 0.75 0.75 0.75


CDF

CDF

CDF
CDF

0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25


k=1 d1 = 1, d2 = 1 α = 1, β = 2
k=2 d1 = 2, d2 = 1 α = 2, β = 2
k=3 d1 = 5, d2 = 2 β=2 α = 3, β = 2
k=4 d1 = 100, d2 = 1 β=1 α = 5, β = 1
0.00 k=5 0.00 d1 = 100, d2 = 100 0.00 β = 0.4 0.00 α = 9, β = 0.5

0 2 4 6 8 0 1 2 3 4 5 0 1 2 3 4 5 0 5 10 15 20
x x x x
Inverse Gamma Beta Weibull Pareto
1.00 1.00
1.00 1.00 α = 0.5, β = 0.5
α = 5, β = 1
α = 1, β = 3
α = 2, β = 2
α = 2, β = 5
0.75 0.75 0.75 0.75
CDF

CDF

CDF

CDF
0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25

α = 1, β = 1 λ = 1, k = 0.5
α = 2, β = 1 λ = 1, k = 1 xm = 1, k = 1
α = 3, β = 1 λ = 1, k = 1.5 xm = 1, k = 2
0.00 α = 3, β = 0.5 0.00 0.00 λ = 1, k = 5 0.00 xm = 1, k = 4

0 1 2 3 4 5 0.00 0.25 0.50 0.75 1.00 0.0 0.5 1.0 1.5 2.0 2.5 1.0 1.5 2.0 2.5
x x x x
7
2 Probability Theory Law of Total Probability
n n
Definitions X G
P [B] = P [B|Ai ] P [Ai ] Ω= Ai
• Sample space Ω i=1 i=1

• Outcome (point or element) ω ∈ Ω Bayes’ Theorem


• Event A ⊆ Ω n
• σ-algebra A P [B | Ai ] P [Ai ] G
P [Ai | B] = Pn Ω= Ai
1. ∅ ∈ A j=1 P [B | Aj ] P [Aj ] i=1
S∞
2. A1 , A2 , . . . , ∈ A =⇒ i=1 Ai ∈ A Inclusion-Exclusion Principle
3. A ∈ A =⇒ ¬A ∈ A
n n
r
[ X X \
• Probability Distribution P (−1)r−1

Ai = A ij


1. P [A] ≥ 0 ∀A i=1 r=1 i≤i1 <···<ir ≤n j=1

2. P [Ω] = 1
"∞ #
G ∞
X 3 Random Variables
3. P Ai = P [Ai ]
i=1 i=1 Random Variable (RV)
• Probability space (Ω, A, P) X:Ω→R
Probability Mass Function (PMF)
Properties
• P [∅] = 0 fX (x) = P [X = x] = P [{ω ∈ Ω : X(ω) = x}]
• B = Ω ∩ B = (A ∪ ¬A) ∩ B = (A ∩ B) ∪ (¬A ∩ B) Probability Density Function (PDF)
• P [¬A] = 1 − P [A]
Z b
• P [B] = P [A ∩ B] + P [¬A ∩ B]
P [a ≤ X ≤ b] = f (x) dx
• P [Ω] = 1 P [∅] = 0 a
S T T S
• ¬( n An ) = n ¬An ¬( n An ) = n ¬An DeMorgan
S T Cumulative Distribution Function (CDF)
• P [ n An ] = 1 − P [ n ¬An ]
• P [A ∪ B] = P [A] + P [B] − P [A ∩ B] FX : R → [0, 1] FX (x) = P [X ≤ x]
=⇒ P [A ∪ B] ≤ P [A] + P [B] 1. Nondecreasing: x1 < x2 =⇒ F (x1 ) ≤ F (x2 )
• P [A ∪ B] = P [A ∩ ¬B] + P [¬A ∩ B] + P [A ∩ B] 2. Normalized: limx→−∞ = 0 and limx→∞ = 1
• P [A ∩ ¬B] = P [A] − P [A ∩ B] 3. Right-Continuous: limy↓x F (y) = F (x)
Continuity of Probabilities
S∞ Z b
• A1 ⊂ A2 ⊂ . . . =⇒ limn→∞ P [An ] = P [A] whereA = i=1 Ai P [a ≤ Y ≤ b | X = x] = fY |X (y | x)dy a≤b
T∞
• A1 ⊃ A2 ⊃ . . . =⇒ limn→∞ P [An ] = P [A] whereA = i=1 Ai a

f (x, y)
Independence ⊥
⊥ fY |X (y | x) =
A⊥
⊥ B ⇐⇒ P [A ∩ B] = P [A] P [B] fX (x)
Independence
Conditional Probability
1. P [X ≤ x, Y ≤ y] = P [X ≤ x] P [Y ≤ y]
P [A ∩ B]
P [A | B] = P [B] > 0 2. fX,Y (x, y) = fX (x)fY (y)
P [B] 8
Z
3.1 Transformations • E [XY ] = xyfX,Y (x, y) dFX (x) dFY (y)
X,Y
Transformation function
• E [ϕ(Y )] 6= ϕ(E [X]) (cf. Jensen inequality)
Z = ϕ(X)
• P [X ≥ Y ] = 1 =⇒ E [X] ≥ E [Y ]
Discrete • P [X = Y ] = 1 ⇐⇒ E [X] = E [Y ]
X ∞
fZ (z) = P [ϕ(X) = z] = P [{x : ϕ(x) = z}] = P X ∈ ϕ−1 (z) =
 
f (x)
X
• E [X] = P [X ≥ x]
x∈ϕ−1 (z) x=1

Continuous Sample mean


n
Z 1X
X̄n = Xi
FZ (z) = P [ϕ(X) ≤ z] = f (x) dx with Az = {x : ϕ(x) ≤ z} n i=1
Az
Conditional expectation
Special case if ϕ strictly monotone Z

d

dx 1 • E [Y | X = x] = yf (y | x) dy
fZ (z) = fX (ϕ−1 (z)) ϕ−1 (z) = fX (x) = fX (x)

dz dz |J| • E [X] = E [E [X | Y ]]
Z ∞
The Rule of the Lazy Statistician • E[ϕ(X, Y ) | X = x] = ϕ(x, y)fY |X (y | x) dx
Z Z −∞

E [Z] = ϕ(x) dFX (x) • E [ϕ(Y, Z) | X = x] = ϕ(y, z)f(Y,Z)|X (y, z | x) dy dz
−∞
Z Z • E [Y + Z | X] = E [Y | X] + E [Z | X]
E [IA (x)] = IA (x) dFX (x) = dFX (x) = P [X ∈ A] • E [ϕ(X)Y | X] = ϕ(X)E [Y | X]
A
• E[Y | X] = c =⇒ Cov [X, Y ] = 0
Convolution
Z ∞ Z z
X,Y ≥0
• Z := X + Y fZ (z) = fX,Y (x, z − x) dx = fX,Y (x, z − x) dx
−∞ 0 5 Variance
Z ∞
• Z := |X − Y | fZ (z) = 2 fX,Y (x, z + x) dx Definition and properties
0
Z ∞ Z ∞ 2
    2
X ⊥⊥ • V [X] = σX = E (X − E [X])2 = E X 2 − E [X]
• Z := fZ (z) = |x|fX,Y (x, xz) dx = xfx (x)fX (x)fY (xz) dx " n # n
Y −∞ −∞ X X X
• V Xi = V [Xi ] + 2 Cov [Xi , Yj ]
i=1 i=1 i6=j
4 Expectation " n
X
# n
X
• V Xi = V [Xi ] if Xi ⊥
⊥ Xj
Definition and properties i=1 i=1
X

 xfX (x) X discrete Standard deviation p
sd[X] = V [X] = σX

Z  x

• E [X] = µX = x dFX (x) = Covariance

 Z
 xfX (x) dx X continuous


• Cov [X, Y ] = E [(X − E [X])(Y − E [Y ])] = E [XY ] − E [X] E [Y ]
• P [X = c] = 1 =⇒ E [X] = c • Cov [X, a] = 0
• E [cX] = c E [X] • Cov [X, X] = V [X]
• E [X + Y ] = E [X] + E [Y ] • Cov [X, Y ] = Cov [Y, X]
9
• Cov [aX, bY ] = abCov [X, Y ] 7 Distribution Relationships
• Cov [X + a, Y + b] = Cov [X, Y ]

n m

n X m
Binomial
X X X
n
• Cov  Xi , Yj  = Cov [Xi , Yj ] X
i=1 j=1 i=1 j=1
• Xi ∼ Bern (p) =⇒ Xi ∼ Bin (n, p)
i=1
Correlation • X ∼ Bin (n, p) , Y ∼ Bin (m, p) =⇒ X + Y ∼ Bin (n + m, p)
Cov [X, Y ]
ρ [X, Y ] = p • limn→∞ Bin (n, p) = Po (np) (n large, p small)
V [X] V [Y ] • limn→∞ Bin (n, p) = N (np, np(1 − p)) (n large, p far from 0 and 1)
Independence
Negative Binomial
X⊥
⊥ Y =⇒ ρ [X, Y ] = 0 ⇐⇒ Cov [X, Y ] = 0 ⇐⇒ E [XY ] = E [X] E [Y ]
• X ∼ NBin (1, p) = Geo (p)
Pr
Sample variance • X ∼ NBin (r, p) = i=1 Geo (p)
n P P
1 X • Xi ∼ NBin (ri , p) =⇒ Xi ∼ NBin ( ri , p)
S2 = (Xi − X̄n )2
n − 1 i=1 • X ∼ NBin (r, p) . Y ∼ Bin (s + r, p) =⇒ P [X ≤ s] = P [Y ≥ r]

Conditional variance Poisson


n n
!
    2
• V [Y | X] = E (Y − E [Y | X])2 | X = E Y 2 | X − E [Y | X] X X
• Xi ∼ Po (λi ) ∧ Xi ⊥⊥ Xj =⇒ Xi ∼ Po λi
• V [Y ] = E [V [Y | X]] + V [E [Y | X]] i=1 i=1
 
n n
X X λ i
6 Inequalities • Xi ∼ Po (λi ) ∧ Xi ⊥⊥ Xj =⇒ Xi Xj ∼ Bin  Xj , Pn 
j=1 j=1 j=1 λ j

Cauchy-Schwarz
2 Exponential
E [XY ] ≤ E X 2 E Y 2
   
n
X
Markov • Xi ∼ Exp (β) ∧ Xi ⊥
⊥ Xj =⇒ Xi ∼ Gamma (n, β)
E [ϕ(X)]
P [ϕ(X) ≥ t] ≤ i=1
t • Memoryless property: P [X > x + y | X > y] = P [X > x]
Chebyshev
V [X] Normal
P [|X − E [X]| ≥ t] ≤
t2  
X−µ

Chernoff • X ∼ N µ, σ 2 =⇒ σ∼ N (0, 1)
δ
 
e •

X ∼ N µ, σ 2 ∧ Z = aX + b =⇒ Z ∼ N aµ + b, a2 σ 2

P [X ≥ (1 + δ)µ] ≤ δ > −1
(1 + δ)1+δ •
 
X ∼ N µ1 , σ12 ∧ Y ∼ N µ2 , σ22 =⇒ X + Y ∼ N µ1 + µ2 , σ12 + σ22

 P 2
Xi ∼ N µi , σi2 =⇒
Hoeffding P P
• X ∼N i µi , i σi
 i i
b−µ a−µ

X1 , . . . , Xn independent ∧ P [Xi ∈ [ai , bi ]] = 1 ∧ 1 ≤ i ≤ n • P [a < X ≤ b] = Φ σ − Φ σ
2 • Φ(−x) = 1 − Φ(x) φ0 (x) = −xφ(x) φ00 (x) = (x2 − 1)φ(x)
P X̄ − E X̄ ≥ t ≤ e−2nt t > 0
   
• Upper quantile of N (0, 1): zα = Φ−1 (1 − α)
2n2 t2
 
   
P |X̄ − E X̄ | ≥ t ≤ 2 exp − Pn 2
t>0 Gamma
i=1 (bi − ai )
Jensen • X ∼ Gamma (α, β) ⇐⇒ X/β ∼ Gamma (α, 1)

E [ϕ(X)] ≥ ϕ(E [X]) ϕ convex • Gamma (α, β) ∼ i=1 Exp (β)
10
9.2 Bivariate Normal
P P
• Xi ∼ Gamma (αi , β) ∧ Xi ⊥
⊥ Xj =⇒ i Xi ∼ Gamma ( i αi , β)
Z ∞
Γ(α)  
• = xα−1 e−λx dx Let X ∼ N µx , σx2 and Y ∼ N µy , σy2 .
λα 0
 
Beta 1 z
f (x, y) = exp −
2(1 − ρ2 )
p
2πσx σy 1−ρ 2
1 Γ(α + β) α−1
• xα−1 (1 − x)β−1 = x (1 − x)β−1
B(α, β) Γ(α)Γ(β) " 2  2   #
  B(α + k, β) α+k−1 x − µx y − µy x − µx y − µy
• E Xk = = E X k−1
  z= + − 2ρ
B(α, β) α+β+k−1 σx σy σx σy
• Beta (1, 1) ∼ Unif (0, 1) Conditional mean and variance
σX
E [X | Y ] = E [X] + ρ (Y − E [Y ])
8 Probability and Moment Generating Functions σY
p
V [X | Y ] = σX 1 − ρ2
 
• GX (t) = E tX |t| < 1
"∞ # ∞  
X (Xt)i X E Xi
· ti
t
 Xt

• MX (t) = GX (e ) = E e =E = 9.3 Multivariate Normal
i=0
i! i=0
i!
• P [X = 0] = GX (0) Covariance matrix Σ (Precision matrix Σ−1 )
• P [X = 1] = G0X (0)  
(i)
V [X1 ] · · · Cov [X1 , Xk ]
GX (0) .. .. ..
• P [X = i] = Σ=
 
. . . 
i!
• E [X] = G0X (1− ) Cov [Xk , X1 ] · · · V [Xk ]
  (k)
• E X k = MX (0) If X ∼ N (µ, Σ),
 
X! (k)
• E = GX (1− ) −1/2

1

(X − k)! fX (x) = (2π)−n/2 |Σ| exp − (x − µ)T Σ−1 (x − µ)
2 2
• V [X] = G00X (1− ) + G0X (1− ) − (G0X (1− ))
d Properties
• GX (t) = GY (t) =⇒ X = Y
• Z ∼ N (0, 1) ∧ X = µ + Σ1/2 Z =⇒ X ∼ N (µ, Σ)
9 Multivariate Distributions • X ∼ N (µ, Σ) =⇒ Σ−1/2 (X − µ) ∼ N (0, 1)

• X ∼ N (µ, Σ) =⇒ AX ∼ N Aµ, AΣAT
9.1 Standard Bivariate Normal

• X ∼ N (µ, Σ) ∧ kak = k =⇒ aT X ∼ N aT µ, aT Σa
p
Let X, Y ∼ N (0, 1) ∧ X ⊥
⊥ Z where Y = ρX + 1 − ρ2 Z
10 Convergence
Joint density
x2 + y 2 − 2ρxy
 
1 Let {X1 , X2 , . . .} be a sequence of rv’s and let X be another rv. Let Fn denote
f (x, y) = exp −
2(1 − ρ2 )
p
2π 1 − ρ2 the cdf of Xn and let F denote the cdf of X.
Conditionals
Types of convergence
(Y | X = x) ∼ N ρx, 1 − ρ2 (X | Y = y) ∼ N ρy, 1 − ρ2
 
and D
1. In distribution (weakly, in law): Xn → X
Independence
X⊥
⊥ Y ⇐⇒ ρ = 0 lim Fn (t) = F (t) ∀t where F continuous
n→∞ 11
P
2. In probability: Xn → X 10.2 Central Limit Theorem (CLT)
Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ, and V [X1 ] = σ 2 .
(∀ε > 0) lim P [|Xn − X| > ε] = 0
n→∞


3. Almost surely (strongly): Xn → X
as
X̄n − µ n(X̄n − µ) D
Zn := q   = →Z where Z ∼ N (0, 1)
V X̄n σ
h i h i
P lim Xn = X = P ω ∈ Ω : lim Xn (ω) = X(ω) = 1
n→∞ n→∞ lim P [Zn ≤ z] = Φ(z) z∈R
n→∞
qm
4. In quadratic mean (L2 ): Xn → X CLT notations

lim E (Xn − X)2 = 0 Zn ≈ N (0, 1)


 
n→∞
σ2
 
X̄n ≈ N µ,
Relationships n
σ2
 
qm P D X̄n − µ ≈ N 0,
• Xn → X =⇒ Xn → X =⇒ Xn → X n
as
• Xn → X =⇒ Xn → X
P √ 2

n(X̄n − µ) ≈ N 0, σ
D P
• Xn → X ∧ (∃c ∈ R) P [X = c] = 1 =⇒ Xn → X √
n(X̄n − µ)
• Xn
P
→X ∧ Yn
P
→ Y =⇒ Xn + Yn → X + Y
P
≈ N (0, 1)
qm qm qm
σ
• Xn →X ∧ Yn → Y =⇒ Xn + Yn → X + Y
P P P
• Xn →X ∧ Yn → Y =⇒ Xn Yn → XY
P P
• Xn →X =⇒ ϕ(Xn ) → ϕ(X) Continuity correction
D D
• Xn → X =⇒ ϕ(Xn ) → ϕ(X)  

x + 12 − µ

qm
• Xn → b ⇐⇒ limn→∞ E [Xn ] = b ∧ limn→∞ V [Xn ] = 0 P X̄n ≤ x ≈ Φ √
qm
σ/ n
• X1 , . . . , Xn iid ∧ E [X] = µ ∧ V [X] < ∞ ⇐⇒ X̄n → µ
x − 12 − µ
 
 
Slutzky’s Theorem P X̄n ≥ x ≈ 1 − Φ √
σ/ n
D P
• Xn → X and Yn → c =⇒ Xn + Yn → X + c
D
Delta method
D P D
• Xn → X and Yn → c =⇒ Xn Yn → cX 
σ2
 
2 σ2

0
D D
• In general: Xn → X and Yn → Y =⇒
6
D
Xn + Yn → X + Y Yn ≈ N µ, =⇒ ϕ(Yn ) ≈ N ϕ(µ), (ϕ (µ))
n n

10.1 Law of Large Numbers (LLN) 11 Statistical Inference


iid
Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ. Let X1 , · · · , Xn ∼ F if not otherwise noted.

Weak (WLLN) 11.1 Point Estimation


P
X̄n → µ n→∞ • Point estimator θbn of θ is a rv: θbn = g(X1 , . . . , Xn )
h i
Strong (SLLN) • bias(θbn ) = E θbn − θ
as P
X̄n → µ n→∞ • Consistency: θbn → θ
12
• Sampling distribution: F (θbn ) Nonparametric 1 − α confidence band for F
r h i
• Standard error: se(θbn ) = V θbn L(x) = max{Fbn − n , 0}
h i h i
• Mean squared error: mse = E (θbn − θ)2 = bias(θbn )2 + V θbn U (x) = min{Fbn + n , 1}
s  
• limn→∞ bias(θbn ) = 0 ∧ limn→∞ se(θbn ) = 0 =⇒ θbn is consistent 1 2
= log
θbn − θ D 2n α
• Asymptotic normality: → N (0, 1)
se
• Slutzky’s Theorem often lets us replace se(θbn ) by some (weakly) consis-
tent estimator σ
bn . P [L(x) ≤ F (x) ≤ U (x) ∀x] ≥ 1 − α

11.2 Normal-Based Confidence Interval 11.4 Statistical Functionals


  • Statistical functional: T (F )
b 2 . Let zα/2 = Φ−1 (1 − (α/2)), i.e., P Z > zα/2 = α/2
 
Suppose θbn ≈ N θ, se
  • Plug-in estimator of θ = (F ): θbn = T (Fbn )
and P −zα/2 < Z < zα/2 = 1 − α where Z ∼ N (0, 1). Then R
• Linear functional: T (F ) = ϕ(x) dFX (x)
• Plug-in estimator for linear functional:
Cn = θbn ± zα/2 se
b
Z n
1X
T (Fbn ) = ϕ(x) dFbn (x) = ϕ(Xi )
n i=1
11.3 Empirical distribution
 
Empirical Distribution Function (ECDF) b 2 =⇒ T (Fbn ) ± zα/2 se
• Often: T (Fbn ) ≈ N T (F ), se b
Pn • pth quantile: F −1 (p) = inf{x : F (x) ≥ p}
i=1 I(Xi ≤ x)
Fbn (x) = • µb = X̄n
n n
1 X
b2 =
• σ (Xi − X̄n )2
( n − 1 i=1
1 Xi ≤ x 1
Pn
I(Xi ≤ x) = n i=1 (Xi − µb)3
0 Xi > x • κ
b=
b3
Pσ n
i=1 (Xi − X̄n )(Yi − Ȳn )
Properties (for any fixed x) • ρb = qP qP
n 2 n 2
h i i=1 (Xi − X̄n ) i=1 (Yi − Ȳn )
• E Fbn = F (x)
h i F (x)(1 − F (x))
• V Fbn = 12 Parametric Inference
n
F (x)(1 − F (x)) D

Let F = f (x; θ) : θ ∈ Θ be a parametric model with parameter space Θ ⊂ Rk
• mse = →0
n and parameter θ = (θ1 , . . . , θk ).
P
• Fbn → F (x)
12.1 Method of Moments
Dvoretzky-Kiefer-Wolfowitz (DKW) inequality (X1 , . . . , Xn ∼ F )
  j th moment Z
2
P sup F (x) − Fbn (x) > ε = 2e−2nε αj (θ) = E X j = xj dFX (x)
 
x 13
j th sample moment Fisher information (exponential family)
n
1X j
α
bj = X 


n i=1 i I(θ) = Eθ − s(X; θ)
∂θ
Method of moments estimator (MoM)
Observed Fisher information
α1 (θ) = α
b1
n
α2 (θ) = α ∂2 X
Inobs (θ) = −
b2
log f (Xi ; θ)
.. .. ∂θ2 i=1
.=.
αk (θ) = α
bk Properties of the mle
Properties of the MoM estimator P
• Consistency: θbn → θ
• θbn exists with probability tending to 1 • Equivariance: θbn is the mle =⇒ ϕ(θbn ) ist the mle of ϕ(θ)

P
Consistency: θbn → θ • Asymptotic normality:
• Asymptotic normality:
p
1. se ≈ 1/In (θ)
√ D (θbn − θ) D
n(θb − θ) → N (0, Σ) → N (0, 1)
  se
where Σ = gE Y Y T g T , Y = (X, X 2 , . . . , X k )T , q
∂ −1 b ≈ 1/In (θbn )
2. se
g = (g1 , . . . , gk ) and gj = ∂θ αj (θ)
(θbn − θ) D
→ N (0, 1)
12.2 Maximum Likelihood se
b
Likelihood: Ln : Θ → [0, ∞) • Asymptotic optimality (or efficiency), i.e., smallest variance for large sam-
ples. If θen is any other estimator, the asymptotic relative efficiency is
n
Y
Ln (θ) = f (Xi ; θ) h i
i=1 V θbn
are(θen , θbn ) = h i ≤ 1
V θen
Log-likelihood
n
X
`n (θ) = log Ln (θ) = log f (Xi ; θ) • Approximately the Bayes estimator
i=1

Maximum likelihood estimator (mle)


12.2.1 Delta Method
Ln (θbn ) = sup Ln (θ) b where ϕ is differentiable and ϕ0 (θ) 6= 0:
If τ = ϕ(θ)
θ

Score function τn − τ ) D
(b
∂ → N (0, 1)
s(X; θ) = log f (X; θ) se(b
b τ)
∂θ
Fisher information where τb = ϕ(θ)
b is the mle of τ and
I(θ) = Vθ [s(X; θ)]

In (θ) = nI(θ) b = ϕ0 (θ)
se se(
b θn )
b b
14
12.3 Multiparameter Models 13 Hypothesis Testing
Let θ = (θ1 , . . . , θk ) and θb = (θb1 , . . . , θbk ) be the mle.
H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ1
∂ 2 `n ∂ 2 `n
Hjj = Hjk = Definitions
∂θ2 ∂θj ∂θk
Fisher information matrix • Null hypothesis H0
• Alternative hypothesis H1
 
Eθ [H11 ] · · · Eθ [H1k ]
In (θ) = − 
 .. .. ..  • Simple hypothesis θ = θ0
. . . 
• Composite hypothesis θ > θ0 or θ < θ0
Eθ [Hk1 ] · · · Eθ [Hkk ]
• Two-sided test: H0 : θ = θ0 versus H1 : θ 6= θ0
Under appropriate regularity conditions • One-sided test: H0 : θ ≤ θ0 versus H1 : θ > θ0
(θb − θ) ≈ N (0, Jn ) • Critical value c
• Test statistic T
with Jn (θ) = In−1 . Further, if θbj is the j th component of θ, then • Rejection region R = {x : T (x) > c}
• Power function β(θ) = P [X ∈ R]
(θbj − θj ) D
→ N (0, 1) • Power of a test: 1 − P [Type II error] = 1 − β = inf β(θ)
se
bj θ∈Θ1
h i • Test size: α = P [Type I error] = sup β(θ)
b 2j = Jn (j, j) and Cov θbj , θbk = Jn (j, k)
where se θ∈Θ0

Retain H0 Reject H0
12.3.1 Multiparameter delta method √
H0 true Type
√ I Error (α)
Let τ = ϕ(θ1 , . . . , θk ) and let the gradient of ϕ be H1 true Type II Error (β) (power)
p-value
∂ϕ
 

 ∂θ1  • p-value = supθ∈Θ0 Pθ [T (X) ≥ T (x)] = inf α : T (x) ∈ Rα
 . 
∇ϕ =  ..  Pθ [T (X ? ) ≥ T (X)]

• p-value = supθ∈Θ0 = inf α : T (X) ∈ Rα

 ∂ϕ  | {z }
1−Fθ (T (X)) since T (X ? )∼Fθ
∂θk
p-value evidence
Suppose ∇ϕ θ=θb 6= 0 and τb = ϕ(θ).
b Then,
< 0.01 very strong evidence against H0
τ − τ) D
(b 0.01 − 0.05 strong evidence against H0
→ N (0, 1) 0.05 − 0.1 weak evidence against H0
se(b
b τ)
> 0.1 little or no evidence against H0
where r Wald test
 T  
se(b
b τ) = ∇ϕ
b Jbn ∇ϕ
b
• Two-sided test
θb − θ0

b and ∇ϕ
and Jbn = Jn (θ) b = ∇ϕ b.
θ=θ • Reject H0 when |W | > zα/2 where W =
  se
b
12.4 Parametric Bootstrap • P |W | > zα/2 → α
• p-value = Pθ0 [|W | > |w|] ≈ P [|Z| > |w|] = 2Φ(−|w|)
Sample from f (x; θbn ) instead of from Fbn , where θbn could be the mle or method
of moments estimator. Likelihood ratio test (LRT)
15
supθ∈Θ Ln (θ) Ln (θbn ) Vector parameter
• T (X) = =
supθ∈Θ0 Ln (θ) Ln (θbn,0 ) ( s
X
)
k fX (x | θ) = h(x) exp ηi (θ)Ti (x) − A(θ)
iid
D
X
• λ(X) = 2 log T (X) → χ2r−q where Zi2 ∼ χ2k and Z1 , . . . , Zk ∼ N (0, 1) i=1

 i=1  = h(x) exp {η(θ) · T (x) − A(θ)}


• p-value = Pθ0 [λ(X) > λ(x)] ≈ P χ2r−q > λ(x) = h(x)g(θ) exp {η(θ) · T (x)}
Multinomial LRT Natural form
 
X1 Xk
• mle: pbn = ,..., fX (x | η) = h(x) exp {η · T(x) − A(η)}
n n
k  Xj = h(x)g(η) exp {η · T(x)}
Ln (b
pn ) Y pbj
= h(x)g(η) exp η T T(x)

• T (X) = =
Ln (p0 ) j=1
p0j
k  
X pbj D
15 Bayesian Inference
• λ(X) = 2 Xj log → χ2k−1
j=1
p 0j
Bayes’ Theorem
• The approximate size α LRT rejects H0 when λ(X) ≥ χ2k−1,α
f (x | θ)f (θ) f (x | θ)f (θ)
Pearson Chi-square Test f (θ | x) = =R ∝ Ln (θ)f (θ)
f (xn ) f (x | θ)f (θ) dθ
k
X (Xj − E [Xj ])2 Definitions
• T = where E [Xj ] = np0j under H0
j=1
E [Xj ] • X n = (X1 , . . . , Xn )
D
• T → χ2k−1 • xn = (x1 , . . . , xn )

• p-value = P χ2k−1 > T (x)
 • Prior density f (θ)
D
2 • Likelihood f (xn | θ): joint density of the data
• Faster → Xk−1 than LRT, hence preferable for small n Yn
In particular, X n iid =⇒ f (xn | θ) = f (xi | θ) = Ln (θ)
Independence testing i=1
• Posterior density f (θ | xn )
• I rows, J columns, X multinomial sample of size n = I ∗ J
• Normalizing constant cn = f (xn ) = f (x | θ)f (θ) dθ
R
X
• mles unconstrained: pbij = nij
X • Kernel: part of a density that depends Ron θ
• mles under H0 : pb0ij = pbi· pb·j = Xni· n·j θLn (θ)f (θ)
• Posterior mean θ̄n = θf (θ | xn ) dθ = R Ln (θ)f
R
(θ) dθ
 
PI PJ nX
• LRT: λ = 2 i=1 j=1 Xij log Xi· Xij·j
PI PJ (X −E[X ])2
• PearsonChiSq: T = i=1 j=1 ijE[Xij ]ij 15.1 Credible Intervals
D
• LRT and Pearson → χ2k ν, where ν = (I − 1)(J − 1) Posterior interval
Z b
n
P [θ ∈ (a, b) | x ] = f (θ | xn ) dθ = 1 − α
14 Exponential Family a

Equal-tail credible interval


Scalar parameter Z a Z ∞
n
f (θ | x ) dθ = f (θ | xn ) dθ = α/2
fX (x | θ) = h(x) exp {η(θ)T (x) − A(θ)} −∞ b

= h(x)g(θ) exp {η(θ)T (x)} Highest posterior density (HPD) region Rn


16
1. P [θ ∈ Rn ] = 1 − α 15.3.1 Conjugate Priors
2. Rn = {θ : f (θ | xn ) > k} for some k Continuous likelihood (subscript c denotes constant)
Likelihood Conjugate prior Posterior hyperparameters
Rn is unimodal =⇒ Rn is an interval

Unif (0, θ) Pareto(xm , k) max x(n) , xm , k + n
Xn
Exp (λ) Gamma (α, β) α + n, β + xi
15.2 Function of parameters i=1
 Pn   
µ0 i=1 xi 1 n
2
 2

Let τ = ϕ(θ) and A = {θ : ϕ(θ) ≤ τ }. N µ, σc N µ0 , σ0 + / + 2 ,
σ2 σ2 σ02 σc
Posterior CDF for τ  0 c−1
1 n
+ 2
Z σ02 σc
H(r | xn ) = P [ϕ(θ) ≤ τ | xn ] = f (θ | xn ) dθ 
Pn
νσ02 + i=1 (xi − µ)2
A N µc , σ 2 Scaled Inverse Chi- ν + n,
ν+n
square(ν, σ02 )
Posterior density
 νλ + nx̄ n
N µ, σ 2 Normal- , ν + n, α + ,
h(τ | xn ) = H 0 (τ | xn ) ν+n 2
scaled Inverse n 2
1X γ(x̄ − λ)
Gamma(λ, ν, α, β) β+ (xi − x̄)2 +
Bayesian delta method 2 i=1 2(n + γ)
−1
Σ−1 −1
Σ−1 −1
  
τ | X n ≈ N ϕ(θ),
b seb ϕ0 (θ)
b MVN(µ, Σc ) MVN(µ0 , Σ0 ) 0 + nΣc 0 µ0 + nΣ x̄ ,
−1 −1

Σ−1

0 + nΣc
Xn
MVN(µc , Σ) Inverse- n + κ, Ψ + (xi − µc )(xi − µc )T
15.3 Priors Wishart(κ, Ψ) i=1
n
Choice
X xi
Pareto(xmc , k) Gamma (α, β) α + n, β + log
i=1
x mc

• Subjective bayesianism. Pareto(xm , kc ) Pareto(x0 , k0 ) x0 , k0 − kn where k0 > kn


n
• Objective bayesianism.
X
Gamma (αc , β) Gamma (α0 , β0 ) α0 + nαc , β0 + xi
• Robust bayesianism. i=1

Types

• Flat: f (θ) ∝ constant


R∞
• Proper: −∞ f (θ) dθ = 1
R∞
• Improper: −∞ f (θ) dθ = ∞
• Jeffrey’s prior (transformation-invariant):
p p
f (θ) ∝ I(θ) f (θ) ∝ det(I(θ))

• Conjugate: f (θ) and f (θ | xn ) belong to the same parametric family


17
Discrete likelihood log10 BF10 BF10 evidence
Likelihood Conjugate prior Posterior hyperparameters 0 − 0.5 1 − 1.5 Weak
n n
0.5 − 1 1.5 − 10 Moderate
Bern (p) Beta (α, β) α+
X
xi , β + n −
X
xi 1−2 10 − 100 Strong
i=1 i=1
>2 > 100 Decisive
n n n p
1−p BF 10
p∗ = where p = P [H1 ] and p∗ = P [H1 | xn ]
X X X
Bin (p) Beta (α, β) α+ xi , β + Ni − xi p
1+ 1−p BF10
i=1 i=1 i=1
n
X
NBin (p) Beta (α, β) α + rn, β + xi
i=1
16 Sampling Methods
n
X
Po (λ) Gamma (α, β) α+ xi , β + n 16.1 Inverse Transform Sampling
i=1
Xn Setup
Multinomial(p) Dir (α) α+ x(i)
i=1 • U ∼ Unif (0, 1)
n
X • X∼F
Geo (p) Beta (α, β) α + n, β + xi
i=1
• F −1 (u) = inf{x | F (x) ≥ u}
Algorithm
15.4 Bayesian Testing
1. Generate u ∼ Unif (0, 1)
If H0 : θ ∈ Θ0 : 2. Compute x = F −1 (u)
Z
Prior probability P [H0 ] = f (θ) dθ 16.2 The Bootstrap
Θ0
Z
Let Tn = g(X1 , . . . , Xn ) be a statistic.
Posterior probability P [H0 | xn ] = f (θ | xn ) dθ
Θ0
1. Estimate VF [Tn ] with VFbn [Tn ].
2. Approximate VFbn [Tn ] using simulation:
∗ ∗
Let H0 , . . . , HK−1 be K hypotheses. Suppose θ ∼ f (θ | Hk ), (a) Repeat the following B times to get Tn,1 , . . . , Tn,B , an iid sample from
the sampling distribution implied by Fn b
f (xn | Hk )P [Hk ]
P [Hk | xn ] = PK , i. Sample uniformly X1∗ , . . . , Xn∗ ∼ Fbn .
n
k=1 f (x | Hk )P [Hk ] ii. Compute Tn∗ = g(X1∗ , . . . , Xn∗ ).
Marginal likelihood (b) Then
B B
!2
Z 1 X
∗ 1 X
f (xn | Hi ) = f (xn | θ, Hi )f (θ | Hi ) dθ vboot = V
bb = Tn,b − T∗
Θ
Fn B B r=1 n,r
b=1

Posterior odds (of Hi relative to Hj ) 16.2.1 Bootstrap Confidence Intervals


n n
P [Hi | x ] f (x | Hi ) P [Hi ] Normal-based interval
= ×
P [Hj | xn ] f (xn | Hj ) P [Hj ] Tn ± zα/2 se
b boot
| {z } | {z }
Bayes Factor BFij prior odds Pivotal interval
Bayes factor 1. Location parameter θ = T (F )
18
2. Pivot Rn = θbn − θ 2. Generate u ∼ Unif (0, 1)
3. Let H(r) = P [Rn ≤ r] be the cdf of Rn Ln (θcand )
∗ ∗
3. Accept θcand if u ≤
4. Let Rn,b = θbn,b − θbn . Approximate H using bootstrap: Ln (θbn )
B
1 X ∗ 16.4 Importance Sampling
H(r)
b = I(Rn,b ≤ r)
B Sample from an importance function g rather than target density h.
b=1
Algorithm to obtain an approximation to E [q(θ) | xn ]:
5. θβ∗ = β sample quantile of (θbn,1
∗ ∗
, . . . , θbn,B ) iid
1. Sample from the prior θ1 , . . . , θn ∼ f (θ)
6. rβ∗ = β sample quantile of (Rn,1
∗ ∗
, . . . , Rn,B ), i.e., rβ∗ = θβ∗ − θbn
Ln (θi )
2. wi = PB ∀i = 1, . . . , B
 
7. Approximate 1 − α confidence interval Cn = â, b̂ where
i=1 Ln (θi )
PB
3. E [q(θ) | xn ] ≈ i=1 q(θi )wi
b −1 1 − α =
 
∗ ∗
â = θbn − H θbn − r1−α/2 = 2θbn − θ1−α/2
2

−1 ∗ ∗
b̂ = θbn − Hb
2
= θbn − rα/2 = 2θbn − θα/2 17 Decision Theory
Percentile interval   Definitions
∗ ∗
Cn = θα/2 , θ1−α/2 • Unknown quantity affecting our decision: θ ∈ Θ
• Decision rule: synonymous for an estimator θb
16.3 Rejection Sampling • Action a ∈ A: possible value of the decision rule. In the estimation
context, the action is just an estimate of θ, θ(x).
b
Setup
• Loss function L: consequences of taking action a when true state is θ or
• We can easily sample from g(θ) discrepancy between θ and θ, b L : Θ × A → [−k, ∞).
• We want to sample from h(θ), but it is difficult
Loss functions
k(θ)
• We know h(θ) up to a proportional constant: h(θ) = R • Squared error loss: L(θ, a) = (θ − a)2
k(θ) dθ (
• Envelope condition: we can find M > 0 such that k(θ) ≤ M g(θ) ∀θ K1 (θ − a) a − θ < 0
• Linear loss: L(θ, a) =
K2 (a − θ) a − θ ≥ 0
Algorithm
• Absolute error loss: L(θ, a) = |θ − a| (linear loss with K1 = K2 )
1. Draw θcand ∼ g(θ) • Lp loss: L(θ, a) = |θ − a|p
2. Generate u ∼ Unif (0, 1) (
0 a=θ
k(θcand ) • Zero-one loss: L(θ, a) =
3. Accept θcand if u ≤ 1 a 6= θ
M g(θcand )
4. Repeat until B values of θcand have been accepted
17.1 Risk
Example
Posterior risk
• We can easily sample from the prior g(θ) = f (θ) Z h i
• Target is the posterior h(θ) ∝ k(θ) = f (xn | θ)f (θ) r(θb | x) = L(θ, θ(x))f
b (θ | x) dθ = Eθ|X L(θ, θ(x))
b

• Envelope condition: f (xn | θ) ≤ f (xn | θbn ) = Ln (θbn ) ≡ M


(Frequentist) risk
• Algorithm Z h i
1. Draw θcand ∼ f (θ) R(θ, θ)
b = L(θ, θ(x))f
b (x | θ) dx = EX|θ L(θ, θ(X))
b
19
Bayes risk 18 Linear Regression
ZZ
Definitions
h i
r(f, θ)
b = L(θ, θ(x))f
b (x, θ) dx dθ = Eθ,X L(θ, θ(X))
b
• Response variable Y
• Covariate X (aka predictor variable or feature)
h h ii h i
r(f, θ)
b = Eθ EX|θ L(θ, θ(X)
b = Eθ R(θ, θ)
b

18.1 Simple Linear Regression


h h ii h i
r(f, θ)
b = EX Eθ|X L(θ, θ(X)
b = EX r(θb | X)
Model
17.2 Admissibility Yi = β0 + β1 Xi + i E [i | Xi ] = 0, V [i | Xi ] = σ 2
Fitted line
• θb0 dominates θb if
b0 rb(x) = βb0 + βb1 x
∀θ : R(θ, θ ) ≤ R(θ, θ)
b
Predicted (fitted) values
∃θ : R(θ, θb0 ) < R(θ, θ)
b Ybi = rb(Xi )
• θb is inadmissible if there is at least one other estimator θb0 that dominates Residuals  
it. Otherwise it is called admissible. ˆi = Yi − Ybi = Yi − βb0 + βb1 Xi

Residual sums of squares (rss)


17.3 Bayes Rule
n
X
Bayes rule (or Bayes estimator) rss(βb0 , βb1 ) = ˆ2i
i=1
• r(f, θ)
b = inf e r(f, θ)
θ
e
R Least square estimates
• θ(x)
b = inf r(θb | x) ∀x =⇒ r(f, θ)
b = r(θb | x)f (x) dx
βbT = (βb0 , βb1 )T : min rss
β
b0 ,β
b1
Theorems

• Squared error loss: posterior mean βb0 = Ȳn − βb1 X̄n


Pn Pn
• Absolute error loss: posterior median i=1 (Xi − X̄n )(Yi − Ȳn ) i=1 Xi Yi − nX̄Y
β1 =
b Pn = P n
• Zero-one loss: posterior mode i=1 (Xi − X̄n )
2 2 2
i=1 Xi − nX
 
β0
h i
E βb | X n =
17.4 Minimax Rules β1
σ 2 n−1 ni=1 Xi2 −X n
h i  P 
Maximum risk V βb | X n = 2
R̄(θ)
b = sup R(θ, θ)
b R̄(a) = sup R(θ, a) nsX −X n 1
θ θ r Pn
2
σ i=1 Xi

b
Minimax rule se(
b βb0 ) =
sX n n
sup R(θ, θ)
b = inf R̄(θ)
e = inf sup R(θ, θ)
e
θ θe θe θ σ

b
se(
b βb1 ) =
sX n
θb = Bayes rule ∧ ∃c : R(θ, θ)
b =c Pn Pn 2
where s2X = n−1 i=1 (Xi − X n )2 and σ b2 = n−21
i=1 
ˆi (unbiased estimate).
Least favorable prior Further properties:
P P
θbf = Bayes rule ∧ R(θ, θbf ) ≤ r(f, θbf ) ∀θ • Consistency: βb0 → β0 and βb1 → β1
20
• Asymptotic normality: 18.3 Multiple Regression
βb0 − β0 D βb1 − β1 D Y = Xβ + 
→ N (0, 1) and → N (0, 1)
se(
b βb0 ) se(
b βb1 )
where
• Approximate 1 − α confidence intervals for β0 and β1 :      
X11 ··· X1k β1 1
 .. ..  β =  ... 
..  .. 
βb0 ± zα/2 se( and βb1 ± zα/2 se( X= . =.
 
b βb0 ) b βb1 ) . . 
Xn1 ··· Xnk βk n
• Wald test for H0 : β1 = 0 vs. H1 : β1 6= 0: reject H0 if |W | > zα/2 where
W = βb1 /se(
b βb1 ). Likelihood
 
1
R2 L(µ, Σ) = (2πσ 2 )−n/2 exp − 2 rss
Pn b 2
Pn 2 2σ
i=1 (Yi − Y ) ˆ rss
2
R = Pn 2
= 1 − Pn i=1 i 2 = 1 −
i=1 (Yi − Y ) i=1 (Yi − Y )
tss
N
X
Likelihood rss = (y − Xβ)T (y − Xβ) = kY − Xβk2 = (Yi − xTi β)2
n n n i=1
Y Y Y
L= f (Xi , Yi ) = fX (Xi ) × fY |X (Yi | Xi ) = L1 × L2
i=1 i=1 i=1 If the (k × k) matrix X T X is invertible,
Yn
L1 = fX (Xi ) βb = (X T X)−1 X T Y
i=1 h i
V βb | X n = σ 2 (X T X)−1
n
( )
Y 1 X 2
−n
L2 = fY |X (Yi | Xi ) ∝ σ exp − 2 Yi − (β0 − β1 Xi )
2σ i βb ≈ N β, σ 2 (X T X)−1

i=1

Under the assumption of Normality, the least squares parameter estimators are
Estimate regression function
also the MLEs, but the least squares variance estimator is not the MLE
n k
1X 2 X
b2 =
σ ˆ rb(x) = βbj xj
n i=1 i j=1

18.2 Prediction Unbiased estimate for σ 2


Observe X = x∗ of the covariate and want to predict their outcome Y∗ . n
1 X 2
b2 =
σ ˆ ˆ = X βb − Y
Yb∗ = βb0 + βb1 x∗ n − k i=1 i
h i h i h i h i
V Yb∗ = V βb0 + x2∗ V βb1 + 2x∗ Cov βb0 , βb1 mle
n−k 2
Prediction interval µ
b = X̄ b2 =
σ σ
 Pn 2
 n
2 2 i=1 (Xi − X∗ )
ξn = σ
b P +1
n i (Xi − X̄)2 j
b
1 − α Confidence interval
Yb∗ ± zα/2 ξbn βbj ± zα/2 se(
b βbj )
21
18.4 Model Selection Akaike Information Criterion (AIC)
Consider predicting a new observation Y ∗ for covariates X ∗ and let S ⊂ J
denote a subset of the covariates in the model, where |S| = k and |J| = n. bS2 ) − k
AIC(S) = `n (βbS , σ
Issues
Bayesian Information Criterion (BIC)
• Underfitting: too few covariates yields high bias
• Overfitting: too many covariates yields high variance k
bS2 ) − log n
BIC(S) = `n (βbS , σ
Procedure 2

1. Assign a score to each model Validation and training


2. Search through all models to find the one with the highest score
m
X n n
Hypothesis testing R
bV (S) = (Ybi∗ (S) − Yi∗ )2 m = |{validation data}|, often or
i=1
4 2
H0 : βj = 0 vs. H1 : βj 6= 0 ∀j ∈ J
Leave-one-out cross-validation
Mean squared prediction error (mspe)
n n
!2
h i X X Yi − Ybi (S)
mspe = E (Yb (S) − Y ∗ )2 R
bCV (S) = (Yi − Yb(i) )2 =
i=1 i=1
1 − Uii (S)
Prediction risk
n n h i
U (S) = XS (XST XS )−1 XS (“hat matrix”)
X X
R(S) = mspei = E (Ybi (S) − Yi∗ )2
i=1 i=1

Training error
n
R
btr (S) =
X
(Ybi (S) − Yi )2 19 Non-parametric Function Estimation
i=1

R 2 19.1 Density Estimation


Pn b 2
R i=1 (Yi (S) − Y )
rss(S) btr (S) R
R2 (S) = 1 − =1− =1− P n 2
Estimate f (x), where f (x) = P [X ∈ A] = A
f (x) dx.
i=1 (Yi − Y )
tss tss Integrated square error (ise)
The training error is a downward-biased estimate of the prediction risk. Z  Z
2
h i L(f, fbn ) = f (x) − fbn (x) dx = J(h) + f 2 (x) dx
E R btr (S) < R(S)

h i n
X h i Frequentist risk
bias(Rtr (S)) = E Rtr (S) − R(S) = −2
b b Cov Ybi , Yi
i=1 h i Z Z
R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx
Adjusted R2
n − 1 rss
R2 (S) = 1 −
n − k tss
h i
Mallow’s Cp statistic b(x) = E fbn (x) − f (x)
h i
R(S)
b =R σ 2 = lack of fit + complexity penalty
btr (S) + 2kb v(x) = V fbn (x)
22
19.1.1 Histograms KDE
n
Definitions
 
1X1 x − Xi
fbn (x) = K
n i=1 h h
• Number of bins m 1 4
Z
00 2 1
Z
• Binwidth h = m 1 R(f, fn ) ≈ (hσK )
b (f (x)) dx + K 2 (x) dx
4 nh
• Bin Bj has νj observations c
−2/5 −1/5 −1/5
c2 c3
Z Z
h∗ = 1 c = σ 2
, c = K 2
(x) dx, c = (f 00 (x))2 dx
R
• Define pbj = νj /n and pj = Bj f (u) du 1 K 2 3
n1/5
Z 4/5 Z 1/5
∗ c4 5 2 2/5 2 00 2
Histogram estimator R (f, fn ) = 4/5
b c4 = (σK ) K (x) dx (f ) dx
n 4
| {z }
m C(K)
X pbj
fbn (x) = I(x ∈ Bj )
h
j=1 Epanechnikov Kernel
h i pj
E fbn (x) = ( √
h √ 3
|x| < 5
h i p (1 − p ) K(x) = 4 5(1−x2 /5)
j j
V fbn (x) = 0 otherwise
nh2
h2
Z
2 1
R(fbn , f ) ≈ (f 0 (u)) du + Cross-validation estimate of E [J(h)]
12 nh
!1/3
∗ 1 6 n n n  
1 X X ∗ Xi − Xj
Z
h = 1/3 R 2 du 2Xb 2
n (f 0 (u)) JbCV (h) = fbn2 (x) dx − f(−i) (Xi ) ≈ 2
K + K(0)
n i=1 hn i=1 j=1 h nh
 2/3 Z 1/3
∗ b C 3 0 2
R (fn , f ) ≈ 2/3 C= (f (u)) du
n 4 Z
∗ (2) (2)
K (x) = K (x) − 2K(x) K (x) = K(x − y)K(y) dy
Cross-validation estimate of E [J(h)]

Z n m
JbCV (h) = fbn2 (x) dx −
2Xb
f(−i) (Xi ) =
2

n+1 X 2
pb
19.2 Non-parametric Regression
n i=1 (n − 1)h (n − 1)h j=1 j
Estimate f (x) where f (x) = E [Y | X = x]. Consider pairs of points
(x1 , Y1 ), . . . , (xn , Yn ) related by

19.1.2 Kernel Density Estimator (KDE) Yi = r(xi ) + i


E [i ] = 0
Kernel K
V [i ] = σ 2
• K(x) ≥ 0

R
K(x) dx = 1 k-nearest Neighbor Estimator
R
• xK(x) dx = 0
R 2 2 1 X
• x K(x) dx ≡ σK >0 rb(x) = Yi where Nk (x) = {k values of x1 , . . . , xn closest to x}
k 23
i:xi ∈Nk (x)
Nadaraya-Watson Kernel Estimator 20 Stochastic Processes
n
X
rb(x) = wi (x)Yi Stochastic Process
i=1 (
x−xi {0, ±1, . . . } = Z discrete

K {Xt : t ∈ T } T =
wi (x) = h ∈ [0, 1]
[0, ∞) continuous

Pn x−xj
j=1 K h
4 Z  2
h4 f 0 (x)
Z
2 2 00 0 • Notations Xt , X(t)
rn , r) ≈
R(b x K (x) dx r (x) + 2r (x) dx
4 f (x) • State space X
Z 2
σ K 2 (x) dx
R
• Index set T
+ dx
nhf (x)
c1
h∗ ≈ 1/5 20.1 Markov Chains
n
∗ c2
rn , r) ≈ 4/5
R (b Markov chain
n
P [Xn = x | X0 , . . . , Xn−1 ] = P [Xn = x | Xn−1 ] ∀n ∈ T, x ∈ X
Cross-validation estimate of E [J(h)]
n n Transition probabilities
X X (Yi − rb(xi ))2
JbCV (h) = (Yi − rb(−i) (xi ))2 = !2
i=1 i=1 K(0)
pij ≡ P [Xn+1 = j | Xn = i]
1−  x−x 
Pn
j=1 K h
j
pij (n) ≡ P [Xm+n = j | Xm = i] n-step

19.3 Smoothing Using Orthogonal Functions Transition matrix P (n-step: Pn )

Approximation • (i, j) element is pij


∞ J
X X • pij > 0
r(x) = βj φj (x) ≈ βj φj (x) P
j=1 i=1
• i pij = 1

Multivariate regression Chapman-Kolmogorov


Y = Φβ + η
  X
φ0 (x1 ) ··· φJ (x1 ) pij (m + n) = pij (m)pkj (n)
 .. .. .. 
where ηi = i and Φ =  . . . 
k

φ0 (xn ) · · · φJ (xn )
Pm+n = Pm Pn
Least squares estimator
βb = (ΦT Φ)−1 ΦT Y Pn = P × · · · × P = Pn
1
≈ ΦT Y (for equally spaced observations only) Marginal probability
n
Cross-validation estimate of E [J(h)] µn = (µn (1), . . . , µn (N )) where µi (i) = P [Xn = i]
2
µ0 , initial distribution

Xn J
X
R
bCV (J) = Yi − φj (xi )βbj,(−i)  µn = µ0 Pn
24
i=1 j=1
20.2 Poisson Processes Autocorrelation function (ACF)
Poisson process
Cov [xs , xt ] γ(s, t)
ρ(s, t) = p =p
• {Xt : t ∈ [0, ∞)} = number of events up to and including time t V [xs ] V [xt ] γ(s, s)γ(t, t)
• X0 = 0
• Independent increments: Cross-covariance function (CCV)
∀t0 < · · · < tn : Xt1 − Xt0 ⊥
⊥ · · · ⊥⊥ Xtn − Xtn−1
γxy (s, t) = E [(xs − µxs )(yt − µyt )]
• Intensity function λ(t)
– P [Xt+h − Xt = 1] = λ(t)h + o(h) Cross-correlation function (CCF)
– P [Xt+h − Xt = 2] = o(h)
Rt γxy (s, t)
• Xs+t − Xs ∼ Po (m(s + t) − m(s)) where m(t) = 0
λ(s) ds ρxy (s, t) = p
γx (s, s)γy (t, t)
Homogeneous Poisson process
Backshift operator
λ(t) ≡ λ =⇒ Xt ∼ Po (λt) λ>0
B k (xt ) = xt−k
Waiting times
Wt := time at which Xt occurs
  Difference operator
1
Wt ∼ Gamma t, ∇d = (1 − B)d
λ
Interarrival times
White noise
St = Wt+1 − Wt
 
1 2
• wt ∼ wn(0, σw )
St ∼ Exp
λ iid 2

• Gaussian: wt ∼ N 0, σw
St
• E [wt ] = 0 t ∈ T
• V [wt ] = σ 2 t ∈ T
Wt−1 Wt t • γw (s, t) = 0 s 6= t ∧ s, t ∈ T

Random walk
21 Time Series
• Drift δ
Mean function ∞
Pt
• xt = δt + j=1 wj
Z
µxt = E [xt ] = xft (x) dx
−∞ • E [xt ] = δt
Autocovariance function
Symmetric moving average
γx (s, t) = E [(xs − µs )(xt − µt )] = E [xs xt ] − µs µt
k
X k
X
γx (t, t) = E (xt − µt )2 = V [xt ]
 
mt = aj xt−j where aj = a−j ≥ 0 and aj = 1
25
j=−k j=−k
21.1 Stationary Time Series 21.2 Estimation of Correlation
Strictly stationary Sample mean
n
1X
x̄ = xt
P [xt1 ≤ c1 , . . . , xtk ≤ ck ] = P [xt1 +h ≤ c1 , . . . , xtk +h ≤ ck ] n t=1

Sample variance
n 
∀k ∈ N, tk , ck , h ∈ Z

1 X |h|
V [x̄] = 1− γx (h)
n n
h=−n
Weakly stationary
  Sample autocovariance function
• E x2t < ∞ ∀t ∈ Z
n−h
1 X
 2
• E xt = m ∀t ∈ Z γ
b(h) = (xt+h − x̄)(xt − x̄)
• γx (s, t) = γx (s + r, t + r) ∀r, s, t ∈ Z n t=1

Autocovariance function Sample autocorrelation function

γ
b(h)
• γ(h) = E [(xt+h − µ)(xt − µ)] ∀h ∈ Z ρb(h) =
  γ
b(0)
• γ(0) = E (xt − µ)2
• γ(0) ≥ 0 Sample cross-variance function
• γ(0) ≥ |γ(h)|
n−h
• γ(h) = γ(−h) 1 X
γ
bxy (h) = (xt+h − x̄)(yt − y)
n t=1
Autocorrelation function (ACF)
Sample cross-correlation function
Cov [xt+h , xt ] γ(t + h, t) γ(h)
ρx (h) = p =p = γ
bxy (h)
V [xt+h ] V [xt ] γ(t + h, t + h)γ(t, t) γ(0) ρbxy (h) = p
γbx (0)b
γy (0)
Jointly stationary time series Properties

γxy (h) = E [(xt+h − µx )(yt − µy )] 1


• σρbx (h) = √ if xt is white noise
n
1
γxy (h) • σρbxy (h) = √ if xt or yt is white noise
ρxy (h) = p n
γx (0)γy (h)
21.3 Non-Stationary Time Series
Linear process
Classical decomposition model

X ∞
X
xt = µ + ψj wt−j where |ψj | < ∞ xt = µt + st + wt
j=−∞ j=−∞
• µt = trend

X • st = seasonal component
2
γ(h) = σw ψj+h ψj • wt = random noise term
26
j=−∞
21.3.1 Detrending Moving average polynomial
Least squares θ(z) = 1 + θ1 z + · · · + θq zq z ∈ C ∧ θq 6= 0

1. Choose trend model, e.g., µt = β0 + β1 t + β2 t2 Moving average operator


2. Minimize rss to obtain trend estimate µ bt = βb0 + βb1 t + βb2 t2 θ(B) = 1 + θ1 B + · · · + θp B p
3. Residuals , noise wt
MA (q) (moving average model order q)
Moving average xt = wt + θ1 wt−1 + · · · + θq wt−q ⇐⇒ xt = θ(B)wt
1
• The low-pass filter vt is a symmetric moving average mt with aj = 2k+1 :
q
X
E [xt ] = θj E [wt−j ] = 0
k j=0
1 X
vt = xt−1 (
2
Pq−h
2k + 1 σw j=0 θj θj+h 0≤h≤q
i=−k γ(h) = Cov [xt+h , xt ] =
0 h>q
1
Pk
• If 2k+1 i=−k wt−j ≈ 0, a linear trend function µt = β0 + β1 t passes MA (1)
without distortion xt = wt + θwt−1

2 2
Differencing (1 + θ )σw h = 0

γ(h) = θσw 2
h=1
• µt = β0 + β1 t =⇒ ∇xt = β1 
0 h>1

(
θ
21.4 ARIMA models 2 h=1
ρ(h) = (1+θ )
Autoregressive polynomial 0 h>1
ARMA (p, q)
φ(z) = 1 − φ1 z − · · · − φp zp z ∈ C ∧ φp 6= 0
xt = φ1 xt−1 + · · · + φp xt−p + wt + θ1 wt−1 + · · · + θq wt−q
Autoregressive operator φ(B)xt = θ(B)wt
p Partial autocorrelation function (PACF)
φ(B) = 1 − φ1 B − · · · − φp B
• xih−1 , regression of xi on {xh−1 , xh−2 , . . . , x1 }
Autoregressive model order p, AR (p)
• φhh = corr(xh − xh−1
h , x0 − xh−1
0 ) h≥2
xt = φ1 xt−1 + · · · + φp xt−p + wt ⇐⇒ φ(B)xt = wt • E.g., φ11 = corr(x1 , x0 ) = ρ(1)
ARIMA (p, d, q)
AR (1)
∇d xt = (1 − B)d xt is ARMA (p, q)
k−1
X k→∞,|φ|<1

X φ(B)(1 − B)d xt = θ(B)wt
• xt = φk (xt−k ) + φj (wt−j ) = φj (wt−j )
Exponentially Weighted Moving Average (EWMA)
j=0 j=0
xt = xt−1 + wt − λwt−1
| {z }
linear process
P∞ j ∞
• E [xt ] = j=0 φ (E [wt−j ]) = 0 X
2 h
σw φ
xt = (1 − λ)λj−1 xt−j + wt when |λ| < 1
• γ(h) = Cov [xt+h , xt ] = 1−φ2 j=1
γ(h)
• ρ(h) = γ(0) = φh x̃n+1 = (1 − λ)xn + λx̃n
• ρ(h) = φρ(h − 1) h = 1, 2, . . . Seasonal ARIMA
27
• Denoted by ARIMA (p, d, q) × (P, D, Q)s Periodic mixture
• ΦP (B s )φ(B)∇D d s
s ∇ xt = δ + ΘQ (B )θ(B)wt q
X
xt = (Uk1 cos(2πωk t) + Uk2 sin(2πωk t))
21.4.1 Causality and Invertibility k=1
P∞
ARMA (p, q) is causal (future-independent) ⇐⇒ ∃{ψj } : j=0 ψj < ∞ such that • Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rv’s with variances σk2
Pq

• γ(h) = k=1 σk2 cos(2πωk h)
  Pq
• γ(0) = E x2t = k=1 σk2
X
xt = wt−j = ψ(B)wt
j=0
Spectral representation of a periodic process
P∞
ARMA (p, q) is invertible ⇐⇒ ∃{πj } : πj < ∞ such that
j=0 γ(h) = σ 2 cos(2πω0 h)
∞ σ 2 −2πiω0 h σ 2 2πiω0 h
= e + e
X
π(B)xt = Xt−j = wt 2 2
j=0 Z 1/2
= e2πiωh dF (ω)
Properties −1/2

• ARMA (p, q) causal ⇐⇒ roots of φ(z) lie outside the unit circle Spectral distribution function


X θ(z)j 0
 ω < −ω0
ψ(z) = ψj z = |z| ≤ 1
φ(z) F (ω) = σ 2 /2 −ω ≤ ω < ω0
j=0 
 2
σ ω ≥ ω0
• ARMA (p, q) invertible ⇐⇒ roots of θ(z) lie outside the unit circle
• F (−∞) = F (−1/2) = 0

X φ(z) • F (∞) = F (1/2) = γ(0)
π(z) = πj z j = |z| ≤ 1
θ(z)
j=0 Spectral density
Behavior of the ACF and PACF for causal and invertible ARMA models ∞
X 1 1
f (ω) = γ(h)e−2πiωh − ≤ω≤
AR (p) MA (q) ARMA (p, q) 2 2
h=−∞
ACF tails off cuts off after lag q tails off
P∞ R 1/2
PACF cuts off after lag p tails off q tails off • Needs h=−∞ |γ(h)| < ∞ =⇒ γ(h) = −1/2
e2πiωh f (ω) dω h = 0, ±1, . . .
• f (ω) ≥ 0
21.5 Spectral Analysis • f (ω) = f (−ω)
Periodic process • f (ω) = f (1 − ω)
R 1/2
• γ(0) = V [xt ] = −1/2 f (ω) dω
xt = A cos(2πωt + φ) 2
• White noise: fw (ω) = σw
= U1 cos(2πωt) + U2 sin(2πωt)
• ARMA (p, q) , φ(B)xt = θ(B)wt :
• Frequency index ω (cycles per unit time), period 1/ω |θ(e−2πiω )|2
2
• Amplitude A fx (ω) = σw
|φ(e−2πiω )|2
• Phase φ
Pp Pq
• U1 = A cos φ and U2 = A sin φ often normally distributed rv’s where φ(z) = 1 − k=1 φk z k and θ(z) = 1 + k=1 θk z k
28
Discrete Fourier Transform (DFT) • I0 (a, b) = 0 I1 (a, b) = 1
n • Ix (a, b) = 1 − I1−x (b, a)
X
d(ωj ) = n−1/2 xt e−2πiωj t
i=1 22.3 Series
Fourier/Fundamental frequencies Finite Binomial
n n  
ωj = j/n X n(n + 1) X n
• k= • = 2n
2 k
Inverse DFT k=1 k=0
n−1 n n    
X r+k r+n+1
xt = n−1/2 d(ωj )e2πiωj t
X X
• (2k − 1) = n2 • =
j=0
k n
k=1 k=0
n n    
Periodogram X n(n + 1)(2n + 1) X k n+1
2 • k2 = • =
I(j/n) = |d(j/n)| 6 m m+1
k=1 k=0
Scaled Periodogram n  2 • Vandermonde’s Identity:
X n(n + 1)
• k3 = r  
m n
 
m+n

2
X
4 k=1 =
P (j/n) = I(j/n) n k r−k r
n X cn+1 − 1 k=0
n
!2 n
!2 • ck = c 6= 1 • Binomial Theorem:
2X 2X c−1 n  
n n−k k
= xt cos(2πtj/n + xt sin(2πtj/n k=0
X
n t=1 n t=1 a b = (a + b)n
k
k=0

22 Math Infinite
∞ ∞
22.1 Gamma Function
X 1 X p
• pk = , pk = |p| < 1
Z ∞ 1−p 1−p
k=0 k=1
• Ordinary: Γ(s) = ts−1 e−t dt ∞ ∞
!  
0
X d X d 1 1
Z ∞ • kpk−1 = pk
= = |p| < 1
• Upper incomplete: Γ(s, x) = ts−1 e−t dt dp dp 1 − p (1 − p)2
k=0 k=0
x ∞
X r + k − 1

Z x
• Lower incomplete: γ(s, x) = ts−1 e−t dt • xk = (1 − x)−r r ∈ N+
k
0 k=0
∞  
• Γ(α + 1) = αΓ(α) α>1 X α k
• p = (1 + p)α |p| < 1 , α ∈ C
• Γ(n) = (n − 1)! n∈N k
√ k=0
• Γ(1/2) = π

22.2 Beta Function


Z1
Γ(x)Γ(y)
• Ordinary: B(x, y) = B(y, x) = tx−1 (1 − t)y−1 dt =
0 Γ(x + y)
Z x
• Incomplete: B(x; a, b) = ta−1 (1 − t)b−1 dt
0
• Regularized incomplete:
a+b−1
B(x; a, b) a,b∈N X (a + b − 1)!
Ix (a, b) = = xj (1 − x)a+b−1−j
B(a, b) j=a
j!(a + b − 1 − j)! 29
22.4 Combinatorics
Sampling

k out of n w/o replacement w/ replacement


k−1
Y n!
ordered nk = (n − i) = nk
i=0
(n − k)!
nk
     
n n! n−1+r n−1+r
unordered = = =
k k! k!(n − k)! r n−1

Stirling numbers, 2nd kind


        (
n n−1 n−1 n 1 n=0
=k + 1≤k≤n =
k k k−1 0 0 else

Partitions
n
X
Pn+k,k = Pn,i k > n : Pn,k = 0 n ≥ 1 : Pn,0 = 0, P0,0 = 1
i=1

Balls and Urns f :B→U D = distinguishable, ¬D = indistinguishable.

|B| = n, |U | = m f arbitrary f injective f surjective f bijective


( (
mn m ≥ n
 
n n! m = n
B : D, U : D mn m!
0 else m 0 else
      (
m+n−1 m n−1 1 m=n
B : ¬D, U : D
n n m−1 0 else
m  
(   (
X n 1 m≥n n 1 m=n
B : D, U : ¬D
k 0 else m 0 else
k=1
m
( (
X 1 m≥n 1 m=n
B : ¬D, U : ¬D Pn,k Pn,m
k=1
0 else 0 else

References
[1] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships. The American
Statistician, 62(1):45–53, 2008.
[2] A. Steger. Diskrete Strukturen – Band 1: Kombinatorik, Graphentheorie, Algebra.
Springer, 2001.
[3] A. Steger. Diskrete Strukturen – Band 2: Wahrscheinlichkeitstheorie und Statistik.
Springer, 2002.
30
Univariate distribution relationships, courtesy Leemis and McQueston [1].
31

You might also like