0% found this document useful (0 votes)

57 views31 pages

Probability and Statistics: Cookbook

This cookbook integrates various topics in probability theory and statistics, based on literature and in-class material from courses of the statistics department at the University of California in Berkeley. It provides overviews and formulas for many common probability distributions and statistical concepts.

Uploaded by

George Singer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views31 pages

Probability and Statistics: Cookbook

Uploaded by

George Singer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Probability and Statistics

Cookbook

Version 0.2.6
19th December, 2017
http://statistics.zone/
Copyright
c Matthias Vallentin
Contents 14 Exponential Family 16 21.5 Spectral Analysis . . . . . . . . . . . . . 28

1 Distribution Overview 3 15 Bayesian Inference 16 22 Math 29

1.1 Discrete Distributions . . . . . . . . . . 3 15.1 Credible Intervals . . . . . . . . . . . . . 16 22.1 Gamma Function . . . . . . . . . . . . . 29
1.2 Continuous Distributions . . . . . . . . 5 15.2 Function of parameters . . . . . . . . . . 17 22.2 Beta Function . . . . . . . . . . . . . . . 29
15.3 Priors . . . . . . . . . . . . . . . . . . . 17 22.3 Series . . . . . . . . . . . . . . . . . . . 29
2 Probability Theory 8 15.3.1 Conjugate Priors . . . . . . . . . 17 22.4 Combinatorics . . . . . . . . . . . . . . 30
15.4 Bayesian Testing . . . . . . . . . . . . . 18
3 Random Variables 8
3.1 Transformations . . . . . . . . . . . . . 9 16 Sampling Methods 18
16.1 Inverse Transform Sampling . . . . . . . 18
4 Expectation 9 16.2 The Bootstrap . . . . . . . . . . . . . . 18
16.2.1 Bootstrap Confidence Intervals . 18
5 Variance 9 16.3 Rejection Sampling . . . . . . . . . . . . 19
16.4 Importance Sampling . . . . . . . . . . . 19
6 Inequalities 10
17 Decision Theory 19
7 Distribution Relationships 10 17.1 Risk . . . . . . . . . . . . . . . . . . . . 19
17.2 Admissibility . . . . . . . . . . . . . . . 20
8 Probability and Moment Generating
17.3 Bayes Rule . . . . . . . . . . . . . . . . 20
Functions 11
17.4 Minimax Rules . . . . . . . . . . . . . . 20
9 Multivariate Distributions 11 18 Linear Regression 20
9.1 Standard Bivariate Normal . . . . . . . 11 18.1 Simple Linear Regression . . . . . . . . 20
9.2 Bivariate Normal . . . . . . . . . . . . . 11 18.2 Prediction . . . . . . . . . . . . . . . . . 21
9.3 Multivariate Normal . . . . . . . . . . . 11 18.3 Multiple Regression . . . . . . . . . . . 21
18.4 Model Selection . . . . . . . . . . . . . . 22
10 Convergence 11
10.1 Law of Large Numbers (LLN) . . . . . . 12 19 Non-parametric Function Estimation 22
10.2 Central Limit Theorem (CLT) . . . . . 12 19.1 Density Estimation . . . . . . . . . . . . 22
19.1.1 Histograms . . . . . . . . . . . . 23
11 Statistical Inference 12 19.1.2 Kernel Density Estimator (KDE) 23
11.1 Point Estimation . . . . . . . . . . . . . 12 19.2 Non-parametric Regression . . . . . . . 23
11.2 Normal-Based Confidence Interval . . . 13 19.3 Smoothing Using Orthogonal Functions 24
11.3 Empirical distribution . . . . . . . . . . 13
11.4 Statistical Functionals . . . . . . . . . . 13 20 Stochastic Processes 24
20.1 Markov Chains . . . . . . . . . . . . . . 24
12 Parametric Inference 13 20.2 Poisson Processes . . . . . . . . . . . . . 25
12.1 Method of Moments . . . . . . . . . . . 13
12.2 Maximum Likelihood . . . . . . . . . . . 14 21 Time Series 25
12.2.1 Delta Method . . . . . . . . . . . 14 21.1 Stationary Time Series . . . . . . . . . . 26 This cookbook integrates various topics in probability theory
12.3 Multiparameter Models . . . . . . . . . 14 21.2 Estimation of Correlation . . . . . . . . 26 and statistics, based on literature [1, 6, 3] and in-class material
12.3.1 Multiparameter delta method . . 15 21.3 Non-Stationary Time Series . . . . . . . 26 from courses of the statistics department at the University of
12.4 Parametric Bootstrap . . . . . . . . . . 15 21.3.1 Detrending . . . . . . . . . . . . 27 California in Berkeley but also influenced by others [4, 5]. If you
21.4 ARIMA models . . . . . . . . . . . . . . 27 find errors or have suggestions for improvements, please get in
13 Hypothesis Testing 15 21.4.1 Causality and Invertibility . . . . 28 touch at http://statistics.zone/.
1 Distribution Overview
1.1 Discrete Distributions
Notation1 FX (x) fX (x) E [X] V [X] MX (s)

0 x<a
(b − a + 1)2 − 1 eas − e−(b+1)s

bxc−a+1 I(a ≤ x ≤ b) a+b
Uniform Unif {a, . . . , b} a≤x≤b
 b−a b−a+1 2 12 s(b − a)
1 x>b

Bernoulli Bern (p) (1 − p)1−x px (1 − p)1−x p p(1 − p) 1 − p + pes
!
n x
Binomial Bin (n, p) I1−p (n − x, x + 1) p (1 − p)n−x np np(1 − p) (1 − p + pes )n
x
 
np1 !n
np1 (1 − p1 ) −np1 p2
k
! k
n! x
X  .  X
Multinomial Mult (n, p) px1 · · · pk k xi = n  ..  .. pi e si
x1 ! . . . xk ! 1 i=1 −np2 p1 . i=0
npk
m N −m
!
x − np x n−x nm nm(N − n)(N − m)
Hypergeometric Hyp (N, m, n) ≈Φ N N 2 (N − 1)
p
np(1 − p) n
N
! r
x+r−1 r 1−p 1−p p
Negative Binomial NBin (r, p) Ip (r, x + 1) p (1 − p)x r r 2
r−1 p p 1 − (1 − p)es
1 1−p pes
Geometric Geo (p) 1 − (1 − p)x x ∈ N+ p(1 − p)x−1 x ∈ N+
p p2 1 − (1 − p)es
x
X λi λx e−λ s
Poisson Po (λ) e−λ λ λ eλ(e −1)

i=0
i! x!

1 We use the notation γ(s, x) and Γ(x) to refer to the Gamma functions (see §22.1), and use B(x, y) and Ix to refer to the Beta functions (see §22.2).

3
Uniform (discrete) Binomial Geometric Poisson
● n = 40, p = 0.3 0.8 ● p = 0.2 ● ●
● ● ● λ=1
● n = 30, p = 0.6 ● p = 0.5 ● λ=4
● n = 25, p = 0.9 ● p = 0.8 ● λ = 10
●
0.3
0.2 ● 0.6

● 0.2
PMF

PMF

PMF
1 ● ●
● ● ● ● ● ● ●
● ● ● ● 0.4 ●
n ●
● ● ●
● ●
●
0.1
● ●
● ● ● ● ●
● ●
●
●
0.1 ●
● 0.2 ●
● ● ● ●
● ● ●
● ●
● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ● ●
●●● ● ● ● ● ● ● ● ● ●
● ● ●
0.0 ●●●●
●●●●●●●●● ●●
●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●● 0.0 ● ●
●
● ●
● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ●

a b 0 10 20 30 40 0.0 2.5 5.0 7.5 10.0 0 5 10 15 20

x x x x
Uniform (discrete) Binomial Geometric Poisson
1 ● 1.00 ●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●● 1.0 ● ● ● ● ● ● ● ● 1.00 ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ●
● ● ●
● ●● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ●
● ● ●
● ● ●
● ●
● ●
● ●
● ● ●
0.75 0.8 ● ● 0.75 ●
●
i ● ● ●
● ●
● ●
n ●
● ● ●
●
CDF

CDF

CDF
0.50 0.6 ● 0.50
● ●
● ●
●
i ● ● ●
● ● ●
n ●
●
0.25 ● 0.4 0.25 ●
●
● ●
●
●
● n = 40, p = 0.3 p = 0.2
● ● ● ● ● ● λ=1
n = 30, p = 0.6 p = 0.5 ●
● ● ● ● ●
●
● λ=4
● ●
0 ● 0.00 ●●●● ●
●●●●●●●●●● ●
●●●●●●●●●●●●●●●●● ● ● n = 25, p = 0.9 0.2 ● ● p = 0.8 0.00
●
● ● ● ● ● λ = 10

a b 0 10 20 30 40 0.0 2.5 5.0 7.5 10.0 0 5 10 15 20

x x x x

4
1.2 Continuous Distributions
Notation FX (x) fX (x) E [X] V [X] MX (s)

0 x<a
(b − a)2 esb − esa

x−a I(a < x < b) a+b
Uniform Unif (a, b) a<x<b
 b−a b−a 2 12 s(b − a)
1 x>b

(x − µ)2
Z x
σ 2 s2

1
N µ, σ 2 σ2

Normal Φ(x) = φ(t) dt φ(x) = √ exp − µ exp µs +
−∞ σ 2π 2σ 2 2
(ln x − µ)2

1 1 ln x − µ 1 2 2 2
ln N µ, σ 2 eµ+σ /2
(eσ − 1)e2µ+σ

Log-Normal + erf √ √ exp −
2 2 2σ 2 x 2πσ 2 2σ 2

1 T
Σ−1 (x−µ) 1
Multivariate Normal MVN (µ, Σ) (2π)−k/2 |Σ|−1/2 e− 2 (x−µ) µ Σ exp µT s + sT Σs
2
−(ν+1)/2 ( ν
Γ ν+1

ν ν
2 x2 ν−2
ν>2
Student’s t Student(ν) Ix , √ 1 + 0 ν>1
νπΓ ν2

2 2 ν ∞ 1<ν≤2

1 k x 1
Chi-square χ2k γ , xk/2−1 e−x/2 k 2k (1 − 2s)−k/2 s < 1/2
Γ(k/2) 2 2 2k/2 Γ(k/2)
r
d
(d1 x)d1 d2 2
2d22 (d1 + d2 − 2)

d1 d2 (d1 x+d2 )d1 +d2 d2
F F(d1 , d2 ) I d1 x , d1 d1 d2 − 2 d1 (d2 − 2)2 (d2 − 4)

d1 x+d2 2 2 xB 2
, 2
1 −x/β 1
Exponential∗ Exp (β) 1 − e−x/β e β β2 s (s < β)
β 1− β
!α
γ(α, βx) β α α−1 −βx α α 1
Gamma∗ Gamma (α, β) x e s (s < β)
Γ(α) Γ (α) β β2 1− β

Γ α, βx

β α −α−1 −β/x β β2 2(−βs)α/2 p
Inverse Gamma InvGamma (α, β) x e α>1 α>2 Kα −4βs
Γ (α) Γ (α) α−1 (α − 1)2 (α − 2) Γ(α)
P
k
Γ i=1 αi Y α −1
k
αi E [Xi ] (1 − E [Xi ])
Dirichlet Dir (α) Qk xi i Pk Pk
i=1 Γ (αi ) i=1 i=1 αi i=1 αi + 1
∞ k−1
!
Γ (α + β) α−1 α αβ X Y α+r sk
Beta Beta (α, β) Ix (α, β) x (1 − x)β−1 1+
Γ (α) Γ (β) α+β (α + β)2 (α + β + 1) r=0
α+β+r k!
k=1
∞
sn λn

k k x k−1 −(x/λ)k 1 2 X n
Weibull Weibull(λ, k) 1 − e−(x/λ) e λΓ 1 + λ2 Γ 1 + − µ2 Γ 1+
λ λ k k n=0
n! k
x α
m xα αxm x2m α
Pareto Pareto(xm , α) 1− x ≥ xm m
α α+1 x ≥ xm α>1 α>2 α(−xm s)α Γ(−α, −xm s) s < 0
x x α−1 (α − 1)2 (α − 2)

∗ 1
We use the rate parameterization where β = λ
. Some textbooks use β as scale parameter instead [6].

5
Uniform (continuous) Normal Log−Normal Student's t
2.0 1.00 0.4 ν=1
µ = 0, σ = 0.2
2
µ = 0, σ = 3
2

µ = 0, σ2 = 1 µ = 2, σ2 = 2 ν=2
µ = 0, σ2 = 5 µ = 0, σ2 = 1 ν=5
ν=∞
µ = −2, σ2 = 0.5 µ = 0.5, σ2 = 1
µ = 0.25, σ2 = 1
1.5 0.75 µ = 0.125, σ2 = 1 0.3
PDF

PDF

PDF
1
● ● 1.0 0.50 0.2
b−a

0.5 0.25 0.1

● ● 0.0 0.00 0.0

a b −5.0 −2.5 0.0 2.5 5.0 0 1 2 3 −5.0 −2.5 0.0 2.5 5.0
x x x x
χ 2 F Exponential Gamma
d1 = 1, d2 = 1 2.0 β = 0.5 0.5 α = 1, β = 0.5
1.00 k=1 3 d1 = 2, d2 = 1 β=1 α = 2, β = 0.5
k=2 d1 = 5, d2 = 2 β = 2.5 α = 3, β = 0.5
k=3 d1 = 100, d2 = 1 α = 5, β = 1
k=4 d1 = 100, d2 = 100 0.4 α = 9, β = 2
k=5
1.5
0.75

2
0.3
PDF

PDF

PDF
PDF

0.50 1.0

0.2

1
0.25 0.5
0.1

0.00 0 0.0 0.0

0 2 4 6 8 0 1 2 3 4 5 0 1 2 3 4 5 0 5 10 15 20
x x x x
Inverse Gamma Beta Weibull Pareto
α = 1, β = 1 5 α = 0.5, β = 0.5 2.0 λ = 1, k = 0.5 4 xm = 1, k = 1
α = 2, β = 1 α = 5, β = 1 λ = 1, k = 1 xm = 1, k = 2
α = 3, β = 1 α = 1, β = 3 λ = 1, k = 1.5 xm = 1, k = 4
4 α = 3, β = 0.5 α = 2, β = 2 λ = 1, k = 5
4 α = 2, β = 5
1.5 3

3
3
PDF

PDF

PDF
1.0 2
2
2

0.5 1
1 1

0 0 0.0 0

0 1 2 3 4 5 0.00 0.25 0.50 0.75 1.00 0.0 0.5 1.0 1.5 2.0 2.5 1.0 1.5 2.0 2.5
x x x x

6
Uniform (continuous) Normal Log−Normal Student's t
1 1.00 µ = 0, σ = 3
2 1.00
µ = 2, σ2 = 2
0.75 µ = 0, σ2 = 1
µ = 0.5, σ2 = 1
µ = 0.25, σ2 = 1
0.75 µ = 0.125, σ2 = 1 0.75

0.50
CDF

CDF

CDF
0.50 0.50

0.25
0.25 0.25

µ = 0, σ2 = 0.2 ν=1
µ = 0, σ2 = 1 ν=2
µ = 0, σ2 = 5 ν=5
0 0.00 µ = −2, σ2 = 0.5 0.00 0.00 ν=∞

a b −5.0 −2.5 0.0 2.5 5.0 0 1 2 3 −5.0 −2.5 0.0 2.5 5.0
x x x x
χ 2 F Exponential Gamma
1.00 1.00 1.00
1.00

0.75 0.75 0.75 0.75

CDF

CDF
CDF

0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25

k=1 d1 = 1, d2 = 1 α = 1, β = 0.5
k=2 d1 = 2, d2 = 1 α = 2, β = 0.5
k=3 d1 = 5, d2 = 2 β = 0.5 α = 3, β = 0.5
k=4 d1 = 100, d2 = 1 β=1 α = 5, β = 1
0.00 k=5 0.00 d1 = 100, d2 = 100 0.00 β = 2.5 0.00 α = 9, β = 2

0 2 4 6 8 0 1 2 3 4 5 0 1 2 3 4 5 0 5 10 15 20
x x x x
Inverse Gamma Beta Weibull Pareto
1.00 1.00
1.00 1.00 α = 0.5, β = 0.5
α = 5, β = 1
α = 1, β = 3
α = 2, β = 2
α = 2, β = 5
0.75 0.75 0.75 0.75
CDF

CDF

CDF
0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25

α = 1, β = 1 λ = 1, k = 0.5
α = 2, β = 1 λ = 1, k = 1 xm = 1, k = 1
α = 3, β = 1 λ = 1, k = 1.5 xm = 1, k = 2
0.00 α = 3, β = 0.5 0.00 0.00 λ = 1, k = 5 0.00 xm = 1, k = 4

0 1 2 3 4 5 0.00 0.25 0.50 0.75 1.00 0.0 0.5 1.0 1.5 2.0 2.5 1.0 1.5 2.0 2.5
x x x x

7
2 Probability Theory Law of Total Probability
n n
Definitions X G
P [B] = P [B|Ai ] P [Ai ] Ω= Ai
• Sample space Ω i=1 i=1

• Outcome (point or element) ω ∈ Ω Bayes’ Theorem

• Event A ⊆ Ω
n
• σ-algebra A P [B | Ai ] P [Ai ] G
P [Ai | B] = Pn Ω= Ai
1. ∅ ∈ A j=1 P [B | Aj ] P [Aj ] i=1
S∞
2. A1 , A2 , . . . , ∈ A =⇒ i=1 Ai ∈ A Inclusion-Exclusion Principle
3. A ∈ A =⇒ ¬A ∈ A
n n
r
[ X X \
• Probability Distribution P (−1)r−1

Ai = A ij

1. P [A] ≥ 0 ∀A i=1 r=1 i≤i1 <···<ir ≤n j=1

2. P [Ω] = 1
"∞ #
G ∞
X 3 Random Variables
3. P Ai = P [Ai ]
i=1 i=1 Random Variable (RV)
• Probability space (Ω, A, P) X:Ω→R

Properties Probability Mass Function (PMF)

• P [∅] = 0 fX (x) = P [X = x] = P [{ω ∈ Ω : X(ω) = x}]

• B = Ω ∩ B = (A ∪ ¬A) ∩ B = (A ∩ B) ∪ (¬A ∩ B)
Probability Density Function (PDF)
• P [¬A] = 1 − P [A]
b
• P [B] = P [A ∩ B] + P [¬A ∩ B]
Z
P [a ≤ X ≤ b] = f (x) dx
• P [Ω] = 1 P [∅] = 0 a
S T T S
• ¬( n An ) = n ¬An ¬( n An ) = n ¬An DeMorgan
S T Cumulative Distribution Function (CDF)
• P [ n An ] = 1 − P [ n ¬An ]
• P [A ∪ B] = P [A] + P [B] − P [A ∩ B] FX : R → [0, 1] FX (x) = P [X ≤ x]
=⇒ P [A ∪ B] ≤ P [A] + P [B]
1. Nondecreasing: x1 < x2 =⇒ F (x1 ) ≤ F (x2 )
• P [A ∪ B] = P [A ∩ ¬B] + P [¬A ∩ B] + P [A ∩ B]
2. Normalized: limx→−∞ = 0 and limx→∞ = 1
• P [A ∩ ¬B] = P [A] − P [A ∩ B]
3. Right-Continuous: limy↓x F (y) = F (x)
Continuity of Probabilities
S∞ b
• A1 ⊂ A2 ⊂ . . . =⇒ limn→∞ P [An ] = P [A]
Z
where A = i=1 Ai
T∞ P [a ≤ Y ≤ b | X = x] = fY |X (y | x)dy a≤b
• A1 ⊃ A2 ⊃ . . . =⇒ limn→∞ P [An ] = P [A] where A = i=1 Ai a

Independence ⊥
⊥ f (x, y)
fY |X (y | x) =
A⊥
⊥ B ⇐⇒ P [A ∩ B] = P [A] P [B] fX (x)
Conditional Probability Independence

P [A ∩ B] 1. P [X ≤ x, Y ≤ y] = P [X ≤ x] P [Y ≤ y]
P [A | B] = P [B] > 0 2. fX,Y (x, y) = fX (x)fY (y)
P [B] 8
Z
3.1 Transformations • E [XY ] = xyfX,Y (x, y) dFX (x) dFY (y)
X,Y
Transformation function
• E [ϕ(Y )] 6= ϕ(E [X]) (cf. Jensen inequality)
Z = ϕ(X)
• P [X ≥ Y ] = 1 =⇒ E [X] ≥ E [Y ]
Discrete • P [X = Y ] = 1 =⇒ E [X] = E [Y ]
X ∞
fZ (z) = P [ϕ(X) = z] = P [{x : ϕ(x) = z}] = P X ∈ ϕ−1 (z) =

fX (x)
X
• E [X] = P [X ≥ x] X discrete
x∈ϕ−1 (z) x=1

Continuous Sample mean

n
Z 1X
X̄n = Xi
FZ (z) = P [ϕ(X) ≤ z] = f (x) dx with Az = {x : ϕ(x) ≤ z} n i=1
Az
Conditional expectation
Special case if ϕ strictly monotone Z

d

dx 1 • E [Y | X = x] = yf (y | x) dy
fZ (z) = fX (ϕ−1 (z)) ϕ−1 (z) = fX (x) = fX (x)

dz dz |J| • E [X] = E [E [X | Y ]]
Z ∞
The Rule of the Lazy Statistician • Eϕ(X,Y ) | X=x [=] ϕ(x, y)fY |X (y | x) dx
−∞
Z Z ∞
E [Z] = ϕ(x) dFX (x) • E [ϕ(Y, Z) | X = x] = ϕ(y, z)f(Y,Z)|X (y, z | x) dy dz
−∞
Z Z • E [Y + Z | X] = E [Y | X] + E [Z | X]
E [IA (x)] = IA (x) dFX (x) = dFX (x) = P [X ∈ A] • E [ϕ(X)Y | X] = ϕ(X)E [Y | X]
A
• E [Y | X] = c =⇒ Cov [X, Y ] = 0
Convolution
Z ∞ Z z
X,Y ≥0
• Z := X + Y fZ (z) = fX,Y (x, z − x) dx = fX,Y (x, z − x) dx
−∞ 0 5 Variance
Z ∞
• Z := |X − Y | fZ (z) = 2 fX,Y (x, z + x) dx Definition and properties
0
Z ∞ Z ∞ 2
2
X ⊥
⊥ • V [X] = σX = E (X − E [X])2 = E X 2 − E [X]
• Z := fZ (z) = |y|fX,Y (yz, y) dy = |y|fX (yz)fY (y) dy " n # n
Y −∞ −∞ X X X
• V Xi = V [Xi ] + Cov [Xi , Xj ]
i=1 i=1 i6=j
4 Expectation " n
X
# n
X
• V Xi = V [Xi ] if Xi ⊥
⊥ Xj
Definition and properties i=1 i=1
X

 xfX (x) X discrete Standard deviation p
sd[X] = V [X] = σX

Z  x

• E [X] = µX = x dFX (x) = Covariance

 Z
 xfX (x) dx X continuous


• Cov [X, Y ] = E [(X − E [X])(Y − E [Y ])] = E [XY ] − E [X] E [Y ]
• P [X = c] = 1 =⇒ E [X] = c • Cov [X, a] = 0
• E [cX] = c E [X] • Cov [X, X] = V [X]
• E [X + Y ] = E [X] + E [Y ] • Cov [X, Y ] = Cov [Y, X]
9
• Cov [aX, bY ] = abCov [X, Y ] 7 Distribution Relationships
• Cov [X + a, Y + b] = Cov [X, Y ]

n m

n X m
Binomial
X X X
n
• Cov  Xi , Yj  = Cov [Xi , Yj ] X
i=1 j=1 i=1 j=1
• Xi ∼ Bern (p) =⇒ Xi ∼ Bin (n, p)
i=1
Correlation • X ∼ Bin (n, p) , Y ∼ Bin (m, p) =⇒ X + Y ∼ Bin (n + m, p)
Cov [X, Y ]
ρ [X, Y ] = p • limn→∞ Bin (n, p) = Po (np) (n large, p small)
V [X] V [Y ] • limn→∞ Bin (n, p) = N (np, np(1 − p)) (n large, p far from 0 and 1)
Independence
Negative Binomial
X⊥
⊥ Y =⇒ ρ [X, Y ] = 0 ⇐⇒ Cov [X, Y ] = 0 ⇐⇒ E [XY ] = E [X] E [Y ]
• X ∼ NBin (1, p) = Geo (p)
Pr
Sample variance • X ∼ NBin (r, p) = i=1 Geo (p)
n P P
1 X • Xi ∼ NBin (ri , p) =⇒ Xi ∼ NBin ( ri , p)
S2 = (Xi − X̄n )2
n − 1 i=1 • X ∼ NBin (r, p) . Y ∼ Bin (s + r, p) =⇒ P [X ≤ s] = P [Y ≥ r]
Conditional variance Poisson
2 n n
!
• V [Y | X] = E (Y − E [Y | X])2 | X = E Y 2 | X − E [Y | X] X X
• Xi ∼ Po (λi ) ∧ Xi ⊥⊥ Xj =⇒ Xi ∼ Po λi
• V [Y ] = E [V [Y | X]] + V [E [Y | X]]
i=1 i=1
 
n n
X X λ i
6 Inequalities • Xi ∼ Po (λi ) ∧ Xi ⊥⊥ Xj =⇒ Xi Xj ∼ Bin  Xj , Pn 
j=1 j=1 j=1 λ j

Cauchy-Schwarz
2 Exponential
E [XY ] ≤ E X 2 E Y 2

n
X
Markov • Xi ∼ Exp (β) ∧ Xi ⊥
⊥ Xj =⇒ Xi ∼ Gamma (n, β)
E [ϕ(X)]
P [ϕ(X) ≥ t] ≤ i=1
t • Memoryless property: P [X > x + y | X > y] = P [X > x]
Chebyshev
V [X] Normal
P [|X − E [X]| ≥ t] ≤
t2
X−µ

Chernoff • X ∼ N µ, σ 2 =⇒ σ ∼ N (0, 1)
δ

e
• X ∼ N µ, σ ∧ Z = aX + b =⇒ Z ∼ N aµ + b, a2 σ 2
2

P [X ≥ (1 + δ)µ] ≤ δ > −1
(1 + δ)1+δ
• Xi ∼ N µi , σi2 ∧ Xi ⊥⊥ Xj =⇒
P
Xi ∼ N
P
µi , i σi2
P
i i
Hoeffding
• P [a < X ≤ b] = Φ b−µ − Φ a−µ

σ σ
X1 , . . . , Xn independent ∧ P [Xi ∈ [ai , bi ]] = 1 ∧ 1 ≤ i ≤ n • Φ(−x) = 1 − Φ(x) φ0 (x) = −xφ(x) φ00 (x) = (x2 − 1)φ(x)
−1
2 • Upper quantile of N (0, 1): zα = Φ (1 − α)
P X̄ − E X̄ ≥ t ≤ e−2nt t > 0

Gamma
2n2 t2

P |X̄ − E X̄ | ≥ t ≤ 2 exp − Pn 2
t>0
i=1 (bi − ai ) • X ∼ Gamma (α, β) ⇐⇒ X/β ∼ Gamma (α, 1)
Pα
Jensen • Gamma (α, β) ∼ i=1 Exp (β)
P P
E [ϕ(X)] ≥ ϕ(E [X]) ϕ convex • Xi ∼ Gamma (αi , β) ∧ Xi ⊥
⊥ Xj =⇒ i Xi ∼ Gamma ( i αi , β)
10
Z ∞
Γ(α) 9.2 Bivariate Normal
• = xα−1 e−λx dx
λα 0
Let X ∼ N µx , σx2 and Y ∼ N µy , σy2 .
Beta
1 Γ(α + β) α−1 1 z
• xα−1 (1 − x)β−1 = x (1 − x)β−1 f (x, y) = exp −
2(1 − ρ2 )
p
B(α, β) Γ(α)Γ(β) 2πσx σy 1 − ρ2
B(α + k, β) α+k−1
E X k−1
" #
• E Xk =
2 2
=

B(α, β) α+β+k−1 x − µx y − µy x − µx y − µy
z= + − 2ρ
• Beta (1, 1) ∼ Unif (0, 1) σx σy σx σy
Conditional mean and variance
8 Probability and Moment Generating Functions E [X | Y ] = E [X] + ρ
σX
(Y − E [Y ])
σY
• GX (t) = E tX |t| < 1 p
V [X | Y ] = σX 1 − ρ2
"∞ # ∞
X (Xt)i X E Xi
· ti

• MX (t) = GX (et ) = E eXt = E =
i=0
i! i=0
i!
9.3 Multivariate Normal
• P [X = 0] = GX (0)
• P [X = 1] = G0X (0) Covariance matrix Σ (Precision matrix Σ−1 )
(i)
GX (0)  
• P [X = i] = V [X1 ] · · · Cov [X1 , Xk ]
i! .. .. ..
Σ=
 
• E [X] = G0X (1− ) . . . 
(k)
• E X k = MX (0) Cov [Xk , X1 ] · · · V [Xk ]

X! (k) If X ∼ N (µ, Σ),
• E = GX (1− )
(X − k)!
2 1
• V [X] = G00X (1− ) + G0X (1− ) − (G0X (1− )) fX (x) = (2π) −n/2
|Σ|
−1/2
exp − (x − µ)T Σ−1 (x − µ)
d 2
• GX (t) = GY (t) =⇒ X = Y
Properties
9 Multivariate Distributions • Z ∼ N (0, 1) ∧ X = µ + Σ1/2 Z =⇒ X ∼ N (µ, Σ)
• X ∼ N (µ, Σ) =⇒ Σ−1/2 (X − µ) ∼ N (0, 1)
9.1 Standard Bivariate Normal • X ∼ N (µ, Σ) =⇒ AX ∼ N Aµ, AΣAT

p
Let X, Y ∼ N (0, 1) ∧ X ⊥
⊥ Z where Y = ρX + 1 − ρ2 Z • X ∼ N (µ, Σ) ∧ kak = k =⇒ aT X ∼ N aT µ, aT Σa

Joint density
1 x2 + y 2 − 2ρxy
10 Convergence
f (x, y) = exp −
2(1 − ρ2 )
p
2π 1 − ρ2 Let {X1 , X2 , . . .} be a sequence of rv’s and let X be another rv. Let Fn denote
Conditionals the cdf of Xn and let F denote the cdf of X.
Types of Convergence
(Y | X = x) ∼ N ρx, 1 − ρ2 (X | Y = y) ∼ N ρy, 1 − ρ2

and D
1. In distribution (weakly, in law): Xn → X
Independence
X⊥
⊥ Y ⇐⇒ ρ = 0 lim Fn (t) = F (t) ∀t where F continuous
n→∞ 11
P
2. In probability: Xn → X √
X̄n − µ n(X̄n − µ) D
Zn := q = →Z where Z ∼ N (0, 1)
(∀ε > 0) lim P [|Xn − X| > ε] = 0 σ
n→∞ V X̄n
as
3. Almost surely (strongly): Xn → X lim P [Zn ≤ z] = Φ(z) z∈R
n→∞
h i h i
P lim Xn = X = P ω ∈ Ω : lim Xn (ω) = X(ω) = 1 CLT notations
n→∞ n→∞

qm
Zn ≈ N (0, 1)
4. In quadratic mean (L2 ): Xn → X
σ2

X̄n ≈ N µ,
lim E (Xn − X)2 = 0 n

n→∞
σ2

X̄n − µ ≈ N 0,
Relationships n
√ 2

qm P D n(X̄n − µ) ≈ N 0, σ
• Xn → X =⇒ Xn → X =⇒ Xn → X √
as
• Xn → X =⇒ Xn → X
P n(X̄n − µ)
≈ N (0, 1)
D P
• Xn → X ∧ (∃c ∈ R) P [X = c] = 1 =⇒ Xn → X σ
P P P
• Xn →X ∧ Yn → Y =⇒ Xn + Yn → X + Y
qm qm qm
• Xn →X ∧ Yn → Y =⇒ Xn + Yn → X + Y Continuity correction
P P P
• Xn →X ∧ Yn → Y =⇒ Xn Yn → XY
x + 12 − µ
P P

• Xn →X =⇒ ϕ(Xn ) → ϕ(X)
P X̄n ≤ x ≈ Φ √
D
• Xn → X =⇒ ϕ(Xn ) → ϕ(X)
D σ/ n
qm
• Xn → b ⇐⇒ limn→∞ E [Xn ] = b ∧ limn→∞ V [Xn ] = 0
x − 12 − µ

qm

• X1 , . . . , Xn iid ∧ E [X] = µ ∧ V [X] < ∞ ⇐⇒ X̄n → µ P X̄n ≥ x ≈ 1 − Φ √
σ/ n
Slutzky’s Theorem Delta method
D P D
• Xn → X and Yn → c =⇒ Xn + Yn → X + c
σ2

2 σ2

D P D
• Xn → X and Yn → c =⇒ Xn Yn → cX Yn ≈ N µ, =⇒ ϕ(Yn ) ≈ N ϕ(µ), (ϕ0 (µ))
n n
D D D
• In general: Xn → X and Yn → Y =⇒
6 Xn + Yn → X + Y
11 Statistical Inference
10.1 Law of Large Numbers (LLN)
iid
Let X1 , · · · , Xn ∼ F if not otherwise noted.
Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ.
Weak (WLLN)
P
X̄n → µ n→∞ 11.1 Point Estimation
Strong (SLLN) • Point estimator θbn of θ is a rv: θbn = g(X1 , . . . , Xn )
as
h i
X̄n → µ n→∞ • bias(θbn ) = E θbn − θ
P
• Consistency: θbn → θ
10.2 Central Limit Theorem (CLT)
• Sampling distribution: F (θbn )
Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ, and V [X1 ] = σ 2 .
r h i
• Standard error: se(θn ) = V θbn
b
12
h i h i
• Mean squared error: mse = E (θbn − θ)2 = bias(θbn )2 + V θbn 11.4 Statistical Functionals
• limn→∞ bias(θbn ) = 0 ∧ limn→∞ se(θbn ) = 0 =⇒ θbn is consistent • Statistical functional: T (F )
θbn − θ D • Plug-in estimator of θ = (F ): θbn = T (Fbn )
• Asymptotic normality: → N (0, 1) R
se • Linear functional: T (F ) = ϕ(x) dFX (x)
• Slutzky’s Theorem often lets us replace se(θbn ) by some (weakly) consis- • Plug-in estimator for linear functional:
tent estimator σ
bn . Z n
1X
T (Fbn ) = ϕ(x) dFbn (x) = ϕ(Xi )
11.2 Normal-Based Confidence Interval n i=1

b 2 . Let zα/2 = Φ−1 (1 − (α/2)), i.e., P Z > zα/2 = α/2

Suppose θbn ≈ N θ, se

b 2 =⇒ T (Fbn ) ± zα/2 se
• Often: T (Fbn ) ≈ N T (F ), se b
and P −zα/2 < Z < zα/2 = 1 − α where Z ∼ N (0, 1). Then
• pth quantile: F −1 (p) = inf{x : F (x) ≥ p}
Cn = θbn ± zα/2 se
b • µb = X̄n
n
1 X
b2 =
• σ (Xi − X̄n )2
11.3 Empirical distribution n − 1 i=1
1
Pn
Empirical Distribution Function (ECDF) n i=1 (Xi − µb)3
• κ
b=
Pn
I(Xi ≤ x) b3
Pσ
Fn (x) = i=1
b n
i=1 (Xi − X̄n )(Yi − Ȳn )
n • ρb = qP qP
n 2 n 2
(X − X̄ ) i=1 (Yi − Ȳn )
(
1 Xi ≤ x i=1 i n
I(Xi ≤ x) =
0 Xi > x
Properties (for any fixed x) 12 Parametric Inference
h i
• E Fbn = F (x)

Let F = f (x; θ) : θ ∈ Θ be a parametric model with parameter space Θ ⊂ Rk
h i F (x)(1 − F (x)) and parameter θ = (θ1 , . . . , θk ).
• V Fbn =
n
F (x)(1 − F (x)) D 12.1 Method of Moments
• mse = →0
n
P j th moment
• Fbn → F (x) Z
αj (θ) = E X j = xj dFX (x)

Dvoretzky-Kiefer-Wolfowitz (DKW) inequality (X1 , . . . , Xn ∼ F )

P sup F (x) − Fn (x) > ε = 2e−2nε
b 2
j th sample moment
x n
1X j
Nonparametric 1 − α confidence band for F α
bj = X
n i=1 i
L(x) = max{Fbn − n , 0}
Method of Moments estimator (MoM)
U (x) = min{Fbn + n , 1}
s α1 (θ) = α
b1
1 2
= log α2 (θ) = α
b2
2n α
.. ..
.=.
P [L(x) ≤ F (x) ≤ U (x) ∀x] ≥ 1 − α αk (θ) = α
bk
13
Properties of the MoM estimator • Equivariance: θbn is the mle =⇒ ϕ(θbn ) is the mle of ϕ(θ)
• θbn exists with probability tending to 1 • Asymptotic optimality (or efficiency), i.e., smallest variance for large sam-
P
• Consistency: θbn → θ ples. If θen is any other estimator, the asymptotic relative efficiency is:
p
• Asymptotic normality: 1. se ≈ 1/In (θ)
√ (θbn − θ) D
D
n(θb − θ) → N (0, Σ) → N (0, 1)
se
q
where Σ = gE Y Y T g T , Y = (X, X 2 , . . . , X k )T , b ≈ 1/In (θbn )
2. se
∂ −1
g = (g1 , . . . , gk ) and gj = ∂θ αj (θ)
(θbn − θ) D
→ N (0, 1)
se
b
12.2 Maximum Likelihood • Asymptotic optimality
Likelihood: Ln : Θ → [0, ∞) h i
V θbn
n
Y are(θen , θbn ) = h i ≤ 1
Ln (θ) = f (Xi ; θ) V θen
i=1
• Approximately the Bayes estimator
Log-likelihood
n
X 12.2.1 Delta Method
`n (θ) = log Ln (θ) = log f (Xi ; θ)
i=1 b where ϕ is differentiable and ϕ0 (θ) 6= 0:
If τ = ϕ(θ)
Maximum likelihood estimator (mle)
τn − τ ) D
(b
→ N (0, 1)
Ln (θbn ) = sup Ln (θ) se(b
b τ)
θ
where τb = ϕ(θ)
b is the mle of τ and
Score function
∂
s(X; θ) = log f (X; θ) b = ϕ0 (θ)
se se(
b θn )
b b
∂θ
Fisher information
I(θ) = Vθ [s(X; θ)] 12.3 Multiparameter Models
In (θ) = nI(θ) Let θ = (θ1 , . . . , θk ) and θb = (θb1 , . . . , θbk ) be the mle.
Fisher information (exponential family)
∂ 2 `n ∂ 2 `n
Hjj = Hjk =
∂ ∂θ2 ∂θj ∂θk
I(θ) = Eθ − s(X; θ)
∂θ Fisher information matrix
Observed Fisher information 
Eθ [H11 ] ··· Eθ [H1k ]

n
In (θ) = −  .. .. ..
∂2 X
 
. . .
Inobs (θ) = −

log f (Xi ; θ)
∂θ2 i=1 Eθ [Hk1 ] · · · Eθ [Hkk ]

Properties of the mle Under appropriate regularity conditions

P
• Consistency: θbn → θ (θb − θ) ≈ N (0, Jn )
14
with Jn (θ) = In−1 . Further, if θbj is the j th component of θ, then • Critical value c
• Test statistic T
(θbj − θj ) D • Rejection region R = {x : T (x) > c}
→ N (0, 1)
se
bj • Power function β(θ) = P [X ∈ R]
h i • Power of a test: 1 − P [Type II error] = 1 − β = inf β(θ)
b 2j = Jn (j, j) and Cov θbj , θbk = Jn (j, k)
where se θ∈Θ1
• Test size: α = P [Type I error] = sup β(θ)
θ∈Θ0
12.3.1 Multiparameter delta method
Let τ = ϕ(θ1 , . . . , θk ) and let the gradient of ϕ be Retain H0 Reject H0
√

∂ϕ
 H0 true Type
√ I Error (α)
 ∂θ1  H1 true Type II Error (β) (power)
 . 
p-value
 .. 
∇ϕ =  
 ∂ϕ 
∂θk

• p-value = supθ∈Θ0 Pθ [T (X) ≥ T (x)] = inf α : T (x) ∈ Rα
Pθ [T (X ? ) ≥ T (X)]

• p-value = supθ∈Θ0 = inf α : T (X) ∈ Rα
Suppose ∇ϕθ=θb 6= 0 and τb = ϕ(θ).
b Then, | {z }
1−Fθ (T (X)) since T (X ? )∼Fθ
τ − τ) D
(b
→ N (0, 1)
se(b
b τ)
p-value evidence
where r < 0.01 very strong evidence against H0
T
0.01 − 0.05 strong evidence against H0

se(b
b τ) = ∇ϕ
b Jbn ∇ϕ
b
0.05 − 0.1 weak evidence against H0
b and ∇ϕ

b = ∇ϕ b. > 0.1 little or no evidence against H0
and Jbn = Jn (θ) θ=θ
Wald test
12.4 Parametric Bootstrap
• Two-sided test
Sample from f (x; θbn ) instead of from Fbn , where θbn could be the mle or method
of moments estimator. θb − θ0
• Reject H0 when |W | > zα/2 where W =
se
b
• P |W | > zα/2 → α
13 Hypothesis Testing • p-value = Pθ0 [|W | > |w|] ≈ P [|Z| > |w|] = 2Φ(−|w|)

H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ1
Likelihood ratio test
Definitions

• Null hypothesis H0 supθ∈Θ Ln (θ) Ln (θbn )

• Alternative hypothesis H1 • T (X) = =
supθ∈Θ0 Ln (θ) Ln (θbn,0 )
• Simple hypothesis θ = θ0 k
• Composite hypothesis θ > θ0 or θ < θ0 iid
D
X
• λ(X) = 2 log T (X) → χ2r−q where Zi2 ∼ χ2k and Z1 , . . . , Zk ∼ N (0, 1)
• Two-sided test: H0 : θ = θ0 versus H1 : θ 6= θ0
i=1
• One-sided test: H0 : θ ≤ θ0 versus H1 : θ > θ0 • p-value = Pθ0 [λ(X) > λ(x)] ≈ P χ2r−q > λ(x)
15
Multinomial LRT Natural form

X1 Xk
• mle: pbn = ,..., fX (x | η) = h(x) exp {η · T(x) − A(η)}
n n
k
Y pbj Xj = h(x)g(η) exp {η · T(x)}
Ln (b
pn )
• T (X) = = = h(x)g(η) exp η T T(x)

Ln (p0 ) j=1
p0j
k
X pbj D
• λ(X) = 2 Xj log → χ2k−1 15 Bayesian Inference
j=1
p 0j

• The approximate size α LRT rejects H0 when λ(X) ≥ χ2k−1,α Bayes’ Theorem
Pearson Chi-square Test f (x | θ)f (θ) f (x | θ)f (θ)
f (θ | x) = =R ∝ Ln (θ)f (θ)
k f (xn ) f (x | θ)f (θ) dθ
X (Xj − E [Xj ])2
• T = where E [Xj ] = np0j under H0
j=1
E [Xj ] Definitions
D
• T → χ2k−1 • X n = (X1 , . . . , Xn )

• p-value = P χ2k−1 > T (x) • xn = (x1 , . . . , xn )
D
2
• Faster → Xk−1 than LRT, hence preferable for small n • Prior density f (θ)
• Likelihood f (xn | θ): joint density of the data
Independence testing Yn
In particular, X n iid =⇒ f (xn | θ) = f (xi | θ) = Ln (θ)
• I rows, J columns, X multinomial sample of size n = I ∗ J i=1
X
• mles unconstrained: pbij = nij • Posterior density f (θ | xn )
X
• Normalizing constant cn = f (xn ) = f (x | θ)f (θ) dθ
R
• mles under H0 : pb0ij = pbi· pb·j = Xni· n·j
• Kernel: part of a density that dependsRon θ

PI PJ nX
• LRT: λ = 2 i=1 j=1 Xij log Xi· Xij·j θL (θ)f (θ)dθ
• Posterior mean θ̄n = θf (θ | xn ) dθ = R Lnn(θ)f (θ) dθ
R
PI PJ (X −E[X ])2
• PearsonChiSq: T = i=1 j=1 ijE[Xij ]ij
D
• LRT and Pearson → χ2k ν, where ν = (I − 1)(J − 1) 15.1 Credible Intervals
Posterior interval
14 Exponential Family Z b
P [θ ∈ (a, b) | xn ] = f (θ | xn ) dθ = 1 − α
Scalar parameter a

fX (x | θ) = h(x) exp {η(θ)T (x) − A(θ)} Equal-tail credible interval

= h(x)g(θ) exp {η(θ)T (x)} Z a Z ∞
f (θ | xn ) dθ = f (θ | xn ) dθ = α/2
Vector parameter −∞ b

Highest posterior density (HPD) region Rn

( s
)
X
fX (x | θ) = h(x) exp ηi (θ)Ti (x) − A(θ)
i=1 1. P [θ ∈ Rn ] = 1 − α
= h(x) exp {η(θ) · T (x) − A(θ)} 2. Rn = {θ : f (θ | xn ) > k} for some k
= h(x)g(θ) exp {η(θ) · T (x)} Rn is unimodal =⇒ Rn is an interval
16
15.2 Function of parameters 15.3.1 Conjugate Priors
Continuous likelihood (subscript c denotes constant)
Let τ = ϕ(θ) and A = {θ : ϕ(θ) ≤ τ }.
Likelihood Conjugate prior Posterior hyperparameters
Posterior CDF for τ
Unif (0, θ) Pareto(xm , k) max x(n) , xm , k + n
Z Xn
n n n
H(r | x ) = P [ϕ(θ) ≤ τ | x ] = f (θ | x ) dθ Exp (λ) Gamma (α, β) α + n, β + xi
A
i=1
Pn
µ0 i=1 xi 1 n
2
2

Posterior density N µ, σc N µ0 , σ0 + / + 2 ,
σ2 σ2 σ02 σc
0 c−1
1 n
h(τ | xn ) = H 0 (τ | xn ) + 2
σ02 σc
Pn
νσ02 + i=1 (xi − µ)2
Bayesian delta method N µc , σ 2 Scaled Inverse Chi- ν + n,
ν+n
square(ν, σ02 )

νλ + nx̄ n

τ | X n ≈ N ϕ(θ),
b seb ϕ0 (θ)

N µ, σ 2
b
Normal- , ν + n, α + ,
ν+n 2
scaled Inverse n 2
1X γ(x̄ − λ)
Gamma(λ, ν, α, β) β+ (xi − x̄)2 +
2 i=1 2(n + γ)
15.3 Priors −1
Σ−1 −1
Σ−1 −1

MVN(µ, Σc ) MVN(µ0 , Σ0 ) 0 + nΣc 0 µ0 + nΣ x̄ ,
−1 −1
Σ−1

Choice 0 + nΣc
Xn
MVN(µc , Σ) Inverse- n + κ, Ψ + (xi − µc )(xi − µc )T
• Subjective Bayesianism: prior should incorporate as much detail as possible Wishart(κ, Ψ) i=1
the research’s a priori knowledge—via prior elicitation n
X xi
• Objective Bayesianism: prior should incorporate as little detail as possible Pareto(xmc , k) Gamma (α, β) α + n, β + log
x mc
(non-informative prior) i=1
Pareto(xm , kc ) Pareto(x0 , k0 ) x0 , k0 − kn where k0 > kn
• Robust Bayesianism: consider various priors and determine sensitivity of Xn
our inferences to changes in the prior Gamma (αc , β) Gamma (α0 , β0 ) α0 + nαc , β0 + xi
i=1

Types

• Flat: f (θ) ∝ constant

R∞
• Proper: −∞ f (θ) dθ = 1
R∞
• Improper: −∞ f (θ) dθ = ∞
• Jeffrey’s Prior (transformation-invariant):

p p
f (θ) ∝ I(θ) f (θ) ∝ det(I(θ))

• Conjugate: f (θ) and f (θ | xn ) belong to the same parametric family

17
Discrete likelihood Bayes factor
Likelihood Conjugate prior Posterior hyperparameters log10 BF10 BF10 evidence
n n 0 − 0.5 1 − 1.5 Weak
0.5 − 1 1.5 − 10 Moderate
X X
Bern (p) Beta (α, β) α+ xi , β + n − xi
i=1 i=1
1−2 10 − 100 Strong
Xn n
X n
X >2 > 100 Decisive
Bin (p) Beta (α, β) α+ xi , β + Ni − xi
p
i=1 i=1 i=1 1−p BF10
n
X p∗ = p where p = P [H1 ] and p∗ = P [H1 | xn ]
NBin (p) Beta (α, β) α + rn, β + xi 1 + 1−p BF10
i=1
n
16 Sampling Methods
X
Po (λ) Gamma (α, β) α+ xi , β + n
i=1
n
X 16.1 Inverse Transform Sampling
Multinomial(p) Dir (α) α+ x(i)
i=1 Setup
n
X
Geo (p) Beta (α, β) α + n, β + xi • U ∼ Unif (0, 1)
i=1 • X∼F
• F −1 (u) = inf{x | F (x) ≥ u}
15.4 Bayesian Testing Algorithm
1. Generate u ∼ Unif (0, 1)
If H0 : θ ∈ Θ0 :
2. Compute x = F −1 (u)
Z
Prior probability P [H0 ] = f (θ) dθ
Θ0 16.2 The Bootstrap
Z
Posterior probability P [H0 | xn ] = f (θ | xn ) dθ Let Tn = g(X1 , . . . , Xn ) be a statistic.
Θ0
1. Estimate VF [Tn ] with VFbn [Tn ].
2. Approximate VFbn [Tn ] using simulation:
∗ ∗
Let H0 . . .Hk−1 be k hypotheses. Suppose θ ∼ f (θ | Hk ), (a) Repeat the following B times to get Tn,1 , . . . , Tn,B , an iid sample from
the sampling distribution implied by Fn b
f (xn | Hk )P [Hk ] i. Sample uniformly X1∗ , . . . , Xn∗ ∼ Fbn .
P [Hk | xn ] = PK ,
n
k=1 f (x | Hk )P [Hk ] ii. Compute Tn∗ = g(X1∗ , . . . , Xn∗ ).
(b) Then
Marginal likelihood B B
!2
1 X ∗ 1 X ∗
vboot = VFbn =
b Tn,b − T
B B r=1 n,r
Z
n
f (x | Hi ) = f (xn | θ, Hi )f (θ | Hi ) dθ b=1
Θ
16.2.1 Bootstrap Confidence Intervals
Posterior odds (of Hi relative to Hj )
Normal-based interval
n
P [Hi | x ] n
f (x | Hi ) P [Hi ] Tn ± zα/2 se
b boot
= ×
P [Hj | xn ] f (xn | Hj ) P [Hj ] Pivotal interval
| {z } | {z }
Bayes Factor BFij prior odds 1. Location parameter θ = T (F )
18
2. Pivot Rn = θbn − θ 2. Generate u ∼ Unif (0, 1)
3. Let H(r) = P [Rn ≤ r] be the cdf of Rn Ln (θcand )
∗ ∗
3. Accept θcand if u ≤
4. Let Rn,b = θbn,b − θbn . Approximate H using bootstrap: Ln (θbn )
B
1 X ∗ 16.4 Importance Sampling
H(r)
b = I(Rn,b ≤ r)
B Sample from an importance function g rather than target density h.
b=1
Algorithm to obtain an approximation to E [q(θ) | xn ]:
5. θβ∗ = β sample quantile of (θbn,1
∗ ∗
, . . . , θbn,B ) iid
1. Sample from the prior θ1 , . . . , θn ∼ f (θ)
6. rβ∗ = beta sample quantile of (Rn,1
∗ ∗
, . . . , Rn,B ), i.e., rβ∗ = θβ∗ − θbn
Ln (θi )
2. wi = PB ∀i = 1, . . . , B

7. Approximate 1 − α confidence interval Cn = â, b̂ where
i=1 Ln (θi )
PB
3. E [q(θ) | xn ] ≈ i=1 q(θi )wi
b −1 1 − α =

∗ ∗
â = θbn − H θbn − r1−α/2 = 2θbn − θ1−α/2
2
α
b̂ = θbn − Hb −1
2
= ∗
θbn − rα/2 = ∗
2θbn − θα/2 17 Decision Theory
Percentile interval Definitions
∗ ∗
Cn = θα/2 , θ1−α/2 • Unknown quantity affecting our decision: θ ∈ Θ
• Decision rule: synonymous for an estimator θb
16.3 Rejection Sampling • Action a ∈ A: possible value of the decision rule. In the estimation
context, the action is just an estimate of θ, θ(x).
b
Setup
• Loss function L: consequences of taking action a when true state is θ or
• We can easily sample from g(θ) discrepancy between θ and θ, b L : Θ × A → [−k, ∞).
• We want to sample from h(θ), but it is difficult Loss functions
k(θ)
• We know h(θ) up to a proportional constant: h(θ) = R • Squared error loss: L(θ, a) = (θ − a)2
k(θ) dθ (
• Envelope condition: we can find M > 0 such that k(θ) ≤ M g(θ) ∀θ K1 (θ − a) a − θ < 0
• Linear loss: L(θ, a) =
K2 (a − θ) a − θ ≥ 0
Algorithm
• Absolute error loss: L(θ, a) = |θ − a| (linear loss with K1 = K2 )
1. Draw θcand ∼ g(θ) • Lp loss: L(θ, a) = |θ − a|p
2. Generate u ∼ Unif (0, 1)
(
0 a=θ
k(θcand ) • Zero-one loss: L(θ, a) =
3. Accept θcand if u ≤ 1 a 6= θ
M g(θcand )
4. Repeat until B values of θcand have been accepted
17.1 Risk
Example
Posterior risk
• We can easily sample from the prior g(θ) = f (θ)
Z h i
r(θb | x) = L(θ, θ(x))f
b (θ | x) dθ = Eθ|X L(θ, θ(x))
b
• Target is the posterior h(θ) ∝ k(θ) = f (xn | θ)f (θ)
• Envelope condition: f (xn | θ) ≤ f (xn | θbn ) = Ln (θbn ) ≡ M (Frequentist) risk
• Algorithm Z h i
1. Draw θ cand
∼ f (θ) R(θ, θ)
b = L(θ, θ(x))f
b (x | θ) dx = EX|θ L(θ, θ(X))
b
19
Bayes risk 18 Linear Regression
ZZ
Definitions
h i
r(f, θ)
b = L(θ, θ(x))f
b (x, θ) dx dθ = Eθ,X L(θ, θ(X))
b
• Response variable Y
• Covariate X (aka predictor variable or feature)
h h ii h i
r(f, θ)
b = Eθ EX|θ L(θ, θ(X)
b = Eθ R(θ, θ)
b

18.1 Simple Linear Regression

h h ii h i
r(f, θ)
b = EX Eθ|X L(θ, θ(X)
b = EX r(θb | X)
Model
17.2 Admissibility Yi = β0 + β1 Xi + i E [i | Xi ] = 0, V [i | Xi ] = σ 2
Fitted line
• θb0 dominates θb if
b0 rb(x) = βb0 + βb1 x
∀θ : R(θ, θ ) ≤ R(θ, θ)
b
Predicted (fitted) values
∃θ : R(θ, θb0 ) < R(θ, θ)
b Ybi = rb(Xi )
• θb is inadmissible if there is at least one other estimator θb0 that dominates Residuals
it. Otherwise it is called admissible. ˆi = Yi − Ybi = Yi − βb0 + βb1 Xi

Residual sums of squares (rss)

17.3 Bayes Rule
n
X
Bayes rule (or Bayes estimator) rss(βb0 , βb1 ) = ˆ2i
i=1
• r(f, θ)
b = inf e r(f, θ)
θ
e
R Least square estimates
• θ(x)
b = inf r(θb | x) ∀x =⇒ r(f, θ)
b = r(θb | x)f (x) dx
βbT = (βb0 , βb1 )T : min rss
β
b0 ,β
b1
Theorems

• Squared error loss: posterior mean βb0 = Ȳn − βb1 X̄n

Pn Pn
• Absolute error loss: posterior median i=1 (Xi − X̄n )(Yi − Ȳn ) i=1 Xi Yi − nX̄Y
β1 =
b Pn = P n
• Zero-one loss: posterior mode i=1 (Xi − X̄n )
2 2 2
i=1 Xi − nX

β0
h i
E βb | X n =
17.4 Minimax Rules β1
σ 2 n−1 ni=1 Xi2 −X n
h i P
Maximum risk V βb | X n = 2
R̄(θ)
b = sup R(θ, θ)
b R̄(a) = sup R(θ, a) nsX −X n 1
θ θ r Pn
2
σ i=1 Xi
√
b
Minimax rule se(
b βb0 ) =
sX n n
sup R(θ, θ)
b = inf R̄(θ)
e = inf sup R(θ, θ)
e
θ θe θe θ σ
√
b
se(
b βb1 ) =
sX n
θb = Bayes rule ∧ ∃c : R(θ, θ)
b =c Pn Pn 2
where s2X = n−1 i=1 (Xi − X n )2 and σ b2 = n−21
i=1
ˆi (unbiased estimate).
Least favorable prior Further properties:
P P
θbf = Bayes rule ∧ R(θ, θbf ) ≤ r(f, θbf ) ∀θ • Consistency: βb0 → β0 and βb1 → β1
20
• Asymptotic normality: 18.3 Multiple Regression
βb0 − β0 D βb1 − β1 D Y = Xβ +
→ N (0, 1) and → N (0, 1)
se(
b βb0 ) se(
b βb1 )
where
• Approximate 1 − α confidence intervals for β0 and β1 :      
X11 ··· X1k β1 1
 .. ..  β =  ... 
..  .. 
βb0 ± zα/2 se( and βb1 ± zα/2 se( X= . =.
 
b βb0 ) b βb1 ) . . 
Xn1 ··· Xnk βk n
• Wald test for H0 : β1 = 0 vs. H1 : β1 6= 0: reject H0 if |W | > zα/2 where
W = βb1 /se(
b βb1 ). Likelihood

1
R2 L(µ, Σ) = (2πσ 2 )−n/2 exp − 2 rss
Pn b 2
Pn 2 2σ
i=1 (Yi − Y ) ˆ rss
2
R = Pn 2
= 1 − Pn i=1 i 2 = 1 −
i=1 (Yi − Y ) i=1 (Yi − Y )
tss
N
X
Likelihood rss = (y − Xβ)T (y − Xβ) = kY − Xβk2 = (Yi − xTi β)2
n n n i=1
Y Y Y
L= f (Xi , Yi ) = fX (Xi ) × fY |X (Yi | Xi ) = L1 × L2
i=1 i=1 i=1 If the (k × k) matrix X T X is invertible,
Yn
L1 = fX (Xi ) βb = (X T X)−1 X T Y
i=1 h i
V βb | X n = σ 2 (X T X)−1
n
( )
Y 1 X 2
−n
L2 = fY |X (Yi | Xi ) ∝ σ exp − 2 Yi − (β0 − β1 Xi )
2σ i βb ≈ N β, σ 2 (X T X)−1

i=1

Under the assumption of Normality, the least squares estimator is also the mle
Estimate regression function
but the least squares variance estimator is not the mle.
n k
1X 2 X
b2 =
σ ˆ rb(x) = βbj xj
n i=1 i j=1

18.2 Prediction Unbiased estimate for σ 2

Observe X = x∗ of the covariate and want to predict their outcome Y∗ . n
1 X 2
b2 =
σ ˆ ˆ = X βb − Y
Yb∗ = βb0 + βb1 x∗ n − k i=1 i
h i h i h i h i
V Yb∗ = V βb0 + x2∗ V βb1 + 2x∗ Cov βb0 , βb1 mle
n−k 2
Prediction interval µ
b = X̄ b2 =
σ σ
Pn 2
n
2 2 i=1 (Xi − X∗ )
ξn = σ
b P +1
n i (Xi − X̄)2 j
b
1 − α Confidence interval
Yb∗ ± zα/2 ξbn βbj ± zα/2 se(
b βbj )
21
18.4 Model Selection Akaike Information Criterion (AIC)
Consider predicting a new observation Y ∗ for covariates X ∗ and let S ⊂ J
denote a subset of the covariates in the model, where |S| = k and |J| = n. bS2 ) − k
AIC(S) = `n (βbS , σ
Issues
Bayesian Information Criterion (BIC)
• Underfitting: too few covariates yields high bias
• Overfitting: too many covariates yields high variance k
bS2 ) −
BIC(S) = `n (βbS , σ log n
Procedure 2

1. Assign a score to each model Validation and training

2. Search through all models to find the one with the highest score
m
X n n
Hypothesis testing R
bV (S) = (Ybi∗ (S) − Yi∗ )2 m = |{validation data}|, often or
i=1
4 2
H0 : βj = 0 vs. H1 : βj 6= 0 ∀j ∈ J
Leave-one-out cross-validation
Mean squared prediction error (mspe)
n n
!2
h i X X Yi − Ybi (S)
mspe = E (Yb (S) − Y ∗ )2 R
bCV (S) = (Yi − Yb(i) )2 =
i=1 i=1
1 − Uii (S)
Prediction risk
n n h i
U (S) = XS (XST XS )−1 XS (“hat matrix”)
X X
R(S) = mspei = E (Ybi (S) − Yi∗ )2
i=1 i=1

Training error
n
R
btr (S) =
X
(Ybi (S) − Yi )2
19 Non-parametric Function Estimation
i=1
2 19.1 Density Estimation
R Pn b 2
R i=1 (Yi (S) − Y )
rss(S) btr (S) R
R2 (S) = 1 − =1− =1− Estimate f (x), where f (x) = P [X ∈ A] = A
f (x) dx.
P n 2
i=1 (Yi − Y )
tss tss Integrated square error (ise)
The training error is a downward-biased estimate of the prediction risk. Z 2 Z
h i L(f, fbn ) = f (x) − fn (x) dx = J(h) + f 2 (x) dx
b
E R btr (S) < R(S)

h i n
X h i Frequentist risk
bias(Rtr (S)) = E Rtr (S) − R(S) = −2
b b Cov Ybi , Yi
i=1
h i Z Z
R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx
Adjusted R2
n − 1 rss
R2 (S) = 1 −
n − k tss h i
Mallow’s Cp statistic b(x) = E fbn (x) − f (x)
h i
R(S)
b =R σ 2 = lack of fit + complexity penalty
btr (S) + 2kb v(x) = V fbn (x)
22
19.1.1 Histograms KDE
n
Definitions 1X1 x − Xi
fbn (x) = K
n i=1 h h
• Number of bins m
Z Z
1 4 00 2 1
1 R(f, fn ) ≈ (hσK )
b (f (x)) dx + K 2 (x) dx
• Binwidth h = m 4 nh
• Bin Bj has νj observations c
−2/5 −1/5 −1/5
c2 c3
Z Z
h∗ = 1 c = σ 2
, c = K 2
(x) dx, c = (f 00 (x))2 dx
R
• Define pbj = νj /n and pj = Bj f (u) du n1/5
1 K 2 3

Z 4/5 Z 1/5
∗ c4 5 2 2/5 2 00 2
Histogram estimator R (f, fn ) = 4/5
b c4 = (σK ) K (x) dx (f ) dx
n 4
| {z }
m C(K)
X pbj
fbn (x) = I(x ∈ Bj )
j=1
h Epanechnikov Kernel
h i pj
E fbn (x) = (
3
√
h √
4 5(1−x2 /5)
|x| < 5
h i p (1 − p ) K(x) =
j j
V fbn (x) = 0 otherwise
nh2
h2
Z
2 1
R(fbn , f ) ≈ (f 0 (u)) du + Cross-validation estimate of E [J(h)]
12 nh
!1/3
1 6 n n n
1 X X ∗ Xi − Xj
Z
∗
h = 1/3 R 2Xb 2
2 du JbCV (h) = fbn2 (x) dx − f(−i) (Xi ) ≈ K + K(0)
n (f 0 (u)) n i=1 hn2 i=1 j=1 h nh
2/3 Z 1/3
∗ b C 3 0 2
R (fn , f ) ≈ 2/3 C= (f (u)) du
n 4 Z
K ∗ (x) = K (2) (x) − 2K(x) K (2) (x) = K(x − y)K(y) dy
Cross-validation estimate of E [J(h)]

Z
2Xb
n
2 n+1 X 2
m 19.2 Non-parametric Regression
JbCV (h) = fbn2 (x) dx − f(−i) (Xi ) = − pb
n i=1 (n − 1)h (n − 1)h j=1 j Estimate f (x) where f (x) = E [Y | X = x]. Consider pairs of points
(x1 , Y1 ), . . . , (xn , Yn ) related by

Yi = r(xi ) + i
19.1.2 Kernel Density Estimator (KDE)
E [i ] = 0
Kernel K V [i ] = σ 2

• K(x) ≥ 0 k-nearest Neighbor Estimator

R
• K(x) dx = 1
•
R
xK(x) dx = 0 1 X
rb(x) = Yi where Nk (x) = {k values of x1 , . . . , xn closest to x}
•
R 2 2
x K(x) dx ≡ σK >0 k
i:xi ∈Nk (x)
23
Nadaraya-Watson Kernel Estimator 20 Stochastic Processes
n
X
rb(x) = wi (x)Yi Stochastic Process
i=1 (
x−xi

K {0, ±1, . . . } = Z discrete
wi (x) = h ∈ [0, 1] {Xt : t ∈ T } T =
[0, ∞)

Pn
K
x−xj continuous
j=1 h
4 Z 2
h4 f 0 (x)
Z
2 2 00 0 • Notations Xt , X(t)
R(brn , r) ≈ x K (x) dx r (x) + 2r (x) dx
4 f (x) • State space X
Z 2R 2
σ K (x) dx • Index set T
+ dx
nhf (x)
c1
h∗ ≈ 1/5 20.1 Markov Chains
n
c2
R∗ (b
rn , r) ≈ 4/5 Markov chain
n

P [Xn = x | X0 , . . . , Xn−1 ] = P [Xn = x | Xn−1 ] ∀n ∈ T, x ∈ X

Cross-validation estimate of E [J(h)]
n
X n
X (Yi − rb(xi ))2 Transition probabilities
JbCV (h) = (Yi − rb(−i) (xi ))2 = !2
i=1 i=1 K(0) pij ≡ P [Xn+1 = j | Xn = i]
1− Pn x−x
j
K
j=1 h pij (n) ≡ P [Xm+n = j | Xm = i] n-step

19.3 Smoothing Using Orthogonal Functions Transition matrix P (n-step: Pn )

Approximation
∞ J • (i, j) element is pij
X X
r(x) = βj φj (x) ≈ βj φj (x) • pij > 0
P
j=1 j=1 • i pij = 1
Multivariate regression
Y = Φβ + η Chapman-Kolmogorov
 
φ0 (x1 ) ··· φJ (x1 ) X
 .. .. ..  pij (m + n) = pij (m)pkj (n)
where ηi = i and Φ =  . . .  k
φ0 (xn ) · · · φJ (xn )
Least squares estimator Pm+n = Pm Pn
βb = (ΦT Φ)−1 ΦT Y
Pn = P × · · · × P = Pn
1
≈ ΦT Y (for equally spaced observations only)
n Marginal probability
Cross-validation estimate of E [J(h)]
 2 µn = (µn (1), . . . , µn (N )) where µi (i) = P [Xn = i]
n J
R
bCV (J) =
X
Yi −
X
φj (xi )βbj,(−i)  µ0 , initial distribution
i=1 j=1 µn = µ0 Pn
24
20.2 Poisson Processes Autocorrelation function (ACF)
Poisson process
Cov [xs , xt ] γ(s, t)
ρ(s, t) = p =p
• {Xt : t ∈ [0, ∞)} = number of events up to and including time t V [xs ] V [xt ] γ(s, s)γ(t, t)
• X0 = 0
• Independent increments: Cross-covariance function (CCV)
∀t0 < · · · < tn : Xt1 − Xt0 ⊥
⊥ · · · ⊥⊥ Xtn − Xtn−1
γxy (s, t) = E [(xs − µxs )(yt − µyt )]
• Intensity function λ(t)
– P [Xt+h − Xt = 1] = λ(t)h + o(h) Cross-correlation function (CCF)
– P [Xt+h − Xt = 2] = o(h)
γxy (s, t)
• Xs+t − Xs ∼ Po (m(s + t) − m(s)) where m(t) =
Rt
λ(s) ds ρxy (s, t) = p
0 γx (s, s)γy (t, t)
Homogeneous Poisson process
Backshift operator
λ(t) ≡ λ =⇒ Xt ∼ Po (λt) λ>0
B k (xt ) = xt−k
Waiting times
Wt := time at which Xt occurs Difference operator

1 ∇d = (1 − B)d
Wt ∼ Gamma t,
λ
Interarrival times White noise
St = Wt+1 − Wt
2

1 • wt ∼ wn(0, σw )
St ∼ Exp iid 2

λ • Gaussian: wt ∼ N 0, σw
• E [wt ] = 0 t ∈ T
St • V [wt ] = σ 2 t ∈ T
• γw (s, t) = 0 s 6= t ∧ s, t ∈ T
Wt−1 Wt t

Random walk
21 Time Series
• Drift δ
Pt
Mean function Z ∞
• xt = δt + j=1 wj
µxt = E [xt ] = xft (x) dx • E [xt ] = δt
−∞

Autocovariance function Symmetric moving average

γx (s, t) = E [(xs − µs )(xt − µt )] = E [xs xt ] − µs µt k

X k
X
mt = aj xt−j where aj = a−j ≥ 0 and aj = 1
γx (t, t) = E (xt − µt )2 = V [xt ]

j=−k j=−k
25
21.1 Stationary Time Series Sample variance
n
Strictly stationary 1 X |h|
V [x̄] = 1− γx (h)
n n
P [xt1 ≤ c1 , . . . , xtk ≤ ck ] = P [xt1 +h ≤ c1 , . . . , xtk +h ≤ ck ] h=−n

∀k ∈ N, tk , ck , h ∈ Z Sample autocovariance function

Weakly stationary n−h

1 X
γ
b(h) = (xt+h − x̄)(xt − x̄)
• E x2t < ∞ ∀t ∈ Z n t=1
2
• E xt = m ∀t ∈ Z
• γx (s, t) = γx (s + r, t + r) ∀r, s, t ∈ Z Sample autocorrelation function
Autocovariance function
γ
b(h)
ρb(h) =
• γ(h) = E [(xt+h − µ)(xt − µ)] ∀h ∈ Z γ
b(0)

• γ(0) = E (xt − µ)2
• γ(0) ≥ 0 Sample cross-variance function
• γ(0) ≥ |γ(h)|
n−h
• γ(h) = γ(−h) 1 X
γ
bxy (h) = (xt+h − x̄)(yt − y)
n t=1
Autocorrelation function (ACF)

Cov [xt+h , xt ] γ(t + h, t) γ(h) Sample cross-correlation function

ρx (h) = p =p =
V [xt+h ] V [xt ] γ(t + h, t + h)γ(t, t) γ(0)
γ
bxy (h)
Jointly stationary time series ρbxy (h) = p
γbx (0)b
γy (0)
γxy (h) = E [(xt+h − µx )(yt − µy )]
Properties
γxy (h)
ρxy (h) = p 1
γx (0)γy (h) • σρbx (h) = √ if xt is white noise
n
Linear process 1
• σρbxy (h) = √ if xt or yt is white noise
∞
X ∞
X n
xt = µ + ψj wt−j where |ψj | < ∞
j=−∞ j=−∞

∞
21.3 Non-Stationary Time Series
X
2
γ(h) = σw ψj+h ψj Classical decomposition model
j=−∞

xt = µt + st + wt
21.2 Estimation of Correlation
Sample mean • µt = trend
n
1X • st = seasonal component
x̄ = xt
n t=1 • wt = random noise term
26
21.3.1 Detrending Moving average polynomial
Least squares θ(z) = 1 + θ1 z + · · · + θq zq z ∈ C ∧ θq 6= 0
2
1. Choose trend model, e.g., µt = β0 + β1 t + β2 t
Moving average operator
2. Minimize rss to obtain trend estimate µ bt = βb0 + βb1 t + βb2 t2
3. Residuals , noise wt θ(B) = 1 + θ1 B + · · · + θp B p
Moving average MA (q) (moving average model order q)
1
• The low-pass filter vt is a symmetric moving average mt with aj = 2k+1 : xt = wt + θ1 wt−1 + · · · + θq wt−q ⇐⇒ xt = θ(B)wt
k q
1 X X
vt = xt−1 E [xt ] = θj E [wt−j ] = 0
2k + 1
i=−k j=0
Pk ( Pq−h
1 2
• If 2k+1 i=−k wt−j ≈ 0, a linear trend function µt = β0 + β1 t passes
σw j=0 θj θj+h 0≤h≤q
γ(h) = Cov [xt+h , xt ] =
without distortion 0 h>q
Differencing MA (1)
xt = wt + θwt−1
• µt = β0 + β1 t =⇒ ∇xt = β1 
2 2
(1 + θ )σw h = 0

2
21.4 ARIMA models γ(h) = θσw h=1

0 h>1

Autoregressive polynomial
(
θ
φ(z) = 1 − φ1 z − · · · − φp zp z ∈ C ∧ φp 6= 0 2 h=1
ρ(h) = (1+θ )
0 h>1
Autoregressive operator
ARMA (p, q)
φ(B) = 1 − φ1 B − · · · − φp B p
xt = φ1 xt−1 + · · · + φp xt−p + wt + θ1 wt−1 + · · · + θq wt−q
Autoregressive model order p, AR (p)
φ(B)xt = θ(B)wt
xt = φ1 xt−1 + · · · + φp xt−p + wt ⇐⇒ φ(B)xt = wt
Partial autocorrelation function (PACF)
AR (1) • xih−1 , regression of xi on {xh−1 , xh−2 , . . . , x1 }
k−1 ∞ • φhh = corr(xh − xh−1
h , x0 − xh−1
0 ) h≥2
X k→∞,|φ|<1 X
• xt = φk (xt−k ) + φj (wt−j ) = φj (wt−j ) • E.g., φ11 = corr(x1 , x0 ) = ρ(1)
j=0 j=0
| {z } ARIMA (p, d, q)
linear process
P∞ j
∇d xt = (1 − B)d xt is ARMA (p, q)
• E [xt ] = j=0 φ (E [wt−j ]) = 0
2 h
σw φ φ(B)(1 − B)d xt = θ(B)wt
• γ(h) = Cov [xt+h , xt ] = 1−φ2
γ(h) Exponentially Weighted Moving Average (EWMA)
• ρ(h) = γ(0) = φh
• ρ(h) = φρ(h − 1) h = 1, 2, . . . xt = xt−1 + wt − λwt−1
27
∞
X • Frequency index ω (cycles per unit time), period 1/ω
xt = (1 − λ)λj−1 xt−j + wt when |λ| < 1
j=1
• Amplitude A
• Phase φ
x̃n+1 = (1 − λ)xn + λx̃n
• U1 = A cos φ and U2 = A sin φ often normally distributed rv’s
Seasonal ARIMA
Periodic mixture
• Denoted by ARIMA (p, d, q) × (P, D, Q)s
q
• ΦP (B s )φ(B)∇D d s
s ∇ xt = δ + ΘQ (B )θ(B)wt X
xt = (Uk1 cos(2πωk t) + Uk2 sin(2πωk t))
k=1
21.4.1 Causality and Invertibility
P∞ • Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rv’s with variances σk2
ARMA (p, q) is causal (future-independent) ⇐⇒ ∃{ψj } : j=0 ψj < ∞ such that Pq
• γ(h) = k=1 σk2 cos(2πωk h)
Pq
∞
X • γ(0) = E x2t = k=1 σk2
xt = wt−j = ψ(B)wt
j=0 Spectral representation of a periodic process
P∞
ARMA (p, q) is invertible ⇐⇒ ∃{πj } : j=0 πj < ∞ such that γ(h) = σ 2 cos(2πω0 h)
∞ σ 2 −2πiω0 h σ 2 2πiω0 h
X = e + e
π(B)xt = Xt−j = wt 2 2
Z 1/2
j=0
= e2πiωh dF (ω)
Properties −1/2

• ARMA (p, q) causal ⇐⇒ roots of φ(z) lie outside the unit circle Spectral distribution function
∞

X θ(z)
j 0
 ω < −ω0
ψ(z) = ψj z = |z| ≤ 1
φ(z) F (ω) = σ 2 /2 −ω ≤ ω < ω0
j=0 
 2
σ ω ≥ ω0
• ARMA (p, q) invertible ⇐⇒ roots of θ(z) lie outside the unit circle
• F (−∞) = F (−1/2) = 0
∞
X φ(z) • F (∞) = F (1/2) = γ(0)
π(z) = πj z j = |z| ≤ 1
j=0
θ(z)
Spectral density
Behavior of the ACF and PACF for causal and invertible ARMA models ∞
X 1 1
AR (p) MA (q) ARMA (p, q) f (ω) = γ(h)e−2πiωh − ≤ω≤
2 2
h=−∞
ACF tails off cuts off after lag q tails off
PACF cuts off after lag p tails off q tails off P∞ R 1/2
• Needs h=−∞ |γ(h)| < ∞ =⇒ γ(h) = −1/2
e2πiωh f (ω) dω h = 0, ±1, . . .
21.5 Spectral Analysis • f (ω) ≥ 0
• f (ω) = f (−ω)
Periodic process • f (ω) = f (1 − ω)
R 1/2
xt = A cos(2πωt + φ) • γ(0) = V [xt ] = −1/2 f (ω) dω
2
= U1 cos(2πωt) + U2 sin(2πωt) • White noise: fw (ω) = σw
28
• ARMA (p, q) , φ(B)xt = θ(B)wt : 22.2 Beta Function
Z 1
Γ(x)Γ(y)
|θ(e−2πiω )|2
2 • Ordinary: B(x, y) = B(y, x) = tx−1 (1 − t)y−1 dt =
fx (ω) = σw 0 Γ(x + y)
|φ(e−2πiω )|2 Z x
a−1 b−1
Pp Pq • Incomplete: B(x; a, b) = t (1 − t) dt
where φ(z) = 1 − k=1 φk z k and θ(z) = 1 + k=1 θk z k 0
• Regularized incomplete:
Discrete Fourier Transform (DFT) a+b−1
B(x; a, b) a,b∈N X (a + b − 1)!
Ix (a, b) = = xj (1 − x)a+b−1−j
n
X B(a, b) j=a
j!(a + b − 1 − j)!
d(ωj ) = n−1/2 xt e−2πiωj t
• I0 (a, b) = 0 I1 (a, b) = 1
i=1
• Ix (a, b) = 1 − I1−x (b, a)
Fourier/Fundamental frequencies
22.3 Series
ωj = j/n
Finite Binomial
Inverse DFT n n
n−1 X n(n + 1) X n
• = 2n
X
xt = n −1/2
d(ωj )e 2πiωj t k= •
2 k
j=0 k=1 k=0
n n
X X r+k r+n+1
Periodogram • (2k − 1) = n2 • =
I(j/n) = |d(j/n)|2 k n
k=1 k=0
n n
Scaled Periodogram
X n(n + 1)(2n + 1) X k n+1
• k2 = • =
6 m m+1
k=1 k=0
4 n
P (j/n) = I(j/n) X
n(n + 1)
2 • Vandermonde’s Identity:
n • k3 = r
m n

m+n

2
!2 !2 X
n n k=1 =
2X 2X n k r−k r
= xt cos(2πtj/n + xt sin(2πtj/n cn+1 − 1 k=0
n t=1 n t=1
X
• ck = c 6= 1 • Binomial Theorem:
c−1 n
n n−k k
k=0
X
a b = (a + b)n
22 Math k
k=0

22.1 Gamma Function Infinite

Z ∞
∞ ∞
• Ordinary: Γ(s) = ts−1 e−t dt X 1 X p
0 • pk = , pk = |p| < 1
Z ∞ 1−p 1−p
k=0 k=1
• Upper incomplete: Γ(s, x) = ts−1 e−t dt ∞ ∞
!
X d X d 1 1
Z xx • kpk−1 = pk
= = |p| < 1
dp dp 1 − p (1 − p)2
• Lower incomplete: γ(s, x) = ts−1 e−t dt k=0 k=0
0 ∞
X r+k−1 k
• Γ(α + 1) = αΓ(α) α>1 • x = (1 − x)−r r ∈ N+
k
• Γ(n) = (n − 1)! n∈N k=0
∞
• Γ(0) = Γ(−1) = ∞ X α k
√ • p = (1 + p)α |p| < 1 , α ∈ C
• Γ(1/2) = π k
k=0
• Γ(−1/2) = −2Γ(1/2)
29
22.4 Combinatorics [3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications With R
Examples. Springer, 2006.
Sampling [4] A. Steger. Diskrete Strukturen – Band 1: Kombinatorik, Graphentheorie, Algebra.
Springer, 2001.
k out of n w/o replacement w/ replacement [5] A. Steger. Diskrete Strukturen – Band 2: Wahrscheinlichkeitstheorie und Statistik.
k−1 Springer, 2002.
Y n!
ordered nk = (n − i) = nk [6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer, 2003.
i=0
(n − k)!
nk

n n! n−1+r n−1+r
unordered = = =
k k! k!(n − k)! r n−1

Stirling numbers, 2nd kind

(
n n−1 n−1 n 1 n=0
=k + 1≤k≤n =
k k k−1 0 0 else

Partitions
n
X
Pn+k,k = Pn,i k > n : Pn,k = 0 n ≥ 1 : Pn,0 = 0, P0,0 = 1
i=1

Balls and Urns f :B→U D = distinguishable, ¬D = indistinguishable.

|B| = n, |U | = m f arbitrary f injective f surjective f bijective

( (
mn m ≥ n

n n! m = n
B : D, U : D mn m!
0 else m 0 else
(
m+n−1 m n−1 1 m=n
B : ¬D, U : D
n n m−1 0 else
m
( (
X n 1 m≥n n 1 m=n
B : D, U : ¬D
k 0 else m 0 else
k=1
m
( (
X 1 m≥n 1 m=n
B : ¬D, U : ¬D Pn,k Pn,m
k=1
0 else 0 else

References
[1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory. Brooks Cole,
1972.
[2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships. The American
Statistician, 62(1):45–53, 2008.
30
Univariate distribution relationships, courtesy Leemis and McQueston [2].
31

CH 5 Saunders 23
No ratings yet
CH 5 Saunders 23
57 pages
Indian Mathematics Presentation
33% (3)
Indian Mathematics Presentation
26 pages
Introduction To Statistical Thought
100% (2)
Introduction To Statistical Thought
393 pages
Probability and Statistics With Examples Using R Siva Athreya, Deepayan Sarkar, and Steve Tanner
No ratings yet
Probability and Statistics With Examples Using R Siva Athreya, Deepayan Sarkar, and Steve Tanner
258 pages
ST102/ST109 Elementary Statistical Theory Course Pack 2022/23 (Michaelmas Term)
100% (1)
ST102/ST109 Elementary Statistical Theory Course Pack 2022/23 (Michaelmas Term)
235 pages
STAT 330 Course Notes Fall 2024 Edition
No ratings yet
STAT 330 Course Notes Fall 2024 Edition
482 pages
AP Calculus AB-BC Poster 2025
No ratings yet
AP Calculus AB-BC Poster 2025
1 page
Mathematics Complex Number MCQ
No ratings yet
Mathematics Complex Number MCQ
12 pages
Probability and Statistics Explorations With Maple
No ratings yet
Probability and Statistics Explorations With Maple
287 pages
Steven G. Krantz - A Guide To Functional Analysis (Dolciani Mathematical Expositions, No. 49)
100% (2)
Steven G. Krantz - A Guide To Functional Analysis (Dolciani Mathematical Expositions, No. 49)
150 pages
Chemistry Education Thesis Topics
100% (2)
Chemistry Education Thesis Topics
7 pages
Sta 2030 Notes
No ratings yet
Sta 2030 Notes
103 pages
BCS301 Notes AJIET 241009 121415
No ratings yet
BCS301 Notes AJIET 241009 121415
225 pages
Probability and Statistics Cheat Sheet
100% (2)
Probability and Statistics Cheat Sheet
28 pages
STA2004F
No ratings yet
STA2004F
212 pages
2020-2021 EDA 101 Handout
No ratings yet
2020-2021 EDA 101 Handout
192 pages
FDP Stastics in Excel Ver 6.1-2020
No ratings yet
FDP Stastics in Excel Ver 6.1-2020
82 pages
Book
No ratings yet
Book
475 pages
Data Analysis
No ratings yet
Data Analysis
51 pages
Statistical Models
No ratings yet
Statistical Models
248 pages
MA-202: Probability & Statistics: Class Notes
No ratings yet
MA-202: Probability & Statistics: Class Notes
221 pages
Book Solutions
No ratings yet
Book Solutions
17 pages
Introduction To Statistical Thought - Michael Levine
No ratings yet
Introduction To Statistical Thought - Michael Levine
344 pages
Ma 202
No ratings yet
Ma 202
219 pages
Statistic Book
100% (1)
Statistic Book
328 pages
Class 12th Maths Chapter 5 (Continuity and Differentiability) Unsolved
No ratings yet
Class 12th Maths Chapter 5 (Continuity and Differentiability) Unsolved
9 pages
Solving Quartic Equations
No ratings yet
Solving Quartic Equations
5 pages
Book Statistik Non Parametrik, Komang Suardika
No ratings yet
Book Statistik Non Parametrik, Komang Suardika
492 pages
Bruce Hajek - Probability With Engineering Applications - Jan 2017
No ratings yet
Bruce Hajek - Probability With Engineering Applications - Jan 2017
291 pages
Probabilityjan13 PDF
No ratings yet
Probabilityjan13 PDF
281 pages
Precedent and Analogy: Legal Reasoning
100% (1)
Precedent and Analogy: Legal Reasoning
7 pages
Doc-Cours MathsV
No ratings yet
Doc-Cours MathsV
69 pages
ECE 313 Course Notes: Probability With Engineering Applications
No ratings yet
ECE 313 Course Notes: Probability With Engineering Applications
188 pages
EC400Stats Lecturenotes2021
No ratings yet
EC400Stats Lecturenotes2021
101 pages
1.1 Background of The Study
No ratings yet
1.1 Background of The Study
10 pages
Lecture Notes
No ratings yet
Lecture Notes
80 pages
MS Theory Exam Study Guide
No ratings yet
MS Theory Exam Study Guide
50 pages
Formulario Ep Probability and Statistics
No ratings yet
Formulario Ep Probability and Statistics
28 pages
RM-896 - Lecture 1
No ratings yet
RM-896 - Lecture 1
56 pages
Probability and Statistics: Cookbook
No ratings yet
Probability and Statistics: Cookbook
28 pages
MS70110W Business Analysis MSG - 2024-2025
No ratings yet
MS70110W Business Analysis MSG - 2024-2025
20 pages
Topic9 PDF
No ratings yet
Topic9 PDF
19 pages
Dupire Functional Ito
No ratings yet
Dupire Functional Ito
43 pages
L02 AsymptoticAnalysis I
No ratings yet
L02 AsymptoticAnalysis I
24 pages
Stat Cookbook
No ratings yet
Stat Cookbook
31 pages
Ylidiz, R. Talent Management Strategies and Functions, A Systematic Review
No ratings yet
Ylidiz, R. Talent Management Strategies and Functions, A Systematic Review
20 pages
MAT210 3.1 Optimization
No ratings yet
MAT210 3.1 Optimization
12 pages
Significant Figure 1
No ratings yet
Significant Figure 1
16 pages
Tocv 2
No ratings yet
Tocv 2
10 pages
Basic Facts Quals
No ratings yet
Basic Facts Quals
30 pages
Probability and Statistics: Cookbook
No ratings yet
Probability and Statistics: Cookbook
31 pages
DiZenzo-A Note On The Gradient of A Multi-Image
No ratings yet
DiZenzo-A Note On The Gradient of A Multi-Image
10 pages
Business Development Strategies Through Artificial Intelligence Technology
No ratings yet
Business Development Strategies Through Artificial Intelligence Technology
6 pages
Paper & Gas Chromatography
No ratings yet
Paper & Gas Chromatography
21 pages
Applied II, Work Sheet I
No ratings yet
Applied II, Work Sheet I
3 pages
Probability and Statistics With Examples Using R: Siva Athreya, Deepayan Sarkar, and Steve Tanner April 25, 2016
No ratings yet
Probability and Statistics With Examples Using R: Siva Athreya, Deepayan Sarkar, and Steve Tanner April 25, 2016
4 pages
Probability and Statistics Cookbook
No ratings yet
Probability and Statistics Cookbook
28 pages
Intermediate Value Theorem
No ratings yet
Intermediate Value Theorem
7 pages
Probability and Statistics: Cookbook
No ratings yet
Probability and Statistics: Cookbook
31 pages
A Probability and Statistics Cheatsheet
No ratings yet
A Probability and Statistics Cheatsheet
28 pages
Stat Cookbook
No ratings yet
Stat Cookbook
31 pages
2780 - Ajoy Sarker
No ratings yet
2780 - Ajoy Sarker
22 pages
Analysis of Logistic Growth Models Can Help
No ratings yet
Analysis of Logistic Growth Models Can Help
24 pages
Probability and Statistics: Cookbook
No ratings yet
Probability and Statistics: Cookbook
31 pages
Stat Cookbook
No ratings yet
Stat Cookbook
31 pages
Modern Physics: Accuracy Vs Precision Accuracy, Precision, Resolution
No ratings yet
Modern Physics: Accuracy Vs Precision Accuracy, Precision, Resolution
7 pages
Hidden Hills Club HHC Blend Disposables 3g Pink Certz Lab Report
No ratings yet
Hidden Hills Club HHC Blend Disposables 3g Pink Certz Lab Report
1 page
Stat Cookbook
No ratings yet
Stat Cookbook
31 pages
Probability and Statistics: Cookbook
No ratings yet
Probability and Statistics: Cookbook
28 pages
Probability and Statistics - Cookbook
No ratings yet
Probability and Statistics - Cookbook
28 pages
Stats Cheat Sheet
No ratings yet
Stats Cheat Sheet
28 pages
Practice Set: Operation Research (BMA342)
No ratings yet
Practice Set: Operation Research (BMA342)
10 pages
Mathematical Statistics
No ratings yet
Mathematical Statistics
271 pages
MIT15 075JF11 Exam01 Soln
No ratings yet
MIT15 075JF11 Exam01 Soln
6 pages
GUI Magic: Mastering Real Projects in Python
From Everand
GUI Magic: Mastering Real Projects in Python
John Nunez
No ratings yet
The Linux Terminal for Advanced Users - The Command Line Made Easy: First Edition
From Everand
The Linux Terminal for Advanced Users - The Command Line Made Easy: First Edition
Michael Basler
No ratings yet
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
From Everand
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
Harrison K Cook
No ratings yet
Advanced college algebra study guide
From Everand
Advanced college algebra study guide
Harrison Cook
No ratings yet
Unlocking Statistics for the Social Sciences
From Everand
Unlocking Statistics for the Social Sciences
Norma Sinclair
No ratings yet
Operation Longlife
From Everand
Operation Longlife
E. Hoffmann Price
3.5/5 (3)
Content Creation Revolution with chatGPT
From Everand
Content Creation Revolution with chatGPT
Maria Cowen
No ratings yet
Bimbo Heaven: Stone Angel #7
From Everand
Bimbo Heaven: Stone Angel #7
Marvin H. Albert
No ratings yet
The Gracious Lily Affair
From Everand
The Gracious Lily Affair
Van Wyck Mason
5/5 (1)
Deadline Istanbul (The Elizabeth Darcy Series)
From Everand
Deadline Istanbul (The Elizabeth Darcy Series)
Peggy Hanson
5/5 (1)
Deadline Yemen (The Elizabeth Darcy Series)
From Everand
Deadline Yemen (The Elizabeth Darcy Series)
Peggy Hanson
5/5 (1)
Osama the Gun
From Everand
Osama the Gun
Norman Spinrad
5/5 (1)
Intrusion Detection Honeypots
From Everand
Intrusion Detection Honeypots
Chris Sanders
3/5 (2)
Securing ChatGPT: Best Practices for Protecting Sensitive Data in AI Language Models
From Everand
Securing ChatGPT: Best Practices for Protecting Sensitive Data in AI Language Models
Matthew C. Smith
No ratings yet
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
Hamlet Had an Uncle: A Comedy of Honor
From Everand
Hamlet Had an Uncle: A Comedy of Honor
James Branch Cabell
4.5/5 (7)

Probability and Statistics: Cookbook

Uploaded by

Probability and Statistics: Cookbook

Uploaded by

Probability and Statistics

1 Distribution Overview 3 15 Bayesian Inference 16 22 Math 29

a b 0 10 20 30 40 0.0 2.5 5.0 7.5 10.0 0 5 10 15 20

a b 0 10 20 30 40 0.0 2.5 5.0 7.5 10.0 0 5 10 15 20

0.5 0.25 0.1

● ● 0.0 0.00 0.0

0.00 0 0.0 0.0

0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25

0.25 0.25 0.25 0.25

• Outcome (point or element) ω ∈ Ω Bayes’ Theorem

Properties Probability Mass Function (PMF)

• P [∅] = 0 fX (x) = P [X = x] = P [{ω ∈ Ω : X(ω) = x}]

Continuous Sample mean

Properties of the mle Under appropriate regularity conditions

• Null hypothesis H0 supθ∈Θ Ln (θ) Ln (θbn )

fX (x | θ) = h(x) exp {η(θ)T (x) − A(θ)} Equal-tail credible interval

Highest posterior density (HPD) region Rn

• Flat: f (θ) ∝ constant

• Conjugate: f (θ) and f (θ | xn ) belong to the same parametric family

18.1 Simple Linear Regression

Residual sums of squares (rss)

• Squared error loss: posterior mean βb0 = Ȳn − βb1 X̄n

18.2 Prediction Unbiased estimate for σ 2

1. Assign a score to each model Validation and training

• K(x) ≥ 0 k-nearest Neighbor Estimator

P [Xn = x | X0 , . . . , Xn−1 ] = P [Xn = x | Xn−1 ] ∀n ∈ T, x ∈ X

19.3 Smoothing Using Orthogonal Functions Transition matrix P (n-step: Pn )

Autocovariance function Symmetric moving average

γx (s, t) = E [(xs − µs )(xt − µt )] = E [xs xt ] − µs µt k

∀k ∈ N, tk , ck , h ∈ Z Sample autocovariance function

Weakly stationary n−h

Cov [xt+h , xt ] γ(t + h, t) γ(h) Sample cross-correlation function

22.1 Gamma Function Infinite

Stirling numbers, 2nd kind

Balls and Urns f :B→U D = distinguishable, ¬D = indistinguishable.

|B| = n, |U | = m f arbitrary f injective f surjective f bijective

You might also like