0% found this document useful (0 votes)

44 views

Probability and Statistics: Cookbook

This document is a cookbook that integrates topics in probability theory and statistics. It covers a wide range of topics from discrete and continuous probability distributions to statistical inference, linear regression, and non-parametric function estimation. The cookbook provides formulas, properties, and examples for each topic.

Uploaded by

Susi González

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views

Probability and Statistics: Cookbook

Uploaded by

Susi González

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Probability and Statistics

Cookbook

31st March, 2015

This cookbook integrates a variety of topics in probability the- 12 Parametric Inference 13 20 Stochastic Processes 24
ory and statistics. It is based on literature and in-class material 12.1 Method of Moments . . . . . . . . . . . 13 20.1 Markov Chains . . . . . . . . . . . . . . 24
from courses of the statistics department at the University of 12.2 Maximum Likelihood . . . . . . . . . . . 14 20.2 Poisson Processes . . . . . . . . . . . . . 25
California in Berkeley but also influenced by other sources [2, 3]. 12.2.1 Delta Method . . . . . . . . . . . 14
If you find errors or have suggestions for further topics, I would 12.3 Multiparameter Models . . . . . . . . . 15 21 Time Series 25
appreciate if you send me an email. The most recent version 21.1 Stationary Time Series . . . . . . . . . . 26
12.3.1 Multiparameter delta method . . 15
of this document is available at http://matthias.vallentin. 21.2 Estimation of Correlation . . . . . . . . 26
12.4 Parametric Bootstrap . . . . . . . . . . 15
net/probability-and-statistics-cookbook/. To reproduce, 21.3 Non-Stationary Time Series . . . . . . . 26
please contact me. 21.3.1 Detrending . . . . . . . . . . . . 27
13 Hypothesis Testing 15
21.4 ARIMA models . . . . . . . . . . . . . . 27
14 Exponential Family 16 21.4.1 Causality and Invertibility . . . . 28
Contents 21.5 Spectral Analysis . . . . . . . . . . . . . 28
15 Bayesian Inference 16
1 Distribution Overview 3 22 Math 29
15.1 Credible Intervals . . . . . . . . . . . . . 16
1.1 Discrete Distributions . . . . . . . . . . 3 22.1 Gamma Function . . . . . . . . . . . . . 29
1.2 Continuous Distributions . . . . . . . . 5 15.2 Function of parameters . . . . . . . . . . 17
22.2 Beta Function . . . . . . . . . . . . . . . 29
15.3 Priors . . . . . . . . . . . . . . . . . . . 17 22.3 Series . . . . . . . . . . . . . . . . . . . 29
2 Probability Theory 8 15.3.1 Conjugate Priors . . . . . . . . . 17 22.4 Combinatorics . . . . . . . . . . . . . . 30
15.4 Bayesian Testing . . . . . . . . . . . . . 18
3 Random Variables 8
3.1 Transformations . . . . . . . . . . . . . 9 16 Sampling Methods 18
16.1 Inverse Transform Sampling . . . . . . . 18
4 Expectation 9 16.2 The Bootstrap . . . . . . . . . . . . . . 18
16.2.1 Bootstrap Confidence Intervals . 18
5 Variance 9
16.3 Rejection Sampling . . . . . . . . . . . . 19
6 Inequalities 10 16.4 Importance Sampling . . . . . . . . . . . 19

7 Distribution Relationships 10 17 Decision Theory 19

17.1 Risk . . . . . . . . . . . . . . . . . . . . 19
8 Probability and Moment Generating 17.2 Admissibility . . . . . . . . . . . . . . . 20
Functions 11 17.3 Bayes Rule . . . . . . . . . . . . . . . . 20
17.4 Minimax Rules . . . . . . . . . . . . . . 20
9 Multivariate Distributions 11
9.1 Standard Bivariate Normal . . . . . . . 11 18 Linear Regression 20
9.2 Bivariate Normal . . . . . . . . . . . . . 11 18.1 Simple Linear Regression . . . . . . . . 20
9.3 Multivariate Normal . . . . . . . . . . . 11 18.2 Prediction . . . . . . . . . . . . . . . . . 21
10 Convergence 11 18.3 Multiple Regression . . . . . . . . . . . 21
10.1 Law of Large Numbers (LLN) . . . . . . 12 18.4 Model Selection . . . . . . . . . . . . . . 22
10.2 Central Limit Theorem (CLT) . . . . . 12
19 Non-parametric Function Estimation 22
11 Statistical Inference 12 19.1 Density Estimation . . . . . . . . . . . . 22
11.1 Point Estimation . . . . . . . . . . . . . 12 19.1.1 Histograms . . . . . . . . . . . . 23
11.2 Normal-Based Confidence Interval . . . 13 19.1.2 Kernel Density Estimator (KDE) 23
11.3 Empirical distribution . . . . . . . . . . 13 19.2 Non-parametric Regression . . . . . . . 23
11.4 Statistical Functionals . . . . . . . . . . 13 19.3 Smoothing Using Orthogonal Functions 24
1 Distribution Overview
1.1 Discrete Distributions
Notation1 FX (x) fX (x) E [X] V [X] MX (s)

0 x<a
(b − a + 1)2 − 1 eas − e−(b+1)s

bxc−a+1 I(a ≤ x ≤ b) a+b
Uniform Unif {a, . . . , b} a≤x≤b
 b−a b−a+1 2 12 s(b − a)
1 x>b

Bernoulli Bern (p) (1 − p)1−x px (1 − p)1−x p p(1 − p) 1 − p + pes
!
n x
Binomial Bin (n, p) I1−p (n − x, x + 1) p (1 − p)n−x np np(1 − p) (1 − p + pes )n
x
k k
!n
n! x
X X
Multinomial Mult (n, p) px1 1 · · · pkk xi = n npi npi (1 − pi ) pi e si
x1 ! . . . xk ! i=1 i=0
! m m−x

x − np x n−x nm nm(N − n)(N − m)
Hypergeometric Hyp (N, m, n) ≈Φ N N 2 (N − 1)
p
np(1 − p) x
N
! r
x+r−1 r 1−p 1−p p
Negative Binomial NBin (r, p) Ip (r, x + 1) p (1 − p)x r r
r−1 p p2 1 − (1 − p)es
1 1−p pes
Geometric Geo (p) 1 − (1 − p)x x ∈ N+ p(1 − p)x−1 x ∈ N+
p p2 1 − (1 − p)es
x
X λi λx e−λ s
Poisson Po (λ) e−λ λ λ eλ(e −1)

i=0
i! x!

1 We use the notation γ(s, x) and Γ(x) to refer to the Gamma functions (see §22.1), and use B(x, y) and Ix to refer to the Beta functions (see §22.2).

3
Uniform (discrete) Binomial Geometric Poisson
●
● n = 40, p = 0.3 0.8 ●
● p = 0.2 ● ●
● λ=1
● n = 30, p = 0.6 ● p = 0.5 ● λ=4
● n = 25, p = 0.9 ● p = 0.8 ● λ = 10
●
0.3
0.2 ● 0.6

●
0.2
PMF

PMF

PMF
1 ● ● ● ●
● ●
● ● ● ● ● ● ● 0.4 ●
n ●
● ● ●
●
● ●
0.1
● ●
● ● ● ● ●
● ●
● 0.1 ●
● 0.2 ● ●
● ● ● ● ●
● ●
● ● ●
● ●
● ● ●
● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ●●● ● ● ● ●
0.0 ●●●● ●●
●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●● 0.0 ● ●
●
● ●
● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ●

a b 0 10 20 30 40 0.0 2.5 5.0 7.5 10.0 0 5 10 15 20

x x x x
Uniform (discrete) Binomial Geometric Poisson
1 ● 1.00 ●●●●●●●●●●●●●●●●
● ● ●●●●●●●●●●●●●●●●●●●●●
●● 1.0 ● ● ● ● ●
● ● ● ● ● 1.00 ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ● ●
● ●
● ●
● ●
● ●
●
●
● ● ●
0.75 ●
0.8 ● ● 0.75 ●
● ● ●
i ●
●
● ●
n ●
● ● ●
●
CDF

CDF

CDF
0.50 0.6 ● 0.50
● ●
● ●
●
● ●
i ●
● ● ●
n ●
●
0.25 ● 0.4 0.25 ●
●
● ●
●
●
● ● ● ● n = 40, p = 0.3 ● p = 0.2 ● ● λ=1
● ● ●
● n = 30, p = 0.6 ● p = 0.5 ●
● λ=4
●
0 ● 0.00 ●●●● ●
●
●●
●●●●●●●●●●●●●●●●●
●●●●●●●●●● ● ● n = 25, p = 0.9 0.2 ● ● p = 0.8 0.00
●
● ● ● ● ● λ = 10

a b 0 10 20 30 40 0.0 2.5 5.0 7.5 10.0 0 5 10 15 20

x x x x

4
1.2 Continuous Distributions
Notation FX (x) fX (x) E [X] V [X] MX (s)

0 x<a
(b − a)2 esb − esa

x−a I(a < x < b) a+b
Uniform Unif (a, b) a<x<b
 b−a b−a 2 12 s(b − a)
1 x>b

(x − µ)2
Z x
σ 2 s2

1
N µ, σ 2 σ2

Normal Φ(x) = φ(t) dt φ(x) = √ exp − µ exp µs +
−∞ σ 2π 2σ 2 2
(ln x − µ)2

1 1 ln x − µ 1 2 2 2
ln N µ, σ 2 eµ+σ /2
(eσ − 1)e2µ+σ

Log-Normal + erf √ √ exp −
2 2 2σ 2 x 2πσ 2 2σ 2

1 T
Σ−1 (x−µ) 1
Multivariate Normal MVN (µ, Σ) (2π)−k/2 |Σ|−1/2 e− 2 (x−µ) µ Σ exp µT s + sT Σs
2
−(ν+1)/2 ( ν
Γ ν+1

ν ν
2 x2 ν−2
ν>2
Student’s t Student(ν) Ix , √ 1 + 0
νπΓ ν2

2 2 ν ∞ 1<ν≤2

1 k x 1
Chi-square χ2k γ , xk/2−1 e−x/2 k 2k (1 − 2s)−k/2 s < 1/2
Γ(k/2) 2 2 2k/2 Γ(k/2)
r
d
(d1 x)d1 d2 2
2d22 (d1 + d2 − 2)

d1 d1 (d1 x+d2 )d1 +d2 d2
F F(d1 , d2 ) I d1 x , d1 d1 d2 − 2 d1 (d2 − 2)2 (d2 − 4)

d1 x+d2 2 2 xB 2
, 2
1 −x/β 1
Exponential Exp (β) 1 − e−x/β e β β2 (s < 1/β)
β 1 − βs
α
γ(α, x/β) 1 1
Gamma Gamma (α, β) xα−1 e−x/β αβ αβ 2 (s < 1/β)
Γ(α) Γ (α) β α 1 − βs
Γ α, βx

β α −α−1 −β/x β β2 2(−βs)α/2 p
Inverse Gamma InvGamma (α, β) x e α>1 α>2 Kα −4βs
Γ (α) Γ (α) α−1 (α − 1)2 (α − 2) Γ(α)
P
k
Γ i=1 αi Y α −1
k
αi E [Xi ] (1 − E [Xi ])
Dirichlet Dir (α) Qk xi i Pk Pk
i=1 Γ (αi ) i=1 i=1 αi i=1 αi + 1
∞ k−1
!
Γ (α + β) α−1 α αβ X Y α+r sk
Beta Beta (α, β) Ix (α, β) x (1 − x)β−1 1+
Γ (α) Γ (β) α+β (α + β)2 (α + β + 1) r=0
α+β+r k!
k=1
∞ n n
k k x k−1 −(x/λ)k 1 2 X s λ n
Weibull Weibull(λ, k) 1 − e−(x/λ) e λΓ 1 + λ2 Γ 1 + − µ2 Γ 1+
λ λ k k n=0
n! k
x α
m xα αxm xα
Pareto Pareto(xm , α) 1− x ≥ xm m
α α+1 x ≥ xm α>1 m
α>2 α(−xm s)α Γ(−α, −xm s) s < 0
x x α−1 (α − 1)2 (α − 2)

5
Uniform (continuous) Normal Log−Normal Student's t
2.0 1.00 0.4
µ = 0, σ2 = 0.2 µ = 0, σ2 = 3 ν=1
µ = 0, σ2 = 1 µ = 2, σ2 = 2 ν=2
µ = 0, σ2 = 5 µ = 0, σ2 = 1 ν=5
ν=∞
µ = −2, σ2 = 0.5 µ = 0.5, σ2 = 1
µ = 0.25, σ2 = 1
1.5 0.75 µ = 0.125, σ2 = 1 0.3
PDF

PDF

PDF
1
● ● 1.0 0.50 0.2
b−a

0.5 0.25 0.1

● ● 0.0 0.00 0.0

a b −5.0 −2.5 0.0 2.5 5.0 0 1 2 3 −5.0 −2.5 0.0 2.5 5.0
x x x x
χ 2 F Exponential Gamma
d1 = 1, d2 = 1 2.0 β=2 2.0 α = 1, β = 2
1.00 k=1 3 d1 = 2, d2 = 1 β=1 α = 2, β = 2
k=2 d1 = 5, d2 = 2 β = 0.4 α = 3, β = 2
k=3 d1 = 100, d2 = 1 α = 5, β = 1
k=4 d1 = 100, d2 = 100 α = 9, β = 0.5
k=5
1.5 1.5
0.75

2
PDF

PDF

PDF
PDF

0.50 1.0 1.0

1
0.25 0.5 0.5

0.00 0 0.0 0.0

0 2 4 6 8 0 1 2 3 4 5 0 1 2 3 4 5 0 5 10 15 20
x x x x
Inverse Gamma Beta Weibull Pareto
α = 1, β = 1 5 α = 0.5, β = 0.5 2.0 λ = 1, k = 0.5 4 xm = 1, k = 1
α = 2, β = 1 α = 5, β = 1 λ = 1, k = 1 xm = 1, k = 2
α = 3, β = 1 α = 1, β = 3 λ = 1, k = 1.5 xm = 1, k = 4
4 α = 3, β = 0.5 α = 2, β = 2 λ = 1, k = 5
4 α = 2, β = 5
1.5 3

3
3
PDF

PDF

PDF
1.0 2
2
2

0.5 1
1 1

0 0 0.0 0

0 1 2 3 4 5 0.00 0.25 0.50 0.75 1.00 0.0 0.5 1.0 1.5 2.0 2.5 1.0 1.5 2.0 2.5
x x x x
6
Uniform (continuous) Normal Log−Normal Student's t
1 1.00 1.00
µ = 0, σ2 = 3
µ = 2, σ2 = 2
0.75 µ = 0, σ2 = 1
µ = 0.5, σ2 = 1
µ = 0.25, σ2 = 1
0.75 µ = 0.125, σ2 = 1 0.75

0.50
CDF

CDF

CDF
0.50 0.50

0.25
0.25 0.25

µ = 0, σ = 0.2
2
ν=1
µ = 0, σ2 = 1 ν=2
µ = 0, σ2 = 5 ν=5
0 0.00 µ = −2, σ2 = 0.5 0.00 0.00 ν=∞

a b −5.0 −2.5 0.0 2.5 5.0 0 1 2 3 −5.0 −2.5 0.0 2.5 5.0
x x x x
χ 2 F Exponential Gamma
1.00 1.00 1.00
1.00

0.75 0.75 0.75 0.75

CDF

CDF
CDF

0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25

k=1 d1 = 1, d2 = 1 α = 1, β = 2
k=2 d1 = 2, d2 = 1 α = 2, β = 2
k=3 d1 = 5, d2 = 2 β=2 α = 3, β = 2
k=4 d1 = 100, d2 = 1 β=1 α = 5, β = 1
0.00 k=5 0.00 d1 = 100, d2 = 100 0.00 β = 0.4 0.00 α = 9, β = 0.5

0 2 4 6 8 0 1 2 3 4 5 0 1 2 3 4 5 0 5 10 15 20
x x x x
Inverse Gamma Beta Weibull Pareto
1.00 1.00
1.00 1.00 α = 0.5, β = 0.5
α = 5, β = 1
α = 1, β = 3
α = 2, β = 2
α = 2, β = 5
0.75 0.75 0.75 0.75
CDF

CDF

CDF
0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25

α = 1, β = 1 λ = 1, k = 0.5
α = 2, β = 1 λ = 1, k = 1 xm = 1, k = 1
α = 3, β = 1 λ = 1, k = 1.5 xm = 1, k = 2
0.00 α = 3, β = 0.5 0.00 0.00 λ = 1, k = 5 0.00 xm = 1, k = 4

0 1 2 3 4 5 0.00 0.25 0.50 0.75 1.00 0.0 0.5 1.0 1.5 2.0 2.5 1.0 1.5 2.0 2.5
x x x x
7
2 Probability Theory Law of Total Probability
n n
Definitions X G
P [B] = P [B|Ai ] P [Ai ] Ω= Ai
• Sample space Ω i=1 i=1

• Outcome (point or element) ω ∈ Ω Bayes’ Theorem

• Event A ⊆ Ω n
• σ-algebra A P [B | Ai ] P [Ai ] G
P [Ai | B] = Pn Ω= Ai
1. ∅ ∈ A j=1 P [B | Aj ] P [Aj ] i=1
S∞
2. A1 , A2 , . . . , ∈ A =⇒ i=1 Ai ∈ A Inclusion-Exclusion Principle
3. A ∈ A =⇒ ¬A ∈ A
n n
r
[ X X \
• Probability Distribution P (−1)r−1

Ai = A ij

1. P [A] ≥ 0 ∀A i=1 r=1 i≤i1 <···<ir ≤n j=1

2. P [Ω] = 1
"∞ #
G ∞
X 3 Random Variables
3. P Ai = P [Ai ]
i=1 i=1 Random Variable (RV)
• Probability space (Ω, A, P) X:Ω→R
Probability Mass Function (PMF)
Properties
• P [∅] = 0 fX (x) = P [X = x] = P [{ω ∈ Ω : X(ω) = x}]
• B = Ω ∩ B = (A ∪ ¬A) ∩ B = (A ∩ B) ∪ (¬A ∩ B) Probability Density Function (PDF)
• P [¬A] = 1 − P [A]
Z b
• P [B] = P [A ∩ B] + P [¬A ∩ B]
P [a ≤ X ≤ b] = f (x) dx
• P [Ω] = 1 P [∅] = 0 a
S T T S
• ¬( n An ) = n ¬An ¬( n An ) = n ¬An DeMorgan
S T Cumulative Distribution Function (CDF)
• P [ n An ] = 1 − P [ n ¬An ]
• P [A ∪ B] = P [A] + P [B] − P [A ∩ B] FX : R → [0, 1] FX (x) = P [X ≤ x]
=⇒ P [A ∪ B] ≤ P [A] + P [B] 1. Nondecreasing: x1 < x2 =⇒ F (x1 ) ≤ F (x2 )
• P [A ∪ B] = P [A ∩ ¬B] + P [¬A ∩ B] + P [A ∩ B] 2. Normalized: limx→−∞ = 0 and limx→∞ = 1
• P [A ∩ ¬B] = P [A] − P [A ∩ B] 3. Right-Continuous: limy↓x F (y) = F (x)
Continuity of Probabilities
S∞ Z b
• A1 ⊂ A2 ⊂ . . . =⇒ limn→∞ P [An ] = P [A] whereA = i=1 Ai P [a ≤ Y ≤ b | X = x] = fY |X (y | x)dy a≤b
T∞
• A1 ⊃ A2 ⊃ . . . =⇒ limn→∞ P [An ] = P [A] whereA = i=1 Ai a

f (x, y)
Independence ⊥
⊥ fY |X (y | x) =
A⊥
⊥ B ⇐⇒ P [A ∩ B] = P [A] P [B] fX (x)
Independence
Conditional Probability
1. P [X ≤ x, Y ≤ y] = P [X ≤ x] P [Y ≤ y]
P [A ∩ B]
P [A | B] = P [B] > 0 2. fX,Y (x, y) = fX (x)fY (y)
P [B] 8
Z
3.1 Transformations • E [XY ] = xyfX,Y (x, y) dFX (x) dFY (y)
X,Y
Transformation function
• E [ϕ(Y )] 6= ϕ(E [X]) (cf. Jensen inequality)
Z = ϕ(X)
• P [X ≥ Y ] = 1 =⇒ E [X] ≥ E [Y ]
Discrete • P [X = Y ] = 1 ⇐⇒ E [X] = E [Y ]
X ∞
fZ (z) = P [ϕ(X) = z] = P [{x : ϕ(x) = z}] = P X ∈ ϕ−1 (z) =

f (x)
X
• E [X] = P [X ≥ x]
x∈ϕ−1 (z) x=1

Continuous Sample mean

n
Z 1X
X̄n = Xi
FZ (z) = P [ϕ(X) ≤ z] = f (x) dx with Az = {x : ϕ(x) ≤ z} n i=1
Az
Conditional expectation
Special case if ϕ strictly monotone Z

d

dx 1 • E [Y | X = x] = yf (y | x) dy
fZ (z) = fX (ϕ−1 (z)) ϕ−1 (z) = fX (x) = fX (x)

dz dz |J| • E [X] = E [E [X | Y ]]
Z ∞
The Rule of the Lazy Statistician • E[ϕ(X, Y ) | X = x] = ϕ(x, y)fY |X (y | x) dx
Z Z −∞
∞
E [Z] = ϕ(x) dFX (x) • E [ϕ(Y, Z) | X = x] = ϕ(y, z)f(Y,Z)|X (y, z | x) dy dz
−∞
Z Z • E [Y + Z | X] = E [Y | X] + E [Z | X]
E [IA (x)] = IA (x) dFX (x) = dFX (x) = P [X ∈ A] • E [ϕ(X)Y | X] = ϕ(X)E [Y | X]
A
• E[Y | X] = c =⇒ Cov [X, Y ] = 0
Convolution
Z ∞ Z z
X,Y ≥0
• Z := X + Y fZ (z) = fX,Y (x, z − x) dx = fX,Y (x, z − x) dx
−∞ 0 5 Variance
Z ∞
• Z := |X − Y | fZ (z) = 2 fX,Y (x, z + x) dx Definition and properties
0
Z ∞ Z ∞ 2
2
X ⊥⊥ • V [X] = σX = E (X − E [X])2 = E X 2 − E [X]
• Z := fZ (z) = |x|fX,Y (x, xz) dx = xfx (x)fX (x)fY (xz) dx " n # n
Y −∞ −∞ X X X
• V Xi = V [Xi ] + 2 Cov [Xi , Yj ]
i=1 i=1 i6=j
4 Expectation " n
X
# n
X
• V Xi = V [Xi ] if Xi ⊥
⊥ Xj
Definition and properties i=1 i=1
X

 xfX (x) X discrete Standard deviation p
sd[X] = V [X] = σX

Z  x

• E [X] = µX = x dFX (x) = Covariance

 Z
 xfX (x) dx X continuous


• Cov [X, Y ] = E [(X − E [X])(Y − E [Y ])] = E [XY ] − E [X] E [Y ]
• P [X = c] = 1 =⇒ E [X] = c • Cov [X, a] = 0
• E [cX] = c E [X] • Cov [X, X] = V [X]
• E [X + Y ] = E [X] + E [Y ] • Cov [X, Y ] = Cov [Y, X]
9
• Cov [aX, bY ] = abCov [X, Y ] 7 Distribution Relationships
• Cov [X + a, Y + b] = Cov [X, Y ]

n m

n X m
Binomial
X X X
n
• Cov  Xi , Yj  = Cov [Xi , Yj ] X
i=1 j=1 i=1 j=1
• Xi ∼ Bern (p) =⇒ Xi ∼ Bin (n, p)
i=1
Correlation • X ∼ Bin (n, p) , Y ∼ Bin (m, p) =⇒ X + Y ∼ Bin (n + m, p)
Cov [X, Y ]
ρ [X, Y ] = p • limn→∞ Bin (n, p) = Po (np) (n large, p small)
V [X] V [Y ] • limn→∞ Bin (n, p) = N (np, np(1 − p)) (n large, p far from 0 and 1)
Independence
Negative Binomial
X⊥
⊥ Y =⇒ ρ [X, Y ] = 0 ⇐⇒ Cov [X, Y ] = 0 ⇐⇒ E [XY ] = E [X] E [Y ]
• X ∼ NBin (1, p) = Geo (p)
Pr
Sample variance • X ∼ NBin (r, p) = i=1 Geo (p)
n P P
1 X • Xi ∼ NBin (ri , p) =⇒ Xi ∼ NBin ( ri , p)
S2 = (Xi − X̄n )2
n − 1 i=1 • X ∼ NBin (r, p) . Y ∼ Bin (s + r, p) =⇒ P [X ≤ s] = P [Y ≥ r]

Conditional variance Poisson

Cauchy-Schwarz
2 Exponential
E [XY ] ≤ E X 2 E Y 2

n
X
Markov • Xi ∼ Exp (β) ∧ Xi ⊥
⊥ Xj =⇒ Xi ∼ Gamma (n, β)
E [ϕ(X)]
P [ϕ(X) ≥ t] ≤ i=1
t • Memoryless property: P [X > x + y | X > y] = P [X > x]
Chebyshev
V [X] Normal
P [|X − E [X]| ≥ t] ≤
t2
X−µ

Chernoff • X ∼ N µ, σ 2 =⇒ σ∼ N (0, 1)
δ

e •

X ∼ N µ, σ 2 ∧ Z = aX + b =⇒ Z ∼ N aµ + b, a2 σ 2

P [X ≥ (1 + δ)µ] ≤ δ > −1
(1 + δ)1+δ •

X ∼ N µ1 , σ12 ∧ Y ∼ N µ2 , σ22 =⇒ X + Y ∼ N µ1 + µ2 , σ12 + σ22

P 2
Xi ∼ N µi , σi2 =⇒
Hoeffding P P
• X ∼N i µi , i σi
i i
b−µ a−µ

X1 , . . . , Xn independent ∧ P [Xi ∈ [ai , bi ]] = 1 ∧ 1 ≤ i ≤ n • P [a < X ≤ b] = Φ σ − Φ σ
2 • Φ(−x) = 1 − Φ(x) φ0 (x) = −xφ(x) φ00 (x) = (x2 − 1)φ(x)
P X̄ − E X̄ ≥ t ≤ e−2nt t > 0

• Upper quantile of N (0, 1): zα = Φ−1 (1 − α)
2n2 t2

P |X̄ − E X̄ | ≥ t ≤ 2 exp − Pn 2
t>0 Gamma
i=1 (bi − ai )
Jensen • X ∼ Gamma (α, β) ⇐⇒ X/β ∼ Gamma (α, 1)
Pα
E [ϕ(X)] ≥ ϕ(E [X]) ϕ convex • Gamma (α, β) ∼ i=1 Exp (β)
10
9.2 Bivariate Normal
P P
• Xi ∼ Gamma (αi , β) ∧ Xi ⊥
⊥ Xj =⇒ i Xi ∼ Gamma ( i αi , β)
Z ∞
Γ(α)
• = xα−1 e−λx dx Let X ∼ N µx , σx2 and Y ∼ N µy , σy2 .
λα 0

Beta 1 z
f (x, y) = exp −
2(1 − ρ2 )
p
2πσx σy 1−ρ 2
1 Γ(α + β) α−1
• xα−1 (1 − x)β−1 = x (1 − x)β−1
B(α, β) Γ(α)Γ(β) " 2 2 #
B(α + k, β) α+k−1 x − µx y − µy x − µx y − µy
• E Xk = = E X k−1
z= + − 2ρ
B(α, β) α+β+k−1 σx σy σx σy
• Beta (1, 1) ∼ Unif (0, 1) Conditional mean and variance
σX
E [X | Y ] = E [X] + ρ (Y − E [Y ])
8 Probability and Moment Generating Functions σY
p
V [X | Y ] = σX 1 − ρ2

• GX (t) = E tX |t| < 1
"∞ # ∞
X (Xt)i X E Xi
· ti
t
Xt

• MX (t) = GX (e ) = E e =E = 9.3 Multivariate Normal
i=0
i! i=0
i!
• P [X = 0] = GX (0) Covariance matrix Σ (Precision matrix Σ−1 )
• P [X = 1] = G0X (0)  
(i)
V [X1 ] · · · Cov [X1 , Xk ]
GX (0) .. .. ..
• P [X = i] = Σ=
 
. . . 
i!
• E [X] = G0X (1− ) Cov [Xk , X1 ] · · · V [Xk ]
(k)
• E X k = MX (0) If X ∼ N (µ, Σ),

X! (k)
• E = GX (1− ) −1/2

1

(X − k)! fX (x) = (2π)−n/2 |Σ| exp − (x − µ)T Σ−1 (x − µ)
2 2
• V [X] = G00X (1− ) + G0X (1− ) − (G0X (1− ))
d Properties
• GX (t) = GY (t) =⇒ X = Y
• Z ∼ N (0, 1) ∧ X = µ + Σ1/2 Z =⇒ X ∼ N (µ, Σ)
9 Multivariate Distributions • X ∼ N (µ, Σ) =⇒ Σ−1/2 (X − µ) ∼ N (0, 1)

• X ∼ N (µ, Σ) =⇒ AX ∼ N Aµ, AΣAT
9.1 Standard Bivariate Normal

• X ∼ N (µ, Σ) ∧ kak = k =⇒ aT X ∼ N aT µ, aT Σa
p
Let X, Y ∼ N (0, 1) ∧ X ⊥
⊥ Z where Y = ρX + 1 − ρ2 Z
10 Convergence
Joint density
x2 + y 2 − 2ρxy

1 Let {X1 , X2 , . . .} be a sequence of rv’s and let X be another rv. Let Fn denote
f (x, y) = exp −
2(1 − ρ2 )
p
2π 1 − ρ2 the cdf of Xn and let F denote the cdf of X.
Conditionals
Types of convergence
(Y | X = x) ∼ N ρx, 1 − ρ2 (X | Y = y) ∼ N ρy, 1 − ρ2

and D
1. In distribution (weakly, in law): Xn → X
Independence
X⊥
⊥ Y ⇐⇒ ρ = 0 lim Fn (t) = F (t) ∀t where F continuous
n→∞ 11
P
2. In probability: Xn → X 10.2 Central Limit Theorem (CLT)
Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ, and V [X1 ] = σ 2 .
(∀ε > 0) lim P [|Xn − X| > ε] = 0
n→∞

√
3. Almost surely (strongly): Xn → X
as
X̄n − µ n(X̄n − µ) D
Zn := q = →Z where Z ∼ N (0, 1)
V X̄n σ
h i h i
P lim Xn = X = P ω ∈ Ω : lim Xn (ω) = X(ω) = 1
n→∞ n→∞ lim P [Zn ≤ z] = Φ(z) z∈R
n→∞
qm
4. In quadratic mean (L2 ): Xn → X CLT notations

lim E (Xn − X)2 = 0 Zn ≈ N (0, 1)

n→∞
σ2

X̄n ≈ N µ,
Relationships n
σ2

qm P D X̄n − µ ≈ N 0,
• Xn → X =⇒ Xn → X =⇒ Xn → X n
as
• Xn → X =⇒ Xn → X
P √ 2

n(X̄n − µ) ≈ N 0, σ
D P
• Xn → X ∧ (∃c ∈ R) P [X = c] = 1 =⇒ Xn → X √
n(X̄n − µ)
• Xn
P
→X ∧ Yn
P
→ Y =⇒ Xn + Yn → X + Y
P
≈ N (0, 1)
qm qm qm
σ
• Xn →X ∧ Yn → Y =⇒ Xn + Yn → X + Y
P P P
• Xn →X ∧ Yn → Y =⇒ Xn Yn → XY
P P
• Xn →X =⇒ ϕ(Xn ) → ϕ(X) Continuity correction
D D
• Xn → X =⇒ ϕ(Xn ) → ϕ(X)

x + 12 − µ

qm
• Xn → b ⇐⇒ limn→∞ E [Xn ] = b ∧ limn→∞ V [Xn ] = 0 P X̄n ≤ x ≈ Φ √
qm
σ/ n
• X1 , . . . , Xn iid ∧ E [X] = µ ∧ V [X] < ∞ ⇐⇒ X̄n → µ
x − 12 − µ

Slutzky’s Theorem P X̄n ≥ x ≈ 1 − Φ √
σ/ n
D P
• Xn → X and Yn → c =⇒ Xn + Yn → X + c
D
Delta method
D P D
• Xn → X and Yn → c =⇒ Xn Yn → cX
σ2

2 σ2

0
D D
• In general: Xn → X and Yn → Y =⇒
6
D
Xn + Yn → X + Y Yn ≈ N µ, =⇒ ϕ(Yn ) ≈ N ϕ(µ), (ϕ (µ))
n n

10.1 Law of Large Numbers (LLN) 11 Statistical Inference

iid
Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ. Let X1 , · · · , Xn ∼ F if not otherwise noted.

Weak (WLLN) 11.1 Point Estimation

P
X̄n → µ n→∞ • Point estimator θbn of θ is a rv: θbn = g(X1 , . . . , Xn )
h i
Strong (SLLN) • bias(θbn ) = E θbn − θ
as P
X̄n → µ n→∞ • Consistency: θbn → θ
12
• Sampling distribution: F (θbn ) Nonparametric 1 − α confidence band for F
r h i
• Standard error: se(θbn ) = V θbn L(x) = max{Fbn − n , 0}
h i h i
• Mean squared error: mse = E (θbn − θ)2 = bias(θbn )2 + V θbn U (x) = min{Fbn + n , 1}
s
• limn→∞ bias(θbn ) = 0 ∧ limn→∞ se(θbn ) = 0 =⇒ θbn is consistent 1 2
= log
θbn − θ D 2n α
• Asymptotic normality: → N (0, 1)
se
• Slutzky’s Theorem often lets us replace se(θbn ) by some (weakly) consis-
tent estimator σ
bn . P [L(x) ≤ F (x) ≤ U (x) ∀x] ≥ 1 − α

11.2 Normal-Based Confidence Interval 11.4 Statistical Functionals

• Statistical functional: T (F )
b 2 . Let zα/2 = Φ−1 (1 − (α/2)), i.e., P Z > zα/2 = α/2

Suppose θbn ≈ N θ, se
• Plug-in estimator of θ = (F ): θbn = T (Fbn )
and P −zα/2 < Z < zα/2 = 1 − α where Z ∼ N (0, 1). Then R
• Linear functional: T (F ) = ϕ(x) dFX (x)
• Plug-in estimator for linear functional:
Cn = θbn ± zα/2 se
b
Z n
1X
T (Fbn ) = ϕ(x) dFbn (x) = ϕ(Xi )
n i=1
11.3 Empirical distribution

Empirical Distribution Function (ECDF) b 2 =⇒ T (Fbn ) ± zα/2 se
• Often: T (Fbn ) ≈ N T (F ), se b
Pn • pth quantile: F −1 (p) = inf{x : F (x) ≥ p}
i=1 I(Xi ≤ x)
Fbn (x) = • µb = X̄n
n n
1 X
b2 =
• σ (Xi − X̄n )2
( n − 1 i=1
1 Xi ≤ x 1
Pn
I(Xi ≤ x) = n i=1 (Xi − µb)3
0 Xi > x • κ
b=
b3
Pσ n
i=1 (Xi − X̄n )(Yi − Ȳn )
Properties (for any fixed x) • ρb = qP qP
n 2 n 2
h i i=1 (Xi − X̄n ) i=1 (Yi − Ȳn )
• E Fbn = F (x)
h i F (x)(1 − F (x))
• V Fbn = 12 Parametric Inference
n
F (x)(1 − F (x)) D

Let F = f (x; θ) : θ ∈ Θ be a parametric model with parameter space Θ ⊂ Rk
• mse = →0
n and parameter θ = (θ1 , . . . , θk ).
P
• Fbn → F (x)
12.1 Method of Moments
Dvoretzky-Kiefer-Wolfowitz (DKW) inequality (X1 , . . . , Xn ∼ F )
j th moment Z
2
P sup F (x) − Fbn (x) > ε = 2e−2nε αj (θ) = E X j = xj dFX (x)

x 13
j th sample moment Fisher information (exponential family)
n
1X j
α
bj = X
∂

n i=1 i I(θ) = Eθ − s(X; θ)
∂θ
Method of moments estimator (MoM)
Observed Fisher information
α1 (θ) = α
b1
n
α2 (θ) = α ∂2 X
Inobs (θ) = −
b2
log f (Xi ; θ)
.. .. ∂θ2 i=1
.=.
αk (θ) = α
bk Properties of the mle
Properties of the MoM estimator P
• Consistency: θbn → θ
• θbn exists with probability tending to 1 • Equivariance: θbn is the mle =⇒ ϕ(θbn ) ist the mle of ϕ(θ)
•
P
Consistency: θbn → θ • Asymptotic normality:
• Asymptotic normality:
p
1. se ≈ 1/In (θ)
√ D (θbn − θ) D
n(θb − θ) → N (0, Σ) → N (0, 1)
se
where Σ = gE Y Y T g T , Y = (X, X 2 , . . . , X k )T , q
∂ −1 b ≈ 1/In (θbn )
2. se
g = (g1 , . . . , gk ) and gj = ∂θ αj (θ)
(θbn − θ) D
→ N (0, 1)
12.2 Maximum Likelihood se
b
Likelihood: Ln : Θ → [0, ∞) • Asymptotic optimality (or efficiency), i.e., smallest variance for large sam-
ples. If θen is any other estimator, the asymptotic relative efficiency is
n
Y
Ln (θ) = f (Xi ; θ) h i
i=1 V θbn
are(θen , θbn ) = h i ≤ 1
V θen
Log-likelihood
n
X
`n (θ) = log Ln (θ) = log f (Xi ; θ) • Approximately the Bayes estimator
i=1

Maximum likelihood estimator (mle)

12.2.1 Delta Method
Ln (θbn ) = sup Ln (θ) b where ϕ is differentiable and ϕ0 (θ) 6= 0:
If τ = ϕ(θ)
θ

Score function τn − τ ) D
(b
∂ → N (0, 1)
s(X; θ) = log f (X; θ) se(b
b τ)
∂θ
Fisher information where τb = ϕ(θ)
b is the mle of τ and
I(θ) = Vθ [s(X; θ)]

In (θ) = nI(θ) b = ϕ0 (θ)
se se(
b θn )
b b
14
12.3 Multiparameter Models 13 Hypothesis Testing
Let θ = (θ1 , . . . , θk ) and θb = (θb1 , . . . , θbk ) be the mle.
H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ1
∂ 2 `n ∂ 2 `n
Hjj = Hjk = Definitions
∂θ2 ∂θj ∂θk
Fisher information matrix • Null hypothesis H0
• Alternative hypothesis H1
 
Eθ [H11 ] · · · Eθ [H1k ]
In (θ) = − 
 .. .. ..  • Simple hypothesis θ = θ0
. . . 
• Composite hypothesis θ > θ0 or θ < θ0
Eθ [Hk1 ] · · · Eθ [Hkk ]
• Two-sided test: H0 : θ = θ0 versus H1 : θ 6= θ0
Under appropriate regularity conditions • One-sided test: H0 : θ ≤ θ0 versus H1 : θ > θ0
(θb − θ) ≈ N (0, Jn ) • Critical value c
• Test statistic T
with Jn (θ) = In−1 . Further, if θbj is the j th component of θ, then • Rejection region R = {x : T (x) > c}
• Power function β(θ) = P [X ∈ R]
(θbj − θj ) D
→ N (0, 1) • Power of a test: 1 − P [Type II error] = 1 − β = inf β(θ)
se
bj θ∈Θ1
h i • Test size: α = P [Type I error] = sup β(θ)
b 2j = Jn (j, j) and Cov θbj , θbk = Jn (j, k)
where se θ∈Θ0

Retain H0 Reject H0
12.3.1 Multiparameter delta method √
H0 true Type
√ I Error (α)
Let τ = ϕ(θ1 , . . . , θk ) and let the gradient of ϕ be H1 true Type II Error (β) (power)
p-value
∂ϕ
 

 ∂θ1  • p-value = supθ∈Θ0 Pθ [T (X) ≥ T (x)] = inf α : T (x) ∈ Rα
 . 
∇ϕ =  ..  Pθ [T (X ? ) ≥ T (X)]

• p-value = supθ∈Θ0 = inf α : T (X) ∈ Rα

 ∂ϕ  | {z }
1−Fθ (T (X)) since T (X ? )∼Fθ
∂θk
p-value evidence
Suppose ∇ϕθ=θb 6= 0 and τb = ϕ(θ).
b Then,
< 0.01 very strong evidence against H0
τ − τ) D
(b 0.01 − 0.05 strong evidence against H0
→ N (0, 1) 0.05 − 0.1 weak evidence against H0
se(b
b τ)
> 0.1 little or no evidence against H0
where r Wald test
T
se(b
b τ) = ∇ϕ
b Jbn ∇ϕ
b
• Two-sided test
θb − θ0

b and ∇ϕ
and Jbn = Jn (θ) b = ∇ϕ b.
θ=θ • Reject H0 when |W | > zα/2 where W =
se
b
12.4 Parametric Bootstrap • P |W | > zα/2 → α
• p-value = Pθ0 [|W | > |w|] ≈ P [|Z| > |w|] = 2Φ(−|w|)
Sample from f (x; θbn ) instead of from Fbn , where θbn could be the mle or method
of moments estimator. Likelihood ratio test (LRT)
15
supθ∈Θ Ln (θ) Ln (θbn ) Vector parameter
• T (X) = =
supθ∈Θ0 Ln (θ) Ln (θbn,0 ) ( s
X
)
k fX (x | θ) = h(x) exp ηi (θ)Ti (x) − A(θ)
iid
D
X
• λ(X) = 2 log T (X) → χ2r−q where Zi2 ∼ χ2k and Z1 , . . . , Zk ∼ N (0, 1) i=1

i=1 = h(x) exp {η(θ) · T (x) − A(θ)}

• p-value = Pθ0 [λ(X) > λ(x)] ≈ P χ2r−q > λ(x) = h(x)g(θ) exp {η(θ) · T (x)}
Multinomial LRT Natural form

X1 Xk
• mle: pbn = ,..., fX (x | η) = h(x) exp {η · T(x) − A(η)}
n n
k Xj = h(x)g(η) exp {η · T(x)}
Ln (b
pn ) Y pbj
= h(x)g(η) exp η T T(x)

• T (X) = =
Ln (p0 ) j=1
p0j
k
X pbj D
15 Bayesian Inference
• λ(X) = 2 Xj log → χ2k−1
j=1
p 0j
Bayes’ Theorem
• The approximate size α LRT rejects H0 when λ(X) ≥ χ2k−1,α
f (x | θ)f (θ) f (x | θ)f (θ)
Pearson Chi-square Test f (θ | x) = =R ∝ Ln (θ)f (θ)
f (xn ) f (x | θ)f (θ) dθ
k
X (Xj − E [Xj ])2 Definitions
• T = where E [Xj ] = np0j under H0
j=1
E [Xj ] • X n = (X1 , . . . , Xn )
D
• T → χ2k−1 • xn = (x1 , . . . , xn )

• p-value = P χ2k−1 > T (x)
• Prior density f (θ)
D
2 • Likelihood f (xn | θ): joint density of the data
• Faster → Xk−1 than LRT, hence preferable for small n Yn
In particular, X n iid =⇒ f (xn | θ) = f (xi | θ) = Ln (θ)
Independence testing i=1
• Posterior density f (θ | xn )
• I rows, J columns, X multinomial sample of size n = I ∗ J
• Normalizing constant cn = f (xn ) = f (x | θ)f (θ) dθ
R
X
• mles unconstrained: pbij = nij
X • Kernel: part of a density that depends Ron θ
• mles under H0 : pb0ij = pbi· pb·j = Xni· n·j θLn (θ)f (θ)
• Posterior mean θ̄n = θf (θ | xn ) dθ = R Ln (θ)f
R
(θ) dθ

PI PJ nX
• LRT: λ = 2 i=1 j=1 Xij log Xi· Xij·j
PI PJ (X −E[X ])2
• PearsonChiSq: T = i=1 j=1 ijE[Xij ]ij 15.1 Credible Intervals
D
• LRT and Pearson → χ2k ν, where ν = (I − 1)(J − 1) Posterior interval
Z b
n
P [θ ∈ (a, b) | x ] = f (θ | xn ) dθ = 1 − α
14 Exponential Family a

Equal-tail credible interval

Scalar parameter Z a Z ∞
n
f (θ | x ) dθ = f (θ | xn ) dθ = α/2
fX (x | θ) = h(x) exp {η(θ)T (x) − A(θ)} −∞ b

= h(x)g(θ) exp {η(θ)T (x)} Highest posterior density (HPD) region Rn

16
1. P [θ ∈ Rn ] = 1 − α 15.3.1 Conjugate Priors
2. Rn = {θ : f (θ | xn ) > k} for some k Continuous likelihood (subscript c denotes constant)
Likelihood Conjugate prior Posterior hyperparameters
Rn is unimodal =⇒ Rn is an interval

Unif (0, θ) Pareto(xm , k) max x(n) , xm , k + n
Xn
Exp (λ) Gamma (α, β) α + n, β + xi
15.2 Function of parameters i=1
Pn
µ0 i=1 xi 1 n
2
2

Let τ = ϕ(θ) and A = {θ : ϕ(θ) ≤ τ }. N µ, σc N µ0 , σ0 + / + 2 ,
σ2 σ2 σ02 σc
Posterior CDF for τ 0 c−1
1 n
+ 2
Z σ02 σc
H(r | xn ) = P [ϕ(θ) ≤ τ | xn ] = f (θ | xn ) dθ
Pn
νσ02 + i=1 (xi − µ)2
A N µc , σ 2 Scaled Inverse Chi- ν + n,
ν+n
square(ν, σ02 )
Posterior density
νλ + nx̄ n
N µ, σ 2 Normal- , ν + n, α + ,
h(τ | xn ) = H 0 (τ | xn ) ν+n 2
scaled Inverse n 2
1X γ(x̄ − λ)
Gamma(λ, ν, α, β) β+ (xi − x̄)2 +
Bayesian delta method 2 i=1 2(n + γ)
−1
Σ−1 −1
Σ−1 −1

τ | X n ≈ N ϕ(θ),
b seb ϕ0 (θ)
b MVN(µ, Σc ) MVN(µ0 , Σ0 ) 0 + nΣc 0 µ0 + nΣ x̄ ,
−1 −1

Σ−1

0 + nΣc
Xn
MVN(µc , Σ) Inverse- n + κ, Ψ + (xi − µc )(xi − µc )T
15.3 Priors Wishart(κ, Ψ) i=1
n
Choice
X xi
Pareto(xmc , k) Gamma (α, β) α + n, β + log
i=1
x mc

• Subjective bayesianism. Pareto(xm , kc ) Pareto(x0 , k0 ) x0 , k0 − kn where k0 > kn

n
• Objective bayesianism.
X
Gamma (αc , β) Gamma (α0 , β0 ) α0 + nαc , β0 + xi
• Robust bayesianism. i=1

Types

• Flat: f (θ) ∝ constant

R∞
• Proper: −∞ f (θ) dθ = 1
R∞
• Improper: −∞ f (θ) dθ = ∞
• Jeffrey’s prior (transformation-invariant):
p p
f (θ) ∝ I(θ) f (θ) ∝ det(I(θ))

• Conjugate: f (θ) and f (θ | xn ) belong to the same parametric family

17
Discrete likelihood log10 BF10 BF10 evidence
Likelihood Conjugate prior Posterior hyperparameters 0 − 0.5 1 − 1.5 Weak
n n
0.5 − 1 1.5 − 10 Moderate
Bern (p) Beta (α, β) α+
X
xi , β + n −
X
xi 1−2 10 − 100 Strong
i=1 i=1
>2 > 100 Decisive
n n n p
1−p BF 10
p∗ = where p = P [H1 ] and p∗ = P [H1 | xn ]
X X X
Bin (p) Beta (α, β) α+ xi , β + Ni − xi p
1+ 1−p BF10
i=1 i=1 i=1
n
X
NBin (p) Beta (α, β) α + rn, β + xi
i=1
16 Sampling Methods
n
X
Po (λ) Gamma (α, β) α+ xi , β + n 16.1 Inverse Transform Sampling
i=1
Xn Setup
Multinomial(p) Dir (α) α+ x(i)
i=1 • U ∼ Unif (0, 1)
n
X • X∼F
Geo (p) Beta (α, β) α + n, β + xi
i=1
• F −1 (u) = inf{x | F (x) ≥ u}
Algorithm
15.4 Bayesian Testing
1. Generate u ∼ Unif (0, 1)
If H0 : θ ∈ Θ0 : 2. Compute x = F −1 (u)
Z
Prior probability P [H0 ] = f (θ) dθ 16.2 The Bootstrap
Θ0
Z
Let Tn = g(X1 , . . . , Xn ) be a statistic.
Posterior probability P [H0 | xn ] = f (θ | xn ) dθ
Θ0
1. Estimate VF [Tn ] with VFbn [Tn ].
2. Approximate VFbn [Tn ] using simulation:
∗ ∗
Let H0 , . . . , HK−1 be K hypotheses. Suppose θ ∼ f (θ | Hk ), (a) Repeat the following B times to get Tn,1 , . . . , Tn,B , an iid sample from
the sampling distribution implied by Fn b
f (xn | Hk )P [Hk ]
P [Hk | xn ] = PK , i. Sample uniformly X1∗ , . . . , Xn∗ ∼ Fbn .
n
k=1 f (x | Hk )P [Hk ] ii. Compute Tn∗ = g(X1∗ , . . . , Xn∗ ).
Marginal likelihood (b) Then
B B
!2
Z 1 X
∗ 1 X
f (xn | Hi ) = f (xn | θ, Hi )f (θ | Hi ) dθ vboot = V
bb = Tn,b − T∗
Θ
Fn B B r=1 n,r
b=1

Posterior odds (of Hi relative to Hj ) 16.2.1 Bootstrap Confidence Intervals

n n
P [Hi | x ] f (x | Hi ) P [Hi ] Normal-based interval
= ×
P [Hj | xn ] f (xn | Hj ) P [Hj ] Tn ± zα/2 se
b boot
| {z } | {z }
Bayes Factor BFij prior odds Pivotal interval
Bayes factor 1. Location parameter θ = T (F )
18
2. Pivot Rn = θbn − θ 2. Generate u ∼ Unif (0, 1)
3. Let H(r) = P [Rn ≤ r] be the cdf of Rn Ln (θcand )
∗ ∗
3. Accept θcand if u ≤
4. Let Rn,b = θbn,b − θbn . Approximate H using bootstrap: Ln (θbn )
B
1 X ∗ 16.4 Importance Sampling
H(r)
b = I(Rn,b ≤ r)
B Sample from an importance function g rather than target density h.
b=1
Algorithm to obtain an approximation to E [q(θ) | xn ]:
5. θβ∗ = β sample quantile of (θbn,1
∗ ∗
, . . . , θbn,B ) iid
1. Sample from the prior θ1 , . . . , θn ∼ f (θ)
6. rβ∗ = β sample quantile of (Rn,1
∗ ∗
, . . . , Rn,B ), i.e., rβ∗ = θβ∗ − θbn
Ln (θi )
2. wi = PB ∀i = 1, . . . , B

7. Approximate 1 − α confidence interval Cn = â, b̂ where
i=1 Ln (θi )
PB
3. E [q(θ) | xn ] ≈ i=1 q(θi )wi
b −1 1 − α =

∗ ∗
â = θbn − H θbn − r1−α/2 = 2θbn − θ1−α/2
2
α
−1 ∗ ∗
b̂ = θbn − Hb
2
= θbn − rα/2 = 2θbn − θα/2 17 Decision Theory
Percentile interval Definitions
∗ ∗
Cn = θα/2 , θ1−α/2 • Unknown quantity affecting our decision: θ ∈ Θ
• Decision rule: synonymous for an estimator θb
16.3 Rejection Sampling • Action a ∈ A: possible value of the decision rule. In the estimation
context, the action is just an estimate of θ, θ(x).
b
Setup
• Loss function L: consequences of taking action a when true state is θ or
• We can easily sample from g(θ) discrepancy between θ and θ, b L : Θ × A → [−k, ∞).
• We want to sample from h(θ), but it is difficult
Loss functions
k(θ)
• We know h(θ) up to a proportional constant: h(θ) = R • Squared error loss: L(θ, a) = (θ − a)2
k(θ) dθ (
• Envelope condition: we can find M > 0 such that k(θ) ≤ M g(θ) ∀θ K1 (θ − a) a − θ < 0
• Linear loss: L(θ, a) =
K2 (a − θ) a − θ ≥ 0
Algorithm
• Absolute error loss: L(θ, a) = |θ − a| (linear loss with K1 = K2 )
1. Draw θcand ∼ g(θ) • Lp loss: L(θ, a) = |θ − a|p
2. Generate u ∼ Unif (0, 1) (
0 a=θ
k(θcand ) • Zero-one loss: L(θ, a) =
3. Accept θcand if u ≤ 1 a 6= θ
M g(θcand )
4. Repeat until B values of θcand have been accepted
17.1 Risk
Example
Posterior risk
• We can easily sample from the prior g(θ) = f (θ) Z h i
• Target is the posterior h(θ) ∝ k(θ) = f (xn | θ)f (θ) r(θb | x) = L(θ, θ(x))f
b (θ | x) dθ = Eθ|X L(θ, θ(x))
b

• Envelope condition: f (xn | θ) ≤ f (xn | θbn ) = Ln (θbn ) ≡ M

(Frequentist) risk
• Algorithm Z h i
1. Draw θcand ∼ f (θ) R(θ, θ)
b = L(θ, θ(x))f
b (x | θ) dx = EX|θ L(θ, θ(X))
b
19
Bayes risk 18 Linear Regression
ZZ
Definitions
h i
r(f, θ)
b = L(θ, θ(x))f
b (x, θ) dx dθ = Eθ,X L(θ, θ(X))
b
• Response variable Y
• Covariate X (aka predictor variable or feature)
h h ii h i
r(f, θ)
b = Eθ EX|θ L(θ, θ(X)
b = Eθ R(θ, θ)
b

18.1 Simple Linear Regression

h h ii h i
r(f, θ)
b = EX Eθ|X L(θ, θ(X)
b = EX r(θb | X)
Model
17.2 Admissibility Yi = β0 + β1 Xi + i E [i | Xi ] = 0, V [i | Xi ] = σ 2
Fitted line
• θb0 dominates θb if
b0 rb(x) = βb0 + βb1 x
∀θ : R(θ, θ ) ≤ R(θ, θ)
b
Predicted (fitted) values
∃θ : R(θ, θb0 ) < R(θ, θ)
b Ybi = rb(Xi )
• θb is inadmissible if there is at least one other estimator θb0 that dominates Residuals
it. Otherwise it is called admissible. ˆi = Yi − Ybi = Yi − βb0 + βb1 Xi

Residual sums of squares (rss)

17.3 Bayes Rule
n
X
Bayes rule (or Bayes estimator) rss(βb0 , βb1 ) = ˆ2i
i=1
• r(f, θ)
b = inf e r(f, θ)
θ
e
R Least square estimates
• θ(x)
b = inf r(θb | x) ∀x =⇒ r(f, θ)
b = r(θb | x)f (x) dx
βbT = (βb0 , βb1 )T : min rss
β
b0 ,β
b1
Theorems

• Squared error loss: posterior mean βb0 = Ȳn − βb1 X̄n

Pn Pn
• Absolute error loss: posterior median i=1 (Xi − X̄n )(Yi − Ȳn ) i=1 Xi Yi − nX̄Y
β1 =
b Pn = P n
• Zero-one loss: posterior mode i=1 (Xi − X̄n )
2 2 2
i=1 Xi − nX

β0
h i
E βb | X n =
17.4 Minimax Rules β1
σ 2 n−1 ni=1 Xi2 −X n
h i P
Maximum risk V βb | X n = 2
R̄(θ)
b = sup R(θ, θ)
b R̄(a) = sup R(θ, a) nsX −X n 1
θ θ r Pn
2
σ i=1 Xi
√
b
Minimax rule se(
b βb0 ) =
sX n n
sup R(θ, θ)
b = inf R̄(θ)
e = inf sup R(θ, θ)
e
θ θe θe θ σ
√
b
se(
b βb1 ) =
sX n
θb = Bayes rule ∧ ∃c : R(θ, θ)
b =c Pn Pn 2
where s2X = n−1 i=1 (Xi − X n )2 and σ b2 = n−21
i=1
ˆi (unbiased estimate).
Least favorable prior Further properties:
P P
θbf = Bayes rule ∧ R(θ, θbf ) ≤ r(f, θbf ) ∀θ • Consistency: βb0 → β0 and βb1 → β1
20
• Asymptotic normality: 18.3 Multiple Regression
βb0 − β0 D βb1 − β1 D Y = Xβ +
→ N (0, 1) and → N (0, 1)
se(
b βb0 ) se(
b βb1 )
where
• Approximate 1 − α confidence intervals for β0 and β1 :      
X11 ··· X1k β1 1
 .. ..  β =  ... 
..  .. 
βb0 ± zα/2 se( and βb1 ± zα/2 se( X= . =.
 
b βb0 ) b βb1 ) . . 
Xn1 ··· Xnk βk n
• Wald test for H0 : β1 = 0 vs. H1 : β1 6= 0: reject H0 if |W | > zα/2 where
W = βb1 /se(
b βb1 ). Likelihood

1
R2 L(µ, Σ) = (2πσ 2 )−n/2 exp − 2 rss
Pn b 2
Pn 2 2σ
i=1 (Yi − Y ) ˆ rss
2
R = Pn 2
= 1 − Pn i=1 i 2 = 1 −
i=1 (Yi − Y ) i=1 (Yi − Y )
tss
N
X
Likelihood rss = (y − Xβ)T (y − Xβ) = kY − Xβk2 = (Yi − xTi β)2
n n n i=1
Y Y Y
L= f (Xi , Yi ) = fX (Xi ) × fY |X (Yi | Xi ) = L1 × L2
i=1 i=1 i=1 If the (k × k) matrix X T X is invertible,
Yn
L1 = fX (Xi ) βb = (X T X)−1 X T Y
i=1 h i
V βb | X n = σ 2 (X T X)−1
n
( )
Y 1 X 2
−n
L2 = fY |X (Yi | Xi ) ∝ σ exp − 2 Yi − (β0 − β1 Xi )
2σ i βb ≈ N β, σ 2 (X T X)−1

i=1

Under the assumption of Normality, the least squares parameter estimators are
Estimate regression function
also the MLEs, but the least squares variance estimator is not the MLE
n k
1X 2 X
b2 =
σ ˆ rb(x) = βbj xj
n i=1 i j=1

18.2 Prediction Unbiased estimate for σ 2

Observe X = x∗ of the covariate and want to predict their outcome Y∗ . n
1 X 2
b2 =
σ ˆ ˆ = X βb − Y
Yb∗ = βb0 + βb1 x∗ n − k i=1 i
h i h i h i h i
V Yb∗ = V βb0 + x2∗ V βb1 + 2x∗ Cov βb0 , βb1 mle
n−k 2
Prediction interval µ
b = X̄ b2 =
σ σ
Pn 2
n
2 2 i=1 (Xi − X∗ )
ξn = σ
b P +1
n i (Xi − X̄)2 j
b
1 − α Confidence interval
Yb∗ ± zα/2 ξbn βbj ± zα/2 se(
b βbj )
21
18.4 Model Selection Akaike Information Criterion (AIC)
Consider predicting a new observation Y ∗ for covariates X ∗ and let S ⊂ J
denote a subset of the covariates in the model, where |S| = k and |J| = n. bS2 ) − k
AIC(S) = `n (βbS , σ
Issues
Bayesian Information Criterion (BIC)
• Underfitting: too few covariates yields high bias
• Overfitting: too many covariates yields high variance k
bS2 ) − log n
BIC(S) = `n (βbS , σ
Procedure 2

1. Assign a score to each model Validation and training

2. Search through all models to find the one with the highest score
m
X n n
Hypothesis testing R
bV (S) = (Ybi∗ (S) − Yi∗ )2 m = |{validation data}|, often or
i=1
4 2
H0 : βj = 0 vs. H1 : βj 6= 0 ∀j ∈ J
Leave-one-out cross-validation
Mean squared prediction error (mspe)
n n
!2
h i X X Yi − Ybi (S)
mspe = E (Yb (S) − Y ∗ )2 R
bCV (S) = (Yi − Yb(i) )2 =
i=1 i=1
1 − Uii (S)
Prediction risk
n n h i
U (S) = XS (XST XS )−1 XS (“hat matrix”)
X X
R(S) = mspei = E (Ybi (S) − Yi∗ )2
i=1 i=1

Training error
n
R
btr (S) =
X
(Ybi (S) − Yi )2 19 Non-parametric Function Estimation
i=1

R 2 19.1 Density Estimation

Pn b 2
R i=1 (Yi (S) − Y )
rss(S) btr (S) R
R2 (S) = 1 − =1− =1− P n 2
Estimate f (x), where f (x) = P [X ∈ A] = A
f (x) dx.
i=1 (Yi − Y )
tss tss Integrated square error (ise)
The training error is a downward-biased estimate of the prediction risk. Z Z
2
h i L(f, fbn ) = f (x) − fbn (x) dx = J(h) + f 2 (x) dx
E R btr (S) < R(S)

h i n
X h i Frequentist risk
bias(Rtr (S)) = E Rtr (S) − R(S) = −2
b b Cov Ybi , Yi
i=1 h i Z Z
R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx
Adjusted R2
n − 1 rss
R2 (S) = 1 −
n − k tss
h i
Mallow’s Cp statistic b(x) = E fbn (x) − f (x)
h i
R(S)
b =R σ 2 = lack of fit + complexity penalty
btr (S) + 2kb v(x) = V fbn (x)
22
19.1.1 Histograms KDE
n
Definitions

1X1 x − Xi
fbn (x) = K
n i=1 h h
• Number of bins m 1 4
Z
00 2 1
Z
• Binwidth h = m 1 R(f, fn ) ≈ (hσK )
b (f (x)) dx + K 2 (x) dx
4 nh
• Bin Bj has νj observations c
−2/5 −1/5 −1/5
c2 c3
Z Z
h∗ = 1 c = σ 2
, c = K 2
(x) dx, c = (f 00 (x))2 dx
R
• Define pbj = νj /n and pj = Bj f (u) du 1 K 2 3
n1/5
Z 4/5 Z 1/5
∗ c4 5 2 2/5 2 00 2
Histogram estimator R (f, fn ) = 4/5
b c4 = (σK ) K (x) dx (f ) dx
n 4
| {z }
m C(K)
X pbj
fbn (x) = I(x ∈ Bj )
h
j=1 Epanechnikov Kernel
h i pj
E fbn (x) = ( √
h √ 3
|x| < 5
h i p (1 − p ) K(x) = 4 5(1−x2 /5)
j j
V fbn (x) = 0 otherwise
nh2
h2
Z
2 1
R(fbn , f ) ≈ (f 0 (u)) du + Cross-validation estimate of E [J(h)]
12 nh
!1/3
∗ 1 6 n n n
1 X X ∗ Xi − Xj
Z
h = 1/3 R 2 du 2Xb 2
n (f 0 (u)) JbCV (h) = fbn2 (x) dx − f(−i) (Xi ) ≈ 2
K + K(0)
n i=1 hn i=1 j=1 h nh
2/3 Z 1/3
∗ b C 3 0 2
R (fn , f ) ≈ 2/3 C= (f (u)) du
n 4 Z
∗ (2) (2)
K (x) = K (x) − 2K(x) K (x) = K(x − y)K(y) dy
Cross-validation estimate of E [J(h)]

Z n m
JbCV (h) = fbn2 (x) dx −
2Xb
f(−i) (Xi ) =
2
−
n+1 X 2
pb
19.2 Non-parametric Regression
n i=1 (n − 1)h (n − 1)h j=1 j
Estimate f (x) where f (x) = E [Y | X = x]. Consider pairs of points
(x1 , Y1 ), . . . , (xn , Yn ) related by

19.1.2 Kernel Density Estimator (KDE) Yi = r(xi ) + i

E [i ] = 0
Kernel K
V [i ] = σ 2
• K(x) ≥ 0
•
R
K(x) dx = 1 k-nearest Neighbor Estimator
R
• xK(x) dx = 0
R 2 2 1 X
• x K(x) dx ≡ σK >0 rb(x) = Yi where Nk (x) = {k values of x1 , . . . , xn closest to x}
k 23
i:xi ∈Nk (x)
Nadaraya-Watson Kernel Estimator 20 Stochastic Processes
n
X
rb(x) = wi (x)Yi Stochastic Process
i=1 (
x−xi {0, ±1, . . . } = Z discrete

K {Xt : t ∈ T } T =
wi (x) = h ∈ [0, 1]
[0, ∞) continuous

Pn x−xj
j=1 K h
4 Z 2
h4 f 0 (x)
Z
2 2 00 0 • Notations Xt , X(t)
rn , r) ≈
R(b x K (x) dx r (x) + 2r (x) dx
4 f (x) • State space X
Z 2
σ K 2 (x) dx
R
• Index set T
+ dx
nhf (x)
c1
h∗ ≈ 1/5 20.1 Markov Chains
n
∗ c2
rn , r) ≈ 4/5
R (b Markov chain
n
P [Xn = x | X0 , . . . , Xn−1 ] = P [Xn = x | Xn−1 ] ∀n ∈ T, x ∈ X
Cross-validation estimate of E [J(h)]
n n Transition probabilities
X X (Yi − rb(xi ))2
JbCV (h) = (Yi − rb(−i) (xi ))2 = !2
i=1 i=1 K(0)
pij ≡ P [Xn+1 = j | Xn = i]
1− x−x
Pn
j=1 K h
j
pij (n) ≡ P [Xm+n = j | Xm = i] n-step

19.3 Smoothing Using Orthogonal Functions Transition matrix P (n-step: Pn )

Approximation • (i, j) element is pij

∞ J
X X • pij > 0
r(x) = βj φj (x) ≈ βj φj (x) P
j=1 i=1
• i pij = 1

Multivariate regression Chapman-Kolmogorov

Y = Φβ + η
  X
φ0 (x1 ) ··· φJ (x1 ) pij (m + n) = pij (m)pkj (n)
 .. .. .. 
where ηi = i and Φ =  . . . 
k

φ0 (xn ) · · · φJ (xn )
Pm+n = Pm Pn
Least squares estimator
βb = (ΦT Φ)−1 ΦT Y Pn = P × · · · × P = Pn
1
≈ ΦT Y (for equally spaced observations only) Marginal probability
n
Cross-validation estimate of E [J(h)] µn = (µn (1), . . . , µn (N )) where µi (i) = P [Xn = i]
2
µ0 , initial distribution

Xn J
X
R
bCV (J) = Yi − φj (xi )βbj,(−i)  µn = µ0 Pn
24
i=1 j=1
20.2 Poisson Processes Autocorrelation function (ACF)
Poisson process
Cov [xs , xt ] γ(s, t)
ρ(s, t) = p =p
• {Xt : t ∈ [0, ∞)} = number of events up to and including time t V [xs ] V [xt ] γ(s, s)γ(t, t)
• X0 = 0
• Independent increments: Cross-covariance function (CCV)
∀t0 < · · · < tn : Xt1 − Xt0 ⊥
⊥ · · · ⊥⊥ Xtn − Xtn−1
γxy (s, t) = E [(xs − µxs )(yt − µyt )]
• Intensity function λ(t)
– P [Xt+h − Xt = 1] = λ(t)h + o(h) Cross-correlation function (CCF)
– P [Xt+h − Xt = 2] = o(h)
Rt γxy (s, t)
• Xs+t − Xs ∼ Po (m(s + t) − m(s)) where m(t) = 0
λ(s) ds ρxy (s, t) = p
γx (s, s)γy (t, t)
Homogeneous Poisson process
Backshift operator
λ(t) ≡ λ =⇒ Xt ∼ Po (λt) λ>0
B k (xt ) = xt−k
Waiting times
Wt := time at which Xt occurs
Difference operator
1
Wt ∼ Gamma t, ∇d = (1 − B)d
λ
Interarrival times
White noise
St = Wt+1 − Wt

1 2
• wt ∼ wn(0, σw )
St ∼ Exp
λ iid 2

• Gaussian: wt ∼ N 0, σw
St
• E [wt ] = 0 t ∈ T
• V [wt ] = σ 2 t ∈ T
Wt−1 Wt t • γw (s, t) = 0 s 6= t ∧ s, t ∈ T

Random walk
21 Time Series
• Drift δ
Mean function ∞
Pt
• xt = δt + j=1 wj
Z
µxt = E [xt ] = xft (x) dx
−∞ • E [xt ] = δt
Autocovariance function
Symmetric moving average
γx (s, t) = E [(xs − µs )(xt − µt )] = E [xs xt ] − µs µt
k
X k
X
γx (t, t) = E (xt − µt )2 = V [xt ]

mt = aj xt−j where aj = a−j ≥ 0 and aj = 1
25
j=−k j=−k
21.1 Stationary Time Series 21.2 Estimation of Correlation
Strictly stationary Sample mean
n
1X
x̄ = xt
P [xt1 ≤ c1 , . . . , xtk ≤ ck ] = P [xt1 +h ≤ c1 , . . . , xtk +h ≤ ck ] n t=1

Sample variance
n
∀k ∈ N, tk , ck , h ∈ Z

1 X |h|
V [x̄] = 1− γx (h)
n n
h=−n
Weakly stationary
Sample autocovariance function
• E x2t < ∞ ∀t ∈ Z
n−h
1 X
2
• E xt = m ∀t ∈ Z γ
b(h) = (xt+h − x̄)(xt − x̄)
• γx (s, t) = γx (s + r, t + r) ∀r, s, t ∈ Z n t=1

Autocovariance function Sample autocorrelation function

γ
b(h)
• γ(h) = E [(xt+h − µ)(xt − µ)] ∀h ∈ Z ρb(h) =
γ
b(0)
• γ(0) = E (xt − µ)2
• γ(0) ≥ 0 Sample cross-variance function
• γ(0) ≥ |γ(h)|
n−h
• γ(h) = γ(−h) 1 X
γ
bxy (h) = (xt+h − x̄)(yt − y)
n t=1
Autocorrelation function (ACF)
Sample cross-correlation function
Cov [xt+h , xt ] γ(t + h, t) γ(h)
ρx (h) = p =p = γ
bxy (h)
V [xt+h ] V [xt ] γ(t + h, t + h)γ(t, t) γ(0) ρbxy (h) = p
γbx (0)b
γy (0)
Jointly stationary time series Properties

γxy (h) = E [(xt+h − µx )(yt − µy )] 1

• σρbx (h) = √ if xt is white noise
n
1
γxy (h) • σρbxy (h) = √ if xt or yt is white noise
ρxy (h) = p n
γx (0)γy (h)
21.3 Non-Stationary Time Series
Linear process
Classical decomposition model
∞
X ∞
X
xt = µ + ψj wt−j where |ψj | < ∞ xt = µt + st + wt
j=−∞ j=−∞
• µt = trend
∞
X • st = seasonal component
2
γ(h) = σw ψj+h ψj • wt = random noise term
26
j=−∞
21.3.1 Detrending Moving average polynomial
Least squares θ(z) = 1 + θ1 z + · · · + θq zq z ∈ C ∧ θq 6= 0

1. Choose trend model, e.g., µt = β0 + β1 t + β2 t2 Moving average operator

2. Minimize rss to obtain trend estimate µ bt = βb0 + βb1 t + βb2 t2 θ(B) = 1 + θ1 B + · · · + θp B p
3. Residuals , noise wt
MA (q) (moving average model order q)
Moving average xt = wt + θ1 wt−1 + · · · + θq wt−q ⇐⇒ xt = θ(B)wt
1
• The low-pass filter vt is a symmetric moving average mt with aj = 2k+1 :
q
X
E [xt ] = θj E [wt−j ] = 0
k j=0
1 X
vt = xt−1 (
2
Pq−h
2k + 1 σw j=0 θj θj+h 0≤h≤q
i=−k γ(h) = Cov [xt+h , xt ] =
0 h>q
1
Pk
• If 2k+1 i=−k wt−j ≈ 0, a linear trend function µt = β0 + β1 t passes MA (1)
without distortion xt = wt + θwt−1

2 2
Differencing (1 + θ )σw h = 0

γ(h) = θσw 2
h=1
• µt = β0 + β1 t =⇒ ∇xt = β1 
0 h>1

(
θ
21.4 ARIMA models 2 h=1
ρ(h) = (1+θ )
Autoregressive polynomial 0 h>1
ARMA (p, q)
φ(z) = 1 − φ1 z − · · · − φp zp z ∈ C ∧ φp 6= 0
xt = φ1 xt−1 + · · · + φp xt−p + wt + θ1 wt−1 + · · · + θq wt−q
Autoregressive operator φ(B)xt = θ(B)wt
p Partial autocorrelation function (PACF)
φ(B) = 1 − φ1 B − · · · − φp B
• xih−1 , regression of xi on {xh−1 , xh−2 , . . . , x1 }
Autoregressive model order p, AR (p)
• φhh = corr(xh − xh−1
h , x0 − xh−1
0 ) h≥2
xt = φ1 xt−1 + · · · + φp xt−p + wt ⇐⇒ φ(B)xt = wt • E.g., φ11 = corr(x1 , x0 ) = ρ(1)
ARIMA (p, d, q)
AR (1)
∇d xt = (1 − B)d xt is ARMA (p, q)
k−1
X k→∞,|φ|<1
∞
X φ(B)(1 − B)d xt = θ(B)wt
• xt = φk (xt−k ) + φj (wt−j ) = φj (wt−j )
Exponentially Weighted Moving Average (EWMA)
j=0 j=0
xt = xt−1 + wt − λwt−1
| {z }
linear process
P∞ j ∞
• E [xt ] = j=0 φ (E [wt−j ]) = 0 X
2 h
σw φ
xt = (1 − λ)λj−1 xt−j + wt when |λ| < 1
• γ(h) = Cov [xt+h , xt ] = 1−φ2 j=1
γ(h)
• ρ(h) = γ(0) = φh x̃n+1 = (1 − λ)xn + λx̃n
• ρ(h) = φρ(h − 1) h = 1, 2, . . . Seasonal ARIMA
27
• Denoted by ARIMA (p, d, q) × (P, D, Q)s Periodic mixture
• ΦP (B s )φ(B)∇D d s
s ∇ xt = δ + ΘQ (B )θ(B)wt q
X
xt = (Uk1 cos(2πωk t) + Uk2 sin(2πωk t))
21.4.1 Causality and Invertibility k=1
P∞
ARMA (p, q) is causal (future-independent) ⇐⇒ ∃{ψj } : j=0 ψj < ∞ such that • Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rv’s with variances σk2
Pq
∞
• γ(h) = k=1 σk2 cos(2πωk h)
Pq
• γ(0) = E x2t = k=1 σk2
X
xt = wt−j = ψ(B)wt
j=0
Spectral representation of a periodic process
P∞
ARMA (p, q) is invertible ⇐⇒ ∃{πj } : πj < ∞ such that
j=0 γ(h) = σ 2 cos(2πω0 h)
∞ σ 2 −2πiω0 h σ 2 2πiω0 h
= e + e
X
π(B)xt = Xt−j = wt 2 2
j=0 Z 1/2
= e2πiωh dF (ω)
Properties −1/2

• ARMA (p, q) causal ⇐⇒ roots of φ(z) lie outside the unit circle Spectral distribution function
∞

X θ(z)j 0
 ω < −ω0
ψ(z) = ψj z = |z| ≤ 1
φ(z) F (ω) = σ 2 /2 −ω ≤ ω < ω0
j=0 
 2
σ ω ≥ ω0
• ARMA (p, q) invertible ⇐⇒ roots of θ(z) lie outside the unit circle
• F (−∞) = F (−1/2) = 0
∞
X φ(z) • F (∞) = F (1/2) = γ(0)
π(z) = πj z j = |z| ≤ 1
θ(z)
j=0 Spectral density
Behavior of the ACF and PACF for causal and invertible ARMA models ∞
X 1 1
f (ω) = γ(h)e−2πiωh − ≤ω≤
AR (p) MA (q) ARMA (p, q) 2 2
h=−∞
ACF tails off cuts off after lag q tails off
P∞ R 1/2
PACF cuts off after lag p tails off q tails off • Needs h=−∞ |γ(h)| < ∞ =⇒ γ(h) = −1/2
e2πiωh f (ω) dω h = 0, ±1, . . .
• f (ω) ≥ 0
21.5 Spectral Analysis • f (ω) = f (−ω)
Periodic process • f (ω) = f (1 − ω)
R 1/2
• γ(0) = V [xt ] = −1/2 f (ω) dω
xt = A cos(2πωt + φ) 2
• White noise: fw (ω) = σw
= U1 cos(2πωt) + U2 sin(2πωt)
• ARMA (p, q) , φ(B)xt = θ(B)wt :
• Frequency index ω (cycles per unit time), period 1/ω |θ(e−2πiω )|2
2
• Amplitude A fx (ω) = σw
|φ(e−2πiω )|2
• Phase φ
Pp Pq
• U1 = A cos φ and U2 = A sin φ often normally distributed rv’s where φ(z) = 1 − k=1 φk z k and θ(z) = 1 + k=1 θk z k
28
Discrete Fourier Transform (DFT) • I0 (a, b) = 0 I1 (a, b) = 1
n • Ix (a, b) = 1 − I1−x (b, a)
X
d(ωj ) = n−1/2 xt e−2πiωj t
i=1 22.3 Series
Fourier/Fundamental frequencies Finite Binomial
n n
ωj = j/n X n(n + 1) X n
• k= • = 2n
2 k
Inverse DFT k=1 k=0
n−1 n n
X r+k r+n+1
xt = n−1/2 d(ωj )e2πiωj t
X X
• (2k − 1) = n2 • =
j=0
k n
k=1 k=0
n n
Periodogram X n(n + 1)(2n + 1) X k n+1
2 • k2 = • =
I(j/n) = |d(j/n)| 6 m m+1
k=1 k=0
Scaled Periodogram n 2 • Vandermonde’s Identity:
X n(n + 1)
• k3 = r
m n

m+n

2
X
4 k=1 =
P (j/n) = I(j/n) n k r−k r
n X cn+1 − 1 k=0
n
!2 n
!2 • ck = c 6= 1 • Binomial Theorem:
2X 2X c−1 n
n n−k k
= xt cos(2πtj/n + xt sin(2πtj/n k=0
X
n t=1 n t=1 a b = (a + b)n
k
k=0

22 Math Infinite
∞ ∞
22.1 Gamma Function
X 1 X p
• pk = , pk = |p| < 1
Z ∞ 1−p 1−p
k=0 k=1
• Ordinary: Γ(s) = ts−1 e−t dt ∞ ∞
!
0
X d X d 1 1
Z ∞ • kpk−1 = pk
= = |p| < 1
• Upper incomplete: Γ(s, x) = ts−1 e−t dt dp dp 1 − p (1 − p)2
k=0 k=0
x ∞
X r + k − 1

Z x
• Lower incomplete: γ(s, x) = ts−1 e−t dt • xk = (1 − x)−r r ∈ N+
k
0 k=0
∞
• Γ(α + 1) = αΓ(α) α>1 X α k
• p = (1 + p)α |p| < 1 , α ∈ C
• Γ(n) = (n − 1)! n∈N k
√ k=0
• Γ(1/2) = π

22.2 Beta Function

Z1
Γ(x)Γ(y)
• Ordinary: B(x, y) = B(y, x) = tx−1 (1 − t)y−1 dt =
0 Γ(x + y)
Z x
• Incomplete: B(x; a, b) = ta−1 (1 − t)b−1 dt
0
• Regularized incomplete:
a+b−1
B(x; a, b) a,b∈N X (a + b − 1)!
Ix (a, b) = = xj (1 − x)a+b−1−j
B(a, b) j=a
j!(a + b − 1 − j)! 29
22.4 Combinatorics
Sampling

k out of n w/o replacement w/ replacement

k−1
Y n!
ordered nk = (n − i) = nk
i=0
(n − k)!
nk

n n! n−1+r n−1+r
unordered = = =
k k! k!(n − k)! r n−1

Stirling numbers, 2nd kind

(
n n−1 n−1 n 1 n=0
=k + 1≤k≤n =
k k k−1 0 0 else

Partitions
n
X
Pn+k,k = Pn,i k > n : Pn,k = 0 n ≥ 1 : Pn,0 = 0, P0,0 = 1
i=1

Balls and Urns f :B→U D = distinguishable, ¬D = indistinguishable.

|B| = n, |U | = m f arbitrary f injective f surjective f bijective

( (
mn m ≥ n

n n! m = n
B : D, U : D mn m!
0 else m 0 else
(
m+n−1 m n−1 1 m=n
B : ¬D, U : D
n n m−1 0 else
m
( (
X n 1 m≥n n 1 m=n
B : D, U : ¬D
k 0 else m 0 else
k=1
m
( (
X 1 m≥n 1 m=n
B : ¬D, U : ¬D Pn,k Pn,m
k=1
0 else 0 else

References
[1] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships. The American
Statistician, 62(1):45–53, 2008.
[2] A. Steger. Diskrete Strukturen – Band 1: Kombinatorik, Graphentheorie, Algebra.
Springer, 2001.
[3] A. Steger. Diskrete Strukturen – Band 2: Wahrscheinlichkeitstheorie und Statistik.
Springer, 2002.
30
Univariate distribution relationships, courtesy Leemis and McQueston [1].
31

Intrusion Detection Honeypots
From Everand
Intrusion Detection Honeypots
Chris Sanders
3/5 (2)
STAT 330 Course Notes Fall 2024 Edition
No ratings yet
STAT 330 Course Notes Fall 2024 Edition
482 pages
A Probability and Statistics Cheatsheet
No ratings yet
A Probability and Statistics Cheatsheet
28 pages
Probability and Statistics Explorations With Maple
No ratings yet
Probability and Statistics Explorations With Maple
287 pages
Statistics Lecture Note Asymptotic Tools
No ratings yet
Statistics Lecture Note Asymptotic Tools
216 pages
Statistical Methods in Data Analysis - W. J. Metzger
No ratings yet
Statistical Methods in Data Analysis - W. J. Metzger
278 pages
Audio, Video, and Media in the Ministry
From Everand
Audio, Video, and Media in the Ministry
Clarence Floyd Richmond
No ratings yet
340AJ Service 3121259 Jan-2012 Global English PDF
No ratings yet
340AJ Service 3121259 Jan-2012 Global English PDF
356 pages
Student Handbook (Revised Version)
No ratings yet
Student Handbook (Revised Version)
60 pages
Probability and Statistics: Cookbook
No ratings yet
Probability and Statistics: Cookbook
31 pages
Stat Cookbook
No ratings yet
Stat Cookbook
31 pages
Stat Cookbook
No ratings yet
Stat Cookbook
31 pages
Stat Cookbook
No ratings yet
Stat Cookbook
31 pages
Stat Cookbook
No ratings yet
Stat Cookbook
31 pages
Probability and Statistics: Cookbook
No ratings yet
Probability and Statistics: Cookbook
31 pages
Probability and Statistics: Cookbook
No ratings yet
Probability and Statistics: Cookbook
31 pages
Probability and Statistics: Cookbook
No ratings yet
Probability and Statistics: Cookbook
28 pages
Stats Cheat Sheet
No ratings yet
Stats Cheat Sheet
28 pages
Probability and Statistics: Cookbook
No ratings yet
Probability and Statistics: Cookbook
28 pages
Probability and Statistics
No ratings yet
Probability and Statistics
28 pages
Probability and Statistics - Cookbook
No ratings yet
Probability and Statistics - Cookbook
28 pages
Probability and Statistics Cookbook
No ratings yet
Probability and Statistics Cookbook
28 pages
Probability and Statistics Cheat Sheet
100% (2)
Probability and Statistics Cheat Sheet
28 pages
Mathematical Statistics
No ratings yet
Mathematical Statistics
271 pages
Statistical-Models
No ratings yet
Statistical-Models
248 pages
A First Course in Mathematical Statistics - Nusbaum
No ratings yet
A First Course in Mathematical Statistics - Nusbaum
195 pages
EC400Stats Lecturenotes2021
No ratings yet
EC400Stats Lecturenotes2021
101 pages
Statistical Inference
No ratings yet
Statistical Inference
158 pages
Introduction To Statistical Thought
100% (2)
Introduction To Statistical Thought
393 pages
Statistic Book
100% (1)
Statistic Book
328 pages
Math Stats Lecture 2020F
No ratings yet
Math Stats Lecture 2020F
122 pages
Introduction To Statistical Thought - Michael Levine
No ratings yet
Introduction To Statistical Thought - Michael Levine
344 pages
Book
No ratings yet
Book
475 pages
2020-2021 EDA 101 Handout
No ratings yet
2020-2021 EDA 101 Handout
192 pages
MSC Notes
No ratings yet
MSC Notes
145 pages
Probabilistic Models in The Study of Language
No ratings yet
Probabilistic Models in The Study of Language
274 pages
Generalized Linear Models
100% (8)
Generalized Linear Models
243 pages
Introduction To Statistics 14 Weeks
No ratings yet
Introduction To Statistics 14 Weeks
310 pages
Book Solutions
No ratings yet
Book Solutions
17 pages
MS Theory Exam Study Guide
No ratings yet
MS Theory Exam Study Guide
50 pages
Akritas Probability & Statistics With R For Engineers and Scientists
No ratings yet
Akritas Probability & Statistics With R For Engineers and Scientists
256 pages
STA501 Study Guide 2024-02-27 01 - 00 - 08
No ratings yet
STA501 Study Guide 2024-02-27 01 - 00 - 08
270 pages
Maple Manual
No ratings yet
Maple Manual
285 pages
Asymp
No ratings yet
Asymp
216 pages
Statistics 333
100% (1)
Statistics 333
84 pages
Statistical Inference
No ratings yet
Statistical Inference
148 pages
STA2004F
No ratings yet
STA2004F
212 pages
Cimentaciones Maquinas
100% (1)
Cimentaciones Maquinas
235 pages
Statistics 152
No ratings yet
Statistics 152
236 pages
Advanced college algebra study guide
From Everand
Advanced college algebra study guide
Harrison Cook
No ratings yet
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
From Everand
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
Harrison K Cook
No ratings yet
Deadline Istanbul (The Elizabeth Darcy Series)
From Everand
Deadline Istanbul (The Elizabeth Darcy Series)
Peggy Hanson
5/5 (1)
Content Creation Revolution with chatGPT
From Everand
Content Creation Revolution with chatGPT
Maria Cowen
No ratings yet
Deadline Yemen (The Elizabeth Darcy Series)
From Everand
Deadline Yemen (The Elizabeth Darcy Series)
Peggy Hanson
5/5 (1)
Unlocking Statistics for the Social Sciences
From Everand
Unlocking Statistics for the Social Sciences
Norma Sinclair
No ratings yet
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
26 Small Business Marketing Mistakes & How to Avoid Them
From Everand
26 Small Business Marketing Mistakes & How to Avoid Them
Mark Bellini
No ratings yet
Operation Longlife
From Everand
Operation Longlife
E. Hoffmann Price
3.5/5 (3)
The Stock Market from A to See - 2nd Edition
From Everand
The Stock Market from A to See - 2nd Edition
John Nunez
No ratings yet
Between River and Mountain
From Everand
Between River and Mountain
Sally Walker Brinkmann
No ratings yet
Osama the Gun
From Everand
Osama the Gun
Norman Spinrad
5/5 (1)
GUI Magic: Mastering Real Projects in Python
From Everand
GUI Magic: Mastering Real Projects in Python
John Nunez
No ratings yet
None Activity 1
No ratings yet
None Activity 1
4 pages
DNM Soln
No ratings yet
DNM Soln
27 pages
Curriculum Vitae and Resume
No ratings yet
Curriculum Vitae and Resume
9 pages
Simple Column Design Example
No ratings yet
Simple Column Design Example
5 pages
A Unique and Rare Conjunction of Saturn and Ketu
No ratings yet
A Unique and Rare Conjunction of Saturn and Ketu
2 pages
C18 Heat Exchanger - Assemble - With Fuel Cooler
No ratings yet
C18 Heat Exchanger - Assemble - With Fuel Cooler
4 pages
Pyramids of Egypt Youtube Documentary
No ratings yet
Pyramids of Egypt Youtube Documentary
6 pages
Stellram Turning Cutting Speeds Inch PDF
No ratings yet
Stellram Turning Cutting Speeds Inch PDF
4 pages
MO T6 Prob
No ratings yet
MO T6 Prob
3 pages
1.4 Genetic Code
No ratings yet
1.4 Genetic Code
30 pages
Exp 0009
No ratings yet
Exp 0009
23 pages
Solutions Manual For Electromechanical Dynamics: Mit Opencourseware
No ratings yet
Solutions Manual For Electromechanical Dynamics: Mit Opencourseware
171 pages
What Is Your Understanding of The Asian Revolution
No ratings yet
What Is Your Understanding of The Asian Revolution
1 page
Cost Control Engineering
No ratings yet
Cost Control Engineering
34 pages
K9 Aggression Control Teaching the Out 2nd Edition Mackenzie - The special ebook edition is available for download now
100% (1)
K9 Aggression Control Teaching the Out 2nd Edition Mackenzie - The special ebook edition is available for download now
47 pages
DTC P0A94/553 DC/DC Converter Performance: Circuit Description
No ratings yet
DTC P0A94/553 DC/DC Converter Performance: Circuit Description
9 pages
Nebosh PSM Brochure
No ratings yet
Nebosh PSM Brochure
1 page
K Ninja H2 Owner's & Service Manuals 08
No ratings yet
K Ninja H2 Owner's & Service Manuals 08
21 pages
List of Communication Journals2
No ratings yet
List of Communication Journals2
232 pages
Aon Compensation Scanner Webinar II
No ratings yet
Aon Compensation Scanner Webinar II
42 pages
Test de Evaluare Initială: A. Fill in The Blanks With The Words Below
No ratings yet
Test de Evaluare Initială: A. Fill in The Blanks With The Words Below
2 pages
Q1 Math7 Module2 Problems Involving Sets
No ratings yet
Q1 Math7 Module2 Problems Involving Sets
21 pages
28
No ratings yet
28
18 pages
Anglgear Catalog Metrico
100% (1)
Anglgear Catalog Metrico
3 pages
Operations Management MRP
No ratings yet
Operations Management MRP
32 pages
Business Process Reengineering
100% (1)
Business Process Reengineering
80 pages
Controllogix Controllers, Revision 16: Controllogix Controller Catalog Numbers
No ratings yet
Controllogix Controllers, Revision 16: Controllogix Controller Catalog Numbers
42 pages
Protect Yourself From Chemical/Biological Contamination Using Your Assigned Protective Mask
No ratings yet
Protect Yourself From Chemical/Biological Contamination Using Your Assigned Protective Mask
20 pages

Probability and Statistics: Cookbook

Uploaded by

Probability and Statistics: Cookbook

Uploaded by

Probability and Statistics

31st March, 2015

7 Distribution Relationships 10 17 Decision Theory 19

a b 0 10 20 30 40 0.0 2.5 5.0 7.5 10.0 0 5 10 15 20

a b 0 10 20 30 40 0.0 2.5 5.0 7.5 10.0 0 5 10 15 20

0.5 0.25 0.1

● ● 0.0 0.00 0.0

0.50 1.0 1.0

0.00 0 0.0 0.0

0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25

0.25 0.25 0.25 0.25

• Outcome (point or element) ω ∈ Ω Bayes’ Theorem

Continuous Sample mean

Conditional variance Poisson

lim E (Xn − X)2 = 0 Zn ≈ N (0, 1)

10.1 Law of Large Numbers (LLN) 11 Statistical Inference

Weak (WLLN) 11.1 Point Estimation

11.2 Normal-Based Confidence Interval 11.4 Statistical Functionals

Maximum likelihood estimator (mle)

i=1 = h(x) exp {η(θ) · T (x) − A(θ)}

Equal-tail credible interval

= h(x)g(θ) exp {η(θ)T (x)} Highest posterior density (HPD) region Rn

• Subjective bayesianism. Pareto(xm , kc ) Pareto(x0 , k0 ) x0 , k0 − kn where k0 > kn

• Flat: f (θ) ∝ constant

• Conjugate: f (θ) and f (θ | xn ) belong to the same parametric family

Posterior odds (of Hi relative to Hj ) 16.2.1 Bootstrap Confidence Intervals

• Envelope condition: f (xn | θ) ≤ f (xn | θbn ) = Ln (θbn ) ≡ M

18.1 Simple Linear Regression

Residual sums of squares (rss)

• Squared error loss: posterior mean βb0 = Ȳn − βb1 X̄n

18.2 Prediction Unbiased estimate for σ 2

1. Assign a score to each model Validation and training

R 2 19.1 Density Estimation

19.1.2 Kernel Density Estimator (KDE) Yi = r(xi ) + i

19.3 Smoothing Using Orthogonal Functions Transition matrix P (n-step: Pn )

Approximation • (i, j) element is pij

Multivariate regression Chapman-Kolmogorov

Autocovariance function Sample autocorrelation function

γxy (h) = E [(xt+h − µx )(yt − µy )] 1

1. Choose trend model, e.g., µt = β0 + β1 t + β2 t2 Moving average operator

22.2 Beta Function

k out of n w/o replacement w/ replacement

Stirling numbers, 2nd kind

Balls and Urns f :B→U D = distinguishable, ¬D = indistinguishable.

|B| = n, |U | = m f arbitrary f injective f surjective f bijective

You might also like

19.1.2 Kernel Density Estimator (KDE) Yi = r(xi ) + i