100% found this document useful (2 votes)

481 views28 pages

Probability and Statistics Cheat Sheet

Uploaded by

Sergei

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

481 views28 pages

Probability and Statistics Cheat Sheet

Uploaded by

Sergei

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Probability and Statistics

Cheat Sheet

February 27, 2011

This cheat sheet integrates a variety of topics in probability the- 12 Parametric Inference 11 20 Stochastic Processes 22
ory and statistics. It is based on literature [1, 6, 3] and in-class 12.1 Method of Moments . . . . . . . . . . . 12 20.1 Markov Chains . . . . . . . . . . . . . . 22
material from courses of the statistics department at the Univer- 12.2 Maximum Likelihood . . . . . . . . . . . 12 20.2 Poisson Processes . . . . . . . . . . . . . 22
sity of California in Berkeley but also influenced by other sources 12.2.1 Delta Method . . . . . . . . . . . 12
[4, 5]. If you find errors or have suggestions for further topics, I 21 Time Series 23
12.3 Multiparameter Models . . . . . . . . . 13
would appreciate if you send me an email. The most recent ver- 21.1 Stationary Time Series . . . . . . . . . . 23
12.3.1 Multiparameter Delta Method . 13
sion of this document is available at http://bit.ly/probstat. 21.2 Estimation of Correlation . . . . . . . . 24
12.4 Parametric Bootstrap . . . . . . . . . . 13 21.3 Non-Stationary Time Series . . . . . . . 24
To reproduce, please contact me.
21.3.1 Detrending . . . . . . . . . . . . 24
13 Hypothesis Testing 13 21.4 ARIMA models . . . . . . . . . . . . . . 24
Contents 14 Bayesian Inference 14
21.4.1 Causality and Invertibility . . . . 25
21.5 Spectral Analysis . . . . . . . . . . . . . 25
1 Distribution Overview 3 14.1 Credible Intervals . . . . . . . . . . . . . 14
1.1 Discrete Distributions . . . . . . . . . . 3 14.2 Function of Parameters . . . . . . . . . 14 22 Math 26
1.2 Continuous Distributions . . . . . . . . 4 14.3 Priors . . . . . . . . . . . . . . . . . . . 15 22.1 Series . . . . . . . . . . . . . . . . . . . 26
14.3.1 Conjugate Priors . . . . . . . . . 15 22.2 Combinatorics . . . . . . . . . . . . . . 27
2 Probability Theory 6 14.4 Bayesian Testing . . . . . . . . . . . . . 15

3 Random Variables 6 15 Exponential Family 16

3.1 Transformations . . . . . . . . . . . . . 7
16 Sampling Methods 16
4 Expectation 7 16.1 The Bootstrap . . . . . . . . . . . . . . 16
16.1.1 Bootstrap Confidence Intervals . 16
5 Variance 7
16.2 Rejection Sampling . . . . . . . . . . . . 17
6 Inequalities 8 16.3 Importance Sampling . . . . . . . . . . . 17

7 Distribution Relationships 817 Decision Theory 17

17.1 Risk . . . . . . . . . . . . . . . . . . . . 17
8 Probability and Moment Generating 17.2 Admissibility . . . . . . . . . . . . . . . 17
Functions 9 17.3 Bayes Rule . . . . . . . . . . . . . . . . 18
17.4 Minimax Rules . . . . . . . . . . . . . . 18
9 Multivariate Distributions 9
9.1 Standard Bivariate Normal . . . . . . . 9 18 Linear Regression 18
9.2 Bivariate Normal . . . . . . . . . . . . . 9 18.1 Simple Linear Regression . . . . . . . . 18
9.3 Multivariate Normal . . . . . . . . . . . 9
18.2 Prediction . . . . . . . . . . . . . . . . . 19
10 Convergence 10 18.3 Multiple Regression . . . . . . . . . . . 19
10.1 Law of Large Numbers (LLN) . . . . . . 10 18.4 Model Selection . . . . . . . . . . . . . . 19
10.2 Central Limit Theorem (CLT) . . . . . 10
19 Non-parametric Function Estimation 20
11 Statistical Inference 11 19.1 Density Estimation . . . . . . . . . . . . 20
11.1 Point Estimation . . . . . . . . . . . . . 11 19.1.1 Histograms . . . . . . . . . . . . 20
11.2 Normal-based Confidence Interval . . . . 11 19.1.2 Kernel Density Estimator (KDE) 21
11.3 Empirical Distribution Function . . . . . 11 19.2 Non-parametric Regression . . . . . . . 21
11.4 Statistical Functionals . . . . . . . . . . 11 19.3 Smoothing Using Orthogonal Functions 21
1 Distribution Overview
1.1 Discrete Distributions
FX (x) fX (x) E [X] V [X] MX (s)

0
 x<a
I(a < x < b) a+b (b − a + 1)2 − 1 eas − e−(b+1)s
bxc−a+1
Uniform{a, . . . , b} a≤x≤b
 b−a b−a+1 2 12 s(b − a)
1 x>b

1−x 1−x
Bernoulli(p) (1 − p) px (1 − p) p p(1 − p) 1 − p + pes

n x n−x
Binomial(n, p) I1−p (n − x, x + 1) p (1 − p) np np(1 − p) (1 − p + pes )n
x
k k
!n
n! X X
Multinomial(n, p) px1 1 · · · pxkk xi = n npi npi (1 − pi ) pi e si
x1 ! . . . xk ! i=1 i=0
! m m−x

x − np x n−x nm nm(N − n)(N − m)
Hypergeometric(N, m, n) ≈Φ N
N/A
N 2 (N − 1)
p
np(1 − p) x
N
r
x+r−1 r 1−p 1−p p
NegativeBinomial(r, p) Ip (r, x + 1) p (1 − p)x r r 2
r−1 p p 1 − (1 − p)es
1 1−p p
Geometric(p) 1 − (1 − p)x x ∈ N+ p(1 − p)x−1 x ∈ N+
p p2 1 − (1 − p)es
x
X λi λx e−λ s
Poisson(λ) e−λ λ λ eλ(e −1)

i=0
i! x!

Uniform (discrete) Binomial Geometric Poisson

n = 40, p = 0.3 p = 0.2

0.8
● ● ● ● ● λ=1
n = 30, p = 0.6 p = 0.5 λ=4
0.25

n = 25, p = 0.9 p = 0.8 λ = 10

0.3
0.20

0.6
0.15

0.2
1
PMF

PMF

PMF
0.4

●
● ● ● ● ● ● ● ●
n ●

●
0.10

●
●

0.1
0.2

●
0.05

● ●
●
● ●

● ● ●
●
●
● ● ●
● ● ●
● ● ●
0.00

● ● ●
0.0

0.0
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

a b 0 10 20 30 40 0 2 4 6 8 10 0 5 10 15 20

x x x x

3
1.2 Continuous Distributions
FX (x) fX (x) E [X] V [X] MX (s)

0
 x<a
I(a < x < b) a+b (b − a)2 esb − esa
x−a
Uniform(a, b) a<x<b
 b−a b−a 2 12 s(b − a)
1 x>b

( )
Z x 2
σ 2 s2

2 1 (x − µ) 2
Normal(µ, σ ) Φ(x) = φ(t) dt φ(x) = √ exp − µ σ exp µs +
−∞ σ 2π 2σ 2 2
(ln x − µ)2

1 1 ln x − µ 1 2 2 2
Log-Normal(µ, σ ) 2
+ erf √ √ exp − eµ+σ /2
(eσ − 1)e2µ+σ
2 2 2σ 2 x 2πσ 2 2σ 2

1 T
Σ−1 (x−µ) 1
Multivariate Normal(µ, Σ) (2π)−k/2 |Σ|−1/2 e− 2 (x−µ) µ Σ exp µT s + sT Σs
2

1 k x 1
Chi-square(k) γ , xk/2 e−x/2 k 2k (1 − 2s)−k/2 s < 1/2
Γ(k/2) 2 2 2k/2 Γ(k/2)
1 −x/β 1
Exponential(β) 1 − e−x/β e β β2 (s < 1/β)
β 1 − βs
α
γ(α, x/β) 1 1
Gamma(α, β)1 xα−1 e−x/β αβ αβ 2 (s < 1/β)
Γ(α) Γ (α) β α 1 − βs

Γ α, βx β α −α−1 −β/x β β2 2(−βs)α/2 p
InverseGamma(α, β) x e α>1 α>2 Kα −4βs
Γ (α) Γ (α) α−1 (α − 1)2 (α − 2)2 Γ(α)
P
k k
Γ i=1 αi Y αi E [Xi ] (1 − E [Xi ])
i −1
Dirichlet(α) Qk xα
i Pk Pk
i=1 Γ (αi ) i=1 i=1 αi i=1 αi + 1
∞ k−1
!
2 Γ (α + β) α−1 β−1 α αβ X Y α+r sk
Beta(α, β) Ix (α, β) x (1 − x) 1+
Γ (α) Γ (β) α+β (α + β)2 (α + β + 1) r=0
α+β+r k!
k=1
∞ n n
k k x k−1 −(x/λ)k 1 2 X s λ n
Weibull(λ, k) 1 − e−(x/λ) e λΓ 1 + λ2 Γ 1 + − µ2 Γ 1+
λ λ k k n=0
n! k
x α
m xα
m αxm xα
m
Pareto(xm , α) 1− x ≥ xm α x ≥ xm α>1 α>2 α(−xm s)α Γ(−α, −xm s) s < 0
x xα+1 α−1 (α − 1)2 (α − 2)

1 γ(s, x) in the CDF of the Gamma distribution denotes the lower incomplete gamma function which is defined as γ(s, x) = 0x ts−1 e−t dt.
R
2I
x (a, b) in the CDF of the
R Beta distribution denotes the regularized incomplete beta function and is defined as Ix (a, b) = B(x; a, b)/B(a, b), where B(x; a, b) is the incomplete beta function
and is defined as B(x; a, b) = 0x ta−1 (1 − t)b−1 dt.

4
Uniform (continuous) Normal Log−normal χ2

0.5
µ = 0, σ2 = 0.2 µ = 0, σ2 = 3 k=1

1.0
µ = 0, σ2 = 1 µ = 2, σ2 = 2 k=2
µ = 0, σ2 = 5 µ = 0, σ2 = 1 k=3

0.8
µ = −2, σ2 = 0.5 µ = 0.5, σ2 = 1 k=4
µ = 0.25, σ2 = 1 k=5

0.4
µ = 0.125, σ2 = 1

0.8
0.6

0.3
0.6
1
PDF

PDF

PDF
φ(x)
● ●
b−a

0.4

0.2
0.4
0.2

0.1
0.2
0.0

0.0

0.0
● ●

a b −4 −2 0 2 4 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 2 4 6 8

x x x x

Exponential Gamma InverseGamma Beta

β=2 α = 1, β = 2 α = 1, β = 1 α = 0.5, β = 0.5

2.0

0.5

3.0
β=1 α = 2, β = 2 α = 2, β = 1 α = 5, β = 1
β = 0.4 α = 3, β = 2 α = 3, β = 1 α = 1, β = 3
α = 5, β = 1 α = 3, β = 0.5 α = 2, β = 2

2.5
α = 9, β = 0.5 α = 2, β = 5
0.4
1.5

2.0
3
0.3
PDF

PDF

PDF
1.0

1.5
2
0.2

1.0
0.5

0.1

0.5
0.0

0.0

0.0
0
0 1 2 3 4 5 0 5 10 15 20 0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0

x x x x

Weibull Pareto
2.5

λ = 1, k = 0.5 xm = 1, α = 1
λ = 1, k = 1 xm = 1, α = 2
λ = 1, k = 1.5 xm = 1, α = 4
λ = 1, k = 5
2.0

3
1.5

2
PDF

PDF
1.0

1
0.5
0.0

0.0 0.5 1.0 1.5 2.0 2.5 0 1 2 3 4 5

x x

5
2 Probability Theory Law of Total Probability
n n
Definitions X G
P [B] = P [B|Ai ] P [Ai ] Ω= Ai
• Sample space Ω i=1 i=1

• Outcome (point or element) ω ∈ Ω Bayes’ Theorem

• Event A ⊆ Ω
n
• σ-algebra A P [B | Ai ] P [Ai ] G
P [Ai | B] = Pn Ω= Ai
1. ∅ ∈ A j=1 P [B | Aj ] P [Aj ] i=1
S∞
2. A1 , A2 , . . . , ∈ A =⇒ i=1 Ai ∈ A Inclusion-Exclusion Principle
3. A ∈ A =⇒ ¬A ∈ A
n n
r
[ X \
• Probability distribution P
X
(−1)r−1

Ai = A ij

1. P [A] ≥ 0 for every A i=1 r=1 i≤i1 <···<ir ≤n j=1

2. P [Ω] = 1
"∞ #
G X∞ 3 Random Variables
3. P Ai = P [Ai ]
i=1 i=1 Random Variable
• Probability space (Ω, A, P) X:Ω→R

Properties Probability Mass Function (PMF)

• P [∅] = 0 fX (x) = P [X = x] = P [{ω ∈ Ω : X(ω) = x}]

• B = Ω ∩ B = (A ∪ ¬A) ∩ B = (A ∩ B) ∪ (¬A ∩ B)
Probability Density Function (PDF)
• P [¬A] = 1 − P [A]
b
• P [B] = P [A ∩ B] + P [¬A ∩ B]
Z
P [a ≤ X ≤ b] = f (x) dx
• P [Ω] = 1 P [∅] = 0 a
S T T S
• ¬( n An ) = n ¬An ¬( n An ) = n ¬An DeMorgan
S T Cumulative Distribution Function (CDF):
• P [ n An ] = 1 − P [ n ¬An ]
• P [A ∪ B] = P [A] + P [B] − P [A ∩ B] FX : R → [0, 1] FX (x) = P [X ≤ x]
=⇒ P [A ∪ B] ≤ P [A] + P [B] 1. Nondecreasing: x1 < x2 =⇒ F (x1 ) ≤ F (x2 )
• P [A ∪ B] = P [A ∩ ¬B] + P [¬A ∩ B] + P [A ∩ B] 2. Normalized: limx→−∞ = 0 and limx→∞ = 1
• P [A ∩ ¬B] = P [A] − P [A ∩ B] 3. Right-continuous: limy↓x F (y) = F (x)
Continuity of Probabilities
S∞ Z b
• A1 ⊂ A2 ⊂ . . . =⇒ limn→∞ P [An ] = P [A] where A = i=1 Ai P [a ≤ Y ≤ b | X = x] = fY |X (y | x)dy a≤b
T∞
• A1 ⊃ A2 ⊃ . . . =⇒ limn→∞ P [An ] = P [A] where A = i=1 Ai a
f (x, y)
Independence ⊥
⊥ fY |X (y | x) =
A⊥
⊥ B ⇐⇒ P [A ∩ B] = P [A] P [B] fX (x)
Independence
Conditional Probability
1. P [X ≤ x, Y ≤ y] = P [X ≤ x] P [Y ≤ y]
P [A ∩ B]
P [A | B] = if P [B] > 0 2. fX,Y (x, y) = fX (x)fY (y)
P [B] 6
Z
3.1 Transformations • E [XY ] = xyfX,Y (x, y) dFX (x) dFY (y)
X,Y
Transformation function
• E [ϕ(Y )] 6= ϕ(E [X]) (cf. Jensen inequality)
Z = ϕ(X)
• P [X ≥ Y ] = 0 =⇒ E [X] ≥ E [Y ] ∧ P [X = Y ] = 1 =⇒ E [X] = E [Y ]
Discrete X ∞
X • E [X] = P [X ≥ x]
fZ (z) = P [ϕ(X) = z] = P [{x : ϕ(x) = z}] = P X ∈ ϕ−1 (z) =

f (x) x=1
x∈ϕ−1 (z) Sample mean
n
Continuous 1X
X̄n = Xi
Z n i=1
FZ (z) = P [ϕ(X) ≤ z] = f (x) dx with Az = {x : ϕ(x) ≤ z} Conditional Expectation
Az Z
Special case if ϕ strictly monotone • E [Y | X = x] = yf (y | x) dy

d

dx 1 • E [X] = E [E [X | Y ]]
fZ (z) = fX (ϕ−1 (z)) ϕ−1 (z) = fX (x) = fX (x)
Z ∞
dz dz |J| • E[ϕ(X, Y ) | X = x] = ϕ(x, y)fY |X (y | x) dx
Z −∞
∞
The Rule of the Lazy Statistician
• E [ϕ(Y, Z) | X = x] = ϕ(y, z)f(Y,Z)|X (y, z | x) dy dz
−∞
Z
E [Z] = ϕ(x) dFX (x) • E [Y + Z | X] = E [Y | X] + E [Z | X]
Z Z • E [ϕ(X)Y | X] = ϕ(X)E [Y | X]
E [IA (x)] = IA (x) dFX (x) = dFX (x) = P [X ∈ A] • E[Y | X] = c =⇒ Cov [X, Y ] = 0
A
Convolution
Z ∞ Z z
5 Variance
X,Y ≥0
• Z := X + Y fZ (z) = fX,Y (x, z − x) dx = fX,Y (x, z − x) dx Variance
−∞ 0
Z ∞ 2
2
• Z := |X − Y | fZ (z) = 2 fX,Y (x, z + x) dx • V [X] = σX = E (X − E [X])2 = E X 2 − E [X]
" n # n
Z ∞ 0 Z ∞ X X X
X ⊥⊥ • V Xi = V [Xi ] + 2 Cov [Xi , Yj ]
• Z := fZ (z) = |x|fX,Y (x, xz) dx = xfx (x)fX (x)fY (xz) dx i=1 i=1
Y −∞ −∞ " n #
i6=j
X n
X
• V Xi = V [Xi ] iff Xi ⊥
⊥ Xj
4 Expectation i=1 i=1

Expectation Standard deviation p

X sd[X] = V [X] = σX


 xfX (x) X discrete Covariance
Z  x

• E [X] = µX = x dFX (x) = • Cov [X, Y ] = E [(X − E [X])(Y − E [Y ])] = E [XY ] − E [X] E [Y ]
• Cov [X, a] = 0

 Z
 xfX (x) X continuous


• Cov [X, X] = V [X]
• P [X = c] = 1 =⇒ E [c] = c • Cov [X, Y ] = Cov [Y, X]
• E [cX] = c E [X] • Cov [aX, bY ] = abCov [X, Y ]
• E [X + Y ] = E [X] + E [Y ] • Cov [X + a, Y + b] = Cov [X, Y ]
7
 
Xn m
X n X
X m • limn→∞ Bin (n, p) = N (np, np(1 − p)) (n large, p far from 0 and 1)
• Cov  Xi , Yj  = Cov [Xi , Yj ]
i=1 j=1 i=1 j=1
Negative Binomial
• X ∼ NBin (1, p) = Geometric (p)
Correlation Pr
Cov [X, Y ] • X ∼ NBin (r, p) = i=1 Geometric (p)
ρ [X, Y ] = p P P
V [X] V [Y ] • Xi ∼ NBin (ri , p) =⇒ Xi ∼ NBin ( ri , p)
• X ∼ NBin (r, p) . Y ∼ Bin (s + r, p) =⇒ P [X ≤ s] = P [Y ≥ r]
Independence
Poisson
X⊥
⊥ Y =⇒ ρ [X, Y ] = 0 ⇐⇒ Cov [X, Y ] = 0 ⇐⇒ E [XY ] = E [X] E [Y ] n n
!
X X
• Xi ∼ Poisson (λi ) ∧ Xi ⊥
⊥ Xj =⇒ Xi ∼ Poisson λi
Sample variance i=1 i=1
n
1 X  
S2 = (Xi − X̄n )2
n n
n−1
X X λ i
i=1 • Xi ∼ Poisson (λi ) ∧ Xi ⊥
⊥ Xj =⇒ Xi Xj ∼ Bin  Xj , Pn 
j=1 j=1 j=1 λ j
Conditional Variance
2 Exponential
• V [Y | X] = E (Y − E [Y | X])2 | X = E Y 2 | X − E [Y | X]
n
• V [Y ] = E [V [Y | X]] + V [E [Y | X]]
X
• Xi ∼ Exp (β) ∧ Xi ⊥
⊥ Xj =⇒ Xi ∼ Gamma (n, β)
i=1
• Memoryless property: P [X > x + y | X > y] = P [X > x]
6 Inequalities
Normal
Cauchy-Schwarz
X−µ

2
E [XY ] ≤ E X 2 E Y 2
• X ∼ N µ, σ 2 =⇒ ∼ N (0, 1)
σ

Markov • X ∼ N µ, σ ∧ Z = aX + b =⇒ Z ∼ N aµ + b, a2 σ 2
2

E [ϕ(X)] • X ∼ N µ1 , σ12 ∧ Y ∼ N µ2 , σ22 =⇒ X + Y ∼ N µ1 + µ2 , σ12 + σ22
P [ϕ(X) ≥ t] ≤
t

Xi ∼ N µi , σi2 =⇒
P P P 2
• X ∼N i µi , i σi
Chebyshev i i
P [a < X ≤ b] = Φ b−µ − Φ a−µ

V [X] • σ σ
P [|X − E [X]| ≥ t] ≤
t2 • Φ(−x) = 1 − Φ(x) φ0 (x) = −xφ(x) φ00 (x) = (x2 − 1)φ(x)
Chernoff • Upper quantile of N (0, 1): zα = Φ−1 (1 − α)
eδ

P [X ≥ (1 + δ)µ] ≤ δ > −1 Gamma (distribution)
(1 + δ)1+δ
Jensen • X ∼ Gamma (α, β) ⇐⇒ X/β ∼ Gamma (α, 1)
Pα
E [ϕ(X)] ≥ ϕ(E [X]) ϕ convex • Gamma (α, β) ∼ i=1 Exp (β)
P P
• Xi ∼ Gamma (αi , β) ∧ Xi ⊥
⊥ Xj =⇒ i Xi ∼ Gamma ( i αi , β)
Z ∞
Γ(α)
7 Distribution Relationships • = xα−1 e−λx dx
λα 0
Binomial Gamma (function)
n R∞
X • Ordinary: Γ(s) = 0 ts−1 e−t dt
• Xi ∼ Bernoulli (p) =⇒ Xi ∼ Bin (n, p) R∞
i=1
• Upper incomplete: Γ(s, x) = x ts−1 e−t dt
Rx
• X ∼ Bin (n, p) , Y ∼ Bin (m, p) =⇒ X + Y ∼ Bin (n + m, p) • Lower incomplete: γ(s, x) = 0 ts−1 e−t dt
• limn→∞ Bin (n, p) = Poisson (np) (n large, p small) • Γ(α + 1) = αΓ(α) α>1
8
• Γ(n) = (n − 1)!
√
n∈N 9 Multivariate Distributions
• Γ(1/2) = π
9.1 Standard Bivariate Normal
Beta (distribution)
p
Let X, Y ∼ N (0, 1) ∧ X ⊥
⊥ Z with Y = ρX + 1 − ρ2 Z

1 Γ(α + β) α−1 Joint density

• xα−1 (1 − x)β−1 = x (1 − x)β−1 2
x + y 2 − 2ρxy

B(α, β) Γ(α)Γ(β) 1
f (x, y) = exp −
B(α + k, β) α+k−1 2(1 − ρ2 )
p
• E Xk = = E X k−1
2π 1 − ρ2
B(α, β) α+β+k−1
• Beta (1, 1) ∼ Unif (0, 1) Conditionals

(Y | X = x) ∼ N ρx, 1 − ρ2 (X | Y = y) ∼ N ρy, 1 − ρ2

and
Beta (function):
Independence
Z 1
x−1 y−1 Γ(x)Γ(y) X⊥
⊥ Y ⇐⇒ ρ = 0
• Ordinary: B(x, y) = B(y, x) = t (1 − t) dt =
0 Γ(x + y)
Z x
• Incomplete: B(x; a, b) = ta−1 (1 − t)b−1 dt 9.2 Bivariate Normal
0
• Regularized incomplete: Let X ∼ N µx , σx2 and Y ∼ N µy , σy2 .
a+b−1
B(x; a, b) a,b∈N X (a + b − 1)!
xj (1 − x)a+b−1−j

Ix (a, b) = = 1 z
B(a, b) j!(a + b − 1 − j)! f (x, y) = exp −
2(1 − ρ2 )
p
j=a 2πσx σy 1 − ρ2
• I0 (a, b) = 0 I1 (a, b) = 1
" 2 2 #
• Ix (a, b) = 1 − I1−x (b, a) x − µx

y − µy

x − µx

y − µy
z= + − 2ρ
σx σy σx σy

Conditional mean and variance

8 Probability and Moment Generating Functions
σX
E [X | Y ] = E [X] + ρ (Y − E [Y ])

• GX (t) = E tX |t| < 1 σY
"∞ # ∞
X (Xt)i E Xi
p
V [X | Y ] = σX 1 − ρ2
X
· ti
t Xt

• MX (t) = GX (e ) = E e =E =
i=0
i! i=0
i!
• P [X = 0] = GX (0) 9.3 Multivariate Normal
• P [X = 1] = G0X (0)
(i) Covariance Matrix Σ (Precision Matrix Σ−1 )
G (0)
• P [X = i] = X
i!
 
V [X1 ] · · · Cov [X1 , Xk ]
• E [X] = G0X (1− ) .. .. ..
Σ=
 
(k) . . . 
• E X k = MX (0)
Cov [Xk , X1 ] · · · V [Xk ]
X! (k)
• E = GX (1− )
(X − k)! If X ∼ N (µ, Σ),
2
• V [X] = G00X (1− ) + G0X (1− ) − (G0X (1− ))
−n/2 −1/2 1
• GX (t) = GY (t) =⇒ X = Y
d
fX (x) = (2π) |Σ| exp − (x − µ)T Σ−1 (x − µ)
2 9
Properties Slutzky’s Theorem
1/2 D P D
• Z ∼ N (0, 1) ∧ X = µ + Σ Z =⇒ X ∼ N (µ, Σ) • Xn → X and Yn → c =⇒ Xn + Yn → X + c
D P D
• X ∼ N (µ, Σ) =⇒ Σ−1/2 (X − µ) ∼ N (0, 1) • Xn → X and Yn → c =⇒ Xn Yn → cX
D D D

• X ∼ N (µ, Σ) =⇒ AX ∼ N Aµ, AΣAT • In general: Xn → X and Yn → Y =⇒
6 Xn + Yn → X + Y

• X ∼ N (µ, Σ) ∧ a is vector of length k =⇒ aT X ∼ N aT µ, aT Σa
10.1 Law of Large Numbers (LLN)
Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ, and V [X1 ] < ∞.
10 Convergence
Weak (WLLN)
Let {X1 , X2 , . . .} be a sequence of rv’s and let X be another rv. Let Fn denote P
the cdf of Xn and let F denote the cdf of X. X̄n → µ as n → ∞
Strong (SLLN)
as
Types of Convergence X̄n → µ as n → ∞
D
1. In distribution (weakly, in law): Xn → X 10.2 Central Limit Theorem (CLT)
lim Fn (t) = F (t) ∀t where F continuous Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ, and V [X1 ] = σ 2 .
n→∞
√
P X̄n − µ n(X̄n − µ) D
2. In probability: Xn → X Zn := q = →Z where Z ∼ N (0, 1)
V X̄n σ
(∀ε > 0) lim P [|Xn − X| > ε] = 0
n→∞ lim P [Zn ≤ z] = Φ(z) z∈R
n→∞
as CLT Notations
3. Almost surely (strongly): Xn → X
h i h i Zn ≈ N (0, 1)
P lim Xn = X = P ω ∈ Ω : lim Xn (ω) = X(ω) = 1 σ2

n→∞ n→∞ X̄n ≈ N µ,
n
qm
4. In quadratic mean (L2 ): Xn → X σ2

X̄n − µ ≈ N 0,
n
lim E (Xn − X)2 = 0

√ 2

n→∞ n(X̄n − µ) ≈ N 0, σ
√
Relationships n(X̄n − µ)
≈ N (0, 1)
qm P D
n
• Xn → X =⇒ Xn → X =⇒ Xn → X
as P
• Xn → X =⇒ Xn → X
D P Continuity Correction
• Xn → X ∧ (∃c ∈ R) P [X = c] = 1 =⇒ Xn → X
x + 12 − µ

P P P
• Xn →X ∧ Yn → Y =⇒ Xn + Yn → X + Y P X̄n ≤ x ≈ Φ √
qm qm qm σ/ n
• Xn →X ∧ Yn → Y =⇒ Xn + Yn → X + Y
x − 12 − µ
P P P

• Xn →X ∧ Yn → Y =⇒ Xn Yn → XY
P X̄n ≥ x ≈ 1 − Φ √
P P
• Xn →X =⇒ ϕ(Xn ) → ϕ(X) σ/ n
D
• Xn → X =⇒ ϕ(Xn ) → ϕ(X)
D Delta Method
σ2 σ2
qm

• Xn → b ⇐⇒ limn→∞ E [Xn ] = b ∧ limn→∞ V [Xn ] = 0 Yn ≈ N µ, =⇒ ϕ(Yn ) ≈ N 0 2
ϕ(µ), (ϕ (µ))
qm
• X1 , . . . , Xn iid ∧ E [X] = µ ∧ V [X] < ∞ ⇐⇒ X̄n → µ n n
10
11 Statistical Inference P
• F̂n → F (x)
iid
Let X1 , · · · , Xn ∼ F if not otherwise noted. Dvoretzky-Kiefer-Wolfowitz (DKW) Inequality (X1 , . . . , Xn ∼ F )

2
11.1 Point Estimation P sup F (x) − F̂n (x) > ε = 2e−2nε

x
• Point estimator θbn of θ is a rv: θbn = g(X1 , . . . , Xn )
h i Nonparametric 1 − α confidence band for F
• bias(θbn ) = E θbn − θ
P L(x) = max{F̂n − n , 0}
• Consistency: θbn → θ
• Sampling distribution: F (θbn ) U (x) = min{F̂n + n , 1}
r h i s
• Standard error: se(θn ) = V θbn
b 1 2
= log
h i h i 2n α
• Mean squared error: mse = E (θbn − θ)2 = bias(θbn )2 + V θbn
• limn→∞ bias(θbn ) = 0 ∧ limn→∞ se(θbn ) = 0 =⇒ θbn is consistent
P [L(x) ≤ F (x) ≤ U (x) ∀x] ≥ 1 − α
θbn − θ D
• Asymptotic normality: → N (0, 1)
se
• Slutzky’s Theorem often lets us replace se(θbn ) by some (weakly) consis- 11.4 Statistical Functionals
tent estimator σ
bn . • Statistical functional: T (F )
• Plug-in estimator of θ = T (F ) : θbn = T (F̂n )
11.2 Normal-based Confidence Interval •
R
Linear functional: T (F ) = ϕ(x) dFX (x)

b 2 . Let zα/2 = Φ−1 (1 − (α/2)), i.e., P Z > zα/2 = α/2 •

Suppose θbn ≈ N θ, se Plug-in estimator for linear functional:

and P −zα/2 < Z < zα/2 = 1 − α where Z ∼ N (0, 1). Then Z n
1X
T (F̂n ) =
ϕ(x) dFbn (x) = ϕ(Xi )
Cn = θbn ± zα/2 se
b n i=1

b 2 =⇒ T (F̂n ) ± zα/2 se
• Often: T (F̂n ) ≈ N T (F ), se
11.3 Empirical Distribution Function
b
• pth quantile: F −1 (p) = inf{x : F (x) ≥ p}
Empirical Distribution Function (ECDF)
• µ̂ = X̄n
Pn n
I(Xi ≤ x) 1 X
Fn (x) = i=1
b b2 =
• σ (Xi − X̄n )2
n n − 1 i=1
1
Pn 3
i=1 (Xi − µ̂)
(
1 Xi ≤ x n
• κ̂ =
I(Xi ≤ x) = b3 j
σ
0 Xi > x Pn
(Xi − X̄n )(Yi − Ȳn )
• ρ̂ = qP i=1 qP
Properties (for any fixed x) n
− 2 n
i=1 (X i X̄n ) i=1 (Yi − Ȳn )
h i
• E F̂n = F (x)
h i F (x)(1 − F (x)) 12 Parametric Inference
• V F̂n =
n
F (x)(1 − F (x)) D Let F = f (x; θ : θ ∈ Θ be a parametric model with parameter space Θ ⊂ Rk
• mse = →0 and parameter θ = (θ1 , . . . , θk ).
n 11
12.1 Method of Moments Fisher Information
I(θ) = Vθ [s(X; θ)]
j th moment Z
In (θ) = nI(θ)
αj (θ) = E X j = xj dFX (x)

Fisher Information (exponential family)

j th sample moment
n
1X j ∂
α̂j = X I(θ) = Eθ − s(X; θ)
n i=1 i ∂θ

Method of Moments Estimator (MoM) Observed Fisher Information

n
∂2 X
α1 (θ) = α̂1 Inobs (θ) = − log f (Xi ; θ)
∂θ2 i=1
α2 (θ) = α̂2
.. .. Properties of the mle
.=.
P
αk (θ) = α̂k • Consistency: θbn → θ
• Equivariance: θbn is the mle =⇒ ϕ(θbn ) is the mle of ϕ(θ)
Properties of the MoM estimator • Asymptotic normality:
•
p
θbn exists with probability tending to 1 1. se ≈ 1/In (θ)
P
• Consistency: θbn → θ (θbn − θ) D
→ N (0, 1)
• Asymptotic normality: se
√
q
D
n(θb − θ) → N (0, Σ) b ≈ 1/In (θbn )
2. se
(θbn − θ) D
where Σ = gE Y Y T g T , Y = (X, X 2 , . . . , X k )T , → N (0, 1)
∂ −1 se
g = (g1 , . . . , gk ) and gj = ∂θ αj (θ)
b
• Asymptotic optimality (or efficiency), i.e., smallest variance for large sam-
ples. If θen is any other estimator, the asymptotic relative efficiency is
12.2 Maximum Likelihood
h i
Likelihood: Ln : Θ → [0, ∞) V θbn
are(θen , θbn ) = h i ≤ 1
n
Y V θen
Ln (θ) = f (Xi ; θ)
i=1
• Approximately the Bayes estimator
Log-likelihood
n
X 12.2.1 Delta Method
`n (θ) = log Ln (θ) = log f (Xi ; θ) b where ϕ is differentiable and ϕ0 (θ) 6= 0:
If τ = ϕ(θ)
i=1

Maximum Likelihood Estimator (mle) τn − τ ) D

(b
→ N (0, 1)
se(b
b τ)
Ln (θbn ) = sup Ln (θ)
θ
where τb = ϕ(θ)
b is the mle of τ and
Score Function
∂ b = ϕ0 (θ)
se se(
b θn )
b b
s(X; θ) = log f (X; θ)
∂θ 12
12.3 Multiparameter Models 13 Hypothesis Testing
Let θ = (θ1 , . . . , θk ) and θb = (θb1 , . . . , θbk ) be the mle.
H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ1
∂ 2 `n ∂ 2 `n
Hjj = Hjk = Definitions
∂θ2 ∂θj ∂θk
Fisher Information Matrix • Null hypothesis H0
• Alternative hypothesis H1
 
Eθ [H11 ] · · · Eθ [H1k ]
In (θ) = − 
 .. .. ..  • Simple hypothesis θ = θ0
. . . 
• Composite hypothesis θ > θ0 or θ < θ0
Eθ [Hk1 ] · · · Eθ [Hkk ]
• Two-sided test: H0 : θ = θ0 versus H1 : θ 6= θ0
Under appropriate regularity conditions • One-sided test: H0 : θ ≤ θ0 versus H1 : θ > θ0
(θb − θ) ≈ N (0, Jn ) • Critical value c
• Test statistic T
with Jn (θ) = In−1 . Further, if θbj is the j th component of θ, then • Rejection Region R = {x : T (x) > c}
• Power function β(θ) = P [X ∈ R]
(θbj − θj ) D
→ N (0, 1) • Power of a test: 1 − P [Type II error] = 1 − β = inf β(θ)
se
bj θ∈Θ1
h i • Test size: α = P [Type I error] = sup β(θ)
b 2j = Jn (j, j) and Cov θbj , θbk = Jn (j, k)
where se θ∈Θ0

Retain H0 Reject H0
12.3.1 Multiparameter Delta Method √
H0 true Type
√ I error (α)
Let τ = ϕ(θ1 , . . . , θk ) be a function and let the gradient of ϕ be H1 true Type II error (β) (power)
p-value
∂ϕ
 

 ∂θ1  • p-value = supθ∈Θ0 Pθ [T (X) ≥ T (x)] = inf α : T (x) ∈ Rα
 . 
 .. 
∇ϕ =  Pθ [T (X ? ) ≥ T (X)]

• p-value = supθ∈Θ0 = inf α : T (X) ∈ Rα

 ∂ϕ  | {z }
1−Fθ (T (X)) since T (X ? )∼Fθ
∂θk
p-value evidence
Suppose ∇ϕθ=θb 6= 0 and τb = ϕ(θ).
b Then,
< 0.01 very strong evidence against H0
τ − τ) D
(b 0.01 − 0.05 strong evidence against H0
→ N (0, 1) 0.05 − 0.1 weak evidence against H0
se(b
b τ)
> 0.1 little or no evidence against H0
where r Wald Test
T
se(b
b τ) = ˆ
∇ϕ Jˆn ∇ϕ
ˆ
• Two-sided test
and Jˆn = Jn (θ) ˆ = ∇ϕ b. θb − θ0

b and ∇ϕ
θ=θ • Reject H0 when |W | > zα/2 where W =
se
b
12.4 Parametric Bootstrap • P |W | > zα/2 → α
• p-value = Pθ0 [|W | > |w|] ≈ P [|Z| > |w|] = 2Φ(−|w|)
Sample from f (x; θbn ) instead of from F̂n , where θbn could be the mle or method
of moments estimator. Likelihood Ratio Test (LRT)
13
supθ∈Θ Ln (θ) Ln (θbn ) • X n = (X1 , . . . , Xn )
• T (X) = =
supθ∈Θ0 Ln (θ) Ln (θbn,0 ) • xn = (x1 , . . . , xn )
k • Prior density f (θ)
iid
D
X
• λ(X) = 2 log T (X) → χ2r−q where Zi2 ∼ χ2k with Z1 , . . . , Zk ∼ N (0, 1) • Likelihood f (xn | θ): joint density of the data
n
i=1 Y
In particular, X n iid =⇒ f (xn | θ) = f (xi | θ) = Ln (θ)

• p-value = Pθ0 [λ(X) > λ(x)] ≈ P χ2r−q > λ(x)
i=1
Multinomial LRT • Posterior density f (θ | xn )
• Normalizing constant cn = f (xn ) = f (x | θ)f (θ) dθ
R
X1 Xk
• Let p̂n = ,..., be the mle
n n • Kernel: part of a density that depends Ron θ
k θLn (θ)f (θ)
• Posterior Mean θ̄n = θf (θ | xn ) dθ = R Ln (θ)f
Xj R
Ln (p̂n ) Y p̂j (θ) dθ
• T (X) = =
Ln (p0 ) j=1
p0j
Xk
p̂j
14.1 Credible Intervals
D
• λ(X) = 2 Xj log → χ2k−1
j=1
p 0j 1 − α Posterior Interval
• The approximate size α LRT rejects H0 when λ(X) ≥ χ2k−1,α Z b
n
P [θ ∈ (a, b) | x ] = f (θ | xn ) dθ = 1 − α
2
Pearson χ Test a

k
X (Xj − E [Xj ])2 1 − α Equal-tail Credible Interval
• T = where E [Xj ] = np0j under H0
j=1
E [Xj ] Z a Z ∞
n
D f (θ | x ) dθ = f (θ | xn ) dθ = α/2
• T → χ2k−1 −∞ b

• p-value = P χ2k−1 > T (x)
2D
1 − α Highest Posterior Density (HPD) region Rn
• Faster → Xk−1 than LRT, hence preferable for small n
1. P [θ ∈ Rn ] = 1 − α
Independence Testing
2. Rn = {θ : f (θ | xn ) > k} for some k
• I rows, J columns, X multinomial sample of size n = I ∗ J
X
• mles unconstrained: p̂ij = nij Rn is unimodal =⇒ Rn is an interval
X
• mles under H0 : p̂0ij = p̂i· p̂·j = Xni· n·j
PI PJ
nX
14.2 Function of Parameters
• LRT: λ = 2 i=1 j=1 Xij log Xi· Xij·j
PI PJ (X −E[X ])2 Let τ = ϕ(θ) and A = {θ : ϕ(θ) ≤ τ }.
• Pearson χ2 : T = i=1 j=1 ijE[Xij ]ij
Posterior CDF for τ
D
• LRT and Pearson → χ2ν , where ν = (I − 1)(J − 1) Z
H(r | xn ) = P [ϕ(θ) ≤ τ | xn ] = f (θ | xn ) dθ
A
14 Bayesian Inference
Posterior Density
Bayes’ Theorem h(τ | xn ) = H 0 (τ | xn )
f (x | θ)f (θ) f (x | θ)f (θ) Bayesian Delta Method
f (θ | x) = =R ∝ Ln (θ)f (θ)
f (xn ) f (x | θ)f (θ) dθ
τ | X n ≈ N ϕ(θ),
b seb ϕ0 (θ)
b
Definitions

14
14.3 Priors Continuous likelihood (subscript c denotes constant)
Likelihood Conjugate Prior Posterior hyperparameters
Choice
Uniform(0, θ) Pareto(xm , k) max x(n) , xm , k + n
n
• Subjective Bayesianism: prior should incorporate as much detail as possible Exponential(λ) Gamma(α, β) α + n, β +
X
xi
the research’s a priori knowledge — via prior elicitation. i=1
• Objective Bayesianism: prior should incorporate as little detail as possible Pn
µ0 i=1 xi 1 n
(non-informative prior). Normal(µ, σc2 ) Normal(µ0 , σ02 ) + / + 2 ,
σ2 σ2 σ02 σc
• Robust Bayesianism: consider various priors and determine sensitivity of 0 c−1
1 n
our inferences to changes in the prior. + 2
σ02 σc
Pn
νσ02 + i=1 (xi − µ)2
Types Normal(µc , σ 2 ) Scaled Inverse Chi- ν + n,
ν+n
square(ν, σ02 )
• Flat: f (θ) ∝ constant νλ + nx̄ n
R∞ Normal(µ, σ 2 ) Normal- , ν + n, α + ,
• Proper: −∞ f (θ) dθ = 1 ν+n 2
scaled Inverse n
γ(x̄ − λ)2
R∞
• Improper: −∞ f (θ) dθ = ∞ 1X 2
Gamma(λ, ν, α, β) β+ (xi − x̄) +
• Jeffreys’ prior (transformation-invariant): 2 i=1 2(n + γ)
−1
Σ−1 −1
Σ−1 −1

p p MVN(µ, Σc ) MVN(µ0 , Σ0 ) 0 + nΣc 0 µ0 + nΣ x̄ ,
f (θ) ∝ I(θ) f (θ) ∝ det(I(θ))
−1 −1
Σ−1

0 + nΣc
n
• Conjugate: f (θ) and f (θ | xn ) belong to the same parametric family X
MVN(µc , Σ) Inverse- n + κ, Ψ + (xi − µc )(xi − µc )T
Wishart(κ, Ψ) i=1
n
X xi
14.3.1 Conjugate Priors Pareto(xmc , k) Gamma(α, β) α + n, β + log
i=1
xm c
Discrete likelihood Pareto(xm , kc ) Pareto(x0 , k0 ) x0 , k0 − kn where k0 > kn
Xn
Likelihood Conjugate Prior Posterior hyperparameters Gamma(αc , β) Gamma(α0 , β0 ) α0 + nαc , β0 + xi
n n i=1
X X
Bernoulli(p) Beta(α, β) α+ xi , β + n − xi
i=1
Xn n
X
i=1
n
X
14.4 Bayesian Testing
Binomial(p) Beta(α, β) α+ xi , β + Ni − xi If H0 : θ ∈ Θ0 :
i=1 i=1 i=1
n
X
Z
Negative Binomial(p) Beta(α, β) α + rn, β + xi Prior probability P [H0 ] = f (θ) dθ
n
i=1 ZΘ0
Posterior probability P [H0 | xn ] = f (θ | xn ) dθ
X
Poisson(λ) Gamma(α, β) α+ xi , β + n
Θ0
i=1
n
X
Multinomial(p) Dirichlet(α) α+ x(i)
i=1 Let H0 , . . . , HK−1 be K hypotheses. Suppose θ ∼ f (θ | Hk ),
n
f (xn | Hk )P [Hk ]
X
Geometric(p) Beta(α, β) α + n, β + xi
P [Hk | xn ] = PK ,
n
k=1 f (x | Hk )P [Hk ]
i=1
15
Marginal Likelihood 1. Estimate VF [Tn ] with VF̂n [Tn ].
Z 2. Approximate VF̂n [Tn ] using simulation:
f (xn | Hi ) = f (xn | θ, Hi )f (θ | Hi ) dθ
∗ ∗
Θ (a) Repeat the following B times to get Tn,1 , . . . , Tn,B , an iid sample from
Posterior Odds (of Hi relative to Hj ) the sampling distribution implied by F̂n

P [Hi | xn ] f (xn | Hi ) P [Hi ] i. Sample uniformly X1∗ , . . . , Xn∗ ∼ F̂n .

= × ii. Compute Tn∗ = g(X1∗ , . . . , Xn∗ ).
P [Hj | xn ] f (xn | Hj ) P [Hj ]
(b) Then
| {z } | {z }
Bayes Factor BFij prior odds

B B
!2
Bayes Factor 1 X ∗ 1 X ∗
log10 BF10 BF10 evidence vboot = V̂F̂n = Tn,b − T
B B r=1 n,r
b=1
0 − 0.5 1 − 1.5 Weak
0.5 − 1 1.5 − 10 Moderate
1−2 10 − 100 Strong
16.1.1 Bootstrap Confidence Intervals
>2 > 100 Decisive
p
1−p BF 10 Normal-based Interval
p∗ = p where p = P [H1 ] and p∗ = P [H1 | xn ]
1 + 1−p BF10
Tn ± zα/2 se
ˆ boot

15 Exponential Family Pivotal Interval

Scalar parameter
1. Location parameter θ = T (F )
fX (x | θ) = h(x) exp {η(θ)T (x) − A(θ)} 2. Pivot Rn = θbn − θ
= h(x)g(θ) exp {η(θ)T (x)} 3. Let H(r) = P [Rn ≤ r] be the cdf of Rn
∗ ∗
4. Let Rn,b = θbn,b − θbn . Approximate H using bootstrap:
Vector parameter
( s
)
B
1 X
X
fX (x | θ) = h(x) exp ηi (θ)Ti (x) − A(θ) Ĥ(r) = ∗
I(Rn,b ≤ r)
i=1 B
b=1
= h(x) exp {η(θ) · T (x) − A(θ)}
= h(x)g(θ) exp {η(θ) · T (x)} 5. Let θβ∗ denote the β sample quantile of (θbn,1
∗ ∗
, . . . , θbn,B )
Natural form 6. Let rβ∗ denote the β sample quantile of (Rn,1
∗ ∗
, . . . , Rn,B ), i.e., rβ∗ = θβ∗ − θbn

fX (x | η) = h(x) exp {η · T(x) − A(η)} 7. Then, an approximate 1 − α confidence interval is Cn = â, b̂ with
= h(x)g(η) exp {η · T(x)} α
= h(x)g(η) exp η T T(x) â = θbn − Ĥ −1 1 − = ∗
θbn − r1−α/2 = ∗
2θbn − θ1−α/2

2
α
b̂ = θbn − Ĥ −1 = ∗
θbn − rα/2 = ∗
2θbn − θα/2
2
16 Sampling Methods
Percentile Interval
16.1 The Bootstrap
∗ ∗
Cn = θα/2 , θ1−α/2
Let Tn = g(X1 , . . . , Xn ) be a statistic.
16
16.2 Rejection Sampling • Decision rule: synonymous for an estimator θb
• Action a ∈ A: possible value of the decision rule. In the estimation
Setup
context, the action is just an estimate of θ, θ(x).
b
• We can easily sample from g(θ) • Loss function L: consequences of taking action a when true state is θ or
• We want to sample from h(θ), but it is difficult discrepancy between θ and θ, b L : Θ × A → [−k, ∞).
k(θ)
• We know h(θ) up to proportional constant: h(θ) = R Loss functions
k(θ) dθ
• Envelope condition: we can find M > 0 such that k(θ) ≤ M g(θ) ∀θ • Squared error loss: L(θ, a) = (θ − a)2
(
K1 (θ − a) a − θ < 0
Algorithm • Linear loss: L(θ, a) =
K2 (a − θ) a − θ ≥ 0
1. Draw θcand ∼ g(θ) • Absolute error loss: L(θ, a) = |θ − a| (linear loss with K1 = K2 )
2. Generate u ∼ Unif (0, 1) • Lp loss: L(θ, a) = |θ − a|p
k(θcand ) (
3. Accept θcand if u ≤ 0 a=θ
M g(θcand ) • Zero-one loss: L(θ, a) =
1 a 6= θ
4. Repeat until B values of θcand have been accepted

Example 17.1 Risk

• Algorithm
(Frequentist) Risk
1. Draw θcand ∼ f (θ)
Z
2. Generate u ∼ Unif (0, 1)
h i
R(θ, θ)
b = L(θ, θ(x))f
b (x | θ) dx = EX|θ L(θ, θ(X))
b
Ln (θcand )
3. Accept θcand if u ≤
Ln (θbn ) Bayes Risk
ZZ
16.3 Importance Sampling
h i
r(f, θ)
b = L(θ, θ(x))f
b (x, θ) dx dθ = Eθ,X L(θ, θ(X))
b
Sample from an importance function g rather than target density h.
Algorithm to obtain an approximation to E [q(θ) | xn ]:
h h ii h i
r(f, θ)
b = Eθ EX|θ L(θ, θ(X)
b = Eθ R(θ, θ)b
iid
1. Sample from the prior θ1 , . . . , θn ∼ f (θ)
h h ii h i
r(f, θ)
b = EX Eθ|X L(θ, θ(X)
b = EX r(θb | X)
Ln (θi )
2. For each i = 1, . . . , B, calculate wi = PB
i=1 Ln (θi )
n
PB 17.2 Admissibility
3. E [q(θ) | x ] ≈ i=1 q(θi )wi
• θb0 dominates θb if
∀θ : R(θ, θb0 ) ≤ R(θ, θ)
b
17 Decision Theory
∃θ : R(θ, θb0 ) < R(θ, θ)
b
Definitions
• θb is inadmissible if there is at least one other estimator θb0 that dominates
• Unknown quantity affecting our decision: θ ∈ Θ it. Otherwise it is called admissible.
17
17.3 Bayes Rule Residual Sums of Squares (rss)
Bayes Rule (or Bayes Estimator) n
X
rss(βb0 , βb1 ) = ˆ2i
• r(f, θ)
b = inf e r(f, θ)
θ
e
i=1
R
• θ(x) = inf r(θ | x) ∀x =⇒ r(f, θ)
b b b = r(θb | x)f (x) dx
Least Square Estimates
Theorems
βbT = (βb0 , βb1 )T : min rss
β
b0 ,β
b1
• Squared error loss: posterior mean
• Absolute error loss: posterior median
• Zero-one loss: posterior mode βb0 = Ȳn − βb1 X̄n
Pn Pn
(Xi − X̄n )(Yi − Ȳn ) i=1 Xi Yi − nX̄Y
17.4 Minimax Rules βb1 = i=1 Pn 2
= n
(X − X̄ ) 2
P 2
i=1 i n i=1 Xi − nX
Maximum Risk

β0
h i
R̄(θ)
b = sup R(θ, θ) R̄(a) = sup R(θ, a) E βb | X n =
b β1
θ θ
σ 2 n−1 ni=1 Xi2 −X n
h i P
Minimax Rule n
V β |X =
b
e = inf sup R(θ, θ)
b = inf R̄(θ)
sup R(θ, θ) e nsX −X n 1
θ θe θe θ
r Pn
2
σ i=1 Xi
√
b
se(
b βb0 ) =
θb = Bayes rule ∧ ∃c : R(θ, θ)
b =c sX n n
σ
√
b
Least Favorable Prior se(
b βb1 ) =
sX n
θbf = Bayes rule ∧ R(θ, θbf ) ≤ r(f, θbf ) ∀θ Pn Pn
where s2X = n−1 i=1 (Xi − X n )2 and σ
b2 = 1
n−2 ˆ2i
i=1 an (unbiased) estimate
of σ. Further properties:
18 Linear Regression
P P
• Consistency: βb0 → β0 and βb1 → β1
Definitions
• Asymptotic normality:
• Response variable Y
• Covariate X (aka predictor variable or feature) βb0 − β0 D βb1 − β1 D
→ N (0, 1) and → N (0, 1)
se(
b βb0 ) se(
b βb1 )
18.1 Simple Linear Regression
• Approximate 1 − α confidence intervals for β0 and β1 are
Model
Yi = β0 + β1 Xi + i E [i | Xi ] = 0, V [i | Xi ] = σ 2 βb0 ± zα/2 se(
b βb0 ) and βb1 ± zα/2 se(
b βb1 )
Fitted Line
rb(x) = βb0 + βb1 x • The Wald test for testing H0 : β1 = 0 vs. H1 : β1 6= 0 is: reject H0 if
|W | > zα/2 where W = βb1 /se(
b βb1 ).
Predicted (Fitted) Values
Ybi = rb(Xi ) R2
Pn b 2
Pn 2
Residuals i=1 (Yi − Y ) ˆ rss
2
= 1 − Pn i=1 i 2 = 1 −

ˆi = Yi − Ybi = Yi − βb0 + βb1 Xi R = Pn 2
i=1 (Yi − Y ) i=1 (Yi − Y )
tss
18
Likelihood If the (k × k) matrix X T X is invertible,
n n n
Y Y Y βb = (X T X)−1 X T Y
L= f (Xi , Yi ) = fX (Xi ) × fY |X (Yi | Xi ) = L1 × L2 h i
i=1 i=1 i=1 V βb | X n = σ 2 (X T X)−1
n
βb ≈ N β, σ 2 (X T X)−1
Y
L1 = fX (Xi )
i=1
n
(
2
) Estimate regression function
Y
−n 1 X
L2 = fY |X (Yi | Xi ) ∝ σ exp − 2 Yi − (β0 − β1 Xi ) k
2σ i X
i=1 rb(x) = βbj xj
j=1
Under the assumption of Normality, the least squares estimator is also the mle
2
Unbiased estimate for σ
n
1X 2 n
b2 =
σ ˆ 1 X 2
n i=1 i b2 =
σ ˆ ˆ = X βb − Y
n − k i=1 i

18.2 Prediction mle

n−k 2
µ
b = X̄ b2 =
σ σ
Observe X = x∗ of the covarite and want to predict their outcome Y∗ . n
1 − α Confidence Interval
Yb∗ = βb0 + βb1 x∗ βbj ± zα/2 se(
b βbj )
h i h i h i h i
V Yb∗ = V βb0 + x2∗ V βb1 + 2x∗ Cov βb0 , βb1
18.4 Model Selection
Prediction Interval Pn Consider predicting a new observation Y ∗ for covariates X ∗ and let S ⊂ J
2

2 2 i=1 (Xi − X∗ ) denote a subset of the covariates in the model, where |S| = k and |J| = n.
ξn = σ
b P +1
n i (Xi − X̄)2 j
b
Issues
• Underfitting: too few covariates yields high bias
Yb∗ ± zα/2 ξbn
• Overfitting: too many covariates yields high variance

18.3 Multiple Regression Procedure

1. Assign a score to each model
Y = Xβ +
2. Search through all models to find the one with the highest score
where       Hypothesis Testing
X11 ··· X1k β1 1
 .. ..  β =  ... 
..  ..  H0 : βj = 0 vs. H1 : βj 6= 0 ∀j ∈ J
X= . =.
 
. . 
Xn1 ··· Xnk βk n Mean Squared Prediction Error (mspe)
Likelihood h i

1
mspe = E (Yb (S) − Y ∗ )2
2 −n/2
L(µ, Σ) = (2πσ ) exp − 2 rss
2σ
Prediction Risk
N
X n
X n
X h i
rss = (y − Xβ)T (y − Xβ) = ||Y − Xβ||2 = (Yi − xTi β)2 R(S) = mspei = E (Ybi (S) − Yi∗ )2
i=1 i=1 i=1 19
Training Error 19 Non-parametric Function Estimation
n
X
R
btr (S) = (Ybi (S) − Yi )2 19.1 Density Estimation
i=1 R
Estimate f (x), where f (x) = P [X ∈ A] = A f (x) dx.
2
R Integrated Square Error (ise)
Pn b 2
R i=1 (Yi (S) − Y )
rss(S) btr (S) Z 2 Z
R2 (S) = 1 − =1− =1− P n 2 L(f, fbn ) = f (x) − fbn (x) dx = J(h) + f 2 (x) dx
i=1 (Yi − Y )
tss tss

The training error is a downward-biased estimate of the prediction risk. Frequentist Risk
h i Z Z
h i R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx
E R btr (S) < R(S)
h i
h
i n
X h i b(x) = E fbn (x) − f (x)
bias(R btr (S) − R(S) = −2
btr (S)) = E R Cov Ybi , Yi h i
i=1 v(x) = V fbn (x)

Adjusted R2
19.1.1 Histograms
2 n − 1 rss
R (S) = 1 −
n − k tss Definitions
Mallow’s Cp statistic • Number of bins m
1
• Binwidth h = m
R(S)
b =R σ 2 = lack of fit + complexity penalty
btr (S) + 2kb • Bin Bj has νj observations
R
• Define pbj = νj /n and pj = Bj f (u) du
Akaike Information Criterion (AIC)
Histogram Estimator
m
AIC(S) = bS2 )
`n (βbS , σ −k X pbj
fbn (x) = I(x ∈ Bj )
j=1
h
Bayesian Information Criterion (BIC) h i p
j
E fbn (x) =
k h
bS2 ) − log n
BIC(S) = `n (βbS , σ h i p (1 − p )
j j
2 V fbn (x) =
nh2
h2
Z
Validation and Training 2 1
R(fn , f ) ≈
b (f 0 (u)) du +
12 nh
m
X n n !1/3
R
bV (S) = (Ybi∗ (S) − Yi∗ )2 m = |{validation data}|, often or ∗ 1 6
i=1
4 2 h = 1/3 R 2 du
n (f 0 (u))
2/3 Z 1/3
Leave-one-out Cross-validation C 3 2
R∗ (fbn , f ) ≈ 2/3 C= (f 0 (u)) du
n n
!2 n 4
X
2
X Yi − Ybi (S)
R
bCV (S) = (Yi − Yb(i) ) = Cross-validation estimate of E [J(h)]
i=1 i=1
1 − Uii (S) Z n m
2 2Xb 2 n+1 X 2
JCV (h) = fn (x) dx −
b b f(−i) (Xi ) = − pb
U (S) = XS (XST XS )−1 XS (“hat matrix”) n i=1 (n − 1)h (n − 1)h j=1 j
20
19.1.2 Kernel Density Estimator (KDE) k-nearest Neighbor Estimator
Kernel K 1 X
rb(x) = Yi where Nk (x) = {k values of x1 , . . . , xn closest to x}
k
i:xi ∈Nk (x)
• K(x) ≥ 0
Nadaraya-Watson Kernel Estimator
R
• K(x) dx = 1
R
• xK(x) dx = 0 n
X
rb(x) = wi (x)Yi
R 2 2
• x K(x) dx ≡ σK >0
i=1
x−xi

KDE K
wi (x) = h ∈ [0, 1]
n
Pn x−xj
j=1 K

1X1 x − Xi h
fbn (x) = K
n i=1 h h h4
Z 4 Z
f 0 (x)
2
Z Z rn , r) ≈
R(b x2 K 2 (x) dx r00 (x) + 2r0 (x) dx
1 4 00 2 1 4 f (x)
R(f, fn ) ≈ (hσK )
b (f (x)) dx + K 2 (x) dx
4 nh σ 2 K 2 (x) dx
Z R
−2/5 −1/5 −1/5 + dx
nhf (x)
Z Z
c c2 c3
h∗ = 1 c1 = σ 2
K , c 2 = K 2
(x) dx, c 3 = (f 00 (x))2 dx c1
n1/5 h∗ ≈
Z 4/5 Z 1/5 n1/5
c4 5 2 2/5 c2
R∗ (f, fbn ) = 4/5 c4 = (σK ) K 2 (x) dx (f 00 )2 dx R∗ (b
rn , r) ≈ 4/5
n 4 n
| {z }
C(K)

Cross-validation estimate of E [J(h)]

Epanechnikov Kernel
n n
√ X X (Yi − rb(xi ))2
(Yi − rb(−i) (xi ))2 =
(
√ 3
|x| < 5 JbCV (h) = !2
K(x) = 4 5(1−x2 /5)
i=1 i=1 K(0)
0 otherwise 1− Pn x−x
j
j=1 K h

Cross-validation estimate of E [J(h)]

19.3 Smoothing Using Orthogonal Functions
n n n
1 X X ∗ Xi − Xj
Z
2 2Xb 2 Approximation
JbCV (h) = fn (x) dx −
b f(−i) (Xi ) ≈ 2
K + K(0)
n i=1 hn i=1 j=1 h nh ∞
X J
X
r(x) = βj φj (x) ≈ βj φj (x)
Z j=1 i=1
K ∗ (x) = K (2) (x) − 2K(x) K (2) (x) = K(x − y)K(y) dy Multivariate Regression
Y = Φβ + η
 
19.2 Non-parametric Regression φ0 (x1 ) ··· φJ (x1 )
 .. .. .. 
where ηi = i and Φ =  . . . 
Estimate f (x), where f (x) = E [Y | X = x]. Consider pairs of points
(x1 , Y1 ), . . . , (xn , Yn ) related by φ0 (xn ) · · · φJ (xn )
Least Squares Estimator
Yi = r(xi ) + i
βb = (ΦT Φ)−1 ΦT Y
E [i ] = 0
1
V [i ] = σ 2 ≈ ΦT Y (for equallly spaced observations only)
n
21
Cross-validation estimate of E [J(h)] 20.2 Poisson Processes
 2
Xn J
X Poisson Process
R
bCV (J) = Yi − φj (xi )βbj,(−i) 
i=1 j=1
• {Xt : t ∈ [0, ∞)} – number of events up to and including time t
• X0 = 0
20 Stochastic Processes • Independent increments:
Stochastic Process
( ∀t0 < · · · < tn : Xt1 − Xt0 ⊥
⊥ ··· ⊥
⊥ Xtn − Xtn−1
{0, ±1, . . . } = Z discrete
{Xt : t ∈ T } T =
[0, ∞) continuous
• Intensity function λ(t)
• Notations: Xt , X(t)
• State space X – P [Xt+h − Xt = 1] = λ(t)h + o(h)
• Index set T – P [Xt+h − Xt = 2] = o(h)
Rt
• Xs+t − Xs ∼ Poisson (m(s + t) − m(s)) where m(t) = 0
λ(s) ds
20.1 Markov Chains
Markov Chain {Xn : n ∈ T } Homogeneous Poisson Process
P [Xn = x | X0 , . . . , Xn−1 ] = P [Xn = x | Xn−1 ] ∀n ∈ T, x ∈ X
λ(t) ≡ λ =⇒ Xt ∼ Poisson (λt) λ>0
Transition probabilities

pij ≡ P [Xn+1 = j | Xn = i] Waiting Times

pij (n) ≡ P [Xm+n = j | Xm = i] n-step
Wt := time at which Xt occurs
Transition matrix P (n-step: Pn )
• (i, j) element is pij
1
• pij > 0 Wt ∼ Gamma t,
P λ
• i pij = 1

Chapman-Kolmogorov Interarrival Times

X
pij (m + n) = pij (m)pkj (n) St = Wt+1 − Wt
k

Pm+n = Pm Pn
1
Pn = P × · · · × P = Pn St ∼ Exp
λ
Marginal probability

µn = (µn (1), . . . , µn (N )) where µi (i) = P [Xn = i]

St
µ0 , initial distribution
µn = µ0 Pn Wt−1 Wt t
22
21 Time Series 21.1 Stationary Time Series
Mean function Z ∞
Strictly stationary
µxt = E [xt ] = xft (x) dx
−∞ P [xt1 ≤ c1 , . . . , xtk ≤ ck ] = P [xt1 +h ≤ c1 , . . . , xtk +h ≤ ck ]
Autocovariance function

γx (s, t) = E [(xs − µs )(xt − µt )] = E [xs xt ] − µs µt ∀k ∈ N, tk , ck , h ∈ Z

γx (t, t) = E (xt − µt )2 = V [xt ]

Weakly stationary
Autocorrelation function (ACF)
• E x2t < ∞ ∀t ∈ Z
2
Cov [xs , xt ] γ(s, t) • E xt = m ∀t ∈ Z
ρ(s, t) = p =p
V [xs ] V [xt ] γ(s, s)γ(t, t) • γx (s, t) = γx (s + r, t + r) ∀r, s, t ∈ Z

Cross-covariance function (CCV) Autocovariance function

γxy (s, t) = E [(xs − µxs )(yt − µyt )]
• γ(h) = E [(xt+h − µ)(xt − µ)] ∀h ∈ Z

Cross-correlation function (CCF) • γ(0) = E (xt − µ)2
γxy (s, t) • γ(0) ≥ 0
ρxy (s, t) = p • γ(0) ≥ |γ(h)|
γx (s, s)γy (t, t)
• γ(h) = γ(−h)
Backshift operator
B k (xt ) = xt−k Autocorrelation function (ACF)
Difference operator
∇d = (1 − B)d Cov [xt+h , xt ] γ(t + h, t) γ(h)
ρx (h) = p =p =
V [xt+h ] V [xt ] γ(t + h, t + h)γ(t, t) γ(0)
White Noise
2
• wt ∼ wn(0, σw ) Jointly stationary time series
iid 2

• Gaussian: wt ∼ N 0, σw
γxy (h) = E [(xt+h − µx )(yt − µy )]
• E [wt ] = 0 t ∈ T
• V [wt ] = σ 2 t ∈ T
• γw (s, t) = 0 s 6= t ∧ s, t ∈ T γxy (h)
ρxy (h) = p
γx (0)γy (h)
Random Walk
• Drift δ Linear Process
Pt
• xt = δt + j=1 wj ∞
X ∞
X
• E [xt ] = δt xt = µ + ψj wt−j where |ψj | < ∞
j=−∞ j=−∞
Symmetric Moving Average
k
X k
X ∞
X
2
mt = aj xt−j where aj = a−j ≥ 0 and aj = 1 γ(h) = σw ψj+h ψj
j=−k j=−k j=−∞
23
21.2 Estimation of Correlation 21.3.1 Detrending
Sample mean Least Squares
n
1X
x̄ = xt 1. Choose trend model, e.g., µt = β0 + β1 t + β2 t2
n t=1
2. Minimize rss to obtain trend estimate µ bt = βb0 + βb1 t + βb2 t2
Sample variance 3. Residuals , noise wt
n
1 X |h|
V [x̄] = 1− γx (h) Moving average
n n
h=−n
1
• The low-pass filter vt is a symmetric moving average mt with aj = 2k+1 :
Sample autocovariance function
k
n−h 1 X
1 X vt = xt−1
γ
b(h) = (xt+h − x̄)(xt − x̄) 2k + 1
n t=1 i=−k

1
Pk
Sample autocorrelation function • If 2k+1 i=−k wt−j ≈ 0, a linear trend function µt = β0 + β1 t passes
without distortion
γ
b(h)
ρb(h) = Differencing
γ
b(0)
• µt = β0 + β1 t =⇒ ∇xt = β1
Sample cross-variance function
n−h
1 X 21.4 ARIMA models
γ
bxy (h) = (xt+h − x̄)(yt − y)
n t=1 Autoregressive polynomial

Sample cross-correlation function φ(z) = 1 − φ1 z − · · · − φp zp z ∈ C ∧ φp 6= 0

γ
bxy (h) Autoregressive operator
ρbxy (h) = p
γbx (0)b
γy (0)
φ(B) = 1 − φ1 B − · · · − φp B p
Properties
Autoregressive model order p, AR (p)
1
• σρbx (h) = √ if xt is white noise
n xt = φ1 xt−1 + · · · + φp xt−p + wt ⇐⇒ φ(B)xt = wt
1
• σρbxy (h) = √ if xt or yt is white noise AR (1)
n
k−1 ∞
X k→∞,|φ|<1 X
21.3 Non-Stationary Time Series • xt = φk (xt−k ) + φj (wt−j ) = φj (wt−j )
j=0 j=0
Classical decomposition model | {z }
linear process
P∞
xt = µt + st + wt • E [xt ] = j=0 φj (E [wt−j ]) = 0
2 h
σw φ
• µt = trend • γ(h) = Cov [xt+h , xt ] = 1−φ2
γ(h)
• st = seasonal component • ρ(h) = γ(0) = φh
• wt = random noise term • ρ(h) = φρ(h − 1) h = 1, 2, . . .
24
Moving average polynomial Seasonal ARIMA
θ(z) = 1 + θ1 z + · · · + θq zq z ∈ C ∧ θq 6= 0 • Denoted by ARIMA (p, d, q) × (P, D, Q)s
Moving average operator • ΦP (B s )φ(B)∇D d s
s ∇ xt = δ + ΘQ (B )θ(B)wt

θ(B) = 1 + θ1 B + · · · + θp B p
21.4.1 Causality and Invertibility
MA (q) (moving average model order q) P∞
ARMA (p, q) is causal (future-independent) ⇐⇒ ∃{ψj } : j=0 ψj < ∞ such that
xt = wt + θ1 wt−1 + · · · + θq wt−q ⇐⇒ xt = θ(B)wt
q ∞
X
xt = wt−j = ψ(B)wt
X
E [xt ] = θj E [wt−j ] = 0
j=0
j=0
( Pq−h P∞
2
σw j=0 θj θj+h 0≤h≤q ARMA (p, q) is invertible ⇐⇒ ∃{πj } : j=0 πj < ∞ such that
γ(h) = Cov [xt+h , xt ] =
0 h>q
∞
X
MA (1) π(B)xt = Xt−j = wt
xt = wt + θwt−1 j=0

2 2
(1 + θ )σw h = 0

Properties
γ(h) = θσw 2
h=1


0 h>1 • ARMA (p, q) causal ⇐⇒ roots of φ(z) lie outside the unit circle
(
θ
2 h=1 ∞
θ(z)
ρ(h) = (1+θ )
X
0 h>1 ψ(z) = ψj z j = |z| ≤ 1
j=0
φ(z)
ARMA (p, q)
xt = φ1 xt−1 + · · · + φp xt−p + wt + θ1 wt−1 + · · · + θq wt−q • ARMA (p, q) invertible ⇐⇒ roots of θ(z) lie outside the unit circle
∞
φ(B)xt = θ(B)wt X φ(z)
π(z) = πj z j = |z| ≤ 1
Partial autocorrelation function (PACF) j=0
θ(z)
• xh−1
i , regression of xi on {xh−1 , xh−2 , . . . , x1 }
Behavior of the ACF and PACF for causal and invertible ARMA models
• φhh = corr(xh − xh−1
h , x0 − xh−1
0 ) h≥2
• E.g., φ11 = corr(x1 , x0 ) = ρ(1) AR (p) MA (q) ARMA (p, q)
ARIMA (p, d, q) ACF tails off cuts off after lag q tails off
∇d xt = (1 − B)d xt is ARMA (p, q) PACF cuts off after lag p tails off q tails off
φ(B)(1 − B)d xt = θ(B)wt
Exponentially Weighted Moving Average (EWMA) 21.5 Spectral Analysis
xt = xt−1 + wt − λwt−1 Periodic process
∞
X xt = A cos(2πωt + φ)
xt = (1 − λ)λj−1 xt−j + wt when |λ| < 1
j=1 = U1 cos(2πωt) + U2 sin(2πωt)
x̃n+1 = (1 − λ)xn + λx̃n
• Frequency index ω (cycles per unit time), period 1/ω
25
• Amplitude A Discrete Fourier Transform (DFT)
• Phase φ n
• U1 = A cos φ and U2 = A sin φ often normally distributed rv’s
X
d(ωj ) = n−1/2 xt e−2πiωj t
Periodic mixture i=1

q
X Fourier/Fundamental frequencies
xt = (Uk1 cos(2πωk t) + Uk2 sin(2πωk t))
k=1
ωj = j/n
• Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rv’s with variances σk2
Pq
• γ(h) = k=1 σk2 cos(2πωk h) Inverse DFT
Pq n−1
• γ(0) = E x2t = k=1 σk2
X
xt = n−1/2 d(ωj )e2πiωj t
Spectral representation of a periodic process j=0

γ(h) = σ 2 cos(2πω0 h) Periodogram

I(j/n) = |d(j/n)|2
σ 2 −2πiω0 h σ 2 2πiω0 h
= e + e
2 2 Scaled Periodogram
Z 1/2
= e2πiωh dF (ω) 4
−1/2
P (j/n) = I(j/n)
n
!2 !2
Spectral distribution function 2X
n
2X
n
 = xt cos(2πtj/n + xt sin(2πtj/n
0
 ω < −ω0 n t=1 n t=1
F (ω) = σ 2 /2 −ω ≤ ω < ω0

 2
σ ω ≥ ω0 22 Math
• F (−∞) = F (−1/2) = 0
• F (∞) = F (1/2) = γ(0)
22.1 Series
Spectral density Finite Binomial
∞ n
X 1 1 n n
γ(h)e−2πiωh − ≤ω≤ n(n + 1)
X
f (ω) = X
• = 2n
2 2 • k= k
h=−∞ 2 k=0
k=1
n n
• Needs
P∞
|γ(h)| < ∞ =⇒ γ(h) =
R 1/2
e2πiωh f (ω) dω h = 0, ±1, . . .
X r+k r+n+1
•
X
h=−∞ −1/2 • (2k − 1) = n 2 =
k n
• f (ω) ≥ 0 k=1 k=0
n n
X k
• f (ω) = f (−ω) n(n + 1)(2n + 1) n+1
•
X
• k2 = =
• f (ω) = f (1 − ω) 6 m m+1
k=1 k=0
R 1/2
• γ(0) = V [xt ] = −1/2 f (ω) dω n 2 • Vandermonde’s Identity:
X n(n + 1) r
• k3 =

2
• White noise: fw (ω) = σw
X m n m+n
2 =
• ARMA (p, q) , φ(B)xt = θ(B)wt :
k=1 k r−k r
n k=0
X cn+1 − 1 • Binomial Theorem:
|θ(e )| −2πiω 2 • ck = c 6= 1 n
2
fx (ω) = σw c−1 X n n−k k
|φ(e−2πiω )|2
k=0 a b = (a + b)n
k
Pp Pq k=0
where φ(z) = 1 − k=1 φk z k and θ(z) = 1 + k=1 θk z k
26
Infinite Balls and Urns f :B→U D = distinguishable, ¬D = indistinguishable.

∞ ∞
X 1 X p |B| = n, |U | = m f arbitrary f injective f surjective f bijective
• pk = , pk = |p| < 1
1−p 1−p
k=0 k=1 ( (
mn m ≥ n

∞ ∞
! n n n! m = n
X d X d 1 1 B : D, U : ¬D m m!
• kpk−1 = pk
= = |p| < 1 0 else m 0 else
dp dp 1 − p 1 − p2
k=0 k=0
∞ (
X r+k−1 k n+n−1 m n−1 1 m=n
• x = (1 − x)−r r ∈ N+ B : ¬D, U : D
k n n m−1 0 else
k=0
∞
X α k m
( (
p = (1 + p)α |p| < 1 , α ∈ C

• X n 1 m≥n n 1 m=n
k B : D, U : ¬D
k=0 k 0 else m 0 else
k=1
m
( (
X 1 m≥n 1 m=n
B : ¬D, U : ¬D Pn,k Pn,m
k=1
0 else 0 else
22.2 Combinatorics

Sampling References
[1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory.
Brooks Cole, 1972.
k out of n w/o replacement w/ replacement
[2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships.
k−1
Y n! The American Statistician, 62(1):45–53, 2008.
ordered nk = (n − i) = nk
(n − k)!
i=0 [3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications
k

n n n! n−1+r n−1+r With R Examples. Springer, 2006.
unordered = = =
k k! k!(n − k)! r n−1
[4] A. Steger. Diskrete Strukturen – Band 1: Kombinatorik, Graphentheorie,
Algebra. Springer, 2001.

Stirling numbers, 2nd kind [5] A. Steger. Diskrete Strukturen – Band 2: Wahrscheinlichkeitstheorie und
Statistik. Springer, 2002.

n

n−1

n−1
(
n 1 n=0 [6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference.
=k + 1≤k≤n = Springer, 2003.
k k k−1 0 0 else

Partitions

n
X
Pn+k,k = Pn,i k > n : Pn,k = 0 n ≥ 1 : Pn,0 = 0, P0,0 = 1
i=1

27
Univariate distribution relationships, courtesy of Leemis and McQueston [2].
28

Stats Cheat Sheet
50% (4)
Stats Cheat Sheet
3 pages
Statistics Cheat Sheets Harvard University
100% (2)
Statistics Cheat Sheets Harvard University
12 pages
Murray Aitkin - Introduction To Statistical Modelling and Inference-CRC Press - Chapman & Hall (2022)
No ratings yet
Murray Aitkin - Introduction To Statistical Modelling and Inference-CRC Press - Chapman & Hall (2022)
391 pages
A Probability and Statistics Cheatsheet
No ratings yet
A Probability and Statistics Cheatsheet
28 pages
6th Central Pay Commission Salary Calculator
100% (436)
6th Central Pay Commission Salary Calculator
15 pages
Machine Learning Cheat Sheet PDF
No ratings yet
Machine Learning Cheat Sheet PDF
15 pages
Data Science Cheatsheet
100% (1)
Data Science Cheatsheet
5 pages
189 Cheat Sheet Minicards
No ratings yet
189 Cheat Sheet Minicards
2 pages
Machine Learning
100% (2)
Machine Learning
136 pages
7 Data Science / Machine Learning Cheat Sheets in One
100% (1)
7 Data Science / Machine Learning Cheat Sheets in One
9 pages
Stats Cheat Sheet
No ratings yet
Stats Cheat Sheet
2 pages
Statistics Cheatsheet
100% (1)
Statistics Cheatsheet
2 pages
ML Interview Cheat Sheet
No ratings yet
ML Interview Cheat Sheet
9 pages
Series Cheat Sheet
No ratings yet
Series Cheat Sheet
1 page
Probability Cheatsheet 140718
100% (1)
Probability Cheatsheet 140718
7 pages
Probability Cheatsheet
100% (2)
Probability Cheatsheet
10 pages
Cheat Sheet - Machine Learning - Data Science Interview PDF
No ratings yet
Cheat Sheet - Machine Learning - Data Science Interview PDF
16 pages
Logistic Regression
No ratings yet
Logistic Regression
24 pages
A Comprehensive Statistics Cheat Sheet For Data Science 1685659812
No ratings yet
A Comprehensive Statistics Cheat Sheet For Data Science 1685659812
39 pages
Statistics Cheat Sheet
100% (1)
Statistics Cheat Sheet
4 pages
Data Science Cheat Sheets
100% (1)
Data Science Cheat Sheets
1 page
U02Lecture07 Classification
100% (1)
U02Lecture07 Classification
56 pages
Scikit Learn
No ratings yet
Scikit Learn
25 pages
R Programming For NGS Data Analysis
No ratings yet
R Programming For NGS Data Analysis
5 pages
Cheat Sheet On Probability
No ratings yet
Cheat Sheet On Probability
2 pages
Cheatsheet Machine Learning Tips and Tricks PDF
No ratings yet
Cheatsheet Machine Learning Tips and Tricks PDF
2 pages
Statquest Gentle Introduction To Rna Seq
100% (1)
Statquest Gentle Introduction To Rna Seq
188 pages
All in One CheatSheet
100% (1)
All in One CheatSheet
52 pages
Statistics Probability Midterm Cheat Sheet
0% (1)
Statistics Probability Midterm Cheat Sheet
5 pages
DS Cheat Sheets
No ratings yet
DS Cheat Sheets
18 pages
Deep Learning Cheatsheet
No ratings yet
Deep Learning Cheatsheet
5 pages
EDA Assignment
No ratings yet
EDA Assignment
15 pages
Ai Cheat Sheet Machine Learning With Python Cheat Sheet
100% (4)
Ai Cheat Sheet Machine Learning With Python Cheat Sheet
2 pages
Formulae Sheet For Multivariate Statistics
No ratings yet
Formulae Sheet For Multivariate Statistics
4 pages
Cheatsheet Python A4
100% (1)
Cheatsheet Python A4
7 pages
3 Regression Diagnostics
100% (1)
3 Regression Diagnostics
53 pages
Missing Value Treatment
No ratings yet
Missing Value Treatment
22 pages
Cheat Sheet Final
100% (2)
Cheat Sheet Final
7 pages
Statistics in Details
100% (2)
Statistics in Details
283 pages
ML Algorithms
100% (1)
ML Algorithms
1 page
771 A18 Lec4
100% (1)
771 A18 Lec4
128 pages
Cheat Sheet (Regular Font) PDF
No ratings yet
Cheat Sheet (Regular Font) PDF
4 pages
Chapter 5.3-Mulitple Linear Regression
No ratings yet
Chapter 5.3-Mulitple Linear Regression
26 pages
Linear Algebra Cheat Sheet
No ratings yet
Linear Algebra Cheat Sheet
2 pages
Using Categorical Data With One Hot Encoding - Kaggle PDF
No ratings yet
Using Categorical Data With One Hot Encoding - Kaggle PDF
4 pages
Statistical Machine Learning
100% (1)
Statistical Machine Learning
12 pages
Summary of Probability 2 1
No ratings yet
Summary of Probability 2 1
3 pages
Unit 4
No ratings yet
Unit 4
108 pages
Bioinformatics F&amp M 20100722 Bujak
100% (1)
Bioinformatics F&amp M 20100722 Bujak
27 pages
ML Summary PDF
No ratings yet
ML Summary PDF
5 pages
A Course in Mathematical Statistics George G. Roussas p593 T
No ratings yet
A Course in Mathematical Statistics George G. Roussas p593 T
593 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
27 pages
Salary Prediction LinearRegression
100% (1)
Salary Prediction LinearRegression
7 pages
MATH2101 Cheat Sheet
No ratings yet
MATH2101 Cheat Sheet
3 pages
Cheat Sheet: Python For Data Science
100% (1)
Cheat Sheet: Python For Data Science
1 page
Introductory Concepts of Probabability & Statistics
No ratings yet
Introductory Concepts of Probabability & Statistics
6 pages
Python-Linear Regression
No ratings yet
Python-Linear Regression
72 pages
Probability and Statistics: Cookbook
No ratings yet
Probability and Statistics: Cookbook
28 pages
Probability and Statistics Cookbook
No ratings yet
Probability and Statistics Cookbook
28 pages
Stats Cheat Sheet
No ratings yet
Stats Cheat Sheet
28 pages
Probability and Statistics: Cookbook
No ratings yet
Probability and Statistics: Cookbook
28 pages
Formulario Ep Probability and Statistics
No ratings yet
Formulario Ep Probability and Statistics
28 pages
Think OS A Brief Introduction To Operating Systems
No ratings yet
Think OS A Brief Introduction To Operating Systems
93 pages
Unix For Poets
No ratings yet
Unix For Poets
25 pages
A Child's Garden of Fractional Derivatives
No ratings yet
A Child's Garden of Fractional Derivatives
8 pages
Regular Expressions Cheat Sheet v1
No ratings yet
Regular Expressions Cheat Sheet v1
1 page
An Informal Introduction To Topos Theory
No ratings yet
An Informal Introduction To Topos Theory
24 pages
Statistical Learning For Biomedical Data Accessible PDF Download
No ratings yet
Statistical Learning For Biomedical Data Accessible PDF Download
14 pages
Dissolution Similarity Requirements How Similar or Dissimilar Are The Global Regulatory Expectations
No ratings yet
Dissolution Similarity Requirements How Similar or Dissimilar Are The Global Regulatory Expectations
8 pages
Investment Returns and Distribution Policies of Non-Profit Endowment Funds
No ratings yet
Investment Returns and Distribution Policies of Non-Profit Endowment Funds
46 pages
Rbetareg
No ratings yet
Rbetareg
11 pages
Mathematical Statistics With Resampling and R 1st Edition Laura M. Chihara Download
100% (1)
Mathematical Statistics With Resampling and R 1st Edition Laura M. Chihara Download
58 pages
FRM Part II - Mock Exam #3-EF5MKISC4A
No ratings yet
FRM Part II - Mock Exam #3-EF5MKISC4A
93 pages
Complete Bundle Fundamentals of Statistics 4th Edition Michael Sullivan
100% (1)
Complete Bundle Fundamentals of Statistics 4th Edition Michael Sullivan
414 pages
Ordinary Least Squares: Rómulo A. Chumacero
No ratings yet
Ordinary Least Squares: Rómulo A. Chumacero
50 pages
Test Wenia Abraao Entropy
No ratings yet
Test Wenia Abraao Entropy
28 pages
Validaciones - Bosstrap
No ratings yet
Validaciones - Bosstrap
50 pages
Ujian Akhir Tengah Semester 3 TRIA
No ratings yet
Ujian Akhir Tengah Semester 3 TRIA
22 pages
(PDF) Modelling of Coagulant Dosage in A Water Treatment Plant
No ratings yet
(PDF) Modelling of Coagulant Dosage in A Water Treatment Plant
11 pages
WIREs Computational Stats - 2011 - Hesterberg - Bootstrap
No ratings yet
WIREs Computational Stats - 2011 - Hesterberg - Bootstrap
30 pages
Data Handling and Parameter Estimation: Gürkan Sin Krist V. Gernaey Sebastiaan C.F. Meijer Juan A. Baeza
No ratings yet
Data Handling and Parameter Estimation: Gürkan Sin Krist V. Gernaey Sebastiaan C.F. Meijer Juan A. Baeza
34 pages
Data Mining Techniques in Smart Agriculture
No ratings yet
Data Mining Techniques in Smart Agriculture
6 pages
Fitdistrplus R Package Fitting Distributions
No ratings yet
Fitdistrplus R Package Fitting Distributions
22 pages
IBM SPSS Bootstrapping
No ratings yet
IBM SPSS Bootstrapping
4 pages
Users Manual v1.0
No ratings yet
Users Manual v1.0
15 pages
The Jackknife, The Bootstrap, and Other Resampling Plans: Bradley Efron
0% (1)
The Jackknife, The Bootstrap, and Other Resampling Plans: Bradley Efron
2 pages
Newton Et Al 2024 Development of Company Specific Emission Factors With Confidence Intervals For Natural Gas Customer
No ratings yet
Newton Et Al 2024 Development of Company Specific Emission Factors With Confidence Intervals For Natural Gas Customer
10 pages
Paper 1
No ratings yet
Paper 1
38 pages
STAMP Users Guide v2.0.0
No ratings yet
STAMP Users Guide v2.0.0
26 pages
General-To-Specific Modeling in Stata: 14, Number 4, Pp. 895-908
No ratings yet
General-To-Specific Modeling in Stata: 14, Number 4, Pp. 895-908
14 pages
RJ 2021 060
No ratings yet
RJ 2021 060
32 pages
An Alternative To Cohen's Standardized Mean Difference Effect Size: A Robust Parameter and Confidence Interval in The Two Independent Groups Case.
No ratings yet
An Alternative To Cohen's Standardized Mean Difference Effect Size: A Robust Parameter and Confidence Interval in The Two Independent Groups Case.
12 pages
Var Jmulti
No ratings yet
Var Jmulti
40 pages
Bootstrap For Panel Data (PPT), Hounkannounon
No ratings yet
Bootstrap For Panel Data (PPT), Hounkannounon
65 pages
Lost Crops of The Incas: Origins of Domestication of The Andean Pulse Crop Tarwi, Lupinus Mutabilis
No ratings yet
Lost Crops of The Incas: Origins of Domestication of The Andean Pulse Crop Tarwi, Lupinus Mutabilis
15 pages
1 s2.0 S0031320303003327 Main
No ratings yet
1 s2.0 S0031320303003327 Main
15 pages

Probability and Statistics Cheat Sheet

Uploaded by

Probability and Statistics Cheat Sheet

Uploaded by

Probability and Statistics

February 27, 2011

3 Random Variables 6 15 Exponential Family 16

7 Distribution Relationships 817 Decision Theory 17

Uniform (discrete) Binomial Geometric Poisson

n = 40, p = 0.3 p = 0.2

n = 25, p = 0.9 p = 0.8 λ = 10

a b −4 −2 0 2 4 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 2 4 6 8

Exponential Gamma InverseGamma Beta

β=2 α = 1, β = 2 α = 1, β = 1 α = 0.5, β = 0.5

0.0 0.5 1.0 1.5 2.0 2.5 0 1 2 3 4 5

• Outcome (point or element) ω ∈ Ω Bayes’ Theorem

Properties Probability Mass Function (PMF)

• P [∅] = 0 fX (x) = P [X = x] = P [{ω ∈ Ω : X(ω) = x}]

Expectation Standard deviation p

1 Γ(α + β) α−1 Joint density

Conditional mean and variance

Fisher Information (exponential family)

Method of Moments Estimator (MoM) Observed Fisher Information

Maximum Likelihood Estimator (mle) τn − τ ) D

P [Hi | xn ] f (xn | Hi ) P [Hi ] i. Sample uniformly X1∗ , . . . , Xn∗ ∼ F̂n .

15 Exponential Family Pivotal Interval

Example 17.1 Risk

18.2 Prediction mle

18.3 Multiple Regression Procedure

Cross-validation estimate of E [J(h)]

Cross-validation estimate of E [J(h)]

pij ≡ P [Xn+1 = j | Xn = i] Waiting Times

Chapman-Kolmogorov Interarrival Times

µn = (µn (1), . . . , µn (N )) where µi (i) = P [Xn = i]

γx (s, t) = E [(xs − µs )(xt − µt )] = E [xs xt ] − µs µt ∀k ∈ N, tk , ck , h ∈ Z

γx (t, t) = E (xt − µt )2 = V [xt ]

Cross-covariance function (CCV) Autocovariance function

Sample cross-correlation function φ(z) = 1 − φ1 z − · · · − φp zp z ∈ C ∧ φp 6= 0

γ(h) = σ 2 cos(2πω0 h) Periodogram

You might also like