Statistical Inference Cheat Sheet
Statistical Inference Cheat Sheet
Note: When g is strictly decreasing, FY (y) = 1 − FX (x). E(Y ) = E(E(Y | A)) = E(Y | A1 )P (A1 ) + E(Y | A2 )P (A2 ) + . . . • Minimum of k independent Expo(λj ) ∼ Expo(λ1 + · · · + λk )
Geometric Series: Eve’s Law (LOTV): • Maximum of k i.i.d. Expo(λ) = Y1 + Y2 + . . . Yk , where
n−1 Yj ∼ Expo(jλ)
2 n−1
X k 1 − rn Var(Y ) = E(Var(Y | X)) + Var(E(Y | X))
a + ar + ar + · · · + ar = ar = a
k=0
1−r Fundamental Bridge (Indicator RVs): Special Cases of Distributions
Taylor Series for Exponential Function: • Bin(n, p) can be thought of as the sum of n i.i.d. Bern(p)
E(Ind(A)) = P (A)
∞ n 2 3 n • Beta(1, 1) is the same distribution as Unif(0, 1)
x
X x x x x Variance:
e = =1+x+ + + · · · = lim 1+ • The sum of n i.i.d. Expo(λ) is Gamma(n, λ),
n! 2! 3! n→∞ n 2 2 2
n=0 Var(X) = E(X − E(X)) = E(X ) − (E(X))
• χ2n is the sum of squares of n i.i.d. N (0, 1) or Gamma n 1
2, 2
Sum of First n Terms of Harmonic Series: Standard Deviation: • NBin(r, p) can be thought of as the sum of r i.i.d. Geom(p)
1 1 1 q
1 + + + ··· + ≈ log n + 0.5772 SD(X) = Var(X) • For X ∼ Expo(λ), X 1/γ ∼ Weibull(λ, γ)
2 3 n
Gamma and Beta Integrals Covariance: • For X ∼ N (µ, σ 2 ), eX ∼ Log-Normal(µ, σ 2 )
Z ∞ Z 1 Γ(a)Γ(b) • For X ∼ Gamma(a, λ) and Y ∼ Gamma(b, λ),
t−1 −x a−1 b−1 Cov(X, Y ) = E[(X − E(X))(Y − E(Y ))] = E(XY ) − E(X)E(Y ) X X
x e dx = Γ(t) x (1 − x) dx = X+Y ∼ Beta(a, b), with X + Y ⊥
⊥ X+Y
0 0 Γ(a + b)
Correlation: • Beta-Binomial Conjugacy: For X | p ∼ Bin(n, p) and
Notes: Γ(a + 1) = aΓ(a), and Γ(n) = (n − 1)! if n is a positive integer. Cov(X, Y )
Corr(X, Y ) = p p ∼ Beta(a, b), the posterior p | X ∼ Beta(a + x, b + n − x)
Var(X)Var(Y )
• Gamma-Poisson Conjugacy For X | λ ∼ Pois(λt) and
Useful Stat 110 Concepts Properties: λ ∼ Gamma(r, b), the posterior λ | X ∼ Gamma(r + x, b + t)
• Var(aX) = a2 Var(X)
Conditional Probability • Chicken-Egg: For Z ∼ Pois(λ) items and acceptance
• For X ⊥
⊥ Y , Var(X + Y ) = Var(X − Y ) = Var(X) + Var(Y ) probability p, accepted items Z1 ∼ Pois(λp) and rejected items
Definition of Conditional Probability: Z2 ∼ Pois(λ(1 − p)) with Z1 ⊥ ⊥ Z2
• Cov(X, Y ) = Cov(Y, X)
P (A, B) • Poisson-Normal: Pois(λ) is approximately N (λ, λ) when λ is
P (A | B) = • Cov(X + a, Y + b) = Cov(X, Y )
P (B) large
• Cov(aX, bY ) = abCov(X, Y )
• Binomial-Poisson: Bin(n, p) is approximately Pois(np) when
Law of Total Probability:
• Cov(W + X, Y + Z) = p is small
n Cov(W, Y ) + Cov(W, Z) + Cov(X, Y ) + Cov(X, Z)
X • Binomial-Normal: Bin(n, p) is approximately N (np, npq)
P (A) = P (A | B1 )P (B1 )+P (A | B2 )P (B2 )+· · · = P (A | Bk )P (Bk )
• Corr(aX + b, cY + d) = Corr(X, Y ) when n is large and p is not near 0 or 1
k=1
Examples German Tank Problem Variance-Stabilizing of Poisson
Suppose n tanks are captured, with serial numbers Y1 , Y2 , . . . Yn . Let T ∼ Pois(λ)
MLE and MoM of Mean/Variance in Normal Assume the population serial numbers are 1, 2, . . . t and that the data √≈ N (λ, λ) for large λ. What is the approximate
distribution of T ?
i.i.d. is a simple random sample. Estimate the total number of tanks t.
Let Y1 , Y2 , Yn ∼ N (µ, σ 2 ). ·
1 T ∼ N (λ, λ)
Maximum Likelihood Estimates: L(t) = t
if Y1 , Y2 , . . . Yn ∈ {1, 2, . . . t} and 0 otherwise
√ √ 1
·
! n
(Yj − µ)2
P
1 2 T ∼N λ, , by Delta Method
L(µ, σ ) =exp − 4
σn 2σ 2 Ind(Y(n) ≤ t)
= t
2
ℓ(µ, σ ) = −n log(σ) −
1 X 2
(Yj − Ȳ ) + n(Ȳ − µ)
2
n Confidence Interval for Mean/Variance in Normal
2σ 2 The likelihood of t is 0 for t < Y(n) because we would have already
The log-likelihood maximizes when µ = Ȳ , so µ̂M LE = Ȳ . observed a tank with a higher serial number. However, the likelihood Let Y ∼ N (µ, σ 2 ), with µ and σ 2 both unknown.
function is decreasing, so the maximum likelihood estimate must be Suppose we use the unbiased sample variance σ̂ 2 = 1
P
(Yj − Ȳ )2 .
Plugging µ̂M LE into the log-likelihood function: t̂M LE = Y(n) . However, this estimator is biased. n−1
2 1 X 2
ℓ(σ ) = −n log(σ) − (Yj − Ȳ ) The PMF for Y(n) is the number of ways to choose n − 1 tanks with n−1 2 2
2σ 2 serial numbers less than Y(n) divided by the total number of ways to σ̂ ∼ χn−1
σ2
n 2 1 X 2 choose n tanks from t.
= − log(σ ) − (Yj − Ȳ )
2 2σ 2 m−1 So, the 95% confidence interval for σ 2 is:
2 n 1 X 2 n−1
s(σ ) = − 2 + (Yj − Ȳ ) P (Y(n) = m) = t
" #
2σ 4σ 2 n
σ̂ 2 (n − 1) σ̂ 2 (n − 1)
n ,
2 1 X 2 Q1 (0.975) Q1 (0.025)
σ̂M LE = (Yj − Ȳ ) t m − 1 n
n j=1
X
E(Y(n) ) = m = (t + 1)
n−1 n+1 where Q1 (p) is the quantile function for χ2n−1 or Gamma n−1 1
Method of Moments: m=n 2 , 2 .
n+1
1. µ = E(Y ) So, we can correct our estimator to n Y(n) − 1, which is unbiased.
2. µ̂M oM = Ȳ Now, suppose we use the sample mean µ̂ = Ȳ .
Sample Mean vs. Sample Median σ
1. σ 2 = E(Y 2 ) − (E(Y ))2 Ȳ − µ ∼ √ N (0, 1)
i.i.d.
2
Let Y1 , Y2 , . . . Yn ∼ N (θ, σ 2 ); estimand θ n
P 2
1
Yj − Ȳ 2 = 1
(Yj − Ȳ )2
P
2. σ̂M oM = n n
2
Sufficient Statistic and MLE in an NEF Sample mean: Ȳ ∼ N µ, σn n−1 2 2
σ̂ ∼ χn−1
σ2
σ2
·
The PMF/PDF of an NEF can be written as fθ (y) = eθy−ψ(θ) h(y), so Sample median: Mn ∼ N θ, π by asymptotic distribution of
2 n So, we can use the pivot:
the joint log-likelihood is:
sample quantiles
P
Yj −nψ(θ) σ σ
L(θ) = e
θ
Ȳ − µ √
n
Z √
n
Z
Sample mean is a more efficient estimator as it has a lower variance, p = r = q
n but in cases when the assumption of Normal is wrong (e.g. Cauchy), σ̂ 2 /n σ n−1 2
(n−1)σ 2 √ σ̂ /(n − 1)
σ̂ 2 /n
X
ℓ(θ) = θ Yj − nψ(θ) sample median may be more robust. (n−1)σ 2
n σ2
j=1
n
′
Generic Method of Moments Z
∼ tn−1
X
s(θ) = Yj − nψ (θ) = 0 = q
n−1 2
j=1 Let Y1 , Y2 , . . . Yn i.i.d. with mean θ and variance σ 2 . σ2
σ̂ /(n − 1)
n 1. θ = E(Y )
1 X ′ So, the 95% confidence interval for µ is:
Yj = ψ (θ) = E(Y )
n j=1 2. θ̂M oM = Ȳ
σ̂ σ̂
µ̂M LE = Ȳ Evaluation: Ȳ − √ Q2 (0.975), Ȳ − √ Q2 (0.025)
n n
So, Ȳ is a sufficient statistic. • Bias(θ̂) = 0 by linearity
2
where Q2 (p) is the quantile function for the Student-t distribution
Censored Data • Var(θ̂) = σn with parameter n − 1.
Suppose there are n = 30 devices. They are observed for 7 months, at 2
d
• θ̂ → N θ, σn by CLT Fisher Information Equality
which point 21 have failed while 9 still work. Assume each device’s
i.i.d. p
lifetime Yj ∼ Expo(λ) and the estimand is µ = 1/λ. • θ̂ → θ by LLn i.i.d.
Let Y1 , Y2 , . . . Yn → Geom(p). Find IY (p).
For each observation:
(
f (y) if observed Asymptotics Theorems L1 (p) = p(1 − p)
Y1
Lj (λ) =
1 − F (7) if not observed Let Y1 , Y2 , . . . Yn i.i.d. with
√
mean µ ̸= 0 and variance σ 2 . Suppose a
n(Ȳ −µ) ℓ1 (p) = log(p) + Y1 log(1 − p)
21
variable of interest T = Ȳ 3 −µ3 . Find its asymptotic distribution.
−7λ 9
L(λ) =
Y
λe
−λtj
e
21 −21λt̄ −63λ
=λ e e 1 Y1
√ d s1 (p) = −
j=1 Numerator: n(Ȳ − µ) → σZ where Z ∼ N (0, 1) by CLT p 1−p
p p
ℓ(λ) = 21 log(λ) − 21λt̄ − 63λ Denominator: Ȳ → µ by LLN, then Ȳ 3 → µ3 by CMT, so 1. IY (p) = nI1 (p) = nVar 1 Y1
− 1−p = n
Var(Y1 ) =
p (1−p)2
21 Ȳ 3 + µ3 = 2µ3 1−p
s(λ) = − 21t̄ − 63 = 0 n
= n
λ Combining the numerator and denominator using Slutsky’s Theorem: (1−p)2 p2 p2 (1−p)
1
λ̂M LE = !
2. IY (p) = nI1 (p) = −nE(s′1 (p)) = −nE − p12 −
Y1
=
t̄ + 3 d σZ · σ2 (1−p)2
T → = N 0, n n n
µ̂M LE = t̄ + 3, by invariance 2µ3 4µ6 p2
+ (1−p)2
E(Y1 ) = p2 (1−p)
E(Y Z)
Poisson Rao-Blackwellization = E(Y Z) − E(XZ) = E(Y Z) − E(Y Z) = 0 Plugging in ψ = eµ by symmetry of log(Y ):
E(XZ) !
i.i.d.
Let Y1 , Y2 , . . . Yn → Pois(λ). The sufficient statistic is T =
P
Yj . P P P P √ d 2πσ 2 ψ 2
Yj Zj Uj Zj + β Xj Zj Uj Zj n(ψ̂ − ψ) → N 0,
Suppose we use the unbiased estimator λ̂ = Y1 . β̂ = P = P =β+ P 4
Xj Zj Xj Zj Xj Zj
Then, for large n:
λ̂RB = E(λ̂ | T ) = E(Y1 | T ) Then, !
· πσ 2 ψ 2
√ 1 P ψ̂ ∼ N ψ,
We can think of the data as a Poisson process with a rate of λ and a √ n Uj Zj 2n
total of T arrivals. We can split the timeline into disjoint intervals of n β̂ − β = 1 n
P
n Xj Zj So, for large n:
length 1, so the number of arrivals in each interval is distributed as
Pois(λ). Since we know that there are a total of T arrivals and Bias(ψ̂) ≈ ψ − θ
√
√
1 X nX
r
π
q
per-interval arrivals are i.i.d., we can view the distribution of arrivals d
1
n Uj Zj − E(U Z) = Uj Zj → N (0, Var(U Z) by CLT SE(ψ̂) = Var(ψ̂) ≈ σψ
that fall within the first interval as Y1 | T ∼ Bin T, n . n n 2n
P (2) What are the bias and standard error for large n when the
T Yj 2 2
Var(U Z) = E((U Z) ) − E(U Z) = E((U Z) )
2
estimand is ψ?
λ̂RB = E(Y1 | T ) = = = Ȳ
n n
By the approximate distribution for ψ̂ found previously:
1 X p
Log-Normal MoM Xj Zj → E(XZ) by LLN Bias(ψ̂) ≈ ψ − ψ = 0
n r
i.i.d. 2 π
Let Y1 , Y2 , . . . Yn ∼ Log-Normal(µ, σ ). Use two methods to find
!
√
d E((U Z)2 ) SE(ψ̂) ≈ σψ
Method of Moments estimators for (µ, σ 2 ). n β̂ − β → N 0, 2
by Slutsky’s 2n
E(XZ)
Method 1: MLE and MoM for Gaussian Linear Regression
1. Let Xj = log(Yj ) for all j ∈ {1, 2, . . . , n} =⇒ Xj ∼ N (µ, σ 2 ) MoM for Neil’s Commute Problem Let the data be pairs (Xj , Yj ) such that Yj | Xj ∼ N (θXj , σ 2 (Xj )),
Let there be two different routes X and Y and X1 , . . . Xn and with σ 2 (x) known. Note that under homoskedasticity, σ 2 (x) is a
(
µ = E(X)
2. Y1 , . . . Yn be independent commute times with the Xj ’s i.i.d. and Yj ’s constant.
σ 2 = E(X 2 ) − (E(X))2 i.i.d. We want to compare the commute times by looking at the ratio Maximum Likelihood Estimate:
( of expected commute times beyond 40 minutes. The estimand is: n
µ̂ = X̄ 1 X 2
3. E[(Y1 − 40)Ind(Y1 > 40)] ℓ(θ) = − 2 (yj − θxj )
σ̂ 2 = n
P 2
1
Xj − X̄ 2 = n1
(Xj − X̄)2
P
θ= 2σ j=1
E[(X1 − 40)Ind(X1 > 40)]
Method 2: n
1 X
xj (yj − θxj )
(
E(Y ) = exp µ + 12 σ 2
Find an MoM estimator for θ and show that it is consistent. s(θ) = 2
1. σ j=1
E(Y ) = exp(2µ + 2σ 2 )
2
Let ν = E[(Y1 − 40)Ind(Y > 40)]. Applying the MoM principle, P
X j Yj
n θ̂ = P 2
2. Let M = n 1
Yj2 1 X Xj
( ν̂ = (Yj − 40)Ind(Yj > 40)
Ȳ = exp µ̂ + 21 σ̂ 2 n j=1
Properties:
2
M = exp(2µ̂ + 2σ̂ ) • E(θ̂ | X) = P1 2 E( xj Yj ) = P1 2
P P
xj E(Yj | Xj )
( Similarly, let η = E[(X1 − 40)Ind(X > 40)]. x
j
x
j
2
log Ȳ = µ̂ + 21 σ̂ 2 1 1
P P
n
= P 2 xj θxj = P 2 θ xj = θ =⇒ Unbiased
3. 1 X x
j
x
j
log M = 2µ̂ + 2σ̂ 2 η̂ = (Xj − 40)Ind(Xj > 40)
n j=1 • Var(θ̂ | X) = 1
Var( xj yj ) = P 12 2
P P
Var(xj Yj |
( x2 ) 2
P
( ( x )
σ̂ 2 = log M − 2 log Ȳ 1
j
2 1
j
P 2 x
4.
P
Pn Xj ) = P 2 2 xj Var(Yj | Xj ) = P 2 2 xj σ (xj )
µ̂ = 2 log Ȳ − 21 log M ν̂ j=1 (Yj − 40)Ind(Yj > 40) ( x )
j
( x )
j
θ̂ = = Pn
η̂ j=1 (Xj − 40)Ind(Xj > 40) σ2
⋆ Simplifies to P 2 (CRLB) under homoskedasticity
Asymptotics for MoM Estimator x
j
For consistency, we want to show that θ̂ converges in probability to θ.
Suppose we have data in the form of triplets (Xj , Yj , Zj ) for • Robust Variance: Since E(σ 2 (Xj ) − (Yj − θXj )2 | Xj ) = 0, use
p p
E(Y Z)
j = 1, 2, . . . n. The estimand is β = E(XZ) . ν̂ → ν and η̂ → η by LLN (Yj − θXj )2 as an unbiased estimator for σ 2 (Xj ). Then, the
P 2
robust variance is P 12 2 xj (yj − θxj )2
( x )
P
Y Z
P j j
ν̂ p j
We can use the method of moments estimator β̂ = X j Zj
. = θ̂ → θ by Slutsky’s
η̂ Method of Moments 1: (Gauss’s Estimator)
(1) Show that β̂ is asymptotically unbiased. A = Y1 > 40 ⇒ E[(Y1 − 40)I(Y1 > 40)] = E(XY ) = E(E(XY | X)) = E(XE(Y | X)) = E(XθX) = θE(X )
2