Web
Web
Course outline:
• Distribution Theory.
– Distributions
– Distribution of transformations
1
• Point estimation
– Method of moments.
– Optimality Theory.
– Sufficiency.
2
• Hypothesis Testing
• Confidence sets
– Pivots
• Decision Theory.
3
Statistics versus Probability
Theory Prediction
A 1
B 2
C 3
Add Randomness
Theory Prediction
A Usually 1 sometimes 2 never 3
B Usually 2 sometimes 1 never 3
C Usually 3 sometimes 1 never 2
5
Probability Definitions
2. A ∈ F implies Ac = {ω ∈ Ω : ω 6∈ A} ∈ F .
6
• P a function, domain F , range a subset of
[0, 1] satisfying:
Consequences of axioms:
Ai ∈ F ; i = 1, 2, · · · implies ∩i Ai ∈ F
7
Vector valued random variable: function X :
Ω 7→ Rp such that, writing X = (X1 , . . . , Xp ),
X1 ≤ x 1 , . . . , X p ≤ x p
is a subset of Ω or event:
8
Borel σ-field in Rp: smallest σ-field in Rp con-
taining every open ball.
A 7→ P (X ∈ A)
which is a probability on the set Rp with the
Borel σ-field rather than the original Ω and F .
9
Cumulative Distribution Function (CDF)
of X: function FX on Rp defined by
1. 0 ≤ F (x) ≤ 1.
10
Defn: Distribution of rv X is discrete (also
call X discrete) if ∃ countable set x1, x2, · · ·
such that
X
P (X ∈ {x1, x2 · · · }) = 1 = P (X = xi) .
i
In this case the discrete density or probability
mass function of X is
fX (x) = P (X = x) .
1
0<x<1
f (x) = undefined x ∈ {0, 1}
0
otherwise .
Example: X is exponential.
1 − e−x x > 0
(
F (x) =
0 x ≤ 0.
−x
e
x>0
f (x) = undefined x = 0
0
x < 0.
12
Distribution Theory
Basic Problem:
Univariate Techniques
FY (y) = P (Y ≤ y) = P (− log U ≤ y)
−y
= P (log U ≥ −y) = P (U ≥ e )
1 − e−y y > 0
(
=
0 y≤0
so Y has standard exponential distribution.
13
Example: Z ∼ N (0, 1), i.e.
1 −z 2/2
fZ (z) = √ e
2π
and Y = Z 2. Then
FY (y) = P (Z 2 ≤ y)
(
0 y<0
= √ √
P (− y ≤ Z ≤ y) y ≥ 0 .
Now differentiate
√ √ √ √
P (− y ≤ Z ≤ y) = FZ ( y) − FZ (− y)
to get
0 h y<0
d F (√y) − F (−√y) y > 0
i
fY (y) = dy Z Z
undefined y = 0.
14
Then
d √ √ d√
FZ ( y) = fZ ( y) y
dy dy
1 √ 2 1 −1/2
= √ exp − ( y ) /2 y
2π 2
1
= √ e−y/2 .
2 2πy
(Similar formula for other derivative.) Thus
1 −y/2 y > 0
√2πy e
fY (y) = 0 y<0
undefined
y = 0.
15
Notice: I never evaluated FY before differen-
tiating it. In fact FY and FZ are integrals I
can’t do but I can differentiate them anyway.
Remember fundamental theorem of calculus:
d x
Z
f (y) dy = f (x)
dx a
at any x where f is continuous.
P (Y ≤ y) = P (g(X) ≤ y)
= P (X ∈ g −1(−∞, y]) .
Take d/dy to compute the density
d
Z
fY (y) = fX (x) dx .
dy {x:g(x)≤y}
Often can differentiate without doing integral.
16
Method 2: Change of variables.
P (x ≤ X ≤ x + δx) ≈ fX (x)δx
so that
Mnemonic:
fY (y)dy = fX (x)dx .
19
Example: X ∼ Weibull(shape α, scale β) or
!α−1
α x
fX (x) = exp {−(x/β)α} 1(x > 0) .
β β
X = (X1 , . . . , Xp), Y = X1
(or in general Y is any Xj ).
fY (x1, . . . , xq ) =
Z ∞ Z ∞
··· f (x1 , . . . , xp) dxq+1 . . . dxp
−∞ −∞
21
Example: The function
22
General case: Y = (Y1, . . . , Yq ) with compo-
nents Yi = gi(X1 , . . . , Xp).
Case 1: q > p.
Case 2: q = p.
23
Case 3: q < p.
Find fY by integration:
fY (y1,...,yq )=
R∞ R∞
−∞ ··· −∞ fZ (y1 ,...,yq ,zq+1 ,...,zp )dzq+1 ...dzp
24
Change of Variables
fY (y)dy = fX (x)dx
and rewrite it in the form
dx
fY (y) = fX (g −1(y))
dy
Interpretation of derivative dx
dy when p > 1:
!
dx ∂xi
= det
dy ∂yj
which is the so called Jacobian.
25
Equivalent formula inverts the matrix:
fX (g −1(y))
fY (y) = dy
.
dx
26
Solve for x in terms of y:
X1 = Y1 cos(Y2)
X2 = Y1 sin(Y2)
so that
g(x1, x2) = (g1(x1, x2), g2(x1 , x2))
q
= ( x2
1 + x 2
2, argument(x1, x2))
= y1 .
It follows that
y12
( )
1
fY (y1, y2) = exp − y1 ×
2π 2
Then
Z ∞
fY1 (y1) = h1 (y1)h2 (y2) dy2
−∞ Z ∞
= h1(y1 ) h2(y2) dy2
−∞
so marginal density of Y1 is a multiple of h1 .
R
Multiplier makes fY1 = 1 but in this case
Z ∞ Z 2π
h2 (y2) dy2 = (2π)−1 dy2 = 1
−∞ 0
so that
−y12/2
fY1 (y1) = y1e 1(0 ≤ y1 < ∞) .
(Special Weibull or Rayleigh distribution.)
28
Similarly
29
Independence, conditional distributions
30
Example: Toss a coin twice.
P (X ∈ A; Y ∈ B) = P (X ∈ A)P (Y ∈ B)
for all A and B.
32
Proof:
P (X ≤ x, Y ≤ y) = P (X ≤ x)P (Y ≤ y) .
Since P (X ∈ A, Y ∈ B) = P (X ∈ A)P (Y ∈ B)
Z Z
[fX,Y (x, y) − fX (x)fY (y)]dydx = 0
A B
It follows (measure theory) that the quantity
in [] is 0 (almost every pair (x, y)).
33
3: For any A and B we have
P (X ∈ A, Y ∈B)
= P (X ∈ A)P (Y ∈ B)
Z Z
= fX (x)dx fY (y)dy
ZA Z B
= fX (x)fY (y)dydx
A B
Define g(x, y) = fX (x)fY (y); we have proved
that for C = A × B
Z
P ((X, Y ) ∈ C) = g(x, y)dydx
C
To prove that g is fX,Y we need only prove that
this integral formula is valid for an arbitrary
Borel set C, not just a rectangle A × B.
34
5: We are given
Z Z
P (X ∈ A, Y ∈ B) = g(x)h(y)dydx
ZA B Z
= g(x)dx h(y)dy
A B
Take B = R1 to see that
Z
P (X ∈ A) = c1 g(x)dx
A
R
where c1 = h(y)dy. So c1g is the density of
RR
X. Since fX,Y (xy)dxdy = 1 we see that
R R R
g(x)dx h(y)dy = 1 so that c1 = 1/ g(x)dx.
Similar argument for Y .
35
Conditional probability
37
The Multivariate Normal Distribution
38
Matrix A singular: X does not have a density.
3: M X + ν ∼ M V N (M µ + ν, M ΣM t): affine
transformation of MVN is normal.
41
Normal samples: Distribution Theory
3. (n − 1)s2 /σ 2 ∼ χ2
n−1.
42
Proof: Let Zi = (Xi − µ)/σ.
n1/2(X̄ − µ)
= n1/2Z̄
σ
(n − 1)s2 X
2
= (Z i − Z̄)
σ2
and
n1/2(X̄ − µ) n1/2Z̄
T = =
s sZ
where (n − 1)sZ = (Zi − Z̄)2 .
2 P
43
Step 1: Define
√
Y = ( nZ̄, Z1 − Z̄, . . . , Zn−1 − Z̄)t .
(So Y has same dimension as Z.) Now
√1 √1 ··· √1
Z1
n n n
1−1 ··· −n1 −n1 Z2
Y = n ..
−1 1 ··· 1
n 1 − n −n
.. .. .. ..
Zn
Y = MZ .
It follows that Y ∼ M V N (0, M M t) so we need
to compute M M t:
1 0 0 ··· 0
0 1−n 1 −1 · · · − 1
t
MM = n n
.. − 1 ... · · · 1
n − n
0 .
. ··· 1−n 1
1 0
= .
0 Q
44
Solve for Z from Y : Zi = n−1/2Y1 + Yi+1 for
1 ≤ i ≤ n − 1. Use the identity
n
X
(Zi − Z̄) = 0
i=1
to get Zn = − n
P
Y + n −1/2Y . So M invert-
i=2 i 1
ible:
1 0
Σ−1 ≡ (M M t)−1 = .
0 Q−1
45
Note: fY is ftn of y1 times a ftn of y2, . . . , yn.
√
Thus nZ̄ is independent of Z1 − Z̄, . . . , Zn−1 −
Z̄.
Since s2
Z is a√function of Z1 − Z̄, . . . , Zn−1 − Z̄
we see that nZ̄ and s2 Z are independent.
46
Derivation of the χ2 density:
Z1 = U 1/2 cos θ1
Z2 = U 1/2 sin θ1 cos θ2
.. = ..
Zn−1 = U 1/2 sin θ1 · · · sin θn−2 cos θn−1
Zn = U 1/2 sin θ1 · · · sin θn−1 .
(Spherical co-ordinates in n dimensions. The
θ values run from 0 to π except last θ from 0
to 2π.) Derivative formulas:
∂Zi 1
= Zi
∂U 2U
and
∂Zi 0
j>i
= −Zi tan θi j = i
∂θj Z cot θ
j < i.
i j
47
Fix n = 3 to clarify the formulas. Use short-
√
hand R = U
U 1/2 sin(θ1)/2
(non-negative for all U and θ1).
48
General n: every term in the first column con-
tains a factor U −1/2/2 while every other entry
has a factor U 1/2.
cu(n−2)/2 exp(−u/2)
for some c.
49
Evaluate c by making
Z Z ∞
fU (u)du = c u(n−2)/2 exp(−u/2)du
0
= 1.
Substitute y = u/2, du = 2dy to see that
Z ∞
c2n/2 y (n−2)/2e−y dy = c2n/2Γ(n/2)
0
= 1.
CONCLUSION: the χ2
n density is
(n−2)/2
1 u
e−u/21(u > 0) .
2Γ(n/2) 2
50
Fourth part: consequence of first 3 parts and
def’n of tν distribution.
51
Plug in
1
fU (u) = (u/2)(ν−2)/2 e−u/2
2Γ(ν/2)
to get
R∞
(u/2) (ν−1)/2 e−u(1+t2/ν)/2 du
fT (t) = 0 √ .
2 πνΓ(ν/2)
Substitute y = u(1 + t2 /ν)/2, to get
dy = (1 + t2/ν)du/2
52
Expectation, moments
53
In general, there are random variables which
are neither absolutely continuous nor discrete.
Here’s how probabilists define E in general.
Defn: If X ≥ 0 then
54
Defn: X is integrable if
E(|X|) < ∞ .
In this case we define
55
Major technical theorems:
Monotone Convergence: If 0 ≤ X1 ≤ X2 ≤
· · · and X = lim Xn (which has to exist) then
E(Xn ) → E(X) .
Often used with all Yn the same rv Y .
56
Theorem: With this definition of E if X has
density f (x) (even in Rp say) and Y = g(X)
then
Z
E(Y ) = g(x)f (x)dx .
57
Defn: The r th moment (about the origin) of
a real rv X is µ0r = E(X r ) (provided it exists).
We generally use µ for E(X).
µr = E[(X − µ)r ]
We call σ 2 = µ2 the variance.
µX = E(X)
is the vector whose ith entry is E(Xi ) (provided
all entries exist).
58
Moments and probabilities of rare events are
closely connected as will be seen in a number
of important probability theorems.
59
Example moments: If Z ∼ N (0, 1) then
Z ∞
−z 2/2
√
E(Z) = ze dz/ 2π
−∞ ∞
2
−e−z /2
= √
2π
−∞
=0
and (integrating by parts)
r
Z ∞
r −z 2/2
√
E(Z ) = z e dz/ 2π
−∞
2 /2 ∞
−z r−1 e −z
= √
2π
Z ∞−∞
r−2 −z 2/2
√
+ (r − 1) z e dz/ 2π
−∞
so that
µr = (r − 1)µr−2
for r ≥ 2. Remembering that µ1 = 0 and
Z ∞
0 −z 2/2
√
µ0 = z e dz/ 2π = 1
−∞
we find that
(
0 r odd
µr =
(r − 1)(r − 3) · · · 1 r even .
60
If now X ∼ N (µ, σ 2 ), that is, X ∼ σZ + µ, then
E(X) = σE(Z) + µ = µ and
E(X) = µ
and
t
n o
= E AZ(AZ)
= AE(ZZ t)At
= AIAt = Σ .
Note use of easy calculation: E(Z) = 0 and
Var(Z) = E(ZZ t) = I .
61
Moments and independence
62
E(X1 · · · Xp)
X
= x1j1 · · · xpjp
j1...jp
× E(1(X1 = x1j1 ) · · · 1(Xp = xpjp ))
X
= x1j1 · · · xpjp
j1...jp
× P (X1 = x1j1 · · · Xp = xpjp )
X
= x1j1 · · · xpjp
j1...jp
× P (X1 = x1j1 ) · · · P (Xp = xpjp )
X
= x1j1 P (X1 = x1j1 ) × · · ·
j1
X
× xpjp P (Xp = xpjp )
jp
Y
= E(Xi) .
63
General Xi ≥ 0:
That is: if
k k+1
≤ X i <
2n 2n
then Xin = k/2n for k = 0, . . . , n2n. For Xi > n
putXin = n.
Xi = max(Xi , 0) − max(−Xi , 0) .
Apply positive case.
64
Moment Generating Functions
d k
0
3. µk = k MX (0).
dt
66
MGFs and Sums
o2
2
nX
+3 E[(Xi − E(Xi )) ]
67
Related quantities: cumulants add up prop-
erly.
Observe
X
κr (Y ) = κr (Xi ) .
68
Relation between cumulants and moments:
x = µt + µ02t2 /2 + µ03t3/3! + · · · ;
Expand out powers of x; collect together like
terms. For instance,
x2 = µ2t2 + µµ02t3
+ [2µ03 µ/3! + (µ02)2/4]t4 + · · ·
x3 = µ3t3 + 3µ02µ2t4/2 + · · ·
x4 = µ 4 t 4 + · · · .
Now gather up the terms. The power t1 oc-
curs only in x with coefficient µ. The power t2
occurs in x and in x2 and so on.
69
Putting these together gives
K(t) =
µt + [µ02 − µ2]t2 /2
+ [µ03 − 3µµ02 + 2µ3]t3 /3!
+ [µ04 − 4µ03µ − 3(µ02)2 + 12µ02 µ2 − 6µ4]t4 /4! · · ·
Comparing coefficients to tr /r! we see that
κ1 = µ
κ2 = µ02 − µ2 = σ 2
κ3 = µ03 − 3µµ02 + 2µ3 = E[(X − µ)3 ]
κ4 = µ04 − 4µ03µ − 3(µ02 )2 + 12µ02 µ2 − 6µ4
= E[(X − µ)4] − 3σ 4 .
70
Example: If X1 , . . . , Xp are independent and
Xi has a N (µi, σi2) distribution then
Z ∞
tx −1 (x−µi)2/σi2
√
MXi (t) = e e 2 dx/( 2πσi )
Z−∞∞ √
t(σ z+µ ) −z 2 /2
= e i i e dz/ 2π
−∞Z
tµ
∞
−(z−tσ ) 2 /2+t2σ 2 /2 √
=e i e i i dz/ 2π
−∞
σi2t2/2+tµi
=e .
σi2t2 /2 + t
X X
KY (t) = µi
which is the cumulant generating function of
N ( µi, σi2 ).
P P
71
Example: Homework: derive moment and cu-
mulant generating function and moments of a
Gamma rv.
By definition: Sν = 1 Zi has χ2
Pν 2
ν distribution.
It is easy to check S1 = Z12 has density
−1/2 −u/2 √
(u/2) e /(2 π)
and then the mgf of S1 is
(1 − 2t)−1/2 .
It follows that
SO: χ2
ν dstbn has Gamma(ν/2, 2) density:
(u/2)(ν−2)/2e−u/2 /(2Γ(ν/2)) .
72
Example: The Cauchy density is
1
2
;
π(1 + x )
corresponding moment generating function is
etx
Z ∞
M (t) = dx
−∞ π(1 + x2)
which is +∞ except for t = 0 where we get 1.
Characteristic Functions
√
Complex numbers: add i = −1 to the real
numbers.
74
Addition: follow usual rules to get
Multiplicative inverses:
1 1 a − bi
=
a + bi a + bi a − bi
a − bi
= 2
a − abi + abi − b2i2
a − bi
= 2 .
a + b2
Division:
a + bi (a + bi) (c − di)
=
c + di (c + di) (c − di)
ac − bd + (bc + ad)i
= 2 2
.
c +d
Notice: usual rules of arithmetic don’t require
any more numbers than
x + yi
where x and y are real.
75
Transcendental functions: For real x have
ex = xk /k! so
P
ex+iy = ex eiy .
How to compute eiy ?
Remember i2 = −1 so i3 = −i, i4 = 1 i5 =
i1 = i and so on. Then
∞
(iy)k
eiy =
X
0 k!
=1 + iy + (iy)2/2 + (iy)3 /6 + · · ·
=1 − y 2/2 + y 4/4! − y 6/6! + · · ·
+ iy − iy 3/3! + iy 5/5! + · · ·
= cos(y) + i sin(y)
We can thus write
76
Identify x + yi with the corresponding point
(x, y) in the plane. Picture the complex num-
bers as forming a plane.
77
We will need from time to time a couple of
other definitions:
zz = x2 + y 2 = r 2 = |z|2
z0 z 0z 0 i(θ 0 −θ)
= 2 = rr e
z |z|
reiθ = re−iθ .
78
Notes on calculus with complex variables.
End of Aside
79
Characteristic Functions
φX (t) = E(eitX )
√
where i = −1 is the imaginary unit.
Since
eitX = cos(tX) + i sin(tX)
we find that
80
Theorem 2 For any two real rvs X and Y the
following are equivalent:
P (X ∈ A) = P (Y ∈ A) .
81
Inversion
eiktP (X = k) .
X
φX (t) =
k
82
Now suppose X has continuous bounded den-
sity f . Define
Xn = [nX]/n
where [a] denotes the integer part (rounding
down to the next smallest integer). We have
83
Now, as n → ∞ we have
λj Zj2
X
T =
where the Zj are iid N (0, 1). In this case
itT
Y itλj Zj2
E(e )= E(e )
(1 − 2itλj )−1/2 .
Y
=
Imhof (Biometrika, 1961) gives a simplification
of the Fourier inversion formula for
FT (x) − FT (0)
which can be evaluated numerically:
FT (x) − FT (0)
Z x
= fT (y)dy
Z0x
1 ∞ Y
Z
= (1 − 2itλj )−1/2e−ity dtdy .
0 2π −∞
87
Multiply
" #1/2
1
φ(t) = Q
(1 − 2itλj )
top and bottom by the complex conjugate of
the denominator:
Q 1/2
(1 + 2itλj )
φ(t) = Q
2 2
.
(1 + 4t λj )
88
This allows us to rewrite
Q P 1/2
rj e i θj
φ(t) = Q 2
rj
or
tan−1(2tλj )/2
P
ei
φ(t) = Q .
(1 + 4t2λ2
j)
1/4
e−ixt − 1
Z x
−iyt
e dy = .
0 −it
We can now collect up the real part of the
resulting integral to derive the formula given
by Imhof. I don’t produce the details here.
89
2): The central limit theorem (in some ver-
sions) can be deduced from the Fourier inver-
sion formula: if X1 , . . . , Xn are iid with mean
0 and variance 1 and T = n1/2X̄ then with φ
denoting the characteristic function of a single
X we have
in−1/2 t
P
itT Xj
E(e ) = E(e )
in
= φ(n−1/2 t)
h
#n
tφ0 (0) t2 φ00(0)
"
≈ φ(0) + √ + + o(n−1)
n 2n
But now φ(0) = 1 and
d
0
φ (t) = E(eitX1 ) = iE(X1 eitX1 ) .
dt
So φ0(0) = E(X1 ) = 0. Similarly
φ00(t) = i2E(X12 eitX1 )
so that
φ00(0) = −E(X12) = −1 .
It now follows that
E(eitT ) ≈ [1 − t2/(2n) + o(1/n)]n
−t 2 /2
→e .
90
With care we can then apply the Fourier inver-
sion formula and get
1 ∞ −itx
Z
fT (x) = e [φ(tn−1/2 )]ndt
2π Z−∞
1 ∞ −itx −t2/2
→ e e dt
2π −∞
1
= √ φZ (−x)
2π
where φZ is the characteristic function of a
standard normal variable Z. Doing the integral
we find
2
φZ (x) = φZ (−x) = e−x /2
so that
1 2
fT (x) → √ e−x /2
2π
which is a standard normal density.
91
Proof of the central limit theorem not general:
requires T to have bounded continuous density.
P (T ≤ t) → P (Z ≤ t) .
92
Along the contour in question we have z =
c + iy so we can think of the integral as being
Z ∞
i exp(K(c + iy) − (c + iy)x)dy .
−∞
Now do a Taylor expansion of the exponent:
93
Essentially same idea: Laplace’s approxima-
tion.
95
Q1) If n is a large number is the N (0, 1/n)
distribution close to the distribution of X ≡ 0?
√
Q3) Is N (0, 1/n) close to N (1/ n, 1/n) distri-
bution?
96
Answers depend on how close close needs to
be so it’s a matter of definition.
1. Xn converges in distribution to X.
99
N(0,1/n) vs X=0; n=10000
1.0
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
X=0
0.8
N(0,1/n)
0.6
0.4
0.2
0.0
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
-3 -2 -1 0 1 2 3
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
X=0
0.8
N(0,1/n)
0.6
0.4
0.2
0.0
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
100
N(1/n,1/n) vs N(0,1/n); n=10000
1.0
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
N(0,1/n) •••
••
••
0.8
N(1/n,1/n)
•
•
•
•
0.6
•
•
•
•
•
0.4
•
•
•
•
0.2
•
••
•••
••
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
0.0
-3 -2 -1 0 1 2 3
•••••••••••••••••••••••••••••••••••••••••••••••••••••
•••••••••••••••••••••••••••••
N(0,1/n) ••
••
•
••
••••••••••••••••
•
•
••••••••
••••••••
0.8
N(1/n,1/n)
••
••
•••••••••
•
•••••
••••••
•
••
••••••
0.6
•
••••••
•••••
••
••••••
•
••••
•••••
0.4
••
•••••••
•
•••••
•••••
•
••
•
•••••••
0.2
•
•••••••
•••••••••
••
••
•
••
••••••••••
•
•
•••••
•••••••••••••••••••
•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
0.0
101
• Multiply both Xn and Yn by n1/2 and let
√
X ∼ N (0, 1). Then nXn ∼ N (n−1/2, 1)
√
and nYn ∼ N (0, 1). Use characteristic
√
functions to prove that both nXn and
√
nYn converge to N (0, 1) in distribution.
102
N(1/sqrt(n),1/n) vs N(0,1/n); n=10000
1.0
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
N(0,1/n) •••
••
••
0.8
N(1/sqrt(n),1/n)
•
•
•
•
0.6
•
•
•
•
•
0.4
•
•
•
•
0.2
•
••
•••
••
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
0.0
-3 -2 -1 0 1 2 3
•••••••••••••••••••••••••••••••••••••••••••••••••••••
•••••••••••••••••••••••••••••
N(0,1/n) ••
••
•
••
••••••••••••••••
•
•
••••••••
••••••••
0.8
N(1/sqrt(n),1/n)
••
••
•••••••••
•
•••••
••••••
•
••
••••••
0.6
•
••••••
•••••
••
••••••
•
••••
•••••
0.4
••
•••••••
•
•••••
•••••
•
••
•
•••••••
0.2
•
•••••••
•••••••••
••
••
•
••
••••••••••
•
•
•••••
•••••••••••••••••••
•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
0.0
103
Summary: to derive approximate distributions:
Proof: As before
itn1/2 X̄ −t2/2
E(e )→e .
This is the characteristic function of N (0, 1)
so we are done by our theorem.
104
Edgeworth expansions
log(φ(t)) = log(1 + u)
where
u = −t2/2 − iγt3 /6 + · · · .
Use log(1 + u) = u − u2/2 + · · · to get
log(φ(t)) ≈
[−t2/2 − iγt3 /6 + · · · ]
− [· · · ]2/2 + · · ·
which rearranged is
105
Now apply this calculation to
106
Remarks:
107
Multivariate convergence in distribution
E(g(Xn )) → E(g(X))
for each bounded continuous real valued func-
tion g on Rp.
or
108
Extensions of the CLT
109
Slutsky’s Theorem: If Xn converges in dis-
tribution to X and Yn converges in distribu-
tion (or in probability) to c, a constant, then
Xn + Yn converges in distribution to X + c.
More generally, if f (x, y) is continuous then
f (Xn , Yn) ⇒ f (X, c).
110
The delta method: Suppose:
If Xn ∈ Rp and f : Rp 7→ Rq then f 0 is q × p
matrix of first derivatives of components of f .
Define
Xi2
" #
Wi = .
Xi
See that
" #
X2
W̄n =
X
Define
f (x1 , x2) = x1 − x2
2
See that s2 = f (W̄n).
E(Xi2 ) µ2 + σ 2
" # " #
µW ≡ E(W̄n) = = .
E(Xi) µ
n1/2(Yn − y) ⇒ M V N (0, Σ)
where Σ = Var(Wi).
(W − µW )(W − µW )t
There are 4 entries in this matrix. Top left
entry is
(X 2 − µ2 − σ 2)2
This has expectation:
E (X 2 − µ2 − σ 2 )2 = E(X 4) − (µ2 + σ 2 )2 .
n o
113
Using binomial expansion:
(X 2 − µ2 − σ 2)(X − µ)
which is
E(X 3 ) − µE(X 2 )
Similar to 4th moment get
µ3 + 2µσ 2
So
µ4 − σ 4 + 4µµ3 + 4µ2σ 2 µ3 + 2µσ 2
" #
Σ=
µ3 + 2µσ 2 σ2
114
7) Compute derivative (gradient) of f : has
components (1, −2x2). Evaluate at y = (µ2 +
σ 2 , µ) to get
at = (1, −2µ) .
This leads to
n1/2(s2 − σ 2 ) ≈
" #
X 2 − (µ2 + σ 2 )
n1/2[1, −2µ]
X̄ − µ
which converges in distribution to
115
Alternative approach worth pursuing. Suppose
c is constant.
Define Xi∗ = Xi − c.
at = (1, 0)
and
µ4 − σ 4
" #
µ3
Σ= .
µ3 σ2
Notice that
at Σ = [µ4 − σ 4, µ3]
and
atΣa = µ4 − σ 4 .
116
Special case: if population is N (µ, σ 2) then
µ3 = 0 and µ4 = 3σ 4 . Our calculation has
n1/2(s2 − σ 2) ⇒ N (0, 2σ 4 )
You can divide through by σ 2 and get
s 2
n1/2( 2 − 1) ⇒ N (0, 2)
σ
In fact ns2/σ 2 has a χ2n−1 distribution and so
the usual central limit theorem shows that
117
Monte Carlo
√
Note: accuracy inversely proportional to N.
Transformation
119
Pitfall: Random uniforms generated on com-
puter sometimes have only 6 or 7 digits.
Improved algorithm:
If Y is to have cdf F :
Take Y = F −1 (U ):
P (Y ≤ y) = P (F −1 (U ) ≤ y)
= P (U ≤ F (y)) = F (y)
Define
q
Y1 = −2 log(U1 ) cos(2πU2 )
and
q
Y2 = −2 log(U1) sin(2πU2 ) .
122
Acceptance Rejection
Algorithm:
1) Generate W1.
4) Let Y = W1 if U1 ≤ p.
R
Estimate things like A f (x)dx by computing
the fraction of the Wi which land in A.
125
Variance reduction
Cauchy density is
1
f (x) =
π(1 + x2)
Generate U1, . . . , Un uniforms.
Compute T = X̄.
Estimate is unbiased.
q
Standard error is p(1 − p)/N .
126
Improvement: −Xi also has Cauchy dstbn.
Take Si = −Ti.
The variance of p̃ is
Compute θ̂ = |Zi|/N .
P
So we try
Zi2/N − 1)
X
θ̃ = θ̂ − c(
Notice that E(θ̃) = θ and
Var(θ̃) =
Zi2/N )
X
Var(θ̂) − 2cCov(θ̂,
+ c2Var( Zi2/N )
X
130
Several schools of statistical thinking. Main
schools of thought summarized roughly as fol-
lows:
p is probability of getting H.
133
Maximum Likelihood Estimation
Cauchy Data
1.0
0.8
0.8
0.6
0.6
Likelihood
Likelihood
0.4
0.4
0.2
0.2
0.0
0.0
-10 -5 0 5 10 -10 -5 0 5 10
Theta Theta
1.0
0.8
0.8
0.6
0.6
Likelihood
Likelihood
0.4
0.4
0.2
0.2
0.0
0.0
-10 -5 0 5 10 -10 -5 0 5 10
Theta Theta
1.0
0.8
0.8
0.6
0.6
Likelihood
Likelihood
0.4
0.4
0.2
0.2
0.0
0.0
-10 -5 0 5 10 -10 -5 0 5 10
Theta Theta
135
Likelihood Function: Cauchy, n=5 Likelihood Function: Cauchy, n=5
1.0
1.0
0.8
0.8
0.6
Likelihood
Likelihood
0.6
0.4
0.4
0.2
0.2
0.0
-2 -1 0 1 2 -2 -1 0 1 2
Theta Theta
1.0
0.8
0.8
0.6
0.6
Likelihood
Likelihood
0.4
0.4
0.2
0.2
0.0
0.0
-2 -1 0 1 2 -2 -1 0 1 2
Theta Theta
1.0
0.8
0.8
0.6
0.6
Likelihood
Likelihood
0.4
0.4
0.2
0.2
0.0
0.0
-2 -1 0 1 2 -2 -1 0 1 2
Theta Theta
136
Likelihood Function: Cauchy, n=25 Likelihood Function: Cauchy, n=25
1.0
1.0
0.8
0.8
0.6
0.6
Likelihood
Likelihood
0.4
0.4
0.2
0.2
0.0
0.0
-10 -5 0 5 10 -10 -5 0 5 10
Theta Theta
1.0
0.8
0.8
0.6
0.6
Likelihood
Likelihood
0.4
0.4
0.2
0.2
0.0
0.0
-10 -5 0 5 10 -10 -5 0 5 10
Theta Theta
1.0
0.8
0.8
0.6
0.6
Likelihood
Likelihood
0.4
0.4
0.2
0.2
0.0
0.0
-10 -5 0 5 10 -10 -5 0 5 10
Theta Theta
137
Likelihood Function: Cauchy, n=25 Likelihood Function: Cauchy, n=25
1.0
1.0
0.8
0.8
0.6
0.6
Likelihood
Likelihood
0.4
0.4
0.2
0.2
0.0
0.0
-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0
Theta Theta
1.0
0.8
0.8
0.6
0.6
Likelihood
Likelihood
0.4
0.4
0.2
0.2
0.0
0.0
-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0
Theta Theta
1.0
0.8
0.8
0.6
0.6
Likelihood
Likelihood
0.4
0.4
0.2
0.2
0.0
0.0
-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0
Theta Theta
138
I want you to notice the following points:
139
To maximize this likelihood: differentiate L,
set result equal to 0.
`(θ) = log{L(θ)} .
140
Likelihood Ratio Intervals: Cauchy, n=5 Likelihood Ratio Intervals: Cauchy, n=5
-12
-10
-14
Log Likelihood
Log Likelihood
-15
-16
-18
-20
-20
-22
-25
• • • • • •• •
-10 -5 0 5 10 -10 -5 0 5 10
Theta Theta
Likelihood Ratio Intervals: Cauchy, n=5 Likelihood Ratio Intervals: Cauchy, n=5
-5
-10
-10
Log Likelihood
Log Likelihood
-15
-15
-20
-20
• • •• • •• • • •
-10 -5 0 5 10 -10 -5 0 5 10
Theta Theta
Likelihood Ratio Intervals: Cauchy, n=5 Likelihood Ratio Intervals: Cauchy, n=5
-10
-12
-15
-14
Log Likelihood
Log Likelihood
-16
-18
-20
-20
-22
-25
-24
• • • •• • • • ••
-10 -5 0 5 10 -10 -5 0 5 10
Theta Theta
141
Likelihood Ratio Intervals: Cauchy, n=5 Likelihood Ratio Intervals: Cauchy, n=5
-11.0
-8
-11.5
-12.0
Log Likelihood
Log Likelihood
-10
-12.5
-12
-13.0
-13.5
-14
-2 -1 0 1 2 -2 -1 0 1 2
Theta Theta
Likelihood Ratio Intervals: Cauchy, n=5 Likelihood Ratio Intervals: Cauchy, n=5
-6
-2
-7
-4
-8
Log Likelihood
Log Likelihood
-9
-6
-10
-8
-11
-12
-2 -1 0 1 2 -2 -1 0 1 2
Theta Theta
Likelihood Ratio Intervals: Cauchy, n=5 Likelihood Ratio Intervals: Cauchy, n=5
-10
-12
-11
-13
Log Likelihood
Log Likelihood
-14
-12
-15
-13
-16
-14
-17
-2 -1 0 1 2 -2 -1 0 1 2
Theta Theta
142
Likelihood Ratio Intervals: Cauchy, n=25 Likelihood Ratio Intervals: Cauchy, n=25
-20
-40
-40
Log Likelihood
Log Likelihood
-60
-60
-80
-80
-100
-100
••• • •• ••••• ••• ••• • • • • • • •• • •••••••• •• • • •
-10 -5 0 5 10 -10 -5 0 5 10
Theta Theta
Likelihood Ratio Intervals: Cauchy, n=25 Likelihood Ratio Intervals: Cauchy, n=25
-20
-40
-40
-60
Log Likelihood
Log Likelihood
-60
-80
-80
-100
-100
-10 -5 0 5 10 -10 -5 0 5 10
Theta Theta
Likelihood Ratio Intervals: Cauchy, n=25 Likelihood Ratio Intervals: Cauchy, n=25
-40
-60
-60
Log Likelihood
Log Likelihood
-80
-80
-100
-100
-120
• • •• • • •• ••• •• ••
• •• • • • • •• ••• • • ••• ••• •• • • • •• ••
-10 -5 0 5 10 -10 -5 0 5 10
Theta Theta
143
Likelihood Ratio Intervals: Cauchy, n=25 Likelihood Ratio Intervals: Cauchy, n=25
-22
-24
-24
-26
Log Likelihood
Log Likelihood
-26
-28
-28
-30
-30
-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0
Theta Theta
Likelihood Ratio Intervals: Cauchy, n=25 Likelihood Ratio Intervals: Cauchy, n=25
-36
-22
-38
Log Likelihood
Log Likelihood
-24
-40
-26
-42
-44
-28
-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0
Theta Theta
Likelihood Ratio Intervals: Cauchy, n=25 Likelihood Ratio Intervals: Cauchy, n=25
-43
-46
-44
-48
-45
Log Likelihood
Log Likelihood
-50
-46
-52
-47
-54
-48
-56
-49
-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0
Theta Theta
144
Notice the following points:
-10
-10
-10
-5
-5
-5
-5
Theta
Theta
Theta
Theta
0
0
5
5
10
10
10
10
Score Log Likelihood Score Log Likelihood
-10
-10
-10
-5
-5
-5
-5
Theta
Theta
Theta
Theta
0
0
5
5
10
10
10
10
-10
-10
-10
-5
-5
-5
-5
Theta
Theta
Theta
Theta
0
0
5
5
10
10
10
10
146
Score Log Likelihood Score Log Likelihood
-15 -5 0 5 10 15 -100 -80 -60 -40 -20 -15 -10 -5 0 5 10 -100 -80 -60 -40
-10
-10
-10
-10
-5
-5
-5
-5
Theta
Theta
Theta
Theta
0
0
5
5
10
10
10
10
Score Log Likelihood Score Log Likelihood
-10 -5 0 5 10 -100 -80 -60 -40 -15 -5 0 5 10 15 -100 -80 -60 -40 -20
-10
-10
-10
-10
-5
-5
-5
-5
Theta
Theta
Theta
Theta
0
0
5
5
10
10
10
10
-10 -5 0 5 10 -100 -80 -60 -40 -15 -10 -5 0 5 10 -120 -100 -80 -60
-10
-10
-10
-10
-5
-5
-5
-5
Theta
Theta
Theta
Theta
0
0
5
5
10
10
10
10
147
Example : X ∼ Binomial(n, θ)
!
n X
L(θ) = θ (1 − θ)n−X
X
!
n
`(θ) = log + X log(θ) + (n − X) log(1 − θ)
X
X n−X
U (θ) = −
θ 1−θ
The function L is 0 at θ = 0 and at θ = 1
unless X = 0 or X = n so for 1 ≤ X ≤ n
the MLE must be found by setting U = 0 and
getting
X
θ̂ =
n
For X = n the log-likelihood has derivative
n
U (θ) = > 0
θ
for all θ so that the likelihood is an increasing
function of θ which is maximized at θ̂ = 1 =
X/n. Similarly when X = 0 the maximum is at
θ̂ = 0 = X/n.
148
The Normal Distribution
150
n=10
40
30
40
20 30
Y 10 20
10 X
n=100
1
0 0.2 0.4 0.6 0.8
Z
40
30
40
20 30
Y 10 20
10 X
151
2.0 n=10
Sigma
1.5
1.0
Mu
n=100
1.2
1.1
Sigma
1.0
0.9
Mu
152
Notice that the contours are quite ellipsoidal
for the larger sample size.
U (θ) = 0 .
153
Solving U (θ) = 0: Examples
N(µ, σ 2)
Binomial(n, θ)
155
The 2 parameter exponential
The density is
1 −(x−α)/β
f (x; α, β) = e 1(x > α)
β
Log-likelihood is −∞ for α > min{X1 , . . . , Xn}
and otherwise is
X
`(α, β) = −n log(β) − (Xi − α)/β
Increasing function of α till α reaches
157
It is not possible to find explicitly the remain-
ing two parameters; numerical methods are
needed.
158
Large Sample Theory
U (θ)/n → µ(θ)
where
∂ log f
µ(θ) =Eθ0 (Xi , θ)
∂θ
∂ log f
Z
= (x, θ)f (x, θ0 )dx
∂θ
160
Example: N (µ, 1) data:
X
U (µ)/n = (Xi − µ)/n = X̄ − µ
If the true mean is µ0 then X̄ → µ0 and
U (µ)/n → µ0 − µ
162
Fact: inequality is strict unless the θ and θ0
densities are actually the same.
163
Definition A sequence θ̂n of estimators of θ is
consistent if θ̂n converges weakly (or strongly)
to θ.
Suppose:
A = {|θ̂n − θ0| ≤ }
164
Theorem:
U (θ̂) =0
=U (θ0) + U 0(θ0 )(θ̂ − θ0)
+ U 00(θ̃)(θ̂ − θ0)2 /2
for some θ̃ between θ0 and θ̂.
166
Derivatives of U are sums of n terms.
167
Normal case:
X
U (θ0) = (Xi − µ0)
has a normal distribution with mean 0 and vari-
√
ance n (SD n).
Derivative is
U 0(µ) = −n .
Next derivative U 00 is 0.
Let
∂ log f
Ui = (Xi , θ0)
∂θ
and
∂ 2 log f
Vi = − 2
(Xi, θ)
∂θ
168
In general, U (θ0) = Ui has mean 0 and ap-
P
169
Notice: interchanged order of differentiation
and integration at one point.
∂ 2 log(f )
Z
− 2
f (x, θ)dx
∂θ Z
∂ log f ∂f
= (x, θ) (x, θ)dx
∂θ ∂θ
2
∂ log f
Z
= (x, θ) f (x, θ)dx
∂θ
170
Definition: The Fisher Information is
I(θ) = −Eθ (U 0(θ)) = nEθ0 (V1 )
We refer to I(θ0) = Eθ0 (V1) as the information
in 1 observation.
172
Summary
173
• Also
{V(θ̂)/n}n1/2(θ̂ − θ0) − n−1/2U (θ0) → 0
in probability as n → ∞.
176
Finding (good) preliminary Point
Estimates
Method of Moments
µ02 = X 2
and so on to
µ0p = X p
You need to remember that the population mo-
ments µ0k will be formulas involving the param-
eters.
178
Gamma Example
αβ = X
α(α + 1)β 2 = X 2
or
αβ = X
2
αβ 2 = X 2 − X .
179
Divide the second equation by the first to find
the method of moments estimate of β is
2
β̃ = (X 2 − X )/X .
Then from the first equation get
2
α̃ = X/β̃ = (X)2 /(X 2 − X ) .
180
Score function has components
Xi
P
Uβ = 2
− nα/β
β
and
X
Uα = −nψ(α) + log(Xi ) − n log(β) .
You can solve for β in terms of α to leave you
trying to find a root of the equation
X X
−nψ(α) + log(Xi ) − n log( Xi/(nα)) = 0
To use Newton Raphson on this you begin with
the preliminary estimate α̂1 = α̃ and then com-
pute iteratively
log(X) − ψ(α̂k ) − log(X)/α̂k
α̂k+1 =
1/α − ψ 0(α̂k )
until the sequence converges. Computation
of ψ 0, the trigamma function, requires special
software. Web sites like netlib and statlib are
good sources for this sort of thing.
181
Estimating Equations
182
Typically assume g(µi) = β0 + xiβ; g is link
function.
183
The key observation, however, is that it is not
necessary to believe that Yi has a Poisson dis-
tribution to make solving the equation U = 0
sensible. Suppose only that log(E(Yi)) = xiβ.
Then we have assumed that
Eβ (U (β)) = 0
This was the key condition in proving that
there was a root of the likelihood equations
which was consistent and here it is what is
needed, roughly, to prove that the equation
U (β) = 0 has a consistent root β̂.
184
Ignoring higher order terms in a Taylor expan-
sion will give
V (β)(β̂ − β) ≈ U (β)
where V = −U 0. In the mle case we had iden-
tities relating the expectation of V to the vari-
ance of U . In general here we have
x2
X
Var(U ) = i Var(Yi) .
If Yi is Poisson with mean µi (and so Var(Yi) =
µi) this is
x2
X
Var(U ) = i µi .
Moreover we have
Vi = x2
i µi
and so
x2
X
V (β) = i µi .
185
The central limit theorem (the Lyapunov kind)
will show that U (β) has an approximate normal
distribution with variance σU 2 = P x2Var(Y )
i i
and so
2
x2 2
X
β̂ − β ≈ N (0, σU /( i µi ) )
If Var(Yi) = µi, as it is for the Poisson case,
P 2
the asymptotic variance simplifies to 1/ xi µi.
186
Other estimating equations are possible, pop-
ular. If wi is any set of deterministic weights
(possibly depending on µi) then could define
X
U (β) = wi(Yi − µi)
and still conclude that U = 0 probably has a
consistent root which has an asymptotic nor-
mal distribution.
Abbreviation: GEE.
187
An estimating equation is unbiased if
Eθ (U (θ)) = 0
p̂ = X/n
where X is the number of heads. In this case
I get p̂ = 1
3.
190
Alternative estimate: p̃ = 1
2.
192
Question: is there a best estimate – one which
is better than every other estimator?
M SEθ̂ (θ ∗) = 0
so that θ̂ = θ ∗ with probability equal to 1.
So θ̂ = θ̃.
193
Principle of Unbiasedness: A good estimate
is unbiased, that is,
Eθ (θ̂) ≡ θ .
Eθ (T ) ≡ θ
When we worked with the score function we
derived some information from the identity
Z
f (x, θ)dx ≡ 1
by differentiation and we do the same here.
If T = T (X) is some function of the data X
which is unbiased for θ then
Z
Eθ (T ) = T (x)f (x, θ)dx ≡ θ
Differentiate both sides to get
d
Z
1= T (x)f (x, θ)dx
Zdθ
∂
= T (x) f (x, θ)dx
∂θ
∂
Z
= T (x) log(f (x, θ))f (x, θ)dx
∂θ
= Eθ (T (X)U (θ))
where U is the score function.
195
Since score has mean 0
196
Summary of Implications
197
What can we do to find UMVUEs when the
CRLB is a strict inequality?
198
LHS of ≡ sign is polynomial function of p as is
RHS.
n(n − 1)h(2)/2
so h(2) = 0.
200
This can be rewritten in the form
n n
pk (1 − p)n−k
X
w(k)
k=0
k
where
w(0) = h(0, 0)
2w(1) = h(1, 0) + h(0, 1)
w(2) = h(1, 1) .
So, as before w(0) = w(1) = w(2) = 0.
201
Now let’s look at the variance of T :
Var(T )
= Ep([T (Y1 , . . . , Yn) − p]2 )
= Ep([T (Y1 , . . . , Yn) − X/n + X/n − p]2 )
= Ep([T (Y1 , . . . , Yn) − X/n]2 )+
2Ep([T (Y1 , . . . , Yn) − X/n][X/n − p])
+ Ep([X/n − p]2)
Claim cross product term is 0 which will prove
variance of T is variance of X/n plus a non-
negative quantity (which will be positive un-
less T (Y1 , . . . , Yn) ≡ X/n). Compute the cross
product term by writing
202
Sum over those y1, . . . , yn whose sum is an in-
teger x; then sum over x:
203
To get more insight rewrite
Ep{T (Y1, . . . , Yn)}
n
X X
= T (y1 , . . . , yn)
x=0
P
yi=x
× P (Y1 = y1, . . . , Yn = yn)
n
X X
= T (y1 , . . . , yn)
x=0
P
yi=x
× P (Y1 = y1, . . . , Yn = yn|X = x)P (X = x)
n
yi=x T (y1 , . . . , yn) n x
PP
p (1 − p)n−x
X
= n
x=0 x
x
205
Sufficiency
206
Mathematically Precise version of this in-
tuition: Suppose T (X) is sufficient statistic
and S(X) is any estimate or confidence inter-
val or ... If you only know value of T then:
207
Example 1: Y1, . . . , Yn iid Bernoulli(p). Given
Yi = y the indexes of the y successes have
P
n
the same chance of being any one of the
y
possible subsets of {1, . . . , n}. Chance does not
depend on p so T (Y1 , . . . , Yn) = Yi is sufficient
P
statistic.
209
Proof: Review conditional distributions: ab-
stract definition of conditional expectation is:
210
Proof:
Z
E(R(X)g(X)) = R(x)g(x)fX (x)dx
Z Z
= R(x)yfX (x)f (y|x)dydx
Z Z
= R(x)yfX,Y (x, y)dydx
= E(R(X)Y )
X X
E( Ai(X)Yi|X) = Ai(X)E(Yi |X)
211
Example: Y1, . . . , Yn iid Bernoulli(p). Then
X = Yi is Binomial(n, p). Summary of con-
P
clusions:
depend on p.
• p̂ is the UMVUE of p.
Eθ [E(T |S)] = Eθ (T )
so that E(T |S) and T have the same bias. If
T is unbiased then
2
i
−2E(Y )E[Y |X] + E (Y )
214
Simplify remembering E(Y |X) is function of X
— constant when holding X fixed. So
E[Y |X]E[Y |X] = E[Y E(Y |X)|X]
taking expectations gives
E[(E[Y |X])2 ] = E[E[Y E(Y |X)|X]]
= E[Y E(Y |X)]
So 3rd term above cancels with 2nd term.
E(Y1 (1 − Y2)|X)
We do this in two steps. First compute
E(Y1(1 − Y2)|X = x)
216
Notice that the random variable Y1(1 − Y2) is
either 1 or 0 so its expected value is just the
probability it is equal to 1:
E(Y1 (1 − Y2)|X = x)
= P (Y1 (1 − Y2) = 1|X = x)
= P (Y1 = 1, Y2 = 0|Y1 + Y2 + · · · + Yn = x)
P (Y1 = 1, Y2 = 0, Y1 + · · · + Yn = x)
=
P (Y1 + Y2 + · · · + Yn = x)
P (Y1 = 1, Y2 = 0, Y3 + · · · + Yn = x − 1)
= n
px (1 − p)n−x
x
n − 2
p(1 − p) px−1 (1 − p)(n−2)−(x−1)
= x− 1
n
px (1 − p)n−x
x
n − 2
= x−n
1
x
x(n − x)
=
n(n − 1)
This is simply np̂(1 − p̂)/(n − 1) (can be bigger
than 1/4, the maximum value of p(1 − p)).
217
Example: If X1 , . . . , Xn are iid N (µ, 1) then X̄
is sufficient and X1 is an unbiased estimate of
µ. Now
218
Finding Sufficient statistics
219
Proof: Find statistic T (X) such that X is a
one to one function of the pair S, T . Apply
change of variables to the joint density of S
and T . If the density factors then
220
Example: If X1, . . . , Xn are iid N (µ, σ 2) then
the joint density is
(2π)−n/2 σ −n×
Xi2 /(2σ 2 ) + µ Xi/σ 2 − nµ2/(2σ 2 )}
X X
exp{−
which is evidently a function of
Xi2 ,
X X
Xi
This pair is a sufficient statistic. You can write
this pair as a bijective function of X̄, (Xi −X̄)2
P
pyi (1 − p)1−yi
Y
f (y1 , . . . , yp; p) =
P P
yi n− yi
=p (1 − p)
Define g(x, p) = px (1 − p)n−x and h ≡ 1 to see
that X = Yi is sufficient by the factorization
P
criterion.
221
Minimal Sufficiency
1. S1 = (X1 , . . . , Xn).
2. S2 = (X(1) , . . . , X(n)).
3. S3 = X̄.
`(θ) − `(θ ∗)
is minimal sufficient. WARNING: the function
is the statistic.
223
Completeness
Eθ (h(T )) = 0
for all θ implies h(T ) = 0.
224
We have already seen that X is complete in
the Binomial(n, p) model. In the N (µ, 1) model
suppose
Eµ(h(X̄)) ≡ 0 .
Since X̄ has a N (µ, 1/n) distribution we find
that
√ −nµ2/2 Z
ne ∞ 2
E(h(X̄)) = √ h(x)e−nx /2 enµxdx
2π −∞
so that
Z ∞
−nx2 /2 nµx
h(x)e e dx ≡ 0 .
−∞
−nx 2 /2
Called Laplace transform of h(x)e .
Hence h ≡ 0.
225
How to Prove Completeness
(S1(X), . . . , Sp(X))
is complete and sufficient.
226
Example: N (µ, σ 2 ) model density has form
1 2+ µ x− µ2
exp − 2σ 2 x σ2
− log σ 2σ 2
√
2π
which is an exponential family with
1
h(x) = √
2π
1
a1 (θ) = − 2
2σ
S1(x) = x2
µ
a2 (θ) = 2
σ
S2(x) = x
and
µ2
c(θ) = − 2 − log σ .
2σ
It follows that
Xi2 ,
X X
( Xi )
is a complete sufficient statistic.
227
Remark: The statistic (s2 , X̄) is a one to one
function of ( Xi2, Xi) so it must be com-
P P
229
Criticism of Unbiasedness
φ̃ = min(φ̂, 1/4)
is smaller than that of φ̂.
φ = log(p/(1 − p)) .
Since the expectation of any function of
the data is a polynomial function of p and
since φ is not a polynomial function of p
there is no unbiased estimate of φ
230
• The UMVUE of σ is not the square root
of the UMVUE of σ 2. This method of es-
timation does not have the parameteriza-
tion equivariance that maximum likelihood
does.
231
Hypothesis Testing
R = {X : we choose Θ1 if we observe X}
called the rejection or critical region of the
test.
232
For technical reasons which will come up soon I
prefer to use the second description. However,
each φ corresponds to a unique rejection region
Rφ = {x : φ(x) = 1}.
π(θ) = Pθ (X ∈ Rφ ) = Eθ (φ(X))
233
Simple versus Simple testing
234
Type I error: the error made when θ = θ0 but
we choose H1, that is, X ∈ Rφ.
235
Problem: choose, for each x, either the value
0 or the value 1, in such a way as to minimize
the integral. But for each x the quantity
236
Neyman and Pearson suggested that in prac-
tice the two kinds of errors might well have
unequal consequences. They suggested that
rather than minimize any quantity of the form
above you pick the more serious kind of error,
label it Type I and require your rule to hold
the probability α of a Type I error to be no
more than some prespecified level α0. (This
value α0 is typically 0.05 these days, chiefly for
historical reasons.)
237
Example: Suppose X is Binomial(n, p) and ei-
ther p = p0 = 1/2 or p = p1 = 3/4.
Region α β
R1 = ∅ 0 1
R2 = {x = 0} 0.03125 1 − (1/4)5
R3 = {x = 5} 0.03125 1 − (3/4)5
238
The first three have the same α and β as before
while R4 has α = α0 = 0.0625 an β = 1 −
(3/4)5 − (1/4)5 . Thus R4 is optimal!
R α
∅ 0
{3}, {0} 1/8
{0,3} 2/8
{1}, {2} 3/8
{0,1}, {0,2}, {1,3}, {2,3} 4/8
{0,1,3}, {0,2,3} 5/8
{1,2} 6/8
{0,1,2}, {1,2,3} 7/8
{0,1,2,3} 1
π(θ) = Eθ (φ(X))
α = E0(φ(X))
and
β = 1 − E1(φ(X))
φ(x) = 1(x ∈ C)
241
The Neyman Pearson Lemma: In testing f0
against f1 the probability β of a type II error
is minimized, subject to α ≤ α0 by the test
function:
f1 (x)
1 f (x)
>λ
0
(x)
φ(x) = γ ff1 (x)
=λ
0
0 f1 (x) < λ
f0 (x)
where λ is the largest constant such that
f1 (X)
P0 ( ≥ λ) ≥ α0
f0 (X)
and
f (X)
P0 ( 1 ≤ λ) ≥ 1 − α0
f0 (X)
and where γ is any number chosen so that
f1 (X)
E0(φ(X)) = P0 ( > λ)
f0 (X)
f (X)
+ γP0 ( 1 = λ)
f0 (X)
= α0
The value of γ is unique if P0 ( ff1(X) = λ) > 0.
0 (X)
242
Example: Binomial(n, p) with p0 = 1/2 and
p1 = 3/4: ratio f1/f0 is
3x 2−n
If n = 5 this ratio is one of 1, 3, 9, 27, 81, 243
divided by 32.
243
Since
P0 (X = 5) + γP0 (X = 4) = 0.05
for γ and find
0.05 − 1/32
γ= = 0.12
5/32
244
If α0 = 6/32 then we can either take λ to be
243/32 and γ = 1 or λ = 81/32 and γ = 0.
However, our definition of λ in the theorem
makes λ = 81/32 and γ = 0.
Xi − nµ2
X
exp{µ1 1/2}
245
Now choose λ so that
Xi − nµ2
X
P0(exp{µ1 1/2} > λ) = α0
Can make it equal because f1 (X)/f0 (X) has a
continuous distribution. Rewrite probability as
246
The rejection region looks complicated: reject
if a complicated statistic is larger than λ which
has a complicated formula. But in calculating
λ we re-expressed the rejection region in terms
of
Xi
P
√ > z α0
n
The key feature is that this rejection region is
the same for any µ1 > 0. [WARNING: in the
algebra above I used µ1 > 0.] This is why the
Neyman Pearson lemma is a lemma!
247
Definition: In the general problem of testing
Θ0 against Θ1 the level of a test function φ is
α = sup Eθ (φ(X))
θ∈Θ0
The power function is
π(θ) = Eθ (φ(X))
A test φ∗ is a Uniformly Most Powerful level α0
test if
1. φ∗ has level α ≤ αo
248
Proof of Neyman Pearson lemma: Given a
test φ with level strictly less than α0 we can
define the test
1 − α0 α −α
φ∗ (x) = φ(x) + 0
1−α 1−α
has level α0 and β smaller than that of φ.
Hence we may assume without loss that α = α0
and minimize β subject to α = α0. However,
the argument which follows doesn’t actually
need this.
249
Lagrange Multipliers
251
Now φ has level α0 and according to the the-
orem above minimizes lambda0 α + β. Suppose
φ∗ is some other test with level α∗ ≤ α0. Then
λ0 α φ + βφ ≤ λ 0 α φ ∗ + β φ ∗
We can rearrange this as
βφ ∗ ≥ β φ
which proves the Neyman Pearson Lemma.
252
Example application of NP: Binomial(n, p)
to test p = p0 versus p1 for a p1 > p0 the NP
test is of the form
253
Application of the NP lemma: In the N (µ, 1)
model consider Θ1 = {µ > 0} and Θ0 = {0}
or Θ0 = {µ ≤ 0}. The UMP level α0 test of
H0 : µ ∈ Θ0 against H1 : µ ∈ Θ1 is
254
Now if φ is any other level α0 test then we have
E0(φ(X1 , . . . , Xn)) ≤ α0
Fix a µ > 0. According to the NP lemma
255
Fairly general phenomenon: for any µ > µ0 the
likelihood ratio fµ/f0 is an increasing function
of Xi. The rejection region of the NP test
P
256
Typical family where this works: one parameter
exponential family. Usually there is no UMP
test.
π(θ) ≥ α .
π 0(µ0) = 0
258
Example: N (µ, 1): data X = (X1 , . . . , Xn). If
φ is any test function then
∂
Z
0
π (µ) = φ(x)f (x, µ)dx
∂µ
Differentiate under the integral and use
∂f (x, µ) X
= (xi − µ)f (x, µ)
∂µ
to get the condition
Z
φ(x)x̄f (x, µ0)dx = µ0α0
Eµ0 (φ(X)) = α0
and
Eµ0 (X̄φ(X)) = µ0α0.
259
Fix two values λ1 > 0 and λ2 and minimize
260
The likelihood ratio f1 /f0 is simply
λ1 + λ2 (X̄ − µ0)
for all X̄ sufficiently large or small. That is,
2. φ∗ is unbiased.
(Xi − µ0)2
X
S=
is complete and sufficient. Remember defini-
tions of both completeness and sufficiency de-
pend on the parameter space.
263
Suppose φ( Xi, S) is an unbiased level α test.
P
Then we have
X
Eµ0,σ (φ( Xi, S)) = α
for all σ. Condition on S and get
X
Eµ0,σ [E(φ( Xi , S)|S)] = α
for all σ. Sufficiency guarantees that
X
g(S) = E(φ( Xi, S)|S)
is a statistic and completeness that
g(S) ≡ α
264
If we maximize the conditional power of this
test for each s then we will maximize its power.
What is the conditional model given S = s?
That is, what is the conditional distribution of
X̄ given S = s? The answer is that the joint
density of X̄, S is of the form
265
Note disappearance of θ2 and null is θ1 = 0.
This permits application of NP lemma to the
conditional family to prove that UMP unbiased
test has form
φ(X̄, S) =
q
1(n1/2X̄/ n[S/n − X̄ 2]/(n − 1) > K ∗(S))
for some K ∗. The quantity
n1/2X̄
T =q
n[S/n − X̄ 2]/(n − 1)
is the usual t statistic and is exactly indepen-
dent of S (see Theorem 6.1.5 on page 262 in
Casella and Berger). This guarantees that
K ∗(S) = tn−1,α
and makes our UMPU test the usual t test.
266
Optimal tests
267
Likelihood Ratio tests
268
Example 1: N (µ, 1): test µ ≤ 0 against µ > 0.
(Remember UMP test.) Log likelihood is
−n(X̄ − µ)2/2
If X̄ > 0 then global maximum in Θ1 at X̄. If
X̄ ≤ 0 global maximum in Θ1 at 0. Thus µ̂1
which maximizes `(µ) subject to µ > 0 is X̄ if
X̄ > 0 and 0 if X̄ ≤ 0. Similarly, µ̂0 is X̄ if
X̄ ≤ 0 and 0 if X̄ > 0. Hence
fθ̂ (X)
1
= exp{`(µ̂1) − `(µ̂0)}
fθ̂ (X)
0
which simplifies to
exp{nX̄|X̄|/2}
Monotone increasing function of X̄ so rejection
region will be of the form X̄ > K. To get level
α reject if n1/2X̄ > zα. Notice simpler statistic
is log likelihood ratio
!
fµ̂1 (X)
λ ≡ 2 log = nX̄|X̄|
fµ̂0 (X)
269
Example 2: In the N (µ, 1) problem suppose
we make the null µ = 0. Then the value of
µ̂0 is simply 0 while the maximum of the log-
likelihood over the alternative µ 6= 0 occurs at
X̄. This gives
λ = nX̄ 2
which has a χ21 distribution. This test leads to
the rejection region λ > (zα/2 )2 which is the
usual UMPU test.
270
On null µ = 0 so find σ̂0 by maximizing
1 X 2
`(0, σ) = − 2 Xi − n log(σ)
2σ
This leads to
σ̂02 = Xi2/n
X
and
`(0, σ̂0) = −n/2 − n log(σ̂0 )
This gives
λ = −n log(σ̂ 2 /σ̂02)
Since
σ̂ 2 (Xi − X̄)2
P
2
=P
σ̂0 (Xi − X̄)2 + nX̄ 2
we can write
λ = 2[`(θ̂1) − `(θ̂0 )]
has nearly a χ2
1 distribution.
272
Now we maximize the likelihood over the null
hypothesis, that is we find θ̂0 = (φ0 , γ̂0 ) to
maximize
`(φ0 , γ)
The log-likelihood ratio statistic is
2[`(θ̂) − `(θ̂0 )]
273
According to our large sample theory for the
mle we have
θ̂ ≈ θ + I −1U
and
−1 U
γ̂0 ≈ γ0 + Iγγ γ
274
If you subtract these you find that
2[`(θ̂) − `(θ̂0 )]
can be written in the approximate form
U tM U
for a suitable matrix M . It is now possible to
use the general theory of the distribution of
X tM X where X is M V N (0, Σ) to demonstrate
that
λ = 2[`(θ̂) − `(θ̂0)]
has, under the null hypothesis, approximately
a χ2
p distribution.
275
Aside:
AQQAt = AQAt
Since Σ is non-singular so is A. Multiply by
A−1 on left and (At)−1 on right; get QQ = Q.
ν = trace(Λ)
277
But
278
Confidence Sets
Pθ (φ(θ) ∈ C) ≥ β
Confidence sets are very closely connected with
hypothesis tests:
279
From tests to confidence sets
280
Confidence sets from Pivots
g(θ, X) ∈ A
to get
θ ∈ C(X, A) .
281
Example: (n−1)s2 /σ 2 ∼ χ2
n−1 is a pivot in the
N (µ, σ 2 ) model.
χ2
n−1,1−α/2 and χ 2
n−1,α/2.
Then
P (χ2 2 2 2
n−1,1−α/2 ≤ (n − 1)s /σ ≤ χn−1,α/2 ) = β
for all µ, σ.
282
In the same model we also have
P (χ2
n−1,1−α ≤ (n − 1)s 2 2
/σ ) = β
which can be solved to get
(n − 1)1/2 s
P (σ ≤ )=β
χn−1,1−α
This gives a level 1 − α interval
283
Decision Theory and Bayesian Methods
• B = Ride my bike.
• H = Stay home.
284
Ingredients of Decision Problem: No data
case.
285
In the example we might use the following table
for L:
C B T H
R 3 8 5 25
S 5 0 2 25
286
Losses of deterministic rules
30
25
•H
20
Rain
15
10
•B
•T
5
•C
0
0 5 10 15 20 25 30
Sun
287
Statistical Decision Theory
288
Example: In estimation theory to estimate a
real parameter θ we used D = Θ,
L(d, θ) = (d − θ)2
and find that the risk of an estimator θ̂(X) is
289
• Minimax methods choose δ to minimize
the worst case risk:
290
Example: Transport problem has no data so
the only possible (non-randomized) decisions
are the four possible actions B, C, T, H. For B
and T the worst case is rain. For the other two
actions Rain and Sun are equivalent. We have
the following table:
C B T H
R 3 8 5 25
S 5 0 2 25
Maximum 5 8 5 25
291
Now imagine: toss coin with probability λ of
getting Heads, take my car if Heads, otherwise
take transit. Long run average daily loss would
be 3λ + 5(1 − λ) when it rains and 5λ + 2(1 − λ)
when it is Sunny. Call this procedure dλ ; add
it to graph for each value of λ. Varying λ
from 0 to 1 gives a straight line running from
(3, 5) to (5, 2). The two losses are equal when
λ = 3/5. For smaller λ worst case risk is for
sun; for larger λ worst case risk is for rain.
292
Losses
30
25
•H
20
Rain
15
10
•B
•T
5
•C
0
0 5 10 15 20 25 30
Sun
293
The figure then shows that d3/5 is actually
the minimax procedure when randomized pro-
cedures are permitted.
294
Losses
30
25
•
20
Rain
15
10
•
•
5
•
0
0 5 10 15 20 25 30
Sun
295
Randomization in decision problems permits
assumption that set of possible risk functions
is convex — an important technical conclusion
used to prove many basic decision theory re-
sults.
Rδ ∗ (θ) ≤ Rδ (θ)
for all θ and there is at least one value of θ
where the inequality is strict. A rule which is
not inadmissible is called admissible.
r π = π R LR + π S LS .
Consider set of L such that this Bayes risk is
equal to some constant.
297
Consider three priors: π1 = (0.9, 0.1), π2 =
(0.5, 0.5) and π3 = (0.1, 0.9).
298
Losses
30
25
•
20
Rain
15
10
•
•
5
•
0
0 5 10 15 20 25 30
Sun
299
Here is a picture showing the same lines for the
three priors above.
Losses
30
25
•
20
Rain
15
10
•
•
5
•
0
0 5 10 15 20 25 30
Sun
300
Bayes procedure for π1 (you’re pretty sure it
will be sunny) is to ride your bike. If it’s a toss
up between R and S you take the bus. If R is
very likely you take your car. Prior (0.6, 0.4)
produces the line shown here:
Losses
30
25
•
20
Rain
15
10
•
•
5
•
0
0 5 10 15 20 25 30
Sun
301
Decision Theory and Bayesian Methods
Summary for no data case
• We call δ ∗ minimax if
302
• A prior is a probability distribution π on Θ,.
rπ (δ ∗ ) ≤ rπ (δ)
for any decision δ.
303
• For infinite parameter spaces: π(θ) > 0 on
R
Θ is a proper prior if π(θ)dθ < ∞; divide π
R
by integral to get a density. If π(θ)dθ = ∞
π is an improper prior density.
304
• Every Bayes procedure with finite Bayes
risk (for prior with density > 0 for all θ)
is admissible.
L(δ ∗ , θ) ≤ L(δ, θ)
Multiply by the prior density; integrate:
rπ (δ ∗ ) ≤ rπ (δ)
If there is a θ for which the inequality in-
volving L is strict and if the density of π
is positive at that θ then the inequality for
rπ is strict which would contradict the hy-
pothesis that δ is Bayes for π.
305
• A minimax procedure is admissible. (Ac-
tually there can be several minimax proce-
dures and the claim is that at least one of
them is admissible. When the parameter
space is infinite it might happen that set
of possible risk functions is not closed; if
not then we have to replace the notion of
admissible by some notion of nearly admis-
sible.)
306
Decision Theory and Bayesian Methods
Summary when there is data
• A procedure is a map δ : X 7→ D.
307
• We call δ ∗ minimax if
rπ (δ ∗ ) ≤ rπ (δ)
for any decision δ.
308
• For infinite parameter spaces: π(θ) > 0 on
R
Θ is a proper prior if π(θ)dθ < ∞; divide π
R
by integral to get a density. If π(θ)dθ = ∞
π is an improper prior density.
309
• If every risk function is continuous then ev-
ery Bayes procedure with finite Bayes risk
(for prior with density > 0 for all θ) is ad-
missible.
310
Bayesian estimation
L(d, θ) = (d − θ)2 .
Bayes risk of θ̂ is
Z
rπ = Rθ̂ (θ)π(θ)dθ
Z Z
= (θ̂(x) − θ)2 f (x; θ)π(θ)dxdθ
311
Choose θ̂ to minimize rπ ?
which is
E(θ|X)
and is called the posterior expected mean
of θ.
313
Example: estimating normal mean µ.
314
Now collect 25 measurements of the speed of
sound.
315
Alternatively: exponent in joint density has
form
1h 2 2 2
i
− µ /γ − 2µψ/γ
2
plus terms not involving µ where
1 n 1
= +
γ2 σ2 τ2
and
ψ Xi ν
P
= + 2
γ2 σ2 τ
So: conditional of µ given data is N (ψ, γ 2).
π(µ) ≡ 1.
This “density” integrates to ∞; using Bayes’
theorem to compute the posterior would give
π(µ|X) =
(2π)−n/2 σ −n exp{− (Xi − µ)2 /(2σ 2 )}
P
317
Admissibility: Bayes procedures correspond-
ing to proper priors are admissible. It follows
that for each w ∈ (0, 1) and each real ν the
estimate
wX̄ + (1 − w)ν
is admissible. That this is also true for w = 1,
that is, that X̄ is admissible is much harder to
prove.
318
Example: Given p, X has a Binomial(n, p) dis-
tribution.
cpX+α−1 (1 − p)n−X+β−1
for a suitable normalizing constant c.
319
Mean of Beta(α, β) distribution is α/(α + β).
So Bayes estimate of p is
X +α α
= wp̂ + (1 − w)
n+α+β α+β
where p̂ = X/n is the usual mle.
320
The risk function of wp̂ + (1 − w)p0 is
Coefficient of p2 is
−w2/n + (1 − w)2
so w = n1/2/(1 + n1/2).
Coefficient of p is then
321
Working backwards: to get these values for w
and p0 require α = β. Moreover
w2/(1 − w)2 = n
gives
√
n/(α + β) = n
√
or α = β = n/2. Minimax estimate of p is
√
n 1 1
√ p̂ + √
1+ n 1 + n2
322
Multivariate estimation: common to extend
the notion of squared error loss by defining
324
Simple hypotheses: Prior is π0 > 0 and π1 >
0 with π0 + π1 = 1.
Rφ(θ0 ) = E0(L(δ, θ0 ))
and
Rφ(θ1 ) = E1(L(δ, θ1 ))
We find
325
The Bayes risk of φ is
π0 ` 0 α + π 1 ` 1 β
We saw in the hypothesis testing section that
this is minimized by
326
Hypothesis Testing and Decision Theory
327
Simple hypotheses: Prior is π0 > 0 and π1 >
0 with π0 + π1 = 1.
Rφ(θ0 ) = E0(L(δ, θ0 ))
and
Rφ(θ1 ) = E1(L(δ, θ1 ))
We find
328
The Bayes risk of φ is
π0 ` 0 α + π 1 ` 1 β
We saw in the hypothesis testing section that
this is minimized by
329