Statistics - Lecture 7
Statistics - Lecture 7
1
Today, Lecture 7: A Lot of Math
2. Exponential Families
3. Bayes priors
2
Towards KL Divergence
• The fact that evidence keeps growing towards the truth is related to the
fact that the “KL divergence” between arbitrary distributions 𝑃$ and 𝑃% ,
&# ' &# /
• 𝐷 𝑃% | 𝑃$ ) ≔ 𝐄'∼.# log = ∑/ 𝑝% 𝑥 log
&! ' &! /
satisfies 𝐷 𝑃% | 𝑃$ ) ≥ 0 with equality iff 𝑃% = 𝑃$ .
The KL divergence is just the expected-log-likelihood-ratio (or ‘difference’)
𝑝% 𝑋 𝑝% 𝑥
𝐷 𝑃% | 𝑃$ ) ≔ 𝐄'∼.# log = @ 𝑝% 𝑥 log
𝑝$ 𝑋 𝑝$ 𝑥
/
satisfies 𝐷 𝑃% | 𝑃$ ) ≥ 0 with equality iff 𝑃% = 𝑃$ .
Nonnegativity of KL known as information inequality in information theory.
Kullback-Leibler Divergence
𝑝% 𝑋 𝑝% 𝑥
𝐷 𝑃% | 𝑃$ ) ≔ 𝐄'∼.# log = @ 𝑝% 𝑥 log
𝑝$ 𝑋 𝑝$ 𝑥
/
satisfies 𝐷 𝑃% | 𝑃$ ) ≥ 0 with equality iff 𝑃% = 𝑃$ .
Nonnegativity of KL known as information inequality in information theory.
Proof: Jensen’s inequality says that in general, for convex functions,
𝐄 𝑓(𝑋) ≥ 𝑓 𝐄 𝑋 , where inequality is strict if 𝑓 is strictly convex and 𝑋 is
nondegenerate (i.e. 𝑃 𝑋 ≠ 𝐄 𝑋 > 0. Using that − log is convex, we get
𝑝$ 𝑋 𝑝$ 𝑋
𝐷 𝑃% | 𝑃$ ) = 𝐄'∼.# − log ≥ − log 𝐄'∼.# = −log 1 = 0
𝑝% 𝑋 𝑝% 𝑋
…with strict inequality if 𝑃$ ≠ 𝑃%
Jensen’s Inequality
𝑬[𝒇 𝒙 ]
𝒇(𝑬[𝒙])
KL divergence and GRO (log-optimality)
• You have already seen in an earlier homework exercise that the GRO
(growth-rate-optimal) e-variable for testing a simple 𝐻0 = {𝑃0} against a
simple alternative 𝐻" = {𝑃"} is given by the likelihood ratio.
• The proof implicitly used KL divergence
• …we will greatly extend this result in the coming weeks, to composite 𝐻0
and 𝐻"
8
Examples
again, different from standard definition but equivalent in the cases we care
about
11
KL divergence and Fisher Information
• If Fisher information as defined above exists, then for 𝜃0, 𝜃" ∈ Θ which
are “close” to each other, a second order Taylor approximation gives
that, for all 𝜃0, 𝜃" ∈ Θ:
1 #
𝐷 𝑃4' || 𝑃4" ≈ 𝐼 𝜃" 𝜃0 − 𝜃"
2
" #
(more precisely 𝐷 𝑃4' || 𝑃4" = #
𝐼 𝜃" 𝜃0 − 𝜃" + 𝑂(|𝜃0 − 𝜃"|:) )
1
𝐷 𝑃4' || 𝑃4" ≈ 𝜃0 − 𝜃" ; 𝐼 𝜃" 𝜃0 − 𝜃"
2
13
Today, Lecture 7: A Lot of Math
1. KL divergence
2. Exponential Families
3. Bayes priors
14
Exponential Families
• 𝑋 = 0,1 ; set 𝑞 𝑥 = 1, 𝜙 𝑥 = 𝑥
• 𝑋 = 0,1 ; set 𝑞 𝑥 = 1, 𝜙 𝑥 = 𝑥
Theorem:
for every exponential family ℳ = 𝑃= : 𝛽 ∈ Θ= , 𝐸.* 𝜙 𝑋 is strictly
monotonically increasing as a function of 𝛽
• [in the MDL book this statement is also made precise for k-dim families with 𝑘 > 1,
i.e. 𝜙: 𝒳 → ℝ# , where it is not directly clear what ‘monotonic’ means]
• Intuition for proof: if 𝛽 increases, then 𝑥 with high 𝜙 𝑥 get exponentially
more weight
• Theorem implies that we can identify a distribution in ℳ by its mean of 𝜙
rather than the value of 𝛽
Bernoulli
• 𝑋 = 0,1 ; 𝒫 = 𝑃1 , 𝜇 ∈ 0,1 , 𝑃1 𝑋 = 1 = 𝜇
• 𝐸.& 𝑋 = 𝑃1 𝑋 = 1 ⋅ 1 + 𝑃1 𝑋 = 0 ⋅ 0 = 𝜇
• As 𝛽 ranges from −∞ to ∞ , 𝐸.* [𝑋] ranges from 0 to 1
1
• Recall that 𝛽 = log ("21),
Gaussian Scale Family
,%
2 %
The standard parameterization 𝑝3% 𝑥 ∝ 𝑒 %1
• We have:
C
• 𝜇= = (C=)log 𝑍 𝛽 [multivariate: 𝜇= = ∇ log 𝑍 𝛽 ]
C%
• var.* (𝜙) = C=%
log 𝑍 𝛽 = 𝐼(𝛽) (Fisher information matrix at 𝛽)
[multivariate: covariance matrix = Hessian= 𝐼(𝛽)]
C
• …analogous properties for 𝛽1 = C1
𝐷 𝜇|| 𝜇0
C% " "
• 𝐼 𝜇 = 𝐷 𝜇|| 𝜇0 = =
C1% D =& EFG5& [@]
• We have:
C
• 𝜇= = (C=)log 𝑍 𝛽 [multivariate: 𝜇= = ∇ log 𝑍 𝛽 ]
C%
• var.* (𝜙) = C=%
log 𝑍 𝛽 = 𝐼(𝛽)
[multivariate: covariance matrix = Hessian= 𝐼(𝛽)]
Note the compact
and overloaded
C notation!
• …analogous properties for 𝛽1 = ( )log 𝐷 𝜇|| 𝜇0 :
C1
C% " "
• 𝐼 𝜇 = C1%
𝐷 𝜇|| 𝜇0 = D =&
= EFG [multivariate: 𝐼 𝛽! = 𝐼 %& 𝜇 = cov(𝜇)]
5& [@]
3 2 /# 2 /#
𝑛 ⋅ 𝛽! − 𝛽 ⋅ E)# 𝜙 𝑋 − 𝑛 log = 𝑛 ⋅ 𝛽 ! − 𝛽 ⋅ 𝜇 − 𝑛log
2 / 2 /
KL Robustness Property for Exponential
Families
• For all full expfams we have: if there is 𝜇 such that 𝐄.& 𝜙(𝑋) =
𝑛2" ∑86"..+ 𝜙(𝑋8 ) then the canonical MLE satisfies 𝛽s = 𝛽1 and mean-
value MLE satisfies 𝜇̂ = 𝜇 = 𝑛2" ∑86"..+ 𝜙(𝑋8 )
• For all 𝜇 ∈ Θ1 , 𝛽 ∈ Θ= with 𝛽 = 𝛽1 we then have:
1 𝑝=N 𝑋8
𝐷 𝜇||𝜇
̂ s
= 𝐷 𝛽||𝛽 = @ log
𝑛 𝑝= (𝑋8 )
86"..+
• ‘Expected log-likelihood difference is empirical average log-likelihood difference’
• Very special property (we would perhaps expect this to hold approximately, with high
probability for large samples by LLN, but it holds precisely, for all sample sizes.)
Locally Approximating KL by squared
Euclidean distance
• Consider a 1-dimensional exponential family.
• MV parameter 𝜇! corresponds to CAN parameter 𝛽!
• By a second-order Taylor approximation, we have:
1 #
𝐷(𝑃1$ | 𝑃1' = 𝐷(𝑃=$ | 𝑃=' ≈ 𝐼 𝛽0 𝛽0 − 𝛽"
2
"
≈ # 𝐼 𝜇" 𝜇0 − 𝜇" #
Approximating KL by squared Euclidean
distance
• Consider a k-dimensional exponential family.
• MV parameter 𝜇! corresponds to CAN parameter 𝛽!
• We have:
1
𝐷(𝑃1$ | 𝑃1' = 𝐷(𝑃=$ | 𝑃=' ≈ 𝛽0 − 𝛽" ; 𝐼 𝛽0 𝛽0 − 𝛽"
2
"
≈ # 𝜇0 − 𝜇" ; 𝐼 𝜇" 𝜇0 − 𝜇"
Jeffreys Prior
• Examples
• Bernoulli
• Gaussian location, fixed 𝜎 #, model truncated to [𝜖, 1 − 𝜖] . 𝐼 𝜇
constant so Jeffreys prior uniform
• Untruncated Gaussian location: Jeffreys prior improper!
• Multidimensional case: replace 𝐼 𝜃 by det 𝐼 𝜃
Parameterization (In)Dependence
(more correct notation would be 𝑃1[PQFR] , 𝑃=[SFR] , 𝐼 PQFR (𝜇), 𝐼 SFR (𝛽) )
Parameterization (In)Dependence
• Examples
• Bernoulli
• Gaussian location, fixed 𝜎 #, model truncated to [𝜖, 1 − 𝜖] . 𝐼 𝜇
constant so Jeffreys prior uniform
• Untruncated Gaussian location: Jeffreys prior improper!
• Multidimensional case: replace 𝐼 𝜃 by det 𝐼 𝜃
Improper Priors
• Example: Jeffreys’ prior for normal location family, 𝑤 𝜇 ≡ 1 (or any other
constant)
8$ A& %
A
%1% &A 8$ %
&& '$ ] 1 > 2
• 𝑤 𝜇 𝑋") ≔ = 8$ A& %
∝𝑒 %1%
∫ && '$ ] 1 C1 A
∫ > %1% C1
47