0% found this document useful (0 votes)
6 views47 pages

Statistics - Lecture 7

The document discusses foundational concepts in statistics and machine learning, focusing on KL divergence, Fisher information, and exponential families. It explains the significance of KL divergence in measuring the difference between probability distributions and its role in statistical theory. Additionally, it covers the properties of exponential families and sufficient statistics, highlighting their importance in simplifying statistical inference.

Uploaded by

Günay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views47 pages

Statistics - Lecture 7

The document discusses foundational concepts in statistics and machine learning, focusing on KL divergence, Fisher information, and exponential families. It explains the significance of KL divergence in measuring the difference between probability distributions and its role in statistical theory. Additionally, it covers the properties of exponential families and sufficient statistics, highlighting their importance in simplifying statistical inference.

Uploaded by

Günay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Foundations of Statistics and Machine Learning:

testing and uncertainty quantification with e-values


(and their link to likelihood, betting)

1
Today, Lecture 7: A Lot of Math

1. KL divergence, Fisher Information

2. Exponential Families

3. Bayes priors

2
Towards KL Divergence

Suppose that, for 𝑗 ∈ 𝐴, 𝐵 , 𝐻! says: 𝑋", 𝑋#, … i.i.d. ∼ 𝑃! where 𝑃$ ≠ 𝑃%


&! '"
Then with 𝑃$ -probability 1, &# ('" )
→∞
…and even more strongly, there is a constant 𝑐 such that, for all large 𝑛,
&! '"
with 𝑃$ -probability 1, &# ('" )
≥ 𝑒 +,
(and analogously with 𝐴 and 𝐵 interchanged)

To analyze size of constant 𝑐 , need to introduce the KL Divergence


Kullback-Leibler Divergence

• The fact that evidence keeps growing towards the truth is related to the
fact that the “KL divergence” between arbitrary distributions 𝑃$ and 𝑃% ,
&# ' &# /
• 𝐷 𝑃% | 𝑃$ ) ≔ 𝐄'∼.# log = ∑/ 𝑝% 𝑥 log
&! ' &! /
satisfies 𝐷 𝑃% | 𝑃$ ) ≥ 0 with equality iff 𝑃% = 𝑃$ .
The KL divergence is just the expected-log-likelihood-ratio (or ‘difference’)

• KL divergence plays a central role in statistics in general, and a super-


central role in e-value/anytime-valid theory!
Kullback-Leibler Divergence

𝑝% 𝑋 𝑝% 𝑥
𝐷 𝑃% | 𝑃$ ) ≔ 𝐄'∼.# log = @ 𝑝% 𝑥 log
𝑝$ 𝑋 𝑝$ 𝑥
/
satisfies 𝐷 𝑃% | 𝑃$ ) ≥ 0 with equality iff 𝑃% = 𝑃$ .
Nonnegativity of KL known as information inequality in information theory.
Kullback-Leibler Divergence

𝑝% 𝑋 𝑝% 𝑥
𝐷 𝑃% | 𝑃$ ) ≔ 𝐄'∼.# log = @ 𝑝% 𝑥 log
𝑝$ 𝑋 𝑝$ 𝑥
/
satisfies 𝐷 𝑃% | 𝑃$ ) ≥ 0 with equality iff 𝑃% = 𝑃$ .
Nonnegativity of KL known as information inequality in information theory.
Proof: Jensen’s inequality says that in general, for convex functions,
𝐄 𝑓(𝑋) ≥ 𝑓 𝐄 𝑋 , where inequality is strict if 𝑓 is strictly convex and 𝑋 is
nondegenerate (i.e. 𝑃 𝑋 ≠ 𝐄 𝑋 > 0. Using that − log is convex, we get

𝑝$ 𝑋 𝑝$ 𝑋
𝐷 𝑃% | 𝑃$ ) = 𝐄'∼.# − log ≥ − log 𝐄'∼.# = −log 1 = 0
𝑝% 𝑋 𝑝% 𝑋
…with strict inequality if 𝑃$ ≠ 𝑃%
Jensen’s Inequality

𝑬[𝒇 𝒙 ]
𝒇(𝑬[𝒙])
KL divergence and GRO (log-optimality)

• You have already seen in an earlier homework exercise that the GRO
(growth-rate-optimal) e-variable for testing a simple 𝐻0 = {𝑃0} against a
simple alternative 𝐻" = {𝑃"} is given by the likelihood ratio.
• The proof implicitly used KL divergence

• …we will greatly extend this result in the coming weeks, to composite 𝐻0
and 𝐻"
8
Examples

• Two Gaussians with variance 1, mean 𝜇", 𝜇# :


1
𝐷 𝑃1$ ||𝑃1% = # 𝜇" − 𝜇# #
2𝜎
• Two Bernoulli distributions with success probability (= mean) 𝜇", 𝜇# : 2nd
order Taylor approximation gives
1 "21 " #
𝐷 𝑃1$ ||𝑃1% = 𝜇" log 1$ + (1 − 𝜇") log "21$ ≈ #3% 𝜇" − 𝜇#
% % &%

where 𝜎1 = 𝜇(1 − 𝜇) is the variance of 𝑃1


• only approximately true, higher order terms missing in equality
• note the asymmetry in particular the possibility 𝐷 𝑃!! ||𝑃!" = ∞, 𝐷 𝑃!" ||𝑃!! = log 2
• yet also note there is “local” symmetry
9
Fisher Information

• Let ℳ = 𝑃4 : 𝜃 ∈ Θ with Θ an open interval in ℝ represent a statistical


model, such that for all 𝜃0 ∈ Θ, we have that
5%
𝐼 𝜃0 ≔ 𝐷 𝑃4' | 𝑃4 ) ∣464' is well-defined.
54%

…then we call 𝐼 𝜃0 the Fisher information at parameter 𝜃0


(this definition is different from the one usually given, but if the model is an
exponential family as defined next and the parameterization is such that
there is a smooth and strictly monotonic transformation from the
Θ −parameterization to the canonical parameterization, then definition
above turns out to be equivalent to the traditional one)
10
Fisher Information

• Let ℳ = 𝑃4 : 𝜃 ∈ Θ with Θ an open connected set in ℝ7 represent a


statistical model, such that for all 𝜃0 ∈ Θ, we have that, for 𝑖, 𝑗 ∈ 1, . . 𝑘 :
9%
𝐼8! 𝜃0 ≔ 94( 94)
𝐷 𝑃4' | 𝑃4 ) ∣464' is well-defined.

…then we call 𝐼 𝜃0 the (𝑘×𝑘) Fisher information matrix at parameter 𝜃0

again, different from standard definition but equivalent in the cases we care
about

11
KL divergence and Fisher Information

• If Fisher information as defined above exists, then for 𝜃0, 𝜃" ∈ Θ which
are “close” to each other, a second order Taylor approximation gives
that, for all 𝜃0, 𝜃" ∈ Θ:

1 #
𝐷 𝑃4' || 𝑃4" ≈ 𝐼 𝜃" 𝜃0 − 𝜃"
2

" #
(more precisely 𝐷 𝑃4' || 𝑃4" = #
𝐼 𝜃" 𝜃0 − 𝜃" + 𝑂(|𝜃0 − 𝜃"|:) )

Bernoulli and Gauss are both special cases of this


12
KL divergence and Fisher Information

• Multivariate Case: If Fisher information as defined above exists, then for


𝜃0, 𝜃" ∈ Θ which are “close” to each other, a second order Taylor
approximation gives that, for all 𝜃0, 𝜃" ∈ Θ:

1
𝐷 𝑃4' || 𝑃4" ≈ 𝜃0 − 𝜃" ; 𝐼 𝜃" 𝜃0 − 𝜃"
2

13
Today, Lecture 7: A Lot of Math

1. KL divergence

2. Exponential Families

3. Bayes priors

14
Exponential Families

Let 𝜙: 𝒳 → ℝ7 be a function and 𝑞: 𝒳 → ℝ< 0 be a nonnegative function.


A family of distributions {𝑃= ∶ 𝛽 ∈ 𝐵} on 𝒳, with 𝐵 ⊂ ℝ7
is called an exponential family relative to sufficient statistic 𝝓 and
carrier 𝒒 if for all 𝛽 ∈ 𝐵 :

If we do not specify 𝐵 explicitly, we refer to the ‘full’ family with


𝐵 = {𝛽 ∈ ℝ7 : 𝑍 𝛽 < ∞}
Exponential Families

Let 𝜙: 𝒳 → ℝ7 be a function and 𝑞: 𝒳 → ℝ< 0 be a nonnegative function.


A family of distributions {𝑃= ∶ 𝛽 ∈ 𝐵} on 𝒳, with 𝐵 ⊂ ℝ7
is called an exponential family relative to sufficient statistic 𝝓 and
carrier 𝒒 if for all 𝛽 ∈ 𝐵 :

sum replaced by integral if 𝒳 = ℝ𝒅

If we do not specify 𝐵 explicitly, we refer to the ‘full’ family with


𝐵 = {𝛽 ∈ ℝ7 : 𝑍 𝛽 < ∞}
Exponential Families

• Many popular models are exponential families: Bernoulli, multinomial,


normal (with fixed mean, with fixed variance, and full), exponential,
Gamma, Poisson, Pareto, Zipf, Beta, Gamma…: all exp families
• … also: Markov (need to extend definition to cover this),
• Gaussian (and other) Mixtures do not form an exponential family!
Bernoulli Example

• 𝑋 = 0,1 ; set 𝑞 𝑥 = 1, 𝜙 𝑥 = 𝑥

>* , >* "


Then 𝑝= 𝑥 = ∑,∈{',$} > * ,
so 𝑝= 1 = "<> *
, 𝑝= 0 = "<> *

Why is this the Bernoulli model?


Bernoulli Example

• 𝑋 = 0,1 ; set 𝑞 𝑥 = 1, 𝜙 𝑥 = 𝑥

>* , >* "


Then 𝑝= 𝑥 = ∑,∈{',$} > * ,
so 𝑝= 1 = "<> *
, 𝑝= 0 = "<> *

Why is this the Bernoulli model?


&
fix 𝑝 ∈ 0,1 and set 𝛽 = log "2 &
. Then 𝑝= 𝑋 = 1 = 𝑝
• Note: boundary points not included (can extend definition so that it is)
Gaussian Example

• Gaussian scale 𝑝3% : 𝜎 # > 0 with fixed mean (say 0):


"
• 𝜙 𝑥 = 𝑥 #, 𝑞 𝑥 = 1, 𝛽 = − #3%
,%
2 %
• Then 𝑒 =@ ' =𝑒 %1 ∝ 𝑝3% 𝑥
…so we immediately see that 𝑍 𝛽 = 2𝜋𝜎
• Gaussian mean: now you need to take 𝑞 𝑥 nonconstant!
Parameterizations

• For general exponential families, we call 𝛽 → 𝑝= the canonical


parameterization.
• If 𝜙 𝑋 = 𝑋 (as in Bernoulli model), we also call it the natural
parameterization (and we call the family a natural exponential family)
Sufficient Statistics!

• Why are exponential families easy to work with? Because, if we extend


them to 𝑛 outcomes by assuming independence (take product
distributions), the sum of the 𝜙(𝑋8 ) is a sufficient statistic and hence
remains 𝑘 dimensional, no matter how big 𝑛 is
• …(with some caveats) they are the only models with this property
(Pitman-Koopman-Darmois)

• “A sufficient statistic of a sample relative to a model summarizes all


information in the sample that is important to make inferences relative to
the model”
Sufficient Statistics!

• Sample size 𝑛: exponential families constructed by taking product


distributions
" 2∑
• 𝑝= 𝑥+ = 𝑒 = (3$.." @('( ) ∏ 𝑞(𝑥 )
8
A = "
max log 𝑝= 𝑥 + = max (𝛽∑𝜙 𝑥8 − 𝑛 log 𝑍 𝛽 + ∑log 𝑞 𝑥8 )
= =

• To determine this, you only need to know sum (equivalently, average)


of 𝜙 !
Sufficient Statistics!

max log 𝑝= 𝑥 + = max (𝛽∑𝜙 𝑥8 − 𝑛 log 𝑍 𝛽 + ∑log 𝑞 𝑥8 )


= =
• To determine this, you only need to know sum (equivalently,
average) of 𝜙 !
• Bernoulli/binomial: need nr of 1s. <no other details>.
• Full Normal distribution: need average and empirical variance
<no other details>
• Poisson: only need average
VERY easy to do statistics with: underlying reason why they are used
so often. Not necessarily that they are good models of reality!
E.g mixture models do not have finite-dim suff stats.
Mean-Value Parameterization

Theorem:
for every exponential family ℳ = 𝑃= : 𝛽 ∈ Θ= , 𝐸.* 𝜙 𝑋 is strictly
monotonically increasing as a function of 𝛽
• [in the MDL book this statement is also made precise for k-dim families with 𝑘 > 1,
i.e. 𝜙: 𝒳 → ℝ# , where it is not directly clear what ‘monotonic’ means]
• Intuition for proof: if 𝛽 increases, then 𝑥 with high 𝜙 𝑥 get exponentially
more weight
• Theorem implies that we can identify a distribution in ℳ by its mean of 𝜙
rather than the value of 𝛽
Bernoulli

• 𝑋 = 0,1 ; 𝒫 = 𝑃1 , 𝜇 ∈ 0,1 , 𝑃1 𝑋 = 1 = 𝜇
• 𝐸.& 𝑋 = 𝑃1 𝑋 = 1 ⋅ 1 + 𝑃1 𝑋 = 0 ⋅ 0 = 𝜇
• As 𝛽 ranges from −∞ to ∞ , 𝐸.* [𝑋] ranges from 0 to 1
1
• Recall that 𝛽 = log ("21),
Gaussian Scale Family
,%
2 %
The standard parameterization 𝑝3% 𝑥 ∝ 𝑒 %1

is the mean-value parameterization (not the ‘variance’ parameterization)


"
𝛽= − #3% : completely different link function as for Bernoulli
• Link functions are always strictly increasing but they vary wildly from
family to family
Mean-Value Parameterization

We can also identify a distribution in ℳ by its mean of 𝜙 rather than the


value of 𝛽. Thus we can always re-parameterize
ℳ = 𝑃= : 𝛽 ∈ 𝛩= as ℳ = 𝑃1 : 𝜇 ∈ 𝛩1 with 𝜇= : = 𝐄. * 𝜙 𝑋
• 𝜷: natural or canonical parameterization
• 𝝁: mean-value parameterization
• 𝛽1 : inverse of 𝜇=
1
• Bernoulli: 𝛽1 = log "21 ; Exponential: 𝛽1 = −1/𝜇
• Normal with mean 0, varying 𝜎 # = 𝐸 𝑋 # mean (!)-value parameter:
𝛽3% = −1/(2𝜎 #)
Maximum Likelihood
in Mean-Value Parameterization
• We already informally claimed that to calculate MLE 𝛽s , only need sum
(eqv. average) of sufficient statistic
• Indeed, for all full exponential families we have:
• If there is 𝜇 such that 𝐸.& 𝜙(𝑋) = 𝑛2" ∑86"..+ 𝜙(𝑋8 ) then the canonical
MLE satisfies 𝛽s = 𝛽1 and the mean-value MLE satisfies 𝜇̂ = 𝜇 =
𝑛2" ∑86"..+ 𝜙(𝑋8 )
• We of course already know this for Bernoulli (MLE is empirical
frequencies of ones) and Gaussian location
Nice Properties (“duality”)

• We have:
C
• 𝜇= = (C=)log 𝑍 𝛽 [multivariate: 𝜇= = ∇ log 𝑍 𝛽 ]
C%
• var.* (𝜙) = C=%
log 𝑍 𝛽 = 𝐼(𝛽) (Fisher information matrix at 𝛽)
[multivariate: covariance matrix = Hessian= 𝐼(𝛽)]

C
• …analogous properties for 𝛽1 = C1
𝐷 𝜇|| 𝜇0
C% " "
• 𝐼 𝜇 = 𝐷 𝜇|| 𝜇0 = =
C1% D =& EFG5& [@]

(i.e. with 𝑓 𝜇 = 𝐷 𝜇|| 𝜇0 , 𝐼 𝜇 = 𝑓 JJ 𝜇 ).


Nice Properties (“duality”)

• We have:
C
• 𝜇= = (C=)log 𝑍 𝛽 [multivariate: 𝜇= = ∇ log 𝑍 𝛽 ]
C%
• var.* (𝜙) = C=%
log 𝑍 𝛽 = 𝐼(𝛽)
[multivariate: covariance matrix = Hessian= 𝐼(𝛽)]
Note the compact
and overloaded
C notation!
• …analogous properties for 𝛽1 = ( )log 𝐷 𝜇|| 𝜇0 :
C1
C% " "
• 𝐼 𝜇 = C1%
𝐷 𝜇|| 𝜇0 = D =&
= EFG [multivariate: 𝐼 𝛽! = 𝐼 %& 𝜇 = cov(𝜇)]
5& [@]

(i.e. with 𝑓 𝜇 = 𝐷 𝜇|| 𝜇' , 𝐼 𝜇 = 𝑓 (( 𝜇 ).


ML is empirical mean!
" 2∑
𝑝= 𝑥+ = A = " 𝑒 = (3$.." @('( ) ∏ 𝑞(𝑥 )
8

max log 𝑝= 𝑥 + = max (𝛽∑𝜙 𝑥8 − 𝑛 log 𝑍 𝛽 + ∑log 𝑞 𝑥8 )


= =
6 6
C 6*
A = 6*
∫ > *7 , L / C/
• log 𝑍 𝛽 = = =
C= A = A(=)
∫ @(/)> *7 8 L / C/
A(=)
= 𝐸.* 𝜙 𝑋 = 𝜇=
C
• C=
log 𝑝= 𝑥 + = 𝑛(𝑛2"∑𝜙 𝑋8 − 𝜇= )
• Derivative is 0 and maximum obtained if 𝑛2"∑𝜙 𝑋8 = 𝜇=
KL Robustness Property for
Exponential Families
log 𝑝= 𝑥 + = 𝛽𝑛 𝜇̂ − 𝑛 log 𝑍 𝛽 + ∑log 𝑞 𝑥8
= 𝑛 ⋅ (𝛽 ⋅ 𝜇̂ − log 𝑍(𝛽1M )) + ∑log 𝑞 𝑥8
So
𝑝=N 𝑥 + 𝑝=&9 𝑥 + 𝑍 𝛽1M
log = log = 𝑛 ⋅ (𝛽1M − 𝛽) ⋅ 𝜇̂ − 𝑛 log
𝑝= /" 𝑝= /" 𝑍(𝛽)
but also
*$# + %0 + 1 + 2 /
,-. /#
𝑛 ⋅ 𝐷(𝛽! ||𝛽) = 𝑛 ⋅ E)# [log ] = 𝑛 ⋅ E)# [log ] =
*$ + ,-. /% 0 + 1 + 2 /#

3 2 /# 2 /#
𝑛 ⋅ 𝛽! − 𝛽 ⋅ E)# 𝜙 𝑋 − 𝑛 log = 𝑛 ⋅ 𝛽 ! − 𝛽 ⋅ 𝜇 − 𝑛log
2 / 2 /
KL Robustness Property for Exponential
Families
• For all full expfams we have: if there is 𝜇 such that 𝐄.& 𝜙(𝑋) =
𝑛2" ∑86"..+ 𝜙(𝑋8 ) then the canonical MLE satisfies 𝛽s = 𝛽1 and mean-
value MLE satisfies 𝜇̂ = 𝜇 = 𝑛2" ∑86"..+ 𝜙(𝑋8 )
• For all 𝜇 ∈ Θ1 , 𝛽 ∈ Θ= with 𝛽 = 𝛽1 we then have:
1 𝑝=N 𝑋8
𝐷 𝜇||𝜇
̂ s
= 𝐷 𝛽||𝛽 = @ log
𝑛 𝑝= (𝑋8 )
86"..+
• ‘Expected log-likelihood difference is empirical average log-likelihood difference’
• Very special property (we would perhaps expect this to hold approximately, with high
probability for large samples by LLN, but it holds precisely, for all sample sizes.)
Locally Approximating KL by squared
Euclidean distance
• Consider a 1-dimensional exponential family.
• MV parameter 𝜇! corresponds to CAN parameter 𝛽!
• By a second-order Taylor approximation, we have:
1 #
𝐷(𝑃1$ | 𝑃1' = 𝐷(𝑃=$ | 𝑃=' ≈ 𝐼 𝛽0 𝛽0 − 𝛽"
2
"
≈ # 𝐼 𝜇" 𝜇0 − 𝜇" #
Approximating KL by squared Euclidean
distance
• Consider a k-dimensional exponential family.
• MV parameter 𝜇! corresponds to CAN parameter 𝛽!
• We have:
1
𝐷(𝑃1$ | 𝑃1' = 𝐷(𝑃=$ | 𝑃=' ≈ 𝛽0 − 𝛽" ; 𝐼 𝛽0 𝛽0 − 𝛽"
2
"
≈ # 𝜇0 − 𝜇" ; 𝐼 𝜇" 𝜇0 − 𝜇"
Jeffreys Prior

• Suppose we want to use a Bayesian method (e.g. predict the


probability the next outcome in a stream of 0s and 1s is 1) and we do
not have clear prior knowledge
• Bayes and Laplace thought we should use the uniform prior on 𝜇
• …we already criticized this earlier as being somewhat arbitrary – it
highly depends on the parameterization – a prior that has uniform
density in one parameterization is highly uniform in another

• So what prior should we use instead?


Jeffreys Prior

• Consider a 1-dimensional model {𝑝4 : 𝜃 ∈ Θ}


D 4
• Jeffreys’ prior is defined as 𝑤 𝜃 ≔
∫: D 4 C4

• Examples
• Bernoulli
• Gaussian location, fixed 𝜎 #, model truncated to [𝜖, 1 − 𝜖] . 𝐼 𝜇
constant so Jeffreys prior uniform
• Untruncated Gaussian location: Jeffreys prior improper!
• Multidimensional case: replace 𝐼 𝜃 by det 𝐼 𝜃
Parameterization (In)Dependence

For general models (not just exponential families)


𝐷 𝑃4∗ ||𝑃4 is parameterization independent
(𝐷 is a divergence between probability distributions, not parameters)
This immediately implies that 𝐼(𝜃) is parameterization dependent since
"
𝐷 𝜃 | | 𝜃 ≈ # 𝐼 𝜃∗ 𝜃 − 𝜃∗ #

Indeed we just saw that if 𝑃1 = 𝑃= then 𝐼 𝜇 = 1/𝐼 𝛽

(more correct notation would be 𝑃1[PQFR] , 𝑃=[SFR] , 𝐼 PQFR (𝜇), 𝐼 SFR (𝛽) )
Parameterization (In)Dependence

𝐼(𝜃) is parameterization dependent


Indeed we just saw that if 𝑃1 = 𝑃= then 𝐼 𝜇 = 1/𝐼 𝛽
(more correct notation would be 𝑃1[PQFR] , 𝑃=[SFR] , 𝐼 PQFR (𝜇), 𝐼 SFR (𝛽) )

YET Jeffreys’ prior is parameterization independent:


under weak regularity conditions if 𝑝4,[U] = 𝑝V 4 ,[W] for smooth 1-to-1 𝑓 then

∫<∈ =,> D[:] 4 54


∫4∈ Y,Z 𝑤[,[U] 𝜃 d𝜃 = = ∫\∈ V(Y),V Z
𝑤[,[W] 𝛾 d𝛾
∫<∈: D : 4 54
Jeffreys Prior

• Consider a 1-dimensional model {𝑝4 : 𝜃 ∈ Θ}


D 4
• Jeffreys’ prior is defined as 𝑤 𝜃 ≔
∫: D 4 C4

• Examples
• Bernoulli
• Gaussian location, fixed 𝜎 #, model truncated to [𝜖, 1 − 𝜖] . 𝐼 𝜇
constant so Jeffreys prior uniform
• Untruncated Gaussian location: Jeffreys prior improper!
• Multidimensional case: replace 𝐼 𝜃 by det 𝐼 𝜃
Improper Priors

• An improper prior on Θ is a density 𝑤: Θ → ℝ<


0 on Θ such that
∫ 𝑤 𝜃 𝑑𝜃 = ∞
• In some cases these are miraculously well-behaved in that the
posterior becomes proper after a small number of observations!
• can then use them just like standard priors
• But Bayes’ theorem is now “an algorithm rather than a mathematical
theorem” (important that you understand this sentence!)
Improper Priors

• An improper prior on Θ is a density 𝑤: Θ → ℝ<


0 on Θ such that
∫ 𝑤 𝜃 𝑑𝜃 = ∞
• In some cases these are miraculously well-behaved in that the
posterior becomes proper after a small number of observations!
• can then use them just like standard priors
• Example: Jeffreys’ prior for normal location family, 𝑤 𝜃 ≡ 1 (or any other
constant)
Improper Priors

• Example: Jeffreys’ prior for normal location family, 𝑤 𝜇 ≡ 1 (or any other
constant)
8$ A& %
A
%1% &A 8$ %
&& '$ ] 1 > 2
• 𝑤 𝜇 𝑋") ≔ = 8$ A& %
∝𝑒 %1%
∫ && '$ ] 1 C1 A
∫ > %1% C1

…is normal density on 𝜇 with mean 𝑋" and variance 𝜎 #


…we really use first outcome as ‘start-up’ to determine prior for the next
outcomes !
…with this improper Jeffreys’ prior, the 95% credible interval of the
posterior after 𝑋 + is exactly the 95% confidence interval!
Improper Priors

• An improper prior on Θ is a density 𝑤: Θ → ℝ<


0 on Θ such that
∫ 𝑤 𝜃 𝑑𝜃 = ∞
• In some cases these are miraculously well-behaved in that the
posterior becomes proper after a small number of observations!
• can then use them just like standard priors
• But Bayes’ theorem is now “an algorithm rather than a mathematical
theorem” (important that you understand this sentence!)
Jeffreys is the New Uniform

• uniform prior (used by Bayes in his 1763 paper)


• not parameterization invariant
• uniform does not seem meaningful concept: refers to names
(parameters) of distributions rather than distributions themselves
• Jeffreys prior
• is parameterization invariant (though not only one!)
• Jeffreys prior puts about equal prior mass in 𝜖 −KL balls around 𝜃
‘everywhere in the space’
(though there are some issues at the boundaries)
• Jeffreys priors thus ‘uniform’ in a meaningful way
Outlook

• Next week: no lecture!

• In two weeks: using KL divergence-tools, both applied to alternative and


to null, to get growth-optimal e-variables , and quantifying how good
they are

47

You might also like