0% found this document useful (0 votes)

6 views47 pages

Statistics - Lecture 7

The document discusses foundational concepts in statistics and machine learning, focusing on KL divergence, Fisher information, and exponential families. It explains the significance of KL divergence in measuring the difference between probability distributions and its role in statistical theory. Additionally, it covers the properties of exponential families and sufficient statistics, highlighting their importance in simplifying statistical inference.

Uploaded by

Günay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views47 pages

Statistics - Lecture 7

Uploaded by

Günay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Foundations of Statistics and Machine Learning:

testing and uncertainty quantification with e-values

(and their link to likelihood, betting)

1
Today, Lecture 7: A Lot of Math

1. KL divergence, Fisher Information

2. Exponential Families

3. Bayes priors

2
Towards KL Divergence

Suppose that, for 𝑗 ∈ 𝐴, 𝐵 , 𝐻! says: 𝑋", 𝑋#, … i.i.d. ∼ 𝑃! where 𝑃$ ≠ 𝑃%

&! '"
Then with 𝑃$ -probability 1, &# ('" )
→∞
…and even more strongly, there is a constant 𝑐 such that, for all large 𝑛,
&! '"
with 𝑃$ -probability 1, &# ('" )
≥ 𝑒 +,
(and analogously with 𝐴 and 𝐵 interchanged)

To analyze size of constant 𝑐 , need to introduce the KL Divergence

Kullback-Leibler Divergence

• The fact that evidence keeps growing towards the truth is related to the
fact that the “KL divergence” between arbitrary distributions 𝑃$ and 𝑃% ,
&# ' &# /
• 𝐷 𝑃% | 𝑃$ ) ≔ 𝐄'∼.# log = ∑/ 𝑝% 𝑥 log
&! ' &! /
satisfies 𝐷 𝑃% | 𝑃$ ) ≥ 0 with equality iff 𝑃% = 𝑃$ .
The KL divergence is just the expected-log-likelihood-ratio (or ‘difference’)

• KL divergence plays a central role in statistics in general, and a super-

central role in e-value/anytime-valid theory!
Kullback-Leibler Divergence

𝑝% 𝑋 𝑝% 𝑥
𝐷 𝑃% | 𝑃$ ) ≔ 𝐄'∼.# log = @ 𝑝% 𝑥 log
𝑝$ 𝑋 𝑝$ 𝑥
/
satisfies 𝐷 𝑃% | 𝑃$ ) ≥ 0 with equality iff 𝑃% = 𝑃$ .
Nonnegativity of KL known as information inequality in information theory.
Kullback-Leibler Divergence

𝑝% 𝑋 𝑝% 𝑥
𝐷 𝑃% | 𝑃$ ) ≔ 𝐄'∼.# log = @ 𝑝% 𝑥 log
𝑝$ 𝑋 𝑝$ 𝑥
/
satisfies 𝐷 𝑃% | 𝑃$ ) ≥ 0 with equality iff 𝑃% = 𝑃$ .
Nonnegativity of KL known as information inequality in information theory.
Proof: Jensen’s inequality says that in general, for convex functions,
𝐄 𝑓(𝑋) ≥ 𝑓 𝐄 𝑋 , where inequality is strict if 𝑓 is strictly convex and 𝑋 is
nondegenerate (i.e. 𝑃 𝑋 ≠ 𝐄 𝑋 > 0. Using that − log is convex, we get

𝑝$ 𝑋 𝑝$ 𝑋
𝐷 𝑃% | 𝑃$ ) = 𝐄'∼.# − log ≥ − log 𝐄'∼.# = −log 1 = 0
𝑝% 𝑋 𝑝% 𝑋
…with strict inequality if 𝑃$ ≠ 𝑃%
Jensen’s Inequality

𝑬[𝒇 𝒙 ]
𝒇(𝑬[𝒙])
KL divergence and GRO (log-optimality)

• You have already seen in an earlier homework exercise that the GRO
(growth-rate-optimal) e-variable for testing a simple 𝐻0 = {𝑃0} against a
simple alternative 𝐻" = {𝑃"} is given by the likelihood ratio.
• The proof implicitly used KL divergence

• …we will greatly extend this result in the coming weeks, to composite 𝐻0
and 𝐻"
8
Examples

• Two Gaussians with variance 1, mean 𝜇", 𝜇# :

1
𝐷 𝑃1$ ||𝑃1% = # 𝜇" − 𝜇# #
2𝜎
• Two Bernoulli distributions with success probability (= mean) 𝜇", 𝜇# : 2nd
order Taylor approximation gives
1 "21 " #
𝐷 𝑃1$ ||𝑃1% = 𝜇" log 1$ + (1 − 𝜇") log "21$ ≈ #3% 𝜇" − 𝜇#
% % &%

where 𝜎1 = 𝜇(1 − 𝜇) is the variance of 𝑃1

• only approximately true, higher order terms missing in equality
• note the asymmetry in particular the possibility 𝐷 𝑃!! ||𝑃!" = ∞, 𝐷 𝑃!" ||𝑃!! = log 2
• yet also note there is “local” symmetry
9
Fisher Information

• Let ℳ = 𝑃4 : 𝜃 ∈ Θ with Θ an open interval in ℝ represent a statistical

model, such that for all 𝜃0 ∈ Θ, we have that
5%
𝐼 𝜃0 ≔ 𝐷 𝑃4' | 𝑃4 ) ∣464' is well-defined.
54%

…then we call 𝐼 𝜃0 the Fisher information at parameter 𝜃0

(this definition is different from the one usually given, but if the model is an
exponential family as defined next and the parameterization is such that
there is a smooth and strictly monotonic transformation from the
Θ −parameterization to the canonical parameterization, then definition
above turns out to be equivalent to the traditional one)
10
Fisher Information

• Let ℳ = 𝑃4 : 𝜃 ∈ Θ with Θ an open connected set in ℝ7 represent a

statistical model, such that for all 𝜃0 ∈ Θ, we have that, for 𝑖, 𝑗 ∈ 1, . . 𝑘 :
9%
𝐼8! 𝜃0 ≔ 94( 94)
𝐷 𝑃4' | 𝑃4 ) ∣464' is well-defined.

…then we call 𝐼 𝜃0 the (𝑘×𝑘) Fisher information matrix at parameter 𝜃0

again, different from standard definition but equivalent in the cases we care
about

11
KL divergence and Fisher Information

• If Fisher information as defined above exists, then for 𝜃0, 𝜃" ∈ Θ which
are “close” to each other, a second order Taylor approximation gives
that, for all 𝜃0, 𝜃" ∈ Θ:

1 #
𝐷 𝑃4' || 𝑃4" ≈ 𝐼 𝜃" 𝜃0 − 𝜃"
2

" #
(more precisely 𝐷 𝑃4' || 𝑃4" = #
𝐼 𝜃" 𝜃0 − 𝜃" + 𝑂(|𝜃0 − 𝜃"|:) )

Bernoulli and Gauss are both special cases of this

12
KL divergence and Fisher Information

• Multivariate Case: If Fisher information as defined above exists, then for

𝜃0, 𝜃" ∈ Θ which are “close” to each other, a second order Taylor
approximation gives that, for all 𝜃0, 𝜃" ∈ Θ:

1
𝐷 𝑃4' || 𝑃4" ≈ 𝜃0 − 𝜃" ; 𝐼 𝜃" 𝜃0 − 𝜃"
2

13
Today, Lecture 7: A Lot of Math

1. KL divergence

2. Exponential Families

3. Bayes priors

14
Exponential Families

Let 𝜙: 𝒳 → ℝ7 be a function and 𝑞: 𝒳 → ℝ< 0 be a nonnegative function.

A family of distributions {𝑃= ∶ 𝛽 ∈ 𝐵} on 𝒳, with 𝐵 ⊂ ℝ7
is called an exponential family relative to sufficient statistic 𝝓 and
carrier 𝒒 if for all 𝛽 ∈ 𝐵 :

If we do not specify 𝐵 explicitly, we refer to the ‘full’ family with

𝐵 = {𝛽 ∈ ℝ7 : 𝑍 𝛽 < ∞}
Exponential Families

Let 𝜙: 𝒳 → ℝ7 be a function and 𝑞: 𝒳 → ℝ< 0 be a nonnegative function.

A family of distributions {𝑃= ∶ 𝛽 ∈ 𝐵} on 𝒳, with 𝐵 ⊂ ℝ7
is called an exponential family relative to sufficient statistic 𝝓 and
carrier 𝒒 if for all 𝛽 ∈ 𝐵 :

sum replaced by integral if 𝒳 = ℝ𝒅

If we do not specify 𝐵 explicitly, we refer to the ‘full’ family with

𝐵 = {𝛽 ∈ ℝ7 : 𝑍 𝛽 < ∞}
Exponential Families

• Many popular models are exponential families: Bernoulli, multinomial,

normal (with fixed mean, with fixed variance, and full), exponential,
Gamma, Poisson, Pareto, Zipf, Beta, Gamma…: all exp families
• … also: Markov (need to extend definition to cover this),
• Gaussian (and other) Mixtures do not form an exponential family!
Bernoulli Example

• 𝑋 = 0,1 ; set 𝑞 𝑥 = 1, 𝜙 𝑥 = 𝑥

>* , >* "

Then 𝑝= 𝑥 = ∑,∈{',$} > * ,
so 𝑝= 1 = "<> *
, 𝑝= 0 = "<> *

Why is this the Bernoulli model?

Bernoulli Example

• 𝑋 = 0,1 ; set 𝑞 𝑥 = 1, 𝜙 𝑥 = 𝑥

>* , >* "

Then 𝑝= 𝑥 = ∑,∈{',$} > * ,
so 𝑝= 1 = "<> *
, 𝑝= 0 = "<> *

Why is this the Bernoulli model?

&
fix 𝑝 ∈ 0,1 and set 𝛽 = log "2 &
. Then 𝑝= 𝑋 = 1 = 𝑝
• Note: boundary points not included (can extend definition so that it is)
Gaussian Example

• Gaussian scale 𝑝3% : 𝜎 # > 0 with fixed mean (say 0):

"
• 𝜙 𝑥 = 𝑥 #, 𝑞 𝑥 = 1, 𝛽 = − #3%
,%
2 %
• Then 𝑒 =@ ' =𝑒 %1 ∝ 𝑝3% 𝑥
…so we immediately see that 𝑍 𝛽 = 2𝜋𝜎
• Gaussian mean: now you need to take 𝑞 𝑥 nonconstant!
Parameterizations

• For general exponential families, we call 𝛽 → 𝑝= the canonical

parameterization.
• If 𝜙 𝑋 = 𝑋 (as in Bernoulli model), we also call it the natural
parameterization (and we call the family a natural exponential family)
Sufficient Statistics!

• Why are exponential families easy to work with? Because, if we extend

them to 𝑛 outcomes by assuming independence (take product
distributions), the sum of the 𝜙(𝑋8 ) is a sufficient statistic and hence
remains 𝑘 dimensional, no matter how big 𝑛 is
• …(with some caveats) they are the only models with this property
(Pitman-Koopman-Darmois)

• “A sufficient statistic of a sample relative to a model summarizes all

information in the sample that is important to make inferences relative to
the model”
Sufficient Statistics!

• Sample size 𝑛: exponential families constructed by taking product

distributions
" 2∑
• 𝑝= 𝑥+ = 𝑒 = (3$.." @('( ) ∏ 𝑞(𝑥 )
8
A = "
max log 𝑝= 𝑥 + = max (𝛽∑𝜙 𝑥8 − 𝑛 log 𝑍 𝛽 + ∑log 𝑞 𝑥8 )
= =

• To determine this, you only need to know sum (equivalently, average)

of 𝜙 !
Sufficient Statistics!

max log 𝑝= 𝑥 + = max (𝛽∑𝜙 𝑥8 − 𝑛 log 𝑍 𝛽 + ∑log 𝑞 𝑥8 )

= =
• To determine this, you only need to know sum (equivalently,
average) of 𝜙 !
• Bernoulli/binomial: need nr of 1s. <no other details>.
• Full Normal distribution: need average and empirical variance
<no other details>
• Poisson: only need average
VERY easy to do statistics with: underlying reason why they are used
so often. Not necessarily that they are good models of reality!
E.g mixture models do not have finite-dim suff stats.
Mean-Value Parameterization

Theorem:
for every exponential family ℳ = 𝑃= : 𝛽 ∈ Θ= , 𝐸.* 𝜙 𝑋 is strictly
monotonically increasing as a function of 𝛽
• [in the MDL book this statement is also made precise for k-dim families with 𝑘 > 1,
i.e. 𝜙: 𝒳 → ℝ# , where it is not directly clear what ‘monotonic’ means]
• Intuition for proof: if 𝛽 increases, then 𝑥 with high 𝜙 𝑥 get exponentially
more weight
• Theorem implies that we can identify a distribution in ℳ by its mean of 𝜙
rather than the value of 𝛽
Bernoulli

• 𝑋 = 0,1 ; 𝒫 = 𝑃1 , 𝜇 ∈ 0,1 , 𝑃1 𝑋 = 1 = 𝜇
• 𝐸.& 𝑋 = 𝑃1 𝑋 = 1 ⋅ 1 + 𝑃1 𝑋 = 0 ⋅ 0 = 𝜇
• As 𝛽 ranges from −∞ to ∞ , 𝐸.* [𝑋] ranges from 0 to 1
1
• Recall that 𝛽 = log ("21),
Gaussian Scale Family
,%
2 %
The standard parameterization 𝑝3% 𝑥 ∝ 𝑒 %1

is the mean-value parameterization (not the ‘variance’ parameterization)

"
𝛽= − #3% : completely different link function as for Bernoulli
• Link functions are always strictly increasing but they vary wildly from
family to family
Mean-Value Parameterization

We can also identify a distribution in ℳ by its mean of 𝜙 rather than the

value of 𝛽. Thus we can always re-parameterize
ℳ = 𝑃= : 𝛽 ∈ 𝛩= as ℳ = 𝑃1 : 𝜇 ∈ 𝛩1 with 𝜇= : = 𝐄. * 𝜙 𝑋
• 𝜷: natural or canonical parameterization
• 𝝁: mean-value parameterization
• 𝛽1 : inverse of 𝜇=
1
• Bernoulli: 𝛽1 = log "21 ; Exponential: 𝛽1 = −1/𝜇
• Normal with mean 0, varying 𝜎 # = 𝐸 𝑋 # mean (!)-value parameter:
𝛽3% = −1/(2𝜎 #)
Maximum Likelihood
in Mean-Value Parameterization
• We already informally claimed that to calculate MLE 𝛽s , only need sum
(eqv. average) of sufficient statistic
• Indeed, for all full exponential families we have:
• If there is 𝜇 such that 𝐸.& 𝜙(𝑋) = 𝑛2" ∑86"..+ 𝜙(𝑋8 ) then the canonical
MLE satisfies 𝛽s = 𝛽1 and the mean-value MLE satisfies 𝜇̂ = 𝜇 =
𝑛2" ∑86"..+ 𝜙(𝑋8 )
• We of course already know this for Bernoulli (MLE is empirical
frequencies of ones) and Gaussian location
Nice Properties (“duality”)

• We have:
C
• 𝜇= = (C=)log 𝑍 𝛽 [multivariate: 𝜇= = ∇ log 𝑍 𝛽 ]
C%
• var.* (𝜙) = C=%
log 𝑍 𝛽 = 𝐼(𝛽) (Fisher information matrix at 𝛽)
[multivariate: covariance matrix = Hessian= 𝐼(𝛽)]

C
• …analogous properties for 𝛽1 = C1
𝐷 𝜇|| 𝜇0
C% " "
• 𝐼 𝜇 = 𝐷 𝜇|| 𝜇0 = =
C1% D =& EFG5& [@]

(i.e. with 𝑓 𝜇 = 𝐷 𝜇|| 𝜇0 , 𝐼 𝜇 = 𝑓 JJ 𝜇 ).

Nice Properties (“duality”)

• We have:
C
• 𝜇= = (C=)log 𝑍 𝛽 [multivariate: 𝜇= = ∇ log 𝑍 𝛽 ]
C%
• var.* (𝜙) = C=%
log 𝑍 𝛽 = 𝐼(𝛽)
[multivariate: covariance matrix = Hessian= 𝐼(𝛽)]
Note the compact
and overloaded
C notation!
• …analogous properties for 𝛽1 = ( )log 𝐷 𝜇|| 𝜇0 :
C1
C% " "
• 𝐼 𝜇 = C1%
𝐷 𝜇|| 𝜇0 = D =&
= EFG [multivariate: 𝐼 𝛽! = 𝐼 %& 𝜇 = cov(𝜇)]
5& [@]

(i.e. with 𝑓 𝜇 = 𝐷 𝜇|| 𝜇' , 𝐼 𝜇 = 𝑓 (( 𝜇 ).

ML is empirical mean!
" 2∑
𝑝= 𝑥+ = A = " 𝑒 = (3$.." @('( ) ∏ 𝑞(𝑥 )
8

max log 𝑝= 𝑥 + = max (𝛽∑𝜙 𝑥8 − 𝑛 log 𝑍 𝛽 + ∑log 𝑞 𝑥8 )

= =
6 6
C 6*
A = 6*
∫ > *7 , L / C/
• log 𝑍 𝛽 = = =
C= A = A(=)
∫ @(/)> *7 8 L / C/
A(=)
= 𝐸.* 𝜙 𝑋 = 𝜇=
C
• C=
log 𝑝= 𝑥 + = 𝑛(𝑛2"∑𝜙 𝑋8 − 𝜇= )
• Derivative is 0 and maximum obtained if 𝑛2"∑𝜙 𝑋8 = 𝜇=
KL Robustness Property for
Exponential Families
log 𝑝= 𝑥 + = 𝛽𝑛 𝜇̂ − 𝑛 log 𝑍 𝛽 + ∑log 𝑞 𝑥8
= 𝑛 ⋅ (𝛽 ⋅ 𝜇̂ − log 𝑍(𝛽1M )) + ∑log 𝑞 𝑥8
So
𝑝=N 𝑥 + 𝑝=&9 𝑥 + 𝑍 𝛽1M
log = log = 𝑛 ⋅ (𝛽1M − 𝛽) ⋅ 𝜇̂ − 𝑛 log
𝑝= /" 𝑝= /" 𝑍(𝛽)
but also
*$# + %0 + 1 + 2 /
,-. /#
𝑛 ⋅ 𝐷(𝛽! ||𝛽) = 𝑛 ⋅ E)# [log ] = 𝑛 ⋅ E)# [log ] =
*$ + ,-. /% 0 + 1 + 2 /#

3 2 /# 2 /#
𝑛 ⋅ 𝛽! − 𝛽 ⋅ E)# 𝜙 𝑋 − 𝑛 log = 𝑛 ⋅ 𝛽 ! − 𝛽 ⋅ 𝜇 − 𝑛log
2 / 2 /
KL Robustness Property for Exponential
Families
• For all full expfams we have: if there is 𝜇 such that 𝐄.& 𝜙(𝑋) =
𝑛2" ∑86"..+ 𝜙(𝑋8 ) then the canonical MLE satisfies 𝛽s = 𝛽1 and mean-
value MLE satisfies 𝜇̂ = 𝜇 = 𝑛2" ∑86"..+ 𝜙(𝑋8 )
• For all 𝜇 ∈ Θ1 , 𝛽 ∈ Θ= with 𝛽 = 𝛽1 we then have:
1 𝑝=N 𝑋8
𝐷 𝜇||𝜇
̂ s
= 𝐷 𝛽||𝛽 = @ log
𝑛 𝑝= (𝑋8 )
86"..+
• ‘Expected log-likelihood difference is empirical average log-likelihood difference’
• Very special property (we would perhaps expect this to hold approximately, with high
probability for large samples by LLN, but it holds precisely, for all sample sizes.)
Locally Approximating KL by squared
Euclidean distance
• Consider a 1-dimensional exponential family.
• MV parameter 𝜇! corresponds to CAN parameter 𝛽!
• By a second-order Taylor approximation, we have:
1 #
𝐷(𝑃1$ | 𝑃1' = 𝐷(𝑃=$ | 𝑃=' ≈ 𝐼 𝛽0 𝛽0 − 𝛽"
2
"
≈ # 𝐼 𝜇" 𝜇0 − 𝜇" #
Approximating KL by squared Euclidean
distance
• Consider a k-dimensional exponential family.
• MV parameter 𝜇! corresponds to CAN parameter 𝛽!
• We have:
1
𝐷(𝑃1$ | 𝑃1' = 𝐷(𝑃=$ | 𝑃=' ≈ 𝛽0 − 𝛽" ; 𝐼 𝛽0 𝛽0 − 𝛽"
2
"
≈ # 𝜇0 − 𝜇" ; 𝐼 𝜇" 𝜇0 − 𝜇"
Jeffreys Prior

• Suppose we want to use a Bayesian method (e.g. predict the

probability the next outcome in a stream of 0s and 1s is 1) and we do
not have clear prior knowledge
• Bayes and Laplace thought we should use the uniform prior on 𝜇
• …we already criticized this earlier as being somewhat arbitrary – it
highly depends on the parameterization – a prior that has uniform
density in one parameterization is highly uniform in another

• So what prior should we use instead?

Jeffreys Prior

• Consider a 1-dimensional model {𝑝4 : 𝜃 ∈ Θ}

D 4
• Jeffreys’ prior is defined as 𝑤 𝜃 ≔
∫: D 4 C4

• Examples
• Bernoulli
• Gaussian location, fixed 𝜎 #, model truncated to [𝜖, 1 − 𝜖] . 𝐼 𝜇
constant so Jeffreys prior uniform
• Untruncated Gaussian location: Jeffreys prior improper!
• Multidimensional case: replace 𝐼 𝜃 by det 𝐼 𝜃
Parameterization (In)Dependence

For general models (not just exponential families)

𝐷 𝑃4∗ ||𝑃4 is parameterization independent
(𝐷 is a divergence between probability distributions, not parameters)
This immediately implies that 𝐼(𝜃) is parameterization dependent since
"
𝐷 𝜃 | | 𝜃 ≈ # 𝐼 𝜃∗ 𝜃 − 𝜃∗ #
∗

Indeed we just saw that if 𝑃1 = 𝑃= then 𝐼 𝜇 = 1/𝐼 𝛽

(more correct notation would be 𝑃1[PQFR] , 𝑃=[SFR] , 𝐼 PQFR (𝜇), 𝐼 SFR (𝛽) )
Parameterization (In)Dependence

𝐼(𝜃) is parameterization dependent

Indeed we just saw that if 𝑃1 = 𝑃= then 𝐼 𝜇 = 1/𝐼 𝛽
(more correct notation would be 𝑃1[PQFR] , 𝑃=[SFR] , 𝐼 PQFR (𝜇), 𝐼 SFR (𝛽) )

YET Jeffreys’ prior is parameterization independent:

under weak regularity conditions if 𝑝4,[U] = 𝑝V 4 ,[W] for smooth 1-to-1 𝑓 then

∫<∈ =,> D[:] 4 54

∫4∈ Y,Z 𝑤[,[U] 𝜃 d𝜃 = = ∫\∈ V(Y),V Z
𝑤[,[W] 𝛾 d𝛾
∫<∈: D : 4 54
Jeffreys Prior

• Consider a 1-dimensional model {𝑝4 : 𝜃 ∈ Θ}

D 4
• Jeffreys’ prior is defined as 𝑤 𝜃 ≔
∫: D 4 C4

• An improper prior on Θ is a density 𝑤: Θ → ℝ<

0 on Θ such that
∫ 𝑤 𝜃 𝑑𝜃 = ∞
• In some cases these are miraculously well-behaved in that the
posterior becomes proper after a small number of observations!
• can then use them just like standard priors
• But Bayes’ theorem is now “an algorithm rather than a mathematical
theorem” (important that you understand this sentence!)
Improper Priors

• An improper prior on Θ is a density 𝑤: Θ → ℝ<

0 on Θ such that
∫ 𝑤 𝜃 𝑑𝜃 = ∞
• In some cases these are miraculously well-behaved in that the
posterior becomes proper after a small number of observations!
• can then use them just like standard priors
• Example: Jeffreys’ prior for normal location family, 𝑤 𝜃 ≡ 1 (or any other
constant)
Improper Priors

• Example: Jeffreys’ prior for normal location family, 𝑤 𝜇 ≡ 1 (or any other
constant)
8$ A& %
A
%1% &A 8$ %
&& '$ ] 1 > 2
• 𝑤 𝜇 𝑋") ≔ = 8$ A& %
∝𝑒 %1%
∫ && '$ ] 1 C1 A
∫ > %1% C1

…is normal density on 𝜇 with mean 𝑋" and variance 𝜎 #

…we really use first outcome as ‘start-up’ to determine prior for the next
outcomes !
…with this improper Jeffreys’ prior, the 95% credible interval of the
posterior after 𝑋 + is exactly the 95% confidence interval!
Improper Priors

• An improper prior on Θ is a density 𝑤: Θ → ℝ<

• uniform prior (used by Bayes in his 1763 paper)

• not parameterization invariant
• uniform does not seem meaningful concept: refers to names
(parameters) of distributions rather than distributions themselves
• Jeffreys prior
• is parameterization invariant (though not only one!)
• Jeffreys prior puts about equal prior mass in 𝜖 −KL balls around 𝜃
‘everywhere in the space’
(though there are some issues at the boundaries)
• Jeffreys priors thus ‘uniform’ in a meaningful way
Outlook

• Next week: no lecture!

• In two weeks: using KL divergence-tools, both applied to alternative and

to null, to get growth-optimal e-variables , and quantifying how good
they are

STA 303 Theory of Estimation 9th Lecture-1
No ratings yet
STA 303 Theory of Estimation 9th Lecture-1
7 pages
Ivchenko Medvedev Chistyakov Problems in Mathematical Statistics
No ratings yet
Ivchenko Medvedev Chistyakov Problems in Mathematical Statistics
282 pages
Adv Statistics I
No ratings yet
Adv Statistics I
95 pages
Bierens - Statistical Foundations of Econometrics
100% (1)
Bierens - Statistical Foundations of Econometrics
434 pages
FIT5197 2021 S1 Formula Sheet
No ratings yet
FIT5197 2021 S1 Formula Sheet
20 pages
Rolf Sundberg - Statistical Modelling by Exponential Families
No ratings yet
Rolf Sundberg - Statistical Modelling by Exponential Families
297 pages
Cheat Sheet
No ratings yet
Cheat Sheet
5 pages
18.6501x Fundamentals of Statistics
100% (1)
18.6501x Fundamentals of Statistics
8 pages
Fundamentals of Statistics (18.6501x)
No ratings yet
Fundamentals of Statistics (18.6501x)
20 pages
Probabilistic Learning and Generalized Linear Models (GLMS)
No ratings yet
Probabilistic Learning and Generalized Linear Models (GLMS)
38 pages
Principles of Statistics
No ratings yet
Principles of Statistics
113 pages
Chapter 8 Estimation of Parameters and Fitting of Probability Distributions
No ratings yet
Chapter 8 Estimation of Parameters and Fitting of Probability Distributions
20 pages
Statistical Machine Learning 1665832214
No ratings yet
Statistical Machine Learning 1665832214
55 pages
Generalized Fisher-Darmois-Koopman-Pitman Theorem and Rao-Blackwell Type Estimators For Power-Law Distributions
No ratings yet
Generalized Fisher-Darmois-Koopman-Pitman Theorem and Rao-Blackwell Type Estimators For Power-Law Distributions
33 pages
Prob Notes
No ratings yet
Prob Notes
70 pages
Foundations of Statistical Inference
No ratings yet
Foundations of Statistical Inference
89 pages
CSE291D Lecture 4: Exponential Families Generalized Linear Models
No ratings yet
CSE291D Lecture 4: Exponential Families Generalized Linear Models
67 pages
Exponential Families
No ratings yet
Exponential Families
42 pages
w6 - Statistical Modelling
No ratings yet
w6 - Statistical Modelling
24 pages
7 Mle
No ratings yet
7 Mle
31 pages
Notes For 18.6501x, Fundamentals of Statistics: v0.2 (2019 April 24)
100% (1)
Notes For 18.6501x, Fundamentals of Statistics: v0.2 (2019 April 24)
14 pages
18.650 Statistics For Applications
No ratings yet
18.650 Statistics For Applications
25 pages
Lecture01 Uppsala EQG 12
No ratings yet
Lecture01 Uppsala EQG 12
39 pages
ემპირიული პროცესები
No ratings yet
ემპირიული პროცესები
131 pages
PRML Slides 2
No ratings yet
PRML Slides 2
86 pages
Ch3 PDF
No ratings yet
Ch3 PDF
55 pages
CPSC 440: Advanced Machine Learning: Exponential Families
No ratings yet
CPSC 440: Advanced Machine Learning: Exponential Families
15 pages
Notests PDF
No ratings yet
Notests PDF
153 pages
Exponential Family
No ratings yet
Exponential Family
45 pages
Pseph QuantileEstimation
No ratings yet
Pseph QuantileEstimation
21 pages
PCA Frances
No ratings yet
PCA Frances
9 pages
Lec12 GLM ExponentialFamilies
No ratings yet
Lec12 GLM ExponentialFamilies
24 pages
Lec 38
No ratings yet
Lec 38
8 pages
Notes
No ratings yet
Notes
10 pages
L08 MaximumLikelihoodEstimation
No ratings yet
L08 MaximumLikelihoodEstimation
5 pages
CQF ML Lab Estimating Default Probability With Logistic Regression
No ratings yet
CQF ML Lab Estimating Default Probability With Logistic Regression
7 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
55 pages
Conditional Distributions
No ratings yet
Conditional Distributions
5 pages
Lecture Notes 1: Brief Review of Basic Probability (Casella and Berger Chapters 1-4)
100% (1)
Lecture Notes 1: Brief Review of Basic Probability (Casella and Berger Chapters 1-4)
14 pages
Natural Parameter Form For Multivariate Gaussian
No ratings yet
Natural Parameter Form For Multivariate Gaussian
17 pages
Generalised Linear Models and Bayesian Statistics
No ratings yet
Generalised Linear Models and Bayesian Statistics
35 pages
Stanford Statistical - Fisher Information
No ratings yet
Stanford Statistical - Fisher Information
7 pages
Statistics 550 Notes 8: SX F F S S SX TX F GT H T T
No ratings yet
Statistics 550 Notes 8: SX F F S S SX TX F GT H T T
14 pages
Stat 5102 Lecture Slides: Deck 6 Gauss-Markov Theorem, Sufficiency, Generalized Linear Models, Likelihood Ratio Tests, Categorical Data Analysis
No ratings yet
Stat 5102 Lecture Slides: Deck 6 Gauss-Markov Theorem, Sufficiency, Generalized Linear Models, Likelihood Ratio Tests, Categorical Data Analysis
86 pages
An Introduction To Signal Detection and Estimation - Second Edition Chapter IV: Selected Solutions
100% (1)
An Introduction To Signal Detection and Estimation - Second Edition Chapter IV: Selected Solutions
7 pages
Msqe Metrics 1 ps2
No ratings yet
Msqe Metrics 1 ps2
11 pages
Review
No ratings yet
Review
6 pages
EE-232 Lab Manual Signals and Systems
No ratings yet
EE-232 Lab Manual Signals and Systems
57 pages
θ, then the probability density function for Y, θ), can be written as  y∣=exp  ybcd  y θ) is called the natural −m  n y ,
No ratings yet
θ, then the probability density function for Y, θ), can be written as  y∣=exp  ybcd  y θ) is called the natural −m  n y ,
6 pages
College Statistics
No ratings yet
College Statistics
244 pages
Exponential Families: Peter D. Hoff September 26, 2013
No ratings yet
Exponential Families: Peter D. Hoff September 26, 2013
9 pages
Lecture 11
No ratings yet
Lecture 11
6 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
AMM May 2016
No ratings yet
AMM May 2016
108 pages
Notes On The Cram Er-Rao Inequality: Kimball Martin February 8, 2012
No ratings yet
Notes On The Cram Er-Rao Inequality: Kimball Martin February 8, 2012
6 pages
Lecture Planner - Core Maths
No ratings yet
Lecture Planner - Core Maths
3 pages
Cambridge IGCSE: MATHEMATICS 0580/43
100% (1)
Cambridge IGCSE: MATHEMATICS 0580/43
20 pages
Power Series Solutions To The Legendre Equation and The Legendre Polynomials
No ratings yet
Power Series Solutions To The Legendre Equation and The Legendre Polynomials
6 pages
Analytic Geometry 2 PDF
No ratings yet
Analytic Geometry 2 PDF
62 pages
Statistique Descriptive
0% (1)
Statistique Descriptive
40 pages
Lectures 2017.pdf Manifolds
No ratings yet
Lectures 2017.pdf Manifolds
10 pages
Robotics Question Bank
No ratings yet
Robotics Question Bank
8 pages
Gee7 2011
No ratings yet
Gee7 2011
318 pages
Maximum and Minimum Values
No ratings yet
Maximum and Minimum Values
6 pages
Lectures 2017 MANIFOLDS
No ratings yet
Lectures 2017 MANIFOLDS
6 pages
Functional Equations: 1. Basic Techniques in Solving Functional Equations in One Variable
No ratings yet
Functional Equations: 1. Basic Techniques in Solving Functional Equations in One Variable
9 pages
Machine Learning Coursera All Exercies PDF
No ratings yet
Machine Learning Coursera All Exercies PDF
117 pages
Introduction To Thermodynamics
No ratings yet
Introduction To Thermodynamics
12 pages
Activity (DifferentialCalculus)
No ratings yet
Activity (DifferentialCalculus)
9 pages
Taylor
No ratings yet
Taylor
21 pages
Lecture3 Statistics
No ratings yet
Lecture3 Statistics
39 pages
BMTC 131
No ratings yet
BMTC 131
15 pages
Lecture 011 FOUNDATIONS OF STATISTICS
No ratings yet
Lecture 011 FOUNDATIONS OF STATISTICS
63 pages
AY2425 T1 MAT60A Focused IPQ AK
No ratings yet
AY2425 T1 MAT60A Focused IPQ AK
11 pages
Lecture05 - Survival Analysis
No ratings yet
Lecture05 - Survival Analysis
52 pages
Lecture04 - Survival Analysis
No ratings yet
Lecture04 - Survival Analysis
41 pages
Deviatoric Stressitis: A Virus Infecting The Earth Science Community
No ratings yet
Deviatoric Stressitis: A Virus Infecting The Earth Science Community
6 pages
Judy Taylor
No ratings yet
Judy Taylor
10 pages
Algorithm and Complexity - Uma Madam
No ratings yet
Algorithm and Complexity - Uma Madam
17 pages
521 Lecture 13
No ratings yet
521 Lecture 13
7 pages
JuniorCoach Waiver Form 3new 1 202109 1
No ratings yet
JuniorCoach Waiver Form 3new 1 202109 1
3 pages
E0-190-2008 (10) 1D Shear Propagation (SHAKE)
No ratings yet
E0-190-2008 (10) 1D Shear Propagation (SHAKE)
40 pages
Lab 14 Open Ended Lab
No ratings yet
Lab 14 Open Ended Lab
4 pages
Signal & Signal
No ratings yet
Signal & Signal
37 pages
2nd Introduction To The Matrix Package
No ratings yet
2nd Introduction To The Matrix Package
9 pages
Reynolds Transport Theorem
No ratings yet
Reynolds Transport Theorem
3 pages
Asymptotic Methods: Example Sheet 3: Send Corrections To David Stuart Dmas2@cam - Ac.uk
No ratings yet
Asymptotic Methods: Example Sheet 3: Send Corrections To David Stuart Dmas2@cam - Ac.uk
4 pages
Mean Median Mode PDF
No ratings yet
Mean Median Mode PDF
52 pages
2.quadrati EquationAdd
No ratings yet
2.quadrati EquationAdd
14 pages
Nonlinear Control Exam February 10, 2016
No ratings yet
Nonlinear Control Exam February 10, 2016
3 pages
Radicals and Quadratic
No ratings yet
Radicals and Quadratic
1 page
Analytic Signals and Hilbert Transform Filters
No ratings yet
Analytic Signals and Hilbert Transform Filters
4 pages
Cac Tich Phan Thong Dung Trong Vat Ly Ly Thuyet
No ratings yet
Cac Tich Phan Thong Dung Trong Vat Ly Ly Thuyet
5 pages

Statistics - Lecture 7

Uploaded by

Statistics - Lecture 7

Uploaded by

Foundations of Statistics and Machine Learning:

testing and uncertainty quantification with e-values

1. KL divergence, Fisher Information

Suppose that, for 𝑗 ∈ 𝐴, 𝐵 , 𝐻! says: 𝑋", 𝑋#, … i.i.d. ∼ 𝑃! where 𝑃$ ≠ 𝑃%

To analyze size of constant 𝑐 , need to introduce the KL Divergence

• KL divergence plays a central role in statistics in general, and a super-

• Two Gaussians with variance 1, mean 𝜇", 𝜇# :

where 𝜎1 = 𝜇(1 − 𝜇) is the variance of 𝑃1

• Let ℳ = 𝑃4 : 𝜃 ∈ Θ with Θ an open interval in ℝ represent a statistical

…then we call 𝐼 𝜃0 the Fisher information at parameter 𝜃0

• Let ℳ = 𝑃4 : 𝜃 ∈ Θ with Θ an open connected set in ℝ7 represent a

…then we call 𝐼 𝜃0 the (𝑘×𝑘) Fisher information matrix at parameter 𝜃0

Bernoulli and Gauss are both special cases of this

• Multivariate Case: If Fisher information as defined above exists, then for

Let 𝜙: 𝒳 → ℝ7 be a function and 𝑞: 𝒳 → ℝ< 0 be a nonnegative function.

If we do not specify 𝐵 explicitly, we refer to the ‘full’ family with

Let 𝜙: 𝒳 → ℝ7 be a function and 𝑞: 𝒳 → ℝ< 0 be a nonnegative function.

sum replaced by integral if 𝒳 = ℝ𝒅

If we do not specify 𝐵 explicitly, we refer to the ‘full’ family with

• Many popular models are exponential families: Bernoulli, multinomial,

>* , >* "

Why is this the Bernoulli model?

>* , >* "

Why is this the Bernoulli model?

• Gaussian scale 𝑝3% : 𝜎 # > 0 with fixed mean (say 0):

• For general exponential families, we call 𝛽 → 𝑝= the canonical

• Why are exponential families easy to work with? Because, if we extend

• “A sufficient statistic of a sample relative to a model summarizes all

• Sample size 𝑛: exponential families constructed by taking product

• To determine this, you only need to know sum (equivalently, average)

max log 𝑝= 𝑥 + = max (𝛽∑𝜙 𝑥8 − 𝑛 log 𝑍 𝛽 + ∑log 𝑞 𝑥8 )

is the mean-value parameterization (not the ‘variance’ parameterization)

We can also identify a distribution in ℳ by its mean of 𝜙 rather than the

(i.e. with 𝑓 𝜇 = 𝐷 𝜇|| 𝜇0 , 𝐼 𝜇 = 𝑓 JJ 𝜇 ).

(i.e. with 𝑓 𝜇 = 𝐷 𝜇|| 𝜇' , 𝐼 𝜇 = 𝑓 (( 𝜇 ).

max log 𝑝= 𝑥 + = max (𝛽∑𝜙 𝑥8 − 𝑛 log 𝑍 𝛽 + ∑log 𝑞 𝑥8 )

• Suppose we want to use a Bayesian method (e.g. predict the

• So what prior should we use instead?

• Consider a 1-dimensional model {𝑝4 : 𝜃 ∈ Θ}

For general models (not just exponential families)

Indeed we just saw that if 𝑃1 = 𝑃= then 𝐼 𝜇 = 1/𝐼 𝛽

𝐼(𝜃) is parameterization dependent

YET Jeffreys’ prior is parameterization independent:

∫<∈ =,> D[:] 4 54

• Consider a 1-dimensional model {𝑝4 : 𝜃 ∈ Θ}

• An improper prior on Θ is a density 𝑤: Θ → ℝ<

• An improper prior on Θ is a density 𝑤: Θ → ℝ<

…is normal density on 𝜇 with mean 𝑋" and variance 𝜎 #

• An improper prior on Θ is a density 𝑤: Θ → ℝ<

• uniform prior (used by Bayes in his 1763 paper)

• Next week: no lecture!

• In two weeks: using KL divergence-tools, both applied to alternative and

You might also like