0% found this document useful (0 votes)

2 views

MaxEnt

This document discusses the derivation of the exponential form of probability distributions using the principle of maximum entropy, as outlined by ET Jaynes. It explains the setup involving finite instances and statistics, the challenges of defining distributions under constraints, and introduces the Lagrangian method for optimization. The final result is the expression of probability distributions in the exponential family form, highlighting the role of the partition function and the log-partition function.

Uploaded by

Enghin Omer

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

MaxEnt

Uploaded by

Enghin Omer

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Maximum Entropy and Exponential Families

Christopher Ré
(edits by Tri Dao and Anand Avati)
August 5, 2019

Abstract
The goal of this note is to derive the exponential form of probability distribution from
more basic considerations, in particular Entropy. It follows a description by ET Jaynes
in Chapter 11 of his book Probability Theory: the Logic of Science [1].1

1 Motivating the Exponential Model

This section will motivate the exponential model form that we’ve seen in lecture.

The Setup The setup for our problem is that we are given a finite set of instances Y and
a set of m statistics (Tj , cj ) in which Tj : Y → R and cj ∈ R. An instance (or possible
world) is just an element in a set. We can think about a statistic as a measurement of an
instance, it tells us the important features of that instance that are important for our model.
More precisely, the only information we have about the instances is the values of Ti on these
instances. Our goal is to find a probability function p such that
X
p : Y → [0, 1] such that p(y) = 1.
y∈Y

The main goal of this note is to provide a set of assumptions under which such distri-
butions have a specific functional form, the exponential family, that we saw in generalized
linear model:
p(y; η) = exp {η · T (y) − a(η)}
in which η ∈ Rm , T (y) ∈ Rm and T (y)j = Tj (y). Notice that there is exactly one parameter
for each statistic. As we’ll see for discrete distributions, we are able to derive this exponential
form as a consequences of a maximizing entropy subject to matching the statistics.2
1
This work is available online in many places including http://omega.albany.edu:8008/ETJ-PS/cc11g.
ps.
2
Unfortunately, for continuous distributions, such a derivation does not work due to some technical issues
with Entropy–this hasn’t stopped folks from using it as justification.

1
1.1 The problem: Too many distributions!
We’ll see the problem of defining a distribution from statistics (measurements). We’ll see
that often there are often many probability distributions that satisfy our constraints, and
we’ll be forced to pick among them.3

The Constraints We interpret a statistic as a constraint on p of the following form:

N
X
Ep [Tj ] = cj i.e., Tj (yi )pi = hTj , pi = cj
i=1

Let’s get some notation to describe these constraints. Let N = |Y|, then the probability we
are after is p ∈ RN subject to constraints.

• There are m constraints of the form

hTj , pi = cj for j = 1, . . . , m.

• A single constraint of the form N

P
i=1 pi = 1 to ensure that p is a probability distribution.
We can write this more succinctly as h1, pi = 1.

• We also have that pi ≥ 0 for i = 1, . . . , N .

More compactly, we can write all constraints in a matrix G as

1 (m+1)×N 1
G= ∈R so that Gp = .
T c

If N(G) = ∅, then this means that p is uniquely defined as G has an inverse. In this case,
p = G−1 c. However often m is much smaller than N , so that N(G) 6= ∅–and there are many
solutions that satisfy the constraints.

Example 1.1 Suppose we have three possible worlds, i.e., Y = {y1 , y2 , y3 } and one statistic
T (yi ) = i and c = 2.5. Then, we have:
 
1
1 1 1
G= and N(G) =  −2 
1 2 3
1

Let p(1) = (1/12, 1/3, 7/12) then Gp = (1, 2.5)T –but so do (infinitely) many others, in par-
ticular q(α) = p(1) + α(1, −2, 1) is valid so long as α ∈ [−1/12, 1/6] (due to positivity).
3
Throughout this section, it will be convenient to view p and Tj as functions from Y → R–and also as
vectors indexed by Y. Their use should be clear from the context.

2
Picking a probability distribution p In the case ∅ = 6 N(G), there are many probability
distributions we can pick. All of these distributions can be written as follows:

(0) (1) (0) (1) (1) 1
p = p + p in which p ∈ N(G) and p satisfies Gp =
c

Example 1.2 Continuing the computation above, we see p(0) = α(1, −2, 1) is a vector in
N(G).

Which p should we pick? Well, we’ll use one method called the method of maximum
entropy. In turn, this will lead to the fact that our function p has a very special form–the
form of exponential family distributions!

1.2 Entropy
To pick among the distributions, we’ll need some scoring method.4 We’ll cut to the chase
here and define the entropy, which is a function on probability distributions p ∈ RN such
that p ≥ 0 and h1, pi = 1.
X N
H(p) = − pi log pi
i=1

Effectively, the entropy rewards one for “spreading” the distribution out more. One can
motivate Entropy from axioms, and either Jaynes or the Wikipedia page is pretty good
on this account.5 . The intuition should be that entropy can be used to select the least
informative prior, it’s a way of making as few additional assumptions as possible. In other
words, we want to encode the prior information given by the constraints on the statistics
while being as “objective” or “agnostic” as possible. This is called the maximum entropy
principle.
For example, one can verify that under no constraints, H(p) is maximized with pi = N −1 –
that is all alternatives have equal probability. This is what we mean by spread out.
We’ll pick the distribution that maximizes entropy subject to our constraints. Mathe-
matically, we’ll examine:

max H(p) s.t. h1, pi = 1, p ≥ 0, and T p = c

p∈RN

We will not discuss it, but under appropriate conditions there is a unique solution p.
4
A few natural methods don’t work as we might think they should (minimizing variance, etc.) See [1,
Ch.11] for a description of these alternative approaches.
5
https://en.wikipedia.org/wiki/Entropy_(information_theory)#Rationale

3
1.3 The Lagrangian
We’ll create a function called the Lagrangian that has the property that any critical point of
the Lagrangian is a critical point of the constrained problem. We will show that all critical
points of the Lagrangian (and so our original problem) can be written in the exponential
format we described above.
To simplify our discussion, let’s imagine that p > 0, i.e,. there are no possible worlds y
such that p(y) = 0. In this case, our problem reduces to:

max H(p) s.t. T p = c and h1, pi = 1

p∈RN

We can write the Lagrangian L : RN × (Rm × R) → R as follows:

L(p; η, λ) = H(p) + hη, T p − ci + λ (h1, pi − 1)

The special property of L is that any critical point of our original solution, in particular
any maximum or minimum corresponds to a critical point of the Lagrangian. Thus, if we
prove something about critical points of the Lagrangian, we prove something about the
critical points of the original function. Later in the course, we’ll see more sophisticated uses
of Lagrangians but for now we include a simple derivation below to give a hint what’s going
on. For this section, we’ll assume this special property is true.
Due to that special property, we find the critical points of L by differentiating with
respect to pi and setting the resulting equations to 0.

∂ ∂
L= [H(p) + hη, T p − ci + λ(h1, pi − 1)]
∂pi ∂pi
= −(log pi + 1) + hη, T (yi )i + λ

Setting this expression equal to 0 and solving for pi we learn:

pi = eλ−1 exp {hη, T (yi )i}

⇒ p(y) ∝ exp{η · T (y)}

which is of the right form–except that we have one too many parameters, namely λ. Nev-
ertheless, this is remarkable: at a critical point, it’s always the case that the exponential
family “pops out”!

Eliminating λ The parameter λ can be eliminated, which is the final step to match our
original claimed exponential form. To do so, we sum over all the pi which we know on one
hand is equal to 1, and the other hand, we have the above expression for pi . This gives us
the following equation:
N N N
! !
X X X X
pi = 1 and pi = eλ−1 exp {η · T (yi )} thus e−λ+1 = exp {η · T (y)}
i=1 i=1 i=1 y∈Y

4
Thus, we have expressed λ as a function of η and we can eliminate it. To do so, we write:
X
Z(η) = exp {η · T (y)}
y∈Y

⇒ p(y; η) = Z(η)−1 exp{η · T (y)}

= exp{η · T (y) − a(η)} where a(η) = log Z(η)

This function Z is called the partition function, and a is called the log-partition function.
The above is the claimed exponential form we saw in lecture.

2 Why the Lagrangian? [optional]

We observe that this is a constrained optimization problem with linear constraints.6
Let r be the rank of G and so dim(N(G)) = N − r. We create a function φ : RN −r → R
such that there is a map between any point in the domain of φ and a feasible solution to
our constrained problem, and moreover φ will take the same value as H. In contrast to our
original constrained problem, φ has an unconstrained domain (all of RN −r ), and so we can
apply standard calculus to find its critical points. To that end, we define a (linear) map
B ∈ RN ×(N −r) that has rank N − r. We also insist that B T B = IN −r . Such a B exists, as it
is simply the first N − r columns of a change of basis matrix from the standard basis to an
orthonormal basis for N(G). We have

φ(x) = H(Bx + p(1) ),

1
where p(1) (1)
is a fixed vector satisfying Gp = .
c
N −r (1)1 (1)
Observe that for any x ∈ R , Bx ∈ N(G) so that G(Bx + p ) = Gp = and so
c
Bx + p(1) is feasible. Moreover, B is a bijection from RN −r to the set of feasible solutions.7
Importantly, φ is now unconstrained, and so any saddle point (and so any maximum or
minimum) must satisfy:
∇x φ(x) = 0

Gradient Decomposition Any critical point of H yields a critical point of φ, that is, if
p = p(0) + p(1) is a critical point of H then x = B T p(0) is a critical point of φ. Consider any
critical point p, then we can uniquely decompose the gradient as:

∇p H(p) = g0 + g1 in which g0 ∈ N(G) and g1 ∈ N(G)⊥ .

6
One can form the Lagrangian for non-linear constraints, but to derive it we need to use fancier math
like the implicit function theorem. We only need linear constraints for our applications.
7
For contradiction, suppose p, q are distinct feasible solutions then, p 6= q but B T p = B T q but we can
write p = p(0) + p(1) and q = q (0) + p(1) from the above. However, B T p = B T q implies that B T p(0) = B T q (0) .
In turn since B is a bijection on N(G) this implies that p(0) = q (0) .

5
We claim g0 = B∇φ(B T p) or equivalently B T g = ∇x φ(B T p). From direct calculation,
∇x φ(x) = ∇x H(Bx + p(1) ) = B T ∇p H(p(0) + p(1) ) = B T ∇p H(p) = B T g0 , where the last
equality is due to g1 ∈ N(G)⊥ . A critical point of H satisfying the constraints must not
change along any direction that satisfies the constraints, which is to say that we must have
g0 = 0. Very roughly, one can have the intuition that if p were a maximum (or minimum),
then if g0 were non-zero there would be a way to strictly increase (or decrease) the function
in a neighbor around p–contradicting p being a maximum (minimum).

Lagrangian Since g1 ∈ N(G)⊥ = R(GT ) (see the fundamental theorem of linear algebra),
we can find a η(p) such that g1 = −GT η(p), which motivates the following functional form:

L(p, η(p)) = H(p) + hη(p), Gp − ci

By the definition of η(p), we have:

∇p L(p, η(p)) = g0 + g1 + GT η(p) = g0 .

That is, for any critical point p of the original function (which corresponds to g0 = 0) we
can select η(p) so that it is a critical point of L(p, η). Informally, the multipliers combines
the rows of G to cancel g1 , the component of the gradient in the direction of the constraints.
This establishes that any critical point of the original constrained function is also a critical
point of the Lagrangian.

References
[1] Jaynes, Edwin T, Probability theory: The logic of science, Cambridge University Press,
2003.

(397 P. COMPLETE SOLUTIONS) Elements of Information Theory 2nd Edition - COMPLETE Solutions Manual (Chapters 1-17)
85% (54)
(397 P. COMPLETE SOLUTIONS) Elements of Information Theory 2nd Edition - COMPLETE Solutions Manual (Chapters 1-17)
397 pages
Stenger, Victor J. - The Comprehensible Cosmos
No ratings yet
Stenger, Victor J. - The Comprehensible Cosmos
292 pages
Calculus of Variation 1
No ratings yet
Calculus of Variation 1
56 pages
STAT 538 Maximum Entropy Models C Marina Meil A Mmp@stat - Washington.edu
No ratings yet
STAT 538 Maximum Entropy Models C Marina Meil A Mmp@stat - Washington.edu
20 pages
The Q-Exponentials Do Not Maximize The Rényi Entropy
No ratings yet
The Q-Exponentials Do Not Maximize The Rényi Entropy
12 pages
Software
No ratings yet
Software
6 pages
Hw1sol PDF
No ratings yet
Hw1sol PDF
9 pages
Ech 4
No ratings yet
Ech 4
39 pages
1507.04783v1
No ratings yet
1507.04783v1
10 pages
Possible Generalization of Boltzmann-Gibbs Statist
No ratings yet
Possible Generalization of Boltzmann-Gibbs Statist
10 pages
Wittenberg 2010 An Introduction To Maximum Entropy and Minimum Cross Entropy Estimation Using Stata
No ratings yet
Wittenberg 2010 An Introduction To Maximum Entropy and Minimum Cross Entropy Estimation Using Stata
16 pages
Entropy of The Gaussian Distribution: Appendix A
No ratings yet
Entropy of The Gaussian Distribution: Appendix A
5 pages
Math Cheat Sheet
No ratings yet
Math Cheat Sheet
5 pages
11Ma2aPracLectures1 5
No ratings yet
11Ma2aPracLectures1 5
23 pages
CPSC 440: Advanced Machine Learning: Exponential Families
No ratings yet
CPSC 440: Advanced Machine Learning: Exponential Families
15 pages
Entropy Post
No ratings yet
Entropy Post
27 pages
Non Equilibrium Stat Mech
No ratings yet
Non Equilibrium Stat Mech
51 pages
Rau J Statistical Mechanics in A Nutshell
No ratings yet
Rau J Statistical Mechanics in A Nutshell
23 pages
Entropy and Uncertainty
No ratings yet
Entropy and Uncertainty
15 pages
Foss Lecture1
No ratings yet
Foss Lecture1
32 pages
Lec12_glm_ExponentialFamilies
No ratings yet
Lec12_glm_ExponentialFamilies
24 pages
Math 6410: Ordinary Differential Equations: Lectures Notes On
No ratings yet
Math 6410: Ordinary Differential Equations: Lectures Notes On
100 pages
September 13, 2001 Reading: Chapter Four Homework: 4.1,4.2,4.3,4.4
No ratings yet
September 13, 2001 Reading: Chapter Four Homework: 4.1,4.2,4.3,4.4
4 pages
(A. Hernando, A. Plastino and A.R. Plastino) MaxEn
No ratings yet
(A. Hernando, A. Plastino and A.R. Plastino) MaxEn
8 pages
Physics Letters A: Q-Generalized D-Dimensional Q-Fourier
No ratings yet
Physics Letters A: Q-Generalized D-Dimensional Q-Fourier
5 pages
An Introduction To Probability Theory
100% (1)
An Introduction To Probability Theory
91 pages
Hitchhiker S Guide To Probability
No ratings yet
Hitchhiker S Guide To Probability
6 pages
Maximum Entropy Distribution
No ratings yet
Maximum Entropy Distribution
11 pages
Distribution
No ratings yet
Distribution
8 pages
Computers and Mathematics With Applications: Maxallent: Maximizers of All Entropies and Uncertainty
No ratings yet
Computers and Mathematics With Applications: Maxallent: Maximizers of All Entropies and Uncertainty
19 pages
Notes On Distributions
No ratings yet
Notes On Distributions
38 pages
E2 201: Information Theory (2019) Solutions To Homework 3
No ratings yet
E2 201: Information Theory (2019) Solutions To Homework 3
11 pages
MIT6 441S16 Midterm
No ratings yet
MIT6 441S16 Midterm
5 pages
Calculus of Variations Slide
No ratings yet
Calculus of Variations Slide
37 pages
Fundamentals of Applied Probability by Drake
No ratings yet
Fundamentals of Applied Probability by Drake
300 pages
Shannon Theory On General Probabilistic Theory
No ratings yet
Shannon Theory On General Probabilistic Theory
18 pages
Lagrangian Handout: Ashmit Dutta
No ratings yet
Lagrangian Handout: Ashmit Dutta
14 pages
Lagrangian Handout: Ashmit Dutta
No ratings yet
Lagrangian Handout: Ashmit Dutta
14 pages
Ps 6
No ratings yet
Ps 6
10 pages
Lecture 4 - Inequalities
No ratings yet
Lecture 4 - Inequalities
19 pages
Selective Review - Probability
No ratings yet
Selective Review - Probability
30 pages
Bignotation
No ratings yet
Bignotation
12 pages
Prof (1) F P Kelly - Probability
No ratings yet
Prof (1) F P Kelly - Probability
78 pages
Maximum Entropy: Density Estimation
No ratings yet
Maximum Entropy: Density Estimation
18 pages
Weilcourse PDF
No ratings yet
Weilcourse PDF
28 pages
Thermal Physics Lecture 3
No ratings yet
Thermal Physics Lecture 3
7 pages
revised (1)
No ratings yet
revised (1)
24 pages
Notes08 Infotheory
No ratings yet
Notes08 Infotheory
7 pages
1 Notes On Brownian Motion: 1.1 Normal Distribution
No ratings yet
1 Notes On Brownian Motion: 1.1 Normal Distribution
15 pages
Presentation
100% (1)
Presentation
138 pages
Computational Biology Project Report
No ratings yet
Computational Biology Project Report
15 pages
Introduction To Polynomials Chaos With NISP: Michael Baudin (EDF) Jean-Marc Martinez (CEA) January 2013
No ratings yet
Introduction To Polynomials Chaos With NISP: Michael Baudin (EDF) Jean-Marc Martinez (CEA) January 2013
52 pages
Tsallis1988 Article PossibleGeneralizationOfBoltzm PDF
No ratings yet
Tsallis1988 Article PossibleGeneralizationOfBoltzm PDF
9 pages
Tsallis
No ratings yet
Tsallis
9 pages
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Lectures on Integral Equations
From Everand
Lectures on Integral Equations
Harold Widom
3.5/5 (1)
Set Theory Essentials
From Everand
Set Theory Essentials
Emil Milewski
No ratings yet
Math for Computer Applications
From Everand
Math for Computer Applications
The Editors of REA
No ratings yet
Lectures on the Coupling Method
From Everand
Lectures on the Coupling Method
Torgny Lindvall
No ratings yet
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
The Theory of Algebraic Numbers
From Everand
The Theory of Algebraic Numbers
Harry Pollard
4/5 (1)
TD8
No ratings yet
TD8
3 pages
Full Download The Thermophysics of Porous Media Vol 126 1st edition Edition Spanos T.J.T. PDF DOCX
100% (8)
Full Download The Thermophysics of Porous Media Vol 126 1st edition Edition Spanos T.J.T. PDF DOCX
67 pages
Mechanics
No ratings yet
Mechanics
104 pages
Geodesic GR
No ratings yet
Geodesic GR
23 pages
Frobt491 Kuka Gravity Pres
No ratings yet
Frobt491 Kuka Gravity Pres
22 pages
Lag Rang Ian Dynamic Formulation of A Four-Bar Mechanism
No ratings yet
Lag Rang Ian Dynamic Formulation of A Four-Bar Mechanism
7 pages
Aknu Syllabus
No ratings yet
Aknu Syllabus
55 pages
Newton's Laws of Motion - Wikipedia
No ratings yet
Newton's Laws of Motion - Wikipedia
34 pages
Modeling of Electromechanical Systems
100% (1)
Modeling of Electromechanical Systems
30 pages
Hydro Notes
No ratings yet
Hydro Notes
109 pages
Physics M.SC PDF
No ratings yet
Physics M.SC PDF
81 pages
2 PDF
No ratings yet
2 PDF
89 pages
Courant Number PDF
No ratings yet
Courant Number PDF
14 pages
(3822) - 101 M.Sc. (Sem. - I) Physics PHY UTN - 501: Classical Mechanics (New Course)
No ratings yet
(3822) - 101 M.Sc. (Sem. - I) Physics PHY UTN - 501: Classical Mechanics (New Course)
12 pages
15 M.SC - Physics
No ratings yet
15 M.SC - Physics
56 pages
Tensor Analysis CH 4 PDF
No ratings yet
Tensor Analysis CH 4 PDF
103 pages
mtm-2020-5-186 Dynamic Analysis of A Four-Bar Linkage Mechanism
No ratings yet
mtm-2020-5-186 Dynamic Analysis of A Four-Bar Linkage Mechanism
5 pages
Stabilization of Double Inverted Pendulum On Cart: LQR Approach
No ratings yet
Stabilization of Double Inverted Pendulum On Cart: LQR Approach
5 pages
EM 253 Lecture 3 - Kinematics (II)
No ratings yet
EM 253 Lecture 3 - Kinematics (II)
15 pages
Jacques Vanier, Cipriana Tomescu - Universe Dynamics - The Least Action Principle and Lagrange's Equations (2019, CRC Press) - Libgen - Li
No ratings yet
Jacques Vanier, Cipriana Tomescu - Universe Dynamics - The Least Action Principle and Lagrange's Equations (2019, CRC Press) - Libgen - Li
185 pages
MSC PHYSICS
No ratings yet
MSC PHYSICS
59 pages
Basic Physical Interpretation: I I I I
No ratings yet
Basic Physical Interpretation: I I I I
1 page
Physics Central Force PDF
No ratings yet
Physics Central Force PDF
19 pages
Fluid Flow Analysis - Different Approaches: System Approach
No ratings yet
Fluid Flow Analysis - Different Approaches: System Approach
2 pages
Classical Mechanics - Goldstein
100% (14)
Classical Mechanics - Goldstein
646 pages
M.SC Mathematics Syllabus 2016 17
No ratings yet
M.SC Mathematics Syllabus 2016 17
22 pages
On The Principles of Quantum Mechanics
No ratings yet
On The Principles of Quantum Mechanics
38 pages
6th Sem Syllabus
No ratings yet
6th Sem Syllabus
7 pages