0% found this document useful (0 votes)

19 views24 pages

T T M L: HE Hermodynamics of Achine Earning

THE THERMODYNAMICS OF MACHINE LEARNING google research preprint

Uploaded by

Hua Hidari Yang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views24 pages

T T M L: HE Hermodynamics of Achine Earning

THE THERMODYNAMICS OF MACHINE LEARNING google research preprint

Uploaded by

Hua Hidari Yang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Preprint under review.

T HER ML:
T HE T HERMODYNAMICS OF M ACHINE L EARNING

Alexander A. Alemi Ian Fischer

Google Research Google Research
1600 Amphitheatre Parkway 1600 Amphitheatre Parkway
Mountain View, CA 94043 Mountain View, CA 94043
[email protected] [email protected]
arXiv:1807.04162v3 [cs.LG] 4 Oct 2018

A BSTRACT

In this work we offer an information-theoretic framework for representation learn-

ing that connects with a wide class of existing objectives in machine learning. We
develop a formal correspondence between this work and thermodynamics and dis-
cuss its implications.

1 I NTRODUCTION
Let X, Y be some paired data, for example: a set of images X and their labels Y . We imagine
the data comes from some true, unknown data generating process Φ1 , from which we have drawn a
training set of N pairs:
TN ≡ (xN , y N ) ≡ {x1 , y1 , x2 , y2 , . . . , xN , yN } ∼ φ(xN , y N ). (1)
2
We further imagine the process is exchangeable and the data is conditionally independent given the
governing process Φ: Y
p(xN , y N |φ) = p(xi |φ)p(yi |xi , φ). (2)
i
As machine learners, we believe that by studying the training set, we should be able to infer or
predict new draws from the same data generating process. Call a set of M future draws from the
0
data generating process TM ≡ {X M , Y M } the test set.
The predictive information (Bialek et al., 2001) is the mutual information between the training set
and a infinite test set, equivalently the amount of information the training set provides about the
generative process itself:
0
Ipred (TN ) ≡ lim I(TN ; TM ) = I(TN ; Φ) = I(X N , Y N ; Φ). (3)
M →∞

The predictive information measures the underlying complexity of the data generating process (Still,
2014), and is fundamentally limited and must grow sublinearly in the dataset size (Bialek et al.,
2001). Hence, the predictive information is a vanishing fraction of the total information in the
training set 3 :
Ipred (TN )
lim =0 (4)
N →∞ H(TN )

A vanishing fraction of the information present in our training data is in any way useful for future
tasks. A vanishing fraction of the information contained in the training data is signal, the rest is
noise. We claim the goal of learning is to learn a representation of data, both locally and globally
that captures the predictive information while being maximally compressed: that separates the signal
from the noise.
1
Here we aim to invoke the same philosophy as in the introduction to Watanabe (2018).
2
That is, we imagine the data satisfies De Finetti’s theorem, for which infinite exchangeable processes
usually can be described by products of conditionally independent distributions, but don’t want to worry too
much about the complicated details since there are subtle special cases (Accardi, 2018).
3 P
Here and throughout H(A) is used to denote entropies H(A) = − i p(A) log p(A).

1
Preprint under review.

Θ Φ Θ Φ

Z X Y X Z Y

(a) Graphical model for world P , the real world augmented (b) Graphical model for world Q, the
with a local and global representation. The dashed lines em- world we desire. In this world, Z acts
phasize that θ only depends on the first N data points, the as a latent variable for X and Y jointly.
training set. Blue denotes nodes outside our control, while
red nodes are under our direct control.

Figure 1: Graphical models.

2 A TALE OF T WO W ORLDS
We are primarily interested in learning a stochastic local representation of X, call it Z, defined by
some parametric distribution of our own design: p(zi |xi , θ) with its own parameters θ. A training
procedure is a process that assigns a distribution p(θ|xN , y N ) to the parameters conditioned on the
observed dataset. In this way, the parameters of our local parametric map are themselves a global
representation of the dataset. With our augmentations, the world now looks like the graphical model
in Figure 3a, denoted World P : Some data generating process Φ generates a dataset (X N , Y N )
which we perform some learning algorithm on to get some parameters p(θ|xN , y N ) which we can
use to form a parametric local representation p(zi |xi , θ).
World P is what we have. It is not necessarily what we want. What we have to contend with is
an unknown distribution of our data. What we want is a world that corresponds to the traditional
modeling assumptions in which Z acts as a latent factor for X and Y , rending them conditionally
independent, leaving no correlations unexplained. Similarly, we would prefer if we could easily
marginalize out the dependence on our universal (Φ) and model specific (Θ) parameters. World Q
in Figure 3b is the world we want 4 .
We can measure the degree to which the real world aligns with our desires by computing the mini-
mum possible relative information5 between our distribution p and any distribution consistent with
the conditional dependencies encoded in graphical model Q6 . It can be shown (Friedman et al.,
2001) that this quantity is given by the difference in multi-informations between the two graphical
models, as measured in World P :
J ≡ min DKL [p; q] = IP − IQ . (5)
q∈Q

The multi-information (Slonim et al., 2005) of a graphical model is the KL divergence between the
joint distribution and the product of all of the marginal distributions, which can be computed as a
sum of mutual informations, one for each node in the graph, between itself and its parents:
p(g N )
X
IG ≡ log Q = I(gi ; Pa(gi )) (6)
i p(gi ) i

In our case:
X
J = I(Θ; X N , Y N ) + [I(Xi ; Φ) + I(Yi ; Xi , Φ) + I(Zi ; Xi , Θ) − I(Xi ; Zi ) − I(Yi ; Zi )] .
i
(7)
4
We could consider different alternatives, deciding to relax some of the constraints we imposed in World
Q, or generalizing World P by letting the representation depend on X and Y jointly, for instance. What
follows demonstrates a general sort of calculus that we can invoke for any specified pair of graphical models.
In particular Appendices A to C discuss alternatives.
5
Also known as the KL divergence.
6
Note that this is DKL [p; q ∗ ] where q ∗ is the well known reverse-information projection or moment pro-
jection: q ∗ = argminq∈Q DKL [p; q] (Csiszár & Matúš, 2003).

2
Preprint under review.

This minimal relative information has two terms outside our control and we can take them to be
constant, but which relate to the predictive information:
X X
[I(Xi ; Φ) + I(Yi ; Xi , Φ)] ≥ I(Yi ; Xi ) + Ipred (TN ). (8)
i i

These terms measure the intrinsic complexity of our data. The remaining four terms are:

• I(Xi ; Zi ) - which measures how much information our representation contains about the
input (X). This should be maximized to ensure our local representation actually represents
the input.
• I(Yi ; Zi ) - which measures how much information our representation contains about our
auxiliary data. This should be maximized as well to ensure that our local representation is
predictive for the labels.
• I(Zi ; Xi , Θ) - which measures how much information the parameters and input determine
about our representation. This should be minimized to ensure consistency between worlds,
and ensure we learn compressed local representations. Notice that this is similar to, but
distinct from the first term above.

I(Zi ; Xi , Θ) = I(Zi ; Xi ) + I(Zi ; Θ|Xi ) (9)

by the Chain Rule for mutual information 7 .

• I(Θ; X N , Y N ) - which measures how much information we store about our training data
in the parameters of our encoder. This should also be minimized to ensure we learn com-
pressed global representation, preveting overfitting.

These mutual informations are all intractable in general, since we cannot compute the necessary
marginals in closed form, given that we do not have access to the true data generating distribution.

2.1 F UNCTIONALS

Despite their intractability, we can compute variational bounds on these mutual informations.

2.1.1 E NTROPY

p(θ|xN , y N )

S≡ log ≥ I(Θ; X N , Y N ) (10)
q(θ) P

The relative entropy in our parameters or just entropy for short measures the relative information be-
tween the distribution we assign our parameters in World P after learning from the data (X N , Y N ),
with respect to some data independent q(θ) prior on the parameters. This is an upper bound on
the mutual information between the data and our parameters and as such can measure our risk of
overfitting our parameters.

2.1.2 R ATE

p(zi |xi , θ)
Ri ≡ log ≥ I(Zi ; Xi , Θ) (11)
q(zi ) P

The rate measures the complexity of our representation. It is the relative information of a sample
specific representation zi ∼ p(z|xi , θ) with respect to our variational marginal q(z). It measures
how many bits we actually encodeP about each sample, and can measure how our risk of overfitting
our representation. We use R ≡ i Ri .

7
Given this relationship, we could actually reduce the total number of functions we consider from 4 to 3, as
discussed in Appendix A.

3
Preprint under review.

2.1.3 C LASSIFICATION E RROR

Ci ≡ − hlog q(yi |zi )iP ≥ H(Yi ) − I(Yi ; Zi ) = H(Yi |Zi ) (12)

The classification error measures the conditional entropy of Y left after conditioning on Z. It is a
measure of how much information about Y is left unspecifiedP in our representation. This functional
measures our supervised learning performance. We use C ≡ i Ci .

2.1.4 D ISTORTION

Di ≡ − hlog q(xi |zi )iP ≥ H(Xi ) − I(Xi ; Zi ) = H(Xi |Zi ) (13)

The distortion measures the conditional entropy of X left after conditioning on Z. It is a measure of
how much information about X is left unspecified in P our representation. This functional measures
our unsupervised learning performance. We use D ≡ i Di .

2.2 G EOMETRY

The distributions p(z|x, θ), p(θ|xN , y N ), q(z), q(x|z), q(y|z) can be chosen arbitrarily. Once cho-
sen, the functionals R, C, D, S take on well described values. The choice of the five distributional
families specifies a single point in a four-dimensional space.
Importantly, the sum of these functionals is a variational upper bound (up to an additive constant)
for the minimum possible relative information between worlds (Appendix D):
X
S+R+C +D ≥J + H(Xi , Yi |Φ) (14)
i

Besides just the upper bound, we can consider the full space of feasible points. Notice that S
and R are both themselves upper bounds on mutual informations, and so must be positive semi-
definite. If our data is discrete, or if we have discretized it 8 , D and C which are both upper
bounds
P on conditional entropies, must be positive as well. Along with Equation (14), given that
i H(X i , Yi |Φ) is a positive constant outside our control, the space of possible (R, C, D, S) values
is at least restricted to be points in the positive orthant with some minimum possible Manhattan
distance to the origin:
X
S+R+C +D ≥ H(Xi , Yi |Φ) R≥0 S≥0 D≥0 C≥0 (15)
i

Even in the infinite model family limit, data-processing inequalities on mutual information terms
all defined in a set of variables that satisfy some nontrivial conditional dependencies ensure that
there are regions in this functional space that are wholly out of reach. The surface of the feasible
region maps an optimal frontier, optimal in the degree to which it minimizes mismatch between our
two worlds subject to constraints on the relative magnitudes of the individual terms. This convex
polytope has edges, faces and corners that are identifiable as the optimal solutions for well known
objectives.
This story is a generalization of the story presented in Alemi et al. (2018), which can be considered
a two-dimensional projection of this larger space (onto R, D). Within our larger framework we can
derive more specific bounds between subsets of the functionals. For instance:
Ri + Di ≥ H(Xi ) + I(Zi ; Θ|Xi ). (16)
This mirrors the bound given in Alemi et al. (2018) where R + D ≥ H(X), which is still true given
that all conditional mutual informations are positive semi-definite (H(X) + I(Z; Θ|X) ≥ H(X)),
but here we obtain a tighter pointwise bound that has a term measuring how much information
about our encoding is revealed by the parameters after conditioning on the input itself. This term
8
E m(x), m(y) on both X and Y , we can define D and C in
More generally, if we choose Dsome measure
q(x|z)
terms of that measure e.g. D ≡ − log m(x) ≥ Hm (X) − I(X; Z) = Hm (X|Z)
P

4
Preprint under review.

I(Zi ; Θ|Xi ) captures the degree to which our local representation is overly sensitive to the particular
parameter settings 910 .

2.3 G ENERALIZATION

We can evaluate how much information our representations capture about the true data generating
process. For instance, I(Zi ; Φ) which measures how much information about the true data generat-
ing procedure our local representations capture. Notice that given the conditional dependencies in
world P , we have the following Markov chain:
Φ → (Xi , Yi , Θ) → Zi (17)
and so by the Data Processing Inequality (Cover & Thomas, 2012):
I(Zi ; Φ) ≤ I(Zi ; Θ, Xi , Yi ) = I(Zi ; Xi , Θ) + ( (((
I(Z i |Xi , Θ) ≤ Ri .
; Y(
(i( (18)

The per-instance rate Ri forms an upper bound on the mutual information between our encoding Zi
and the true governing parameters of our data Φ. Similarly, we can establish that:
Φ → (X N , Y N ) → Θ =⇒ I(Θ; Φ) ≤ I(Θ; X N , Y N ) ≤ S. (19)
S upper bounds the amount of information our encoder’s parameters Θ, the global representation of
the dataset can contain about the true process Φ. At the same time:
X
I(Θ; Φ) ≤ I(X N , Y N ; Φ) ≤ I(Xi , Yi ; Φ), (20)
i

which sets a natural upper limit for the maximum S that might be useful.

3 O PTIMAL F RONTIER

As in Alemi et al. (2018), under mild assumptions about the variational distributional families, it can
be argued that the surface is monotonic in all of its arguments. The optimal surface in the infinite
family limit can be characterized as a convex polytope (Equation (15)). In practice we will be in the
realistic setting corresponding to finite parametric families such as neural network approximators.
We then expect that there is an irrevocable gap that opens up in the variational bounds. Any failure of
the distributional families to model the correct corresponding marginal in P means that the space of
all realizable R, C, D, S values will be some convex relaxation of the optimal feasible surface. This
surface will be described some function f (R, C, D, S) = 0, which means we can identify points on
the surface as a function of one functional with respect to the others (e.g. R = R(C, D, S)). Finding
points on this surface equates to solving a constrained optimization problem, e.g.
min R such that D = D0 , S = S0 , C = C0 . (21)
q(z)q(x|z)q(y|z)p(z|x,θ)p(θ|{x,y})

Equivalently, we could solve the unconstrained Lagrange multipliers problem:

min R + δD + γC + σS. (22)
q(z)q(x|z)q(y|z)p(z|x,θ)p(θ|{x,y})

Here δ, γ, σ are Lagrange multipliers that impose the constraints. They each correspond to the partial
derivative of the rate at the solution with respect to their corresponding functional, keeping the others
fixed.
Notice that this single objective encompasses a wide range of existing techniques.

• If we retain C alone, we are doing traditional supervised learning and our network will
learn to be deterministic in its activations and parameters.
9
In Appendix A we consider taking this bound seriously to limit the space only only three functionals, S,
C and V ≥ I(Zi ; Θ|Xi )
10
This could help explain the observation that often times putting additional modeling power on the prior
rather than the encoder can give improvements in ELBO (Chen et al., 2016).

5
Preprint under review.

• If δ = 0 we no longer require a variational reconstruction network q(x|z), and are doing

some form of supervised learning generally.
• If δ = 0, σ = 0 we exactly recover the Variational Information Bottleneck (VIB) objective
of Alemi et al. (2016) (where β = 1/γ), a form of stochastically regularized supervised
learning that imposes a bottleneck on how much information our representation can retain
about the input, while simultaneously maximizing the amount of information the represen-
tation contains about the target.
• If δ = 0 and σ, γ → ∞ but in such a way as to keep the ratio fixed β ≡ σ/γ (that is if
we drop the R term and only keep C + βS as our objective) we recover the Information
Bottleneck Lagrangian loss of Achille & Soatto (2017), presented as an alternative way
to do Information Bottleneck (Tishby et al., 1999) but being stochastic on the parameters
rather than the activations as in VIB.
• As a special case, if our objective is set to C + S (δ = 0, σ, γ → ∞, σ/γ → 1), we obtain
the objective for a Bayesian neural network, ala Blundell et al. (2015).
• If we retain only D, we are training a stochastic autoencoder.
• If σ = 0, γ = 0, δ = 1 the objective is equivalent to the ELBO used to train a VAE (Kingma
& Welling, 2014).
• If σ = 0, γ = 0 more generally, the objective is equivalent to a β-VAE (Higgins et al.,
2017) where β = 1/δ.
• If γ = 0 all terms involving the auxiliary data Y drop out and we are doing some form
of unsupervised learning without any variational classifier q(y|z). The presence of the S
term makes this more general than a usual β-VAE and should offer better generalization
properties and control of overfitting by bottle-necking how much information we allow the
parameters of our encoder to extract from the training data.
• σ = 0, γ = α, δ = 1 recovers the semi-supervised objective of Kingma et al. (2014).
• In its most general form, in common parlance the full objective might be described as
a temperature-regulated Bayesian semi-supervised β-VAE, or a Variational Information
Bottleneck Lagrangian Autoencoder (VIBLA).

Examples of all of these objectives behavior on a simple toy model is shown in Appendix H.
Notice that all of these previous approaches describe low dimensional sub-surfaces of the optimal
three dimensional frontier. These approaches were all interested in different domains, some were
focused on supervised prediction accuracy, others on learning a generative model. Depending on
your specific problem, and downstream tasks, different points on the optimal frontier will be desir-
able. However, instead of choosing a single point on the frontier, we can now explore a region on the
surface to see what class of solutions are possible within the modeling choices. By simply adjusting
the three control parameters δ, γ, σ, we can smoothly move across the entire frontier and smoothly
interpolate between all of these objectives and beyond.

3.1 O PTIMIZATION

So far we’ve considered explicit forms of the objective in terms of the four functionals. For S this
would require some kind of tractable approximation to the posterior over the parameters of our en-
coding distribution11 . Alternatively, we can formally describe the exact solution to our minimization
problem:
min S s.t. R = R0 , C = C0 , D = D0 . (23)
Recall that S measures the relative entropy of our parameter distribution with respect to the q(θ)
prior. As such, the solution that minimizes the relative entropy subject to some constraints is a
generalized Boltzmann distribution (Jaynes, 1957):
q(θ) −(R+δD+γC)/σ
p∗ (θ|{x, y}) = e . (24)
Z
11
As in Blundell et al. (2015); Achille & Soatto (2017)

6
Preprint under review.

Here Z is the partition function, the normalization constant for the distribution
Z
Z = dθ q(θ) e−(R+δD+γC)/σ (25)

This suggests an alternative method for finding points on the optimal frontier. We could turn the un-
constrained Lagrange optimization problem that required some explicit choice of tractable posterior
distribution over parameters into a sampling problem for a richer implicit distribution.
A naive way to draw samples from this posterior would be to use Stochastic Gradient Langevin
Dynamics or its cousins (Welling & Teh, 2011; Chen et al., 2014; Ma et al., 2015) which, in practice,
would look like ordinary stochastic gradient descent (or its cousins like momentum) for the objective
R + δD + γC, with injected noise. By choosing the magnitude of the noise relative to the learning
rate, the effective temperature σ can be controlled.
There is increasing evidence that the stochastic part of stochastic gradient descent itself is enough
to turn SGD less into an optimization procedure and more into an approximate posterior sam-
pler (Mandt et al., 2017; Smith & Le, 2017; Achille & Soatto, 2017; Zhang et al., 2018; Chaudhari
& Soatto, 2017), where hyperparameters such as the learning rate and batch size set the effective
temperature. If ordinary stochastic gradient descent is doing something more akin to sampling from
a posterior and less like optimizing to some minimum, it would help explain improved performance
through ensemble averages of different points along trajectories (Huang et al., 2017).
When viewed in this light, Equation 24 describes the optimal posterior for the parameters so as
to ensure the minimal divergence between worlds P and Q. q(θ) plays the role of the prior over
parameters, but our overall objective is minimized when
q(θ) = p(θ) = hp(θ|xN , y N )ip(xN ,yN ) . (26)
That is, when our prior is the marginal of the posteriors over all possible datasets drawn from the
true distribution. A fair draw from this marginal is to take a sample from the posterior obtained on
a different but related dataset. Insomuch as ordinary SGD training is an approximate method for
drawing a posterior sample, the common practice of fine-tuning a pretrained network on a related
dataset is using a sample from the optimal prior as our initial parameters. The fact that fine-tuning
approximates use of an optimal prior presumably helps explain its broad success.
If we identify our true goal not as optimizing some objective but instead directly sampling from
Equation 24, we can consider alternative approaches to define our learning dynamics, such as par-
allel tempering or population annealing (Machta & Ellis, 2011). Alternatively, we could, instead of
adopting variational bounds on the mutual informations, consider other mutual information bounds
such as those in Ishmael Belghazi et al. (2018); van den Oord et al. (2018). Perhaps our priors can be
fit, providing we form estimates of the expectation over datasets (e.g. bootstrapping or jackknifing
our dataset (DasGupta, 2008)).

4 T HERMODYNAMICS

So far we have described a framework for learning that involves finding points that lie on the surface
of a convex three-dimensional surface in terms of four functional coordinates R, C, D, S. Interest-
ingly, this is all that is required to establish a formal connection to thermodynamics, which similarly
is little more than the study of exact differentials (Sethna, 2006; Finn, 1993).
Whereas previous approaches connecting thermodynamics and learning (Parrondo et al., 2015; Still,
2017; Still et al., 2012) have focused on describing the thermodynamics and statistical mechanics
of physical realizations of learning systems (i.e. the heat bath in these papers is a physical heat
bath at finite temperature), in this work we make a formal analogy to the structure of the theory of
thermodynamics, without any physical content.

4.1 F IRST L AW OF L EARNING

The optimal frontier creates an equivalence class of states, being the set of all states that minimize
as much as possible the distortion introduced in projecting world P onto a set of distributions that

7
Preprint under review.

respect the conditions in Q. The surface satisfies some equation f (R, C, D, S) = 0 which we
can use to describe any one of these functionals in terms of the rest, e.g. R = R(C, D, S). This
function is entire, and so we can equate partial derivatives of the function with differentials of the
functionals12 :
∂R ∂R ∂R
dR = dC + dD + dS. (27)
∂C D,S ∂D C,S ∂S C,D
Since the function is smooth and convex, instead of identifying the surface of optimal rates in terms
of the functionals C, D, S, we could just as well describe the surface in terms of the partial deriva-
tives by applying a Legendre transformation. We will name the partial derivatives:

∂R ∂R ∂R
γ≡− δ≡− σ≡− . (28)
∂C D,S ∂D C,S ∂S C,D
These measure the exchange rate for turning rate into reduced distortion, reduced classification error,
or increased entropy, respectively.
The functionals R, C, D, S are analogous to extensive thermodynamic variables such as volume,
entropy, particle number, magnetic field, charge, surface area, length and energy which grow as the
system grows, while the named partial derivatives γ, δ, σ are analogous to the intensive, generalized
forces in thermodynamics corresponding to their paired state variable, such as pressure, temperature,
chemical potential, magnetization, electromotive force, surface tension, elastic force, etc. Just as
in thermodynamics, the extensive functionals are defined for any state, while the intensive partial
derivatives are only well defined for equilibrium states, which in our language are the states lying
on the optimal surface 13 .
Recasting our total differential:
dR = −γdC − δdD − σdS, (29)
we create a law analogous to the First Law of Thermodynamics. In thermodynamics the First Law is
often taken to be a statement about the conservation of energy, and by analogy here we could think
about this law as a statement about the conservation of information. Granted, the actual content
of the law is fairly vacuous, equivalent only to the statement that there exists a scalar function
R = R(C, D, S) defining our surface and its partial derivatives.

4.2 M AXWELL R ELATIONS AND T HERMODYNAMIC P OTENTIALS

Requiring that Equation 29 be an exact differential has mathematically trivial but intuitively non-
obvious implications that relate various partial derivatives of the system to one another, akin to the
Maxwell Relations in thermodynamics. For example, requiring that mixed second partial derivatives
are symmetric establishes that:
2 2
∂ R ∂ R ∂δ ∂γ
= =⇒ = . (30)
∂D∂C ∂C∂D ∂C D ∂D C
This equates the result of two very different experiments. In the experiment encoded in the partial
derivative on the left, one would measure the change in the derivative of the R − D curve (δ) as
a function of the classification error (C) at fixed distortion (D). On the right one would measure
the change in the derivative of the R − C curve (γ) as a function of the distortion (D) at fixed
classification error (C). As different as these scenarios appear, they are mathematically equivalent.
A full set of Maxwell relations can be found in Appendix F.
We can additionally take and name higher order partial derivatives, analogous to the susceptibilities
of thermodynamics like bulk modulus, the thermal expansion coefficient, or heat capacities. For
instance, we can define the analog of heat capacity for our system, a sort of rate capacity at constant
distortion:
∂R
KD ≡ . (31)
∂σ D
12 ∂X

∂Y Z
denotes the partial derivative of X with respect to Y holding Z constant.
13
For more discussion of equilibrium states, and how they connect with more intuitive notions of equilibrium,
see Appendix G

8
Preprint under review.

Just as in thermodynamics, these susceptibilities may offer useful ways to characterize and quan-
tify the systematic differences between model families. Perhaps general scaling laws can be found
between susceptibilities and network widths, or depths, or number of parameters or dataset size. Di-
vergences or discontinuities in the susceptibilities are the hallmark of phase transitions in physical
systems, and it is reasonable to expect to see similar phenomenon for certain models.
A great deal of first, second and third order partial derivatives in thermodynamics are given unique
names. This is because the quantities are particularly useful for comparing different physical sys-
tems. We expect a subset of the first, second and higher order partial derivatives of the base function-
als will prove similarly useful for comparing, quantifying, and understanding differences between
modeling choices.

4.3 S ECOND L AW OF L EARNING ?

Even when doing deterministic training, training is non-invertible (Maclaurin et al., 2015), and we
need to contend with and track the entropy (S) term. We set the parameters of our networks initially
with a fair draw from some prior distribution q(θ). The training procedure acts as a Markov process
on the distribution of parameters, transforming it from the prior distribution into some modified
distribution, the posterior p(θ|xN , y N ). Optimization is a many-to-one function, that in the ideal
limiting case, maps all possible initializations to a single global optimum. In this limiting case S
would be divergent, and there is nothing to prevent us from memorizing the training set.
The Second Law of Thermodynamics states that the entropy of an isolated system tends to increase.
All systems tend to disorder, and this places limits on the maximum possible efficiency of heat
engines.
Formally, there are many statements akin to the Second Law of Thermodynamics that can be made
about Markov chains generally (Cover & Thomas, 2012). The central one is that for any for any two
distributions pn , qn both evolving according to the same Markov process (n marks the time step),
the relative entropy DKL [pn ; qn ] is monotonically decreasing with time. This establishes that for
a stationary Markov chain, the relative entropy to the stationary state DKL [pn ; p∞ ] monotonically
decreases 14 .
In our language, we can make strong statements about dynamics that target points on the optimal
frontier, or dynamics that implement a relaxation towards equilibrium. There is a fundamental
distinction between states that live on the frontier and those off of it, analogous to the distinction
between equilibrium and non-equilibrium states in thermodynamics.
Any equilibrium distribution can be expressed in the form Equation (24) and identified by its partial
derivatives γ, δ, σ. If name the objective in Equation (22):
J(γ, δ, σ) ≡ R + δD + γC + σS, (32)
The value this objective takes for any equilibrium distribution can be shown to be given by the log
partition function (Equation (25)):
min J(γ, δ, σ) = −σ log Z(γ, δ, σ) (33)
and the KL divergence between any distribution over parameters p(θ) and an equilibrium distribution
is:
DKL [p(θ); p∗ (θ; γ, δ, σ)] = ∆J/σ (34)
∆J ≡ J noneq (p; γ, δ, σ) − J(γ, δ, σ) (35)
Where J noneq is the non-equilibrium objective:
J noneq (p; γ, δ, σ) = hR + δD + γC + σSip(θ) . (36)

For a stationary Markov process whose stationary distribution is an equilibrium distribution the KL
divergence to the stationary distribution must monotonically decrease each step. This means the
∆J/σ must decrease monotonically, that is our objective J must decrease monotonically:
Jt=0 ≥ Jt ≥ Jt+1 ≥ Jt=∞ . (37)
14
For discrete state Markov chains, this implies that if the stationary distribution is uniform, the entropy of
the distribution H(pn ) is strictly increasing.

9
Preprint under review.

Furthermore, if we use q(θ) as our prior over parameters, we know:

Jt=0 = hR + δD + γCiq(θ) (38)
Jt=∞ = −σ log Z. (39)

5 C ONCLUSION
We have formalized representation learning as the process of minimizing the distortion introduced
when we project the real world (World P ) onto the world we desire (World Q). The projection is nat-
urally described by a set of four functionals which variationally bound relevant mutual informations
in the real world. Relations between the functionals describe an optimal three-dimensional surface
in a four dimensional space of optimal states. A single learning objective targeting points on this
optimal surface can express a wide array of existing learning objectives spanning from unsupervised
learning to supervised learning and everywhere in between. The geometry of the optimal frontier
suggests a wide array of identities involving the functionals and their partial derivatives. This offers
a direct analogy to thermodynamics independent of any physical content. By analogy to thermody-
namics, we can begin to develop new quantitative measures and relationships amongst properties of
our models that we believe will offer a new class of theoretical understanding of learning behavior.

ACKNOWLEDGEMENTS
The authors would like to Jascha Sohl-Dickstein, Mallory Alemi, Rif A. Saurous, Ali Rahimi, Kevin
Murphy, Ben Poole, Danilo Rezende and Matt Hoffman for helpful discussions and feedback on the
draft.

R EFERENCES
L Accardi. De Finetti, 2018. URL http://www.encyclopediaofmath.org/index.
php?title=De_Finetti_theorem&oldid=12884.
A. Achille and S. Soatto. Emergence of Invariance and Disentangling in Deep Representations.
Proceedings of the ICML Workshop on Principled Approaches to Deep Learning, 2017.
Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information
bottleneck. arXiv:1612.00410, 2016. URL http://arxiv.org/abs/1612.00410.
Alexander A Alemi, Ben Poole, Joshua V Dillon, Rif A Saurous, and Kevin Murphy. Fixing a
broken ELBO. ICML 2018, 2018. URL http://arxiv.org/abs/1711.00464.
William Bialek, Ilya Nemenman, and Naftali Tishby. Predictability, complexity, and learning. Neu-
ral computation, 13(11):2409–2463, 2001.
C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight Uncertainty in Neural Net-
works. arXiv: 1505.05424, May 2015. URL https//arxiv.org/abs/1505.05424.
Pratik Chaudhari and Stefano Soatto. Stochastic gradient descent performs variational inference,
converges to limit cycles for deep networks. arXiv, 2017. URL https://arxiv.org/abs/
1710.11029.
T. Chen, E. B. Fox, and C. Guestrin. Stochastic Gradient Hamiltonian Monte Carlo.
arXiv:1402.4102, February 2014. URL https://arxiv.org/abs/1402.4102.
X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, and P. Abbeel.
Variational Lossy Autoencoder. arXiv, 2016. URL https://arxiv.org/abs/1611.
02731.
Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
Imre Csiszár and František Matúš. Information projections revisited. IEEE Transactions on Infor-
mation Theory, 49(6):1474–1490, 2003.

10
Preprint under review.

Anirban DasGupta. Edgeworth expansions and cumulants. In Asymptotic Theory of Statistics and
Probability, pp. 185–201. Springer, 2008.
Colin BP Finn. Thermal physics. CRC Press, 1993.
Nir Friedman, Ori Mosenzon, Noam Slonim, and Naftali Tishby. Multivariate information bottle-
neck. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pp.
152–161. Morgan Kaufmann Publishers Inc., 2001.
Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick,
Shakir Mohamed, and Alexander Lerchner. β-VAE: Learning Basic Visual Concepts with a Con-
strained Variational Framework. 2017.
G. Huang, Y. Li, G. Pleiss, Z. Liu, J. E. Hopcroft, and K. Q. Weinberger. Snapshot Ensembles:
Train 1, get M for free. arXiv: 1704.00109, March 2017. URL https://arxiv.org/abs.
1704.00109.
M. Ishmael Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y. Bengio, A. Courville, and R Devon
Hjelm. MINE: Mutual Information Neural Estimation. arXiv, 2018. URL https://arxiv.
org/abs/1801.04062.
Edwin T Jaynes. Information theory and statistical mechanics. Physical review, 106(4):620, 1957.
D. P. Kingma, D. J. Rezende, S. Mohamed, and M. Welling. Semi-Supervised Learning with Deep
Generative Models. arXiv: 1406.5298, June 2014. URL https://arxiv.org/abs/1406.
5298.
Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. 2014.
Y.-A. Ma, T. Chen, and E. B. Fox. A Complete Recipe for Stochastic Gradient MCMC.
arXiv:1506.04696, June 2015. URL https://arxiv.org/abs/1506.04696.
J. Machta and R. S. Ellis. Monte Carlo Methods for Rough Free Energy Landscapes: Population
Annealing and Parallel Tempering. Journal of Statistical Physics, 144:541–553, August 2011.
doi: 10.1007/s10955-011-0249-0. URL https://arxiv.org/abs/1104.1138.
Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter opti-
mization through reversible learning. In Proceedings of the 32Nd International Conference on In-
ternational Conference on Machine Learning - Volume 37, ICML’15, pp. 2113–2122. JMLR.org,
2015. URL http://dl.acm.org/citation.cfm?id=3045118.3045343.
S. Mandt, M. D. Hoffman, and D. M. Blei. Stochastic Gradient Descent as Approximate Bayesian
Inference. arXiv: 1704.04289, April 2017. URL https://arxiv.org/abs/1704.04289.
Juan MR Parrondo, Jordan M Horowitz, and Takahiro Sagawa. Thermodynamics of informa-
tion. Nature physics, 11(2):131–139, 2015. URL http://jordanmhorowitz.mit.edu/
sites/default/files/documents/natureInfo.pdf.
James Sethna. Statistical mechanics: entropy, order parameters, and complexity, volume 14. Ox-
ford University Press, 2006. URL http://pages.physics.cornell.edu/˜sethna/
StatMech/EntropyOrderParametersComplexity.pdf.
Noam Slonim, Gurinder S Atwal, Gasper Tkacik, and William Bialek. Estimating mutual informa-
tion and multi–information in large networks. arXiv, 2005. URL https:/arxiv.org/abs/
cs/0502017.
S. L. Smith and Q. V. Le. A Bayesian Perspective on Generalization and Stochastic Gradient De-
scent. arXiv:1710.06451, October 2017. URL https://arxiv.org/abs/1710.06451.
S. Still. Thermodynamic cost and benefit of data representations. arXiv: 1705.00612, April 2017.
URL https://arxiv.org/abs/1705.00612.
S. Still, D. A. Sivak, A. J. Bell, and G. E. Crooks. Thermodynamics of Prediction. Physical Re-
view Letters, 109(12):120604, September 2012. doi: 10.1103/PhysRevLett.109.120604. URL
https://arxiv.org/abs/1203.3271.

11
Preprint under review.

Susanne Still. Information bottleneck approach to predictive inference. Entropy, 16(2):968–989,

2014.
N. Tishby, F.C. Pereira, and W. Biale. The information bottleneck method. In The 37th annual
Allerton Conf. on Communication, Control, and Computing, pp. 368–377, 1999. URL https:
//arxiv.org/abs/physics/0004057.
A. van den Oord, Y. Li, and O. Vinyals. Representation Learning with Contrastive Predictive Coding.
arXiv, 2018. URL https://arxiv.org/abs/1807.03748.
Sumio Watanabe. Algebraic geometry and statistical learning theory, volume 25. Cambridge Uni-
versity Press, 2009.
Sumio Watanabe. Mathematical theory of Bayesian statistics. CRC Press, 2018.
Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In
Proceedings of the 28th international conference on machine learning (ICML-11), pp. 681–688,
2011.
Y. Zhang, A. M. Saxe, M. S. Advani, and A. A. Lee. Energy-entropy competition and the effective-
ness of stochastic gradient descent in machine learning. arXiv: 1803.01927, March 2018. URL
https://arxiv.org/abs/1803.01927.

12
Preprint under review.

A R ECONSTRUCTION F REE F ORMULATION

We can utilize the Chain Rule of Mutual Information (Equation (9)):
I(Zi ; Xi , Θ) = I(Zi ; Xi ) + I(Zi ; Θ|Xi ), (40)
to simplify our expression for the minimum possible KL between worlds (Equation (7)), and con-
sider a reduced set of functionals (compare to Section 2.1):

• Ci ≡ − hlog q(yi |zi )iP ≥ H(Yi ) − I(Yi ; Zi ) = H(Yi |Zi )

The classification error, as before.
D E
• S ≡ log p(θ|{x,y})
q(θ) ≥ I(Θ; {X, Y })
P
The entropy as before.
D E
i |xi ,θ)
• Vi ≡ log p(z
q(zi |xi ) ≥ I(Zi ; Θ|Xi )
P
The volume of the representation (for lack of a better term), which measures the mutual
information between our representation Z and the parameters Θ, conditioned on the input
X. That is, this functional bounds how much of the information in our representation can
come from the learning algorithm, independent of the actual input.

In principle, these three functionals still fully characterize the distortion introduced in our informa-
tion projection. Notice that this new functional requires the variational approximation q(zi |xi ), a
variational approximation to the marginal over our parameter distribution. Notice also that we no
longer require a variational approximation to p(xi |zi ). That is, in this formulation we no longer
require any form of decoder, or synthesis in our original data space X. While equivalent in its
information projection, this more naturally corresponds to the model of our desired world Q:
Y
q(x, y, φ, z, θ) = q(φ)q(θ) q(zi |xi )q(yi |zi ), (41)
i

depicted below in Figure 2. Here we desire, not the joint generative model X ← Z → Y , but the
predictive model X → Z → Y .

Θ Φ

X Z Y

Figure 2: Modified graphical model for world Q, instead of Figure 3b, the world we desire which
satisfies the joint density in Equation 41. Notice that this graphical model encodes all of the same
conditional independencies as the original.

In this case we have:

X
C +S+V ≥J + [H(Yi Xi |Φ) − H(Xi )] . (42)
i

We can imagine tracing out this, now three dimensional, frontier that still explores a space consistent
with our original graphical model, but wherein we no longer have to do any form of direct variational
synthesis.

B BAYESIAN I NFERENCE
Just as in A we can consider alternative graphical models for World P. In particular, we can consider
a simplified scenario depicted in Figure 3 corresponding to the usual situation in Bayesian inference.

13
Preprint under review.

Φ X Θ Φ X Θ

(a) Graphical model for world P , depict- (b) Graphical model for world Q, the world
ing Bayesian inference as learning a single we desire, the usual generative model of
global representation of data. Bayesian inference.

Figure 3: Graphical models for standard Bayesian inference.

Here we have just data, generated by some process and we form a single global representation of the
dataset. The world we desire, World Q, corresponds to the usual Bayesian modeling assumption,
whereby our own global representation generates the data conditionally independently.
For these sets of graphical models, we have the following information projection:
X X
Jbayes = min DKL [p; q] = IP − IQ = I(Xi ; Φ) + I(Θ; X n ) − I(Xi ; Θ) (43)
q∈Q
i i

And we can derive the simple variational bounds:

p(θ|X n )

S≡ log ≥ I(Θ; X n ) (44)
q(θ)
This entropy gives an upper bound on the mutual information between our parameters and the
dataset, it requires a variational approximation to the true marginal of the posterior p(θ|X N ) over
datasets: q(θ), a prior.

Ui ≡ − hlog q(xi |θ)i ≥ H(Xi |Θ) (45)

The energy gives an upper bound on the conditional entropy of our data given our parameters, it
is powered by a variational approximation to the factored inverse of our global representation, the
likelihood in ordinary parlance.
Our optimal frontier is set by those conditions above as well as: 15 :
X
U + S ≥ Jbayes + H(Xi |Φ) (46)
i

Just as in our earlier paper (Alemi et al., 2018) we could trace out the frontier by doing the con-
strained optimization problem:
min S + βU (47)

The formal solution to this optimization problem takes the form:

X
log p(θ|xN ) = log q(θ) + β log q(xi |θ) − log Z. (48)
i

Where Z is the partition function:

Z P
log q(xi |θ)
Z= dθ q(θ)eβ i (49)

This is the ordinary temperature regulated (Watanabe, 2009) Bayesian posterior:

Y
p(θ|xN ) ∝ q(θ) q(xi |θ)β . (50)
i

Using a temperature to regulate the relative contribution of the prior and posterior has been used
broadly, but ordinarily doesn’t have a well founded justification. Here we can unapologetically vary
15 P
U≡ i Ui

14
Preprint under review.

the relative contributions of the prior and likelihood since in the representational framework, those
are both variational approximations that might have differing ability to better model the true distri-
butions they approximate. By varying the β parameter here, just as in the β-VAE case (Alemi et al.,
2018) we can smoothly explore the frontier within our modeling family, smoothly controlling the
amount of information our model extracts from the dataset. This can help us control for overfitting
in a principled way.
Additionally, we could try to relax our variational approximations, and fit our prior, assuming we
could estimate an expectation over datasets. One way to do that is with a bootstrap or jackknife
procedure (DasGupta, 2008).

C D ISCRIMINATIVE M ODELS
Similarly we could consider the situation depicting usual discriminative learning, depicted in Fig-
ure 4.

Φ Θ Φ Θ

X Y X Y

(a) Graphical model P , depicting condition- (b) Graphic model Q depicting a discrimina-
ally independent data with a global represen- tive generative model.
tation.

Figure 4: Graphical models for the traditional discriminative case.

For these sets of graphical models, we have the following information projection:
X
Jd = IP − IQ = I(Θ; X N , Y N ) + [I(Xi ; Φ) + I(Yi ; Xi , Φ) − I(Yi ; Xi , Θ)] . (51)
i

p(θ|X n )

S≡ log ≥ I(Θ; X n ) (52)
q(θ) P
This entropy gives an upper bound on the mutual information between our parameters and the
dataset, it requires a variational approximation to the true marginal of the posterior p(θ|X N ) over
datasets: q(θ), a prior.

Ui ≡ − hlog q(yi |xi , θ)iP ≥ H(Yi |Xi , Θ) (53)

The energy gives an upper bound on the conditional entropy of our targets given our parameters and
input, it is powered by a variational approximation to the factored inverse of our global representa-
tion, the conditional likelihood in ordinary parlance.
Our optimal frontier is set by those conditions above as well as:
X
U + S ≥ Jd + H(Yi |Xi , Φ) − I(Xi ; Φ) (54)
i

Just as previously in Appendix B solutions on the frontier can be specified by:

X
log p(θ|xN , y N ) = log q(θ) + β log q(yi |xi , θ) − log Z. (55)
i

Here again we can smoothly explore the frontier set by the variationals approximations given by the
prior and likelihood by simply adjusting β. We might additionally consider going beyond the fixed
variational approximations and push the frontier by fitting the prior, or likelihood.

15
Preprint under review.

D F UNCTIONAL I NEQUALITIES
Here we show the details for deriving Equation (14).
We start by expressing our functional inequalities, but being explicit about presence of the relative
informations of our variational approximations.
I(Θ; X N , Y N ) = S − DKL [p(θ); q(θ)] (56)
I(Xi ; Zi ) = H(Xi ) − Di + DKL [p(xi |zi ); q(xi |zi )] (57)
I(Yi ; Zi ) = H(Yi ) − Ci + DKL [p(yi |zi ); q(yi |zi )] (58)
I(Zi ; Xi , Θ) = Ri − DKL [p(zi ); q(zi )] (59)
Combining Equations (7) and (56) to (59):
X
J = S + D + C + R − DKL [p; q] − [H(Xi ) + H(Yi ) − I(Xi ; Φ) − I(Yi ; Xi , Φ)] ≥ 0. (60)
i
Here we have collected all of the KL divergences for our variational approximations:
X
DKL [p; q] ≡DKL [p(θ); q(θ)] + DKL [p(xi |zi ); q(xi |zi )]
i
X (61)
+ [DKL [p(yi |zi ); q(yi |zi )] + DKL [p(zi ); q(zi )]] .
i

Which yields:
X
S + D + C + R = J + DKL [p; q] + H(Yi , Xi |Φ) (66)
i
X
S+D+C +R≥J + H(Yi , Xi |Φ) (67)
i
X
S+D+C +R≥ H(Yi , Xi |Φ) (68)
i
(69)

E I DENTITIES
We will utilize some basic information identities, first by definition
I(A; B) = H(A) − H(A|B) (70)
= H(B) − H(B|A) (71)
= H(A) + H(B) − H(A, B) (72)
= H(A, B) − H(A|B) − H(B|A) (73)
By the chain rule of mutual information:
I(A, B; C) = I(A; C) + I(B; C|A) ≥ 0 (74)
Mutual informations, and conditional mutual informations are always positive:
I(A; B) ≥ 0 (75)
I(A; B|C) ≥ 0 (76)
We will also use the following rule for conditional entropies
H(B|A) = H(A, B) − H(A) (77)

16
Preprint under review.

F M AXWELL R ELATIONS
We can also define other potentials analogous to the alternative thermodynamic potentials such as
enthalpy, free energy, and Gibb’s free energy by performing partial Legendre transformations. For
instance, we can define a free rate:
F (C, D, σ) ≡ R + σS (78)
dF = −γdC − δdD + Sdσ. (79)
The free rate measures the rate of our system, not as a function of S (something difficult to keep
fixed), but in terms of σ, a parameter in our loss or optimal posterior.
The free rate gives rise to other Maxwell relations such as

∂S ∂γ
=− , (80)
∂C σ ∂σ C
which equates how much each additional bit of entropy (S) buys you in terms of classification error
(C) at fixed effective temperature (σ), to a seemingly very different experiment where you measure
the change in the effective supervised tension (γ, the slope on the R − C curve) versus effective
temperature (σ) at a fixed classification error (C).

F.1 C OMPLETE E NUMERATION

Here we enumerate a complete set of Maxwell Relations. First if we write R = R(D, C, S):
dR = −γdC − δdD − σdS

∂γ ∂δ
= (81)
∂D C ∂C D

∂δ ∂σ
= (82)
∂S D ∂D S

∂γ ∂σ
= (83)
∂S C ∂C S

Next transforming to F = R + σS = F (D, C, σ)

dF = −γdC − δdD + Sdσ

∂γ ∂S
=− (84)
∂σ C ∂C σ

∂δ ∂S
=− (85)
∂S D ∂D σ

Next transforming to H = R + γC = H(D, γ, S)

dH = Cdγ − δD − σdS (86)

∂C ∂δ
=− (87)
∂D γ ∂γ D

∂C ∂σ
=− (88)
∂S γ ∂γ S

17
Preprint under review.

Next transforming to G = R + σS + γC = G(D, γ, σ)

dG = Cdγ − δdD + Sdσ (89)

∂C ∂S
= (90)
∂σ γ ∂γ σ

Next transforming to A = R + δD = A(δ, C, S)

dA = −γdC + Ddδ − σdS (91)

∂γ ∂D
=− (92)
∂δC ∂C δ

∂D ∂σ
=− (93)
∂σ δ ∂δ S
Finally transforming to B = R + δD + σS = B(δ, C, σ)
dB = −γdC + Ddδ + Sdσ (94)

∂γ ∂S
=− (95)
∂σ C ∂C σ

∂S ∂D
= (96)
∂δ σ ∂σ δ

G Z EROTH L AW OF L EARNING
A central concept in thermodynamics is a notion of equilibrium. The so called Zeroth Law of ther-
modynamics defines thermal equilibrium as a sort of reflexive property of systems (Finn, 1993). If
system A is in thermal equilibrium with system C, and system B is separately in thermal equilibrium
with system C, then system A and B are in thermal equilibrium with each other.
When any sub-part of a system is in thermal equilibrium with any other sub-part, the system is said
to be an equilibrium state.
In our framework, the points on the optimal surface are analogous to the equilibrium states, for
which we have well defined partial derivatives. We can demonstrate that this notion of equilib-
rium agrees with a more intuitive notion of equilibrium between coupled systems. Imagine we
have two different models, characterized by their own set of distributions, Model A is defined by
pA (z|x, θ), pA (θ, {x, y}), qA (z), and model B by pB (z|x, θ), pB (θ, {x, y}), qB (z). Both models
will have their own value for each of the functionals: RA , SA , DA , CA and RB , SB , DB , CB . Each
model defines its own representation ZA , ZB . Now imagine coupling the models, by forming the
joint representation ZC = (ZA , ZB ) formed by concatenating the two representations together.
Now the governing distributions over Z are simply the product of the two model’s distributions, e.g.
qC (zC ) = qA (zA )qB (zB ). Thus the rate RC and entropy SC for the combined model is the sum of
the individual models: RC = RA + RB , SC = SA + SB .
Now imagine we sample new states for the combined system which are maximally entropic with the
constraint that the combined rate stay constant:
q(θ) −R/σ
min S s.t. R = RC =⇒ p(θ|{x, y}) = e . (97)
Z
For the expectation of the two rates to be unchanged after they have been coupled and evolved
holding their total rate fixed, we must have,
1 1 1 1
− RA − RB = − RC = − (RA + RB ) =⇒ σA = σB = σC . (98)
σ σB σC σC
Therefore, we can see that σ, the effective temperature, allows us to identify whether two systems
are in thermal equilibrium with one another. Just as in thermodynamics, if two systems at different
temperatures are coupled, some transfer takes place.

18
Preprint under review.

H E XPERIMENTS
We show examples of models trained on a toy dataset for all of the different objectives we define
above. The dataset has both an infinite data variant, where overfitting is not a problem, and a finite
data variant, where overfitting can be clearly observed for both reconstruction and classification.

Data generation. We follow the toy model from Alemi et al. (2018), but add an additional classifi-
cation label in order to explore supervised and semi-supervised objectives. The true data generating
distribution is as follows. We first sample a latent binary variable, z ∼ Ber(0.7), then sample a
latent 1D continuous value from that variable, h|z ∼ N (h|µz , σz ), and finally we observe a dis-
cretized value, x = discretize(h; B), where B is a set of 30 equally spaced bins, and a discrete
label, y = z (so the true label is the latent variable that generated x). We set µz and σz such that
R∗ ≡ I(x; z) = 0.5 nats, in the true generative process, representing the ideal rate target for a latent
variable model. For the finite dataset, we select 50 examples randomly from the joint p(x, y, z). For
the infinite dataset, we directly supply the true full marginal p(x, y) at each iteration during training.
When training on the finite dataset, we evaluate model performance against the infinite dataset so
that there is no error in the evaluation metrics due to a finite test set.

Model details. We choose to use a discrete latent representation with K = 30 values, with an
encoder of the form q(zi |xj ) ∝ − exp[(wie xj − bei )2 ], where z is the one-hot encoding of the latent
categorical variable, and x is the one-hot encoding of the observed categorical variable. We use a
decoder of the same form, but with different parameters: q(xj |zi ) ∝ − exp[(wid xj − bdi )2 ]. We
use a classifier of the same form as well: q(yj |zi ) ∝ − exp[(wic yj − bci )2 ]. Finally, we use a
variational marginal, q(zi ) = πi . Given this,
P the true joint distribution has the form p(x, y, z) =
p(x)p(z|x)p(y|x), with marginal p(z) = x p(x, z), and conditionals p(x|z) = p(x, z)/p(z) and
p(y|z) = p(y, z)/p(z).
The encoder is additionally parameterized following Achille & Soatto (2017) by α, a set of learned
parameters for a Log Normal distribution of the form log N (−αi /2, αi ). In total, the model has 184
parameters: 60 weights and biases in the encoder and decoder, 4 weights and biases in the classifier,
30 weights in the marginal, and an additional 30 weights for the αi parameterizing the stochastic
encoder. We initialize the weights so that when σ = 0, there is no noticeable effect on the encoder
during training or testing.

Experiments. In Figure 5, we show the optimal, hand-crafted model for the toy dataset, as well
as a selection of parameterizations of the TherML objective that correspond to commonly-used ob-
jective functions and a few new objective functions not previously described. In the captions, the
parameters are specified with γ, δ, σ as in the main text, as well as ρ, which is a corresponding La-
grange multiplier for R, in order to simplify the parameterization. It just parameterizes the optimal
surface slightly differently. We train all objectives for 10,000 gradient steps. For all of the objectives
described, the model has converged, or come close to convergence, by that point.
Because the model is sufficiently powerful to memorize the dataset, most of the objectives are very
susceptible to overfitting. Only the objective variants that are “regularized” by the S term (parame-
terized by σ) are able to avoid overfitting in the decoder and classifier.

19
Preprint under review.

Figure 5: Hand-crafted optimal model. Toy Model illustrating the difference between selected points
on the three dimensional optimal surface defined by γ, δ, and σ. See Section 3 for more description of the
objectives, and Appendix H for details on the experiment setup. Top (i): P Three distributions in data space:
the true data distribution, p(x), the model’s
P generative distribution, g(x) = z q(z)q(x|z), and the empirical
data reconstruction distribution, d(x) = x0 z p(x0 )q(z|x0 )q(x|z). Middle (ii): Four distributions
P
P in latent
space: the learned (or computed) marginal q(z), the empirical induced marginal e(z) = x p(x)q(z|x), the
empirical distribution over z values for data vectors in the set X0 = {xn : zn = 0}, which we denote by
e(z0 ) in purple, and the empirical distribution over z values for data vectors in the set X1 = {xn : zn = 1},
which we denote by e(z1 ) in yellow. Bottom: Three K × K distributions: (iii) q(z|x), (iv) q(x|z) and (v)
q(x0 |x) = z q(z|x)q(x0 |z).
P

20
Preprint under review.

(a) Deterministic Supervised Classifier: δ = (b) Entropy-regularized Deterministic Classi-

ρ = σ = 0, γ = 1. fier: δ = ρ = 0, γ = 1, σ = 0.1.

(c) Entropy-regularized IB: δ = 0, ρ = 0, γ = (d) Bayesian Neural Network Classifier: δ =

1, σ = 0.01. 0, ρ = 0, σ = γ = 1.

Figure 6: Supervised Learning approaches.

21
Preprint under review.

(b) Entropy-regularized VIB: δ = 0, γ =

(a) VIB: δ = 0, σ = 0, γ = 1, ρ(β) = 0.5.
1, ρ = 0.9, σ = 0.1.

Figure 7: VIB style objectives.

(a) Deterministic Autoencoder: γ = ρ = σ = (b) Entropy-regularized Deterministic Autoen-

0, δ = 1. coder: γ = ρ = 0, δ = 1, σ = 0.01.

Figure 8: Autoencoder objectives.

22
Preprint under review.

(a) VAE: σ = 0, γ = 0, δ = ρ = 1. (b) β-VAE: σ = 0, γ = 0, δ = 1, ρ(β) = 0.5.

(c) Entropy-regularized β-VAE: σ = 0.5, γ =

0, δ = 1, ρ(β) = 0.9. (d) Semi-supervised VAE: σ = 0, γ(α) =
0.5, δ = ρ = 1.

Figure 9: VAE style objectives.

23
Preprint under review.

Figure 10: Full Objective. σ = 0.5, γ = 1000, δ = 1, ρ = 0.9. Simple demonstration of the behavior
with all terms present in the objective.

Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
Ecs 403 ML Module I
No ratings yet
Ecs 403 ML Module I
33 pages
Bishop Solutions PDF
No ratings yet
Bishop Solutions PDF
87 pages
Restricted Boltzmann Machines: Abstract
No ratings yet
Restricted Boltzmann Machines: Abstract
21 pages
Chapter 5
No ratings yet
Chapter 5
140 pages
Understanding Diffusion Models: A Unified Perspective
No ratings yet
Understanding Diffusion Models: A Unified Perspective
23 pages
Credal Learning Theory
No ratings yet
Credal Learning Theory
30 pages
An Information-Theoretic Approach To Generalization Theory - Part2
No ratings yet
An Information-Theoretic Approach To Generalization Theory - Part2
22 pages
ML Merge
No ratings yet
ML Merge
145 pages
Therml: Thermodynamics of Machine Learning: Box & Draper 1987 1A
No ratings yet
Therml: Thermodynamics of Machine Learning: Box & Draper 1987 1A
16 pages
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
No ratings yet
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
32 pages
IFT6085 Presentation IB
No ratings yet
IFT6085 Presentation IB
59 pages
Duda Solutions PDF
No ratings yet
Duda Solutions PDF
77 pages
Bayesian Input Design For Linear Dynamical Model Discrimination 2019 Bania
No ratings yet
Bayesian Input Design For Linear Dynamical Model Discrimination 2019 Bania
13 pages
Deep Learning 2017 Lecture7GAN
No ratings yet
Deep Learning 2017 Lecture7GAN
62 pages
Entropy and Mutual Information in Models of Deep Neural Networks
No ratings yet
Entropy and Mutual Information in Models of Deep Neural Networks
77 pages
Information Plane and Compression-Gnostic Feedback in Quantum Machine Learning
No ratings yet
Information Plane and Compression-Gnostic Feedback in Quantum Machine Learning
16 pages
Efficient DBM
No ratings yet
Efficient DBM
40 pages
Bayesian NN
No ratings yet
Bayesian NN
82 pages
Denoising Autoencoders tr1316
No ratings yet
Denoising Autoencoders tr1316
16 pages
Deep Neural Nets As Hamiltonians: Mike Winer, Boris Hanin April 1, 2025
No ratings yet
Deep Neural Nets As Hamiltonians: Mike Winer, Boris Hanin April 1, 2025
26 pages
Variation Al
No ratings yet
Variation Al
25 pages
An Introduction To Variational Autoencoders: Foundations and Trends in Machine Learning
No ratings yet
An Introduction To Variational Autoencoders: Foundations and Trends in Machine Learning
89 pages
Four Lectures On Statistical Physics of Learning
No ratings yet
Four Lectures On Statistical Physics of Learning
74 pages
Notes On Deep Learning Theory
No ratings yet
Notes On Deep Learning Theory
68 pages
ALFR A Novel Algorithm For Learning Adversarial Fa
No ratings yet
ALFR A Novel Algorithm For Learning Adversarial Fa
8 pages
Information Theoretic-Learning Auto-Encoder - 2016
No ratings yet
Information Theoretic-Learning Auto-Encoder - 2016
8 pages
Entropy, Concentration, and Learning - A Statistical Mechanics Primer
No ratings yet
Entropy, Concentration, and Learning - A Statistical Mechanics Primer
38 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
114 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
ML Document-1 - Merged
No ratings yet
ML Document-1 - Merged
19 pages
Sketching Information Divergence
No ratings yet
Sketching Information Divergence
15 pages
AT-QIT Learning Theory
No ratings yet
AT-QIT Learning Theory
13 pages
DLbook
No ratings yet
DLbook
165 pages
PDLT
No ratings yet
PDLT
449 pages
Water Calculation
100% (2)
Water Calculation
38 pages
Intro To Vae
No ratings yet
Intro To Vae
89 pages
Fit Without Fear - Remarkable Mathematical Phenomena of Deep Learning Through The Prism of Interpolation
No ratings yet
Fit Without Fear - Remarkable Mathematical Phenomena of Deep Learning Through The Prism of Interpolation
51 pages
On Contrastive Divergence Learning
No ratings yet
On Contrastive Divergence Learning
8 pages
Poly Aml
No ratings yet
Poly Aml
76 pages
Information Dropout Learning Optimal Representations Through Noisy Computation
No ratings yet
Information Dropout Learning Optimal Representations Through Noisy Computation
9 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
Training Conditional Random Fields With Natural Gradient Descent
No ratings yet
Training Conditional Random Fields With Natural Gradient Descent
9 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Maximum Entropy Method: Sampling Bias: Jorge - Cossio@cigb - Edu.cu
No ratings yet
Maximum Entropy Method: Sampling Bias: Jorge - Cossio@cigb - Edu.cu
10 pages
Brief Intro To ML PDF
No ratings yet
Brief Intro To ML PDF
236 pages
Answer Key
No ratings yet
Answer Key
12 pages
DLAI4 Networks Gans
No ratings yet
DLAI4 Networks Gans
7 pages
ACV - Notes - Final
No ratings yet
ACV - Notes - Final
7 pages
Dragon Magazine - Marvel Superheroes Game Supplements
100% (3)
Dragon Magazine - Marvel Superheroes Game Supplements
88 pages
06. ĐỀ SỐ 06 HSG ANH 9 (HUYỆN)
No ratings yet
06. ĐỀ SỐ 06 HSG ANH 9 (HUYỆN)
7 pages
Tutorial On Diffusion Models
No ratings yet
Tutorial On Diffusion Models
4 pages
Cheatsheet Supervised Learning
No ratings yet
Cheatsheet Supervised Learning
4 pages
AI60201 Module3 4 Problems
No ratings yet
AI60201 Module3 4 Problems
4 pages
Chapter Introduction
No ratings yet
Chapter Introduction
7 pages
Bayesian Nonparametrics and The Probabilistic Approach To Modelling
No ratings yet
Bayesian Nonparametrics and The Probabilistic Approach To Modelling
27 pages
Tutorial Part I Information Theory Meets Machine Learning Tuto - Slides - Part1
No ratings yet
Tutorial Part I Information Theory Meets Machine Learning Tuto - Slides - Part1
46 pages
17163-Article Text-20657-1-2-20210518
No ratings yet
17163-Article Text-20657-1-2-20210518
9 pages
You Must Be Mad!: Warbirds RPG Mad Science Sourcebook
100% (2)
You Must Be Mad!: Warbirds RPG Mad Science Sourcebook
55 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
Fast Dropout
No ratings yet
Fast Dropout
11 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
32 pages
M.V. Anil Kumar Kumar: Summary: Work History
No ratings yet
M.V. Anil Kumar Kumar: Summary: Work History
3 pages
American Scientist, Vol. 111.1 (January-February 2023)
No ratings yet
American Scientist, Vol. 111.1 (January-February 2023)
68 pages
Nomination Facility Provided by Banks MCQs With Case Study
No ratings yet
Nomination Facility Provided by Banks MCQs With Case Study
8 pages
Build A Programmable Mechanical Music Box
No ratings yet
Build A Programmable Mechanical Music Box
21 pages
JFM93 47-63 Maxworthy Weisfogh
No ratings yet
JFM93 47-63 Maxworthy Weisfogh
24 pages
The Ergonomic Posture Assessment by Comparing REBA With RULA & OWAS: A Case Study in A Gas Springs Factory
No ratings yet
The Ergonomic Posture Assessment by Comparing REBA With RULA & OWAS: A Case Study in A Gas Springs Factory
23 pages
WJPIndex 2023
No ratings yet
WJPIndex 2023
220 pages
Final Version Manchester 2
No ratings yet
Final Version Manchester 2
20 pages
ADA369158 Fish Rohr Dolphin 1999
No ratings yet
ADA369158 Fish Rohr Dolphin 1999
187 pages
A Historical Survey of Methods of Solving Cubic Equations U Richmond 1956
No ratings yet
A Historical Survey of Methods of Solving Cubic Equations U Richmond 1956
93 pages
BARTEC Engineers Manual
No ratings yet
BARTEC Engineers Manual
12 pages
Fenrg 09 834283 Optical Nameplate
No ratings yet
Fenrg 09 834283 Optical Nameplate
8 pages
US549160 Selden 1895
No ratings yet
US549160 Selden 1895
6 pages
RapidDegradationofMid Powerwhite lightLEDsinSaturatedMoistureConditions
No ratings yet
RapidDegradationofMid Powerwhite lightLEDsinSaturatedMoistureConditions
9 pages
The Japanese Startup Ecosystem - Opportunities For Eu Collaboration
No ratings yet
The Japanese Startup Ecosystem - Opportunities For Eu Collaboration
56 pages
From Solid To Plane Tessellations s00004-018-0389-5
No ratings yet
From Solid To Plane Tessellations s00004-018-0389-5
28 pages
Shen Acoustics Ancient Chinese Bells Sciam 1987
No ratings yet
Shen Acoustics Ancient Chinese Bells Sciam 1987
10 pages
Jmse 08 00357 v3
No ratings yet
Jmse 08 00357 v3
32 pages
Scirobotics Adi8022
No ratings yet
Scirobotics Adi8022
17 pages
Woods2014 Membrane Processes For Heating, Ventilation, and Air Conditioning
No ratings yet
Woods2014 Membrane Processes For Heating, Ventilation, and Air Conditioning
15 pages
School Brochure 2024-2025
No ratings yet
School Brochure 2024-2025
2 pages
Corrosion of Aluminium 2nd Edition Christian Vargel Instant Download
100% (1)
Corrosion of Aluminium 2nd Edition Christian Vargel Instant Download
62 pages
Business Research
No ratings yet
Business Research
44 pages
Toward Extenics-Based Innovation Model On Intellig
No ratings yet
Toward Extenics-Based Innovation Model On Intellig
23 pages
ABM - Business Finance CG - 2
No ratings yet
ABM - Business Finance CG - 2
6 pages
792224FT-371008 - 207-00011
No ratings yet
792224FT-371008 - 207-00011
13 pages
Unit 5 (C++) - Function
No ratings yet
Unit 5 (C++) - Function
102 pages
First Summative Test in English 5
No ratings yet
First Summative Test in English 5
2 pages
Chapter 3 Abstract
No ratings yet
Chapter 3 Abstract
19 pages
ServerAdmin v10.6
No ratings yet
ServerAdmin v10.6
197 pages
Fresh Air Cooling Coil - 9800 CFM
No ratings yet
Fresh Air Cooling Coil - 9800 CFM
1 page
Catatonia and ECT Across The Lifespan - 2024 - Schizophrenia Research
No ratings yet
Catatonia and ECT Across The Lifespan - 2024 - Schizophrenia Research
6 pages
MK PT en
No ratings yet
MK PT en
308 pages
PES 4 AFS Map Files by Ajay
No ratings yet
PES 4 AFS Map Files by Ajay
2 pages
CAWRT Drill Flyer
No ratings yet
CAWRT Drill Flyer
1 page
FSSAI - Internship Portal
No ratings yet
FSSAI - Internship Portal
3 pages
Dr. Richard Felder and Dr. Rebecca Brent Part 3
No ratings yet
Dr. Richard Felder and Dr. Rebecca Brent Part 3
3 pages
Probability Althea
No ratings yet
Probability Althea
8 pages
109.1 8 3英題目
No ratings yet
109.1 8 3英題目
6 pages
Emergency Cart Checklist
No ratings yet
Emergency Cart Checklist
1 page
Analysis2 Final Exam 2022 PDF
No ratings yet
Analysis2 Final Exam 2022 PDF
3 pages
RIM S BlackBerry Fall Back Analysis and PDF
No ratings yet
RIM S BlackBerry Fall Back Analysis and PDF
9 pages

T T M L: HE Hermodynamics of Achine Earning

Uploaded by

T T M L: HE Hermodynamics of Achine Earning

Uploaded by

Preprint under review.

Alexander A. Alemi Ian Fischer

In this work we offer an information-theoretic framework for representation learn-

Figure 1: Graphical models.

I(Zi ; Xi , Θ) = I(Zi ; Xi ) + I(Zi ; Θ|Xi ) (9)

by the Chain Rule for mutual information 7 .

2.1.3 C LASSIFICATION E RROR

Ci ≡ − hlog q(yi |zi )iP ≥ H(Yi ) − I(Yi ; Zi ) = H(Yi |Zi ) (12)

Di ≡ − hlog q(xi |zi )iP ≥ H(Xi ) − I(Xi ; Zi ) = H(Xi |Zi ) (13)

Equivalently, we could solve the unconstrained Lagrange multipliers problem:

• If δ = 0 we no longer require a variational reconstruction network q(x|z), and are doing

4.1 F IRST L AW OF L EARNING

4.2 M AXWELL R ELATIONS AND T HERMODYNAMIC P OTENTIALS

4.3 S ECOND L AW OF L EARNING ?

Furthermore, if we use q(θ) as our prior over parameters, we know:

Susanne Still. Information bottleneck approach to predictive inference. Entropy, 16(2):968–989,

A R ECONSTRUCTION F REE F ORMULATION

• Ci ≡ − hlog q(yi |zi )iP ≥ H(Yi ) − I(Yi ; Zi ) = H(Yi |Zi )

In this case we have:

Figure 3: Graphical models for standard Bayesian inference.

And we can derive the simple variational bounds:

Ui ≡ − hlog q(xi |θ)i ≥ H(Xi |Θ) (45)

The formal solution to this optimization problem takes the form:

Where Z is the partition function:

This is the ordinary temperature regulated (Watanabe, 2009) Bayesian posterior:

Figure 4: Graphical models for the traditional discriminative case.

Ui ≡ − hlog q(yi |xi , θ)iP ≥ H(Yi |Xi , Θ) (53)

Just as previously in Appendix B solutions on the frontier can be specified by:

F.1 C OMPLETE E NUMERATION

Next transforming to F = R + σS = F (D, C, σ)

Next transforming to H = R + γC = H(D, γ, S)

Next transforming to G = R + σS + γC = G(D, γ, σ)

Next transforming to A = R + δD = A(δ, C, S)

(a) Deterministic Supervised Classifier: δ = (b) Entropy-regularized Deterministic Classi-

(c) Entropy-regularized IB: δ = 0, ρ = 0, γ = (d) Bayesian Neural Network Classifier: δ =

Figure 6: Supervised Learning approaches.

(b) Entropy-regularized VIB: δ = 0, γ =

Figure 7: VIB style objectives.

(a) Deterministic Autoencoder: γ = ρ = σ = (b) Entropy-regularized Deterministic Autoen-

Figure 8: Autoencoder objectives.

(a) VAE: σ = 0, γ = 0, δ = ρ = 1. (b) β-VAE: σ = 0, γ = 0, δ = 1, ρ(β) = 0.5.

(c) Entropy-regularized β-VAE: σ = 0.5, γ =

Figure 9: VAE style objectives.

You might also like