T T M L: HE Hermodynamics of Achine Earning
T T M L: HE Hermodynamics of Achine Earning
T HER ML:
T HE T HERMODYNAMICS OF M ACHINE L EARNING
A BSTRACT
1 I NTRODUCTION
Let X, Y be some paired data, for example: a set of images X and their labels Y . We imagine
the data comes from some true, unknown data generating process Φ1 , from which we have drawn a
training set of N pairs:
TN ≡ (xN , y N ) ≡ {x1 , y1 , x2 , y2 , . . . , xN , yN } ∼ φ(xN , y N ). (1)
2
We further imagine the process is exchangeable and the data is conditionally independent given the
governing process Φ: Y
p(xN , y N |φ) = p(xi |φ)p(yi |xi , φ). (2)
i
As machine learners, we believe that by studying the training set, we should be able to infer or
predict new draws from the same data generating process. Call a set of M future draws from the
0
data generating process TM ≡ {X M , Y M } the test set.
The predictive information (Bialek et al., 2001) is the mutual information between the training set
and a infinite test set, equivalently the amount of information the training set provides about the
generative process itself:
0
Ipred (TN ) ≡ lim I(TN ; TM ) = I(TN ; Φ) = I(X N , Y N ; Φ). (3)
M →∞
The predictive information measures the underlying complexity of the data generating process (Still,
2014), and is fundamentally limited and must grow sublinearly in the dataset size (Bialek et al.,
2001). Hence, the predictive information is a vanishing fraction of the total information in the
training set 3 :
Ipred (TN )
lim =0 (4)
N →∞ H(TN )
A vanishing fraction of the information present in our training data is in any way useful for future
tasks. A vanishing fraction of the information contained in the training data is signal, the rest is
noise. We claim the goal of learning is to learn a representation of data, both locally and globally
that captures the predictive information while being maximally compressed: that separates the signal
from the noise.
1
Here we aim to invoke the same philosophy as in the introduction to Watanabe (2018).
2
That is, we imagine the data satisfies De Finetti’s theorem, for which infinite exchangeable processes
usually can be described by products of conditionally independent distributions, but don’t want to worry too
much about the complicated details since there are subtle special cases (Accardi, 2018).
3 P
Here and throughout H(A) is used to denote entropies H(A) = − i p(A) log p(A).
1
Preprint under review.
Θ Φ Θ Φ
Z X Y X Z Y
(a) Graphical model for world P , the real world augmented (b) Graphical model for world Q, the
with a local and global representation. The dashed lines em- world we desire. In this world, Z acts
phasize that θ only depends on the first N data points, the as a latent variable for X and Y jointly.
training set. Blue denotes nodes outside our control, while
red nodes are under our direct control.
2 A TALE OF T WO W ORLDS
We are primarily interested in learning a stochastic local representation of X, call it Z, defined by
some parametric distribution of our own design: p(zi |xi , θ) with its own parameters θ. A training
procedure is a process that assigns a distribution p(θ|xN , y N ) to the parameters conditioned on the
observed dataset. In this way, the parameters of our local parametric map are themselves a global
representation of the dataset. With our augmentations, the world now looks like the graphical model
in Figure 3a, denoted World P : Some data generating process Φ generates a dataset (X N , Y N )
which we perform some learning algorithm on to get some parameters p(θ|xN , y N ) which we can
use to form a parametric local representation p(zi |xi , θ).
World P is what we have. It is not necessarily what we want. What we have to contend with is
an unknown distribution of our data. What we want is a world that corresponds to the traditional
modeling assumptions in which Z acts as a latent factor for X and Y , rending them conditionally
independent, leaving no correlations unexplained. Similarly, we would prefer if we could easily
marginalize out the dependence on our universal (Φ) and model specific (Θ) parameters. World Q
in Figure 3b is the world we want 4 .
We can measure the degree to which the real world aligns with our desires by computing the mini-
mum possible relative information5 between our distribution p and any distribution consistent with
the conditional dependencies encoded in graphical model Q6 . It can be shown (Friedman et al.,
2001) that this quantity is given by the difference in multi-informations between the two graphical
models, as measured in World P :
J ≡ min DKL [p; q] = IP − IQ . (5)
q∈Q
The multi-information (Slonim et al., 2005) of a graphical model is the KL divergence between the
joint distribution and the product of all of the marginal distributions, which can be computed as a
sum of mutual informations, one for each node in the graph, between itself and its parents:
p(g N )
X
IG ≡ log Q = I(gi ; Pa(gi )) (6)
i p(gi ) i
In our case:
X
J = I(Θ; X N , Y N ) + [I(Xi ; Φ) + I(Yi ; Xi , Φ) + I(Zi ; Xi , Θ) − I(Xi ; Zi ) − I(Yi ; Zi )] .
i
(7)
4
We could consider different alternatives, deciding to relax some of the constraints we imposed in World
Q, or generalizing World P by letting the representation depend on X and Y jointly, for instance. What
follows demonstrates a general sort of calculus that we can invoke for any specified pair of graphical models.
In particular Appendices A to C discuss alternatives.
5
Also known as the KL divergence.
6
Note that this is DKL [p; q ∗ ] where q ∗ is the well known reverse-information projection or moment pro-
jection: q ∗ = argminq∈Q DKL [p; q] (Csiszár & Matúš, 2003).
2
Preprint under review.
This minimal relative information has two terms outside our control and we can take them to be
constant, but which relate to the predictive information:
X X
[I(Xi ; Φ) + I(Yi ; Xi , Φ)] ≥ I(Yi ; Xi ) + Ipred (TN ). (8)
i i
These terms measure the intrinsic complexity of our data. The remaining four terms are:
• I(Xi ; Zi ) - which measures how much information our representation contains about the
input (X). This should be maximized to ensure our local representation actually represents
the input.
• I(Yi ; Zi ) - which measures how much information our representation contains about our
auxiliary data. This should be maximized as well to ensure that our local representation is
predictive for the labels.
• I(Zi ; Xi , Θ) - which measures how much information the parameters and input determine
about our representation. This should be minimized to ensure consistency between worlds,
and ensure we learn compressed local representations. Notice that this is similar to, but
distinct from the first term above.
These mutual informations are all intractable in general, since we cannot compute the necessary
marginals in closed form, given that we do not have access to the true data generating distribution.
2.1 F UNCTIONALS
Despite their intractability, we can compute variational bounds on these mutual informations.
2.1.1 E NTROPY
p(θ|xN , y N )
S≡ log ≥ I(Θ; X N , Y N ) (10)
q(θ) P
The relative entropy in our parameters or just entropy for short measures the relative information be-
tween the distribution we assign our parameters in World P after learning from the data (X N , Y N ),
with respect to some data independent q(θ) prior on the parameters. This is an upper bound on
the mutual information between the data and our parameters and as such can measure our risk of
overfitting our parameters.
2.1.2 R ATE
p(zi |xi , θ)
Ri ≡ log ≥ I(Zi ; Xi , Θ) (11)
q(zi ) P
The rate measures the complexity of our representation. It is the relative information of a sample
specific representation zi ∼ p(z|xi , θ) with respect to our variational marginal q(z). It measures
how many bits we actually encodeP about each sample, and can measure how our risk of overfitting
our representation. We use R ≡ i Ri .
7
Given this relationship, we could actually reduce the total number of functions we consider from 4 to 3, as
discussed in Appendix A.
3
Preprint under review.
The classification error measures the conditional entropy of Y left after conditioning on Z. It is a
measure of how much information about Y is left unspecifiedP in our representation. This functional
measures our supervised learning performance. We use C ≡ i Ci .
2.1.4 D ISTORTION
The distortion measures the conditional entropy of X left after conditioning on Z. It is a measure of
how much information about X is left unspecified in P our representation. This functional measures
our unsupervised learning performance. We use D ≡ i Di .
2.2 G EOMETRY
The distributions p(z|x, θ), p(θ|xN , y N ), q(z), q(x|z), q(y|z) can be chosen arbitrarily. Once cho-
sen, the functionals R, C, D, S take on well described values. The choice of the five distributional
families specifies a single point in a four-dimensional space.
Importantly, the sum of these functionals is a variational upper bound (up to an additive constant)
for the minimum possible relative information between worlds (Appendix D):
X
S+R+C +D ≥J + H(Xi , Yi |Φ) (14)
i
Besides just the upper bound, we can consider the full space of feasible points. Notice that S
and R are both themselves upper bounds on mutual informations, and so must be positive semi-
definite. If our data is discrete, or if we have discretized it 8 , D and C which are both upper
bounds
P on conditional entropies, must be positive as well. Along with Equation (14), given that
i H(X i , Yi |Φ) is a positive constant outside our control, the space of possible (R, C, D, S) values
is at least restricted to be points in the positive orthant with some minimum possible Manhattan
distance to the origin:
X
S+R+C +D ≥ H(Xi , Yi |Φ) R≥0 S≥0 D≥0 C≥0 (15)
i
Even in the infinite model family limit, data-processing inequalities on mutual information terms
all defined in a set of variables that satisfy some nontrivial conditional dependencies ensure that
there are regions in this functional space that are wholly out of reach. The surface of the feasible
region maps an optimal frontier, optimal in the degree to which it minimizes mismatch between our
two worlds subject to constraints on the relative magnitudes of the individual terms. This convex
polytope has edges, faces and corners that are identifiable as the optimal solutions for well known
objectives.
This story is a generalization of the story presented in Alemi et al. (2018), which can be considered
a two-dimensional projection of this larger space (onto R, D). Within our larger framework we can
derive more specific bounds between subsets of the functionals. For instance:
Ri + Di ≥ H(Xi ) + I(Zi ; Θ|Xi ). (16)
This mirrors the bound given in Alemi et al. (2018) where R + D ≥ H(X), which is still true given
that all conditional mutual informations are positive semi-definite (H(X) + I(Z; Θ|X) ≥ H(X)),
but here we obtain a tighter pointwise bound that has a term measuring how much information
about our encoding is revealed by the parameters after conditioning on the input itself. This term
8
E m(x), m(y) on both X and Y , we can define D and C in
More generally, if we choose Dsome measure
q(x|z)
terms of that measure e.g. D ≡ − log m(x) ≥ Hm (X) − I(X; Z) = Hm (X|Z)
P
4
Preprint under review.
I(Zi ; Θ|Xi ) captures the degree to which our local representation is overly sensitive to the particular
parameter settings 910 .
2.3 G ENERALIZATION
We can evaluate how much information our representations capture about the true data generating
process. For instance, I(Zi ; Φ) which measures how much information about the true data generat-
ing procedure our local representations capture. Notice that given the conditional dependencies in
world P , we have the following Markov chain:
Φ → (Xi , Yi , Θ) → Zi (17)
and so by the Data Processing Inequality (Cover & Thomas, 2012):
I(Zi ; Φ) ≤ I(Zi ; Θ, Xi , Yi ) = I(Zi ; Xi , Θ) + ( (((
I(Z i |Xi , Θ) ≤ Ri .
; Y(
(i( (18)
The per-instance rate Ri forms an upper bound on the mutual information between our encoding Zi
and the true governing parameters of our data Φ. Similarly, we can establish that:
Φ → (X N , Y N ) → Θ =⇒ I(Θ; Φ) ≤ I(Θ; X N , Y N ) ≤ S. (19)
S upper bounds the amount of information our encoder’s parameters Θ, the global representation of
the dataset can contain about the true process Φ. At the same time:
X
I(Θ; Φ) ≤ I(X N , Y N ; Φ) ≤ I(Xi , Yi ; Φ), (20)
i
which sets a natural upper limit for the maximum S that might be useful.
3 O PTIMAL F RONTIER
As in Alemi et al. (2018), under mild assumptions about the variational distributional families, it can
be argued that the surface is monotonic in all of its arguments. The optimal surface in the infinite
family limit can be characterized as a convex polytope (Equation (15)). In practice we will be in the
realistic setting corresponding to finite parametric families such as neural network approximators.
We then expect that there is an irrevocable gap that opens up in the variational bounds. Any failure of
the distributional families to model the correct corresponding marginal in P means that the space of
all realizable R, C, D, S values will be some convex relaxation of the optimal feasible surface. This
surface will be described some function f (R, C, D, S) = 0, which means we can identify points on
the surface as a function of one functional with respect to the others (e.g. R = R(C, D, S)). Finding
points on this surface equates to solving a constrained optimization problem, e.g.
min R such that D = D0 , S = S0 , C = C0 . (21)
q(z)q(x|z)q(y|z)p(z|x,θ)p(θ|{x,y})
Here δ, γ, σ are Lagrange multipliers that impose the constraints. They each correspond to the partial
derivative of the rate at the solution with respect to their corresponding functional, keeping the others
fixed.
Notice that this single objective encompasses a wide range of existing techniques.
• If we retain C alone, we are doing traditional supervised learning and our network will
learn to be deterministic in its activations and parameters.
9
In Appendix A we consider taking this bound seriously to limit the space only only three functionals, S,
C and V ≥ I(Zi ; Θ|Xi )
10
This could help explain the observation that often times putting additional modeling power on the prior
rather than the encoder can give improvements in ELBO (Chen et al., 2016).
5
Preprint under review.
Examples of all of these objectives behavior on a simple toy model is shown in Appendix H.
Notice that all of these previous approaches describe low dimensional sub-surfaces of the optimal
three dimensional frontier. These approaches were all interested in different domains, some were
focused on supervised prediction accuracy, others on learning a generative model. Depending on
your specific problem, and downstream tasks, different points on the optimal frontier will be desir-
able. However, instead of choosing a single point on the frontier, we can now explore a region on the
surface to see what class of solutions are possible within the modeling choices. By simply adjusting
the three control parameters δ, γ, σ, we can smoothly move across the entire frontier and smoothly
interpolate between all of these objectives and beyond.
3.1 O PTIMIZATION
So far we’ve considered explicit forms of the objective in terms of the four functionals. For S this
would require some kind of tractable approximation to the posterior over the parameters of our en-
coding distribution11 . Alternatively, we can formally describe the exact solution to our minimization
problem:
min S s.t. R = R0 , C = C0 , D = D0 . (23)
Recall that S measures the relative entropy of our parameter distribution with respect to the q(θ)
prior. As such, the solution that minimizes the relative entropy subject to some constraints is a
generalized Boltzmann distribution (Jaynes, 1957):
q(θ) −(R+δD+γC)/σ
p∗ (θ|{x, y}) = e . (24)
Z
11
As in Blundell et al. (2015); Achille & Soatto (2017)
6
Preprint under review.
Here Z is the partition function, the normalization constant for the distribution
Z
Z = dθ q(θ) e−(R+δD+γC)/σ (25)
This suggests an alternative method for finding points on the optimal frontier. We could turn the un-
constrained Lagrange optimization problem that required some explicit choice of tractable posterior
distribution over parameters into a sampling problem for a richer implicit distribution.
A naive way to draw samples from this posterior would be to use Stochastic Gradient Langevin
Dynamics or its cousins (Welling & Teh, 2011; Chen et al., 2014; Ma et al., 2015) which, in practice,
would look like ordinary stochastic gradient descent (or its cousins like momentum) for the objective
R + δD + γC, with injected noise. By choosing the magnitude of the noise relative to the learning
rate, the effective temperature σ can be controlled.
There is increasing evidence that the stochastic part of stochastic gradient descent itself is enough
to turn SGD less into an optimization procedure and more into an approximate posterior sam-
pler (Mandt et al., 2017; Smith & Le, 2017; Achille & Soatto, 2017; Zhang et al., 2018; Chaudhari
& Soatto, 2017), where hyperparameters such as the learning rate and batch size set the effective
temperature. If ordinary stochastic gradient descent is doing something more akin to sampling from
a posterior and less like optimizing to some minimum, it would help explain improved performance
through ensemble averages of different points along trajectories (Huang et al., 2017).
When viewed in this light, Equation 24 describes the optimal posterior for the parameters so as
to ensure the minimal divergence between worlds P and Q. q(θ) plays the role of the prior over
parameters, but our overall objective is minimized when
q(θ) = p(θ) = hp(θ|xN , y N )ip(xN ,yN ) . (26)
That is, when our prior is the marginal of the posteriors over all possible datasets drawn from the
true distribution. A fair draw from this marginal is to take a sample from the posterior obtained on
a different but related dataset. Insomuch as ordinary SGD training is an approximate method for
drawing a posterior sample, the common practice of fine-tuning a pretrained network on a related
dataset is using a sample from the optimal prior as our initial parameters. The fact that fine-tuning
approximates use of an optimal prior presumably helps explain its broad success.
If we identify our true goal not as optimizing some objective but instead directly sampling from
Equation 24, we can consider alternative approaches to define our learning dynamics, such as par-
allel tempering or population annealing (Machta & Ellis, 2011). Alternatively, we could, instead of
adopting variational bounds on the mutual informations, consider other mutual information bounds
such as those in Ishmael Belghazi et al. (2018); van den Oord et al. (2018). Perhaps our priors can be
fit, providing we form estimates of the expectation over datasets (e.g. bootstrapping or jackknifing
our dataset (DasGupta, 2008)).
4 T HERMODYNAMICS
So far we have described a framework for learning that involves finding points that lie on the surface
of a convex three-dimensional surface in terms of four functional coordinates R, C, D, S. Interest-
ingly, this is all that is required to establish a formal connection to thermodynamics, which similarly
is little more than the study of exact differentials (Sethna, 2006; Finn, 1993).
Whereas previous approaches connecting thermodynamics and learning (Parrondo et al., 2015; Still,
2017; Still et al., 2012) have focused on describing the thermodynamics and statistical mechanics
of physical realizations of learning systems (i.e. the heat bath in these papers is a physical heat
bath at finite temperature), in this work we make a formal analogy to the structure of the theory of
thermodynamics, without any physical content.
The optimal frontier creates an equivalence class of states, being the set of all states that minimize
as much as possible the distortion introduced in projecting world P onto a set of distributions that
7
Preprint under review.
respect the conditions in Q. The surface satisfies some equation f (R, C, D, S) = 0 which we
can use to describe any one of these functionals in terms of the rest, e.g. R = R(C, D, S). This
function is entire, and so we can equate partial derivatives of the function with differentials of the
functionals12 :
∂R ∂R ∂R
dR = dC + dD + dS. (27)
∂C D,S ∂D C,S ∂S C,D
Since the function is smooth and convex, instead of identifying the surface of optimal rates in terms
of the functionals C, D, S, we could just as well describe the surface in terms of the partial deriva-
tives by applying a Legendre transformation. We will name the partial derivatives:
∂R ∂R ∂R
γ≡− δ≡− σ≡− . (28)
∂C D,S ∂D C,S ∂S C,D
These measure the exchange rate for turning rate into reduced distortion, reduced classification error,
or increased entropy, respectively.
The functionals R, C, D, S are analogous to extensive thermodynamic variables such as volume,
entropy, particle number, magnetic field, charge, surface area, length and energy which grow as the
system grows, while the named partial derivatives γ, δ, σ are analogous to the intensive, generalized
forces in thermodynamics corresponding to their paired state variable, such as pressure, temperature,
chemical potential, magnetization, electromotive force, surface tension, elastic force, etc. Just as
in thermodynamics, the extensive functionals are defined for any state, while the intensive partial
derivatives are only well defined for equilibrium states, which in our language are the states lying
on the optimal surface 13 .
Recasting our total differential:
dR = −γdC − δdD − σdS, (29)
we create a law analogous to the First Law of Thermodynamics. In thermodynamics the First Law is
often taken to be a statement about the conservation of energy, and by analogy here we could think
about this law as a statement about the conservation of information. Granted, the actual content
of the law is fairly vacuous, equivalent only to the statement that there exists a scalar function
R = R(C, D, S) defining our surface and its partial derivatives.
Requiring that Equation 29 be an exact differential has mathematically trivial but intuitively non-
obvious implications that relate various partial derivatives of the system to one another, akin to the
Maxwell Relations in thermodynamics. For example, requiring that mixed second partial derivatives
are symmetric establishes that:
2 2
∂ R ∂ R ∂δ ∂γ
= =⇒ = . (30)
∂D∂C ∂C∂D ∂C D ∂D C
This equates the result of two very different experiments. In the experiment encoded in the partial
derivative on the left, one would measure the change in the derivative of the R − D curve (δ) as
a function of the classification error (C) at fixed distortion (D). On the right one would measure
the change in the derivative of the R − C curve (γ) as a function of the distortion (D) at fixed
classification error (C). As different as these scenarios appear, they are mathematically equivalent.
A full set of Maxwell relations can be found in Appendix F.
We can additionally take and name higher order partial derivatives, analogous to the susceptibilities
of thermodynamics like bulk modulus, the thermal expansion coefficient, or heat capacities. For
instance, we can define the analog of heat capacity for our system, a sort of rate capacity at constant
distortion:
∂R
KD ≡ . (31)
∂σ D
12 ∂X
∂Y Z
denotes the partial derivative of X with respect to Y holding Z constant.
13
For more discussion of equilibrium states, and how they connect with more intuitive notions of equilibrium,
see Appendix G
8
Preprint under review.
Just as in thermodynamics, these susceptibilities may offer useful ways to characterize and quan-
tify the systematic differences between model families. Perhaps general scaling laws can be found
between susceptibilities and network widths, or depths, or number of parameters or dataset size. Di-
vergences or discontinuities in the susceptibilities are the hallmark of phase transitions in physical
systems, and it is reasonable to expect to see similar phenomenon for certain models.
A great deal of first, second and third order partial derivatives in thermodynamics are given unique
names. This is because the quantities are particularly useful for comparing different physical sys-
tems. We expect a subset of the first, second and higher order partial derivatives of the base function-
als will prove similarly useful for comparing, quantifying, and understanding differences between
modeling choices.
Even when doing deterministic training, training is non-invertible (Maclaurin et al., 2015), and we
need to contend with and track the entropy (S) term. We set the parameters of our networks initially
with a fair draw from some prior distribution q(θ). The training procedure acts as a Markov process
on the distribution of parameters, transforming it from the prior distribution into some modified
distribution, the posterior p(θ|xN , y N ). Optimization is a many-to-one function, that in the ideal
limiting case, maps all possible initializations to a single global optimum. In this limiting case S
would be divergent, and there is nothing to prevent us from memorizing the training set.
The Second Law of Thermodynamics states that the entropy of an isolated system tends to increase.
All systems tend to disorder, and this places limits on the maximum possible efficiency of heat
engines.
Formally, there are many statements akin to the Second Law of Thermodynamics that can be made
about Markov chains generally (Cover & Thomas, 2012). The central one is that for any for any two
distributions pn , qn both evolving according to the same Markov process (n marks the time step),
the relative entropy DKL [pn ; qn ] is monotonically decreasing with time. This establishes that for
a stationary Markov chain, the relative entropy to the stationary state DKL [pn ; p∞ ] monotonically
decreases 14 .
In our language, we can make strong statements about dynamics that target points on the optimal
frontier, or dynamics that implement a relaxation towards equilibrium. There is a fundamental
distinction between states that live on the frontier and those off of it, analogous to the distinction
between equilibrium and non-equilibrium states in thermodynamics.
Any equilibrium distribution can be expressed in the form Equation (24) and identified by its partial
derivatives γ, δ, σ. If name the objective in Equation (22):
J(γ, δ, σ) ≡ R + δD + γC + σS, (32)
The value this objective takes for any equilibrium distribution can be shown to be given by the log
partition function (Equation (25)):
min J(γ, δ, σ) = −σ log Z(γ, δ, σ) (33)
and the KL divergence between any distribution over parameters p(θ) and an equilibrium distribution
is:
DKL [p(θ); p∗ (θ; γ, δ, σ)] = ∆J/σ (34)
∆J ≡ J noneq (p; γ, δ, σ) − J(γ, δ, σ) (35)
Where J noneq is the non-equilibrium objective:
J noneq (p; γ, δ, σ) = hR + δD + γC + σSip(θ) . (36)
For a stationary Markov process whose stationary distribution is an equilibrium distribution the KL
divergence to the stationary distribution must monotonically decrease each step. This means the
∆J/σ must decrease monotonically, that is our objective J must decrease monotonically:
Jt=0 ≥ Jt ≥ Jt+1 ≥ Jt=∞ . (37)
14
For discrete state Markov chains, this implies that if the stationary distribution is uniform, the entropy of
the distribution H(pn ) is strictly increasing.
9
Preprint under review.
5 C ONCLUSION
We have formalized representation learning as the process of minimizing the distortion introduced
when we project the real world (World P ) onto the world we desire (World Q). The projection is nat-
urally described by a set of four functionals which variationally bound relevant mutual informations
in the real world. Relations between the functionals describe an optimal three-dimensional surface
in a four dimensional space of optimal states. A single learning objective targeting points on this
optimal surface can express a wide array of existing learning objectives spanning from unsupervised
learning to supervised learning and everywhere in between. The geometry of the optimal frontier
suggests a wide array of identities involving the functionals and their partial derivatives. This offers
a direct analogy to thermodynamics independent of any physical content. By analogy to thermody-
namics, we can begin to develop new quantitative measures and relationships amongst properties of
our models that we believe will offer a new class of theoretical understanding of learning behavior.
ACKNOWLEDGEMENTS
The authors would like to Jascha Sohl-Dickstein, Mallory Alemi, Rif A. Saurous, Ali Rahimi, Kevin
Murphy, Ben Poole, Danilo Rezende and Matt Hoffman for helpful discussions and feedback on the
draft.
R EFERENCES
L Accardi. De Finetti, 2018. URL http://www.encyclopediaofmath.org/index.
php?title=De_Finetti_theorem&oldid=12884.
A. Achille and S. Soatto. Emergence of Invariance and Disentangling in Deep Representations.
Proceedings of the ICML Workshop on Principled Approaches to Deep Learning, 2017.
Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information
bottleneck. arXiv:1612.00410, 2016. URL http://arxiv.org/abs/1612.00410.
Alexander A Alemi, Ben Poole, Joshua V Dillon, Rif A Saurous, and Kevin Murphy. Fixing a
broken ELBO. ICML 2018, 2018. URL http://arxiv.org/abs/1711.00464.
William Bialek, Ilya Nemenman, and Naftali Tishby. Predictability, complexity, and learning. Neu-
ral computation, 13(11):2409–2463, 2001.
C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight Uncertainty in Neural Net-
works. arXiv: 1505.05424, May 2015. URL https//arxiv.org/abs/1505.05424.
Pratik Chaudhari and Stefano Soatto. Stochastic gradient descent performs variational inference,
converges to limit cycles for deep networks. arXiv, 2017. URL https://arxiv.org/abs/
1710.11029.
T. Chen, E. B. Fox, and C. Guestrin. Stochastic Gradient Hamiltonian Monte Carlo.
arXiv:1402.4102, February 2014. URL https://arxiv.org/abs/1402.4102.
X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, and P. Abbeel.
Variational Lossy Autoencoder. arXiv, 2016. URL https://arxiv.org/abs/1611.
02731.
Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
Imre Csiszár and František Matúš. Information projections revisited. IEEE Transactions on Infor-
mation Theory, 49(6):1474–1490, 2003.
10
Preprint under review.
Anirban DasGupta. Edgeworth expansions and cumulants. In Asymptotic Theory of Statistics and
Probability, pp. 185–201. Springer, 2008.
Colin BP Finn. Thermal physics. CRC Press, 1993.
Nir Friedman, Ori Mosenzon, Noam Slonim, and Naftali Tishby. Multivariate information bottle-
neck. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pp.
152–161. Morgan Kaufmann Publishers Inc., 2001.
Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick,
Shakir Mohamed, and Alexander Lerchner. β-VAE: Learning Basic Visual Concepts with a Con-
strained Variational Framework. 2017.
G. Huang, Y. Li, G. Pleiss, Z. Liu, J. E. Hopcroft, and K. Q. Weinberger. Snapshot Ensembles:
Train 1, get M for free. arXiv: 1704.00109, March 2017. URL https://arxiv.org/abs.
1704.00109.
M. Ishmael Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y. Bengio, A. Courville, and R Devon
Hjelm. MINE: Mutual Information Neural Estimation. arXiv, 2018. URL https://arxiv.
org/abs/1801.04062.
Edwin T Jaynes. Information theory and statistical mechanics. Physical review, 106(4):620, 1957.
D. P. Kingma, D. J. Rezende, S. Mohamed, and M. Welling. Semi-Supervised Learning with Deep
Generative Models. arXiv: 1406.5298, June 2014. URL https://arxiv.org/abs/1406.
5298.
Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. 2014.
Y.-A. Ma, T. Chen, and E. B. Fox. A Complete Recipe for Stochastic Gradient MCMC.
arXiv:1506.04696, June 2015. URL https://arxiv.org/abs/1506.04696.
J. Machta and R. S. Ellis. Monte Carlo Methods for Rough Free Energy Landscapes: Population
Annealing and Parallel Tempering. Journal of Statistical Physics, 144:541–553, August 2011.
doi: 10.1007/s10955-011-0249-0. URL https://arxiv.org/abs/1104.1138.
Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter opti-
mization through reversible learning. In Proceedings of the 32Nd International Conference on In-
ternational Conference on Machine Learning - Volume 37, ICML’15, pp. 2113–2122. JMLR.org,
2015. URL http://dl.acm.org/citation.cfm?id=3045118.3045343.
S. Mandt, M. D. Hoffman, and D. M. Blei. Stochastic Gradient Descent as Approximate Bayesian
Inference. arXiv: 1704.04289, April 2017. URL https://arxiv.org/abs/1704.04289.
Juan MR Parrondo, Jordan M Horowitz, and Takahiro Sagawa. Thermodynamics of informa-
tion. Nature physics, 11(2):131–139, 2015. URL http://jordanmhorowitz.mit.edu/
sites/default/files/documents/natureInfo.pdf.
James Sethna. Statistical mechanics: entropy, order parameters, and complexity, volume 14. Ox-
ford University Press, 2006. URL http://pages.physics.cornell.edu/˜sethna/
StatMech/EntropyOrderParametersComplexity.pdf.
Noam Slonim, Gurinder S Atwal, Gasper Tkacik, and William Bialek. Estimating mutual informa-
tion and multi–information in large networks. arXiv, 2005. URL https:/arxiv.org/abs/
cs/0502017.
S. L. Smith and Q. V. Le. A Bayesian Perspective on Generalization and Stochastic Gradient De-
scent. arXiv:1710.06451, October 2017. URL https://arxiv.org/abs/1710.06451.
S. Still. Thermodynamic cost and benefit of data representations. arXiv: 1705.00612, April 2017.
URL https://arxiv.org/abs/1705.00612.
S. Still, D. A. Sivak, A. J. Bell, and G. E. Crooks. Thermodynamics of Prediction. Physical Re-
view Letters, 109(12):120604, September 2012. doi: 10.1103/PhysRevLett.109.120604. URL
https://arxiv.org/abs/1203.3271.
11
Preprint under review.
12
Preprint under review.
In principle, these three functionals still fully characterize the distortion introduced in our informa-
tion projection. Notice that this new functional requires the variational approximation q(zi |xi ), a
variational approximation to the marginal over our parameter distribution. Notice also that we no
longer require a variational approximation to p(xi |zi ). That is, in this formulation we no longer
require any form of decoder, or synthesis in our original data space X. While equivalent in its
information projection, this more naturally corresponds to the model of our desired world Q:
Y
q(x, y, φ, z, θ) = q(φ)q(θ) q(zi |xi )q(yi |zi ), (41)
i
depicted below in Figure 2. Here we desire, not the joint generative model X ← Z → Y , but the
predictive model X → Z → Y .
Θ Φ
X Z Y
Figure 2: Modified graphical model for world Q, instead of Figure 3b, the world we desire which
satisfies the joint density in Equation 41. Notice that this graphical model encodes all of the same
conditional independencies as the original.
We can imagine tracing out this, now three dimensional, frontier that still explores a space consistent
with our original graphical model, but wherein we no longer have to do any form of direct variational
synthesis.
B BAYESIAN I NFERENCE
Just as in A we can consider alternative graphical models for World P. In particular, we can consider
a simplified scenario depicted in Figure 3 corresponding to the usual situation in Bayesian inference.
13
Preprint under review.
Φ X Θ Φ X Θ
(a) Graphical model for world P , depict- (b) Graphical model for world Q, the world
ing Bayesian inference as learning a single we desire, the usual generative model of
global representation of data. Bayesian inference.
Here we have just data, generated by some process and we form a single global representation of the
dataset. The world we desire, World Q, corresponds to the usual Bayesian modeling assumption,
whereby our own global representation generates the data conditionally independently.
For these sets of graphical models, we have the following information projection:
X X
Jbayes = min DKL [p; q] = IP − IQ = I(Xi ; Φ) + I(Θ; X n ) − I(Xi ; Θ) (43)
q∈Q
i i
p(θ|X n )
S≡ log ≥ I(Θ; X n ) (44)
q(θ)
This entropy gives an upper bound on the mutual information between our parameters and the
dataset, it requires a variational approximation to the true marginal of the posterior p(θ|X N ) over
datasets: q(θ), a prior.
Just as in our earlier paper (Alemi et al., 2018) we could trace out the frontier by doing the con-
strained optimization problem:
min S + βU (47)
Using a temperature to regulate the relative contribution of the prior and posterior has been used
broadly, but ordinarily doesn’t have a well founded justification. Here we can unapologetically vary
15 P
U≡ i Ui
14
Preprint under review.
the relative contributions of the prior and likelihood since in the representational framework, those
are both variational approximations that might have differing ability to better model the true distri-
butions they approximate. By varying the β parameter here, just as in the β-VAE case (Alemi et al.,
2018) we can smoothly explore the frontier within our modeling family, smoothly controlling the
amount of information our model extracts from the dataset. This can help us control for overfitting
in a principled way.
Additionally, we could try to relax our variational approximations, and fit our prior, assuming we
could estimate an expectation over datasets. One way to do that is with a bootstrap or jackknife
procedure (DasGupta, 2008).
C D ISCRIMINATIVE M ODELS
Similarly we could consider the situation depicting usual discriminative learning, depicted in Fig-
ure 4.
Φ Θ Φ Θ
X Y X Y
(a) Graphical model P , depicting condition- (b) Graphic model Q depicting a discrimina-
ally independent data with a global represen- tive generative model.
tation.
For these sets of graphical models, we have the following information projection:
X
Jd = IP − IQ = I(Θ; X N , Y N ) + [I(Xi ; Φ) + I(Yi ; Xi , Φ) − I(Yi ; Xi , Θ)] . (51)
i
p(θ|X n )
S≡ log ≥ I(Θ; X n ) (52)
q(θ) P
This entropy gives an upper bound on the mutual information between our parameters and the
dataset, it requires a variational approximation to the true marginal of the posterior p(θ|X N ) over
datasets: q(θ), a prior.
Here again we can smoothly explore the frontier set by the variationals approximations given by the
prior and likelihood by simply adjusting β. We might additionally consider going beyond the fixed
variational approximations and push the frontier by fitting the prior, or likelihood.
15
Preprint under review.
D F UNCTIONAL I NEQUALITIES
Here we show the details for deriving Equation (14).
We start by expressing our functional inequalities, but being explicit about presence of the relative
informations of our variational approximations.
I(Θ; X N , Y N ) = S − DKL [p(θ); q(θ)] (56)
I(Xi ; Zi ) = H(Xi ) − Di + DKL [p(xi |zi ); q(xi |zi )] (57)
I(Yi ; Zi ) = H(Yi ) − Ci + DKL [p(yi |zi ); q(yi |zi )] (58)
I(Zi ; Xi , Θ) = Ri − DKL [p(zi ); q(zi )] (59)
Combining Equations (7) and (56) to (59):
X
J = S + D + C + R − DKL [p; q] − [H(Xi ) + H(Yi ) − I(Xi ; Φ) − I(Yi ; Xi , Φ)] ≥ 0. (60)
i
Here we have collected all of the KL divergences for our variational approximations:
X
DKL [p; q] ≡DKL [p(θ); q(θ)] + DKL [p(xi |zi ); q(xi |zi )]
i
X (61)
+ [DKL [p(yi |zi ); q(yi |zi )] + DKL [p(zi ); q(zi )]] .
i
We can simplify:
H(Xi ) − I(Xi ; Φ) = H(Xi |Φ) (62)
H(Yi ) − I(Yi ; Xi Φ) = H(Yi |Xi , Φ) (63)
H(Yi |Xi , Φ) + H(Xi |Φ) = H(Yi , Xi |Φ) (64)
To obtain: X
J = S + D + C + R − DKL [p; q] − H(Yi , Xi |Φ) (65)
i
Which yields:
X
S + D + C + R = J + DKL [p; q] + H(Yi , Xi |Φ) (66)
i
X
S+D+C +R≥J + H(Yi , Xi |Φ) (67)
i
X
S+D+C +R≥ H(Yi , Xi |Φ) (68)
i
(69)
E I DENTITIES
We will utilize some basic information identities, first by definition
I(A; B) = H(A) − H(A|B) (70)
= H(B) − H(B|A) (71)
= H(A) + H(B) − H(A, B) (72)
= H(A, B) − H(A|B) − H(B|A) (73)
By the chain rule of mutual information:
I(A, B; C) = I(A; C) + I(B; C|A) ≥ 0 (74)
Mutual informations, and conditional mutual informations are always positive:
I(A; B) ≥ 0 (75)
I(A; B|C) ≥ 0 (76)
We will also use the following rule for conditional entropies
H(B|A) = H(A, B) − H(A) (77)
16
Preprint under review.
F M AXWELL R ELATIONS
We can also define other potentials analogous to the alternative thermodynamic potentials such as
enthalpy, free energy, and Gibb’s free energy by performing partial Legendre transformations. For
instance, we can define a free rate:
F (C, D, σ) ≡ R + σS (78)
dF = −γdC − δdD + Sdσ. (79)
The free rate measures the rate of our system, not as a function of S (something difficult to keep
fixed), but in terms of σ, a parameter in our loss or optimal posterior.
The free rate gives rise to other Maxwell relations such as
∂S ∂γ
=− , (80)
∂C σ ∂σ C
which equates how much each additional bit of entropy (S) buys you in terms of classification error
(C) at fixed effective temperature (σ), to a seemingly very different experiment where you measure
the change in the effective supervised tension (γ, the slope on the R − C curve) versus effective
temperature (σ) at a fixed classification error (C).
Here we enumerate a complete set of Maxwell Relations. First if we write R = R(D, C, S):
dR = −γdC − δdD − σdS
∂γ ∂δ
= (81)
∂D C ∂C D
∂δ ∂σ
= (82)
∂S D ∂D S
∂γ ∂σ
= (83)
∂S C ∂C S
∂γ ∂S
=− (84)
∂σ C ∂C σ
∂δ ∂S
=− (85)
∂S D ∂D σ
∂C ∂δ
=− (87)
∂D γ ∂γ D
∂C ∂σ
=− (88)
∂S γ ∂γ S
17
Preprint under review.
∂C ∂S
= (90)
∂σ γ ∂γ σ
∂γ ∂D
=− (92)
∂δC ∂C δ
∂D ∂σ
=− (93)
∂σ δ ∂δ S
Finally transforming to B = R + δD + σS = B(δ, C, σ)
dB = −γdC + Ddδ + Sdσ (94)
∂γ ∂S
=− (95)
∂σ C ∂C σ
∂S ∂D
= (96)
∂δ σ ∂σ δ
G Z EROTH L AW OF L EARNING
A central concept in thermodynamics is a notion of equilibrium. The so called Zeroth Law of ther-
modynamics defines thermal equilibrium as a sort of reflexive property of systems (Finn, 1993). If
system A is in thermal equilibrium with system C, and system B is separately in thermal equilibrium
with system C, then system A and B are in thermal equilibrium with each other.
When any sub-part of a system is in thermal equilibrium with any other sub-part, the system is said
to be an equilibrium state.
In our framework, the points on the optimal surface are analogous to the equilibrium states, for
which we have well defined partial derivatives. We can demonstrate that this notion of equilib-
rium agrees with a more intuitive notion of equilibrium between coupled systems. Imagine we
have two different models, characterized by their own set of distributions, Model A is defined by
pA (z|x, θ), pA (θ, {x, y}), qA (z), and model B by pB (z|x, θ), pB (θ, {x, y}), qB (z). Both models
will have their own value for each of the functionals: RA , SA , DA , CA and RB , SB , DB , CB . Each
model defines its own representation ZA , ZB . Now imagine coupling the models, by forming the
joint representation ZC = (ZA , ZB ) formed by concatenating the two representations together.
Now the governing distributions over Z are simply the product of the two model’s distributions, e.g.
qC (zC ) = qA (zA )qB (zB ). Thus the rate RC and entropy SC for the combined model is the sum of
the individual models: RC = RA + RB , SC = SA + SB .
Now imagine we sample new states for the combined system which are maximally entropic with the
constraint that the combined rate stay constant:
q(θ) −R/σ
min S s.t. R = RC =⇒ p(θ|{x, y}) = e . (97)
Z
For the expectation of the two rates to be unchanged after they have been coupled and evolved
holding their total rate fixed, we must have,
1 1 1 1
− RA − RB = − RC = − (RA + RB ) =⇒ σA = σB = σC . (98)
σ σB σC σC
Therefore, we can see that σ, the effective temperature, allows us to identify whether two systems
are in thermal equilibrium with one another. Just as in thermodynamics, if two systems at different
temperatures are coupled, some transfer takes place.
18
Preprint under review.
H E XPERIMENTS
We show examples of models trained on a toy dataset for all of the different objectives we define
above. The dataset has both an infinite data variant, where overfitting is not a problem, and a finite
data variant, where overfitting can be clearly observed for both reconstruction and classification.
Data generation. We follow the toy model from Alemi et al. (2018), but add an additional classifi-
cation label in order to explore supervised and semi-supervised objectives. The true data generating
distribution is as follows. We first sample a latent binary variable, z ∼ Ber(0.7), then sample a
latent 1D continuous value from that variable, h|z ∼ N (h|µz , σz ), and finally we observe a dis-
cretized value, x = discretize(h; B), where B is a set of 30 equally spaced bins, and a discrete
label, y = z (so the true label is the latent variable that generated x). We set µz and σz such that
R∗ ≡ I(x; z) = 0.5 nats, in the true generative process, representing the ideal rate target for a latent
variable model. For the finite dataset, we select 50 examples randomly from the joint p(x, y, z). For
the infinite dataset, we directly supply the true full marginal p(x, y) at each iteration during training.
When training on the finite dataset, we evaluate model performance against the infinite dataset so
that there is no error in the evaluation metrics due to a finite test set.
Model details. We choose to use a discrete latent representation with K = 30 values, with an
encoder of the form q(zi |xj ) ∝ − exp[(wie xj − bei )2 ], where z is the one-hot encoding of the latent
categorical variable, and x is the one-hot encoding of the observed categorical variable. We use a
decoder of the same form, but with different parameters: q(xj |zi ) ∝ − exp[(wid xj − bdi )2 ]. We
use a classifier of the same form as well: q(yj |zi ) ∝ − exp[(wic yj − bci )2 ]. Finally, we use a
variational marginal, q(zi ) = πi . Given this,
P the true joint distribution has the form p(x, y, z) =
p(x)p(z|x)p(y|x), with marginal p(z) = x p(x, z), and conditionals p(x|z) = p(x, z)/p(z) and
p(y|z) = p(y, z)/p(z).
The encoder is additionally parameterized following Achille & Soatto (2017) by α, a set of learned
parameters for a Log Normal distribution of the form log N (−αi /2, αi ). In total, the model has 184
parameters: 60 weights and biases in the encoder and decoder, 4 weights and biases in the classifier,
30 weights in the marginal, and an additional 30 weights for the αi parameterizing the stochastic
encoder. We initialize the weights so that when σ = 0, there is no noticeable effect on the encoder
during training or testing.
Experiments. In Figure 5, we show the optimal, hand-crafted model for the toy dataset, as well
as a selection of parameterizations of the TherML objective that correspond to commonly-used ob-
jective functions and a few new objective functions not previously described. In the captions, the
parameters are specified with γ, δ, σ as in the main text, as well as ρ, which is a corresponding La-
grange multiplier for R, in order to simplify the parameterization. It just parameterizes the optimal
surface slightly differently. We train all objectives for 10,000 gradient steps. For all of the objectives
described, the model has converged, or come close to convergence, by that point.
Because the model is sufficiently powerful to memorize the dataset, most of the objectives are very
susceptible to overfitting. Only the objective variants that are “regularized” by the S term (parame-
terized by σ) are able to avoid overfitting in the decoder and classifier.
19
Preprint under review.
Figure 5: Hand-crafted optimal model. Toy Model illustrating the difference between selected points
on the three dimensional optimal surface defined by γ, δ, and σ. See Section 3 for more description of the
objectives, and Appendix H for details on the experiment setup. Top (i): P Three distributions in data space:
the true data distribution, p(x), the model’s
P generative distribution, g(x) = z q(z)q(x|z), and the empirical
data reconstruction distribution, d(x) = x0 z p(x0 )q(z|x0 )q(x|z). Middle (ii): Four distributions
P
P in latent
space: the learned (or computed) marginal q(z), the empirical induced marginal e(z) = x p(x)q(z|x), the
empirical distribution over z values for data vectors in the set X0 = {xn : zn = 0}, which we denote by
e(z0 ) in purple, and the empirical distribution over z values for data vectors in the set X1 = {xn : zn = 1},
which we denote by e(z1 ) in yellow. Bottom: Three K × K distributions: (iii) q(z|x), (iv) q(x|z) and (v)
q(x0 |x) = z q(z|x)q(x0 |z).
P
20
Preprint under review.
21
Preprint under review.
22
Preprint under review.
23
Preprint under review.
Figure 10: Full Objective. σ = 0.5, γ = 1000, δ = 1, ρ = 0.9. Simple demonstration of the behavior
with all terms present in the objective.
24