Machine Learning 2025
Machine Learning 2025
Kyunghyun Cho
New York University & Genentech
May 8, 2025
2
PREFACE
I prepared this lecture note in order to teach DS-GA 1003 “Machine Learn-
ing” at the Center for Data Science of New York University. This is the first
course on machine learning for master’s and PhD students in data science, and
my goal was to provide them with a solid foundation on top of which they can
continue on to learn more advanced and modern topics in machine learning,
data science as well as more broadly artificial intelligence. Because of this goal,
this lecture note has quite a bit of mathematical derivations of various concepts
in machine learning. This should not deter students from reading through this
lecture note, as I have interleaved these derivations with accessible explana-
tions on the intuition and insights behind these derivations. Of course, as I was
preparing this note, it only became clear how shallow my own foundation in
machine learning was. But, I tried.
In preparing this lecture note, I tried my best to constantly remind my-
self of “Bitter Lesson” by Richard Sutton [Sutton, 2019]. I forced myself to
present various algorithms, models and theories in ways that support scalable
implementations, both for compute and data. All machine learning algorithms
in this lecture are thus presented to work with stochastic gradient descent and
its variants. Of course, there are other aspects of scalability, such as distributed
computing, but I expect and hope that other more advanced follow-up courses
would teach students with these advanced topics based on the foundation this
course has equipped those students with.
Despite my intention to cover as much foundational topics as possible in
this course, it only became apparent that one course is not long enough to
dig deeper into all of these topics. I had to make a difficult decision to omit
some topics I find foundational, interesting and exciting, such as online learning,
kernel methods and how to handle missing values. There were on the other hand
some topics I intentionally omitted, although I believe them to be foundational
as well, because they are covered extensively in various other courses, such
as sequence modeling (or large-scale language modeling). I have furthermore
refrained from discussing any particular application, hoping that there are other
follow-up courses focused on individual application domains, such as computer
vision, computational biology and natural language processing.
There are a few more modern topics I hoped I could cover but could not
due to time. To list a few of those, they include ordinary differential equation
(ODE) based generative models and contrastive learning for both representation
learning and metric learning. Perhaps in the future, I could create a two-course
series in machine learning and add these extra materials. Until then, students
will have to look for other materials to learn about these topics.
This lecture note is not intended to be a reference book but was created to
be a teaching material. This is my way of apologizing in advance that I have
not been careful at all on extensively and exhaustively citing all relevant past
literature. I will hopefully add citations more thoroughly the next time I teach
this same course, although there is no immediate plan to do so anytime soon.
Contents
1 An Energy Function 1
3
4 CONTENTS
6 Further Topics 83
6.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . 83
6.2 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.3 Meta-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.4 Regression: Mixture Density Networks . . . . . . . . . . . . . . . 98
6.5 Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Chapter 1
An Energy Function
1
2 CHAPTER 1. AN ENERGY FUNCTION
e : X × Z × Θ → R. (1.1)
where Y is the set of all possible outcomes y. When Y consists of discrete items,
we call it classification. If y is a continuous variable, we call it regression.
When Z is a finite set of discrete items, a given energy function e defines
the cluster assignment of an observation x, resulting in clustering:
If z is a continuous variable, we would solve the same problem but call it repre-
sentation learning.
All these different paradigms effectively correspond to solving a minimization
problem with respect to some subset of the inputs to the energy function e. In
other words, given a partially-observed input, we infer the unobserved part that
minimizes the energy function. This is often why people refer to using any
machine learning model after training as inference.
It is not trivial to solve such a minimization problem. The level of difficulty
depends on a variety of factors, including how the energy function is defined,
3
when there is no latent variable. It turned out unfortunately that learning is not
as easy, since we must ensure that the energy assigned to undesirable observa-
tion, i.e. pdata (x) ↓, must be relatively high. In other words, we must introduce
an extra term that regularizes learning:
The choice of R must be made appropriately for each problem we solve, and
throughout the course, we will learn how to design appropriate regularizers to
ensure proper learning.
Of course it becomes even more involved when there are latent (unobserved)
variables z, since it require us to solve the problem of inference simultaneously as
well. This happens for problems such as clustering where the cluster assignment
of each observation is unknown and factor analysis where latent factors are
unknown in advance. We will learn how to interpret such latent variables and
algorithms that allow us to estimate θ in the absence of latent variables.
In summary, there are three aspects to every machine learning problem; (1)
defining an energy function e (parametrization), (2) estimating the parameters
θ from data (learning), and (3) inferring a missing part given an partial obser-
vation (inference). Across these three steps sits one energy function, and once
we obtain an energy function e, we can easily mix and match these steps from
different paradigms of machine learning.
4 CHAPTER 1. AN ENERGY FUNCTION
Chapter 2
2.1 Classification
In the problem of classification, an observation x can be split into the input and
output; [x, y]. The output y takes one of the finite number of categories in Y.
For now, we assume that there is no latent variable, i.e., Z = ∅. Inference is
quite trivial in this case, since all we need to do is to pick the category that has
the lowest energy, after computing the energy for all possible categories one at
a time:
ŷ(x) = arg min e([x, y], ∅, θ). (2.1)
y∈Y
where θ = (W, b) with W ∈ R|Y|×|x| and b ∈ R|Y| . When such a linear feature
extractor is used, we call it a linear classifier.
A natural next question is how we can learn the parameters θ (e.g. W and b).
We approach learning from the perspective of optimization. That is, we establish
5
6CHAPTER 2. BASIC IDEAS IN MACHINE LEARNING WITH CLASSIFICATION
a loss function first and figure out how to minimize the loss function averaged
over a training set D, where the training set D is assumed to consist of N
independently sampled observations from the identical distribution (i.i.d.):
N
D = {[xn , y n ]}n=1 . (2.4)
Perhaps the most obvious loss function we can imagine is a so-called zero-one
(0-1) loss:
where
With this zero-one loss function, the overall objective of learning is then
N
1 X
min L0−1 ([xn , y n ], θ). (2.8)
θ N n=1
the true outcome y, i.e., e([x, y], ∅, θ), is lower than the energy associated with
any other y ′ ̸= y. This goal can then be written down as satisfying the following
inequality:
In order to satisfy this inequality, we need to minimize the left hand side (l.h.s.)
until it hits 0. We do not need to further minimize l.h.s. after hitting 0, since the
inequality is already satisfied. This translates to the following so-called margin
loss (or a hinge loss):
This loss is called a margin loss, because it ensures that there exists at least
the margin of m between the energy values of the correct outcome y and the
second best outcome ŷ ′ . The margin loss is at the heart of support vector ma-
chines [Cortes, 1995].
Consider the case where m = 0:
pθ (y ′ |x) = 1.
P
2. Normalization: y ′ ∈Y
Of course, there can be many (if not infinitely many) different ways to map
e([x, y], ∅, θ) to pθ (y|x), while satisfying these two conditions [Peters et al.,
2019]. We thus need to impose a further constraint to narrow down on one
particular mapping from the energy function to the Categorical probability. A
natural such constraint is the maximum entropy criterion.
The (Shannon) entropy is defined as
X
H(y|x; θ) = − pθ (y|x) log pθ (y|x). (2.15)
y∈Y
d
X d
X
max − ai pi − pi log pi (2.17)
p1 ,...,pd
i=1 i=1
subject to
d
X d
X d
X d
X
J(p1 , . . . , pd , λ1 , . . . , λd , γ) = − ai pi − pi log pi + λi (pi − s2i ) + γ( pi − 1),
i=1 i=1 i=1 i=1
(2.20)
Let us first compute the partial derivative of J with respect to pi and set it
to 0:
∂J
= −ai − log pi − 1 + λi + γ = 0 (2.21)
∂pi
⇐⇒ log pi = −ai + λi − 1 + γ (2.22)
⇐⇒ pi = exp(−ai + λi − 1 + γ) > 0. (2.23)
Let us now plug it into the second constraint and solve for γ:
d
X
exp(−1 + γ) exp(−ai ) = 1 (2.25)
i=1
d
X
⇐⇒ − 1 + γ + log exp(−ai ) = 0 (2.26)
i=1
d
X
⇐⇒ γ = 1 − log exp(−ai ). (2.27)
i=1
d
X
pi = exp(−ai ) exp(−1 + 1 − log exp(−aj )) (2.28)
j=1
exp(−ai )
= Pd . (2.29)
j=1 exp(−aj )
β was added to make our analysis easier. We often call β an inverse temperature.
β is by default 1, but by varying β, we can gain more insights into the negative
term.
Consider the case where β = 0, the negative term reduces to
1 X
− ∇θ e([x, y ′ ], ∅, θ)). (2.34)
|Y| ′
y ∈Y
This would correspond to increasing the energy associated with each outcome
equally.
How about when β → ∞? In that case, the negative term reduces to
where
When β → ∞, we end up with two cases. First, the classifier makes the
correct prediction; ŷ = y. In this case, the positive and negative terms cancel
each other, and there is no gradient. Hence, there is no update to the param-
eters. This reminds us of the perceptron loss from the earlier section. On the
other hand, if ŷ ̸= y, it will try to lower the energy value associated with the
2 Recall that this is a loss which is minimized.
2.2. BACKPROPAGATION 11
correct outcome y while increasing the energy value associated with the current
prediction ŷ. This continues until the prediction matches the correct outcome.
These two extreme cases tell us what happens with the cross entropy loss.
It softly adjust the energy values associated with all possible outcomes however
based on how likely they are to be the prediction. The cross entropy loss has
become more or less de facto standard when it comes to training a neural network
in recent years.
2.2 Backpropagation
Once you decide the loss function, it is time for us to train a classifier to minimize
the average loss. In doing so, one of the most effective approaches has been
stochastic gradient descent, or its variant. Stochastic gradient descent, which
we will discuss more in-depth later, takes a subset of training instances from D,
computes and averages the gradients of the loss of each instance in this subset
and updates the parameters in the negative direction of this stochastic gradient.
This makes it both interesting and important for us to think of how to compute
the gradient of a loss function.
Let us consider both the margin loss and cross entropy loss, since there is
no meaningful gradient of the zero-one loss function and the perceptron loss is
a special case of the margin loss:
(
∇θ e([x, y], ∅, θ) − ∇θ e([x, ŷ ′ ], ∅, θ), if Lmargin ([x, y], θ) > 0.
∇θ Lmargin ([x, y], θ) =
0, otherwise.
(2.37)
∇θ Lce ([x, y], θ) = ∇θ e([x, y], ∅, θ) − Ey|x;θ [∇θ e([x, y ′ ], ∅, θ))] . (2.38)
In both cases, the gradient of the energy function shows up: ∇θ e([x, y], ∅, θ)).
We thus focus on the gradient of the energy function in this case.
The gradient of the energy function with respect to the associated weight
vector wy is then
The first one (the gradient w.r.t. wy ) states that for the energy to be lowered
for this particular combination (x, y), we should add the input x to the weight
vector wy . The second one (the gradient w.r.t. by ) lowers the energy for the
outcome y regardless of the input.
Let us consider the perceptron loss, or the margin loss with zero margin.
The first-term gradient, ∇θ e([x, y], ∅, θ), updates the weight vector and the
bias value associated with the correct outcome. With a learning rate η > 0,
the updated energy associated with the correct outcome, where we follow the
negative gradient,3 is then smaller than the original energy function:
This is precisely what we intended, since we want the energy value to be lower
with a good combination of the input and outcome.
This alone is however not enough as a full learning rule. Even if the energy
value associated with the right combination is lowered, it may not be lowered
enough, so that the correct outcome is selected when the input is presented
again. The second-term gradient compliments this by having the opposite sign in
front of it. By following the negative gradient of the negative energy associated
with the input and the predicted outcome ŷ, we ensure that this particular
energy value is increased:
So, this learning rule would lower the energy value associated with the correct
outcome and increase that associated with the incorrectly-predicted outcome,
until the outcome with the lowest energy coincides with the correct outcome.
When that happens, the loss is constant, and no learning happens, because
y = ŷ.
3 We will shortly discuss why we do so later in this chapter.
2.2. BACKPROPAGATION 13
At this point, we start to see that the derivation and argument above equally
apply to x, the input. Instead of the gradient of the energy w.r.t. the weight
vector wy , but we can compute that w.r.t. the input x as well:
∇x e = −wy ,
∇x Lperceptron ([x, y], θ) = ∇x e([x, y], θ) − ∇x e([x, ŷ], θ) = −wy + wŷ . (2.47)
Similarly to the weight matrix and bias vector above, if we update the input
x following the opposite of this direction, we can increase the energy value
associated with the correct outcome y while lowering that with the incorrectly-
predicted outcome ŷ. Although this is generally useless with a linear energy
function, as we discussed just now, this is an interesting thought experiment, as
this tells us that we can solve the problem either by adapting the parameters,
i.e. the weight matrix and bias vector, or by adapting the input data points
themselves. The latter sounds like an attractive alternative, because it would
break us free from being constrained by the linearity of the energy function.
There is however a major issue with the latter alternative. That is, we do not
know how to change the new input in the future (not included in the training
set), since such a new input may not come together with the associated correct
outcome. We thus need to build a system that predicts what the altered input
would be given a new input in the future.
To overcome this issue, we start by using some transformation h of the
input x, with its own parameters θ′ , instead of the original input x. That is,
h = F (x, θ′ ). Analogously, we refer to the newly updated input by ĥ. We obtain
ĥ by following the gradient direction from Eq. (2.47). We now define a new
energy function e′ such that the combination (h, ĥ) is assigned a lower energy
than the other combinations if h and ĥ are close to each other. Under this energy
function, the energy is low if this transformation of the input h = F (x, θ′ ) is
4 When it is clear that there is no latent (unobserved) variable z, I will skip φ for brevity.
14CHAPTER 2. BASIC IDEAS IN MACHINE LEARNING WITH CLASSIFICATION
similar to the updated input ĥ. This intuitively makes sense, since ĥ is the
desirable transformation of the input x, as it lowers the overall loss function
above.
A typical example of such an energy function would be
1
e′ ([h, ĥ], θ′ ) = ∥ σ(U ⊤ x + c) −ĥ∥2 , (2.48)
2 | {z }
=h
where U and c are the weight matrix and bias vector, respectively, and σ is an
arbitrary nonlinear function. h = σ(U ⊤ x + c) would be some transformation of
the input x, as described above.
The loss function in this case can be simply the energy function itself:
The gradient of the loss function w.r.t. the transformation matrix U is then:
⊤
∇U = x (h − ĥ) ⊙ h′ (2.50)
where
h′ = σ ′ (U ⊤ x + c) (2.51)
with σ ′ (a) = ∂σ
∂a (a), according to the chain rule of derivatives. ⊙ denotes
element-wise multiplication. Similarly, the gradient w.r.t. the bias vector c is
∇c = (h − ĥ) ⊙ h′ . (2.52)
∂
= xi (hj − ĥj )h′j = (xi hj − xi ĥj )h′j . (2.53)
∂uij
We already know what h′j does: it decides whether the slope was positive or
negative, and thereby whether the update direction should flip. Because we
follow the opposite direction (since we want to lower the energy), the first term
2.2. BACKPROPAGATION 15
xi hj is subtracted from uij . This term tells us how strongly the value of xi is
reflected on the value of hj . Since hj is now being updated away, the effect of
xi on the j-th dimension of the transformation via uij must be reduced. On the
other hand, the second term xi ĥj does the opposite. It states that the effect of
xi on the j-th dimension of the transformation, according to the newly updated
value ĥj , must be reflected more on uij . If the new value of the j-th dimension
has the same sign as xi , uij should tend toward the positive value. Otherwise,
it should tend toward the negative value.
We can now imagine a procedure where we alternate between computing
ĥ and updating U and c to match ĥ. Of course, this procedure may not be
optimal, since there is no guarantee (or it is difficult to obtain any guarantee)
that repeatedly updating U and c following the gradient of the second energy
function leads to improvement in the overall loss when h = σ(U ⊤ x + c) is used
in place of the target ĥ. When the second energy function is truly minimized
so that σ(U ′⊤ x + c′ ) coincides with ĥ, the loss will be smaller than the original
h = σ(U ⊤ x + c). It is however unclear whether the loss will be smaller until this
minimum is achieved.
Instead, we can think of a procedure in which we update U and c directly
without producing ĥ as an intermediate quantity. Assume we take just a unit
step to update ĥ:
In other words, we can skip computing ĥ and directly compute the gradients of
the loss w.r.t. U and c using the gradient w.r.t. h.
Just like what we did with h (or originally x), we can check how we would
change this new x to minimize the second energy function e′ . This is done by
computing the gradient of e′ w.r.t. x:
∇x = U (h − ĥ) ⊙ h′ , (2.58)
It is the third time we are discussing it, but yes, we know what h′ does here:
it decides the sign of the update. If we ignore h′ by simply assuming that σ
16CHAPTER 2. BASIC IDEAS IN MACHINE LEARNING WITH CLASSIFICATION
was e.g. an identity map (which would mean that h′ = 1), we realize that ∇x is
linear transformation of ∇h 5 , as
∇x = U ∇h . (2.60)
h = σ(U ⊤ x + c) (2.61)
The red-colour term above can be thought of as propagating the input signal x
via U ⊤ to h. In contrast U ∇h can be thought of as back-propagating the error
signal ∇h via U to the input x.
You must see where we are heading toward now. Let us replace x once more,
this time, with z. In other words,
h = σ(U ⊤ z + c)
and
z = σ(V ⊤ x + s).
ẑ = z − ∇z (2.63)
= z − U ∇h . (2.64)
Following the exactly same steps of derivation from above, we end up with
⊤
∇V = x (∇z ⊙ z ′ ) (2.65)
′
∇s = ∇z ⊙ z , (2.66)
where
∇z = U ∇h . (2.67)
In one single sweep, we could backpropagate the error signal from the loss func-
tion all the way back to x and compute the gradient of the loss function w.r.t.
all the parameters, W, b, U, c, V and s. Of course, in doing so, we had to store
the so-called forward activation vectors, x, z and h, which is often referred to
book-keeping.
This process of computing the gradient of the loss fucntion w.r.t. all the pa-
rameters from multiple stages of nonlinear transformation of the input is called
5 Whenever it is clear, we will drop some terms for both brevity and clarify. In this case,
∇h is ∇h L(h).
2.3. STOCHASTIC GRADIENT DESCENT 17
When the overall loss is the average (or sum) of the individual loss functions,
we say that the loss is decomposable.
We can view such an overall loss function as computing the expected indi-
vidual loss function:
although we will for now stick to the uniform indexing over the training set.
With Eq. (2.69), we also get
∇f = Ei [∇fi ] , (2.71)
because the expectation over a finite, discrete random variable can be written
down using a finite sum.
There are two constants we should consider when deciding how we are going
to minimize f w.r.t. θ. They are the number of training examples N and the
18CHAPTER 2. BASIC IDEAS IN MACHINE LEARNING WITH CLASSIFICATION
number of parameters dim(θ) (if not confusing, we would use dim(θ) and |θ|
interchangeably.) Let us start with the latter |θ|. If the number of parameters is
large, we cannot expect to compute any high-order derivative information of the
function f beyond the first-order derivative, that is its gradient. Without access
to higher-order derivative, we cannot benefit from advanced optimization algo-
rithms, such as Newton’s algorithm. Unfortunately, in modern machine learning,
|θ| can be as larger as tens of billions, and we are often stuck with first-order
optimization algorithms.
If N is large, it becomes increasingly burdensome to compute f not to men-
tion ∇f directly each update. In other words, we can only expect to use the
true gradient of f only when there are few training examples only, i.e., small N .
In modern machine learning, we are often faced with hundreds of thousands, if
not millions or billions, of training examples, and it is often impossible for us to
exactly compute the overall loss. In short, we are in a situation where we cannot
even use the full, true gradient information to update the parameters.
In order to cope with large N and large |θ|, we often resort to a stochastic
gradient estimate rather than the full gradient, where the stochastic gradient is
defined as
where it was drawn from the uniform distribution over {1, . . . , N }. We then
update the parameters using this stochastic gradient estimate by
Since ∥∇f (θt )∥2 ≥ 0, we want to find αt that maximizes − L2 αt2 + αt . We simply
compute the derivative of this expression w.r.t. αt and set it to zero:
1
−Lαt + 1 = 0 ⇐⇒ αt = . (2.84)
L
In other words, if we set the learning rate to 1/L (that is, inverse proportionally
to how rapidly the function changes), we can make the most progress each time.
Of course, this does not directly apply to the stochastic case, since the descent
lemma does not apply to the stochastic gradient estimate as it is.
L
f (θt+1 ) ≤ f (θt ) − αt ∇f (θt )⊤ git + αt2 ∥git ∥2 . (2.86)
2
We are interested in the expected progress here over it ∼ U(1, . . . , N ):
L
E [f (θt+1 )] ≤ f (θt ) − αt ∇f (θt )⊤ E [git ] + αt2 E∥git ∥2 (2.87)
2
2L
2
= f (θt ) − αt ∥∇f (θt )∥ + αt E∥git ∥2 , (2.88)
| {z } |2 {z }
=(a)
=(b)
convergence rate(s) of stochastic gradient descent are out of the scope of this
course.
In summary, we use stochastic gradient descent in modern machine learning,
and with a small learning rate stochastic gradient descent exhibits the descent
property on expectation. We will therefore worry less and rely on stochastic
gradient descent throughout the course.
The regret must grow sub-linearly, i.e, R(T ) = o(T ), since linear growth, i.e.,
R(T ) = O(T ), implies that the learning algorithm is not converging toward the
optimal solution (or its associated minimum value.)
We (try to) achieve this goal by finding an appropriate update rule that
maps θt−1 and gt to θt . In doing so, it is relatively straightforward to think of
the following simplified framework, that generalizes stochastic gradient descent:
θt ← θt−1 + ηt ⊙ gt , (2.90)
practically by taking into account the loss function landscape, that is, how the
loss changes w.r.t. the parameters, more carefully.
Adagrad [Duchi et al., 2011]. For each parameter θi , the magnitude of the
partial derivative of the loss, (gti )2 , tells us how sensitive the loss value was to
the change in θi . Or, another way is to view it q as the impact of the change in
i
Pt i 2
θ on the loss. By accumulating this over time, t′ =1 (gt′ ) , we can measure
the overall impact of θi on the loss. We can then normalize each update inverse-
proportionally in order to ensure each and every parameter has more or less the
equal impact on the loss. That is,
√P t 1 1 2
t′ =1
(gt′
)
..
θt ← θt−1 + ⊙ gt (2.91)
.
qP 1
t |θ|
t′ =1
(gt′ )2
√
The regret of Adagrad is O( T ), just like that of SGD, assuming ∥gt ∥ ≪ ∞.
It however often decreases faster especially when many parameters are inconse-
quential (sparse) and/or quickly learned (because the accumulated magnitude
rapidly grows and its inverse converges to zero quickly.)
gi
θti ← θt−1
i
+p t , (2.93)
vti + ϵ
where ϵ > 0 is a small scalar to prevent the degenerate case.
Adam furthermore uses exponential smoothing to reduce the variance of the
gradient estimate as well:
mi
θti ← θt−1
i
+ αp t , (2.95)
vti + ϵ
2.4. GENERALIZATION AND MODEL SELECTION 23
Unfortunately, this expected risk is not computable, and we only have access
to a sample-based proxy to the expected risk, called the empirical risk:
N
1 X
R̂(θ) = L([xn , y n ], θ). (2.97)
N n=1
Pn
For brevity and clarity, let Sn = k=1 L([xk , y k ], θ). Then, we can express these
risks as
1 1
R(θ) = Edata×···×data SN , and R̂(θ) = SN . (2.98)
N N
The former holds because each instance (x, y) is drawn independently from the
same data distribution.
8 This is certainly not true in reality but is a reasonable starting point. We will discuss later
in the course what we can do if this assumption does not hold, hopefully if time permits.
24CHAPTER 2. BASIC IDEAS IN MACHINE LEARNING WITH CLASSIFICATION
!
2(N ϵ)2
= 2 exp −2N ϵ2 .
p(|R(θ) − R̂(θ)| ≥ ϵ) ≤ 2 exp − PN (2.99)
n=1 (1 − 0)2
This inequality tells us that the gap between the expected and empirical risks
shrinks exponentially with N , the number of training examples we use to com-
pute the empirical risk. This inequality applies to any θ, implying that this
convergence of the empirical risk toward the expected risk is uniform over the
parameter space (or the corresponding classifier space.) Such uniform conver-
gence is nice in that we do not have to worry about how well learning works
(that is, what kind of solution we end up with after optimization), in order to
determine how much deviation we would anticipate between the empirical risk
(the one we can compute) and the expected risk at any θ. On the other hand,
there is a big question of whether we actually care about most of the parameter
space; it is likely that we do not and we only care about a small subset of the
parameter space over which iterative optimization, such as stochastic gradient
descent, explores. We will discuss this a bit more later, but for now, let’s assume
we are happy with this uniform convergence.9
Let’s imagine that someone (or some learning algorithm) gave me θ that is
supposed to be good with a particular empirical risk R̂(θ). Is there any way
for me to check how much worse the expected risk R(θ) would be, based on
the Hoeffding’s inequality above? Of course, such a statement would have to be
probabilistic, since we are working with random variables, R(θ) and R̂(θ).
The inequality above allows us to express that
with some probability at least 1−δ. Be aware that the direction of the inequality
has flipped.
If |R(θ) − R̂(θ)| < ϵ, we know that R(θ) < R̂(θ) + ϵ. We are interested in
this latter inequality, because we want to upper-bound the expected (true) risk.
If the true risk was lower than the empirical risk, we are happy and do not care
about it. We want to know if we were to be unhappy (that is, the expected risk
was greater than the empirical risk), how unhappy we would be in the worst
case.
Because we want to make such a statement with the probability of at least
This is somewhat obvious, because a pair (ei , ej ) may not be mutually exclusive.
Think of a Venn diagram. With this, we want to compute
X
p(|R(θ) − R̂(θ)| ≥ ϵ) ≤ 2|Θ| exp −2N ϵ2 .
p(∪θ∈Θ |R(θ) − R̂(θ)| ≥ ϵ) ≤
θ∈Θ
| {z }
=2 exp(log |Θ|−2N ϵ2 )
(2.107)
We can follow the exactly same logic above:
2 exp(log |Θ| − 2N ϵ2 ) = δ (2.108)
r
log |Θ| − log 2δ
⇐⇒ ϵ = . (2.109)
2N
26CHAPTER 2. BASIC IDEAS IN MACHINE LEARNING WITH CLASSIFICATION
This makes sense, as the generalization bound now depends on the size of Θ,
our hypothesis space. If the hypothesis set is large, there is a greater chance of
us finding a solution that is good empirically R̂(θ) ↓ but is on expectation very
bad R(θ) ↑. This also implies that we need N (the number of training examples)
to grow exponentially w.r.t. the size of the hypothesis space Θ.
This bound only works with a finite-size hypothesis set Θ without favouring
any particular parameter configuration. In order to work with an infinitely large
hypothesis set, we must come up with different approaches. For instance, the
Vapnik–Chervonenkis (VC) dimension can be used to bound the complexity
of the infinitely large hypothesis set [Vapnik and Chervonenkis, 1971]. Or, we
can use the PAC-Bayes bound, where a prior distribution over the (potentially
infinitely large) hypothesis set is introduced [McAllester, 1999]. These are all
out of the scope of this course, but we briefly touch upon the idea of PAC-Bayes
bound here before ending this section.
with probability at least 1 − δ. Although this inequality looks quite dense, these
terms are extremely descriptive, once we define and learn how to read them.
First, R(Q) and R̂(Q) are defined analogously to R(θ) and R̂(θ), except that
we marginalize out θ using the so-called posterior distribution Q(θ). That is,
prior belief. This effect will however vanish rapidly as the number of training
examples increases due to N1 .
We can read two things from the second term N1 log N δ+1 . Because δ is in the
denominator, we know that we would potentially get a greater discrepancy if
we want to get a stronger guarantee, that is, δ → 0. log(N
N
+1)
vanishes toward 0
as the data size increases, i.e. N → ∞. The rate of this convergence is however
quite slow, i.e. sublinear.
Similarly to what we did earlier, we can turn this inequality in Eq. (2.110)
into a generalization bound. In particular, we use the Pinsker’s inequality. In
our case with Bernoulli random variables, we get
2 1
R̂(Q) − R(Q) ≤ DKL (B(R̂(Q))∥B(R(Q))). (2.113)
2
Then,
s
1 N +1
|R̂(Q) − R(Q)| ≤ DKL (Q∥P ) + log . (2.114)
2N δ
Unlike the earlier generalization bound, and its variants, this PAC-Bayesian
bound provides us with more actionable insights. First, we want the posterior
distribution Q to be good in that it results in a lower empirical risk on average. It
sounds obvious, but the earlier generalization bound was designed to work with
any parameter configuration (uniform convergence) and did not tell us what
it means to choose a good parameter configuration. With the PAC-Bayesian
bound, we already know that we want to choose the parameter configuration so
that the empirical risk is low on average. In other words, we should use a good
learning algorithm.
The posterior distribution Q however cannot be too far away from where
we start from. As the bound is a function of the discrepancy between Q and
our prior belief P . Flipping the coin around, it also states that we must choose
our prior P so that it puts high probabilities on parameter configurations that
are likely to be probable under the posterior distribution Q. In other words, we
want to ensure that we need a minimum amount of work to go from P to Q, in
order to minimize the generalization bound.
In summary, the PAC-Bayesian bound tells us that we should have some
good prior knowledge of the problem and that we should not train a predictive
model too much, thereby ensuring that the posterior distribution Q stays close
to the prior distribution P . This will ensure that the expected risk does not
deviate too much from the empirical risk.
28CHAPTER 2. BASIC IDEAS IN MACHINE LEARNING WITH CLASSIFICATION
There are so many terms we need to consider in this equation, but we will
consider them one at a time, from the back. First, let us start with (µy − µ̂y )2 .
This term (c) tells us about how well our learner captures the mean of the true
output y. This term does not care about how much variance there is either under
the data distribution pdata (y|x) nor under the model distribution q(θ). It only
talks about getting the outcome correct on average. This term is referred to as
a bias. When this term (c) is zero, we call our predictor unbiased.
The second term from the back, which is zero, is the (negative) covariance
between the true outcome y and the predicted one ŷ(x, θ), both of which are
random variables. Because we did not assume anything about q(θ), in general
we cannot assume θ is in anyway correlated with y|x, implying that there should
not be any covariance. We can ignore this term.
Let us continue with the two remaining terms, (a) and (b). The first term (a)
is the variance of the true outcome y. This reflect inherent uncertainty present in
the true outcome given an input x. This inherent uncertainty cannot be reduced,
since it is not what we control but is given to us by the nature of the problem
we are tackling. If this quantity is large, there is only so much we can do. We
often refer to this as aleatoric uncertainty or irreducible uncertainty.
The second term (b) is also uncertainty, as it measures the variance arising
from the uncertainty in the model parameters. This uncertainty is however con-
trollable and thereby reducible with efforts, since it arises from our uncertainty
q(θ) in choosing the parameters θ. When the model is simpler, we tend to have a
better grasp at learning and can reduce this reducible (or epistemic) uncertainty
greatly. When the model is complex and thereby exhibits many symmetries that
must be broken arbitrarily, it is difficult (if not impossible) to reduce this epis-
temic uncertainty much. This term is often referred to as variance.
It should be quite clear at this point that there must be some inherent trade
off between the bias (c) and the variance (b). The more complex a classifier
is the higher variance we end up with, but due to its complexity, it would be
able to fit data well, resulting in a lower bias. When a classifier is simple, the
variance will be lower, but the bias will be higher. Learning can thus be thought
of as finding a good balance between these two competing quantities.10
The explanation above is slightly different from a usual way in which bias-
variance tradeoff is described [Wikipedia contributors, 2023]. In particular, we
are considering a generic distribution q(θ) that may or may not be directly
related to any particular training dataset, when the conventional approach often
sticks to the strong dependence on the training dataset and the distribution over
the training set. This is a minor difference, but this can come in handy when
we start thinking about more exotic ways by which we come with q(θ). If time
and space permits later in the course, we may learn one or two techniques that
involve such exotic techniques, such as transfer learning and multi-task learning.
10 I must emphasize here that the complexity of a classifier is not easy to quantify. When I
speak of ‘complex’ or ‘simple’ here, I am referring to this mythical measure of the classifier’s
complexity and do not mean that we can compute it easily.
30CHAPTER 2. BASIC IDEAS IN MACHINE LEARNING WITH CLASSIFICATION
¯lN is a random variable that refers to the average loss computed over the N
examples. In other words, with larger N , we expect that the average accuracy
we get from considering N examples is centered at the true average µ with the
2
variance σN . So, the more N , the more confidence we have in trusting that the
sample average does not deviate too much from the true average. With small
N , however, we cannot be confident that our sample average accuracy is close
enough to the true average, and this lack of confidence is proportional to the
true variance underlying the accuracy. Unfortunately, we do not have access
to the true variance of the accuracy but often can get a rough sense of it by
considering the sample variance.
If N is large, we can compute the confidence interval11 and use it to compare
against another classifier or your prior expectation on the accuracy. For instance,
because the accuracy estimate converges to the normal distribution, we can use
so-called t-test, since the difference between the true mean and the mean of
the estimate converges toward the Student’s t distribution. In that case, the
confidence interval for the binary accuracy (simply 1 − l⋆ , where l⋆ is the true
loss of the classifier) is given by
" r r #
¯lN (1 − ¯lN ) ¯lN (1 − ¯lN )
¯
CI ≈ (1 − lN ) − Z ¯
, (1 − lN ) + Z , (2.129)
N N
11 The confidence interval for a quantity with the confidence level γ means that if we repeat
the process of inferring the target quantity and measure the confidence interval, the true target
quantity would be included in the confidence interval proportional to γ.
32CHAPTER 2. BASIC IDEAS IN MACHINE LEARNING WITH CLASSIFICATION
where (·)j refers to the j-th element of the vector, because the zero-one loss is
invariant to the multiplicative scaling of the energy value. A pair of co-linear
features are defined to have linear relationship given the target outcome. Imagine
that
xj = αxi , (2.131)
when y = c. We then say that (xi , xj ) are co-linear given y = c. In this case,
the following two energy functions are equivalent:
0 , . . . , wc,|x| x − bc ,
e([x, c], θ) = − wc,1 , . . . , wc,i , . . . , |{z} (2.132)
=wc,j
1
e′ ([x, c], θ′ ) = − wc,1 , . . . , |{z}
0 , . . . , wc,i , . . . , wc,|x| x − bc , (2.133)
α
=wc,i
probability γ. Let γ = 1−α for convenience. Then, we are looking for an interval
[l, u]:
α α
p(¯lN ≤ l|D, D′ ) = and p(¯lN ≥ u|D, D′ ) = . (2.135)
2 2
This credible interval is reasonable when p(¯lN |D, D′ ) is unimodal, but this may
not be the case. The probability density may be concentrated in two well-
separated sub-regions, in which case this credible interval would be unnecessarily
wide and uninformative.
In that case, we can try to define a credible region C, which may not be
contiguous. The credible region is define to satisfy
Z
p(¯lN |D, D′ )d¯lN = γ, (2.136)
l̄N ∈C
p(¯lN |D, D′ ) ≥ p(¯lN
′
|D, D′ ) for all ¯lN ∈ C ∧ ¯lN
′
∈
/ C. (2.137)
examples with replacement, (2) compute the sample statistics of interest and
(3) repeat (1-2) M times. In step (2), we can split the resampled set into the
resampled training set and the resampled test set. We use the resampled training
set to train a model and then the resampled test set to evaluate the trained
n oM
(m)
model to obtain ¯lm ′ . After M such iterations, we end up with ¯l
|D | N .
m=1
These sampled statistics then serve as a set of samples drawn from p(¯l), allowing
us to get a good sense of how the proposed learning algorithm works on this
particular problem (not a particular dataset).
There are many ways to characterize the uncertainty in evaluating how well
any learning algorithm works. Although we have considered a few aspects of un-
certainty we should consider in this section, there are many more ways to think
of this problem. For instance, if we want to compare two learning algorithms,
how should we take into account the uncertainty? If there is uncertainty in my
learning algorithm, is there a better way to benefit from this uncertainty? We
will touch upon some of these questions in the rest of the course.
In other words, learning is the process of minimizing the empirical risk. This
learning process is however not only a function of data D but also of the hyper-
parameters λ and noise ϵ.
We now need to find the right set of hyperparameters. What should be the
objective function here? We can use a separate dataset Dval ∩ D = ∅, called a
validation set, to measure how good each hyperparamer set is:
h i
Tune(Dval , D; ϵ′ ) = arg min Eϵ R̂(Learn(D; λ, ϵ); Dval ) . (2.140)
λ
This hyperparameter tuning process is a function of both the training and val-
idation sets as well as some source of noise ϵ′ .
We can then obtain the final model by
θ̂ = Learn(D; Tune(Dval , D; ϵ′ ), ϵ), (2.141)
2.5. HYPERPARAMETER TUNING: MODEL SELECTION 35
or
There are many different ways to approximate this quantity, such as forward-
mode automatic differentiation as well as implicit function theorem. Neverthe-
less, this quantity is ultimately a fairly expensive quantity to compute due to
many factors including the ever-increasing dataset size |D| and thereby the
ever-increasing optimization cost of learning.
It is thus more usual to treat hyperparameter optimization as a black-box
optimization problem, where we can evaluate the outcome (that is, the loss
computed on the validation set) of a particular hyperparameter combination
but cannot access anything else of this learning process.
Random search is one of the most widely used black-box optimization based
approaches to hyperparameter optimization. In random search, we start by
defining a prior distribution p(λ) over the hyperparameters λ. We draw K sam-
ples from this prior distribution, {λ1 , . . . , λK }, and in parallel evaluate them by
training a model using each of these sampled hyperparameters. We then pick the
best hyperparameter based on the validation risk, rk = R̂(Learn(D; λk , ϵ); Dval ).
Instead of simply picking the best one, one can update the prior over the
hyperparameters based on
as well, but this process tends to be too expensive computationally to be practical, since we
need to repeatedly train many new models for the purpose of optimization.
36CHAPTER 2. BASIC IDEAS IN MACHINE LEARNING WITH CLASSIFICATION
p(r|λ, θ) is a model that predicts the output r given the hyperparameter con-
figuration λ, using the parameters θ. See Eq. (6.64) and surrounding discussion
on how to create such a model. The expected improvement of a hyperparameter
configuration λ is then defined as
where
to narrow down the space by make the density concentrated locally around the
best hyperparameter so far:
Once we found the best hyperparameter configuration, we train the final model
on the training set D to obtain our final model parameter θ̂. How well would it
work?
Unfortunately, we cannot use the validation risk, as that was the objective
by which θ̂ was selected. Meanwhile, when this model is deployed in the wild,
the world will not be so kind and a set of examples thrown at this model will
not be so perfect for the model. We thus need another set, called the test set,
Dtest in order to check the test accuracy. This set must be separate from both
the training and validation sets, and we can report the risk on this set as is, or
we can report more statistics, as we discussed earlier in §2.4.3.
38CHAPTER 2. BASIC IDEAS IN MACHINE LEARNING WITH CLASSIFICATION
Chapter 3
Earlier in §2.2.2, we talked about how general transformation F (x; θ) can be.
As an example back then, we considered
σ
Flinear (x; θ) = σ(U ⊤ x + c), (3.1)
or a linear layer.
39
40 CHAPTER 3. BUILDING BLOCKS OF NEURAL NETWORKS
3.1 Normalization
Let us consider the simple squared energy function from Eq. (6.64), with an
identity nonlinearlity:
1 ⊤
e′ ([x, y], (u, c)) = (u x + c − y)2 . (3.3)
2
We will further assume that y is a scalar and thereby u is a vector rather than
a matrix.
The overall loss is then
N N
1 X ′ 1 X ⊤
J(θ) = e ([xn , yn ], (u, c)) = (u xn + c − yn )2 . (3.4)
N n=1 2N n=1
So far, there is nothing different from our earlier exercises. We now consider
the Hessian of the loss:
" P #
N PN
1
N n=1 xn x⊤
n
1
N n=1 xn
H= 1
PN . (3.7)
N n=1 xn 1
The Hessian matrix tells us about the curvature of the objective function and
directly relates to the difficulty of optimization by a gradient-based approach. In
particular, gradient-based optimization is more challenging when the condition
number is larger, where the condition number is defined as
| maxi λi (H)|
κ= ≥ 1, (3.8)
| mini λi (H)|
eigenvalues of the Hessian to be similar to each other, for such an iterative op-
timization algorithm to work well. For more rigorous discussion, refer to your
favourite convex optimization book [Nocedal and Wright, 2006].
Based on this definition, an identity matrix has the minimal condition num-
ber. In other words, we can transform the Hessian matrix into the identity
matrix, in order to facilitiate gradient-based optimization [LeCun et al., 1998].
In this particular case, because the Hessian matrix does not depend on θ but
only on the observations xn ’s, we can simply transform the input in advance by
N
1 X
(1) xn ← xn − x n′ (centering) (3.9)
N ′
n =1
N
!− 12
1 X ⊤
(2) xn ← xn′ xn′ xn (whitening) (3.10)
N ′
n =1
This will result in the identity Hessian matrix, improving the convergence of
gradient-based optimization.
Such normalization is a key to the success in optimization, but it is challeng-
ing to apply it in practice exactly, as the Hessian matrix is often non-stationary
when we train a deep neural network. The Hessian matrix changes as we update
the model parameters, and there is no tractable way to turn the Hessian matrix
into the identity matrix. Furthermore, it is even more challenging to invert this
Hessian matrix. It however turned out that normalizing (as a weaker version
of whitening) of the input to each block helps in learning. Such normalization
could also be considered as a building block, and let us look at a few widely
used ones here.
Batch normalization [Ioffe and Szegedy, 2015]. This is one of the build-
ing blocks that sparked the revolution in deep learning, greatly facilitating learn-
ing:
where µ and σ 2 are the mean and diagonal covariance of the input to this
block. Because the inverse of a full covariance matrix, which is often similar to
the Hessian matrix up to an additive term, is costly, we are only consider the
diagonal of the covariance matrix, which is readily invertible.
Instead of using the full training set to estimate µ and σ 2 , which will be
prohibitively expensive, we use the minibatch at each update during training to
get stochastic estimates of these two quantities. This practice is perfectly fine
during training but it becomes problematic when the model is deployed, as the
model will receive one example at a time. With a single example, we cannot
estimate either µ nor σ 2 , or if we do, it will simply subtract out the input in its
entirety. It is a usual practice instead to either fully re-estimate µ and σ 2 using
the full training set once training is over or keep the running estimates of µ and
σ 2 during training and use them after training is over.
42 CHAPTER 3. BUILDING BLOCKS OF NEURAL NETWORKS
In other words, we apply the k-th filter f k at each position t to check how similar
(in the sense of dot product) the signal centered at t is to the filter f k .
Another way to write it down is
′
mX =M
ht = Fm′ +M +1 xt+m′ , (3.16)
m′ =−M
where
1
fm
fm2
Fm = . ∈ Rd×K (3.17)
..
K
vm
with d = |xt |. The full parameters of this 1-D convolution block can be summa-
rized as a 3-D tensor of size d × K × (2M + 1).
It is pretty straightforward to see that this operation is translation equivari-
ant. If we shift every xt by δ, the resulting ht will shift by δ without any impact
on its computed value. Unfortunately, in practice, this does not hold perfectly,
as we do not work with an infinitely long sequence. We must decide how to
handle the boundaries of the sequence with a finite-length sequence, and this
choice will impact the degree of translation equivariance near the boundaries.
Detailed discussion on how we handle boundaries is out of the scope of this
course, though.
We can now readily extend this 1-D convolution to N -D convolution. For
instance, 2-D convolution would work on an infinitely large image, and 3-D
convolution on an infinitely large-and-long video. Furthermore, we can extend
it by introducing various features, such as a stride. These are also out of the
scope of this course, but I recommend the first half of the classic by LeCun et al.
[1998].
44 CHAPTER 3. BUILDING BLOCKS OF NEURAL NETWORKS
where
This (weighted) linear combination has shown to effectively address the issue of
vanishing gradient [Bengio et al., 1994], and has become a standard practice in
machine learning over the past decade or so [He et al., 2016].
We are referred to as the key, query and value vectors of the i-th item xi .
For each j-th item xj , we check how compatible it is to the current i-th item
xi :
exp(qi⊤ kj )
αij = PN . (3.26)
⊤ ′
j ′ =1 exp(qi kj )
where the lack of the superscript in the second term means that there is no
nonlinearity. If |hi | = |xi |, it is customary to fix θr = (I, 0). It is usual to add a
layer normalization block after v̂i or at hi , to facilitate optimization.
When implemented in a single block, this block is often referred to as the
(multi-headed) attention block [Bahdanau et al., 2015, Vaswani et al., 2017].
overcome in that case. That is, we must ensure that each item in a sequence is
marked with its position.
There are two major approaches to this. The first approach is based on
additive marking. For each position i, let ei be a vector of size |x| and represent
the i-th position. There are many ways to construct this vector, and sometimes
it is even possible to learn this vector from data, although we can only handle
the length seen during training in the latter case. One particular approach is to
use sinusoidal functions so that each dimension of ei captures different rates at
which the position changes. For instance,
(
i
sin , if i mod 2 = 0
edi = Li/|x|
i
(3.30)
cos L(i−1)/|x|
, if i mod 2 = 1
where L is a hyperparameter and is often set to 10000. This vector is then added
to each input item, i.e., xi + ei before being fed to the attention block.
The first approach, the additive approach, makes it easy for the attention
block to capture the locality of each vector, because neary vectors tend to have
similar positional embeddings, and to capture the absolute position based pat-
tern, as each absolute position is represented by its unique positional embedding
vector. It is however challenging for the attention block to capture the patterns
based on relative positions beyond simple locality.
In particular, consider how the so-called attention weight on the j-th item
for the i-th item was computed in Eq. (3.26). The weight is proportional to the
dot product between the i-th query vector and the j-th key vector:
where we assumed zero bias vectors for both query and key vectors. From from
the first term in the expanded expression, we notice that the content-based
relationship between the i-th input and j-th input is largely independent of
their positions. In other words, the semantic relationship between these two
inputs is stationary across their relative positions, which may be restrictive in
many downstream applications.
Focusing on the first term above, we can think of a way to ensuring that
this pairwise semantic relationship is position-dependent. More specifically, we
want it to depend on the relative position between xi and xj :
the diagonal:
2
R1 (m) 0 ··· 0
2
0 R2 (m) ··· 0
Rm = , (3.34)
0 0 ··· 0
2
0 ··· 0 R|x|/2 (m)
In other words, we rotate every pair of elements of the query/key vector based on
its position before computing the dot product between these two vectors. Since
this rotation depends on the relative position between the query and key vectors,
this approach can capture position-dependent semantic relationship between
the i-th input and the j-th input. This idea has become one of the standard
approaches to incorporating positional information in the attention block in
recent years [Su et al., 2021].
48 CHAPTER 3. BUILDING BLOCKS OF NEURAL NETWORKS
Chapter 4
Probabilistic Machine
Learning and Unsupervised
Learning
exp(−e(x, z, θ))
p(x, z; θ) = RR . (4.1)
exp(−e(x′ , z ′ , θ))dx′ dz ′
Of course, it is often (if not almost always) challenging to compute the normal-
ization constant (or the partition function) in the denominator. Such a challenge
hints at a different approach to the same problem. Instead of defining an en-
ergy function first and then deriving the probability function, why not directly
define the probability function? After all, we can recover the underlying energy
function given a probability function up to a constant:
In fact, it may be even easier to decompose the joint probability function p(x, z)
49
50CHAPTER 4. PROBABILISTIC MACHINE LEARNING AND UNSUPERVISED LEARNING
With these probability functions in our hands, we can now define a generic
loss function:
Z
Lll (x, θ) = − log p(x|z; θ)p(z; θ)dz. (4.5)
where we assume that p(x) is simply given and is not optimized with its own
parameters. When z does not exist, it reduces to the cross-entropy loss from
Eq. (2.30):
Lll ([x, y], θ) = − log p(y|x; θ). (4.8)
1 As usual, we will omit θ if its existence, or lack thereof, is clear from the context.
4.2. VARIATIONAL INFERENCE AND GAUSSIAN MIXTURE MODELS51
In the rest of this chapter, we focus on the case where we have input-only
observations. We often call such a setup unsupervised learning.
2 In short, p(z) should be a so-called conjugate prior to the likelihood p(x|z), so that the
To find q (or its parameters ϕ(x)), we minimize the second and third terms
above, since the first term log p(x) is not a function of q . In other words,
ϕ̂(x) = arg min −Ez∼q [log p(x|z)] + DKL (q(z; ϕ(x))∥p(z)) (4.16)
ϕ(x)
log p(x; θ) ≥ Ez∼q [log p(x|z; θ)] − DKL (q(z; ϕ(x))∥p(z)) . (4.18)
| {z }
=J(θ)
that there exist a finite number of Gaussian distributions, which are referred to
as “components”, and a latent variable z selects one of these components. Once
the component is selected, an observation x is drawn from the corresponding
Gaussian distribution.
To map this story onto the probability functions, we begin with a prior
distribution over the components:
1
p(z) = , (4.20)
M
where M is the number of Gaussian components. This prior states that each
and every component is equally likely to be selected. This can be relaxed, but
we will stick to this for now. z can take any one of {1, . . . , M }.
Once the component is selected, we draw an observation x from
where µz and Σz are the mean and covariance of the z-th Gaussian component.
For simplicity, let us assume that Σz = I, that is, the covariance is an identity
matrix. In such a case, we say that the component is spherical Gaussian.
We introduce an approximate posterior for each training example xn . This
approximate posterior is
where d = dim(xn ).
Let’s compute the gradient of J w.r.t. µk :
!
1 X n n 1 X X
∇µk = (αk (x − µk )) = αkn xn − µk αkn =0 (4.25)
N n N n n
N
X αkn
⇐⇒ µk = PN ′
xn . (4.26)
n=1 n′ =1 αkn
1 d
∇αnk = − ∥xn − µk ∥2 − log 2π − log M − log αkn − 1 = 0 (4.27)
2 2
n 1 n 2 d
⇐⇒ log αk = − ∥x − µk ∥ − log 2π − log M − 1 (4.28)
2 2
1
− d2 log 2π − log M
n 2
n exp − 2 ∥x − µk ∥
⇐⇒ αk = PK 1 d
, (4.29)
n ′ 2
k′ =1 exp − 2 ∥x − µk ∥ − 2 log 2π − log M
exp − 21 ∥xn − µk ∥2
n
⇐⇒ αk = PK 1
, (4.30)
n ′ 2
k′ =1 exp − 2 ∥x − µk ∥
PK
because k=1 αkn = 1.
For the approximate posterior, we can solve it analytically and exactly. In
fact, if we analyze the solution above more carefully, we realize that it is identical
to the true posterior:
1
log αkn = log N (xn |µk , I) + log − log Z. (4.31)
|{z} | {z } M
|{z}
p(z=k|xn ) n
=p(x |z=k;θ)
=p(z=k)
draw more samples and (b) infer the posterior distribution over the components
given a new observation.
instead of K values for each n-th training example. In other words, we need
only ⌈log2 K⌉ bits as opposed to K × B bits where B is the number of bits for
representing a real value in one’s system.
In this case, the update rule for the mean of each component in Eq. (4.26)
can be simplified as well:
N
X αkn
µk = PN xn (4.35)
n′
n=1 n′ =1 αk
N
X
= 1(ẑ n = k)xn . (4.36)
n=1
That is, we collect all training examples that belong to the k-th component and
compute the average of these training examples. This further saves a significant
amount of compute, as we only go through on average N/K training examples
for each component to compute its mean vector.
Because we are effectively making the hard choice of to which component
each training example belongs to (4.34), we often refer to this special case as
56CHAPTER 4. PROBABILISTIC MACHINE LEARNING AND UNSUPERVISED LEARNING
Let’s restate the objective function derived from the variational inference prin-
ciple earlier in Eq. (4.19):
N
1 X
max Ez∼q(z;ϕ(xn )) [log p(xn |z; θ)] − DKL (q(z; ϕ(xn ))∥p(z)).
ϕ(x1 ),...,ϕ(xN ),θ N n=1
(4.37)
K + ||µn ||2
1 2 |x| 1
Jn = Ez∼qn − ∥xn − W z − b∥ − log 2π − ∗ − K + 2K ln(σ)
2 2 2 σ2
(4.41)
1 1 1
= −Ez ∥xn ∥2 + ∥W z + b∥2 − x⊤ n (W z + b) − ∥µn ∥2 (4.42)
2 2 2σ 2
1 ⊤ ⊤ 1 1
= −Ez z W W z + ∥b∥2 + b⊤ W z − x⊤ n W z − x ⊤
n b − 2 ∥µn ∥2 + const.
2 2 2σ
(4.43)
1 1 1 1
= − tr W Ez [(z − µn )(z − µn )⊤ ]W ⊤ − tr W µn Ez [z]⊤ W ⊤ − tr W Ez [z]µ⊤ ⊤ ⊤
n W + tr W µn µn W
⊤
2 2 2 2
(4.44)
1 1
− ∥b∥2 − b⊤ W µn + x⊤ ⊤
n W µn + xn b − ∥µn ∥2 + const. (4.45)
2 2σ 2
1 1 1 1
= − tr W W ⊤ − µ⊤ W ⊤ W µn − ∥b∥2 − b⊤ W µn + x⊤ ⊤
n W µn + xn b − ∥µn ∥2 + const.
2 2 n 2 2σ 2
(4.46)
where const. refers to the terms that do not depend on either ϕ(xn ) nor θ.
Let’s perform posterior inference first by computing the gradient of J =
1
P
N n Jn w.r.t. ϕ(xn ):
1
∇µn = −W ⊤ W µn + W ⊤ (xn − b) − µn = 0 (4.47)
σ2
− (W ⊤ W + σ −2 I)µn + W ⊤ (xn − b) = 0 (4.48)
⊤ −2 −1 ⊤
µn = (W W + σ I) W (xn − b). (4.49)
Just like with the MoG above, we get a clean, analytical solution to each µn .
Because we need to compute the inverse of W ⊤ W + I ∈ RK×K , this may be
somewhat expensive, but we need to compute it once and use it for all N µn ’s.
Let’s look at the role of σ 2 from the prior p(z) earlier in this context. When
2
σ → ∞, the expression above simplifies to
which is equivalent to
xn = W µn + b. (4.52)
58CHAPTER 4. PROBABILISTIC MACHINE LEARNING AND UNSUPERVISED LEARNING
This expression makes an intuitive sense. b, the bias, is the average offset between
what we get given the latent configuration and what we actually observe.
Let us continue with W :
N
1 X
−W − W µn µ⊤ ⊤ ⊤
∇W = n − bµn + xn µn (4.55)
N n=1
N
! N
1 X ⊤ 1 X
= −W I + µn µn + (xn − b)µ⊤
n. (4.56)
N n=1 N n=1
Then,
N
! N
!−1
1 X 1 X
W = (xn − b)µ⊤
n I+ µn µ⊤
n . (4.57)
N n=1 N n=1
The first term in the product in the right-hand side can be thought of im-
plementing so-called a Hebbian learning rule: “neurons that fire together, wire
together ” [Hebb, 1949]. If the i-th dimension of the observation xi fires (that
is, beyond the bias bi ) and the j-th dimension of the latent variable µj fires
together (where ‘fire’ is defined as any deviate away from the bias or zero),
the strength of the weight value wij between them must be large. This already
shows up as the second term in the gradient w.r.t. W above.
The second term in the right-hand side (which corresponds to the first term
in the gradient) is more complicated. This works as whitening µn inside the first
term. That is, it makes µn ’s to be distributed so that the covariance is closer
to the identity. This works as making W capture the covariance between the
mean-subtracted
observations (xn − b)’s and the whitened latent configurations
PN −1
1 ⊤
µn I + N n=1 µn µn ’s. By doing so, in the next iteration of this EM
4.3. CONTINUOUS LATENT VARIABLE MODELS 59
∂F
Jacθ F (z; θ) = (z; θ). (4.59)
∂θ
This small change has a big consequence in terms of the modeling power of
the continuous latent variable model. This is due to the peculiar (and amazing)
properties of normal distributions. Let us revisit the linear case above (4.38):
3I ∂
will use ∂x
notation to refer to the Jacobian matrix unless it is confusing.
60CHAPTER 4. PROBABILISTIC MACHINE LEARNING AND UNSUPERVISED LEARNING
1 1
log p(x, z; θ) = log p(z) + log p(x|z; θ) = − ∥z∥2 − ∥x − W z − b∥2 + const.
2σ 2 2
(4.62)
1 ⊤ −2
z (σ I)z + (x − b)⊤ I(x − b) + z ⊤ W ⊤ W z − (x − b)⊤ W z − zW ⊤ (x − b) + const.
=−
2
(4.63)
1
= − (x − b)⊤ I(x − b) + z ⊤ (W ⊤ W + σ −2 I)z − (x − b)⊤ W z − z ⊤ W ⊤ z + const.
2
(4.64)
⊤
Let v = [x, z] and µ = [b, 0|z| ]⊤ . Then,
1 I −W
log p(x, z; θ) − log Z(θ) = − (v − µ)⊤ (v − µ) + const.
2 −W ⊤ ⊤ −2
W W +σ I
| {z }
=Σ−1
(4.65)
This shows that the joint distribution over [x, z] is also Gaussian with the mean
µ. Although we just needed to show this for our further argument, let us also
check the covariance matrix of the joint distribution.
There is a magical formula called the block matrix inversion lemma:
−1 −1
A + A−1 B(D − CA−1 B)−1 CA−1 −A−1 B(D − CA−1 B)−1
A B
= .
C D −(D − CA−1 B)−1 CA−1 (D − CA−1 B)−1
(4.66)
We can use this to write down the covariance of the joint distribution p(x, z; θ):
I + W (W ⊤ W + σ −2 I − W ⊤ W )−1 W ⊤ W (W ⊤ W + σ −2 I − W ⊤ W )−1
Σ=
(W ⊤ W + σ −2 I − W ⊤ W )−1 W ⊤ (W ⊤ W + σ −2 I − W ⊤ W )−1
(4.67)
I + σ2 W W ⊤ σ2 W
= . (4.68)
σ2 W ⊤ σ2 I
Let’s consider the variational lowerbound with this nonlinear formulation for
one particular example xn :
1 K + ∥µn ∥2
1 2 |x|
Jn = Ez∼qn − ∥xn − F (z; θ)∥ − log 2π − − K + 2K ln(σ) .
2 2 2 σ2
(4.70)
exp − 12 ∥z − µn ∥2 1
Z
µn
∇µ n = − |z|/2
(z − µn )∥xn − F (z; θ)∥2 − 2 (4.71)
(2π) 2 σ
1 µn
= − Ez (z − µn )∥xn − F (z; θ)∥2 − 2 (4.72)
2 σ
1 µn
= − 2xn Ez [zF (z; θ)] − 2xn µn Ez [F (z; θ)] + Ez z∥F (z; θ)∥2 − µn Ez ∥F (z; θ)∥2 − 2 .
2 σ
(4.73)
It is clear that without knowing the form of F , it is not possible to come up with
an analytical solution to µn in general. Even worse, it is unclear how to compute
the gradient analytically either, due to the challenging expectations that must
be computed. We can however use sampled-based Monte Carlo approximation,
since we can choose the approximate posterior q to be readily samplable:
z̃ = µn + σϵ, (4.76)
lar distribution as nonlinearly and deterministically transforming noise drawn from another
distribution:
z = g(ϵ; ϕ), ϵ ∼ p(ϵ). (4.75)
This allows us to compute the derivative of the sample z w.r.t. the parameters of this de-
∂g
terministic function, i.e., ∂ϕ (z). This is a handy trick, since sampling is often considered a
non-differentiable operator. Despite its usefulness, it is not always possible to come up with
such reparametrization.
62CHAPTER 4. PROBABILISTIC MACHINE LEARNING AND UNSUPERVISED LEARNING
We can then use this stochastic gradient estimate to find the solution to µn .
Looking at the gradient above, we can however see what this gradient direc-
tion points at. Particularly, the first term considers directions that are similar
to the mean µn but with some noise. We then weigh each such direction (or the
difference between this direction and the current estimate of µn ) by their qual-
ity, where the quality is defined as how similar the decoded observation, F (z; θ),
is to the actual observation xn (notice the negative sign that turns this distance
into the quality.) In other words, we look for the change to µn that makes the
decoded observation closer to the actual observation. This makes perfect sense,
from the perspective of p(y|z). The second term simply brings µn toward the
origin at the rate inversely proportional to the prior variance which is inversely
proportional to the regularization strength.
This lack of an analytical solution to µn is problematic, because we must
keep µn for all N training examples across E-M iterations, even with stochas-
tic gradient descent. At each iteration, we would select a small number M of
M
training examples, retrieve the associated observations {xm }m=1 as well as the
M
associated current estimate of the posterior means {µm }m=1 , update the pos-
terior means slightly following the gradient direction, update the parameters
slightly following the stochastic gradient direction, and finally store back the
updated estimate of the posterior means. This does not put too much pressure
on computation but it puts a huge pressure on the storage and I/O, as we need
O(b × |z| × N ) bits to store the posterior means.
µm = G(xm ; θG ). (4.78)
sample estimate of Jn :
1 1
J˜n = − ∥xn − F (G(xn ; θG ) + σϵ; θ)∥2 − 2 ∥G(xn ; θG )∥2 + const., (4.81)
2 2σ
where ϵ ∼ N (0, I).
There are two non-constant terms in this approximate objective. The first
term is the reconstruction error. The input xn is processed by the inference
network G first, and then the noisy version of the output of G is then processed
by F to reconstruct the input. The objective is maximized when the difference
between the original input and the reconstructed input is minimized (see the
negation in front of the L2 norm.) This process is often referred as autoencoding,
and this is why this whole framework is called a variational autoencoder [Kingma
and Welling, 2013].
The second term is a regularizer that pushes the L2 norm of the output from
the inference network to be small. This ensures that all the inputs {x1 , . . . , xM }
are mapped to the latent space, i.e. the space of the latent variable z, as tightly
as possible. Without this term, the norm of the output of G can grow indefinitely,
pushing the inferred posteriors of all the inputs to be as far away as possible,
since this would ensure that F can reconstruct the original input perfectly even
with the injected noise. This would however make it impossible for F to cope
with any z sampled from the prior or located between any pair of inputs’ inferred
posterior distributions, resulting in a lousy generative model.
Thanks to the reparametrization trick, we can compute the gradient of J˜n
w.r.t. all parameters, including those of F and those of G. In other words, we
can use backpropagation to train both the inference and generation networks,
G and F , respectively. This allows us to train the inference network extremely
efficiently, without having to maintain a whole database of instance-specific ap-
proximate posterior parameters. Furthermore, as discussed earlier, this inference
network can be used with a novel input, making it useful for analyzing a set
of inputs that are not present during training. The approximate posterior com-
puted by the inference network can be further finetuned using gradient descent
to match the true posterior better [Hjelm et al., 2016].
Perhaps more importantly, this implies that such end-to-end learning of in-
ference and generation networks is possible with backpropagation and stochastic
gradient descent as long as we can use a reparametrization trick to sample from
an approximate posterior without breaking the differentiability. This opens wide
a door to a whole new set of opportunities to scale up various probabilistic mod-
els that were cumbersome to derive and use before, although these are out of
the scope of this course.
Unlike in the training time, we are less under time pressure, and therefore a
natural approach would be a naive Monte-Carlo approximation:
M
1 X
p(x) ≈ p(x|zm ), (4.83)
M m=1
where zm ∼ pz (z).
Unfortunately this naive approach can have a large variance. For brevity, let
f (z) = p(x|z) and p(z) = pz (z). Because we already know that it is unbiased,
we can then write the variance as
" M
# " M # M
1 X 1 X 1 X 1 V [f (z)]
V f (zm ) = 2 V f (zm ) = 2 V [f (zm )] = 2 M V [f (z)] = ,
M m=1 M m=1
M m=1
M M
(4.84)
p(x|z)2 pz (z|)2
Z
p(x|z)p(z)
V = dz − const. (4.88)
q(z) q(z)
The second term is constant w.r.t. q, because that is nothing but the original
quantity we are trying to approximate.
Recall the following definition of Cauchy-Schwarz inequality:
2
|⟨u, v⟩| ≤ ⟨u, u⟩ ⟨v, v⟩ , (4.89)
4.3. CONTINUOUS LATENT VARIABLE MODELS 65
where ⟨·, ·⟩ is an inner product that generalizes a standard dot product. We can
define an inner product on the square-integrable functions5 as
Z
⟨f, g⟩ = f (x)g(x)dx (4.91)
R
Because q(z)dz = 1 by definition, we observe that
!2 Z !2 Z
Z Z 2
p(x|z)pz (z) p(x|z)pz (z) p
p dz q(z)dz ≥ p q(z)dz = p(x|z)pz (z)dz .
q(z) q(z)
| {z }
=1
(4.93)
Considering both sides carefully, we see that they are equal when
This is easy to check by plugging it into the left hand side of the inequality
above:
!2 Z
Z
Cq(z)
p dz q(z)dz = C 2 , (4.95)
q(z)
R
Because q(z)dz = 1,
Z
C= p(x|z)pz (z)dz. (4.97)
p(x|z)pz (z)
q ∗ (z) = R , (4.98)
p(x|z ′ )pz (z ′ )dz ′
5A function f is square-integrable when
Z
f (x)dx < ∞. (4.90)
66CHAPTER 4. PROBABILISTIC MACHINE LEARNING AND UNSUPERVISED LEARNING
which turned out to be exactly the posterior distribution over z given x. In other
words, if we sample from the posterior distribution instead of the prior distri-
p(z)
bution and reweigh p(x|z) according to their ratio p(z|x) , our approximation is
both unbiased and has the minimal variance.
This is however not the right way forward, since the posterior probability
has in its own denominator the intractable integral. Rather, this says that the
so-called proposal distribution q must be close to the true posterior distribution
p(z|x), which is in fact exactly the criterion we used to derive the variational
lowerbound earlier in §4.2. When the variational lowerbound, which serves as the
objective function for latent-variable models, is maximized, the KL divergence
between q (the approximate posterior) and p (the true posterior) shrinks. In
other words, we can simply use the trained q inference network as the proposal
distribution to approximate the log-marginal probability of an observation x
after training to obtain an unbiased, low-variance estimator of the quantity.6 It
turned out maximizing variational inference had yet another advantage.
6 The variational lowerbound can be used as a proxy to the log-marginal probability as well.
This is indeed a standard practice during training, to monitor the progress of learning. This
quantity is however a biased estimate of the log-marginal probability, and it is important to
use importance sampling to check the true log-marginal probability.
Chapter 5
Undirected Generative
Models
67
68 CHAPTER 5. UNDIRECTED GENERATIVE MODELS
Z X
log p(x, z; θ) = −e(x, z, θ) − log exp (−e(x′ , z ′ , θ)) dx′ , (5.3)
x′ ∈X
z ′ ∈{0,1}|z|
where
R XPis a set of all possible values x can take. If X is a finite set, we replace
with .
′
P
Let us focus on the normalization constant, z ′ ∈{0,1}|z| exp(−e(x, z , θ)).
Since
|x| |z| |x| |z|
Y Y Y Y
exp(−e(x, z, θ)) = exp(wij xi zj ) exp(xi bi ) exp(zj cj )
i=1 j=1 i=1 j=1
(5.5)
|x| |z|
Y Y
= exp(xi bi ) exp(wij xi zj + zj cj ). (5.6)
i=1 j=1
Now, I want to marginalize out z from this expression. In most cases, this
would be intractable, because there are 2|z| possible values z can take. This
bipartite structure however turned out to be a blessing we can rely on.
Let’s consider the following simple case:
X 2
Y
fj (zj ) = f1 (0)f2 (0) + f1 (0)f2 (1) + f1 (1)f2 (0) + f1 (1)f2 (1) (5.7)
z∈{0,1}2 j=1
Instead of summing exponentially many terms, we can multiply |z| terms only:
|x| |z|
X X Y Y
exp(−e(x, z, θ)) = exp(xi bi ) exp(wij xi zj + zj cj )
z∈{0,1}|z| z∈{0,1}|z| i=1 j=1
(5.11)
|x| |z|
Y X Y
= exp(xi bi ) exp(wij xi zj + zj cj )
i=1 z∈{0,1} |z| j=1
(5.12)
|x| |z|
X Y
= exp xi bi (1 + exp(wij xi + cj )) (5.13)
i=1 j=1
(5.14)
You can think of the left-hand side of this derivation as the unnormalized
probability function p̃(x; θ) of x, since the normalization constant of p(x, z; θ) is
neither a function of x nor z. In that case, we can write it down as
|z|
Y
p̃(x; θ) ∝ ϕ0 (x) ϕj (x), (5.15)
j=1
where
Just like earlier, we use stochastic gradient descent, and to do so, we need to
be able to compute the gradient of this per-example loss w.r.t. the energy e.
70 CHAPTER 5. UNDIRECTED GENERATIVE MODELS
Once we can compute it, we can use the chain rule of derivatives to compute
the gradient w.r.t. each parameter. So,
exp(−e(x′ , θ))
Z
∇θ Lll =∇θ e(x, θ) − R ∇θ e(x′ , θ)dx′ (5.19)
exp(−e(x′′ , θ))dx′′
| {z }
=p(x′ ;θ)
There are two terms in this gradient. The first term (a) is called a positive
phase, since it proactively decreases (recall that we are taking the negative gra-
dient direction) the energy of the positive example, where the positive example
refers to one of the training examples x from the training set. The second term
(b) is called a negative phase, where it proactively increases the energy of a
configuration x′ that is highly probable under the current model, i.e. p(x′ ; θ) ↑.
This is exactly what we saw earlier when we learned about the cross-entropy
loss for classification in §2.1.2.
Unlike the cross entropy with softmax earlier, we are in a worse situation
here, because the number of possible values x can take is much greater. In fact,
it is exponentially larger, since we often use RBMs or any of these generative
models to model a distribution over a high-dimensional space. In other words, we
cannot compute the negative phase (b) exactly in a tractable time, or sometimes
we just do not know how to compute it at all.
In the remainder of this section, we study how we can efficiently draw these
negative samples and use them for learning.
and be convinced that the collected series of samples form collectively a set of
samples from the target distribution.
In addition to this condition (p∞ = p∗ ), we need to meet an extra condi-
tion. That is, this stationary distribution has to be unique. If there are other
stationary distributions, we may not be able to tell that even after running
this transition operator indefinitely that we are collecting samples from the true
distribution. To do so, we further put a constraint that this Markov chain is
ergodic. In an ergodic Markov chain, any state (or a region of the state space, in
the case of an infinitely large X ) is reachable from any other state within a finite
number of transition steps. This ergodicity guarantees that there is only one sta-
tionary distribution, and that repeated applications of the transition operator
will eventually converge toward this unique stationary distribution.
Sampling from a complicated target distribution p∗ then boils down to de-
signing a transition operator T such that the resulting Markov chain has a
unique stationary distribution. The next question is how we can guarantee that
there exists a stationary distribution, since the ergodicity tells us that there is
a unique stationary distribution if there is a stationary distribution under this
Markov chain. There are more than one way to do so, and one relatively well-
known way is the principle of detailed balance. Detailed balance in a Markov
chain is defined as having the transition operator T satisfy
T (x′ |x)p∞ (x) = T (x|x′ )p∞ (x′ ). (5.22)
As pretty clear from the equation, it says that whatever flows from one state
to another must flow back. This is stronger than having a stationary distribution,
as a stationary distribution p∞ may not satisfy this. When detailed balance is
satisfied, we often refer to such a Markov chain as a reversible Markov chain,
since we will not be able to tell the direction of time once it converged.
Our goal is then to design a transition operator T such that the resulting
Markov chain is ergodic and satisfies detailed balance.1 We refer to the procedure
of sampling by collecting a series of visited states from such a Markov chain by
Markov Chain Monte Carlo (MCMC) methods.
One of the most popular and widely-used MCMC algorithm is Metropolis-
Hastings (M-H) algorithm [Hastings, 1970]. The M-H algorithm assumes that
we have access to the unnormalized probability p̃∗ (x) of the target distribution:
p̃∗ (x)
p∗ (x) = R . (5.23)
p̃∗ (x)dx
This assumption makes the M-H algorithm particularly suitable for many energy-
based models, such as restricted Boltzmann machines (RBM), since we can easily
often the unnormalized probability but cannot tractably compute the normal-
ization constant.
1 This statement does not exclude the possibility of designing a Markov chain that allows
us to sample from a target distribution even when it does not satisfy detailed balance. Fur-
thermore, this statement does not exclude the possibility of expanding the state space by
augmenting x with an extra variable. It has been shown that this can be beneficial with
so-called Hamiltonian Monte Carlos methods [Neal, 1993].
72 CHAPTER 5. UNDIRECTED GENERATIVE MODELS
We first assume we are given (or can create) a proposal distribution q(x|x′ )
that is often centered at x′ and whose probability mass is largely concentrated
in the neighbourhood of x′ . q must be ergodic, that is, if we repeatedly sample
from q(x|x′ ), we should be able to reach any state (or a region of the state space)
within a finite number of steps. We then define an acceptance probability α(x|x′ )
such that
where
(
∞, if x = x′
δx′ (x) = (5.26)
0, otherwise
and
Z
δx′ (x)dx = 1. (5.27)
We can sample from this transition operator given the past sample x′ by
This transition operator satisfies both ergodicity and detailed balance, and
a lot of MCMC algorithms can be viewed as variants of the M-H algorithm with
particular choices of the proposal distribution q.
p̃∗ (x)pd (x′ |x) p̃∗ ([x′1 , . . . , x′d−1 , xd , x′d+1 , . . . , x′|x| ])pd (x′d |[x′1 , . . . , x′d−1 , x′d+1 , . . . , x′|x| ])
=
p̃∗ (x′ )pd (x|x′ ) p̃∗ ([x′1 , . . . , x′|x| ])pd (xd |[x′1 , . . . , x′d−1 , x′d+1 , . . . , x′|x| ])
(5.32)
( (( ((′((
( (((
p̃∗ ([x′1 , . . . , x ′ ′( ′ ∗ ′ ′ ′ ( ′ ((
. , x|x| ])C([x′1 , . . .( ′
x′d+1 , . . . , x′|
(
( d−1
( (, x (d , x
(
d+1 , . . . , x |x| ]) p̃ ([x 1 , . . . , x
( d−1
( ,
( x (d , xd+1 , . . , x(d−1 (,( ((
= (((( (
((((
(
((′(( ((( (
(
∗ ′ ((( ′(( ∗ ′ ′ ′ (( ′ ′ ′(((
(( ′
(p̃ (([x(1 , . . . , x |x| ]) p̃ ([x 1 , . . . , x
( d−1
( (, x (d , x
(
d+1 , . . . , x |x| ]) C([x 1 , . . .
( , x
( d−1
( ,(xd+1 , . . . , x |x| ])
(( (((
((( (5.33)(
=1 (5.34)
In other words, the acceptance probability is 1, and we always accept this new
sample which differs from the previous sample in just one dimension d.
This procedure is called Gibbs sampling. We pick one coordinate, sample
from the conditional distribution of that particular coordinate, replace it with
the newly sampled coordinate value and repeat it. This procedure is often ap-
plicable even when we have access only to the unnormalized probability, since
the conditional probability is often tractable in that case. Furthermore, because
every sample is automatically accepted, there is almost no extra overhead in
implementation, which makes it an attractive algorithm choice.
where
If we run it too short, our stochastic gradient estimate will likely be incorrect,
resulting in a disastrous outcome.
Instead, it turned out that we can simply start the Gibbs sampling chain
from a positive example, run it only a small number of steps (as few as just one)
and use the resulting sample as the negative example. That is,
S
1X
∇θ Lk (θ; x) = ∇θ e(x, θ) − ∇e(x′s , θ), (5.42)
| {z } S s=1
=positive | {z }
=negative
where x′s is one of the S samples drawn after running k steps of Gibbs sampling
starting from x. It is usual to set S to 1. In the limit of k → ∞, this is exact,
since the negative sample x′ would be from the stationary distribution which
coincides with the true distribution p(x; θ). It is however not so with a finite k,
and there is not even a guarantee that a larger k leads to a better approximation,
when k is small. This strategy nevertheless results in a reasonably well trained
RBM and is often called contrastive divergence.
It turned out that we can maintain the computational complexity with a
minimal overhead in memory complexity by maintaining S samples across mul-
tiple stochastic gradient steps while ensuring that learning converges to the
exact solution asymptotically. We do so by running S chains of Gibbs sampling
in parallel to stochastic gradient descent. Between consecutive steps of SGD,
we run S chains of Gibbs sampling for T ≈ 1 steps each to update the a set
of S samples that are more likely to have been drawn from the latest model.
Then, we use these newly updated samples to compute the stochastic gradient
estimate, to update the model parameters.
As learning continues, the change to the model parameters slows down (since
we are getting increasingly closer to a local minimum), and thereby Gibbs sam-
pling chains in the background are increasingly getting closer to the stationary
distribution of the final model. This makes it such that the early stage of learn-
ing is inexact but has a low variance (because we are not perturbing negative
examples too much) but the later stage is exact since the model parameters
change very slowly. This strategy is called persistent contrastive divergence.
x = g(ϵ; θg ), (5.43)
where ϵ ∼ p(ϵ) and p(ϵ) is some easy-to-sample distribution of our choice. This
sampler is parametrized by θg .
76 CHAPTER 5. UNDIRECTED GENERATIVE MODELS
Lrkl (θg ) = KL(pg ∥pe ) = −Ex∼pg [log pe (x) − log pg (x)] (5.44)
= − Ex∼pg [log pe (x)] − H(pg ), (5.45)
| {z } | {z }
=(a) =(b)
Z
′ ′
(a) = Ex∼pg [log pe (x)] = E −e(x) − log exp(−e(x ))dx (5.46)
Although we do not have pg , we can draw samples from this distribution with
g. We can thus compute the stochastic gradient of (a):
M
1 X
∇aθg ≈ − ∇θ e(g(ϵm )), (5.48)
M m=1 g
where k(·, ·) is a kernel function. We will not discuss what kernel functions are,
but you can think of the kernel function as some kind of a distance metric, such
that any kernel function k(a, b) satisfies two properties. First, it is symmetric:
Because the kernelized MMD above is differentiable w.r.t. the samples, as long
as the kernel function was selected to be differentiable, we can compute the
gradient of the MMD w.r.t. the parameters of the sampler g and use it in place
of the gradient of (b) from Eq. (5.55).
Although we will not go into any technical detail behind this kernelized
MMD, it is instructive to inspect it at an intuitive level. Let us start from
the back. The third term (c) is intuitively correct, as it computes the average
pair-wise distance between all possible pairs of samples from two distributions. If
the average pair-wise distance is larger, the discrepancy between two underlying
distributions must be high as well.
Let’s assume |D| = |D′ | (that is, we have the same number of samples
from each distribution.) Then, the minimum this pair-wise distance can at-
tain is determined by the average pair-wise distance within each set, since all
these samples would be placed on top of the samples from the other distribu-
tion. Furthermore, when this happens, the first two terms, (a) and (b), would
coincide with each other. Considering that the first two terms and the final
term have opposite signs, they would cancel out each other, resulting in 0, as
78 CHAPTER 5. UNDIRECTED GENERATIVE MODELS
desirable. In other words, (c) determines the overall discrepancy between two
distributions, while (a) and (b) are there to take into account that the mini-
mum discrepancy between two distributions is largely bounded from below by
the intra-distribution dispersion.
By minimizing the following loss, we can train a sampling network g that
transforms a sample from a simple distribution p(ϵ) into a sample from the
target distribution defined from the energy function e:
N
1 X
N M
Jg (θg ; e) = − e(g(ϵm )) − λ MMD2 {sn }n=1 , {g(ϵm )}m=1 , (5.55)
M m=1 | {z }
=R(θg )
where
M
!
1 X
sn ∼ N µ= g(ϵm ), αΣ (5.56)
M m=1
with
M
1 X ⊤
Σ= (g(ϵm ) − µ) (g(ϵm ) − µ) . (5.57)
M m=1
min max Ex∼D [e(x, θ)] − Eϵ∼p(ϵ) [e(g(ϵ; θg ), θ)] − λR(θg ). (5.59)
θ θg
where Π is an arbitrary permutation of (1, 2, . . . , d). This chain rule states that
the probability of any configuration of X can be computed as the product of
the probabilities of the d constituents, appropriately conditioned on a subset of
constituents. Without loss of generality, we assume Π(i) = i.
Our goal is to build a neural network that models (a) above and thereby
model the joint probability function p(X). There are two things to consider.
First, we do not want to have d separate neural networks to capture d conditional
probability distributions. We instead want to have a single neural network that
is able to model the relationship between any pair of the target dimension xi
and the context dimensions x<i = (x1 , . . . , xi−1 ). This allows the predictor to
benefit from patterns shared across these pairs. For instance, if xi was the i-th
pixel in an image, we know that the pixel value of xi must be somewhat similar
to xi−1 , regardless of i, due to the locality of pixel values. This knowledge should
be more readily captured if a single predictor is used for all i.
Second, the number of parameters should not grow w.r.t. d, i.e., |θ| = o(d). It
is in fact desirable to have |θ| = O(1), by having absolutely no dependency on d.
This enables us to build an unsupervised model that can work on a variable-sized
observation, which is critically important when dealing with variable-length se-
quences, such as natural language text and video.
Combining these two considerations, we can now write this approach in the
form of
xi ∼ G(F ((x1 , x2 , . . . , xi−1 ); θ), ϵ), (5.63)
where ϵ is noise. This reminds us of autoregressive modeling in signal process-
ing,2 and thus we refer to such an approach as autoregressive modeling, as this is
akin to a nonlinear autoregressive model with an unbounded context (p → ∞).
Two building blocks from §3 are particularly suitable for implementing F ; a
recurrent block and an attention block (with a positioning encoding.) In the case
of a recurrent block, we do not need any modification, but can simply feed in the
entire sequence (x0 , x1 , x2 , . . . , xd ) and read out (p(x1 ), p(x2 |x1 ), . . . , p(xd |x<d )).
More specifically, if we use gated recurrent units,
hi = FGRU ([xi , hi−1 ]; θr ) (5.65)
exp(u⊤
xi+1 hi + cxi+1 )
p(xi+1 |x≤i ) = P ⊤
, (5.66)
x∈C exp(ux hi + cx )
exp(qi⊤ kj − mij )
αij = PN , (5.68)
j ′ =1 exp(qi⊤ kj ′ − mij ′ )
where
(
0, if j < i
mij = (5.69)
∞, if j ≥ i
This would ensure that the output v̂i from the attention block is not computed
using any input vectors (xi , xi+1 , . . . , xd ). Some refer to this as causal masking
by borrowing from the concept of a causal system in signal processing.
We must be careful when we are dealing with continuous xi ’s. We will discuss
why this is the case, and how we can deal with it properly in §6.4, if time permits.
A major advantage of this autoregressive modeling approach is that we can
compute the log-probability of any observation exactly. We simply need to com-
pute the conditional log-probabilities and sum them to get the log-probability
of the observation. This is unlike any latent variable approaches we have con-
sidered above. In the case of a variational autoencoder, we have to solve an
intractable marginalization problem, and in the case of RBM’s, we must com-
pute the intractable log-partition function, or the log-normaliztaion constant.
Furthermore, we can readily draw independent samples tractably with this au-
toregressive model, which is a great advantage for RBM’s which require costly
and challenging MCMC sampling.
This autoregressive modeling paradigm has become de facto standard in
building conversational agents in recent years since the successful demonstra-
tions by Brown et al. [2020] and Ouyang et al. [2022]. To learn more about the
fundamentals behind language modeling and related ideas, see this somewhat
outdated lecture note [Cho, 2015]. We do not go into any further detail, as these
topics are out of the scope of this course.
82 CHAPTER 5. UNDIRECTED GENERATIVE MODELS
Chapter 6
Further Topics
The first question we often need to ask is whether we can compute the stochastic
gradient of this objective w.r.t. the parameters θ. Let us try that ourselves here:
Z XC Z XC
∇θ p(x) p(y|x; θ)R(y)dx = p(x) ∇θ p(y|x; θ)R(y) dx. (6.2)
y=1 y=1
| {z }
=∇Ey|x;θ R(y)
f ′ = f · (log f )′ , (6.6)
83
84 CHAPTER 6. FURTHER TOPICS
where ỹ ∼ y|x; θ.
Before declaring the victory, let us compute the variance of this stochastic
gradient estimator:
Looking at the first term of the variance, we notice that there are two things
that affect the variance greatly. The first factor is the magnitude of the reward.
If the reward has a high magnitude, it results in an increased variance of the
because
f′
(log f )′ = (6.7)
f
6.1. REINFORCEMENT LEARNING 85
The optimal baseline to minimize the upperbound on the right hand side is
cmax 2
∇b E (R(y) − b) = −cmax (ER(y) − b) = 0 (6.24)
2
⇐⇒ b∗ = Ey|x;θ [R(y)] . (6.25)
In other words, the optimal baseline is the expected reward we anticipated given
the input x.
Of course, this quantity is again intractable or impossible to compute exactly.
We can however now fit a predictor of b∗ given x using all the past observations of
(x, R(ỹ)), because each R(ỹ) is a single-sample approximation to Ey|x;θ [R(y)].3
Because we update θ along the way, many of the past samples would not be
valid under the current θ. If we however assume that θ is updated slowly and
that the predictor is adapted rapidly, asymptotically this is an exact procedure,
just like persistent contrastive divergence from §5.1.2.
We then need to maintain two predictors. One predictor is often called a
policy network that maps the current input, or state, x to the distribution
over possible outputs, or actions. The other predictor is often referred to as a
value network that maps the current state x to the expected reward. The latter
is called a value network, because it predicts the value of the current state,
regardless of the action to be taken by the policy. These networks are trained
in parallel.
The case of noisy reward: an actor critic method. Let’s imagine that
the reward R depends on both x and y and that it is random as well. That is,
we observe only a noisy estimate of the reward at x given the output choice y.
We probably want then to maximize the expected, expected reward:
∇θ Ey|x;θ Eϵ [R(y, x; ϵ)] ≈ (Eϵ [R(ỹ, x; ϵ)] − b(x))∇θ log p(ỹ|x; θ) (6.29)
≈ (R̃(ỹ, x) − b̂(x; θb ))∇θ log p(ỹ|x; θ), (6.30)
3 In particular, we should use a mean-squared error as the loss function when fitting a
predictor to estimate the expected reward. This comes from the fact that the optimal solution
to minimizing the mean-squared error corresponds to computing the average, as easily seen
below:
N N N
1 X 1 X 1 X
∇µ (µ − xn )2 = (µ − xn ) = µ − xn = 0 (6.26)
2N n=1 N n=1 N n=1
N
1 X
⇐⇒ µ = xn . (6.27)
N n=1
6.1. REINFORCEMENT LEARNING 87
where b̂(x) refers to a predicted baseline, e.g. the value of x. θb is the parameters
of this value function.
Unfortunately, this estimator will have an extra variance due to noisy reward.
Similarly to what we did with the baseline above, we can lower the variance by
predicting the expected reward at (x, ỹ) using a predictor trained on samples.
That is,
∇θ Ey|x;θ Eϵ [R(y, x; ϵ)] ≈ (R̂(ỹ, x; θr ) − b̂(x; θb ))∇θ log p(ỹ|x; θ), (6.31)
| {z }
=(a)
which may help reduce the variance from having to train two separator pre-
dictors. With a reasonable C, this can implemented quite efficiently by having
R̂ to output a C-dimensional real-valued vector, multiplying the output with
the output from the y predictor (which is often called a policy) and sum these
values. Sometimes we call this R̂ a critic and p(y|x; θ) an actor. This approach
is thus called an actor-critic algorithm.
for y ∈ {1, 2, . . . , C}. This transition matrix gives us the distribution over the
next state given the current state xt−1 and the selected action yt , as
assume that a reward function S ⋆ that is defined on each state and returns a
scalar, i.e. S ⋆ : X → R. Each time we transit from xt−1 to xt due to yt , we
receive a reward S ⋆ (xt ).
Together with a policy π(y|x; θ), it defines a distribution over trajectories, or
often called episodes. We can then sample a (potentially infinitely long) sequence
of tuples of previous state xt−1 , selected action yt , next state xt and received
reward st = S ⋆ (xt ). Of course, these tuples are highly correlated with each
other, since they are collected from a single trajectory defined by a shared set
of distributions, the policy, transition and reward. We will however for now
ignore this by saying that we are considering a particular time step t from many
independent trajectories.
Let us use n to refer to each of these trajectories. In order to apply the policy
gradient, or actor-critic algorithm, from above, we must start with the Q network
Q̂(xnt−1 , ytn ). This Q network approximates the expected quality of (xnt−1 , ytn ).
We define the expected quality by first defining the quality of (xnt−1 , ytn ) from
the n-th trajectory as
Tn
X ′
Q̃(xnt−1 , ytn ) = snt + γ t −t snt′ , (6.35)
t′ =t+1
where γ ∈ [0, 1] is a so-called discounting factor and Tn is the length of the n-th
trajectory.
This formulation tells us that the quality of any particular state-action pair
is determined by the accumulated rewards from there on throughout the full
trajectory. Because we assumed the Markov property, it is perfectly fine for us
to ignore how we arrived at (xt−1 , yt ). With γ < 0, we are specifying that we do
not want to take into account what happens too far into the future. This is often
a good strategy to facilitate learning in the case of finite-length episodes, i.e.
Tn < ∞, and is necessary to define the quality to be finite with infinitely-long
episodes, i.e. Tn → ∞.5
This particular quality from the n-th trajectory can be thought of a sample
from a random variable Q(xnt−1 , ytn ) which is defined as
Q(xnt−1 , ytn ) = snt +Eq(xt |xt−1 ,ytn ) [ (6.36)
⋆
γEπ(yt+1 |xt )q(xt+1 |xt ,yt+1 ) [s (xt+1 )+ (6.37)
⋆
γEπ(yt+2 |xt+1 )q(xt+2 |xt+1 ,yt+2 ) [s (xt+2 ) + · · · ] . (6.38)
In other words, the expected quality is the weighted sum of all future per-step
rewards after marginalizing out all possible future trajectories according to the
transition model and the policy.
When we are working with finite-length trajectories, we can easily train the
Q network to minimize the following quantity:
N Ln 2
1 XX 1
min R̂(xnt−1 , ytn ) − Q̃(xnt−1 , ytn ) , (6.39)
θr N n=1 t=2 2
5 Unless γ < 1, the quality easily diverges, assuming st > 0 even when |st | < ∞.
6.1. REINFORCEMENT LEARNING 89
because Q̂ is an unbiased sample drawn from the true distribution of the quality
defined immediately above.
Unfortunately this is not possible, if we are working with an infinitely-long
episode. Such an infinitely-long episode is not common in the current-day setups,
but it is something we aspire to working with in the future, where we would
anticipate a learning based system to be deployed in real world situations and
adapt itself on the fly. Of course, in this case, we must updaate the Q network
also on-the-fly. It is unfortunately not possible to get even a single sample Q̃,
since we never see the end of any episode.
Let us re-arrange terms in Eq. (6.35):
Tn
X ′
Q̃(xnt−1 , ytn ) =snt + γ t −t snt′ (6.40)
t′ =t+1
Tn
t′ −t n
X
=snt
n
+ γ st+1 + γ st′ . (6.41)
| t′ =2
{z }
=Q̃(xn n
t ,yt+1 )
This allows us to write a loss function to train the Q network without waiting
for the full episode to end (or never end) by considering the temporal difference
at time t:
N 2
1 X
min R̂(xnt−1 , ytn ; θr ) − γ snt + R̂(xnt , yt+1n
; θ̃r ) (6.43)
θr N
n=1
2 XN 2
γ 1
⇐⇒ min R̂(xnt−1 , ytn ; θr ) − R̂(xnt , yt+1
n
; θ̃r ) − snt , (6.44)
θr N γ
n=1
dependencies of the choice of a particular action ytn at xnt−1 on many steps later,
since the naive temporal difference only considers one step deviation at a time.
There have been many improvements proposed since the initial work, but it is
out of the scope of this course to cover those.
With this Q network (or the critic network, as we learned to call it above,)
we can rely on the policy gradient to update the policy (or the actor network)
from Eq. (6.31). There are of course many different ways to improve the actor
update, for instance by constraining the update to be somewhat limited. Again,
these are more or less out of the scope of this course.
We will discuss where such a distribution comes from later, but for now, we will
assume it magically exists and that we can readily draw N classifiers from this
distribution q.
We already considered the case of having q earlier in §2.4.2 when we consid-
ered the following bias-variance decomposition from Eq. (2.118):
where
This decomposition was done on the loss averaged over the predictors drawn
from the posterior distribution q. We can instead consider the loss computed
using the average prediction from the predictors drawn from the posterior dis-
tribution. That is, our prediction is
Then,
2 2
Ex,y (y − ŷ(x)) ∝ Ex Ey|x (y − µy ) −2 ŷ(x) Ey|x [y] + ŷ 2 (x) (6.50)
| {z } |{z} | {z } | {z }
=(a’) =µ̂y =µy =µ̂2y
2 2
= Ex Ey|x (y − µy ) −µ2y + (µ̂y − µy ) . (6.51)
| {z } |{z} | {z }
=(a’) =(b’) =(c’)
Now, let us consider the difference between these two loss function. Since (a)
and (a’) are equivalent and (c) and (c’) are equivalent, we just need to consider
(b) and (b’):
2
Eθ (ŷ(x; θ) − µ̂y ) + µ2y = Eθ ŷ 2 (x; θ) − 2µ̂y Eθ [ŷ(x; θ)] +µ̂2y + µ̂2y = Eθ ŷ 2 (x; θ) ≥ 0.
| {z }
=µ̂y
(6.52)
In other words, this tells us that the average loss over the predictors is always
greater than or equal to the loss of the average prediction by the predictors.
This motivates the idea of bagging [Breiman, 1996].
As long as we have q, or a sampler that draws predictors, or the correspond-
ing parameters, from this distribution q, bagging tells us that it is never a bad
idea to use many of those sampled predictors and average their predictions,
rather than using any one of them solely, on average. It turned out there are
many different ways that make our predictor θ random rather than determin-
istic. We have already covered most of them earlier in the course, but let us
briefly go through them here once more.
In modern machine learning, a major source of randomness is the use of
stochastic gradient descent on a non-convex loss function. The loss function is
not convex w.r.t. the parameters, as we stack highly nonlinear blocks to build
a deep neural network based predictor, and in doing so, we introduce a large
degree of redundancies (or ambiguities). These ambiguities are more or less
arbitrarily resolved by randomness in stochastic gradient descent. For instance,
our choice of the initial values of the parameters affect a subspace over which
stochastic gradient descent can explore and find a local minimum. In addition to
initialization, there are other types of randomness in stochastic gradient descent,
that is, how we construct minibatches by selecting random subsets of the training
set. Furthermore, quite a few building blocks are inherently stochastic. Recall
the variational autoencoder from §4.3.1, where we injected noise for processing
each and every instance during training. In other words, we can think of the
resulting solution by running stochastic gradient descent as a sample drawn
from some distribution implicitly defined by this process of learning.
Of course, another major source of randomness is the choice of the training
set. As we have discussed earlier in §2.4.3, we can imitate the randomness in
data collection even when we have a single set of data points drawn from the
92 CHAPTER 6. FURTHER TOPICS
X
e(θ; D) = L(x; θ), (6.53)
x∈D
We can interpret this energy function just like any other energy function
we have defined and used throughout the semester. We want the predictor
parametrized by θ to be assigned a low energy value when it is good. The
goodness of the predictor is defined as how low the loss function this predictor
attains on the training set D.
We can now turn this energy function into the probability function using the
6.2. ENSEMBLE METHODS 93
where θm ∼ q(θ|D, β). In other words, if we follow the Bayes’ rule and think of
the loss function as an energy function of the parameter θ given an individual
instance, we arrive at the conclusion that we should draw predictors from the
posterior distribution q(θ|D, β). This is a great property, since we now have a
good guideline on what we should do, although the inclusion of β here was quite
intentional, as it says that we still need some kind of hyperparameter search even
in so-called Bayesian machine learning.
Let us now connect this (log-)posterior distribution with what we have
learned so far by writing it as
X
log p(θ|D, β) = log p(x|θ, β) + log p(θ) − log Z(D, β) (6.61)
x∈D
X
=−β L(x; θ) + log p(θ) − log Z ′ (D, β), (6.62)
x∈D
94 CHAPTER 6. FURTHER TOPICS
where we collect all terms that are constant w.r.t. β into log Z ′ .
α
By setting β = |D| , we end up with
α X
− log p(θ|D) = L(x; θ) − log p(θ) +const. (6.63)
|D|
x∈D
| {z }
=− log p∗ (θ|D,α)
where α > 0. We can then write the energy function that includes h as
We can minimize this energy function w.r.t. α and θ′ , which results in g that
complements the existing predictor f to minimize any remaining error by f .
This idea is often referred as boosting [Schapire, 1990], as it boosts the represen-
tational power of weak predictors by combining two weak predictors, here f and
g, to form a stronger predictor. This procedure can be repeated by considering
h as f and introducing yet another weak predictor g into the mix, until the
point at which a satisfyingly low level of the loss is achieved. Although we have
derived it in the context of a single example (x, y), it should be readily extended
to multiple example pairs.
By carefully inspecting (a), we realize that this term is the negative gradient
of Eq. (6.64) w.r.t. f (x; θ):
∂e
= −(y − f (x; θ)). (6.68)
∂f (x; θ)
Instead of e, which was equivalent to the loss, because it was formulated using
L2 distance, we can use a more generic loss l(θ; [x, y]). We can then further
rewrite e′ as
e′ ([x; y], {θ, θ′ }) ∝ ∥ − ∇ŷ l(θ; [x, y]) − αg(x; θ′ )∥2 , (6.69)
This procedure resembles the process of gradient descent from §2.3.2, and is
thereby referred to as gradient boosting [Friedman, 2001].
96 CHAPTER 6. FURTHER TOPICS
Boosting does not specify how to estimate β and g (or equivalently θ′ ) at each
iteration, and it is up to the practitioner to decide which (weak) learner g they
use and which loss function l they choose. Popular choices include decision trees
and kernel-based support vector machines. In this sense, this is not a learning
algorithm but more a meta-heuristics.
6.3 Meta-Learning
In the previous section §6.2, we learned that it is a good idea to average the
predictions from multiple models if we have a distribution q(θ) over the models
(or predictors) rather than a single predictor. We then learned that Bayesian
machine learning tells us that this distribution should be conditioned on the
training set, resulting q(θ|D), and that we can obtain this posterior distribution
following the Bayes’ rule:
Y
q(θ|D) ∝ p(θ) p(x|θ). (6.72)
x∈D
It is a fair question at this point whether we must follow this particular for-
mulation based on the Bayes’ rule. Perhaps there is a better way to map the
training set D to the posterior distribution over θ.
Let us assume that we have not one but multiple training set D1 , D2 , . . . , DM ,
corresponding to the M prediction tasks. For each training set, we can define a
so-called K-fold cross-validation loss as
K Z
m 1 X X
LKCV (ϕ; D ) = − log p(x|θ)q θ|Dσmk (⌈ 1 |Dm |⌉+1):σk (|Dm |) ; ϕ dθ,
K Ω K
k=1 x∈D m
σk (1):σk (⌈ 1 |D m |⌉)
K
(6.73)
where θb ∼ q θ|Dσmk (⌈ 1 |Dm |⌉+1):σk (|Dm |) ; ϕ . Since θ is often continuous, we can
K
for instance compute the gradient of LKCV w.r.t. ϕ with the reparametrization
trick as long as q is differentiable w.r.t. ϕ.
This is interesting, since we can be flexible about how we parametrize q,
and this q is directly optimized to result in a distribution over θ or a set of
θ’s under which the predictive loss is minimal. In other words, q is a learning
algorithm, and we are training a learning algorithm by minimizing the meta-
objective function in Eq. (6.74) w.r.t. q.
For instance, we can define q implicitly by drawing a sample of the parame-
ters θ from q using just a few steps N of stochastic gradient descent, as opposed
to running it until convergence as from §2.3.2. In doing so, we can consider the
initialization θ0 of the parameters as ϕ. By minimizing the meta-objective func-
tion w.r.t. θ0 , we are looking for the initialization of the parameters that are
optimal with N SGD steps. If the new training set after such meta-learning is
similar to the meta-training sets, we would expect that N SGD steps would be
enough if not optimal to obtain the best predictor. This approach was originally
proposed by [Finn et al., 2017] and called model-agnostic meta-learning.
Of course, we can completely forego of any iterative optimization when de-
signing q and build a predictor that directly maps a set of training data points
D to the prediction on a new observation x′ . In doing so, it is important to real-
ize that this predictor cannot simply take as input D but needs to model noise
in learning itself. This naturally calls for including latent variables z into this
predictor, just like how we did earlier with generative models in §4. In this case,
the posterior distribution q(θ) is implicit, and we directly predict the predictive
probability by
Z
p(x|D; ϕ) = p(x|z; ϕx )pz (z|D; ϕz )dz, (6.76)
Z
where pz is the prior over z and we marginalize out z. This approach is often
referred as neural processes [Garnelo et al., 2018]. Because this marginalization
is often intractable, it is a common practice to approach it from variational
inference and learning which we learned already in §4.3.1.
Overall, these approaches are referred as meta-learning, since such a proce-
dure results in a predictor that knows how to learn to solve a problem given a
set of new examples. Meta-learning can then be used to solve not only learning
problems but also any kind of set-to-set problems, such as causal discovery and
statistical inference problems. This is an exciting and active area of research.
98 CHAPTER 6. FURTHER TOPICS
assuming
1 2
e([x, y], z, θ) = ∥y − µz (F (x; θF ); θµ )∥ . (6.81)
2σz2 (F (x; θF ); θσ )
6 A categorical variable takes a value out of a small number of predefined possible values,
We can solve this problem by gradient descent which will find one of at most K
modes of this complex distribution or find a saddle point.
It however is unsatisfactory to return a single point estimate of the solution,
when we trained our predictor to capture the full distribution over the output
space. Rather, it may be desirable to return a set of possible values of the
outcome y that are within a credible region, following the procedure from §2.4.3.
This is particularly desirable, as we can readily draw as many independent
samples from the mixture of Gaussians. Once the samples {y1 , . . . , yM } are
drawn, we score each sample with the mixture density network, which is again
trivial, resulting in {p1 , . . . , pM }. We can then fit a cumulative density function
on these scores and pick only those that are above a predefined threshold. These
selected outputs can be considered a credible set of outputs for x.
100 CHAPTER 6. FURTHER TOPICS
6.5 Causality
A major limitation of all methods in this lecture note, perhaps except for rein-
forcement learning in §6.1, is that they all rely almost entirely on association, or
correlation. These algorithms all look for which patterns appear together with
which other patterns frequently within a given dataset.
Already in §2.2.2, this was apparent. For instance, recall the following update
rule for a linear block in Eq. (2.53):
∂
= xi hj − xi ĥj , (6.84)
∂uij
where we assume there was no nonlinearity, i.e. h′j = 1. The first term decreases
the value of uij toward the origin 0 if xi and the old, undesired value of the
j-th hidden neuron had the same sign.7 The second term on the other hand
increases the value of uij away from the origin if xi and the new, desirable ĥj
have the same sign. In other words, uij , one of the many parameters of this
predictor, encodes how correlated the i-th dimension of the observation and the
j-th dimension of the hidden variable are with each other.
This is perfectly fine, if the goal is to capture such correlations and use them
to impute missing values, such as outputs associated with test-time observa-
tions. This is not enough however if we want to infer the causal relationship
among variables, because as we often say casually, “correlation does not imply
causation.”8
Let us dig slightly deeper into this statement and consider a few cases where
correlation exists but causation does not. The first case is when there exists an
unobserved confounder, where the confounder z is defined to affect both the
input x and the outcome y, such that
x y
It is relatively straightforward to see that this would not be factorized into the
product of p(x) and p(z), i.e.
Z
p(x|z)p(y|z)p(z)dz ̸= p(x)p(z), (6.86)
unless
Z Z
p(x|z)p(y|z)p(z)dz = p(x) p(y|z)p(z)dz, (6.87)
which would imply that there is no edge going from z to x in the first place.
That we cannot factor p(x, y) into the product of the marginals of x and
y implies that x and y are dependent on each other. Equivalently, we can say
that x and y are correlated with each other (potentially nonlinearly.) They are
however unrelated to each other causally, since intervening on x would not cause
any change in y and vice versa.
An example of this case of an unobserved confounder can be found in driving.
If one is not aware of how driving works and only looks at the dashboard of a
car,9 it is easy to see that the turn indicator and the steering wheel angle
are highly correlated with each other, which may result in an incorrect causal
conclusion that the turn indicator causes the steering wheel to turn, or vice
versa. This is missing a big confounder that is a driver and their intention to
turn the car.
The second case is what we often referred to as confirmation bias. Consider
the following causal model:
x y
p(x)p(y)p(z|x, y)
p(x, y|z) = R . (6.88)
p(x′ )p(y ′ )p(z|x′ , y ′ )dx′ dy ′
Because of p(z|x, y), we cannot factor p(x, y|z) into the product of two terms,
each of which depends only on either x or y. If we could, that would imply that
z is caused by either one of x or y (or neither.) The input and outcome are
correlated in this case, because we only selectively consider a subset of (x, y)
pairs that are associated with a particular value of z. This is thus also called a
selection bias.
Let us consider an example, where x corresponds to a burglary and y to an
earthquake. z is a house alarm. The house alarm goes off (z = 1) when either
there is burglary (x = 1) or there is an earthquake (y = 1). It is pretty safe for
9 Imagine you are collecting data from the car to build a self-driving model.
102 CHAPTER 6. FURTHER TOPICS
now to assume that the chances of burglary and earthquake are pretty much
independent of each other. If you however hear that your alarm went off, that
is, if you condition on z = 1, burglary and earthquake are not independent
anymore, since I would be able to explain away the chance of burglary if I felt
earthquake myself. That is, what’s the chance that earthquake and burglary
happened together and triggered the alarm. Although there is no causal rela-
tionship between the earthquake and burglary, they are now correlated with
each other negatively because we are conditioned on alarm going off.
These cases emphasize the difference between association (correlation) and
causality. In order to capture causal relationships among variables and use them
to control the underlying system, we must use an extra set of assumptions and
tools to rule out non-causal associations, or so-called spurious correlations. Once
we are equipped with such tools, we can make machine learning more robust in
more realistic scenarios, for instance where the distribution from which obser-
vations are drawn shifts between training and test times. This is a fascinating
topic in machine learning and more broadly artificial intelligence, but is out
of the scope of this course. I suggest you check out my lecture note “A Brief
Introduction to Causal Inference in Machine Learning” [Cho, 2024] and then
move on to more in-depth materials on causal inference, causal discovery and
causal representation learning.
Bibliography
103
104 BIBLIOGRAPHY
W. K. Hastings. Monte carlo sampling methods using Markov chains and their
applications. Biometrika, 57(1):97–109, 1970. doi: 10.1093/biomet/57.1.97.
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recogni-
tion. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 770–778, 2016.
D. O. Hebb. The Organization of Behavior: A Neuropsychological Theory. Wiley
& Sons, New York, 1949.
G. E. Hinton. Training products of experts by minimizing contrastive diver-
gence. Neural computation, 14(8):1771–1800, 2002.
D. Hjelm, R. R. Salakhutdinov, K. Cho, N. Jojic, V. Calhoun, and J. Chung.
Iterative refinement of the approximate posterior for directed belief networks.
Advances in neural information processing systems, 29, 2016.
H. Hotelling. Analysis of a complex of statistical variables into principal com-
ponents. Journal of educational psychology, 24(6):417, 1933.
A. Ilin and T. Raiko. Practical approaches to principal component analysis in
the presence of missing values. The Journal of Machine Learning Research,
11:1957–2000, 2010.
S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training
by reducing internal covariate shift. Proceedings of the 32nd International
Conference on Machine Learning, 37:448–456, 2015.
E. T. Jaynes. Information theory and statistical mechanics. Physical review,
106(4):620, 1957.
D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization of
expensive black-box functions. Journal of Global Optimization, 13(4):455–492,
1998. doi: 10.1023/A:1008352424803.
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980, 2014.
D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114, 2013.
R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation
and model selection. Ijcai, 14(2):1137–1145, 1995.
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep
convolutional neural networks. In Advances in neural information processing
systems, pages 1097–1105, 2012.
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning ap-
plied to document recognition. Proceedings of the IEEE, 86(11):2278–2324,
1998.
106 BIBLIOGRAPHY