0% found this document useful (0 votes)
23 views162 pages

Intro 2 ML

This document provides an introduction to machine learning and optimization basics. It covers topics such as convex sets and functions, gradients, Lagrangian duals, and gradient descent. The document contains 20 sections that introduce fundamental machine learning algorithms and concepts, including perceptrons, linear regression, logistic regression, support vector machines, neural networks, and more. It aims to provide lecture notes for a course on machine learning algorithms.

Uploaded by

therealj6ix
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views162 pages

Intro 2 ML

This document provides an introduction to machine learning and optimization basics. It covers topics such as convex sets and functions, gradients, Lagrangian duals, and gradient descent. The document contains 20 sections that introduce fundamental machine learning algorithms and concepts, including perceptrons, linear regression, logistic regression, support vector machines, neural networks, and more. It aims to provide lecture notes for a course on machine learning algorithms.

Uploaded by

therealj6ix
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 162

An Introduction to Machine Learning

Yao-Liang Yu
[email protected]
School of Computer Science
University of Waterloo

December 30, 2021

Abstract
This is the lecture note for CS480/680.

Contents
-1 Optimization Basics 3

0 Statistical Learning Basics 19

1 Perceptron 30

2 Linear Regression 44

3 Logistic Regression 55

4 Support Vector Machines (SVM) 64

5 Soft-margin Support Vector Machines 78

6 Reproducing Kernels 79

7 Automatic Differentiation (AutoDiff ) 82

8 Deep Neural Networks 87

9 Convolutional Neural Networks 88

10 Recurrent Neural Networks 89

11 Graph Neural Networks 90

12 k-Nearest Neighbors (kNN) 100

13 Decision Trees 107

14 Boosting 108

15 Expectation-Maximization (EM) and Mixture Models 117

16 Restricted Boltzmann Machine (RBM) 125

17 Deep Belief Networks (DBN) 133

18 Generative Adversarial Networks (GAN) 139

1
CS480/680–Fall 2021 §CONTENTS University of Waterloo

19 Attention 146

20 Learning to Learn 159

Yaoliang Yu 2 –Version 0.0–May 11, 2020–


CS480/680–Fall 2021 §-1 OPTIMIZATION BASICS University of Waterloo

-1 Optimization Basics
Goal

Convex sets and functions, gradient and Hessian, Fenchel conjugate, Lagrangian dual and gradient descent.

Alert -1.1: Convention

Gray boxes are not required hence can be omitted for unenthusiastic readers.
This note is likely to be updated again soon.

Definition -1.2: Vector space

Throughout the course, our universe is typically a (linear) vector space V over the real scalar field R. On V
we have two operators: addition + : V × V → V and (scalar) multiplication · : R × V → V. Together they
satisfy the following axioms:
• Addition commutativity: ∀u, v ∈ V, u + v = v + u;
• Addition associativity: ∀u, v, w ∈ V, (u + v) + w = u + (v + w);
• Identity of addition: ∃0 ∈ V such that ∀v ∈ V, 0 + v = v;

• Inverse of addition: ∀v ∈ V, ∃ unique u ∈ V such that v + u = 0, in which case we denote u = −v;


• Identity of multiplication: ∀v ∈ V, 1 · v = v, where 1 ∈ R is the usual constant 1;
• Compatibility: ∀a, b ∈ R, v ∈ V, a(bv) = (ab)v;
• Distributivity over vector addition: ∀a ∈ R, u, v ∈ V, a(u + v) = au + av;

• Distributivity over scalar addition: ∀a, b ∈ R, v ∈ V, (a + b)v = av + bv.


These axioms are so natural that you may question why do we bother to formalize them? Well, take two
images/documents/graphs/speeches/DNAs/etc., how do you add them? multiply with scalars? inverse?
Not so obvious... Representing objects as vectors in a vector space so that we can exploit linear algebra
tools is arguably one of the most important lessons in ML/AI.

Example -1.3: Euclidean space

The most common vector space we are going to need is the d-dimensional Euclidean space Rd . Each vector
v ∈ Rd can be identified with a d-tuple: v = (v1 , v2 , . . . , vd ). The addition and multiplication operators are
defined element-wise, and we can easily verify the axioms:
• Let u = (u1 , u2 , . . . , ud ), then u + v = (u1 + v1 , u2 + v2 , . . . , ud + vd );
• Verify associativity yourself;
• 0 = (0, 0, . . . , 0);

• −v = (−v1 , −v2 , , . . . , −vd );


• Obvious;
• av = (av1 , av2 , . . . , avd );
• Verify distributivity yourself;

Yaoliang Yu 3 –Version 0.0–May 11, 2020–


CS480/680–Fall 2021 §-1 OPTIMIZATION BASICS University of Waterloo

• Verify distributivity yourself.


In the perceptron lecture, we encode an email as a binary vector x ∈ {0, 1}d ⊆ Rd . This crude bag-of-
words representation allowed us to use all linear algebra tools.

Definition -1.4: Convex set

A point set C ⊆ V is called convex if

∀w, z ∈ C, [w, z] := {λw + (1 − λ)z : λ ∈ [0, 1]} ⊆ C.

Elements in the “interval” [w, z] are called convex combinations of w and z.

Theorem -1.5: Intersection and union of convex sets

Arbitrary intersection and increasing union of convex sets are convex.


Thus, lim inf α Cα := ∪α ∩β≥α Cβ is convex. However, arbitrary union hence lim supα Cα := ∩α ∪β≥α Cβ
may not be convex.

Exercise -1.6: Hyperplane and halfspace

Verify the convexity of the hyperplane and halfspace:

∂Hw,b := {x ∈ Rd : ⟨x, w⟩ + b = 0}
Hw,b := {x ∈ Rd : ⟨x, w⟩ + b ≤ 0}

(The partial notation ∂ in front of a set means boundary.)

Exercise -1.7: Polyhedron

A polyhedron is some finite intersection of halfspaces:


\
P := Hwi ,bi = {x ∈ Rd : W x + b ≤ 0}.
i=1,...,n

Prove any polyhedron is convex.


If a polyhedron is bounded, then we call it a polytope. Prove the following:
• the unit ball of ℓ1 norm {x : ∥x∥1 ≤ 1} is a polytope;

• the unit ball of ℓ∞ norm {x : ∥x∥∞ ≤ 1}is a polytope;


• the unit ball of ℓ2 norm {x : ∥x∥2 ≤ 1} is not a polytope.

Theorem -1.8: Convex sets as intersection of halfspaces

Any closed convex set is intersection of halfspaces:


\
C= Hwi ,bi
i∈I

Yaoliang Yu 4 –Version 0.0–May 11, 2020–


CS480/680–Fall 2021 §-1 OPTIMIZATION BASICS University of Waterloo

The subtlety is that for non-polyhedra, the index set I is infinite. For instance:
\
{w : ∥w∥2 ≤ 1} = Hw,−1 .
w:∥w∥2 =1

Definition -1.9: Convex function (Jensen 1905)

The extended real-valued function f : V → (−∞, ∞] is called convex if Jensen’s inequality holds:

∀w, z ∈ V, ∀λ ∈ (0, 1), f (λw + (1 − λ)z) ≤ λf (w) + (1 − λ)f (z). (-1.1)

It is necessary that the (effective) domain of f , i.e. dom f := {w ∈ V : f (w) < ∞}, is a convex set.
We call f strictly convex iff the equality in (-1.1) holds only when w = z.
A function f is (strictly) concave iff −f is (strictly) convex.
According to wikipedia, Jensen (Danish) never held any academic position and proved his mathematical
results in his spare time.
Jensen, Johan Ludwig William Valdemar (1905). “Om konvekse Funktioner og Uligheder mellem Middelværdier”. Nyt
Tidsskrift for Matematik B, vol. 16, pp. 49–68.

Exercise -1.10: Affine = convex and concave

Prove that a function is both convex and concave iff it is affine (see Definition 1.13).

Remark -1.11: Convexity is remarkably important!

In the above definition of convexity, we have used the fact that V is a vector space, so that we can add
vectors and multiply them with (real) scalars. It is quite remarkable that such a simple definition leads to
a huge body of interesting results, a tiny part of which we shall be able to present below.

Definition -1.12: Epigraph and sub-level sets

The epigraph of a function f is defined as the set of points lying on or above its graph:

epi f := {(w, t) ∈ V × R : f (w) ≤ t}.

It is clear that the epigraph of a function completely determines it:

f (w) = min{t : (w, t) ∈ epi f }.

So two functions are equal iff their epigraphs (sets) are the same. Again, we see that functions and sets are
somewhat equivalent.
The sub-level sets of f are defined as:

∀t ∈ R, Lt := Jf ≤ tK := {w ∈ V : f (w) ≤ t}.

Sub-level sets also completely determine the function:

f (w) = min{t : w ∈ Lt }.

Each sub-level set is clearly a section of the epigraph.

Yaoliang Yu 5 –Version 0.0–May 11, 2020–


CS480/680–Fall 2021 §-1 OPTIMIZATION BASICS University of Waterloo

Exercise -1.13: Calculus for convexity

Prove the following:


• Any norm is convex;
• If f and g are convex, then for any α, β ≥ 0, αf + βg is also convex; (what about −f ?)
• If f : Rd → R is convex, then so is w 7→ f (Aw + b);
• If ft is convex for all t ∈ T , then f := supt∈T ft is convex;

• If f (w, t) is jointly convex in w and t, then w 7→ inf t∈T f (w, t) is convex;


• f is a convex function iff its epigraph epi f is a convex set;
• If f : C → R is convex, then the perspective function g(w, t) := tf (w/t) is convex on C × R++ ;

• If f (w, z) is convex, then for any w, fw := f (w, ·) is convex and similarly for f z := f (·, z). (what
about the converse?)
• All sub-level sets of a convex function are convex, but a function with all sub-level sets being convex
need not be convex (these are called quasi-convex functions).

Definition -1.14: Fenchel conjugate function

For any extended real-valued function f : V → (−∞, ∞] we define its Fenchel conjugate function as:

f ∗ (w∗ ) := sup ⟨w, w∗ ⟩ − f (w).


w

According to one of the rules in Exercise -1.13, f ∗ is always a convex function (of w∗ ).
If dom f is nonempty and closed, and f is continuous, then

f ∗∗ := (f ∗ )∗ = f.

This remarkable property of convex functions will be used later in the course.

Theorem -1.15: Verifying convexity

Let f : Rd → R be twice differentiable. Then f is convex iff its Hessian ∇2 f is always positive semidefinite.
If the Hessian is always positive definite, then f is strictly convex.
The function f (x) = x4 is strictly convex but its Hessian vanishes at its minimum x = 0.
Fix any w and take any direction d. Consider the univariate function h(t) := f (w + td). We verify
that h′′ (t) = d, ∇2 f (w + td)d ≥ 0. In other words, a convex function has increasing derivative along any
direction d (starting from any point w).

Exercise -1.16: Example convex functions

Prove the following functions are convex (Exercise -1.13 and Theorem -1.15 may be handy):
• affine functions: f (w) = w⊤ w + b;
• exponential function: f (x) = exp(x);
• entropy: f (x) = x log x with x ≥ 0 and 0 log 0 := 0;
Pd
• log-sum-exp: f (w) = log j=1 exp(xj ); (its gradient is the so-called softmax, to be discussed later)

Yaoliang Yu 6 –Version 0.0–May 11, 2020–


CS480/680–Fall 2021 §-1 OPTIMIZATION BASICS University of Waterloo

(You may appreciate the epigraph more after the last exercise.)

Definition -1.17: Optimization problem

Consider a function f : Rd → R, we are interested in the minimization problem:

p∗ := min f (w), (-1.2)


w∈C

where C ⊆ Rd represents the constraints that w must satisfy.


Historically, the minimization problem (-1.2) was motivated by the need of solving nonlinear equations
h(w) = 0 (Cauchy 1847; Curry 1944), which can be reformulated as a least squares minimization problem
minw h2 (w). As pointed out by Polyak (1963), the minimization problem itself has become ubiquitous, and
often does not have to correspond to solving any nonlinear equation.
We remind that the minimum value p∗ is an extended real number in [−∞, ∞] (where p∗ = ∞ iff C = ∅).
When p∗ is finite, any feasible w ∈ C such that f (w) = p∗ is called a minimizer. Minimum value always
exists while minimizers may not!
Cauchy, M. Augustin-Louis (1847). “Méthode générale pour la résolution des systémes d’équations simultanées”.
Comptes rendus hebdomadaires des séances de l’Académie des sciences, vol. 25, no. 2, pp. 536–538.
Curry, Haskell B. (1944). “The Method of Steepest Descent for Non-linear Minimization Problems”. Quarterly of
Applied Mathematics, vol. 2, no. 3, pp. 258–261.
Polyak, Boris Teodorovich (1963). “Gradient methods for the minimization of functionals”. USSR Computational
Mathematics and Mathematical Physics, vol. 3, no. 4, pp. 643–653.

Definition -1.18: Level-bounded

A function f : Rd → (−∞, ∞] is said to be level bounded iff for all t ∈ R, the sublevel set Jf ≤ tK is bounded.
Equivalently, f is level bounded iff ∥w∥ → ∞ =⇒ f (w) → ∞.

Theorem -1.19: Existence of minimizer

A continuous and level-bounded function f : Rd → (−∞, ∞] has a minimizer.


To appreciate this innocent theorem, let f (x) = exp(x):
• Is f continuous?

• Is f level-bounded?
• What is the minimum value of f ? Does f has a minimizer on R?

Definition -1.20: Extrema of an unconstrained function

Recall that w is a local minimizer of f if there exists an (open) neighborhood N of w so that for all z ∈ N we
have f (w) ≤ f (z). In case when N can be chosen to be the entire space Rd , we say w is a global minimizer
of f with the notation w ∈ argmin f .
By definition, a global minimizer is always a local minimizer while the converse is true only for a special
class of functions (knowns as invex functions). The global minimizer may not be unique (take a constant
function) but the global minimum value is.
The definition of a local (global) maximizer is analogous.

Yaoliang Yu 7 –Version 0.0–May 11, 2020–


CS480/680–Fall 2021 §-1 OPTIMIZATION BASICS University of Waterloo

Exercise -1.21: Properties of extrema

Prove the following:


• w is a local (global) minimizer of f iff w is a local (global) maximizer of −f .
• w is a local (global) minimizer of f iff w is a local (global) minimizer of λf + c for any λ > 0 and
c ∈ R.
• If w is a local (global) minimizer (maximizer) of f , then it is a local (global) minimizer (maximizer)
of g(f ) for any increasing function g : R → R. What if g is strictly increasing?

• w is a local (global) minimizer (maximizer) of a positive function f iff it is a local (global) maximizer
(minimizer) of 1/f .
• w is a local (global) minimizer (maximizer) of a positive function f iff it is a local (global) minimizer
(maximizer) of log f .

Alert -1.22: The epigraph trick

Often, we rewrite the optimization problem

min f (w)
w∈C

as the equivalent one:

min t, (-1.3)
(w,t)∈epi f ∩C×R

where the newly introduced variable t is jointly optimized with w. The advantages of (-1.3) include:
• the objective in (-1.3) is a simple, canonical linear function ⟨(0, 1), (w, t)⟩;
• the constraints in (-1.3) may reveal more (optimization) insights, as we will see in SVMs.

Theorem -1.23: Local is global under convexity

Any local minimizer of a convex function (over some convex constraint set) is global.

Proof. Let w be a local minimizer of a convex function f . Suppose there exists z such that f (z) < f (w).
Take convex combination and appeal to the definition of convexity in Definition -1.9:

∀λ ∈ (0, 1), f (λw + (1 − λ)z) ≤ λf (w) + (1 − λ)f (z) < f (w),

contradicting to the local minimality of w.


In fact, in Theorem 2.22 we will see that any stationary point of a convex function is its global minimizer.

Theorem -1.24: Fermat’s necessary condition for extrema

A necessary condition for w to be a local minimizer of a differentiable function f : Rd → R is

∇f (w) = 0.

(Such points are called stationary, a.k.a. critical.) If f is convex, then the necessary condition is also

Yaoliang Yu 8 –Version 0.0–May 11, 2020–


CS480/680–Fall 2021 §-1 OPTIMIZATION BASICS University of Waterloo

sufficient.

Proof. Suppose t is a local minimizer of a univariate function g : R → R, then apply the definition of
derivative we may easily deduce that g ′ (t) ≤ 0 and g ′ (t) ≥ 0, i.e. g ′ (t) = 0.
Now if w is a local minimizer of f : Rd → R, then t = 0 is a local minimizer of the univariate function
g(t) := f (w + tw) for any w. Apply the previous result for univariate functions and the chain rule:

g ′ (0) = w⊤ ∇f (w) = 0.

Since w was arbitrary, we must have ∇f (w) = 0.


If f is convex, then we have

f (w + t(z − w)) − f (w) tf (z) + (1 − t)f (w)) − f (w)


⟨∇f (w), z − w⟩ = lim+ ≤ = f (z) − f (w).
t→0 t t
In other words, we obtain the first order characterization of convexity:

∀w, z, f (z) ≥ f (w) + ⟨∇f (w), z − w⟩ .

Plugging in ∇f (w) = 0 we see that w is in fact a global minimizer.

Take f (x) = x3 and x = 0 we see that this necessary condition is not sufficient for nonconvex functions.
For local maximizers, we simply negate the function and apply the theorem to −f instead.

Example -1.25: The difficulty of satisfying a constraint

Consider the trivial problem:

min x2 ,
x≥1

which admits a unique minimizer x⋆ = 1. However, if we ignore the constraint and set the derivative to zero
we would obtain x = 0, which does not satisfy the constraint!
We can apply Theorem 2.22 only when there is no constraint. We will introduce the Lagrangian to
“remove” constraints below.
A common trick is to ignore any constraint and apply Theorem 2.22 anyway to derive a “bogus” minimizer
w⋆ . Then, we verify if w⋆ satisfies the constraint: If it does, then we actually find a minimizer for the
constrained problem! For instance, hand the constraint above been x ≥ −1 we would be fine by ignoring
the constraint. Needless to say, this trick only works occasionally.

Remark -1.26: Iterative algorithm

The prevailing algorithms in machine learning are iterative in nature, i.e., we will construct a sequence
w0 , w1 , . . ., which will hopefully “converge” to something we contend with .

Definition -1.27: Projection to a closed set

Let C ⊆ Rd be a closed set. We define the (Euclidean) projection of a point w ∈ Rd to C as:


PC (w) := argmin ∥z − w∥2 ,
z∈C

i.e., points in C that are closest to the given point w. Needless to say, PC (w) = w iff w lies in C.
The projection is always unique iff C is convex (Bunt 1934; Motzkin 1935). The only if part remains a
long open problem when the space is infinite dimensional.
Bunt, L. N. H. (1934). “Bijdrage tot de theorie de convexe puntverzamelingen”. PhD thesis. University of Groningen.

Yaoliang Yu 9 –Version 0.0–May 11, 2020–


CS480/680–Fall 2021 §-1 OPTIMIZATION BASICS University of Waterloo

Motzkin, Theodore Samuel (1935). “Sur quelques propriétés caractéristiques des ensembles convexes”. Atti della Reale
Accademia Nazionale dei Lincei, vol. 21, no. 6, pp. 562–567.

Exercise -1.28: Projecting to nonnegative orthant

Let C = Rd+ be the nonnegative orthant. Find the formula for PC (w). (Exercise -1.21 may be handy.)

Algorithm -1.29: Algorithm of feasible direction

We remind that line 4 below is possible because our universe is a vector space! This testifies again the
(obvious?) importance of making machine learning amenable to linear algebra.
Algorithm: Algorithm of feasible direction
Input: w0 ∈ dom f ∩ C
1 for t = 0, 1, . . . do
2 choose direction dt // e.g. lim sup ⟨dt , ∇f (wt )⟩ > 0
t→∞
3 choose step size ηt > 0
4 wt+1 = wt − ηt dt // update
5 wt+1 = PC (wt+1 ) // optional projection step

Intuitively, Algorithm -1.29 first finds a direction dt , and then moves the iterate wt along the direction.
How far we move away from the current iterate wt is determined by the step size (assuming dt is normalized).
To motivate the selection of the direction, apply Taylor’s expansion:

f (wt+1 ) = f (wt − ηt dt ) = f (wt ) − ηt ⟨dt , ∇f (wt )⟩ + o(ηt ),

where o(ηt ) is the small order term. Clearly, if ⟨dt , ∇f (wt )⟩ > 0 and ηt is small, then f (wt+1 ) < f (wt ), i.e.
the algorithm is descending hence making progress. Typical choices for the direction include:
• Gradient Descent (GD): dt = ∇f (wt );
• Newton: dt = [∇2 f (wt )]−1 ∇f (wt );
• Stochastic Gradient Descent (SGD): dt = ξt , E(ξt ) = ∇f (wt ).

Remark -1.30: Randomness helps?

Note that if we start from a stationary point w0 , i.e. ∇f (w0 ) = 0, then gradient descent will not move, no
matter what step size we choose! This is why such points are called stationary in the first place. On the
other hand, stochastic gradient descent may still move because of the random noise added to it.

Remark -1.31: More on SGD

In machine learning we typically minimize some average loss over a training set D = *(wi , yi )+:
n
1X
min ℓ(w; wi , yi ) =: min Êℓ(w; w, y), (-1.4)
w n i=1 w

where (w, y) is randomly chosen from the training set D and the empirical expectation is taken w.r.t. D.
Computing the gradient obviously costs O(n) since we need to go through each sample in the training set

Yaoliang Yu 10 –Version 0.0–May 11, 2020–


CS480/680–Fall 2021 §-1 OPTIMIZATION BASICS University of Waterloo

D. On the other hand, if we take a random sample (w, y) from the training set, and compute

ξ = ∇ℓ(w; w, y).

Obviously, Êξ equals the gradient but computing ξ is clearly much cheaper. In practice, one usually sample
a mini-batch, and compute the (average) gradient over the mini-batch.

Exercise -1.32: Perceptron as SGD

Recall the perceptron update: if y(⟨w, w⟩ + b) ≤ 0, then

w ← w + yw, b ← b + y.

Construct a loss function ℓ(y(⟨w, w⟩ + b)) in (-1.4) so that perceptron reduces to SGD (with step size say
η ≡ 1).

Example -1.33: Descending alone does NOT guarantee convergence to stationary point

Consider the function


 
3 2 3 1
 4 (1 − x) − 2(1 − x), x > 1
 2x + 2, x > 1

f = 34 (1 + x)2 − 2(1 + x), x < −1 , with gradient f = 23 x − 12 , x < −1

.

 2 
x − 1, −1 ≤ x ≤ 1 2x, −1 ≤ x ≤ 1

Clearly, f is convex and has a unique minimizer x⋆ = 0 (with f ⋆ = −1). It is easy to verify that

f (x) < f (y) ⇐⇒ |x| < |y| .

Start with x0 > 1, set the step size η ≡ 1, and choose d = f ′ (x), then x1 = x0 − f ′ (x0 ) = − x02+1 . By
induction it is easy to show that xt+1 = − 12 (xt − (−1)t ). Thus, xt > 1 if t is odd and xt < −1 if t is even.
Moreover, |xt+1 | < |xt |, implying f (xt+1 ) < f (xt ). It is easy to show that |xt | → 1 and f (xt ) → 0, hence
the algorithm is not converging to a stationary point.

Remark -1.34: Step sizes

Let us mention a few ways to choose the step size ηt :


• Cauchy’s rule (Cauchy 1847), where the existence of the minimizer is assumed:

ηt ∈ argmin f (wt − ηdt ).


η≥0

• Curry’s rule (Curry 1944), where the finiteness of ηt is assumed:

ηt = inf{η ≥ 0 : f ′ (wt − ηdt ) = 0}.

• Constant rule: ηt ≡ η > 0.


• Summable rule: t ηt < ∞, e.g. ηt = O(1/t);
P P 2
t ηt = ∞,

• Diminishing rule: t ηt = ∞, limt ηt = 0, e.g. ηt = O(1/ t).
P

The latter three are most common, especially for SGD.


Note that under the condition ⟨dt , ∇f (wt )⟩ > 0, the function h(η) := f (w − ηdt ) is decreasing for small
positive η. Therefore, Curry’s rule essentially selects the smallest local minimizer of h while Cauchy’s rule

Yaoliang Yu 11 –Version 0.0–May 11, 2020–


CS480/680–Fall 2021 §-1 OPTIMIZATION BASICS University of Waterloo

selects the global minimizer of h. Needless to say, Cauchy’s rule leads to larger per-step decrease of the
function value but is also computationally more demanding. Under both Cauchy’s and Curry’s rule, we have
the orthogonality property:

⟨dt , ∇f (wt+1 )⟩ = 0,

i.e., the gradient at the next iterate wt+1 is orthogonal to the current direction vector dt . This explains the
zigzag behaviour in gradient algorithms.
Cauchy, M. Augustin-Louis (1847). “Méthode générale pour la résolution des systémes d’équations simultanées”.
Comptes rendus hebdomadaires des séances de l’Académie des sciences, vol. 25, no. 2, pp. 536–538.
Curry, Haskell B. (1944). “The Method of Steepest Descent for Non-linear Minimization Problems”. Quarterly of
Applied Mathematics, vol. 2, no. 3, pp. 258–261.

Definition -1.35: L-smoothness

For twice differentiable functions f , define its smoothness parameter (a.k.a. the Lipschitz constant of the
gradient):

L = max ∥∇2 f (w)∥sp


w

where the norm ∥ · ∥sp is the spectral norm (i.e., the largest singular value, see Definition 2.13). For such
functions, choosing η ∈ (0, L1 ) will guarantee convergence of gradient descent. For convex functions, the step
size can be enlarged to η ∈ (0, L2 ).
With η = 2/L, gradient descent may not converge (although the function value still converges to some
value): simply revisit Example -1.33.

Example -1.36: Iterates of gradient descent may NOT converge

While the function values f (wt ) of an iterative algorithm (such as gradient descent) usually converge, the
iterate wt itself may not:
• when there is no minimizer at all: f (x) = exp(x);
• when there is a unique minimizer but the function is non-differentiable: consider the convex function
(
e−x , x≤0
f (x) = ;
x + 1, x > 0

• even when the function is smooth but there are many minimizers: see the intriguing “smoothed Mexican
hat” function in (Absil et al. 2005, Fig 2.1).

Absil, P., R. Mahony, and B. Andrews (2005). “Convergence of the Iterates of Descent Methods for Analytic Cost
Functions”. SIAM Journal on Optimization, vol. 16, no. 2, pp. 531–547.

Example -1.37: Gradient descent may converge to saddle point

Consider the function f (x, y) = 12 x2 + 14 y 4 − 12 y 2 , whose gradient is ∇f (x, y) = y3x−y and Hessian is

 
1 0
. Clearly, there are three stationary points (0, 0), (0, 1), (0, −1), where the the last two are
0 3y 2 − 1
global minimizers whereas the first is a saddle point. Take any w0 = (t, 0), then wt = (xt , 0) hence it can
only converge to (0, 0), which is a saddle point.

Yaoliang Yu 12 –Version 0.0–May 11, 2020–


CS480/680–Fall 2021 §-1 OPTIMIZATION BASICS University of Waterloo

Definition -1.38: Lagrangian Dual

Consider the canonical optimization problem:

min f (w) (-1.5)


w∈C⊆Rd

s.t. g(w) ≤ 0, (-1.6)


h(w) = 0, (-1.7)

where f : Rd → R, g : Rd → Rn , and h : Rd → Rm . The set C is retained here to represent “simple”


constraints that is more convenient to deal with directly than put into either (-1.6) or (-1.7). (Alternatively,
one may always put C = Rd .)
The (nonlinear) constraints (-1.6) are difficult to deal with. Fortunately, we can introduce the Lagrangian
multipliers (a.k.a. dual variables) µ ∈ Rn+ , ν ∈ Rm to move constraints into the Lagrangian:

L(w; µ, ν) := f (w) + µ⊤ g(w) + ν ⊤ h(w).

We can now rewrite the original problem (-1.5) as the following fancy min-max problem:

p⋆ := min max L(w; µ, ν). (-1.8)


w∈C µ≥0,ν

(Here p stands for primal, as opposed to the dual below.) Indeed, if we choose some w ∈ C that does
not satisfy either of the two constraints in (-1.6)-(-1.7), then there exist µ ∈ Rn+ and ν ∈ Rm such that
L(w; µ, ν) → ∞. Thus, in order to minimize w.r.t. w, we are forced to choose w ∈ C to satisfy both
constraints in (-1.6)-(-1.7), in which case maxµ≥0,ν L(w; µ, ν) simply reduces to f (w) (by setting for instance
µ = 0, ν = 0). In other words, the explicit constraints in (-1.5) have become implicit in (-1.8)!
The Lagrangian dual simply swaps the order of min and max:

d⋆ := max min L(w; µ, ν). (-1.9)


µ≥0,ν w∈C

(Here d stands for dual.) We emphasize the dimension of the dual variable µ is the number of inequality
constraints while that of ν is the number of equality constraints; they are different from the dimension of
the (primal) variable w. Note also that there is no constraint on w in the dual problem (-1.9) (except the
simple one w ∈ C which we do not count ,), implicit or explicit!
In some sense, the Lagrangian dual is “the trick” that allows us to remove complicated constraints and
apply Theorem 2.22.

Theorem -1.39: Weak duality

For any function f : X × Y → R, we have

min max f (w, y) ≥ max min f (w, y)


w y y w

Proof. Left as exercise.

We immediately deduce that the Lagrangian dual (-1.9) is a lower bound of the original problem (-1.5),
i.e. p⋆ ≥ d⋆ . When equality is achieved, we say strong duality holds. For instance, if f , gi for all i = 1, . . . , n,
and C are convex, h is affine, and some mild regularity condition holds (e.g. Slater’s condition), then we
have strong duality for the Lagrangian in Definition -1.38.

Yaoliang Yu 13 –Version 0.0–May 11, 2020–


CS480/680–Fall 2021 §-1 OPTIMIZATION BASICS University of Waterloo

Exercise -1.40: Does the direction matter?

Derive the Lagrangian dual of the following problem:

max f (w)
w∈C
s.t. g(w) ≥ 0
h(w) = 0.

(Start with reducing to (-1.5) and then see if you can derive directly.)

Definition -1.41: KKT conditions (Karush 1939; Kuhn and Tucker 1951)

The inner minimization in the Lagrangian dual (-1.9) can usually be solved in closed-form:

X(µ, ν) := argmin L(w; µ, ν). (-1.10)


w∈C

When C = Rd , applying Theorem 2.22 we obtain the stationary condition (-1.12) below. Plugging any
minimizer X(µ, ν) back into (-1.9) we obtain the dual problem:

max L(X(µ, ν); µ, ν) ≡ − min − L(X(µ, ν); µ, ν). (-1.11)


µ∈Rn
+ ,ν∈R
m µ∈Rn
+ ,ν∈R
m

Compared to the original problem (-1.5) that comes with complicated constraints (-1.6)-(-1.7), the dual
problem (-1.11) has only very simple nonnegative constraint on µ. Thus, we may try to solve the Lagrangian
dual (-1.11) instead! In fact, we have the following necessary conditions for w to minimize the original
problem (-1.5) and for µ, ν to maximize the dual problem (-1.11):
• primal feasibility:

g(w) ≤ 0, h(w) = 0, w ∈ C;

• dual feasibility:

µ ≥ 0;

• stationarity: w = X(µ, ν); for C = Rd , we simply have


n
X m
X
∇f (w) + µi ∇gi (w) + νj ∇hj (w) = 0; (-1.12)
i=1 j=1

• complementary slackness:

⟨µ, g(w)⟩ = 0.

Note that from primal and dual feasibility we always have

∀i = 1, . . . , n, µi gi (w) ≤ 0.

Thus, complementary slackness actually implies equality in all n inequalities above.


When strong duality mentioned in Theorem -1.39 holds, the above KKT conditions are also sufficient!
Karush, W. (1939). “Minima of Functions of Several Variables with Inequalities as Side Constraints”. MA thesis.
University of Chicago.
Kuhn, H. W. and A. W. Tucker (1951). “Nonlinear programming”. In: Proceedings of the Second Berkeley Symposium
on Mathematical Statistics and Probability, pp. 481–492.

Yaoliang Yu 14 –Version 0.0–May 11, 2020–


CS480/680–Fall 2021 §-1 OPTIMIZATION BASICS University of Waterloo

Algorithm -1.42: Dual gradient descent (Uzawa 1958)

We can apply Algorithm -1.29 to the dual problem (-1.11), equipped with the projection onto nonnegative
orthants in Exercise -1.28. After solving the dual, we may attempt to recover the primal through (-1.10).
In both steps we sidestep any complicated constraints! However, one needs to always keep the following
Alert -1.43 in mind.
Uzawa, Hirofumi (1958). “Iterative methods for concave programming”. In: Studies in linear and non-linear program-
ming. Ed. by Kenneth J. Arrow, Leonid Hurwicz, and Hirofumi Uzawa. Standford University Press, pp. 154–
165.

Alert -1.43: Instability

A popular way to solve the primal problem (-1.5) is to solve the dual (-1.11) first. Then, with the optimal
dual variable (µ⋆ , ν ⋆ ) at hand, we “recover” the primal solution by

X(µ⋆ , ν ⋆ ) = argmin L(w; µ⋆ , ν ⋆ ).


w∈C

The minimizing set X always contains all minimizers of the primal problem (-1.5). Thus, if X is a singleton,
everything is fine. Otherwise X may actually contain points that are not minimizers of the primal problem!
Fortunately, we need only verify primal feasibility in order to identify the true minimizers of the primal
(-1.5) (when strong duality holds). The problem is, in practice, our algorithm only returns one “solution”
from X, and if it fails primal feasibility, we may not be able to fetch a different “solution” from X.

Example -1.44: Instability

Let us consider the trivial problem:

min 0, s.t. x ≥ 0,
x

whose minimizers are R+ . We derive the Lagrangian dual:


(
0, if µ = 0
max min −µx = max .
µ≥0 x µ≥0 −∞, o.w.

Clearly, we have µ⋆ = 0. Fixing µ⋆ = 0 and solving

min −µ⋆ x
x

gives us X = R which strictly contains the primal solutions R+ ! Verifying primal feasibility x ≥ 0 then
identifies true minimizers.

Example -1.45: (Kernel) ridge regression

Recall ridge regression:

min 12 ∥z∥22 + λ2 ∥w∥22


w,z

s.t. Xw − y = z,

where we introduced an “artificial” constraint (and variable). Derive the Lagrangian dual:

max min 21 ∥z∥22 + λ2 ∥w∥22 + α⊤ (Xw − y − z),


α w,z

Yaoliang Yu 15 –Version 0.0–May 11, 2020–


CS480/680–Fall 2021 §-1 OPTIMIZATION BASICS University of Waterloo

Applying Theorem 2.22 to the inner minimization problem:

w⋆ = −X ⊤ α/λ, z⋆ = α (-1.13)

Plugging it back in (and simplify) we obtain the dual:


1
max − 2λ ∥X ⊤ α∥22 − α⊤ y − 12 ∥α∥22
α

Applying Theorem 2.22 again we obtain:

α⋆ = −(XX ⊤ /λ + I)−1 y

and hence from (-1.13) we have

w⋆ = X ⊤ (XX ⊤ + λI)−1 y.

From Section 2 we know the solution of ridge regression is

w⋆ = (X ⊤ X + λI)−1 X ⊤ y

Verify the two solutions of w⋆ are the same. (In fact, we have accidentally proved the Sherman-Morrison
formula.) The purpose of this “linear algebra exercise” will become evident when we discuss reproducing
kernels.

Algorithm -1.46: Gradient descent ascent (GDA)

When we cannot solve the inner minimization problem in the Lagrangian dual (-1.9), an alternative is to
iteratively perform one gradient descent step on the inner minimization and then perform another gradient
ascent step on the outer maximization. This idea can be traced back to (at least) (Brown and Neumann
1950; Arrow and Hurwicz 1958).
More generally, let us consider the min-max optimization problem:

min max f (w, y).


w∈X y∈Y

Algorithm: Gradient descent ascent for min-max


Input: (w0 , y0 ) ∈ dom f ∩ X × Y
1 s−1 = 0, (w̄−1 , ȳ−1 ) = (0, 0) // optional
2 for t = 0, 1, . . . do
3 choose step size ηt > 0
4 wt+1 = PX [wt − ηt ∇w f (wt , yt )] // GD on minimization
5 yt+1 = PY [yt + ηt ∇y f (wt , yt )] // GA on maximization
6 st = st−1 + ηt
Pt
(w̄t , ȳt ) = st−1 (w̄t−1 ,ȳt−1 )+ηt (wt ,yt ) P
7 st // averaging: (w̄t , ȳt ) = k=1 ηk (wk , yk )/ k ηk

Variations of Algorithm -1.46 include (but are not limited to):


• use different step sizes on w and y;

• use wt+1 in the update on y (or vice versa);


• use stochastic gradients in both steps;
• after every update in w, perform k updates in y (or vice versa);

Brown, G. W. and J. von Neumann (1950). “Solutions of Games by Differential Equations”. In: Contributions to the
Theory of Games I. Ed. by H. W. Kuhn and A. W. Tucker. Princeton University Press, pp. 73–79.

Yaoliang Yu 16 –Version 0.0–May 11, 2020–


CS480/680–Fall 2021 §-1 OPTIMIZATION BASICS University of Waterloo

Arrow, Kenneth J. and Leonid Hurwicz (1958). “Gradient method for concave programming I: Local results”. In:
Studies in linear and non-linear programming. Ed. by Kenneth J. Arrow, Leonid Hurwicz, and Hirofumi Uzawa.
Standford University Press, pp. 117–126.

Example -1.47: Vanilla GDA may never converge for any step size

Let us consider the following simple problem:

min max xy ≡ max min xy.


x∈[−1,1] y∈[−1,1] y∈[−1,1] x∈[−1,1]

The left-hand side is equivalent as minx∈[−1,1] |x| hence with unique minimizer x∗ = 0. Similarly, the right-
hand side is equivalent as maxy∈[−1,1] −|y| hence with unique maximizer y ∗ = 0. It follows that the optimal
saddle-point (equilibrium) is x∗ = y ∗ = 0.
If we run vanilla (projected) GDA with step size ηt ≥ 0, then

xt+1 = [xt − ηt yt ]1−1


yt+1 = [yt + ηt xt ]1−1 ,

where [z]1−1 := (z ∧ 1) ∨ (−1) is the projection of z onto the interval [−1, 1]. Thus, we have

x2t+1 + yt+1
2
≥ 1 ∧ [(xt − ηt yt )2 + (yt + ηt xt )2 ] = 1 ∧ [(1 + ηt2 )(x2t + yt2 )] ≥ 1 ∧ (x2t + yt2 ).

Therefore, if we do not initialize at the equilibrium x∗ = y ∗ = 0, then the norm of (xt , yt ) will always be
∗
lower bounded by 1 ∧ ∥ xy00 ∥ > 0 = ∥ xy∗ ∥. In other words, (xt , yt ) will not converge to (x∗ , y ∗ ).


In fact, we can generalize the above failure to any bilinear matrix game:

min max w⊤ Ay,


w∈C y∈D

where C and D are closed sets and A ∈ Rd×d is nonsingular. Suppose there exists ϵ > 0 so that any
equilibrium (w∗ , y∗ ) is ϵ away from the boundary: dist(w∗ , ∂C) ∧ dist(y∗ , ∂D) > ϵ, then we know Ay∗ = 0
hence y∗ = 0 and similarly w∗ = 0. It follows that vanilla projected gradient

wt+1 = PC (wt − ηt Ayt )


yt+1 = PD (yt + ηt A⊤ wt )

will not converge to any equilibrium. Indeed, for vanilla gradient to converge, the projection will eventually
be vacuous (otherwise we are ϵ away), but then for the equilibrium (w∗ = 0, y∗ = 0):
2
A⊤
  
0 yt
∥wt+1 ∥22 + ∥yt+1 ∥22 = ∥wt ∥22 + ∥yt ∥2 + ηt2 = (1 + ηt2 σmin
2
(A))(∥wt ∥22 + ∥yt ∥2 ),
A 0 wt 2

which is strictly lower bounded by ∥w0 ∥22 + ∥y0 ∥22 if the starting point (w0 , y0 ) does not coincide with the
equilibrium.

Algorithm -1.48: Alternating

The following alternating algorithm is often applied in practice for solving the joint minimization problem:

min min f (w, y).


w∈X y∈Y

Yaoliang Yu 17 –Version 0.0–May 11, 2020–


CS480/680–Fall 2021 §-1 OPTIMIZATION BASICS University of Waterloo

Algorithm: Alternating minimization for min-min


Input: (w0 , y0 ) ∈ dom f ∩ X × Y
1 for t = 0, 1, . . . do
2 wt+1 = argminw∈X f (w, yt ) // exact minimization
3 yt+1 = argminy∈Y f (wt+1 , y) // exact maximization

It is tempting to adapt alternating to solve min-max problems. The resulting algorithm, when compared
to dual gradient (see Algorithm -1.42), is more aggressive in optimizing y: dual gradient only takes a gradient
ascent step while the latter finds the exact maximizer. Surprisingly, being more aggressive (and spending
more effort) here actually hurts (sometimes).

Example -1.49: When to expect alternating to work?

Let us consider the simple problem

min min xy,


x∈[−1,1] y∈[−1,1]

where two minimizers {(1, −1), (−1, 1)} exist. Let us apply the alternating Algorithm -1.48. Choose any
x > 0, we obtain

y ← −1, x ← 1, y ← −1,

which converges to an optimal solution after 1 iteration!


To compare, let us consider the “twin” problem:

min max xy,


x∈[−1,1] y∈[−1,1]

where a unique equilibrium (x∗ , y ∗ ) = (0, 0) exists. Choose any x > 0 (the analysis for x < 0 is similar).
Let us perform the inner maximization exactly to obtain y = 1. With y = 1 in mind, we perform the outer
inner minimization exactly to obtain x = −1. Keep iterating, we obtain the cycle

y ← −1, x ← 1, y ← 1, x ← −1, y ← −1,

which is bounded away from the equilibrium! Of course, if we perform averaging along the trajectory we again
obtain convergence. Dual gradient essentially avoids the oscillating behaviour of alternating by (implicitly)
averaging.

Yaoliang Yu 18 –Version 0.0–May 11, 2020–


CS480/680–Fall 2021 §0 STATISTICAL LEARNING BASICS University of Waterloo

0 Statistical Learning Basics


Goal

Maximum Likelihood, Prior, Posterior, MAP, Bayesian LR

Alert 0.1: Convention

Gray boxes are not required hence can be omitted for unenthusiastic readers.
This note is likely to be updated again soon.

Definition 0.2: Distribution and density

Recall that the cumulative distribution function (cdf) of a random vector X ∈ Rd is defined as:

F (x) := Pr(X ≤ x),

and its probability density function (pdf) is


x1 xd
∂dF
Z Z
p(x) := (x), or equivalently F (x) = ··· p(x) dx.
∂x1 · · · ∂xd −∞ −∞

Clearly, each cdf F : Rd → [0, 1] is


• monotonically increasing in each of its inputs;
• right continuous in each of its inputs;

• limx→∞ F (x) = 1 and limx→−∞ F (x) = 0.


On the other hand, each pdf p : Rd → R+
R∞
• integrates to 1, i.e. −∞ p(x) dx = 1.

(The cdf and pdf of a discrete random variable can be defined similarly and is omitted.)

Remark 0.3: Change-of-variable

Let T : Rd → Rd be a diffeomorphism (differentiable bijection with differentiable inverse). Let X = T(Z),


then we have the change-of-variable formula for the pdfs:

dT−1
p(x) dx ≈ q(z) dz, i.e. p(x) = q(T−1 (x)) det (x)
dx
dT
q(z) = p(T(z)) det (z) ,
dz

where det denotes the determinant.

Definition 0.4: Marginal, conditional, and independence

Let X = (X1 , X2 ) be a random vector with pdf p(x) = p(x1 , x2 ). We say X1 is a marginal of X with pdf
Z ∞
p1 (x1 ) = p(x1 , x2 ) dx2 ,
−∞

Yaoliang Yu 19 –Version 0.2–Sep 8, 2021–


CS480/680–Fall 2021 §0 STATISTICAL LEARNING BASICS University of Waterloo

where we marginalize over X2 by integrating it out. Similarly X2 is a marginal of X with pdf


Z ∞
p2 (x2 ) = p(x1 , x2 ) dx1 .
−∞

We then define the conditional X1 |X2 with density:

p1|2 (x1 |x2 ) = p(x1 , x2 )/p2 (x2 ),

where the value of p1|2 is arbitrary if p2 (x2 ) = 0 (usually immaterial). Similarly we may define the conditional
X2 |X1 . It is obvious from our definition that

p(x1 , x2 ) = p1 (x1 )p2|1 (x2 |x1 ) = p2 (x2 )p1|2 (x1 |x2 ),

namely the joint density p can be factorized into the product of marginal p1 and conditional p2|1 . Usually,
we omit all subscripts in p when referring to the marginal or conditional whenever the meaning is obvious
from context.
Iterating the above construction, we obtain the famous chain rule:
d
Y
p(x1 , x2 , . . . , xd ) = p(xj |x1 , . . . , xj−1 ),
j=1

with obviously p(x1 |x1 , . . . , x0 ) := p(x1 ). We say that the random vectors X1 , X2 , . . . , Xd are independent
if
d
Y
p(x1 , x2 , . . . , xd ) = p(xj ).
j=1

All of our constructions above can be done with cdfs as well (with serious complication for the conditional
though). In particular, we have the Bayes rule:

Pr(A, B) Pr(B|A) Pr(A)


Pr(A|B) = = .
Pr(B) Pr(B, A) + Pr(B, ¬A)

Definition 0.5: Mean, variance and covariance

Let X = (X1 , . . . , Xd ) be a random (column) vector. We define its mean (vector) as


Z
µ = EX, where µj = xj · p(xj ) dxj

and its covariance (matrix) as


Z
Σ = E(X − µ)(X − µ)⊤ , where Σij = (xi − µi )(xj − µj ) · p(xi , xj ) dxi dxj .

By definition Σ is symmetric Σij = Σji and positive semidefinite (all eigenvalues are nonnegative). The j-th
diagonal entry of the covariance σj2 := Σjj is called the variance of Xj .

Exercise 0.6: Covariance

Prove the following equivalent formula for the covariance:


• Σ = EXX⊤ − µµ⊤ ;

Yaoliang Yu 20 –Version 0.2–Sep 8, 2021–


CS480/680–Fall 2021 §0 STATISTICAL LEARNING BASICS University of Waterloo

• Σ = 12 E(X − X′ )(X − X′ )⊤ , where X′ is iid (independent and identically distributed) with X.

Suppose X has mean µ and covariance Σ. Find the mean and covariance of AX + b, where A, b are
deterministic.

Example 0.7: Multivariate Gaussian

The pdf of the multivariate Gaussian distribution (a.k.a. normal distribution) is:

p(x) = (2π)−d/2 [det(Σ)]−1/2 exp − 12 (x − µ)⊤ Σ−1 (x − µ) ,




where d is the dimension and det denotes the determinant of a matrix. We typically use the notation
X ∼ N (µ, Σ), where µ = EX is its mean and Σ = E(X − µ)(X − µ)⊤ is its covariance.
An important property of the multivariate Gaussian distribution is its equivariance under affine trans-
formations:

X ∼ N (µ, Σ) =⇒ AX + b ∼ N (Aµ + b, AΣA⊤ ).

(This property actually characterizes the multivariate Gaussian distribution.)

Exercise 0.8: Marginal and conditional of multivariate Gaussian


     
X1 µ1 Σ Σ12
Let ∼N , 11 . Prove the following results:
X2 µ2 Σ21 Σ22

X1 ∼ N (µ1 , Σ11 ), X2 |X1 ∼ N (µ2 + Σ21 Σ−1 −1


11 (X1 − µ1 ), Σ22 − Σ21 Σ11 Σ12 );
X2 ∼ N (µ2 , Σ22 ), X1 |X2 ∼ N (µ1 + Σ12 Σ−1 −1
22 (X2 − µ2 ), Σ11 − Σ12 Σ22 Σ21 ).

Remark 0.9: Bias-variance trade-off

Suppose we are interested in predicting a random (scalar) quantity Y based on some feature vector (a.k.a.
covariate) X, using the function fˆ. Here the hat notation suggests fˆ may depend on other random quantities,
such as samples from a training set. Often, we use the squared loss to evaluate our prediction:
 2
E(fˆ(X) − Y )2 = E fˆ(X) − Efˆ(X) + Efˆ(X) − E(Y |X) + E(Y |X) − Y
2 2 2
= E fˆ(X) − Efˆ(X) + E Efˆ(X) − E(Y |X) + E E(Y |X) − Y ,
| {z } | {z } | {z }
variance bias2 difficulty

where recall that E(Y |X) is the so-called regression function. The last term indicates the difficulty of our
problem and cannot be reduced by our choice of fˆ. The first two terms reveals an inherent trade-off in
designing fˆ:
• the variance term reflects the fluctuation incurred by training on some random training set. Typically,
a less flexible fˆ will incur a smaller variance (e.g. constant functions have 0 variance);

• the (squared) bias term reflects the mismatch of our choice of fˆ and the optimal regression function.
Typically, a very flexible fˆ will incur a smaller bias (e.g. when fˆ can model any function).
The major goal of much of ML is to strike an appropriate balance between the first two terms.

Yaoliang Yu 21 –Version 0.2–Sep 8, 2021–


CS480/680–Fall 2021 §0 STATISTICAL LEARNING BASICS University of Waterloo

Definition 0.10: Maximum likelihood estimation(MLE)

Suppose we have a dataset D = *x1 , . . . , xn +, where each sample xi (is assumed to) follow some pdf p(x|θ)
with unknown parameter θ. We define the likelihood of a parameter θ given the dataset D as:
n
Y
L(θ) = L(θ; D) := p(D|θ) = p(xi |θ),
i=1

where in the last equality we assume our data is iid. A popular way to find an estimate of the parameter θ
is to maximize the likelihood over some parameter space Θ:

θMLE := argmaxθ∈Θ L(θ).

Equivalently, by taking the log and negating, we minimize the negative log-likelihood (NLL):
n
X
θMLE := argminθ∈Θ − log p(xi |θ).
i=1

We remark that MLE is applicable only when we can evaluate the likelihood function efficiently, which
turns out to be not the case in many settings and we will study alternative algorithms.

Example 0.11: Sample mean and covariance as MLE

Let x1 , . . . , xn be iid samples from the multivariate Gaussian distribution N (µ, Σ) where the parameters µ
and Σ are to be found. We apply maximum likelihood:
n
1X
µ̂MLE := argmin (xi − µ)⊤ Σ−1 (xi − µ).
µ 2 i=1

Applying Theorem 2.22 we obtain the sample mean:


n
1X
µ̂MLE = xi =: Êx,
n i=1

where the hat expectation Ê is w.r.t. the given data.


Similarly we can show
n
X
Σ̂MLE := argmin log det Σ + (xi − µ)⊤ Σ−1 (xi − µ).
Σ i=1

Or equivalently
n
X
Σ̂−1
MLE := argmin − log det S + (xi − µ)⊤ S(xi − µ).
S i=1

Applying Theorem 2.22 (with the fact that the gradient of log det S is S −1 ), we obtain:
n
1X
Σ̂MLE = (xi − µ)(xi − µ)⊤ = Êxx⊤ − (Êx)(Êx)⊤ ,
n i=1

where we plug in the ML estimate µ̂MLE of µ if it is not known.

Yaoliang Yu 22 –Version 0.2–Sep 8, 2021–


CS480/680–Fall 2021 §0 STATISTICAL LEARNING BASICS University of Waterloo

Exercise 0.12: Bias and variance of sample mean and covariance

Calculate the following bias and variance:

E[µ − µ̂MLE ] =
E[µ − µ̂MLE ][µ − µ̂MLE ]⊤ =
E[Σ − Σ̂MLE ] =

Definition 0.13: f -divergence (Csiszár 1963; Ali and Silvey 1966)

Let f : R+ → R be a strictly convex function (see Definition -1.9) with f (1) = 0. We define the following
f -divergence to measure the closeness of two pdfs p and q:
Z

Df (p∥q) := f p(x)/q(x) · q(x) dx,

where we assume q(x) = 0 =⇒ p(x) = 0 (otherwise we put the divergence to ∞).


Csiszár, Imre (1963). “Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizität
von Markoffschen Ketten”. A Magyar Tudományos Akadémia Matematikai Kutató Intézetének közleményei, vol. 8,
pp. 85–108.
Ali, S. M. and S. D. Silvey (1966). “A General Class of Coefficients of Divergence of One Distribution from Another”.
Journal of the Royal Statistical Society. Series B (Methodological), vol. 28, no. 1, pp. 131–142.

Exercise 0.14: Properties of f -divergence

Prove the following:


• Df (p∥q) ≥ 0, with 0 attained iff p = q;
• Df +g = Df + Dg and Dsf = sDf for s > 0;

• Let g(t) = f (t) + s(t − 1) for any s. Then, Dg = Df ;


• If p(x = 0) ⇐⇒ q(x) = 0, then Df (p∥q) = Df ⋄ (q∥p), where f ⋄ (t) := t · f (1/t);
• f ⋄ is (strictly) convex, f ⋄ (1) = 0 and (f ⋄ )⋄ = f ;
The second last result indicates that f -divergences are not usually symmetric. However, we can always
symmetrize them by the transformation: f ← f + f ⋄ .

Example 0.15: KL and LK

Let f (t) = t log t, then we obtain the Kullback-Leibler (KL) divergence:


Z
KL(p∥q) = p(x) log(p(x)/q(x)) dx.

Reverse the inputs we obtain the reverse KL divergence:

LK(p∥q) := KL(q∥p).

Verify by yourself that the underlying function f = − log for reverse KL.

Yaoliang Yu 23 –Version 0.2–Sep 8, 2021–


CS480/680–Fall 2021 §0 STATISTICAL LEARNING BASICS University of Waterloo

Definition 0.16: Entropy, conditional entropy, cross-entropy, and mutual information

We define the entropy of a random vector X with pdf p as:


Z
H(X) := E − log p(X) = − p(x) log p(x) dx,

the conditional entropy between X and Z (with pdf q) as:


Z
H(X|Z) := E − log p(X|Z) = − p(x, z) log p(x|z) dx dz,

and the cross-entropy between X and Z as:


Z
†(X, Z) := E − log q(X) = − p(x) log q(x) dx.

Finally, we define the mutual information between X and Z as:


Z
p(x, z)
I(X, Z) := KL(p(x, z)∥p(x)q(z)) = p(x, z) log dx dz
p(x)q(z)

Exercise 0.17: Information theory

Verify the following:

H(X, Z) = H(Z) + H(X|Z)


†(X, Z) = H(X) + KL(X∥Z) = H(X) + LK(Z∥X)
I(X, Z) = H(X) − H(X|Z)
I(X, Z) ≥ 0, with equality iff X independent of Z
KL(p(x, z)∥q(x, z)) = KL(p(z)∥q(z)) + E[KL(p(x|z)∥q(x|z))].

All of the above can obviously be iterated to yield formula for more than two random vectors.

Exercise 0.18: Multivariate Gaussian

Compute
• the entropy of the multivariate Gaussian N (µ, Σ);

• the KL divergence between two multivariate Gaussians N (µ1 , Σ1 ) and N (µ2 , Σ2 ).

Example 0.19: More divergences, more fun

Derive the formula for the following f -divergences:


• χ2 -divergence: f (t) = (t − 1)2 ;

• Hellinger divergence: f (t) = ( t − 1)2 ;
• total variation: f (t) = |t − 1|;
• Jensen-Shannon divergence: f (t) = t log t − (t + 1) log(t + 1) + log 4;
tα −1
• Rényi divergence (Rényi 1961): f (t) = α−1 for some α > 0 (for α = 1 we take limit and obtain ?).

Yaoliang Yu 24 –Version 0.2–Sep 8, 2021–


CS480/680–Fall 2021 §0 STATISTICAL LEARNING BASICS University of Waterloo

Which of the above are symmetric?


Rényi, Alfréd (1961). “On Measures of Entropy and Information”. In: Proceedings of the Fourth Berkeley Symposium
on Mathematical Statistics and Probability, pp. 547–561.

Remark 0.20: MLE = KL minimization

Let us define the empirical “pdf” based on a dataset D = *x1 , . . . , xn +:


n
1X
p̂(x) = δx ,
n i=1 i

where δx is the “illegal” delta mass concentrated at x. Then, we claim that



θMLE = argmin KL p̂∥p(x|θ) .
θ∈Θ

Indeed, we have
Z n
1X
KL(p̂∥p(x|θ)) = [log(p̂(x)) − log p(x|θ)]p̂(x) dx = C + − log p(xi |θ),
n i=1

where C is a constant that does not depend on θ.

Exercise 0.21: Is the flood gate open?

Now obviously you are thinking to replace the KL divergence with any f -divergence, hoping to obtain some
generalization of MLE. Try and explain any difficulty you may run into. (We will revisit this in the GAN
lecture.)

Exercise 0.22: Why KL is so special

To appreciate the uniqueness of the KL divergence, prove the following:


log is the only continuous function satisfying f (st) = f (s) + f (t).

Remark 0.23: Information theory for ML

A beautiful they that connects information theory, Bayes risk, convexity and proper loss is available in
(Grünwald and Dawid 2004; Reid and Williamson 2011) and the references therein.
Grünwald, Peter D. and Alexander Philip Dawid (2004). “Game theory, maximum entropy, minimum discrepancy
and robust Bayesian decision theory”. Annals of Statistics, vol. 32, no. 4, pp. 1367–1433.
Reid, Mark D. and Robert C. Williamson (2011). “Information, Divergence and Risk for Binary Experiments”. Journal
of Machine Learning Research, vol. 12, no. 22, pp. 731–817.

Example 0.24: Linear regression as MLE

Let us now give linear regression a probabilistic interpretation, by making the following assumption:

Y = x⊤ w + ϵ,

where ϵ ∼ N (0, σ 2 ). Namely, the response is a linear function of the feature vector x, corrupted by some

Yaoliang Yu 25 –Version 0.2–Sep 8, 2021–


CS480/680–Fall 2021 §0 STATISTICAL LEARNING BASICS University of Waterloo

standard Gaussian noise, or in fancy notation: Y ∼ N (x⊤ w, σ 2 ). Given a dataset D = *(x1 , y1 ) . . . , (xn , yn )+
(where we assume the feature vectors xi are fixed and deterministic, unlike the responses yi which are
random), the likelihood function of the parameter w is:
n
(yi − x⊤ 2
 
Y 1 i w)
L(w; D) = p(D|w) = √ exp −
i=1 2πσ 2 2σ 2
n
n 1 X
ŵMLE = argmin log σ 2 + 2 (yi − x⊤ 2
i w) ,
w 2 2σ i=1

which is exactly the ordinary linear regression.


Moreover, we can now also obtain an MLE of the noise variance σ 2 by solving:
n
n 1 X
2
σ̂MLE = argmin log σ 2 + 2 (yi − x⊤
i w)
2
σ2 2 2σ i=1
n
1X
= (yi − x⊤ 2
i ŵMLE ) ,
n i=1

which is nothing but the average training error.

Definition 0.25: Prior

In a full Bayesian approach, we also assume the parameter θ is random and follows a prior pdf p(θ). Ideally,
we choose the prior p(θ) to encode our a priori knowledge of the problem at hand. (Regrettably, in practice
computational convenience often dominates the choice of the prior.)

Definition 0.26: Posterior

Suppose we have chosen a prior pdf p(θ) for our parameter of interest θ. After observing some data D, our
belief on the probable values of θ will have changed, so we obtain the posterior:

p(D|θ)p(θ) p(D|θ)p(θ)
p(θ|D) = =R ,
p(D) p(D|θ)p(θ) dθ

where recall that p(D|θ) is exactly the likelihood of θ given the data D. Note that computing the denominator
may be difficult since it involves an integral that may not be tractable.

Example 0.27: Bayesian linear regression

Let us consider linear regression (with vector-valued response y ∈ Rm , matrix-valued covariate X ∈ Rm×d ):

Y = Xw + ϵ,

where the noise ϵ ∼ Nm (µ, S) and we impose a Gaussian prior on the weights w ∼ Nd (µ0 , S0 ). As usual we
assume ϵ is independent of w. Given a dataset D = *(X1 , y1 ), . . . , (Xn , yn )+, we compute the posterior:

p(w|D) ∝ p(w)p(D|w)
n
(w − µ0 )⊤ S0−1 (w − µ0 ) (yi − Xi w − µ)⊤ S −1 (yi − Xi w − µ)
  Y  
∝ exp − · exp −
2 i=1
2
= N (µn , Sn ),

Yaoliang Yu 26 –Version 0.2–Sep 8, 2021–


CS480/680–Fall 2021 §0 STATISTICAL LEARNING BASICS University of Waterloo

where (by completing the square) we have


n
X
Sn−1 = S0−1 + Xi⊤ S −1 Xi
i=1
n
!
X
µn = Sn S0−1 µ0 + Xi⊤ S −1 (yi − µ) .
i=1

The posterior covariance Sn contains both the prior covariance S0 and the data Xi . As n → ∞, data
dominates the prior. Similar remark applies to the posterior mean µn .
We can also derive the predictive distribution on a new input X:
Z
p(y|X, D) = p(y|X, w)p(w|D) dw

= N (Xµn + µ, XSn X ⊤ + S)

The covariance XSn X ⊤ + S reflects our uncertainty on the prediction at X.

Theorem 0.28: Bayes classifier

Consider the classification problem with random variables X ∈ Rd and Y ∈ [c] := {1, . . . , c}. The optimal
(Bayes) classification rule, defined as

argmin Pr(Y ̸= h(X)),


h:Rd →[c]

admits the closed-form formula:

h⋆ (x) = argmax Pr(Y = k|X = x) (0.1)


k∈[c]

= argmax p(X = x|Y = k) · Pr(Y = k),


k∈[c] | {z } | {z }
likelihood prior

where ties can be broken arbitrarily.

Proof. Let h(x) be any classification rule. Its classification error is:

Pr(h(X) ̸= Y ) = 1 − Pr(h(X) = Y ) = 1 − E[Pr(h(X) = Y |X)].

Thus, conditioned on X, to minimize the error we should maximize Pr(h(X) = Y |X), leading to h(x) =
h⋆ (x).
To understand the second formula, we resort to the definition of conditional expectation:
Z
Pr(Y = k|X = x)p(x) dx = Pr(X ∈ A, Y = k)
A
= Pr(X ∈ A|Y = k) Pr(Y = k)
Z
= p(X = x|Y = k) Pr(Y = k) dx.
A

Since the set A is arbitrary, we must have

p(X = x|Y = k) Pr(Y = k)


Pr(Y = k|X = x) = .
p(X = x)

(We assume the marginal density p(x) and class-specific densities p(x|Y = k) exist.)

Yaoliang Yu 27 –Version 0.2–Sep 8, 2021–


CS480/680–Fall 2021 §0 STATISTICAL LEARNING BASICS University of Waterloo

In practice, we do not know the distribution of (X, Y ), hence we cannot compute the optimal Bayes
classification rule. One natural idea is to estimate the pdf of (X, Y ) and then plug into (0.1). This approach
however does not scale to high dimensions and we will see direct methods that avoid estimating the pdf.
It is clear that the Bayes error (achieved by the Bayes classification rule) is:
 
E 1 − max Pr(Y = k|X) .
k∈[c]

In particular, for c = 2, we have

Bayes error = E min{Pr(Y = 1|X), Pr(Y = −1|X)} .


 

Exercise 0.29: Cost-sensitive classification (Elkan 2001)

Cost-sensitive classification refers to the setting where making certain mistakes is more expensive than
making some other ones. Formally, we suffer cost cij when we predict class i while the true class is j. We
may of course assume cii ≡ 0. Derive the optimal Bayes rule.
Elkan, Charles (2001). “The Foundations of Cost-Sensitive Learning”. In: Proceedings of the 17th International Joint
Conference on Artificial Intelligence (IJCAI), pp. 973–978.

Exercise 0.30: Bayes estimator

Let ℓ : Ŷ × Y → R+ be a loss function that compares our prediction ŷ with the groundtruth y. We define
the Bayes estimator as:

min Eℓ(f (X), Y).


f :X →Ŷ

Can you derive the formula for the Bayes estimator (using conditional expectation)?

Definition 0.31: Maximum a posteriori (MAP)

Another popular parameter estimation algorithm is the MAP that simply maximizes the posterior:

θMAP := argmax p(θ|D)


θ∈Θ
= argmin − log p(D|θ) + − log p(θ)
θ∈Θ | {z } | {z }
negative log-likelihood prior as regularization

A strong (i.e. sharply concentrated, i.e. small variance) prior helps reducing the variance of our estimator,
with potential damage to increasing our bias (see Definition 0.10) if our a priori belief is mis-specified, such
as stereotypes ,.
MAP is not a Bayes estimator, since we cannot find an underlying loss ℓ for it.

Example 0.32: Ridge regression as MAP

Continuing Example 0.24 let us now choose a standard Gaussian prior w ∼ N (0, λ1 I). Then,
n
n 1 X λ d
ŵMAP = argmin log σ 2 + 2 (yi − x⊤ 2 2
i w) + ∥w∥2 − log λ,
w 2 2σ i=1 2 2

Yaoliang Yu 28 –Version 0.2–Sep 8, 2021–


CS480/680–Fall 2021 §0 STATISTICAL LEARNING BASICS University of Waterloo

which is exactly equivalent to ridge regression. Note that the larger the regularization constant λ is, the
smaller the variance of the prior is. In other words, larger regularization means more determined prior
information.
Needless to say, if we choose a different prior on the weights, MAP would yield a different regularized
linear regression formulation. For instance, with the Laplacian prior (which is more peaked than the Gaussian
around the mode), we obtain the celebrated Lasso (Tibshirani 1996):
1
min ∥Xw − y∥22 + λ∥w∥1 .
w 2σ 2

Tibshirani, Robert (1996). “Regression Shrinkage and Selection via the Lasso”. Journal of the Royal Statistical Society:
Series B, vol. 58, no. 1, pp. 267–288.

Theorem 0.33: Bayes rule arose from optimization (e.g. Zellner 1988)
R
Let p(θ) be a prior pdf of our parameter θ, p(D|θ) the pdf of data D given θ, and p(D) = p(θ)p(D|θ) dθ
the data pdf. Then,

(0.2)

p(θ|D) = argmin KL p(D)q(θ) ∥ p(θ)p(D|θ) ,
q(θ)

where the minimization is over all pdf q(θ).

Proof. KL is nonnegative while the posterior p(θ|D) already achieves 0. In fact, only the posterior can achieve
0, see Exercise 0.14.

This result may seem trivial at first sight. However, it motives a number of important extensions:
• If we restrict the minimization to a subclass P of pdfs, then we obtain some KL projection of the
posterior p(θ|D) to the class P. This is essentially the so-called variational inference.
• If we replace the KL divergence with any other f -divergence, the same result still holds. This opens a
whole range of possibilities when we can only optimize over a subclass P of pdfs.

• The celebrated expectation-maximization (EM) algorithm also follows from (0.2)!


We will revisit each of the above extensions later in the course.
Zellner, Arnold (1988). “Optimal Information Processing and Bayes’s Theorem”. The American Statistician, vol. 42,
no. 4, pp. 278–280.

Yaoliang Yu 29 –Version 0.2–Sep 8, 2021–


CS480/680–Fall 2021 §1 PERCEPTRON University of Waterloo

1 Perceptron
Goal

Understand the celebrated perceptron algorithm for online binary classification.

Alert 1.1: Convention

Gray boxes are not required hence can be omitted for unenthusiastic readers.
This note is likely to be updated again soon.

Definition 1.2: Binary classification

Given a set of n known example pairs *(xi , yi ) : i = 1, 2, . . . , n+, where xi ∈ Rd and yi ∈ {±1}, we want to
learn a (binary) “classification rule” h : Rd → {±1}, so that

h(x) = y

on most unseen (future) examples (x, y). Throughout we will call xi the feature vector of the i-th example,
and yi the (binary) label of the i-th example. Together, the known example pairs *(xi , yi ) : i = 1, 2, . . . , n+
are called the training set, with n being its size and d being its dimension. The unseen future example (x, y)
will be called the test example. If we have a set of test examples, together they will be called a test set.

Alert 1.3: Notations

We use boldface letters, e.g. x, for a vector of appropriate size. Subscripts are used for two purposes: (the
bold) xi denotes a vector that may have nothing to do with x, while (the non-bold) xi denotes the i-th
coordinate of x. The j-th coordinate of xi will be denoted as xji . We use 1 and 0 to denote a vector of all
1s and 0s of appropriate size (which should be clear from context), respectively.
By default, all vectors are column vectors and we use x⊤ to denote the transpose (i.e. a row vector) of
a column vector x.

Definition 1.4: Functions and sets are equivalent

A binary classification rule h : Rd → {±1} can be identified with a set P ⊆ Rd and its complement
N = Rd \ P , where h(x) = 1 ⇐⇒ x ∈ P .

Exercise 1.5: Multiclass rules

Let h : Rd → {1, 2, . . . , c}, where c ≥ 2 is the number of classes. How do we identify the function h with
sets?

Remark 1.6: Memorization does NOT work... Or does it?

The challenge of binary classification lies in two aspects:


• on a test example (x, y), we actually only have access to x but not the label y. It is our job to predict
y, hopefully correctly most of the time.
• the test example x can be (very) different from any of the training examples {xi : i = 1, . . . , n}. So we
can not expect naive memorization to work.

Yaoliang Yu 30 –Version 0.52–September 14, 2021–


CS480/680–Fall 2021 §1 PERCEPTRON University of Waterloo

Essentially, we need a (principled?) way to interpolate from the training set (where labels are known) and
hopefully generalize to the test set (where labels need to be predicted). For this to be possible, we need

• the training set to be “indicative” of what the test set look like, and/or
• a proper baseline (competitor) to compare against.

Definition 1.7: Statistical learning

We assume the training examples (xi , yi ) and the test example (x, y) are drawn independently and identically
(i.i.d.) from an unknown distribution P:
i.i.d.
(x1 , y1 ), . . . , (xn , yn ), (x, y) ∼ P,

in which case we usually use capital letters (Xi , Yi ) and (X, Y ) to emphasize the random nature of these
quantities. Our goal is then to find a classification rule h : Rd → {±1} so that the classification error

P(h(X) ̸= Y )

is as small as possible. Put in optimization terms, we are interested in solving the following (abstract)
optimization problem:

min P(h(X) ̸= Y ). (1.1)


h:Rd →{±1}

We will shortly see that if P is known, then the classification problem (1.1) admits a closed-form solution
known as the Bayes classifier. In the more realistic case where P is not known, our hope is that the training
set {(Xi , Yi ) : i = 1, . . . , n} may provide enough information about P, as least when n is sufficiently large,
which is basically the familiar law of large numbers in (serious) disguise.

Remark 1.8: i.i.d., seriously?

Immediate objections to the i.i.d. assumption in statistical learning include (but not limit to):
• the training examples are hardly i.i.d..
• the test example may follow a different distribution than the training set, known as domain shift.

Reasons to support the i.i.d. assumption include (but not limit to):
• it is a simple, clean mathematical abstraction that allows us to take a first step in understanding and
solving the binary classification problem.
• for many real problems, the i.i.d. assumption is not terribly off. In fact, it is a reasonably successful
approximation.
• there exist more complicated ways to alleviate the i.i.d. assumption, usually obtained by refining
results under the i.i.d. assumption.
We will take a more pragmatic viewpoint: we have to start from somewhere and the i.i.d. assumption
seems to be a good balance between what we can analyze and what we want to achieve.

Yaoliang Yu 31 –Version 0.52–September 14, 2021–


CS480/680–Fall 2021 §1 PERCEPTRON University of Waterloo

Definition 1.9: Online learning

A different strategy, as opposed to statistical learning, is not to put any assumption whatsoever on the data,
but on what we want to compare against: Given a collection of existing classification rules G = {gk : k ∈ K},
we want to construct a classification rule h that is competitive against the “best” g ∗ ∈ G, in terms of the
number of mistakes:
n
X
M(h) := Jh(xi ) ̸= yi K.
i=1

The “subtlety” is that even the best g ∗ ∈ G may not perform well on the data D = *(xi , yi ) : i = 1, . . . , n+,
so being competitive against the best g ∗ in G may or may not be as significant as you would have liked.
When the examples (xi , yi ) come one at a time, i.e. in the online (streaming) fashion, we can give
ourselves even more flexibility: we construct a sequence of classification rules {hi : i = 1, 2, . . .}, and the
evaluation proceeds as follows. Start with i = 1 and choose h1 :

(I). receive xi and predict ŷi = hi (xi )


(II). receive true label yi and possibly suffer a mistake if ŷi ̸= yi
(III). adjust hi to hi+1 and increment i by 1.

(We could also allow hi to depend on xi , i.e. delay the adjustment of hi until receiving xi .) Note that while
we are allowed to adaptively adjust our classification rules {hi }, the competitor is more restricted: it has to
stick to some fixed rule gk ∈ G chosen well before seeing any example.

Definition 1.10: Proper vs. improper learning

When we require h ∈ G, we are in the proper learning regime, where we (who chooses h) must act in the
same space as the competitor (who is constrained to choose from G). However, sometimes learning is easier
if we abandon h ∈ G and operate in the improper learning regime. Life is never fair anyways .
Of course, this distinction is relative to the class G. In particular, if G consists of all possible functions,
then any learning algorithm is proper but we are competing against a very strong competitor.

Alert 1.11: Notation

The Iverson notation JAK or sometimes also 1(A) (or even 1A ) denotes the indicator function of the event
A ⊆ Rd , i.e., JAK is 1 if the event A holds and 0 otherwise.
We use |A| to denote the size (i.e. the number of elements) of a set A.

Definition 1.12: Thresholding

Often it is more convenient to learn a real-valued function f : Rd → R and then use thresholding to get a
binary-valued classification rule: h = sign(f ), where say we define sign(0) = −1 (or sign(0) = 1, the actual
choice is usually immaterial).

Definition 1.13: Linear and affine functions

Perhaps the simplest multivariate function is the class of linear/affine functions. Recall that a function
f : Rd → R is linear if for all w, z ∈ Rd and α, β ∈ R:
f (αw + βz) = αf (w) + βf (z).
From the definition it follows that f (0) = 0 for any linear function f .

Yaoliang Yu 32 –Version 0.52–September 14, 2021–


CS480/680–Fall 2021 §1 PERCEPTRON University of Waterloo

Similarly, a function f : Rd → R is called affine if for all w, z ∈ Rd and α ∈ R:

f (αw + (1 − α)z) = αf (w) + (1 − α)f (z).

Compared to the definition of linear functions, the restriction α + β = 1 is enforced.

Exercise 1.14: Representation of linear and affine functions

Prove:
• If f : Rd → R is linear, then for any n ∈ N, any x1 , . . . , xn ∈ Rd , any α1 , . . . , αn ∈ R, we have
n
! n
X X
f αi xi = αi f (xi ).
i=1 i=1

What is the counterpart for affine functions?

• A function f : Rd → R is linear iff there exists some w ∈ Rd so that f (x) = w⊤ x.


• A function f : Rd → R is affine iff there exists some w ∈ Rd and b ∈ R so that f (x) = w⊤ x + b, i.e.
an affine function is a linear function translated by some constant.

Definition 1.15: Inner product

Recall the inner product (i.e. dot product) between two vectors w, x ∈ Rd is defined as
d
X

w x= wj xj = x⊤ w.
j=1

The notation ⟨w, x⟩ is also used (especially when we want to abstract away coordinates/basis).

Algorithm 1.16: Perceptron

Combining thresholding (see Definition 1.12) and affine functions (see Definition 1.13), the celebrated per-
ceptron algorithm of Rosenblatt (1958) tries to learn a classification rule

h(x) = sign(f (x)), f (x) = w⊤ x + b,

parameterized by the weight vector w ∈ Rd and bias (threshold) b ∈ R so that

∀i, yi = h(xi ) ⇐⇒ yi ŷi > 0, ŷi = f (xi ) = w⊤ xi + b, (1.2)

where we have used the fact that yi ∈ {±1} (and ignored the possibility of ŷi = 0 for the direction ⇒).
In the following perceptron algorithm, the training examples come in the online fashion (see Defini-
tion 1.9), and the algorithm updates only when it makes a “mistake” (line 3).

Yaoliang Yu 33 –Version 0.52–September 14, 2021–


CS480/680–Fall 2021 §1 PERCEPTRON University of Waterloo

Algorithm: Perceptron (Rosenblatt 1958)


Input: Dataset D = *(xi , yi ) ∈ Rd × {±1} : i = 1, . . . , n+, initialization w ∈ Rd and b ∈ R,
threshold δ ≥ 0
Output: approximate solution w and b
1 for t = 1, 2, . . . do
2 receive training example index It ∈ {1, . . . , n} // the index It can be random

3 if yIt (w xIt + b) ≤ δ then
4 w ← w + yIt xIt // update only after making a “mistake”
5 b ← b + y It

We can break the for-loop if a maximum number of iterations has been reached, or if all training examples
are correctly classified in a full cycle (in which case the algorithm will no longer update itself).
Rosenblatt, F. (1958). “The perceptron: A probabilistic model for information storage and organization in the brain”.
Psychological Review, vol. 65, no. 6, pp. 386–408.

Remark 1.17: Padding


 
x
If we define ai = yi i , A = [a1 , . . . , an ] ∈ Rp×n (where p = d + 1 stands for the number of predictors),
1
 
w
and w = , then clearly
b

a⊤ ⊤
i w = yi (w xi + b).

Thus, the perceptron problem (1.2) can be concisely reduced to the following (slightly more general) system
of linear inequalities:

Given A ∈ Rp×n find w ∈ Rp so that A⊤ w > 0, (1.3)

where the (strict) inequality is meant elementwise, i.e. x > w ⇐⇒ ∀j, xj > wj . In the sequel we will
identify the perceptron problem (1.2) with the above system of linear equalities in (1.3).
The trick to pad the constant 1 to xi and the bias b to w so that we can deal with the pair (w, b) more
concisely is used ubiquitously in machine learning. The trick to multiply the binary label yi to xi is also
often used in binary classification problems.

Alert 1.18: Notation

We use x and w for the original vectors and x and w for the padded versions (with constant 1 and bias b
respectively). Similar, we use X and W for the original matrices and X and w for the padded versions.
We use ŷ ∈ R for a real-valued prediction and ŷ ∈ {±1} for a binary prediction, keeping in mind that
usually ŷ = sign(ŷ).

Remark 1.19: “If it ain’t broke, don’t fix it”

The perceptron algorithm is a perfect illustration of the good old wisdom: “If it ain’t broke, don’t fix
it.” Indeed, it maintains the same weight vector (w, b) until when a “mistake” happens, i.e. line 3 in
Algorithm 1.16. This principle is often used in designing machine learning algorithms, or in life .
On the other hand, had we always performed the updates in line 4 and 5 (even when we predicted
correctly), then it is easy to construct an infinite sequence (x1 , y1 ), (x2 , y2 ), . . . , that is strictly linearly
separable (see Definition 1.24 below), and yet the modified (aggressive) perceptron will make infinitely many
mistakes.

Yaoliang Yu 34 –Version 0.52–September 14, 2021–


CS480/680–Fall 2021 §1 PERCEPTRON University of Waterloo

Indeed, let δ = 0 and yi ≡ 1. This dataset (with any xi ) is clearly linearly separable (simply take
w = 0, b = 1). The aggressive perceptron maintains the weight (assuming w.l.o.g that w0 = 0)
t  
X xi
wt = .
i=1
1

Thus, to fool it we need


* t
+
X
⟨wt , at+1 ⟩ = xi , xt+1 + t < 0, i.e., ⟨x̄t , xt+1 ⟩ < −1,
i=1
Pt
where x̄t := 1
t i=1 xi . Now, let
(
−2x, if log2 t ∈ N
xt = ,
x, otherwise

for any fixed x with unit length, i.e. ∥x∥2 = 1. Then, we verify x̄t → x while for sufficiently large t such that
log2 (t + 1) ∈ N, we have⟨x̄t , xt+1 ⟩ → ⟨x, −2x⟩ = −2 < −1. Obviously, such t’s form an infinite subsequence,
on which the aggressive perceptron errs.

Exercise 1.20: Mistakes all the time

Construct a linearly separable sequence (x1 , y1 ), (x2 , y2 ), . . . , so that the aggressive perceptron makes mis-
takes on every example! [Hint: you may let xi go unbounded. Can you make it bounded?]

History 1.21: “What doesn’t kill you makes you stronger” ?

Historically, perceptron was the first algorithm that kicked off the entire field of artificial intelligence. Its
design, analysis, and application have had lasting impact on the machine learning field even to this day.
Ironically, the failure of perceptron on nonlinear problems (to be discussed in later lectures) almost killed
the entire artificial intelligence field as well...

Exercise 1.22: Perceptron for solving (homogeneous) linear inequalities

Modify Algorithm 1.16 to solve the system of (homogeneous) linear inequalities (1.3).

Alert 1.23: Existence and uniqueness of solution

For any problem you are interested in solving, the first question you should ask is:
Does there exist a solution?

• If the answer is “no,” then the second question you should ask is:

If there is no solution at all, can we still “solve” the problem in certain meaningful ways?

• If the answer is “yes,” then the second question you should ask is:

If there is at least one solution, then is the solution unique?

– If the answer is “no,” then the third question you should ask is:

Yaoliang Yu 35 –Version 0.52–September 14, 2021–


CS480/680–Fall 2021 §1 PERCEPTRON University of Waterloo

Is there any reason to prefer a certain solution?

Too often in ML, we hasten to design algorithms and run experiments, without fully understanding or
articulating what we are trying to solve, or deciding if the problem is even solvable in any well-defined sense.
There is certain value in this philosophy but also great danger.

Definition 1.24: (Strictly) linear separable

We say that the perceptron problem (1.3) is (strictly) linearly separable if for some hence all s>0, there exists
some w such that ∀i, a⊤ i w ≥ s > 0 (or in matrix notation A w ≥ s1). Otherwise, we say the perceptron

problem is linearly inseparable.


This is the reason why the threshold parameter δ in Algorithm 1.16 is immaterial, at least in terms of
convergence when the problem is indeed linearly separable.

Definition 1.25: Norms and Cauchy-Schwarz inequality

For any vector w ∈ Rd , its Euclidean (ℓ2 ) norm (i.e., length) is defined as:
v
u d
√ uX

∥w∥2 := w w = t |wj |2 .
j=1

More generally, for any p ≥ 1, we define the ℓp norm


d
X 1/p
∥w∥p := |wj |p
j=1

while for p = ∞ we define the max norm

∥w∥∞ := max |wj |.


j=1,...,d

Even more generally, a norm is any function ∥ · ∥ : Rd → R+ that satisfies:

• (definite) ∥w∥ = 0 ⇐⇒ w = 0
• (homogeneous) for all λ ∈ R and w ∈ Rd , ∥λw∥ = |λ| · ∥w∥
• (triangle inequality) for all w and z ∈ Rd :

∥w + z∥ ≤ ∥w∥ + ∥z∥.

The norm function is a convenient way to convert a vector quantity to a real number, for instance, to facilitate
numerical comparison. Part of the business in machine learning is to understand the effect of different norms
on certain learning problems, even though all norms are “formally equivalent:” for any two norms ∥ · ∥ and
||| · |||, there exist constants cd , Cd ∈ R so that ∀w ∈ Rd ,

cd ∥w∥ ≤ |||w||| ≤ Cd ∥w∥.

The subtlety lies on the dependence of the constants cd , Cd on the dimension d: could be exponential and
could affect a learning algorithm a lot.
The dual (norm) ∥ · ∥◦ of the norm ∥ · ∥ is defined as:

w⊤ z |w⊤ z|
∥z∥◦ := max w⊤ z = max = max |w⊤ z| = max .
∥w∥=1 w̸=0 ∥w∥ ∥w∥=1 w̸=0 ∥w∥

Yaoliang Yu 36 –Version 0.52–September 14, 2021–


CS480/680–Fall 2021 §1 PERCEPTRON University of Waterloo

From the definition it follows the important inequality:

w⊤ z ≤ |w⊤ z| ≤ ∥w∥ · ∥z∥◦ .

The Cauchy-Schwarz inequality, which will be repeatedly used throughout the course, is essentially a
self-duality property of the ℓ2 norm:

w⊤ z ≤ |w⊤ z| ≤ ∥w∥2 · ∥z∥2 ,

i.e., the dual norm of the ℓ2 norm is itself. The dual norm of the ℓp norm is the ℓq norm, where

1 1
∞ ≥ p, q ≥ 1 and + = 1.
p q

Exercise 1.26: Norms

Prove the following:


• for any w ∈ Rd : ∥w∥p → ∥w∥∞ as p → ∞

• for any ∞ ≥ p ≥ 1, the ℓp norm is indeed a norm. What about p < 1?


• the dual of any norm ∥ · ∥ is indeed again a norm
1 1
−p
• for any ∞ ≥ p ≥ q ≥ 1, ∥w∥p ≤ ∥w∥q ≤ d q ∥w∥p
• for any w, z ∈ Rd : ∥w + z∥22 + ∥w − z∥22 = 2(∥w∥22 + ∥z∥22 ).

Remark 1.27: Key insight

The key insight for the success of the perceptron Algorithm 1.16 is the following simple inequality:

⟨a, wt+1 ⟩ = ⟨a, wt + a⟩ = ⟨a, wt ⟩ + ∥a∥22 > ⟨a, wt ⟩ .

(Why can we assume w.l.o.g. that ∥a∥22 > 0?) Therefore, if the condition ⟨a, wt ⟩ > δ is violated, then we
perform an update which brings us strictly closer to satisfy that constraint. [The magic is that by doing so
we do not ruin the possibility of satisfying all other constraints, as we shall see.]
This particular update rule of perceptron can be justified as performing stochastic gradient descent on
an appropriate objective function, as we are going to see in a later lecture.

Theorem 1.28: Perceptron convergence theorem (Block 1962; Novikoff 1962)

Assuming the data A (see Remark 1.17) is (strictly) linearly separable and denoting wt the iterate after the
t-th update in the perceptron algorithm. Then, wt → some w∗ in finite time. If each column of A is selected
infinitely often, then A⊤ w∗ > δ1.

Proof. Under the linearly separable assumption there exists some solution w⋆ , i.e., A⊤ w⋆ ≥ s1 for some
s > 0. Then, upon making an update from wt to wt+1 (using the data example denoted as a):

⟨wt+1 , w⋆ ⟩ = ⟨wt , w⋆ ⟩ + ⟨a, w⋆ ⟩ ≥ ⟨wt , w⋆ ⟩ + s.

Hence, by telescoping we have ⟨wt , w⋆ ⟩ ≥ ⟨w0 , w⋆ ⟩ + ts, which then, using the Cauchy-Schwarz inequality,

Yaoliang Yu 37 –Version 0.52–September 14, 2021–


CS480/680–Fall 2021 §1 PERCEPTRON University of Waterloo


implies ∥wt ∥2 ≥ ⟨w0∥w
,w ⟩+ts
⋆∥
2
. [Are we certain that ∥w⋆ ∥2 ̸= 0?]
On the other hand, using the fact that we make an update only when ⟨a, wt ⟩ ≤ δ:

∥wt+1 ∥22 = ∥wt + a∥22 = ∥wt ∥22 + 2 ⟨wt , a⟩ + ∥a∥22 ≤ ∥wt ∥22 + 2δ + ∥a∥22 .

Hence, telescoping again we have ∥wt ∥22 ≤ ∥w0 ∥22 + (2δ + ∥A∥22,∞ )t, where we use the notation ∥A∥2,∞ :=
maxi ∥ai ∥2 . q
⟨w0 ,w⋆ ⟩+ts
Combine the above two (blue) inequalities: ∥w⋆ ∥2 ≤ ∥w0 ∥22 + (2δ + ∥A∥22,∞ )t, solving which gives:

(2δ + ∥A∥22,∞ )∥w⋆ ∥22 + 2s∥w⋆ ∥2 ∥w0 ∥2


t≤ . (1.4)
s2
Thus, the perceptron algorithm performs at most a finite number of updates, meaning that wt remains
unchanged thereafter.
Typically, we start with w0 = 0 and we choose δ = 0, then the perceptron algorithm converges after at
∥A∥22,∞ ∥w⋆ ∥22
most s2 updates.
Block, H. D. (1962). “The perceptron: A model for brain functioning”. Reviews of Modern Physics, vol. 34, no. 1,
pp. 123–135.
Novikoff, A. (1962). “On Convergence proofs for perceptrons”. In: Symposium on Mathematical Theory of Automata,
pp. 615–622.

Exercise 1.29: Data normalization

Suppose the data D = *(xi , yi ) ∈ Rd × {±1} : i = 1, . . . , n+ is linearly separable, i.e., there exists some s > 0
and w = (w; b) such that A⊤ w ≥ s1, where recall that ai = yi x1i .


• If we scale each instance xi to λxi for some λ > 0, is the resulting data still linearly separable? Does
perceptron converge faster or slower after scaling? How does the bound (1.4) change?
• If we translate each instance xi to xi + x̄ for some x̄ ∈ Rd , is the resulting data still linearly separable?
Does perceptron converge faster or slower after translation? How does the bound (1.4) change?

[Hint: you could consider initializing w0 differently after the scaling.]

Remark 1.30: Optimizing the bound

As we mentioned above, (for linearly separable data) the perceptron algorithm converges after at most
∥A∥22,∞ ∥w⋆ ∥22
s2 steps, if we start with w0 = 0 (and choose δ = 0). Note, however, that the “solution” w⋆ is
introduced merely for the analysis of the perceptron algorithm; the algorithm in fact does not “see” it at all.
In other words, w⋆ is “fictional,” hence we can tune it to optimize our bound as follows:
 2
∥w⋆ ∥22 1 1 h 1 i2
min = min =   = ,
(w⋆ ,s):A⊤ w⋆ ≥s1 s2 (w,s):∥w∥2 ≤1, A⊤ w≥s1 s2 max s max min ⟨ai , w⟩
(w,s):∥w∥2 ≤1, A⊤ w≥s1 w:∥w∥2 ≤1 i
| {z }
the margin γ2

where we implicitly assumed the denominator is positive (i.e. A is linearly separable). Therefore, the
perceptron algorithm (with w0 = 0, δ = 0) converges after at most

maxi ∥ai ∥22


T = T (A) :=  2
max mini ⟨ai , w⟩
∥w∥2 ≤1

Yaoliang Yu 38 –Version 0.52–September 14, 2021–


CS480/680–Fall 2021 §1 PERCEPTRON University of Waterloo

steps. If we scale the data so that ∥A∥2,∞ := maxi ∥ai ∥2 = 1, then we have the appealing bound:

1
T = T (A) = , where γ2 = γ2 (A) = max min ⟨ai , w⟩ ≤ min max ⟨ai , w⟩ = min ∥ai ∥2 ≤ ∥A∥2,∞ = 1.
γ22 ∥w∥2 ≤1 i i ∥w∥2 ≤1 i
(1.5)

Intuitively, the margin parameter γ2 characterizes how “linearly separable” a dataset A is, and the perceptron
algorithm converges faster if the data is “more” linearly separable!

Remark 1.31: Uniqueness

The perceptron algorithm outputs a solution w such that Aw > δ1, but it does not seem to care which
solution to output if there are multiple ones. The iteration bound in (1.5) actually suggests a different
algorithm, famously known as the support vector machines (SVM). The idea is simply to find the weight
vector w that attains the margin in (1.5):

max min ⟨ai , w⟩ ⇐⇒ min ∥w∥22 ,


∥w∥2 ≤1 i w:Aw≥1

where the right-hand side is the usual formula for hard-margin support vector machines (SVM), to be
discussed in a later lecture! (Strictly speaking, we need to replace ∥w∥22 with ∥w∥22 , i.e., unregularizing the
bias b in SVM.)

Alert 1.32: Notation

For two real numbers u, v ∈ R, the following standard notations will be used throughout the course:
• u ∨ v := max{u, v}: maximum of the two
• u ∧ v := min{u, v}: minimum of the two

• u+ = u ∨ 0 = max{u, 0}: positive part


• u− = max{−u, 0}: negative part
These operations extend straightforwardly in the elementwise manner to two vectors u, v ∈ Rd .

Exercise 1.33: Decomposition

Prove the following claims (note that the negative part u− is a positive number by definition):
• u+ = (−u)−

• u = u+ − u−
• |u| = u+ + u−

Theorem 1.34: Optimality of perceptron

Let n = 1/γ 2 ∧ d. For any deterministic algorithm A, there exists a dataset *(ei , yi )+ni=1 with margin at
least γ such that A makes at least n mistakes on it.

Proof. For any deterministic algorithm A, set yi = −A(e1 , y1 , . . . , ei−1 , yi−1 , ei ). Clearly, A makes n mis-
takes on the dataset *(ei , yi )+ni=1 (due to the hostility of nature in constructing yi ).

Yaoliang Yu 39 –Version 0.52–September 14, 2021–


CS480/680–Fall 2021 §1 PERCEPTRON University of Waterloo

We need only verify the margin claim. Let wi⋆ = yi γ (and b = 0), then yi ⟨ei , w⋆ ⟩ = γ. Thus, the dataset
*(ei , yi )+ni=1 has margin at least γ.

Therefore, for high dimensional problems (i.e. large d), the perceptron algorithm achieves the optimal
worst-case mistake bound.

Theorem 1.35: Perceptron boundedness theorem (Amaldi and Hauser 2005)

Let A ∈ Rp×n be a matrix with nonzero columns, w0 ∈ Rp arbitrary, ηt ∈ [0, η̄], and define

wt+1 = wt + ηt aIt ,

where aIt is some column of A chosen such that ⟨wt , aIt ⟩ ≤ 0. Then, for all t,
h  i
∥wt ∥2 ≤ 2 max ∥w0 ∥2 , η̄ max ∥ai ∥2 × (1 ∧ min ∥ai ∥2 )rank(A) × κ(A)23p/2 + 1 , (1.6)
i i

where the condition number

κ−2 (A) := min{det(B ⊤ B) : B = [ai1 , ai2 , . . . , airank(A) ] is a submatrix of A with full column rank}.

Proof. We omit the somewhat lengthy proof.


The preceptron algorithm corresponds to ηt ≡ 1, in which case the boundedness claim (without the
quantitative bound (1.6)) was first established in Minsky and Papert (1988, originally published in 1969)
and Block and Levin (1970).
Amaldi, Edoardo and Raphael Hauser (2005). “Boundedness Theorems for the Relaxation Method”. Mathematics of
Operations Research, vol. 30, no. 4, pp. 939–955.
Minsky, Marvin L. and Seymour A. Papert (1988, originally published in 1969). Perceptron. second expanded. MIT
press.
Block, H. D. and S. A. Levin (1970). “On the boundedness of an iterative procedure for solving a system of linear
inequalities”. Proceedings of the American Mathematical Society, vol. 26, pp. 229–235.

Remark 1.36: Reducing multiclass to binary

We can easily adapt the perceptron algorithm to datasets with c > 2 classes, using either of the following
general reduction schemes:
• one-vs-all: For each class k, use its examples as positive and examples from all other classes as negative,
allowing us to train a perceptron with weight wk = [wk ; bk ]. Upon receiving a new example x, we
predict according to the “winner” of the c perceptrons (break ties arbitrarily):

ŷ = argmax wk⊤ x + bk .
k=1,...,c

The downside of this scheme is that when we train the k-th perceptron, the dataset is imbalanced, i.e.
we have much more negatives than positives. The upside is that we only need to train c (or c − 1 if we
set one class as the default) perceptrons.
• one-vs-one: For each pair (k, k ′ ) of classes, we train a perceptron wk,k′ where we
 use examples from
class k as positive and examples from class k ′ as negative. In total we train 2c perceptrons. Upon
receiving a new example x, we count how many times each class k is the (binary) prediction:

|{k : x⊤ wk,k′ + bk,k′ > 0 or x⊤ wk′ ,k + bk′ ,k ≤ 0}|.

Of course, we take again the “winner” as our predicted class. The downside here is we have to train
O(c2 ) perceptrons while the upside is that each time the training set is more balanced.

Yaoliang Yu 40 –Version 0.52–September 14, 2021–


CS480/680–Fall 2021 §1 PERCEPTRON University of Waterloo

Algorithm 1.37: General linear inequalities (Agmon 1954; Motzkin and Schoenberg 1954)

More generally, to solve any (non-homogeneous) linear inequality system

A⊤ w ≤ c, i.e., a⊤
i w ≤ ci , i = 1, . . . , n,

we can extend the idea of perceptron to the following projection algorithm:


Algorithm: Projection Algorithm for Linear Inequalities
Input: A ∈ Rp×n , c ∈ Rn , initialization w ∈ Rp , relaxation parameter η ∈ (0, 2]
Output: approximate solution w
1 for t = 1, 2, . . . do
2 select an index It ∈ {1, . . . , n}  // the index It can be random
(a⊤
It w−cIt )
+
3 w ← (1 − η)w + η w − a It
⟨aIt ,aIt ⟩

The term within the square bracket is exactly the projection of w onto the halfspace a⊤
It w ≤ cIt . If we
choose η ≡ 1 then we just repeatedly project w onto each of the halfspaces. With η ≡ 2 we actually perform
reflections, which, as argued by Motzkin and Schoenberg (1954), can accelerate convergence a lot in certain
settings.
Agmon, Shmuel (1954). “The Relaxation Method for Linear Inequalities”. Canadian Journal of Mathematics, vol. 6,
pp. 382–392.
Motzkin, T. S. and I. J. Schoenberg (1954). “The Relaxation Method for Linear Inequalities”. Canadian Journal of
Mathematics, vol. 6, pp. 393–404.

Remark 1.38: Choosing the index

There are a couple of ways to choose the index It , i.e., which example we are going to deal with at iteration
t:
• cyclic: It = (It−1 + 1) mod n.
• chaotic: ∃τ ≥ n so that for any t ∈ N, {1, 2, . . . , n} ⊆ {It , It+1 , . . . , It+τ −1 }.

• randomized: It = i with probability pi . A typical choice is pi = ∥ai ∥22 / ı ∥aı ∥22 .


P

• permuted: in each epoch randomly permute {1, 2, . . . , n} and then follow cyclic.
(a⊤
It w−cIt )
+
(a⊤
i w−ci )
+
• maximal distance: ∥aIt ∥2 = max ∥ai ∥ 2
(break ties arbitrarily).
i=1,...,n

• maximal residual: (a⊤


It w − cIt ) = max (ai w − ci ) (break ties arbitrarily).
+ ⊤ +
i=1,...,n

Remark 1.39: Understanding perceptron mathematically

Let us define a polyhedral cone cone(A) := {Aλ : λ ≥ 0} whose dual is [cone(A)]∗ = {w : A⊤ w ≥ 0}. The
linear separability assumption in Definition 1.24 can be written concisely as int([cone(A)]∗ ) ̸= ∅, but it is
known in convex analysis that the dual cone [cone(A)]∗ has nonempty interior iff int([cone(A)]∗ )∩cone(A) ̸= ∅,
i.e., iff there exists some λ ≥ 0 so that w = Aλ satisfies A⊤ w > 0. Slightly perturb λ we may assume w.l.o.g.
λ is rational. Perform scaling if necessary we may even assume λ is integral. The perceptron algorithm
gives a constructive way to find such an integral λ (hence also w).

Yaoliang Yu 41 –Version 0.52–September 14, 2021–


CS480/680–Fall 2021 §1 PERCEPTRON University of Waterloo

Remark 1.40: Variants

We mention some interesting variants of the perceptron algorithm: (Cesa-Bianchi et al. 2005; Dekel et al.
2008; Soheili and Peña 2012; Soheili and Peña 2013).
Cesa-Bianchi, Nicolò, Alex Conconi, and Claudio Gentile (2005). “A Second-Order Perceptron Algorithm”. SIAM
Journal on Computing, vol. 34, no. 3, pp. 640–668.
Dekel, Ofer, Shai Shalev-Shwartz, and Yoram Singer (2008). “The Forgetron: A Kernel-based Perceptron on A Bud-
get”. SIAM Journal on Computing, vol. 37, no. 5, pp. 1342–1372.
Soheili, Negar and Javier Peña (2012). “A Smooth Perceptron Algorithm”. SIAM Journal on Optimization, vol. 22,
no. 2, pp. 728–737.
— (2013). “A Primal–Dual Smooth Perceptron–von Neumann Algorithm”. In: Discrete Geometry and Optimization,
pp. 303–320.

Remark 1.41: More refined results

A primal-dual version is given in (Spingarn 1985; Spingarn 1987), which solves the general system of linear
inequalities in finite time (provided a solution exists in the interior).
For a more refined analysis of the perceptron algorithm and related, see (Goffin 1980; Goffin 1982;
Ramdas and Peña 2016; Peña et al. 2021).
Spingarn, Jonathan E. (1985). “A primal-dual projection method for solving systems of linear inequalities”. Linear
Algebra and its Applications, vol. 65, pp. 45–62.
— (1987). “A projection method for least-squares solutions to overdetermined systems of linear inequalities”. Linear
Algebra and its Applications, vol. 86, pp. 211–236.
Goffin, J. L. (1980). “The relaxation method for solving systems of linear inequalities”. Mathematis of Operations
Research, vol. 5, no. 3, pp. 388–414.
— (1982). “On the non-polynomiality of the relaxation method for systems of linear inequalities”. Mathematical
Programming, vol. 22, pp. 93–103.
Ramdas, Aaditya and Javier Peña (2016). “Towards a deeper geometric, analytic and algorithmic understanding of
margins”. Optimization Methods and Software, vol. 31, no. 2, pp. 377–391.
Peña, Javier F., Juan C. Vera, and Luis F. Zuluaga (2021). “New characterizations of Hoffman constants for systems
of linear constraints”. Mathematical Programming, vol. 187, pp. 79–109.

Remark 1.42: Solving conic linear system

The perceptron algorithm can be used to solve linear programs (whose KKT conditions form a system of
linear inequalities) and more generally conic linear programs, see (Dunagan and Vempala 2008; Belloni et al.
2009; Peña and Soheili 2016; Peña and Soheili 2017).
Dunagan, John and Santosh Vempala (2008). “A simple polynomial-time rescaling algorithm for solving linear pro-
grams”. Mathematical Programming, vol. 114, no. 1, pp. 101–114.
Belloni, Alexandre, Robert M. Freund, and Santosh Vempala (2009). “An Efficient Rescaled Perceptron Algorithm
for Conic Systems”. Mathematics of Operations Research, vol. 34, no. 3, pp. 621–641.
Peña, Javier and Negar Soheili (2016). “A deterministic rescaled perceptron algorithm”. Mathematical Programming,
vol. 155, pp. 497–510.
— (2017). “Solving Conic Systems via Projection and Rescaling”. Mathematical Programming, vol. 166, no. 1-2,
pp. 87–111.

Remark 1.43: Herding

Some interesting applications of the perceptron algorithm and its boundedness can be found in (Gelfand
et al. 2010; Harvey and Samadi 2014), (Briol et al. 2015; Briol et al. 2019; Chen et al. 2016), and (Phillips
and Tai 2020; Dwivedi and Mackey 2021; Turner et al. 2021).
Gelfand, Andrew, Yutian Chen, Laurens Maaten, and Max Welling (2010). “On Herding and the Perceptron Cycling
Theorem”. In: Advances in Neural Information Processing Systems.

Yaoliang Yu 42 –Version 0.52–September 14, 2021–


CS480/680–Fall 2021 §1 PERCEPTRON University of Waterloo

Harvey, Nick and Samira Samadi (2014). “Near-Optimal Herding”. In: Proceedings of The 27th Conference on Learning
Theory, pp. 1165–1182.
Briol, François-Xavier, Chris J. Oates, Mark Girolami, Michael A. Osborne, and Dino Sejdinovic (2015). “Frank-Wolfe
Bayesian Quadrature: Probabilistic Integration with Theoretical Guarantees”. In: Advances in Neural Information
Processing Systems.
— (2019). “Probabilistic Integration: A Role in Statistical Computation?” Statistical Science, vol. 34, no. 1, pp. 1–22.
Chen, Yutian, Luke Bornn, Nando de Freitas, Mareija Eskelin, Jing Fang, and Max Welling (2016). “Herded Gibbs
Sampling”. Journal of Machine Learning Research, vol. 17, no. 10, pp. 1–29.
Phillips, Jeff M. and Wai Ming Tai (2020). “Near-Optimal Coresets of Kernel Density Estimates”. Discrete & Com-
putational Geometry, vol. 63, pp. 867–887.
Dwivedi, Raaz and Lester Mackey (2021). “Kernel Thinning”. In: Proceedings of Thirty Fourth Conference on Learning
Theory, pp. 1753–1753.
Turner, Paxton, Jingbo Liu, and Philippe Rigollet (2021). “A Statistical Perspective on Coreset Density Estimation”.
In: Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, pp. 2512–2520.

Yaoliang Yu 43 –Version 0.52–September 14, 2021–


CS480/680–Fall 2021 §2 LINEAR REGRESSION University of Waterloo

2 Linear Regression
Goal

Understand linear regression for predicting a real response. Regularization and cross-validation.

Alert 2.1: Convention

Gray boxes are not required hence can be omitted for unenthusiastic readers.
This note is likely to be updated again soon.
Some notations and conventions in this note are different from those in the slides.

Definition 2.2: Interpolation

Given a sequence of pairs D = *(xi , yi ) ∈ Rd × Rr : i = 1, . . . , n+, we want to find a function f : Rd → Rr


so that for all i:

f (xi ) ≈ yi .

Most often, r = 1, i.e., each yi is real-valued. However, we will indulge ourselves for treating any r (since it
brings very minimal complication).
The variable r here stands for the number of responses (tasks), i.e., how many values we are interested
in predicting (simultaneously).

Theorem 2.3: Exact interpolation

For any finite number of pairs D = *(xi , yi ) ∈ Rd × Rr : i = 1, . . . , n+ that satisfy xi = xj =⇒ yi = yj ,


there exist infinitely many functions f : Rd → Rr so that for all i:

f (xi ) = yi .

Proof. W.l.o.g. we may assume all xi ’s are distinct. Lagrange polynomials give immediately such a claimed
function. More generally, one may put a bump function within a small neighborhood of each xi and then
glue them together. In details, set Ni := {z : ∥z − xi ∥∞ < δ} ⊆ Rd . Clearly, xi ∈ Ni and for δ sufficiently
small, Ni ∩ Nj = ∅. Define
( 2 Qd
 
yi ed/δ 1
j=1 exp − δ 2 −(zj −xji )2 , if z ∈ Ni
fi (z) = .
0, otherwise

The function f = fi again exactly interpolates our data D.


P
i

The condition xi = xj =⇒ yi = yj is clearly necessary, for otherwise there cannot be any function so
that yi = f (xi ) = f (xj ) = yj and yi =
̸ yj . Of course, when all xi ’s are distinct, this condition is trivially
satisfied.

Exercise 2.4: From 1 to ∞

Complete the proof of Theorem 2.3 with regard to the “infinite” part.

Yaoliang Yu 44 –Version 0.32–September 14, 2021–


CS480/680–Fall 2021 §2 LINEAR REGRESSION University of Waterloo

Remark 2.5: Infinitely many choices...

Theorem 2.3 has the following important implication: Given a finite training set D, no matter how large
its size might be, there exist infinitely many smooth (infinitely differentiable) functions f that maps each xi
in D exactly to yi , i.e. they all achieve zero training “error.” However, on a new test instance x ̸∈ D, the
predictions ŷ = f (x) of different choices of f can be wildly different (in fact, we can make f (x) = y for any
y ∈ Rr ).
Which function should we choose then?

Definition 2.6: Least squares regression

To resolve the difficulty in Remark 2.5, we need some statistical assumption on how our data is generated.
In particular, we assume (Xi , Yi ) are independently and identically distributed (i.i.d.) random samples from
an unknown distribution P. The test sample (X, Y) is also drawn independently and identically from the
same distribution P. We are interested in solving the least squares regression problem:

min E∥Y − f (X)∥22 , (2.1)


f :Rd →Rr

i.e., finding a function f : Rd → Rr so that f (X) approximates Y well in expectation. (Strictly speaking we
need the technical assumption that f is measurable so that the above expectation is even defined.)
In reality, we do not know the distribution P of (X, Y) hence will not be able to compute the expectation,
let alone minimizing it. Instead, we use the training set D = {(Xi , Yi ) : i = 1, . . . , n} to approximate the
expectation:
n
1X
min Ê∥Y − f (X)∥22 := ∥Yi − f (Xi )∥22 . (2.2)
f :Rd →Rr n i=1

By law of large numbers, for any fixed function f , we indeed have


n
1X as n→∞
∥Yi − f (Xi )∥22 −−−−−−→ E∥Y − f (X)∥22 .
n i=1

Remark 2.7: Properties of (conditional) expectation

• The expectation E[Y] intuitively is the (elementwise) average value of the random variable Y.
• The conditional expectation E[Y|X] is an (equivalent class of) real-valued function(s) of X.

• E[f (X)|X] = f (X) (almost everywhere). Intuitively, conditioned on X, f (X) is not random hence the
(conditional) expectation is vacuous.
• Law of total expectation: E[Y] = E[E(Y|X)] for any random variable Y and X. Intuitively, the left
hand side is the average value of Y, while the right hand side is the average of averages: E(Y|X) is
the average of Y for a given realization of X. For instance, take Y to be the height of a human being
and X to be the gender. Then, the average height of a human (left hand side) is equal to the average
of the average heights of man and woman (right hand side).

Yaoliang Yu 45 –Version 0.32–September 14, 2021–


CS480/680–Fall 2021 §2 LINEAR REGRESSION University of Waterloo

Definition 2.8: Regression function

For a moment let us assume the distribution P is known to us, so we can at least in theory solve the least
squares regression problem (2.1). It turns out there exists an optimal solution whose closed-form expression
can be derived as follows:

E∥Y − f (X)∥22 = E∥Y − E(Y|X) + E(Y|X) − f (X)∥22


= E∥Y − E(Y|X)∥22 + E∥E(Y|X) − f (X)∥22 + 2E[⟨Y − E(Y|X), E(Y|X) − f (X)⟩]
h i
= E∥Y − E(Y|X)∥22 + E∥E(Y|X) − f (X)∥22 + 2E E[⟨Y − E(Y|X), E(Y|X) − f (X)⟩ |X]
h i
= E∥Y − E(Y|X)∥22 + E∥E(Y|X) − f (X)∥22 + 2E ⟨E[Y − E(Y|X)|X], E(Y|X) − f (X)⟩
= E∥Y − E(Y|X)∥22 + E∥E(Y|X) − f (X)∥22 ,

whence the regression function

f ⋆ (X) := E(Y|X) (2.3)

eliminates the second nonnegative term while the first nonnegative term is not affected by any f at all.
With hindsight, it is not surprising the regression function f ⋆ is an optimal solution for the least squares
regression problem: it basically says given (X, Y), we set f ⋆ (X) = Y if there is a unique value of Y asscoiated
with X (which of course is optimal), while if there are multiple values of Y associated to the given X, then
we simply average them.
The constant term E∥Y − E(Y|X)∥22 describes the difficulty of our regression problem: no function f can
reduce it.

Definition 2.9: Overfitting, underfitting, and regularization

All regression methods are in one way or another trying to approximate the regression function (2.3). In
reality, we only have access to an i.i.d. training set, but if we solve (2.2) naively we will run into again the
difficulty in Remark 2.5: we achieve very small (or even zero) error on the training set but the performance
on a test sample can be very bad. This phenomenon is known as overfitting, i.e. we are taking our training
set “too seriously.” Remember in machine learning we are not interested in doing well on the training set
at all (even though training data is all we got, oh life!); training set is used only as a means to get good
performance on (future) test set.
Overfitting arises here because we are considering all functions f : Rd → Rr , which is a huge class,
while we only have limited (finite) training examples. In other words, we do not have enough training data
to support our ambition. To address overfitting, we will restrict ourselves to a subclass Fn of functions
Rd → Rr and solve:
n
1X
min ∥Yi − f (Xi )∥2 . (2.4)
f ∈Fn n i=1

In other words, we regularize our choice of candidate functions f to avoid fitting too well on the training
set. Typically, Fn grows as n increases, i.e. with more training data we could allow us to consider more
candidate functions. On the other hand, if there are too few candidate functions in Fn (for instance when
Fn consists of all constant functions), then we may not be able to do well even on the training set. This
phenomemon is known as underfitting. Generally speaking, doing too well or too badly on the training set
are both indications of poor design (either more data needs to be collected or a larger/smaller function class
needs to be considered).

Yaoliang Yu 46 –Version 0.32–September 14, 2021–


CS480/680–Fall 2021 §2 LINEAR REGRESSION University of Waterloo

Remark 2.10: Approximation vs. estimation and bias vs. variance

It is possible that the regression function f ⋆ (see (2.3)), the object we are trying to find, is not in the chosen
function class Fn at all. This results in the so-called approximation error (which solely depends on the
function class Fn but not training data), or bias in statistical terms. Clearly, the “larger” Fn is, the smaller
the approximation error. On the other hand, finding the “optimal” f ∈ Fn based on (2.4) is more challenging
for a larger Fn (when n is fixed). This is called estimation error, e.g. the error due to using a finite training
set (random sampling). Typically, the “larger” Fn is, the larger the estimation error is. Often, with a “larger”
Fn , the performance on the test set has more variation (had we repeated with different training data) .
Much of the work on regression is about how to balance between the approximation error and estimation
error, a.k.a. the bias and variance trade-off.

Definition 2.11: Linear least squares regression

The simplest choice for the function class Fn ≡ F is perhaps the class of linear/affine functions (recall
Definition 1.13). Adopting this choice leads to the linear least squares regression problem:
n
1X
min ∥Yi − W Xi − b∥22 , (2.5)
W ∈Rr×d ,b∈Rr n i=1

where recall that Xi ∈ Rd and Yi ∈ Rr .

Exercise 2.12: Linear functions may not exactly interpolate

Show that Theorem 2.3 fails if we are only allowed to use linear/affine functions.

Definition 2.13: Matrix norms

For a matrix W ∈ Rr×d , we define its Frobenius norm


sX
∥W ∥F = 2,
wij
ij

which is essentially the matrix analogue of the vector Euclidean norm ∥w∥2 .
Another widely used matrix norm is the spectral norm

∥W ∥sp = max ∥W x∥2 ,


∥x∥2 =1

which coincides with the largest singular value of W .


It is known that
• ∥W ∥sp ≤ ∥W ∥F ≤ rank(W )∥W ∥sp ,
p

• ∥W x∥2 ≤ ∥W ∥sp ∥x∥2 .

Remark 2.14: Padding again

We apply the same padding trick as in Remark 1.17. Define xi = X1i and X = [x1 , . . . , xn ] ∈ Rp×n ,


Y = [Y1 , . . . , Yn ] ∈ Rr×n , and W = [W, b] ∈ Rr×p , where of course p = d + 1. We can then rewrite the

Yaoliang Yu 47 –Version 0.32–September 14, 2021–


CS480/680–Fall 2021 §2 LINEAR REGRESSION University of Waterloo

linear least squares problem in the following (equivalent but prettier) matrix format:

min 1
n ∥Y − WX∥2F . (2.6)
W∈Rr×p

Remark 2.15: Pseudo-inverse

Recall that the Moore-Penrose pseudo-inverse A† of any matrix A ∈ Rp×n is the unique matrix G ∈ Rn×p
so that

AGA = A, GAG = G, (AG)⊤ = AG, (GA)⊤ = GA.

In particular, if A = U SV ⊤ is the thin SVD (singular value decomposition) of A, then A† = V S −1 U ⊤ . If A


is in fact invertible, then A† = A−1 .
We can use pseudo-inverse to solve the linear least squares problem (2.6). Indeed, it is known that
W ⋆ := A† CB † is a solution for the minimization problem:

min ∥AW B − C∥2F .


W

In fact, W ⋆ is the unique solution that enjoys the minimum Frobenius norm. See this lecture note for proofs
and more related results.
Equipped with the above result, we know that W⋆ = Y X† is a closed-form solution for the linear least
squares problem (2.6).

Definition 2.16: Gradient and Hessian (for the brave hearts)

Recall that the gradient of a smooth function f : Rd → Rr at w is defined as the linear mapping ∇f (w) :
Rd → Rr so that:
∥f (w + ∆w) − f (w) − [∇f (w)](∆w)∥
lim = 0, (2.7)
0̸=∆w→0 ∥∆w∥

where say the norm ∥ · ∥ is the Euclidean norm. Or equivalently in big-o notation:

∥f (w + ∆w) − f (w) − [∇f (w)](∆w)∥ = o(∥∆w∥).

As w varies, we may think of the gradient as the (nonlinear) mapping:

∇f : Rd → L(Rd , Rr ), w 7→ {∇f (w) : Rd → Rr }

where L(Rd , Rr ) denotes the class of linear mappings from Rd to Rr , or equivalently the class of matrices
Rr×d .
We can iterate the above definition. In particular, replacing f with ∇f we define the Hessian ∇2 f :
R → L(Rd , L(Rd , Rr )) ≃ B(Rd × Rd , Rr ) of f as the gradient of the gradient ∇f , where B(Rd × Rd , Rr )
d

denotes the class of bilinear mappings from Rd × Rd to Rr .

Definition 2.17: Gradient and Hessian through partial derivatives

Let f : Rd → R be a real-valued smooth function. We can define its gradient through partial derivatives:
∂f
[∇f (w)]j = (w).
∂wj
Note that the gradient ∇f (w) ∈ Rd has the same size as the input w.

Yaoliang Yu 48 –Version 0.32–September 14, 2021–


CS480/680–Fall 2021 §2 LINEAR REGRESSION University of Waterloo

Similarly, we can define the Hessian through partial derivatives:

∂2f ∂2f
[∇2 f (w)]ij = (w) = (w) = [∇2 f (w)]ji ,
∂wi ∂wj ∂wj ∂wi

where the second equality holds as long as f is twice-differentiable. Note that the Hessian is a symmetric
matrix ∇2 f (w) ∈ Rd×d with the same number of rows/columns as the size of the input w.

Remark 2.18: Matrix input

If f : Rr×p → R takes a matrix as input, then its gradient is computed similarly as in Definition 2.17:
∂f
[∇f (W)]ij = (W).
∂wij

Again, the gradient ∇f (W) ∈ Rr×p has the same size as the input W.
In principle, computing the Hessian is completely similar, although we will need 4 indices and the notation
can get quite messy quickly. Fortunately, we will rarely find ourselves facing this challenge.

Example 2.19: Quadratic function

Consider the following quadratic function


f : Rd → R, w 7→ w⊤ Qw + p⊤ w + α, (2.8)

where Q ∈ Rd×d , p ∈ Rd , and α ∈ R. We can write explicitly:


d
" d
#
X X
f (w) = α + wk pk + qkl wl ,
k=1 l=1

whence follows
 !  Pd 
d d ∂ pk + l=1 qkl wl
∂f X
 ∂wk
X
[∇f (w)]j = (w) = pk + qkl wl + wk 
∂wj ∂wj ∂wj
k=1 l=1
d
! d
X X
= pj + qjl wl + wk qkj
l=1 k=1
= pj + [(Q + Q⊤ )w]j .
That is, collectively
∇f (w) = p + (Q + Q⊤ )w.
Similarly,
 Pd Pd 
2
∂ f ∂ pj + l=1 qjl wl + k=1 wk qkj
[∇2 f (w)]ij = (x) = = qji + qij .
∂wi ∂wj ∂wi
That is, collectively
∇2 f (w) ≡ Q + Q⊤ .

The formula further simplifies if we assume Q is symmetric, i.e. Q = Q⊤ (which is usually the case).
As demonstrated above, using partial derivatives to derive the gradient and Hessian is straightforward
but tedious. Fortunately, we need only do this once: derive and memorize a few, and then resort to the
chain rule.

Yaoliang Yu 49 –Version 0.32–September 14, 2021–


CS480/680–Fall 2021 §2 LINEAR REGRESSION University of Waterloo

Example 2.20: Quadratic function (for the brave hearts)

Another way to derive the gradient and Hessian is to guess and then verify the definition (2.7). Using the
quadratic function (2.8) again as an example. We need to verify:

o(∥∆w∥) = ∥f (w + ∆w) − f (w) − [∇f (w)](∆w)∥


= ∥(w + ∆w)⊤ Q(w + ∆w) + p⊤ (w + ∆w) − w⊤ Qw − p⊤ w − [∇f (w)](∆w)∥
= ∥w⊤ Q∆w + (∆w)⊤ Qw + p⊤ ∆w − [∇f (w)](∆w) + (∆w)⊤ Q(∆w)∥,

whence we guess

∇f (w) = p + (Q + Q⊤ )w

so that we can cancel out the first four terms (that are all linear in ∆w). It is then easy to verify that the
remaining term indeed satisfies

∥(∆w)⊤ Q(∆w)∥ = o(∥∆w∥).

Similarly, we can guess and verify the Hessian:

o(∥∆w∥) = ∥∇f (w + ∆w) − ∇f (w) − [∇2 f (w)](∆w)∥


= ∥(Q + Q⊤ )∆w − [∇2 f (w)](∆w)∥,

from which it is clear that ∇2 f (w) = Q + Q⊤ would do.

Alert 2.21: Symmetric gradient

Let us consider the function of a symmetric matrix:

f (X) := tr(AX),

where both A and X are symmetric. What is the gradient?

X vs. 2X − diag(X).

(representation of the derivative depends on the underlying inner product. if we identify the symmetric
matrix [a, b; b, c] with the vector [a, b, c], we are equipping the latter with the dot product aa′ + cc′ + 2bb′ .)

Theorem 2.22: Fermat’s necessary condition for extrema (recalled)

A necessary condition for w to be a local minimizer of a differentiable function f : Rd → R is

∇f (w) = 0.

(Such points are called stationary, a.k.a. critical.) For convex f the necessary condition is also sufficient.

Proof. See the lecture note.

Take f (w) = w3 and w = 0 we see that this necessary condition is not sufficient for nonconvex functions.
For local maximizers, we simply negate the function and apply the theorem to −f instead.

Yaoliang Yu 50 –Version 0.32–September 14, 2021–


CS480/680–Fall 2021 §2 LINEAR REGRESSION University of Waterloo

Alert 2.23: Existence of a minimizer

Use the geometric mean as an example for the inapplicability of Fermat’s condition when the minimizer does
not exist.

Definition 2.24: Normal equation

We may now solve the linear least squares problem (2.6). Simply compute its gradient (by following Exam-
ple 2.19) and set it to 0, as suggested by Theorem 2.22. Upon simplifying, we obtain the so-called normal
equation (for the unknown variable W ∈ Rr×p ):

WXX⊤ = Y X⊤ , or after transposing XX⊤ W⊤ = XY ⊤ . (2.9)

This is a system of linear equations, which we can solve using standard numerical linear algebra toolboxes
(Cholesky decomposition in this case). The time complexity is O(p3 + p2 n + pnr). For large n or p, we may
use iterative algorithms (e.g. conjugate gradient) to directly but approximately solve (2.6).

Exercise 2.25: Linear least squares is linear

Based on the original linear least squares formula (2.6), or the normal equation (2.9) (or the more direct
solution Y X† using pseudo-inverse), prove the following equivariance property of linear least squares:
• If we apply a nonsingular transformation T ∈ Rp×p to X (or X), what would happen to the linear least
squares solution W?

• If we apply a nonsingular transformation T ∈ Rr×r to Y , what would happen to the linear least squares
solution W?

Definition 2.26: Prediction

Once we have the linear least squares solution Ŵ = (Ŵ , b̂), we perform prediction on a (future) test sample
X naturally by:

Ŷ := Ŵ X + b̂.

We measure the “goodness” of our prediction Ŷ by:

∥Ŷ − Y∥22 ,

which is usually averaged over a test set.

Alert 2.27: Calibration

Note that we used the squared loss (y, ŷ) 7→ (y − ŷ)2 in training linear least squares regression, see (2.5).
Thus, naturally, when evaluating the linear least squares solution Ŵ on a test set, we should use the same
squared loss. If we use a different loss, such as the absolute error |ŷ − y|, then our training procedure may
be suboptimal. Be consistent in terms of the training objective and the test measure!
Nevertheless, sometimes it might be necessary to use one loss ℓ1 (ŷ, y) for training and another one ℓ2 (ŷ, y)
for testing. For instance, ℓ1 may be easier to handle computationally. The theory of calibration studies when
a minimizer under the loss ℓ1 remains optimal under a different loss ℓ2 . See the (heavy) paper of Steinwart
(2007) and some particular refinement in Long and Servedio (2013) for more details.
Steinwart, Ingo (2007). “How to compare different loss functions and their risks”. Constructive Approximation, vol. 26,
no. 2, pp. 225–287.

Yaoliang Yu 51 –Version 0.32–September 14, 2021–


CS480/680–Fall 2021 §2 LINEAR REGRESSION University of Waterloo

Long, Philip M. and Rocco A. Servedio (2013). “Consistency versus Realizable H-Consistency for Multiclass Classi-
fication”. In: Proceedings of the 31st International Conference on Machine Learning (ICML).

Definition 2.28: Ridge regression with Tikhonov regularization (Hoerl and Kennard 1970)

The class of linear functions may still be too large, leading linear least squares to overfit or instable. We can
then put some extra restriction, such as the Tikhonov regularization in ridge regression:

min 1
n ∥Y − WX∥2F + λ∥W∥2F , (2.10)
W∈Rr×p

where λ ≥ 0 is the regularization constant (hyperparameter) that balances the two terms.
To understand ridge regression, consider
• when λ is small, thus we are neglecting the second regularization term, and the solution resembles that
of the ordinary linear least squares solution;
• when λ is large, thus we are neglecting the first data term, and the solution degenerates to 0.
In the literature, the following variant that chooses not to regularize the bias term b is also commonly
used:

min 1
n ∥Y − WX∥2F + λ∥W ∥2F . (2.11)
W=[W,b]

Hoerl, Arthur E. and Robert W. Kennard (1970). “Ridge regression: biased estimation for nonorthogonal problems”.
Technometrics, vol. 12, no. 1, pp. 55–67.

Exercise 2.29: Solution for ridge regression

Prove that the unique solution for (2.10) is given by the following normal equation:

(XX⊤ + nλI)W⊤ = XY ⊤ ,

where I is the identity matrix of appropriate size.


• Follow the derivation in Definition 2.24 by computing the gradient and setting it to 0.
• Use the trick you learned in elementary school, completing the square, to reduce ridge regression to
the ordinary linear least squares regression.

• Data augmentation: ridge regression is equivalent to the ordinary linear least squares regression (2.6)
with

X ← [X, nλIp×p ], Y ← [Y, 0r×p ].

How about the solution for the variant (2.11)?

Remark 2.30: Equivalence between regularization and constraint

The regularized problem

min ℓ(W) + λ · r(W) (2.12)


W

Yaoliang Yu 52 –Version 0.32–September 14, 2021–


CS480/680–Fall 2021 §2 LINEAR REGRESSION University of Waterloo

is “equivalent” to the following constrained problem:

min ℓ(W) (2.13)


r(W)≤c

in the sense that


• for any λ ≥ 0 and any solution W of (2.12) there exists some c so that W remains to be a solution for
(2.13);
• under mild conditions, for any c there exists some λ ≥ 0 so that any solution W of (2.13) remains to
be a solution for (2.12).
The regularized problem (2.12) is computationally more appealing as it is unconstrained while the con-
strained problem (2.13) is more intuitive and easier to analyze theoretically. Using this equivalence we see
that ridge regression essentially performs linear least squares regression over the restricted class of linear
functions whose weights are no larger than some constant c.

Exercise 2.31: Is ridge regression linear?

Redo Exercise 2.25 with ridge regression.

Remark 2.32: Choosing the regularization constant

The regularization constant λ ≥ 0 balances the data term and the regularization term in ridge regression
(2.10). A typical way to select λ is through a validation set V that is different from the training set D and
the test set T . For each λ in a candidate set Λ ⊆ R+ (chosen by experience or trial and error), we train our
algorithm on D and get Ŵλ , evaluate the performance of Ŵλ on V (according to some metric perf), choose
the best Ŵλ (w.r.t. perf) and finally apply it on the test set T . Importantly,
• One should never “see” the test set during training and parameter tuning; otherwise we are cheating
and bound to overfit.

• The validation set V should be different from both the training set and the test set.
• We can only use the validation set once in parameter tuning. Burn After Reading! (However, see the
recent work of Dwork et al. (2017) and the references therein for data re-usage.)
• As mentioned in Alert 2.27, the performance measure perf should ideally align with the training
objective.

Dwork, Cynthia, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth (2017). “Guilt-free
Data Reuse”. Communications of the ACM, vol. 60, no. 4, pp. 86–93.

Algorithm 2.33: Cross-validation

When data is limited, we do not have the luxury to afford a separate validation set, in which case we may use
the cross-validation procedure for hyperparameter selection. Essentially we split the training set and hold
a (small) part out as validation set. However, to avoid random fluctuations (especially when the validation
set is small), we repeat the split k times and average the results.

Yaoliang Yu 53 –Version 0.32–September 14, 2021–


CS480/680–Fall 2021 §2 LINEAR REGRESSION University of Waterloo

Algorithm: Cross-Validation
Input: candidate regularization constants Λ ⊆ R, performance metric perf, k ∈ N, training data D
Output: best regularization constant λ⋆
1 randomly partition D into similarly sized k subsets D1 , . . . , Dk // e.g. |Dl | = ⌊|D|/k⌋ , ∀l < k
2 for λ ∈ Λ do
3 p(λ) ← 0
4 for l = 1, . . . , k do
5 train on D¬l := ∪j̸=l Dj = D \ Dl and get Ŵλ,¬l
6 p(λ) ← p(λ) + perf(Ŵλ,¬l , Dl ) // evaluate Ŵλ,¬l on the holdout set Dl
7 λ⋆ ← argminλ∈Λ p(λ) // assuming the smaller perf the better
With the “optimal” λ⋆ at hand, we re-train on the entire training set D to get Ŵλ⋆ and then evaluate it
on the test set using the same performance measure perf(Ŵλ⋆ , T ).
In the extreme case where k = |D|, each time we train on the entire training set except one data instance.
This is called leave-one out cross-validation. Typically we set k = 10 or k = 5. Note that the larger k is, the
more expensive (computationally) cross-validation is.
As an example, for ridge regression, we can set the performance measure as:
|D|
1 X 1
perf(W, D) := ∥Yi − W Xi − b∥22 = ∥Y − WX∥2F ,
|D| i=1 |D|

where recall that


 
X1 ··· X|D|
X= , Y = [Y1 , . . . , Y|D| ], and W = [W, b].
1 ··· 1

Exercise 2.34: Can we use training objective as performance measure?

Explain if we can use the regularized training objective 1


|D| ∥Y − WX∥2F + λ∥W∥2F as our performance measure
perf?

Yaoliang Yu 54 –Version 0.32–September 14, 2021–


CS480/680–Fall 2021 §3 LOGISTIC REGRESSION University of Waterloo

3 Logistic Regression
Goal

Understand logistic regression. Comparison with linear regression.

Alert 3.1: Convention

Gray boxes are not required hence can be omitted for unenthusiastic readers.
This note is likely to be updated again soon.
Unlike the slides, in this note we encode the label y ∈ {±1} and we arrange xi in X columnwise.
We use x and w for the original vectors and x and w for the padded versions (with constant 1 and bias
b respectively). Similar, we use X and W for the original matrices and X and W for the padded versions.

Remark 3.2: Confidence of prediction

In perceptron we make predictions directly through a linear threshold function:

ŷ = sign(w⊤ x + b).

Often, we would also like to know how confident we are about this prediction ŷ. For example, we can use the
magnitude |w⊤ x + b| as the indication of our “confidence.” This choice, however, can be difficult to interpret
at times, after all the magnitude could be any positive real number.
In the literature there are many attempts to turn the real output of a classifier into probability estimates,
see for instance Vovk and Petej (2012) and Vovk et al. (2015).
Vovk, Vladimir and Ivan Petej (2012). “Venn-Abers Predictors”. In: UAI.
Vovk, Vladimir, Ivan Petej, and Valentina Fedorova (2015). “Large-scale probabilistic predictors with and without
guarantees of validity”. In: NIPS.

Remark 3.3: Reduce classification to regression?

Recall that the optimal Bayes classifier is

h⋆ (x) = sign(2η(x) − 1), where η(x) = Pr(Y = 1|X = x).

The posterior probability η(x) is a perfect measure of our confidence in predicting ŷ = h⋆ (x). Therefore,
one may attempt to estimate the posterior probability

η(x) = Pr(Y = 1|X = x) = E(1Y =1 |X = x).

If we define Y = 1Y =1 ∈ {0, 1}, then η(X) is exactly the regression function of (X, Y), see Definition 2.8.
So, in principle, we could try to estimate the regression function based on some i.i.d. samples {(Xi , Yi ) :
i = 1, . . . , n}.
The issue with the above approach is that we are in fact reducing an easier problem (classification) to a
more general hence harder problem (regression). Note that the posterior probability η(x) always lies in [0, 1],
and we would like to exploit this a priori knowledge. However, a generic approach to estimate the regression
function would not be able to take this structure into account. In fact, an estimate of the regression function
(e.g. through linear regression) may not always take values in [0, 1] at all.
As a practical rule of thumb: Never try to solve a more general problem than necessary. (Theoreticians
violate this rule all the time but nothing is meant to be practical in theory anyways.)

Yaoliang Yu 55 –Version 0.21–September 22, 2021–


CS480/680–Fall 2021 §3 LOGISTIC REGRESSION University of Waterloo

Definition 3.4: Bernoulli model

Let us consider the binary classification problem with labels y ∈ {±1}. With the parameterization:

Pr(Y = 1|X = x) =: p(x; w), (3.1)

where p is a function that maps x and w into [0, 1], we then have the Bernoulli model for generating the
label y ∈ {±1}:

Pr(Y = y|X = x) = p(x; w)(1+y)/2 [1 − p(x; w)](1−y)/2 .

Let D = {(xi , yi ) : i = 1, . . . , n} be an i.i.d. sample from the same distribution as (X, Y ). The conditional
likelihood factorizes under the i.i.d. assumption:
n
Y
Pr(Y1 = y1 , . . . , Yn = yn |X1 = x1 , . . . , Xn = xn ) = Pr(Yi = yi |Xi = xi )
i=1
Yn
= p(xi ; w)(1+yi )/2 [1 − p(xi ; w)](1−yi )/2 . (3.2)
i=1

A standard algorithm in statistics and machine learning for parameter estimation is to maximize the (con-
ditional) likelihood. In this case, we can maximize (3.2) w.r.t. w. Once we figure out w, we can then make
probability estimates on any new test sample x, by simply plugging w and x into (3.1).

Example 3.5: What is that function p(x; w)?

Let us consider two extreme cases:


• p(x; w) = p(x), i.e., the function p can take any value on any x. This is the extreme case where
anything we learn from one data point xi may have nothing to do with what p(xj ) can take. Denote
pi = p(xi ), take logarithm on (3.2), and negate:
n
1X
min − (1 + yi ) log pi + (1 − yi ) log(1 − pi ).
p1 ,...,pn 2 i=1

Since the pi ’s are not related, we can solve them separately. Recall the definition of the KL divergence
in Definition 15.3, we know
 1+yi   !
1+yi 1−yi pi
− 2 log pi − 2 log(1 − pi ) = KL 2
1−yi − 1+y 1+yi 1−yi
2 log 2 − 2 log 2 .
i 1−yi

2
1 − pi

Since the KL divergence is nonnegative, to maximize the conditional likelihood we should set
1+yi
pi = 2 .

This result does make sense, since for yi = 1 we set pi = 1 while for yi = −1 we set pi = 0 (so that
1 − pi = 1, recall that pi is the probability for yi being 1).
• p(x; w) = p(w) = p, i.e., the function p is independent of x hence is a constant. The is the extreme
case where anything we learn from one data point immediately applies in the same way to any other
data point. Similar as above, we find p by solving
n
1X
min − (1 + yi ) log p + (1 − yi ) log(1 − p).
p 2 i=1
Pn
Let p̄ = 2n
1
which is exactly the fraction of positive examples in our training set D.
i=1 (1 + yi ), P
n
Obviously then 1 − p̄ = 2n1
i=1 (1 − yi ). Prove by yourself that p̄ is indeed the optimal choice. This

Yaoliang Yu 56 –Version 0.21–September 22, 2021–


CS480/680–Fall 2021 §3 LOGISTIC REGRESSION University of Waterloo

again makes sense: if we have to pick one and only one probability estimate (for every data point),
intuitively we should just use the fraction of positive examples in our training set.

The above two extremes are not satisfactory: it is either too flexible by allowing each data point to have
its own probability estimate (which may have nothing to do with each other hence learning is impossible) or
it is too inflexible by restricting every data point to use the same probability estimate. Logistic regression,
which we define next, is an interpolation between the two extremes.

Definition 3.6: Logistic Regression (Cox 1958)

Motivated by the two extreme cases in Example 3.5, we want to parameterize p(x; w) in a not-too-flexible
and not-too-inflexible way. One natural choice is to set p as an affine function (how surprising): p(x; w) =
w⊤ x + b. However, this choice has the disadvantage in the sense that the left-hand side takes value in [0, 1]
while the right-hand side takes value in R. To avoid this issue, we first take a logit transformation of p and
then equate it to an affine function:
p(x; w)
log = w⊤ x + b.
1 − p(x; w)
The ratio on the left-hand side is known as odds ratio (probability of 1 divide by probability of -1). Or
equivalently,
1 1 exp(t)
p(x; w) = = sgm(w⊤ x + b), where sgm(t) = = (3.3)
1 + exp(−w⊤ x − b) 1 + exp(−t) 1 + exp(t)
is the so-called sigmoid function. Note that our definition of p involves x but not the label y. This is crucial
as later on we will use p(x; w) to predict the label y.
Plugging (3.3) into the conditional likelihood (3.2) and maximizing w = (w, b) we get the formulation of
logistic regression, or in equivalent form:
n
1X
min (1 + yi ) log(1 + exp(−x⊤ ⊤ ⊤
i w)) + (1 − yi ) log(1 + exp(−xi w)) + (1 − yi )xi w,
w 2 i=1

which is usually written in the more succinct form:


n
X
min lgt(yi ŷi ), where ŷi = w⊤ xi , and lgt(t) := log(1 + exp(−t)) (3.4)
w
i=1

is the so-called logistic loss in the literature. (Clearly, the base of log is immaterial and we use the natural
log.) In the above we have applied padding, see Remark 1.17, to ease notation.
sigmoid function logistic loss

1 4
lgt(t) = log[1 + exp(−t)]
1+exp(−t)

0.8
3
1

0.6
2
sgm(t) =

0.4
1
0.2

0 0
−5−4−3−2−1 0 1 2 3 4 5 −4 −3 −2 −1 0 1 2 3 4
t t
Cox, D. R. (1958). “The Regression Analysis of Binary Sequences”. Journal of the Royal Statistical Society. Series B
(Methodological), vol. 20, no. 2, pp. 215–242.

Yaoliang Yu 57 –Version 0.21–September 22, 2021–


CS480/680–Fall 2021 §3 LOGISTIC REGRESSION University of Waterloo

Alert 3.7: Binary labels: {±1} or {0, 1}?

In this note we choose to encode binary labels as {±1}. In the literature the alternative choice {0, 1} is
also common. In essence there is really no difference if we choose one convention or the other: the eventual
conclusions would be the same. However, some formulas do look a bit different on the surface! For example,
the neat formula (3.4) becomes
n
X
log[exp (1 − yi )w⊤ xi + exp(−yi w⊤ xi )],

min
w
i=1

had we chosen the convention yi ∈ {0, 1}. Always check the convention before you subscribe to any formula!

Remark 3.8: Prediction with confidence

Once we solve w as in (3.4) and given a new test sample x, we can compute p(x; w) = 1
1+exp(−w⊤ x)
and
predict
(
1, if p(x; w) ≥ 1/2 ⇐⇒ w⊤ x ≥ 0
ŷ(x) = .
−1, otherwise

In other words, for predicting the label we are back to the familiar rule ŷ = sign(w⊤ x). However, now we
are also equipped with the probability confidence p(x; w).
It is clear that logistic regression is a linear classification algorithm, whose decision boundary is given by
the hyperplane

H = {x : w⊤ x = 0}.

Alert 3.9: Something is better than nothing?

It is tempting to prefer logistic regression over other classification algorithms since the former spits out not
only label predictions but also probability confidences. However, one should keep in mind that in logistic
regression, we make the assumption (see Equation (3.3))
1
Pr(Y = 1|X = x) = sgm(w⊤ x) = ,
1 + exp(−w⊤ x)

which may or may not hold on your particular dataset. So the probability estimates we get from logistic
regression can be totally off. Is it really better to have a probability estimate that is potentially very wrong
than not to have anything at all? To exaggerate in another extreme, for any classification algorithm we can
“make up” a 100% confidence on each of its predictions. Does this “completely fake” probability confidence
bring any comfort? But, how is this any different from the numbers you get from logistic regression?

Alert 3.10: Do not do extra work

Logistic regression does more than classification, since it also tries to estimate the posterior probabilities.
However, if prediction (of the label) is our sole goal, then we do not have to, and perhaps should not, estimate
the posterior probabilities. Put it more precisely, all we need to know is whether or not η(x) = Pr(Y =
1|X = x) is larger than 1/2. The precise value of η(x) is not important; only its comparison with 1/2 is.
As we shall see, support vector machines, in contrast, only tries to estimate the decision boundary (i.e. the
relative comparison between η(x) and 1/2), hence can be more efficient.

Yaoliang Yu 58 –Version 0.21–September 22, 2021–


CS480/680–Fall 2021 §3 LOGISTIC REGRESSION University of Waterloo

Remark 3.11: More than logistic regression

The main idea behind logistic regression is to equate the posterior probability p(x; w) with some transforma-
tion F of the affine function w⊤ x. Here the transformation F turns a real number into some value in [0, 1]
(where the posterior probability belongs to). Obviously, we can choose F to be any cumulative distribution
function (cdf) on the real line. Indeed, plug the formula

p(x; w) = F (w⊤ x),

into the conditional likelihood (3.2) gives us many variants of logistic regression. If we choose F to be the
cdf of the logistic distribution (hence the name)
1
F (x; µ, s) = ,
1 + exp − x−µ
s

where µ is the mean and s is some shape parameter (with variance s2 π 2 /3), then we recover logistic regression
(provided that µ = 0 and s = 1).
If we choose F to be the cdf of the standard normal distribution, then we get the so-called probit
regression.

Algorithm 3.12: Gradient descent for logistic regression

Unlike linear regression, logistic regression no longer admits a closed-form solution. Instead, we can apply
gradient descent to iteratively converge to a solution. All we need is to apply the chain rule to compute the
gradient of each summand of the objective function in (3.4):
exp(−t)
∇lgt(yi x⊤
i w) = − · yi xi = −sgm(−yi w⊤ xi ) · yi xi
1 + exp(−t) t=yi w⊤ xi

= −yi xi + sgm(yi w⊤ xi )yi xi


(
(p(xi ; w) − 1)xi , if yi = 1
=
(p(xi ; w) − 0)xi , if yi = −1
= p(xi ; w) − yi2+1 xi .


In the following algorithm, we need to choose a step size η. A safe choice is


4
η= ,
∥X∥2sp

namely, the inverse of the largest singular value of the Hessian (see below). An alternative is to start with
some small η and decrease it whenever we are not making progress (known as step size annealing).
Algorithm: Gradient descent for binary logistic regression.
Input: X ∈ Rd×n , y ∈ {−1, 1}n (training set), initialization w ∈ Rp
Output: w ∈ Rp
1 for t = 1, 2, . . . , maxiter do
2 sample a minibatch I = {i1 , . . . , im } ⊆ {1, . . . , n}
3 g←0
4 for i ∈ I do // use for-loop only in parallel implementation
1 1
5 pi ← 1+exp(−w ⊤x )
i
// in serial, replace with p ← 1+exp(−X ⊤ w)
:,I
1+yi 1+yI
6 g ← g + (pi − 2 )xi // in serial, replace with g ← X:,I (p − 2 )
7 choose step size η > 0
8 w ← w − ηg
9 check stopping criterion // e.g. ∥ηg∥ ≤ tol

For small problems (n ≤ 104 say), we can set I = {1, . . . , n}, i.e., use the entire dataset in every iteration.

Yaoliang Yu 59 –Version 0.21–September 22, 2021–


CS480/680–Fall 2021 §3 LOGISTIC REGRESSION University of Waterloo

Algorithm 3.13: Newton iteration for logistic regression

We can also apply Newton’s algorithm for solving logistic regression. In addition to computing the
gradient, we now also need to compute the Hessian:

Hi = ∇2w lgt(yi w⊤ xi ) = xi [∇w p(xi ; w)]⊤ = p(xi ; w)[1 − p(xi ; w)]xi x⊤


i .

Algorithm: Newton iteration for binary logistic regression.


Input: X ∈ Rd×n , y ∈ {−1, 1}n (training set), initialization w ∈ Rp
Output: w ∈ Rp
1 for t = 1, 2, . . . , maxiter do
2 sample a minibatch I = {i1 , . . . , im } ⊆ {1, . . . , n}
3 g ← 0, H ← 0
4 for i ∈ I do // use for-loop only in parallel implementation
1 1
5 pi ← 1+exp(−w ⊤x )
i
// in serial, replace with p ← 1+exp(−X ⊤ w)
:,I

6 g ← g + (pi − 1+y
2 )xi
i
// in serial, replace with g ← X:,I (p − 1+y2 )
I

7 H ← H + pi (1 − pi )xi x⊤
i

// in serial, replace with H ← X:,I diag(p ⊙ (1 − p))X:,I
8 choose step size η > 0
9 w ← w − ηH −1 g // solve H −1 g as linear system
10 check stopping criterion // e.g. ∥ηg∥ ≤ tol

Typically, we need to tune η at the initial phase but quickly we can just set η ≡ 1. Newton’s algorithm
is generally much faster than gradient descent. The downside, however, is that computing and storing the
Hessian can be expensive. For example, Algorithm 3.12 has per-step time complexity O(md) and space
complexity O(d) (or O(nd) if X is stored explicitly in memory) while Algorithm 3.13 has per-step time
complexity O(md2 + d3 ) and space complexity O(d2 ) (or O(nd + d2 ) if X is stored explicitly in memory).

Alert 3.14: Overflow and underflow

Numerically computing exp(a) can be tricky when the vector a has very large or small entries. The usual
trick is to shift the origin as follows. Let t = maxi ai − mini ai be the range of the elements in a. Then, after
shifting 0 to t/2:

exp(a) = exp(a − t/2) exp(t/2).

Computing exp(a − t/2) may be numerically better than computing exp(a) directly. The scaling factor
exp(t/2) usually will cancel out in later computations so we do not need to compute it. (Even when we have
to, it may be better to return t/2 than exp(t/2).)

Remark 3.15: Logistic regression as iterative re-weighted linear regression

Let us define the diagonal matrix Ŝ = diag p̂ ⊙ (1 − p̂) . If we set η ≡ 1, then we can interpret Newton’s


iteration in Algorithm 3.13 as iterative re-weighted linear regression (IRLS):


w+ = w − (XŜX⊤ )−1 X(p̂ − 1+y
2 )
= (XŜX⊤ )−1 [(XŜX⊤ )w − X(p̂ − 1+y
2 )]
= (XŜX⊤ )−1 XŜy, y := X⊤ w − Ŝ (p̂ − 1+y
−1
2 )
Xn
= argmin ŝi (w⊤ xi − yi )2 , ŝi := p̂i (1 − p̂i ), (3.5)
w
i=1

where the last equality can be seen by setting the derivative w.r.t. w to 0.

Yaoliang Yu 60 –Version 0.21–September 22, 2021–


CS480/680–Fall 2021 §3 LOGISTIC REGRESSION University of Waterloo

So, Newton’s algorithm basically consists of two steps:


• given the current w, compute the weights ŝi and update the targets yi . Importantly, if the current w
yields very confident prediction p̂i for the i-th training example (i.e., when p̂i is close to 0 or 1), then
the corresponding weight ŝi is close to 0, i.e., we are down-weighting this training example whose label
we are already fairly certain about. On the other hand, if p̂i is close to 1/2, meaning we are very unsure
about the i-th training example, then the corresponding weight ŝi will be close to the maximum value
1/4, i.e. we pay more attention to it in the next iteration.
• solve the re-weighted least squares problem (3.5).
We have to iterate the above two steps because p̂ hence ŝ are both functions of w themselves. It would be
too difficult to solve w in one step. This iterative way of solving complicated problems is very typical in
machine learning.

Remark 3.16: Linear regression vs. logistic regression

In the following comparison, Ŝ = diag p̂ ⊙ (1 − p̂) . We note that as ŷi deviates from yi , the least squares


loss varies from 0 to ∞. Similarly, as p̂i deviates from yi , the cross-entropy loss varies from 0 to ∞ as well.
Pn Pn
• least-squares: i=1 (yi − ŷi )2 • cross-entropy: i=1 − 1+y 1−yi
2 log p̂i − 2 log(1− p̂i )
i

• prediction: ŷi = w⊤ xi • prediction: ŷi = sign(w xi ), p̂i = sgm(w⊤ xi )


• objective: ∥y − ŷ∥22 • objective: KL( 1+y


2 ∥p̂)
• grad: w ← w − ηX(ŷ − y) • grad: w ← w − ηX(p̂ − 1+y 2 )
• Newton: w ← w − η(XX⊤ )−1 X(ŷ − y) • Newton: w ← w − η(XŜX⊤ )−1 X(p̂ − 1+y 2 )

Exercise 3.17: Linearly separable

If the training data D = *(xi , yi ) : i = 1, . . . , n+ is linearly separable (see Definition 1.24), does logistic
regression have a solution w? What happens if we run gradient descent (Algorithm 3.12) or Newton’s
iteration (Algorithm 3.13)?

Exercise 3.18: Regularization

Derive the formulation and an algorithm (gradient or Newton) for ℓ2 -regularized logistic regression, where
we add λ∥w∥22 .

Remark 3.19: More than 2 classes

We can easily extend logistic regression to c > 2 classes. As before, we make the assumption

Pr(Y = k|X = x) = fk (W⊤ x), k = 1, . . . , c,

where W = [w1 , . . . , wc ] ∈ Rp×c and the vector-valued function f = [f1 , . . . , fc ] : Rc → ∆c−1 maps a vector
of size c × 1 to a probability vector in the simplex ∆c−1 . Given an i.i.d. training dataset D = {(xi , yi ) :
i = 1, . . . , n}, where each yi ∈ {0, 1}c is a one-hot vector, i.e. 1⊤ yi = 1, then the (negated) conditional
log-likelihood is:
n Y
Y c
− log Pr(Y1 = y1 , . . . , Yn = yn |X1 = x1 , . . . , Xn = xn ) = − log [fk (W⊤ xi )]yki
i=1 k=1

Yaoliang Yu 61 –Version 0.21–September 22, 2021–


CS480/680–Fall 2021 §3 LOGISTIC REGRESSION University of Waterloo

n X
X c
= −yki log fk (W⊤ xi ).
i=1 k=1

To minimize the negated log-likelihood, we can apply gradient descent or Newton’s iteration as before:
c
X 1
∇ℓi (W) = −yki xi [∇fk ]⊤ , (3.6)
fk (W⊤ xi ) W ⊤ xi
k=1
c
X 1
∀G ∈ Rp×c , [∇2 ℓi (W)](G) = −yki (xi x⊤ ⊤ 2
i )G[∇fk ∇fk − fk ∇ fk ] , (3.7)
fk2 (W⊤ xi ) W ⊤ xi
k=1

where recall that ∇ℓi (W) ∈ Rp×c and ∇2 ℓi (W) : Rp×c → Rp×c . Note that due to our one-hot encoding, the
above summation has actually one term.

Definition 3.20: Multiclass logistic regression, a.k.a. Multinomial logit or softmax regression

The multinomial logit model corresponds to choosing the softmax function:

exp(y)
f (W⊤ x) = softmax(W⊤ x), where softmax : Rc → ∆c−1 , y 7→ .
1⊤ exp(y)

Let p̂i = f (W⊤ xi ) and specialize (3.6) and (3.7) to the softmax function we obtain its gradient and Hessian:

∇ℓi (W) = xi (p̂i − yi )⊤ ,


∀G ∈ Rp×c , [∇2 ℓi (W)](G) = (xi x⊤ ⊤

i )G diag(p̂i ) − p̂i p̂i .

In the multiclass setting, solving the Newton step could quickly become infeasible (O(d3 c3 )). As Böhning
(1992) pointed out, we can instead use the upper bound:
1
0 ⪯ diag(p̂i ) − p̂i p̂⊤
i ⪯ (Ik − 1 ⊤
k+1 11 ),
2
which would reduce the computation to inverting only the data matrix XX⊤ = i .
xi x⊤
P
i
Böhning, Dankmar (1992). “Multinomial logistic regression algorithm”. Annals of the Institute of Statistical Mathe-
matics, vol. 44, no. 1, pp. 197–200.

Remark 3.21: Mean and Covariance

We point out the following “miracle:” Let Y be a random vector taking values on standard bases {ek ∈
{0, 1}c : k = 1, . . . , c, 1⊤ ek = 1} and following the multinomial distribution:

Pr(Y = ek ) = pk , k = 1, . . . , c.

Then, straightforward calculation verifies:

E(Y) = p,
Cov(Y) = diag(p) − pp⊤ .

Remark 3.22: Removing translation invariance in softmax

In the above multiclass logistic regression formulation, we used a matrix W with c columns to represent c

Yaoliang Yu 62 –Version 0.21–September 22, 2021–


CS480/680–Fall 2021 §3 LOGISTIC REGRESSION University of Waterloo

classes. Note however that the softmax function is translation invariant:

∀w, softmax((W + w1⊤ )⊤ x) = softmax(W⊤ x).

Therefore, for identifiability purposes, we may assume w.l.o.g. wc = 0 and we need only optimize the first
c − 1 columns. If we denote L(w1 , . . . , wc−1 , wc ) as the original negated log-likelihood in Definition 3.20,
then fixing wc = 0 changes our objective to L(w1 , . . . , wc−1 , 0). Clearly, the gradient and Hessian formula
in Definition 3.20 still works after deleting the entries corresponding to wc .
Setting c = 2 we recover binary logistic regression, with the alternative encoding y ∈ {0, 1} though.

Exercise 3.23: Alternative constraint to remove translation invariance

An alternative fix to the translation-invariance issue of softmax is to add the following constraint:

W1 = 0. (3.8)
Pc−1
In this case our objective changes to L(w1 , . . . , wc−1 , − k=1 wk ). How should we modify the gradient and
Hessian?
Interestingly, after we add ℓ2 regularization to the unconstrained multiclass logistic regression:

min L(w1 , . . . , wc ) + λ∥W∥2F ,


W

the solution automatically satisfies the constraint (3.8). Why? What if we added ℓ1 regularization?

Definition 3.24: Generalized linear models (GLMs)

The similarity between linear regression and logistic regression is not coincidental: they both belong to
generalized linear models (i.e. exponential family noise distributions), see Nelder and Wedderburn (1972).
Nelder, J. A. and R. W. M. Wedderburn (1972). “Generalized Linear Models”. Journal of the Royal Statistical Society.
Series A (General), vol. 135, no. 3, pp. 370–384.

Yaoliang Yu 63 –Version 0.21–September 22, 2021–


CS480/680–Fall 2021 §4 SUPPORT VECTOR MACHINES (SVM) University of Waterloo

4 Support Vector Machines (SVM)


Goal

Define and understand the classical hard-margin SVM for binary classification. Dual view.

Alert 4.1: Convention

Gray boxes are not required hence can be omitted for unenthusiastic readers.
For less mathematical readers, think of the norm ∥ · ∥ and its dual norm ∥ · ∥◦ as the Euclidean ℓ2 norm
∥ · ∥2 . Treat all distances as the Euclidean distance. All of our pictures are for this special case.
This note is likely to be updated again soon.

Definition 4.2: SVM as maximizing minimum distance

+ +
2 − 2 −

1 − 1 −
+ +
0 0
0 1 2 3 4 0 1 2 3 4

Given a (strictly) linearly separable dataset D = {(xi , yi ) ⊆ Rd × {±1} : i = 1, . . . , n}, there exists a
separating hyperplane Hw = {x ∈ Rd : w⊤ x + b = 0}, namely that

∀i, yi (w⊤ xi + b) > 0.

In fact, there exist infinitely many separating hyperplanes: if we perturb (w, b) slightly, the resulting hyper-
plane would still be separating, thanks to continuity. Is there a particular separating hyperplane that stands
out, and be “optimal”?
The answer is yes! Let Hw be any separating hyperplane (w.r.t. the given dataset D). We can compute
the distance from each training sample xi to the hyperplane Hw :

dist(xi , Hw ) := min ∥x − xi ∥◦ (e.g., the typical choice ∥ · ∥◦ = ∥ · ∥ = ∥ · ∥2 )


x∈Hw

w⊤ (x − xi ) + b − b
≥ (Cauchy-Schwarz, see Definition 1.25)
∥w∥
|w⊤ xi + b|
= (equality at x = xi − z
∥w∥2 (b + w⊤ xi ), z⊤ w = ∥w∥2 , ∥z∥◦ = ∥w∥)
∥w∥ | {z }
1
h i
z∈∂ 2 ∥w∥2

yi (w⊤ xi + b)
= (yi ∈ {±1} and Hw is separating). (4.1)
∥w∥
Here and in the following, we always assume w.l.o.g. that the dataset D contains at least 1 positive example
and 1 negative example, so that w = 0 with any b cannot be a separating hyperplane.
Among all separating hyperplanes, support vector machines (SVM) tries to find one that maximizes the
minimum distance (with the typical choice ∥ · ∥ = ∥ · ∥2 in mind):
yi ŷi
max min , where ŷi = w⊤ xi + b. (4.2)
w:∀i,yi ŷi >0 i=1,...,n ∥w∥

Yaoliang Yu 64 –Version 0.11–September 22, 2021–


CS480/680–Fall 2021 §4 SUPPORT VECTOR MACHINES (SVM) University of Waterloo

We remark that the above formulation is scaling-invariant: If w = (w, b) is optimal, then so is γw for any
γ > 0 (the fraction is unchanged and the constraint on w is not affected). This is not at all surprising,
as w and γw really represent the same hyperplane: Hw = Hγw . Note also that the separating condition
∀i, yi ŷi > 0 can be omitted since it is automatically satisfied if the dataset D is indeed (strictly) linearly
separable.

Alert 4.3: Margin as minimum distance

We repeat the formula in Definition 4.2:


h i |w⊤ x + b| y(w⊤ x + b) yŷ
dist(x, Hw ) := min ∥z − x∥◦ = = = ,
z∈Hw ∥w∥ ∥w∥ ∥w∥

where the third equality holds if yŷ ≥ 0 and y ∈ {±1}. Given any hyperplane Hw , we define its margin
w.r.t. a data point (x, y) as:

yŷ
γ((x, y); Hw ) := , ŷ = w⊤ x + b,
∥w∥

Geometrically, when the hyperplane Hw classifies the data point (x, y) correctly (i.e. yŷ > 0), this margin is
exactly the distance from x to the hyperplane Hw , and the negation of the distance otherwise.
Fixing any hyperplane Hw , we can extend the notion of its margin to a dataset D = {(xi , yi ) : i = 1, . . . , n}
by taking the (worst-case) minimum:
h i yi ŷi
γ(D; Hw ) := min γ((xi , yi ); Hw ) = min , ŷi := w⊤ xi + b.
i=1,...,n i ∥w∥

Again, when the hyperplane Hw (strictly) separates the dataset D, the margin γ(D; Hw ) > 0 coincides with
the minimum distance, as we saw in Definition 4.2. However, when D is not (strictly) separated by Hw , the
margin γ(D; Hw ) ≤ 0 is the negation of the maximum distance among all wrongly classified data points.
We can finally define the margin of a dataset D as the (best-case) maximum among all hyperplanes:
h i yi ŷi
γ(D) := max γ(D; Hw ) = max min . (4.3)
w w i=1,...,n ∥w∥

Again, when the dataset D is (strictly) linearly separable, the margin γ(D) > 0 reduces to the minimum
distance to the SVM hyperplane, in which case the margin definition here coincides with what we saw in
Remark 1.30 (with the choice ∥ · ∥◦ = ∥ · ∥ = ∥ · ∥2 ) and characterizes “how linearly separable” our dataset
D is. On the other hand, when D is not (strictly) linearly separable, the margin γ(D) ≤ 0.
To summarize, hard-margin SVM, as defined in Definition 4.2, maximizes the margin among all hyper-
planes on a (strictly) linearly separable dataset. Interestingly, with this interpretation, the hard-margin
SVM formulation (4.3) continues to make sense even on a linearly inseparable dataset.
In the literature, sometimes people often call the unnormalized quantity yŷ margin, which is fine as long
as the scale ∥w∥ is kept constant.

Definition 4.4: Alternative definition of margin

We give a slightly different definition of margin here: γ + . As the notation suggests, γ + coincides with the
definition in Alert 4.3 on a (strictly) linearly separable dataset, and reduces to 0 otherwise.
• Given any hyperplane Hw , we define its margin w.r.t. a data point (x, y) as:
(yŷ)+
γ + ((x, y); Hw ) := , ŷ = w⊤ x + b,
∥w∥
where recall (t)+ = max{t, 0} is the positive part. Geometrically, when the hyperplane Hw classifies the

Yaoliang Yu 65 –Version 0.11–September 22, 2021–


CS480/680–Fall 2021 §4 SUPPORT VECTOR MACHINES (SVM) University of Waterloo

data point (x, y) correctly (i.e. yŷ ≥ 0), this margin is exactly the distance from x to the hyperplane
Hw , and 0 otherwise.

• Fixing any hyperplane Hw , we can extend the notion of its margin to a dataset D = {(xi , yi ) : i =
1, . . . , n} by taking the (worst-case) minimum:
h i (yi ŷi )+
γ + (D; Hw ) := min γ + ((xi , yi ); Hw ) = min , ŷi := w⊤ xi + b.
i=1,...,n i ∥w∥

Again, when the hyperplane Hw (strictly) separates the dataset D, the margin γ + (D; Hw ) > 0 coincides
with the minimum distance, as we saw in Definition 4.2. However, when D is not (strictly) separated
by Hw , the margin γ + (D; Hw ) = 0.
• We can finally define the margin of a dataset D as the (best-case) maximum among all hyperplanes:
h i [yi ŷi ]+
γ + (D) := max γ + (D; Hw ) = max min .
w w i=1,...,n ∥w∥

Again, when the dataset D is (strictly) linearly separable, the margin γ + (D) reduces to the minimum
distance to the SVM hyperplane. In contrast, when D is not (strictly) linearly separable, the margin
γ + (D) = 0.

Remark 4.5: Important standardization trick

A simple standardization trick in optimization is to introduce an extra variable so that we can reduce an
arbitrary objective function to the canonical linear function. For instance, if we are interested in solving

min f (w),
w

where f can be any complicated nonlinear function. Upon introducing an extra variable t, we can reformulate
our minimization problem equivalently as:

min t,
(w,t):f (w)≤t

where the new objective (0; 1)⊤ (w; t) is a simple linear function of (w; t). The expense, of course, is that we
have to deal with the extra constraint f (w) ≤ t now.

Remark 4.6: Removing homogeneity by normalizing direction

To remove the scaling-invariance mentioned in Definition 4.2, we can restrict the direction vector w to have
unit norm, which happened to yield the same formulation as that in Rosen (1965) (see Remark 4.23 below
for more details):
max min yi ŷi . (4.4)
w:∥w∥=1 i=1,...,n

Applying the trick in Remark 4.5 (and noting we are maximizing here) yields the reformulation:
max δ, s.t. min yi ŷi ≥ δ ⇐⇒ yi ŷi ≥ δ, ∀i = 1, . . . , n,
(w,δ):∥w∥=1 i=1,...,n

which is completely equivalent to (4.3) (except by excluding out the trivial solution w = 0).
Observe that on any linearly separable dataset, at optimality we can always achieve δ ≥ 0. Thus, we
may relax the unit norm constraint on w slightly:
max δ (4.5)
w,δ

Yaoliang Yu 66 –Version 0.11–September 22, 2021–


CS480/680–Fall 2021 §4 SUPPORT VECTOR MACHINES (SVM) University of Waterloo

s.t. ∥w∥ ≤ 1
yi ŷi ≥ δ, ∀i = 1, . . . , n.

It is clear if the dataset D is indeed linearly separable, at maximum we may choose ∥w∥ = 1, hence the
“relaxation” is in fact equivalent (on any linearly separable dataset that consists of at least 1 positive and 1
negative).
Note that (4.5) is exactly the bound we got for the perceptron algorithm, see Remark 1.30. Thus, SVM
could have been derived by optimizing the convergence bound of perceptron. So, theory does inspire new
algorithms, although this was not what really happened in history: Vapnik and Chervonenkis (1964) did not
seem to be aware of Theorem 1.28 at the time; in fact they wrongly claimed that the perceptron algorithm
may not find a solution even when one exists....
Rosen, J.B (1965). “Pattern separation by convex programming”. Journal of Mathematical Analysis and Applications,
vol. 10, no. 1, pp. 123–134.
Vapnik, Vladimir N. and A. Ya. Chervonenkis (1964). “On a class of perceptrons”. Automation and Remote Control,
vol. 25, no. 1, pp. 112–120.

Exercise 4.7: Detecting linear separability

Prove an additional advantage of the “relaxation” (4.5): Its maximum value is always greater than 0, which
is attained iff the dataset is not (strictly) linearly separable.
In contrast, prove that the original formulation (4.4) with exact unit norm constraint
• is equivalent to (4.5) with strictly positive maximum value, iff the dataset is (strictly) linearly separable;
• is different from (4.5) with strictly negative maximum value, iff the dataset is not (strictly) linearly
separable and the intersection of positive and negative convex hulls has nonempty (relative) interior;
• is similar to (4.5) with exactly 0 maximum value, iff the dataset is not (strictly) linearly separable and
the intersection of positive and negative convex hulls has empty (relative) interior.

Remark 4.8: History of SVM

In this box we summarize the first SVM paper due to Vapnik and Lerner (1963). Our terminology and
notation are different from the somewhat obscure original.
Let our universe be X and D ⊆ X a training set. Vapnik and Lerner (1963) considered essentially the
unsupervised setting, where labels are not provided even at training. Let φ : X → S ⊆ H be a mapping
that turns the original input x ∈ X into a point z := φ(x) in the unit sphere S := {z ∈ H : ∥z∥2 = 1}
of a Hilbert space H (with induced norm ∥ · ∥2 ). Our goal is to divide the data into c (disjoint) categories
C1 , . . . , Cc , each of which is represented by a center ck ∈ S so that

max ∥zi − ck ∥22 < min ∥zj − ck ∥22 .


i∈Ck j̸∈Ck

In other words, if we circumscribe the training examples in each category Ck by the smallest ball with center
in the unit sphere S, then these balls only contain training examples in the same category (in fact this could
be how we define categories). (However, these balls may still intersect.) Since both z and c have unit norm,
equivalently we may require

min ⟨zi , ck ⟩ > max ⟨zj , ck ⟩ . (4.6)


i∈Ck j̸∈Ck

In other words, there exists


S a hyperplane with direction ck that (strictly) separates the category Ck from
the rest categories C¬k := l̸=k Cl . Let us define

rk = max ∥zi − ck ∥2 , k = 1, . . . , c,
i∈Ck

Yaoliang Yu 67 –Version 0.11–September 22, 2021–


CS480/680–Fall 2021 §4 SUPPORT VECTOR MACHINES (SVM) University of Waterloo

so that we may declare any point z ∈ B(ck , rk ) := {z ∈ H : ∥z − ck ∥2 ≤ rk }, or equivalently any z ∈ Hk+ :=


{z : ⟨wk , z⟩ ≥ 1} where wk := c1k 2 , is in category k.
1− 2 rk
Vapnik and Lerner (1963) considered two scenarios:
• transductive learning (distinction): each (test) example is known to belong to one and only one ball
Bk := B(ck , rk ). In this case we may identify the category for any x ∈ X :
[
x ∈ Ck ⇐⇒ z := φ(x) ∈ Bk \ B¬k := Bl , or simply z ∈ Bk .
l̸=k

In other words, ∥z − ck ∥2 ≤ rk and ∥z − cl ∥2 > rl for all l ̸= k, or equivalently


ck
∀l ̸= k, ⟨z, wk ⟩ ≥ 1 > ⟨z, wl ⟩ , where wk := .
1 − 21 rk2

Therefore, we may use the simple rule to predict the category c(x) of x (under transformation φ):

c(x) = argmax ⟨φ(x), wk ⟩ . (4.7)


k=1,...,c

• inductive learning (recognition): each (test) example may be in several balls or may not be in any ball
at all. We may still use the same prediction rule (4.7), but declare “failure”

– if maxk=1,...,c ⟨φ(x), wk ⟩ < 1: x does not belong to any existing category;


– on the other hand, if |c(x)| > 1: x belongs to at least two existing categories. Ambiguous.
Vapnik and Lerner (1963) ended with an announcement of the main result in Remark 4.11, namely how
to find the centers ck (or equivalently the weights wk ) using (4.6).
Vapnik, Vladimir N. and A. Ya. Lerner (1963). “Pattern Recognition using Generalized Portraits”. Automation and
Remote Control, vol. 24, no. 6, pp. 709–715.

Remark 4.9: More on (Vapnik and Lerner 1963)

Vapnik and Lerner (1963) defined a dataset as indefinite, if there exists some test example x so that

∀k, z ∈ B(ck , r¬k ) \ B(ck , rk ), where r¬k := max rl ,


l̸=k

i.e., z is not in category k but would be if we increase its radius rk to that of the best alternative r¬k .
Vapnik and Lerner (1963) proposed to use the (positive) quantity

I = I(D) = 1 − max ⟨ck , cl ⟩


k̸=l

to measure the distinguishability of our dataset: the bigger I is, the more spread (orthogonal) the centers
are hence the easier to distinguish the categories. Using I as the evaluation metric one can sequentially
refine the feature transformation φ so that the resulting distinguishability is steadily increased.
Vapnik and Lerner (1963) also noted the product space trick: Let {φj : X → Hj , j = 1, . . . , m} be a set
of feature transformations. Then, w.l.o.g., we can assemble them into a single transformation φ : X → H :=
H1 × · · · × Hm .
Vapnik, Vladimir N. and A. Ya. Lerner (1963). “Pattern Recognition using Generalized Portraits”. Automation and
Remote Control, vol. 24, no. 6, pp. 709–715.

Yaoliang Yu 68 –Version 0.11–September 22, 2021–


CS480/680–Fall 2021 §4 SUPPORT VECTOR MACHINES (SVM) University of Waterloo

Remark 4.10: Linear separability, revisited

Recall our definition of (strict) linear separability of a dataset D = {(xi , yi ) ∈ Rd × {±1} : i = 1, . . . , n}:

∃w ∈ Rd , b ∈ R, s > 0, such that yi ŷi ≥ s, ∀i = 1, . . . , n, where ŷi := w⊤ xi + b.

Let us now break the above condition for any positive example yi = 1 and any negative example yj = −1:

w⊤ xi + b ≥ s ≥ −s ≥ w⊤ xj + b ⇐⇒ w⊤ xi ≥ s − b ≥ −s − b ≥ w⊤ xj
⇐⇒ min w⊤ xi > max w⊤ xj .
i:yi =1 j:yj =−1

It is clear now that the linear separability condition has nothing to do with the offset term b but the normal
vector w.

Remark 4.11: History of SVM, continued

In this box we summarize the main result in Vapnik and Chervonenkis (1964).
Inspired by (4.6), Vapnik and Chervonenkis (1964) essentially applied the one-vs-all reduction (see Re-
mark 1.36) and arrived at what they called the optimal approximation:

max min w⊤ xi (4.8)


∥w∥=1 i:yi =1

s.t. min w⊤ xi ≥ max w⊤ xj .


i:yi =1 j:yj =−1

According to Remark 4.10, the last condition is equivalent to requiring the dataset to be linearly separable
(strictly or not). Reintroducing the offset b and applying the standardization trick in Remark 4.5:

max δ − b (4.9)
w,b,δ

assuming obj > 0


s.t. ∥w∥ = 1 −−−−−−−−−−−−−−→ ∥w∥ ≤ 1
yi ŷi ≥ δ ≥ 0, ŷi := w⊤ xi + b, i = 1, . . . , n.

The above problem obviously admits a solution iff the dataset is linearly separable. Moreover, if we assume
the maximum objective is strictly positive (which is different from strict linear separability: take say D =
{(−1, +), (−2, −)}), then we can relax the unit norm constraint, in which case the optimal w is unique.
This formulation (4.9) of Vapnik and Chervonenkis differs from our previous one (4.5) mainly in the
objective function: δ − b vs. δ, i.e. Vapnik and Chervonenkis always subtract the offset from the minimum
margin. This difference can be significant though, see Example 4.12 below for an illustration.
The main result in (Vapnik and Chervonenkis 1964), aside from the formulation in (4.8), is the derivation
of its Lagrangian dual, which is quite a routine derivation nowadays. Nevertheless, we reproduce the original
argument of Vapnik and Chervonenkis for historical interests.
Define C(w) = mini:yi =1 w⊤ xi , and define the set of support vectors S(w) := {yi xi : w⊤ xi = C(w)}.
By definition there is always a positive support vector while there may not be any negative support vector
(consider say D = {(1, −), (2, +)}). Recall that Vapnik and Chervonenkis assumed the optimal objective
C ⋆ = C(w⋆ ) in (4.6) is strictly positive, hence the uniqueness of the optimal solution w⋆ . By restricting the
norm ∥ · ∥ = ∥ · ∥2 , Vapnik and Chervonenkis made the following observations:
• w⋆ is a conic combination of support vectors S = S(w⋆ ). Suppose not, let P denote the ℓ2 projector
onto the conic hull of support vectors. Let wη = (I − P )w⋆ + ηP w⋆ . For any support vector yx ∈ S:

y[⟨wη , x⟩ − η ⟨w⋆ , x⟩] = (1 − η) ⟨w⋆ − P w⋆ , yx⟩ = (1 − η) ⟨w⋆ − P w⋆ , (yx + P w⋆ ) − P w⋆ ⟩ ≥ 0,

since P is the projector onto the conic hull and η ≥ 1. Since C(w⋆ ) > 0 by assumption, slightly
increase η from 1 to 1 + ϵ will maintain all constraints but increase the objective in (4.6), contradiction
to the optimality of w⋆ .

Yaoliang Yu 69 –Version 0.11–September 22, 2021–


CS480/680–Fall 2021 §4 SUPPORT VECTOR MACHINES (SVM) University of Waterloo

• If we normalize the weight vector w∗ = w⋆ /C(w⋆ ), the constraint in (4.6) is not affected but the
objective C(w∗ ) now becomes unit. Of course, w∗ remains to be a conic combination of support
vectors:
n
X
w∗ = αi yi xi , αi ≥ 0, αi (x⊤ ∗ ⊤ ∗
i w − 1) = 0, yi (xi w − 1) ≥ 0, i = 1, . . . , n, (4.10)
i=1

where the conditions follow from our definition of support vectors and is known as the KKT condition.
• Vapnik and Chervonenkis mentioned the following differential system:

= −ϵα + [y − (K ⊙ yy⊤ )α]+ , Kij := ⟨xi , xj ⟩ ,
dt
whose equilibrium will approach the KKT condition hence the solution in (4.10) as ϵ → 0. Vapnik and
Chervonenkis concluded that to compute α, we need only the dot product K. Moreover, to reconstruct
w, only the support vectors from the training set are needed.

• To perform testing, we compare x⊤ w∗ with threshold 1 (see the last condition in (4.10)). Or equiva-
lently, using uniqueness we can recover w⋆ = w∗ /∥w∗ ∥2 , where
* n
+ n
X X
∗ 2 ∗ ∗ ∗
∥w ∥2 = ⟨w , w ⟩ = w , αi yi xi = αi yi ⟨w∗ , xi ⟩ = α⊤ y.
i=1 i=1

Thus, we may also compare x⊤ w⋆ with the threshold √ 1⊤ . Usually we prefer to use w∗ because
α y
its threshold is normalized: in case there are multiple classes, we can then use the argmax rule:
ŷ = argmaxk wk⊤ x. However, note that this decision boundary corresponds to the hyperplane that first
passes through a positive example!

• Vapnik and Chervonenkis mentioned the possibility to dynamically update the dual variables α: upon
making a mistake on an example (x, y), we augment the support vectors S with (x, y) and recompute
the dual variables on the “effective training set” S. Again, only dot products are needed.

Vapnik, Vladimir N. and A. Ya. Chervonenkis (1964). “On a class of perceptrons”. Automation and Remote Control,
vol. 25, no. 1, pp. 112–120.

Example 4.12: SVM, old and new

As illustrated in the following figure, the optimal approximation in Remark 4.11 can behave very differently
from the familiar SVM formulation (4.11) in Remark 4.13. To verify that the left purple solid line in the left
plot is indeed optimal, simply note that the minimum distance from positive examples to any hyperplane
passing through the origin (e.g. the objective in (4.8)) is at most the minimum distance from positive
examples to the origin, which the left purple line achieves.
H0 H− H0 H+
1
∥w∥2
+ +
2 − 2 −
1
∥w∥2
1 − 1 − 1
∥w∥2
+ +
0 0
0 1 2 3 4 0 1 2 3 4

Yaoliang Yu 70 –Version 0.11–September 22, 2021–


CS480/680–Fall 2021 §4 SUPPORT VECTOR MACHINES (SVM) University of Waterloo

Remark 4.13: Removing homogeneity by normalizing offset

A different way to remove the scaling-invariance mentioned in Definition 4.2 is to perform normalization on
the offset so that

min yi ŷi = δ,
i=1,...,n

where δ > 0 is any fixed constant. When the dataset D is indeed (strictly) linearly separable, this nor-
malization can always be achieved (simply by scaling w). After normalizing this way, we can simplify (4.2)
as:
δ
max , s.t. min yi yi = δ.
w ∥w∥ i=1,...,n

We remind again that δ here is any fixed positive constant and we are not optimizing it (in contrast to what
we did in Remark 4.6). Applying some elementary transformations (that do not change the minimizer) we
arrive at the usual formulation of SVM (due to Boser et al. (1992)):

min 1
2 ∥w∥
2
(4.11)
w
s.t. yi ŷi ≥ δ, ∀i = 1, . . . , n.

It is clear that the actual value of the positive constant δ is immaterial. Most often, we simply set δ = 1,
which is our default choice in the rest of this note.
The formulation (4.11) only makes sense on (strictly) linearly separable datasets, unlike our original
formulation (4.3).
Boser, Bernhard E., Isabelle M. Guyon, and Vladimir N. Vapnik (1992). “A Training Algorithm for Optimal Margin
Classiers”. In: COLT, pp. 144–152.

Alert 4.14: Any positive number but not zero

Note that in the familiar SVM formulation (4.11), we can choose δ to be any (strictly) positive number
(which amounts to a simple change of scale). However, we cannot set δ = 0, for otherwise the solution could
be trivially w = 0, b = 0.

Remark 4.15: Perceptron vs. SVM

We can formulate perceptron as the following feasibility problem:

min 0
w
s.t. yi ŷi ≥ δ, ∀i = 1, . . . , n,

where as before δ > 0 is any fixed constant.


Unlike SVM, the objective function of perceptron is the trivial constant 0 function, i.e., we are not trying
to optimize anything (such as distance/margin) other than satisfying a bunch of constraints (separating the
positives from the negatives). Computationally, perceptron belongs to linear programming (LP), i.e., when
the objective function and all constraints are linear functions. In contrast, SVM belongs to the slightly more
complicated quadratic programming (QP): the objective function is a quadratic function while all constraints
are still linear. Needless to say, LP ⊊ QP.

Yaoliang Yu 71 –Version 0.11–September 22, 2021–


CS480/680–Fall 2021 §4 SUPPORT VECTOR MACHINES (SVM) University of Waterloo

Remark 4.16: Three parallel hyperplanes

Geometrically, we have the following intuitive picture. As an example, the dataset D consists of 2 positive
and 2 negative examples. The left figure shows the SVM solution, and for comparison the right figure depicts
a suboptimal solution. We will see momentarily why the left solution is optimal.
H− H0 H+ H− H0 H+
1
∥w∥2
+ +
2 − 2 −
1
∥w∥2
1 − 1 1 −
∥w∥2
+ +
0 0
0 1 2 3 4 0 1 2 3 4

To understand the above figure, let us take a closer look at the SVM formulation (4.11), where w.l.o.g.
we choose δ = 1. Recall that the dataset D contains at least 1 positive example and 1 negative example (so
that w = 0 is ruled out). Let us breakdown the constraints in (4.11):
)
w⊤ xi + b ≥ 1, yi = 1
⇐⇒ 1 − min w⊤ xi ≤ b ≤ −1 − max w⊤ xi .
w⊤ xi + b ≤ −1, yi = −1 i:yi =1 i:yi =−1

If one of the inequalities is strict, say the left one, then we can decrease b slightly so that both inequalities are
strict. But then we can scale down w and b without violating any constraint while decreasing the objective
2 ∥w∥ further. Therefore, at minimum, we must have
1 2

1 − min w⊤ xi = b = −1 − max w⊤ xi , i.e., yi ŷi = 1 for at least one yi = 1 and one yi = −1.
i:yi =1 i:yi =−1

Given the SVM solution (w, b), we can now define three parallel hyperplanes:

H0 := {x : w⊤ x + b = 0}
H+ := {x : w⊤ x + b = 1} (we choose δ = 1)

H− := {x : w x + b = −1}.

The hyperplane H0 is the decision boundary of SVM: any point above or below it is classified as positive or
negative, respectively, i.e. y = sign(w⊤ x + b). The hyperplane H+ is the translate of H0 on which for the
first time we pass through some positive examples, and similarly for H− . Note that there are no training
examples between H− and H+ (a dead zone), with H0 at the middle between H− and H+ . More precisely,
we can compute the distance between H0 and H+ :

dist(H+ , H0 ) := min min ∥p − q∥◦


p∈H+ q∈H0

= min dist(xi , H0 ) (since H+ first passes through positive examples)


i:yi =1
1
= (see (4.1))
∥w∥
= min dist(xi , H0 ) (since H− first passes through negative examples)
i:yi =−1

= dist(H− , H0 ).

Yaoliang Yu 72 –Version 0.11–September 22, 2021–


CS480/680–Fall 2021 §4 SUPPORT VECTOR MACHINES (SVM) University of Waterloo

Exercise 4.17: Uniqueness of w

For the ℓ2 norm, prove the parallelogram equality

∥w1 + w2 ∥22 + ∥w1 − w2 ∥22 = 2(∥w1 ∥22 + ∥w2 ∥22 ).

(The parallelogram law, in fact, characterizes norms that are induced by an inner product). With this choice
∥ · ∥ = ∥ · ∥2 , prove
• that the SVM weight vector w is unique;

• that the SVM offset b is also unique.

Definition 4.18: Convex set

A set C ⊆ Rd is called convex iff for all x, z ∈ C and for all α ∈ [0, 1] we have

(1 − α)x + αz ∈ C,

i.e., the line segment connecting any two points in C remains in C.


By convention the empty set is convex. Obviously, the universe Rd , being a vector space, is convex.

Exercise 4.19: Basic properties of convex sets

Prove the following:


• The intersection Cγ of a collection of convex sets {Cγ }γ∈Γ is convex.
T
γ∈Γ

• A set in R (the real line) is convex iff it is an interval (not necessarily bounded or closed).
• The union of two convex sets need not be convex.

• The complement of a convex set need not be convex.


• Hyperplanes H0 := {x ∈ Rd : w⊤ x + b = 0} are convex.
• Halfspaces H≤ := {x ∈ Rd : w⊤ x + b ≤ 0} are convex.
(In fact, a celebrated result in convex analysis shows that any closed convex set is an intersection of halfs-
paces.)

Definition 4.20: Convex hull

The convex hull conv(A) of an arbitrary set A is the intersection of all convex supersets of A, i.e.,
\
conv(A) := C.
convex C⊇A

In other words, the convex hull is the “smallest” convex superset.

Exercise 4.21: Convex hull as convex combination


Pn
We define the convex combination of a finite set of points x1 , . . . , xn as any point x = i=1 αi xi with

Yaoliang Yu 73 –Version 0.11–September 22, 2021–


CS480/680–Fall 2021 §4 SUPPORT VECTOR MACHINES (SVM) University of Waterloo

coefficients α ≥ 0, 1⊤ α = 1, i.e. α ∈ ∆n−1 . Prove that for any A ⊆ Rd :


( n
)
X
conv(A) = x = αi xi : n ∈ N, α ∈ ∆n−1 , xi ∈ A ,
i=1

i.e., the convex hull is simply the set of all convex combinations of points in A.
(The celebrated Carathéodory theorem allows us to restrict n ≤ d + 1, and n ≤ d if A is connected.)

Exercise 4.22: Unit balls of norms are convex

Recall that the unit ball of the ℓp “norm” is defined as:

Bp := {x : ∥x∥p ≤ 1},

which is convex iff p ≥ 1. The following figure shows the unit ball Bp for p = 2, ∞, 21 , 1.

As shown above:

conv(B 1 ) = B1 .
2

• For what values of p and q do we have conv(Bp ) = Bq ?


• For what value of p is the sphere Sp := {x : ∥x∥p = 1} = ∂Bp convex?

Remark 4.23: The first dual view of SVM (Rosen 1965)

Rosen (1965) was among the first few people who recognized that a dataset D is (strictly) linearly separable
(see Definition 1.24) iff

conv(D+ ) ∩ conv(D− ) = ∅, where D± := {xi ∈ D : yi = ±1}.

(Prove the only if part by yourself; to see the if part, note that the convex hull of a compact set (e.g. finite
set) is compact, and disjoint compact sets can be strictly separated by a hyperplane, due to the celebrated
Hahn-Banach Theorem.)
To test if a given dataset D is (strictly) linearly separable, Rosen’s idea was to compute the minimum
(Euclidean) distance between the (convex hulls of the) two classes. In his Eq (2.5), after applying the
standardization trick, see Remark 4.5, Rosen proposed exactly (up to a constant 21 ) the hard-margin SVM
formulation (4.4). Then, to get an equivalent convex formulation, Rosen (1965) did some simple algebraic
manipulations to arrive at the familiar hard-margin SVM in (4.11) (his Eq (2.6), again, up to a constant
2 ). Rosen (1965) proved the uniqueness of the hard-margin SVM solution, and he further proved that the
1

number of support vectors can be bounded by d + 1. Rosen (1965) also discussed how to separate more than
two classes, using basically the one-vs-one and the one-vs-all reductions (see Remark 1.36). It would seem
appropriate to attribute our hard-margin SVM formulations in (4.4) and (4.11) to Rosen (1965).
Rosen, J.B (1965). “Pattern separation by convex programming”. Journal of Mathematical Analysis and Applications,
vol. 10, no. 1, pp. 123–134.

Yaoliang Yu 74 –Version 0.11–September 22, 2021–


CS480/680–Fall 2021 §4 SUPPORT VECTOR MACHINES (SVM) University of Waterloo

Remark 4.24: More on linear separability detection (Mangasarian 1965)

omitted.
Mangasarian, O. L. (1965). “Linear and Nonlinear Separation of Patterns by Linear Programming”. Operations
Research, vol. 13, no. 3, pp. 444–452.

Remark 4.25: Dual view of SVM, as bisector of minimum distance pair

H H′ H H′ H− H H+
z+
+ + +
2 − 2 − 2 −
p− p− u
1 x− − x+ 1 x− − x+ 1 x− − x+
q q
+ + +
p+ p+
0 0 0
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4

In Definition 4.2 we defined SVM as maximizing the minimum distance of training examples to the
decision boundary H0 . We now provide a dual view which geometrically is very appealing.

• We first make a simple observation about a (strict) separating hyperplane H:


) (
⟨w, xi ⟩ + b > 0, if xi ∈ D+ := {xj : yj = 1} ⟨w, x⟩ + b > 0, if x ∈ conv(D+ )
=⇒ ,
⟨w, xi ⟩ + b < 0, if xi ∈ D− := {xj : yj = −1} ⟨w, x⟩ + b < 0, if x ∈ conv(D− )

i.e., H also (strictly) separates the convex hulls of positive examples and negative ones.
• The second observation we make is about the minimum distance of all positive (negative) examples to
a separating hyperplane:

±(w⊤ x + b) ±(w⊤ x + b)
min± dist(x, H) = min± = min ± = min dist(x, H),
x∈D x∈D ∥w∥ x∈conv(D ) ∥w∥ x∈conv(D ± )

where the first equality follows from (4.1), the second from linearity, and the third from our observation
above. In other words, we could replace the datasets D± with their convex hulls.
• Based on the second observation, we now find the pair of x+ ∈ conv(D+ ) and x− ∈ conv(D− ) so that
dist(x+ , x− ) achieves the minimum distance among all pairs from the two convex hulls. We connect
the segment from x+ to x− and find its bisector, a separating hyperplane H that passes the middle
point 21 (x+ + x− ) with normal vector proportional to ∂ 12 ∥x+ − x− ∥2 . We claim that
 

min dist(x, H) = min dist(x, H) = 21 dist(x+ , x− ) = 21 dist(conv(D+ ), conv(D− )).


x∈D ± x∈conv(D ± )

To see the second equality, we translate H in parallel until it passes x+ and x− , and obtain hyperplanes
H+ and H− , respectively. Since H is a bisector of the line segment x+ x− ,

dist(H+ , H) = dist(H− , H) = 21 dist(x+ , x− ).

We are left to prove there is no point in conv(D± ) that lies between H− and H+ . Suppose, for the sake
of contradiction, there is some z+ ∈ conv(D+ ) that lies between H− and H+ . The remaining proof
for the Euclidean case where ∥ · ∥ = ∥ · ∥2 is depicted above: We know the angle ∠x− x+ z+ < 90◦ .
If we move a point u on the segment z+ x+ from z+ to x+ , because the angle ∠ux− x+ → 0◦ , so

Yaoliang Yu 75 –Version 0.11–September 22, 2021–


CS480/680–Fall 2021 §4 SUPPORT VECTOR MACHINES (SVM) University of Waterloo

eventually we will have ∠x− ux+ ≥ 90◦ , in which case we would have dist(u, x− ) < dist(x+ , x− ). Since
u ∈ conv(D+ ), we have a contradiction:

dist(u, x− ) ≥ dist(conv(D+ ), conv(D− )) = dist(x+ , x− ) > dist(u, x− ).

The proof for any norm is as follows: Since the line segment z+ x+ ∈ conv(D+ ) and by definition
dist(x+ , x− ) = dist(conv(D+ ), conv(D− )), we know for any uλ = λz+ + (1 − λ)x+ on the line segment,
f (λ) := dist(uλ , x− ) ≥ dist(x+ , x− ) = f (0), i.e. the minimum of f (λ) over the interval λ ∈ [0, 1] is
achieved at λ = 0. Since f (λ) is convex its right derivative at λ = 0, namely ⟨w, z+ − x+ ⟩, where
w ∈ ∂∥x+ − x− ∥, must be positive. But we know the hyperplane H+ = {x : w⊤ (x − x+ ) = 0} and the
middle point 12 (x+ + x− ) is on the left side of H+ , hence z+ is on the right side of H+ , contradiction.
• We can finally claim that H is the SVM solution, i.e., H maximizes the minimum distance to every
training examples in D. Indeed, let H ′ be any other separating hyperplane. According to our first
observation above, H ′ intersects with the line segment x+ x− at some point q (due to separability).
Define p± as the projection of x± onto the hyperplane H ′ , and since q ∈ H ′ ,

dist(x± , p± ) = dist(x± , H ′ ) ≤ dist(x± , q).

Therefore, using our second and third observations above:

min dist(x, H ′ ) = min dist(x, H ′ ) ≤ dist(x+ , p+ ) ∧ dist(x− , p− )


x∈D ± x∈conv(D ± )

≤ 21 [dist(x+ , p+ ) + dist(x− , p− )]
≤ 21 [dist(x+ , q) + dist(x− , q)]
= 12 dist(x+ , x− )
= min dist(x, H) = min± dist(x, H).
x∈conv(D ± ) x∈D

Exercise 4.26: Necessity of convex hull

In Remark 4.25, we picked the pair x+ and x− from the two convex hulls D± of the positive and negative
examples, respectively. Prove the following:
• One of x+ and x− can be chosen from the original datasets D± .
• Not both of x+ and x− may be chosen from the original datasets D± .

• What observation(s) in Remark 4.25 might fail if we insist in picking both x+ and x− from the original
datasets D± ?

H H

+ +
2 − 2 −

1 x− − x+ 1 x− −
+ + x+
0 0
0 1 2 3 4 0 1 2 3 4

Yaoliang Yu 76 –Version 0.11–September 22, 2021–


CS480/680–Fall 2021 §4 SUPPORT VECTOR MACHINES (SVM) University of Waterloo

Remark 4.27: SVM dual, from geometry to algebra

We complement the geometric dual view of SVM in Remark 4.25 with a “simpler” algebraic view. Applying
scaling we may assume the weight vector w of a separating hyperplane Hw is normalized. Then, we maximize
the minimum distance as follows:
 
+ − ⊤ ⊤
max dist(D , Hw ) ∧ dist(D , Hw ) = max min (w x+ + b) ∧ min −(w x− + b)
∥w∥=1,b ∥w∥=1,b x+ ∈D + x− ∈D −
 
⊤ ⊤
= max min t(w x+ + b) + (1 − t)(−w x− − b)
∥w∥=1,b x± ∈D ± ,t∈[0,1]
 

= max min w (x + − x− ) + b(2t − 1)
∥w∥≤1,b x+ ∈tconv(D + ),x− ∈(1−t)conv(D− ),t∈[0,1]

max w⊤ (x+ − x− ) + b(2t − 1)


 
= +
min
x+ ∈tconv(D ),x− ∈(1−t)conv(D− ),t∈[0,1] ∥w∥≤1,b

= min max w⊤ (x+ − x− )


1 1
x+ ∈ 2 conv(D + ),x− ∈ 2 conv(D− ) ∥w∥≤1

= min ∥x+ − x− ∥◦
1 1
x+ ∈ 2 conv(D + ),x− ∈ 2 conv(D− )

= 12 dist(conv(D+ ), conv(D− )),

where in the third equality we used linearity to replace with convex hulls, which then allowed us to apply the
minimax theorem to swap max with min. The sixth equality follows from Cauchy-Schwarz and is attained
when w ∝ x+ − x− , i.e. when Hw is a bisector.

Yaoliang Yu 77 –Version 0.11–September 22, 2021–


CS480/680–Fall 2021 §5 SOFT-MARGIN SUPPORT VECTOR MACHINES University of Waterloo

5 Soft-margin Support Vector Machines


Goal

Extend hard-margin SVM to handle linearly inseparable data.

Alert 5.1: Convention

Gray boxes are not required hence can be omitted for unenthusiastic readers.
This note is likely to be updated again soon.

Yaoliang Yu 78 –Version 0.0–October 11, 2018–


CS480/680–Fall 2021 §6 REPRODUCING KERNELS University of Waterloo

6 Reproducing Kernels
Goal

Understand the kernel trick for training nonlinear classifiers with linear techniques. Reproducing kernels.

Alert 6.1: Convention

Gray boxes are not required hence can be omitted for unenthusiastic readers.
This note is likely to be updated again soon.
[Remes et al., 2017] Sami Remes, Markus Heinonen, and Samuel Kaski. Non-stationary spectral kernels.
In Advances in Neural Information Processing Systems 30 (NIPS), pages 4642–4651, 2017.
[Samo and Roberts, 2015] Yves-Laurent Kom Samo and Stephen Roberts. Generalized spectral kernels.
arXiv preprint arXiv:1506.02236, 2015.

Example 6.2: XOR problem

XOR No separating hyperplanes


1.5 1.5

1 + − 1 + −

0.5 0.5
x2

x2

0 0

−0.5 −0.5

−1 − + −1 − +

−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1


x1 x1
The famous XOR problem is a simple binary classification problem with dataset

D = {x1 = (1; 1), y1 = −1,


x2 = (−1; −1), y2 = −1,
x3 = (1; −1), y3 = 1,
x4 = (−1; 1), y4 = 1},

which is perfectly classified by the (nonlinear) XOR rule:

y = x1 xor x2 := −x1 x2 .

However, as illustrated on the right plot, no separating hyperplane can achieve 0 error on D. Indeed, a
hyperplane parameterized by (w, b) (strictly) linearly separates D iff for all i, yi ŷi > 0, where as usual
y = w⊤ x + b. On the XOR dataset D, we would obtain

y1 ŷ1 = −(w1 + w2 + b) > 0


y2 ŷ2 = −(−w1 − w2 + b) > 0
y3 ŷ3 = (w1 − w2 + b) > 0

Yaoliang Yu 79 –Version 0.0–June 11, 2018–


CS480/680–Fall 2021 §6 REPRODUCING KERNELS University of Waterloo

y4 ŷ4 = (−w1 + w2 + b) > 0.

Adding the 4 inequalities we obtain 0 > 0, which is absurd, hence there cannot exist a (strictly) separating
hyperplane for D.

Example 6.3: Dimension of Gaussian kernel feature space

Since the Gaussian density is a kernel, we know there exists a feature transformation φ : Rd → H so that

⟨φ(x), φ(x′ )⟩ = exp(−∥x − x′ ∥22 /σ),

where σ > 0 is any fixed positive number. We prove that dim(H) cannot be finite below.
For the sake of contradiction, suppose dim(H) = h < ∞. For any n points x1 , . . . , xn ∈ Rd we define the
matrix

Φ = [φ(x1 ), . . . , φ(xn )] ∈ Rh×n .

It is immediate that rank(Φ) ≤ n ∧ h.


Next, we choose a set of distinct points x1 , . . . , xn ∈ Rd and define the matrix

Kij = exp(−∥xi − xj ∥22 /σ).

We claim that rank(K) = n. But we also know K = Φ⊤ Φ hence rank(K) ≤ rank(Φ) ≤ n ∧ h. Now if we set
n > h we arrive at a contradiction.
We are left to prove the claim that rank(K) = n. We use a tensorization argument. Recall that the
tensor product A = u ⊗ v ⊗ · · · ⊗ w is a multi-dimensional array such that Ai,j,··· ,l = ui vj · · · wl . We use the
short-hand notation x⊗k := x ⊗ · · · ⊗ x. For example, x⊗1 = x and x⊗2 ≃ xx⊤ . The following claims are
| {z }
k times
easily proved:
• Let x1 , x2 , · · · , xn ∈ Rp be linearly independent vectors. Then, for any k ∈ N, x⊗k , y⊗k , ..., z⊗k are
linearly independent vectors in (Rp )⊗k .

• Let x1 , . . . , xn ∈ Rp be linearly independent, and xn+1 ̸∈ {x1 , . . . , xn }. Then there exists some
k ∈ {1, . . . , n + 1} such that x⊗k ⊗k
1 , . . . , xn , xn+1 are linearly independent.
⊗k

– Suppose not, then we know

n n
!⊗k n n
X X X X
∀k = 1, . . . , n + 1, x⊗k
n+1 = αik x⊗k
i = αi1 xi = ··· αi11 · · · αi1k xi1 ⊗ · · · ⊗ xik .
i=1 i=1 i1 =1 ik =1

Since {xi } are linearly independent, so are {xi1 ⊗ · · · ⊗ xik }. Thus,


k
Y
|{i1 , . . . , ik }| > 1 =⇒ αi1j = 0.
j=1

Yaoliang Yu 80 –Version 0.0–June 11, 2018–


CS480/680–Fall 2021 §6 REPRODUCING KERNELS University of Waterloo

Example 6.4: XOR revisited

XOR XOR
1.5 1.5

-1 +1 1 -1 +1 1

0.5 0.5
x2

x2
0 0

−0.5 −0.5

+1 -1 −1 +1 -1 −1

−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1


x1 x1

Original x space Learned h space

1 + − 2 −

1.5
0.5

1
h2
x2

0
0.5
−0.5
0 − +
−1 − +

−1 −0.5 0 0.5 1 0 1 2 3
x1 h1

Yaoliang Yu 81 –Version 0.0–June 11, 2018–


CS480/680–Fall 2021 §7 AUTOMATIC DIFFERENTIATION (AUTODIFF) University of Waterloo

7 Automatic Differentiation (AutoDiff)


Goal

Forward and reverse mode auto-differentiation.

Alert 7.1: Convention

Gray boxes are not required hence can be omitted for unenthusiastic readers.
This note is likely to be updated again soon.

Definition 7.2: Function Superposition and Computational Graph (Bauer 1974)

Let BF be a class of basic functions. A (vector-valued) function g : X ⊆ Rd → Rm is a superposition of


the basic class BF if the following is satisfied:
• There exist some DAG G = (V , E ) where using topological sorting we arrange the nodes as follows:

v1 , . . . , v d , vd+1 , . . . , vd+k , vd+k+1 , . . . , vd+k+m , and (vi , vj ) ∈ E =⇒ i < j.


| {z } | {z } | {z }
input intermediate variables output

Here we implicitly assume the outputs of the function g do not depend on each other. If they do, we
need only specify the indices of the output nodes accordingly (i.e. they may not all appear in the end).

• For each node vi , let Ii := {u ∈ V : (u, vi ) ∈ E } and Oi := {u ∈ V : (vi , u) ∈ E } denote the


(immediate) predecessors and successors of vi , respectively. Clearly, Ii = ∅ if i ≤ d (i.e. input nodes)
and Oi = ∅ if i > d + k (i.e. output nodes).
• The nodes are computed as follows: sequentially for i = 1, . . . , d + k + m,
(
xi , i≤d
vi = , where fi ∈ BF. (7.1)
fi (Ii ), i > d

Our definition of superposition closely resembles the computational graph of Bauer (1974), who attributed
the idea to Kantorovich (1957).
Bauer, F. L. (1974). “Computational Graphs and Rounding Error”. SIAM Journal on Numerical Analysis, vol. 11,
no. 1, pp. 87–96.
Kantorovich, L. V. (1957). “On a system of mathematical symbols, convenient for electronic computer operations”.
Soviet Mathematics Doklady, vol. 113, no. 4, pp. 738–741.

Exercise 7.3: Neural Networks as Function Superposition

Let BF = {+, ×, σ, constant}. Prove that any multi-layer NN is a superposition of the basic class BF.
Is exp a superposition of the basic class above?

Theorem 7.4: Automatic Differentiation (e.g. Kim et al. 1984)

Let BF be a basic class of differentiable functions that includes +, ×, and all constants. Denote T (f ) as
the complexity of computing the function f and T (f, ∇f ) the complexity with additional computation of
the gradient. Let If and Of be the input and output arguments and assume there exists some constant

Yaoliang Yu 82 –Version 0.0–January 09, 2018–


CS480/680–Fall 2021 §7 AUTOMATIC DIFFERENTIATION (AUTODIFF) University of Waterloo

C = C(BF) > 0 so that

∀f ∈ BF, T (f, ∇f ) + |If ||Of |[T (+) + T (×) + T (constant)] ≤ C · T (f ).

Then, for any superposition g : Rd → Rm of the basic class BF, we have

T (g, ∇g) ≤ Cγ(m ∧ d) · T (g),

where γ is the maximal output dimension of basic functions used to superpose g.

Proof. Applying the chain rule to the recursive formula (7.1) it is clear that any superposition g is differen-
tiable too. We split the proof into two parts: a forward mode and a backward mode.
Forward mode: Let us define the block matrix V = [V1 , . . . , Vd , Vd+1 , . . . , Vd+k , Vd+k+1 , . . . , Vd+k+m ] ∈
Rd× i di , where each column block Vi corresponds to the gradient ∇vi = ∂v , where di is the
P
d×di
∂x ∈ R
i

output dimension of node vi (typically 1). By definition of the input nodes we have
Vi = ei , i = 1, . . . , d,
where ei is the standard basis vector in Rd . Using the recursive formula (7.1) and chain rule we have
X ∂fi
Vi = Vj · ∇j fi , where ∇j fi = ∈ Rdj ×di .
∂vj
j∈Ii

In essence, by differentiating at each node, we obtained a square and sparse system of linear equations, where
∇j fi are known coefficients and Vi are unknown variables. Solving the linear system yields Vd+k+1 , . . . , Vd+k+m ,
the desired gradient of g. Thanks to the topological ordering, we can simply solve Vi one by one. Let
γ = maxi di be the maximum output dimension of any node. We bound the complexity of the forward mode
as follows:
X X
T (g, ∇g) ≤ T (fi , ∇fi ) + ddi dj [T (+) + T (×) + T (constant)]
i∈V j∈Ii
X
≤ dγ T (fi , ∇fi ) + |Ifi ||Ofi |[T (+) + T (×) + T (constant)] ≤ dγCT (g).
i∈V

Reverse mode: Let us rename the outputs yi = vd+k+i for i = 1, P. . . , m. Similarly we define the
block matrix V = [V1 ; . . . ; Vd ; Vd+1 ; . . . ; Vd+k ; Vd+k+1 ; . . . ; Vd+k+m ] ∈ R i di ×m , where each row block Vi
corresponds to the transpose of the gradient ∇vi = ∂v ∂y
i
∈ Rm×di , where di is the output dimension of node
vi (typically 1). By definition of the output nodes we have
Vd+k+i = ei , i = 1, . . . , m, ei ∈ R1×m .
Using the recursive formula (7.1) and chain rule we have
X ∂fj
Vi = ∇i fj · Vj , where ∇i fj = ∈ Rdi ×dj .
∂vi
j∈Oi

Again, by differentiating at each node we obtained a square and sparse system of linear equations, where
∇i fj are known coefficients and Vi are unknown variables. Solving the linear system yields V1 , . . . , Vd , the
desired gradient of g. Thanks to the topological ordering, we can simply solve Vi one by one backwards,
after a forward pass to get the function values at each node. Similar as the forward mode, we can bound
the complexity as mγCT (g).
Thus, surprisingly, for real-valued superpositions (m = γ = 1), computing the gradient, which is a d × 1
vector, costs at most constant times that of the function value (which is a scalar), if we operate in the reverse
mode! The common misconception is that the gradient has size d × 1 hence if we compute one component
at a time we end up d times slower. This is wrong, because we can recycle computations. Note also that
even reading the input already costs O(d). However, this time complexity gain, as compared to that of the
forward mode, is achieved through a space complexity tradeoff: in reverse mode we need a forward pass first
to collect and store all function values at each node, whereas in the forward mode these function values can
be computed on the fly.

Yaoliang Yu 83 –Version 0.0–January 09, 2018–


CS480/680–Fall 2021 §7 AUTOMATIC DIFFERENTIATION (AUTODIFF) University of Waterloo

Kim, K. V., Yuri E. Nesterov, and B. V. Cherkasskii (1984). “An estimate of the effort in computing the gradient”.
Soviet Mathematics Doklady, vol. 29, no. 2, pp. 384–387.

Algorithm 7.5: Automatic Differentiation (AD) Pesudocode

We summarize the forward and reverse algorithms below. Note that to compute the gradient-vector mul-
tiplication ∇g · w for some compatible vector w, we can use the forward mode and initialize Vi with w.
Similarly, to compute w · ∇g, we can use the reverse mode with proper initialization to Vd+k+i .
Algorithm: Forward Automatic Differentiation for Superposition.
Input: x ∈ Rd , basic function class BF, computational graph G
Output: gradient [Vd+k+1 , . . . , Vd+k+m ] ∈ Rd×m
1 for i = 1, . . . , d do // forward: initialize function values and derivatives
2 vi ← x i
3 Vi ← ei ∈ Rd×1
4 for i = d + 1, . . . , d + k + m do // forward: accumulate function values and derivatives
5 compute vi ← fi (Ii )
6 for j ∈ Ii do
7 compute partial derivatives ∇j fi (Ii )
P
8 Vi ← j∈Ii Vj · ∇j fi

Algorithm: Reverse Automatic Differentiation for Superposition.


Input: x ∈ Rd , basic function class BF, computational graph G
Output: gradient [V1 ; . . . ; Vd ] ∈ Rd×m
1 for i = 1, . . . , d do // backward: initialize function values and derivatives
2 vi ← x i
3 Vd+k+i ← ei ∈ R1×m
4 for i = d + 1, . . . , d + k + m do // forward: accumulate function values
5 compute vi ← fi (Ii )
6 for i = d P
+ k, . . . , 1 do // backward: accumulate function values and derivatives
7 Vi ← j∈Oi ∇i fj · Vj

We remark that, as suggested by Wolfe (1982), one effective way to test AD (or manually programmed
derivatives) and locate potential errors is through the classic finite difference approximation.
Wolfe, Philip (1982). “Checking the Calculation of Gradients”. ACM Transactions on Mathematical Software, vol. 8,
no. 4, pp. 337–343.

Exercise 7.6: Matrix multiplication

To understand the difference between forward-mode and backward-mode differentiation, let us consider the
simple matrix multiplication problem: Let Aℓ ∈ Rdℓ ×dℓ+1 , ℓ = 1, . . . , L, where d1 = d and dL+1 = m. We
are interested in computing
L
Y
A= Aℓ .
ℓ=1

• What is the complexity if we multiply from left to right (i.e. ℓ = 1, 2, . . . , L)?


• What is the complexity if we multiply from right to left (i.e. ℓ = L, L − 1, . . . , 1)?
• What is the optimal way to compute the product?

Yaoliang Yu 84 –Version 0.0–January 09, 2018–


CS480/680–Fall 2021 §7 AUTOMATIC DIFFERENTIATION (AUTODIFF) University of Waterloo

Remark 7.7: Further insights on AD

∂vj
If we associate an edge weight wij = ∂vi to (i, j) ∈ E , then the desired gradient

∂gi X Y
= we . (7.2)
∂xj
path P :vj →vi e∈P

However, we cannot compute the above naively, as the number of paths in a DAG can grow exponentially
quickly with the depth. The forward and reverse modes in the proof of Theorem 7.4 correspond to two
dynamic programming solutions. (Incidentally, this is exactly how one computes the graph kernel too.)
Naumann (2008) showed that finding the optimal way to compute (7.2) is NP-hard.
Naumann, Uwe (2008). “Optimal Jacobian accumulation is NP-complete”. Mathematical Programming, vol. 112, no. 2,
pp. 427–441.

Remark 7.8: Tightness of dimension dependence in AD (e.g. Griewank 2012)

The dimensional dependence m ∧ d cannot be reduce in general. Indeed, consider the simple function
f (x) = sin(w⊤ x)b, where x ∈ Rd and b ∈ Rm . Computing f clearly costs O(d + m) (assuming sin can be
evaluated in O(1)) while even outputting the gradient costs O(dm).
Griewank, Andreas (2012). “Who Invented the Reverse Mode of Differentiation?” Documenta Mathematica, vol. Extra
Volume ISMP, pp. 389–400.

Exercise 7.9: Backpropogation (e.g. Rumelhart et al. 1986)

Apply Theorem 7.4 to multi-layer NNs and recover the celebrated backpropogation algorithm. Distinguish
two cases:
• Fix the network weights W1 , . . . , WL and compute the derivative w.r.t. the input x of the network.
This is useful for constructing adversarial examples.
• Fix the input x of the network and compute the derivative w.r.t. the network weights W1 , . . . , WL .
This is useful for training the network.
Suppose we know how to compute the derivatives of f (x, y). Explain how to compute the derivative of
f (x, x)?
• Generalize from above to derive the backpropogation rule for convolutional neural nets (CNN).

• Generalize from above to derive the backpropogation rule for recurrent neural nets (RNN).

Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams (1986). “Learning representations by back-
propagating errors”. Nature, vol. 323, pp. 533–536.

Remark 7.10: Fast computation of other derivatives (Kim et al. 1984)

Kim et al. (1984) pointed out an important observation, namely that the proof of Theorem 7.4 only uses
the chain-rule property of differentiation:
∂f ∂f ∂x
= · .
∂x ∂y ∂y
In other words, we could replace differentiation with any other operation that respects the chain rule and
obtain the same efficient procedure for computation. For instance, the relative differential in numerical
analysis or the directional derivative can both be efficiently computed in the same way. Similarly, one

Yaoliang Yu 85 –Version 0.0–January 09, 2018–


CS480/680–Fall 2021 §7 AUTOMATIC DIFFERENTIATION (AUTODIFF) University of Waterloo

can compute the Hessian-vector multiplication efficiently as it also respects the chain rule (Møoller 1993;
Pearlmutter 1994).
Kim, K. V., Yuri E. Nesterov, V. A. Skokov, and B. V. Cherkasskii (1984). “An efficient algorithm for computing
derivatives and extremal problems”. Ekonomika i matematicheskie metody, vol. 20, no. 2, pp. 309–318.
Møoller, M. (1993). Exact Calculation of the Product of the Hessian Matrix of Feed-Forward Network Error Functions
and a Vector in O(N ) Time. Tech. rep. DAIMI Report Series, 22(432).
Pearlmutter, Barak A. (1994). “Fast Exact Multiplication by the Hessian”. Neural Computation, vol. 6, no. 1, pp. 147–
160.

Yaoliang Yu 86 –Version 0.0–January 09, 2018–


CS480/680–Fall 2021 §8 DEEP NEURAL NETWORKS University of Waterloo

8 Deep Neural Networks


Goal

Define and understand the notion of fairness in machine learning algorithms.

Alert 8.1: Convention

Gray boxes are not required hence can be omitted for unenthusiastic readers.
This note is likely to be updated again soon.

Yaoliang Yu 87 –Version 0.0–June 11, 2018–


CS480/680–Fall 2021 §9 CONVOLUTIONAL NEURAL NETWORKS University of Waterloo

9 Convolutional Neural Networks


Goal

Introducing the basics of CNN and the popular architectures.

Alert 9.1: Convention

Gray boxes are not required hence can be omitted for unenthusiastic readers.
This note is likely to be updated again soon.

Yaoliang Yu 88 –Version 0.0–June 11, 2018–


CS480/680–Fall 2021 §10 RECURRENT NEURAL NETWORKS University of Waterloo

10 Recurrent Neural Networks


Goal

Introducing the basics of RNN. LSTM. GRU. PixelRNN and related.

Alert 10.1: Convention

Gray boxes are not required hence can be omitted for unenthusiastic readers.
This note is likely to be updated again soon.

Yaoliang Yu 89 –Version 0.0–June 11, 2018–


CS480/680–Fall 2021 §11 GRAPH NEURAL NETWORKS University of Waterloo

11 Graph Neural Networks


Goal

Introducing the basics of GNN and the popular variants.

Alert 11.1: Convention

Gray boxes are not required hence can be omitted for unenthusiastic readers.
This note is likely to be updated again soon.
Nice surveys on this topic include Bronstein et al. (2017) and Wu et al. (2020).
Bronstein, M. M., J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst (2017). “Geometric Deep Learning: Going
beyond Euclidean data”. IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 18–42.
Wu, Z., S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2020). “A Comprehensive Survey on Graph Neural
Networks”. IEEE Transactions on Neural Networks and Learning Systems, pp. 1–21.

Definition 11.2: Graph learning

Consider a graph G = (V, E, l) with nodes V, edges E, node feature/attribute/label lv ∈ Rd for each node
v ∈ V, and edge feature/attribute/label le ∈ Rp for each edge e ∈ E. The graph may be directed or
undirected, where the direction of the edge can be easily encoded in the edge feature. We use Nv = N (v) ⊆ V
to denote the neighboring nodes of v and Mv = M(v) ⊆ E for the edges that have node v as a vertex. For
positional graphs, we also have an injective function pv : Nv → {1, 2, . . . , |V|} that encodes the relative
position of each neighbor of a node v. For instance, on a 2-D image, {1, 2, 3, 4} may represent the west,
north, east, and south neighbor, respectively.

Alert 11.3: All for one, and one for all

Let (Gi , yi ), i = 1, . . . , n be a given supervised set of graphs and labels. Our goal is to learn a predictive
function ŷ that maps a new test graph G to its corresponding label: ŷ(G) ≈ y. The labels could be at the
node, edge or graph level. Do not confuse the label y with the feature l, since some authors also refer to the
latter as “labeling.”
Interestingly, we can piece all graphs into one large, disconnected graph, greatly simplifying our notation
and without compromising generality. Note that this is more than just a reduction trick: in some cases it is
actually the natural thing to do, such as in web-scale applications where the entire internet is just one giant
graph. We follow this trick throughout.

Example 11.4: Some applications of graph learning

We mention some example applications of graph learning:


• Each node may represent an atom in some chemical compound while the edges model the (strength
of) chemical bonds linking the atoms. We may be interested in predicting how a certain disease reacts
to the chemical compound.
• All image analyses fall into graph learning with each pixel playing a node of the underlying (regular)
grid and the pixel value being the node feature.
• Social network, where we may be interested in classifying the nodes or imputing missing links. For
instance, each webpage is a node and hyperlinks act as edges.

Yaoliang Yu 90 –Version 0.1–Oct 19, 2021–


CS480/680–Fall 2021 §11 GRAPH NEURAL NETWORKS University of Waterloo

Definition 11.5: Graph neural network (GNN) (Scarselli et al. 2009)

GNNs, as defined here, can be regarded as a natural extension of recurrent networks, from a chain graph to
a general graph. Indeed, we define the following recursion: for all v ∈ V,

hv ← f (hv , hNv , lv , lNv , lMv ; w)


ov = g(hv , lv ; w),

where hv is the hidden state at node v and ov is its output. The two (local) update functions f , g are
parameterized by w, which is shared among all nodes. We remark that in general it is up to us to define
the neighborhoods N and M, and f , g may have slightly different forms (such as involving other inputs).
Collect all local updates into one abstract formula:
 
h
x := ← F(x, l; w). (11.1)
o

Note that the input node/edge features l are fixed. Thus, for a fixed weight w, the above update defines the
(enhanced) state x as a fixed point of the map Fl,w : x 7→ F(x, l; w).
To compute the state x with a fixed weight w, we perform the (obvious) iteration:

xt+1 = F(xt , l; w), x0 initialized. (11.2)

According to Banach’s fixed point theorem, (for any initialization x0 ) the above iteration converges geomet-
rically to the unique fixed point of Fl,w , provided that the latter is a contraction (or more generally a firm
nonexpansion). For later reference, we abstract (the unique) solution of the nonlinear equation (11.1) as:

o = ŷ(l; w),

where we have discarded the state h and only retained the output o.
Scarselli, F., M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2009). “The Graph Neural Network Model”.
IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61–80.

Alert 11.6: Recursive neural network

When the underlying graph is a DAG (and the update function of a node only depends on its descendants),
we may arrange the computation in (11.2) according to some topological ordering so that it stops after one
(sequential) pass of all nodes. When the graph is a chain, we recover the familiar recurrent neural network.

Example 11.7: Local update function

We mention two examples of local update function:


• For positional graphs, we arrange the neighbors in hNv , lNv , lMv according to their relative positions
decided by pv (say in increasing order). For non-existent neighbors, we may simply pad with null
values.
• For non-positional graphs, the following permutation-invariant local update is convenient:
1 X
hv ← f (hv , hu , lv , lu , l(v,u) ).
|Nv |
u∈Nv

More generally, we may replace the above average with any permutation-invariant function (e.g. aver-
aged ℓp norm), see Xu et al. (2019) for some discussion on possible limitations of this choice.

Xu, Keyulu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka (2019). “How Powerful are Graph Neural Networks?”
In: International Conference on Learning Representations.

Yaoliang Yu 91 –Version 0.1–Oct 19, 2021–


CS480/680–Fall 2021 §11 GRAPH NEURAL NETWORKS University of Waterloo

Algorithm 11.8: Learning GNN

To learn the weights w of a GNN, we choose a loss function ℓ, and apply (stochastic) gradient descent to
solve

min ℓ(ŷ(l; w), y),


w

where recall that ŷ(l; w) is (the unique) solution of the nonlinear equation (11.1) and is practically computed
by the iteration (11.2) (similar to unrolling in RNN). If F(x, l; w) is differentiable in w and contracting in x,
then a simple application of the implicit function theorem reveals that the solution ŷ(l; w) is also differentiable
in w. Thus, we may apply the recurrent back-propagation algorithm. If memory is not an issue, we can
also apply back-propagation through time (BPTT) by replacing ŷ with ot after a fixed number of unrolling
steps in (11.1).

Example 11.9: Parameterizing local update function

• Affine: Let F(x, l; w) = A(l; w)x + b(l; w), where the matrix A and bias vector b are outputs of some
neural net with input l and weights w. By properly scaling A, it is easy to make F a contraction.
• More generally, we may parameterize F by a (highly) nonlinear deep network. However, care must be
taken (e.g. through regularization) so that F is (close to) a contraction at the learned weights.

• We remark that in theory any parameterization of F can be used; it does not have to be a neural
network.

Example 11.10: PageRank belongs to GNN

Define the normalized adjacency matrix


(
1
|Nu | , if (u, v) ∈ E
Āuv = ,
0, otherwise

which represents the probability of visiting a neighboring node v once we are at node u. Consider the GNN
with linear state update function:

x ← αx0 + (1 − α)Ā⊤ x,

where the parameter α ∈ [0, 1) models the probability of “telescoping” and x0 ∈ ∆. In other words, the state
of node v is an aggregation of the states of its neighbors:
X
1
xv = αxv,0 + (1 − α) |Nu | xu .
(u,v)∈E

For any α ∈ (0, 1), the above iterate converges to a unique fixed point known as the PageRank.

Definition 11.11: Spatial convolutional networks on graphs (Bruna et al. 2014)

Given a (weighted) graph G 0 = (V 0 , A0 ), where A0 is the adjacency matrix, we define a sequence of coars-
enings G l = (V l , Al ), l = 1, . . . , L, where recursively each node V ∈ V l+1 is a subset (e.g. neighborhood) of

Yaoliang Yu 92 –Version 0.1–Oct 19, 2021–


CS480/680–Fall 2021 §11 GRAPH NEURAL NETWORKS University of Waterloo

nodes in V l , i.e.
l X X
V l+1 ⊆ 2V , and for all U, V ∈ V l+1 , Al+1
UV = Aluv .
u∈U ⊆V l v∈V ⊆V l

Typically, the nodes in V l+1 form a partition of the nodes in V l , using say some graph partitioning algorithm.
For instance we may cluster a node u with all “nearby” and available nodes v with Auv ≤ ϵ, hence forming
an ϵ-cover.
l
Let xl = [xl1 ; . . . ; xldl ] ∈ R|V |dl be a dl -channel signal on the nodes of graph G l . We define a layer of
spatial convolution as follows:
l l
xl+1 = P σ(Wrl xl ) , Wrl ∈ R|V |×|V |dl , r = 1, . . . , dl+1 ,

r

where each Wrl is a spatially compact filter (with nonzero entries only when Aluv larger than some threshold),
σ : R → R some (nonlinear) component-wise activation function, and P a pooling operator that pools the
values in each neighborhood (corresponding to nodes in V l+1 ). The total number of parameters in the filter
W l is O(|E l |dl dl+1 ). Since nodes in a general graph (as opposed to regular ones such as grids) may have
different neighborhoods, it is not possible to share the filter weights at different nodes (i.e. the rows in Wrl
have to be different).
Bruna, Joan, Wojciech Zaremba, Arthur Szlam, and Yann LeCun (2014). “Spectral Networks and Locally Connected
Networks on Graphs”. In: International Conference on Learning Representations.

Example 11.12: Spatial CNN (Niepert et al. 2016)

The main difficulty in extending spatial convolution to general graphs is the lack of correspondence of the
nodes. Niepert et al. (2016) proposed to first label the nodes so that they are somewhat in correspondence.
Consider l : V → L that sends a node v ∈ V to a color lv in some totally ordered set L. For instance, l could
simply be the node degree or computed by the WL Algorithm 11.23 below. We proceed similarly as in CNN:
• The color l induces an ordering of the nodes, allowing us to select a fixed number n of nodes, starting
from the “smallest” and incrementing with stride s. We pad (disconnected) trivial nodes if run out of
choices.
• For each chosen node v above, we incrementally select its neighbors Nv := d {u : dist(u, v) ≤ d} using
S
breadth first search (BFS), until exceeding the receptive field size or running out of choice.
• We recompute colors on Nv with the constraint dist(u, v) < dist(w, v) =⇒ lu < lw . Depending on the
size of Nv , we either select a fixed number m of (top) neighbors and recompute their colors, or pad
(disconnected) trivial nodes to make the fixed number m. Lastly, we perform canoniocalization using
Nauty (McKay and Piperno 2014) while respecting the node colors.
• Finally, we collect the results into tensors with size n × m × d for d-dim node features and n × m × m × p
for p-dim edge features, which can be reshaped to nm × d and nm2 × p. We apply 1-d convolution
with stride and receptive field size m to the first and m2 to the second tensor.
For grid graphs, if we use the WL Algorithm 11.23 to color the nodes, then it is easy to see that the above
procedure recovers the usual CNN.
Niepert, Mathias, Mohamed Ahmed, and Konstantin Kutzkov (2016). “Learning Convolutional Neural Networks for
Graphs”. In: Proceedings of The 33rd International Conference on Machine Learning, pp. 2014–2023.
McKay, Brendan D. and Adolfo Piperno (2014). “Practical graph isomorphism, II”. Journal of Symbolic Computation,
vol. 60, pp. 94–112.

Yaoliang Yu 93 –Version 0.1–Oct 19, 2021–


CS480/680–Fall 2021 §11 GRAPH NEURAL NETWORKS University of Waterloo

Definition 11.13: Graph Laplacian

Let A be the usual adjacency matrix of an (undirected) graph and D the diagonal matrix of degrees:
( (P
1, if (u, v) ∈ E v Auv , if u = v
Auv = , Duv = .
0, otherwise 0, otherwise

More generally, we may consider a weighted graph with (nonnegative, real-valued and symmetric) weights
Auv = wuv . We define the graph Laplacian and its normalized version:

L = D − A, L̄ = I − D−1/2 AD−1/2 = D−1/2 LD−1/2 .

Among many other nice properties, the graph Laplacian is useful because of its connection to quadratic
potentials. Too see this, let xv ∈ Rd be a feature vector at each node v and we verify that
1X 1X X X
Auv ∥xu − xv ∥22 = Auv [∥xu ∥22 + ∥xv ∥22 − 2 ⟨xu , xv ⟩] = du ∥xu ∥22 − Auv ⟨xu , xv ⟩
2 u,v 2 u,v u u,v
d
X
= tr(X(D − W )X ⊤ ) = tr(XLX ⊤ ) = ⊤
Xj: LXj: , X = [. . . , xv , . . .] ∈ Rd×|V| .
j=1

Taking d = 1 we see that the Laplacian L is symmetric and positive semidefinite. Similarly,
1X
tr(X L̄X ⊤ ) = tr((XD−1/2 )L(D−1/2 X ⊤ )) = Auv ∥ √xdu − xv 2
√ ∥ .
dv 2
2 u,v u

Of course, the normalized graph Laplacian is also symmetric and positive semidefinite.

Exercise 11.14: Laplacian and Connectedness

Prove that the dimension of the null space of the Laplacian is exactly the number of connected components
in the (weighted) graph.
Moreover, L1 = 0, so the Laplacian always has 0 as an eigenvalue and 1 as the corresponding eigenvector.

Remark 11.15: Graph Laplacian is everywhere

The graph Laplacian played significant roles in the early days of segmentation, dimensionality reduction and
semi-supervised learning, see Shi and Malik (e.g. 2000), Dhillon et al. (2007), Zhu et al. (2003), Zhou et al.
(2004), Coifman et al. (2005), Belkin et al. (2006), Belkin and Niyogi (2008), Hammond et al. (2011), and
Shuman et al. (2013). It allows us to propagate information from one node to another through traversing the
edges and to enforce global consistency through local ones. Typical ways to construct graph from sampled
data include thresholding pairwise distances or comparing node features.
Shi, Jianbo and J. Malik (2000). “Normalized cuts and image segmentation”. IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 22, no. 8, pp. 888–905.
Dhillon, I. S., Y. Guan, and B. Kulis (2007). “Weighted Graph Cuts without Eigenvectors A Multilevel Approach”.
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 11, pp. 1944–1957.
Zhu, Xiaojin, Zoubin Ghahramani, and John Lafferty (2003). “Semi-Supervised Learning Using Gaussian Fields and
Harmonic Functions”. In: Proceedings of the Twentieth International Conference on International Conference on
Machine Learning, pp. 912–919.
Zhou, Dengyong, Olivier Bousquet, Thomas N. Lal, Jason Weston, and Bernhard Schölkopf (2004). “Learning with
Local and Global Consistency”. In: Advances in Neural Information Processing Systems 16, pp. 321–328.
Coifman, R. R., S. Lafon, A. B. Lee, M. Maggioni, B. Nadler, F. Warner, and S. W. Zucker (2005). “Geometric
diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps”. Proceedings of the
National Academy of Sciences, vol. 102, no. 21, pp. 7426–7431.

Yaoliang Yu 94 –Version 0.1–Oct 19, 2021–


CS480/680–Fall 2021 §11 GRAPH NEURAL NETWORKS University of Waterloo

Belkin, Mikhail, Partha Niyogi, and Vikas Sindhwani (2006). “Manifold Regularization: A Geometric Framework for
Learning from Labeled and Unlabeled Examples”. Journal of Machine Learning Research, vol. 7, pp. 2399–2434.
Belkin, Mikhail and Partha Niyogi (2008). “Towards a theoretical foundation for Laplacian-based manifold methods”.
Journal of Computer and System Sciences, vol. 74, no. 8, pp. 1289–1308.
Hammond, David K., Pierre Vandergheynst, and Rémi Gribonval (2011). “Wavelets on graphs via spectral graph
theory”. Applied and Computational Harmonic Analysis, vol. 30, no. 2, pp. 129–150.
Shuman, D. I., S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst (2013). “The emerging field of signal
processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains”. IEEE
Signal Processing Magazine, vol. 30, no. 3, pp. 83–98.

Definition 11.16: Spectral convolutional networks on graphs (Bruna et al. 2014)

Bruna et al. (2014) also defined the spectral graph convolution of two graph signals x ∈ R|V| and g ∈ R|V|
as:

x ∗ g := U [(U ⊤ x) ⊙ (U ⊤ g)], where L = U ΛU ⊤

is the spectral decomposition of the graph Laplacian L and ⊙ denotes component-wise multiplication. Let
g, or equivalently w := U ⊤ g, represent a filter. We then define a layer of spectral graph convolution as:

xl+1 = σ U [Wrl ⊙ (U ⊤ X l )]1 , r = 1, . . . , dl+1 , X l = [xl1 , . . . , xldl ], (11.3)



r

where dl is the number of channels for layer l and σ : R → R is some component-wise (nonlinear) activation
function. The formula (11.3) continues to make sense if we only take say bottom sl eigenvectors in U
(corresponding to the smallest eigenvalues). Thus, the number of filter parameters in W l is O(sl dl dl+1 ),
which we may reduce through interpolating a few “landmarks”: Wrl = Bαlr , where B is a fixed interpolation
kernel and the few knots αlr are tunable.
When a sequence of coarsenings G l is available (like the spatial convolution in Definition 11.11), we can
then perform pooling on the signal X l by pooling the values in each neighborhood (corresponding to nodes
in V l+1 ).
Henaff et al. (2015) also considered learning the graph topology and spectral convolution alternately.
Bruna, Joan, Wojciech Zaremba, Arthur Szlam, and Yann LeCun (2014). “Spectral Networks and Locally Connected
Networks on Graphs”. In: International Conference on Learning Representations.
Henaff, Mikael, Joan Bruna, and Yann LeCun (2015). “Deep Convolutional Networks on Graph-Structured Data”.

Definition 11.17: Chebyshev polynomial

Let p0 ≡ 1 and p1 (x) = x. For k ≥ 2 we define the k-th Chebyshev polynomial recursively:

pk (x) = 2x · pk−1 (x) − pk−2 (x).



It is known that Chebyshev polynomials form an orthogonal basis for L2 ([−1, 1], dx/ 1 − x2 ).

Example 11.18: Chebyshev Net (Defferrard et al. 2016)

The spectral graph convolution in Definition 11.16 is expensive as we need to eigen-decompose the Laplacian
L. However, note that

x ∗ g := U [(U ⊤ g) ⊙ (U ⊤ x)] = U [diag(f (λ; w))(U ⊤ x)] = [U diag(f (λ; w))U ⊤ ]x,

where we assume U ⊤ g = f (λ; w) and recall the eigen-decomposition L = U diag(λ)U ⊤ . The univariate
function f : R → R is parameterized by w and is applied component-wise to a vector (and component-wise

Yaoliang Yu 95 –Version 0.1–Oct 19, 2021–


CS480/680–Fall 2021 §11 GRAPH NEURAL NETWORKS University of Waterloo

to the eigenvalues of a symmetric matrix). Then, it follows

x ∗ g = f (L; w)x,
Pk−1 Pk−1
and with a polynomial function f (λ; w) = j=0 wj λj we have x ∗ g = j=0 wj Lj x, where the polynomial
Lj only depends on nodes within j edges hence localized. Using the Chebyshev polynomial we may then
parameterize spectral convolution:
k−1
X
x∗g = wj pj (L̃)x, where L̃ := 2L/∥L∥ − I,
j=0

whose spectrum lies in [−1, 1]. If we define xj = pj (L̃)x, then recursively

xj = 2L̃xj−1 − xj−2 , with x0 = x, x1 = L̃x.

The above recursion indicates that Chebyshev net is similar to a k-step unrolling of GNN with linear update
functions.
Thus, computing the graph convolution x ∗ g costs only O(k|E|). We easily extend to multi-channel
signals X = [x1 , . . . , xs ] ∈ R|V|×s with filters Wr = [w0r , . . . , wk−1
r
] ∈ Rs×k :

s k−1
X X
r
 
x ∗ gr = wij pj (L̃)xi = p0 (L̃), · · · , pk−1 (L̃) vec(XWr ), r = 1, . . . , t,
i=1 j=0

where s and t are the number of input and output channels, respectively. Component-wise nonlinear acti-
vation is applied afterwards, and pooling can be similarly performed as before if a sequence of coarsenings
is available.
Defferrard, Michaël, Xavier Bresson, and Pierre Vandergheynst (2016). “Convolutional Neural Networks on Graphs
with Fast Localized Spectral Filtering”. In: Advances in Neural Information Processing Systems 29, pp. 3844–
3852.

Definition 11.19: Graph convolutional network (GCN) (Kipf and Welling 2017)

Given a weighted graph G = (V, A), a layer of GCN is defined concisely as:
X l+1 = σ D̊−1/2 ÅD̊−1/2 X l W l , X l = [xl1 , . . . , xls ] ∈ R|V|×s , W l ∈ Rs×t , (11.4)


where Å = A + I (i.e. adding self-cycle), D̊ is the usual diagonal degree matrix of Å, and s and t are the
number of input and output channels, respectively.
GCN can be motivated by setting k = 1 and with weight-sharing wi,0 r r
= −wi,1 = wir in Chebyshev net
(see Example 11.18):
s
X s
X
r r
x ∗ gr = (wi,0 I + wi,1 L̃)xi = wir (I − 2L/∥L∥ + I)xi .
i=1 i=1

If we use the normalized Laplacian and assume ∥L̄∥ = 2, then


s
X
x ∗ gr = wir (I + D−1/2 AD−1/2 )xi = ( |{z}
I + D
|
−1/2
AD−1/2} )Xwr .
{z
i=1 self-loop 1-hop neighbors

Comparing to (11.4), we see that GCN first adds the self-loop to the adjacency matrix to get Å and then
renormalizes to get the 1-hop neighbor term D̊−1/2 ÅD̊−1/2 .
Comparing to Chebyshev net, 1 layer of GCN only takes 1-hop neighbors into account while Chebyshev
net takes all k-hop neighbors into account. However, this can be compensated by stacking k layers in GCN.
Kipf and Welling (2017) applied GCN to semi-supervised node classification where cross-entropy on labeled
nodes is minimized while the unlabeled nodes affect the Laplacian hence also learning of the weights W .

Yaoliang Yu 96 –Version 0.1–Oct 19, 2021–


CS480/680–Fall 2021 §11 GRAPH NEURAL NETWORKS University of Waterloo

Kipf, Thomas N. and Max Welling (2017). “Semi-Supervised Classification with Graph Convolutional Networks”. In:
International Conference on Learning Representations.

Example 11.20: Simple graph convolution (SGC) (Wu et al. 2019)

As mentioned above, GCN replaces a layer of Chebyshev net with k compositions of a simple layer defined
in (11.4):

X → σ(L̊XW 1 ) → · · · → σ(L̊XW k ), L̊ := D̊−1/2 ÅD̊−1/2 .

Surprisingly, Wu et al. (2019) showed that collapsing the above leads to similar performance, effectively
bringing us back to Chebyshev net with a different polynomial parameterization:

X → σ(L̊k XW ).

Wu et al. (2019) proved that the self-loop in Å effectively shrinks the spectrum.
Wu, Felix, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger (2019). “Simplifying Graph
Convolutional Networks”. In: Proceedings of the 36th International Conference on Machine Learning, pp. 6861–
6871.

Exercise 11.21: Multiplication is indeed composition

Prove that the mapping x 7→ Lk x depends only on k-hop neighbors.

Alert 11.22: The deeper, the worse? (Oono and Suzuki 2020)

Both GCN and SGC seem to suggest that we do not need to build very deep graph networks. This is possibly
due to the small-world phenomenon in many real-word graphs, namely that each node can be reached from
any other node through very few hops. See Oono and Suzuki (2020) for an interesting result along this
direction.
Oono, Kenta and Taiji Suzuki (2020). “Graph Neural Networks Exponentially Lose Expressive Power for Node
Classification”. In: International Conference on Learning Representations.

Algorithm 11.23: Iterative color refinement (Weisfeiler and Lehman 1968)

Algorithm: Weisfeiler-Lehman iterative color refinement (Weisfeiler and Lehman 1968)


Input: Graph G = (V, E, l0 )
Output: l|V|−1
1 for t = 0, 1, . . . ,|V| − 1 do 
2 lt+1 ← hash [ltv , ltu∈Nv ] : v ∈ V // [·] is a multiset, allowing repetitions

Algorithm: Assuming node features l from a totally ordered space L


 
1 Function hash [lv , lu∈Nv ] : v ∈ V :

2 for v ∈ V do 
3 sort lu∈Nv // sort the neighbors
4 add lv as prefix to the sorted list [lv , lu∈Nv ] // lv does not participate in sorting!

5 l+
v ← f ([l , l
v u∈Nv ]) // f : L → L strictly increasing w.r.t. lexicographic order

Yaoliang Yu 97 –Version 0.1–Oct 19, 2021–


CS480/680–Fall 2021 §11 GRAPH NEURAL NETWORKS University of Waterloo

We follow Shervashidze et al. (2011) to explain the Weisfeiler-Lehman (WL) iterative color refinement
algorithm. Consider a graph G = (V, E, l) with node feature lv in some totally ordered space L for each node
v ∈ V. For instance, we may simply set lv ≡ 1 and L = {1, 2, . . . , |V|} (a.k.a. colors) if no better information
is available. Then, for each node (in parallel) we repeatedly aggregate information from its neighbors and
reassign its node feature using a hash function (which may change from iteration to iteration).
A typical choice for the hash function is illustrated above, based on sorting the neighbors and using
a strictly increasing function f : L∗ → L that maps the smallest neighborhood [lv , lu∈Nv ] to the smallest
element in L, and so on and so forth. (Note that in this convention f may change in different iterations in
WL). By construction, the node feature lv for any node will never decrease (thanks to the monotonicity of
f ). W.l.o.g. we may identify LP = {1, 2, . . . , |V|}, from which we see that the algorithm need only repeat for
at most |V| iterations: |V|2 ≥ v lv ≥ |V| and each non-vacuous update increases the sum by at least 1. If
we maintain a histogram on the alphabet L, then we may early stop the algorithm when the histogram stops
changing. WL can be implemented in almost linear time (e.g. Berkholz et al. 2017).
As mentioned in this historic comment, WL was motivated by applications in computational chemistry,
where a precursor already appeared in Morgan (1965). An interesting story about Andrey Lehman is
available here while an unsettling story about the disappearance of Boris Weisfeiler is available here.
Weisfeiler, Boris and Andrey Lehman (1968). “The reduction of a graph to canonical form and the algebra which
appears therein”. Nauchno-Technicheskaya Informatsia, vol. 2, no. 9, pp. 12–16.
Shervashidze, Nino, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M. Borgwardt (2011).
“Weisfeiler-Lehman Graph Kernels”. Journal of Machine Learning Research, vol. 12, no. 77, pp. 2539–2561.
Berkholz, C., P. Bonsma, and M. Grohe (2017). “Tight Lower and Upper Bounds for the Complexity of Canonical
Colour Refinement”. Theory of Computing Systems, vol. 60, pp. 581–614.
Morgan, H. L. (1965). “The Generation of a Unique Machine Description for Chemical Structures-A Technique
Developed at Chemical Abstracts Service”. Journal of Chemical Documentation, vol. 5, no. 2, pp. 107–113.

Algorithm 11.24: Graph isomorphism test

Testing whether two graphs are isomorphic is one of the few surprising problems in NP that we do not know
if it is in NPC or P. The WL Algorithm 11.23 immediately leads to an early test for graph isomorphism:
we simply “glue” the two input graphs as disjoint components into one graph and start with trivial labeling
lv ≡ 1. Run WL Algorithm 11.23. If at some iteration the histograms on the two components/graphs differ,
then we claim “non-isomorphic.” Otherwise we classify as “possibly isomorphic.”
The above test was mistakenly believed to be a solution to graph isomorphism (Weisfeiler and Lehman
1968) but soon counterexamples were found. Nevertheless, Babai and Kucera (1979) and Babai et al. (1980)
proved that for almost all graphs, the WL test is valid. The exact power of the WL test has been characterized
in Arvind et al. (2015) and Kiefer et al. (2015).
Weisfeiler, Boris and Andrey Lehman (1968). “The reduction of a graph to canonical form and the algebra which
appears therein”. Nauchno-Technicheskaya Informatsia, vol. 2, no. 9, pp. 12–16.
Babai, L. and L. Kucera (1979). “Canonical labelling of graphs in linear average time”. In: 20th Annual Symposium
on Foundations of Computer Science, pp. 39–46.
Babai, László, Paul Erdös, and Stanley M. Selkow (1980). “Random Graph Isomorphism”. SIAM Journal on Com-
puting, vol. 9, no. 3, pp. 628–635.
Arvind, V., Johannes Köbler, Gaurav Rattan, and Oleg Verbitsky (2015). “On the Power of Color Refinement”. In:
Fundamentals of Computation Theory, pp. 339–350.
Kiefer, Sandra, Pascal Schweitzer, and Erkal Selman (2015). “Graphs Identified by Logics with Counting”. In: Math-
ematical Foundations of Computer Science, pp. 319–330.

Algorithm 11.25: High dimensional WL (e.g. Grohe 2017; Weisfeiler 1976, §O)

For any k ≥ 2, we may lift the WL algorithm by considering k-tuples of nodes v in V k . Variations on the
neighborhood Nv include:
• WLk : Nv := [Nv,1 , . . . , Nv,k ], where Nv,j = [u ∈ V k : u\j = v\j ].
• fWLk : Nv := [Nv,u : u ∈ V], where Nv,u = [(u, v2 , . . . , vk ), (v1 , u, . . . , vk ), . . . , (v1 , v2 , . . . , u)].

Yaoliang Yu 98 –Version 0.1–Oct 19, 2021–


CS480/680–Fall 2021 §11 GRAPH NEURAL NETWORKS University of Waterloo

• sWLk (Morris et al. 2019): Nv := [u ∈ V k : |u ∩ v| = k − 1].

We initialize k-tuples u and v with the same node feature (color) if the (ordered) subgraph they induce are
isomorhpic (and with the same node features inherited from the original graph). The WL Algorithm 11.23
will be denoted as WL1 ; see (Grohe 2017, p. 84) on how to unify the description.
It is known that WLk+1 is as powerful as fWLk (Grohe and Otto 2015). For k ≥ 2, WLk+1 is strictly
more powerful than WLk (Cai et al. 1992; Grohe and Otto 2015, Observation 5.13 and Theorem 5.17), while
WL1 is equivalent to WL2 (Cai et al. 1992; Grohe and Otto 2015). Moreover, sWLk is strictly weaker than
WLk (Sato 2020, page 15).
Grohe, Martin (2017). Descriptive Complexity, Canonisation, and Definable Graph Structure Theory. Cambridge
University Press.
Weisfeiler, Boris (1976). On Construction and Identification of Graphs. Springer.
Morris, Christopher, Martin Ritzert, Matthias Fey, William L. Hamilton, Jan Eric Lenssen, Gaurav Rattan, and
Martin Grohe (2019). “Weisfeiler and Leman Go Neural Higher-Order Graph Neural Networks”. In: Proceedings
of the AAAI Conference on Artificial Intelligence.
Grohe, Martin and Martin Otto (2015). “Pebble Games and Linear Equations”. The Journal of Symbolic Logic, vol. 80,
no. 3, pp. 797–844.
Cai, J., M. Fürer, and N. Immerman (1992). “An optimal lower bound on the number of variables for graph identifi-
cation”. Combinatorica, vol. 12, pp. 389–410.
Sato, Ryoma (2020). “A Survey on The Expressive Power of Graph Neural Networks”.

Remark 11.26: The connection between WL and GCN

The similarity between WL Algorithm 11.23 and GCN is recognized in (Kipf and Welling 2017). Indeed,
consider the following specialization of the hash function in Algorithm 11.23:
!
h X i
l+1 1 √ a l l
lv = σ dv +1 lv +
vu
lu W ,
(dv +1)(du +1)
u∈Nv

which is exactly the GCN update in (11.4) (with the identification Xv: = lv ). From this observation we
see that even with random weights W , GCN may still be able to extract useful node features, as confirmed
through an example in (Kipf and Welling 2017, Appendix A.1).
More refined and exciting findings along this connection have appeared in Xu et al. (e.g. 2019), Maron
et al. (2019), Morris et al. (2019), and Sato (2020) lately. See also Kersting et al. (2009) and Kersting et al.
(2014) for applications to graphical models.
Kipf, Thomas N. and Max Welling (2017). “Semi-Supervised Classification with Graph Convolutional Networks”. In:
International Conference on Learning Representations.
Xu, Keyulu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka (2019). “How Powerful are Graph Neural Networks?”
In: International Conference on Learning Representations.
Maron, Haggai, Heli Ben-Hamu, Hadar Serviansky, and Yaron Lipman (2019). “Provably Powerful Graph Networks”.
In: Advances in Neural Information Processing Systems 32, pp. 2156–2167.
Morris, Christopher, Martin Ritzert, Matthias Fey, William L. Hamilton, Jan Eric Lenssen, Gaurav Rattan, and
Martin Grohe (2019). “Weisfeiler and Leman Go Neural Higher-Order Graph Neural Networks”. In: Proceedings
of the AAAI Conference on Artificial Intelligence.
Sato, Ryoma (2020). “A Survey on The Expressive Power of Graph Neural Networks”.
Kersting, Kristian, Babak Ahmadi, and Sriraam Natarajan (2009). “Counting Belief Propagation”. In: Proceedings of
the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI), pp. 277–284.
Kersting, Kristian, Martin Mladenov, Roman Garnett, and Martin Grohe (2014). “Power Iterated Color Refinement”.
In: The Twenty-Eighth AAAI Conference on Artificial Intelligence (AAAI).

Yaoliang Yu 99 –Version 0.1–Oct 19, 2021–


CS480/680–Fall 2021 §12 K-NEAREST NEIGHBORS (KNN) University of Waterloo

12 k-Nearest Neighbors (kNN)


Goal

Understand k-nearest neighbors for classification and regression. Relation to Bayes error.

Alert 12.1: Convention

Gray boxes are not required hence can be omitted for unenthusiastic readers.
This note is likely to be updated again soon.

Definition 12.2: Distance

Given a domain X ⊆ Rd , we define a distance metric dist : X × X → R+ as any function that satisfies the
following axioms:
• nonnegative: dist(x, z) ≥ 0;
• identity: dist(x, z) = 0 iff x = z;
• symmetric: dist(x, z) = dist(z, x);

• triangle inequality: dist(x, z) ≤ dist(x, y) + dist(y, z).


We call the space X equipped with a distance metric dist a metric space, with notation (X, dist).
If we relax the “iff” part in identity to “if” then we obtain pseudo-metric; if we drop symmetry we obtain
quasi-metric; and finally if we drop the triangle inequality we get semi-metric.

Exercise 12.3: Example distances

Given any norm ∥ · ∥ on a vector space V, it immediately induces a distance metric:

dist∥·∥ (x, z) = ∥x − z∥.

Verify by yourself dist∥·∥ is indeed a distance metric.


In particular, for the ℓp norm defined in Definition 1.25, we obtain the ℓp distance.
Another often used “distance” is the cosine similarity:

x⊤ z
∠(x, z) = .
∥x∥2 · ∥z∥2

Is it a distance metric?

Remark 12.4: kNN in a nutshell

Given a metric space (X, dist) and a dataset D = *(x1 , y1 ) . . . , (xn , yn )+, where xi ∈ X, upon receiving a new
instance x ∈ X, it is natural to find near neighbors (e.g. “friends”) in our dataset D according to the metric
dist and predict ŷ(x) according to the y-values of the neighbors. The underlying assumption is
neighboring feature vectors tend to have similar or same y-values.
The subtlety of course lies on what do we mean by neighboring, i.e., how do we choose the metric dist.

Yaoliang Yu 100 –Version 0.1–Oct 18, 2021–


CS480/680–Fall 2021 §12 K-NEAREST NEIGHBORS (KNN) University of Waterloo

Remark 12.5: The power of an appropriate metric

Suppose we have (X, Y ) following some distribution on X × Y, where the target space Y is equipped with
some metric disty (acting as a measure of our prediction error). Then, we may define a (pseudo)metric on
X as:

distx (x, x′ ) := E[disty (Y, Y ′ )|X = x, X′ = x′ ],

where (X′ , Y ′ ) is an independent copy of (X, Y ). (Note that distx (x, x) = 0 may not hold.) Given a test
instance X = x, if we can find a near neighbor X′ = x′ so that distx (x, x′ ) ≤ ϵ, then predicting Y (x)
according to Y (x′ ) gives us at most ϵ error:

E[disty (Y (X), Y (X′ ))] = E[distx (X, X′ )] ≤ ϵ.

Of course, we would not be able to construct the distance metric distx in practice, as it depends on the
unknown distribution of our data.

Algorithm 12.6: kNN

Given a dataset D = *(x1 , y1 ), . . . , (xn , yn )+, where xi ∈ (X, dist) and yi ∈ Y, and a test instance x, we
predict according to the knn algorithm:
Algorithm: kNN
Input: Dataset D = *(xi , yi ) ∈ X × Y : i = 1, . . . , n+, new instance x ∈ X, hyperparameter k
Output: y = y(x)
1 for i = 1, 2, . . . , n do
2 di ← dist(x, xi ) // avoid for-loop if possible
3 find indices i1 , . . . , ik of the k smallest entries in d
4 y ← aggregate(yi1 , . . . , yik )
For different target space Y, we may use different aggregations:

• multi-class classification Y = {1, . . . , c}: we can perform majority voting

y ← argmax #{yil = j : l = 1, . . . , k}, (12.1)


j=1,...,c

where ties can be broken arbitrarily.

• regression: Y = Rm : we can perform averaging


k
1X
y← y il . (12.2)
k
l=1

Strictly speaking, there is no training time in kNN as we need only store the dataset D. For testing, it
costs O(nd) as we have to go through the entire dataset to compute all distances to the test instance. There
is a large literature that aims to bring down this complexity in test time by pre-processing our dataset and
often by contending with near (but not necessarily nearest) neighbors (see e.g. Andoni and Indyk (2008)).
Andoni, Alexandr and Piotr Indyk (2008). “Near-optimal Hashing Algorithms for Approximate Nearest Neighbor in
High Dimensions”. Communications of the ACM, vol. 51, no. 1, pp. 117–122.

Yaoliang Yu 101 –Version 0.1–Oct 18, 2021–


CS480/680–Fall 2021 §12 K-NEAREST NEIGHBORS (KNN) University of Waterloo

Exercise 12.7: The power of weights

More generally, suppose we also have a distance metric disty on Y, we may set
n
wi↓ · distx (x, xπ(i) )
X
π ← argmin (12.3)
π
i=1
n
vi↓ · dist2y (y, yπ(i) ),
X
y ← argmin (12.4)
y∈Y i=1

where π : [n] → [n] is a permutation, and w1 ≥ w2 ≥ · · · ≥ wn ≥ 0, v1 ≥ v2 ≥ · · · ≥ vn ≥ 0 are weights


(e.g. how much each training instance should contribute to the final result). We may also use disty in (12.4)
(without squaring). A popular choice is to set vi ∝ 1/dπ(i) so that nearer neighbors will contribute more to
predicting y.
Prove that with the following choices we recover (12.1) and (12.2) from (12.3)-(12.4), respectively:
(
0, if y = y′
• Let Y = {1, . . . , c} and disty (y, y′ ) = be the discrete distance. Use the kNN weights
1, o.w.
w = v = (1, . . . , 1, 0, . . . , 0).
| {z }
k

• Let Y = Rm and disty (y, y′ ) = ∥y − y′ ∥2 be the ℓ2 distance.

Remark 12.8: Effect of k

Intuitively, using a larger k would give us more stable predictions (if we vary the training dataset), as we
are averaging over more neighbors, corresponding to smaller variance but potentially larger bias (see ??):
• If we use k = n, then we always predict the same target irrespective of the input x, which is clearly
not varied at all but may incur a large bias.
• Indeed, if we have a dataset where different classes are well separated, then using a large k can bring
significant bias while 1NN achieves near 0 error.
In practice we may select k using cross-validation (see Algorithm 2.33). For a moderately large dataset,
typically k = 3 or 5 suffices. A rule of thumb is we use larger k for larger and more difficult datasets.

Theorem 12.9: kNN generalization error (Biau and Devroye 2015)

Let k be odd and fixed. Then, for all distributions of (X, Y ), as n → ∞,


" k   #
X k  
l k−l k k
LkNN := Pr[hn (X) ̸= Y ] → E r (X)(1 − r(X)) r(X)Jl < 2 K + (1 − r(X))Jl ≥ 2 K ,
l
l=0

where the knn classifier hn is defined in (12.5) and r(x) := Pr[Y = 1|X = x] is the regression function.

i.i.d. i.i.d.
Proof. Let X1 , . . . , Xn ∼ X and let Yi = JUi ≤ r(Xi )K, where Ui ∼ Uniform([0, 1]). Clearly, (Xi , Yi , Ui )
form an i.i.d. sequence where (Xi , Yi ) ∼ (X, Y ). Let Dn = *(Xi , Yi , Ui ), i = 1, . . . , n+. Fixing x, define
Ỹi (x) = JUi ≤ r(x)K. Order X(i) (x), Y(i) (x), Ỹ(i) (x) and U(i) (x) according to the distance dist(Xi , x).
Consider the classifiers:
( Pk ( Pk
1, if l=1 Y(l) (x) > k/2 1, if l=1 Ỹ(l) (x) > k/2
hn (x) = , h̃n (x) = . (12.5)
0, o.w. 0, o.w.

Yaoliang Yu 102 –Version 0.1–Oct 18, 2021–


CS480/680–Fall 2021 §12 K-NEAREST NEIGHBORS (KNN) University of Waterloo

Then, we have
" k k
#
X X
Pr[hn (X) ̸= h̃n (X)] ≤ Pr Y(l) (X) ̸= Ỹ(l) (X)
l=1 l=1
h   i
≤ Pr Y(1) (X), . . . , Y(k) (X) ̸= Ỹ(1) (X), . . . , Ỹ(k) (X)
" k
#
[
≤ Pr Jr(X(l) (X)) ∧ r(X) < U(l) (X) ≤ r(X(l) (X)) ∨ r(X)K
l=1
k
n→∞
X
≤ E r(X(l) (X)) − r(X) −→ 0, see Stone’s Lemma 12.13 below.
l=1

Recall that L(hn ) := Pr(hn (X) ̸= Y |D) and similarly for L(h̃n ). Thus,

E L(hn ) − L(h̃n ) ≤ Pr[hn (X) ̸= h̃n (X)] = o(1),

i.i.d.
whereas noting that given x, Ỹl (x) ∼ Bernoulli(r(x)), hence

EL(h̃n ) = Pr Binomial(k, r(X)) > k2 , Y = 0 + Pr Binomial(k, r(X)) ≤ k2 , Y = 1


   

= E (1 − r(X))JBinomial(k, r(X)) > k2 K + r(X)JBinomial(k, r(X)) ≤ k2 K .


 

Combining the above completes the proof.


The proof above exploits the beautiful decoupling idea: Y(i) ’s, which the kNN classifier gn depends on, are
coupled through the ordering induced by the Xi ’s. On the other hand, Ỹ(i) ’s are independent (conditioned
on X = x) hence allow us to analyze the closely related classifier g̃n with ease. Stone’s Lemma 12.13 adds
the final piece that establishes the asymptotic equivalence of the two classifiers.
Biau, Gérard and Luc Devroye (2015). Lectures on the Nearest Neighbor Method. Springer.

Corollary 12.10: 1NN ≤ 2×Bayes (Cover and Hart 1967)

For n → ∞, we have

LBayes ≤ L1NN ≤ 2LBayes (1 − LBayes ) ≤ 2LBayes ,

and L3NN = E[r(X)(1 − r(X))] + 4E[r2 (X)(1 − r(X))2 ].

Proof. For k = 1, it follows from Theorem 12.9 that


L1NN = 2E[r(X)(1 − r(X))]
whereas the Bayes error is
LBayes = E[r(X) ∧ (1 − r(X))].
Therefore, letting s(x) = r(x) ∧ (1 − r(x)), we have
L1NN = 2E[s(X)(1 − s(X))] = 2Es(X) · E(1 − s(X)) − 2 · Variance(s(X)) ≤ 2LBayes (1 − LBayes ).
The formula for L3NN follows immediately from Theorem 12.9.
We note that for trivial problems where LBayes = 0 or LBayes = 12 , L1NN = LBayes . On the other hand,
when the Bayes error is small, L1NN ∼ 2LBayes while L3NN ∼ LBayes .
Cover, T. M. and P. E. Hart (1967). “Nearest Neighbor Pattern Classification”. IEEE Transactions on Information
Theory, vol. 13, no. 1, pp. 21–27.

Yaoliang Yu 103 –Version 0.1–Oct 18, 2021–


CS480/680–Fall 2021 §12 K-NEAREST NEIGHBORS (KNN) University of Waterloo

Proposition 12.11: Continuity

Let f : Rd → R be (Lebesgue) integrable. If k/n → 0, then


k
1X 
E f Xl (X) − f (X) → 0,
k
l=1

where X(i) (X) is ordered by the distance ∥Xi − X∥2 and Xi ∼ X for i = 1, . . . , n.

Proof. Since Cc is dense in L1 , we may approximate f by a (uniformly) continuous function fϵ with compact
support. In particular, for ϵ > 0 there exists δ > 0 such that dist(x, z) ≤ δ =⇒ |fϵ (x) − fϵ (z)| ≤ ϵ. Thus,
k k
1X  1X  
E f Xl (X) − f (X) ≤ E f Xl (X) − fϵ (Xl (X)) + E fϵ Xl (X) − fϵ (X) + E |fϵ (X) − f (X)|
k k
l=1 l=1
(Stone’s Lemma 12.13) ≤ (γd + 2)E |f (X) − fϵ (X)| + 2∥fϵ ∥∞ · Pr[dist(X(k) , X) > δ] + ϵ
≤ (γd + 2)ϵ + 2∥fϵ ∥∞ · Pr[dist(X(k) , X) > δ]
≤ (γd + 3)ϵ, thanks to Theorem 12.12 when n is large.

The proof is complete by noting that ϵ is arbitrary.

Theorem 12.12: projection through kNN

Fix x and define ρ = dist(x, suppµ) where suppµ is the support of some measure µ. If k/n → 0, then almost
surely

dist(X(k) (x), x) → ρ,

i.i.d.
where Xi ∼ µ and X(i) is ordered by dist(Xi , x), i = 1, . . . , n.

Proof. Fix any ϵ > 0 and let p = Pr(dist(X, x) ≤ ϵ + ρ) > 0. Then, for large n,
n
!
i.i.d.
X
Pr(dist(X(k) , x) − ρ > ϵ) = Pr Bi < k , where Bi ∼ Bernoulli(p)
i=1
n
!
1X
= Pr (Bi − p) < k/n − p
n i=1
≤ exp −2n(p − k/n)2 .


Since p > 0 and k/n → 0, the theorem follows.

Let X ∼ µ be another independent copy, then with k/n → 0:


a.s.
dist(X(k) , X) −→ 0.

Indeed, for µ-almost all x and large n, we have


 
n→∞
X
Pr sup dist(X(k,m) (x), x) ≥ ϵ ≤ exp(−mp2 ) −→ 0.
m≥n
m≥n

Yaoliang Yu 104 –Version 0.1–Oct 18, 2021–


CS480/680–Fall 2021 §12 K-NEAREST NEIGHBORS (KNN) University of Waterloo

Lemma 12.13: Stone’s Lemma (Stone 1977)

(n) (n) (n) (n)


Let (w1 , . . . , wn ) be a probability vector with w1 ≥ · · · ≥ wn for all n. Then, for any integrable
function f : Rd → R,
" n #
X (n)
E wi f (X(i) (X) ≤ (1 + γd )E|f (X)|,
i=1

where Xi ’s are i.i.d. copies of X, X(i) ’s are ordered by ∥Xi − X∥2 , and γd < ∞ only depends on d.

Proof. Define
(n) (n) (n)
Wi (x) := Wi (x; x1 , . . . , xn ) := wk

if xi is the k-th nearest neighbor of x (ties broken by index). We first prove


n
(n)
X
Wi (xi ; x1 , . . . , xi−1 , x, xi+1 , . . . , xn ) ≤ (1 + γd ). (12.6)
i=1

Cover Rd with γd angular cones Kt , t = 1, . . . , γd , each with angle π/12. Let A = {i : xi = x} and
Bt = {i : xi ∈ (Kt + x) \ {x}}. Choose any a, b ∈ Bt such that 0 < ∥xa − x∥ ≤ ∥xb − x∥, then

∥xa − xb ∥2 ≤ ∥xa − x∥2 + ∥xb − x∥2 − 2∥xa − x∥∥xb − x∥ cos(π/6) < ∥xb − x∥2 . (12.7)

Therefore, if xb is the k-th closest to x among xBt , then x is at best the k-th closest to xb among x, xBt \{b} .
(n)
Since the weights wi are ordered, we have

n−|A|
(n) (n)
X X
Wi (xi ; x1 , . . . , xi−1 , x, xi+1 , . . . , xn ) ≤ wi ≤1
i∈Bt i=1
|A|
(n) (n)
X X
Wi (xi ; x1 , . . . , xi−1 , x, xi+1 , . . . , xn ) = wi ≤ 1.
i∈A i=1

Taking unions over the γd angular cones proves (12.6).


Therefore,
" n # " n #
X (n) X (n)
E wi f (X(i) (X) = E Wi (X) |f (Xi )|
i=1 i=1
" n
#
(n)
X
(symmetrization) = E |f (X)| Wi (Xi ; X1 , . . . , Xi−1 , X, Xi+1 , . . . , Xn )
i=1
≤ (1 + γd )E|f (X)|.

Here γd is the covering number of Rd by angular cones:

K(z, θ) := {x ∈ Rd : ∠(x, z) ≤ θ}.

The proof above relies on the ℓ2 distance only in (12.7).


Stone, Charles J. (1977). “Consistent Nonparametric Regression”. The Annals of Statistics, vol. 5, no. 4, pp. 595–620.

Yaoliang Yu 105 –Version 0.1–Oct 18, 2021–


CS480/680–Fall 2021 §12 K-NEAREST NEIGHBORS (KNN) University of Waterloo

Theorem 12.14: No free lunch (Shalev-Shwartz and Ben-David 2014)

Let h be any classifier learned from a training set Dn with size n ≤ |X|/2. Then, there exists a distribution
(X, Y ) over X × {0, 1} such that the Bayes error is zero while

Pr [h(X; Dn ) ̸= Y ] ≥ 14 .
1
In particular, with probability at least 7 over the training set Dn we have Pr [h(X; Dn ) ̸= Y |Dn ] ≥ 81 .

Proof. We may assume w.l.o.g. that |X| = 2n. Enumerate all T = 22n functions ht : X → {0, 1}, each of
which induces a distribution where X ∈ X is uniformly random while Y = ht (X). For each labeling function
ht , we have S = (2n)n possible training sets Dn (s, t). Thus,
S T S
1X 1X1X
max Pr[h(X; Dn (s, t)) ̸= ht (X)] ≥ Pr[h(X; Dn (s, t)) ̸= ht (X)]
t∈[T ] S s=1 T t=1 S s=1
T
1X
≥ min Pr[h(X; Dn (s, t)) ̸= ht (X)]
s∈[S] T t=1
T
1X 1 X
≥ min Jh(xi ; Dn (s, t)) ̸= ht (xi )K
s∈[S] T t=1 2|X \ Dn (s, t)|
xi ∈X\Dn (s,t)
T
1 1X X
= min Jh(xi ; Dn (s, t)) ̸= ht (xi )K
s∈[S] 2|X \ Dn (s, t)| T
t=1 xi ∈X\Dn (s,t)

≥ 41 ,

since we apparently have

Jh(xi ; Dn (s, t)) ̸= ht (xi )K + Jh(xi ; Dn (s, τ )) ̸= hτ (xi )K = 1,

for two labeling functions ht and hτ which agree on x iff x ∈ Dn .


Let c > 1 be arbitrary. Consider the uniform grid X in the cube [0, 1]d with 1/c distance between
neighbors. Clearly, there are (c + 1)d points in X. If our training set is smaller than (c + 1)d /2, then kNN
suffers at least 1/4 error while the Bayes error is 0! Thus, the condition n → ∞ in Theorem 12.9 can be
very unrealistic in high dimensions!
Shalev-Shwartz, Shai and Shai Ben-David (2014). Understanding Machine Learning: From Theory to Algorithms.
Cambridge University Press.

Yaoliang Yu 106 –Version 0.1–Oct 18, 2021–


CS480/680–Fall 2021 §13 DECISION TREES University of Waterloo

13 Decision Trees
Goal

Define and understand the classic decision trees. Bagging and random forest.

Alert 13.1: Convention

Gray boxes are not required hence can be omitted for unenthusiastic readers.
This note is likely to be updated again soon.

Yaoliang Yu 107 –Version 0.0–June 11, 2018–


CS480/680–Fall 2021 §14 BOOSTING University of Waterloo

14 Boosting
Goal

Understand the ensemble method for combining classifiers. Bagging, Random Forest, and the celebrated
Adaboost.

Alert 14.1: Convention

Gray boxes are not required hence can be omitted for unenthusiastic readers.
This note is likely to be updated again soon.

Remark 14.2: Together and stronger?

Often it is possible to train a variety of different classifiers for a particular problem at hand, and a lot of
time, energy and discussion are spent on debating and choosing the most appropriate classifier. This makes
sense when the classifiers are “expensive” to obtain (be it computationally or financially or resourcefully).
Putting operational costs aside, however, is it possible to combine a bunch of classifiers and get better
performance, for instance when compared against the best classifier in the gang? Of course, one usually does
not know which classifier in the gang is “best” (unless when we try all of them out).

Remark 14.3: The power of aggregation

To motivate our development, let us consider an ideal scenario where we have a collection of classifiers
{ht : X → {0, 1}, t = 1, 2, . . . , 2T + 1} for a binary classification problem (where we encode the labels
as {0, 1}). Conditional on a given test sample (X, Y ), we assume the classifiers ht independently achieve
accuracy

p := Pr(ht (X) = Y |X, Y ) > 12 .

(For instance, if classifier ht is trained on an independent dataset Dt .) We predict according to the majority:
(
1, if #{t : ht (X) = 1} ≥ T + 1
h(X) = .
0, if #{t : ht (X) = 0} ≥ T + 1

What is the accuracy of this “meta-classifier” h? Simple calculation reveals:


2T +1
! 2T +1  
X X 2T + 1 k
Pr(h(X) = Y |X, Y ) = Pr Jht (X) = Y K ≥ T + 1 X, Y = p (1 − p)2T +1−k
t=1
k
k=T +1
2T +1
!
1 X T + 1 − p(2T + 1)
= Pr p [Jht (X) = Y K − p] ≥ p X, Y
(2T + 1)p(1 − p) t=1 (2T + 1)p(1 − p)
!
T →∞ T + 1 − p(2T + 1) T →∞
−−−−→ 1 − Φ p −−−−→ 1,
(2T + 1)p(1 − p) 2p>1

where Φ : R → [0, 1] is the cdf of a standard normal distribution, and the convergence follows from the
central limit theorem.
Therefore, provided that we can combine a large number of conditionally independent classifiers, each of
which is slightly better than random guessing, we can approach perfect accuracy! The caveat of course is
the difficulty (or even impossibility) in obtaining decent independent classifiers in the first place.

Yaoliang Yu 108 –Version 0.1–Oct 18, 2021–


CS480/680–Fall 2021 §14 BOOSTING University of Waterloo

Exercise 14.4: Odd vs. even

Perform a similar analysis as in Remark 14.3 for 2T classifiers. Is there any advantage in using the odd
2T + 1 over the even 2T ?

Algorithm 14.5: Bootstrap Aggregating (Bagging, Breiman 1996)

One way to achieve (approximately) independent classifiers is to simply train them on independent datasets,
which in most (if not all) applications is simply a luxury. However, we can use the same bootstrapping idea
as in cross-validation (cf. Algorithm 2.33):
Algorithm: Bagging predictors.
Input: training set D, number of reptitions T
Output: meta-predictor h
1 for t = 1, 2, . . . , T do // in parallel
2 sample a new training set Dt ⊆ D // with or without replacement
3 train predictor ht on Dt
4 h ← aggregate(h1 , . . . , hT ) // majority vote for classification; average for regression
There are two ways to sample a new training set:
• Sample with replacement: copy a (uniformly) random element from D to Dt and repeat |D| times.
Usually there will be repetitions of the same element in Dt (which has no effect on most machine
learning algorithms). This is the common choice.
• Sample without replacement: “cut” a (uniformly) random element from (a copy of) D to Dt and repeat
say 70% × |D| times. There will be no repetitions in each Dt . Note that had we repeated |D| times
(just as in sample with replacement) we would have Dt ≡ D, which is not very useful.

Of course the training sets {Dt : t = 1, . . . , T } are not really independent of each other, but aggregating
predictors trained on them usually (but not always) improves performance.
Breiman, Leo (1996). “Bagging Predictors”. Machine Learning, vol. 24, no. 2, pp. 123–140.

Algorithm 14.6: Randomizing output (e.g. Breiman 2000)

Bagging perturbs the training set by taking a random subset. We can also perturb the training set by simply
adding noise:
• for regression tasks: replace line 2 in the bagging algorithm by “adding small Gaussian noise to each
response yi ”

• for classification tasks: replace line 2 in the bagging algorithm by “randomly flip a small portion of the
labels yi ”
We will come back to this perturbation later. Intuitively, adding noise can prevent overfitting, and
aggregating reduces variance.
Breiman, Leo (2000). “Randomizing outputs to increase prediction accuracy”. Machine Learning, vol. 40, no. 3,
pp. 229–242.

Algorithm 14.7: Random forest (Breiman 2001)

One of the most popular machine learning algorithms is random forest, where we combine bagging and
decision trees. Instead of training a usual decision tree, we introduce yet another layer of randomization:
• during training, at each internal node where we need to split the training samples, we select a random

Yaoliang Yu 109 –Version 0.1–Oct 18, 2021–


CS480/680–Fall 2021 §14 BOOSTING University of Waterloo

subset of features and perform the splitting. A different subset of features is used at a different internal
node.
Then we apply bagging and aggregate a bunch of the above “randomized” decision trees.
Breiman, Leo (2001). “Random Forest”. Machine Learning, vol. 45, no. 1, pp. 5–32.

Algorithm 14.8: Hedging (Freund and Schapire 1997)

Algorithm: Hedging.
Input: initial weight vector w1 ∈ Rn++ , discount factor β ∈ [0, 1]
Output: last weight vector wT +1
1 for t = 1, 2, . . . , T do
2 learner chooses probability vector pt = 1⊤ww t
t
// normalization
3 environment chooses loss vector ℓt ∈ [0, 1] n
// ℓt may depend on pt !
4 learner suffers (expected) loss ⟨pt , ℓt ⟩
5 learner updates weights wt+1 = wt ⊙ β ℓt // element-wise product ⊙ and power
6 optional scaling: wt+1 ← ct+1 wt+1 // ct+1 > 0 can be arbitrary

Imagine the following horse racing gamea : There are n horses in a race, which we repeat for T rounds.
On each round we bet a fixed amount of money, with pit being the proportion we spent on horse i at the t-th
round. At the end of round t we receive a loss ℓit ∈ [0, 1] that we suffer on horse i. Note that the losses can
be completely arbitrary (e.g. no i.i.d. assumption), although for simplicity we assume they are bounded.
How should we place our bets (i.e. pit ) to hedge our risk? Needless to say, we must decide the proportions
pt before we see the losses ℓt . On the other hand, the losses ℓt can be completely adversarial (i.e. depend
on pt ).
The Hedging algorithm gives a reasonable (in fact in some sense optimal) strategy: basically if a horse i
is doing well (badly) on round t, then we spend a larger (smaller) proportion pi,t+1 on it in the next round
t + 1. On a high level, this is similar to the idea behind perceptron (see Section 1).
Note that line 5 continues decreasing each entry in the weight vector (because both β ∈ [0, 1] and
ℓ ∈ [0, 1]). To prevent the weights from underflowing, we can re-scale them by a positive number c > 0, as
shown in the optional line 6. Clearly, this scaling does not change p hence the algorithm (in any essential
way).
Freund, Yoav and Robert E. Schapire (1997). “A Decision-Theoretic Generalization of On-Line Learning and an
Application to Boosting”. Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119–139.

a Viewer discretion is advised; please do not try at home!

Alert 14.9: Multiplicative Updates

The Hedging algorithm belongs to the family of multiplicative updates, where we repeatedly multiplying,
instead of adding (cf. Perceptron in Section 1), our weights by some correction terms. For multiplicative
updates, it is important to start with strictly positive initializations, for a zero weight will remain as zero in
multiplicative updates.

Theorem 14.10: Hedging Guarantee

For any nonempty S ⊆ {1, . . . , n}, we have


P
− ln( i∈S pi1 ) − (ln β) maxi∈S Li
L≤ ,
1−β
PT PT
where L := t=1 ⟨pt , ℓt ⟩ is our total (expected) loss and Li := t=1 ℓit is the total loss on the i-th horse.

Yaoliang Yu 110 –Version 0.1–Oct 18, 2021–


CS480/680–Fall 2021 §14 BOOSTING University of Waterloo

Proof. We first lower bound the sum of weights at the end:


n
X X
1⊤ wT +1 = wi,T +1 ≥ wi,T +1 // weights remain nonnegative
i=1 i∈S
X
= wi,1 β Li // line 5 of Algorithm 14.8 (14.1)
i∈S
X
≥ β maxi∈S Li wi,1 // β ∈ [0, 1]. (14.2)
i∈S

Then, we upper bound the sum of weights:


n
X n
X
1⊤ wt+1 = wi,t+1 = wi,t β ℓi,t // line 5 of Algorithm 14.8
i=1 i=1
Xn
≤ wi,t (1 − (1 − β)ℓi,t ) // xℓ ≤ 1ℓ + (x − 1)ℓ: xℓ concave when ℓ ∈ [0, 1] and x ≥ 0(14.3)
i=1
= (1⊤ wt )[1 − (1 − β)p⊤
t ℓt ] // line 2 of Algorithm 14.8.

Thus, by telescoping:
T
" T
#
1⊤ wT +1 Y X
≤ ⊤
[1 − (1 − β)pt ℓt ] ≤ exp −(1 − β) ⊤
pt ℓt = exp[−(1 − β)L], (14.4)
1⊤ w1 t=1 t=1

where we have used the elementary inequality 1 − x ≤ exp(−x) for any x.


Finally, combining the inequalities (14.2) and (14.4) completes the proof.
By inspecting the proof closely, we realize that the same conclusion still holds if we change the update to

wt+1 = wt ⊙ Uβ (ℓt )

as long as the function Uβ satisfies

β ℓ ≤ Uβ (ℓ) ≤ 1 − (1 − β)ℓ

(so that inequalities (14.1) and (14.3) still hold).

Exercise 14.11: Optional re-scaling has no effect

Show that the same conclusion in Theorem 14.10 still holds even if we include the optional line 6 in Algo-
rithm 14.8.

Corollary 14.12: Comparable to best “horse” in hindsight

If we choose w1 = 1 and |S| = 1, then


T
X mini Li ln β1 + ln n
⟨pt , ℓt ⟩ ≤ . (14.5)
t=1
1−β

Yaoliang Yu 111 –Version 0.1–Oct 18, 2021–


CS480/680–Fall 2021 §14 BOOSTING University of Waterloo

In addition, if we choose β = √1 , where U ≥ Lmin := mini Li and m ≥ ln n, then


1+ (2m)/U

T √
1X Lmin 2mU ln n
⟨pt , ℓt ⟩ ≤ + + . (14.6)
T t=1 T T T

If we also choose m = ln n and U = T (in our choice of β), then


T r
1X Lmin 2 ln n ln n
⟨pt , ℓt ⟩ ≤ + + . (14.7)
T t=1 T T T

Proof. Indeed, for β ∈ [0, 1] we have

2 ≤ 2/β =⇒ 2β ≥ 2 ln β + 2 // 2β − 2 ln β − 2 is decreasing and nonnegative at β = 1


2
=⇒ β ≤ 1 + 2β ln β // β 2 − 2β ln β − 1 is increasing and nonpositive at β = 1
2
1−β
⇐⇒ ln β1 ≤ 2β .

Plugging our choice of β into (14.5) and apply the above inequality we obtain (14.6). The bound (14.7)
holds because we can choose U = T (recall that we assume ℓi,t ∈ [0, 1] for all t and i).
Based on a result due to Vovk (1998), Freund and Schapire (1997) proved that the coefficients in (14.5)
(i.e. −1−β
ln β
and 1−β
ln n
) cannot be simultaneously improved for any β, using any algorithm (not necessarily
Hedging).
In our last setting, β = √ 1 , indicating for a longer game (larger T ) we should use a larger β (to
1+ (2 ln n)/T
discount less aggressively) while for more horses (larger n) we should do the opposite (although this effect is
much smaller due to the log).
Vovk, Valadimir (1998). “A Game of Prediction with Expert Advice”. Journal of Computer and System Sciences,
vol. 56, no. 2, pp. 153–173.
Freund, Yoav and Robert E. Schapire (1997). “A Decision-Theoretic Generalization of On-Line Learning and an
Application to Boosting”. Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119–139.

Remark 14.13: Appreciating the significance of Hedging

Corollary 14.12 implies that the average loss of Hedging on the left-hand q side of (14.7) is no larger than
the average loss of the “best horse”, plus a constant term proportional to 2 ln
T
n
(where we omit the higher
order term lnTn ). As the number of rounds T goes to infinity, Hedging can compete against the best horse in
hindsight! However, we emphasize three important points:
• It does not mean Hedging will achieve small loss, simply because the best horse may itself achieve a
large average loss, in which case being competitive against the best horse does not really mean much.
• The loss vectors ℓt can be adversarial against the Hedging algorithm! However, if the environment
always tries to “screw up” the Hedging algorithm, then the guarantee in Corollary 14.12 implies that
the environment inevitably also “screws up” the best horse.

• If the best horse can achieve very small loss, i.e. Lmin ≈ 0, then by setting U ≈ 0 we know from (14.6)
that Hedging is off by at most a term proportional to lnTn , which is much smaller than the dominating
q
term 2 ln T
n
in (14.7).

Yaoliang Yu 112 –Version 0.1–Oct 18, 2021–


CS480/680–Fall 2021 §14 BOOSTING University of Waterloo

Remark 14.14: History of Boosting

Schapire (1990) first formally proved the possibility to combine a few mediocre classifiers into a very accurate
one, which was subsequently improved in (Freund 1995) and eventually in (Freund and Schapire 1997), which
has since become the main source of the celebrated Adaboost algorithm. Freund and Shapire received the
Gödel prize in 2003 for this seminal contribution.
Schapire, Robert E. (1990). “The strength of weak learnability”. Machine Learning, vol. 5, no. 2, pp. 197–227.
Freund, Y. (1995). “Boosting a Weak Learning Algorithm by Majority”. Information and Computation, vol. 121,
no. 2, pp. 256–285.
Freund, Yoav and Robert E. Schapire (1997). “A Decision-Theoretic Generalization of On-Line Learning and an
Application to Boosting”. Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119–139.

Algorithm 14.15: Adaboost (Freund and Schapire 1997)

Algorithm: Adaptive Boosting.


Input: initial weight vector w1 ∈ Rn++ , training set Dn = *(xi , yi )+ni=1 ⊆ Rd × {0, 1}
PT
Output: meta-classifier h̄ : Rd → {0, 1}, x 7→ [[ t=1 (ln β1t )(ht (x) − 21 ) ≥ 0]]
1 for t = 1, 2, . . . , T do
2 pt = wt /1⊤ wt // normalization
3 ht ← WeakLearn(Dn , pt ) // t-th weak classifier ht : Rd → [0, 1]
4 ∀i, ℓit = 1 − |ht (xiP ) − yi | // loss is higher if prediction is more accurate!
n
5 ϵt = 1 − ⟨pt , ℓt ⟩ = i=1 pit |ht (xi ) − yi | // weighted (expected) error ϵt ∈ [0, 1] of ht
6 βt = ϵt /(1 − ϵt ) // adaptive discounting parameter βt ≤ 1 ⇐⇒ ϵt ≤ 12
7 w =w ⊙β t ℓ // element-wise product ⊙ and power
t+1 t t
8 optional scaling: wt+1 ← ct+1 wt+1 // ct+1 > 0 can be arbitrary

Provided that ϵt ≤ 12 hence βt ∈ [0, 1] (so the classifier ht is better than random guessing), if ht predicts
correctly on a training example xi , then we suffer a larger loss ℓti so that in the next iteration we assign
less weight to xi . In other words, each classifier is focused on hard examples that are misclassified by the
previous classifier. On the other hand, if ϵt > 21 then βt > 1 and we do the opposite.
For the meta-classifier h̄, we first perform a weighted aggregation of the confidences of individual (weak)
classifiers and then threshold. Alternatively, we could threshold each weak classifier first and then perform
(weighted) majority voting. The former approach, which we adopt here, is found to work better in practice.
Note that, a classifier ht with lower (training) error ϵt will be assigned a higher weight ln β1t = ln( ϵ1t − 1) in
the final aggregation, making intuitive sense.
Freund, Yoav and Robert E. Schapire (1997). “A Decision-Theoretic Generalization of On-Line Learning and an
Application to Boosting”. Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119–139.

Exercise 14.16: Implicitly fixing “bad” classifiers

If ϵt ≥ 12 (hence βt ≥ 1), the t-th (weak) classifier is in fact worse than random guessing! In our binary
setting here, a natural idea is to discard ht but use instead h̃t := 1 − ht . Prove the following (all tilde
quantities are w.r.t. the flipped classifier h̃t ):
• ℓ̃t = 1 − ℓt
• ϵ̃t = 1 − ϵt
• β̃t = 1/βt
• The Adaboost algorithm is not changed in any essential way even if we flip all “bad” classifiers. In
particular, we arrive at the same meta-classifier.
In other words, Adaboost implicitly fixes all “bad” classifiers!

Yaoliang Yu 113 –Version 0.1–Oct 18, 2021–


CS480/680–Fall 2021 §14 BOOSTING University of Waterloo

Theorem 14.17: Adaboost Guarantee

The following bound on the training error holds for the Adaboost Algorithm 14.15:
n
X T p
Y
pi1 [[h̄(xi ) ̸= yi ]] ≤ 4ϵt (1 − ϵt ).
i=1 t=1

Proof. The proof closely follows what we have seen in the proof of Theorem 14.10. As before, we upper
bound the sum of weights in the same manner:
T T
∥wT +1 ∥1 Y Y
≤ [1 − (1 − βt )pt ℓt ] = [1 − (1 − βt )(1 − ϵt )].
∥w1 ∥1 t=1 t=1

To lower bound the sum of weights, let S = {i : h̄(xi ) ̸= yi }, where recall that h̄ is the meta-classifier
constructed in Algorithm 14.15. Proceed as before:

T
"T #1/2
∥wT +1 ∥1 X wi,T +1 X Y ℓi,t
X Y
≥ = pi,1 βt ≥ pi,1 βt ,
∥w1 ∥1 ∥w1 ∥1 t=1 t=1
i∈S i∈S i∈S

where the last inequality follows from the definition of S and the meta-classifier h̄:
X X X
i ∈ S =⇒ 0 ≥ sign(2yi − 1) (ln β1t )(ht (xi ) − 21 ) = (ln β1t )( 12 − |ht (xi ) − yi |) = (ln β1t )(ℓi,t − 21 ).
t t t

Combine the upper bound and the lower bound:


T p T T
X Y Y X Y 1 − (1 − βt )(1 − ϵt )
pi,1 βt ≤ [1 − (1 − βt )(1 − ϵt )], i.e., pi,1 ≤ √ .
i∈S t=1 t=1 i∈S t=1
βt

Optimizing w.r.t. βt we obtain βt = ϵt /(1 − ϵt ). Plugging it back in we complete the proof.


Importantly, we observe that the training error of the meta-classifier h̄ is upper bounded by the errors
of all classifiers ht : improving any individual classifier leads to a better bound. Moreover, the symmetry on
the right-hand side confirms again that in the binary setting a very “inaccurate” classifier (i.e. large ϵt ) is as
good as a very accurate classifier (i.e. small ϵt ).

Corollary 14.18: Exponential decay of training error

Assume |ϵt − 21 | > γt , then


n T q T
!
X Y X
pi1 [[h̄(xi ) ̸= yi ]] ≤ 1 − 4γt2 ≤ exp −2 γt2 .
i=1 t=1 t=1

In particular, if γt ≥ γ for all t, then


n
X
pi1 [[h̄(xi ) ̸= yi ]] ≤ exp(−2T γ 2 ).
i=1

Thus, to achieve ϵ (weighted) training error, we need to combine at most


1 1
T =⌈ ln ⌉
2γ 2 ϵ
weak classifiers, each of which is slightly better than random guessing (by a margin of γ).

Yaoliang Yu 114 –Version 0.1–Oct 18, 2021–


CS480/680–Fall 2021 §14 BOOSTING University of Waterloo

Remark 14.19: Generalization Error of Adaboost

It is possible to bound the generalization error of Adaboost as well. For instance, Freund and Schapire
(1997) bounded the VC dimension of the (family of) meta-classifiers constructed by Adaboost. A standard
application of the VC theory then relates the generalization error with the training error. More refined
analysis can be found in Schapire et al. (1998), Koltchinskii and Panchenko (2002), Koltchinskii et al.
(2003), Koltchinskii and Panchenko (2005), Freund et al. (2004), and Rudin et al. (2007) (just to give a few
pointers).
Freund, Yoav and Robert E. Schapire (1997). “A Decision-Theoretic Generalization of On-Line Learning and an
Application to Boosting”. Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119–139.
Schapire, Robert E., Yoav Freund, Peter Bartlett, and Wee Sun Lee (1998). “Boosting the margin: a new explanation
for the effectiveness of voting methods”. The Annals of Statistics, vol. 26, no. 5, pp. 1651–1686.
Koltchinskii, V. and D. Panchenko (2002). “Empirical Margin Distributions and Bounding the Generalization Error
of Combined Classifiers”. The Annals of Statistics, vol. 30, no. 1, pp. 1–50.
Koltchinskii, Vladimir, Dmitriy Panchenko, and Fernando Lozano (2003). “Bounding the generalization error of convex
combinations of classifiers: balancing the dimensionality and the margins”. The Annals of Applied Probability,
vol. 13, no. 1, pp. 213–252.
Koltchinskii, Vladimir and Dmitry Panchenko (2005). “Complexities of convex combinations and bounding the gen-
eralization error in classification”. The Annals of Statistics, vol. 33, no. 4, pp. 1455–1496.
Freund, Yoav, Yishay Mansour, and Robert E. Schapire (2004). “Generalization bounds for averaged classifiers”. The
Annals of Statistics, vol. 32, no. 4, pp. 1698–1722.
Rudin, Cynthia, Robert E. Schapire, and Ingrid Daubechies (2007). “Analysis of boosting algorithms using the smooth
margin function”. The Annals of Statistics, vol. 35, no. 6, pp. 2723–2768.

Alert 14.20: Does Ababoost Overfit? (Breiman 1999; Grove and Schuurmans 1998)

It has long been observed that Adaboost, even after decreasing the training error to zero, continues to
improve test error. In other words, Adaboost does not seem to overfit even when we combine many many
(weak) classifiers.
A popular explanation, due to Schapire et al. (1998), attributes Adaboost’s resistance against overfitting
to margin maximization: after decreasing the training error to 0, Adaboost continues to improve the margin
(i.e. y h̄(x)), which leads to better generalization. However, Breiman (1999) and Grove and Schuurmans
(1998) later designed the LPboost that explicitly maximizes the margin but observed inferior generalization.

Breiman, Leo (1999). “Prediction Games and Arcing Algorithms”. Neural Computation, vol. 11, no. 7, pp. 1493–1517.
Grove, Adam J. and Dale Schuurmans (1998). “Boosting in the Limit: Maximizing the Margin of Learned Ensembles”.
In: Proceedings of the Fifteenth National Conference on Artificial Intelligence, pp. 692–699.
Schapire, Robert E., Yoav Freund, Peter Bartlett, and Wee Sun Lee (1998). “Boosting the margin: a new explanation
for the effectiveness of voting methods”. The Annals of Statistics, vol. 26, no. 5, pp. 1651–1686.

Algorithm 14.21: Face Detection (Viola and Jones 2004)

Viola and Jones (2004) applied the Adaboost algorithm to real-time face detection and are among the first
few people who demonstrated the power of Adaboost in real challenging applications.
Viola, Paul and Michael J. Jones (2004). “Robust Real-Time Face Detection”. International Journal of Computer
Vision, vol. 57, no. 2, pp. 137–154.

Remark 14.22: Comparison

The following figure illustrates the similarity and difference between bagging and boosting:

Yaoliang Yu 115 –Version 0.1–Oct 18, 2021–


CS480/680–Fall 2021 §14 BOOSTING University of Waterloo

To summarize:
• Both bagging and boosting train an ensemble of (weak) classifiers, which are combined in the end to
produce a “meta-classifier”;
• Bagging is amenable to parallelization while boosting is strictly sequential;

• Bagging resamples training examples while boosting reweighs training examples;


• As suggested in Breiman (2004), we can think of bagging as averaging independent classifiers (in order
to reduce variance), while boosting averages dependent classifiers which may be analyzed through
dynamic system and ergodic theory;

• We can of course combine bagging with boosting. For instance, each classifier in boosting can be
obtained through bagging (called bag-boosting by Bühlmann and Yu, see the discussion of Friedman
et al. (2000)).

Breiman, Leo (2004). “Population theory for boosting ensembles”. The Annals of Statistics, vol. 32, no. 1, pp. 1–11.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani (2000). “Additive logistic regression: a statistical view of
boosting (With discussion and a rejoinder by the authors)”. The Annals of Statistics, vol. 28, no. 2, pp. 337–407.

Exercise 14.23: Diversity


Pn
Recall that ϵt (h) := i=1 pit |h(xi ) − yi | is the weighted error of classifier h at iteration t. Assume the (weak)
classifiers are always binary-valued (and βt ∈ (0, 1) for all t). Prove:

ϵt+1 (ht ) ≡ 12 .

In other words, Adaboost would never choose the same weak classifier twice in a row.

Yaoliang Yu 116 –Version 0.1–Oct 18, 2021–


CS480/680–Fall 2021 §15 EXPECTATION-MAXIMIZATION (EM) AND MIXTURE MODELSUniversity of Waterloo

15 Expectation-Maximization (EM) and Mixture Models


Goal

Mixture models for density estimation and the celebrated expectation-maximization algorithm.

Alert 15.1: Convention

Gray boxes are not required hence can be omitted for unenthusiastic readers.
This note is likely to be updated again soon.

Definition 15.2: Density estimation

The central problem of this note is to estimate a density function (or more generally a probability measure),
through a finite training sample. Formally, we are interested in estimating a probability measure χ from a
(non)parametric family {χθ }θ∈Θ . A typical approach is to minimize some statistical divergence (distance)
between a noisy version χ̂ and χθ :

inf D(χ̂∥χθ ).
θ∈Θ

However, the minimization problem above may not always be easy to solve, and alternative (indirect)
strategies have been developed. As we show below, choosing the KL divergence corresponds to the maximum
likelihood estimation procedure.

Definition 15.3: KL and LK

Recall that the Kullback-Leibler (KL) divergence between two density functiosn p and q is defined as:
Z
p(x)
KL(p∥q) = p(x) log dx.
q(x)

Reverse the inputs we obtain the reverse KL divergence:

LK(p∥q) := KL(q∥p).

(In Section 18 we will generalize the above two divergences to a family of f -divergence.)

Definition 15.4: Entropy, conditional entropy, cross-entropy, and mutual information

We define the entropy of a random vector X with pdf p as:


Z
H(X) := −E log p(X) = − p(x) log p(x) dx,

the conditional entropy between X and Z (with pdf q) as:


Z
H(X|Z) := −E log p(X|Z) = − p(x, z) log p(x|z) dx dz,

and the cross-entropy between X and Z as:


Z
†(X, Z) := −E log q(X) = − p(x) log q(x) dx.

Yaoliang Yu 117 –Version 0.1–Nov 1, 2021–


CS480/680–Fall 2021 §15 EXPECTATION-MAXIMIZATION (EM) AND MIXTURE MODELSUniversity of Waterloo

Finally, we define the mutual information between X and Z as:


Z
p(x, z)
I(X, Z) := KL(p(x, z)∥p(x)q(z)) = p(x, z) log dx dz
p(x)q(z)

Exercise 15.5: Information theory

Verify the following:

H(X, Z) = H(Z) + H(X|Z)


†(X, Z) = H(X) + KL(X∥Z) = H(X) + LK(Z∥X)
I(X, Z) = H(X) − H(X|Z)
I(X, Z) ≥ 0, with equality iff X independent of Z
KL(p(x, z)∥q(x, z)) = KL(p(z)∥q(z)) + E[KL(p(x|z)∥q(x|z))].

All of the above can obviously be iterated to yield formula for more than two random vectors.

Exercise 15.6: Multivariate Gaussian

Compute
• the entropy of the multivariate Gaussian N (µ, Σ);
• the KL divergence between two multivariate Gaussians N (µ1 , Σ1 ) and N (µ2 , Σ2 ).

Remark 15.7: MLE = KL minimization

Let us define the empirical “pdf” based on a dataset D = *x1 , . . . , xn +:


n
1X
p̂(x) = δx ,
n i=1 i

where δx is the “illegal” delta mass concentrated at x. Then, we claim that



θMLE = argmin KL p̂∥p(x|θ) .
θ∈Θ

Indeed, we have
Z n
1X
KL(p̂∥p(x|θ)) = [log(p̂(x)) − log p(x|θ)]p̂(x) dx = C + − log p(xi |θ),
n i=1

where C is a constant that does not depend on θ.

Exercise 15.8: Why KL is so special

To appreciate the uniqueness of the KL divergence, prove the following:


log is the only continuous function satisfying f (st) = f (s) + f (t).

Yaoliang Yu 118 –Version 0.1–Nov 1, 2021–


CS480/680–Fall 2021 §15 EXPECTATION-MAXIMIZATION (EM) AND MIXTURE MODELSUniversity of Waterloo

Algorithm 15.9: Expectation-Maximization (EM) (Dempster et al. 1977)

We formulate EM under the density estimation formulation in Definition 15.2, except that we carry out
the procedure in a lifted space X × Z where Z is the space that some latent random variable Z lives in.
Importantly, we do not observe the latent variable Z: it is “artificially” constructed to aid our job. We fit
our model with a prescribed family of joint distributions

µθ (dx, dz) = ζθ (dz) · Dθ (dx|z) = χθ (dx) · Eθ (dz|x), θ ∈ Θ.

In EM, we typically specify the joint distribution µθ explicitly, and in a way that the posterior distribution
Eθ (dz|x) can be easily computed. Similarly, we “lift” χ(dx) (our target of estimation) to the joint distribution

ν̂(dx, dz) = χ̂(dx) · E(dz|x).

(We use the hat notation to remind that we do not really have access to the true distribution χ but a
sample from it, represented by the empirical distribution χ̂.) Then, we minimize the discrepancy between
the joint distributions ν̂ and µθ , which is an upper bound of the discrepancy of the marginals KL(χ̂∥χθ )
(Exercise 15.5):

inf inf KL(ν̂(dx, dz)∥µθ (dx, dz)).


θ∈Θ E(dz|x)

Note that there is no restriction on E (and do not confuse it with Eθ , which is “prescribed”).
The EM algorithm proceeds with alternating minimization:

• (E-step) Fix θt , we solve Et+1 by (recall Exercise 15.5)

inf KL(ν̂(dx, dz)∥µθt (dx, dz)) = KL(χ̂∥χθt )+Eχ̂ KL(E∥Eθt ),


E

which leads to the “closed-form” solution:

Et+1 = Eθt .

• (M-step) Fix Et+1 , we solve θt+1 by

inf KL(ν̂t+1 (dx, dz)∥µθ (dx, dz)) = KL(χ̂∥χθ ) + Eχ̂ KL(Et+1 ∥Eθ ) .
θ∈Θ | {z } | {z }
likelihood regularizer

For the generalized EM algorithm, we need only decrease the above (joint) KL divergence if finding a
(local) minima is expensive. It may be counter-intuitive that minimizing the sum of two terms above
can be easier than minimizing the first likelihood term only!
Obviously, the EM algorithm monotonically decreases our (joint) KL divergence KL(ν̂, µθ ). Moreover, thanks
to construction, the EM algorithm also ascends the likelihood:

KL(χ̂∥χθt+1 ) ≤ KL(ν̂t+1 ∥µθt+1 ) ≤ KL(ν̂t+1 ∥µθt ) = KL(χ̂∥χθt ).

Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). “Maximum Likelihood from Incomplete Data via the EM
Algorithm”. Journal of the Royal Statistical Society. Series B (Methodological), vol. 39, no. 1, pp. 1–38.

Definition 15.10: Exponential family distribution

The exponential family distributions have the following density form:



p(x) = h(x) exp ⟨η, T (x)⟩ − A(η) ,

Yaoliang Yu 119 –Version 0.1–Nov 1, 2021–


CS480/680–Fall 2021 §15 EXPECTATION-MAXIMIZATION (EM) AND MIXTURE MODELSUniversity of Waterloo

where T (x) is the sufficient statistics, η is the natural parameter, A is the log-partition function, and h
represents the base measure. Since p integrates to 1, we have
Z

A(η) = log exp ⟨η, T (x)⟩ · h(x) dx

We can verify that A is a convex function.

Example 15.11: Gaussian distribution in exponential family

Recall that the multivariate Gaussian density is:

p(x) = (2π)−d/2 [det(Σ)]−1/2 exp − 12 (x − µ)⊤ Σ−1 (x − µ)



   −1  
x Σ µ 1 ⊤ −1
= exp , − (µ Σ µ + d log(2π) + log det Σ) .
− 21 xx⊤ Σ−1 2

Thus, we identify

T (x) = (x, − 21 xx⊤ )


η = (Σ−1 µ, Σ−1 ) =: (ξ, S)
A(µ, Σ) = 12 (µ⊤ Σ−1 µ + d log(2π) + log det Σ)
A(η) = A(ξ, S) = 12 (ξ ⊤ S −1 ξ + d log(2π) − log det S).

Example 15.12: Bernoulli and Multinoulli in exponential family

The Bernoulli distribution is given as:

p(z) = π z (1 − π)1−z = exp (z log π + (1 − z) log(1 − π))


π
= exp(z log 1−π + log(1 − π)).

Thus, we may identify

T (z) = z
π
η = log 1−π
A(η) = log(1 + exp(η)).

We can also consider the multinoulli distribution:


c
Y
p(z) = πkzk = exp (⟨z, log π⟩)
k=1
 
= exp z̃⊤ log π̃
1−⟨1,π̃⟩ + log(1 − ⟨1, π̃⟩) ,

where recall that z ∈ {0, 1}c is one-hot (i.e. 1⊤ z = 1), and we use the tilde notation to denote the subvector
with the last entry removed. Thus, we may identify

T (z̃) = z̃
π̃
η̃ = log 1−⟨1,π̃⟩

A(η̃) = log(1 + ⟨1, exp(η̃)⟩).

(Here, we use the tilde quantities to remove one redundancy since 1⊤ z = 1⊤ π = 1).

Yaoliang Yu 120 –Version 0.1–Nov 1, 2021–


CS480/680–Fall 2021 §15 EXPECTATION-MAXIMIZATION (EM) AND MIXTURE MODELSUniversity of Waterloo

Exercise 15.13: Mean parameter and moments

Prove that for the exponential family distribution,

∇A(η) = E[T (X)]


∇2 A(η) = E[T (X) · T (X)⊤ ] − E[T (X)] · E[T (X)]⊤ = Cov(T (X)),

where the last equality confirms again that the log-partition function A is convex (since the covariance matrix
is positive semidefinite).

Exercise 15.14: Marginal, conditional and product of exponential family

Let p(x, z) be a joint distribution from the exponential family. Prove the following:
• The marginal p(x) need not be from the exponential family.
• The conditional p(z|x) is again from the exponential family.
• The product of two exponential family distributions is again in exponential family.

Exercise 15.15: Exponential family approximation under KL

Let p(x) be an arbitrary distribution and qη (x) from the exponential family with sufficient statistics T ,
natural parameter η and log-partition function A. Then,

η ∗ := argmin KL(p∥qη )
η

is given by moment-matching:

Ep T (X) = Eqη T (X) = ∇A(η), i.e., η = ∇A−1 (Ep T (X)).

Exercise 15.16: EM for exponential family

Prove that the M-step of EM simplifies to the following, if we assume the joint distribution µη is from the
exponential family with natural parameter η, sufficient statistics T and log-partition function A:

η t+1 = ∇A−1 (Eν̂t+1 (T (X))).

Definition 15.17: Mixture Distribution

We define the joint distribution over a discrete latent random variable Z ∈ {1, . . . , c} and an observed
random variable X:
Yc
p(x, z) = [πk · pk (x; θk )]zk ,
k=1

where we represent z using one-hot encoding. We easily obtain the marginal and conditional:
c
X
p(x) = πk · pk (x; θk ) (15.1)
k=1

Yaoliang Yu 121 –Version 0.1–Nov 1, 2021–


CS480/680–Fall 2021 §15 EXPECTATION-MAXIMIZATION (EM) AND MIXTURE MODELSUniversity of Waterloo

p(z = ek ) = πk
p(x|z = ek ) = pk (x; θk )
πk · pk (x; θk )
p(z = ek |x) = Pc .
j=1 πj · pj (x; θj )

The marginal distribution p(x) can be interpreted as follows: There are c component densities pk . We choose
a component pk with probability πk and then we sample x from the resulting component density. However,
in reality we do not know which component density an observation x is sampled from, i.e., the discrete
random variable Z is not observed (missing).
Let pk (x; θk ) be multivariate Gaussian (with θk denoting its mean and covariance) we get the popular
Gaussian mixture model (GMM).

Algorithm 15.18: Mixture density estimation – MLE

Replacing the parameterization χθ with the mixture model in (15.1) we get a direct method for estimating
the density function χ based on a sample:
c
X
min KL(χ̂∥p), p(x) = πk · pk (x; θk )
π∈∆,θ∈Θ
k=1

where ∆ denotes the simplex constraint (i.e., π ≥ 0 and 1⊤ π = 1). The number of components c is a
hyperparameter that needs to be determined a priori. We may apply (projected) gradient descent to solve
π and θ. However, it is easy to verify that the objective function is nonconvex hence convergence to a
reasonable solution may not be guaranteed.
We record the gradient here for later comparison. We use pW (x, z) for the joint density whose marginal-
ization over the latent z gives p(x). For mixtures, the parameter W includes both π and θ.

∂ ∂ log pW (x, z)
= −Ep̂W (z,x) , where p̂W (z, x) := χ̂(dx) · pW (z|x).
∂W ∂W

Algorithm 15.19: Mixture density estimation – EM

Let us now apply the EM Algorithm 15.9 to the mixture density estimation problem. As mentioned before,
we minimize the upper bound:
c
Y
min min KL(ν̂(x, z)∥p(x, z)), ν̂(x, z) = χ̂(x)E(z|x), p(x, z) = [πk · pk (x; θk )]zk ,
π∈∆,θ∈Θ E
k=1

with the following two steps alternated until convergence:


• E-step: Fix π (t) and θ (t) , we solve
(t) (t)
π · pk (x; θk ) (t+1)
Et+1 = p(t) (z = ek |x) = Pc k (t) (t)
=: rk (x). (15.2)
j=1 πj · pj (x; θj )

• M-step: Fix Et+1 hence ν̂t+1 , we solve

min min KL(ν̂t+1 (x, z)∥p(x, z)) ≡ max max Eχ̂ EEt+1 [⟨z, log π⟩ + ⟨z, log p(X; θ)⟩]
π∈∆ θ∈Θ π∈∆ θ∈Θ
hD E D Ei
= max max Eχ̂ r(t+1) (X), log π + r(t+1) (X), log p(X; θ) .
π∈∆ θ∈Θ

Yaoliang Yu 122 –Version 0.1–Nov 1, 2021–


CS480/680–Fall 2021 §15 EXPECTATION-MAXIMIZATION (EM) AND MIXTURE MODELSUniversity of Waterloo

It is clear that the optimal


n
X
π (t+1) = Eχ̂ r(t+1) (X) = χ̂(xi ) · r(t+1) (xi ). (15.3)
i=1

(For us the empirical training distribution χ̂ ≡ n,


1
although we prefer to keep everything abstract and
general.)
The θk ’s can be solved independently:
h i
(t+1) (t+1)
max Eχ̂ rk (X) · log pk (X; θk ) ≡ min KL(χ̂k ∥pk (·; θk )), (15.4)
θk ∈Θk θk ∈Θk

(t+1) (t+1)
where we define χ̂k ∝ χ̂ · rk . (This is similar to Adaboost, where we reweigh the training
examples!)

If we choose the component density pk (x; θk ) from the exponential family with sufficient statistics Tk
and log-partition function Ak , then from Exercise 15.15 we know (15.4) can be solved in closed-form:
h i
(t+1)
θk = ∇A−1
k E χ̂
(t+1) Tk (X) (15.5)
k

Alert 15.20: implicit EM vs. explicit ML

We now make an important connection between EM and ML. We follow the notation in Algorithm 15.18.
For the joint density pW (x, z), EM solves

Wt+1 = argmin KL(p̂t+1 ∥pW ), where p̂t+1 (x, z) := p̂Wt (x, z) = χ̂(dx) · pWt (z|x).
W

In particular, at a minimizer Wt+1 the gradient vanishes:

∂ log pW (x, z)
−Ep̂t+1 = 0.
∂W
In other words, EM solves the above nonlinear equation (in W ) to get Wt+1 while ML with gradient descent
simply performs one fixed-point iteration.

Example 15.21: Gaussian Mixture Model (GMM) – EM

Using the results in multivariate Gaussian Example 15.11 we derive:


S −1 ξ
   
µ
∇A(η) = = ,
− 12 (S −1 ξξ ⊤ S −1 + S −1 ) − 12 (µµ⊤ + Σ)
  X n  
Eχ̂(t+1) X (t+1) xi
Eχ̂(t+1) T (X) = ∝ (χ̂ · r )(xi ) ,
− 21 Eχ̂(t+1) XX⊤ − 12 xi x⊤i
i=1

where we have omitted the component subscript k. Thus, from (15.5) we obtain:
n (t+1)
(t+1)
X (χ̂ · rk )(xi )
µk = Eχ̂(t+1) X = Pn (t+1)
· xi (15.6)
ι=1 (χ̂ · rk )(xι )
k
i=1
n (t+1)
(t+1) (t+1) (t+1) ⊤
X (χ̂ · rk )(xi ) (t+1) (t+1) ⊤
Σk = Eχ̂(t+1) XX⊤ − µk µk = Pn (t+1)
· (xi − µk )(xi − µk ) , (15.7)
ι=1 (χ̂ · rk )(xι )
k
i=1

where we remind that the empirical training distribution χ̂ ≡ n.


1

Yaoliang Yu 123 –Version 0.1–Nov 1, 2021–


CS480/680–Fall 2021 §15 EXPECTATION-MAXIMIZATION (EM) AND MIXTURE MODELSUniversity of Waterloo

The updates on “responsibility” r in (15.2), mixing distribution π in (15.3), on the means µk in (15.6),
and on the covariance matrices Sk in (15.7), consist of the main steps for estimating a GMM using EM.

Yaoliang Yu 124 –Version 0.1–Nov 1, 2021–


CS480/680–Fall 2021 §16 RESTRICTED BOLTZMANN MACHINE (RBM) University of Waterloo

16 Restricted Boltzmann Machine (RBM)


Goal

Gibbs sampling, Boltzmann Machine and Restricted Boltzmann Machine.

Alert 16.1: Convention

Gray boxes are not required hence can be omitted for unenthusiastic readers.
This note is likely to be updated again soon.

Algorithm 16.2: The Metropolis-Hastings algorithm (Metropolis et al. 1953; Hastings 1970)

Suppose we want to take a sample from a density p, where direct sampling is costly. Instead, we resort to
an iterative algorithm:
Algorithm: The Metropolis-Hastings Algorithm
Input: proposal (conditional) density q(y|x), symmetric function s(x, y), target density p(x)
Output: approximate sample X ∼ p
1 choose X
2 repeat
3 sample Y ∼ q(·|X)
s(X,Y) p(Y)q(X|Y)
4 α(X, Y) ← p(X)q(Y|X) = s(X, Y) p(Y)q(X|Y)+p(X)q(Y|X)
1+ p(Y)q(X|Y)

5 with probability α(X, Y): X ← Y


6 until until convergence
Obviously, the symmetric function s must be chosen so that α ∈ [0, 1]. Popular choices include:
• Metropolis-Hastings (Hastings 1970):
p(x)q(y|x) + p(y)q(x|y) p(y)q(x|y)
s(x, y) = =⇒ α(x, y) = 1 ∧ (16.1)
p(x)q(y|x) ∨ p(y)q(x|y) p(x)q(y|x)

• Barker (Barker 1965):


p(y)q(x|y)
s(x, y) = 1 =⇒ α(x, y) =
p(y)q(x|y) + p(x)q(y|x)

The algorithm simplifies considerably if the proposal q is symmetric, i.e. q(x|y) = q(y|x), which is the
original setting in (Metropolis et al. 1953):
Algorithm: The Symmetric Metropolis-Hastings Algorithm
Input: symmetric proposal density q(y|x), symmetric function s(x, y), target density p(x)
Output: approximate sample X ∼ p
1 choose X
2 repeat
3 sample Y ∼ q(·|X)
p(Y)
4 α(X, Y) ← s(X, Y) p(Y)+p(X)
5 with probability α(X, Y): X ← Y
6 until until convergence

For MH’s rule (16.1), we now have


p(x) p(y) p(y)
s(x, y) = ∧ =⇒ α(x, y) = 1 ∧
p(y) p(x) p(x)

Yaoliang Yu 125 –Version 0.0–July 9, 2020–


CS480/680–Fall 2021 §16 RESTRICTED BOLTZMANN MACHINE (RBM) University of Waterloo

while Barker’s rule (6) reduces to

p(y)
s(x, y) = 1 =⇒ α(x, y) = .
p(y) + p(x)

In particular, if p(Y) ≥ p(X), then MH always moves to the new position Y while Barker’s rule may still
reject and repeat over.
Metropolis, Nicholas, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller (1953).
“Equation of state calculations by fast computing machines”. Journal of Chemical Physics, vol. 21, pp. 1087–1092.
Hastings, W. Keith (1970). “Monte Carlo sampling methods using Markov chains and their applications”. Biometrika,
vol. 57, pp. 97–109.
Barker, A. A. (1965). “Monte Carlo calculations of the radial distribution functions for a proton-electron plasma”.
Australian Journal of Physics, vol. 18, no. 2, pp. 119–134.

Alert 16.3: Significance of MH

To appreciate the significance of MH, let us point out that:


• There is immense flexibility in choosing the proposal q!
• We need only know the target density p up to a constant!

Both are crucial for our application to (restricted) Boltzmann machines, as we will see.

Algorithm 16.4: Gibbs sampling (Hastings 1970; Geman and Geman 1984)

If we choose the proposal density q so that q(y|x) ̸= 0 only if the new position y and the original position
x do not differ much (e.g. agree on all but 1 coordinate), then we obtain the so-called Gibbs sampler.
Variations include:
• randomized: randomly choose a (block of) coordinate(s) j in x and change it (them) according to qj .

• cyclic: loop over each (block of) coordinate(s) j in x and change it (them) according to qj .

If we choose q(y|x) = p(y|x), then for MH’s rule α ≡ 1 while for Barker’s rule α ≡ 12 .
Hastings, W. Keith (1970). “Monte Carlo sampling methods using Markov chains and their applications”. Biometrika,
vol. 57, pp. 97–109.
Geman, Stuart and Donald Geman (1984). “Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration
of Images”. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 6, no. 6, pp. 721–741.

Remark 16.5: Optimality of MH

Peskun (1973) showed that the MH rule is optimal in terms of asymptotic variance.
Peskun, P. H. (1973). “Optimum Monte-Carlo Sampling Using Markov Chains”. Biometrika, vol. 60, no. 3, pp. 607–
612.

Definition 16.6: Boltzmann distribution (e.g. Hopfield 1982)

We say a (discrete) random variable S ∈ {±1}m follows a Boltzmann distribution p iff there exists a
symmetric matrix W ∈ Sm+1 such that
X
∀s ∈ {±1}m , pW (S = s) = exp(s⊤ W s − A(W )), where A(W ) = log exp(s⊤ W s) (16.2)
s∈{±1}m

Yaoliang Yu 126 –Version 0.0–July 9, 2020–


CS480/680–Fall 2021 §16 RESTRICTED BOLTZMANN MACHINE (RBM) University of Waterloo

is the log-partition function. It is clear that Boltzmann distributions belong to the exponential family
Definition 15.10, with sufficient statistics

T (s) = ss⊤ .

We remind that we have appended the constant 1 in s so that W contains the bias term too.
Hopfield, John J. (1982). “Neural networks and physical systems with emergent collective computational abilities”.
Proceedings of the National Academy of Sciences, vol. 79, no. 8, pp. 2554–2558.

Alert 16.7: Coding convention

We used the encoding {±1} to represent a binary value above. As a consequence, the diagonal entries in
W only contribute a constant (independent of the realization s) in (16.2). Thus, w.l.o.g. we may absorb
diag(W ) into A(W ) so that diag(W ) = 0.
On the other hand, if we use the encoding {0, 1}, while conceptually being equivalent, we will no longer
need to perform padding, since the bias term can now be stored in the diagonal of W .

Alert 16.8: Intractability of Boltzmann distributions

Despite the innocent form (16.2), Boltzmann distributions are in general intractable (for large m), since
the log-partition function involves summation over 2m terms (Long and Servedio 2010). This is common in
Bayesian analysis where we know a distribution only up to an intractable normalization constant.
Long, Philip M. and Rocco A. Servedio (2010). “Restricted Boltzmann Machines Are Hard to Approximately Eval-
uate or Simulate”. In: Proceedings of the 27th International Conference on International Conference on Machine
Learning, pp. 703–710.

Example 16.9: m=1 reduces to (binary) logistic

For m = 1 we have
exp(2w12 )
pW (S = 1) = = sgm(w), where w := 4w12
exp(2w12 ) + exp(−2w12 )

and recall the sigmoid function sgm(t) = 1+exp(−t)


1
.
This example confirms that even if we can choose any W , the resulting set of Boltzmann distributions
forms a strict subset of all discrete distributions over the cube {±1}m .

Definition 16.10: Boltzmann machine (BM) (Ackley et al. 1985; Hinton and Sejnowski 1986)

Now let us partition the Boltzmann random variable S into the concatenation of an observed random variable
X ∈ {±1}d and a latent random variable Z ∈ {±1}t . We call the marginal distribution over X a Boltzmann
machine. Note that X no longer belongs to the exponential family!
Given a sample X1 , . . . , Xn , we are interested in learning the Boltzmann machine, namely the marginal
density pW (x). We achieve this goal by learning the symmetric matrix W that defines the joint Boltzmann
distribution pW (s) = pW (x, z) in (16.2).
Ackley, David H., Geoffrey E. Hinton, and Terrence J. Sejnowski (1985). “A learning algorithm for boltzmann ma-
chines”. Cognitive Science, vol. 9, no. 1, pp. 147–169.
Hinton, Geoffrey E. and T. J. Sejnowski (1986). “Learning and Relearning in Boltzmann Machines”. In: Parallel
Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations. Ed. by David E.
Rumelhart, James L. McClelland, and the PDP Research Group. The MIT Press, pp. 282–317.

Yaoliang Yu 127 –Version 0.0–July 9, 2020–


CS480/680–Fall 2021 §16 RESTRICTED BOLTZMANN MACHINE (RBM) University of Waterloo

Definition 16.11: Restricted Boltzmann Machine (RBM) (Smolensky 1986)

Let us consider the partition of the symmetric matrix


 
Wxx Wxz
W = ⊤ .
Wxz Wzz

If we require Wxx = 0 and Wzz = 0, then we obtain the restricted Boltzmann machine:

pW (x, z) ∝ exp x⊤ Wxz z , (16.3)




i.e., only cross products are allowed.


Similarly, we will consider learning RBM through estimating the (rectangular) matrix Wxz ∈ R(d+1)×(t+1) .

Smolensky, Paul (1986). “Information Processing in Dynamical Systems: Foundations of Harmony Theory”. In: Par-
allel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations. The MIT
Press, pp. 194–281.

Example 16.12: m = 1, t = 1

Let m = 1 and t = 1. We have for (R)BM:

p(X = x, Z = z) ∝ exp (2xzw12 + 2xw13 + 2zw23 )


p(X = 1) ∝ exp (2w12 + 2w13 + 2w23 ) + exp (−2w12 + 2w13 − 2w23 ) ,

In general, RBM is a strict subset of BM.

Remark 16.13: Representation power of (R)BM—the power of latent variables

Freund and Haussler (1992) and Neal (1992) are among the first to prove that RBM and BM can approximate
any discrete distribution on {±1}d arbitrarily well if the number t of latent variables is large (approaching
2d ). More refined results appeared later in (Le Roux and Bengio 2008; Le Roux and Bengio 2010; Montúfar
and Ay 2011; Montúfar 2014).
In essence, when we marginalize out the latent variables in a (restricted) Boltzmann distribution, we
create a mixture of many components on the remaining variables, hence the ability to approximate any
discrete distribution.
Freund, Yoav and David Haussler (1992). “Unsupervised learning of distributions on binary vectors using two layer
networks”. In: Advances in Neural Information Processing Systems 4. Ed. by J. E. Moody, S. J. Hanson, and
R. P. Lippmann, pp. 912–919.
Neal, Radford M. (1992). “Connectionist learning of belief networks”. Artificial Intelligence, vol. 56, no. 1, pp. 71–113.
Le Roux, Nicolas and Yoshua Bengio (2008). “Representational Power of Restricted Boltzmann Machines and Deep
Belief Networks”. Neural Computation, vol. 20, no. 6, pp. 1631–1649.
— (2010). “Deep Belief Networks Are Compact Universal Approximators”. Neural Computation, vol. 22, no. 8,
pp. 2192–2207.
Montúfar, Guido and Nihat Ay (2011). “Refinements of Universal Approximation Results for Deep Belief Networks
and Restricted Boltzmann Machines”. Neural Computation, vol. 23, no. 5, pp. 1306–1319.
Montúfar, Guido F. (2014). “Universal Approximation Depth and Errors of Narrow Belief Networks with Discrete
Units”. Neural Computation, vol. 26, no. 7, pp. 1386–1407.

Yaoliang Yu 128 –Version 0.0–July 9, 2020–


CS480/680–Fall 2021 §16 RESTRICTED BOLTZMANN MACHINE (RBM) University of Waterloo

Remark 16.14: Computing the gradient

We now apply the same maximum likelihood idea in Algorithm 15.18 for learning an (R)BM:
X
min KL(χ̂(x)∥pW (x)) ≡ min − Eχ̂ log pW (X, z).
W W
z∈{±1}t

Taking derivative w.r.t. W we obtain:


P ∂pW (X,z)

= −Eχ̂ z ∂W
P
∂W z pW (X, z)
X ∂ log pW (X, z)
= −Eχ̂ pW (z|X)
z
∂W
X ∂(s⊤ W s − A(W ))
= −Eχ̂ pW (z|X)
z
∂W
X
= −Eχ̂ pW (z|X)ss⊤ + ∇A(W ).
z

Denoting p̂W (x, z) = χ̂(dx)pW (z|x) and applying Exercise 15.13 we obtain the beautiful formula:


= −Ep̂W ss⊤ + EpW ss⊤ , where s = (x; z; 1).
∂W
The same result, with the restriction that Wxx = 0 and Wxz = 0, holds for RBM.
Therefore, we may apply (stochastic) gradient descent to find W , provided that we can evaluate the two
expectations above. This is where we need the Gibbs sampling algorithm in Algorithm 16.4.

Alert 16.15: Failure of EM

Let us attempt to apply the EM Algorithm 15.9 for estimating W :


• E-step: Et+1 (z|x) = pWt (z|x).
• M-step: Wt+1 = argminW KL(p̂t+1 ∥pW ), where recall that p̂t+1 (x, z) := χ̂(dx) · pWt (z|x).
It follows then from Exercise 15.15 (or more directly from Exercise 15.16) that

Wt+1 = ∇A−1 (Ep̂t+1 T (X)), i.e. Ept+1 T (X) = Ep̂t+1 T (X), where pt+1 := pWt+1 . (16.4)

However, we cannot implement (16.4) since the log-partition function A hence also its gradient ∇A is not
tractable!
In fact, the gradient algorithm in Remark 16.14 is some explicit form that bypasses the difficulty in EM.
Namely, to solve a nonlinear equation

f (W ) = 0, [ for our case, f (W ) = EpW T (X) − Ep̂t+1 T (X)]

we apply the fixed-point iteration:

W ← W − η · f (W ),

which converges (if at all) iff f (W ) = 0.

Yaoliang Yu 129 –Version 0.0–July 9, 2020–


CS480/680–Fall 2021 §16 RESTRICTED BOLTZMANN MACHINE (RBM) University of Waterloo

Algorithm 16.16: Learning BM

To estimate EpW ss⊤ , we need to be able to draw a sample S ∼ pW . This can be achieved by the Gibbs sam-
pling Algorithm 16.4. Indeed, we know (recall that conditional of exponential family is again in exponential
family, Exercise 15.14)

pW (Sj = sj |S\j = s\j ) ∝ exp 2sj W\j,j , s\j , i.e. pW (Sj = sj |S\j = s\j ) = sgm(4sj W\j,j , s\j ),
(16.5)

where the subscript \j indicates removal of the j-th entry or row.


Algorithm: Gibbs sampling from Boltzmann distribution pW
Input: symmetric matrix W ∈ Sm+1
Output: approximate sample s ∼ pW
1 initialize s ∈ {±1}
m

2 repeat
3 for j = 1, . . . , m do
4 pj ← sgm(4 W\j,j , s\j )
5 with probability pj set sj = 1, otherwise set sj = −1
6 until until convergence
Similarly, to estimate Ep̂W ss⊤ we first draw a training sample x ∼ χ̂, and then draw z ∼ pW (·|x). For
the latter, we fix x and apply the Gibbs sampling algorithm:
Algorithm: Gibbs sampling from conditional Boltzmann distribution pW (·|x)
Input: symmetric matrix W ∈ Sm+1 , training sample x
Output: approximate sample z ∼ pW (·|x)
1 initialize z ∈ {±1}t and set s = (x; z; 1)
2 repeat
3 for j = d + 1, . . . , m do
4 pj ← sgm(4 W\j,j , s\j )
5 with probability pj set sj = 1, otherwise set sj = −1
6 until until convergence
The above algorithms are inherently sequential hence extremely slow.

Algorithm 16.17: Learning RBM

We can now appreciate a big advantage in RBM:

pW (Zj = zj |Z\j = z\j , X = x) ∝ exp (zj ⟨Wx,j , x⟩) , i.e. pW (Zj = zj |Z\j = z\j , X = x) = sgm(2zj ⟨Wx,j , x⟩),

and similarly

pW (Xj = xj |X\j = x\j , Z = z) = sgm(2xj ⟨Wj,z , z⟩).

This is possible since in RBM (16.3), X only interacts with Z but there is no interaction within either X or
Z. Thus, we may apply block Gibbs sampling:

Yaoliang Yu 130 –Version 0.0–July 9, 2020–


CS480/680–Fall 2021 §16 RESTRICTED BOLTZMANN MACHINE (RBM) University of Waterloo

Algorithm: Block Gibbs sampling from RBM pW


Input: rectangular matrix W ∈ R(d+1)×(t+1)
Output: approximate sample s ∼ pW
1 initialize s = (x, z) ∈ {±1}d+t
2 repeat
3 p ← sgm(2W z)
4 for j = 1, . . . , d, in parallel do
5 with probability pj set xj = 1, otherwise set xj = −1
6 q ← sgm(2W ⊤ x)
7 for j = 1, . . . , t, in parallel do
8 with probability qj set zj = 1, otherwise set zj = −1
9 until until convergence
Similarly, to estimate Ep̂W ss⊤ we first draw a training sample x ∼ χ̂, and then draw z ∼ pW (·|x). For
the latter, we fix x and apply the Block Gibbs sampling algorithm:
Algorithm: Block Gibbs sampling from conditional RBM pW (·|x)
Input: rectangular matrix W ∈ R(d+1)×(t+1)
Output: approximate sample z ∼ pW (·|x)
1 initialize z ∈ {±1}t
2 repeat
3 q ← sgm(2W ⊤ x)
4 for j = 1, . . . , t, in parallel do
5 with probability qj set zj = 1, otherwise set zj = −1
6 until until convergence

Remark 16.18: Sampling and marginalization

After we have learned the parameter matrix W , we can draw a new sample (X, Z) from pW (x, z) using the
same unconditioned (block) Gibbs sampling algorithm. Simply dropping Z we obtain a sample X from the
marginal distribution pW (x).
For RBM, we can actually derive the marginal density (up to a constant):
X
exp x⊤ W z (16.6)

pW (X = x) ∝
z∈{±1}t

X t+1
Y
= exp(x⊤ W:j zj )
z∈{±1}t j=1
t
Y

= exp(x W:,t+1 ) [exp(⟨x, W:j ⟩) + exp(− ⟨x, W:j ⟩)] .
j=1

A similar formula for pW (Z = z) obviously holds as well. Thus, for RBM, if we need only draw a sample X,
we can and perhaps should directly apply Gibbs sampling to the marginal density (16.6).

Exercise 16.19: Conditional independence in RBM

Prove or disprove the following for BM and RBM:


t
Y d
Y
pW (z|x) = pW (zj |x), pW (x|z) = pW (xj |z).
j=1 j=1

Yaoliang Yu 131 –Version 0.0–July 9, 2020–


CS480/680–Fall 2021 §16 RESTRICTED BOLTZMANN MACHINE (RBM) University of Waterloo

Remark 16.20: RBM as stochastic/feed-forward neural network

RBM is often referred to as a two-layer stochastic neural network, where X is the input layer while Z is the
output layer. It also defines an underlying nonlinear, deterministic, feed-forward network. Indeed, let

yj = pW (Zj = 1|X = x) = sgm(2 ⟨W:j , x⟩)

we obtain a nonlinear feedforward network that takes an input x ∈ {±1}d and maps it non-linearly to an
output y ∈ [0, 1]t , through:

h = 2W ⊤ x
y = sgm(h).

Of course, we can stack RBMs on top of each other and go “deep” (Hinton and Salakhutdinov 2006;
Salakhutdinov and Hinton 2012). Applications can be found in (Mohamed et al. 2012; Sarikaya et al. 2014).

Hinton, G. E. and R. R. Salakhutdinov (2006). “Reducing the Dimensionality of Data with Neural Networks”. Science,
vol. 313, pp. 504–507.
Salakhutdinov, Ruslan and Geoffrey Hinton (2012). “An Efficient Learning Procedure for Deep Boltzmann Machines”.
Neural Computation, vol. 24, no. 8, pp. 1967–2006.
Mohamed, A., G. E. Dahl, and G. Hinton (2012). “Acoustic Modeling Using Deep Belief Networks”. IEEE Transactions
on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 14–22.
Sarikaya, R., G. E. Hinton, and A. Deoras (2014). “Application of Deep Belief Networks for Natural Language
Understanding”. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 4, pp. 778–
784.

Remark 16.21: Not necessarily binary data

It is possible to extend (R)BM to handle other types of data, see for instance (Welling et al. 2005).
In fact, let pW (x, z) = exp(⟨T (x, z), W ⟩ − A(W )) be any joint density in the exponential family. Then,
we can estimate the parameter matrix W as before:
Z
min KL(χ̂(x)∥pW (x)) ≡ min − Eχ̂ log pW (X, z) dz.
W W z

Denoting p̂W (x, z) = χ̂(dx)pW (z|x) and following the same steps as in Remark 16.14 we obtain:


= −Ep̂W T (X, Z) + EpW T (X, Z).
∂W
A restricted counterpart corresponds to

⟨T (x, z), W ⟩ = T1 (x)⊤ Wxz T2 (z).

Similar Gibbs sampling algorithms can be derived to approximate the expectations.


Welling, Max, Michal Rosen-zvi, and Geoffrey E. Hinton (2005). “Exponential Family Harmoniums with an Appli-
cation to Information Retrieval”. In: Advances in Neural Information Processing Systems 17, pp. 1481–1488.

Yaoliang Yu 132 –Version 0.0–July 9, 2020–


CS480/680–Fall 2021 §17 DEEP BELIEF NETWORKS (DBN) University of Waterloo

17 Deep Belief Networks (DBN)


Goal

Belief network, sigmoid belief network, deep belief network, maximum likelihood, Gibbs sampling.

Alert 17.1: Convention

Gray boxes are not required hence can be omitted for unenthusiastic readers.
This note is likely to be updated again soon.

Definition 17.2: DAG

A directed acyclic graph (DAG) is a directed graph that contains no directed cycles. Note that the underlying
undirected graph could still contain cycles.

Definition 17.3: Notions in DAGs

A useful property of a DAG is that we can always define a partial ordering on its vertices so that u ≤ v
iff there is a directed path from u to v. For each vertex v in the directed graph G = (V, E), we use
pa(v) := {u : − → ∈ E}, ch(v) := {u : −
uv → ∈ E}, an(v) := {u : ∃w , . . . , w ∈ V, −
vu 1 k
−→, −
uw −−→ −−→
1 w1 w2 , . . . , wk v ∈ E},
−−
→ −−−→ −−→
de(v) := {u : ∃w1 , . . . , wk ∈ V, vw1 , w1 w2 , . . . , wk u ∈ E}, and nd(v) := V \ ({v} ∪ de(v)) to denote the
parents, children, ancestors, descendants, and non-descendants of v, respectively. Similarly we define such
sets for a set of nodes by taking unions.

Definition 17.4: Belief/Bayes networks (BNs) (Pearl 1986)

For each vertex v of a DAG G = (V = {1, . . . , m}, E), we associate a random variable Sv with it. Inter-
changeably we refer to the vertex either by v or Sv . Together S = (S1 , . . . , Sm ) defines a Belief network
w.r.t. G iff the joint density p factorizes as follows:
Y
p(s1 , . . . , sm ) = p(sv | pa(sv )). (17.1)
v∈V

Pearl, Judea (1986). “Fusion, propagation, and structuring in belief networks”. Artificial Intelligence, vol. 29, no. 3,
pp. 241–288.

Theorem 17.5: Factorization of BNs

Fix a DAG G. For any set of normalized probability densities {pv (sv | pa(sv ))}v∈V ,
Y
p(s1 , . . . , sm ) = pv (sv | pa(sv )) (17.2)
v∈V

defines a BN over G, whose conditionals are precisely given by {pv }.

Proof. W.l.o.g. we may assume the nodes are so arranged that u ∈ de(v) =⇒ u ≥ v. This is possible since

Yaoliang Yu 133 –Version 0.0–July 9, 2020–


CS480/680–Fall 2021 §17 DEEP BELIEF NETWORKS (DBN) University of Waterloo

the graph is a DAG. We claim that for all j,


Y
p(s1 , . . . , sj ) = pv (sv | pa(sv )).
1≤v≤j

The idea is to perform marginalization bottom-up. Indeed,


Z
p(s1 , . . . , sj ) := p(s1 , . . . , sm ) dsj+1 · · · dsm
Y Z Y
= pv ( sv | pa(sv ) ) pv (sv | pa(sv )) dsj+1 · · · dsm
| {z }
1≤v≤j no sj+1 ,...,sm j+1≤v≤m
Y Z Y Z 
= pv (sv | pa(sv )) pv (sv | pa(sv )) dsj+1 · · · dsm−1 pm (sm | pa(sm )) dsm
no sm
| {z }
1≤v≤j j+1≤v≤m−1
Y Z Y
= pv (sv | pa(sv )) pv (sv | pa(sv )) dsj+1 · · · dsm−1
1≤v≤j j+1≤v≤m−1
Y
= ··· = pv (sv | pa(sv )).
1≤v≤j

This also verifies that p, as defined in (17.2), is indeed a density. Moreover, for all j,

p(s1 , . . . , sj )
p(sj | s1 , . . . , sj−1 ) := = pj (sj | pa(sj )).
p(s1 , . . . , sj−1 )

Therefore,
Z Y
p(sj , pa(sj )) = p(s1 , . . . , sj ) dsv
v∈{1,...,j−1}\pa(sj )
Z Y
= pj (sj | pa(sj )) p(s1 , . . . , sj−1 ) dsv
v∈{1,...,j−1}\pa(sj )

= pj (sj | pa(sj )) · p(pa(sj )),

implying the coincidence of the conditionals

p(sj | s1 , . . . , sj−1 ) = pj (sj | pa(sj )) = p(sj | pa(sj )),

hence completing our proof.

The main significance of this theorem is that by specifying local conditional probability densities pv (sv |
pa(sv )) for each node v, we automatically obtain a bona fide joint probability density p over the DAG.

Example 17.6: Necessity of the DAG condition

Theorem 17.5 can fail if the underlying graph is not a DAG. Consider the simple graph with two nodes
X1 , X2 taking values in {0, 1}. Define the conditionals as follows:

p(X1 = 1 | X2 = 1) = 0.5, p(X1 = 1 | X2 = 0) = 1, p(X2 = 1 | X1 = 1) = 1, p(X2 = 1 | X1 = 0) = 0.5.

Then if the graph is cyclic and the factorization (??) holds, then we have

p(X1 = 1, X2 = 1) = p(X1 = 1 | X2 = 1) · p(X2 = 1 | X1 = 1) = 0.5


p(X1 = 1, X2 = 0) = p(X1 = 1 | X2 = 0) · p(X2 = 0 | X1 = 1) = 0
p(X1 = 0, X2 = 1) = p(X1 = 0 | X2 = 1) · p(X2 = 1 | X1 = 0) = 0.25

Yaoliang Yu 134 –Version 0.0–July 9, 2020–


CS480/680–Fall 2021 §17 DEEP BELIEF NETWORKS (DBN) University of Waterloo

p(X1 = 0, X2 = 0) = p(X1 = 0 | X2 = 0) · p(X2 = 0 | X1 = 0) = 0,

not a joint distribution. Note that even if we re-normalize the joint distribution, we won’t be able to recover
the conditionals: p(X1 = 1 | X2 = 1) = 0.75
0.5
̸= 0.5.

Remark 17.7: Economic parameterization

For a binary valued random vector X ∈ {±1}d , in general we need 2d − 1 positive numbers to specify its
joint density. However, if X is a BN over a DAG whose maximum in-degree is k, then we need only (at
most) d(2k+1 − 1) positive numbers to specify the joint density (by specifying each conditional in (17.1)).
For k = 0 we obtain the independent setting while for k = 1 we include the so-called naive Bayes model.

Definition 17.8: Sigmoid belief network (SBN) (Neal 1992)

Instead of parameterizing each conditional densities p(sv |pa(sv )) directly, we now consider more efficient
ways, which is also necessary if Sv is continuous valued.
Following Neal (1992), we restrict our discussion to binary S ∈ {±1}m and define
j−1
!
X
∀j = 1, . . . , m, pW (Sj = sj |S<j = s<j ) = sgm sj (Wj,m+1 + sk Wjk ) = sgm(sj Wj: s),
k=1

where W ∈ Rm×(m+1) is strictly lower-triangular (i.e. Wjk = 0 if m ≥ k ≥ j). Note that we have added a
bias term in the last column of W and appended the constant 1 to get s = (s; 1). Note the similarity with
Boltzmann machines (16.5).
The ordering of the nodes Sj here are chosen rather arbitrarily. A different ordering may lead to conse-
quences in later analysis, although we shall assume in the following a fixed ordering is given without much
further discussion.
Applying Theorem 17.5 we obtain a joint density pW (s) := pW (x, z), parameterized by the (strictly
lower-triangular) weight matrix W . We will see how to learn W based on a sample X1 , . . . , Xn ∼ pW (x).
Neal, Radford M. (1992). “Connectionist learning of belief networks”. Artificial Intelligence, vol. 56, no. 1, pp. 71–113.

Example 17.9: SBN with m = 1 and m = 2

Let m = 1 in SBN. We have

pb (S = s) = sgm(sb),

which is exactly the Bernoulli distribution whose mean parameter p is parameterized as a sigmoid transfor-
mation of b. Compare with Example 16.9.
For m = 2, we have instead

pw,b,c (S1 = s1 , S2 = s2 ) = sgm(s1 c) · sgm(s2 (ws1 + b)).

These examples confirm that without introducing latent variables, SBNs may not approximate all distribu-
tions over the cube well.

Definition 17.10: Noisy-OR belief network (Neal 1992)

The sigmoid function is chosen in Definition 17.8 mostly to mimic Boltzmann machines, but it is clear that

Yaoliang Yu 135 –Version 0.0–July 9, 2020–


CS480/680–Fall 2021 §17 DEEP BELIEF NETWORKS (DBN) University of Waterloo

any univariate CDF works equally well. Indeed, we can even define:

p(Sj = 1|S<j = s<j ) = 1 − exp −W 1+s



2 ,

where as before W ∈ Rm×(m+1) is strictly lower-triangular. The awkward term s+1 2 is to reduce to the
{0, 1}-valued case, the setting where noisy-or is traditionally defined in. Note that we need the constraint
W (1 + s) ≥ 0 (for instance W ≥ 0 suffices) to ensure the resulting density is nonnegative.
See (Arora et al. 2017) for some recent results on learning noisy-or networks.
Neal, Radford M. (1992). “Connectionist learning of belief networks”. Artificial Intelligence, vol. 56, no. 1, pp. 71–113.
Arora, Sanjeev, Rong Ge, Tengyu Ma, and Andrej Risteski (2017). “Provable Learning of Noisy-OR Networks”. In:
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1057–1066.

Remark 17.11: Universality

As shown in (Neal 1992), Boltzmann distributions, SBNs and Noisy-OR BNs in general represent different
strict subsets of all discrete distributions over the cube {±1}m . If we introduce (a large number of) t latent
variables, then the marginal distribution on X can approximate any discrete distribution arbitrarily well,
for all 3 networks. More refined universality results can be found in (Sutskever and Hinton 2008).
Neal, Radford M. (1992). “Connectionist learning of belief networks”. Artificial Intelligence, vol. 56, no. 1, pp. 71–113.
Sutskever, Ilya and Geoffrey E. Hinton (2008). “Deep, Narrow Sigmoid Belief Networks Are Universal Approximators”.
Neural Computation, vol. 20, no. 11, pp. 2629–2636.

Example 17.12: SBN with d = 1 and t = 1

Let m = 2 in SBN. We partition S = (X, Z), hence

pw,b,c (X = x, Z = z) = sgm(xc) · sgm(z(wx + b)).

Marginalizing over z = {±1}:

pw,b,c (X = x) = sgm(xc)[sgm(wx + b) + sgm(−wx − b)] = sgm(xc).

On the other hand, if we swap the position of Z and X:

pw,b,c (Z = z, X = x) = sgm(zc) · sgm(x(wz + b)).

Marginalizing over z = {±1}:

pw,b,c (X = x) = [sgm(c)sgm(x(w + b)) + sgm(−c)sgm(x(−w + b))],

which is a mixture.

Algorithm 17.13: SBN–Maximum Likelihood

Given a sample X1 , . . . , Xn ∈ {±1}d , we apply ML to estimate W :


min KL(χ̂(x)∥pW (x)),
W ∈Rm×(m+1)

where the SBN pW is defined in Definition 17.8 and we remind again that W is always constrained to be
strictly lower-triangular (modulus the last bias column) in this section (although we shall not signal this
anymore).
To apply (stochastic) gradient descent, we compute the gradient:
∂ ∂ log pW (s)
= −Ep̂W , where p̂W (s) := pW (z, x) := χ̂(dx) · pW (z|x)
∂Wj: ∂Wj:

Yaoliang Yu 136 –Version 0.0–July 9, 2020–


CS480/680–Fall 2021 §17 DEEP BELIEF NETWORKS (DBN) University of Waterloo

Qm
∂ log j=1 pW (sj |s<j ) ∂ log pW (sj |s<j ) sgm′ (sj Wj: s)
= −Ep̂W = −Ep̂W = −Ep̂W sj s̃<j ,
∂Wj: ∂Wj: sgm(sj Wj: s)

where we use the tilde notation s̃<j to denote the full size vector where coordinates k ∈ [j, m] are zeroed out
(since W is strictly lower-triangular). Using the fact that sgm′ (t) = sgm(t)sgm(−t) we obtain:


∀j = 1, . . . , d, ∀k ̸∈ [j, m], = −Ep̂W sgm(−sj Wj: s)sj sk .
∂Wjk

Note that the gradient only involves 1 expectation, while there were 2 in BMs (cf. Remark 16.14). This
is perhaps expected, since in BM we have that intractable log-partition function to normalize the density
while in BN the joint density is automatically normalized since each conditional is so.

Algorithm 17.14: Conditional Gibbs sampling for SBN

We now give the conditional Gibbs sampling algorithm for p̂W (x, z) = χ̂(dx) · pW (z|x). For that we derive
the conditional density (Pearl 1987):

p(Sj = t|S\j = s\j ) ∝ p(S\j = s\j , Sj = t)


Y
∝ p(Sj = t|S<j = s<j ) · p(Sk = sk |S<k,\j = s<k,\j , Sj = t)
k>j
Y
= sgm(tWj: s) sgm[sk (Wk: s + Wkj (t − sj ))]
k>j

We omit the pseudo-code but point out that the matrix vector product W s should be dynamically updated
to save computation.
Pearl, Judea (1987). “Evidential reasoning using stochastic simulation of causal models”. Artificial Intelligence, vol. 32,
no. 2, pp. 245–257.

Alert 17.15: Importance of ordering: training vs. testing

If we choose to put the observed variables X before the latent variables Z so that S = (X, Z). Then, drawing a
sample Z necessitates the above (iterative) Gibbs sampling procedure. Thus, training is expensive. However,
after training drawing a new (exact) sample for X only takes one-shot. In contrast, if we arrange S = (Z, X).
Then, drawing a sample Z during training no longer requires the Gibbs sampling algorithm hence training
is fast. However, the price to pay is that at test time when we want to draw a new sample X, we would
have to run the iterative Gibbs sampling algorithm.

Exercise 17.16: Failure of EM

I know you must wondering if we can apply EM to SBN. The answer is no and the details are left as exercise.

Definition 17.17: Deep belief network (DBN) (Bengio and Bengio 1999)

We now extend SBN to handle any type of data and go deep. We achieve this goal through 3 key steps:
• First, we realize that SBN amounts to specifying d univariate, parameterized densities. Indeed, let

p(sj |s<j ) = pj (sj ; θj (s<j )),

where pj is a prescribed univariate density on sj with parameter θj . For instance, pj may be the

Yaoliang Yu 137 –Version 0.0–July 9, 2020–


CS480/680–Fall 2021 §17 DEEP BELIEF NETWORKS (DBN) University of Waterloo

exponential family with natural parameter θj (e.g. Bernoulli where θj = pj or Gaussian where θj =
(µj , σj2 )).

• Second, the parameter θj can be computed as a function from the previous observations s<j . For
instance, SBN specifies pj to be Bernoulli whose mean parameter pj is computed as the output of a
two-layer neural network with sigmoid activation function:

h = Ws
p = sgm(h).

• It is then clear that we could use a deep neural net to compute the parameter θj :

h0 = s (17.3)
∀ℓ = 1, . . . , L − 1, hℓ = σ(Wℓ hℓ−1 )
θ = σ(WL hL−1 ).

Note that we need to make sure θj depends only on the first j − 1 inputs s<j , which can be achieved
by wiring the network appropriately. For instance, if W1 is strictly lower-triangular while W≥2 is
lower-triangular (Bengio and Bengio 1999), then we obviously satisfy this constraint.

Bengio, Yoshua and Samy Bengio (1999). “Modeling High-Dimensional Discrete Data with Multi-Layer Neural Net-
works”. In: NeurIPS.

Alert 17.18: Weight sharing

We compare the network structure (17.3) with the following straightforward alternative:
(j)
h0 = s<j (17.4)
(j) (j) (j)
∀ℓ = 1, . . . , L − 1, hℓ = σ(Wℓ hℓ−1 )
(j) (j)
θj = σ(WL hL−1 ),

(j)
where we use separate networks for each parameter θj . The weights Wℓ above can be arbitrary. We observe
that the parameterization in (17.3) is much more efficient, since the weights used to compute θj are shared
with all subsequent computations for θ≥j . Needless to say, the parameterization in (17.4) are more flexible.

Yaoliang Yu 138 –Version 0.0–July 9, 2020–


CS480/680–Fall 2021 §18 GENERATIVE ADVERSARIAL NETWORKS (GAN) University of Waterloo

18 Generative Adversarial Networks (GAN)


Goal

Push-forward, Generative Adversarial Networks, min-max optimization, duality.

Alert 18.1: Convention

Gray boxes are not required hence can be omitted for unenthusiastic readers.
This note is likely to be updated again soon.

Example 18.2: Simulating distributions

Suppose we want to sample from a Gaussian distribution with mean u and covariance S. The typical approach
is to first sample from the standard Gaussian distribution (with zero mean and identity covariance) and then
perform the transformation:

If Z ∼ N (0, I), then X = T(Z) := u + S 1/2 Z ∼ N (u, S).

Similarly, we can sample from a χ2 distribution with zero mean and degree d by the transformation:
d
X
If Z ∼ N (0, Id ), then X = T(Z) := Zj2 ∼ χ2 (d).
j=1

In fact, we can sample from any distribution F on R by the following transformation:

If Z ∼ N (0, 1), then X = T(Z) := F− (Φ(Z)) ∼ F, where F− (z) = min{x : F (x) ≥ z},

and Φ is the cumulative distribution function of standard normal.

Theorem 18.3: Transforming to any probability measure

Let µ be a diffuse (Borel) probability measure on a polish space Z and similarly ν be any (Borel) probability
measure on another polish space X. Then, there exist (measurable) maps T : Z → X such that

If Z ∼ µ, then X := T(Z) ∼ ν.

Recall that a (Borel) probability measure is diffuse iff any single point has measure 0. For less math-
ematical readers, think of Z = Rp , X = Rd , µ and ν as probability densities on the respective Euclidean
spaces.

Definition 18.4: Push-forward generative modeling

Given an i.i.d. sample X1 , . . . , Xn ∼ χ, we can now estimate the target density χ by the following push-
forward approach:
inf D(X, Tθ (Z)),
θ

where say Z ∼ N (0, Ip ), Tθ : R → R , and X ∼ χ (the true underlying data generating distribution). The
p d

function D is a “distance” that measures the closeness of our (true) data distribution (represented by X) and
model distribution (represented by Tθ (Z)). By minimizing D we bring our model Tθ (Z) close to our data
X.

Yaoliang Yu 139 –Version 0.11–Nov 2, 2021–


CS480/680–Fall 2021 §18 GENERATIVE ADVERSARIAL NETWORKS (GAN) University of Waterloo

Remark 18.5: The good, the bad, and the beautiful

One big advantage of the push-forward approach in Definition 18.4 is that after training (e.g. finding a
reasonable θ) we can effortlessly generate new data: we sample Z ∈ N (0, Id ) and then set X = Tθ (Z). On
the flip side, we no longer have any explicit form for the model density (namely, that of Tθ (Z) when p < d).
This renders direct maximum likelihood estimation of θ impossible.
This is where we need the beautiful idea called duality. Basically, we need to distinguish two distributions:
the data distribution represented by a sample X and the model distribution represented by a sample Tθ (Z).
We distinguish them by running many tests, represented by functions f :

sup |Ef (X) − Ef Tθ (Z) |.
f ∈F

If the class of tests F we run is dense enough, then we would be able to tell the difference between the two
distributions and provide feedback for the model θ to improve, until we no longer can tell the difference.

Definition 18.6: f -divergence (Csiszár 1963; Ali and Silvey 1966)

Let f : R+ → R be a strictly convex function (see the background lecture on optimization) with f (1) = 0.
We define the following f -divergence to measure the closeness of two pdfs p and q:
Z
(18.1)

Df (p∥q) := f p(x)/q(x) · q(x) dx,

where we assume q(x) = 0 =⇒ p(x) = 0 (otherwise we put the divergence to ∞).


For two random variables Z ∼ q and X ∼ p, we sometimes abuse the notation to mean

Df (X∥Z) := Df (p∥q).

Csiszár, Imre (1963). “Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizität
von Markoffschen Ketten”. A Magyar Tudományos Akadémia Matematikai Kutató Intézetének közleményei, vol. 8,
pp. 85–108.
Ali, S. M. and S. D. Silvey (1966). “A General Class of Coefficients of Divergence of One Distribution from Another”.
Journal of the Royal Statistical Society. Series B (Methodological), vol. 28, no. 1, pp. 131–142.

Exercise 18.7: Properties of f -divergence

Prove the following:


• Df (p∥q) ≥ 0, with 0 attained iff p = q;
• Df +g = Df + Dg and Dsf = sDf for s > 0;

• Let g(t) = f (t) + s(t − 1) for any s. Then, Dg = Df ;


• If p(x) = 0 ⇐⇒ q(x) = 0, then Df (p∥q) = Df ⋄ (q∥p), where f ⋄ (t) := t · f (1/t);
• f ⋄ is (strictly) convex, f ⋄ (1) = 0 and (f ⋄ )⋄ = f ;
The second last result indicates that f -divergences are not usually symmetric. However, we can always
symmetrize them by the transformation: f ← f + f ⋄ .

Yaoliang Yu 140 –Version 0.11–Nov 2, 2021–


CS480/680–Fall 2021 §18 GENERATIVE ADVERSARIAL NETWORKS (GAN) University of Waterloo

Example 18.8: KL and LK

Let f (t) = t log t, then we obtain the Kullback-Leibler (KL) divergence:


Z
KL(p∥q) = p(x) log(p(x)/q(x)) dx.

Reverse the inputs we obtain the reverse KL divergence:

LK(p∥q) := KL(q∥p).

Verify by yourself that the underlying function f = − log for reverse KL.

Example 18.9: More divergences, more fun

Derive the formula for the following f -divergences:


• χ2 -divergence: f (t) = (t − 1)2 ;

• Hellinger divergence: f (t) = ( t − 1)2 ;

• total variation: f (t) = |t − 1|;


• Jensen-Shannon divergence: f (t) = t log t − (t + 1) log(t + 1) + log 4;
tα −1
• Rényi divergence (Rényi 1961): f (t) = α−1 for some α > 0 (for α = 1 we take limit and obtain ?).

Which of the above are symmetric?


Rényi, Alfréd (1961). “On Measures of Entropy and Information”. In: Proceedings of the Fourth Berkeley Symposium
on Mathematical Statistics and Probability, pp. 547–561.

Definition 18.10: Fenchel conjugate function

For any extended real-valued function f : V → (−∞, ∞] we define its Fenchel conjugate function as:

f ∗ (x∗ ) := sup ⟨x, x∗ ⟩ − f (x).


x

We remark that f ∗ is always a convex function (of x∗ ).


If dom f is nonempty and closed, and f is continuous, then

f ∗∗ := (f ∗ )∗ = f.

This remarkable property of convex functions will now be used!

Example 18.11: Fenchel conjugate of JS

Consider the convex function that defines the Jensen-Shannon divergence:


f (t) = t log t − (t + 1) log(t + 1) + log 4. (18.2)
We derive its Fenchel conjugate:
f ∗ (s) = sup st − f (t) = sup st − t log t + (t + 1) log(t + 1) − log 4.
t t

Taking derivative w.r.t. t we obtain


1
s − log t − 1 + log(t + 1) + 1 = 0 ⇐⇒ t = ,
exp(−s) − 1

Yaoliang Yu 141 –Version 0.11–Nov 2, 2021–


CS480/680–Fall 2021 §18 GENERATIVE ADVERSARIAL NETWORKS (GAN) University of Waterloo

and plugging it back we get

s 1 1 exp(−s) exp(−s)
f ∗ (s) = − log + log − log 4
exp(−s) − 1 exp(−s) − 1 exp(−s) − 1 exp(−s) − 1 exp(−s) − 1
s 1 1 exp(−s) 1 s exp(−s)
= − log + log − − log 4
exp(−s) − 1 exp(−s) − 1 exp(−s) − 1 exp(−s) − 1 exp(−s) − 1 exp(−s) − 1
= −s − log(exp(−s) − 1) − log 4
= − log(1 − exp(s)) − log 4. (18.3)

Using conjugation again, we obtain the important formula:

f (t) = sup st − f ∗ (s) = sup st + log(1 − exp(s)) + log 4.


s s

Exercise 18.12: More conjugates

Derive the Fenchel conjugate of the other convex functions in Example 18.8 and Example 18.9.

Definition 18.13: Generative adversarial networks (GAN) (Goodfellow et al. 2014)

We are now ready to define the original GAN, which amounts to using the Jensen-Shannon divergence in
Definition 18.4:

inf JS(X∥Tθ (Z)), where JS(p∥q) = Df (p∥q) = KL(p∥ p+q p+q


2 ) + KL(p∥ 2 ),
θ

and the convex function f is defined in (18.2), along with its Fenchel conjugate f ∗ given in (18.3).
To see how we can circumvent the lack of an explicit form of the density q(x) of Tθ (Z), we expand using
duality:
Z

JS(X∥Tθ (Z)) = f p(x)/q(x) q(x) dx
Zx
= [sup sp(x)/q(x) − f ∗ (s)]q(x) dx
s
Zx
= [sup sp(x) − f ∗ (s)q(x)] dx
x s
Z Z
= sup S(x)p(x) dx − f ∗ (S(x))q(x) dx
S:Rd →R x x
= sup ES(X) − Ef ∗ (S(Tθ (Z))).
S:Rd →R

Therefore, if we parameterize the test function S by ϕ (say a deep net), then we obtain a lower bound of
the Jensen-Shannon divergence for minimizing:

inf sup ESϕ (X) − Ef ∗ (Sϕ (Tθ (Z))).


θ ϕ

Of course, we cannot compute either of the two expectations, so we use sample average to approximate them:

inf sup ÊSϕ (X) − Êf ∗ (Sϕ (Tθ (Z))), (18.4)


θ ϕ

where the first sample expectation Ê is simply the average of the given training data while the second sample
expectation is the average over samples generated by the model Tθ (Z) (recall Remark 18.5).
In practice, both Tθ and Sϕ are represented by deep nets, and the former is called the generator while
the latter is called the discriminator. Our final objective (18.4) represents a two-player game between

Yaoliang Yu 142 –Version 0.11–Nov 2, 2021–


CS480/680–Fall 2021 §18 GENERATIVE ADVERSARIAL NETWORKS (GAN) University of Waterloo

the generator and the discriminator. At equilibrium (if any) the generator is forced to mimic the (true)
data distribution (otherwise the discriminator would be able to tell the difference and incur a loss for the
generator).
See the background lecture on optimization for a simple algorithm (gradient-descent-ascent) for solving
(18.4).
Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville,
and Yoshua Bengio (2014). “Generative Adversarial Nets”. In: NIPS.

Remark 18.14: Approximation

We made a number of approximations in Definition 18.13. Thus, technically speaking, the final GAN
objective (18.4) no longer minimizes the Jensen-Shannon divergence. Nock et al. (2017) and Liu et al.
(2017) formally studied this approximation trade-off.
Nock, Richard, Zac Cranko, Aditya K. Menon, Lizhen Qu, and Robert C. Williamson (2017). “f -GANs in an Infor-
mation Geometric Nutshell”. In: NIPS.
Liu, Shuang, Léon Bottou, and Kamalika Chaudhuri (2017). “Approximation and convergence properties of generative
adversarial learning”. In: NIPS.

Exercise 18.15: Catch me if you can

Let us consider the game between the generator q(x) (the implicit density of Tθ (Z)) and the discriminator
S(x):
Z Z

inf sup S(x)p(x) dx + log 1 − exp(S(x)) q(x) dx + log 4.
q S x x

• Fixing the generator q, what is the optimal discriminator S?

• Plugging the optimal discriminator S back in, what is the optimal generator?
• Fixing the discriminator S, what is the optimal generator q?
• Plugging the optimal generator q back in, what is the optimal discriminator?

Exercise 18.16: KL vs. LK

Recall that the f -divergence Df (p∥q) is infinite iff for some x, p(x) ̸= 0 while q(x) = 0. Consider the
following twin problems:

qKL := argmin KL(p∥q)


q∈Q

qLK := argmin LK(p∥q).


q∈Q

Recall that supp(p) := cl{x : p(x) ̸= 0}. What can we say about supp(p), supp(qKL ) and supp(qLK )?
What about JS?

Definition 18.17: f -GAN (Nowozin et al. 2016)

Following Nowozin et al. (2016), we summarize the main idea of f -GAN as follows:
• Generator: Let µ be a fixed reference probability measure on space Z (usually the standard normal
distribution) and Z ∼ µ. Let ν be any target probability measure on space X and X ∼ ν. Let

Yaoliang Yu 143 –Version 0.11–Nov 2, 2021–


CS480/680–Fall 2021 §18 GENERATIVE ADVERSARIAL NETWORKS (GAN) University of Waterloo

T ⊆ {T : Z → X} be a class of transformations. According to Theorem 18.3 we know there exist


transformations T (which may or may not be in our class T ) so that T(Z) ∼ X ∼ ν. Our goal is to
approximate such transformations T using our class T .

• Loss: We use the f -divergence to measure the closeness between the target X and the transformed
reference T(Z):

inf Df X∥T(Z) .
T∈T

In fact, any loss function that allows us to distinguish two probability measures can be used. However,
we face an additional difficulty here: the densities of X and T(Z) (w.r.t. a third probability measure
λ) are not known to us (especially the former) so we cannot naively evaluate the f -divergence in (18.1).
• Discriminator: A simple variational reformulation will resolve the above difficulty! Indeed,
Z  

Df (X∥T(Z)) = f (x) dτ (x) (T(Z) ∼ τ )

Z  

= sup s (x) − f ∗ (s) dτ (x) (f ∗∗ = f )

s∈dom(f ) dτ
Z  

≥ sup S(x) (x) − f ∗ (S(x)) dτ (x) (S ⊆ {S : X → dom(f ∗ )})
S∈S dτ
 

= sup E[S(X)] − E[f ∗ S(T(Z)) ] (equality if f ′

∈ S),
S∈S dτ

so our estimation problem reduces to the following minimax zero-sum game:

inf sup E[S(X)] − E[f ∗ S(T(Z)) ].



T∈T S∈S

By replacing the expectations with empirical averages we can (approximately) solve the above problem
with classic stochastic algorithms.
• Reparameterization: The class of functions S we use to test the difference between two probability
measures in the f -divergence must have their range contained in the domain of f ∗ . One convenient
way to enforce this constraint is to set

S = σ ◦ U := {σ ◦ U : U ∈ U}, σ : R → dom(f ∗ ), U ⊆ {U : X → R},

where the functions U are unconstrained and the domain constraint is enforced through a fixed “acti-
vation function” σ. With this choice, the final f -GAN problem we need to solve is:

inf sup E[σ ◦ U(X)] − E[(f ∗ ◦ σ) U(T(Z)) ].



T∈T U∈U

Typically we choose an increasing σ so that the composition f ∗ ◦σ is “nice.” Note that the monotonicity
of σ implies the same monotonicity of the composition f ∗ ◦ σ (since f ∗ is always increasing as f is
defined only on R+ ). In this case, we prefer to pick a test function U so that U(X) is large while
U(T(Z)) is small. This choice aligns with the goal to “maximize target and minimize transformed
reference,” although the opposite choice would work equally well (merely a sign change).

Nowozin, Sebastian, Botond Cseke, and Ryota Tomioka (2016). “f -GAN: Training Generative Neural Samplers using
Variational Divergence Minimization”. In: NIPS.

Yaoliang Yu 144 –Version 0.11–Nov 2, 2021–


CS480/680–Fall 2021 §18 GENERATIVE ADVERSARIAL NETWORKS (GAN) University of Waterloo

Remark 18.18: f -GAN recap

To specify an f -GAN, we need:


• A reference probability measure µ: should be easy to sample and typically we use standard normal;
• A class of transformations (generators): T ⊆ {T : Z → X};
• An increasing convex function f ∗ : dom(f ∗ ) → R with f ∗ (0) = 0 and f ∗ (s) ≥ s (or equivalently an
f -divergence);

• An increasing activation function σ : R → dom(f ∗ ) so that f ∗ ◦ σ is “nice”;


• A class of unconstrained test functions (discriminators): U ⊆ {U : X → R} so that S = σ ◦ U.

Definition 18.19: Wasserstein GAN (WGAN) (Arjovsky et al. 2017)

If we let the test functions range over the set of all 1-Lipschitz continuous functions L, we then obtain
WGAN:

inf sup ES(X) − ES Tθ (Z) ,
θ S∈L

which corresponds to the dual of the 1-Wasserstein distance.


Arjovsky, Martin, Soumith Chintala, and Léon Bottou (2017). “Wasserstein Generative Adversarial Networks”. In:
ICML.

Definition 18.20: Maximum Mean Discrepancy GAN (MMD-GAN)

If, instead, we choose the test functions from a reproducing kernel Hilbert space (RKHS), then we obtain the
so-called MMD-GAN (Dziugaite et al. 2015; Li et al. 2015; Li et al. 2017; Bellemare et al. 2017; Bińkowski
et al. 2018):

inf sup ES(X) − ES Tθ (Z) ,
θ S∈Hκ

where Hκ is the unit ball of the RKHS induced by the kernel κ.


Dziugaite, Gintare Karolina, Daniel M. Roy, and Zoubin Ghahramani (2015). “Training generative neural networks
via maximum mean discrepancy optimization”. In: UAI.
Li, Yujia, Kevin Swersky, and Rich Zemel (2015). “Generative Moment Matching Networks”. In: ICML.
Li, Chun-Liang, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabas Poczos (2017). “MMD GAN: Towards
Deeper Understanding of Moment Matching Network”. In: NIPS.
Bellemare, Marc G., Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshminarayanan, Stephan Hoyer, and
Remi Munos (2017). “The Cramer Distance as a Solution to Biased Wasserstein Gradients”. arXiv:1705.10743.
Bińkowski, Mikolaj, Dougal J. Sutherland, Michael Arbel, and Arthur Gretton (2018). “Demystifying MMD GANs”.
In: ICLR.

Yaoliang Yu 145 –Version 0.11–Nov 2, 2021–


CS480/680–Fall 2021 §19 ATTENTION University of Waterloo

19 Attention
Goal

Introduction to transformer, attention, BERT, GPTs, and the exploding related.


A nice code tutorial is available: https://nlp.seas.harvard.edu/2018/04/03/attention.html.

Alert 19.1: Convention

Gray boxes are not required hence can be omitted for unenthusiastic readers. Tokens are arranged row-wise.
This note is likely to be updated again soon.

Remark 19.2: The input sequential price of RNNs

A lot of natural language processing (NLP) techniques rely on recurrent neural networks, which are unfor-
tunately sequential in nature (w.r.t. input tokens). Our main goal in this lecture is to use a hierarchical,
parallelizable attention mechanism to trade the input sequential part in RNNs with that in network depth.

Definition 19.3: Transformer (Vaswani et al. 2017)

In a nutshell, a transformer (in machine learning!) is composed of multiple blocks of components that we
explain in details below. It takes an input sequence and outputs another sequence, much like an RNN:

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and
Illia Polosukhin (2017). “Attention is All you Need”. In: Advances in Neural Information Processing Systems 30,
pp. 5998–6008.

Definition 19.4: Input and output embeddings

Typically, the tokens in an input sequence are one-hot encoded over a dictionary with size say p. We add
and learn a distributed representation We ∈ Rp×d of the tokens. The same We is also used to decode the
output tokens, and its transpose We⊤ is used to compute the softmax output probability (over tokens).

Yaoliang Yu 146 –Version 0.0–July 27, 2020–


CS480/680–Fall 2021 §19 ATTENTION University of Waterloo

Definition 19.5: Positional encoding

The order of tokens in an input sequence matters. To encode this information, we may simply add a
positional vector pt ∈ Rd for each fixed position t:
   
pt,2i = sin t/100002i/d , pt,2i+1 = cos t/100002i/d , i = 0, . . . , d2 − 1.

It is clear that pt+k is an orthogonal transformation of pt , since



cos(k/100002i/d ) sin(k/100002i/d ) sin t/100002i/d 
   
pt+k,2i
=
pt+k,2i+1 − sin(k/100002i/d ) cos(k/100002i/d ) cos t/100002i/d
cos(k/100002i/d ) sin(k/100002i/d )
  
pt,2i
= .
− sin(k/100002i/d ) cos(k/100002i/d ) pt,2i+1

The periodic functions chosen here also allows us to handle test sequences that are longer than the training
sequences. Of course, one may instead try to directly learn the positional encoding pt .

Definition 19.6: Residual connection, layer-normalization and dropout

In each layer we add residual connection (He et al. 2016) and layer-wise normalization (Ba et al. 2016) to
ease training. We may also add dropout layers (Srivastava et al. 2014).
He, K., X. Zhang, S. Ren, and J. Sun (2016). “Deep Residual Learning for Image Recognition”. In: IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pp. 770–778.
Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton (2016). “Layer Normalization”.
Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov (2014). “Dropout:
A Simple Way to Prevent Neural Networks from Overfitting”. Journal of Machine Learning Research, vol. 15,
no. 56, pp. 1929–1958.

Definition 19.7: Attention (e.g. Bahdanau et al. 2015)

Given a query vector q ∈ Rdk and a database consisting of m key-value pairs (K, V ) : K = [k1 , . . . , km ]⊤ ∈
Rm×dk , V = [v1 , . . . , vm ]⊤ ∈ Rm×dv , a natural way to guess/retrieve the value of the query is through
comparison to the known key-value pairs:
m
X
att(q; K, V ) = πi · vi = π ⊤ V, (19.1)
i=1

namely the value of the query is some convex combination of the values in the database.
The coefficient vector π can be determined as follows:
m
X
argmin πi · distk (q, ki ) + λ · πi log πi , (19.2)
π∈∆m−1 i=1

where distk (q, ki ) measures the dissimilarity between the query q and the key ki , and πi log πi is the so-called
entropic regularizer. As you can verify in Exercise 19.8:
exp(−distk (q, ki )/λ)
πi = Pm , i.e., π = softmax(−distk (q, K)/λ). (19.3)
j=1 exp(−distk (q, kj )/λ)

A popular choice for the dissimilarity function is distk (q, k) = −q⊤ k.


Now, with the above π, we solve a similar weighted regression problem to retrieve the value of the query:
m
X
argmin πi · distv (a, vi ).
a
i=1

With distv (a, vi ) := ∥a − vi ∥22 , we easily verify the convex combination in (19.1).

Yaoliang Yu 147 –Version 0.0–July 27, 2020–


CS480/680–Fall 2021 §19 ATTENTION University of Waterloo

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio (2015). “Neural machine translation by jointly learning to
align and translate”. In: International Conference on Learning Representations.

Exercise 19.8: KL leads to softmax

Prove that the optimal solution of (19.2) is given by the softmax in (19.3).

Example 19.9: Popular attention mechanisms

Popular choices for the dissimilarity function distk include:


• (negated) dot-product: we have already mentioned this choice distk (q, k) = −q⊤ k. We further point
out that when normalized, i.e. ∥q∥2 = ∥k∥2 = 1, then the (negated) dot-product essentially measures
the angular distance between the query and the key. This choice is particularly efficient, as we can
easily implement
√ the matrix-product QK ⊤ (for a set of queries
√ and keys) in parallel. One typically
sets λ = dk so that if q ⊥ k ∼ N (0, Idk ), then Var(q⊤ k/ dk ) = 1.
• additive: more generally we may parameterize the dissimilarity function as a feed-forward network
distk (q, k; w) whose weight vector w is learned from data.

Remark 19.10: Sparse attention (Sukhbaatar et al. 2019; Correia et al. 2019)

TBD
Sukhbaatar, Sainbayar, Edouard Grave, Piotr Bojanowski, and Armand Joulin (2019). “Adaptive Attention Span
in Transformers”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,
pp. 331–335.
Correia, Gonçalo M., Vlad Niculae, and André F. T. Martins (2019). “Adaptively Sparse Transformers”. In: Pro-
ceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International
Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2174–2184.

Exercise 19.11: Self-attention in the limit

Let V ∈ Rm×d be arbitrary and consider applying self-attention repeatedly to it:

V ← Aλ (V ) := softmax(V V ⊤ /λ)V,

where of course the softmax operator is applied row-wise. What is the limiting behaviour of A∞
λ (V )?

Definition 19.12: Multi-head attention

Recall that in CNNs, we employ a number of filters to extract different feature maps. Similarly, we use
multi-head attention so that our model can learn to attend to different parts of the input simultaneously:

H = [A1 , . . . , Ah ]W, with Ai = att(QWiQ ; KWiK , V WiV ), i = 1, . . . , h,

where W ∈ R(hdv )×d , WiQ ∈ Rd×dk , WiK ∈ Rd×dk , WiV ∈ Rd×dv are weight matrices to be learned, and att
is an attention mechanism in Definition 19.7 (applied to each row of QWiQ ; see also Example 19.9).
If we set dk = dv = d/h, then the number of parameters in h-head attention is on par with single-head
attention. The choice of dimensions here also facilitates the implementation of residual connections above.

Yaoliang Yu 148 –Version 0.0–July 27, 2020–


CS480/680–Fall 2021 §19 ATTENTION University of Waterloo

Remark 19.13: implicit distance learning?

Another way to interpret the linear projections WiQ and WiK is through distance learning. Indeed, the
distance between the query and key, after linear projection, is

−(QWiQ )(KWiK )⊤ = −Q WiQ (WiK )⊤ K ⊤ =: −QMi K ⊤ ,




where the low-rank matrix Mi ∈ Rd×d “distorts” the dot-product: ⟨q, k⟩Mi := q⊤ Mi k. This explanation
suggests tying together WiQ , WiK , and possibly also WiV .

Definition 19.14: Self- and context- attention

The transformer uses multi-head attention along with an encoder-decoder structure, and employs self-
attention (e.g. Cheng et al. 2016) as a computationally efficient way to relate different positions in an
input sequence:
• encoder self-attention: In this case Q = K = V all come from the input (of the current encoder layer).
Each output can attend to all positions in the input.
• context attention: In this case Q comes from the input of the current decoder layer while (K, V ) come
from the output of (the final layer of) the encoder, namely the context. Again, each output can attend
to all positions in the input.
• decoder self-attention: In this case Q comes from the input (of the current decoder layer), K = Q ⊙ M
and V = Q ⊙ M are masked versions of Q so that each output position can only attend to positions
up to and including the current position. In practical implementation we can simply reset illegal
dissimilarities:

distk (qi , qj ) ← ∞ if i < j.

The input to the transformer is a sequence X = [x1 , . . . , xm ]⊤ ∈ Rm×p and it generates an output
sequence Y = [y1 , . . . , yl ]⊤ ∈ Rl×p . Note that we shift the output sequence 1 position to the right, so that
combined with the decoder self-attention above, each output symbol only depends on symbols before it (not
including itself).

Yaoliang Yu 149 –Version 0.0–July 27, 2020–


CS480/680–Fall 2021 §19 ATTENTION University of Waterloo

Cheng, Jianpeng, Li Dong, and Mirella Lapata (2016). “Long Short-Term Memory-Networks for Machine Reading”.
In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 551–561.

Definition 19.15: Position-wise feed-forward network

In each encoder and decoder layer, we also apply a feed-forward network at each position. For instance, let
ht ∈ Rd be the (row) representation for position t (at some layer), then we compute

FFN(ht ) = σ(h⊤
t W1 )W2 .

The weights W1 ∈ Rd×4d and W2 ∈ R4d×d are shared among different positions but change from layer to
layer.

Remark 19.16: Comparison between attention, RNN and CNN

Layer type per-layer complexity sequential operations max path length


Self-attention 2
O(m d) O(1) O(1)
Recurrent O(md2 ) O(m) O(m)
Convolution O(kmd2 ) O(1) O(logk m)
Self-attention (restricted) O(rmd) O(1) O(m/r)

Let m be the length of an input sequence and d the internal representation dimension (as above). In
self-attention, the dot-product QQ⊤ costs O(m2 d) (recall that Q ∈ Rm×d ). However, this matrix-matrix
multiplication can be trivial parallelized using GPUs. We define the maximum path length to be the
maximum number of sequential operations for any output position to attend to any input position. Clearly,
for the transformer, each output position can attend to each input position hence its maximum path length
is O(1).
In contrast, for RNNs, computation is sequential in terms of the tokens in the input sequence. Each
recurrence, e.g. h ← W2 σ(W1 (h, x)), costs O(d2 ), totaling O(md2 ). For CNNs (e.g. Gehring et al. 2017)
with filter size k, each convolution costs O(kd) and for a single output filter we need to repeat m times while
we have d output filters, hence the total cost O(kmd2 ). Convolutions can be trivially parallelized on GPUs,
and if we employ dilated convolutions (Kalchbrenner et al. 2016) the maximum path length is O(logk m)
(and O(m/k) for usual convolutions with stride k). Separable convolutions (Chollet 2017) can reduce the
per-layer complexity to O(kmd + md2 ).
Finally, if we restrict attention to the r neighbors (instead of all m tokens in the input sequence), we may
reduce the per-payer complexity to O(rmd), at the cost of increasing the maximum path length to O(m/r).
From the comparison we see that transformer (with restricted attention, if quadratic time/space com-
plexity is a concern) is very suitable for modeling long-range dependencies.
Gehring, Jonas, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin (2017). “Convolutional Sequence
to Sequence Learning”. In: Proceedings of the 34th International Conference on Machine Learning, pp. 1243–1252.
Kalchbrenner, Nal, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu
(2016). “Neural Machine Translation in Linear Time”. In:
Chollet, F. (2017). “Xception: Deep Learning with Depthwise Separable Convolutions”. In: IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pp. 1800–1807.

Remark 19.17: Attention as interpretation?

One advantage of the attention mechanism is visualization: we can inspect the attention distribution over
layers or positions and try to interpret the model; see (Jain and Wallace 2019; Wiegreffe and Pinter 2019)
for some discussions.

Yaoliang Yu 150 –Version 0.0–July 27, 2020–


CS480/680–Fall 2021 §19 ATTENTION University of Waterloo

Jain, Sarthak and Byron C. Wallace (2019). “Attention is not Explanation”. In: Proceedings of the 2019 Conference
of the North American Chapter of the Association for Computational Linguistics, pp. 3543–3556.
Wiegreffe, Sarah and Yuval Pinter (2019). “Attention is not not Explanation”. In: Proceedings of the 2019 Conference
on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural
Language Processing (EMNLP-IJCNLP), pp. 11–20.

Definition 19.18: Training using machine translation

Vaswani et al. (2017) trained the transformer using the supervised machine translation problem on the WMT
2014 dataset that consists of pairs (X, Y ) of sentences, one (X) from the source language (say English) and
the other (Y ) from the target language (say French or German). The usual (multi-class) cross-entropy loss
is used as the objective function:
hD Ei
min − Ê Y, log Ŷ , Ŷ = [ŷ1 , . . . , ŷℓ ]⊤ ,

where ŷj depends on the input sentence X = [x1 , . . . , xm ]⊤ and all previous target tokens Y<j := [y1 , . . . , yj−1 ]⊤ .
Note that the sequence lengths m and l vary from one input to another. In practice, we “bundle” sentence
pairs with similar lengths to streamline the parallel computation.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and
Illia Polosukhin (2017). “Attention is All you Need”. In: Advances in Neural Information Processing Systems 30,
pp. 5998–6008.

Remark 19.19: Some practicalities about Transformer

We mention some training choices in the original work:


• optimizer: Adam (Kingma and Ba 2015) was used with β1 = 0.9, β2 = 0.98 and ϵ = 10−9 .

• learning rate:

ηiter = √1 1
· min{ √iter , iter · 1
√ },
d τ τ

where τ = 4000 controls the warm-up stage where the learning rate increases linearly. After that, the
learning rate decreases inverse proportionally to the square root of the iteration number.
• regularization: (a) dropout (with rate 0.1) was added after the positional encoding layer and after each
attention layer; (b) residual connection and layer normalization is performed after each attention layer
and feed-forward layer; (c) label smoothing (Szegedy et al. 2016):

y ← (1 − α)y + α C1 ,

where y is the (original, typically one-hot encoded) label vector, C is the number of classes, and α = 0.1
controls the amount of smoothing.

Kingma, D. P. and J. Ba (2015). “Adam: A method for stochastic optimization”. In: International Conference on
Learning Representations.
Szegedy, C., V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016). “Rethinking the Inception Architecture for
Computer Vision”. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826.

Yaoliang Yu 151 –Version 0.0–July 27, 2020–


CS480/680–Fall 2021 §19 ATTENTION University of Waterloo

Definition 19.20: Beam search during inference time

After we have trained the transformer, during test time we are given an input sentence and asked to decode
its translation. We use beam search:
• We keep b = 4 currently best candidate translations;
• We augment each candidate translation Yk with a next word y:

scorek ← scorek − log P̂ (y|X, Yk ).

We may prune the next word by considering only those whose score contribution lies close to the best
one (up to some threshold, say b).
• We keep only the best b (in terms of score) augmented translations.
• If some candidate translation ends (either by outputting the stop word or exceeding maximum allowed
length, say input length plus 50), we compute its normalized score by dividing its length to the power
of α = 0.6 (to reduce bias towards shorter translations).
• We prune any candidate translation that lies some threshold (say b) below the best normalized score.
Note that beam search is highly sequential and subject to possible improvements.

Example 19.21: Image transformer (Parmar et al. 2018)

Parmar, Niki, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran (2018).
“Image Transformer”. In: Proceedings of the 35th International Conference on Machine Learning, pp. 4055–4064.

Example 19.22: Sparse transformer (Child et al. 2019)

Child, Rewon, Scott Gray, Alec Radford, and Ilya Sutskever (2019). “Generating Long Sequences with Sparse Trans-
formers”.

Definition 19.23: Embedding from Language Models (ELMo) (Peters et al. 2018)

Unlike conventional word embeddings that assign each word a fixed representation, ELMo is contextualized
(Melamud et al. 2016; McCann et al. 2017), where the representation for each word depends on the entire
sentence (context) it lies in. ELMo is applied for down-stream tasks in the following way:
• Two-Stage (TS, not to be confused with few-shot learning in Example 19.34): We fix the parameters
in ELMo and use it as an (additional) feature extractor, on top of which we tune a task-specific
architecture. For the latter, Peters et al. (2018) also found that stacking ELMo with original token
representation x in its input layer and with its output before passing to a softmax layer appears to
improve performance.

ELMo trains a bidirectional LM:


→ ←
min Ê − log p (X|Θ) − log p (X|Φ), where
Θ,Φ
m m
→ Y ← Y
p (X|Θ) = p(xj |x1 , . . . , xj−1 ; Θ), p (X|Φ) = p(xj |xj+1 , . . . , xm ; Φ),
j=1 j=1

and the two probabilities are modeled by two LSTMs with shared embedding and softmax layers but different
hidden parameters.

Yaoliang Yu 152 –Version 0.0–July 27, 2020–


CS480/680–Fall 2021 §19 ATTENTION University of Waterloo

Given an input sequence, the ELMo representation of any token x is computed as:
→ ← → ←
ELMo(x; A) = A(h0 , { h l }L L L
l=1 , { h l }l=1 ) =: A(h0 , {hl }l=1 ), hl := [ h l ; h l ],

where recall that the embedding h0 (e.g. We x, or extracted from any context insensitive approach) is shared
→ ←
between the two L-layer LSTMs, whose hidden states are h and h , respectively. Here A is an aggregation
function. Typical choices include:
→ ←
l=1 , { h l }l=1 ) = hL , where only the top layers are retained.
• Top: A(h0 , { h l }L L

• ELMo as in (Peters et al. 2018):


L
X
ELMo(x; s, γ) = γ sl hl ,
l=0

where the layer-wise scaling parameters s and global scaling parameter γ are task-dependent.
ELMo employs 3 layers of representation: context-insensitive embedding through character-level CNN,
followed by the forward and backward LSTMs. With L = 2, Peters et al. (2018) found that the lower layers
tend to learn syntactic information while semantic information is captured in higher layers (through e.g.
testing on tasks that require syntactic/semantic information), justifying the aggregation scheme in ELMo.
Peters, Matthew, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, and Luke Zettlemoyer (2018).
“Deep Contextualized Word Representations”. In: Proceedings of the 2018 Conference of the North American
Chapter of the Association for Computational Linguistics, pp. 2227–2237.
Melamud, Oren, Jacob Goldberger, and Ido Dagan (2016). “context2vec: Learning Generic Context Embedding
with Bidirectional LSTM”. In: Proceedings of The 20th SIGNLL Conference on Computational Natural Language
Learning, pp. 51–61.
McCann, Bryan, James Bradbury, Caiming Xiong, and Richard Socher (2017). “Learned in Translation: Contextual-
ized Word Vectors”. In: Advances in Neural Information Processing Systems 30, pp. 6294–6305.

Definition 19.24: Bidirectional Encoder Representation from Transformers (Devlin et al. 2019)

BERT followed up on the pre-training path in GPT, and added some interesting twists:
• The input to BERT is a concatenation of two sequences, starting with a special symbol [CLS], followed
by sequence A, the special separator [SEP], sequence B, and ending again with [SEP]. (It obviously still
works in the absence of a sequence B.) The final representation of [CLS] is a sentence-level abstraction
and can be used for various downstream sentence classification tasks, while the two-sequence input
format can be extremely useful in question answering tasks (e.g. question for A and passage for B).
• Apart from token positional embedding, we also add a sequence positional embedding, where tokens
in the first sequence share an embedding vector while those in the second sequence share another one.
• Masked language model (MLM): As in ELMo (Definition 19.23), Devlin et al. (2019) argue that
in certain tasks it is important to exploit information from both directions, instead of the left-to-right
order in usual language models (LMs, e.g. GPT). Thus, BERT aims at training an MLM: On each
input, it randomly replaces 15% tokens with the special symbol [Mask]. The modified input then
goes through the Transformer encoder where each position can attend to any other position (hence
bidirectional). The final hidden representations are passed to a softmax layer where we predict only
the masked tokens with the usual cross-entropy loss. Note that our predictions on the masked tokens
are in parallel, unlike in usual LMs where tokens are predicted sequentially and may affect each other
(at test time).
• Hack: At test time, the input never includes the special symbol [Mask] hence creating a mismatch
between training and testing in BERT. To remedy this issue, of the 15% tokens chosen during training,
only 80% of them will actually be replaced with [Mask], while 10% of them will be replaced with a
random token and the remaining 10% will remain unchanged.

Yaoliang Yu 153 –Version 0.0–July 27, 2020–


CS480/680–Fall 2021 §19 ATTENTION University of Waterloo

• Next sequence prediction (NSP): Given the first sequence, we choose the second sequence as its next
sequence 50% of the time (labeled as true) and as a random sequence the remaining time (labeled
as false). Then, BERT pre-training includes a binary classification loss besides MLM, based on the
(binary) softmax applied on the final hidden representation of [CLS].
• The training loss of BERT is the sum of averaged MLM likelihood and NSP likelihood.
• Fine-tuning (FT): during fine-tuning, depending on the task we may proceed differently. For se-
quence classification problems, we add a softmax layer to the final hidden representation of [CLS],
while for token-level predictions we add softmax layers to the final hidden representations of all rel-
evant input tokens. For instance, we add two (independent) softmax layers (corresponding to start
and end) in span prediction where during test time we use the approximation

log Pr(start = i, end = j) ≈ log Pr(start = i) + log Pr(end = j),

considering of course only i ≤ j. Note that all parameters are adjusted in FT, in sharp contrast to the
TS approach in ELMo. Of course, BERT can also be applied in the TS setting, where the performance
may be slightly worse (Devlin et al. 2019).
• BERT showed that scaling to extreme model sizes leads to surprisingly large improvements on very
small scale tasks, provided that the model has been sufficiently pre-trained.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova (2019). “BERT: Pre-training of Deep Bidirec-
tional Transformers for Language Understanding”. In: Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics, pp. 4171–4186.

Definition 19.25: XLNet (Yang et al. 2019)

Yang, Zhilin, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le (2019). “XL-
Net: Generalized Autoregressive Pretraining for Language Understanding”. In: Advances in Neural Information
Processing Systems 32, pp. 5753–5763.

Example 19.26: RoBERTa (Liu et al. 2020)

Liu, Yinhan et al. (2020). “RoBERTa: A Robustly Optimized BERT Pretraining Approach”.

Example 19.27: A lite BERT (ALBERT) (Lan et al. 2020)

Lan, Zhenzhong, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut (2020).
“ALBERT: A Lite BERT for Self-supervised Learning of Language Representations”. In: International Conference
on Learning Representations.

Remark 19.28: More on BERT

(Joshi et al. 2020; Lewis et al. 2020; Saunshi et al. 2019; Song et al. 2019)
Joshi, Mandar, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy (2020). “SpanBERT:
Improving Pre-training by Representing and Predicting Spans”. Transactions of the Association for Computational
Linguistics, vol. 8, pp. 64–77.
Lewis, Mike, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoy-
anov, and Luke Zettlemoyer (2020). “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language

Yaoliang Yu 154 –Version 0.0–July 27, 2020–


CS480/680–Fall 2021 §19 ATTENTION University of Waterloo

Generation, Translation, and Comprehension”. In: Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, pp. 7871–7880.
Saunshi, Nikunj, Orestis Plevrakis, Sanjeev Arora, Mikhail Khodak, and Hrishikesh Khandeparkar (2019). “A Theo-
retical Analysis of Contrastive Unsupervised Representation Learning”. In: Proceedings of the 36th International
Conference on Machine Learning, pp. 5628–5637.
Song, Kaitao, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu (2019). “MASS: Masked Sequence to Sequence Pre-
training for Language Generation”. In: Proceedings of the 36th International Conference on Machine Learning,
pp. 5926–5936.

Definition 19.29: Generative Pre-Training (GPT) (Radford et al. 2018)

GPT works in two stages:


• unsupervised pre-training through learning a language model:
m
Y
min − Ê log p(X|Θ), where p(X|Θ) = p(xj |x1 , . . . , xj−1 ; Θ).
Θ
j=1

Namely, given the context consisting of previous tokens x1 , . . . , xj−1 , we aim to predict the current
token xj . The conditional probability is computed through a multi-layer transformer decoder (Liu
et al. 2018):

H (0) = XWe + Wp
H (ℓ) = transformer_decoder_block(H (ℓ−1) ), ℓ = 1, . . . , L
(L)
p(xj |x1 , . . . , xj−1 ; Θ) = softmax(hj We⊤ ),

where X = [x1 , . . . , xm ]⊤ is the input sequence consisting of m tokens (m may vary from input to
input), L is the number of transformer blocks, We is the token embedding matrix and Wp is the
position embedding matrix.
• supervised fine-tuning with task-aware input transformations:
D E
min min − Ê log p(y|X, Θ) − λ · Ê log p(X|Θ), where p(y|X, Θ) = y, softmax(h(L)
m W y ) ,
Wy Θ

and we include the unsupervised pre-training loss to help improving generalization and accelerating
convergence. Note that for different tasks, we only add an extra softmax layer to do the classification.
Unlike ELMo, GPT avoids task-specific architectures and aims to learn a universal representation
(through language model pre-training) for all tasks.
Sequence pre-training has been explored earlier in (Dai and Le 2015; Howard and Ruder 2018) using
LSTMs.
Radford, Alec, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever (2018). “Improving Language Understanding
by generative pre-training”.

Yaoliang Yu 155 –Version 0.0–July 27, 2020–


CS480/680–Fall 2021 §19 ATTENTION University of Waterloo

Liu, Peter J., Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer (2018).
“Generating Wikipedia by Summarizing Long Sequences”. In: International Conference on Learning Representa-
tions.
Dai, Andrew M and Quoc V Le (2015). “Semi-supervised Sequence Learning”. In: Advances in Neural Information
Processing Systems 28, pp. 3079–3087.
Howard, Jeremy and Sebastian Ruder (2018). “Universal Language Model Fine-tuning for Text Classification”. In:
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 328–339.

Remark 19.30: Input transformations

GPT employs task-specific input transformations so that its fine-tuning strategy in Definition 19.29 is
applicable:
• Textual entailment: Given premise, decide if hypothesis holds. We simply concatenate premise
with hypothesis, delimited by a special token, as input to GPT and reduce to 3-class classification:
entailment, neutral, contradiction\verb.
• Similarity: Similarly, we concatenate the two input sentences in both orders and add the resulting
representation.

• Question answering and reasoning: We concatenate the context, question and each possible answer
to obtain multiple input sequences and reduce to multi-class classification.

Remark 19.31: Some practicalities about GPT

We mention the following implementation choices in GPT:


• Byte pair encoding (BPE): Current character-level language models (LMs) (e.g. Gillick et al. 2016)
are not as competitive as word-level LMs (e.g. Al-Rfou et al. 2019). BPE (e.g. Sennrich et al. 2016)
provides a practical middle ground, where we start with UTF-8 bytes (256 base vocabs instead of the
overly large 130k codes) and repeatedly merge pairs with the highest frequency in our corpus. We
prevent merging between different (UTF-8 code) categories, with an exception on spaces (to reduce
similar vocabs such as dog and dog! but allow dog cat). BPE (or any other character-level encoding)
allows us to compute probabilities even over words and sentences that are not seen at training.
• Gaussian Error Linear Unit (GELU) (Hendrycks and Gimpel 2016):

σ(x) = xΦ(x) = E(x · m|x), where m ∼ Bernoulli(Φ(x)),

i.e., on the input x, with probability Φ(x) we drop it. Unlike Relu where we drop any negative input,

Yaoliang Yu 156 –Version 0.0–July 27, 2020–


CS480/680–Fall 2021 §19 ATTENTION University of Waterloo

GELU drops the input more probably as the latter decreases.


• fine-tuning uses a smaller batch size 32, and converges after 3 epochs in most cases. λ = 21 ; warm-up
τ = 2000 during pre-training and 0.2% during fine-tuning.

Gillick, Dan, Cliff Brunk, Oriol Vinyals, and Amarnag Subramanya (2016). “Multilingual Language Processing From
Bytes”. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computa-
tional Linguistics, pp. 1296–1306.
Al-Rfou, Rami, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones (2019). “Character-Level Language
Modeling with Deeper Self-Attention”. In: AAAI.
Sennrich, Rico, Barry Haddow, and Alexandra Birch (2016). “Neural Machine Translation of Rare Words with
Subword Units”. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics,
pp. 1715–1725.
Hendrycks, Dan and Kevin Gimpel (2016). “Gaussian Error Linear Units (GELUs)”.

Alert 19.32: “What I cannot create, I do not understand.” — Richard Feynman

A popular hypothesis to support pre-training (through say language modeling) is that the underlying gener-
ative model learns to perform many of the downstream (supervised) tasks in order to improve its language
modeling capability, and the long-range dependencies allowed by transformers assists in transfer compared
to LSTMs (or other RNNs).

Example 19.33: GPT-2 (Radford et al. 2019)

Most current ML systems are narrow in the sense that they specialize exclusively on a single task. This
approach, however successful, suffers from generalizing to other domains/tasks that are not encountered
during training. While unsupervised pre-training combined with supervised fine-tuning (such as in GPT)
proves effective, GPT-2 moves on to the completely unsupervised zero-shot setting.
McCann et al. (2018) showed that many NLP tasks (input and output) can be specified purely by
language, and hence can be solved through training a single model. Indeed, the target output to a natural
language task is just one of the many possible next sentences. Thus, training a sufficiently large unsupervised
language model may allow us to perform well on a range of tasks (not explicitly trained with supervision),
albeit in a much less data-efficient way perhaps.
GPT-2 also studied the effect of test set contamination, where the large training set accidentally includes
near-duplicates of part of the test data. For example CIFAR-10 has 3.3% overlap between train and test
images (Barz and Denzler 2020).
Apart from some minor adjustments, the main upgrade from GPT to GPT-2 is a (sharp) increase in
model capacity, a larger training set, and the exclusive focus on zero-shot learning. Conceptually, GPT-2
demonstrated the (surprising?) benefit of training (excessively?) large models, with near-human performance
on some NLP tasks (such as text generation).
Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever (2019). “Language models
are unsupervised multitask learners”.
McCann, Bryan, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher (2018). “The Natural Language De-
cathlon: Multitask Learning as Question Answering”.
Barz, Björn and Joachim Denzler (2020). “Do We Train on Test Data? Purging CIFAR of Near-Duplicates”. Journal
of Imaging, vol. 6, no. 41, pp. 1–8.

Example 19.34: GPT-3 (Brown et al. 2020)

The main contribution of GPT-3 is perhaps its crystallization of different evaluation schemes:
• Fine-Tuning (FT): this was explored in GPT-1 and involves task-dedicated datasets and gradient

Yaoliang Yu 157 –Version 0.0–July 27, 2020–


CS480/680–Fall 2021 §19 ATTENTION University of Waterloo

updates on both the LM and the task-dependent classifier.


• Few-Shot (FS): a natural language description of the task (e.g. Translate English to French) and
a few example demonstrations (e.g. English-French sentence pairs), followed by a final context (e.g.
English sentence) that will be completed by GPT-3 (e.g. a French translation).

• One-Shot (1S): same as above with the number of demonstrations restricted to 1.


• Zero-Shot (0S): just the task description (in natural language form) is provided.
GPT-3 focuses exclusively on the last 3 scenarios and never performs gradient updates on the LM. The
Common Crawl dataset that GPT-3 was built on is so large that no sequence was ever updated twice during
training.
Brown, Tom B. et al. (2020). “Language Models are Few-Shot Learners”.

Example 19.35: Pixel GPT (Chen et al. 2020)

Chen, Mark, Alec Radford, Rewon Child, Jeffrey K Wu, Heewoo Jun, David Luan, and Ilya Sutskever (2020).
“Generative Pretraining From Pixels”. In: Proceedings of the 37th International Conference on Machine Learning.

Alert 19.36: The grand comparison

model data en de input embedding heads params batch steps GPUs time
transformer WMT14 6 6 - 512 8 213M 50k tokens 100k 8×P100 12h
sparseTran
imageTran
GPT-i
ELMo 1B Word Bench - 2×2 2048c 512 - ? ? 10e ? ?
GPT-1 BooksCorp 0 12 512×40k 768 12 100M 64 100e 8×P600 1m
BERTbase 12 768 12 110M 4×cTPU
BooksCorp+Wiki 0 512×30k 256 1M 4d
BERTlarge 24 1024 16 340M 16×cTPU
12 768 117M
24 1024 345M
GPT-2 WebText 0 1024× 50k 12 512 ? ? ?
36 1280 762M
48 1600 1542M
RoBERTa
XLNet
ALBERT
GPT-3-S 12 768 12 125M 0.5M
GPT-3-M 24 1024 16 350M 0.5M
GPT-3-L 24 1536 16 760M 0.5M
GPT-3-XL 24 2048 24 1.3B 1M
CommonCrawl 0 2048× 50k ? ?×V100 ?
GPT-3-SS 32 2560 32 2.7B 1M
GPT-3-MM 32 4096 32 6.7B 2M
GPT-3-LL 40 5140 40 13B 2M
GPT-3 96 12288 96 175B 3.2M

Yaoliang Yu 158 –Version 0.0–July 27, 2020–


CS480/680–Fall 2021 §20 LEARNING TO LEARN University of Waterloo

20 Learning to Learn
Goal

A general introduction to the recent works on meta-learning.

Alert 20.1: Convention

Gray boxes are not required hence can be omitted for unenthusiastic readers.
This note is likely to be updated again soon.
Earlier developments can be found in the monograph (Thrun and Pratt 1998) while recent ones can be
found in the survey article (Hospedales et al. 2020).
Thrun, Sebastian and Lorien Pratt (1998). “Learning to Learn: Introduction and Overview”. In: Learning to Learn.
Springer.
Hospedales, Timothy, Antreas Antoniou, Paul Micaelli, and Amos Storkey (2020). “Meta-Learning in Neural Net-
works: A Survey”.

Definition 20.2: c-class and s-shot (cs)

In few-shot learning, we are given cs training examples, with s shots (examples) per each of the c classes,
and we are asked to classify new test examples into the c classes. Typically, both c and s are small (e.g.
c = 5 or c = 10, s = 5 or s = 1 or even s = 0). Training a deep model based on the cs training examples
can easily lead to severe overfitting.

Algorithm 20.3: kNN in few-shot learning

A common algorithm in the above c-class and s-shot setting is the k nearest neighbor algorithm discussed in
Section 5. In addition, we may train a feature extractor (such as BERT or GPT) based on unlabeled data
(e.g. images or documents depending on the application) collected from the internet, and apply kNN (with
the labeled cs training examples) in the feature space.
Recall that there is no learning but memorization (of the training set) in kNN.

Definition 20.4: Matching network (Vinyals et al. 2016)

Matching network (Vinyals et al. 2016) (along with many precursors) is a simple extension of the above
kNN algorithm. It makes the prediction ŷ on a new test sample x after seeing a few shots S = *(xi , yi ), i =
1, . . . , m+ as follows:
m
X
ŷ(x; w|S) = aw (x, xi |S) · yi , (20.1)
i=1

where the attention mechanism a : X × X → R+ measures the similarity between its inputs and is learned
parametrically. (For classification problems we interpret ŷ as the confidence, as usual.) The idea to param-
eterize a non-parametric model is prevalent in ML (e.g. neural nets or kernel learning).
Vinyals, Oriol, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra (2016). “Matching
Networks for One Shot Learning”. In: Advances in Neural Information Processing Systems 29, pp. 3630–3638.

Exercise 20.5: kNN as attention

With what fixed choice of the attention mechanism can we reduce (20.1) to kNN?

Yaoliang Yu 159 –Version 0.0–August 6, 2020–


CS480/680–Fall 2021 §20 LEARNING TO LEARN University of Waterloo

Alert 20.6: Like father, like son

An ML algorithm is good at tasks that we train them to be good ata . Thus, in the few-shot setting it is
natural to consider meta-training (Vinyals et al. 2016), to reflect the challenges that will be encountered
in the test phase. Specifically, we define a task (a.k.a. episode) T to consist of m = cs training examples
S := {(xi , yi ), i = 1, . . . , m} and r test examples R = {(xι , yι ), ι = 1, . . . , r} from c sampled classes:
prediction based on training data
z }| { 
min ET =(S,R) E(x,y)∼R ℓ y, ŷ(x; w|S) ,
w | {z }
mimick the test loss

where ℓ is some loss function (e.g. cross-entropy for classification). Note that the c classes may be different
for different tasks T . In fact, in more challenged evaluations, it is typical to use disjoint classes in the test
phase than those appeared in the above training phase.
After training, we have two choices:
• directly apply the model, without any further fine-tuning, to new few-shot settings that may involve
novel classes that were never seen before. Of course, we expect the performance to degrade as the
novel classes become increasingly different from the ones appeared in training;
• fine-tune the model using the given m = cs training examples before applying to test examples, at the
risk of potential severe overfitting.

Vinyals, Oriol, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra (2016). “Matching
Networks for One Shot Learning”. In: Advances in Neural Information Processing Systems 29, pp. 3630–3638.

a There is of course some mild generalization beyond memorization, but one should not take it too far: we simply cannot

expect an algorithm trained on dogs to recognize cats, unless we incorporate additional mechanism to make it even plausible.

Example 20.7: Instantiation of Matching Network

The simplest parameterization of the attention mechanism in (20.1) is through learning an embedding
(representation): Let f (x; w) = g(x; w) be a feature extractor (e.g. deep network with weights w), and set
 
aw (x, xi ) := softmax − dist f (x; w), f (X; w) , X = [x1 , . . . , xm ], f (X; w) = [f (x1 ; w), . . . , f (xm ; w)].

Vinyals et al. (2016) also considered the more complicated parameterization:


 
aw (x, xi ) := softmax − dist f (x; X, θ), g(X; ϕ) , w = [θ, ϕ],

where dist is the cosine distance and f and g are two embedding networks with tunable parameters θ and ϕ,
respectively. We choose a bidirectional LSTM for g where the examples X are treated as the input sequence
(so that the embedding of xi depends on the entire set X). For f , we parameterize it as a memory-enhanced
LSTM:

[hk , ck ] = LSTM(f0 (x), [hk−1 , rk−1 ], ck−1 )


residual
 z }| { 

rk = g(X) · softmax g(X) hk + f0 (x) , k = 1, . . . , K,
| {z } |
memory
{z }
attention weights

where f0 (x) is the input embedding of x, hk is the LSTM state after k recurrence, ck is the cell state of
LSTM, and g(X) = g(X; ϕ) = [g(x1 ; ϕ), . . . , g(xm ; ϕ)] acts as memory while rk is the read from memory
(using again an attention mechanism). We set f (x; X, θ) = hK .
Vinyals, Oriol, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra (2016). “Matching
Networks for One Shot Learning”. In: Advances in Neural Information Processing Systems 29, pp. 3630–3638.

Yaoliang Yu 160 –Version 0.0–August 6, 2020–


CS480/680–Fall 2021 §20 LEARNING TO LEARN University of Waterloo

Definition 20.8: Prototypical Network (Snell et al. 2017)

Prototypical network is a drastic simplification of the matching network, by first grouping training examples
(in the feature space) from the same class before passing to softmax:
prediction based on training data
z }| { c
X
where a f (x; w), f¯(Sk ; w) · ek ,
 
min ET =(S,R) E(x,y)∼R ℓ y, ŷ(x; w|S) , ŷ(x; w|S) =
w
k=1
| {z }
mimick the test loss

Sk := {xi : yi = ek } is the set of training examples from the k-th class. As usual,
 
a f (x; w), f¯(Sk ; w) ∝ exp − dist f (x; w), f¯(Sk ; w) ,


while we use the center (i.e. mean) of each class as its prototype:
1 X
f¯(Sk ; w) = f (x; w).
|Sk |
x∈Sk

Snell et al. (2017) showed that the (squared) Euclidean distance dist(x, xi ) := ∥x − xi ∥22 performs better
than the cosine distance in matching networks. It is empirically observed that using more classes during
training while maintaining the same number of shots tends to improve test performance.
Mensink et al. (2013) and Rippel et al. (2016) also considered using multiple prototypes (e.g. mixture
model) for each class, providing an interpolation between the prototypical and the matching network.
Snell, Jake, Kevin Swersky, and Richard Zemel (2017). “Prototypical Networks for Few-shot Learning”. In: Advances
in Neural Information Processing Systems 30, pp. 4077–4087.
Mensink, T., J. Verbeek, F. Perronnin, and G. Csurka (2013). “Distance-Based Image Classification: Generalizing
to New Classes at Near-Zero Cost”. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35,
no. 11, pp. 2624–2637.
Rippel, Oren, Manohar Paluri, Piotr Dollar, and Lubomir Bourdev (2016). “Metric Learning with Adaptive Density
Discrimination”. In: ICLR.

Definition 20.9: Archetypes vs. prototypes (Cutler and Breiman 1994)

Cutler and Breiman (1994) proposed the following archetype analysis: Given training data X = [x1 , . . . , xn ],
we solve

min ℓ(X, XU V ), (20.2)


U ∈Rn×r
+ ,V ∈Rr×n
+

s.t. U ⊤ 1 = 1, V ⊤ 1 = 1,

where Z = [z1 , . . . , zr ] := XU are called archetypes. The objective in (20.2) aims at recovering each training
point xi through convex combinations (with weights vi ) of the archetypes, which are themselves convex
combinations of the original data points. In fact, at optimality we may choose archetypes to lie on the
boundary of the convex hull of training points, and many of them tend to be extremal (especially for large
r). We can solve (20.2) by alternating minimization, although Cutler and Breiman (1994) pointed out that
local minima increases with the number r of archetypes. Archetypes also tend to be non-robust under the
least squares objective.
Archetypes have the advantage of being quite interprettable. We may display them and analyze (the
distribution of) the reconstruction weights vi for each training point xi .
Cutler, Adele and Leo Breiman (1994). “Archetypal Analysis”. Technometrics, vol. 36, no. 4, pp. 338–347.

Yaoliang Yu 161 –Version 0.0–August 6, 2020–


CS480/680–Fall 2021 §20 LEARNING TO LEARN University of Waterloo

Definition 20.10: Model-agnostic meta-learning (MAML) (Finn et al. 2017)

Both matching and prototypical networks are some variations of (soft) kNN, which does not require any
training at the classifier level (although we do learn a versatile representation). Recall our meta-training
objective:
prediction after fine-tuning
z }| { 
min ET =(S,R) E(x,y)∼R ℓ y, ŷ(x; w|S) ,
w | {z }
mimick the test loss

where ŷ(x; w|S) now can be any prediction rule based on training data S and model initializer w. In
particular, MAML (Finn et al. 2017) proposed the following instantiation:

ŷ(x; w|S) = ŷ(x; w̃), w̃ = Tη,S (w) := w − η∇L(w; S), L(w; S) := Ê(x,y)∼S ℓ(y, ŷ(x; w)) (20.3)

where ŷ(x; w) is a chosen prediction rule (e.g. softmax for classification or linear for regression). In other
words, based on the training data S we update the model parameter w to w̃, which is then applied on
the test set R to compute a loss for w, mimicking completely what one would do during fine-tuning. Of
course, we may change the one-step gradient in (20.3) to any update rule, such as k-step gradient. Plugging
everything in we obtain:
mimick prediction on test set after fine-tuning
z }| {
mimick fine-tuning using training data
  z }| { 

min ET =(S,R) E(x,y)∼R ℓ y, ŷ x; w − η Ê(x,y)∼S ∇ℓ y, ŷ(x; w) ,
w
| {z }
mimick the loss on test set

MAML performs (stochastic) gradient update on the above objective, which requires computing the Hessian
∇2 ℓ (vector product). Empirically, omitting the Hessian part seems to not impair the performance noticeably.
Importantly, MAML imposes no restriction on the prediction rule ŷ(x; w) (as long as we can backpropa-
gate gradient), and it aims at learning a representation (or initialization), which will be further fine-tuned at
test time (by running the update Tη,S (w) various steps on new shots S in test set). In contrast, fine-tuning
is optional in both matching and prototypical networks (thanks to their kNN nature).
Finn, Chelsea, Pieter Abbeel, and Sergey Levine (2017). “Model-Agnostic Meta-Learning for Fast Adaptation of Deep
Networks”. In: Proceedings of the 34th International Conference on Machine Learning, pp. 1126–1135.

Definition 20.11: Learning to optimize few-shot learning (Ravi and Larochelle 2017)

Ravi and Larochelle (2017) noted that the (stochastic) gradient update we typically use resembles the update
for the cell state in an LSTM:

ct = ft ⊙ ct−1 + it ⊙ c̃t ,

where ft ≡ 1, ct−1 = wt−1 , it = αt , and c̃t = −∇Lt . Inspired by (Andrychowicz et al. 2016), we can thus
employ an LSTM to parameterize our optimization algorithm:

it = σ(U [∇Lt , Lt , wt−1 , it−1 ])


ft = σ(V [∇Lt , Lt , wt−1 , ft−1 ])
wt = ct .

The parameters U, V of the LSTM are learned using the meta-training objective.
Ravi, Sachin and Hugo Larochelle (2017). “Optimization as a Model for Few-Shot Learning”. In: ICLR.
Andrychowicz, Marcin, Misha Denil, Sergio Gómez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shilling-
ford, and Nando de Freitas (2016). “Learning to learn by gradient descent by gradient descent”. In: Advances in
Neural Information Processing Systems 29, pp. 3981–3989.

Yaoliang Yu 162 –Version 0.0–August 6, 2020–

You might also like