0% found this document useful (0 votes)
24 views95 pages

Machine Learning

The document is an introduction to machine learning, covering both supervised and unsupervised learning techniques. It includes detailed sections on regression, classification, neural networks, and probabilistic approaches, along with various algorithms and methodologies. The content is structured with a comprehensive table of symbols and an extensive table of contents outlining the topics discussed.

Uploaded by

varunikamaaran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views95 pages

Machine Learning

The document is an introduction to machine learning, covering both supervised and unsupervised learning techniques. It includes detailed sections on regression, classification, neural networks, and probabilistic approaches, along with various algorithms and methodologies. The content is structured with a comprehensive table of symbols and an extensive table of contents outlining the topics discussed.

Uploaded by

varunikamaaran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

Introduction to Machine Learning

Flavio Schneider and Soel Micheletti

Spring 2020
Table of Symbols

Symbol or Acronym Meaning


𝑎, 𝑏, 𝑐, 𝛼, 𝛽, 𝛾 Scalars (lowercase).
x, y, z, 𝜶 Vectors (bold lowercase).
𝑥𝑖 Index at 𝑖 of vector x.
A, B, C Matrices (bold uppercase).
A𝑖,: Row 𝑖 of matrix A as row vector.
A:,𝑗 Column 𝑗 of matrix A as column vector.
a𝑖 Row or column 𝑖 (vector) of matrix A (depends on context).
𝑎 𝑖,𝑗 Index at row 𝑖 and column 𝑗 of matrix A.
(𝑎, 𝑏, 𝑐) Row vector (ordered touple) of 𝑎, 𝑏, 𝑐 .
[𝑎, 𝑏, 𝑐] Column vector of 𝑎, 𝑏, 𝑐 .
(x , y , z) Matrix with x , y , z as columns.
[x , y , z] Matrix with x , y , z as rows.
{𝑥, 𝑦, 𝑧} Set of elements (unordered).
x> , A > Transpose of vector or matrix.
X ,Y ,Z Random variables (sans-serif uppercase).
X ,Y ,Z Multivariate random variables (sans-serif bold uppercase).
ℙ[X = 𝑥] Probability that the random variable X is realized as 𝑥 .
𝔼[X] Expected value of X.
𝕍 [X ] Variance of X.
∀𝑥 Universal quantifier: for all 𝑥
∃𝑥 Existential quantifier: there exist 𝑥
𝑎∈𝐴 The set 𝐴 contains the element 𝑎 .
∅ Empty set.
𝑎 ..= 𝑏 The value 𝑎 is defined as 𝑏 .
𝑎 =.. 𝑏 The value 𝑏 is defined as 𝑎 .
ℕ The set of natural numbers: {1 , 2 , 3 , . . . }.
ℕ0 The set of natural numbers with 0: {0 , 1 , 2 , . . . }.
ℤ The set of integers.
ℝ The set of real numbers.
ℂ The set of complex numbers.
[𝑛] All integers from 1 to 𝑛 : [𝑛] ..= {1 , . . . , 𝑛}.
i. i. d. Independent and identically distributed.
i. e. That is. . . / This means. . .
e. g. For example. . .
s. t. Such that. . .
etc. And so on.
et al. And others.
Contents

Table of Symbols iii

Contents v

1 Introduction 1
1.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Supervised Learning 5
2 Regression 6
2.1 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Polynomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Prediction Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Standardization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Classification 15
3.1 Binary Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Perceptron Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.6 Class Imbalance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.7 Multi-class Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 Kernels 27
4.1 Feature Explosion Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Polynomial Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Kernelized Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4 Kernel Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Infinite Dimensional Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.6 Kernelized SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.7 Kernelized Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5 Neural Networks 36
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2 General Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.3 Forward Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.4 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.5 Computational Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.6 Back-Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.7 Weight Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.8 Optimizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.9 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.10 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.11 Convolutional Neural Networks (CNNs) . . . . . . . . . . . . . . . . . . . 53
5.12 Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6 Probabilistic Approach to Supervised Learning 60


6.1 Bias Variance Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.3 Bayesian Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.4 Generative Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Unsupervised Learning 72
7 Classification 73
7.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.2 K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

8 Regression 77
8.1 Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.2 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . . . 77
8.3 Kernel PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.4 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

9 Probabilistic Approach to Unsupervised Learning 87


9.1 Mixture Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
9.2 Gaussian Mixtures Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Introduction 1
Example 1.0.1 (Spam E-Mail) Imagine that one has to write a program 1.1 Supervised learning . . . . . . 2
that, given an E-Mail message, decides whether the E-Mail is spam or 1.2 Unsupervised learning . . . . 4
not. In order to solve the task one can decide a set of rules and classify
the E-Mail accordingly. An example of rule could be: if the text body
contains "login here", classify it as spam. This approach seems infeasible
because it may be easy to hack and choosing a good set of rules seems
very difficult. Here is where machine learning comes into play: we
want an automatic discovery of rules from training data. Concretely one
collects a large amount of data (i.e. examples of spam/ non-spam
E-Mails) and then uses general purpose learning methods to discover a
decision rule that (hopefully) generalizes well to new examples which
are not part of the original training data.

Now we discuss a very broad definition of Machine Learning given by


Tom Mitchell and we mention a couple of observations which show how
design choices affect the algorithm that one will use in the end:

A computer is said to learn from experience E with respect to


some task T and performance measure P, if its performance
at tasks in T, as measured by P, improves with experience E.

I A straightforward example of performance measure could be


accuracy (i.e. the number of mistakes out of number of decisions
made on examples which are not part of the training set). Although
it might seem a natural choice, this is not always what one wants.
In the case of Example 1.0.1 a false positive (i.e. a legitimate E-Mail
classified as spam) might be considered worse than a false negative
(i.e. a spam E-Mail classified as non-spam).
I In Example 1.0.1 we discussed a binary classification task. However
in the E-Mail classification problem this might be not exactly what
the designer wants. Concretely, it could be better to have three
different labels: spam (which are put in the junk box), non-spam
(which are put in the inbox) and doubt (which are put in the inbox
but marked with a warning sign).

Nowadays, Machine Learning is a very hot topic in the news. In facts,


both Machine Learning and Artificial Intelligence play a very important
role in today’s society. This happens because we are in the era of big data,
i.e. we are in an era in which a lot of data (about human behaviour and
other phenomena) are available. If analysed properly, one can get value
out of this data, and Machine Learning plays a core role in this value
chain.
Machine Learning algorithms are usually divided in two major categories:
supervised learning and unsupervised learning. In this chapter we see a brief
overview of both types of learning, which will be the core of this script.
1 Introduction 2

1.1 Supervised learning

Supervised learning can be though of as learn from labeled examples. In


facts, the goal of supervised learning is learning functions of the form

𝑓 :𝑋 →𝑌

Concretely, we want to learn a mapping from inputs in the set 𝑋 to


outputs in the set 𝑌 . In the case of Example 1.0.1, 𝑋 is the set of all
possible E-Mail messages and 𝑌 is the label of the input message (i.e.
spam/ not-spam).
We can categorize supervised learning in the following subcategories:

I Classification: where 𝑌 is a discrete set of labels. When the cardi-


nality of 𝑌 is two, we have binary classification, otherwise multi-label
classification. A canonical example of multi-label classification is
ImageNet, a dataset of several images used for object identification
(i.e. given an image, one has to decide which of 1000 possible
objects is the main one in the picture). In this case, 𝑋 is the vector
representing the pixels of the image and 𝑌 is a vector of dimension
1000 where each component represents the probability that the
corresponding object is the main one in the image.
I Regression: where 𝑌 is a real (continuous) value, often in form of
a vector. Some examples of regression are given in the following
table:

X Y Table 1.1: Examples of regression.

Flight route Delay


Real estate objects Price
Patient and Drug Effectiveness
User and articles Products to display
Image Sentence in natural language
Text in English Text in Italian
Ugly JS Code Nice JS Code

In order to understand how the general supervised learning pipeline


works we introduce some definitions.

Definition 1.1.1 (Labeled Dataset) A labeled dataset 𝐷 is a set of 𝑚 tuples.


Formally 𝐷 ..= {(x1 , y1 ), . . . , (x𝑚 , y𝑚 )}. We refer to x𝑖 ∈ ℝ 𝑛 as the feature
vector of the 𝑖 𝑡 ℎ element.

Definition 1.1.2 (Training Data) The training data 𝑇 ⊆ 𝐷 is a subset of


the labeled dataset that the learning algorithm will use to train the model.

Definition 1.1.3 (Test Data) The test data 𝑇 0 ⊆ 𝐷 is a subset of the labeled
dataset (usually s.t. 𝑇 ∩ 𝑇 0 = ∅) used to evaluate the performance of the
model.
1 Introduction 3

In general (and at a high level) the supervised learning pipeline looks as


follows:

1. Gather some labeled dataset (i.e. examples of elements in 𝑋 associated


to the right value in 𝑌 ).
2. Pick some training data and some test data from the labeled dataset.
3. Use some learning algorithm on the training data in order to find a
(hopefully good) mapping from 𝑋 to 𝑌 .
4. Use the found mapping on the test data to evaluate its performance
according to some metrics. If the model works well one can use it
to predict the output from real world data.

Now we discuss some crucial aspects of this pipeline:

I Representation of data: since we use general-purpose learning


algorithms we have to choose a representation that works for
different kinds of inputs (e.g. both on text and images). In this
course we will mostly use feature vectors, i.e. every 𝑥 ∈ 𝑋 will be
represented as a vector in ℝ 𝑑 , where 𝑑 is the number of features
we consider (and must be chosen carefully). An example of data
representation for words is called bag-of-words: we suppose that
a language has at most 𝑑 = 100000 words and we represent each
document as vector 𝑥 in ℝ 𝑑 , where 𝑥 𝑖 counts the occurrences
of word 𝑖 in the document. This is very simple, but it has some
drawbacks: some words are more important than others (e.g. the
word no can be more informative than the word the), this scheme
ignores the order of the words in the sentence (i.e. the mapping
between documents and vectors is not injective), it is dependent
on the length of the document and it ignores the semantics of
the language. All these drawbacks can be mitigated with more
involved versions of the same concept.
I Model fitting: given training examples (i.e. feature vectors with
associated labels), it is necessary to find a decision rule to learn
the desired function. Examples of decision rules are hyperplanes,
linear decision trees, random forests, and deep neural networks.
I Prediction and generalization: at this point we have a model
that, given an element of 𝑋 , returns an element of 𝑌 . This model
has two desirable properties: goodness of fit (i.e. it must have a
good performance on the test set) and complexity (i.e. it should
not be too complex). In general when the model is too simple
we speak of underfitting and when the model is too complex we
speak of overfitting. Ideal models are simultaneously statistically
and computationally efficient.

In order to understand the concepts above, it is useful to apply them


concretely on Example 1.0.1. We can represent E-Mails as bag-of-words
(optimized in the best possible way) and represent each E-Mail of the
training set in a 𝑑−dimensional vector space. Then one has to find a
decision rule, e.g. a partition of the vector space in regions 𝑅 and 𝑅0 such
that:
I 𝑅 ∩ 𝑅0 = ∅
I 𝑅 ∪ 𝑅0 = ℝ 𝑑
I The spam E-Mails of the training set are in 𝑅
I The not-spam E-Mails of the training set are in 𝑅0
1 Introduction 4

We can use this model (which, since we are in the case of classification
is called classifier) to classify a new E-Mail as follows: we represent the
new E-Mail in bag-of-words and we check, whether this vector is in 𝑅
or 𝑅0. If the model is good enough we have a high probability of having
done the right choice. Keep in mind that we aim to have goodness of fit
and reasonable complexity of the regions at the same time.

1.2 Unsupervised learning

The other very famous class of Machine Learning algorithms is known


as unsupervised learning. This means learning without supervision, i.e.
the dataset does not have any labels.

Definition 1.2.1 (Unlabeled Dataset) An unlabeled dataset 𝐷 is a set of


𝑚 vectors. Formally 𝐷 ..= {x1 , . . . , x𝑚 }. We refer to x𝑖 ∈ ℝ 𝑛 as the feature
vector of the 𝑖 𝑡 ℎ element. The difference to the labeled dataset is that we don’t
have any label.

In some sense we are still trying to learn the same function 𝑓 we intro-
duced for supervised learning and the steps of the pipeline are essentially
the same (i.e. training data, learning algorithm, model, prediction on test
data). Two canonical classifications of unsupervised learning algorithms
are:
I Clustering: can be thought of as unsupervised classification. Here
we have a set of data without labels as input and we want to assign
each vector input to a cluster (i.e. a group of similar data points) in
order to infer the label a posteriori.
I Dimension reduction: can be thought of as unsupervised regression.
Here we want to find a lower dimension of the dataset (which maybe
can even be visualized) in order to have more efficient computation.
The goal of dimension reduction is preserving as many features as
possible, otherwise having the data in a lower-dimensional space
would be useless.
Common goals in unsupervised learning algorithms are finding good
data representation (a form of compression). The objective, however, is
often not as clear as in supervised learning tasks. Examples of applications
where an unsupervised learning approach was used are face recognition,
anomaly detection, images generation, network inference, and many
more.
Supervised Learning
Regression 2
2.1 Linear regression . . . . . . . . 6
2.1 Linear regression 2.2 Polynomial Regression . . . . 8
2.3 Prediction Error . . . . . . . . . 8
In its most general form, regression has the goal to learn a function 𝑓 of 2.4 Cross Validation . . . . . . . . . 11
the form: 2.5 Model Selection . . . . . . . . 12
2.6 Regularization . . . . . . . . . 12
2.7 Standardization . . . . . . . . 13

𝑓 : ℝ𝑑 → ℝ

We have to ask ourselves two fundamental questions:

1. What type of functions should we consider?


2. How should we measure the goodness of fit?

In this section we talk about linear regression, but as we will see later, the
same ideas apply also to other types of regression. Linear regression is
of the form 𝑦 ≈ 𝑓 (𝑥), where 𝑓 is a linear function, i.e. a function that can
be written as:

𝑓 (𝑥) = 𝑤 1 𝑥1 + · · · + 𝑤 𝑑 𝑥 𝑑 + 𝑤 0 = w𝑇 𝑥 + 𝑤 0

where w = [𝑤 1 , . . . , 𝑤 𝑑 ] and x = [𝑥 1 , . . . , 𝑥 𝑑 ]. Without loss of generality,


we can use homogeneous coordinates and write:

𝑓 (𝑥) = w𝑇 𝑥

with w = [𝑤 1 , . . . , 𝑤 𝑑 , 𝑤 0 ] and x = [𝑥 1 , . . . , 𝑥 𝑑 , 1].


How do we quantify goodness of fit? There are multiple possible design
choices (and, depending on which one we peek, we will use different
algorithms). In this case we minimize the squared sum of residuals.
Concretely, given the test data 𝐷 = {(x1 , 𝑦1 ), . . . , (x𝑛 , 𝑦𝑛 )} we define the
residual 𝑟 𝑖 as 𝑦 𝑖 − 𝑓 (x𝑖 ) = 𝑦 𝑖 − w𝑇 x𝑖 . At the end we want to minimize
ˆ w) = P𝑛 𝑟 2 = P𝑛 𝑦 𝑖 − w𝑇 x𝑖 2 .
𝑅(

𝑖=1 𝑖 𝑖=1

Now the question is: given a dataset 𝐷 = {(x1 , 𝑦1 ), . . . , (x𝑛 , 𝑦𝑛 )}, how do
we find the optimal vector? Formally we want to find ŵ defined as:

𝑛 2
𝑦 𝑖 − w𝑇 x𝑖
X
ŵ = arg min
w 𝑖=1
2 Regression 7

In this particular case, we have a closed-form solution. In facts, we


can write the optimization problem as an (overdetermined) system of
equations:

𝑤1 
 𝑥 1 ,1  𝑦1 
... 𝑥1,𝑑 1  𝑤 2 
 
 𝑦2 
 
 𝑥 2 ,1 ... 𝑥2,𝑑

1  . 
 
 · .
  .  =  .. 
 
...
 
 .
1 𝑤 𝑑 
   
𝑥 𝑛,1 ... 𝑥 𝑛,𝑑
 
 𝑦𝑛 
}  𝑤0  |{z}
    
| {z
A |{z} y
x

And the least square solution gives us that the optimal weights are given
by:

 −1
A𝑇 A A𝑇 𝑦

However, most of the problems we solve in Machine Learning don’t have


a closed form solution. Also in this particular case, it might not be always
necessary to use the closed-form solution since this is less efficient (you
need to do matrix multiplication and solve a linear system of equations)
and is not always necessary to get an optimal solution (an arbitrarily
good approximation is always good enough).
In general a widely used algorithm to solve optimization problems
is called gradient descent. The algorithm is very simple and looks as
follows:

1 Start at an arbitrary w0 ∈ ℝ 𝑑 Algorithm 2.1: Gradient descent


2 for 𝑡 = 0 , 1 , 2 , . . .
3 w𝑡+1 = w𝑡 − 𝜂𝑡 ∇ · 𝑅ˆ (w𝑡 )

where 𝜂𝑡 is called learning rate. If the learning rate is chosen properly (in
this case a learning rate of 0.5 would work), the algorithm converges to
the optimum on convex functions∗ , i.e. on functions such that:

∀𝑥, 𝑥 0 , 𝜆 ∈ [0, 1] it holds that 𝑓 (𝜆𝑥 + (1 − 𝜆)𝑥 0) ≤ 𝜆 𝑓 (𝑥) + (1 − 𝜆) 𝑓 (𝑥 0)

Here we squared the residuals, but there are many other possible (and
meaningful) loss functions. In general, the choice of the loss function
introduces trade-offs:

I |𝑟 | : we have zero error with the right weights; no closed solution;


is less sensitive to noise than the square loss.

∗ This
condition is sufficient but non necessary. For non-convex functions this method
sometimes converges to an optimum and sometimes to a stationary point.
2 Regression 8

I |𝑟 | 𝑝 (for 𝑝 > 1): convex function; error almost zero for errors which
are less than one in absolute value but very sensitive to noise
otherwise; might be useful if we want that all points are equally
important.
I |𝑟 | 𝑝 (for 𝑝 < 1): not convex, but still possible to use gradient descent
because of the shape of the function; robust to noise

2.2 Polynomial Regression

Often fitting a linear model doesn’t work well because we underfit the
data, thus instead we will use a polynomial.

Goal
Given a set of feature vectors x1 , . . . , xn where xi ∈ ℝ 𝑑 (which can be
represented as a matrix X ∈ ℝ 𝑛×𝑑 ), and a set labels 𝑦1 , . . . , 𝑦𝑛 where
𝑦 𝑖 ∈ ℝ (which can be represented as a vector y ∈ ℝ 𝑛 ).

Output the coefficient vector w ∈ ℝ 𝑑 such that

𝐷
X
𝑓 (xi ) ..= 𝑤 𝑗 𝜙 𝑗 (xi ) ≈ 𝑦 𝑖 , ∀𝑖 ∈ {1 , . . . , 𝑛} (2.1)
𝑗=1

where 𝜙(xi ) ..= vector of all monomials of degree up to 𝑛 in 𝑥 𝑖,1 , . . . , 𝑥 𝑖,𝑑 ,


where 𝐷 = |𝜙(x)| . a
a In 2-d we have 𝜙(xi ) = x˜i = [1 , 𝑥 𝑖,1 , 𝑥 𝑖,2 , 𝑥 2𝑖,1 , 𝑥 2𝑖,2 , 𝑥 𝑖,1 · 𝑥 𝑖,2 , . . . ]𝑇

Notice that to find the coefficient vector w we can still use linear regression,
to do so we compute the values of the function 𝜙 as a new vectors
x̃i = 𝜙(xi ), and then solve the problem using some standard linear
regression method:
𝑛
1X 2
ŵ = arg min 𝑦 𝑖 − w𝑇 x̃i (2.2)
w 𝑛 𝑖=1

2.3 Prediction Error

Choosing the degree of the polynomial is a critical task. If we use a high


degree polynomial on noisy data we fit the error, and hence the prediction
error will increase. This situation is known as overfitting. Conversely, if
we use a polynomial with (too) low degree, we don’t have a model that
is powerful enough to properly fit the data. This situation is known as
underfitting. An important task in regression is choosing a model that
is powerful enough (in order to avoid underfitting), but also not too
complex (in order to avoid overfitting). The prediction error is a useful
tool to find the best degree which optimizes the goodness of fit.
2 Regression 9

We start by assuming that each touple in the training dataset, is generated


i. i. d. from some unknown distribution 𝑃 :

(xi , 𝑦 𝑖 ) ∼ 𝑃(X, y) (2.3)

then we can define the notion of error.

Definition 2.3.1 (Expected Error) The expected error (or true risk) of w
under 𝑃 is defined as:

𝑅(w) ..= 𝔼x,𝑦 (𝑦 − w𝑇 x)2


 
(2.4)

= 𝑃(x , 𝑦)(𝑦 − w𝑇 x)2 𝑑x 𝑑𝑦 (2.5)

The problem is that 𝑃 is not known, and thus we can’t directly optimize
for 𝑅 . Instead, we estimate 𝑅 .

Definition 2.3.2 (Estimated Expected Error) Let 𝐷 be some labeled data,


then estimated true risk (or empirical risk) is defined as:

1 X
𝑅ˆ 𝐷 (w) ..= (𝑦 − w𝑇 x)2 (2.6)
|𝐷| (x,𝑦)∈𝐷

Then, by the Law of large numbers, we know that 𝑅ˆ 𝐷 (w) −−−−−→ 𝑅(w)
|𝐷|→∞
for any fixed w, i. e. the more data we have the better our approximation
for 𝑅 will be since it will approach the true value.
Finally we can optimize our empirical risk using our training data:

ŵ𝐷 ..= arg min 𝑅ˆ 𝐷 (w) (2.7)


w

ideally:

w∗ ..= arg min 𝑅(w) (2.8)


w

However, it’s not always the case that as we have more training data in 𝐷
the optimal risk w∗ approaches the empirical risk ŵ𝐷 (this is not implied
by the Law of large numbers alone), for this we need the stronger notion
of uniform convergence.

Definition 2.3.3 (Uniform Convergence) We say that 𝑅 converges uni-


formely if

sup 𝑅(w) − 𝑅ˆ 𝐷 (w) −−−−−→ 0 (2.9)


w |𝐷|→∞

Learning from Data The previous notions use the fact that |𝐷| must
approach infinity, however we always deal with a finite amount of training
samples and hence the following problem occurs:
2 Regression 10

Lemma 2.3.1 (Optimistic Estimate) Given a data set 𝐷 we have that:


h i
𝔼𝐷 𝑅ˆ 𝐷 (𝑤ˆ 𝐷 ) ≤ 𝔼𝐷 [𝑅 𝐷 (𝑤ˆ 𝐷 )] (2.10)

Proof.
h i h i
𝔼𝐷 𝑅ˆ 𝐷 (ŵ𝐷 ) = 𝔼𝐷 min 𝑅ˆ 𝐷 (w)
w
h i
≤ min 𝔼𝐷 𝑅ˆ 𝐷 (w) Jensen’s Inequality
w
" #
|𝐷|
1 X
= min 𝔼𝐷 (𝑦 𝑖 − w𝑇 xi )2
w |𝐷| 𝑖=1
|𝐷|
1 X
𝔼(xi ,𝑦𝑖 )∼𝑃 (𝑦 𝑖 − w𝑇 xi )2
 
= min
w |𝐷| 𝑖=1
= min 𝑅(w)
w
≤ 𝔼𝐷 [𝑅 𝐷 (𝑤ˆ 𝐷 )]

Lemma 2.3.1 tells us that the expected value of the expected estimated
error is always less than the expected value of the true error. This
is a problem, because we will always estimate a smaller error than
what we actually have by using a finite training set. In order to avoid
underestimating the prediction error, we will use two different data sets
𝐷𝑡𝑟 𝑎𝑖𝑛 and 𝐷𝑡𝑒 𝑠𝑡 from the same distribution 𝐷𝑡𝑟𝑎𝑖𝑛 , 𝐷𝑡𝑒𝑠𝑡 ∼ 𝑃 , then:

Lemma 2.3.2 (Correct Estimate) Given 𝐷 ..= 𝐷𝑡𝑟𝑎𝑖𝑛 and 𝑉 ..= 𝐷𝑡𝑒𝑠𝑡 , then:
h i
𝔼𝐷,𝑉 𝑅ˆ 𝑉 (ŵ𝐷 ) = 𝔼𝐷 [𝑅(ŵ𝐷 )] (2.11)

Proof.
h i h h ii
𝔼𝐷,𝑉 𝑅ˆ 𝑉 (ŵ𝐷 ) = 𝔼𝐷 𝔼𝑉 𝑅ˆ 𝑉 (ŵ𝐷 ) 𝐷, 𝑉 are i. i. d.
" " ##
|𝑉 |
1 X
= 𝔼𝐷 𝔼𝑉 (𝑦 𝑖 − ŵ𝑇𝐷 xi )2
|𝑉 | 𝑖=1
" #
|𝑉 |
1 X
𝔼(xi ,𝑦𝑖 )∼𝑉 (𝑦 𝑖 − ŵ𝑇𝐷 xi )2
 
= 𝔼𝐷
|𝑉 | 𝑖=1
= 𝔼𝐷 [𝑅(ŵ𝐷 )]

Lemma 2.3.2 tells us that if we use independent train and test (validation) 1: Test data samples might not be inde-
sets, the expected value of the estimated error is the same as the expected pendent if drawn from:

value of the true error, and thus we will be able to estimate the correct I Time series data might contain
error by using our test set 𝐷𝑡𝑒 𝑠𝑡 without having a wrong underestimation. time-correlated values, e. g. stocks,
video, audio,...
This works because the two sets are independent, and thus we have an
I Spatial data might be correlated
unbiased error. We must be careful to choose the test data in a way that e. g. images.
it’s actually independent from the training data. 1 I Noise might contain correlated
data.
2 Regression 11

2.4 Cross Validation

We have analyzed the expected prediction error using 𝐷𝑡𝑟𝑎𝑖𝑛 and 𝐷𝑡𝑒𝑠𝑡
as samples from a distribution 𝑃 . In practice, we are given a labeled
dataset 𝐷 of finite dimension, hence we can’t sample the data from such a
distribution. Recall our initial goal of finding a good model that optimizes
goodness of fit given different parameters, e. g. different degrees for the
polynomials used in the polynomial regression. Thus we have to find a
way to pick 𝐷𝑡𝑟 𝑎𝑖𝑛 and 𝐷𝑡𝑒 𝑠𝑡 from 𝐷 and a way to exploit them in order
to evaluate the performance of a given model. This process is called
cross-validation and there are different ways to apply it.

Monte Carlo Cross-Validation We split the dataset 𝐷 into two disjoint


sets 𝐷 = 𝐷𝑡𝑟 𝑎𝑖𝑛 ] 𝐷𝑡𝑒 𝑠𝑡 by picking some number of elements uniformly
at random such that 𝐷𝑡𝑟 𝑎𝑖𝑛 ⊂ 𝐷 and the remaining elements will be in
the test set 𝐷𝑡𝑒 𝑠𝑡 = 𝐷 \ 𝐷𝑡𝑟 𝑎𝑖𝑛 . Then we train the model on the training set
𝐷𝑡𝑟 𝑎𝑖𝑛 and validate on the test set 𝐷𝑡𝑒 𝑠𝑡 . Lastly, we estimate the prediction
error by averaging the test error over multiple random trials and we pick
the best model.

1 foreach model 𝑚 = 1 , ..., 𝑀 Algorithm 2.2: Monte Carlo CV


2 foreach repetition 𝑟 = 1 , . . . , 𝑅
3 𝐷 = 𝐷𝑡𝑟 𝑎𝑖𝑛 ] 𝐷𝑡𝑒 𝑠𝑡 B Split training and test sets randomly
4 ŵ = arg minw 𝑅ˆ 𝐷𝑡𝑟𝑎𝑖𝑛 (w) B Train model
ˆ (𝑟) . ˆ
5 𝑅 𝑚 = 𝑅 𝐷 (ŵ)
.
𝑡𝑒𝑠𝑡 B Save estimated error of repetition 𝑟 and model 𝑚
6 end
7 end
P𝑅 (𝑟)
8 return 𝑚
ˆ = arg min𝑚 1
𝑅 𝑟=1 𝑅ˆ 𝑚 B Pick model with smallest average error

K-fold Cross-Validation We split the dataset 𝐷 into 𝑘 disjoint sets such


(1) (𝑘) (𝑖) (𝑖)
that 𝐷 = 𝐷𝑡𝑟 𝑎𝑖𝑛 ] · · · ] 𝐷𝑡𝑟 𝑎𝑖𝑛 and 𝐷𝑡𝑒 𝑠𝑡 ..= 𝐷 \ 𝐷𝑡𝑟𝑎𝑖𝑛 . Then, for each
model, we train on each training fold and validate on each test fold. Lastly,
we estimate the prediction error by averaging over multiple folds and we
pick the best model.

1 foreach model 𝑚 = 1 , ..., 𝑀 Algorithm 2.3: K-fold CV


2 foreach fold 𝑖 = 1 , . . . , 𝑘 do
(𝑖) (𝑖)
3 𝐷 = 𝐷𝑡𝑟 𝑎𝑖𝑛 ] 𝐷𝑡𝑒 𝑠𝑡 B Split training and test sets using fold 𝑖
(𝑖)
4 ŵ(𝑖) = arg minw 𝑅ˆ 𝐷 (w) B Train model with fold 𝑖
𝑡𝑟𝑎𝑖𝑛
(𝑖) (𝑖)
5 𝑅ˆ 𝑚 ..= 𝑅ˆ
𝐷𝑡𝑒𝑠𝑡
(ŵ(𝑖) ) B Save estimated error of fold 𝑖 and model 𝑚
6 end
7 end
P𝑘 (𝑖)
8 return 𝑚
ˆ = arg min𝑚 1
𝑘 𝑖=1 𝑅ˆ 𝑚 B Pick model with smallest average error

What are the tradeoffs of choosing different sizes of 𝑘 ?


If 𝑘 is too small:
I Risk of overfitting to the test set.
I Risk of using too little data for training.
I Risk of underfitting to the training set.
2 Regression 12

If 𝑘 is too large:
I Better performance, usually 𝑘 = 𝑛 works really well and it’s called
leave-one-out cross-validation LOOCV.
I Higher computational complexity.
I Risk of underfitting to training set.

2.5 Model Selection

Cross-validation is a useful tool for model selection, i. e. choosing the


model which performs best on our task and, hopefully, generalizes well
to real-world data. In case we have to select the best polynomial degree
for a regression task, we can simply iterate with increasing degrees and
check which model has the least expected error. However polynomial
regression is not the only possible case, in facts we might also want to
perform regression by choosing a basis function which is not a polynomial
(e. g. a combination of trigonometrical, logarithmical and exponential
transformations). In those cases, iterating with increasing degree is not
an option. Here one would have to create a list of candidate models
(without any ordering relation between them) and iterating it.

2.6 Regularization

Standard linear regression seeks to optimize the data fit by minimizing


the mean squared error minw 𝑅( ˆ w). This approach might work well if our
data doesn’t contain any outliers, but it may be problematic if our data
are (a little bit) noisy. Consider a dataset with 1000 points. A polynomial
with degree 1000 will interpolate over those points and hence will have a
zero error on them. However, if a few points are outliers, our high-degree
polynomial will have an artificial shape to fit the noise. This will lead
to a large error to some points which are in proximity of this artificial
shape and hence this model would not generalize well. This polynomial
will have some large weights in order to interpolate the points and hence
we can say that large weights are indicators of overfitting. Our goal is
to keep weights reasonably small in order to avoid such a phenomenon.
Thus, instead of just optimizing just for the mean squared error, we add
an additional penalty on the size of the weights. This practice is called
regularization.

Definition 2.6.1 (Regularization) Given penalty function 𝐶 and a regu-


larization parameter 𝜆 we can define a regularization problem as:

ˆ w) + 𝜆𝐶(w)
min 𝑅( (2.12)
w

where 𝜆 is used to weight how much 𝐶 should penalize w.

Regularization seeks to find a balance between the goodness of fit and


the magnitude of the weights. There are countless possible regularizers,
here we present ridge regression, a popular solution that is widely adopted
in practice.
2 Regression 13

Ridge Regression This type of regularization uses the squared norm


𝐶(w) ..= k w k 2 as the penalty function which has nice mathematical
properties when we want to solve for w:
𝑛
1X 2
ŵ = arg min 𝑦 𝑖 − w𝑇 xi + 𝜆 k w k 22 (2.13)
w 𝑛 𝑖=1

The solution to this problem can be found both via gradient descent:
Gradient Evaluation
 
ˆ w) + 𝜆 k w k 22 = ∇w 𝑅(
∇w 𝑅( ˆ w) + 𝜆∇w k w k 22 (2.14)
ˆ w) + 𝜆∇w (w𝑇 w)
= ∇w 𝑅( (2.15)
ˆ w) + 2𝜆w
= ∇w 𝑅( (2.16)

GD Update Rule
 
ˆ w) + 2𝜆w
w𝑡+1 ← w𝑡 − 𝜂𝑡 ∇w 𝑅( (2.17)
ˆ w)
= (1 − 2𝜆𝜂𝑡 )w𝑡 − 𝜂𝑡 ∇𝑅( (2.18)

and via analytical solution:

ŵ = (X𝑇 X + 𝜆I)−1 X𝑇 y (2.19)

The best 𝜆 is typically chosen with cross validation over a logarithmically


spaced list of candidates 2 . 2: 𝜆 ∈ {. . . , 10−2 , 10−1 , 10 , 101 , 102 , . . . }

2.7 Standardization

In the previous section, we have seen that large weights often correspond
to noise and are indicators of overfitting. For this reason, we introduced
a term to penalize large weights. However, the idea that having smaller
weights leads to more accurate models might not always be true. Consider
an example where we have three features and those are in completely
different magnitudes (e. g. the first feature is in the order of 104 , the
second in the order of 103 and the third one in the order of 100 ). If we
penalize large weights we might come to a situation where all weights
are similar (e. g. all close to one). However, since the features have a
completely different magnitude, the first feature would have a much
larger impact than the other ones, and this is undesirable since we would
lose the information brought by the other features. A solution to this
problem is using standardization to scale our data such that they have zero
mean and unit variance.
2 Regression 14

𝑛
1X
𝜇ˆ 𝑗 = 𝑥 𝑖,𝑗 (2.20)
𝑛 𝑖=1
𝑛
1X
𝜎ˆ 2𝑗 = (𝑥 𝑖,𝑗 − 𝜇ˆ 𝑗 )2 (2.21)
𝑛 𝑖=1
𝑥 𝑖,𝑗 − 𝜇ˆ 𝑗
𝑥˜ 𝑖,𝑗 ..= (2.22)
𝜎ˆ 𝑗
Classification 3
In the previous chapter we discussed regression, i. e. the problem of 3.1 Binary Classification . . . . . 15
predicting a function 𝑓 : 𝑋 → 𝑌 , where 𝑌 is a continuous set such as ℝ. 3.2 Perceptron Algorithm . . . . 16
Now we introduce classification, where the set 𝑌 is discrete. The high-level 3.3 Stochastic Gradient Descent . 17
idea is assigning each point in 𝑋 to a specific category. For example, the 3.4 Support Vector Machine . . . 17
3.5 Feature Selection . . . . . . . 18
space 𝑋 could represent the space of pictures and we want to find the
3.6 Class Imbalance . . . . . . . . 21
best way to assign each picture to either the category cat or the category
3.7 Multi-class Classification . . 24
dog, depending on which animal is represented in the picture. In order
to design such algorithms, we will combine some concepts learned in
the previous chapter (e. g. gradient descent, regularization, the general
idea of minimizing a loss function, ...) with new, ad hoc tools.

3.1 Binary Classification


Goal
Given a set of feature vectors x1 , . . . , xn where xi ∈ ℝ 𝑑 (which can be
represented as a matrix X ∈ ℝ 𝑛×𝑑 ), and a set of labels 𝑦1 , . . . , 𝑦𝑛 where
𝑦 𝑖 ∈ {−1 , +1} (which can be represented as a vector y ∈ {−1 , +1} 𝑛 ).

Output the coefficient vector w ∈ ℝ 𝑑 such that

𝑓 (xi ) ..= sign(w𝑇 xi ) = 𝑦 𝑖 , ∀𝑖 ∈ {1, . . . , 𝑛} (3.1)

The intuition is that w represents the normal to a hyperplane (in 2d the


norm to the separating line) that points in the direction of the positive
labels. If the point xi is on the same side as where the vector w points to,
we will have a positive dot product w𝑇 xi and thus a sign of +1, otherwise
a negative dot product and a sign of −1.

0/1 Loss To find the optimal ŵ we could consider the approach of


counting the number of samples that we got wrong using a loss function
called 0/1 loss and then minimizing that value.
𝑛
1X
ŵ = arg min ℓ0/1 (w; xi , 𝑦 𝑖 ) (3.2)
w 𝑛 𝑖=1

Definition 3.1.1 (0/1 Loss) Given the coefficient vector w the current
feature vector xi and label 𝑦 𝑖 , the 0/1 loss is defined as:
(
0 if 𝑦 𝑖 · w𝑇 xi ≥ 0
ℓ0/1 (w; xi , 𝑦 𝑖 ) ..= (3.3)
1 if 𝑦 𝑖 · w𝑇 xi < 0
3 Classification 16

The problem is that ℓ 0/1 is neither differentiable nor convex, and thus we
cannot use our standard optimization method such as gradient descent.

3.2 Perceptron Algorithm

Since ℓ 0/1 is not suitable for our purposes, we have to introduce a surrogate
loss which is both informative and compatible with gradient descent.
The perceptron algorithm uses the following loss function ℓ 𝑃 , which is
similar to ℓ 0/1 , convex and differentiable.

Definition 3.2.1 (Perceptron Loss) Given the coefficient vector w the


current feature vector xi and label 𝑦 𝑖 , the perceptron loss is defined as:

ℓ 𝑃 (w; xi , 𝑦 𝑖 ) ..= max(0 , −𝑦 𝑖 · w𝑇 xi ) (3.4)

The following objective function can now be optimized with gradient


descent.

𝑛
1X
ŵ = arg min ℓ 𝑃 (w; xi , 𝑦 𝑖 ) (3.5)
w 𝑛 𝑖=1

Gradient Evaluation
𝑛
1X
ˆ w) =
∇w 𝑅( ∇w max(0 , −𝑦 𝑖 · w𝑇 xi ) (3.6)
𝑛 𝑖=1
(
𝑛
1X 0 if 𝑦 𝑖 · w𝑇 xi ≥ 0
= (3.7)
𝑛 𝑖=1 −𝑦 𝑖 xi if 𝑦 𝑖 · w𝑇 xi < 0

GD Update Rule
(
𝑛
1X 0 if 𝑦 𝑖 · w𝑇 xi ≥ 0
w𝑡+1 ← w𝑡 − 𝜂 𝑡 (3.8)
𝑛 𝑖=1 −𝑦 𝑖 xi if 𝑦 𝑖 · w𝑇 xi < 0

Theorem 3.2.1 (Perceptron Convergence) If the provided training dataset


is linearly separable, then the perceptron algorithm will always find coefficients
w that linearly separate the data.

Note that while we use the perceptron loss on the training data, we still
use the ℓ 0/1 loss on the test data to compute the number of errors and
evaluate the performance of our model. A drawback of the algorithm we
have presented so far is that, in order to do a single weights update, we
have to iterate over the whole dataset. This might be very inefficient for
large datasets. Now we present the variant of the perceptron algorithm
which is most widely used in practice. This variant uses stochastic gradient
descent (see next section) in order to efficiently optimize the objective
function.
3 Classification 17

1 w0 ← 0 Algorithm 3.1: Perceptron


2 foreach 𝑡 = 1 , 2 , ..., 𝑇
3 sample (xi , 𝑦 𝑖 ) ∈u. a. r. 𝐷𝑡𝑟 𝑎𝑖𝑛 B Sample uniformly at random with replacement
4 if 𝑦 𝑖 w𝑇 xi ≥ 0
5 w𝑡+1 ← w𝑡
6 else
7 w𝑡+1 ← w𝑡 + 𝜂𝑡 𝑦 𝑖 xi
8 end
9 return w𝑇+1

3.3 Stochastic Gradient Descent

Standard gradient descent is highly inefficient: if we have a lot of training


samples we have to loop through them all just to take a small step towards
the optimum. Instead, stochastic gradient descent samples a single data
point uniformly at random (with replacement) from the training set at
each step, and therefore it’s much more efficient.

1 𝑤 0 ∈u. a. r. ℝ 𝑑 Algorithm 3.2: SGD


2 foreach 𝑡 = 1 , 2 , ..., 𝑇
3 sample (x , 𝑦) ∈u. a. r. 𝐷𝑡𝑟 𝑎𝑖𝑛 B Sample uniformly at random with replacement
4 w𝑡+1 ← w𝑡 − 𝜂𝑡 ∇w𝑡 ℓ (w𝑡 ; x , 𝑦)
5 end
6 return wT+1

Mini-Batch SGD Using stochastic gradient descent with only a single


sample might have a large variance in the gradient estimate, and hence
leads to slow convergence. To solve this problem and reduce the variance,
instead of picking a single training sample, we will pick a small subset
of the training data called mini-batch. Mini-batches will both have the
advantage of a fast and stable convergence.

1 𝑤 0 ∈u. a. r. ℝ 𝑑 Algorithm 3.3: Mini-Batch SGD


2 foreach 𝑡 = 1 , 2 , ..., 𝑇
3 sample 𝐷𝑏𝑎𝑡𝑐 ℎ ⊆u. a. r. 𝐷𝑡𝑟 𝑎𝑖𝑛 B Mini-batch uniformly at random with
replacement
wt+1 ← wt − 𝜂𝑡 |𝐷 1 ∇ℓ (wt ; x , 𝑦)
P
4 (x ,𝑦)∈𝐷𝑏𝑎𝑡𝑐 ℎ
𝑏𝑎𝑡𝑐 ℎ |
5 end
6 return wT+1

3.4 Support Vector Machine

The perceptron algorithm is an efficient method for binary classification


and it always finds coefficients w that linearly separate the data if such
coefficients exist. But what if there are multiple vector coefficients w that
linearly separate the data? The perceptron algorithm will choose an
arbitrary hyperplane. However, different hyperplanes can be more or less
3 Classification 18

error-prone than others. For example, if the line we pick is close to one of
the two clusters of data, it will be more sensitive to noise than another
one which keeps a larger margin between clusters. The support vector
machine algorithm uses an objective function that maximizes the margin
between the separating hyperplane and the data. With this method, lines
close to the clusters of data are penalized and therefore noise resistance
is increased. This is obtained by introducing a new loss function and
applying regularization. 1 1: Since we are using regularization re-
member that we have to standardize our
𝑛 data.
1X
ŵ = arg min ℓ 𝐻 (w; xi , 𝑦 𝑖 ) + 𝜆 k w k 22 (3.9)
w 𝑛 𝑖=1

Definition 3.4.1 (Hinge Loss) Given the coefficient vector w the current
feature vector xi and label 𝑦 𝑖 , the hinge loss is defined as:

ℓ 𝐻 (w; xi , 𝑦 𝑖 ) ..= max{0, 1 − 𝑦 𝑖 w𝑇 xi } (3.10)

Gradient Evaluation
𝑛
ˆ w) = 1 ∇w max{0 , 1 − 𝑦 𝑖 w𝑇 xi } + 𝜆∇w k w k 22
X
∇w 𝑅( (3.11)
𝑛 𝑖=1
(
𝑛
1X 0 if 𝑦 𝑖 · w𝑇 xi ≥ 1
= + 2𝜆w (3.12)
𝑛 𝑖=1 −𝑦 𝑖 xi if 𝑦 𝑖 · w𝑇 xi < 1

GD Update Rule 2 2: Usually a good choice for the learning


rate is 𝜂𝑡 = 𝜆𝑡
1
, where 𝜆 is found using
cross-validation.
(
𝑛
1X 0 if 𝑦 𝑖 · wt 𝑇 xi ≥ 1
w𝑡+1 ← (1 − 2𝜂𝑡 𝜆)w𝑡 − 𝜂𝑡 (3.13)
𝑛 𝑖=1 −𝑦 𝑖 xi if 𝑦 𝑖 · wt 𝑇 xi < 1

SGD Update Rule

w𝑡+1 ← (1 − 2𝜂𝑡 𝜆)w𝑡 + 𝜂𝑡 𝑦𝑡 xt , [𝑦𝑡 wt 𝑇 xt < 1] (3.14)

Similarly to the perceptron algorithm, we don’t use the hinge loss for the
validation of our model but we would use the target performance metric
(e. g. the number of mistakes with the 0/1 loss).

3.5 Feature Selection

The models we have presented so far are trained with some feature
vectors x1 , . . . , xn , where 𝑥 ∈ ℝ 𝑑 . If the dimension 𝑑 of a feature vector is
high (i. e. there are many parameters) our model might take a long time
to train. In many cases, some features in a feature vector are redundant
and don’t bring any useful information: keeping those features is not
desirable since they make our model less efficient without improving its
performance. For this reason, it’s crucial to find a way to select only the
important features. The optimization process of selecting the best features
is called feature selection and, in general, it’s a very difficult combinatorial
problem. In this section, we will explore some heuristics to approach
3 Classification 19

the problem. Before we present them, it’s important to introduce the


preliminary notion of feature error.

Feature Error

Definition 3.5.1 (Feature Selection) Given a set of features 𝑉 = {1 , . . . , 𝑑}


with 𝑆 ⊆ 𝑉 of cardinality 𝑘 , and one feature vector xi = [𝑥 𝑖,1 , . . . , 𝑥 𝑖,𝑑 ],
then the feature selection of that feature vector is defined as

xi (𝑆) ..= [𝑥 𝑖,1 , . . . , 𝑥 𝑖,𝑘 ] (3.15)

When using this feature selection, the associated coefficient vector is


given by:

𝑛
1X
ŵ(𝑆) ..= arg min ℓ (w(𝑆) ; xi (𝑆) , 𝑦 𝑖 ) + 𝜆 k w k 22 (3.16)
w(𝑆)
𝑛 𝑖=1

A feature selection just picks a sparse version of the initial feature vector
that is then reduced to a lower dimensional vector, thus both w(𝑆) and
xi (𝑆) will be a lower-dimensional version of w and xi respectively. We
will now be able to define the feature error as:

Definition 3.5.2 (Feature Error) Given 𝑆 ⊆ 𝑉 , and a coefficient vector


ŵ(𝑆) , the feature error is the cross validation prediction accuracy of ŵ(𝑆)
ˆ .
denoted as 𝐿(𝑆)

The feature error allows us to measure the error of a subset of features.


We can use this concept to pick different sets 𝑆 take the one that gives
ˆ . The first, naive idea, would be to test all
the least feature error 𝐿(𝑆)
possible subsets of 𝑆 and pick the best one. However, this method is
highly inefficient and thus we will consider more clever methods.

Greedy Forward Selection The idea of greedy forward selection is starting


with an empty set 𝑆 and always picking the remaining feature that
decreases the error the most until the error increases.

1 𝑆←∅ Algorithm 3.4: Greedy Forward Selection


2 𝐸0 ← ∞
3 foreach 𝑖 = 1 , ..., 𝑑
4 ˆ ∪ {𝑗})
𝑠 𝑖 ..= arg min 𝑗∈𝑉\𝑆 𝐿(𝑆 B Find best element to add
5 𝐸 𝑖 ← 𝐿(𝑆ˆ ∪ {𝑠 𝑖 }) B Compute error
6 if 𝐸 𝑖 > 𝐸 𝑖−1 break B Stop if new best element increases error
7 else 𝑆 ← 𝑆 ∪ {𝑠 𝑖 } B Otherwise add new best element and continue
8 end
9 return 𝑆

The advantage of this algorithm is that it’s relatively fast if we have only a
few features that are important and many that are not. However, it cannot
handle dependent features well since it might get stuck in a sub-optimal
solution, especially if almost all features are necessary.
3 Classification 20

Greedy Backward Selection The idea of greedy backward selection is


starting with the entire set of features 𝑆 , and always removing the
remaining feature that decreases the error the most.

1 𝑆 ← {1 , . . . , 𝑘} Algorithm 3.5: Greedy Backward Selec-


2 𝐸 𝑑+1 ← ∞ tion

3 foreach 𝑖 = 𝑑, ..., 1
4 ˆ \ {𝑗})
𝑠 𝑖 ..= arg min 𝑗∈𝑆 𝐿(𝑆 B Find best element to remove
5 𝐸 𝑖 ← 𝐿(𝑆ˆ \ {𝑠 𝑖 }) B Compute error
6 if 𝐸 𝑖 > 𝐸 𝑖+1 break B Stop if removing element increases error
7 else 𝑆 ← 𝑆 \ {𝑠 𝑖 } B Otherwise remove new best element and continue
8 end
9 return 𝑆

This selection can handle dependent features much better. If almost all
features are important and only a few can be removed this algorithm
might work better than forward selection.

Joint Selection Both greedy feature selection methods are expensive


if we have a large dataset, furthermore neither method guarantees that
an optimal solution is reached. It would be ideal if we could find all
the important features by solving a single optimization problem. The
idea is that, as we have seen before, the coefficient vector w(𝑆) will be a
lower-dimensional version of a sparse w, thus if we limit the sparsity
of w we would reduce the highest number of features. This method is
called joint selection.

𝑛
1X
ŵ ..= arg min (𝑦 𝑖 − w𝑇 xi )2 s. t. k w k 0 ≤ 𝑘 (3.17)
w 𝑛 𝑖=1

Definition 3.5.3 (𝐿0 -Norm) Let w be a vector, then the 𝐿0 -Norm is:

k w k 0 ..= number of non-zeros in w (3.18)

Thus by constraining k w k 0 to be smaller than 𝑘 , we will constrain the


number of features to be at most 𝑘 . However, optimizing k w k 0 is a hard
combinatorial problem and it cannot be solved easily.

Lasso Regression Previously, we have seen how the perceptron algo-


rithm uses a surrogate loss ℓ 𝑃 such that ℓ 0/1 can be approximated and
optimized. In this case, we will use a similar method, but instead of work-
ing with a surrogate loss we will use a surrogate convex regularization
term. The idea is that instead of limiting the sparsity of the vector w to
be at most 𝑘 , we will penalize large weights of w in the following way:

𝑛
1X
ŵ ..= arg min (𝑦 𝑖 − w𝑇 xi )2 + 𝜆 k w k 1 (3.19)
w 𝑛 𝑖=1
3 Classification 21

Definition 3.5.4 (𝐿1 -Norm) Let w be a vector, then the 𝐿1 -Norm a is:

𝑑
X
k w k 1 ..= |𝑤 𝑖 | (3.20)
𝑖=1
a This norm is convex and thus easy to optimize for.

Where the 𝐿1 -Norm will penalize large weights and thus maximize the
sparsity3 of w. This regression method is called Lasso Regression. 3: This idea of using the 𝐿1 -Norm to max-
imize sparsity of the coefficient vector is
One clear advantage of this method is that it’s faster. We will train the very important and used thoroughly in
model and at the same time select the best feature by maximizing sparsity. machine learning.
However, this method only works for linear models, where the greedy
methods are slower but apply to any model.

3.6 Class Imbalance

Sometimes it’s possible that we are given an unbalanced dataset, which


means that the labels 𝑦1 , . . . , 𝑦𝑛 ∈ {−1 , +1} of the feature vectors are
not properly divided in two sets of equal size representing +1 and −1.
More formally, given 𝑃 ..= {𝑦 𝑖 | 𝑦 𝑖 = +1} and 𝑀 ..= {𝑦 𝑖 | 𝑦 𝑖 = −1} we
have either |𝑃|  |𝑀 | or |𝑃|  |𝑀 | . We will refer to the set with more
elements as the majority class and to the set with fewer elements to the
minority class.

Example 3.6.1 Some examples of applications suffer from class imbal-


ance:
I Reccomender systems, which suggest items of interest to the
user (e. g. Advertisement, Movies, Books, ...). In this context the
data about what the user likes is much less than the data of what
the user doesn’t like.
I Fraud detection, the number of fraud transactions is (usually)
much less than the number of normal transactions.
I Medical applications, the number of people with a disease is
much less than those without.

There are a few issues with class imbalance. If we use the fraction of
correctly labeled elements (accuracy) as our metric to test the performance,
even if our classifier doesn’t work well, we will label most of the elements
correctly since the number of labels in the minority class will contribute
almost nothing to the error. Also, during training, the minority class may
be ignored for optimization since it will contribute little to the empirical
risk. Thus we will have to find a better way both to train and to test our
classifier.

Naive Classifier One easy solution could be to subsample the data of


the majority class (e. g. by removing them uniformly at random) until it
contains the same number of samples of the minority class. The advantage
of this approach is that we get a smaller dataset. However, we throw
away a lot of useful information. The opposite solution is to upsample
the minority class by duplicating data (possibly with some random
perturbation) until we obtain the same number of data as in the majority
3 Classification 22

class. The advantage is that we make use of all data but the issue is that
adding perturbation might give us inaccurate data and the dataset will
be much larger and thus slower to train. Those naive solutions are not
optimal but deal with both the training and testing problem.

Cost-Sensitive Classifier Another solution could be giving more im-


portance to the minority class during training. In order to do this, we
will define a slightly modified loss function that can be applied to any of
the cost functions that we have seen.

Definition 3.6.1 (Cost-Sensitive Loss) Given a loss function ℓ★(w; x , 𝑦)


and a scalar 𝑐 𝑦 ∈ ℝ>0 that depends on the label 𝑦 we will define it’s
cost-sensitive loss as:
(𝐶𝑆)
ℓ★ (w; x , 𝑦) ..= 𝑐 𝑦 ℓ★(w; x , 𝑦) (3.21)

Using the cost sensitive loss we can redefine the empirical risk as:

ˆ w; 𝑐+ , 𝑐− ) = 1 1 X
X
𝑅( 𝑐 +ℓ★(w; x , 𝑦) + 𝑐 −ℓ★(w; x , 𝑦) (3.22)
𝑛 𝑖 : 𝑦𝑖 =+1 𝑛 𝑖 : 𝑦𝑖 =−1

Where 𝑐 + is the cost that we put on the data in the majority class and 𝑐 −
the cost of the data in the minority class. Note that this empirical risk has
the following property:

ˆ w; 𝑐+ , 𝑐− ) = 𝑅(
∀𝛼 > 0 : 𝛼 𝑅( ˆ w; 𝛼𝑐 + , 𝛼𝑐− ) (3.23)

Thus if we let 𝛼 ..= 1


𝑐− we can redefine the empirical risk as:

ˆ w; 𝑐+ , , 1) = 𝑅(
𝑅( ˆ w; 𝑐) (3.24)
𝑐−

which removes the redundancy of using two different costs and thus we
can only use 𝑐 ..= 𝑐𝑐+−, as a weighting factor. Then if 𝑐 > 1 we will give
more importance to the class where 𝑦 𝑖 = +1 and if 𝑐 < 1 we will give
more importance to the class where 𝑦 𝑖 = −1.

Threshold Classifier Instead of using a cost-sensitive classifier can


also use another option which is to use standard classifier and change
classification threshold used to find the predicted value for some value
of 𝜏 ∈ ℝ.

𝑦ˆ = sign(w𝑇 x + 𝜏) (3.25)

This method will move the boundary of the classifier and if moved in the
right direction it might label correctly more data in the minority class.

Testing Unbalanced Classifiers The cost-sensitive loss and the thresh-


old classifier are different options that we can take to solve the problem
of training an unbalanced classifier (i. e. to find w), however we still have
to find a way to properly test it and to choose the values of 𝑐 for the
cost-sensitive and 𝜏 for the threshold classifier.
3 Classification 23

Given a test set, i. e. x1 , . . . , x𝑛 with labels 𝑦1 , . . . , 𝑦𝑛 , and the trained


classifier w, let 𝑦ˆ 𝑖 = sign(w𝑇 xi ) (or 𝑦ˆ 𝑖 = sign(w𝑇 xi + 𝜏) if using the
threshold classifier) be the predicted value for feature 𝑥 𝑖 . Then the usual
way to check the accuracy is to count the fraction of correctly classified
elements.
𝑛
1X
accuracy = 1 ˆ (3.26)
𝑛 𝑖=1 𝑦𝑖 = 𝑦𝑖

Where the accuracy should be as close as possible to 1. However, as we


have discussed before, this method doesn’t work well because we give
the same importance to errors in the minority and the majority classes.
To change our definition of accuracy we have to define the notion of
true/false positive/negative.

Positive Label Negative Label


𝑛
X 𝑛
X
Positive Prediction 𝑇𝑃 ..= 1 𝑦ˆ 𝑖 =+1∧𝑦𝑖 =+1 𝐹𝑃 ..= 1 𝑦ˆ 𝑖 =+1∧𝑦𝑖 =−1
𝑖=1 𝑖=1
X𝑛 X𝑛
Negative Prediction 𝐹𝑁 ..= 1 𝑦ˆ 𝑖 =−1∧𝑦𝑖 =+1 𝑇𝑁 ..= 1 𝑦ˆ 𝑖 =−1∧𝑦𝑖 =−1
𝑖=1 𝑖=1

Where the number of positive labels is 𝑛+ ..= 𝑇𝑃 + 𝐹𝑁 , the number of


negative labels is 𝑛− ..= 𝐹𝑃 + 𝑇𝑁 , the number of positive prediction is
𝑝+ ..= 𝑇𝑃 + 𝐹𝑃 and the number of negative prediction is 𝑝− ..= 𝐹𝑁 + 𝑇𝑁
where 𝑛+ + 𝑛− = 𝑝 + + 𝑝 − = 𝑛 . Furthermore the number of correctly
classified elements is 𝑡 ..= 𝑇𝑃 + 𝑇𝑁 and the number of wrongly classified
elements is 𝑓 ..= 𝐹𝑁 + 𝐹𝑃 . Then using the previous definition we can
build different types of metric. Note that we use the convention that the
minority class will use positive labels.

I Accuracy:

𝑡 𝑇𝑃 + 𝑇𝑁
accuracy = = ∈ [0 , 1] (3.27)
𝑛 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
which is the same accuracy (i. e. number of correctly classified
elements) that we have defined before.
I Precision (or TPR)

𝑇𝑃 𝑇𝑃
precision = = ∈ [0 , 1] (3.28)
𝑝+ 𝑇𝑃 + 𝐹𝑃

which measures the fraction of elements that have been correctly


predicted positive out of the ones that are predicted positive.
I Recall

𝑇𝑃 𝑇𝑃
recall = = ∈ [0, 1] (3.29)
𝑛+ 𝑇𝑃 + 𝐹𝑁

which measures the fraction of correctly predicted positive out of


the one that are labeled positive.
I FPR

𝐹𝑃 𝐹𝑃
FPR = = ∈ [0 , 1] (3.30)
𝑛− 𝑇𝑁 + 𝐹𝑃

which measures the false positive rate.


3 Classification 24

I F1 Score

2𝑇𝑃 2
F1 = = ∈ [0 , 1] (3.31)
2𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 1
precision + 1
recall

which considers both precision and recall.

Then, as always, we can use cross-validation to pick different models


w and values for 𝑐 if using the cost-sensitive classifier or 𝜏 if using the
threshold classifier. However, instead of using accuracy, we can use the
newly defined metrics like the F1 score to test the performance of the
unbalanced model.
There are many ways to use the different metrics, one of them is to plot
for different algorithms on the x,y-axis (recall , precision) in what is called
the precision recall curve, then the goal is to get as close as possible to (1 , 1).
Another way is to plot on the x,y-axis (FPR , TPR) in what is called the
ROC Curve, where the goal is to get as close as possible to (0 , 1).

Theorem 3.6.1 (PR vs ROC) A model w1 dominatesa a model w2 in terms


of the ROC Curve iff w1 dominates model w2 in terms of the PR curve.
a Dominate means a curve that is always above the other.

Note that if the area under either the ROC or PR curve (AUC) is less than
1
2 then there is something wrong, probably the labels are swapped.

3.7 Multi-class Classification

In the previous sections, we considered binary classification, i. e. the pres-


ence of two classes i. e. 𝑦 𝑖 ∈ {−1 , +1}. However, in some cases, we might
want to consider more than two classes. In such scenarios, the algorithms
we studied for the binary classification task are no longer applicable, at
least directly. This problem is called multi-class classification.

Goal
Given a set of feature vectors x1 , . . . , xn where xi ∈ ℝ 𝑑 (which can be
represented as a matrix X ∈ ℝ 𝑛×𝑑 ), and a set labels 𝑦1 , . . . , 𝑦𝑛 where
𝑦 𝑖 ∈ 𝐶 = {1, 2, . . . , |𝐶|} (which can be represented as a vector y ∈ 𝐶 𝑛 ).

Output the coefficient vector w ∈ ℝ 𝑑 such that

𝑓 (xi ) ..= sign(w𝑇 xi ) = 𝑦 𝑖 , ∀𝑖 ∈ {1, . . . , 𝑛} (3.32)

One-Vs-All One easy way to do multi-class classification without build-


ing a new classifier from scratch is to transform the labels of the data
such that we can reuse the well known binary classifier. To do so we will
pick each class individually and compare it to all others. This method is
called one-vs-all multi-class classification. More formally let 𝑐 ∈ 𝐶 be the
3 Classification 25

current class, then we will redefine:


(
(𝑐) +1 if 𝑦 𝑖 = 𝑐
∀𝑖 : 𝑦˜ 𝑖 = (3.33)
−1 otherwise

and train |𝐶 | different binary classifiers, where each one of them is


responsible to detect a single class in 𝐶 . The problem is that it’s possible
that a feature vector x is detected as part of more than one class if we use
the standard prediction method 𝑦ˆ (𝑐) = sign(w𝑇 x) (for class 𝑐 ∈ 𝐶 ). To
solve this problem we have to understand the notion of confidence.

Definition 3.7.1 (Confidence) Given a trained model w(𝑐) for a class 𝑐 ∈ 𝐶


and a feature vector x, then the confidence of this class is defined as:

𝑓 (𝑐) (x) ..= (w(𝑐) )𝑇 x (3.34)

Geometrically the confidence is higher if the feature vector 𝑥 is further


away from the decision boundary defined by w. Then instead of using
the sign function to check whether the feature vector is on one side of the
decision boundary we use the confidence as a number that tells us how
much away from the decision boundary is x, which gives us a metric to
compare the classifier from different classes. Thus our new prediction
method will be defined by the classifier with the highest confidence.

𝑦ˆ = arg max 𝑓 (𝑐) (𝑥) = (w(𝑐) )𝑇 x (3.35)


𝑐∈𝐶

We have to be careful when computing different w, the sign function is


invariable to scale i. e. ∀𝛼 > 0 : sign(𝛼 w𝑇 x) = sign(w𝑇 x), however the
same doesn’t hold for the confidence. To solve this problem we have to
either normalize the weight vector w ← k wwk , or use regularization to
2
force all the weight vector to be small and have approximately the same
magnitude k w k 2 . Another problem is that by isolating a single class and
comparing it to all others we will have an unbalanced classification prob-
lem, and thus we will have to use cost-sensitive or threshold classifiers.
Lastly, if one class is not linearly separable from all others this method
will fail.

One-Vs-One If some classes are not linearly separable, one-vs-all will


fail to separate them appropriately. Instead it makes more sense to
compare only two classes at the time and possibly use a non-linear
classifier between them. This method is called one-vs-one multi-class
classification. Let 𝑐 1 , 𝑐 2 ∈ 𝐶 be the classes in comparison, then we will
redefine:



 +1 if 𝑦 𝑖 = 𝑐 1
(𝑐1 ,𝑐 2 )


∀𝑖 : 𝑦˜ 𝑖 = −1 if 𝑦 𝑖 = 𝑐 2 (3.36)

otherwise ignore sample 𝑖 if 𝑦 𝑖 ∉ {𝑐 1 , 𝑐 2 }



|𝐶|(|𝐶|−1)
and then train 2 binary classifiers (one for each pair). Here we
don’t need the notion of confidence, but instead the class with the highest
number of positive prediction wins. The methods has the disadvantage
3 Classification 26

that it needs to train more classifiers, however it doesn’t suffer from class
imbalance and can handle non-linearly separable data.

Multi-class SVM In both previous methods, we had to train more than


one classifier, however, by taking inspiration from the confidence in the
one-vs-all method, we can modify the loss function of the SVM algorithm
to handle multi-class classification.

Definition 3.7.2 (Multi-class Hinge Loss) Given a feature vector x, its


label 𝑦 and weight vectors w(1) , . . . , w(|𝐶|) we can define the multi-class
hinge loss function is defined as:
   
(1) (|𝐶 |) (𝑐) 𝑇 (𝑦) 𝑇
ℓ 𝑀𝐶−𝐻 w , . . . , w ; x , 𝑦 = max 0 , 1 + max (w ) x − (w ) x
𝑐∈𝐶\{𝑦}
(3.37)

The key idea is that as in one-vs-all we keep |𝐶| weight vectors, then if
we evaluate the confidence on the correct class 𝑦 it must be higher than
the confidence on all other classes by at least a margin (e. g. 1), i. e.:

∀𝑐 ∈ 𝐶 \ {𝑦} : (w(𝑐) )𝑇 x > (w(𝑦) )𝑇 x + 1 (3.38)


(𝑐) 𝑇 (𝑦) 𝑇
⇐⇒ ∀𝑐 ∈ 𝐶 \ {𝑦} : (w ) x − (w ) x > 1 (3.39)
(𝑐) 𝑇 (𝑦) 𝑇
⇐⇒ max (w ) x − (w ) x > 1 (3.40)
𝑐∈𝐶\{𝑦}

Thus if the condition of Equation 3.40 ★ is satisfied the gradient of the


loss function simplifies by a lot.

 


 0 if (★) ∨ (𝑐 ≠ 𝑦 ∧ 𝑐 ≠ arg max 𝑗∈𝐶 )(w(𝑗) )𝑇 x


∇w(𝑐) ℓ 𝑀𝐶−𝐻 w(1:|𝐶|) ; x , 𝑦 = −𝑥 if ¬(★) ∧ 𝑐 = 𝑦

 +𝑥 otherwise


(3.41)

Confusion Matrices We have seen that for unbalanced datasets we can


use true/false positive/negative rates to select a better testing metric.
When we are dealing with multi-class datasets we can create a square
matrix that compares each predicted label with each true label and
counts for each pair the number of predictions. Then the correct number
of predicted labels will be counted on the diagonal of the matrix and
everything that is not on the diagonal will be wrongly predicted labels.
Kernels 4
4.1 Feature Explosion Problem . 27
4.1 Feature Explosion Problem 4.2 Polynomial Kernels . . . . . 29
4.3 Kernelized Perceptron . . . . 30
Both in linear regression and in classification we can fit non linear (e. g. 4.4 Kernel Properties . . . . . . . . 31
polynomial) functions to our data by mapping feature vectors x1 , . . . , xn 4.5 Infinite Dimensional Kernels 32
to some non linear function 𝜙(xi ) = x̃i . This transformation returns a new 4.6 Kernelized SVM . . . . . . . 34
set of data x˜1 , . . . , x˜n which can be optimized with a standard squared
4.7 Kernelized Linear Regression 34

loss. Notice that 𝜙 : ℝ 𝑑 → ℝ 𝑑 and since most of the times we wish


0

to use more complicated features than the initial ones, we often have
𝑑0 > 𝑑 .

Example 4.1.1 In a real world scenario we might have 𝑑 = 10000,


i. e. 10000 features or 10000 dimensional vectors, which cannot be
fitted with a linear function. We decide to use a polynomial of degree
𝑚 = 2, which is the smallest possible polynomial (besides linear)
that we can use. Thus, we have to evaluate our new feature vectors
𝜙(xi ) = x̃i = [𝑥 2𝑖1 , . . . , 𝑥 2𝑖 𝑑 , 𝑥 𝑖1 · 𝑥 𝑖2 , . . . , 𝑥 𝑖 𝑑−1 · 𝑥 𝑖 𝑑 ]𝑇 , notice that x̃i ∈ ℝ 𝑑
0

where 𝑑 = 100000000.
0

From the previous example we observe that in facts, even with a small
degree polynomial, if we have a lot of features, we might have feature
explosion from 𝑑 to 𝑑 𝑘 features, and thus often 𝑑0  𝑑 which is very
computationally inefficient. The use of kernel functions will help us to solve
this problem. In their essence, kernels allow us to exploit the benefits
brought by a larger amount of features without paying for their overhead.
In order to understand the core concepts of kernel methods, we introduce
the following lemma.

Lemma 4.1.1 (Linear Optimum) Given some labels 𝑦 𝑖 and some feature
vectors xi , we can always find some scalars 𝛼 𝑖 ∈ ℝ for 𝑖 ∈ {1 , . . . , 𝑛} such
that we can represent the optimum ŵ as a linear combination:
𝑛
X
ŵ = 𝛼 𝑖 𝑦 𝑖 xi (4.1)
𝑖=1

Proof (Handwavy). We will give a handwavy proof for the specific cases
of the perceptron and SVM algorithms. Recall that we can obtain the
optimum ŵ with stochastic gradient descent in the following way:

w𝑡+1 ← w𝑡 + 𝜂𝑡 𝑦𝑡 xt , [𝑦𝑡 wt 𝑇 xt < 0] Perceptron (4.2)


𝑇
w𝑡+1 ← w𝑡 (1 − 2𝜆𝜂𝑡 ) + 𝜂𝑡 𝑦𝑡 xt , [𝑦𝑡 wt xt < 1] SVM (4.3)

Consider the specific case of the SGD for the perceptron, after some time
4 Kernels 28

𝑇 we will have

ŵ = w𝑇+1 (4.4)
= w𝑇 + 𝜂𝑇 𝑦𝑇 xT SGD (4.5)
=. w𝑇−1 + (𝜂𝑇−1 𝑦𝑇−1 xT−1 ) + (𝜂𝑇 𝑦𝑇 xT ) SGD unroll twice (4.6)
..
= w0 + (𝜂1 𝑦1 x1 ) + · · · + (𝜂𝑇 𝑦𝑇 xT ) SGD unroll 𝑇 times (4.7)
=0
X𝑛
= 𝛼 𝑖 𝑦 𝑖 xi Group same 𝑦 𝑖 xi (4.8)
𝑖=1

Where 𝛼 𝑖 will be the sum of the learning rates 𝜂 from the same terms
𝑦 𝑖 xi . The proof of the linear optimum for SVM is analogous.

We will now see how Lemma 4.1.1 will help us to solve the problem of
feature explosion. The basic idea is that instead of optimizing for the
best ŵ ∈ ℝ 𝑑 we want to find a way to optimize for 𝜶ˆ ∈ ℝ 𝑛 . If we have
0

feature explosion clearly 𝑛  𝑑0 and thus the problem will be much less
computationally expensive.

Perceptron Reformulation By using Lemma 4.1.1 we can redefine the


optimum of the perceptron method by parametrizing it by 𝜶 and obtain-
ing a dual optimization problem.
𝑛
1X
min max{0 , −𝑦 𝑖 w𝑇 x̃i } (4.9)
w 𝑛 𝑖=1
!𝑇
𝑛 𝑛
 
1X


 X 


= min max 0 , −𝑦 𝑖 𝛼 𝑗 𝑦 𝑗 x̃j x̃i Lemma 4.1.1 (4.10)
𝜶 𝑛
𝑖=1 𝑗=1

 


( )
𝑛 𝑛
1X
𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 (x̃i 𝑇 x̃j )
X
= min max 0 , − (4.11)
𝜶 𝑛 𝑖=1 𝑗=1
( )
𝑛 𝑛
1X
𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 (𝜙(xi )𝑇 𝜙(xj ))
X
= min max 0 , − (4.12)
𝜶 𝑛
𝑖=1 𝑗=1

Definition 4.1.1 (Kernel Function) Let xi , x 𝑗 ∈ ℝ 𝑑 be two vectors and


𝜙 : ℝ 𝑑 → ℝ 𝑑 be map, then the kernel function k : ℝ 𝑑 × ℝ 𝑑 → ℝ is
0

defined as:

k(xi , xj ) ..= 𝜙(xi )𝑇 𝜙(xj ) (4.13)

Then using the previous reformulation and the notion of kernel function
we can rewrite the dual optimization problem as:
( )
𝑛 𝑛
1X X
𝜶ˆ = arg min max 0 , − 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 k(x, x0) (4.14)
𝜶 𝑛 𝑖=1 𝑗=1

Now, even if we have reduced the problem to finding 𝜶ˆ ∈ ℝ 𝑛 instead


of ŵ ∈ ℝ 𝑑 , there is still the problem that computing the dot product
0

𝜙(x)𝑇 𝜙(x0) of the kernel function is very expensive (since 𝜙(x) is also
of dimension 𝑑0). The most important part of this section is to realize
4 Kernels 29

that by using some clever tricks we can compute the kernel function
k(xi , xj ) without computing the dot product of dimension 𝑑0 and neither
the function 𝜙(x).

4.2 Polynomial Kernels


Homogeneous Polynomial Kernel

Lemma 4.2.1 (Homogeneous Polynomial Kernel) Let xi , x 𝑗 ∈ ℝ 𝑑 . The


homogeneous polynomial kernel of degree 𝑚 , that corresponds to the
feature space spanned by all products of exactlya 𝑚 attributes, can be easily
evaluated and is:

k𝑚 (xi , xj ) = (xi 𝑇 xj )𝑚 (4.15)


a The dimensionality of this feature space is 𝑑0 ..= 𝑑 𝑚 .

Lemma 4.2.1 tells us that we never have to actually evaluate the dot
product 𝜙(xi )𝑇 𝜙(xj ), and thus the computation can be done much more
efficiently.

Example 4.2.1 (Homogeneous Quadratic Kernel (𝑚 = 𝑑 = 2))


Let 𝜙 : ℝ2 → ℝ3 be defined as

𝜙(x) = 𝜙([𝑥1 , 𝑥2 ]) ..= [𝑥12 , 𝑥22 , 2 𝑥1 𝑥2 ]𝑇 = x̃ (4.16)

which maps a 𝑑0 = 2 dimensional feature vector to a 𝑑0 = 3 dimensional


feature vector. Then, given 2 feature vectors xi , xj we can compute the
kernel function in 2 ways:

k2 (xi , xj ) ..= 𝜙(xi )𝑇 𝜙(xj ) (4.17)


= 𝑥 2𝑖,1 𝑥 2𝑗,1 + 𝑥 2𝑖,2 𝑥 2𝑗,2 + 2 𝑥 𝑖,1 𝑥 𝑖,2 𝑥 𝑗,1 𝑥 𝑗,2 (4.18)
= (𝑥 𝑖,1 𝑥 𝑗,1 + 𝑥 𝑖,2 𝑥 𝑗,2 )2 (4.19)
𝑇 2
= (xi xj ) (4.20)

In Equation 4.18 if we compute the function and the dot product in a


standard way we have to do 2 additions and 3+3+4=10 multiplications,
where in Equation 4.19, which is the one that can be directly derived
with Lemma 4.2.1, only 1 addition and 3 multiplications.

Inhomogeneous Polynomial Kernel

Lemma 4.2.2 (Inhomogeneous Polynomial Kernel) Let xi , x 𝑗 ∈ ℝ 𝑑 . The


inhomogeneous polynomial kernel of degree 𝑚 , that corresponds to the
feature space spanned by all products of at mosta 𝑚 attributes, can be easily
evaluated and is:

k𝑚 (xi , xj ) = (𝑐 + xi 𝑇 xj )𝑚 (4.21)
𝑑+𝑚 
a The dimensionality of this feature space is 𝑑0 ..= 𝑚 = O(𝑑 𝑚 ).
4 Kernels 30

Proof. Let n = [𝑛0 , . . . , 𝑛 𝑑 ]𝑇 where | n | = 𝑛1 + · · · + 𝑛 𝑑 , then:

k𝑚 (x , y) = (𝑐 + x𝑇 y)𝑚 (4.22)
𝑚
= (𝑐 + 𝑥1 𝑦1 + · · · + 𝑥 𝑑 𝑦 𝑑 ) (4.23)
X 𝑚
 
= 𝑐 𝑛0 (𝑥 1 𝑦1 )𝑛1 (𝑥2 𝑦2 )𝑛2 · · · (𝑥 𝑑 𝑦 𝑑 )𝑛 𝑑 (4.24)
| n |=𝑚
n
s  s 
𝑑 𝑑
X © 𝑚 𝑛0 Y 𝑛 ª© 𝑚 𝑛0 Y 𝑛 ª
= ­ 𝑐 𝑥𝑘 𝑘 ® ­ 𝑐 𝑦𝑘 𝑘 ® (4.25)
| n |=𝑚
n 𝑘=1
n 𝑘=1
« ¬« ¬
= 𝜙(x)𝑇 𝜙(y) (4.26)

Example 4.2.2 (Inhomogeneous Quadratic Kernel (𝑚 = 𝑑 = 2))


Let 𝜙 : ℝ2 → ℝ6 be defined as:
√ √ √
𝜙(x) = 𝜙([𝑥 1 , 𝑥2 ]) ..= [1 , 2 𝑥1 , 2𝑥 2 , 𝑥12 , 𝑥22 , 2 𝑥1 𝑥2 ]𝑇 = x̃ (4.27)

which maps a 𝑑0 = 2 dimensional feature vector to a 𝑑0 = 6 dimensional


feature vector . Then given 2 feature vectors xi , xj we can compute the
kernel function in 2 ways:

k2 (xi , xj ) ..= 𝜙(xi )𝑇 𝜙(xj ) (4.28)


= 1 + 2 𝑥 𝑖,1 𝑥 𝑗,1 + 2 𝑥 𝑖,2 𝑥 𝑗,2 + 2 𝑥 2𝑖,1 𝑥 2𝑗,1 + (4.29)
𝑥 2𝑖,2 𝑥 2𝑗,2 + 2𝑥 𝑖,1 𝑥 𝑖,2 𝑥 𝑗,1 𝑥 𝑗,2 (4.30)
= (1 + 𝑥 𝑖,1 𝑥 𝑗,1 + 𝑥 𝑖,2 𝑥 𝑗,2 )2 (4.31)
𝑇 2
= (1 + xi xj ) (4.32)

Here, again, we see that the number of operations if we use Lemma


4.2.2 directly is reduced dramatically.

We have reduced a dot product between two vectors 𝜙(x) of size 𝑑0 (order
O((𝑑0)𝑚 )) to one single dot product of two vectors of size 𝑑 (order O(𝑑 𝑚 )).
Also remember that we never have to compute 𝜙(x) in any way, it’s
implicitly computed by the kernel. Complicated functions like Equation
4.27 must not be derived manually and the computational complexity
between homogeneous and inhomogeneous kernels is the same.

4.3 Kernelized Perceptron

We can use the dual optimization problem and the kernel trick to solve
efficiently the perceptron algorithm training phase.

1 𝜶0 ← 0 Algorithm 4.1: Kernelized Perceptron


2 foreach 𝑡 = 1 , 2 , ..., 𝑇
3 sample (xi , 𝑦 𝑖 ) ∈u. a. r. 𝐷𝑡𝑟 𝑎𝑖𝑛 B Sample uniformly at random with replacement.
if 𝑦 𝑖 𝑛𝑗=1 𝛼 𝑡,𝑗 𝑦 𝑗 k(xi , xj ) ≥ 0
P
4 B Predict using the dual.
5 𝜶 𝑡+1 ← 𝜶 𝑡 B Correct prediction, no update.
6 else
4 Kernels 31

7 𝜶 𝑡+1 ← 𝜶 𝑡
8 𝛼 𝑡+1,𝑖 ← 𝛼 𝑡+1,𝑖 + 𝜂𝑡 B Wrong prediction, update.
9 end
10 return 𝜶𝑇+1

Then if we are given a new point x to predict using the trained preceptron
we just check the sign.
!
𝑛
X
𝑦ˆ = sign 𝛼 𝑗 𝑦 𝑗 k(x, xj ) (4.33)
𝑗=1

4.4 Kernel Properties

Instead of computing the kernel for each combination of inputs x1 , . . . , xn


as a function, we can store all the kernels in a kernel matrix.

Definition 4.4.1 (Kernel Matrix) Let k : ℝ 𝑑 × ℝ 𝑑 → ℝ be a kernel


function then the associated kernel matrix K is defined as:

𝐾 𝑖,𝑗 ..= k(xi , xj ) (4.34)

The advantage of using a kernel matrix in our model is that once we have
computed K we don’t have to store our data x1 , . . . , xn anymore since it’s
implicitly contained in K. The kernel has the following properties:
I Symmetric:

Proof.

𝐾 𝑖,𝑗 ..= k(xi , xj ) (4.35)


𝑇
= 𝜙(xi ) 𝜙(xj ) (4.36)
𝑇
= 𝜙(xj ) 𝜙(xi ) (4.37)
= k(xj , xi ) = 𝐾 𝑗,𝑖 (4.38)

I Positve Semi-definite:

Proof.
𝑛 X
𝑛
a𝑇 Ka =
X
𝑎 𝑖 𝑎 𝑗 k(xi , xj ) (4.39)
𝑖=1 𝑗=1
𝑛 X
𝑛
𝑎 𝑖 𝑎 𝑗 𝜙(xi )𝑇 𝜙(xj )
X
= (4.40)
𝑖=1 𝑗=1
!𝑇 !
𝑛
X 𝑛
X
= 𝑎 𝑖 𝜙(xi ) 𝑎 𝑗 𝜙(xj ) (4.41)
𝑖=1 𝑗=1
2
𝑛
X
= 𝑎 𝑖 𝜙(xi ) ≥0 (4.42)
𝑖=1
4 Kernels 32

I Composition rules:
Given kernel functions k𝑖 : 𝑋 × 𝑋 → ℝ defined on some data space
𝑋 , then all of the following are valid kernels:
• k(𝑥, 𝑥 0) = k1 (𝑥, 𝑥 0) + k2 (𝑥, 𝑥 0)
• k(𝑥, 𝑥 0) = k1 (𝑥, 𝑥 0) k2 (𝑥, 𝑥 0)
• k(𝑥, 𝑥 0) = 𝑐 k1 (𝑥, 𝑥 0)
• k(𝑥, 𝑥 0) = 𝑓 (k1 (𝑥, 𝑥 0))
• k(𝑧, 𝑧 0) = k1 (𝑉(𝑧), 𝑉(𝑧 0))
• k(𝑥, 𝑥 0) = 𝑑𝑖=1 k𝑖 (𝑥 𝑖 , 𝑥 0𝑖 ) for 𝑥 ∈ ℝ 𝑑 ANOVA Kernel.
P

Where 𝑐 ∈ ℝ, 𝑓 is a polynomial with positive coefficients or the


exponential function, and 𝑉 : 𝑍 → 𝑋 is some function on the data
space.

Lemma 4.4.1 (Feature Map Construction) Given a finite data space 𝑋 =


{1, . . . , 𝑛} i. e. 𝑛 < ∞ and a symmetric positive semi-definite matrix K ∈
ℝ 𝑛×𝑛 , then we can always construct a feature map 𝜙 : 𝑋 → ℝ 𝑛 such that
𝐾 𝑖,𝑗 = 𝜙(𝑖)𝑇 𝜙(𝑗).

Proof. Since K is a symmetric positive semi-definite matrix it has real


and non negative eigenvalues, and can be decomposed with eigen
decomposition as follows:

K = UΛU𝑇 (4.43)

Where U = [u1 · · · un ] is the orthonormal matrix of eigenvectors and Λ =


diag([𝜆1 , . . . , 𝜆𝑛 ]) is the diagonal matrix of eigenvalues, both arranged
in non-increasing order of eigenvalues 𝜆1 ≥ · · · ≥ 𝜆𝑛 ≥ 0. Then we can
define Λ = Λ 2 Λ 2 𝑇 and we get:
1 1

K = UΛ 2 Λ 2 𝑇 U𝑇
1 1
(4.44)
|{z} | {z }
..=Φ𝑇 =Φ

Where Φ𝑖 = 𝜙(𝑖), and thus it holds that 𝐾 𝑖,𝑗 = Φ𝑇𝑖 Φ 𝑗 = 𝜙(𝑖)𝑇 𝜙(𝑗).

Theorem 4.4.2 (Mercer’s) Let 𝑋 be a compact subset of ℝ 𝑑 , and 𝑘 :


𝑋 × 𝑋 → ℝ a kernel function, then 𝑘 can be expanded into a uniformely
convergent series of bounded functions 𝜙 𝑖 such that:

X
k(𝑥, 𝑥 0) = 𝜆 𝑖 𝜙 𝑖 (𝑥)𝜙 𝑖 (𝑥 0) (4.45)
𝑖=1

4.5 Infinite Dimensional Kernels

We can define other types of kernel other than polynomial that have an
infinite feature space, often such kernels are referred to as non-parametric
kernels.

Definition 4.5.1 (Gaussian Kernel) Given feature vectors xi , xj ∈ ℝ 𝑑 and


4 Kernels 33

a scalar bandwidth ℎ ∈ ℝ, the Gaussian kernel is defined as:


2!
− xi − xj 2
k(xi , xj ) = exp (4.46)
ℎ2

The Gaussian kernel is useful since it obtains a value close to 1 the closer
xi is to xj , and the value approaches 0 as they are farther away. In other
words, we can measure the similarity between two points xi and xj using
the Gaussian kernel.
With this information, we can construct a 𝑘 nearest neighbor classifier
which doesn’t need any training and only uses the provided data to
classify a new point.

1 input X = [x1 · · · xn ], y = [𝑦1 , . . . , 𝑦𝑛 ]𝑇 , 𝑘 ∈ ℕ , x ∈ ℝ 𝑑 Algorithm 4.2: KNN


2 output 𝑦ˆ ∈ ℝ !
X
3 return 𝑦ˆ = sign k(xi , x)𝑦 𝑖
𝑖∈N 𝑘 (𝑥)

where N 𝑘 (𝑥) is the set with the 𝑘 closest neighbor of 𝑥 and k is a Gaussian
or some other similarity measuring kernel. The downside compared to
the kernelized perceptron is that this algorithm uses all of the training
data for each new point and thus it’s very inefficient. Also, the prediction
cannot capture global trends but is only depends on close points.

Definition 4.5.2 (Laplacian Kernel) Given feature vectors xi , xj ∈ ℝ 𝑑 and


a scalar bandwidth ℎ ∈ ℝ, the Laplacian kernel is defined as:
!
− xi − xj 1
k(xi , xj ) = exp (4.47)

This Kernel is similar to the Gaussian kernel but it uses exponential decay
instead of smooth decay.
4 Kernels 34

4.6 Kernelized SVM

By using Lemma 4.1.1 and the kernel trick we can kernelize the SVM
algorithm by finding the dual with 𝜶 .
𝑛
1X
min max{0 , 1 − 𝑦 𝑖 w𝑇 x̃i } + 𝜆 k w k 22 (4.48)
w 𝑛 𝑖=1
!𝑇
𝑛 𝑛 𝑛
 
1X


 X 

 X
= min max 0 , 1 − 𝑦 𝑖 𝛼 𝑗 𝑦 𝑗 x̃j x̃i + 𝜆 𝛼 𝑗 𝑦 𝑗 x̃j Lemma 4.1.1
𝜶 𝑛
𝑖=1 𝑗=1 𝑗=1

 

 
(4.49)
( )
𝑛 𝑛 𝑛 X 𝑛
1X
𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 (x̃i 𝑇 x̃j ) + 𝜆 𝛼 𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 x̃i 𝑇 x̃j
X X
= min max 0 , 1 −
𝜶 𝑛 𝑖=1 𝑗=1 𝑖=1 𝑗=1
(4.50)
( )
𝑛 𝑛 𝑛 X 𝑛
1X X X
= min max 0 , 1 − 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 k(xi , xj ) + 𝜆 𝛼 𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 k(x𝑖 , x 𝑗 )
𝜶 𝑛 𝑖=1 𝑗=1 𝑖=1 𝑗=1
(4.51)

To write the objective in a more compact way we define 𝜶 ..= [𝛼 1 , . . . , 𝛼 𝑛 ]𝑇 ,


D 𝑦 ..= diag(𝑦1 , . . . , 𝑦𝑛 ), k𝑖 = [𝑦1 k(xi , x1 ), . . . , 𝑦𝑛 k(xi , xn )]𝑇 and K is the
kernel matrix. Then the more compact kernelized SVM objective dual
is:
Learning
𝑛
1X
𝜶ˆ = arg min max 0 , 1 − 𝑦 𝑖 𝜶𝑇 k𝑖 + 𝜆𝜶𝑇 D 𝑦 KD 𝑦 𝜶

(4.52)
𝜶 𝑛 𝑖=1

Prediction
!
𝑛
X
𝑦ˆ = 𝑓 (x) = sign 𝛼 𝑗 𝑦 𝑗 k(x , xj ) (4.53)
𝑗=1

4.7 Kernelized Linear Regression

By using Lemma 4.1.1 and the kernel trick we can kernelize the linear
regression algorithm by finding the dual with 𝜶 .
𝑛
1X
min (w𝑇 x̃𝑖 − 𝑦 𝑖 )2 + 𝜆 k w k 22 (4.54)
w 𝑛 𝑖=1
! !2 2
𝑛 𝑛 𝑛
1X X X
= min 𝛼 𝑗 𝑦 𝑗 x̃j x̃𝑖 − 𝑦 𝑖 +𝜆 𝛼 𝑗 𝑦 𝑗 x̃j Lemma 4.1.1
𝜶 𝑛
𝑖=1 𝑗=1 𝑗=1 2
(4.55)
𝑛 X
𝑛 𝑛 X
𝑛
1
𝛼 𝑗 (x̃𝑇𝑗 x̃𝑖 ) + 𝜆 𝛼 𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 x̃i 𝑇 x̃j
X X
= min (4.56)
𝜶 𝑛 𝑖=1 𝑗=1 𝑖=1 𝑗=1
𝑛 X 𝑛 𝑛 X 𝑛
1X X
= min 𝛼 𝑗 k(x𝑖 , x 𝑗 ) + 𝜆 𝛼 𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 k(x𝑖 , x 𝑗 ) (4.57)
𝜶 𝑛 𝑖=1 𝑗=1 𝑖=1 𝑗=1
4 Kernels 35

Then the more compact kernelized linear regression objective dual is:
Learning

1
𝜶ˆ = arg min k𝜶K − y k 22 + 𝜆𝜶𝑇 K𝜶 (4.58)
𝜶 𝑛

We can also solve this optimization problem using the following closed
form:

𝜶 = (K + 𝑛𝜆I)−1 y (4.59)

Prediction
𝑛
X
𝑦ˆ = 𝑓 (x) = 𝛼ˆ 𝑖 k(x𝑖 , x) (4.60)
𝑖=1

Semi-parametric Regression If we have some periodic data we can


use a e. g. Gaussian kernel and with the right parameters it will fit the
periodicity correctly. However, if the data is both periodic but also follows
a linear trend is some direction using only a Gaussian kernel will not
be enough. It would be optimal if we could use parametric models that
are rigid (e. g. linear or polynomial kernels) and very good for general
high level trends, but also non-parametric models e. g. Gaussian kernels
that can handle variability and periodicity. We have seen before that
the composition of two kernels is still a valid kernel and thus we can
combine more than one single kernel to get semi-parametric regression
models.

Example 4.7.1 (Semi-parametric kernel)


2!
− xi − xj
k(x𝑖 , x 𝑗 ) = 𝑐 1 exp 2
+𝑐 2 x𝑇𝑖 x 𝑗
ℎ2
| {z } |{z}
Non-parametric kernel (Gaussian) Parametric kernel (Linear, 𝑚 = 1)
(4.61)

This kernel used on kernelized linear regression will fit both linear
and periodic data at the same time.

Kernel Tuning We have seen a lot of different kernels, in semi-parametric


regression often the choice of the kernel is not obvious. There are less ad-
vanced methods to select the kernel e. g. brute force or domain knowledge
(i. e. like in Example 4.7.1 in which we choose a linear and Gaussian kernel
knowing that our data was periodic in some linearly defined direction),
but also more advanced algorithms used for kernel learning.
Neural Networks 5
Neural networks were first idealized in the 1950s. In the 1980s many 5.1 Introduction . . . . . . . . . . 36
papers were published about them, however, at some point, SVMs were 5.2 General Neural Network . . 37
discovered and neural networks lost popularity. In the 2000s neural 5.3 Forward Propagation . . . . . 39
5.4 Objective . . . . . . . . . . . . 40
networks came back thanks to the advances in computation and the
5.5 Computational Graphs . . . . 41
increasing amount of data. Today, neural networks are everywhere as
5.6 Back-Propagation . . . . . . . 45
the main component of the field of deep learning.
5.7 Weight Initialization . . . . . 48
5.8 Optimizers . . . . . . . . . . . 50
5.9 Overfitting . . . . . . . . . . . 50
5.1 Introduction 5.10 Regularization . . . . . . . . . . 51
5.11 Convolutional Neural Networks
(CNNs) . . . . . . . . . . . . . . . 53
We have seen that when dealing with non-linearly separable data we
5.12 Other . . . . . . . . . . . . . . . 58
can transform the features with non-linear functions (e. g. polynomials)
and then apply kernels to be able to use non-linear classification or
regression. This method has some drawbacks: in facts, choosing the
right kernel can be challenging since different applications need different
kernels, and with the increasing amount of data also the computational
complexity increases. The goal of neural networks is to automatically
learn a non-linear mapping 𝜙 from a labeled dataset.

Goal
Given a set of feature vectors x1 , . . . , xn where xi ∈ ℝ 𝑑 (which can be
represented as a matrix X ∈ ℝ 𝑛×𝑑 ), and a set labels y1 , . . . , y𝑛 where
y𝑖 ∈ ℝ 𝑘 (which can be represented as a matrix Y ∈ ℝ 𝑛×𝑘 ).

Output the coefficients 𝜽 such that

𝑓 (x𝑖 ; 𝜽) = ŷ𝑖 ≈ y𝑖 , ∀𝑖 ∈ {1 , . . . , 𝑛} (5.1)

Basic Neural Network We still have to define 𝑓 , we start with the most
basic neural network possible.
𝑚
𝑤 𝑗 𝜙(x; 𝜃 𝑗 ) = 𝜙(x; 𝜽)𝑇 w
X
𝑓 (x𝑖 ; 𝜽, w) ..= (5.2)
𝑗=1

Then use this definition of 𝑓 and try to learn 𝜙 by minimizing some loss
function L:
𝑛
X
L y𝑖 ; 𝑓 (x𝑖 ; w , 𝜽)

min (5.3)
w ,𝜽 𝑖=1

Where 𝜙 will be a non-linear activation function.


5 Neural Networks 37

Activation Functions

Definition 5.1.1 (Activation Function) Let 𝜽 ∈ ℝ 𝑚 be a vector of learnable


parameters for some function 𝜙 and x ∈ ℝ 𝑚 be the input vector, then:

𝜙(x; 𝜽) ..= 𝜑(𝜽𝑇 x) (5.4)

where 𝜑 : ℝ → ℝ is called activation-function.a


a Often we only write 𝜙 and refer to it as activation function, however the actual activation

function is 𝜑 which takes a scalar as input.

Note that in the study of neural networks we often give up the notion of a
convex non-linear function for non-convex non-linear activation function,
where the optimal convergence is no longer guaranteed. We will argue
later on that this is not a big problem for training.
Let 𝑧 ..= 𝜽𝑇 x, then the most used activation functions are:
I Sigmoid

1
𝜑(𝑧) = (5.5)
1 + exp(−𝑧)
𝜑0(𝑧) = 𝜑(𝑧)(1 − 𝜑(𝑧)) (5.6)

I Tanh

exp(𝑧) − exp(−𝑧)
𝜑(𝑧) = tanh(𝑧) = (5.7)
exp(𝑧) + exp(−𝑧)
𝜑0(𝑧) = 1 − 𝜑(𝑧)2 (5.8)

I ReLU

𝜑(𝑧) = max(𝑧, 0) (5.9)




 0 if 𝑧 < 0
0


𝜑 (𝑧) = 1 if 𝑧 > 0 (5.10)

 undefined if 𝑧 = 0

5.2 General Neural Network

In the basic neural network, we used a single non-linear activation


function which can be used to approximate simple non-linear functions
𝑓 . However, in the case that fitting our dataset requires a more complex
function the basic setting will not be enough. The general idea is that
instead of giving x as input to the activation function we can nest many
activation functions by continuously giving the output of the previous
activation function as input to the next one. To understand this idea we
can use different notations, the standard mathematical notation gives
us a precise understanding of how to evaluate the different activation
functions but is often harder to have an intuitive grasp of what is going
on. The graphical notation is much more intuitive but might miss some
details that only the mathematical one provides. Since different notation
give us different understandings we will provide many of them.
5 Neural Networks 38

Mathematical View Differently form the basic setting, here we use


many nested functions. Each function has its weight matrix W and bias
vector b (in the basic settings we only had a coefficient matrix 𝜽 ). We will
have many layers of parameters 𝜽 ..= (W(1) , b(1) , . . . , W(𝐿) , b(𝐿) ), then to
compute 𝑓 :

𝑓 (x; 𝜽) = 𝑓 (𝐿) (· · · 𝑓 (2) ( 𝑓 (1) (x; 𝜽); 𝜽) · · · ; 𝜽) (5.11)

Then for each layer 𝑙 ∈ {1 , . . . , 𝐿} the function 𝑓 (𝑙) uses only it own
weights, biases, and possibly a different activation function 𝜑 (𝑙) . Then
each intermediate vector will be called a hidden layer h(𝑙) .
 
h(1) ..= 𝑓 (1) (x; W(1) , b(1) ) = 𝜑 (1) xW(1) + b(1) (5.12)
 
h(2) ..= 𝑓 (2) (h1 ; W(2) , b(2) ) = 𝜑 (2) h(1) W(2) + b(2) (5.13)
..
.  
h(𝐿) ..= 𝑓 (𝐿) (h(𝐿−1) ; W(𝐿) , b(𝐿) ) = 𝜑 (𝐿) h(𝐿−1) W(𝐿) + b(𝐿) (5.14)

Where h(0) ..= x, is the input layer and h(𝐿) ..= 𝑓 (x; 𝜽) = ŷ ≈ y, is the
output layer, however we usually use keep the notation x and ŷ to show
that each function can be seen as taking a hidden layer as input and
returning a hidden layer as output. The dimensions are 𝑾 (𝑙) ∈ ℝ 𝑑 ×𝑑 ,
(𝑙−1) (𝑙)

(𝑙)
b(𝑙) ∈ ℝ1×𝑑 , where 𝑑 (𝑙) is the number of hidden units in layer 𝑙 , i. e.
h(𝑙) ∈ ℝ 𝑑 , and we use the convention that 𝑑 (0) ..= 𝑑 and 𝑑 (𝐿) ..= 𝑘 .
(𝑙)

Note that depending on the way we build the neural network each hidden
layer (and thus weight matrix W(𝑙) and bias b(𝑙) ) can have arbitrary size,
the only constraint is in the input and output layers that are fixed
by the dimensionality of our dataset. The number of layers 𝐿 can be
chosen depending on how complex is the function 𝑓 that we want to
approximate. All of those sizes must be chosen manually and are called
hyper-parameters of the network, there is no general rule of thumb that
works for all functions 𝑓 .

Graph View We can visualize the previous mathematical view in a


network graph view to get a more intuitive understanding. Here, instead
of viewing each layer as a vector operation on the previous layer, we can
view each element of the hidden vector as a neuron. The value of the
neuron is evaluated as a weighted sum of the previous layer neurons
and the strength of each connecting edge weight. More formally, assume
that the hidden layer 𝑙 has dimension 𝑑 (𝑙) ∈ ℕ , then:
!
𝑑 (𝑙)
(𝑙) (𝑙) (𝑙−1) (𝑙)
(𝑙)
X
ℎ𝑗 =𝜑 𝑏𝑗 + ℎ 𝑖 𝑊𝑖,𝑗 (5.15)
𝑖=1

Neural networks are similar to the perceptron algorithm, instead, here,


we use a non-linear activation function, and thus each neuron returns a
real-valued output (not only -1 or 1) to learn non-linear features. Instead of
a single layer we have many layers, each layer can learn more complicated
features by taking all outputs of the previous neurons (which in reality
are just number in a vector) and combining them with a weighted sum
to which we apply a non-linear function.
5 Neural Networks 39

Example 5.2.1 (Neural Network Graph View) In this example we input


dimension 𝑑 = 2, output dimension 𝑘 = 3, then we choose to use 𝐿 = 2
hidden layers where 𝑑 (1) = 4 and 𝑑 (2) = 3. Remember that the input
(0)
layer can also be seen as the hidden layer 0, 𝑥 𝑗 = ℎ 𝑗 , and the output
(3)
layer can also be seen as the hidden layer 3, 𝑦ˆ 𝑗 = ℎ 𝑗 . Furthermore to
(𝑙−1)
simplify the evaluation for the bias we let, ℎ 0 = 1 and the weights
(𝑙) (𝑙)
𝑊0,𝑗 = 𝑏𝑗 ,
such that the first row of the weight matrix of layer 𝑙 will
contain all the biases for layer 𝑙 , and we won’t need to add the biases
separately. Then we can represent 𝑓 (x; 𝜽) using the following graph:

Input Hidden Hidden Output


layer layer 1 layer 2 layer

(1)
𝑊 (1) ℎ0 𝑊 (2) 𝑊 (3)
(2)
ℎ0
𝑥0 (1)
ℎ1 𝑦ˆ1
(2)
ℎ1
𝑥1 (1)
ℎ2 𝑦ˆ2
(2)
ℎ2
𝑥2 (1)
ℎ3 𝑦ˆ3
(2)
ℎ3
(1)
ℎ4

(1) (1) (1) (1) 


Then for example: ℎ 1 = 𝜑 (1) 𝑥 0 𝑊0,1 +𝑥 1 𝑊1,1 + 𝑥 2 𝑊2,1 .
| {z }
(1)
=1·𝑏1

Usually, we never compute the sum presented in the graph view but use
the mathematical view with matrix multiplication and thus each layer
is computed in one go. The graph view is just another way to see the
mathematical view i. e. the output is the same, however instead of vector
operations we can visualize how the input x flows to the output ŷ, this
process is called forward propagation.

5.3 Forward Propagation

The previous different views of a neural network give us ways to un-


derstand the process of forward propagation more intuitively. One nice
functionality of neural networks is that by selecting appropriately the
output layer function we can solve both regression or classification
problems.

1 h0 ← x Algorithm 5.1: NN Forward Propagation


2 foreach 𝑙 = 1 : 𝐿
3 z(𝑙) = h(𝑙−1) W(𝑙)
4 h(𝑙) = 𝜑 (𝑙) (z(𝑙) )
5 end
6 return ℎ (𝐿)
5 Neural Networks 40

If we want to solve a regression problem, the function 𝜑 (𝐿) , will be


the identity function 𝜑 (𝐿) (z(𝐿) ) = z(𝐿) . For classification problems if
the output dimension is 𝑘 = 1 then 𝜑 (𝐿) (z(𝐿) ) = sign(z(𝐿) ), if instead
we are dealing with multi-class classification i. e. 𝑘 > 1, we will use
𝜑(𝐿) (z(𝐿) ) = arg max𝑖 (z(𝐿) )𝑖 to select the class with the highest score.
As we have seen before there is no general rule that tells us how many
hidden layers we should use, however the universal approximation theorem
tells us that

Theorem 5.3.1 (Universal Approximation) Let 𝜑(z) be the Sigmoid acti-


vation function and 𝑓 ∗ (x) : ℝ 𝑑 → ℝ 𝑘 a continuous function 𝑓 ∈ 𝐶 (1) , then
there always exists a finite (𝑚 < ∞) sum of the form
𝑚
𝑤 𝑗 𝜑(𝜽𝑇 x)
X
𝑓 (x; 𝜽, w) = (5.16)
𝑗=1

such that for all 𝜀 > 0

𝑓 (x; 𝜽, w) − 𝑓 ∗ (x) < 𝜀 ∀x (5.17)

More simply, the basic neural network with a single layer 𝐿 = 1 and the
right activation function could be enough to approximate any continuous
function with an error as small as we want. If this is true, why we might
want more than a single layer? The reason is that 𝑚 might be really large
and is often unknown. A nice property of multi-layer neural networks is
that by adding only a few layers we can exponentially decrease the size
of 𝑚 .

5.4 Objective

Forward propagation is used for prediction and thus assumes that we


already have the parameters 𝜽 . Stochastic gradient descent is used to
train the network and thus find the correct parameters 𝜽 such that
𝑓 (x; 𝜽) = ŷ ≈ y for any pair (x , y) in the dataset.
Recall that if we add an additional input at each hidden layer with value
1, we can merge the biases as the first row in each layer’s weight matrix
W(𝑙) . This will allow us to represent our parameters in a more compact
form 𝜽 ..= (W(1) , W(2) , . . . , W(𝐿) ).
We can then define the objective function of a neural network as
𝑛
1X
𝜽ˆ = arg min L(𝜽 ; x𝑖 , y𝑖 ) (5.18)
𝜽∈ℝ 𝐷 𝑛 𝑖=1

where L is a multi-output loss function.

Definition 5.4.1 (Multi-output loss) Given a standard loss function


ℓ★(𝑦, 𝑦)
ˆ , ℓ★ : ℝ2 → ℝ, vectors 𝜽 ,x, and y, the multi-output loss
5 Neural Networks 41

L : ℝ 𝐷 → ℝ is defined as:

𝑘
1X
L(𝜽 ; x , y) ..= ℓ★(𝑦 𝑗 , 𝑓 (𝜽 ; x) 𝑗 ) (5.19)
𝑘 𝑗=1

More simply it’s just a function that averages the standard loss of each
component of the output vector if we are using a neural network with
many outputs. When we are dealing with regression ★ is usually a mean
squared error, and if we are dealing with classification a multi-class
perceptron or hinge loss.
As we have seen the when computing the forward propagation for 𝑓 ,
we apply non-linear and possibly non-convex activation functions. The
reason we do this that the advantages outweigh the disadvantages. In fact,
we can still get a very good approximation of the optimal solution for 𝜽
by using stochastic gradient descent even if the problem is non-convex.
SGD Update Rule

𝜽𝑡+1 ← 𝜽𝑡 − 𝜂𝑡 ∇𝜽𝑡 L(𝜽𝑡 ; x, y) (5.20)

5.5 Computational Graphs

In the previous models, when we had to compute the gradient of the


loss function, it was more or less straight forward. In this case, the
gradient of the loss L with respect to 𝜽 contains the output 𝑓 (𝜽, x) of the
forward propagation, thus it can be very tedious to compute. We will
introduce the idea of a computational graph, which is a way to visualize
the computation of gradients of any multivariate function, then show
how this process can be easily extended to compute the gradient of our
specific loss function very efficiently. This process is not only a way to
visualize how to compute partial derivatives, but also the method that
most machine learning software use to compute complicated derivatives,
called automatic differentiation.
Given any multivariate function 𝐹(𝑥 1 , . . . , 𝑥 𝑛 ), to compute the partial
𝜕𝐹
derivative 𝜕𝑥 𝑖
, we will first build a computational graph, and then read
the output partial derivative from the graph. More precisely we will
follow the next steps, later we will show a concrete example.

Build Graph First, attribute a variable to the output of each sub-function


of 𝐹 . Secondly, each input of 𝐹 will be a node without any edges pointing
at it, ad each output of 𝐹 will be a node that doesn’t point to any other
node, and each intermediate variable that is associated with the output
of a sub-function of 𝐹 will also be a node. Nodes will have edges pointing
at them from other nodes that are input, and will point to nodes where
their output is passed to. Since each node represents a variable once we
will have computed 𝐹 each node will contain a numerical value. Then if
we have a direct connection from a source node with variable 𝑠 to node
with target node with variable 𝑡 , the intermediate edge will represents
𝜕𝑡
the partial derivative 𝜕𝑠 .
5 Neural Networks 42

Compute Partial Derivative There are three ways to compute partial


derivatives depending on the structure of the graph, each with its own
tradeoffs.
1. Standard Chain Rule, has the advantage that it lets us compute the
derivative between any two nodes in the graph 𝑠 and 𝑡 that are not
𝜕𝑡
directly connected i. e. 𝜕𝑠 . To do so we will follow the path in the
graph from 𝑠 to 𝑡 and multiply the value of the derivative of each
intermediate edge, if there are multiple paths from 𝑠 to 𝑡 we will
sum their values.
2. Forward-Mode Differentiation, lets us compute the derivative of any
node with respect to one input, i. e. apply the operator 𝜕𝑥𝜕 𝑖 to each
node. In this case, instead of working with edges, we will save
𝜕𝑛
in each intermediate node 𝑛 the derivative 𝜕𝑥 𝑖
. We will start by
𝜕𝑥 𝑗
setting the derivative of 𝜕𝑥 𝑖 to 0 if 𝑖 ≠ 𝑗 and to 1 if 𝑖 = 𝑗 . Then
we will follow the graph from the inputs to the outputs and each
intermediate node 𝑛 will sum the value of its children’s partial
derivatives multiplied by the derivative of the connecting edge and
flow them forward as output. This process will be repeated until
𝜕𝐹 𝑗
we have all derivatives with respect to each node, included 𝜕 .
3. Backward-Mode Differentiation, lets us compute the derivative of
one output node with respect to all other nodes, i. e. apply the
𝜕𝐹 𝑗
operator 𝜕 to each node. The process is similar to forward mode
differentiation but instead flows from outputs to inputs. We will
𝜕𝐹 𝑗
set 𝐹𝑖 to 0 if 𝑖 ≠ 𝑗 and to 1 if 𝑖 = 𝑗 , then store in each intermediate
𝜕𝐹 𝑗
node 𝑛 the derivative 𝜕𝑛 , to do so we will sum all parents’ partial
derivatives multiplied by the derivative of the connecting edge and
flow them backward.
One obvious question is, why we need three different methods if the
first one works for all possible partial derivatives? The reason is that
following all paths between two nodes as in standard chain rule might
lead to an exponential explosion. When using forward or backward mode
differentiation we consider for each node only the children or parent
nodes and thus it’s much more efficient. Furthermore, depending on the
function it might be more efficient to use forward-mode differentiation if
the number of output of 𝐹 is much larger than the number of outputs,
conversely, backward-mode, differentiation is much more efficient if the
number of inputs is much larger than the number of outputs. 1 1: Computational graphs and the neural
network graph view are not the same thing.
In the computational graph each variable
Example 5.5.1 (Standard Chain Rule) Let 𝐹 : ℝ3 → ℝ2 , be defined as: is represented by a node and the edges rep-
resent partial derivatives, wherein the neu-
h i𝑇 𝑇 ral network graph each node represents a
𝑥 2 +𝑦
𝐹(𝑥, 𝑦, 𝑧) ..= 𝑦 2 + 𝑘(𝑧, 𝑧) = 𝐹1 𝐹2

exp(𝑥)
(5.21) neuron output and each edge a connection
weight. Neural network graphs are used
to get an intuitive understanding of how
where 𝑘 : ℝ2 → ℝ can be any function (used to demonstrate how the
the network works but not implemented
graph looks if we have undefined functions), then we start by building directly. The computational graph also
the graph with one variable that represents a single sub-function per gives an intuitive understanding but are
node, where the edges represent the derivative of the successive node actually implemented by many libraries
to compute first-order derivatives with
with respect to the previous one. Green nodes will represent inputs,
automatic differentiation.
blue nodes sub-functions, and red nodes outputs.
5 Neural Networks 43

𝜕𝑐
𝜕𝑥 𝜕𝑑
𝑥 𝑐 ..= exp(𝑥) 𝜕𝑐

𝜕𝑎
𝜕𝑥 𝜕𝑏
𝜕𝑎
𝑏 ..= 𝑎 + 𝑦 𝑏
𝑎 ..= 𝑥 2 𝑑 ..= 𝑐 = 𝐹1
𝜕𝑑
𝜕𝑏 𝜕𝑏
𝜕𝑦

𝑦 𝜕𝑖
𝑒 ..= 𝑦 2 𝜕𝑒
𝜕𝑒
𝜕𝑦

𝜕ℎ
𝜕𝑓 𝜕𝑓
𝜕𝑧
𝑓 ..= 𝑧 ℎ ..= 𝑘( 𝑓 , 𝑔) 𝑖 ..= 𝑒 + ℎ = 𝐹2
𝜕𝑖
𝜕ℎ
𝜕ℎ
𝜕𝑔 𝜕𝑔
𝜕𝑧
𝑧 𝑔 ..= 𝑧

Now if we want to compute the partial derivative of 𝐹1 with respect


to 𝑥 , 𝜕𝐹
𝜕𝑥
1
, using the standard chain rule, we will simply sum all the
paths from node 𝑑 to node 𝑥 (or 𝑥 to 𝑑 ) in the graph and multiply their
partial derivatives.

𝜕𝐹1 𝜕𝑑 𝜕𝑑 𝜕𝑐 𝜕𝑑 𝜕𝑏 𝜕𝑎
= = + (5.22)
𝜕𝑥 𝜕𝑥 𝜕𝑐 𝜕𝑥 𝜕𝑏 𝜕𝑎 𝜕𝑥
More concretely let’s see a numerical example, let (𝑥, 𝑦, 𝑧) = (1 , 2 , 3)
and define 𝑘( 𝑓 , 𝑔) ..= 𝑓 𝑔 .
𝜕𝑐 = exp(𝑥) = 2.72 𝜕𝑑
𝜕𝑥 𝜕 𝑐 = −𝑏 =
𝑥=1 𝑐 = 2.72 𝑐2 −0.54

𝜕𝑎
𝜕𝑥 = 2
𝑥 =2 𝜕𝑏 = 1
𝜕𝑎
𝑎=1 𝑏=4 𝑑 = 1.47
𝜕𝑑 = 1 = 0.37
𝜕𝑏 𝑐
𝜕𝑏 = 1
𝜕𝑦
𝜕𝑖
𝑦=2 𝑒=4 𝜕𝑒 = 1
𝜕𝑒 = 2 𝑦 = 4
𝜕𝑦

𝜕ℎ = 𝑔 𝑓 𝑔−1 = 27
𝜕𝑓
𝜕𝑓 = 1 𝑓 =3 ℎ = 27 𝑖 = 31
𝜕𝑧 𝜕𝑖 = 1
𝜕ℎ

𝜕𝑔
𝜕𝑧
=1 9 . 66
=2
𝑧=3 𝑔=3 𝑔 ln( 𝑓
)
𝑓
𝜕ℎ =
𝜕𝑓

Then again by following the two paths from 𝑑 to 𝑥 , the previous partial
derivative is:

𝜕𝐹1 𝜕𝑑
= = −0.54 · 2.72 + 0.37 · 1 · 2 = −0.73 (5.23)
𝜕𝑥 𝜕𝑥
Or in more simple terms, if we change the value of 𝑥 by 1, the value of
𝑑 (which is the output of 𝐹1 ) changes by approximately -0.73, this idea
is really important because it shows us how by changing one variable
in the graph (which in this case is the input, but could be any variable)
affects the change to another variable.
Example 5.5.1 shows us how we can use the graph to compute the partial
derivative of any two variables, however, as we have seen before there
are some cases in which the number of paths between two variables
can increase exponentially, and thus the standard chain rule is very
5 Neural Networks 44

inefficient.

Example 5.5.2 (Backward-Mode Differentiation) Let 𝐹 : ℝ → ℝ, be


defined as:

𝐹(𝑥) ..= 𝑦(𝑧(𝑥, 𝑥, 𝑥), 𝑧(𝑥, 𝑥, 𝑥), 𝑧(𝑥, 𝑥, 𝑥)) (5.24)

for some functions 𝑦, 𝑧 : ℝ3 → ℝ, note that usually we would write 𝐹


as 𝐹(𝑥) ..= 𝑦(𝑑, 𝑑, 𝑑) where 𝑑 ..= 𝑧(𝑥, 𝑥, 𝑥).

𝜕𝑑 𝜕𝑒 𝜕ℎ
𝜕𝑎
𝜕𝑥
𝑎 ..= 𝑥 𝜕𝑎 𝜕𝑑 𝑒 ..= 𝑑 𝜕𝑒

𝜕𝑏 𝜕𝑑 𝜕𝑓 𝜕ℎ
𝜕𝑥 𝜕𝑏 𝜕𝑑 𝜕𝑓
𝑥 𝑏 ..= 𝑥 𝑑 ..= 𝑧(𝑎, 𝑏, 𝑐) 𝑓 ..= 𝑑 ℎ ..= 𝑦(𝑑, 𝑒, 𝑓 )

𝜕𝑐 𝜕𝑑 𝜕𝑔 𝜕ℎ
𝜕𝑥 𝜕𝑐 𝜕𝑑 𝜕𝑔

𝑐 ..= 𝑥 𝑔 ..= 𝑑

𝜕𝐹 𝜕ℎ
This time if we want to compute the derivative 𝜕𝑥
= 𝜕𝑥
the number of
paths from ℎ to 𝑥 is 32 = 9.
Standard chain rule:

𝜕ℎ 𝜕ℎ 𝜕𝑒 𝜕𝑑 𝜕𝑎 𝜕ℎ 𝜕𝑔 𝜕𝑑 𝜕𝑐
= +··· + (5.25)
𝜕𝑥 𝜕𝑒 𝜕𝑑 𝜕𝑎 𝜕𝑥 𝜕𝑔 𝜕𝑑 𝜕𝑐 𝜕𝑥
| {z } | {z }
Path 1 Path 9

For each path we compute 4 derivatives for a total of 9 · 4 = 36


derivatives.
Bachward-mode idea:

𝜕ℎ 𝜕ℎ 𝜕𝑑
= · (5.26)
𝜕𝑥 𝜕𝑑 𝜕𝑥
𝜕ℎ 𝜕𝑒 𝜕ℎ 𝜕 𝑓 𝜕ℎ 𝜕𝑔 𝜕𝑑 𝜕𝑎 𝜕𝑑 𝜕𝑏 𝜕𝑑 𝜕𝑐
  
= + + · + +
𝜕𝑒 𝜕𝑑 𝜕 𝑓 𝜕𝑑 𝜕𝑔 𝜕𝑑 𝜕𝑎 𝜕𝑥 𝜕𝑏 𝜕𝑥 𝜕𝑐 𝜕𝑥
(5.27)

In this case we only compute 12 derivatives. It’s easy to see that if the
depth of this kind of function increases, the number of derivatives with
the standard chain rule increases exponentially. With backward-mode
differentiation the number of derivatives its linear in both depth and
width of the graph.
Backward-mode differentiation starts from ℎ and flows backward
through the directed graph, each time storing in the node the derivative
with respect to 𝐹 = ℎ , i. e. applying the operator 𝜕ℎ𝜕
on all nodes. Thus
𝜕ℎ
we start from the node ℎ and store its derivative 𝜕ℎ = 1 (if the function
𝜕ℎ
has multiple outputs note that 𝜕𝐹 = 0), then flow backward through
the graph where each node will add all derivatives contained on their
parent nodes multiplied by the edge from which they came from.
5 Neural Networks 45

𝜕𝑎
𝜕𝑥 𝜕ℎ = 𝜕ℎ 𝜕𝑑 𝜕𝑑 𝜕𝑒 𝜕ℎ = 𝜕ℎ 𝜕ℎ
𝜕𝑎 𝜕𝑑 𝜕𝑎 𝜕𝑎 𝜕𝑑 𝜕𝑒 𝜕ℎ 𝜕𝑒 𝜕ℎ
𝜕𝑒

𝜕𝑏 𝜕𝑑 𝜕𝑓 𝜕ℎ
𝜕𝑥 𝜕𝑏 𝜕𝑑 𝜕𝑓

𝜕ℎ = 𝜕ℎ 𝜕𝑎 + 𝜕ℎ 𝜕𝑏 + 𝜕ℎ 𝜕𝑐 𝜕ℎ = 𝜕ℎ 𝜕𝑑 𝜕ℎ = 𝜕ℎ 𝜕𝑒 + 𝜕ℎ 𝜕 𝑓 + 𝜕ℎ 𝜕𝑔 𝜕ℎ = 𝜕ℎ 𝜕ℎ 𝜕ℎ = 1
𝜕𝑥 𝜕𝑎 𝜕𝑥 𝜕𝑏 𝜕𝑥 𝜕𝑐 𝜕𝑥 𝜕𝑏 𝜕𝑑 𝜕𝑏 𝜕𝑑 𝜕𝑒 𝜕𝑑 𝜕 𝑓 𝜕𝑑 𝜕𝑔 𝜕𝑑 𝜕𝑓 𝜕ℎ 𝜕 𝑓 𝜕ℎ

𝜕𝑑 𝜕𝑔
𝜕𝑐 𝜕𝑐 𝜕𝑑
𝜕𝑥
𝜕ℎ
𝜕ℎ = 𝜕ℎ 𝜕𝑑 𝜕ℎ = 𝜕ℎ 𝜕ℎ 𝜕𝑔
𝜕𝑐 𝜕𝑑 𝜕𝑐 𝜕𝑔 𝜕ℎ 𝜕𝑔

Other than avoiding an exponential explosion, backward-mode dif-


ferentiation computes the derivative of one output with respect to all
variables, instead of just one as with standard chain rule. This is really
important, each node tells us how much influence it has on the output.
For example, if we observe that 𝜕ℎ 𝜕𝑎
= 3 this tells us that if we would
change 𝑎 by 1, then the output of 𝐹 would change by approximately 3.
If our goal is to make 𝐹 as small or as big as possible we can just check
the graph and see which variables are responsible for the greatest
increase or decrease if a small change is applied. Note that if we are
dealing with multivariate functions backward-mode differentiation
will compute all derivatives, including the ones in the gradient ∇x𝐹 ,
which is just the vector of the derivatives of one output (red) with
respect to all inputs (green nodes).
All of the previous methods are useful to compute complicated deriva-
tives, but how does this apply to neural networks? In fact, neural networks
are just complicated functions with some variables (𝜽 ) that can be changed
as needed. Our goal is to optimize L : ℝ 𝐷 → ℝ, i. e. a function with a
lot of inputs and a single output. We have seen before that for functions
with few outputs and many inputs, backward-mode differentiation is the
most suitable way to compute derivatives with respect to any variable.
This process of computing the derivatives of a loss function (∇𝜽 L) for a
neural network is called back-propagation. Note that the loss function L
will always have a one-dimensional output even if the neural network
has a very high dimensional output.

5.6 Back-Propagation

In back-propagation the goal is to compute ∇𝜽 L(𝜽, x , y). Recall that the


gradient of L with respect to 𝜽 = (W(1) , . . . , W(𝐿) ) is the vector of all
partial derivatives 𝜕L(𝑙) , ∀𝑖, 𝑗, 𝑙 .
𝜕𝑊𝑖,𝑗

Example 5.6.1 In this example we will compute back-propagation for


the following single hidden layer neural network with one input, one
output, and no biases (for semplicity).

(1) (2)
𝑓 (𝑥1 ; 𝜽) = 𝜑 (2) (𝜑(1) (𝑥1𝑊1,1 )𝑊1,1 ) = 𝑦ˆ1 (5.28)

(1) (2)
where 𝜽 = (𝑊1,1 , 𝑊1,1 ), and 𝑓 has the following graph view.
5 Neural Networks 46

(1) (2)
𝑊1,1 𝑊1,1
𝑥1 (1)
ℎ1 𝑦ˆ1

As we have seen before we will use a computational graph for the


loss L(𝜽 ; 𝑥 1 , 𝑦1 ), where green nodes are inputs (i. e. 𝜽 ), blue nodes are
sub-functions, red nodes are outputs (i. e. L) and yellow nodes are fixed
variables (i. e. 𝑥 1 , 𝑦1 ). Note that when computing the loss we want to
find the partial derivatives with respect to the weights in 𝜽 , and thus
𝑥 1 , 𝑦1 are not inputs as in 𝑓 but fixed variables. We will assume that
𝜑 (1) , 𝜑(2) are sigmoid activation functions 𝜎 , and that our single output
loss is the mean squared error. Then the computational graph of L is:

𝑥1
(1)
𝜕𝑧
1
𝜕𝑥1
(1) (1 )
𝜕𝑧 𝜕ℎ
1 1
(1) (1)
𝜕𝑊 𝜕𝑧
(1) 1 ,1 (1) (1) 1 (1) (1)
𝑊1,1 𝑧 1 ..= 𝑥1 𝑊1,1 (2)
ℎ 1 = 𝜎(𝑧1 )
𝜕𝑧
1
(1 )
(2 ) 𝜕ℎ
𝜕𝑧 1
1 𝜕 𝑦ˆ1
(2 ) (2 )
𝜕𝑊 𝜕𝑧
(2) 1 ,1 (2 ) . (1) (2) 1 (2)
𝑊1,1 𝑧1 .= ℎ 1 𝑊1,1 𝑦ˆ1 = 𝜎(𝑧 1 )
𝜕L
𝜕 𝑦ˆ1

𝜕L
𝜕𝑦1
𝑦1 L = (𝑦1 − 𝑦ˆ1 )2

Then we will apply backward-mode differentiation from the loss


function, where the goal is to find the derivatives 𝜕L(1) and 𝜕L(2) .
𝜕𝑊1,1 𝜕𝑊1,1

𝜕 L = 𝜕 L 𝑊 (1 )
𝜕𝑥1 (1 ) 1 , 1
𝜕𝑧 (1)
1 𝜕𝑧
1
𝜕𝑥1
(1 ) (1)
𝜕𝑧 𝜕ℎ
1 1
(1 ) (1 )
𝜕L = 𝜕L 𝑥 𝜕𝑊 𝜕𝑧
   
1 ,1 𝜕L = 𝜕L 𝜎 𝑧 (1) 1 − 𝜎 𝑧 (1) 1 𝜕L = 𝜕L 𝑊 (2)
(1 ) (1 ) 1 (1) (1) 1 1 (1 ) (2) 1 ,1
𝜕𝑊 𝜕𝑧 𝜕𝑧 𝜕ℎ 𝜕𝑧
(2) 𝜕ℎ 𝜕𝑧
1 ,1 1 1 1 1 1
1
(1)
𝜕ℎ
(2) 1
𝜕𝑧 𝜕 𝑦ˆ1
1
(2) (2)
𝜕𝑊 𝜕𝑧
   
𝜕L = 𝜕L ℎ (1) 1 ,1 𝜕L = 𝜕L 𝜎 𝑧 (2) 1 − 𝜎 𝑧 (2) 1 𝜕L = 𝜕L (−2)(𝑦 − 𝑦ˆ )
𝜕𝑊
(2 )
𝜕𝑧
(2) 1 (2 ) 𝜕 𝑦ˆ1 1 1 𝜕 𝑦ˆ1 𝜕L 1 1
1 ,1 1 𝜕𝑧
1
𝜕L
𝜕 𝑦ˆ1
𝜕L
𝜕L = 𝜕L 2(𝑦 − 𝑦ˆ ) 𝜕𝑦1
𝜕L = 1
𝜕𝑦1 𝜕L 1 1 𝜕L

Then we can easily read out our derivatives from the graph. Note that
for for spacing reasons we wrote the derivatives in a more compact way,
usually we just have to read out the value of the partial derivatives from
the green nodes. Also the yellow nodes are never actually computed
since we can’t change our data, however they are just variables and
thus nothing is stopping us from computing their partial derivatives.
5 Neural Networks 47

In this case the extended form is:

𝜕L 𝜕L
(1)
= 𝑥
(1) 1
(5.29)
𝜕𝑊1,1 𝜕𝑧1
𝜕L 
(1)
 
(1)

= (1)
𝜎 𝑧1 1 − 𝜎 𝑧1 𝑥1 (5.30)
𝜕ℎ1
𝜕L (2)

(1)
 
(1)

= 𝑊 𝜎 𝑧1
(2) 1,1
1 − 𝜎 𝑧1 𝑥1 (5.31)
𝜕𝑧1
𝜕L  (2)    
(2) (2)
 
(1)
 
(1)
= 𝜎 𝑧1 1 − 𝜎 𝑧1 𝑊1,1 𝜎 𝑧 1 1 − 𝜎 𝑧1 𝑥1
𝜕 𝑦ˆ1
(5.32)
       
(2) (2) (2) (1) (1)
= (−2)(𝑦1 − 𝑦ˆ1 )𝜎 𝑧 1 1 − 𝜎 𝑧1 𝑊1,1 𝜎 𝑧1 1 − 𝜎 𝑧1 𝑥1
(5.33)
𝜕L 𝜕L (1)
(2)
= (2)
ℎ1 (5.34)
𝜕𝑊1,1 𝜕𝑧1
𝜕L  (2)    
(2) (1)
= 𝜎 𝑧1 1 − 𝜎 𝑧1 ℎ1 (5.35)
𝜕 𝑦ˆ1
   
(2) (2) (1)
= (−2)(𝑦1 − 𝑦ˆ1 )𝜎 𝑧1 1 − 𝜎 𝑧1 ℎ1 (5.36)

The last thing to realize is that we don’t have tocompute


 most
 of the
(1) (2)
values in the partial derivatives. For example 𝜎 𝑧 1 and 𝜎 𝑧 1 are
(2) (1)
computed during forward propagation together with ℎ1 , and ℎ1 ,
𝑦ˆ1 . Furthermore, since we use backward-mode differentiation, 𝜕L(2) is
𝜕𝑧1
computed only once and used to evaluate both 𝜕L(1) and 𝜕L(2) . After
𝜕𝑊1,1 𝜕𝑊1,1
having computed the partial derivatives of the loss with respect to the
weights 𝜽 , we will know by how much the loss changes if we slightly
change the weights of the neural network. This information will be
used by gradient descent to take small steps towards the optimal value
for L.
Example 5.6.1 shows how to compute the partial derivatives of a simple
neural network, where the advantages of backward-mode differentiation
are not fully expressed as we only have one dimensional layers. In the
following example we will show how backward-mode differentiation
can be used with wider networks i. e. with matrices.

Example 5.6.2

The architecture of standard neural networks is always the same (only


the number of layers, the width of each layer, the activation functions,
or the single output loss are changing) thus we don’t have to build the
computational graph for every new standard neural network but only
run the following iterative algorithm.

h i
𝜕L 𝜕L 𝜕L Standard back-
1 ∇z(𝐿) L ..= 𝜕 𝑦ˆ1 𝜕 𝑦ˆ2
··· 𝜕 𝑦ˆ 𝑘 B Compute output error gradient. Algorithm 5.2:
propagation
2 ∇W(𝐿) L ..= h(𝐿−1)· ∇z(𝐿) L B Output weight error gradient.
3 foreach 𝑙 = 𝐿 − 1 : 1
∇z(𝑙) L ..= 𝜑0(𝑙) (z(𝑙) ) ∇z(𝑙+1) L · W(𝑙+1)

4 B Hidden layer 𝑙 error gradient.
5 Neural Networks 48

5 ∇W(𝑙) L ..= h(𝑙) ∇z(𝑙) L B Weight layer 𝑙 error gradient.


6 end  
7 return ∇𝜽 L = ∇W(1) L ∇W(2) L ··· ∇W(𝐿) L

Note that if we want to compute the gradient of the loss of a different


architecture (e. g. RNN, LSTM, GRU), and thus a different function, we
would have to build a new computational graph and find a new iterative
algorithm. However, using backward-mode differentiation we can easily
compute the gradient for any network or function. If we are using a
library that implements automatic differentiation this process will be
automated.

Recap The steps of the training process of a neural network are:


1. Initialize the parameters 𝜽 .
2. Choose an optimization algorithm e. g. gradient descent.
3. Choose a single output loss function ℓ★.
4. Repeat the following steps:
a) Pick an input output pair (x𝑖 , y𝑖 ) from the dataset
(or mini-batch pair (X1...𝐵 = [x𝑖1 , . . . , x𝑖 𝐵 ]𝑇 , Y1...𝐵 = [y𝑖1 , . . . , y𝑖 𝐵 ]𝑇 )).
b) Forward propagate the pair: 𝑓 (x𝑖 , y𝑖 ; 𝜽) = ŷ𝑖 .
c) Compute the gradient of the loss function L(𝜽 ; x𝑖 , y𝑖 ) with
respect to the weights 𝜽 using back-propagation.
d) Use the optimization algorithm to update the parameters 𝜽 .

5.7 Weight Initialization

We have seen that the first step to train a neural network model is to
initialize the weights 𝜽 in some way. We will show that the initialization
of the parameters is really important, and that if it’s not properly applied
the network won’t be able to learn. Recall the two formulas applied
during forward and back-propagation:
Forward Propagation Step
!
𝑑 (𝑙)
(𝑙) (𝑙−1) (𝑙)
(𝑙)
X
ℎ𝑗 =𝜑 ℎ 𝑖 𝑊𝑖,𝑗 (5.37)
𝑖=1
| {z }
(𝑙)
𝑧𝑗

Back-Propagation Step

𝜕L (𝑙−1) (𝑙−1) 𝜕L
(𝑙−1)
= ℎ𝑖 𝜑0(𝑙−1) (𝑧 𝑖 ) (𝑙)
(5.38)
𝜕𝑊𝑖,𝑗 𝜕ℎ 𝑗

During back-propagation we multiply iteratively the derivative of the


(𝑙−1) (𝑙−1)
activation function 𝜑 at 𝑧 𝑖 . If the variance of 𝑧 𝑖 is very high (i. e.
𝕍 [𝑧 (𝑙−
𝑖
1)
]  1) and we are using a Sigmoid/TanH activation function, its
derivative will be always close to 0. Conversely if the variance is very
(𝑙−1)
small (i. e. 𝕍 [𝑧 𝑖 ]  1) the derivative of the activation function will
always be around the same value. The first problem is called vanishing
5 Neural Networks 49

gradients problem, since we can’t follow the gradient if it’s 0, the second
exploding gradients problem since again we can’t follow the gradient if it’s
not changing.
To solve this problem we assume that the inputs are standardized
(zero mean and constant variance) and drawn from some distribution.
Furthermore, we assume that 𝑥 1 , . . . , 𝑥 𝑑 are independent. Note that the
input is the same as the hidden layer 0.

𝔼[𝑥 𝑗 ] = 𝔼[ℎ (𝑗0) ] = 0 (5.39)

𝕍 [𝑥 𝑗 ] = 𝕍 [ℎ (𝑗0) ] = 1 (5.40)

Our goal is to show that, for each activation function, we can find some
distribution from which we can pick the weights of the neural network
such that the standardization is preserved through each layer. In other
words if the standardization of the input is preserved through all layers
this means that the values nor explode neither shrink too much, i. e. the
variance is constant. This prevents the exploding or vanishing gradient
problems. We assume that all the weights are drawn from a normal
(𝑙)
distribution with zero mean and unknown variance, 𝑊𝑖,𝑗 ∼ N (0 , 𝜎 2 ).
Then, by induction:

" #
𝑑 (𝑙)
𝔼[𝑧 (𝑙) (𝑙−1) (𝑙)
X
𝑗 ]=𝔼 ℎ𝑖 𝑊𝑖,𝑗 (5.41)
𝑖=1
𝑑(𝑙) h i
𝔼 ℎ (𝑙− 1) (𝑙)
X
= 𝑖 𝑊𝑖,𝑗 (5.42)
𝑖=1
𝑑(𝑙) h i h i
𝔼 ℎ (𝑙− 1) (𝑙)
X
= 𝑖 𝔼 𝑊𝑖,𝑗 (5.43)
𝑖=1 | {z }
=0
=0 (5.44)
" #
𝑑 (𝑙)
𝕍 [𝑧 (𝑙) (𝑙−1) (𝑙)
X
𝑗 ]=𝕍 ℎ𝑖 𝑊𝑖,𝑗 (5.45)
𝑖=1
𝑑(𝑙) h i
𝕍 ℎ (𝑙− 1) (𝑙)
X
= 𝑖 𝑊𝑖,𝑗 (5.46)
𝑖=1
𝑑(𝑙) h i i
𝕍 ℎ (𝑙− 1) (𝑙)
X
= 𝑖 𝕍 [𝑊𝑖,𝑗 (5.47)
𝑖=1 | {z } | {z }
=1 =𝜎 2
(𝑙) 2
=𝑑 𝜎 (5.48)

Where we use the property that given two random variables 𝑋 , 𝑌 if 𝑌


has 0 mean, then 𝕍 [𝑋𝑌] = 𝕍 [𝑋]𝕍 [𝑌]. Then if we pick 𝜎 2 correctly we
can mantain a variance of 1 throughout all layers.

I TanH/Sigmoid: Xavier Glorot normal initialization


 
(𝑙) 2
𝑊𝑖,𝑗 ∼ N 𝜇 = 0 , 𝜎 2 = (5.49)
𝑑(𝑙−1) + 𝑑(𝑙)
5 Neural Networks 50

I ReLU: Kaiming He normal initialization


 
(𝑙) 2
𝑊𝑖,𝑗 ∼ N 𝜇 = 0, 𝜎 =2
(5.50)
𝑑(𝑙−1)

5.8 Optimizers

We have seen how to update the weights of the neural network using
stochastic gradient descent:

SGD Update Rule

𝜽𝑡+1 ← 𝜽𝑡 − 𝜂𝑡 ∇𝜽𝑡 L(𝜽𝑡 ; x, y) (5.51)

However SGD is not the only option.

Adaptive SGD Adaptive SGD is analog to SGD, where instead of


using a fixed learning rate 𝜂𝑡 , we change it as the number of iterations 𝑡
increases. The idea is to begin with a fixed small learning rate and start
decreasing it slowly after a fixed amount of iterations.

Example 5.8.1 In this example we will design an adaptive learning


rate that starts as 0.1 constant and then after 1000 iterations it starts to
decrease.
100
𝜂𝑡 = min{0.1, } (5.52)
𝑡

Momentum SGD Recall that neural networks are not convex, and thus
it’s possible that we get stuck in a local minimum or saddle point. Momen-
tum SGD is a form of SGD where we use information about the previous
step to follow the gradient even if we are temporarily stuck. The real-world
analogy is that of a ball rolling down a mountain, even if there are short
bumps or flat points the ball keeps rolling down because it has momentum.

Momentum SGD Update Rule:

v𝑡 ← 𝛼 v𝑡−1 − 𝜂𝑡 ∇𝜽𝑡 L(𝜽𝑡 ; x , y) (5.53)


𝜽𝑡+1 ← 𝜽𝑡 + v𝑡 (5.54)

Where v𝑡−1 ∈ ℝ 𝐷 is the momentum vector that contains the retained


gradient of the previous step and 𝛼 ∈ [0 , 1] determines how much of the
momentum vector should contribute to the current update.

5.9 Overfitting

There are different countermeasures against overfitting.


5 Neural Networks 51

1. Early Stopping i. e. don’t run the optimizer until convergence,


otherwise we might start to overfit to the training data. Instead
every few training steps we pick some data from the validation set
and check if the validation performance starts to decrease, if so we
stop training. One way to do this is using a plot of the validation
and training error with the number of epochs.
2. Regularization

5.10 Regularization

Penality Regularization As with regression methods we can regularize


the model by keeping the weights small, to do so we can use a matrix
norms as penality to the objective.
𝑛
1X
𝜽ˆ = arg min L(𝜽 ; x , y) + 𝜆 k𝜽k 𝐹 (5.55)
𝜽∈ℝ 𝐷 𝑛 𝑖=1

Definition 5.10.1 (Matrix Norm) Let 𝑝, 𝑞 ∈ ℕ , and 𝐴 ∈ ℝ 𝑚×𝑛 be a


matrix then the 𝑝, 𝑞 -matrix norm is defined as:
1
! 𝑝𝑞 𝑞
𝑛 𝑚
|𝑎 𝑖,𝑗 | 𝑝
©X X
k𝐴k 𝑝,𝑞 =­ (5.56)
ª
®
« 𝑗=1 𝑖=1 ¬

Using the matrix norm we can define the Frobenius norm which is the
most popular norm used to regularize the weights 𝜽 , it’s the analogous
of the 𝐿2-norm but with matrices.

Definition 5.10.2 (Frobenius Norm) If 𝑝 = 2, 𝑞 = 2, and 𝐴 ∈ ℝ 𝑚×𝑛 the


matrix norm is called Frobenius norm and is defined as:

𝑚 X
𝑛
s
p
trace(𝐴𝐻 𝐴)
X
k𝐴k 𝐹 = |𝑎 𝑖,𝑗 | 2 = (5.57)
𝑖=1 𝑗=1

Similarly if we set 𝑝 = 1, 𝑞 = 1 we would get the analogous to the


𝐿1-norm. As always the value of 𝜆, which weights the importance of
regularization, is found using cross-validation.

Dropout Overfitting happens because the training data is a noisy


approximation of the real underlying distribution, thus some neurons
in the hidden layers of our neural network will start to recognize noise
instead of general characteristics of the data. To solve this problem
theoretically, we could train many neural networks to get different values
of 𝜽 and then average the learned parameters. The problem with this
approach is that if there are a lot of parameters training a neural network
many times takes a prohibitive amount of time. To solve this problem,
instead of training many networks, we only train a single one but apply
a neat trick called dropout. Dropout is a form of regularization that is
not applied to the objective function but is applied directly inside of the
5 Neural Networks 52

network, more precisely on some hidden layer. Recall how to evaluate


the current hidden layer given the previous one.
 
h(𝑙) ..= 𝜑 (𝑙) h(𝑙−1) W(𝑙) + b(𝑙) (5.58)

(5.59)

The idea is that during training we set some amount of neurons of the
(𝑙−1)
hidden layer to 0 (i. e. ℎ 𝑗 = 0 for some 𝑗 ) with probability 𝑝 , this
simulates the training of only a single sub-network. This has many effects,
the first is that the network is forced to learn a representation that is
more sparse (i. e. to not use the entire hidden layer all the times), the
second is that if some neurons previously were overfitting to noise now
that they are deactivated they are forced to store only important features.
Furthermore, activating only a single set of hidden units has a similar
effect to training multiple networks and averaging the values of the
weights. To avoid overfitting using dropout we change the hidden layer
evaluation slightly as follows:

d(𝑙−1) ∼ Bernoulli(1 − 𝑝) (5.60)


 
h(𝑙) ..= 𝜑 (𝑙) (h(𝑙−1) d(𝑙−1) )W(𝑙) + b(𝑙) (5.61)

Where d is the dropout vector that randomly sets some values of the
hidden layer to 0. Note that each training step will select a different set of
hidden neurons with probability 𝑝 and that during validation dropout is
disabled, i. e. 𝑝 = 0. Depending on the architecture we can apply dropout
to only a few layers or to all of them.

Batch Normalization Recall that using stochastic gradient descent with


a simple training sample might have a large variance and thus lead to
a slow convergence. We have seen that training with mini-batches of
data (X1...𝐵 = [x𝑖1 , . . . , x𝑖 𝐵 ]𝑇 , Y1...𝐵 = [y𝑖1 , . . . , y𝑖 𝐵 ]𝑇 ) (𝐵 samples in the
mini-batch) helps to solve this problem. Note that by the way neural
networks are built, if instead of a single vector we pass a matrix X1...𝐵
with features in the columns and different elements of the batch in the
rows as input, we will get as output the entire batch prediction, i. e.
𝑓 (X1...𝐵 , Y1...𝐵 ; 𝜽) = Yˆ 1...𝐵 , without having to loop through each data
sample. The Batch-SGD step is then:

1
𝜽𝑡+1 ← 𝜽𝑡 − 𝜂𝑡 ∇𝜽𝑡 L(𝜽𝑡 ; X1...𝐵 , Y1...𝐵 ) (5.62)
𝐵
Usually is a good practice to standardize the input data in each batch
(x𝑖1 , . . . , x𝑖 𝐵 ) to avoid the vanishing and exploding gradient problem.
The idea of batch normalization is that we can not only standardize the
input data (and thus the hidden layer 0), but also each intermediate
hidden layer. Similarly to the dropout regularization technique, batch
normalization is a regularization technique applied to each layer of the
network. Batch normalization has many advantages, it helps to keep
the weights of the neural network small, it enables faster training with
higher learning rates, and overall improves the stability of the network.
To apply batch normalization to layer 𝑙 , we will normalize the output
of the hidden layer using the data from the entire batch. If we feed a
batch as input, the hidden layer 𝑙 , h(𝑙) = 𝜙 (𝑙) (z(𝑙) ), will be of dimension
5 Neural Networks 53

h(𝑙) , z(𝑙) ∈ ℝ 𝑑
(𝑙) ×𝐵
. Then we apply batch normalization on z(𝑙) :

𝐵
(𝑙) 1X (𝑙)
𝝁𝐵 ..= z Batch mean (5.63)
𝐵 𝑗=1 :,𝑗
𝐵
2 (𝑙) . 1X (𝑙) (𝑙)
𝝈𝐵 .= (z − 𝜇𝐵 ) Batch standard deviation (5.64)
𝐵 𝑗=1 :,𝑗
(𝑙)
z(𝑙) − 𝝁𝐵
z̃(𝑙) ..= q Standardization (5.65)
2 (𝑙)
𝝈𝐵 + 𝜺
BN𝜸(𝑙) ,𝜷(𝑙) (z(𝑙) ) ..= 𝜸 (𝑙) z̃(𝑙) + 𝜷 (𝑙) Batch normalization (5.66)

(𝑙)
Where 𝜺 ∈ ℝ1×𝑑 is a small number added to avoid division by 0, and
(𝑙)
𝜸, 𝜷 ∈ ℝ1×𝑑 are two variables that will be trained along with 𝜽, i. e.
𝜽 ← 𝜽 ∪{𝜸, 𝜷} to automatically denormalize our data. The idea is that by
optimizing 𝜸, 𝜷 the network won’t have to destabilize itself by increasing
disproportionately its weights, but solely these two parameters. Finally
we can change the evaluation of layer 𝑙 to:

h(𝑙) = 𝜑 (𝑙) (BN𝜸(𝑙) ,𝛽(𝑙) (z(𝑙) )) (5.67)


= 𝜑(𝑙) (BN𝜸(𝑙) ,𝜷(𝑙) (h(𝑙−1) W(𝑙) )) (5.68)

Note that when using batch normalization the bias b(𝑙) is not necessary
since it’s already implemented by 𝜷 .

5.11 Convolutional Neural Networks (CNNs)

Convolutional neural networks are very similar to standard neural


networks. CNNs are usually used for image recognition, if we used
a standard neural network to recognize images there would be some
problems. The first is that would have to input a vector of the same
size as the number of pixels, even small images e. g. 200x200x3 (200x200
pixels with 3 color channels) would use 120000 different inputs. Secondly,
images are a special type of data, most neighboring pixels are correlated
with each other and thus contain redundant information, usually the
greatest amount of information is found when there are big differences
between neighboring pixels. Convolutional neural networks try to solve
those problems by using trainable convolution filters to compress the
representation of the grid of pixels and keeping only important features.
Note that CNNs work well for all types of applications that have spatially
correlated data, i. e. 𝑑 -dimensional grids of values where neighboring
data-points are correlated, hence not only for images. The general idea
behind CNNs is that we start with a grid of values, we apply one or
multiple filters on it, and then use this new representation as input to a
standard feed forward neural network.

Transformation Invariance When dealing with spatially correlated


data (think of an image, or 2 dimensional grid, of a digit e. g. ‘4’),
we usually want the information to be recognized even if different
transformations are applied.
5 Neural Networks 54

I Shift invariance, if the important information is shifted (e. g. if the


digit ‘4’ is not centered in the image but on the top left).
I Rotation invariance, if the important information is rotated (e. g. digit
‘4’ rotated by a few degrees).
I Scale invariance, if the important information is concentrated in a
small area or on the entire grid (e. g. digit ‘4’ is written in a small
font, or big font).
Thus we want our model to be resistant to invariance. There are different
ways to solve this problem, one might be to use a standard neural network
and train it not only with the standard data, but also with the transformed
data (shifted, rotated, scaled). This process is called data augmentation.
Other methods use special types of regularization, but the state of the
art is usually only achieved by using CNNs.

Convolution Operation The convolution operation is used to filter


important correlation between neighboring cells and also to transform
the input. We have seen that convolutions applied on grid of values. The
idea is that we have one or more convolutional filters, which are just
matrices of numbers, that will be sliding over the grid and extract the
important features. To precisely describe a convolution we will have to
set the following parameters:
I Batch size: 𝐵, similarly to neural networks we can apply to convolu-
tion to one batch or many batches at the same time (e. g. if we apply
it to a single image at the time 𝐵 = 1, if many images at the time
𝐵 > 1).
I Filters: 𝐹 , during a convolution operation a filter is applied on the
image, we can apply a single filter at the time i. e. 𝐹 = 1, or many
filters at the time i. e. 𝐹 > 1.
I Dimension: 𝐷𝑤 , 𝐷 ℎ , 𝐷𝑑 , the dimension of the input (e. g. a 2D
image of 20 × 30 pixels will have 𝐷𝑤 = 20 , 𝐷 ℎ = 30, or a 1D audio
signal with 100 discrete timesteps will have 𝐷𝑤 = 100).
I Channels: 𝐶 , the number of channels of the signal (e. g. a 2D color
image in RGB will have 𝐶 = 3 color channels, and a 1D audio signal
will have 𝐶 = 2 different sound channels). Note that sometimes the
number of channel is added as an additional dimension, however
with standard application it’s good practice to keep it separate (e. g.
a video will have 𝐷𝑤 , 𝐷 ℎ , 𝐷𝑑 = 3 for width, height and time, and
𝐶 = 3 for the color channels).
I Kernel Size: 𝐾 𝑤 , 𝐾 ℎ , 𝐾 𝑑 , the dimension of the filter (e. g. 𝐾 𝑤 =
3 , 𝐾 ℎ = 3).
I Padding: 𝑃𝑤 , 𝑃 ℎ , 𝑃𝑑 , sometimes is convenient to pad the input
volume with zeros so that we can control the output dimension
(e. g. in 2D with 𝑃𝑤 , 𝑃 ℎ we go to 𝐷𝑤 ← 𝐷𝑤 + 2𝑃𝑤 , 𝐷 ℎ ← 𝐷 ℎ + 2𝑃 ℎ .
Times 2 since we pad top/bottom respectively left/right).
I Stride: 𝑆𝑤 , 𝑆 ℎ , 𝑆 𝑑 , by how much we move the filter window, usually
𝑆 = 1. The higher the stride the a smaller output representation we
will get. (e. g. a stride of 𝑆 = 2 will slide the filter at intervals of 2
pixels on the image).
Then each filter will be applied to all batches of data and we will get as
output a matrix of size 𝐵 × 𝐹 × 𝐷 ˜𝑤 × 𝐷 ˜ℎ ×𝐷 ˜ ★ depends on
˜ 𝑑 . Where 𝐷
5 Neural Networks 55

the input dimension, padding and stride. Note that matrices with more
than 2 dimensions are usually referred to as tensors.

˜ ★ = 𝐷★ + 2𝑃★ − 𝐾★ + 1
 
𝐷 (5.69)
𝑆★

Definition 5.11.1 (Convolution) Given a 2D input matrix X ∈ ℝ 𝐷𝑤 ×𝐷 ℎ


with 𝐶 channels and a kernel filter matrixa W( 𝑓 ) ∈ ℝ 𝐾𝑤 ×𝐾 ℎ with 𝑓 ∈
{1, . . . , 𝐹}, the convolution Yˆ ( 𝑓 ) ..= (X ∗ W( 𝑓 ) )𝑖,𝑗 between X and W( 𝑓 ) at
𝑖, 𝑗 is defined as:

𝑤 −1 𝐾
𝐾X 𝐶−1
ℎ −1 X
(𝑓)
𝑌ˆ𝑖,𝑗, 𝑓 ..=
X
𝑋𝑖+𝑥,𝑗+𝑦,𝑐 𝑊𝑥,𝑦,𝑐 (5.70)
𝑥=0 𝑦=0 𝑐=0

a We use the notation W to make clear that this a trainable weight matrix.

Example 5.11.1 (2D-Convolution) Let the following convolution be


applied on a 2D color image: 𝐵 = 3 (3 images at the time), 𝐹 = 2 (2
filters applied on each image), 𝐷𝑤 , 𝐷 ℎ = 5 (5 × 5 image size), 𝐶 = 3
(color channels), 𝐾 𝑥 , 𝐾 𝑦 = 3 (3 × 3 kernel size) 𝑃𝑥 , 𝑃𝑦 = 1 (1 zero pad),
𝑆 𝑥 , 𝑆 𝑦 = 2 (2 stride) Then to compute the convolution Yˆ we slide
all filters over all images in the batch, and at each step multiply all
overlapping cells and sum them together. The following illustration
(1)
shows computation of the convolution step for 𝑌ˆ1,1 .
5 Neural Networks 56

• Input Tensor
𝐵 × 𝐶 × 𝐷 ℎ × 𝐷𝑤

0 0 0

0 𝑋1 , 1 , 0 𝑋1 , 2 , 0

𝐷ℎ 0 𝑋2 , 1 , 0 𝑋2 , 2 , 0
𝐵 • Kernel Tensor
𝐹 × 𝐾 ℎ × 𝐾𝑤 × 𝐶

𝐷𝑤
𝐶
(1) (1 ) (1 )
𝑊0,0,0 𝑊0,1,0 𝑊0,2,0

𝐾ℎ (1) (1 )
𝑊1,0,0 𝑊1,1,0 𝑊1,2,0
(1 )

• Output Tensor (1) (1 ) (1 )


𝐹
𝑊2,0,0 𝑊2,1,0 𝑊2,2,0
𝐵×𝐹×𝐷 ˜ℎ ×𝐷 ˜𝑤

𝐾𝑤

𝐹
𝑌ˆ1,1,1 𝐵
˜ℎ
𝐷

˜𝑤
𝐷

We have 3 batches and 2 filters, each batch is applied to all filters thus
we have a total of 6 convoluted images (𝐵 · 𝐹 ). For example the index
1 , 1 of the first convoluted images is evaluated as:

(1) (1) (1) (1)


𝑌ˆ1,1,1 = 𝑋1,1,0𝑊1,1,0 + 𝑋1,2,0𝑊1,2,0 + 𝑋2,1,0𝑊2,1,0 + 𝑋2,2,0 𝑊2,2,0 (5.71)
(1) (1) (1) (1)
+ 𝑋1,1,1𝑊1,1,1 + 𝑋1,2,1𝑊1,2,1 + 𝑋2,1,1𝑊2,1,1 + 𝑋2,2,1 𝑊2,2,1 (5.72)
(1) (1) (1) (1)
+ 𝑋1,1,2𝑊1,1,2 + 𝑋1,2,2𝑊1,2,2 + 𝑋2,1,2𝑊2,1,2 + 𝑋2,2,2 𝑊2,2,2 (5.73)

Note that for this cell we sum only 4 values for each channel since the
input tensor is zero padded.
We start with images of size 𝐶 × 𝐷 ℎ × 𝐷𝑤 and after the convolution
˜ℎ × 𝐷
each of those images will have size 𝐹 × 𝐷 ˜ 𝑤 , thus in a sense the
5 Neural Networks 57

number of filters 𝐹 is the same as the number of output channels i. e.


𝐶˜ ..= 𝐹 . Going back to the convolution operation, we could either put the
˜ ˜ ˜
current representation into a vector 2 x ∈ ℝ 𝐷 ℎ ·𝐷𝑤 ·𝐶 and feed it as input 2: The operation of putting a tensor into
to a standard a neural network, or apply more layers of operations using a vector is called flattening.

the output channel 𝐶˜ as the new input channel 𝐶 .

Pooling Operation Convolutional layers work well to extract important


correlation, but are not designed to compress the representation. If we
want to reduce the number of parameters that we feed to the neural
network by a reasonable amount usually we apply an additional operation
to the output of the convolution called pooling. Pooling subsamples the
given representation, usually it’s an operation that splits the grid into
subgrids, returns either the average or the maximum of each subgrid.
Similarly to convolutions, we can still change the padding, kernel size,
or stride even if usually 𝑆𝑤 = 𝐾 𝑤 , 𝑆 ℎ = 𝐾 ℎ but this time the number of
channels stays fixed.

Definition 5.11.2 (Max/Average Pooling) Max Pooling:

𝑌ˆ𝑖,𝑗,𝑐 ..= max max 𝑋𝑖+𝑥,𝑗+𝑦,𝑐 (5.74)


𝑥=0 ,...,𝐾 𝑤 −1 𝑦=0 ,...,𝐾 ℎ −1

Avg. Pooling:

𝑤 −1 𝐾
1 𝐾X ℎ −1
𝑌ˆ𝑖,𝑗,𝑐 ..=
X
𝑋𝑖+𝑥,𝑗+𝑦,𝑐 (5.75)
𝐾 𝑤 𝐾 ℎ 𝑥=0 𝑦=0

Example 5.11.2 (Pooling) Let the following pooling operation be ap-


plied on a 2D color image: 𝐵 = 3 (3 images at the time), 𝐷𝑤 , 𝐷 ℎ = 4
(4 × 4 image size), 𝐶 = 3 (color channels), 𝐾 𝑥 , 𝐾 𝑦 = 2 (2 × 2 kernel
size) 𝑃𝑥 , 𝑃𝑦 = 0 (0 zero pad), 𝑆 𝑥 , 𝑆 𝑦 = 2 (2 stride) Then to compute
the pooling we take subgrids of 2 at the time and average or take the
maximum of each.
5 Neural Networks 58

• Input Tensor
𝐵 × 𝐶 × 𝐷 ℎ × 𝐷𝑤

1 2 3 7

4 1 5 5
𝐷ℎ 𝐵
0 1 0 1

3 0 1 0

𝐷𝑤

• Output Tensor • Output Tensor


(Max Pooling) (Average Pooling)
𝐵×𝐶×𝐷 ˜ℎ ×𝐷 ˜𝑤 𝐵×𝐶×𝐷 ˜ℎ ×𝐷 ˜𝑤

𝐹 𝐹
𝐵 𝐵
4 7 2 5
˜ℎ
𝐷 ˜ℎ
𝐷
3 1 1 0.5

˜𝑤
𝐷 ˜𝑤
𝐷

Operation Stacking In a real-world scenario, we might apply many


layers of convolution and pooling until the last layer returns a tensor of
size 𝐶˜ × 1 × 1 with a high dimensional 𝐶˜ that will then flattened into a
vector and fed to a standard neural network. If we then visualize what
each filter matrix W( 𝑓 ) is doing layer by layer we will see that at, as more
layers are applied, more and more complex features will be recognized.
If, for example, we are doing face recognition we will start with vertical,
horizontal lines and edges, then we recognize curves, then circles, then
heads eyes and noses, and finally faces. Thus if by using standard feed
forward neural network each neuron was recognizing a global pattern,
with convolutional neural networks we can use the convolutional layers
to recognize useful local patterns.

5.12 Other

Weight-Space Symmetry All the weights of a neural network are


symmetric, thus there are exponentially many possible local minima that
5 Neural Networks 59

might be in fact global minima.

Kernels vs NNs The tradeoffs are:


Kernels
I + Convex optimization i. e. no local minima.
I + Robust against noise.
I +/- Models grow with the size of data.
I - Don’t allow multiple layers.
NNs
I + Flexible non-linear models with fixed parametrization.
I + Mutiple layers discover representation at different levels of
abstraction
I - Many hyperparameters that have to be tuned.
I - Suffer if data is noisy (i. e. regularization is often essential).

How to choose hyperparamters?


Probabilistic Approach to
Supervised Learning 6
So far we have discussed how we can fit prediction models (both linear
and non-linear) for regression and classifications problems without any
statistical interpretation. Often we would like to statistically model the
data in order to quantify uncertainty and being able to express prior
knowledge or assumptions about the data. In this chapter, we will see how
many of the approaches we have already discussed can be interpreted
as fitting probabilistic models. This view will allow us to derive new
methods.
So far, given a training data, we wanted to identify an hypothesis (e. g.
linear model, kernelized models, neural networks, ...) in order to min-
imize the prediction error. Our fundamental assumption was that the
data set is generated independently and identically distributed from a
probability distribution 𝑃(x , 𝑦). Unfortunately, this probability distribu-
tion is unknown, otherwise supervised learning would simply mean
finding the hypothesis which minimizes (in terms of a loss function 𝑙 )
the following expression:


𝑅(ℎ) = 𝑃(x , 𝑦)𝑙(𝑦, ℎ(x))𝑑x 𝑑𝑦 = 𝔼x,𝑦 𝑙(𝑦 ; ℎ(x))
 

Now we want to answer the following question: what is the upper bound,
with the best possible hypothesis, that one can achieve? The following lemma
gives the solution for the case of square loss, but the same idea generalizes
also to other loss functions.

Lemma 6.0.1 If one knows 𝑃(x , 𝑦) and assumin that the data are generated
iid from such a distribution, the best possible hypothesis predicts

ℎ ∗ (𝑥) = 𝔼 [𝑌|𝑋 = 𝑥]

This, in practice unattainable, hypothesis is called Bayes’optimal predictor


for the square loss.

Proof. We have:

min 𝑅(ℎ) = min 𝔼X,𝑌∼𝑃 (𝑌 − ℎ(X)2


 
ℎ ℎ
= min 𝔼X 𝔼𝑌 (𝑌 − ℎ(x))2 | X = x
  

 
= 𝔼X min 𝔼𝑌 (𝑌 − ℎ(x))2 | X = x
 
ℎ(x)

Hence the best possible hypothesis is the one which finds the optimal
6 Probabilistic Approach to Supervised Learning 61

prediction 𝑦 ∗ (𝑥) defined as:

𝑦 ∗ (x) ∈ arg min 𝔼𝑌 ( 𝑦ˆ − 𝑌)2 | X = x


 
𝑦ˆ

This means that we formally have to minimize the following expression:



𝑙( 𝑦)
ˆ := ( 𝑦ˆ − 𝑦)2 · 𝑝(𝑦| x)𝑑𝑦

Which we can do by taking the derivative with respect to 𝑦ˆ and set it to


zero. One get that the necessary condition is given by:
∫ ∫
𝑦ˆ · 𝑝(𝑦| x)𝑑𝑦 = 𝑦 · 𝑝(𝑦| x)𝑑𝑦

Where the first part is equal to 𝑦ˆ and the second is 𝔼 [𝑌| X = x]

We will now study least square estimations under a statistical perspective.


Before we dive into the analysis, we need to define some concepts.

Definition 6.0.1 (Parametric estimation) A parametric estimation is of


ℙ [𝑌|𝑋] is a parametric formula of the form

ℙ̂ [𝑌| X , Θ]

Definition 6.0.2 (Maximum conditional likelihood estimation) We


want to estimate the optimal value of Θ , i.e.

Θ ∗ = arg max ℙ̂ 𝑦1 , . . . , 𝑦𝑛 | x1 , . . . , x𝑛 , Θ
 
Θ
𝑛
Y
ℙ̂ 𝑦 𝑖 |𝑥 𝑖 , Θ
 
= arg max
Θ 𝑖=1
𝑛
Y
ℙ̂ 𝑦 𝑖 |𝑥 𝑖 , Θ
 
= arg max log
Θ 𝑖=1
𝑛
X
log ℙ̂ 𝑦 𝑖 | x𝑖 , Θ
 
= arg max
Θ 𝑖=1
𝑛
X
log ℙ̂ 𝑦 𝑖 | x𝑖 , Θ
 
= arg min
Θ 𝑖=1

Lemma 6.0.2 Under the conditional linear Gaussian assumption, maximiz-


ing the likelihood is equivalent to least square optimization. That is, if we
assume

𝑦 𝑖 ∼ N (w𝑇 x𝑖 , 𝜎2 )

then we get
𝑛
X
arg max ℙ 𝑦1 , . . . , 𝑦𝑛 | x1 , . . . , x𝑛 , w = arg min (𝑦 𝑖 − wx𝑖 )2
 
W W 𝑖=1
6 Probabilistic Approach to Supervised Learning 62

Proof. We have:

arg max ℙ 𝑦1 , . . . , 𝑦𝑛 | x1 , . . . , x𝑛 , w
 
W
𝑛
X
log ℙ̂ 𝑦 𝑖 | xi , w
 
= arg min −
w 𝑖=1
𝑛  
X 1 1
= arg min log(2𝜋𝜎 ) + 2 (𝑦 𝑖 − wx𝑖 )2
2
w 𝑖=1 2 2𝜎
𝑛
𝑛 1 X
= arg min log(2𝜋𝜎 2 ) + 2 (𝑦 𝑖 − wx𝑖 )2
w 2 2 𝜎 𝑖=1
𝑛
X
= arg min (𝑦 𝑖 − wx𝑖 )2
w 𝑖=1

Hence we have that the maximum likelihood estimator is given by the


least square solution, assuming that the noise is i. i. d. Gaussian with
constant variance. This is useful since the maximum likelihood estimator
satisfies several nice statistical properties such as consistency (parameter
estimate converges to true parameters in probability), asymptotic effi-
ciency (smallest variance among all well-behaved estimators for large
𝑛 ) and asymptotic normality. Keep in mind that those properties are
asymptotic (i. e. they hold for 𝑛 → ∞). For finite 𝑛 is crucial to avoid
overfitting!

6.1 Bias Variance Tradeoff


Definition 6.1.1 (Bias) Bias is the difference between the average prediction
of our model and the correct value which we are trying to predict. Models with
high bias pays very little attention to the training data and oversimplifies the
model. It always leads to high error on training and test data. Bias can also be
considered as the excess risk of best model compared to minimal achievable
risk knowing ℙ [X , Y]. Formally:
h i
𝔼X 𝔼𝐷 ℎ̂ 𝐷 (X) − ℎ ∗ (X)

Definition 6.1.2 (Variance) Variance is the variability of model prediction


for a given data point or a value which tells us spread of our data. Model with
high variance pays a lot of attention to training data and does not generalize
on the data which it has not seen before. As a result, such models perform
very well on training data but has high error rates on test data. This is the
risk incurred due to estimating model from limited data. Formally:
h i2 h i2
𝔼X 𝑉 𝑎𝑟𝐷 ℎ̂ 𝐷 (X) = 𝔼X 𝔼𝐷 ℎ̂ 𝐷 (X) − 𝔼𝐷0 ℎ̂ 𝐷0 (X)

Definition 6.1.3 (Noise) Irreducible error, formally defined as

𝔼X,Y (Y − ℎ ∗ (X))2
 
6 Probabilistic Approach to Supervised Learning 63

Lemma 6.1.1 (Bias variance tradeoff) The expected prediction error is given
by

Bias2 + Variance + Noise

Proof. We have:
h i2 h i2
𝔼X 𝔼𝐷 ℎ̂ 𝐷 (X) − ℎ ∗ (X) = 𝔼X 𝑉 𝑎𝑟𝐷 ℎ̂ 𝐷 (X)
h i2
+ 𝔼X 𝔼𝐷 ℎ̂ 𝐷 (X) − 𝔼𝐷0 ℎ̂ 𝐷0 (X)
+ 𝔼X ,Y (Y − ℎ ∗ (X))2
 

Ideally, we wish to find an estimator that minimizes both bias and variance.
However one should keep this idea in mind: the bias is a decreasing
function in terms of model complexity (i. e. with very complex models
we can achieve a very small bias), while the variance is an increasing
function in terms of model complexity (i. e. with a very complex model
the variance increases). Hence we need a model with a good balance
between bias and variance in order to minimize the prediction error. We
have that the maximum likelihood estimator (i. e. least squares) for linear
regression is unbiased. In facts, by choosing a proper polynomial degree,
we can fit all possible data sets. Moreover, as stated in the Gauss-Markov
theorem, this is the minimum variance estimator among all unbiased
ones. However, we have already seen that the least squares solution
can overfit. Thus we trade a little bit of bias for a potentially dramatic
reduction of variance. We have discussed regularization as a solution in
this sense. But how do this kind of tricks fit into the probabilistic view of
the situation?
The basic idea of regularization is penalizing large weights because we
believe that those are an indicator of overfitting. Hence we are implicitly
introducing assumptions about the weights, we assume that weights will
probably not be too large. From the statistical perspective, we can achieve
the same result by introducing prior assumptions about the probability
distribution.

Lemma 6.1.2 Ridge regression can be understood as finding the Maximum a


Postetiori (MAP) parameter
 estimate for a linear regression problem assuming
that the noise ℙ 𝑦| w , x is i. i. d. Gaussian and the prior ℙ [w] on the model
parameters w is Gaussian.

Proof. First we need to compute the posterior distribution of w using the


Bayes’rule

ℙ [w | x1:𝑛 ] ℙ 𝑦1:𝑛 | w, x1:𝑛


 
ℙ w | x1:𝑛 , 𝑦1:𝑛
 
=
ℙ 𝑦1:𝑛 |𝑥1:𝑛
 

ℙ [w] ℙ 𝑦1:𝑛 | w, x1:𝑛


 
=
ℙ 𝑦1:𝑛 |𝑥1:𝑛
 
6 Probabilistic Approach to Supervised Learning 64

We have to find the weights w that maximize the expression, with the
assumption that they are normally distributed with mean zero and
𝜎2 = 𝛽 2 . We get

arg max ℙ w | x1:𝑛 , 𝑦1:𝑛 = arg min − log ℙ [w] − log ℙ 𝑦1:𝑛 | w , x1:𝑛 + log ℙ 𝑦1:𝑛 | x1:𝑛
     
w w

where the second term is equal to arg minw 2𝜎1 2 𝑛𝑖=1 (𝑦 𝑖 − wx𝑖 )2 as shown
P
in the proof of Lemma 6.0.2 and the third term does not depend on w.
Hence we have to work on the first term only

𝑑
Y
− log ℙ [w] = − log ℙ [𝑤 𝑖 ]
𝑖=1
!
𝑑
X 1 𝑤𝑖
=− log p 𝑒 𝑥𝑝(− 2 )
𝑖=1 2𝜋𝛽 2 2𝛽
𝑑
𝑑 1 X
= log 2𝜋𝛽 2 + 2 𝑤2
2 2𝛽 𝑖=1 𝑖
1
= || w ||22 + O(1)
2𝛽 2

Hence our problem reduces to


𝑛
1 1 X
arg min ||𝑣𝑒 𝑐𝑤|| 2
+ (𝑦 𝑖 − w𝑇 x𝑖 )2
w 2𝛽 2 2
2 𝜎 2 𝑖=1
𝑛
𝜎2
(𝑦 𝑖 − w𝑇 x𝑖 )2
2
X
= arg min || w || +
w 𝛽2 2
𝑖=1

𝜎2
which is ridge regression with parameter 𝜆 := 𝛽2

by changing our assumption regarding the distribution of the weights


we get different regularizers, e. g. the Laplace distribution is the prior of
Lasso regression.

6.2 Logistic Regression

So far we have discussed a probabilistic approach to regression. What


can we say about classification? In classification the risk is, considering
accuracy as metric, given by:

𝑅(ℎ) = 𝔼X ,𝑌 [[𝑌 ≠ ℎ(X)]]

If we unrealistically suppose that we knew 𝑃(X , 𝑌), the ℎ that minimizes


the risk would be given by the one that outputs the most probable class,
i.e.

ℎ ∗ (𝑥) = arg max ℙ 𝑌 = 𝑦| X = x


 
𝑦

A new model for classification is logistic regression, where we estimate


the probability that a given sample belongs to a certain class. We want
that a realisation that is positive and far away from the border will have
6 Probabilistic Approach to Supervised Learning 65

a very high probability, a realisation that is negative and far away from
the border will have a probability close to zero, and cases close to the
border will have a probability of around 0.5.

Definition 6.2.1 (Logistic regression) Logistic regression is a classification


method that estimates the probability that an input is in the positive class.
Formally

1
ℙ 𝑌 = 𝑦| x = 𝜎0(w𝑇 x) =
 
1 + 𝑒 𝑥𝑝(𝑦 w𝑇 x)

We replace the assumption of Gaussian noise that we used for regression with
i. i. d. Bernoulli noise, i. e.

ℙ 𝑦| w, x = 𝐵𝑒𝑟(𝑦 ; 𝜎0(w𝑇 x))


 

Lemma 6.2.1 The maximum likelihood estimator for logistic regression is


given by
𝑛
ˆ w) = log(1 + 𝑒 𝑥𝑝(−𝑦 𝑖 w𝑇 x𝑖 ))
X
𝑅(
𝑖=1

Proof. We have
𝑛
Y
ℙ 𝑦 𝑖 | x𝑖 , w
 
ŵ ∈ arg max ℙ [𝐷| w] = arg max
w w 𝑖=1
X𝑛
log ℙ 𝑦 𝑖 | x𝑖 , w
 
= arg min −
w 𝑖=1
𝑛
X 1
= arg min − − log
w 𝑖=1 1 + 𝑒 𝑥𝑝(−𝑦 𝑖 w𝑇 x𝑖 )
𝑛
log(1 + 𝑒 𝑥𝑝(𝑦 𝑖 w𝑇 x𝑖 ))
X
= arg min
𝑤 𝑖=1

A good property of the logistic loss is convexity, hence we can use stochas-
tic gradient descent in order to find (an arbitrarily good approximation
of) optimal weights.

Lemma 6.2.2 The gradient of the logistic loss is given by

1
∇W log(1 + 𝑒 𝑥𝑝(−𝑦 w𝑇 x)) = · 𝑒 𝑥𝑝(−𝑦 w𝑇 x) · (−𝑦 x)
1 + 𝑒 𝑥𝑝(−𝑦 w𝑇 x
𝑒 𝑥𝑝(−𝑦 w𝑇 x
= · (−𝑦 x)
1 + exp(−𝑦 w𝑇 x
1
= · (−𝑦 w)
1 + 𝑒 𝑥𝑝(𝑦 w𝑇 x
= −𝑦 xℙ 𝑌 ≠ −𝑦| w, x
 

Of course, in order to avoid overfitting, one can use a regularizer (either


with L2 or L1 norm) and modify the above gradient accordingly. Once
6 Probabilistic Approach to Supervised Learning 66

the model is trained one can compute the following in order to do the
classification:

1
arg max ℙ 𝑦|
ˆ x , w = arg max
 
𝑦ˆ 𝑦ˆ 1 + 𝑒 𝑥𝑝(− 𝑦ˆ w𝑇 x)
= arg min 𝑒 𝑥𝑝(− 𝑦ˆ w𝑇 x)
𝑦ˆ

= arg min − 𝑦ˆ w𝑇 x
𝑦ˆ

= arg max 𝑦ˆ w𝑇 x
𝑦ˆ

Logistic regression has some important variants that are worth mention-
ing:

I Kernelized logistic regression: find optimal weights with

𝑛
log(1 + 𝑒 𝑥𝑝(−𝑦 𝑖 𝛼 K𝑖 )) + 𝜆𝛼𝑇 K 𝛼
X
𝛼ˆ = arg min
𝛼 𝑖=1

where K𝑖 is the 𝑖−th column of the kernel matrix.


I Multi-class logistic regression: maintain one weight vector per class
and estimate the probability of each class as

𝑒 𝑥𝑝(w𝑖 x
ℙ [𝑌 = 𝑖| x, w1 , . . . , w𝑐 ] = P𝑐
𝑗=1 𝑒 𝑥𝑝(w 𝑗 x)

which corresponds to the loss function

𝑙(𝑦 ; x , w1 , . . . , w𝑐 ) = − log ℙ 𝑌 = 𝑦| x , w1 , . . . , w𝑐
 

Logistic regression is a classification method. So far we have discussed


other classification methods such as SVMs. A drawback of logistic
regression compared to SVMs is that obtained solutions are often dense,
but with logistic regression is very easy to obtain class probabilities.

6.3 Bayesian Decision Theory

So far we have seen how we can interpret supervised learning as fitting


probabilistic models of the data. Now we will discuss a framework that
allows one to pick the best decision under uncertainty with the estimated
models.

Definition 6.3.1(Bayesian Decision Theory) Given a conditional distribu-


tion over labels ℙ 𝑦| x ], a set of actions A and a cost function C : 𝑌×A → ℝ,


Bayesian decision theory recommends to pick the action that minimizes the
expected cost

𝑎 ∗ = arg min 𝔼 𝑦 C(𝑦, 𝑎)| x


 
𝑎∈A
6 Probabilistic Approach to Supervised Learning 67

In general, if we had access to the true distribution ℙ 𝑦| x , this imple-


 

ments the Bayesian optimal decision. In practice, this probability can


only be estimated, e. g. with logistic regression.

Example 6.3.1 Suppose one has estimated a logistic regression model


for spam filtering and has obtained a probability 𝑝 for a given message
to be spam. Further, suppose a set of three actions: spam (S), not spam
(N), uncertain (U). Which one should be picked? First, one has to define
a cost function, which is represented in the following table

Actions Is spam Is not spam


S 0 10
N 1 0
U 5 5

In this case one can compute the expected cost of each action and pick
the one with lower cost. In this case it holds:
I Cost of S: (1 − 𝑝) · 10
I Cost of N: 𝑝
I Cost of U: 𝑝 · 5 + (1 − 𝑝) · 5

Example 6.3.2 (Optimal Decision for Logistic Regression) In the ex-


ample of logistic regression for binary classification we have:
I Estimated conditional distribution: ℙ̂ 𝑦| x = 𝐵𝑒𝑟(𝑦 ; 𝜎(w𝑇 x))
 

I Action set: A = {+1 , −1} 


I Cost function: 𝐶(𝑦, 𝑎) = 𝑦 ≠ 𝑎


Then the action that minimizes the expected cost is the most likely
class.

Example 6.3.3 (Asymmetric Cost) Consider the following situation:


I Estimated conditional distribution: ℙ̂ 𝑦| x = 𝐵𝑒𝑟(𝑦 ; 𝜎(w𝑇 x))
 

I Action set: A = {+1 , −1}


I Costs:

𝑐 if 𝑦 = −1 and 𝑎 = +1
 𝐹𝑃




𝐶(𝑦, 𝑎) = 𝑐 𝐹𝑁 if 𝑦 = 1 and 𝑎 = −1

0

otherwise

Then the expected costs for our set of actions are:
I 𝑐 + = (1 − 𝑝) · 𝑐 𝐹𝑃
I 𝑐 − = 𝑝 · 𝑐 𝐹𝑁

and we have to pick the smallest one, i. e. we predict +1 if and only if


𝑝 > 𝑐 𝐹𝑃𝑐+𝑐
𝐹𝑃
𝐹𝑁

Example 6.3.4 (Asymmetric Cost for Regression) Consider the follow-


ing situation:
I Estimated conditional distribution: ℙ̂ 𝑦| x = N (𝑦 ; w𝑇 x , 𝜎 2 )
 

I Action set: A = ℝ
6 Probabilistic Approach to Supervised Learning 68

I Costs: 𝐶(𝑦, 𝑎) = 𝑐 1 max(𝑦 − 𝑎, 0) + 𝑐 2 max(𝑎 = 𝑦, 0)

This means, that underestimating and overestimating have different


costs. Then the action that minimizes the expected cost is
𝑐1
𝑎 ∗ = ŵ𝑇 x + 𝜎Φ−1 ( )
𝑐1 + 𝑐2

Example 6.3.5 (Doubtful Logistic Regression) Consider the following


situation:
I Estimated conditional distribution: ℙ̂ 𝑦| x = 𝐵𝑒𝑟(𝑦 ; 𝜎(w𝑇 x))
 

I Action set: A = {+1 , −1 , 𝐷}


I Costs:
(
𝑦≠𝑎 if 𝑎 ∈ {+1 , −1}

𝐶(𝑦, 𝑎) =
𝑐 if 𝑎 = 𝐷

Then the action that minimizes the expected cost is given by


(
𝑦 if ℙ̂ 𝑦| x ≥ 1 − 𝑐
 

𝑎 =
𝐷 otherwise

That is, we pick the most likely class only if confident enough.

In Machine Learning there is a golden (business) rule: a labelled dataset is


money. In facts, obtaining data is relatively cheap, but obtaining labels
is more expensive. Hence minimizing the number of labels is a useful
goal.

Definition 6.3.2 (Active Learning) Active learning is a technique that


uses algorithms in order to minimize the number of labels. A simple strategy
is uncertainty sampling, which follows the principle of always picking the
most uncertain example.

1 Given: pool of unlabeled examples 𝐷𝑋 = { x1 , . . . , xn } Algorithm 6.1: Uncertainty Sampling


2 Also maintain a labelled dataset 𝐷 , initially empty
3 for 𝑡 = 1 , 2 , 3 , . . .
4 Estimate ℙ̂ [𝑌𝑖 | x𝑖 ] given current data 𝐷
5 𝑖 𝑡 ∈ arg min𝑖 | 0.5 − ℙ̂ [𝑌𝑖 | x𝑖 ] |
6 Query label for 𝑦 𝑖𝑡 and 𝐷 ← 𝐷 ∪ {(x𝑖𝑡 , 𝑦 𝑖𝑡 }
7 end

6.4 Generative Modeling

As a motivational example think to the scenario of classification. If we


want to classify a feature vector in the negative region which is far away
from the boundary, a classification method such as linear regression
would be very confident to predict that the point has a negative label.
This also if the new point that has to be classified is an outlier, i. e. far
away from all other negative examples that the model has seen so far.
6 Probabilistic Approach to Supervised Learning 69

This means that logistic regression can be overconfident about labels for
outliers.
So far, we have considered learning methods that estimate conditional
distributions ℙ 𝑦| x . Such models don’t attempt to esimate ℙ [x] and
 

thus they will not be able to detect outliers, i. e. unusual points for which
ℙ [x] is very small. Thus models are called discriminative models. Now
we consider the so called generative models that aim to estimate the joint
distribution ℙ x , 𝑦 . Keep in mind that generative models are more


powerful than discriminative models, in facts it is possible to derive


  ℙ[x,𝑦 ]
ℙ 𝑦| x from ℙ x, 𝑦 (we have ℙ 𝑦| x = ℙ[x] ) but not viceversa.
   

The typical approach to generative modeling is attempting to infer the


process, according to which examples are generated and then do the
following:

I Estimate prior on labels ℙ 𝑦


 

I Estimate conditional distribution ℙ x |𝑦 for each class 𝑦


 

I Obtain predictive distribution using Bayes’rule

1
ℙ 𝑦| x = ℙ 𝑦 ℙ x |𝑦
     
ℙ [x]
Usually, in the family of Naive classifiers, features are assumed to be con-
ditionally independent given 𝑌 , i. e. ℙ [ X1 , . . . , Xd | Y] = 𝑑𝑖=1 ℙ [ Xi | Y].
Q

Example 6.4.1 (Gaussian Naive Bayes Classifier) Learning: given data


𝐷 = {(x1 , 𝑦1 ), . . . , (x𝑛 , 𝑦𝑛 )}.
Count(𝑌=𝑦)
I MLE for class prior: ℙ̂ 𝑌 = 𝑦 =
 
𝑛
I MLE for feature distribution ℙ̂ x𝑖 |𝑦 𝑖 = N (x𝑖 ; 𝜇ˆ 𝑦,𝑖 , 𝜎 2𝑦,𝑖 )


1 X
𝜇ˆ 𝑦,𝑖 = 𝑥 𝑗,𝑖
Count(𝑌 = 𝑦) 𝑗 : 𝑦 𝑗 =𝑦

1 X
𝜎2𝑦,𝑖 = (𝑥 𝑗,𝑖 − 𝜇ˆ 𝑦,𝑖 )2
Count(𝑌 = 𝑦) 𝑗 : 𝑦 𝑗 =𝑦

Prediction given new point x

𝑑
 X
𝑦 = arg max ℙ̂ 𝑦 0 | x = arg max ℙ̂ 𝑦 0
   
ℙ̂ x𝑖 |𝑦 𝑖
𝑦0 𝑦0 𝑖=1

Here one could show that, in the case of binary classification and with
the additional assumption of shared variance, the Gaussian Naive
Bayes Classifier produces a linear classifier of the same form as logistic
regression. For the sake of brevity we omit this argument and we point
to the literature.
This model has some limitations, such as the fact that if the conditional
independence assumption is violated (i. e. features are not generated
independently) then the predictions might become overconfident. This
might be fine if we are interested in the most likely outcome only, but
this would hurt if we use this probability to make decisions. In order
6 Probabilistic Approach to Supervised Learning 70

to improve this issue, we introduce a new (more complex) model in


the next example.

Example 6.4.2 (Gaussian Bayes Classifier) Learning: given data 𝐷 =


{(x1 , 𝑦1 ), . . . , (x𝑛 , 𝑦𝑛 )}.
Count(𝑌=𝑦)
I MLE for class prior: ℙ̂ 𝑌 = 𝑦 =
 
 𝑛
I MLE for feature distribution ℙ̂ x |𝑦 = N (x; 𝜇ˆ 𝑦 , Σ̂ 𝑦 )


1 X
𝜇ˆ 𝑦 = 𝑥𝑖
Count(𝑌 = 𝑦) 𝑖 : 𝑦𝑖 =𝑦

1
ˆ 𝑦 )𝑇
X
Σ̂ 𝑦 = (x𝑖 − 𝑚𝑢
ˆ 𝑦 )(x𝑖 − 𝑚𝑢
Count(𝑌 = 𝑦) 𝑖 : 𝑦𝑖 =𝑦

Prediction given new point x

𝑑
 X
𝑦 = arg max ℙ̂ 𝑦 0 | x = arg max ℙ̂ 𝑦 0
   
ℙ̂ x𝑖 |𝑦 𝑖
𝑦0 𝑦0 𝑖=1

Given 𝑝 := ℙ [𝑌 = 1] and ℙ x |𝑦 = N (x; 𝜇 𝑦 , Σ 𝑦 ) we want to compute


 

the discriminant

ℙ [𝑌 = 1 | x]
𝑓 (x) = log
ℙ [𝑌 = 1 | x]
This discriminant function is given by

𝑝
 
1 |Σ̂− |  −1
 
−1
𝑓 (x) = log + log + (x − 𝜇ˆ − )Σ̂− (x − 𝜇ˆ − ) − (x − 𝜇ˆ + )Σ̂+ (x − 𝜇ˆ + )
1−𝑝 2 |Σ̂+ |

By fixing 𝑝 = 0.5 and with the additional assumption Σ̂− = Σ̂+ , then
one obtains a linear classifier known as Fisher’s linear discriminant
analysis which, as happened with the Naive Gaussian Bayes Classifier,
has the same class distribution of logistic regression. Without those
further assumptions, we do quadratic discriminant analysis.

We have introduced generative modeling which is in contrast with the


discriminative models we had discussed before. This introduces some
trade-offs:

I Fisher’s Linear Discriminant Analysys


• Is a generative model, i. e. it models ℙ [X , 𝑌]
• Can be used to detect outliers, i. e. ℙ [X] is small
• Assumes normality of X
• Not very robust against violation of this assumption
I Logistic regression
• Discriminative model, i. e. models ℙ [𝑌| X] only
• Cannot detect outliers
• Makes no assumptions on ℝ 𝑋
• More robust

Moreover, in the class of generative models, we have a naive classifier


which assumed conditional independence of features and a more involved
6 Probabilistic Approach to Supervised Learning 71

one which considered covariances. Also in this situation we have some


trade-offs:

I Naive Gaussian Bayes Models


• Conditional independence assumption may lead to overconfi-
dence
• Predictions might still be useful
• The number of parameters is in O(𝑐𝑑)
• Complexity (memory and inference) is linear in 𝑑
I General Gaussian Bayes Models
• Captures correlations among features
• Avoids overconfidence
• The number of parameters is in O(𝑐𝑑 2 )
• Complexity is quadratic in 𝑑
Unsupervised Learning
Classification 7
7.1 Clustering . . . . . . . . . . . . 73
7.1 Clustering 7.2 K-Means Clustering . . . . . 74

In unsupervised clustering we are given a set of features vectors x1 , . . . , x𝑛 ,


where x𝑖 ∈ ℝ 𝑑 , and the goal is to group the given data points into clusters
such that similar points are in the same cluster and dissimilar points are
in different clusters. Different clustering approaches define their own
metric to describe the similarity between two points.

Applications
I Words clustering: Given a document, group the words based on
what they describe.
I Image Clustering: Given a set of images, group them based on their
features.
I Outlier Detection: Given a set of vectors, group them to find which
one are outliers.
I Given a set of products, group them based on which type of
customer bought them.

Clustering Approaches
I Hierarchical Clustering separates the data points into small clusters
by distance (norm), then the small clusters are again separated
in coarser and coarser clusters until all the points are in one big
cluster (this way is bottom-up but could also be done top-down). By
representing each group of clusters with a node and connecting sub-
clusters with parent clusters we can represent the entire structure as
a hierarchical tree. Then by chopping the branches of the structure
at different heights we can get many small or few big clusters. Some
algorithms are single/average-linkage clustering.
I Partitional Clustering Partitional clustering uses a graph data struc-
ture to connect data points depending on some cost function. Then
using different of graph cuts (e. g. min-cut) we get different clus-
ters. Some algorithms are spectral clustering or graph-cut based
clustering.
I Model-Based Clustering We represent each cluster by a model (e. g.
the center, which means that we will assign to each point the closest
center), then for new points, we will infer the cluster by picking
which model fits best. Some algorithms are k-means clustering or
Gaussian mixture models.
Model-based clustering has the advantage that given a new unseen data
point we can easily apply the model and infer to which cluster it should
be part of. In hierarchical/partitional clustering we apply the structure
only on points that are already given and hence it’s less flexible. More
specifically we will look into k-means clustering.
7 Classification 74

7.2 K-Means Clustering


Goal
Given a set of feature vectors x1 , . . . , x𝑛 where x𝑖 ∈ ℝ 𝑑 (which can be
represented as a matrix X ∈ ℝ 𝑛×𝑑 ), and a desired number of output
clusters 𝑘 ∈ ℕ .

Output a set of cluster centers 𝝁1 , . . . , 𝝁 𝑘 where 𝝁𝑖 ∈ ℝ 𝑑 (which


can be represented as a matrix M ∈ ℝ 𝑘×𝑑 ) such that the empirical
error
𝑛
2
ˆ M) = 𝑅(𝝁
ˆ 1 , . . . , 𝝁 𝑘 ) ..=
X
𝑅( min x𝑖 − 𝝁 𝑗 2
(7.1)
𝑖=1 𝑗∈{1 ,...,𝑘}

ˆ such that
is minimal, i. e. find M

M ˆ M)
ˆ = arg min 𝑅( (7.2)
M

In simpler terms, we want to find 𝑘 cluster centers (𝝁1 , . . . , 𝝁 𝑘 ), such that


the squared distance between the cluster centers (which is how we define
the similarity) and all the points is minimal. Then given a new unseen
point x we will put it into the cluster with the closest center 𝝁 𝑗 .

The problem with this approach is that 𝑅ˆ is a non-convex function


(because of the min operator), and thus this optimization problem is
NP-hard (i. e. can’t be solved optimally). One solution is to use gradient
descent, however other than selecting the initial values, which if not
picked correctly might give us a bad solution, we would also have to
manually select the learning rate. A better solution is to use Lloyd’s
algorithm.

Lloyd’s algorithm is an iterative algorithm that is guaranteed to mono-


tonically decrease at each step and hence it will always converge to a local
optimum. The idea is that we initialize 𝑘 random centers 𝜇 𝑗 , then for each
point x𝑖 , at each iteration, we find the closest center index 𝑧 𝑖 ∈ {1 , . . . , 𝑘}
and update all previous means 𝝁 𝑗 to be the average point of all x𝑖 that
have 𝑗 as their closest center (i. e. 𝑧 𝑖 = 𝑗 ).

(0) (0)
1 M(0) = [𝝁1 , . . . , 𝝁 𝑘 ] B Initialize cluster centers. Algorithm 7.1: Lloyd’s Algorithm
2 𝑡←1
3 while not converged
2
(𝑡) (𝑡−1)
4 𝑧 𝑖 ← arg min x𝑖 − 𝝁 𝑗 B Assign each x𝑖 to the closest center.
𝑗∈{1,...,𝑘} 2

(𝑡) 1 X
5 𝝁𝑗 ← x𝑖 B Set new center as mean of assigned points.
𝑛 𝑗 (𝑡)
𝑖 :𝑧 𝑖 =𝑗
6 𝑡 ++
7 end
(𝑡) (𝑡)
8 return M(𝑡) = [𝝁1 , . . . , 𝝁 𝑘 ]

Where 𝑛 𝑗 is the number of points in cluster 𝑗 (i. e. centered at 𝝁 𝑗 ), and


(𝑡) (𝑡)
𝑖 : 𝑧 𝑖 = 𝑗 means all 𝑖 such that 𝑧 𝑖 = 𝑗 . Each step has a computational
7 Classification 75

complexity of O(𝑛𝑑𝑘).
Lloyd’s algorithm is guaranteed to find a local minimum since at each
step it decreases monotonically.

(𝑡)
Lemma 7.2.1 (Lloyd’s Monotonic Decrease) Let 𝑧 𝑖 ∈ {1 , . . . , 𝑘} be
the index of the closest center 𝝁𝑧 (𝑡) of vector x𝑖 at step 𝑡 and 𝑅ˆ 𝝁, 𝑧 ..=

𝑖
P𝑛
𝑖=1 xi − 𝝁𝑧 𝑖 be a single center error, then:
   
𝑅ˆ 𝝁(𝑡) , 𝑧 (𝑡) ≥ 𝑅ˆ 𝝁(𝑡+1) , 𝑧 (𝑡+1) (7.3)

Proof.
   
𝑅ˆ 𝝁(𝑡) , 𝑧 (𝑡) ≥ 𝑅ˆ 𝝁(𝑡) , 𝑧 (𝑡+1) , ˆ (𝑡) , 𝑧)
𝑧 (𝑡+1) = arg min 𝑅(𝝁 (7.4)
𝑧
 
≥ 𝑅ˆ 𝝁(𝑡+1) , 𝑧 (𝑡+1) , ˆ
𝝁(𝑡+1) = arg min 𝑅(𝝁, 𝑧 (𝑡+1) ) (7.5)
𝜇

(7.6)

However, there are still a few problems:


I Exponential iterations: even if the algorithm is guaranteed to find a
local optimum it might take an exponential number of steps. This
problem manifests itself only in some rare cases, and hence can
be usually ignored. In fact, the number of steps to convergence is
usually very small.
(0)
I Initialization: how to initialize the centers 𝝁 𝑗 ? We have seen that
it’s guaranteed to converge but usually to local optima, what if
the local optima is really bad? This problem heavily depends on
initialization.
I Cluster number: how do we pick the number of clusters 𝑘 if we don’t
know in how many clusters can our data be separated?
I Cluster shape: we represent a cluster by a central point, however, it’s
not always the case that a cluster can be represented by a single mean
point. This can be solved with kernel-k-means clustering, similarly
to how we use kernels in supervised learning to fit non-linear
functions.

Initialization approaches
I Random Start: we can 𝑘 points among x𝑖 and set them as our initial
𝝁 𝑗 . However, if there are some large clusters and some small clusters
the probability of picking a point in the large cluster is much higher
and thus we might find a bad solution.
I Farthest Points Heuristic: instead of picking 𝑘 random points among
x𝑖 we pick one center at the time and for each new point if it’s
further away than the other centers it will have a higher probability
of being selected. This approaches works really well if our data
doesn’t contain outliers, however if it does it will pick outliers with
a high probability and thus fail to find a good solution.
7 Classification 76

I K-Means++: is a variant of the point heuristic, where instead of only


considering points that are far away from the current centers (to
solve the problem of picking points in the same cluster), we also
increase the probability of picking points in clusters that have many
points (to solve the problem of picking outlier points). K-Means++
is usually used as the standard initializer.

1 𝑖 1 ∼ Uniform({1, . . . , 𝑛}) B Pick first center randomly. Algorithm 7.2: K-Means++ Initialization
(0)
2 𝝁1 ← x𝑖1
3 for 𝑗 = 2 , . . . , 𝑘
1 2
(0)
4 pick 𝑖 𝑗 with probability min x𝑖 𝑗 − 𝝁𝑙
𝑧 𝑙∈{1 ,...,𝑗−1} 2
(0)
5 𝝁 𝑗 = x𝑖 𝑗
6 end
(0) (0)
7 return M(0) = [𝝁1 , . . . , 𝝁 𝑘 ]

This initialization technique, other than picking the initial points, already
gives us a good guess for the optimum without using Lloyd’s algorithm
or other model-based clustering algorithms.

Lemma 7.2.2 (K-Means++ Log Competitive) If we pick M(0) as our final


guess for the centers, assuming that M(0) is sampled from a random process,
then:

ˆ M(0) )] ≤ O(log(𝑘)) min 𝑅(𝑀)


𝔼[𝑅( ˆ (7.7)
M

In simpler terms, if we pick M(0) as our final guess it’s only a logarithm
term away from the optimal k-means solution.

Cluster Number Selection In general, picking the number of clusters


𝑘 is very difficult. With supervised learning we know the type of model
that we want to fit. In unsupervised clustering we don’t really know. This
is still an unsolved problem, and the standard approaches usually are:
I Elbow Method Heuristic: we start with a low value of 𝑘 and then
ˆ M) decreases by only a negligible
increase it until the error 𝑅(
amount, then pick the second-last guess (if you graph the error the
second-last guess is pointy and looks like an elbow.
ˆ M) we add a regular-
I Regularization: instead of minimizing only 𝑅(
ization term weighted as always by 𝜆.

ˆ M1: 𝑘 ) + 𝜆𝑘
min 𝑅( (7.8)
𝑘,M1: 𝑘

This is in fact equivalent to the elbow method.


Note that we can’t use cross-validation because by increasing 𝑘 the error
decreases continuously until 𝑘 = 𝑛 , where the error will be 0. But a
smaller error with a large 𝑘 is not what we are looking for, we want both
the error and 𝑘 to be as small as possible.
Regression 8
8.1 Dimension Reduction . . . . 77
8.1 Dimension Reduction 8.2 Principal Component Analysis
(PCA) . . . . . . . . . . . . . . . . 77
Dimension reduction is a method that given a set of vectors x1 , . . . , x𝑛 8.3 Kernel PCA . . . . . . . . . . . 82
where x𝑖 ∈ ℝ 𝑑 , returns some other vectors z1 , . . . , z𝑛 where z𝑖 ∈ ℝ 𝑘 8.4 Autoencoders . . . . . . . . . 84
with 𝑘 < 𝑑 . In other words, we want to represent the same amount of
points/vectors but in a smaller dimension without loosing too much
information. This method has many applications:
I Visualization: we can’t visualize vectors in more than 3 dimensions,
dimension reduction helps to shrink the dimension of the data to 3
or less and hence give us a visible or intuitive understanding.
I Regularization: shrinking the dimension of the data without losing
too much information is a form of regularization. If we give as
input to a model lower-dimensional data it will train faster and be
able to find the important characteristics with less work (i. e. reduce
the model complexity).
I Unsupervised Feature Discovery: similar to the point above by shrink-
ing the dimension the feature that are most important will remain
encoded in the vectors.

8.2 Principal Component Analysis (PCA)


Goal
Given a set of centereda feature vectors x1 , . . . , x𝑛 where x𝑖 ∈ ℝ 𝑑
(which can be represented as a matrix X ∈ ℝ 𝑛×𝑑 ), and the desired
output dimension 𝑘 ∈ ℕ with 1 ≤ 𝑘 ≤ 𝑑

Output the analog set of dimensionally reduced vectors z1 , . . . , z𝑛


where z𝑖 ∈ ℝ 𝑘 (which can be represented as a matrix Z ∈ ℝ 𝑛×𝑘 ) and
an orthogonal matrix W ∈ ℝ 𝑑×𝑘 such that the empirical error
𝑛
ˆ W , Z) = 𝑅(
ˆ W , z1 , . . . , z𝑛 ) ..= k Wz𝑖 − x𝑖 k 22
X
𝑅( (8.1)
𝑖=1

ˆ , Zˆ ) such that:
is minimal, i.e. find (W

(W ˆ W , Z)
ˆ , Zˆ ) = arg min 𝑅( (8.2)
W ,Z

a 𝝁 ..= P𝑛
𝑖=1 x 𝑖 =0

The reason why we output not only the dimensionally reduced vectors in
Z, but also a matrix W is that given a new point x that is not in the initial
dataset we can easily find the analog dimensionally reduced vector z by
1: The coefficients z𝑖 ..= W𝑇 x𝑖 of the pro-
only computing z ..= W𝑇 x.1 In other words W𝑇 is a transformation matrix
jected vector x𝑖 are called principal scores.
8 Regression 78

from 𝑑 dimensional coordinates to 𝑘 dimensional coordinates, and W is a


transformation matrix from 𝑘 dimensional coordinates to 𝑑 dimensional
coordinates. Since the transformation W is trying to reconstruct higher
dimensional vectors x from compressed lower dimensional vectors z we
must loose some information (except if 𝑘 = 𝑑 , then we would reconstruct
the exact same data). If we let x̄𝑖 ..= Wz𝑖 be the reconstructed version
of x𝑖 , the goal of PCA is to loose as little information as possible, i. e.
make sure that x𝑖 is as close as possible to x̄𝑖 . More precisely, we want
to find a matrix W such that when we take the compressed version of
xi , z𝑖 , we can reconstruct the initial x𝑖 with the least possible error, that
is, minimize the reconstruction error k x̄𝑖 − x𝑖 k 22 = k Wz𝑖 − x𝑖 k 22 for all 𝑖 .
As always we have defined the similarity/distance of two vectors as the
squared norm.
The nice thing about PCA is that we can find the globally optimal solution
(W∗ , Z∗ ) by only using ideas from linear algebra, in facts we only have
to find W∗ , then Z∗ ..= (W∗ )𝑇 X𝑇 . So the idea is to somehow compute W∗
using X, and then compute Z∗ as described.

Minimizing Error to Maximizing Variance Before computing W∗ we


will show how to convert the problem of minimizing the reconstruction
error into a maximization problem.

Lemma 8.2.1 (Min Max PCA) Let the matrix W∗ be

 
...  ∈ ℝ 𝑑×𝑘



W =  w∗1 w∗2 w∗𝑘  (8.3)
 
 

then to find w∗𝑗 ∈ ℝ 𝑑 , which is called the 𝑗 -th principal component/axis,


we can solve either of the following two optimization problem:
𝑛
X 2
w∗𝑗 = arg min w 𝑗 𝑧 𝑖,𝑗 − x𝑖 2
(8.4)
w𝑇𝑗 w 𝑗 =1 𝑖=1
w𝑇𝑗 w𝑙 =0 ,∀1 ≤𝑙<𝑗
z1 ,...,z𝑛

= arg max w𝑇𝑗 X𝑇 Xw 𝑗 (8.5)


w𝑇𝑗 w 𝑗 =1
w𝑇𝑗 w𝑙 =0 ,∀1 ≤𝑙<𝑗

Proof.
𝑛
X 2
w∗𝑗 = arg min w 𝑗 𝑧 𝑖,𝑗 − x𝑖 2
(8.6)
w𝑇𝑗 w 𝑗 =1 𝑖=1
w𝑇𝑗 w𝑙 =0 ,∀1 ≤𝑙<𝑗
z1 ,...,z𝑛
𝑛 2
w 𝑗 w𝑇𝑗 x𝑖 − x𝑖 Def. 𝑧 𝑖,𝑗 ..= w𝑇𝑗 x𝑖
X
= arg min , (8.7)
2
w𝑇𝑗 w 𝑗 =1 𝑖=1
w𝑇𝑗 w𝑙 =0 ,∀1 ≤𝑙<𝑗
𝑛
(w 𝑗 w𝑇𝑗 x𝑖 − x𝑖 )𝑇 (w 𝑗 w𝑇𝑗 x𝑖 − x𝑖 ),
X
= arg min Def. L2 (8.8)
w𝑇𝑗 w 𝑗 =1 𝑖=1
w𝑇𝑗 w𝑙 =0 ,∀1 ≤𝑙<𝑗
8 Regression 79

𝑛
(x𝑇𝑖 w 𝑗 w𝑇𝑗 − x𝑇𝑖 )(w 𝑗 w𝑇𝑗 x𝑖 − x𝑖 )
X
= arg min (8.9)
w𝑇𝑗 w 𝑗 =1 𝑖=1
w𝑇𝑗 w𝑙 =0 ,∀1 ≤𝑙<𝑗
𝑛
x𝑇𝑖 w 𝑗 w𝑇𝑗 w 𝑗 w𝑇𝑗 x𝑖 − 2x𝑇𝑖 w 𝑗 w𝑇𝑗 x𝑖 + x𝑇𝑗 x𝑖
X
= arg min (8.10)
w𝑇𝑗 w 𝑗 =1 𝑖=1 |{z} |{z}
w𝑇𝑗 w𝑙 =0 ,∀1 ≤𝑙<𝑗 =1 k x𝑖 k 22
𝑛
−x𝑇𝑖 w 𝑗 w𝑇𝑗 x𝑖 + k x𝑖 k 22
X
= arg min (8.11)
w𝑇𝑗 w 𝑗 =1 𝑖=1
w𝑇𝑗 w𝑙 =0 ,∀1 ≤𝑙<𝑗
𝑛 𝑛
k x𝑖 k 22 − (w𝑇𝑗 x𝑖 )2
X X
= arg min (8.12)
w𝑇𝑗 w 𝑗 =1 𝑖=1 𝑖=1
w𝑇𝑗 w𝑙 =0 ,∀1 ≤𝑙<𝑗
| {z }
Const.
𝑛
(w𝑇𝑗 x𝑖 )2 ,
X
= arg max Min to max by switching sign. (8.13)
w𝑇𝑗 w 𝑗 =1 𝑖=1
w𝑇𝑗 w𝑙 =0 ,∀1 ≤𝑙<𝑗
𝑛
w𝑇𝑗 x𝑖 x𝑇𝑖 w 𝑗
X
= arg max (8.14)
w𝑇𝑗 w 𝑗 =1 𝑖=1
w𝑇𝑗 w𝑙 =0 ,∀1 ≤𝑙<𝑗
!
𝑛
w𝑇𝑗 x𝑖 x𝑇𝑖
X
= arg max w𝑗 (8.15)
w𝑇𝑗 w 𝑗 =1 𝑖=1
w𝑇𝑗 w𝑙 =0 ,∀1 ≤𝑙<𝑗

= arg max w𝑇𝑗 X𝑇 Xw 𝑗 (8.16)


w𝑇𝑗 w 𝑗 =1
w𝑇𝑗 w𝑙 =0 ,∀1 ≤𝑙<𝑗

The first minimization problem has the same form as our initial opti-
mization problem, where instead of solving directly for W∗ we solve for
each component individually. We add two additional constraints, the
first constraint w𝑇𝑗 w 𝑗 = w 𝑗 2 = 1 is to make sure that our principal
axis has length one such that our solution is unique, the second con-
straint w𝑇𝑗 w𝑙 = 0 is called orthogonality constraint, to make sure that
all principal axis are orthogonal between each other. Lastly, note that by
converting the problem into a maximization problem we can discard the
dependence with Z and hence the second form is less constrained.
This min/max duality has a nice geometric interpretation. The variance
of the centered points x𝑖 projected on a unit vector w 𝑗 is given by
1 P𝑛 𝑇 2
𝑛 𝑖=1 (x𝑖 w 𝑗 ) which is the same as in 8.13 (up to an irrelevant factor of
1
𝑛 which doesn’t affect the optimization), thus finding the unit vector w 𝑗 2: Assume that z1 , . . . , z𝑛 are i.i.d. ran-
that maximizes the variance of the projected points2 is the same as finding dom observations from a random variable
the unit vector that minimizes the reconstruction error of the projected Z, then 𝔼[Z] = 0 since the points are cen-
points. tered and by the Law of large numbers:
𝑛
𝐿𝐿𝑁 1X
𝕍 [Z] = 𝔼[Z2 ] ≈ z2 (8.17)
𝑛 𝑖=1 𝑖
Finding Optimum Now that we have an easier form of the optimization
𝑛
1X
problem we can finally find W∗ by using the following lemma. = (x𝑇 w 𝑗 )2 (8.18)
𝑛 𝑖=1 𝑖
8 Regression 80

Lemma 8.2.2 (PCA) Let X𝑛×𝑑 be the feature matrix, 𝑗 ∈ {1 , . . . , 𝑘}, and v 𝑗
be the eigenvector associated to the 𝑗 -th largesta eigenvalueb 𝜆 𝑗 of 𝚺 ..= 𝑛1 X𝑇 X
(i. e. 𝚺v 𝑗 = 𝜆 𝑗 v 𝑗 )c , then w∗𝑗 = v 𝑗 with

w∗𝑗 = arg max w𝑇𝑗 X𝑇 Xw 𝑗 (8.19)


w𝑇𝑗 w 𝑗 =1
w𝑇𝑗 w𝑙 =0 ,∀1 ≤𝑙<𝑗

i. e. v 𝑗 is equal to the 𝑗 -th pricipal component w∗𝑗 .


a (𝜆 ≥ · · · ≥ 𝜆 𝑘 ≥ · · · ≥ 𝜆𝑑 )
1
b𝜆 is called principal eigenvalue of the 𝑗 -th principal component.
𝑗
c The matrix 𝚺 is called covariance matrix of X, note that this definition is true only if

the rows x𝑖 are centered.

Proof. We will prove that for all 𝑗 ∈ {1 , . . . , 𝑘} the lemma holds by


induction on 𝑗 .
Base Case ( 𝑗 = 1): note that for 𝑗 = 1 there is no orthogonality constraint.

w∗1 = arg max w𝑇1 X𝑇 Xw1 (8.20)


w𝑇𝑗 w 𝑗 =1

= arg max w𝑇1 𝑛 𝚺w1 , Def. 𝚺 (8.21)


k w1 k 2 =1
|{z}
Const.

= arg max w𝑇1 𝚺w1 (8.22)


k w1 k 2 =1

= arg max w𝑇1 V𝚲V𝑇 w1 , Eigendecomposition (8.23)


k w1 k 2 =1

= arg max u𝑇1 𝚲u1 , Let u 𝑗 ..= V𝑇 w 𝑗 (8.24)


k u1 k 2 =1
𝑑
X
= arg max 𝜆 𝑖 𝑢12,𝑖 (8.25)
k u1 k 2 =1 𝑖=1

(8.26)

Note that 𝚺 ..= 𝑛1 X𝑇 X is a symmetric and positive definite matrix and thus
has an eigendecomposition of the form 𝚺 = V𝚲V𝑇 , where V ∈ ℝ 𝑑×𝑑 is
orthonormal i. e. V𝑇 V = VV𝑇 = I and 𝚲 = diag(𝜆1 , . . . , 𝜆 𝑑 ) with 𝜆1 ≥
· · · ≥ 𝜆 𝑑 . Furthermore V contains the eigenvectors of 𝚺 as columns. Note
that since V is orthonormal we have that k u1 k 2 = V𝑇 w1 2 = k w1 k 2 = 1.
Finally we have to pick a unit vector u1 that maximizes the sum 𝑑𝑖=1 𝜆 𝑖 𝑢12,𝑖 ,
P
since the eigenvalues 𝜆 𝑖 are sorted from largest (𝜆1 ) to the smallest (𝜆 𝑑 )
the best we can do is set u∗1 = e1 , where e1 = [1 , 0 , . . . , 0]𝑇 ∈ ℝ 𝑑 (the first
unit vector) such that our sum will equal 𝜆1 i. e. be as big as possible.
Then to find the optimal w∗1 :

=I
z}|{
u∗𝑗 ..= V𝑇 w∗𝑗 ⇔ Vu∗𝑗 = VV𝑇 w∗𝑗 (8.27)
⇔ Vu∗𝑗 = w∗𝑗 (8.28)
⇔ w∗1 = Vu1 = Ve1 = v1 (8.29)
(8.30)

Where v1 is the eigenvector of 𝚺 associated with the biggest eigenvalue


8 Regression 81

𝜆1 , which is what we wanted to prove.


Induction Hypothesis:

v𝑗 = arg max w𝑇𝑗 X𝑇 Xw 𝑗 (8.31)


w𝑇𝑗 w 𝑗 =1
w𝑇𝑗 w𝑙 =0 ,∀1 ≤𝑙<𝑗

where v 𝑗 is the unit eigenvector associated to the 𝑗 -th largest eigenvalue


of 𝚺.
Induction Step ( 𝑗 → 𝑗 + 1):
We will show that

v 𝑗+1 = arg max w𝑇𝑗+1 X𝑇 Xw 𝑗+1 (8.32)


w𝑇𝑗+1 w 𝑗+1 =1
w𝑇𝑗+1 w𝑙 =0 ,∀1 ≤𝑙<𝑗+1

by using the lagrange multiplier of v 𝑘+1 parametrized by 𝜆 and 𝜂 𝑖 .

𝑗
L(v 𝑘+1 ) ..= v𝑇𝑘+1 X𝑇 Xv 𝑘+1 + 𝜆w𝑇𝑗+1 w 𝑗+1 − 1 + 𝜂 𝑖 v𝑇𝑗+1 v𝑖
X
(8.33)
| {z } 𝑖=1
Unitary Constraint
| {z }
Orthogonality Constraints
𝑗
∇L(v 𝑘+1 ) = 2X𝑇 Xv 𝑘+1 − 2𝜆v 𝑘+1 +
X
𝜂 𝑖 v𝑖 (8.34)
𝑖=1
!
=0 (8.35)
(8.36)

By I.H. we know that all v𝑙 with 𝑙 < 𝑗 are orthogonal (i. e. v𝑇𝑙 v 𝑗 = 0), we
have to prove that this also holds for 𝑙 < 𝑗 + 1. Observe that if we can
prove that 𝜂 𝑙 = 0 for all 𝑙 then all the orthogonality constraint are 0 and
hence are satisfied.

0 = v𝑇𝑙 0 (8.37)
𝑗
= 2v𝑇𝑙 X𝑇 Xv 𝑘+1 − 2𝜆 v𝑇𝑙 v 𝑘+1 + v𝑇𝑙 v𝑖
X
𝜂𝑖 (8.38)
| {z } 𝑖=1 |{z}
=0 =1 only if l=i

= 2v𝑇𝑙 X𝑇 Xv 𝑘+1 + 𝜂 𝑙 (8.39)


= 2(X𝑇 Xv𝑙 )𝑇 v 𝑘+1 + 𝜂 𝑙 (8.40)
𝑇
= 2(𝜆 𝑙 v𝑙 ) v 𝑘+1 + 𝜂 𝑙 Def. EV (IH). (8.41)
= 2𝜆 𝑙 v𝑇𝑙 v 𝑘+1 +𝜂 𝑙 (8.42)
| {z }
=0
= 𝜂𝑙 (8.43)

Finally by plugging back 𝜂 𝑙 = 0 in the gradient of the lagrangian we get


that:

2X𝑇 Xv 𝑘+1 − 2𝜆v 𝑘+1 = 0 ⇔ X𝑇 Xv 𝑘+1 = 𝜆v 𝑘+1 (8.44)


(8.45)
8 Regression 82

hence 𝜆 is by definition an eigenvalue of X𝑇 X with eigenvector v 𝑘+1 .

Computing Optimum We can usually solve this optimization problem


in two different ways:
I Eigendecomposition: as we have seen we can compute the matrix
𝚺 ..= 𝑛1 X𝑇 X and then applying eigendecomposition on 𝚺 to get
𝚺 = V𝚲V𝑇 . Then we have to set 𝑾 ∗ = V.
I Singular Value Decomposition: has the same effect of eigendecompo-
sition but the computation is more direct since we don’t have to
compute the matrix 𝚺. By applying singular value decomposition
on X we get:

X = USV𝑇 (8.46)

where U ∈ ℝ 𝑛×𝑛 contains the eigenvectors of XX𝑇 in its columns,


V ∈ ℝ 𝑑×𝑑 contains the eigenvectors of X𝑇 X in its columns, and
S = diag(𝜎1 , . . . , 𝜎𝑑 ) ∈ ℝ 𝑛×𝑑 with 𝜎1 ≥ · · · ≥ 𝜎𝑑 where 𝜎𝑖2 = 𝜆 𝑖
and 𝜆 𝑖 is the 𝑖 -th largest eigenvalue of X. Since V already contains
the sorted eigenvectors of X𝑇 X we just have to set W∗ = V.

Dimension Selection Choosing the value of 𝑘 depends on the applica-


tion.
I Visualization: to visualize the data clearly we can only pick 𝑘 ∈
{1, 2, 3}.
I Feature Induction: if the dimension reduced features are given as
input to a supervised learning algorithm we can use cross validation
on 𝑘 to find which one performs the best.
I Elbow Method Heuristic: we can start with a low 𝑘 and then increase
it until most of the variance of the data is accounted by the principal
components.

8.3 Kernel PCA

If our feature vectors x1 , . . . , x𝑛 are non-linearly separable and we use


PCA to project our 𝑑 dimensional data points to 𝑘 < 𝑑 dimensions
we will get a bad representation. Similarly to supervised learning, we
can apply a mapping 𝜙(x𝑖 ) = x̃𝑖 with 𝜙 : ℝ 𝑑 → ℝ 𝑑 that will increase
0

the initial dimension of the data to 𝑑0 > 𝑑 such that when we apply
PCA to project the data to a lower dimension 𝑘 < 𝑑0 we get a linearly
separable representation. Recall that to avoid the feature explosion we
never actually apply the function 𝜙 , but instead we use the kernel trick.

Lemma 8.3.1 (Kernel PCA Objective) Let X ˜ ∈ ℝ 𝑑0 ×𝑛 be the transformed


feature matrix (with x̃𝑖 = 𝜙(x𝑖 )), then the following optimization problems
8 Regression 83

are equivalent.

w∗𝑗 = arg max ˜ 𝑇 Xw


w𝑇𝑗 X ˜ 𝑗 (8.47)
w𝑇𝑗 w 𝑗 =1
w𝑇𝑗 w𝑙 =0 ,∀1 ≤𝑙<𝑗

𝜶∗𝑗 = arg max 𝜶 𝑗 K𝑇 K𝜶 𝑗 (8.48)


𝜶𝑇𝒋 K 𝜶 𝑗 =1
𝜶𝑇𝒋 K 𝜶 𝑙 =0 ,∀1≤𝑙<𝑗

where 𝜶 ∈ ℝ 𝑛 and K ∈ ℝ 𝑛×𝑛 is the kernel matrixa of the features in


X ∈ ℝ 𝑑×𝑛 .
a Recall 𝐾 𝑖,𝑗 ..= k(x𝑖 , x 𝑗 )

Proof. Assume that any principal component can be written as a linear


P𝑛 (𝑗)
combination of the form w 𝑗 = 𝑘=1
𝛼 𝑘 x̃ 𝑘 where 𝑗 ∈ {1 , . . . , 𝑘} for some
(𝑗) (𝑗)
𝜶 𝑗 ..= (𝛼1 , . . . , 𝛼 𝑛 ) ∈ ℝ 𝑛 , then:

Objective
𝑛
w𝑇𝑗 X
˜ 𝑇 Xw (w𝑇𝑗 x̃𝑖 )2
X
˜ 𝑗= (8.49)
𝑖=1
!𝑇 2
𝑛 𝑛
X © X (𝑗)
= 𝛼 𝑘 x̃ 𝑘 x̃𝑖 ® (8.50)
ª
­
𝑖=1
« 𝑘=1 ¬
!2
𝑛 𝑛
X X (𝑗) 𝑇
𝛼𝑘

= x̃ 𝑘 x̃𝑖 (8.51)
𝑖=1 𝑘=1
!2
𝑛 𝑛
X X (𝑗)
= 𝛼 𝑘 k(x̃ 𝑘 , x̃𝑖 ) (8.52)
𝑖=1 𝑘=1
𝑛  2
𝜶𝑇𝑗 K𝑖
X
= (8.53)
𝑖=1
= 𝜶 𝑗 K𝑇 K𝜶 𝑗 (8.54)

Constraints
!𝑇 !
𝑛 𝑛
(𝑗)
w𝑇𝑗 w𝑙
(𝑙)
X X
= 𝛼 𝑘 x̃ 𝑘1 𝛼 𝑘 x̃ 𝑘2 (8.55)
1 2
𝑘1 =1 𝑘 2 =1
𝑛 X 𝑛
(𝑗) (𝑙)
𝛼 𝑘 𝛼 𝑘 x̃𝑇𝑘1 x̃ 𝑘2
X
= (8.56)
1 2
𝑘1 =1 𝑘2 =1
𝑛 X 𝑛
X (𝑗) (𝑙)
= 𝛼 𝑘 𝛼 𝑘 k(x 𝑘1 , x 𝑘2 ) (8.57)
1 2
𝑘1 =1 𝑘2 =1

= 𝜶𝑇𝑗 𝑲𝜶 𝑙 (8.58)

Lemma 8.3.2 (Kernel PCA) Let K ∈ ℝ 𝑛×𝑛 be the kernel matrix of the
features in X ∈ ℝ 𝑑×𝑛 , 𝑗 ∈ {1 , . . . , 𝑘}, and v 𝑗 be the eigenvector associated to
8 Regression 84

the 𝑗 -th largest eigenvalue 𝜆 𝑗 of K, then 𝜶 ∗𝑗 = √1 v 𝑗 with


𝜆𝑗

𝜶∗𝑗 = arg max 𝜶 𝑗 K𝑇 K𝜶 𝑗 (8.59)


𝜶𝑇𝒋 K𝜶 𝑗 =1
𝜶𝑇𝒋 K𝜶 𝑙 =0,∀1 ≤𝑙<𝑗

Computing Kernel PCA To compute the kernelized PCA of some


(centered) feature matrix X ∈ ℝ 𝑑×𝑛 we apply the following steps:
1. Pick any kernel, even possibly infinite
 dimensional
 kernels (e. g. a
2
− k x 𝑖 −x 𝑗 k 2
Gaussian kernel k(x𝑖 , x 𝑗 ) = exp ℎ2
for ℎ ∈ ℝ).
2. Compute the kernel matrix using the selected kernel 𝐾 𝑖,𝑗 = k(x𝑖 , x 𝑗 ).
3. Apply kernel PCA by computing the eigendecomposition K =
VΛV𝑇 , then 𝜶 𝑗 = √1 v 𝑗 where v 𝑗 is the 𝑗 -th column of V and 𝜆 𝑗 is
𝜆𝑗
the 𝑗 -th diagonal of Λ.
4. Finally pick some output dimension 𝑘 and to find the reduced
vector z = (𝑧 (1) , . . . , 𝑧 (𝑘) ) ∈ ℝ 𝑘 (principal scores) of some x we
compute 𝑧 (𝑖) = 𝑛𝑗=1 𝛼 𝑗 k(x 𝑗 , x).
P (𝑖)

Notes
1. Kernel K-Means: if we want to cluster some 𝑑 dimensional features
that are not linearly separable we can apply kernel PCA on the
features (e. g. with some infinite dimensional kernel) with 𝑘 = 𝑑 . By
having 𝑘 = 𝑑 and and infinite kernel we project the initial data to an
infinite dimensional space, and then back to the initial dimension
𝑑 with kernel PCA.
2. Centering Kernel: even if our initial features in X are centered we
may get a non-centered kernel matrix K. To solve this problem is
good practice to center recenter it as: K0 = K − KE − EK + EKE
where 𝐸 𝑖,𝑗 = 𝑛1 , E ∈ ℝ 𝑛×𝑛 .
3. Uses: kernel PCA is a very useful method to discover non-linear
features before applying any model, included supervised methods
(SVM, nerual networks, . . . ).

8.4 Autoencoders

Autoencoders are an application of neural networks to unsupervised


dimension reduction. The key idea works as follows: build a multi-
layer neural network 𝑓 such that the input and output dimension 𝑑
are the same, then feed some feature vector as input, and train the
network to reconstruct the input vector as output, that is find 𝑓 such that
𝑓 (x , 𝜽) = ŷ ≈ x. How does this network reduce the dimension of x? The
idea is that we will build the network such that it has some hidden layer
h(𝑙) ∈ ℝ 𝑘 where 𝑘 < 𝑑 (i. e. some bottleneck). Then, after the network is
trained with back-propagation to reconstruct the input as accurately as
possible, if we feed some vector x, the vector z ..= h(𝛽) must contain a
dimensionally reduced representation of x.
8 Regression 85

Mathematical View

Goal
Given a set of feature vectors x1 , . . . , x𝑛 where x𝑖 ∈ ℝ 𝑑 (which can be
represented as a matrix X ∈ ℝ 𝑛×𝑑 ), and the desired output dimension
𝑘 ∈ ℕ with 1 ≤ 𝑘 ≤ 𝑑

Output the analog set of dimensionally reduced vectors z1 , . . . , z𝑛


where z𝑖 ∈ ℝ 𝑘 (which can be represented as a matrix Z ∈ ℝ 𝑛×𝑘 ) and
the parameters 𝜽 such that

Encoder
z }| {
𝑓 (x; 𝜽) = 𝑓 (𝐿) (· · · 𝑓 (𝛽) (· · · 𝑓 (1) (x; 𝜽) · · · ; 𝜽) · · · 𝜽) (8.60)
| {z }
Decoder

satisfies the following optimization for some loss function l★:

𝑛 X 𝑑
1 X
𝜽ˆ = arg min l★(𝑥 𝑖,𝑗 , 𝑓 (𝜽 ; x𝑖 ) 𝑗 ) (8.61)
𝜽∈ℝ 𝐷 𝑛𝑑 𝑖=1 𝑗=1

More precisely:
 
h(1) ..= 𝑓 (1) (x; W(1) , b(1) ) = 𝜑 (1) xW(1) + b(1) (8.62)
 
h(2) ..= 𝑓 (2) (h1 ; W(2) , b(2) ) = 𝜑 (2) h(1) W(2) + b(2) (8.63)
..
.  
h(𝛽) ..= 𝑓 (𝛽) (h(𝛽−1) ; W(𝛽) , b(𝛽) ) = 𝜑 (𝛽) h(𝛽−1) W(𝛽) + b(𝛽) (8.64)
..
.  
h(𝐿) ..= 𝑓 (𝐿) (h(𝐿−1) ; W(𝐿) , b(𝐿) ) = 𝜑 (𝐿) h(𝐿−1) W(𝐿) + b(𝐿) (8.65)

where the layer 𝛽 is the bottleneck and z ..= h(𝛽) ∈ ℝ 𝑘 .

Graph View

Example 8.4.1 (Autoencoder Graph View) In this example we will


have 𝑑 = 4 and 𝑘 = 2, this architecture is as simple as possible, i. e. only
one encoding and one encoding and decoding function with no biases.
The autoencoder neural network graph view is then:
8 Regression 86

Input Hidden Output


layer layer 1 layer

𝑊 (1) 𝑊 (2)
𝑥1 𝑦ˆ1

𝑥2 (1)
ℎ1 𝑦ˆ2

𝑥3 (1)
ℎ2 𝑦ˆ3

𝑥4 𝑦ˆ4

Where ŷ should be as close as possible to x after training and if we


feed x will have z = h(1) as the dimension reduced representation.

Notes
I Autoencoders PCA: if we pick the identity function as activation
function 𝜑 (𝑙) for all layers, the autoencoder will have the exact
same result as PCA. If, instead, we use non-linear functions 𝜑 (𝑙) the
autoencoder will usually find a better compression than PCA for
the same 𝑘 . The downside is that the optimization is non-convex
and thus it relies heavily on the initialization of the weights and
biases.
I Denoising Autoencoders: a very interesting application of autoen-
coders is denoising. Denoising is a procedure in which we add a
noise vector n to each input x0 ..= x + n and then train the autoen-
coder to reconstruct the original x. Since the bottleneck is forced to
store only important features of x0 noise will be removed in favor of
more important characteristics. Denoising has many applications,
one of which is image processing.
Probabilistic Approach to
Unsupervised Learning 9
9.1 Mixture Distribution

To understand mixture models consider the following experiment: we


are given a set of vectors x1 , . . . , x𝑁 with x𝑛 ∈ ℝ 𝐷 , and we are told that
those vectors were drawn from 𝐾 distinct distributions with probability
densities 𝑓1 , . . . , 𝑓𝐾 with 𝑓 𝑘 : ℝ 𝐷 → [0 , 1]. Let X be the random vector
and x𝑛 for 𝑛 ∈ {1 , . . . , 𝑁 } be realizations of X (where X = [X1 · · · X𝐷 ]𝑇 ,
with X𝑑 for 𝑑 ∈ {1 , . . . , 𝐷} are random variables). The goal will be to
find the probability density function 𝑓X of X, clearly it cannot be a single
𝑓 𝑘 since each one of them specifies only one of those distributions, thus
it must be something more complex. If we knew for each x𝑛 which of the
𝐾 distribution is X choosing for that particular realization, then we could
just pick the right 𝑓 𝑘 , but we have no idea which of the 𝐾 distributions is
chosen each time.
To solve this problem we introduce a new random variable Z that
will take the value 𝑘 when X is drawing from the 𝑘 -th distribution. In
other words we assume that Z is telling us from which distribution
is X choosing. Clearly we don’t know how X chooses the different
distributions and thus we don’t know the probability density function
𝑓Z (𝑘) = ℙ[Z = 𝑘] : {1 , . . . , 𝐾} → [0 , 1] of Z, but we assume that we do
and see where this takes us1 . To reformulate what we have seen before 1: Random variables that are not observed,
like Z, are usually called latent variables.
in terms of Z, ℙ[X = x | Z = 𝑘] = 𝑓X | Z (x | 𝑘, 𝜽𝑘 ) = 𝑓 𝑘 (x | 𝜽𝑘 ) (where 𝜽𝑘
are the parameters of the distribution 𝑓 𝑘 ), i. e. if we know that the 𝑘 -th
distribution has been chosen then we know that X is distributed as 𝑓 𝑘 .
Using the previous assumption we can now evaluate the marginal
distribution of X and thus find its probability density function:

𝐾
X
ℙ[X = x] ..= ℙ[Z = 𝑘]ℙ[X = x | Z = 𝑘] (9.1)
𝑘=1
𝐾
X
= 𝑓Z (𝑘) 𝑓 𝑘 (x | 𝜽𝑘 ) (9.2)
𝑘=1

where the distribution of X is parametrized by 𝜽 i. e. 𝑓X (x | 𝜽) = ℙ[X = x]


with 𝜽 = {𝜽1 , . . . , 𝜽𝐾 }. Recall that ℙ[Z = 𝑘] is unknown and thus we can
take it one step further and let 𝜋 𝑘 ..= ℙ[Z = 𝑘] also be parameters of the
probability density function of X, i. e. 𝜽 = {𝜽1 , . . . , 𝜽𝐾 , 𝜋1 , . . . , 𝜋𝐾 } hence
removing the dependency on Z. Note that since 𝜋 𝑘 represent probabilities
P𝐾
𝜋 = 1.
𝑘=1 𝑘

Definition 9.1.1 (Mixture Distribution) Let 𝑓 𝑘 (x | 𝜽𝑘 ) : ℝ 𝐷 → [0 , 1] for


𝑘 ∈ {1, . . . , 𝐾} be 𝐾 probability density functions of different distributions,
9 Probabilistic Approach to Unsupervised Learning 88

then we define their mixture distribution as:

𝐾
X
𝑓X (x | 𝜽) ..= 𝜋 𝑘 𝑓 𝑘 (x | 𝜽𝑘 ) (9.3)
𝑘=1

a
with 𝜽 = {𝜽1 , . . . , 𝜽𝐾 , 𝜋1 , . . . , 𝜋𝐾 } as the parameters of 𝑓X . Where 𝜋 𝑘 are
called mixture weights and 𝑓 𝑘 are called mixture components.
a We have that X ∼ 𝑓X (𝜽)

9.2 Gaussian Mixtures Model

A Gaussian mixture model is a model used for unsupervised clustering,


to understand how this works lets go back to the previous notion of
mixture distribution. We will make the assumption that the distributions
from which x1 , . . . , x𝑁 are drawn are 𝐾 multivariate Gaussians, i. e.
𝑓 𝑘 (x | 𝜽𝑘 ) ..= N (x | 𝝁 𝑘 , 𝚺 𝑘 ) for 𝑘 ∈ {1 , . . . , 𝐾}. In this case we have
decided to use the same distribution with different parameters for all of
the 𝐾 mixture components, but different models exist that use different
assumptions. This will give the following mixture distribution for X:

𝐾
X
𝑓X (x | 𝜽) ..= 𝜋 𝑘 N (x | 𝝁 𝑘 , 𝚺 𝑘 ) (9.4)
𝑘=1

with 𝜽 = {𝝅, 𝝁1 , . . . , 𝝁𝐾 , 𝚺1 , . . . , 𝚺𝐾 }. It’s important to realize that if


we had 𝜽 , then we could also compute 𝛾𝑘 (x) ..= ℙ[Z = 𝑘 | X = x] i. e.
given any new x what is the probability that it’s part of the mixture 𝑘 .
This function 𝛾𝑘 (x) is crucial for clustering since it tells us which cluster
(mixture) is x more probable to be part of. Note that we can evaluate
𝛾𝑘 (x) by using Bayes Theorem as follows:

ℙ[Z = 𝑘]ℙ[X = x | Z = 𝑘]
𝛾𝑘 (x) ..= ℙ[Z = 𝑘 | X = x] = P𝐾 (9.5)
𝑘 0 =1
ℙ[Z = 𝑘 0]ℙ[X = x | Z = 𝑘 0]
𝜋 𝑘 N (x | 𝝁 𝑘 , 𝚺 𝑘 )
= P𝐾 (9.6)
𝑘 0 =1
𝜋 𝑘0 N (x | 𝝁 𝑘0 , 𝚺 𝑘0 )

We can now write the concrete goal of a Gaussian Mixture model.

Goal
Given a set of feature vectors x1 , . . . , x𝑛 where x𝑛 ∈ ℝ 𝐷 (which can be
represented as a matrix X ∈ ℝ 𝑁×𝐷 ), and a desired number of clusters
𝐾 ∈ ℕ.

Output the parameters 𝜽 = {𝝅, 𝝁1 , . . . , 𝝁𝐾 , 𝚺1 , . . . , 𝚺𝐾 } with 𝝅 ∈ ℝ 𝐾 ,


𝝁 𝑘 ∈ ℝ 𝐷 , 𝚺 𝑘 ∈ ℝ 𝐷×𝐷 that characterize

𝜋 𝑘 N (x | 𝝁 𝑘 , 𝚺 𝑘 )
𝛾𝑘 (x) ..= P𝐾 (9.7)
𝑘 0 =1
𝜋 𝑘0 N (x | 𝝁 𝑘0 , 𝚺 𝑘0 )

Now that we have clearly defined our goal only one part is missing: the
estimation of the parameters 𝜽 given the concrete realizations x1 , . . . , x𝑁 .
9 Probabilistic Approach to Unsupervised Learning 89

To do so we will evaluate the maximum likelihood estimation (MLE)


of 𝑓X (x | 𝜽) = ℙ[X = x] which gives us the following log likelihood
function:
𝑁
i. i. d. Y
ln ℙ[X1 = x1 , . . . , X𝑛 = x𝑛 ] = ln ℙ [X 𝑛 = x 𝑛 ] (9.8)
𝑛=1
𝑁
X
= ln ℙ[X𝑛 = x𝑛 ] (9.9)
𝑛=1
𝑁
X 𝐾
X
= ln 𝜋 𝑘 N (x𝑛 | 𝝁 𝑘 , 𝚺 𝑘 ) (9.10)
𝑛=1 𝑘=1
=.. LL(𝜽) (9.11)

Then since 𝜽 = {𝝅, 𝝁1 , . . . , 𝝁𝐾 , 𝚺1 , . . . , 𝚺𝐾 } we have to differentiate LL


with respect to 𝜋 𝑘 , 𝝁 𝑘 , 𝚺 𝑘 which gives:
"
𝐾
!# P𝑁
𝜕 X ! 𝑛=1 𝛾𝑘 (x𝑛 )
LL(𝜽) +𝜆 𝜋𝑘 − 1 = 0 ⇒ 𝜋𝑘 = (9.12)
𝜕𝜋 𝑘 𝑘=1
𝑁
P𝑁
𝜕 𝛾𝑘 (x𝑛 )x𝑛
[LL(𝜽)] = 0 ⇒ 𝝁 𝑘 = P𝑛=𝑁1
!
(9.13)
𝜕𝝁 𝑘 𝑛=1 𝛾𝑘 (x𝑛 )
P𝑁
𝜕 ! 𝑛=1 𝛾𝑘 (x𝑛 )(x𝑛 − 𝝁 𝑘 )(x𝑛 − 𝝁 𝑘 )𝑇
[LL(𝜽)] = 0 ⇒ 𝚺 𝑘 = (9.14)
𝜕𝚺 𝑘
P𝑁
𝑛=1 𝛾𝑘 (x𝑛 )

Where on the first parameter we use the Lagrangian to enforce the


constraint that 𝐾𝑘=1 𝜋 𝑘 = 1. Usually at this point we would solve the
P
system of equations to obtain 𝜋 𝑘 , 𝝁 𝑘 , 𝚺 𝑘 , to find the maximum of the log
likelihood. However, in this case the equations are coupled and hard to
solve jointly we thus use an iterative algorithm called Soft EM that works
as follows:

1 𝝁 𝑘 ,𝚺 𝑘 ,𝜋 𝑘 B Initialize means, covariances, and mixing coefficients for all 𝑘 ∈ {1 , . . . , 𝐾}. Algorithm 9.1: Soft EM Algorithm
2 while not converged
3 for each 𝑛 ∈ {1 , . . . , 𝑁 }
𝜋 𝑘 N (x𝑛 |𝝁 𝑘 ,𝚺 𝑘 )
4 𝛾𝑘 (x𝑛 ) ..= P𝐾
𝜋 0 N (x𝑛 |𝝁 𝑘0 ,𝚺 𝑘0 )
B E Step.
P𝑁 𝑘 0 =1 𝑘
𝑛=1 𝛾𝑘 (x𝑛 )
5 𝜋𝑘 = B M Steps
P𝑁 𝑁
𝛾𝑘 (x𝑛 )x𝑛
6 𝝁𝑘 = P𝑛=𝑁1
𝛾𝑘 (x𝑛 )
P𝑁 𝑛= 1
𝑇
𝑛=1 𝛾𝑘 (x𝑛 )(x𝑛 −𝝁 𝑘 )(x𝑛 −𝝁 𝑘 )
7 𝚺𝑘 = P𝑁
𝛾𝑘 (x𝑛 )
𝑛=1
8 end
9 return 𝜽 = {𝝅, 𝝁1 , . . . , 𝝁𝐾 , 𝚺1 , . . . , 𝚺𝐾 }

Finally, we have obtained an approximation 𝜽ˆ which can be used to


perform clustering. Note that the Soft EM algorithm usually converges
to different local minima depending on the initialization step.

You might also like