Maths For Machine Learning
Maths For Machine Learning
Contents
0 Introduction 2
1
9 Lecture 9 - Regression and Regularisation 50
9.1 Regression and general loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
9.2 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
9.3 Stability and overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9.4 Regularised risk minimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
9.5 Tikhonov regularisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
18 Lecture 18 - Boosting 94
19 Lecture 19 - Clustering 97
19.1 Linkage-based clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
19.2 k-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
19.3 Spectral clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
0 Introduction
This lecture is about:
2
1. Language of ML: What is classification, regression, ranking, clustering, dimensionality reduction,
supervised and unsupervised learning, generalisation, overfitting.
2. PAC Theory: PAC Learning model, finite hypothesis sets, consistent and inconsistent problems,
deterministic and agnostic learning.
3. Rademacher complexity and VC dimension: generalization bounds for Rademacher, Growth func-
tion, Connection to Rademacher compl., VC dimension, VC dimension based upper bounds, lower
bounds on generalization.
4. Model Selection: Bias Variance trade-off, Structural Risk minimisation, Cross validation, regularisa-
tion.
5. Support Vector Machines: generalisation bounds, margin theory/margin based generalization
bounds.
6. Kernel Methods: Reproducing Kernel Hilbert spaces, Representer Theorem, kernel SVM, generali-
sation bounds for kernel based methods
7. Neural Networks (Mostly shallow)
8. Clustering: k-means, Lloyds algorithm, Ncut, Cheeger cut, spectral clustering.
9. Dimensionality Reduction: PCA, diffusion maps, Johnson Lindenstrauss lemma.
Literature:
• Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning.
MIT press, 2018. https://cs.nyu.edu/~mohri/mlbook/
• Shalev-Shwartz, Shai, and Shai Ben-David. Understanding machine learning: From the-
ory to algorithms. Cambridge university press, 2014. https://www.cs.huji.ac.il/~shais/
UnderstandingMachineLearning/
• Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning:
data mining, inference, and prediction. Springer Science & Business Media, 2009 https://web.
stanford.edu/~hastie/ElemStatLearn/
Lecture notes: You are reading the lecture notes to the course right now. They are being developed
during the course. If you find things than can be improved, please contact Philipp Petersen to give feed-
back. This is highly appreciated.
Required prior knowledge: This is an applied math course. Therefore it will often touch on many dif-
ferent mathematical fields. Such as harmonic analysis, graph theory, random matrix theory, etc. Students
are not required to know about these issues beforehand. But a certain willingness to look up concepts
from time to time is necessary.
To enjoy this course, students should have heard an introductory course on probability theory and linear
algebra. Most of the examples, as well as the challenges will require rudimentary knowledge of Python.
Towards the end, a basic knowledge of functional analysis is helpful, but not required.
Assessment criteria: There will be an oral exam at the end of the lecture.
In addition, we will have challenges where you will solve machine learning problems in Python. You
need to form groups for this. You need to beat me to pass (This will be very easy, because I will publish
3
my approach and my solutions will typically not be very sophisticated). The winning team and the team
with the most creative solution will receive prices. There will be at least three challenges this year.
Note that your results for the challenges need to be handed in before the given deadline. There are no
exceptions to this rule. You will have plenty of time to complete these, so it may make sense to prepare
one submission as a backup and then fine tune later.
4
1 Lecture 1 – Example and Language of Machine Learning
A cheesy lecture on machine learning would probably start by claiming that machine learning is revolu-
tionary and constitutes a completely new paradigm for science and mathematics. One may even say a
new world of unknown and exciting terrain with uncountable possibilities.
We instead show that machine learning can quite literally show us new worlds.
We import a data set of light curves of stars recorded from the Kepler telescope. These can be found online
at (https://www.kaggle.com/keplersmachines/kepler-labelled-time-series-data). We print the first five
lines of the data set to get a feeling what is going on.
The columns are the intensities of the light at different positions in time. The label is 2 if some astrophysi-
cists has claimed that this planet has an exoplanet and 1 if they claimed it has none. We will plot a couple
of these curves to get a good understanding what is going on.
fig = plt.figure(figsize=(18,14))
ax = fig.add_subplot(231)
plt.plot(data.values[6, 1:]/np.max(np.abs(data.values[69, 1:])))
plt.title('Has exoplanet')
ax = fig.add_subplot(232)
plt.plot(data.values[2003, 1:]/np.max(np.abs(data.values[2003, 1:])))
plt.title('No exoplanet')
ax = fig.add_subplot(233)
plt.plot(data.values[1, 1:]/np.max(np.abs(data.values[1, 1:])))
5
plt.title('Has exoplanet')
ax = fig.add_subplot(234)
plt.plot(data.values[13, 1:]/np.max(np.abs(data.values[69, 1:])))
plt.title('Has exoplanet')
ax = fig.add_subplot(235)
plt.plot(data.values[75, 1:]/np.max(np.abs(data.values[2003, 1:])))
plt.title('No exoplanet')
ax = fig.add_subplot(236)
plt.plot(data.values[77, 1:]/np.max(np.abs(data.values[1, 1:])))
plt.title('No exoplanet')
Stars with exoplanets often have periodically occuring sharp drops in light intensity. We do not know if it
is the only indication, though. Since we also not trained in astrophysics, we should not overanalyse this.
Maybe there is another obvious way of differentiating between stars with exoplanets and stars. We shall
start some exploratory data analysis. This consists of looking at certain statistical aspects of the data set:
ex_labels = data.values[:, 0]
6
print('In the data set there are: ' + str(np.sum(ex_labels==1)) + ' Stars without exoplanets.')
print('In the data set there are: ' + str(np.sum(ex_labels==2)) + ' Stars without exoplanets.')
fig = plt.figure(figsize=(18,14))
means1 = LightCurves[ex_labels==1].mean(axis=1)
means2 = LightCurves[ex_labels==2].mean(axis=1)
ax = fig.add_subplot(231)
ax.hist(means1,alpha=0.8,bins=50,density=True,range=(-250,250))
ax.hist(means2,alpha=0.8,bins=50,density=True,range=(-250,250))
ax.legend(['No Exoplanets', 'Has Exoplanets'])
ax.set_xlabel('Mean Intensity')
ax.set_ylabel('Num. of Stars')
std1 = LightCurves[ex_labels==1].std(axis=1)
std2 = LightCurves[ex_labels==2].std(axis=1)
ax = fig.add_subplot(232)
ax.hist(std1,alpha=0.8,bins=50,density=True,range=(-250,250))
ax.hist(std2,alpha=0.8,bins=50,density=True,range=(-250,250))
ax.legend(['No Exoplanets', 'Has Exoplanets'])
ax.set_xlabel('Standard Deviation')
ax.set_ylabel('Num. of Stars')
ax = fig.add_subplot(233)
ax.hist(spread1,alpha=0.8,bins=50,density=True,range=(-2500,2500))
ax.hist(spread2,alpha=0.8,bins=50,density=True,range=(-2500,2500))
ax.legend(['No Exoplanets', 'Has Exoplanets'])
ax.set_xlabel('Max minus min value')
ax.set_ylabel('Num. of Stars')
ax = fig.add_subplot(234)
ax.hist(Derivative,alpha=0.8,bins=50,density=True,range=(-250,250))
ax.hist(Derivative2,alpha=0.8,bins=50,density=True,range=(-250,250))
ax.legend(['No Exoplanets', 'Has Exoplanets'])
ax.set_xlabel('L1 Norm of Derivative')
ax.set_ylabel('Num. of Stars')
7
MaxDerivative = np.max(np.gradient(LightCurves[ex_labels==1], axis = 1), axis = 1)
MaxDerivative2 = np.max(np.gradient(LightCurves[ex_labels==2], axis = 1), axis = 1)
ax = fig.add_subplot(235)
ax.hist(MaxDerivative,alpha=0.8,bins=50,density=True,range=(-500,500))
ax.hist(MaxDerivative2,alpha=0.8,bins=50,density=True,range=(-500,500))
ax.legend(['No Exoplanets', 'Has Exoplanets'])
ax.set_xlabel('Max of Derivative')
ax.set_ylabel('Num. of Stars')
ax = fig.add_subplot(236)
ax.hist(MaxSecDerivative,alpha=0.8,bins=50,density=True,range=(-500,500))
ax.hist(MaxSecDerivative2,alpha=0.8,bins=50,density=True,range=(-500,500))
ax.legend(['No Exoplanets', 'Has Exoplanets'])
ax.set_xlabel('Max of Second Derivative')
ax.set_ylabel('Num. of Stars')
8
Unfortunately none of our clever statistics seem to really separate the data. It seems like stars with exo-
planets may have higher max derivatives, but this only holds for the distribution and does not make for
a simple test yet. We need to actually perform machine learning. Let us use an all purpose weapon, the
support vector machine:
We have trained the support vector machine on the data. Now let us evaluate how well this trained
algorithm performs on a test set.
prediction=SupportVectorClassifier.predict(TestLightCurves)
from sklearn.metrics import accuracy_score, confusion_matrix, plot_confusion_matrix
9
On first sight, we have achieved 99.12% accuracy on the test set, which seems nice. But let us dig a little
bit deeper by also printing the confusion matrix. Which is a matrix C = (Ci,j )2i,j=1 . Where C0,0 denotes the
number of true negatives, C1,1 are true positives, C1,0 are false negatives, C0,1 are false positives
shadow=True, startangle=90)
plt.show()
Confusion Matrix:
[[565 0]
[ 5 0]]
The confusion matrix and the pie chart show quite clearly what is the problem. The data set is very
imbalanced. The classifier, while achieving high accuracy, was not successfull in labelling only a single
star with an exoplanet correctly. In fact, it labelled all stars as having no exoplanet.
It seems like we have to use a bit more sophisticated methods.
We start by making the data a bit nicer by standardising and filtering it. We filter out high and low
frequencies by applying a wavelet transform. We also remove very oscillating element as they seem to be
outliers.
Next, we will transform the data so that it is in a format that may exhibit the characteristics that we need
to classify. As we have seen in one of the light curves, stars with exoplanets exhibit periodically appearing
drops in light intensity. To expose this periodicity, it makes sense to take the Fourier transform. We also
want our classifier to be independent of temporal shifts. This can be enforced by taking the absolute
10
value of the Fourier transform, since translation of functions corresponds to modulation of its Fourier
transform.
[13]: def filterData(DataSet,wav_len):
wavelet = gaussian(wav_len, 1)
wavelet = np.diff(np.diff(wavelet))# Produce a wavelet with two vanishing moments
for k in range(DataSet.shape[0]):
DataSet[k,:] = DataSet[k,:] - DataSet[k,:].mean()
DataSet[k,:] = DataSet[k,:] / DataSet[k,:].std()
if(np.sum(np.abs(np.diff(DataSet[k,:]))) > 200*max(abs(DataSet[k,:]))):
DataSet[k,:] = 0; # remove light curves with too much oscillation
else:
DataSet[k,:] = np.convolve(DataSet[k,:], wavelet, 'same')
DataSet[k,:] = np.abs(fft(DataSet[k,:]))**2
return DataSet
One big problem that we observed was the imbalance of the data set. We attack this problem by generating
artificial data. The artificial data is produced ba making signals that have periodic spikes.
for k in range(500):
newexoplanets[k,:] = LightCurves[500+k,:]
newexoplanets[k,:] = (newexoplanets[k,:] - newexoplanets[k,:].mean())/newexoplanets[k,:].std()
period = 100+k
start = np.random.uniform(3,int(period))
for j in range(int(start), LightCurves.shape[1]-3, int(period)):
randpershift = int(np.random.uniform(-1,1))
11
[17]: prediction=SupportVectorClassifier.predict(TestLightCurves_F)
. . . this looks much better. We found all five exoplanets without knowing anything about astrophysics!
Learning stages:
12
Figure 1: Data sets to cluster
• Examples: Observations/Instances of data used in the learning process or to evaluate. Stars in our
exoplanet study.
• Features: The set of attributes of the examples. In the exoplanet study, these are the light curves.
• Labels: Values or categories assigned to the examples. Has exoplanet or does not have exoplanet.
• Hyperparameters: Parameters that define the learning algorithm. These are not learned. E.g., number
of neurons of the neural networks, when to stop training, e.t.c.
• Training sample: These are the examples that are used to train the learning algorithm.
• Validation sample: These examples are only indirectly used in the learning algorithm, to tune its
hyperparameters.
• Test sample: These examples are not accessed during training. After training they are used to deter-
mine the accuracy of the algorithm.
• Loss function: This function is used to measure the distance between the predicted and true label.
If Y is the set of labels, then L : Y × Y → R+ . Examples include the zero-one loss: Y = {−1, 1},
L0−1 ( x, y) = 1x6=y , the square loss Y = Rd , Lsq ( x, y) = k x − yk2 . In the exoplanet study, we used the
binary cross entropy: Y = [0, 1], where Lce ( x, y) = −(y log( x ) + (1 − y) log(1 − x )) (in our case, the
true labels y only take values {0, 1}).
• Hypothesis set: A set of functions that map features to labels.
Learning Scenarios:
• Supervised learning: The learner has access to labels for every training and evaluation sample. This
was the case in the exoplanet study.
• Unsupervised learning: Here we do not have labels. A typical example is clustering.
• Semi-supervised learning: Here some of the data have labels. Here the labels and the structure of the
data need to be used.
• Online learning: Here training and testing are performed iteratively in rounds. In each round we
receive new data. We make a prediction receive an evaluation and update our model. The goal is to
reduce the so-called regret. This describes how much worse one performed than an expert would
in hindsight.
• Reinforcement learning: Similar to online learning in the sense that training and testing phases are
mixed. The learner receives a reward for each action and seeks to maximise this reward. This if
often used to train algorithms to play computer games.
13
Labelled data Algorithm Prior knowledge
Validation sample A ( Θ0 )
Figure 2: Learning pipeline. We learn using an algorithm A(Θ). This algorithm can be chosen based on
certain features and prior knowledge of the problem. This algorithm has hyperparameters Θ that we can
choose based on the validation sample.
• Active learning: An oracle exists that can be queried by the learner for labels to samples chosen by
the learner.
Generalisation: Generalisation describes the performance of the learned algorithm outside of the train-
ing set.
Example:
a) Polynomials fitting points
[2]: N = 25
x = np.arange(0,1, 1/N)
plt.figure(figsize = (15,5))
plt.subplot(1,3,1)
plt.scatter(x,y)
plt.plot(x,polyordLOW(x), c = 'r')
plt.title('Degree 1')
plt.subplot(1,3,2)
plt.scatter(x,y)
plt.title('Degree 2')
plt.plot(x,polyordRIGHT(x), c = 'r')
14
plt.subplot(1,3,3)
plt.scatter(x,y)
plt.plot(x,polyordHIGH(x), c = 'r')
plt.title('Degree 20')
[2]:
b) Binary classification
c) Real world: Sports statistics. "Red Bull Salzburg never loses a game in the Champions League if they play
at home, the moon is full and at least 3 yellow cards are awarded in the first 20 minutes to players with odd
jersey numbers."
d) Science: Geocentric model. Based on epicycles. See Figure 3.
15
Figure 3: Ptolemaic system
16
Definition 2.1 (Generalisation error). Let h ∈ H, c ∈ C , and let D be a data distribution. The generalisation
error or risk of h is defined as
R(h) = Px∼D (h( x ) 6= c( x )) = E 1h(x)6=c(x) ,
In practice, we cannot compute the generalisation error R(h) since we know neither D nor the target
hypothesis c. We can compute the error on a sample instead:
Definition 2.2 (Empirical error). Let h ∈ H, and S := ( xi , yi )im=1 be a training sample. The empirical error or
empirical risk is defined as
m
b S (h) = 1 ∑ 1h(x )6=y .
R
m i =1 i i
1 m
m i∑
E(xi )im=1 ∼D m (R
b (x ,c(x ))m )(h) =
i i i =1
E(xi )im=1 ∼D m (1h(xi )6=c(xi ) )
=1
1 m
m i∑
= Ex∼D (1h(x)6=c(x) )
=1
1 m
m i∑
= R(h) = R(h).
=1
We want to learn the target concept from samples. When is this even possible? What does possible even
mean?
Definition 2.3 (PAC learnability). Let C be a concept class. We say that C is PAC-learnable if there exists a
function mC : (0, 1)2 → N and an algorithm A mapping samples S to functions A(S) ∈ {X → Y } with the
following property: For every distribution D on Y , for every target concept c ∈ C , and for all e, δ ∈ (0, 1):
if m ≥ mC (e, δ).
Note that the definition of PAC learnability is distribution free. Also it describes the worst case behaviour,
over the whole concept class.
2.2 An example
• X := [0, 1]2
• C := {1[r1 ,r2 ]×[r3 ,r4 ] : r1 , r2 , r3 , r4 ∈ (0, 1)}.
Lets define our learning algorithm A as follows: for S = ( xi , yi )im=1 we pick r10 , r20 , r30 , r40 ∈ (0, 1) so that
[r10 , r20 ] × [r30 , r40 ] is the smallest rectangle containing all xi such that yi = 1 and then A(S) = 1[r10 ,r20 ]×[r30 ,r40 ] .
Let us analyse the expected error of our algorithm. Pick arbitrary c ∈ C and distribution D on [0, 1]2 . Let
e > 0:
1. Note that for a sample S we have {A(S) = 1} ⊂ {c = 1}.
17
2. The expected error is therefore given by
R(A(S)) = ED (c − A(S)).
3. Assuming, ED (c) > e we choose four rectangles ( R j )4j=1 as in Figure 4 each of probability mass
exactly e/4.1
4. Observe that, if ED (c − A(S)) > e, then in particular ED (c) > e and suppA(S) cannot intersect all
4 rectangles of Step 3. Hence, there is one rectangle that does not contain any training samples. In
other words,
m
P(xi )im=1 ∼D m (R(A(S)) > e) ≤ ∑ P(x ) = ∼D m
i i 1
m ( xi 6 ∈ R j )
j =1
4 m
≤ ∑ ∏ Px∼D (x 6∈ R j )
j =1 i =1
Figure 4: Left: A sample, drawn according to D as well as the target concept. Middle: Rectangles of area
e/4 each. Right: Red box is the solution of A.
18
Theorem 2.1 (Learning bound, finite H, consistent). Let H ⊃ C be hypothesis set and concept class. Let D be
a data distribution and A be an algorithm, such that for each c ∈ H, and each sample S = ( xi , c( xi ))im=1 we have
that
Rb S (A(S)) = 0.
PS∼ Dm (R(A(S)) ≤ e) ≥ 1 − δ,
if
1 1
m≥ log |H| + log .
e δ
In other words, for every e, δ > 0, with probability at least 1 − δ
1 1
R(A(S)) ≤ log |H| + log .
m δ
a We write [ N ] := {1, . . . , N }
19
Hoeffdings inequality tells us that if we draw a die m times then the mean of the observed eyes should
concentrate strongly around 3.5. Indeed since modelling each draw of a die by an iid random variable Xi
tking values in [6] yields that
!
√
1 m −2 25m
m i∑
−1/4
P X i − 3.5 > m ≤ 2e .
=1
[48]: num_of_experiments = 30
plt.figure(figsize = (12,6))
plt.plot(cum_mean.T)
plt.plot(np.arange(1,num_of_draws+1), 3.5+np.power(np.arange(1,num_of_draws+1), -1/4), c = 'k')
plt.plot(np.arange(1,num_of_draws+1), 3.5-np.power(np.arange(1,num_of_draws+1), -1/4), c = 'k')
20
Corollary 2.1. Let e > 0 and let D be a distribution on X and c : X → {0, 1} be a target concept. Then, for every
h : X → {0, 1} it holds that
P(xi )im=1 ∼D m R b (x ,c(x ))m (h) − R(h) ≥ e ≤ e−2me2 ,
i i i =1
P(xi )im=1 ∼D m Rb (x ,c(x ))m (h) − R(h) ≤ −e ≤ e−2me2 .
i i i =1
In particular
b (x ,c(x ))m (h) − R(h)| ≥ e ≤ 2e−2me2 .
P(xi )im=1 ∼D m |R i i i =1
• Rb (x ,c(x ))m (h) = ∑m Xi for independent random variables Xi with 0 ≤ Xi ≤ 1/m almost surely
i i i =1 i =1
for i ∈ [m].
We conclude the proof by applying Theorem 2.2.
We can extend Corollary 2.1 to any finite hypothesis set by a union bound.
Theorem 2.3 (Learning bound, finite H, inconsistent). Let H be a finite hypothesis set. Then, for every δ > 0,
the following inequality holds with probability at least 1 − δ over the sample S = ( xi , c( xi ))im=1 :
s
log |H| + log 2δ
R(h) ≤ R
b S (h) + for all h ∈ H. (1)
2m
P(∃hi ∈ H : |R(hi ) − R
b S (hi )| ≥ e)
n
≤ ∑ P(|R(hi ) − R
b S (hi )| ≥ e)
i =1
n
≤ ∑ 2e−2me ≤ 2|H|e−2me .
2 2
[Corollary 2.1]
i =1
2
Setting δ = 2|H|e−2me and solving for e yields (1).
21
3 Lecture 3 – Some Generalisations and Rademacher Complexities
3.1 Agnostic PAC learning:
The notion of concept class requires a deterministic relationship between input x drawn according to D and
the label. This is not always sensible. Instead consider a distribution D on X × Y . Below is an example:
Below we draw the temperature in Austria over periods of two weeks. We can consider the week number
as the example space X and the temperature as the label space Y .
<matplotlib.figure.Figure at 0x7f6639f8e860>
22
X
Y
If D is considered as a probability distribution on X × Y , then we call the learning problem stochastic.
Analogously, we call our previous set-up deterministic.
In this case, we can redefine the risk to be
R(h) := P(x,y)∼D (h( x ) 6= y) = E(x,y)∼D (1h(x)6=y ). (2)
Definition 3.1 (Agnostic PAC learnability). Let H a hypothesis set. An algorithm A mapping samples S to
functions in H is an agnositic PAC learning algorithm if there exists a function mH : (0, 1)2 → N with the
following property: for all e, δ ∈ (0, 1) and for all distributions D over X × Y
if m ≥ mH (e, δ). We call H agnostic PAC learnable if an agnostic PAC learning algorithm exists.
For every x we have P(x,y)∼D (hBayes ( x ) 6= y| x ) = min{P(x,y)∼D (1| x ), P(x,y)∼D (0| x )}, which is the smallest
possible error. Hence hBayes is indeed a Bayes classifier.
Definition 3.3 (Noise). Given a distribution D over X × Y , we define the noise at point x ∈ X by noise( x ) =
min{P(x,y)∼D (1| x ), P(x,y)∼D (0| x )}.
The average noise or simply noise is then defined as E(noise( x )).
23
It is clear by construction that
E(noise( x )) = R∗ .
The noise level is one aspect describing the hardness of a learning task.
Remark 3.1. The empirical Rademacher complexity measures how well the class G can correlate with random noise
on a given sample S. If, for example G is the set of continuous functions from [0, 1] to [−1, 1] and S contains m
elements ( x1 , . . . , xm ) with xi 6= x j for all i, j ∈ [m], then R
b S (G) = 1. If G = {1} contains only one function then
b S (G) = 0.
R
The Rademacher complexity is defined for functions with real outputs. To apply it to general learning
problems, we introduce the concept of a loss function:
Definition 3.5 (Family of loss functions). A function L : Y × Y → R is called a loss function. For a hypothesis
class H, we define the family of loss functions associated to H by
G := { g : X × Y → R : g( x, y) = L(h( x ), y), h ∈ H}
Setting Z = X × Y we can apply Definition 3.4 to families of loss functions. We can also define a non-
empirical version of the Rademacher complexity.
Definition 3.6. Let a, b ∈ R and Z be a set. Let G ⊂ M(Z , [ a, b]) and let D be a distribution over Z . For
m ∈ N, we define the Rademacher complexity by
Rm (G) := ES∼D m Rb S (G) .
24
Theorem 3.1. Let G ⊂ M(Z , [0, 1]) and let D be a distribution on Z . For every δ > 0 and m ∈ N we have that
for a sample S = (z1 , . . . , zm ) ∼ D m for all g ∈ G :
s
1 m log 1δ
m i∑
E( g ) ≤ g(zi ) + 2Rm (G) + (3)
=1
2m
s
1 m log 2δ
m i∑
E( g ) ≤ g ( z i ) + 2Rb S (G) + 3 , (4)
=1
2m
[4]: # set-up
iterations = 50
degrees = [3,4,7,20]
largeNumber = 1000
#errors to be computed
RademacherPoly = np.ones([iterations, len(degrees)])
EmpErrorsPoly = np.zeros([iterations, len(degrees)])
ErrorsPoly = np.ones([iterations, len(degrees)])
25
for m in range(1,iterations):
for k in range(len(degrees)):
# fit polynomials to data:
p = np.poly1d(np.polyfit(x_short, y_short, degrees[k]))
#compute errors
y_exp = p(x_test) - y_test
y_emp = p(x_short) - y_short
EmpErrorsPoly[m, k] = abs(y_emp).mean()
ErrorsPoly[m, k] = abs(y_exp).mean()
RademacherPoly[m, k] = err/largeNumber
plt.subplot(131)
plt.plot(np.arange(iterations), RademacherPoly)
plt.legend(('Degree 3', 'Degree 4', 'Degree 7', 'Degree 20'))
plt.title('Rademacher complexities')
plt.subplot(132)
plt.semilogy(np.arange(iterations), EmpErrorsPoly)
plt.legend(('Degree 3', 'Degree 4', 'Degree 7', 'Degree 20'))
plt.title('Empirical errors')
plt.subplot(133)
plt.semilogy(np.arange(iterations), ErrorsPoly)
plt.legend(('Degree 3', 'Degree 4', 'Degree 7', 'Degree 20'))
plt.title('Expected errors')
[5]:
26
Having understood the nature of Theorem 3.1, we can now look its proof. We need the following result:
Theorem 3.2 (McDiarmid’s inequality). Let m ∈ N, and X1 , . . . , Xm be independent random variables taking
values in X . Assume that there exist c1 , . . . , cm > 0 and a function f : X m → R satisfying
| f ( x1 , . . . , xi , . . . , xm ) − f ( x1 , . . . , xi0 , . . . , xm )| ≤ ci ,
for all i ∈ [m] and all points x1 , . . . , xm , xi0 ∈ X . Then the following inequalities hold for all e > 0:
2 / | c |2
P ( f (S) − E( f (S)) ≥ e) ≤ e−2e
2 / | c |2
P ( f (S) − E( f (S)) ≤ −e) ≤ e−2e ,
Proof of Theorem 3.1. We define two short-hand notations for a sample S = (z1 , . . . , zm ):
m
b S ( g) := 1 ∑ g(zi )
E
m i =1
Φ(S) := sup E( g) − E b S ( g)
g∈G
To prove the theorem, we need to bound Φ(S) and we will use McDiarmids inequality for this.
Let S and S0 be two samples that differ in exactly one point, i.e., S = (z1 , . . . , zi , . . . , zm ) and S =
(z1 , . . . , zi0 , . . . , zm ). We compute:
Φ(S0 ) − Φ(S) = sup E( g) − E
b S0 ( g) − sup E( g) − E
b S ( g)
g∈G g∈G
≤ sup E
b S ( g) − E
b S0 ( g )
g∈G
g(zi ) − g(zi0 ) 1
≤ sup ≤ ,
g∈G m m
where the first inequality, is due to elementary properties of suprema, the second follows from the defini-
tion of E
b S ( g) and S, S0 and the last is due to the fact that g takes values in [0, 1].
1
|Φ(S0 ) − Φ(S)| ≤
m
27
for all S, S0 differing in one point only. By McDiarmids inequality, we have that for a random sample S
2
P (Φ(S) − ES (Φ(S)) ≥ e) ≤ e−2e m .
where S0 is a sample that is independent from and distributed like S. We used that ES0 (E b S0 ( g)) = E( g).
By the monotonicity of the expected value, we have that
ES sup ES0 E b S0 ( g ) − E
b S ( g) ≤ ES,S0 sup Eb S0 ( g ) − E
b S ( g)
g∈G g∈G
1 m
m i∑
= ES,S0 sup g(zi0 ) − g(zi ).
g∈G =1
To see why this holds, we observe that for every fixed σ a negative sign of σi corresponds to switching zi
and zi0 in ∑im=1 g(zi0 ) − g(zi ). Since we all zi , zi0 are chosen i.i.d and we are taking the expectation, this does
not affect the output. Applying the sub-additivity of the supremum to (7) yields that
1 m 1 m 1 m
m i∑ ∑ i i ∑ −σi g(zi )
0 0
Eσ ES,S0 sup σi ( g ( z i ) − g ( z i )) ≤ E E
σ S 0 sup σ g ( z ) + E E
σ S sup
g∈G =1 g∈G m i =1 g∈G m i =1
1 m
m i∑
≤ 2Eσ ES sup σi g(zi ) = 2Rm (G),
g∈G =1
where the last inequality follows since σ and −σ have the same distribution. This yields (3).
To prove (4), we apply McDiarmids inequality again. Note that for two samples S, S0 differing in one
point only
Rb S (G) − Rb S0 (G) ≤ 1
m
and hence with probability 1 − δ/2
s
log 2δ
Rm (G) = E(R b S (G)) ≤ R b S (G) + . (8)
2m
Therefore, we conclude with a union bound from (8) and (5) that with probability 1 − δ
s
log( 2δ )
Φ(S) ≤ 2RS (G) + 3
b
2m
which yields (4).
28
4 Lecture 4 – Application of Rademacher Complexities and Growth Function
4.1 Rademacher complexity bounds for binary classification
Theorem 3.1 holds for general families of loss functions. We want to make this notion more concrete for
common learning problems.
Lemma 4.1. Let H ⊂ M(X , {−1, 1}). Furthermore, let G = {X × Y 3 ( x, y) 7→ 1h(x)6=y : h ∈ H}. For a
sample ( xi , yi )im=1 = S ∈ (X × Y )m we denote SX = ( xi )im=1 . It holds that
b S (G) = 1 R
R b S (H).
2 X
Proof. The proof follows from a simple computation which is fundamentally based on the identity:
1h(x)6=y = (1 − h( x )y)/2. With this, we have that
!
1 m
b S (G) = Eσ
R sup ∑ σi g(zi )
g∈G m i =1
!
1 m
= Eσ sup ∑ σi 1h(xi )6=yi
h∈H m i =1
!
1 m 1 − h ( xi ) yi
= Eσ sup ∑ σi
h∈H m i =1 2
!
1 1 m 1b
= Eσ sup ∑ (−σi yi )h( xi ) = R S (H),
2 h∈H m i =1 2 X
where the last identity follows since (−σi yi ) and σi have the same distribution.
Now we can transfer our generalisation bound of Theorem 3.1 to the binary classification setting:
Theorem 4.1. Let H ⊂ M(X , {−1, 1}) and D be a distribution on X . Then, for every δ > 0 it holds with
probability at least 1 − δ that
s
log 1δ
R(h) ≤ R
b S (h) + Rm (H) +
2m
s
log 2δ
R(h) ≤ R
b S (h) + R
b S (H) + 3 ,
2m
where S ∼ D m .
For the binary loss computing the empirical Rademacher complexity of a hypothesis class H amounts to
solving for all choices of a Rademacher vector an optimisation problem over the whole class H. This can
be computationally challenging, if H is very complex and m is large. Moreover, computing Rm is often
not possible at all, since we do not know the underlying distribution.
29
The growth function describes the number of ways m points could be grouped into two classes by ele-
ments in H. The growth function is independent of the underlying distribution and a useful tool to bound
the Rademacher complexity.
A helpful result here is Massart’s lemma:
Theorem 4.2 (Massart’s Lemma). Let A ⊂ { x = ( x1 , . . . , xm ) ∈ Rm : | x | ≤ r } be finite set. Then
" #
m
p
1 r 2 log | A|
m x∈ A i∑
Eσ sup σi xi ≤ ,
=1
m
HS := {h(S) : h ∈ H}
√
is contained in the m ball and per definition |HS | ≤ ΠH (m).
Therefore
! ! r
1 m
1 m
2 log ΠH (m)
Rm (H) = ES Eσ sup ∑ σi h( xi ) = ES Eσ sup ∑ σi ui ≤ ,
m h∈H i=1 m u∈HS i=1 m
by Theorem 4.2.
Using this estimate, we can reformulate our previous generalisation bound that was formulated in terms
of Rademacher complexity via the growth function instead:
Corollary 4.2. Let H ⊂ { h : X → {−1, 1}}. Then, for any δ > 0, with probability at least 1 − δ for any h ∈ H:
r s
log 1δ
R(h) ≤ Rb S (h) + 2 log ΠH (m) + .
m 2m
30
Example 4.1 (Intervals). Let H = {21[a,b] − 1 : a, b ∈ R}.
It is clear that VCdim(H) ≥ 2 since for x1 < x2 the functions
21[ x1 −2,x1 −1] − 1, 21[ x1 −2,x1 ] − 1, 21[ x1 ,x2 ] − 1, 21[ x2 ,x2 +1] − 1,
Figure 5: Different ways to classify two or three points. The coloured-blocks correspond to the intervals
[ a, b].
For any four points ( x1 , x2 , x3 , x4 ) one of two situations will happen. Either one point is in the convex hull of the
remaining three or the four points form the edges of a convex quadrilateral. In the first case, we can assume that
without loss of generality x4 is a convex combination of x1 , x2 , x3 . Since half-spaces are convex too, we have that
if h( x1 ) = h( x2 ) = h( x3 ) = 1 then h( x4 ) = 1. Therefore, we cannot shatter sets of this form. If, on the other
hand, the points ( x1 , x2 , x3 , x4 ) are the sides of a convex quadrilateral, then, without loss of generality the points
x1 and x3 lie on different sides of the line connecting x2 and x4 . Since ( x1 , x2 , x3 , x4 ) are the extreme points of the
quadrilateral, it must be the case that the lines connecting x1 and x3 and x2 and x4 intersect. Further, any half-space
that contains x1 and x3 contains by convexity also the line between x1 and x3 . Any half-space not containing x2
and x4 contains, by convexity also no element of the line between x2 , x4 . Hence, there is no half space containing
x1 , x3 but not x2 and x4 . A visualisation of the argument above is given in Figure 7.
We conclude that for the half-space classifier VCdim(H) = 3.
31
Figure 7: Visualisation of the argument prohibiting shattering of sets of four elements.
d
m
ΠH (m) ≤ ∑ . (9)
i =0
i
Proof. We prove this by induction over m + d ≤ k. For k = 2, we have the options m = 1 and d = 0, 1 as
well as m = 2, d = 0.
1. If d = 0 and m ∈ N, then |HS | ≤ 1 for all samples S of size 1 and hence ΠH (1) ≤ 1. Moreover,
if for an m ∈ N, Π H (m) > 1, then there would exist a set S with m samples on which |HS | > 1.
That means that on at least one of the elements of S, HS takes at least two different values and hence
Π H (1) > 1, a contradiction. Hence Π H (m) ≤ 1 for all m ∈ N. The right-hand side of (9) is always
at least 1.
2. If d ≥ 1 and m = 1, then Π H (1) ≤ 2 per definition, which is always bounded by the right-hand side
of (9).
Assume now that the statement (9) holds for all m + d ≤ k and let m̄ + d¯ = k + 1. By Points 1 and 2 above,
we can assume without loss of generality that m̄ > 1 and d¯ > 0.
Let S = { x1 , . . . , xm̄ } be a set so that ΠH (m̄) = |HS | and let S0 = { x1 , . . . , xm̄−1 }.
Let us define an auxiliary set
In words, G contains all those maps in HS0 that have two corresponding functions in HS .
32
Now it is clear that
d¯
m̄ − 1
|HS0 | ≤ ΠH (m̄ − 1) ≤ ∑ i
. (12)
i =0
Note that G is a set of functions defined on S0 . Hence we can compute its VC dimension. If a set Z ⊂ S0 is
shattered by G , then Z ∪ { xm̄ } is shattered by HS . We conclude that
d¯−1
m̄ − 1
|G| ≤ ΠG (m̄ − 1) ≤ ∑ i
. (13)
i =0
d¯ d¯−1 d¯
m̄ − 1 m̄ − 1
m̄
ΠH (m̄) = |HS | = |HS0 | + |G| ≤ ∑ i
+∑
i
=∑
i
.
i =0 i =0 i =0
d
m
ΠH (m) ≤ ∑
i =0
i
d
m m d −i
≤∑
i =0
i d
m
m m d −i
≤∑
i =0
i d
m d m m d i
d i∑
= .
=0
i m
Plugging Theorem 4.3 into Corollary 4.2, we can now state a generalisation bound for binary classification
in terms of the VC dimension.
33
Corollary 4.3. Let H ⊂ { h : X → {−1, 1}}. Then, for every δ > 0, with probability at least 1 − δ for any h ∈ H:
s s
2d log( em
d ) log 1δ
R(h) ≤ Rb S (h) + + ,
m 2m
Two files will be supplied to you via Moodle. A test and training set ‘data_train_db.csv’ and
‘data_test_db.csv’. They were taken by observing a mystery machine. The first entry ‘Running’ is 1 if
the machine worked. It is 0 if it failed to work. In the test set, the labels, are set to 2. You should predict
them.
Let us look at out data first:
[2]: data_train_db = pd.read_csv('data_train_db.csv')
data_test_db = pd.read_csv('data_test_db.csv')
data_train_db.head()
[3]: data_train_db.describe()
34
How is the distribution of the labels?
[4]: data_train = data_train_db.values
plt.show()
plt.subplot(1,2,1)
plt.hist(data_train[data_train[:,0] == 1,1:].std(1))
plt.title('Distribution of standard deviation--- not running')
plt.subplot(1,2,2)
plt.hist(data_train[data_train[:,0] == 0,1:].std(1))
plt.title('Distribution of standard deviation--- not running')
plt.subplot(1,2,1)
plt.hist(np.sum(data_train[data_train[:,0]==1,1:], 1)/100)
plt.title('Distribution of means--- running')
plt.subplot(1,2,2)
plt.hist(np.sum(data_train[data_train[:,0]==0,1:], 1)/100)
plt.title('Distribution of means--- not running')
35
fig = plt.figure(figsize = (14, 4))
plt.subplot(1,2,1)
plt.hist(np.amax(data_train[data_train[:,0] == 1,1:], axis = 1))
plt.title('Distribution of max value--- running')
plt.subplot(1,2,2)
plt.hist(np.amax(data_train[data_train[:,0] == 0,1:], axis = 1))
plt.title('Distribution of max value--- not running')
plt.subplot(1,2,1)
plt.hist(np.amin(data_train[data_train[:,0] == 1,1:], axis = 1))
plt.title('Distribution of min value--- running')
plt.subplot(1,2,2)
plt.hist(np.amin(data_train[data_train[:,0] == 0,1:], axis = 1))
plt.title('Distribution of min value--- not running')
36
The distribution of the min values is a bit worrying. Since some very few entries have very high standard
deviation, some very few and possibly the same values have very low negative values, but almost all
other entries have only positive values. This may be a problem in the data set. We decide that these
entries are outliers and drop these entries from the data base.
[6]: # It seems like there are some data points which have much higher standard deviation than most. Let us␣
,→just remove those.
def clean_dataset(data):
to_drop= []
for k in range(data.shape[0]):
if data[k,:].std()>15:
to_drop.append(k)
return np.delete(data, to_drop, axis = 0)
Let us apply the cleaning and look at the data set again
plt.subplot(1,2,1)
plt.hist(data_train[data_train[:,0] == 1,1:].std(1))
plt.title('Distribution of standard deviation--- not running')
plt.subplot(1,2,2)
plt.hist(data_train[data_train[:,0] == 0,1:].std(1))
plt.title('Distribution of standard deviation--- not running')
37
fig = plt.figure(figsize = (14, 4))
plt.subplot(1,2,1)
plt.hist(np.amin(data_train[data_train[:,0] == 1,1:], axis = 1))
plt.title('Distribution of min value--- running')
plt.subplot(1,2,2)
plt.hist(np.amin(data_train[data_train[:,0] == 0,1:], axis = 1))
plt.title('Distribution of min value--- not running')
38
Controller mintcream -0.018974 0.208065 -0.278604 -0.801354
Controller mistyrose 0.040784 -0.166662 -0.335028 -0.225535
Controller moccasin -0.038704 0.771165 0.072201 -0.240072
plt.figure(figsize = (12,6))
plt.plot(np.arange(1, 100), corrMatrix['Running'][1:100])
plt.title('Correlation with Running')
plt.show()
39
The first row of the data set (after the row ‘Running’ itself), seems to be suspiciously important. Lets look
at it in isolation.
[10]: plt.hist(data_train[:,1])
plt.title(data_train_db.columns[1])
We see that ‘Blue Switch On’ only takes two values (On and Off). Let us look in detail, what the effect of
this switch is on whether the mechanism runs or not.
[16]: runs_switchon = np.count_nonzero((data_train[:,0]==1)*(data_train[:,1]==1))
runs_switchoff = np.count_nonzero((data_train[:,0]==1)*(data_train[:,1]==0))
runsnot_switchon = np.count_nonzero((data_train[:,0]==0)*(data_train[:,1]==1))
runsnot_switchoff = np.count_nonzero((data_train[:,0]==0)*(data_train[:,1]==0))
conf_matrix = [[runs_switchon, runs_switchoff], [runsnot_switchon, runsnot_switchoff]]
sn.set(color_codes=True)
plt.figure(1, figsize=(9, 6))
40
plt.title("Confusion Matrix")
sn.set(font_scale=1.4)
ax = sn.heatmap(conf_matrix, annot=True, cmap="YlGnBu", fmt='2')
Now this is fantastic. If the Blue Switch is off, then the mechanism never works.
Next, we would like to extract additional important parameters of the machine. We rank the columns
according to their correlation with ‘Running’:
[12]: S = np.argsort(np.array(corrMatrix['Running']))[::-1]
print(S)
[ 0 1 72 98 50 74 37 10 33 14 89 41 34 8 56 68 7 90 95 11 83 39 67 64
17 81 47 70 96 92 84 27 80 82 22 69 73 24 63 60 58 13 77 86 49 2 28 5
44 53 71 16 18 3 66 45 55 75 93 79 87 52 35 61 25 59 38 42 48 43 29 85
78 26 91 36 20 51 21 23 94 88 57 15 31 19 54 65 30 97 12 9 76 32 46 6
62 40 4 99]
We saw that the first entry is always 1 if the method is working. Also from the ranking above, we expect
that large values in coordinates 72 and 98 seem to indicate that the algorithm works.
Let us describe a hypothesis set that thakes this observation into account, by defining a classifier below.
The hypothesis set is characterised by a threshholding value ‘thresh’.
41
return 1
return 0
Next we find the value thresh, that yields the best classification on the test set:
[40]: best_thresh = 0
best_err = data_train.shape[0]
for tr in range(100):
thresh = tr/20
err = 0
for t in range(data_train.shape[0]):
err = err + (myclassifier(data_train[t, :], thresh) != data_train[t, 0])
if err < best_err:
best_err = err
best_thresh = thresh
Test accuracy:0.7604010025062656
The test accuracy above is quite terrible. On the other hand, the hypothesis class seems very small, so
Corollary 4.2 gives us some confidence that the result may generalise in the sense that it will not be worse
on the test set. ("Not worse, but still very bad" is of course not a very desirable outcome.)
I am sure you can do much better than this.
for k in range(data_test_db.shape[0]):
predicted_labels[k] = myclassifier(data_test[k, :], best_thresh)
Please send your result via email to [email protected]. Your email should include the names
of all people who worked on your code, their student identification numbers, a name for your team, and
the code used. It should also contain one or two paragraphs of a short description of the method you
used.
d−1
PS∼D m RD (A(S)) > ≥ 0.01.
32m
Proof.
42
1. Set-up: We first build a very imbalanced distribution. Let X := { x1 , x2 , . . . , xd } ⊂ X be a set that is
shattered by H.
For e > 0, we define the distribution De by P( x1 ) = 1 − 8e and P( xk ) = 8e/(d − 1) for k = 2, . . . , d.
For S ∈ X m , we denote S := {si ∈ S : si 6= x1 for all i ∈ [m]}. Additionally, let S ⊂ X m be the set of
samples, such that |S| ≤ (d − 1)/2.
Let, for S ∈ X m and u ∈ {0, 1}d−1 , f u ∈ H be such that
f u ( x1 ) = 1 and f u ( xk ) = uk−1 .
d
EU (RDe (A(S), f U )) = ∑ ∑ 1A(S)(x )6= f (x ) P[xk ]P[u],
k u k
u∈{0,1}d−1 k =2
where E(RDe (A(S), f u )) denotes the expected risk with target concept f u . By reducing the set that
we sum over, we may estimate from below by
d
EU (RDe (A(S), f U )) ≥ ∑ ∑ 1A(S)(xk )6= f u (xk ) P[ xk ]P[u]
u∈{0,1}d−1 k =2
xk 6∈S
d
= ∑ ∑ 1A(S)(xk )6= f u (xk ) P[u] P[ xk ].
k =2 u∈{0,1}d−1
xk 6∈S
Per definition of f u it is clear that for every xk , where k > 1 it holds that 1A(S)(xk )= f u (xk ) = 1 on
exactly half of all values u ∈ {0, 1}d−1 . Hence, we estimate that
d
1 1 8e
EU (RDe (A(S), f U )) ≥ ∑ 2 P[ x k ] = 2 d − 1 − |S|
d−1
.
k =2
k6∈S
43
Thus, if S ∈ S , then
d
d−1
1 1 8e
EU (RDe (A(S), f U )) ≥ ∑ 2
P[ x k ] ≥
2 2 d−1
= 2e. (15)
k =2
xk 6∈S
The estimate on the expected value (16) implies that there exists at least one u∗ ∈ {0, 1}d−1 such that
d d
8e
RDe (A(S), f u∗ ) = ∑ 1A(S)(x )6= f ∗ (x ) P[xk ] ≤ ∑ d − 1 = 8e.
k u k
(18)
k =2 k =2
(18) ∑
≤ S : R (A(S), f ∗ )≥e
8e P(S|S)
De u
+ ∑ e P(S|S)
S : RDe (A(S), f u∗ )<e
1
P(RDe (A(S), f u∗ ) ≥ e)) ≥ .
7
More generally, for arbitrary S ∼ De we have that
PDe (S)
PS∼De (RDe (A(S), f u∗ ) ≥ e)) ≥ . (19)
7
44
Theorem 6.2 (Multiplicative Chernoff Bound). Let X1 , . . . , Xm be independent random variables drawn ac-
cording to a distribution D with mean µ and such that 0 ≤ Xk ≤ 1 almost surely for all k ∈ [m]. Then, for
γ ∈ [0, 1/µ − 1] it holds that
mµγ2
P[ µ ≥ (1 + γ ) µ ] ≤ e − 3
mµγ2
P[ µ ≤ (1 − γ ) µ ] ≤ e − 2 ,
where µ = 1
m ∑im=1 Xi .
Let Y1 , . . . , Ym be i.i.d distributed as De . Further let for k ∈ [m]
Zk := 1{ x2 ,...,xd } (Yk ).
It is clear that E( Zk ) = 8e. Assuming that 8e ≤ 1/2, we can apply Theorem 6.2 with γ = 1 to obtain
!
m
∑ Zi ≥ 16em
8em
P ≤ e− 3 . (20)
i =1
Now notice that if a sample S = (Y1 , . . . , Ym ) is not in S then the associated ( Z1 , . . . , Zm ) must satisfy
∑im=1 Zi > (d − 1)/2. Therefore,
!
m
1 − P(S) ≤ P ∑ Zi ≥ (d − 1)/2) .
i =1
45
8 Lecture 8 - Model Selection
Ho do we choose an appropriate hypothesis set or learning algorithm for a given problem?
For a given binary hypothesis class H and a function h ∈ H, we have that
∗ ∗
R(h) − R = R(h) − inf R( g) + inf R( g) − R . (21)
g∈H g∈H
| {z } | {z }
estimation approximation
where R∗ is the Bayes error of Definition 3.2. See Figure 9 for a visualisation of (21).
h∗
Note, that hSERM does not need to exist, but if S is finite and Y is too, as in the binary classification case, then it is
easy to see that hSERM is well defined.
We have that the empirical risk minimiser inflicts a small estimation error if the generalisation error is
small.
Proposition 8.1. Let H be a hypothesis set, S be a sample. Then we have that
!
P R(hSERM ) − inf R(h) > e ≤ P sup |R(h) − R
b S (h)| > e/2 . (22)
h∈H h∈H
Proof. For every δ > 0, there exists hδ ∈ H such that R(hδ ) − infh∈H R(h) < δ. Therefore, we have that
P R(hSERM ) − inf R(h) > e ≤ P R(hSERM ) − R(hδ ) > e − δ , (23)
h∈H
46
for all δ > 0.
Moreover,
Since the left hand side of (24) is independent from δ we obtain the claim from the continuity of measures.
We saw before that we can control the right hand side of (22), if the VC dimension of H is bounded.
Thereby, (22) yields a bound on the estimation error. However, requiring a small VC dimension does not
let us take a very large hypothesis space. This implies that we may have a large approximation error.
H1 ⊂ H2 ⊂ · · · ⊂ Hk ⊂ ·...
The approximation error will decrease (or at least not increase) for growing k, while the estimation de-
creases with decreasing k. The idea is shown in Figure 10.
h
H4 H3 H2 H1
h∗
Figure 10: Visualisation of structural risk minimisation, where h∗ is the Bayes classifier.
Structural risk minimisation is a method to choose an appropriate value of k. Here one employs a penalty
on large terms.
47
Definition 8.2. Let (Hk )∞
k =1 be a sequence of hypothesis sets and let S be a sample. Then, the solution of structural
risk minimisation is
hSRM
S := arg min{ Fk (h) : k ∈ N, h ∈ Hk },
where r
log k
Fk (h) := R
b S (h) + Rm (Hk ) + .
m
We have the following learning guarantee for SRM:
Theorem 8.1. Let δ > 0, (Hk )∞ k =1 be a sequence of hypothesis sets, H := k ∈N Hk , and let D be a distribution.
S
Proof. We first remind ourselves of Theorem 4.1, where we found that with probability at least 1 − δ
s
log 1δ
R(h) ≤ Rb S (h) + Rm (H) +
2m
or equivalently
b S (h) − Rm (H) > δ ≤ e−2mδ2 .
P R(h) − R (25)
48
Now we compute for an arbitrary h ∈ H
r !
log k (h)
P R(hSRM
S ) − R(h) − 2Rm (Hk(h) ) − >e
m
[ Equation (27) ] ≤ P R(hSRM
S ) − Fk(h SRM (
) Sh SRM
) > e/2
S
r !
log k (h)
+P Fk(hSRM ) (hSRM
S ) − R(h) − 2Rm (Hk(h) ) − > e/2
S m
r !
−me2 /2 log k (h)
[ Equation (26) ] ≤ 2e + P Fk(hSRM ) (hS ) − R(h) − 2Rm (Hk(h) ) −
SRM
> e/2
S m
r !
2 log k ( h )
≤ 2e−me /2 + P Fk(h) (h) − R(h) − 2Rm (Hk(h) ) − > e/2
m
2
≤ 2e−me /2 + P R b S (h) − R(h) − Rm (Hk(h) ) > e/2
2 /2
[ Equation (25) ] ≤ 3e−me .
8.3 Cross-validation
Definition 8.3. Let (Hk )∞ m
k=1 be sequence of hypothesis sets. Let α ∈ (0, 1) and S = ( xi , yi )i =1 be a sample. Then,
the solution of cross-validation is
0
where S1 = ( xi , yi )im=1 for m0 = d(1 − α)me, S2 = ( xi , yi )im=m0 +1 , and hSERM
1 ,k
is the empirical risk minimiser over
the hypothesis class Hk with sample S1 .
In words, cross-validation consists in setting aside a validation set on which the loss is measured, but
which is not used for training.
The following proposition will be stated without proof. It shows that with high probability, the empirical
risk with respect to S2 is close to the expected risk. The proof is based on Hoeffding’s inequality.
Proposition 8.2. Let (Hk )∞ m
k=1 be sequence of hypothesis sets. Let α ∈ (0, 1), and let S ∼ D and S1 be as in
Definition 8.3, then it holds that
r !
log k 2
P sup R(hSERM 1 ,k
)−Rb S (h ERM ) −
2 S1 ,k > e ≤ 4e−2αme . (28)
k ≥1 αm
Based on the result above, we can show that cross-validation can often perform very similarly to structural
risk minimisation.
49
Theorem 8.2. Let (Hk )∞ m
k =1 be sequence of hypothesis sets. Let α ∈ (0, 1), and let S ∼ D and S1 be as in
Definition 8.3. For every δ ∈ (0, 1) it holds with probability 1 − 2δ that
s
log(max{k (hCV SRM
r
CV SRM S ), k ( hS1 )}) log(4/δ)
R(hS ) − R(hS1 ) ≤ 2 +2 ,
αm 2αm
Again, for k = k (hSRM SRM is the empirical risk minimiser over H . Therefore, we get from
S1 ) it holds that hS1 k
Proposition 8.2 that with probability 1 − 2δ
s s
log(k (hSRM
r
CV ))
SRM log ( k ( h S S1 )) log(4/δ)
II ≤ R(hS1 ) + + +2
αm αm 2αm
s
max{log(k (hCV SRM
r
SRM S )), log( k ( hS1 ))} log(4/δ)
≤ R(hS1 ) + 2 +2 .
αm 2αm
If αm is not too small, i.e., when the validation set is large, then we achieve similar results with cross
validation to those achieved with structural risk minimisation with sample S1 . However, if this means
that S1 is very small, then, we do not benefit. Hence, the right choice of α is crucial.
50
Definition 9.2. Let L be a loss function on Y × Y . Let h ∈ H, and S := ( xi , yi )im=1 be a training sample. The
empirical risk is defined as
m
b S,L (h) = 1 ∑ L(h( xi ), yi ).
R
m i =1
Note that, Theorem 3.1 yields generalisation bounds for these loss functions.
Example 9.1. Some loss functions that are quite frequently used:
• The 0-1 loss: L0−1 (y1 , y2 ) = 1y1 6=y2 . We have used this everywhere until now. Used if Y = { a, b} for a 6= b.
• The quadratic loss: L2 (y1 , y2 ) = ky1 − y2 k2 . Used if Y is a normed space such as Rd , d ∈ N.
• Cross entropy loss/Log Likelihood-loss: LCE (y1 , y2 ) = −(y1 log(y2 ) + (1 − y1 ) log(1 − y2 )). Used if Y ⊂
[0, 1], y1 ∈ {0, 1}, y2 ∈ (0, 1).
• Hinge loss: L H (y1 , y2 ) = max{1 − y1 y2 }. Used if Y ⊂ [−1, 1].
â = ( X T X )−1 X T y.
A small generalisation of linear regression is polynomial regression or more generally basis regression.
Let (hk )kK=1 be linearly independent such that span(hk )kK=1 = H ⊂ {X → R} for an arbitrary linear space
X , then finding
1
arg minh∈H k h( xi ) − yi k2
m
is equivalent to finding
1 K
m k∑
arg mina∈RK | a k h k ( x i ) − y i |2 . (31)
=1
51
[2]: import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
sines = np.zeros([100,10])
for k in range(10):
sines[:,k] = np.sin(2*(k+1)*np.pi*(x + 0.1)) # small shift to not have all start at 0.
[86]:
x_dat = np.arange(0,1,1/num_points)
y_dat = np.sin(2*np.pi*(x_dat + 0.1)) # this data should be very easy to fit.
hx_dat = np.zeros([num_points,10])
for k in range(10):
hx_dat[:, k] = np.sin(2*(k+1)*np.pi*(x_dat + 0.1))
52
plt.figure(figsize = (15, 5))
plt.subplot(1,3,1)
plt.plot(x, np.dot(sines,a))
plt.scatter(x_dat, y_dat, c = 'r')
[108]:
Polynomial/basis regression does not seem to be very stable towards very small changes of a single
element. This, however seems to be a quite desirable property, if we want to generalise well. We make
this more precise in the next chapter.
53
Indeed, we have the following theorem.
Theorem 9.1. Let D be a distribution and S ∼ D m . Let U be the uniform distribution on [m]. Further let L be a
loss function. Then, we have that for every learning algorithm A,
E R L (A(S)) − R b S,L (A(S)) = EEi∼U L(A(Si )( xi ), yi ) − L(A(S)( xi ), yi ) , (32)
where the last equality follows by swapping si and sii . Moreover, we have that
!
m
1
m i∑
E R b S,L (A(S)) = E L(A(S)( xi ), yi ) = EEi∼U ( L(A(S)( xi ), yi )) .
=1
The result now follows from the linearity of the expected value.
Bounding the right hand-side of (32) yields another way to guarantee a small generalisation error. Unfor-
tunately, we have just seen in the previous chapter, that for linear regression we cannot expect the right
hand-side of (32) to be small.
We will address this problem in the next chapter. Beforehand, let us fix the the concept of stability used
in Theorem 9.1 in form of a definition.
Definition 9.3. Let κ : N → R be monotonically decreasing. A learning algorithm A is on-average-replace-
one-stable with rate κ if for every distribution D every m ∈ N it holds that for every sample S ∼ D m :
ES,U (m) L(A(Si )( xi ), yi ) − L(A(S)( xi ), yi ) ≤ κ (m), (33)
Choosing L to be the 0-1 loss and H := Hk for a sequence of hypothesis sets (Hk )k∈N and
S
k ∈N
q
r (θ ) := Rm (Hk(hθ ) ) + log k (hθ )/m
shows that structural risk minimisation is a special case of regularised risk minimisation.
54
9.5 Tikhonov regularisation
If we have a hypothesis class H = (hθ )θ ∈Θ where Θ ⊂ Rd , then we call the regulariser
r Tikh,λ : Θ → R
r Tikh,λ (θ ) := λkθ k2
Tikhonov regulariser. Here k · k is the Euclidean norm. We will see below that this norm can be replaced by
any sufficiently convex norm and Θ can be a general normed space.
We are now interested in finding out under which condition regularised risk minimisation with the
Tikhonov regulariser admits generalisation bounds. We first study the convexity of r Tikh,λ .
Definition 9.5. For a normed space X, we say that a function f : X → R is strongly λ-convex if for all x1 , x2 ∈ X
and all α ∈ (0, 1) it holds that
λ
f (αx1 + (1 − α) x2 ) ≤ α f ( x1 ) + (1 − α) f ( x2 ) − α(1 − α)k x1 − x2 k2 .
2
x ( x + y)/2 y
Figure 11: Example of a stronly convex function.
λ
f (y) − f ( x ) ≥ k x − y k2 .
2
With these observations in place, we can state the following result:
Proposition 9.1. Assume that L is a loss function which is ρ-Lipschitz in the first coordinate and is such that
θ 7→ L(hθ ( x ), y) is convex for every ( x, y) ∈ X × Y . Let S ∼ D m and λ > 0 and let A(S) = hS,L
RRM , where h RRM
S,L
is the solution of the regularised risk minimisation with regulariser r Tikh,λ .
Then, A is on-average-replacement-one-stable with rate 2ρ2 /(λm). In particular,
2
b S,L (h RRM ) ≤ 2ρ .
E R L (hS,L
RRM
)−R S,L (34)
λm
55
Proof. We write R(θ; S) = R b S,L (hθ ). Then, by Points 1 and 2 of Lemma 9.1, we conclude that θ 7→
R(θ; S) + r Tikh,λ (θ ) is 2λ-strongly convex. By Point 3 of Lemma 9.1, we conclude that for every θ 0 and if
θ 00 is the minimiser of R(·; S) + r Tikh,λ (·)
Of course (40) holds when replacing xi and yi by xii and yii , respectively. If we plug this equation into (39),
we obtain
2ρ
kθ 00 − θ 0 k ≤ .
λm
If we combine this estimate with (40), then we conclude that
2ρ2
L(hθ 0 ( xi ), yi ) − L(hθ 00 ( xi ), yi ) ≤ .
λm
2ρ2
This implies the on-average-replace-one-stability of A with rate λm . The "in particular" part of the state-
ment follows from Theorem 9.1.
56
Lipschitz continuity of the loss may sometimes be a bit much to ask. For example the very frequently
used square loss is not Lipschitz continuous in its input (unless it is restricted to a compact set). The
result holds under weaker conditions that include the square loss, too.
Corollary 9.1. Assume that L is a loss function which is ρ-Lipschitz in the first coordinate and is such that θ 7→
L(hθ ( x ), y) is convex for every x, y ∈ X × Y . Let S ∼ D m and λ > 0 and let hS,L RRM be the solution of the
r
8
P R L (hS,L
RRM
) − min R L (hθ ) > e ≤ ρB .
θ ∈Θ me2
Let us end this section by looking at the regularised risk minimisation applied to the regression problem
from above:
[134]: num_points = 15
lmbda = 0.0000001
57
x_dat = np.arange(0,1,1/num_points)
y_dat = np.sin(2*np.pi*(x_dat + 0.1)) # this data should be very easy to fit.
hx_dat = np.zeros([num_points,10])
for k in range(10):
hx_dat[:, k] = np.sin(2*(k+1)*np.pi*(x_dat + 0.1))
58
10 Lecture 10 - Freezing Fritz
Freezing Fritz is a pretty cool guy. He has one problem, though. In his house, it is quite often too cold
or to hot during the night. Then he has to get up and open or close his windows or turn the heat up or
down. Needless to say, he would like to avoid this.
However, his flat has three doors that he can keep open or closed, it has four radiators, and four windows.
There is a picture of his home in Figure 12. It seems like there are endless possibilities of prepping the flat
for whatever temperature the night will have.
Fritz, does not want to play his luck any longer and decided to get active. He recorded the temperature
outside and inside of his bedroom for the last two years. Now he would like to find a prediction that,
given the outside temperature, as well as a certain configuration of his flat, tells him how cold or warm
his bedroom will become.
Can you help Freezing Fritz find blissful sleep?
W2 W1 R1
R2
D2 Temperature Outside
D3 D1
W4
R3 R4
W3
Figure 12: The home of Freezing Fritz. Here we see him lying in his bed. It is too cold. There are four
radiators, labelled R1-R4. Four windows labelled W1-W4 and three doors, labelled D1-D3. Fritz also owns
three plants. It is unclear, if they have anything todo with the heat distribution in this place, though.
Let us first look at the situation. Below we the experiment that Fritz carried out in 8 cases.
59
[4]: letItFlow([1,1, 1,1], [0,0,0,0], [1,1,1], 0, report = True)
letItFlow([1,1, 1,1], [5,0,0,0], [0,0,1], 0, report = True)
letItFlow([0,0, 0,0], [5,5,5,5], [1,1,1], 0, report = True)
letItFlow([0,0, 0,0], [0,0,0,0], [1,1,1], 0, report = True)
letItFlow([1,1, 1,1], [0,0,0,0], [1,1,0], 10, report = True)
letItFlow([1,1, 0,0], [5,5,0,5], [1,1,0], 10, report = True)
letItFlow([0,1, 0,1], [3,5,2,1], [1,0,1], 10, report = True)
letItFlow([0,0, 0,0], [5,5,5,5], [1,1,1], 20, report = True)
60
Temperature outside: 0°C
Window one open: False, window two open: False, window three open: False, window four open: False
Door one open: True, door two open: True, door three open: True
Heater one level: 5, heater two level: 5, heater three level: 5, heater four level: 5
Temperature in bed: 20.4°C
61
Temperature outside: 10°C
Window one open: True, window two open: True, window three open: False, window four open: False
Door one open: True, door two open: True, door three open: False
Heater one level: 5, heater two level: 5, heater three level: 0, heater four level: 5
Temperature in bed: 19.0°C
[4]:
For the experiment, we first load the data from the data sets ’data_train_Temperature.csv’ and
’data_test_Temperature.csv’ that will be supplied on moodle.
62
[8]: data_train_Temperature = pd.read_csv('data_train_Temperature.csv')
data_test_Temperature = pd.read_csv('data_test_Temperature.csv')
data_train_Temperature.head()
[9]: data_train_Temperature.describe()
63
Door 3 Temperature Outside Temperature Bed
count 730.000000 730.000000 730.000000
mean 0.480822 7.837429 19.530556
std 0.499975 7.788304 3.867791
min 0.000000 -4.998988 5.869975
25% 0.000000 1.039908 16.772720
50% 0.000000 7.470895 20.015297
75% 1.000000 14.628865 22.617748
max 1.000000 21.988839 28.606276
We use the correlation matrix again to see how each of the parameters of the problem affect the tempera-
ture in the bedroom. We also look at how the trade-off between outside and inside temperature is affected
by some of the parameters.
plt.figure(figsize = (12,12))
sn.pairplot(data_train_Temperature, vars = ['Temperature Outside', 'Temperature Bed'], kind = 'scatter',␣
,→hue='Window 1')
plt.show()
64
My idea is to interpolate over the data but weigh it according to my observations and domain knowledge.
So I give low weights to windows 2 and 3. Same with door 3. Then I also think that the doors are more
important for the overall value than the individual heaters. My predictor is now a simple weighted
interpolation over 80% of the data training set. I validate on 20% of the training set.
def predict(data):
simple_test_set = data.copy()
#some parameters are more important to fit than others. (In our case, window 1 and doors 1 and 2)
weights = np.ones(12)
weights[0] = 10 # Window 1
weights[1] = 0.1 # Window 2
weights[2] = 1 # Window 3
weights[3] = 10 # Window 4
weights[4] = 1 # Heat 1
weights[5] = 1 # Heat 2
weights[6] = 1 # Heat 3
65
weights[7] = 1 # Heat 4
weights[8] = 10 # Door 1
weights[9] = 10 # Door 2
weights[10] = 1 # Door 3
weights[11] = 1 # Temp Out
for k in range(simple_test_set.shape[0]):
value = 0;
totaldist = 0
for j in range(simple_data_set.values.shape[0]):
value = value + simple_data_set.values[j,-1]/(np.linalg.norm(weights*simple_data_set.
,→values[j,:-1] - weights*simple_test_set.values[k,:-1]))**4
return simple_test_set
[33]: 1.882422855960373
The algorithm was not very sophisticated. Nonetheless I came within an accuracy of 2 degrees on the
validation set. For me this seems acceptable. Maybe you can help Freezing Fritz even more?
I will just store my prediction on the test set now:
Please send your result via email to [email protected]. Your email should include the names
of all people who worked on your code, their student identification numbers, a name for your team, and
the code used. It should also contain one or two paragraphs of a short description of the method you
used.
66
This hypothesis class corresponds to a binary classification problem with Y = {−1, 1}. We say that a
sample S = ( xi , yi )im=1 is linearly separable, if there exist w ∈ Rd \ {0} and b ∈ R such that
or equivalently
yi (hw, xi i + b) ≥ 0 for all i ∈ [m].
Practically, this means that there exists a hyperplane in Rd splitting Rd into two parts, where one contains
all samples labelled −1 and the other contains all points labelled 1. An illustration is given in Figure 13.
Figure 13: Left: Linearly separable data set. Right: Hyperplane that maximises the margin.
Definition 11.1. Let d ∈ N and h be a linear classifier. We define the geometric margin ρh ( x ) of h at a point z
as the Euclidean distance of z to the hyperplane given by { h = 0}. For a sample S ∈ Rd × {−1, 1}, we define the
geometric margin of h as
ρh = min ρh ( xi ).
i ∈[m]
|hw, zi + b|
ρh (z) = . (44)
k w k2
Equation (44) is easily verified by the following argument: Let H 0 := { h = 0}. For x1 , x2 ∈ H 0 we have
that hw/kwk2 , x1 i + b/kwk2 = 0 = hw/kwk2 , x2 i + b/kwk2 and hence
0 = hw/kwk2 , x1 − x2 i = 0.
|hz, wi + b|
min kz − x k2 = min |hz, w/kwk2 i − h x, w/kwk2 i| = min |hz, w/kwk2 i + b/kwk2 | = .
x∈ H0 x∈ H0 x∈ H0 k w k2
The support vector machine algorithm returns the hyperplane classifier that maximises the geometric
margin.
Definition 11.2. For d ∈ N, the algorithm ASVM takes a sample S ∈ (Rd × {−1, 1})m and outputs a linear
classifier hSVM = ASVM (S) such that
ρhSVM = max ρh .
h∈H
67
11.2 Primal optimisation problem
We are interested in finding the hyperplane with the maximum geometric margin. By (44), this geometric
margin is given by
|hw, xi i + b|
ρ= max min .
w,b : yi (hw,xi i+b)≥0 i ∈[m] kwk
|hw, xi i + b| y (hw, xi i + b) 1
max min = max min i = max , (45)
w,b : yi (hw,xi i+b)≥0 i ∈[m] kwk w,b i ∈[m] kwk mini∈[m] yi (hw,xi i+b)=1 k w k
where the last equality follows by observing that rescaling of (w, b) by any scalar does not affect the value
of the fraction.
Since increasing kwk will decrease the value (45), we conclude that
1
ρ= max . (46)
mini∈[m] yi (hw,xi i+b)≥1 k w k
Instead of maximising 1/kwk we can minimise 21 kwk2 and we therefore end with the optimisation prob-
lem
1
min kwk2 , (47)
w,b 2
subject to: yi (hw, xi i + b) ≥ 1. (48)
This optimisation problem is strictly convex and the constraints are affine linear. Hence, there exists a
unique solution and efficient solvers to find it.
min f ( x )
x
subject to: gi ( x ) ≤ 0,
∇ x L( x, α) = 0 (49)
(∇α L( x, α))i = gi ( x ) ≤ 0, for all i ∈ [m] (50)
m
∑ αi gi = 0, (51)
i =1
where
m
L( x, λ) = f ( x ) + ∑ λi gi for x ∈ X , λ = (λ1 , . . . , λm ) ∈ (R+ )m .
i =1
68
Note that, (50) and (51) are equivalent to
gi ( x ) ≤ 0 ∧ (∀i ∈ [m], αi gi ( x ) = 0). (52)
Applying Theorem 11.1 to (46) by defining
m
1
L( x, λ) = L(w, b, λ) = kwk22 + ∑ λi [yi (hw, xi i) − 1]
2 i =1
Definition 11.3. Let A be a learning algorithm taking samples S ∈ (X × {−1, 1})m and mapping them to
hypotheses h : X → Y . The leave-one-out error of h on a sample S is defined by
m
bLOO (A, S) := 1 ∑ 1h
R ,
m i=1 S−{si } (xi )6=yi
where S − {si } denotes the sample of size m − 1 which results from S = (s1 , . . . , sm ) by removing si .
In words, the leave one out error of a sample is the mean error committed when training on all elements
of a sample except one and then observing the error on the left out point.
We have that the average leave-one-out error is an unbiased estimator of the risk.
Lemma 11.1. Let A be a learning algorithm taking samples S ∈ (X × {−1, 1})m and mapping them to hypotheses
h : X → Y . Let D be a distribution on X × {−1, 1}. Then it holds that
ES∼D m ( R
bLOO (A, S)) = ES∼D m−1 (R(A(S)))
Proof. From the definition of the leave-one-out error and the linearity of the expected value, we have that
m
bLOO (A, S)) = 1 ∑ ES∼D m 1h
ES∼D m ( R ( x )6 = y
m i =1 S−{si } i i
= ES∼D m 1hS−{s } (x1 )6=y1
1
= ES∼D m−1 Es∼D 1hS (x)6=y
= ES∼D m−1 R(A(S)).
69
Using this lemma, we can now obtain a first generalisation bound for the SVM algorithm.
Theorem 11.2. Let, for a linearly separable sample S ∈ (X × {−1, 1})m , hSVM = ASVM (S) be the hypothesis
returned by the SVM algorithm. Let NSV (S) be the number of support vectors that define hS . Then
NSV (S)
ES∼D m (R(hSVM )) ≤ ES∼ Dm+1 . (55)
m+1
Proof. Let S = ( xi , yi )im=+11 be a linearly separable sample of size m + 1. We observe that if xi is not a
support vector for hS then by (53) and (54), hS = hS−{si } . Since S was linearly separable, we have that
hS ( xi ) = yi for all i ∈ [m + 1]. We conclude that
1 m +1 NSV (S)
bLOO (ASVM , S) =
R ∑
m + 1 i =1
1hS−{s } (xi )6=yi ≤
i m+1
.
Now the right-hand side of (56) makes sense for all real valued functions h. Then, one may interpret the
classification with a general h : X → R so that the sign of h( x ) corresponds to the predicted class and the
magnitude of h( x ) corresponds to the confidence of the classification.
From this observation, we can create a loss function that penalises not only wrong classifications but those
that do not have enough confidence. In words, for ρc > 0 we could define the hard margin loss as
For analysis purposes it is convenient to consider a continuous alternative of the hard margin loss, which
we shall define below:
Definition 12.1. Let ρc > 0. The ρc -margin loss is the function Lρc : R × R → R+ defined as Lρc (y, y0 ) =
Φρc (yy0 ), where y, y0 ∈ R and
1 if x ≤ 0
x
Φρc ( x ) = min 1, max 1 − , 0 = 1 − x/ρc if 0 ≤ x ≤ ρc
ρc
0 if ρc ≤ x.
for x ∈ R
70
0 ρc 1
Figure 14: The functions x 7→ 1x≤ρc (dotted, orange), Φρc ( x ) (solid, green), and x 7→ 1x≤0 (dashed, red).
Using the margin loss function, we can define the associated empirical risk function.
Definition 12.2. For a sample S = ( xi , yi )im=1 , a margin ρc > 0, and h : X → R, we define the empirical margin
loss as
m
b S,ρ (h) := 1 ∑ Φρc (yi h( xi )).
R c
m i =1
1 m m
b S,ρ (h) ≤ 1 ∑ 1y h(x )≤ρ ,
∑
m i =1
1yi h(xi )≤0 ≤ R c
m i =1 i i c
(58)
see also Figure 14. As mentioned before, the reason to introduce the margin loss function is that it is
continuous. More precisely, the fact that it is Lipschitz continuous with Lipschitz constant 1/ρc . The
reason why this is beneficial will become clear from the result below:
Theorem 12.1 (Talagrand’s Lemma). Let Φ : R → R be C -Lipschitz for C > 0. Then for every hypothesis set
H of real-valued functions it holds that for every sample S
b S (Φ ◦ H) ≤ CR
R b S (H),
where Φ ◦ H := {Φ ◦ h : h ∈ H}.
Now we obtain a first margin based generalisation bound.
Theorem 12.2. Let H be a set of real-valued functions. Let D be a distribution on X × {−1, 1}. For ρc > 0 it
holds for all δ > 0 with probability at least 1 − δ over a sample S ∼ D m and for all h ∈ H
r
R(h) ≤ R b S,ρ (h) + Rm (H) + log(1/δ)
2
(59)
c
ρc 2m
r
R(h) ≤ R
2
b S,ρ (h) + R b S (H) + 3 log(2/δ) , (60)
c
ρc 2m
Proof. Let H
e := {(z = ( x, y) 7→ yh( x ) : h ∈ H)}. Next we set H b := Φρ ◦ H e and observe that H
b contains
c
functions mapping from X × {−1, 1} to [0, 1]. Thus, we can apply Theorem 3.1, which yields that with
probability 1 − δ it holds for all g ∈ Hb:
r
1 m log(1/δ)
E( g ) ≤ ∑
m i =1
g(zi ) + 2Rm (H) +
b
2m
. (61)
71
Therefore, for every h ∈ H
r
log(1/δ)
E(x,y)∼D (Φρc (yh( x ))) ≤ R
b S,ρ (h) + 2Rm (H)
c
b + . (62)
2m
We have that for all x ∈ R, 1x≤0 ≤ Φρc ( x ) and hence
We conclude that
r
log(1/δ)
R(h) ≤ R
b S,ρ (h) + 2Rm (H)
c
b + .
2m
Next, we apply Theorem 12.1 which yields that
r
b S,ρ (h) + 2 Rm (H) log(1/δ)
R(h) ≤ R c
e + ,
ρ 2m
This yields (59). We can show (60) by following the same steps as above but estimating with the empirical
Rademacher complexity in (61).
As outlined at the beginning of this section, we consider the hypothesis set H of affine linear functions.
Lets compute the empirical Rademacher complexity of this set.
Theorem 12.3. Let X be an inner product space and S ⊂ Br (0) be a sample of size m. Further let H := { x 7→
hw, x i + b : kwk ≤ Λ, |b| ≤ s}. Then
r
2
Rb S (H) ≤ (s + rΛ) .
m
Λ m
s m
≤
m
Eσ ∑ σi xi +
m
Eσ ∑ σi .
i =1 i =1
72
Moreover again by Jensens inequality and the independence of the σi , we conclude that
" #!2 2
m m m
Eσ ∑ σi xi ≤ Eσ ∑ σi xi = ∑ k xi k2 ≤ mr2 .
i =1 i =1 i =1
We conclude that
Λ m
s m
Λr + s
m
Eσ ∑ σi xi + m
Eσ ∑ σi ≤ √ .
m
i =1 i =1
Now we can combine Theorems 12.3 and 12.2 to obtain the following generalisation bound for affine
linear classifiers.
Corollary 12.1. Let H = { x 7→ hw, x i + b : kwk ≤ Λ, |b| ≤ s} for Λ, s > 0. Assume further, that X is a subset
of an inner product space and all elements of X have norm bounded by r. Let D be a distribution on X × {−1, 1}.
Then, ρc > 0 and for all δ > 0 it holds with probability 1 − δ over a sample S ∼ D m :
r r
(s + rΛ)2 /$2 log(1/δ)
R(h) ≤ RS,ρc (h) + 2
b + ,
m 2m
for all h ∈ H.
Now let S ⊂ Br (0) be a linearly separable sample with geometric margin ρ and let hS = ASV M (S). Since
S is linearly separable, we can assume that the separating hyperplane passes through Br (0). Let z ∈ Br (0)
be such that hS (z) = 0, i.e. z lies on the hyperplane.
We have that
hS ( x ) = sign(hw, x i + b) = sign(hw, x + zi) = hS0 ( x ),
where hS = ASV M (S0 ) and S0 = ( xi0 , yi )im=1 = ( xi + z, yi )im=1 results from shifting S by z. Since S was
linearly separable with geometric margin ρ we have by (46) that kwk = 1/ρ. Furthermore, S0 ⊂ B2r (0).
Moreover, by the definition of a geometric margin, we have that
for all i ∈ [m] and hence, Φ1 (hw, xi0 iyi ) = 0 for all i ∈ [m]. Corollary 12.1, therefore implies that with
probability 1 − δ over the choice of S we have that
s r
(2r )2 log(1/δ)
RD (hS ) = E(x,y)∼D (1hS0 (x−z)y≤0 ) ≤ 2 + , (63)
mρ2 2m
where we define the right-hand side as ∞ if the sample is not linearly separable. Let us note this as a final
corollary.
Corollary 12.2. For all distributions D on X × {−1, 1}, where X is contained in a ball of radius r in a inner
product space, it holds that for all δ > 0 with probability 1 − δ:
s r
r2 log(1/δ)
RD ( hS ) ≤ 4 2
+ ,
mρ 2m
73
The estimate of Corollary 12.2 is remarkable in the sense that it does not seem to depend in any way on
X . We have seen earlier that the VC dimension of linear classifiers depends on the dimension d. Hence,
the VC-dimension-based generalisation bound of Corollary 16.1 would not be dimension independent.
Note though, that we have seen in Theorem 6.1 that the VC dimension based bounds are optimal in the
sense that for a carefully designed distribution we cannot significantly outperform the upper bounds. We
know now that these bad distributions cannot be such that they only generate linearly separable samples
with large geometric margins.
Again, we would like (65) to hold with a large margin ρ = kw1 k , which in this case, we call soft-margin.
In addition, we now want the slack variables to be as small as possible. In total, we end up with the
optimisation problem:
m
1
min kwk2 + ∑ ξ i
p
w,b,ξ 2 i =1
subject to yi (hw, xi i + b) ≥ 1 − ξ i and ξ i ≥ 0,
74
where p ≥ 1. Similarly to the separable case one can show that the minimiser of the optimisation problem
depends only on few sample points. In this case, these points are the support vectors, such that equality
holds in (65).
Figure 16: Left:: Original not linearly separable problem for a sample S = ( xi , yi )31
i =1 . Right: The sample
(( xi , 1 − 5| xi |), yi )31
i =1 . This seems to be separable.
75
For a given transformation ψ as above Kψ ( x, x 0 ) = hψ( x ), ψ( x 0 )iZ is a kernel. Often one can, however,
also start with a kernel for which then a transform exists.
Theorem 14.1 (Mercer’s condition). Let X ⊂ Rn be compact and let K : X × X → R be a continuous and
symmetric function. Then, there exist φn : X → R and an > 0 such that
∞
K ( x, x 0 ) = ∑ an φn (x)φn (x0 ), (66)
n =0
Theorem 14.1 shows that we do not need to find a specific ψ and an associated Hilbert space as long as we
are content with the existence of these objects. We can design a kernel instead. This kernel should satisfy
that K( x, x 0 ) is large, if x, x 0 should be classified similarly.
Having a kernel that can be efficiently computed saves us from computing scalar products between high-
dimensional feature embeddings of values x, x 0 ∈ X . Nonetheless, to apply the SVM algorithm we need
to compute inner products between ψ( x ) and a vector w in the Hilbert space. To compute these scalar
products efficiently, we make use of the representer theorem.
Consider the following optimisation problem:
w∗ = w̄ + u,
where w̄ ∈ span{ψ( xi ) : i ∈ [m]} and u ⊥ span{ψ( xi ) : i ∈ [m]}. By the Pythagorean theorem it holds that
by construction. Hence, if there exists an optimal solution of (67) in Z , then there exists also one in
span{ψ( xi ) : i ∈ [m]}.
76
Based on Theorem 14.2, we can now compute for w = ∑im=1 αi ψ( xi )
m
hw, ψ( xi )i = ∑ α j hψ(x j ), ψ(xi )i (69)
j =1
m m
kwk2 = hw, wi = ∑ ∑ α j αi hψ(x j ), ψ(xi )i. (70)
j =1 i =1
If now K( x, x 0 ) = hψ( x ), ψ( x 0 )i, then the SVM problem on Z corresponds to the minimisation of
1 T
minα Gα
α∈R ,b∈R 2
m
subject to yi (( Gα)i + b) ≥ 1,
where Gi,j = K( xi , x j ) is the so-called Gram matrix of the kernel. The optimisation problem above predicts
for an new sample x:
! !
N N
sign(hw, ψ( x )i + b) = sign ∑ αi hψ(xi ), ψ(x)i + b = sign ∑ αi K(xi , x) + b .
i =1 i =1
b S (H) ≤ Λ s rΛ + s
q
R Tr (K) + √ ≤ √ , (71)
m m m
Proof. The proof is very similar to that of Theorem 12.3, but we take the effect of the kernel into account.
We have per definition that
" * + #
m m
1
b S (H) = Eσ
R sup w, ∑ σi ψ( xi ) + ∑ σi b
m kwk≤Λ,|b|≤s i =1 i =1
!
Λ m
s m
≤ Eσ
m ∑ σi ψ( xi ) + Eσ ∑ σi .
m
i =1 i =1
77
An application of Jensen’s inequality and the independence of σi and σj for all i 6= j yields
1/2 1/2
2 2
Λ m
s m
b S (H) ≤ Eσ
R
m ∑ σi ψ(xi ) +
m
Eσ ∑ σi
i =1 i =1
!!1/2 !1/2
Λ m
s m
∑ kψ(xi )k Eσ ∑ kσi k
2 2
≤ Eσ +
m i =1
m i =1
!!1/2
Λ m
s
≤
m
Eσ ∑ K(xi , xi ) +√
m
i =1
Λ s rΛ + s
q
= Tr (K) + √ ≤ √ .
m m m
We can apply Theorem 12.2 to Proposition 14.1 which yields the following generalisation bound for large
margin kernel SVM classifiers:
Theorem 14.3. Let K : X × X → R be a kernel such that for a function ψ : X → Z , where Z is a Hilbert space,
it holds that K( x, x 0 ) = hψ( x ), ψ( x 0 )i for all x, x 0 ∈ X . Let r2 := supx∈X K( x, x ). We denote H := { x 7→
hw, ψ( x )i + b : kwkZ ≤ Λ, |b| ≤ s}. Let D be a distribution on X × {−1, 1}. For ρc > 0 it holds for all δ > 0
with probability at least 1 − δ over a sample S ∼ D m and for all h ∈ H
r
2 ( rΛ + s ) log(1/δ)
R(h) ≤ R b S,ρ (h) +
c
√ + (72)
ρc m 2m
( x 0 )21
x12
2 0 2
√ x2 √ ( x0 )2 0
* +
√2x1 x2
2( x )1 ( x )2
√
K( x, x 0 ) = ( x1 x10 + x2 x20 + c)2 = ,
0
√2cx1 √2c( x 0 )1
2cx2 2c( x )2
c c
• Gaussian/Radial basis function kernel: For any constant σ > 0, the Gaussian or RBF kernel over R N is
defined as
0 2 2
K( x, x 0 ) = e−kx−x k /(2σ )
It can be shown, see [5], that the Gaussian kernel satisfies the assumptions of Theorem 14.1 and
there is a ψ and an infinite dimensional feature space Z such that K( x, x 0 ) = hψ( x ), ψ( x 0 )iZ .
78
14.4 A numerical example
We use the following helper function to plot the decision regions of our SVM classifiers. This is taken
from the Python Data Science Handbook by Jake VanderPlas
if ax is None:
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
We will now apply the kernel SVM to three different data sets. We start with the moons data set, which
consists of two interleaving half-circles.
79
[109]: def plot_svm_moons(N, kernel):
X,y = make_moons(n_samples=N, shuffle=True, noise=0.05)
ax = plt.gca()
ax.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='Dark2')
plot_svc_decision_function(model, ax)
N = 200
plt.figure(figsize = (18, 6))
plt.subplot(131)
plot_svm(N, 'rbf')
plt.subplot(132)
plot_svm(N, 'linear')
plt.subplot(133)
plot_svm(N, 'poly')
Next, we look at the performance on one of the most famous data sets in data science, the iris data set.
This data set contains as features the lengths of two leaves of iris flowers. The associated labels are one
of three flower types: "iris setosa’, ‘iris versicolor’, ‘iris virginica’. Since our support vector classifier only
performs binary classification, we combine iris versicolor, iris virginica to one class.
X = X[:N, :]
y = y[:N]
ax = plt.gca()
80
ax.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='Dark2')
ax.set_xlabel(load_iris().feature_names[1])
ax.set_ylabel(load_iris().feature_names[2])
plot_svc_decision_function(model, ax)
N = 200
plt.figure(figsize = (18, 6))
plt.subplot(131)
plot_svm_iris(N, 'rbf')
plt.subplot(132)
plot_svm_iris(N, 'linear')
plt.subplot(133)
plot_svm_iris(N, 'poly')
Finally, we look at a non linearly separable data set, which is the wine data set. It contains labelled data
of three types of wine of which we combine the later two. The data is in the form of specific measurable
characteristics of the wine, such as the alkohol or magnesium content.
features = [0, 6]
X = load_wine().data[:, features]
y = load_wine().target<1
X = X[:N, :]
y = y[:N]
ax = plt.gca()
ax.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='Dark2')
ax.set_xlabel(load_wine().feature_names[features[0]])
ax.set_ylabel(load_wine().feature_names[features[1]])
plot_svc_decision_function(model, ax)
81
N = 200
plt.figure(figsize = (18, 6))
plt.subplot(131)
plot_svm_wine(N, 'rbf')
plt.subplot(132)
plot_svm_wine(N, 'linear')
plt.subplot(133)
plot_svm_wine(N, 'poly')
ρ( x, xπ (x)(i) ) ≤ ρ( x, xπ (x)(i+1) ).
hkNN
S ( x ) := A(S)( x ) := majority label of (yπ (x)(i) )i∈[k] .
Here the majority label is that value y appearing to most in the sequence (yπ (x)(i) )i∈[k] .
Of course, we do not have to take the majority label in the definition of A(S) but could take an average,
such as the mean, of the observed labels, if we want to perform regression instead of classification.
82
Figure 17: Sketch of a 1-nearest neighbour classifier. The blue and yellow dots are from the training set.
The points separate the input space into regions that are closest to one data point. These regions are called
Voronoi regions.
We conclude that
!
E ∑ P( Q i ) ≤ r max P( Qi )e−P(Qi )m .
i =1,...,r
(74)
i : Qi ∩ S = ∅
and has therefore one root at x ∗ = 1/m. It is not hard to see that this is a maximum of h. Since h( x ∗ ) =
1/(em), we conclude that h( x ) ≤ 1/(em). Applying this observation to (74), we obtain the result.
83
Using the lemma above, we can now prove a generalisation bound for the one-nearest-neighbour classi-
fier, if the underlying concept class is Lipschitz continuous.
Theorem 15.1. Let X = [0, 1]d , Y ⊂ [0, 1] and let D be a distribution on X . Let c be a C1 -Lipschitz continuous
target concept and L be a C2 -Lipschitz loss function bounded by 1. It holds that for a sample S ∼ D m with
probability 1 − δ
√
1 1
ES R L (hS ) ≤ 2 dC1 C2 +
1NN
m − d +1 .
e
Proof. For x ∈ X and a sample S ∈ X m , we denote by πS ( x ) the closest element to x in S. Then, h1NN
S (x) =
c(π ( x )). We have that
ES R L (h1NN
S ) = ES EL(h1NN
S ( x ), c( x ))
= ES E L h1NN
S ( x ) , c ( x ) 1 k x −π ( x )k∞ ≤e + E S E L h 1NN
S ( x ) , c ( x ) 1 k x −π ( x )k∞ >e
≤ ES E L h1NN
S ( x ), c( x ) 1kx−π (x)k∞ ≤e + ES E 1kx−π (x)k∞ >e =: I + II. (75)
Figure 18: The domain X can be covered by Md cubes of side length 1/M.
S
i : Qi ∩ S = ∅ Qi ). With a union bound and Lemma 15.1, we obtain that
Md 2d e − d
II = ES P(k x − πS ( x )k∞ > e) ≤ ≤ .
em em
84
1
Choosing e = m− d+1 yields that
√ − 1 d
√ 1
ES R L (h1NN
S ) = I + II ≤ C1 C2 dm d+1 + m d+1 /em ≤ C1 C2 d + e m − d +1 ,
Remark 15.1. Note that the generalisation bound of 1-nearest neighbour classification deteriorates exponentially
fast with increasing dimension. This is one instance of the so-called curse of dimension.
Φ : Rd → R
N
Φ( x ) = ∑ ci $(hai , xi + bi ) + e,
i =1
Figure 19: Sketch of a neural network with input dimension 3 and 6 neurons.
In practice, often deep neural networks are used. These are functions that result from stacking multiple of
these neural networks after another in multiple layers. We will not discuss these types of networks here.
Neural networks form a general class of hypothesis set. If ρ = 21(0,∞) − 1, N = 1, d = 0, and c1 = 1, then
the class of such neural networks is the class of hyperplane classifiers that we have already encountered
in Section 11.
We first would like to understand this set a bit better and in particular the role of the number of neurons
and the activation function. The following result is one of the most famous in neural network theory:
85
Theorem 16.1 (Universal approximation theorem). Let $ : R → R be a sigmoidal function, i.e., $ is continuous
and limx→−∞ $( x ) = 0 and limx→∞ $( x ) = 1. Then, for every compact set K ⊂ Rd and every continuous function
f : K → R and every e > 0, there exists Φ such that
Proof. Assume towards a contradiction that there exists a function f : K → R and an e > 0 such that for
all neural networks Φ with activation function $
sup | f ( x ) − Φ( x )| > e.
x ∈K
Let us denote the set of all neural networks with input dimension d and activation function $ by N N d,$ .
It is clear from the definition that N N d,$ is a subspace of the space of continuous functions on K, which
we denote by C (K ).
By the theorem of Hahn-Banach, there exists a continuous linear functional h ∈ C (K )0 , the dual space of
C (K ), such that
h(Φ) = 0 and h( f ) = 1,
for all Φ ∈ N N d,$ . Furthermore, by the representation theorem of Riesz, there exists a signed Borel
measure µ 6= 0 such that Z
h( g) = g( x )dµ( x ).
K
Since h(Φ) = 0 for all neural networks Φ, it holds in particular for every neural network with one neuron:
x 7→ $(h a, x i + b).
86
By using the linearity of the integral, we conclude that for all a ∈ Rd and b1 , b2 ∈ R
Z
1[b1 ,b2 ) (h a, x i)µ( x ) = 0.
K
Since every univariate continuous function on an interval can be approximated arbitrarily well uniformly
by step functions we conclude that for every g ∈ C (R)
Z Z
g(h a, x i)µ( x ) = g[c1 ,c2 ] (h a, x i)µ( x ) = 0,
K K
where c1 = min{h a, x i : x ∈ K }, c2 = max{h a, x i : x ∈ K }.
In particular, g = sin and cos is possible and by Euler’s formula eix = i sin( x ) + cos( x ), we conclude that
Z Z
ei(ha,xi) µ( x ) = ei(ha,xi) µ̃( x ),
K
for a measure µ̃ on Rd supported on K that coincides with µ on K. We conclude that the Fourier transform
of µ̃ vanishes. This implies that µ̃ and hence µ vanishes, which is a contradiction to the choice of µ.
We see that neural networks are a versatile hypothesis set, since they can represent every continuous
function arbitrarily well, if they are sufficiently large. From a generalisation point of view this is of course
not so exciting since the VC dimension of the set of continuous functions is infinite. We recall from
Theorem 6.1 that an infinite VC dimension prohibits us from learning anything.
In practice, neural networks with only a finite number of neurons are used. Typically the resulting set of
neural networks does then not yield form dense subset of the set of continuous functions. Then, we have
again a chance to learn something. Indeed, we can bound the VC dimension of sets of neural networks.
The definition of VC dimension requires a function class with outputs in Y = {−1, 1}. Thus, we can only
define a VC dimension for the set of NNs with binary output, which we get by composing every NN with
a sign function.
Theorem 16.2 ([1, Theorem 2.1]). Let d, N ∈ N and let $ be a piecewise polynomial function. We denote the set
of neural networks with N neurons input dimension d and activation function $ by F N . It holds that
87
[175]: X, y = make_moons(noise=0.1, random_state=0)
plt.scatter(X[:,0], X[:,1], c = y, cmap = 'coolwarm')
[222]: # We define three models. Always one hidden layer with the relu (x \mapsto \max \{x, 0\}) activation␣
,→function.
# The first model has 2 neurons, the second 5, and the last has 20 neurons.
# We apply a sigmoid to the output for for stability reasons.
model1 = Sequential()
model1.add(Dense(2, input_dim=2, activation='relu'))
model1.add(Dense(1, activation='sigmoid'))
model2 = Sequential()
model2.add(Dense(5, input_dim=2, activation='relu'))
model2.add(Dense(1, activation='sigmoid'))
model3 = Sequential()
model3.add(Dense(20, input_dim=2, activation='relu'))
model3.add(Dense(1, activation='sigmoid'))
# compile the models. Here we need to chose an optimiser. This one is called adam, it is used to
# determine how the training is performed. We do not care in this lecture how this is done.
opt = ks.optimizers.Adam(learning_rate=0.02)
model1.compile(loss='mean_squared_error', optimizer=opt, metrics=['accuracy'])
model2.compile(loss='mean_squared_error', optimizer=opt, metrics=['accuracy'])
model3.compile(loss='mean_squared_error', optimizer=opt, metrics=['accuracy'])
88
# fit the models on the dataset
h=0.2
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z1 = model1.predict(np.c_[xx.ravel(), yy.ravel()])
Z1 = Z1.reshape(xx.shape)
Z2 = model2.predict(np.c_[xx.ravel(), yy.ravel()])
Z2 = Z2.reshape(xx.shape)
Z3 = model3.predict(np.c_[xx.ravel(), yy.ravel()])
Z3 = Z3.reshape(xx.shape)
plt.figure(figsize = (18,5))
plt.subplot(1,3,1)
plt.contourf(xx, yy, Z1, cmap='coolwarm', alpha=.8)
plt.scatter(X[:,0], X[:,1], c = y, cmap = 'coolwarm')
plt.title('2 Neurons')
plt.subplot(1,3,2)
plt.contourf(xx, yy, Z2, cmap='coolwarm', alpha=.8)
plt.scatter(X[:,0], X[:,1], c = y, cmap = 'coolwarm')
plt.title('5 Neurons')
plt.subplot(1,3,3)
plt.contourf(xx, yy, Z3, cmap='coolwarm', alpha=.8)
plt.scatter(X[:,0], X[:,1], c = y, cmap = 'coolwarm')
plt.title('15 Neurons')
Model 1:
4/4 [==============================] - 0s 720us/step - loss: 0.0851 - accuracy:
0.8800
Model 2:
4/4 [==============================] - 0s 680us/step - loss: 0.0351 - accuracy:
89
0.9700
Model 3:
4/4 [==============================] - 0s 1ms/step - loss: 0.0011 - accuracy:
1.0000
90
plt.figure(figsize = (15,15))
for k in range(16):
plt.subplot(4,4,k+1)
plt.imshow(data_train[k,:,:], cmap= cmap)
plt.title(Emotions[int(labels[k])])
Our biggest problem is to deal with the massive input dimension of 35 × 35.
My solution is to use a very simple algorithm. Other solutions could involve reducing the dimension in a
smart way and then applying tools from earlier.
91
[98]: # we split the training set into a train and a validation set:
data_train_split = data_train[0:int(data_train.shape[0]/2), :, :]
lab_train_split = labels[0:int(data_train.shape[0]/2)]
data_validation_split = data_train[int(data_train.shape[0]/2)::, :, :]
lab_validation_split = labels[int(data_train.shape[0]/2)::]
[98]: KNeighborsClassifier(n_neighbors=1)
# validation accuracy:
Accuracy: 0.8646
Let us have a look at the misclassified data points to see if there is something conspicuous about them.
plt.figure(figsize = (15,15))
for k in range(16):
plt.subplot(4,4,k+1)
plt.imshow(data_validation_split[mistakes[k],:,:], cmap= cmap)
plt.title(Emotions[int(validation_labels[mistakes[k]])])
92
I am very happy with the accuracy on the validation set. I also have no simple explanation why the
faces above were misclassified and therefore no direct way of improving my algorithm. (One notices a
surprisingly high amount of faces with glasses though.) Hence I choose to proceed.
I apply this algorithm to the test set now:
93
18 Lecture 18 - Boosting
Boosting is a type of ensemble method where multiple classifiers/predictors are combined to yield one
more powerful classifier/predictor.
We start with the definition of a weak learning algorithm.
Definition 18.1. Let C be a concept class. A weak PAC learning algorithm is an algorithm A taking samples
S ∈ X m to functions in H ⊂ X × {−1, 1} such that for a γ > 0 there exists a function m : (0, 1) → N, such that
for every δ > 0, all distributions D on X and every target concept c, it holds that
1
PS∼D m RS (A(S)) ≤ − γ ≥ 1 − δ,
2
if m ≥ m(δ).
A weak learning algorithm only needs to be slightly better than the trivial algorithm that predicts
Rademacher random labels.
The idea behind boosting is now to cleverly combine the hypotheses returned by weak learning algo-
rithms to build a stronger algorithm.
Probably the most widely-used boosting algorithm is AdaBoost:
5. for i = 1, . . . , m:
D t (i ) exp(−αt yi ht ( xi ))
6. set D t+1 (i ) := ∑m
.
j=1 D t ( j ) exp(− αt y j ht ( x j ))
Theorem 18.1. Let S = ( xi , yi )im=1 be a sample, let H be a set of base classifiers and assume that in the iteration of
AdaBoost, 0 < et < 1/2 − γ for a fixed γ > 0. Then, for f = ADABOOST (H, S, T )
m
b S ( f ) = 1 ∑ 1 f (x )6=y ≤ e−2γ2 T .
R
m i =1 i i
ft = ∑ αphp
p≤t
1 m − yi f t ( xi )
m i∑
Zt := e ,
=1
94
Since 1h(x)y≤0 ≤ e−yh(x) , we have that
m
b S ( f ) = 1 ∑ 1 f (x )6=y
R
m i =1 i i
1 m
m i∑
= 1 f T ( x i ) y i ≤0
=1
≤ ZT
Z
= T
Z0
ZT Z1
= ... .
ZT − 1 Z0
e − y i f t −1 ( x i )
D t (i ) = − y i f t −1 ( x j )
. (78)
∑m
j =1 e
Since (78) holds for t = 1, we conclude by induction that (78) holds for all t ∈ [ T ].
Now we have that
Zt+1 ∑ m e − y i f t +1 ( x i )
= i=m1 −y f (x )
Zt ∑ i =1 e i t i
∑ m e − y i f t ( x i ) e − y i α t +1 h t +1 ( x i )
= i =1
∑im=1 e−yi f t (xi )
m
= ∑ D t +1 ( i ) e − y α + h + ( x )
i t 1 t 1 i
i =1
=e − α t +1
∑ + e α t +1 ∑
i : y i = h t +1 ( x i ) i : y i 6 = h t +1 ( x i )
= e − α t +1 ( 1 − e t + 1 ) + e α t +1 e t + 1
1 p
=p (1 − et+1 ) + 1/et+1 − 1et+1
1/et+1 − 1
s
1 − e t +1
r
e t +1
= (1 − e t +1 ) + e t +1
1 − e t +1 e t +1
q q q
= et+1 (1 − et+1 ) + (1 − et+1 )(et+1 ) = 2 et+1 (1 − et+1 ).
95
Using 1 − x ≤ e− x yields that
Zt+1 2
≤ e−2γ .
Zt
This completes the proof.
We saw that Adaboost can very quickly reduce the empirical error, if weak learners exist and can be
found quickly. A standard choice for the set of base classifiers is that of so-called decision stumps (this
name comes from the fact that these are decision trees with minimal depth.), which are linear classifiers
acting on a single axis of the data, i.e., for X = R N
H := { x 7→ b · sign( xi − θ ) : θ ∈ R, b ∈ {±1}, i ∈ [ N ]} .
Figure 20: Visualisation of classification with boosting and decision stumps. The top left shows the sam-
ples and the underlying disctibution. The next four panels show successively built sums of decision
stumps.
Note that the set of decision stumps is quite small. In fact, there exist simple distributions so that there
does not exist a PAC learning algorithm with hypothesis set H.
We can ask ourselves how the base class affects the generalisation capabilities of Adaboost. For this, we
observe that the output of Adaboost is an element of the following set:
( ! )
T
L(H, T ) = x 7→ sign ∑ αt ht ( x ) : αt ∈ R, ht ∈ H .
t =1
96
Proposition 18.1. Let H be a base class and let T ∈ N, T ≥ 3. Then
where d := VCdim(H).
| { f (C ) : f ∈ L(H, T )} | ≤ (eT +1 mdT +(T +1) ) = eT +1 m(d+1)T +1 ≤ 22(T +1) m(d+1)(T +1) .
and hence
and thus
19 Lecture 19 - Clustering
Clustering is the act of associating elements of a data set ( xi )im=1 into a number of sets that may or may
not be determined beforehand. In low dimensions, humans have a very good intuition on how to cluster
data points. For example in Figure 21, most people would have a pretty strong opinion on how to cluster
the points into 2 or three sets. However, defining a mathematical rule is typically harder. To perform
clustering numerically, one needs to specify an objective to minimise or a procedure to follow. We will
discuss some examples of such algorithms in this chapter.
97
Figure 21: Six clustering problems
Let us first describe the task of clustering in more mathematical terms. Clustering is a procedure that
maps an input to an output:
• Input: A set X = ( xi )im=1 and a distance function d : X × X → R+ which is symmetric and satisfies
d( x, x ) = 0. Alternatively, a similarity measure s : X × X → [0, 1] can be given with s symmetric and
s( x, x ) = 1.
Sk
• Output: A sequence of disjoint subsets of X denoted by (Ci )ik=1 such that j =1 Ck = X.
How this segmentation of X into the (Ci )ik=1 is performed depends on d or s and is different from algorithm
to algorithm.
are the most similar clusters (we will discuss what this means below). Then
(Ci` )im=−` ` `−1
1 = {Ci : i 6 = i1 , i2 } ∪ {Ci1 ∪ Ci`−
2
1
}. (82)
The notion of "most similar clusters", that was used above can mean many things, depending on the appli-
cation in mind. A couple of examples are listed below:
• Single Linkage clustering: Here we define
d(Ci , Cj ) := min{d( x` , x`0 ) : x` ∈ Ci , x`0 ∈ Cj , ` ∈ [m]}.
98
• Max Linkage clustering: Here we define
µ(Ci ) := argminµ∈coX ∑ d ( x j , µ )2 ,
x j ∈Ci
where ( x j )m
j=1 is the data set.
k
Gk−means (( x j )m
j=1 , d )(C1 , . . . , Ck ) := ∑ ∑ d( x j , µ(Ci ))2 . (83)
i =1 x j ∈Ci
Finding a solution of this problem is NP-hard in general. However, there is a widely used algorithm, that
typically performs well. This is Lloyd’s algorithm and consists of the following steps:
Lemma 19.1. Each iteration of the k-means algorithm does not increase the k-means objective function Gk−means of
(83).
99
1
Proof. It holds that for µ̄(C ) = |C | ∑ x j ∈C x j that for arbitrary λ ∈ co( X )
Hence µ̄(C ) = µ(C ). Let for ` ∈ N and i ∈ [k ], Ci` be as in Lloyd’s algorithm. We have that
k
Gk−means (( x j )m `+1 `+1
j=1 , d )(C1 , . . . , Ck ) = ∑ ∑ k x j − µ(Ci`+1 )k2
i =1 x j ∈C `+1
i
k
≤ ∑ ∑ k x j − µi` k2
i =1 x j ∈C `+1
i
k
≤ ∑ ∑ k x j − µi` k2
i =1 x j ∈ C `
i
k
= ∑ ∑ k x j − µ(Ci` )k2 = Gk−means (( x j )m ` `
j=1 , d )(C1 , . . . , Ck ),
i =1 x j ∈ C `
i
where the first inequality follows by the definition of µ(Ci`+1 ) as a minimum, the second inequality follows
from the definition of µ(Ci` ), and penultimate equality follows by the considerations at the beginning of
the proof.
Figure 23: Two examples of the evolution of means µ1` , µ2` in Lloyd’s algorithm. On the left-hand side, we
observe successful convergence. On the right-hand side, there is no convergence.
100
Remark 19.1. Lloyd’s algorithm is simple and very often effective. However, it comes with a couple of issues.
First of all, convergence is not guaranteed or could take a very long time. An example of a bad initialisation
prohibiting convergence is shown in Figure 23. Moreover, k-NN generally suffers from the issue that all clusters
must necessarily be convex in the sense that if x ∈ co(Ci ) ∩ X, then x ∈ Ci . In some of the examples of Figure 21,
this can be a serious issue. See also Figure 24 for an illustration.
Figure 24: K-means clustering for two data sets. On the left hand side the clustering (with k = 3) was
successful. The problem on the right hand side cannot be clustered correctly (with k = 2), because this
would require non-convex clusters.
101
0 1 1 1 1 0 0 0 0 0
0 1 1 1 1 0 0 0 0
1 0 1 0 0 0 1 0 0 0
1 0 1 0 1 0 0 0 0
1 1 0 1 0 0 0 0 0 0
1 1 0 1 0 0 0 0 0
1 0 1 0 1 0 0 0 0 0
1 0 1 0 1 0 0 0 0
1 0 0 1 0 0 0 0 0 0
1 1 0 1 0 0 0 0 0 and . (84)
0 0 0 0 0 0 1 1 1 0
0 0 0 0 0 0 1 1 1
0 1 0 0 0 1 0 1 0 1
0 0 0 0 0 1 0 1 0
0 0 0 0 0 1 1 0 0 0
0 0 0 0 0 1 1 0 1
0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 1 0 1 0
0 0 0 0 0 0 1 0 0 0
Next, we would like to define a measure that allows us to formulate an objective for clustering. A natural
measure is the so-called cut.
Definition 19.3. Let (V, E, W ) be a weighted graph. Let S ⊂ V be a vertex partition. We define the cut of S as
cut(S) := ∑ ∑ wi,j .
vi ∈ S v j ∈ S c
A small cut, means that not many large weights needed to be removed in order to make the partition.
Intuitively, this seems to be a reasonable condition for a clustering algorithm. It turns out that the cut is
closely related to the so-called graph Laplacian of a graph.
Definition 19.4. Let G = (V, E, W ) be a weighted graph. The degree matrix of G is defined as
102
Let S be a vertex partition and y ∈ {±1}n be such that yi = 1 if and only if vi ∈ S. Then it holds that
!
1 1
4∑ ∑ wi,j + 4 c ∑ wi,j
wi,j (yi − y j )2 = 4
i< j
4 i ∈S,j∈Sc ,i < j i ∈S ,j∈S,i < j
!
= ∑ wi,j + ∑ wi,j = cut(S).
i ∈S,j∈Sc ,i < j i ∈S,j∈Sc ,i > j
Using the formula above, we are now able to express the cut in terms of the graph Laplacian.
Proposition 19.1. Let G = (V, E, W ) be a weighted graph and let LG be the associated graph Laplacian. It holds
for all x ∈ Rn , where n = |V | that
In particular,
1 T
cut(S) = y LG y, (87)
4
for y ∈ {±1} with yi = 1 if and only if vi ∈ S.
∑ wi,j (xi − x j )2 = xT LG x.
i< j
Remark 19.3. Proposition 19.1 shows that we can compute the cut by computing a quadratic form involving
the graph Laplacian. Minimising expressions of the form x T Ax for positive semidefinite matrices reduces to an
eigenvalue problem and can be considered simple. We have two issues though: First, we do not want to minimise
such an expression over general x ∈ Rn , but only over x that take values in {±1}. Second, a minimiser of the cut
is actually very easy to compute. In fact S = ∅ or S = V always yield cut(S) = 0, the minimal value. This is,
however, not the minimiser we were looking for.
To address the issues raised in Remark 19.3, we first introduce a different notion of a cut that promotes
balanced partitions.
Definition 19.5. Let G = (V, E, W ) be a weighted graph and S ⊂ V be a vertex partition. We define the weighted
cut of G as
cut(S) cut(Sc )
1 1
Ncut(S) := + = cut ( S ) + ,
vol(S) vol(Sc ) vol(S) vol(Sc )
where vol(S) = ∑i∈S deg(i ). Here one needs to decide on a convention for S = ∅ or S = V.
103
The following proposition holds:
Proposition 19.2. Let G = (V, E, W ) be a weighted graph and S ⊂ V be a vertex partition. It holds that
Ncut(S) = y T LG y
where 1/2
vol(Sc )
if i ∈ S,
vol(S)vol(V )
yi = 1/2
vol(S)
−
vol(Sc )vol(V )
if i ∈ Sc .
1
2 i,j∑
yT LG y = wi,j (yi − y j )2
∈V
= ∑ ∑ wi,j (yi − y j )2
i ∈S j∈Sc
1/2 1/2 !2
vol(Sc )
vol(S)
= ∑ ∑ wi,j vol(S)vol(V )
+
vol(Sc )vol(V )
i ∈S j∈Sc
1/2 !
vol(Sc ) vol(Sc )
vol(S) vol(S)
= ∑ ∑ wi,j +2 +
i ∈S j∈Sc
vol(S)vol(V )vol(S)vol(V ) vol(Sc )vol(V ) vol(Sc )vol(V )
vol(Sc )
2 vol(S)
= ∑ ∑ wi,j + +
i ∈S j∈Sc
vol(S)vol(V ) vol(V ) vol(Sc )vol(V )
wi,j vol(Sc )
vol(S)
=∑ ∑ +2+ .
i ∈S j∈Sc
vol(V ) vol(S) vol(Sc )
vol(S) vol(Sc )
Using that vol(S)
+ vol(Sc )
= 2, we obtain that
Using Proposition 19.3, we can now rewrite the minimisation of Ncut as a minimisation problem involv-
ing the graph Laplacian.
y T Dy = 1,
y T D1 = 0.
Proposition 19.3. Let G = (V, E, W ) be a weighted graph and S ⊂ V be a vertex partition. S is a minimum of
Ncut if and only if y is a minimiser of (88) with yi ∈ { a, b} for some a, b ∈ R and all i ∈ [n] and S = {i ∈
[ n ] : y i = a }.
104
Proof. We only need to show that, for every minimiser y of (88), the entries are of the form prescribed in
Proposition 19.3 as well as that every y of the form given by Proposition 19.3 satisfies the constraints of
(88). This is left as an exercise for the reader.
Unfortunately, minimising (88) is in not simple at all. In fact, it can be shown to be an NP hard problem.
However, we can simplify this problem by relaxing the condition that y can only take two values.
This optimisation problem is simple. In fact it is equivalent to an eigenvalue problem. Set z = D1/2 y, then
we obtain the problem
kzk2 = 1,
( D1/2 1)T z = 0,
Hence, D1/2 1 is an eigenvector associated to the smallest eigenvalue of LG . We recall the following con-
sequence of the Courant-Fischer Theorem:
Theorem 19.1. For a matrix A ∈ Rn×n it holds that
where λ2 is the second smallest eigenvalue of A and v2 is an associated eigenvector. Moreover, v1 is the eigenvector
associated to the smallest eigenvalue of A.
We conclude that the solution of (91) is the smallest eigenvector associated to the second smallest eigen-
value of LG .
This is the motivaton for the following algorithm:
105
Figure 26: Eigenvalues and D −1/2 v1 and D −1/2 v2 for v1 , v2 the eigenvectors associated to the smallest and
second smallest eigenvalue of LG for the two graphs of Figure 25.
The relationship of spectral clustering to the problem (88) or equivalently the minimisation of the Ncut
problem is given by the following result:
Theorem 19.2. Let G = (V, E, W ) be a weighted graph. There exists a threshhold τ ∈ R such that
q
λ2 (LG ) . Ncut(Sτ ) . λ2 (LG ),
where Sτ is the result of spectral clustering with parameter τ. Here all implicit constants are less than 4.
Figure 27: Spectral clustering is successful for the non-convex clustering problem above.
106
20 Lecture 20 - Dimensionality reduction
We have run into the curse of dimension when we analysed the k-nearest neighbour algorithm. We found
that, for some algorithms to work, it is beneficial if the input dimension is small. Sometimes it is possible
to reduce the dimension of the data space, without really compromising the data. For example, if the data
lies in a low dimensional subspace, it is conceivable that we could restrict our learning problem to this
low dimensional subspace and thereby simplify it. A bit more involved is the situation if the data only
lies on or close to a low-dimensional non-linear manifold. In both situations, we cannot certainly, still
need to find the low dimensional structure before reducing the problem to the simplified setup. Doing
this is called dimensionality reduction.
where V is the matrix with k-th row equal to vkT . We only need to decide what we mean by ≈. In the case
of principle component analysis (PCA), we choose µ, V, ( β i )in=1 such that the least squares error is small.
Concretely, we are looking for µ, V, β that assume the following minimum:
n
min
µ,V,β k
∑ k xi − µ − Vβi k22 . (92)
i =1
V T V =Id
Finding the minimiser of (96) can be done in multiple steps. First we restrict ourselves to minimisers of
the form ∑in=1 β i = 0. Indeed, if ∑in=1 β i = λ 6= 0 then we can replace β i by β̃ = β i − λ/n and observe that
n n
∑ k xi − µ − Vβi k22 = ∑ kxi − µ̃ − V β̃i k22 . (93)
i =1 i =1
where µ̃ = µ − λ/nV1 and 1 denotes the vector with all entries equal to 1.
Under this assumption on the β’s, we first seek to find µ. This is done by looking for a stationary point of
(96) in µ, i.e., µ∗ such that
n
∇µ ∑ k xi − µ∗ − Vβ i k22 = 0. (94)
i =1
We compute:
n n n n n
∇µ ∑ k xi − µ − Vβ i k22 = −2 ∑ ( xi − µ − Vβ i ) = 2nµ − 2 ∑ xi + V ∑ β i = 2nµ − 2 ∑ xi .
i =1 i =1 i =1 i =1 i =1
min k xi − µ − Vβ i k22
βi
107
is given by β i = V T ( xi − µ). We simply need to show that this choice of β i satisfies ∑in=1 β i = 0. This of
course follows immediately from the linearity of V T .
In the final step, we would like to find V. By the previous computations the problem reduces to
n
min
VT V =Id
∑ k xi − µ∗ − VV T (xi − µ∗ )k22 . (95)
i =1
k xi − µ∗ − VV T ( xi − µ∗ )k22 = ( xi − µ∗ )T ( xi − µ∗ ) − 2( xi − µ∗ )T VV T ( xi − µ∗ ) + ( xi − µ∗ )T VV T VV T ( xi − µ∗ )
= ( xi − µ∗ )T ( xi − µ∗ ) − ( xi − µ∗ )T VV T ( xi − µ∗ ),
where we used V T V = Id. The first term above does not depend on V and so we observe that the
optimisation problem (95) is equivalent to
n
max
V T V =Id
∑ (xi − µ∗ )T VV T (xi − µ∗ ). (96)
i =1
Now we use some of the magic of the trace operator. We denote by Tr( A) = ∑ik=1 Aii the trace operator.
Note that for a scalar λ ∈ R it holds that Tr(λ) = λ. It is also well known that for matrices A ∈ Rm×n and
B ∈ Rn×m it holds that Tr( BA) = Tr( AB).
Also, directly from the definition, we have that Tr( A) = Tr( A T ). It holds that
n n
max ∑ (xi − µ∗ )T VV T (xi − µ∗ ) = max
V T V =Id i =1
∑
V T V =Id i =1
Tr ( x i − µ ∗ T
) VV T
( x i − µ ∗
)
n
= max
VT V =Id
∑ Tr V T ( xi − µ∗ )( xi − µ∗ )T V
i =1
= max (n − 1)Tr V T Σn V , (97)
V T V =Id
where Σn = n− 1
1 ∑i =1 ( xi − µ )( xi − µ ) is the sample variance. It is not hard to see that Tr V Σn V
n ∗ ∗ T T
108
Definition 20.1. For ( xi )in=1 ⊂ Rd and e ≥ 0 we call a map f : Rd → R p an e-isometry if for all i, j ∈ [n]
Now it is clear, that a 0-isometry exists if p ≥ min{d, n}, since in this case ( xi )in=1 lie in a p dimensional
space and we can simply project onto this space without distorting the pairwise distances at all.
Nonetheless, the question arises, how small p can be to still allow for an e-isometry.
The following theorem yields a lower bound on p such that a linear e-isometry exists.
Theorem 20.1. Let n, d, p ∈ N, d ≥ 4, and 0 < e < 1/2 be such that
20
p≥ log n.
e2
Lemma 20.1 ([5, Lemma 15.3]). Let x ∈ Rd , p < d and assume that A is a p × d matrix with every entry
independently normally distributed. Then it holds that for every 0 < e < 1/2
" #
2
1 2 3
P (1 − e)k x k2 ≤ √ Ax ≤ (1 + e)k x k2 ≥ 1 − 2e−(e −e ) p/4 .
p
Figure 28: Projection of 3 points onto three subspaces. The projection onto the purple subspace distorts
the pairwise distances the least.
Proof of Theorem 20.1. Let p be as in the statement of the theorem and choose f = √1p A, where A is a p × d
matrix with all entries i.i.d normal distributed. Let xi , x j be two points in ( xi )in=1 then it holds that with
2 3
probability at least 1 − 2e−(e −e ) p/4
109
There are (n2 ) ≤ n2 many pairs of data points in ( xi )in=1 and hence
∑ P k f ( xi ) − f ( x j )k2 /k xi − x j k2 6∈ (1 − e, 1 + e)
≤
i,j∈[n],i < j
2 − e3 ) p/4
≤ 2n2 e−(e .
20
Choosing p ≥ e2
log n implies that
2 − e3 ) p/4
2n2 e−(e ≤ 2n2 e−(1−e)5 log n = 2e−(3−5e) log n < 2e−1/2 log n ≤ 1
for n ≥ 4, where we used that e < 1/2. Hence, the probability that f is an e isometry for ( xi )in=1 is not
zero, which implies that such an f exists.
Figure 29: Swiss roll data set on the left. The middle figure shows an embedding resulting from projecting
on a one dimensional subspace identified by PCA. On the right is the embedding obtained by the diffusion
embedding discussed in this section.
It seems obvious that we do not want to maintain all pairwise distances in this nonlinear setting any
longer. Instead we would rather like to have a map that in some sense respects the intrinsic distances
of the points. To come up with a sensible notion of distance between points in a point cloud, that also
respects the intrinsic structure, we consider the following example:
Example 20.1. Consider the heat equation on a two dimensional domain: u0 ∈ L2 (R2 ), γ > 0
∂
u(t, x ) − γ∆ x u(t, x ) = 0 for (t, x ) ∈ [0, T ] × R2 (99)
∂t
u(0, ·) = u0 . (100)
If u0 is a small bump function centered at a point t1 and ũ0 is a second bump function centered at t2 and both ũ0
and u0 have compact and disjoint support, then kũ0 − u0 k2L2 = kũ0 k2 + ku0 k2L2 . From this information alone, we
110
obtain very little information about the distance of t1 from t2 . However, if we instead look at kũ( T, ·) − u( T, ·)k L2 ,
where ũ and u are the solutions of (99) with initial conditions ũ0 and u0 respectively, then these may give us a more
nuanced description of the distance between t1 and t2 . Consider Figure 30. There, we see that after some time has
passed the distances between heat profiles are closely related to the distances between the initial heat sources.
Figure 30: First and second row: Heat diffusion associated to two sources. Third row: relationship be-
tween distance of initial heat sources and heat profiles after two seconds of diffusion.
The interesting thing about the diffusion is that we can also use it to make sense of a distance on a non-euclidean
domain.
111
Figure 31: Diffusion on a map with a slit. The heat profile after two seconds now describes quite accurately
the distance of points when taking into account the more involved geometry. Indeed, the two lower rows
have a very similar heat profile after two seconds since the associated heat sources lie on the same side of
the slit.
Example 20.1 showed us that, if we define a distance via the heat equation then this will oftentimes
respect the underlying geometry. We will now do something similar for general graphs. Let (V, E, W ) be
a weighted graph. We define a random walk on V by
112
For the graph corresponding to the slit domain of Figure 31, where every pixel in the white part of the
image is one vertex of the graph and vertices corresponding to neighbouring pixels are connected by an
edge with weight 1, we run some examples of the random walk in Figure 32.
We denote the matrix of transition probabilities by M, where Mi,j = wi,j /deg(i ). Note that M = D −1 W,
where D is a diagonal matrix with Di,i = deg(i ). If we start the random walk X in the node i, i.e., X (0) = i,
then we can compute the probability that X (t) = j by
The map vi 7→ Mt (i, ·) can now be considered as a map from V to Rn . This map now maps points to
the associated probability distribution of the random walk starting in that point after t iterations. While
we had observed that this embedding seems to reflect the inner geometry of the problem, it is hardly a
dimensionality reduction since n may be quite large.
113
Figure 32: Three random walks starting at the same points as the heat sources of Figure 31. The points
are marked by a blue dot in the images. The resulting distribution is very similar to the heat profiles of
Figure 31.
To reduce the dimension of the embedding, we truncate the size of the matrix M by a spectral decompo-
sition. First of all, we notice that
S = VΛV T ,
for a matrix V such that V T V = Id and a diagonal matrix Λ with Λi,i ≥ Λi+1,i+1 for all i ∈ [n − 1]. Now
we have that
M = D −1/2 SD1/2 = ( D −1/2 V )Λ(V T D1/2 ) =: ΦΛΨ T .
114
Because of this, we can write
n
M= ∑ λk φk ψkT ,
k =1
where φi and ψi are the i-th column of Φ and Ψ, respectively. Note also, that by construction ΦΨ T = Id
and hence
n
Mt = ΦΛt Ψ T = ∑ λtk φk ψkT .
k =1
Before we turn this representation into the so-called diffusion map, we observe that φ1 = 1. Indeed, it holds
that M1 = 1. Therefore, φ1 = 1 holds if 1 is the largest eigenvalue of M.
Proposition 20.1. All eigenvalues of M are bounded by 1 in absolute value.
Proof. Let φk be a right eigenvector of M and let imax be such that φk (imax ) = maxi∈[n] φk (i ) > 0. Then
n
λk φk (imax ) = Mφk (imax ) = ∑ Mi max ,j
φk ( j).
j =1
By this proposition and the previous discussion we conclude that φ1 = 1 and hence does not carry any
information about G.
Definition 20.2. Given a graph G = (V, E, W ) let M = ΦΛΨ T be as above. For t ∈ N, the diffusion map ϕt is
defined as
ϕ t : V → Rn −1
t
λ2 φ2 (i )
λt φ3 (i )
3
ϕ t ( vi ) = .
..
.
λtn φn (i )
The diffusion map is still not really performing dimensionality reduction. However, if we assume that
many eigenvalues are significantly smaller than 1, then for sufficiently large t the values λtk will be very
small. Hence, we believe that we can drop the associated dimensions from the embedding without sig-
nificantly affecting the embedding’s quality.
115
Definition 20.3. Given a graph G = (V, E, W ) let M = ΦΛΨ T be as above. For p ∈ [n − 1] and t ∈ N, the
( p)
truncated diffusion map ϕt is defined as
( p)
ϕt : V → R p
λ2t φ2 (i )
λ3t φ3 (i )
ϕ t ( vi ) = .. .
.
λtp+1 φ p+1 (i )
Let us conclude this section by proving that the diffusion map does indeed what we wanted it to do,
which is to produce an embedding where the distances correspond to the distances of the probability
densities of a random walk.
Proposition 20.2. Let G = (V, E, W ) be a weighted graph, let M = ΦΛΨ T be as above, and let X be a random
walk as above. For every pair of nodes vi , v j ∈ V and for every t ∈ N it holds that
n
1
∑ deg(k) (P(X (t) = k|X (0) = i) − P(X (t) = k|X (0) = j))
2
k ϕt (vi ) − ϕt (v j )k2 = .
k =1
k =1
!2
n n n
1
=∑ ∑ λt` φ` (i )ψ`T (k ) − ∑ λt` φ` ( j)ψ`T (k )
k =1
deg (k) `=1 `=1
!2
n n
1
=∑ ∑ λt` (φ` (i ) − φ` ( j))ψ`T (k )
k =1
deg(k ) `=1
!2
n n ψ`T (k)
= ∑ ∑ ` `
λ t
( φ ( i ) − φ` ( j )) p
deg(k )
k =1 `=1
!2
n n
= ∑ ∑ λt` (φ` (i) − φ` ( j)) D−1/2 ψ`T (k)
k =1 `=1
2
n
= ∑ λt` (φ` (i ) − φ` ( j)) D −1/2 ψ`T .
`=1
Since D −1/2 ψ`T = v` per construction, where (v` )n`=1 denote the eigenvectors of S = D1/2 MD −1/2 . Since
(v` )n`=1 form an orthogonal basis, we obtain that
2
n n 2
∑ λt` (φ` (i ) − φ` ( j)) D −1/2 ψ`T = ∑ λt` (φ` (i ) − φ` ( j)) = k ϕt (vi ) − ϕt (v j )k2 ,
`=1 `=1
by Parseval’s identity.
Example 20.2. Using the diffusion map, we can produce a visual representation of data from which we only know
the relationship between each pair of elements. For example, consider the situation that we only know which central
European country as which neighbours, as given in Figure 33.
116
Figure 33: Adjacency matrix of countries in Europe. Countries that have a common border are considered
to be connected by an edge.
From the local neighbourhood relationship of the countries, we cannot really see how central Europe looks like.
We can, however apply a truncated diffusion map to the graph associated to that adjacency matrix and obtain the
following embedding of Figure 34 two dimensional space.
Figure 34: Embedding of the graph of Figure 33 in the two-dimensional plane. The positions coincide
somewhat with the locations of the countries on a real map. On the right, Voronoi regions are drawn.
They do not yield the correct borders. This is not surprising since Voronoi regions are necessarily convex,
but countries are not.
References
[1] Peter L Bartlett, Vitaly Maiorov, and Ron Meir. Almost linear VC-dimension bounds for piecewise
polynomial networks. Neural computation, 10(8):2159–2173, 1998.
117
[2] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control,
signals and systems, 2(4):303–314, 1989.
[3] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are uni-
versal approximators. Neural networks, 2(5):359–366, 1989.
[4] Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward net-
works with a nonpolynomial activation function can approximate any function. Neural networks,
6(6):861–867, 1993.
[5] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT
press, 2018.
118