0% found this document useful (0 votes)
43 views118 pages

Maths For Machine Learning

The document outlines a course on the Mathematics of Machine Learning, covering topics such as classification, regression, PAC learning, Rademacher complexity, model selection, support vector machines, kernel methods, neural networks, clustering, and dimensionality reduction. It includes a structured curriculum with lecture notes and required prior knowledge, emphasizing the importance of mathematical foundations and programming skills. Assessment will be based on an oral exam and group challenges involving machine learning problems in Python.

Uploaded by

Venkat Krishna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views118 pages

Maths For Machine Learning

The document outlines a course on the Mathematics of Machine Learning, covering topics such as classification, regression, PAC learning, Rademacher complexity, model selection, support vector machines, kernel methods, neural networks, clustering, and dimensionality reduction. It includes a structured curriculum with lecture notes and required prior knowledge, emphasizing the importance of mathematical foundations and programming skills. Assessment will be based on an oral exam and group challenges involving machine learning problems in Python.

Uploaded by

Venkat Krishna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 118

Mathematics of Machine Learning

Philipp Christian Petersen*


April 18, 2022

Contents
0 Introduction 2

1 Lecture 1 – Example and Language of Machine Learning 5


1.1 A new world (literaly) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Language of machine learning: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Lecture 2 – A Mathematical Framework 16


2.1 PAC learning framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Finite hypothesis, consistent case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Finite hypothesis, inconsistent case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Lecture 3 – Some Generalisations and Rademacher Complexities 22


3.1 Agnostic PAC learning: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Bayes error and noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 The Rademacher complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Generalisation bound with Rademacher complexity . . . . . . . . . . . . . . . . . . . . . . . 24

4 Lecture 4 – Application of Rademacher Complexities and Growth Function 29


4.1 Rademacher complexity bounds for binary classification . . . . . . . . . . . . . . . . . . . . 29
4.2 The growth function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 The Vapnik–Chevronenkis Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4 Generalisation bounds via VC-dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5 Lecture 5 - The Mysterious Machine 34

6 Lecture 6 - Lower Bounds on Learning 42

7 Lecture 7 - The Mysterious Machine - Discussion 45

8 Lecture 8 - Model Selection 46


8.1 Empirical Risk Minimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
8.2 Structural risk minimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.3 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
* PP can be found in room 7.39 in the Kolingasse 14-16, 1090 Vienna, unless there is a lockdown. In that case, he will hide at
an undisclosed location, but will still be very willing to read emails sent to [email protected].

1
9 Lecture 9 - Regression and Regularisation 50
9.1 Regression and general loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
9.2 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
9.3 Stability and overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9.4 Regularised risk minimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
9.5 Tikhonov regularisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

10 Lecture 10 - Freezing Fritz 59

11 Lecture 11 - Support Vector Machines I 66


11.1 Definition of the support vector machine algorithm . . . . . . . . . . . . . . . . . . . . . . . 66
11.2 Primal optimisation problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
11.3 Support vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
11.4 Generalisation bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
11.4.1 Leave-one-out analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

12 Lecture 13 - Support Vector Machines II 70


12.1 Margin theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
12.2 Non-separable case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

13 Lecture 13 - Freezing Fritz - Discussion 75

14 Lecture 14 - Kernel methods 75


14.1 The kernel trick and kernel SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
14.2 Learning guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
14.3 Some standard kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
14.4 A numerical example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

15 Lecture 15 - Nearest Neighbour 82


15.1 Generalisation bounds for one-nearest-neighbour . . . . . . . . . . . . . . . . . . . . . . . . 83

16 Lecture 16 - Neural Networks 85

17 Lecture 17 - Facial Expression Classification 90

18 Lecture 18 - Boosting 94

19 Lecture 19 - Clustering 97
19.1 Linkage-based clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
19.2 k-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
19.3 Spectral clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

20 Lecture 20 - Dimensionality reduction 107


20.1 Principle component analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
20.2 Johnson-Lindenstrauss embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
20.3 Diffusion maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

0 Introduction
This lecture is about:

2
1. Language of ML: What is classification, regression, ranking, clustering, dimensionality reduction,
supervised and unsupervised learning, generalisation, overfitting.
2. PAC Theory: PAC Learning model, finite hypothesis sets, consistent and inconsistent problems,
deterministic and agnostic learning.
3. Rademacher complexity and VC dimension: generalization bounds for Rademacher, Growth func-
tion, Connection to Rademacher compl., VC dimension, VC dimension based upper bounds, lower
bounds on generalization.
4. Model Selection: Bias Variance trade-off, Structural Risk minimisation, Cross validation, regularisa-
tion.
5. Support Vector Machines: generalisation bounds, margin theory/margin based generalization
bounds.
6. Kernel Methods: Reproducing Kernel Hilbert spaces, Representer Theorem, kernel SVM, generali-
sation bounds for kernel based methods
7. Neural Networks (Mostly shallow)
8. Clustering: k-means, Lloyds algorithm, Ncut, Cheeger cut, spectral clustering.
9. Dimensionality Reduction: PCA, diffusion maps, Johnson Lindenstrauss lemma.

Literature:
• Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning.
MIT press, 2018. https://cs.nyu.edu/~mohri/mlbook/
• Shalev-Shwartz, Shai, and Shai Ben-David. Understanding machine learning: From the-
ory to algorithms. Cambridge university press, 2014. https://www.cs.huji.ac.il/~shais/
UnderstandingMachineLearning/
• Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning:
data mining, inference, and prediction. Springer Science & Business Media, 2009 https://web.
stanford.edu/~hastie/ElemStatLearn/

Lecture notes: You are reading the lecture notes to the course right now. They are being developed
during the course. If you find things than can be improved, please contact Philipp Petersen to give feed-
back. This is highly appreciated.

Required prior knowledge: This is an applied math course. Therefore it will often touch on many dif-
ferent mathematical fields. Such as harmonic analysis, graph theory, random matrix theory, etc. Students
are not required to know about these issues beforehand. But a certain willingness to look up concepts
from time to time is necessary.
To enjoy this course, students should have heard an introductory course on probability theory and linear
algebra. Most of the examples, as well as the challenges will require rudimentary knowledge of Python.
Towards the end, a basic knowledge of functional analysis is helpful, but not required.

Assessment criteria: There will be an oral exam at the end of the lecture.
In addition, we will have challenges where you will solve machine learning problems in Python. You
need to form groups for this. You need to beat me to pass (This will be very easy, because I will publish

3
my approach and my solutions will typically not be very sophisticated). The winning team and the team
with the most creative solution will receive prices. There will be at least three challenges this year.
Note that your results for the challenges need to be handed in before the given deadline. There are no
exceptions to this rule. You will have plenty of time to complete these, so it may make sense to prepare
one submission as a backup and then fine tune later.

4
1 Lecture 1 – Example and Language of Machine Learning
A cheesy lecture on machine learning would probably start by claiming that machine learning is revolu-
tionary and constitutes a completely new paradigm for science and mathematics. One may even say a
new world of unknown and exciting terrain with uncountable possibilities.
We instead show that machine learning can quite literally show us new worlds.

1.1 A new world (literaly)

[1]: import matplotlib as mpl


import matplotlib.pyplot as plt
import matplotlib.style as style
style.use('seaborn')
import numpy as np
from scipy import signal
from scipy.fftpack import fft, ifft
from scipy.signal.windows import gaussian

We import a data set of light curves of stars recorded from the Kepler telescope. These can be found online
at (https://www.kaggle.com/keplersmachines/kepler-labelled-time-series-data). We print the first five
lines of the data set to get a feeling what is going on.

[2]: import pandas as pd


data = pd.read_csv("exoTrain.csv")
# Preview the first 5 lines of the loaded data
data.head()

[2]: LABEL FLUX.1 FLUX.2 FLUX.3 FLUX.4 FLUX.5 FLUX.6 FLUX.7 \


0 2 93.85 83.81 20.10 -26.98 -39.56 -124.71 -135.18
1 2 -38.88 -33.83 -58.54 -40.09 -79.31 -72.81 -86.55
2 2 532.64 535.92 513.73 496.92 456.45 466.00 464.50
3 2 326.52 347.39 302.35 298.13 317.74 312.70 322.33
4 2 -1107.21 -1112.59 -1118.95 -1095.10 -1057.55 -1034.48 -998.34
[5 rows x 3198 columns]

The columns are the intensities of the light at different positions in time. The label is 2 if some astrophysi-
cists has claimed that this planet has an exoplanet and 1 if they claimed it has none. We will plot a couple
of these curves to get a good understanding what is going on.

[3]: plt.figure(figsize = (10, 8))

fig = plt.figure(figsize=(18,14))
ax = fig.add_subplot(231)
plt.plot(data.values[6, 1:]/np.max(np.abs(data.values[69, 1:])))
plt.title('Has exoplanet')
ax = fig.add_subplot(232)
plt.plot(data.values[2003, 1:]/np.max(np.abs(data.values[2003, 1:])))
plt.title('No exoplanet')
ax = fig.add_subplot(233)
plt.plot(data.values[1, 1:]/np.max(np.abs(data.values[1, 1:])))

5
plt.title('Has exoplanet')
ax = fig.add_subplot(234)
plt.plot(data.values[13, 1:]/np.max(np.abs(data.values[69, 1:])))
plt.title('Has exoplanet')
ax = fig.add_subplot(235)
plt.plot(data.values[75, 1:]/np.max(np.abs(data.values[2003, 1:])))
plt.title('No exoplanet')
ax = fig.add_subplot(236)
plt.plot(data.values[77, 1:]/np.max(np.abs(data.values[1, 1:])))
plt.title('No exoplanet')

[3]: Text(0.5, 1.0, 'No exoplanet')

<Figure size 720x576 with 0 Axes>

Stars with exoplanets often have periodically occuring sharp drops in light intensity. We do not know if it
is the only indication, though. Since we also not trained in astrophysics, we should not overanalyse this.
Maybe there is another obvious way of differentiating between stars with exoplanets and stars. We shall
start some exploratory data analysis. This consists of looking at certain statistical aspects of the data set:

[4]: LightCurves = data.values[:, 1:]

ex_labels = data.values[:, 0]

6
print('In the data set there are: ' + str(np.sum(ex_labels==1)) + ' Stars without exoplanets.')
print('In the data set there are: ' + str(np.sum(ex_labels==2)) + ' Stars without exoplanets.')

fig = plt.figure(figsize=(18,14))

means1 = LightCurves[ex_labels==1].mean(axis=1)
means2 = LightCurves[ex_labels==2].mean(axis=1)

ax = fig.add_subplot(231)

ax.hist(means1,alpha=0.8,bins=50,density=True,range=(-250,250))
ax.hist(means2,alpha=0.8,bins=50,density=True,range=(-250,250))
ax.legend(['No Exoplanets', 'Has Exoplanets'])
ax.set_xlabel('Mean Intensity')
ax.set_ylabel('Num. of Stars')

std1 = LightCurves[ex_labels==1].std(axis=1)
std2 = LightCurves[ex_labels==2].std(axis=1)

ax = fig.add_subplot(232)

ax.hist(std1,alpha=0.8,bins=50,density=True,range=(-250,250))
ax.hist(std2,alpha=0.8,bins=50,density=True,range=(-250,250))
ax.legend(['No Exoplanets', 'Has Exoplanets'])
ax.set_xlabel('Standard Deviation')
ax.set_ylabel('Num. of Stars')

spread1 = LightCurves[ex_labels==1].max(axis=1) - LightCurves[ex_labels==1].min(axis=1)


spread2 = LightCurves[ex_labels==2].max(axis=1) - LightCurves[ex_labels==2].min(axis=1)

ax = fig.add_subplot(233)

ax.hist(spread1,alpha=0.8,bins=50,density=True,range=(-2500,2500))
ax.hist(spread2,alpha=0.8,bins=50,density=True,range=(-2500,2500))
ax.legend(['No Exoplanets', 'Has Exoplanets'])
ax.set_xlabel('Max minus min value')
ax.set_ylabel('Num. of Stars')

Derivative = np.abs(np.gradient(LightCurves[ex_labels==1], axis = 1)).mean(axis=1)


Derivative2 = np.abs(np.gradient(LightCurves[ex_labels==2], axis = 1)).mean(axis=1)

ax = fig.add_subplot(234)

ax.hist(Derivative,alpha=0.8,bins=50,density=True,range=(-250,250))
ax.hist(Derivative2,alpha=0.8,bins=50,density=True,range=(-250,250))
ax.legend(['No Exoplanets', 'Has Exoplanets'])
ax.set_xlabel('L1 Norm of Derivative')
ax.set_ylabel('Num. of Stars')

7
MaxDerivative = np.max(np.gradient(LightCurves[ex_labels==1], axis = 1), axis = 1)
MaxDerivative2 = np.max(np.gradient(LightCurves[ex_labels==2], axis = 1), axis = 1)

ax = fig.add_subplot(235)

ax.hist(MaxDerivative,alpha=0.8,bins=50,density=True,range=(-500,500))
ax.hist(MaxDerivative2,alpha=0.8,bins=50,density=True,range=(-500,500))
ax.legend(['No Exoplanets', 'Has Exoplanets'])
ax.set_xlabel('Max of Derivative')
ax.set_ylabel('Num. of Stars')

MaxSecDerivative = np.max(np.gradient(np.gradient(LightCurves[ex_labels==1], axis = 1), axis = 1), axis␣


,→= 1)

MaxSecDerivative2 = np.max(np.gradient(np.gradient(LightCurves[ex_labels==2], axis = 1), axis = 1), axis␣


,→= 1)

ax = fig.add_subplot(236)

ax.hist(MaxSecDerivative,alpha=0.8,bins=50,density=True,range=(-500,500))
ax.hist(MaxSecDerivative2,alpha=0.8,bins=50,density=True,range=(-500,500))
ax.legend(['No Exoplanets', 'Has Exoplanets'])
ax.set_xlabel('Max of Second Derivative')
ax.set_ylabel('Num. of Stars')

In the data set there are: 5050 Stars without exoplanets.


In the data set there are: 37 Stars without exoplanets.

[4]: Text(0, 0.5, 'Num. of Stars')

8
Unfortunately none of our clever statistics seem to really separate the data. It seems like stars with exo-
planets may have higher max derivatives, but this only holds for the distribution and does not make for
a simple test yet. We need to actually perform machine learning. Let us use an all purpose weapon, the
support vector machine:

[5]: from sklearn import svm


SupportVectorClassifier = svm.SVC()
SupportVectorClassifier.fit(LightCurves, ex_labels);

We have trained the support vector machine on the data. Now let us evaluate how well this trained
algorithm performs on a test set.

[6]: data_test = pd.read_csv("exoTest.csv")

TestLightCurves = data_test.values[:, 1:]


TestLabels = data_test.values[:, 0]

prediction=SupportVectorClassifier.predict(TestLightCurves)
from sklearn.metrics import accuracy_score, confusion_matrix, plot_confusion_matrix

print('Accuracy Score: {}'


.format(accuracy_score(TestLabels,prediction)))

Accuracy Score: 0.9912280701754386

9
On first sight, we have achieved 99.12% accuracy on the test set, which seems nice. But let us dig a little
bit deeper by also printing the confusion matrix. Which is a matrix C = (Ci,j )2i,j=1 . Where C0,0 denotes the
number of true negatives, C1,1 are true positives, C1,0 are false negatives, C0,1 are false positives

[7]: fig = plt.figure(figsize=(8,8))


plt.pie([np.sum(TestLabels==1), np.sum(TestLabels==2)], labels=['No exoplanet', 'Has exoplanet'],␣
,→autopct='%1.1f%%',

shadow=True, startangle=90)
plt.show()

plot_confusion_matrix(SupportVectorClassifier, TestLightCurves, TestLabels)


plt.grid(False)
print('Confusion Matrix:\n {}'
.format(confusion_matrix(TestLabels,prediction)))

Confusion Matrix:
[[565 0]
[ 5 0]]

The confusion matrix and the pie chart show quite clearly what is the problem. The data set is very
imbalanced. The classifier, while achieving high accuracy, was not successfull in labelling only a single
star with an exoplanet correctly. In fact, it labelled all stars as having no exoplanet.
It seems like we have to use a bit more sophisticated methods.
We start by making the data a bit nicer by standardising and filtering it. We filter out high and low
frequencies by applying a wavelet transform. We also remove very oscillating element as they seem to be
outliers.
Next, we will transform the data so that it is in a format that may exhibit the characteristics that we need
to classify. As we have seen in one of the light curves, stars with exoplanets exhibit periodically appearing
drops in light intensity. To expose this periodicity, it makes sense to take the Fourier transform. We also
want our classifier to be independent of temporal shifts. This can be enforced by taking the absolute

10
value of the Fourier transform, since translation of functions corresponds to modulation of its Fourier
transform.
[13]: def filterData(DataSet,wav_len):

wavelet = gaussian(wav_len, 1)
wavelet = np.diff(np.diff(wavelet))# Produce a wavelet with two vanishing moments

for k in range(DataSet.shape[0]):
DataSet[k,:] = DataSet[k,:] - DataSet[k,:].mean()
DataSet[k,:] = DataSet[k,:] / DataSet[k,:].std()
if(np.sum(np.abs(np.diff(DataSet[k,:]))) > 200*max(abs(DataSet[k,:]))):
DataSet[k,:] = 0; # remove light curves with too much oscillation
else:
DataSet[k,:] = np.convolve(DataSet[k,:], wavelet, 'same')
DataSet[k,:] = np.abs(fft(DataSet[k,:]))**2

return DataSet

One big problem that we observed was the imbalance of the data set. We attack this problem by generating
artificial data. The artificial data is produced ba making signals that have periodic spikes.

[14]: newexoplanets = np.zeros([500, LightCurves.shape[1]])


new_ex_labels = 2*np.ones(500)

for k in range(500):
newexoplanets[k,:] = LightCurves[500+k,:]
newexoplanets[k,:] = (newexoplanets[k,:] - newexoplanets[k,:].mean())/newexoplanets[k,:].std()
period = 100+k
start = np.random.uniform(3,int(period))
for j in range(int(start), LightCurves.shape[1]-3, int(period)):
randpershift = int(np.random.uniform(-1,1))

newexoplanets[k, j-1 + randpershift] = newexoplanets[k, j-1 + randpershift]-6


newexoplanets[k,j + randpershift] = newexoplanets[k,j + randpershift]-12
newexoplanets[k, j+1 + randpershift] = newexoplanets[k, j+1+randpershift]-6

LightCurves_Augmented = np.concatenate((LightCurves, newexoplanets))


ex_labels_Augmented = np.concatenate([ex_labels, new_ex_labels])

Now we apply the filtering to the augmented data set.

[15]: LightCurves_Augmented_F = filterData(LightCurves_Augmented.copy(), 12)


TestLightCurves_F =filterData(TestLightCurves.copy(), 12)

Lets apply our support vector machine again

[16]: SupportVectorClassifier = svm.SVC()


SupportVectorClassifier.fit(LightCurves_Augmented_F, ex_labels_Augmented);

. . . and lets look at the result again:

11
[17]: prediction=SupportVectorClassifier.predict(TestLightCurves_F)

print('Accuracy Score: {}'


.format(accuracy_score(TestLabels,prediction)))

plot_confusion_matrix(SupportVectorClassifier, TestLightCurves_F, TestLabels)


plt.grid(False)
print('Confusion Matrix:\n {}'
.format(confusion_matrix(TestLabels,prediction)))

Accuracy Score: 1.0


Confusion Matrix:
[[565 0]
[ 0 5]]

. . . this looks much better. We found all five exoplanets without knowing anything about astrophysics!

1.2 Language of machine learning:


Types of learning:
• Classification: Assigning a discrete label to items. Example: Exoplanet yes or no, topics in document
classification, or content in image classification.
• Regression: Predicting a real value. Example: Prediction of the value of a stock value, temperature
or other physical values.
• Ranking: Order items according to a criterion. Example: page rank to order webpages according to
how well they fit a search query.
• Clustering: partitioning of items into subsets. See Figure 1. Example: social networks.
• Dimensionality reduction/manifold learning: transform high dimensional data set into a low dimen-
sional representation.

Learning stages:

12
Figure 1: Data sets to cluster

• Examples: Observations/Instances of data used in the learning process or to evaluate. Stars in our
exoplanet study.
• Features: The set of attributes of the examples. In the exoplanet study, these are the light curves.
• Labels: Values or categories assigned to the examples. Has exoplanet or does not have exoplanet.
• Hyperparameters: Parameters that define the learning algorithm. These are not learned. E.g., number
of neurons of the neural networks, when to stop training, e.t.c.
• Training sample: These are the examples that are used to train the learning algorithm.
• Validation sample: These examples are only indirectly used in the learning algorithm, to tune its
hyperparameters.
• Test sample: These examples are not accessed during training. After training they are used to deter-
mine the accuracy of the algorithm.
• Loss function: This function is used to measure the distance between the predicted and true label.
If Y is the set of labels, then L : Y × Y → R+ . Examples include the zero-one loss: Y = {−1, 1},
L0−1 ( x, y) = 1x6=y , the square loss Y = Rd , Lsq ( x, y) = k x − yk2 . In the exoplanet study, we used the
binary cross entropy: Y = [0, 1], where Lce ( x, y) = −(y log( x ) + (1 − y) log(1 − x )) (in our case, the
true labels y only take values {0, 1}).
• Hypothesis set: A set of functions that map features to labels.

Learning Scenarios:
• Supervised learning: The learner has access to labels for every training and evaluation sample. This
was the case in the exoplanet study.
• Unsupervised learning: Here we do not have labels. A typical example is clustering.
• Semi-supervised learning: Here some of the data have labels. Here the labels and the structure of the
data need to be used.
• Online learning: Here training and testing are performed iteratively in rounds. In each round we
receive new data. We make a prediction receive an evaluation and update our model. The goal is to
reduce the so-called regret. This describes how much worse one performed than an expert would
in hindsight.
• Reinforcement learning: Similar to online learning in the sense that training and testing phases are
mixed. The learner receives a reward for each action and seeks to maximise this reward. This if
often used to train algorithms to play computer games.

13
Labelled data Algorithm Prior knowledge

Training sample A(Θ) Features

Validation sample A ( Θ0 )

Test sample Evaluation

Figure 2: Learning pipeline. We learn using an algorithm A(Θ). This algorithm can be chosen based on
certain features and prior knowledge of the problem. This algorithm has hyperparameters Θ that we can
choose based on the validation sample.

• Active learning: An oracle exists that can be queried by the learner for labels to samples chosen by
the learner.

Generalisation: Generalisation describes the performance of the learned algorithm outside of the train-
ing set.
Example:
a) Polynomials fitting points

[1]: import numpy as np


import matplotlib.pyplot as plt

[2]: N = 25

x = np.arange(0,1, 1/N)

y = x**2 - x + np.random.normal(0, 0.02, N)

polyordLOW = np.poly1d(np.polyfit(x, y, 1))


polyordRIGHT = np.poly1d(np.polyfit(x, y, 2))
polyordHIGH = np.poly1d(np.polyfit(x, y, N-5))

plt.figure(figsize = (15,5))
plt.subplot(1,3,1)
plt.scatter(x,y)
plt.plot(x,polyordLOW(x), c = 'r')
plt.title('Degree 1')
plt.subplot(1,3,2)
plt.scatter(x,y)
plt.title('Degree 2')
plt.plot(x,polyordRIGHT(x), c = 'r')

14
plt.subplot(1,3,3)
plt.scatter(x,y)
plt.plot(x,polyordHIGH(x), c = 'r')
plt.title('Degree 20')

[2]:

b) Binary classification

c) Real world: Sports statistics. "Red Bull Salzburg never loses a game in the Champions League if they play
at home, the moon is full and at least 3 yellow cards are awarded in the first 20 minutes to players with odd
jersey numbers."
d) Science: Geocentric model. Based on epicycles. See Figure 3.

15
Figure 3: Ptolemaic system

2 Lecture 2 – A Mathematical Framework


2.1 PAC learning framework
• Input/Example space X .
• Output/Label space Y . (For the rest of this chapter we do binary classification Y = {0, 1}.)
• Concept class C ⊂ {X → Y }. These are possible relationships between examples and labels. We
typically assume that we know this. A funtion c ∈ C is called a concept. There is often one specific
concept that we want to identify. We call this the target concept. We do not know this.
• Data distribution is a distribution D on X . For simplicity, we assume in the sequel that D has a
density, if X is not discrete. We do not know this.
• Hypothesis set H ⊂ {X → Y }. This does not need to coincide with C .
• Training samples are generated by drawing i.i.d. examples x1 , . . . , xm subject to D . The samples are
then given as ( xi , c( xi ))im=1 for a fixed concept c.
Based on the training data, a learning algorithm chooses a function in the hypothesis set. This choice is
good, if it is close to an underlying target concept. What is meant by close? We want the generalisation
error to be small:

16
Definition 2.1 (Generalisation error). Let h ∈ H, c ∈ C , and let D be a data distribution. The generalisation
error or risk of h is defined as
 
R(h) = Px∼D (h( x ) 6= c( x )) = E 1h(x)6=c(x) ,

where 1 A is the indicator/characteristic function of the event A.a


a We assume here, that all probabilities are well defined. Of course this restricts the hypothesis and concept classes to some

extent. We will ignore all issues of measurability from now on.

In practice, we cannot compute the generalisation error R(h) since we know neither D nor the target
hypothesis c. We can compute the error on a sample instead:
Definition 2.2 (Empirical error). Let h ∈ H, and S := ( xi , yi )im=1 be a training sample. The empirical error or
empirical risk is defined as
m
b S (h) = 1 ∑ 1h(x )6=y .
R
m i =1 i i

Since the data is generated i.i.d with respect to D , we see that R


b S (h) is an unbiased estimator for R(h):

1 m
m i∑
E(xi )im=1 ∼D m (R
b (x ,c(x ))m )(h) =
i i i =1
E(xi )im=1 ∼D m (1h(xi )6=c(xi ) )
=1
1 m
m i∑
= Ex∼D (1h(x)6=c(x) )
=1
1 m
m i∑
= R(h) = R(h).
=1

We want to learn the target concept from samples. When is this even possible? What does possible even
mean?
Definition 2.3 (PAC learnability). Let C be a concept class. We say that C is PAC-learnable if there exists a
function mC : (0, 1)2 → N and an algorithm A mapping samples S to functions A(S) ∈ {X → Y } with the
following property: For every distribution D on Y , for every target concept c ∈ C , and for all e, δ ∈ (0, 1):

P(xi )im=1 ∼D m (R(A(( xi , c( xi ))im=1 )) ≤ e) ≥ 1 − δ,

if m ≥ mC (e, δ).
Note that the definition of PAC learnability is distribution free. Also it describes the worst case behaviour,
over the whole concept class.

2.2 An example
• X := [0, 1]2
• C := {1[r1 ,r2 ]×[r3 ,r4 ] : r1 , r2 , r3 , r4 ∈ (0, 1)}.
Lets define our learning algorithm A as follows: for S = ( xi , yi )im=1 we pick r10 , r20 , r30 , r40 ∈ (0, 1) so that
[r10 , r20 ] × [r30 , r40 ] is the smallest rectangle containing all xi such that yi = 1 and then A(S) = 1[r10 ,r20 ]×[r30 ,r40 ] .
Let us analyse the expected error of our algorithm. Pick arbitrary c ∈ C and distribution D on [0, 1]2 . Let
e > 0:
1. Note that for a sample S we have {A(S) = 1} ⊂ {c = 1}.

17
2. The expected error is therefore given by

R(A(S)) = ED (c − A(S)).

3. Assuming, ED (c) > e we choose four rectangles ( R j )4j=1 as in Figure 4 each of probability mass
exactly e/4.1
4. Observe that, if ED (c − A(S)) > e, then in particular ED (c) > e and suppA(S) cannot intersect all
4 rectangles of Step 3. Hence, there is one rectangle that does not contain any training samples. In
other words,
m
P(xi )im=1 ∼D m (R(A(S)) > e) ≤ ∑ P(x ) = ∼D m
i i 1
m ( xi 6 ∈ R j )
j =1
4 m
≤ ∑ ∏ Px∼D (x 6∈ R j )
j =1 i =1

≤ 4(1 − (e/4))m ≤ 4e−me/4 ,

where we use the inequality 1 + x ≤ e x which holds for all x ∈ R.2


5. Setting δ = 4e−me/4 yields that C is PAC learnable with mC (e, δ) = (4/e) ln(4/δ).

Figure 4: Left: A sample, drawn according to D as well as the target concept. Middle: Rectangles of area
e/4 each. Right: Red box is the solution of A.

2.3 Finite hypothesis, consistent case


We analyse the consistent case now, which is when the concept class C is a subset of the hypothesis set H
of possible solutions of our learning algorithm.
If H is finite (and therefore C is finite) we can get the following learning bound:
1 This is possible, by adapting the width, because we assumed that D has a density on the continuous space X .
2 Note that the argument requires the choice of the ( R j )4j=1 to be independent of A.

18
Theorem 2.1 (Learning bound, finite H, consistent). Let H ⊃ C be hypothesis set and concept class. Let D be
a data distribution and A be an algorithm, such that for each c ∈ H, and each sample S = ( xi , c( xi ))im=1 we have
that
Rb S (A(S)) = 0.

Then, for every δ, e > 0, we have that

PS∼ Dm (R(A(S)) ≤ e) ≥ 1 − δ,

if  
1 1
m≥ log |H| + log .
e δ
In other words, for every e, δ > 0, with probability at least 1 − δ
 
1 1
R(A(S)) ≤ log |H| + log .
m δ

Proof. • Let e > 0 and define He = { h ∈ H : R(h) > e}.


• A fixed hypothesis h ∈ He fails to match the target concept c on a set Z of measure at least e.
If R
b S (h) = 0 then this means that we have avoided Z over m random draws subject to D . The
probability of this happening is bounded by (1 − e)m .
• We bound the probability that this happens for at least one h ∈ He by a union bound:
   
P ∃ h ∈ He : R b S (h) = 0 ≤ ∑ P R b S (h) = 0
h∈He
≤ ∑ (1 − e)m ≤ | He |e−em ≤ | H |e−em .
h∈ He

• We set δ = | H |e−em and conclude the result.

2.4 Finite hypothesis, inconsistent case


If H 6⊂ C , then we can still show that R(h) is not much larger than R
b S (h) with high probability. We need
some preparation first.
Theorem 2.2 (Hoeffding’s inequality). Let X1 , . . . , Xm be independent random variables such that for all i ∈
[m]a , ai ≤ Xi ≤ bi almost surely for some ai , bi ∈ R. Then, for e > 0, it holds that with Sm = ∑im=1 Xi

P(Sm − E(Sm ) > e) ≤ e−2e


2/
∑im=1 (bi − ai )2 ,

P(Sm − E(Sm ) < −e) ≤ e−2e


2/
∑im=1 (bi − ai )2 .

a We write [ N ] := {1, . . . , N }

Proof. See [5, Theorem D.2].

[8]: import numpy as np


import numpy.matlib as mlb
import matplotlib.pyplot as plt

19
Hoeffdings inequality tells us that if we draw a die m times then the mean of the observed eyes should
concentrate strongly around 3.5. Indeed since modelling each draw of a die by an iid random variable Xi
tking values in [6] yields that
!

1 m −2 25m
m i∑
−1/4
P X i − 3.5 > m ≤ 2e .
=1

[48]: num_of_experiments = 30

for num_of_draws in 100, 1000:


diceRes = np.random.randint(1,7, [num_of_experiments, num_of_draws])
scaling = 1/mlb.repmat(np.arange(1,num_of_draws+1), num_of_experiments, 1)
cum_mean = np.multiply(np.cumsum(diceRes, 1),scaling)

plt.figure(figsize = (12,6))
plt.plot(cum_mean.T)
plt.plot(np.arange(1,num_of_draws+1), 3.5+np.power(np.arange(1,num_of_draws+1), -1/4), c = 'k')
plt.plot(np.arange(1,num_of_draws+1), 3.5-np.power(np.arange(1,num_of_draws+1), -1/4), c = 'k')

Now we observe the following corollary:

20
Corollary 2.1. Let e > 0 and let D be a distribution on X and c : X → {0, 1} be a target concept. Then, for every
h : X → {0, 1} it holds that
 
P(xi )im=1 ∼D m R b (x ,c(x ))m (h) − R(h) ≥ e ≤ e−2me2 ,
i i i =1
 
P(xi )im=1 ∼D m Rb (x ,c(x ))m (h) − R(h) ≤ −e ≤ e−2me2 .
i i i =1

In particular
 
b (x ,c(x ))m (h) − R(h)| ≥ e ≤ 2e−2me2 .
P(xi )im=1 ∼D m |R i i i =1

Proof. We have by Definition 2.2 that


• E( R
b (x ,c(x ))m (h)) = R(h)
i i i =1

• Rb (x ,c(x ))m (h) = ∑m Xi for independent random variables Xi with 0 ≤ Xi ≤ 1/m almost surely
i i i =1 i =1
for i ∈ [m].
We conclude the proof by applying Theorem 2.2.

We can extend Corollary 2.1 to any finite hypothesis set by a union bound.
Theorem 2.3 (Learning bound, finite H, inconsistent). Let H be a finite hypothesis set. Then, for every δ > 0,
the following inequality holds with probability at least 1 − δ over the sample S = ( xi , c( xi ))im=1 :
s
log |H| + log 2δ
R(h) ≤ R
b S (h) + for all h ∈ H. (1)
2m

Proof. Let H := { h1 , . . . , hn }. We compute:

P(∃hi ∈ H : |R(hi ) − R
b S (hi )| ≥ e)
n
≤ ∑ P(|R(hi ) − R
b S (hi )| ≥ e)
i =1
n
≤ ∑ 2e−2me ≤ 2|H|e−2me .
2 2
[Corollary 2.1]
i =1

2
Setting δ = 2|H|e−2me and solving for e yields (1).

Theorem 1 shows an instance of Occam’s Razor principle.

21
3 Lecture 3 – Some Generalisations and Rademacher Complexities
3.1 Agnostic PAC learning:
The notion of concept class requires a deterministic relationship between input x drawn according to D and
the label. This is not always sensible. Instead consider a distribution D on X × Y . Below is an example:

[96]: import matplotlib.pyplot as plt


import numpy as np
import joypy as jp
import pandas as pd
data = pd.read_csv("weather_2017.csv")
data.head()

[96]: number month day temp_dailyMin temp_minGround temp_dailyMean


0 0 1 1 -6.2 -9.5 -3.7
1 1 1 2 -7.2 -7.5 -1.0
2 2 1 3 -0.7 -3.3 1.1
3 3 1 4 0.9 -0.1 2.9
4 4 1 5 -3.2 -0.8 -0.1

Dataset available here: https://www.kaggle.com/zikazika/sickness-and-weather-data?select=


weather_2017.csv
We want to make a plot of temperature vs week. Hence we transform the first column so that each number
corresponds to two weeks.

[88]: data["number"] = np.ceil(data ["number"]/14)

Below we draw the temperature in Austria over periods of two weeks. We can consider the week number
as the example space X and the temperature as the label space Y .

[93]: # Draw Plot


plt.figure(figsize=(12,8), dpi= 80)
fig, axes = jp.joyplot(data, column='temp_dailyMean', by="number", figsize=(12,8))
plt.title('Temperature per week in Austria over a year', fontsize=22)
plt.show()

<matplotlib.figure.Figure at 0x7f6639f8e860>

22
X

Y
If D is considered as a probability distribution on X × Y , then we call the learning problem stochastic.
Analogously, we call our previous set-up deterministic.
In this case, we can redefine the risk to be
R(h) := P(x,y)∼D (h( x ) 6= y) = E(x,y)∼D (1h(x)6=y ). (2)

Definition 3.1 (Agnostic PAC learnability). Let H a hypothesis set. An algorithm A mapping samples S to
functions in H is an agnositic PAC learning algorithm if there exists a function mH : (0, 1)2 → N with the
following property: for all e, δ ∈ (0, 1) and for all distributions D over X × Y

PS∼D m (R(A(S)) − min R(h)) ≤ e) ≥ 1 − δ,


h∈H

if m ≥ mH (e, δ). We call H agnostic PAC learnable if an agnostic PAC learning algorithm exists.

3.2 Bayes error and noise


In the stochastic case, there does not necessesarily exist any function f such that R( f ) = 0.
Definition 3.2 (Bayes error). Let D be a distribution over X × Y . The Bayes error R∗ is defined as R∗ :=
infh∈M(X ,Y ) R(h).a
A hypothesis h such that R(h) = R∗ is called Bayes classifier.
a We denote by M(X , Y ) the set of measurable functions from X to Y .

We can define a potential Bayes classifier in terms of conditional probabilities:


hBayes ( x ) = arg maxy∈{0,1} P[y| x ].

For every x we have P(x,y)∼D (hBayes ( x ) 6= y| x ) = min{P(x,y)∼D (1| x ), P(x,y)∼D (0| x )}, which is the smallest
possible error. Hence hBayes is indeed a Bayes classifier.

Definition 3.3 (Noise). Given a distribution D over X × Y , we define the noise at point x ∈ X by noise( x ) =
min{P(x,y)∼D (1| x ), P(x,y)∼D (0| x )}.
The average noise or simply noise is then defined as E(noise( x )).

23
It is clear by construction that
E(noise( x )) = R∗ .

The noise level is one aspect describing the hardness of a learning task.

3.3 The Rademacher complexity


We saw that finite hypothesis classes are PAC learnable. Some infinite hypothesis sets seem to be learnable
too. This was seen in the example in Section 2.2. We now introduce a new type of complexity that handles
infinite hypothesis sets.
Definition 3.4. Let a, b ∈ R and Z be a set. Let G ⊂ M(Z , [ a, b]). Further let S = (z1 , . . . , zm ) ∈ Z m . Then
the empirical Rademacher complexity of G with respect to S is defined as
!
m
1
b S (G) = Eσ sup ∑ σi g(zi ) ,
R
g∈G m i =1

where σ = (σ1 , . . . , σm ) with σi being i.i.d Rademacher random variables.a


a This means that the σi satisfy P(σi = ±1) = 1/2

Remark 3.1. The empirical Rademacher complexity measures how well the class G can correlate with random noise
on a given sample S. If, for example G is the set of continuous functions from [0, 1] to [−1, 1] and S contains m
elements ( x1 , . . . , xm ) with xi 6= x j for all i, j ∈ [m], then R
b S (G) = 1. If G = {1} contains only one function then
b S (G) = 0.
R
The Rademacher complexity is defined for functions with real outputs. To apply it to general learning
problems, we introduce the concept of a loss function:
Definition 3.5 (Family of loss functions). A function L : Y × Y → R is called a loss function. For a hypothesis
class H, we define the family of loss functions associated to H by

G := { g : X × Y → R : g( x, y) = L(h( x ), y), h ∈ H}
Setting Z = X × Y we can apply Definition 3.4 to families of loss functions. We can also define a non-
empirical version of the Rademacher complexity.
Definition 3.6. Let a, b ∈ R and Z be a set. Let G ⊂ M(Z , [ a, b]) and let D be a distribution over Z . For
m ∈ N, we define the Rademacher complexity by
 
Rm (G) := ES∼D m Rb S (G) .

3.4 Generalisation bound with Rademacher complexity


Below, we present a generalisation bound similar to Theorem 2.3, but for potentially infinite hypothesis
sets.

24
Theorem 3.1. Let G ⊂ M(Z , [0, 1]) and let D be a distribution on Z . For every δ > 0 and m ∈ N we have that
for a sample S = (z1 , . . . , zm ) ∼ D m for all g ∈ G :
s
1 m log 1δ
m i∑
E( g ) ≤ g(zi ) + 2Rm (G) + (3)
=1
2m
s
1 m log 2δ
m i∑
E( g ) ≤ g ( z i ) + 2Rb S (G) + 3 , (4)
=1
2m

with probability at least 1 − δ.


Before we prove this result, lets look at an example:
Let us look at four hypothesis sets: polynomials of degree 3 , 4, 7, and 20. The target concept is a polyno-
mial ptrue of degree 5. Hence the data distribution constructed by a uniform distribution on (−1, 1) and
ptrue . Below, vary the number of sample points, compute the empirical Rademacher complexities of the
model with loss function L(h( x ), y) = h( x ) − ptrue ( x ). Note that
! !
1 m 1 m
Eσ sup ∑ σi (h( xi ) − ptrue ( xi )) = Eσ sup ∑ σi h( xi )
h∈H m i =1 h∈H m i =1

We compute the empirical error as


1 m
m i∑
| L(h( xi ), y)|
=1
and approximate the expected error E(| L(h( x ), y)|). Note that due to the absolute value, we are not
completely in the setup of Theorem 3.1. We will later see, that this does not matter, so we should not
overthink this now.
[1]: import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', np.RankWarning)

[4]: # set-up
iterations = 50
degrees = [3,4,7,20]
largeNumber = 1000

#errors to be computed
RademacherPoly = np.ones([iterations, len(degrees)])
EmpErrorsPoly = np.zeros([iterations, len(degrees)])
ErrorsPoly = np.ones([iterations, len(degrees)])

#the test data


x_test = np.arange(-1,1,1/largeNumber)
y_test = (x_test - 0.3)* (x_test + 0.15) * x_test * (x_test + 0.75) * (x_test - 0.8)

# precompute training data on random points:


x = np.random.uniform(-1,1, iterations)
y = (x - 0.3) * (x + 0.15) * x * (x + 0.75) * (x - 0.8)

25
for m in range(1,iterations):

# take subset of length m from training data


x_short = x[0:m]
y_short = y[0:m]

for k in range(len(degrees)):
# fit polynomials to data:
p = np.poly1d(np.polyfit(x_short, y_short, degrees[k]))

#compute errors
y_exp = p(x_test) - y_test
y_emp = p(x_short) - y_short
EmpErrorsPoly[m, k] = abs(y_emp).mean()
ErrorsPoly[m, k] = abs(y_exp).mean()

#estimate empirical Rademacher complexities:


err = 0
for it in range(largeNumber):
rdm = 2*np.round(np.random.uniform(0,1, m))-1
p = np.poly1d(np.polyfit(x_short, rdm, degrees[k]))
err = err + np.dot(p(x_short), rdm)/m

RademacherPoly[m, k] = err/largeNumber

[5]: plt.figure(figsize = (18,5))

plt.subplot(131)
plt.plot(np.arange(iterations), RademacherPoly)
plt.legend(('Degree 3', 'Degree 4', 'Degree 7', 'Degree 20'))
plt.title('Rademacher complexities')

plt.subplot(132)
plt.semilogy(np.arange(iterations), EmpErrorsPoly)
plt.legend(('Degree 3', 'Degree 4', 'Degree 7', 'Degree 20'))
plt.title('Empirical errors')

plt.subplot(133)
plt.semilogy(np.arange(iterations), ErrorsPoly)
plt.legend(('Degree 3', 'Degree 4', 'Degree 7', 'Degree 20'))
plt.title('Expected errors')

[5]:

26
Having understood the nature of Theorem 3.1, we can now look its proof. We need the following result:
Theorem 3.2 (McDiarmid’s inequality). Let m ∈ N, and X1 , . . . , Xm be independent random variables taking
values in X . Assume that there exist c1 , . . . , cm > 0 and a function f : X m → R satisfying

| f ( x1 , . . . , xi , . . . , xm ) − f ( x1 , . . . , xi0 , . . . , xm )| ≤ ci ,

for all i ∈ [m] and all points x1 , . . . , xm , xi0 ∈ X . Then the following inequalities hold for all e > 0:
2 / | c |2
P ( f (S) − E( f (S)) ≥ e) ≤ e−2e
2 / | c |2
P ( f (S) − E( f (S)) ≤ −e) ≤ e−2e ,

where S = ( X1 , . . . , Xm ) and c = (c1 , . . . , cm ).

Proof of Theorem 3.1. We define two short-hand notations for a sample S = (z1 , . . . , zm ):
m
b S ( g) := 1 ∑ g(zi )
E
m i =1
 
Φ(S) := sup E( g) − E b S ( g)
g∈G

To prove the theorem, we need to bound Φ(S) and we will use McDiarmids inequality for this.
Let S and S0 be two samples that differ in exactly one point, i.e., S = (z1 , . . . , zi , . . . , zm ) and S =
(z1 , . . . , zi0 , . . . , zm ). We compute:
   
Φ(S0 ) − Φ(S) = sup E( g) − E
b S0 ( g) − sup E( g) − E
b S ( g)
g∈G g∈G
 
≤ sup E
b S ( g) − E
b S0 ( g )
g∈G
g(zi ) − g(zi0 ) 1
≤ sup ≤ ,
g∈G m m

where the first inequality, is due to elementary properties of suprema, the second follows from the defini-
tion of E
b S ( g) and S, S0 and the last is due to the fact that g takes values in [0, 1].

The choice of S, S0 was arbitrary, and so we conclude that

1
|Φ(S0 ) − Φ(S)| ≤
m

27
for all S, S0 differing in one point only. By McDiarmids inequality, we have that for a random sample S
2
P (Φ(S) − ES (Φ(S)) ≥ e) ≤ e−2e m .

Hence with probability 1 − δ/2


s
log( 2δ )
Φ(S) ≤ ES (Φ(S)) + . (5)
2m
Let us compute ES (Φ(S)) next.
!
   
ES (Φ(S)) = ES sup E( g) − E
b S ( g) = ES sup ES0 Eb S0 ( g ) − E
b S ( g) , (6)
g∈G g∈G

where S0 is a sample that is independent from and distributed like S. We used that ES0 (E b S0 ( g)) = E( g).
By the monotonicity of the expected value, we have that
   
ES sup ES0 E b S0 ( g ) − E
b S ( g) ≤ ES,S0 sup Eb S0 ( g ) − E
b S ( g)
g∈G g∈G

1 m
m i∑
= ES,S0 sup g(zi0 ) − g(zi ).
g∈G =1

Assume next, that σ is a Rademacher random variable. Then it holds that


1 m 1 m
m i∑ ∑ σi ( g(zi0 ) − g(zi )).
0
ES,S0 sup g ( z i ) − g ( z i ) = E E
σ S,S 0 sup (7)
g∈G =1 g∈G m i =1

To see why this holds, we observe that for every fixed σ a negative sign of σi corresponds to switching zi
and zi0 in ∑im=1 g(zi0 ) − g(zi ). Since we all zi , zi0 are chosen i.i.d and we are taking the expectation, this does
not affect the output. Applying the sub-additivity of the supremum to (7) yields that

1 m 1 m 1 m
m i∑ ∑ i i ∑ −σi g(zi )
0 0
Eσ ES,S0 sup σi ( g ( z i ) − g ( z i )) ≤ E E
σ S 0 sup σ g ( z ) + E E
σ S sup
g∈G =1 g∈G m i =1 g∈G m i =1

1 m
m i∑
≤ 2Eσ ES sup σi g(zi ) = 2Rm (G),
g∈G =1

where the last inequality follows since σ and −σ have the same distribution. This yields (3).
To prove (4), we apply McDiarmids inequality again. Note that for two samples S, S0 differing in one
point only
Rb S (G) − Rb S0 (G) ≤ 1
m
and hence with probability 1 − δ/2
s
log 2δ
Rm (G) = E(R b S (G)) ≤ R b S (G) + . (8)
2m
Therefore, we conclude with a union bound from (8) and (5) that with probability 1 − δ
s
log( 2δ )
Φ(S) ≤ 2RS (G) + 3
b
2m
which yields (4).

28
4 Lecture 4 – Application of Rademacher Complexities and Growth Function
4.1 Rademacher complexity bounds for binary classification
Theorem 3.1 holds for general families of loss functions. We want to make this notion more concrete for
common learning problems.
Lemma 4.1. Let H ⊂ M(X , {−1, 1}). Furthermore, let G = {X × Y 3 ( x, y) 7→ 1h(x)6=y : h ∈ H}. For a
sample ( xi , yi )im=1 = S ∈ (X × Y )m we denote SX = ( xi )im=1 . It holds that

b S (G) = 1 R
R b S (H).
2 X

Proof. The proof follows from a simple computation which is fundamentally based on the identity:
1h(x)6=y = (1 − h( x )y)/2. With this, we have that
!
1 m
b S (G) = Eσ
R sup ∑ σi g(zi )
g∈G m i =1
!
1 m
= Eσ sup ∑ σi 1h(xi )6=yi
h∈H m i =1
!
1 m 1 − h ( xi ) yi
= Eσ sup ∑ σi
h∈H m i =1 2
!
1 1 m 1b
= Eσ sup ∑ (−σi yi )h( xi ) = R S (H),
2 h∈H m i =1 2 X

where the last identity follows since (−σi yi ) and σi have the same distribution.

Now we can transfer our generalisation bound of Theorem 3.1 to the binary classification setting:
Theorem 4.1. Let H ⊂ M(X , {−1, 1}) and D be a distribution on X . Then, for every δ > 0 it holds with
probability at least 1 − δ that
s
log 1δ
R(h) ≤ R
b S (h) + Rm (H) +
2m
s
log 2δ
R(h) ≤ R
b S (h) + R
b S (H) + 3 ,
2m

where S ∼ D m .
For the binary loss computing the empirical Rademacher complexity of a hypothesis class H amounts to
solving for all choices of a Rademacher vector an optimisation problem over the whole class H. This can
be computationally challenging, if H is very complex and m is large. Moreover, computing Rm is often
not possible at all, since we do not know the underlying distribution.

4.2 The growth function


Definition 4.1 (Growth function). For a hypothesis set H ⊂ { h : X → {−1, 1}}, the growth function
ΠH : N → N is defined by

ΠH (m) = max |{(h( x1 ), . . . , h( xm )) : h ∈ H}| .


{ x1 ,...,xm }⊂X

29
The growth function describes the number of ways m points could be grouped into two classes by ele-
ments in H. The growth function is independent of the underlying distribution and a useful tool to bound
the Rademacher complexity.
A helpful result here is Massart’s lemma:
Theorem 4.2 (Massart’s Lemma). Let A ⊂ { x = ( x1 , . . . , xm ) ∈ Rm : | x | ≤ r } be finite set. Then
" #
m
p
1 r 2 log | A|
m x∈ A i∑
Eσ sup σi xi ≤ ,
=1
m

where the σi are independent Rademacher random variables.


Now we can show the following upper bound on the Rademacher complexity:
Corollary 4.1. Let H ⊂ { h : X → {−1, 1}}. Let D be a distribution on X . Then, for every m ∈ N it holds that,
r
2 log ΠH (m)
Rm (H) ≤ .
m

Proof. Notice that every vector of length m with entries either plus or minus one has euclidean norm m.
Hence we have that for every sample S = ( x1 , . . . , xm ) the set

HS := {h(S) : h ∈ H}

is contained in the m ball and per definition |HS | ≤ ΠH (m).
Therefore
! ! r
1 m
1 m
2 log ΠH (m)
Rm (H) = ES Eσ sup ∑ σi h( xi ) = ES Eσ sup ∑ σi ui ≤ ,
m h∈H i=1 m u∈HS i=1 m

by Theorem 4.2.

Using this estimate, we can reformulate our previous generalisation bound that was formulated in terms
of Rademacher complexity via the growth function instead:
Corollary 4.2. Let H ⊂ { h : X → {−1, 1}}. Then, for any δ > 0, with probability at least 1 − δ for any h ∈ H:
r s
log 1δ
R(h) ≤ Rb S (h) + 2 log ΠH (m) + .
m 2m

4.3 The Vapnik–Chevronenkis Dimension


Definition 4.2 (Shattering). For a function h : X 7→ {−1, 1}, we denote for a set of points S = ( x1 , . . . , xm ) ∈
X m by hS the restriction of h to S. For a hypothesis class H ⊂ {h : X → {−1, 1}}, we say that S is shattered by
H, if |{hS : h ∈ H}| = 2m .
The VC-dimension of a hypothesis class is now the size of the largest set, that is shattered by a hypothesis
class. We can equivalently state it in terms of the growth function:
Definition 4.3 (VC-Dimension). Let H ⊂ { h : X → {−1, 1}}. Then we define the VC-Dimension of H by

VCdim(H) = max {m ∈ N : ΠH (m) = 2m } .

30
Example 4.1 (Intervals). Let H = {21[a,b] − 1 : a, b ∈ R}.
It is clear that VCdim(H) ≥ 2 since for x1 < x2 the functions

21[ x1 −2,x1 −1] − 1, 21[ x1 −2,x1 ] − 1, 21[ x1 ,x2 ] − 1, 21[ x2 ,x2 +1] − 1,

are all different, when restricted to S = ( x1 , x2 ).


On the other hand, if x1 < x2 < x3 then, we have that since h−1 ({1}) is an interval for all h ∈ H that h( x1 ) =
1 = h( x3 ) implies h( x2 ) = 1. Hence, no set of three elements can be shattered. Therefore, VCdim(H) = 2. The
situation is depicted in Figure 5.

Figure 5: Different ways to classify two or three points. The coloured-blocks correspond to the intervals
[ a, b].

Example 4.2 (Two dimensional half-spaces). Let H = {21R+ (h a, ·i + b) − 1 : a ∈ R2 , b ∈ R} be a hypothesis


set of rotated and shifted two-dimensional half-spaces. By Figure 6, we see that H shatters a set of three points.

Figure 6: Different ways to classify three by a half-space.

For any four points ( x1 , x2 , x3 , x4 ) one of two situations will happen. Either one point is in the convex hull of the
remaining three or the four points form the edges of a convex quadrilateral. In the first case, we can assume that
without loss of generality x4 is a convex combination of x1 , x2 , x3 . Since half-spaces are convex too, we have that
if h( x1 ) = h( x2 ) = h( x3 ) = 1 then h( x4 ) = 1. Therefore, we cannot shatter sets of this form. If, on the other
hand, the points ( x1 , x2 , x3 , x4 ) are the sides of a convex quadrilateral, then, without loss of generality the points
x1 and x3 lie on different sides of the line connecting x2 and x4 . Since ( x1 , x2 , x3 , x4 ) are the extreme points of the
quadrilateral, it must be the case that the lines connecting x1 and x3 and x2 and x4 intersect. Further, any half-space
that contains x1 and x3 contains by convexity also the line between x1 and x3 . Any half-space not containing x2
and x4 contains, by convexity also no element of the line between x2 , x4 . Hence, there is no half space containing
x1 , x3 but not x2 and x4 . A visualisation of the argument above is given in Figure 7.
We conclude that for the half-space classifier VCdim(H) = 3.

31
Figure 7: Visualisation of the argument prohibiting shattering of sets of four elements.

The half-space VC dimension bound generalises to arbitrary dimensions.


Example 4.3 (Half-spaces). Let d ∈ N, H = {21R+ (h a, ·i + b) − 1 : a ∈ Rd , b ∈ R} be a hypothesis set of
rotated and shifted half spaces. Then, VCdim(H) = d + 1.

4.4 Generalisation bounds via VC-dimension


First we are looking for connections between the VC dimension and the growth function.
Theorem 4.3. Let H ⊂ { h : X → {−1, 1}} be such that VCdim(H) = d. Then for all m ∈ N:

d
 
m
ΠH (m) ≤ ∑ . (9)
i =0
i

In particular, for all m ≥ d  em 


log ΠH (m) ≤ d log = O(d log(m)).
d

Proof. We prove this by induction over m + d ≤ k. For k = 2, we have the options m = 1 and d = 0, 1 as
well as m = 2, d = 0.
1. If d = 0 and m ∈ N, then |HS | ≤ 1 for all samples S of size 1 and hence ΠH (1) ≤ 1. Moreover,
if for an m ∈ N, Π H (m) > 1, then there would exist a set S with m samples on which |HS | > 1.
That means that on at least one of the elements of S, HS takes at least two different values and hence
Π H (1) > 1, a contradiction. Hence Π H (m) ≤ 1 for all m ∈ N. The right-hand side of (9) is always
at least 1.
2. If d ≥ 1 and m = 1, then Π H (1) ≤ 2 per definition, which is always bounded by the right-hand side
of (9).
Assume now that the statement (9) holds for all m + d ≤ k and let m̄ + d¯ = k + 1. By Points 1 and 2 above,
we can assume without loss of generality that m̄ > 1 and d¯ > 0.
Let S = { x1 , . . . , xm̄ } be a set so that ΠH (m̄) = |HS | and let S0 = { x1 , . . . , xm̄−1 }.
Let us define an auxiliary set

G := {h ∈ HS0 : ∃h0 , h00 ∈ HS , h0 ( xm̄ ) 6= h00 ( xm̄ ), h = h0S0 = h00S0 }. (10)

In words, G contains all those maps in HS0 that have two corresponding functions in HS .

32
Now it is clear that

|HS | = |HS0 | + |G|. (11)

Per assumption (m̄ − 1) + d¯ ≤ k and (m̄ − 1) ∈ N. Hence, by the induction hypothesis:


m̄ − 1
 
|HS0 | ≤ ΠH (m̄ − 1) ≤ ∑ i
. (12)
i =0

Note that G is a set of functions defined on S0 . Hence we can compute its VC dimension. If a set Z ⊂ S0 is
shattered by G , then Z ∪ { xm̄ } is shattered by HS . We conclude that

VCdim(G) ≤ VCdim(HS ) − 1 ≤ VCdim(H) − 1 = d¯ − 1.

Since, by assumption d¯ − 1 ≥ 0, we conclude with the induction hypothesis, that

d¯−1 
m̄ − 1

|G| ≤ ΠG (m̄ − 1) ≤ ∑ i
. (13)
i =0

We conclude with (11), (12), and (13) that

d¯  d¯−1  d¯  
m̄ − 1 m̄ − 1
 

ΠH (m̄) = |HS | = |HS0 | + |G| ≤ ∑ i
+∑
i
=∑
i
.
i =0 i =0 i =0

This completes the induction step and yields (9).


Now let us address the ’in particular’ part:
We have for m > d by (9) that

d
 
m
ΠH (m) ≤ ∑
i =0
i
d  
m  m  d −i
≤∑
i =0
i d
m  
m  m  d −i
≤∑
i =0
i d
 m  d m m   d i


d i∑
= .
=0
i m

The binomial theorem states that


m  
m
∑ i x m −i yi = ( x + y ) m .
i =0

In particular, setting x = 1 and y = d/m, we conclude that


 m d  d m  m d d

ΠH (m) ≤ 1+ ≤ e . (14)
d m d

The result follows by applying the logarithm to (14).

Plugging Theorem 4.3 into Corollary 4.2, we can now state a generalisation bound for binary classification
in terms of the VC dimension.

33
Corollary 4.3. Let H ⊂ { h : X → {−1, 1}}. Then, for every δ > 0, with probability at least 1 − δ for any h ∈ H:
s s
2d log( em
d ) log 1δ
R(h) ≤ Rb S (h) + + ,
m 2m

where d is the VC dimension of H and m ≥ d.

5 Lecture 5 - The Mysterious Machine


Having established some theory, we are now ready for the first challenge.

[1]: import numpy as np


import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sn
%matplotlib inline

Two files will be supplied to you via Moodle. A test and training set ‘data_train_db.csv’ and
‘data_test_db.csv’. They were taken by observing a mystery machine. The first entry ‘Running’ is 1 if
the machine worked. It is 0 if it failed to work. In the test set, the labels, are set to 2. You should predict
them.
Let us look at out data first:
[2]: data_train_db = pd.read_csv('data_train_db.csv')
data_test_db = pd.read_csv('data_test_db.csv')
data_train_db.head()

[2]: Running Blue Switch On Battery level Humidity Magnetic field


0 1.0 1.0 0.504463 0.654691 0.809938
1 1.0 1.0 0.441385 0.597252 0.690019
2 0.0 1.0 0.497714 0.521752 0.512899
3 0.0 0.0 0.729477 0.974705 0.629772
4 0.0 1.0 0.828015 0.768117 0.694428

[5 rows x 100 columns]

Lets look at some more properties of the data:

[3]: data_train_db.describe()

[3]: Running Blue Switch On Battery level Humidity


count 2000.000000 2000.000000 2000.000000 2000.000000
mean 0.319000 0.803436 0.697403 0.699631
std 0.466206 1.344869 1.604714 0.903394
min 0.000000 -42.078674 -54.697685 -29.500793
25% 0.000000 1.000000 0.556451 0.556232
50% 0.000000 1.000000 0.706002 0.699358
75% 1.000000 1.000000 0.853678 0.852918
max 1.000000 18.242558 44.936291 25.747851

34
How is the distribution of the labels?
[4]: data_train = data_train_db.values

labels = 'Runs', 'Does not run'


sizes = [np.sum(data_train[:,0]), np.sum(1-data_train[:,0])]

fig1, ax1 = plt.subplots()


ax1.pie(sizes, labels=labels, autopct='%1.1f%%',
shadow=True, startangle=90)
ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()

Lets look at some standard statistics of the data:


[5]: fig = plt.figure(figsize = (14, 4))

plt.subplot(1,2,1)
plt.hist(data_train[data_train[:,0] == 1,1:].std(1))
plt.title('Distribution of standard deviation--- not running')
plt.subplot(1,2,2)
plt.hist(data_train[data_train[:,0] == 0,1:].std(1))
plt.title('Distribution of standard deviation--- not running')

fig = plt.figure(figsize = (14, 4))

plt.subplot(1,2,1)
plt.hist(np.sum(data_train[data_train[:,0]==1,1:], 1)/100)
plt.title('Distribution of means--- running')
plt.subplot(1,2,2)
plt.hist(np.sum(data_train[data_train[:,0]==0,1:], 1)/100)
plt.title('Distribution of means--- not running')

35
fig = plt.figure(figsize = (14, 4))

plt.subplot(1,2,1)
plt.hist(np.amax(data_train[data_train[:,0] == 1,1:], axis = 1))
plt.title('Distribution of max value--- running')
plt.subplot(1,2,2)
plt.hist(np.amax(data_train[data_train[:,0] == 0,1:], axis = 1))
plt.title('Distribution of max value--- not running')

fig = plt.figure(figsize = (14, 4))

plt.subplot(1,2,1)
plt.hist(np.amin(data_train[data_train[:,0] == 1,1:], axis = 1))
plt.title('Distribution of min value--- running')
plt.subplot(1,2,2)
plt.hist(np.amin(data_train[data_train[:,0] == 0,1:], axis = 1))
plt.title('Distribution of min value--- not running')

[5]: Text(0.5, 1.0, 'Distribution of min value--- not running')

36
The distribution of the min values is a bit worrying. Since some very few entries have very high standard
deviation, some very few and possibly the same values have very low negative values, but almost all
other entries have only positive values. This may be a problem in the data set. We decide that these
entries are outliers and drop these entries from the data base.

[6]: # It seems like there are some data points which have much higher standard deviation than most. Let us␣
,→just remove those.

def clean_dataset(data):
to_drop= []
for k in range(data.shape[0]):
if data[k,:].std()>15:
to_drop.append(k)
return np.delete(data, to_drop, axis = 0)

Let us apply the cleaning and look at the data set again

[7]: data_train = clean_dataset(data_train)

fig = plt.figure(figsize = (14, 4))

plt.subplot(1,2,1)
plt.hist(data_train[data_train[:,0] == 1,1:].std(1))
plt.title('Distribution of standard deviation--- not running')
plt.subplot(1,2,2)
plt.hist(data_train[data_train[:,0] == 0,1:].std(1))
plt.title('Distribution of standard deviation--- not running')

37
fig = plt.figure(figsize = (14, 4))

plt.subplot(1,2,1)
plt.hist(np.amin(data_train[data_train[:,0] == 1,1:], axis = 1))
plt.title('Distribution of min value--- running')
plt.subplot(1,2,2)
plt.hist(np.amin(data_train[data_train[:,0] == 0,1:], axis = 1))
plt.title('Distribution of min value--- not running')

[7]: Text(0.5, 1.0, 'Distribution of min value--- not running')

This looks much better.


Now we start understanding our data set a bit more in detail. Let us try to get a feeling of the dependen-
cies between the columns.
[8]: data_train_db.corr()

[8]: Running Blue Switch On Battery level Humidity


Running 1.000000 0.100058 0.004500 0.000527
Blue Switch On 0.100058 1.000000 0.373730 -0.374582
Battery level 0.004500 0.373730 1.000000 0.353327
Humidity 0.000527 -0.374582 0.353327 1.000000
Magnetic field -0.035802 -0.554634 0.272756 0.321244
... ... ... ... ...
Blade density 0.015307 -0.144248 -0.737770 -0.151252
Blade rotation 0.012993 0.148190 -0.527079 -0.675620

38
Controller mintcream -0.018974 0.208065 -0.278604 -0.801354
Controller mistyrose 0.040784 -0.166662 -0.335028 -0.225535
Controller moccasin -0.038704 0.771165 0.072201 -0.240072

[100 rows x 100 columns]

[9]: corrMatrix = data_train_db.corr()


plt.figure(figsize = (12,12))
sn.heatmap(corrMatrix, annot=False)
plt.show()

plt.figure(figsize = (12,6))
plt.plot(np.arange(1, 100), corrMatrix['Running'][1:100])
plt.title('Correlation with Running')
plt.show()

39
The first row of the data set (after the row ‘Running’ itself), seems to be suspiciously important. Lets look
at it in isolation.
[10]: plt.hist(data_train[:,1])
plt.title(data_train_db.columns[1])

[10]: Text(0.5, 1.0, 'Blue Switch On')

We see that ‘Blue Switch On’ only takes two values (On and Off). Let us look in detail, what the effect of
this switch is on whether the mechanism runs or not.
[16]: runs_switchon = np.count_nonzero((data_train[:,0]==1)*(data_train[:,1]==1))
runs_switchoff = np.count_nonzero((data_train[:,0]==1)*(data_train[:,1]==0))
runsnot_switchon = np.count_nonzero((data_train[:,0]==0)*(data_train[:,1]==1))
runsnot_switchoff = np.count_nonzero((data_train[:,0]==0)*(data_train[:,1]==0))
conf_matrix = [[runs_switchon, runs_switchoff], [runsnot_switchon, runsnot_switchoff]]

sn.set(color_codes=True)
plt.figure(1, figsize=(9, 6))

40
plt.title("Confusion Matrix")

sn.set(font_scale=1.4)
ax = sn.heatmap(conf_matrix, annot=True, cmap="YlGnBu", fmt='2')

ax.set_yticklabels(['runs', 'does not run'])


ax.set_xticklabels(['Blue Switch On', 'Blue Switch Off'])

[16]: [Text(0.5, 0, 'Blue Switch On'), Text(1.5, 0, 'Blue Switch Off')]

Now this is fantastic. If the Blue Switch is off, then the mechanism never works.
Next, we would like to extract additional important parameters of the machine. We rank the columns
according to their correlation with ‘Running’:

[12]: S = np.argsort(np.array(corrMatrix['Running']))[::-1]
print(S)

[ 0 1 72 98 50 74 37 10 33 14 89 41 34 8 56 68 7 90 95 11 83 39 67 64
17 81 47 70 96 92 84 27 80 82 22 69 73 24 63 60 58 13 77 86 49 2 28 5
44 53 71 16 18 3 66 45 55 75 93 79 87 52 35 61 25 59 38 42 48 43 29 85
78 26 91 36 20 51 21 23 94 88 57 15 31 19 54 65 30 97 12 9 76 32 46 6
62 40 4 99]

We saw that the first entry is always 1 if the method is working. Also from the ranking above, we expect
that large values in coordinates 72 and 98 seem to indicate that the algorithm works.
Let us describe a hypothesis set that thakes this observation into account, by defining a classifier below.
The hypothesis set is characterised by a threshholding value ‘thresh’.

[39]: def myclassifier(data, thresh):


if data[1] == 0:
return 0 # If the blue switch is off, then we know that the mechanism wont work.
if data[72] + data[98] > thresh:

41
return 1
return 0

Next we find the value thresh, that yields the best classification on the test set:

[40]: best_thresh = 0
best_err = data_train.shape[0]

for tr in range(100):
thresh = tr/20
err = 0
for t in range(data_train.shape[0]):
err = err + (myclassifier(data_train[t, :], thresh) != data_train[t, 0])
if err < best_err:
best_err = err
best_thresh = thresh

print('Test accuracy:' + str(1-best_err/data_train.shape[0]))

Test accuracy:0.7604010025062656

The test accuracy above is quite terrible. On the other hand, the hypothesis class seems very small, so
Corollary 4.2 gives us some confidence that the result may generalise in the sense that it will not be worse
on the test set. ("Not worse, but still very bad" is of course not a very desirable outcome.)
I am sure you can do much better than this.

[41]: # Finally, we predict the result.


predicted_labels = np.zeros(data_test_db.shape[0])
data_test = data_test_db.values

for k in range(data_test_db.shape[0]):
predicted_labels[k] = myclassifier(data_test[k, :], best_thresh)

np.savetxt('PhilippPetersens_prediction.csv', predicted_labels, delimiter=',')

Please send your result via email to [email protected]. Your email should include the names
of all people who worked on your code, their student identification numbers, a name for your team, and
the code used. It should also contain one or two paragraphs of a short description of the method you
used.

6 Lecture 6 - Lower Bounds on Learning


A finite VC dimension guarantees a controllable generalisation error, but is it necessary? Yes!
Theorem 6.1. Let H be a hypothesis set with VCdim(H) = d > 1. Then, for every m ≥ (d − 1)/2 and for every
learning algorithm A there exists a distribution D over X and a target concept g ∈ H such that

d−1
 
PS∼D m RD (A(S)) > ≥ 0.01.
32m

Proof.

42
1. Set-up: We first build a very imbalanced distribution. Let X := { x1 , x2 , . . . , xd } ⊂ X be a set that is
shattered by H.
For e > 0, we define the distribution De by P( x1 ) = 1 − 8e and P( xk ) = 8e/(d − 1) for k = 2, . . . , d.

Figure 8: Distribution De for e = 1/16, 1/32, 1/64.

For S ∈ X m , we denote S := {si ∈ S : si 6= x1 for all i ∈ [m]}. Additionally, let S ⊂ X m be the set of
samples, such that |S| ≤ (d − 1)/2.
Let, for S ∈ X m and u ∈ {0, 1}d−1 , f u ∈ H be such that

f u ( x1 ) = 1 and f u ( xk ) = uk−1 .

We have that f u is well-defined since H shatters X.


Assume that A is any learning algorithm. We can assume without loss of generality that A(S)( x1 ) =
1. Otherwise, we could modify A to satisfy this and end up with a lower expected error since we
will only consider concepts g below that satisfy g( x1 ) = 1.
2. Bounding the expected error for a fixed sample:
Let U be a uniform distribution on {0, 1}d−1 , then for any S ∈ X m ,

d
EU (RDe (A(S), f U )) = ∑ ∑ 1A(S)(x )6= f (x ) P[xk ]P[u],
k u k
u∈{0,1}d−1 k =2

where E(RDe (A(S), f u )) denotes the expected risk with target concept f u . By reducing the set that
we sum over, we may estimate from below by

d
EU (RDe (A(S), f U )) ≥ ∑ ∑ 1A(S)(xk )6= f u (xk ) P[ xk ]P[u]
u∈{0,1}d−1 k =2
xk 6∈S
 
d
= ∑  ∑ 1A(S)(xk )6= f u (xk ) P[u] P[ xk ].
k =2 u∈{0,1}d−1
xk 6∈S

Per definition of f u it is clear that for every xk , where k > 1 it holds that 1A(S)(xk )= f u (xk ) = 1 on
exactly half of all values u ∈ {0, 1}d−1 . Hence, we estimate that

d
1 1  8e
EU (RDe (A(S), f U )) ≥ ∑ 2 P[ x k ] = 2 d − 1 − |S|
d−1
.
k =2
k6∈S

43
Thus, if S ∈ S , then
d
d−1
 
1 1 8e
EU (RDe (A(S), f U )) ≥ ∑ 2
P[ x k ] ≥
2 2 d−1
= 2e. (15)
k =2
xk 6∈S

3. Finding one ’bad’ concept:


We conclude from (15) that
ES∈S EU (RDe (A(S), f U )) ≥ 2e.
By Fubini’s theorem, we also have that

EU (ES∈S RDe (A(S), f U ) ≥ 2e. (16)

The estimate on the expected value (16) implies that there exists at least one u∗ ∈ {0, 1}d−1 such that

ES∈S RDe (A(S), f u∗ ) ≥ 2e. (17)

Note that, for every S ∈ X m

d d
8e
RDe (A(S), f u∗ ) = ∑ 1A(S)(x )6= f ∗ (x ) P[xk ] ≤ ∑ d − 1 = 8e.
k u k
(18)
k =2 k =2

Now we can compute for

ES∈S RDe (A(S), f u∗ ) = ∑ RDe (A(S), f u∗ )P(S|S)


S : RDe (A(S), f u∗ )≥e

+ ∑ RDe (A(S), f u∗ )P(S|S)


S : RDe (A(S), f u∗ )<e

(18) ∑
≤ S : R (A(S), f ∗ )≥e
8e P(S|S)
De u

+ ∑ e P(S|S)
S : RDe (A(S), f u∗ )<e

≤ 8e P(RDe (A(S), f u∗ ) ≥ e) + e (1 − P(RDe (A(S), f u∗ ) ≥ e))


= e + 7e P(RDe (A(S), f u∗ ) ≥ e)).

With (17), we conclude that if S ∈ S

1
P(RDe (A(S), f u∗ ) ≥ e)) ≥ .
7
More generally, for arbitrary S ∼ De we have that

PDe (S)
PS∼De (RDe (A(S), f u∗ ) ≥ e)) ≥ . (19)
7

4. Find PDe [S]:


We will use the following multiplicative Chernoff bound:

44
Theorem 6.2 (Multiplicative Chernoff Bound). Let X1 , . . . , Xm be independent random variables drawn ac-
cording to a distribution D with mean µ and such that 0 ≤ Xk ≤ 1 almost surely for all k ∈ [m]. Then, for
γ ∈ [0, 1/µ − 1] it holds that
mµγ2
P[ µ ≥ (1 + γ ) µ ] ≤ e − 3

mµγ2
P[ µ ≤ (1 − γ ) µ ] ≤ e − 2 ,

where µ = 1
m ∑im=1 Xi .
Let Y1 , . . . , Ym be i.i.d distributed as De . Further let for k ∈ [m]

Zk := 1{ x2 ,...,xd } (Yk ).

It is clear that E( Zk ) = 8e. Assuming that 8e ≤ 1/2, we can apply Theorem 6.2 with γ = 1 to obtain
!
m
∑ Zi ≥ 16em
8em
P ≤ e− 3 . (20)
i =1

Now notice that if a sample S = (Y1 , . . . , Ym ) is not in S then the associated ( Z1 , . . . , Zm ) must satisfy
∑im=1 Zi > (d − 1)/2. Therefore,
!
m
1 − P(S) ≤ P ∑ Zi ≥ (d − 1)/2) .
i =1

5. Finishing the proof: Setting e = (d − 1)/(32m) ≤ 1/16, we conclude that


d −1
P(S) ≥ 1 − e− 12 ≥ 7δ,

for δ > 1/100.


We conclude with (19), that
d−1
 
1
P RDe (A(S), f u∗ ) ≥ > ,
32m 100
which is the claim.

A similar result to Theorem 6.1 holds in the non-realisable/agnostic setting.


Theorem 6.3. Let H be a hypothesis set with d = VCdim(H) > 1. Then for m ∈ N and any learning algorithm
A, there exists a distribution D over X × {−1, 1} such that
r !
d 1
PS∼D m RD (A(S)) − inf RD (h) > ≥ .
h∈H 320m 64

7 Lecture 7 - The Mysterious Machine - Discussion


This will be a discussion about the challenge as well as help with coding issues.
I recommend that you should use Python and the Jupyter notebook. See, for example https://jupyter.
org/install for a guide to install both.

45
8 Lecture 8 - Model Selection
Ho do we choose an appropriate hypothesis set or learning algorithm for a given problem?
For a given binary hypothesis class H and a function h ∈ H, we have that
   
∗ ∗
R(h) − R = R(h) − inf R( g) + inf R( g) − R . (21)
g∈H g∈H
| {z } | {z }
estimation approximation

where R∗ is the Bayes error of Definition 3.2. See Figure 9 for a visualisation of (21).

h∗

Figure 9: Visualisation of (21), where h∗ is the Bayes classifier.

8.1 Empirical Risk Minimisation


Empirical risk minimisation is the algorithm that chooses the hypothesis with the smallest empirical risk.
Definition 8.1. Let H be a hypothesis set, S be a sample then we define the solution of empirical risk minimisa-
tion
hSERM := arg minh∈H R b S ( h ).

Note, that hSERM does not need to exist, but if S is finite and Y is too, as in the binary classification case, then it is
easy to see that hSERM is well defined.
We have that the empirical risk minimiser inflicts a small estimation error if the generalisation error is
small.
Proposition 8.1. Let H be a hypothesis set, S be a sample. Then we have that
  !
P R(hSERM ) − inf R(h) > e ≤ P sup |R(h) − R
b S (h)| > e/2 . (22)
h∈H h∈H

Proof. For every δ > 0, there exists hδ ∈ H such that R(hδ ) − infh∈H R(h) < δ. Therefore, we have that
   
P R(hSERM ) − inf R(h) > e ≤ P R(hSERM ) − R(hδ ) > e − δ , (23)
h∈H

46
for all δ > 0.
Moreover,

R(hSERM ) − R(hδ ) = R(hSERM ) − R


b S (h ERM ) + R
S
b S (h ERM ) − R(hδ )
S
ERM ERM
≤ R(hS ) − RS (hS ) + RS (hδ ) − R(hδ )
b b
≤ 2 sup |R(h) − R(hh )|.
h∈H

We obtain from (23) that


!
e−δ
 
P R(hSERM ) − inf R(h) > e ≤ P sup |R(h) − R(hh )| > . (24)
h∈H h∈H 2

Since the left hand side of (24) is independent from δ we obtain the claim from the continuity of measures.

We saw before that we can control the right hand side of (22), if the VC dimension of H is bounded.
Thereby, (22) yields a bound on the estimation error. However, requiring a small VC dimension does not
let us take a very large hypothesis space. This implies that we may have a large approximation error.

8.2 Structural risk minimisation


Here we perform ERM over nested hypothesis spaces

H1 ⊂ H2 ⊂ · · · ⊂ Hk ⊂ ·...

The approximation error will decrease (or at least not increase) for growing k, while the estimation de-
creases with decreasing k. The idea is shown in Figure 10.

h
H4 H3 H2 H1

h∗

Figure 10: Visualisation of structural risk minimisation, where h∗ is the Bayes classifier.

Structural risk minimisation is a method to choose an appropriate value of k. Here one employs a penalty
on large terms.

47
Definition 8.2. Let (Hk )∞
k =1 be a sequence of hypothesis sets and let S be a sample. Then, the solution of structural
risk minimisation is
hSRM
S := arg min{ Fk (h) : k ∈ N, h ∈ Hk },
where r
log k
Fk (h) := R
b S (h) + Rm (Hk ) + .
m
We have the following learning guarantee for SRM:
Theorem 8.1. Let δ > 0, (Hk )∞ k =1 be a sequence of hypothesis sets, H := k ∈N Hk , and let D be a distribution.
S

With probability at least 1 − δ for a sample S ∼ D m , it holds that


r ! r
log k ( h ) 2 log(3/δ)
R(hSRM
S ) ≤ inf R(h) + 2Rm (Hk(h) ) + + ,
h∈H m m

where k (h) is the smallest k such that h ∈ Hk .

Proof. We first remind ourselves of Theorem 4.1, where we found that with probability at least 1 − δ
s
log 1δ
R(h) ≤ Rb S (h) + Rm (H) +
2m
or equivalently
 
b S (h) − Rm (H) > δ ≤ e−2mδ2 .
P R(h) − R (25)

We compute with a union bound that


! !
P sup R(h) − Fk(h) (h) > e = P sup sup R(h) − Fk (h) > e
h∈H k∈N h∈Hk

!
≤ ∑P sup R(h) − Fk (h) > e .
k =1 h∈Hk

Invoking the definition of Fk and (25) yields that



!
∑P sup R(h) − Fk (h) > e
k =1 h∈Hk

!
q
= ∑P sup R(h) − R
b S ( h ) − Rm ( h ) > e + log(k)/m
k =1 h∈Hk
∞  q 2 !
≤ ∑ exp −2m e + log(k )/m
k =1

∑ e−2me e−2 log(k)
2

k =1

1 π 2 −2me2
∑ k2
2 2
= e−2me = e ≤ 2e−2me . (26)
k =1
6

Consider two random variables X1 , X2 . It is clear that for every t ∈ R


P( X1 + X2 > t) ≤ P( X1 > t/2) + P( X2 > t/2). (27)

48
Now we compute for an arbitrary h ∈ H
r !
log k (h)
P R(hSRM
S ) − R(h) − 2Rm (Hk(h) ) − >e
m
 
[ Equation (27) ] ≤ P R(hSRM
S ) − Fk(h SRM (
) Sh SRM
) > e/2
S
r !
log k (h)
+P Fk(hSRM ) (hSRM
S ) − R(h) − 2Rm (Hk(h) ) − > e/2
S m
r !
−me2 /2 log k (h)
[ Equation (26) ] ≤ 2e + P Fk(hSRM ) (hS ) − R(h) − 2Rm (Hk(h) ) −
SRM
> e/2
S m
r !
2 log k ( h )
≤ 2e−me /2 + P Fk(h) (h) − R(h) − 2Rm (Hk(h) ) − > e/2
m
2
 
≤ 2e−me /2 + P R b S (h) − R(h) − Rm (Hk(h) ) > e/2
2 /2
[ Equation (25) ] ≤ 3e−me .

This completes the proof.


p
Remark 8.1. Except for the term log(k (h))/m the generalisation bound of SRM is that of the best hypothesis
from the sequence (H)∞
k =1 . On the flip side, we would need to solve many empirical risk minimisations and know
the Rademacher complexities of all individual hypothesis sets.

8.3 Cross-validation
Definition 8.3. Let (Hk )∞ m
k=1 be sequence of hypothesis sets. Let α ∈ (0, 1) and S = ( xi , yi )i =1 be a sample. Then,
the solution of cross-validation is

S := arg min{ RS2 ( hS1 ,k ) : hS1 ,k ∈ Hk , k ∈ N},


hCV b ERM ERM

0
where S1 = ( xi , yi )im=1 for m0 = d(1 − α)me, S2 = ( xi , yi )im=m0 +1 , and hSERM
1 ,k
is the empirical risk minimiser over
the hypothesis class Hk with sample S1 .
In words, cross-validation consists in setting aside a validation set on which the loss is measured, but
which is not used for training.
The following proposition will be stated without proof. It shows that with high probability, the empirical
risk with respect to S2 is close to the expected risk. The proof is based on Hoeffding’s inequality.
Proposition 8.2. Let (Hk )∞ m
k=1 be sequence of hypothesis sets. Let α ∈ (0, 1), and let S ∼ D and S1 be as in
Definition 8.3, then it holds that
r !
log k 2
P sup R(hSERM 1 ,k
)−Rb S (h ERM ) −
2 S1 ,k > e ≤ 4e−2αme . (28)
k ≥1 αm

Based on the result above, we can show that cross-validation can often perform very similarly to structural
risk minimisation.

49
Theorem 8.2. Let (Hk )∞ m
k =1 be sequence of hypothesis sets. Let α ∈ (0, 1), and let S ∼ D and S1 be as in
Definition 8.3. For every δ ∈ (0, 1) it holds with probability 1 − 2δ that
s
log(max{k (hCV SRM
r
CV SRM S ), k ( hS1 )}) log(4/δ)
R(hS ) − R(hS1 ) ≤ 2 +2 ,
αm 2αm

where k (h) denotes the smallest k such that h ∈ Hk .

Proof. We have by Proposition 8.2 that


s r
log(k (hCV
S )) log(4/δ)
R(hCV
S ) ≤ b S (hCV ) +
R 2 S + =: I, (29)
αm 2αm

with probability 1 − δ. Since hSRM


S1 is an empirical risk minimiser on Hk with k = k (hSRM
S1 ), we have that
b S (h ) ≤ R
R CV b S (h SRM ) and hence
2 S 2 S1
s r
b S (hSRM ) + log(k (hCV
S )) log(4/δ)
I≤ R 2 S1 + =: II
αm 2αm

Again, for k = k (hSRM SRM is the empirical risk minimiser over H . Therefore, we get from
S1 ) it holds that hS1 k
Proposition 8.2 that with probability 1 − 2δ
s s
log(k (hSRM
r
CV ))
SRM log ( k ( h S S1 )) log(4/δ)
II ≤ R(hS1 ) + + +2
αm αm 2αm
s
max{log(k (hCV SRM
r
SRM S )), log( k ( hS1 ))} log(4/δ)
≤ R(hS1 ) + 2 +2 .
αm 2αm

If αm is not too small, i.e., when the validation set is large, then we achieve similar results with cross
validation to those achieved with structural risk minimisation with sample S1 . However, if this means
that S1 is very small, then, we do not benefit. Hence, the right choice of α is crucial.

9 Lecture 9 - Regression and Regularisation


9.1 Regression and general loss functions
Until now, we have stated most of our results for binary classification problems. In practice, we often
have labels that are not necessarily only 0 or 1. Hence, we need to generalise Definition 2.2 and Definition
2.1/ Equation (2).
We do this by invoking the notion of a loss function already introduced in Definition 3.5.
Definition 9.1. Let L be a loss function on Y × Y . Let D be a distribution on X × Y and let h ∈ H. The risk of
h is defined by
R L (h) = ED ( L(h( x ), y)) .
Similarly, we define the empirical risk for a general loss function:

50
Definition 9.2. Let L be a loss function on Y × Y . Let h ∈ H, and S := ( xi , yi )im=1 be a training sample. The
empirical risk is defined as
m
b S,L (h) = 1 ∑ L(h( xi ), yi ).
R
m i =1

Note that, Theorem 3.1 yields generalisation bounds for these loss functions.
Example 9.1. Some loss functions that are quite frequently used:
• The 0-1 loss: L0−1 (y1 , y2 ) = 1y1 6=y2 . We have used this everywhere until now. Used if Y = { a, b} for a 6= b.
• The quadratic loss: L2 (y1 , y2 ) = ky1 − y2 k2 . Used if Y is a normed space such as Rd , d ∈ N.
• Cross entropy loss/Log Likelihood-loss: LCE (y1 , y2 ) = −(y1 log(y2 ) + (1 − y1 ) log(1 − y2 )). Used if Y ⊂
[0, 1], y1 ∈ {0, 1}, y2 ∈ (0, 1).
• Hinge loss: L H (y1 , y2 ) = max{1 − y1 y2 }. Used if Y ⊂ [−1, 1].

9.2 Linear regression


Using non-binary loss functions, we can now also solve regression problems via empirical risk minimisa-
tion. One classical example is linear regression.
In linear regression we have a distribution D on R p × Rq and H is the set of all linear maps from R p → Rq
which we can interpret as q × p matrices.
Choosing q = 1 for simplicity, we have for a sample S = ( xi , yi )im=1 and the square loss L2 that
ERM
hS,L 2
( x ) = h a, x i, where
m
arg mina∈Rn ∑ |ha, xi i − yi |2 .
i =1

Clearly a is the solution of the least squares problem


a = arg mina∈R p k Xa − yk2 , (30)
where the rows of X are the xi and y = (yi )im=1 . One solution of (30) is

â = ( X T X )−1 X T y.

A small generalisation of linear regression is polynomial regression or more generally basis regression.
Let (hk )kK=1 be linearly independent such that span(hk )kK=1 = H ⊂ {X → R} for an arbitrary linear space
X , then finding
1
arg minh∈H k h( xi ) − yi k2
m
is equivalent to finding

1 K
m k∑
arg mina∈RK | a k h k ( x i ) − y i |2 . (31)
=1

Hence, setting Xi,k = hk ( xi ), we have that


â = ( X T X )−1 X T y.
solves (31). Finally,
K
ERM
hS,L 2
= ∑ âk hk .
k =1

51
[2]: import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt

Let us look at a hypothesis set of sums of sinosoids up to a frequency of 10.

[86]: x = np.arange(0,1,0.01) # everything lives on [0,1]

sines = np.zeros([100,10])
for k in range(10):
sines[:,k] = np.sin(2*(k+1)*np.pi*(x + 0.1)) # small shift to not have all start at 0.

plt.figure(figsize = (15, 5))


a = np.random.uniform(0,1, 10)
plt.subplot(1,3,1)
plt.plot(x, np.dot(sines,a))
a = np.random.uniform(0,1, 10)
plt.subplot(1,3,2)
plt.plot(x, np.dot(sines,a))
a = np.random.uniform(0,1, 10)
plt.subplot(1,3,3)
plt.plot(x, np.dot(sines,a))

[86]:

Let us fit some data:


[108]: num_points = 15

x_dat = np.arange(0,1,1/num_points)
y_dat = np.sin(2*np.pi*(x_dat + 0.1)) # this data should be very easy to fit.

hx_dat = np.zeros([num_points,10])
for k in range(10):
hx_dat[:, k] = np.sin(2*(k+1)*np.pi*(x_dat + 0.1))

a = np.linalg.lstsq(hx_dat, y_dat, rcond=-1)[0]

52
plt.figure(figsize = (15, 5))
plt.subplot(1,3,1)
plt.plot(x, np.dot(sines,a))
plt.scatter(x_dat, y_dat, c = 'r')

# We change one value by almost nothing.


y_dat[4] = y_dat[4] - 1e-15
a = np.linalg.lstsq(hx_dat, y_dat, rcond=-1)[0]
plt.subplot(1,3,2)
plt.plot(x, np.dot(sines,a))
plt.scatter(x_dat, y_dat, c = 'r')
plt.scatter(x_dat[4], y_dat[4], c = 'g',s=80)

# We change one value by ten times almost nothing.


y_dat[4] = y_dat[4] - 1e-14
a = np.linalg.lstsq(hx_dat, y_dat, rcond=-1)[0]
plt.subplot(1,3,3)
plt.plot(x, np.dot(sines,a))
plt.scatter(x_dat, y_dat, c = 'r')
plt.scatter(x_dat[4], y_dat[4], c = 'g',s=80)

[108]:

Polynomial/basis regression does not seem to be very stable towards very small changes of a single
element. This, however seems to be a quite desirable property, if we want to generalise well. We make
this more precise in the next chapter.

9.3 Stability and overfitting


Let S = (sk )m m i m i
k =1 ∼ D be a sample. We denote by S = ( sk )k=1 the sample with sk ∼ sk for all k 6 = i and
sii ∼ D independent from S.
Now let A be a sensible learning algorithm and L be a loss function. Then, we expect that
L(A(S), si ) ≤ L(A(Si ), si ),
where we use the short-hand notation: L(A(S), si ) = L(A(S)( xi ), yi ), where si = ( xi , yi ). On the other
hand if
L(A(S), si )  L(A(Si ), si ),
then we the algorithm performs only well on the sample si if it sees it in its training set. This is a sign of
overfitting.

53
Indeed, we have the following theorem.
Theorem 9.1. Let D be a distribution and S ∼ D m . Let U be the uniform distribution on [m]. Further let L be a
loss function. Then, we have that for every learning algorithm A,
   
E R L (A(S)) − R b S,L (A(S)) = EEi∼U L(A(Si )( xi ), yi ) − L(A(S)( xi ), yi ) , (32)

where ( xi , yi ) = si and ( xii , yii ) = sii .

Proof. Since ( xii , yii ) are independent of S, we have that


   
E (R L (A(S))) = E L(A(S)( xii ), yii ) = E L(A(Si )( xi ), yi ) ,

where the last equality follows by swapping si and sii . Moreover, we have that
!
m
  1
m i∑
E R b S,L (A(S)) = E L(A(S)( xi ), yi ) = EEi∼U ( L(A(S)( xi ), yi )) .
=1

The result now follows from the linearity of the expected value.

Bounding the right hand-side of (32) yields another way to guarantee a small generalisation error. Unfor-
tunately, we have just seen in the previous chapter, that for linear regression we cannot expect the right
hand-side of (32) to be small.
We will address this problem in the next chapter. Beforehand, let us fix the the concept of stability used
in Theorem 9.1 in form of a definition.
Definition 9.3. Let κ : N → R be monotonically decreasing. A learning algorithm A is on-average-replace-
one-stable with rate κ if for every distribution D every m ∈ N it holds that for every sample S ∼ D m :
 
ES,U (m) L(A(Si )( xi ), yi ) − L(A(S)( xi ), yi ) ≤ κ (m), (33)

where U (m) is the uniform distribution on [m].

9.4 Regularised risk minimisation


We introduce yet another risk minimisation problem. This time we add an auxiliary function to distort
the problem in a hopefully sensible way. The effect that we aim at is that we obtain some stability in the
sense of Definition (9.3) and Theorem 9.1.
Definition 9.4. Let H = (hθ )θ ∈Θ be a hypothesis set, let L : X × Y → R be a loss function. Let S be a sample and
r : Θ → R. Then we define the solution of regularised risk minimisation with regulariser r as hS,L RRM = h
RRM ,
θS,L
where
RRM
θS,L := arg minθ ∈Θ R
b S,L (hθ , L) + r (θ ).

Choosing L to be the 0-1 loss and H := Hk for a sequence of hypothesis sets (Hk )k∈N and
S
k ∈N
q
r (θ ) := Rm (Hk(hθ ) ) + log k (hθ )/m

shows that structural risk minimisation is a special case of regularised risk minimisation.

54
9.5 Tikhonov regularisation
If we have a hypothesis class H = (hθ )θ ∈Θ where Θ ⊂ Rd , then we call the regulariser

r Tikh,λ : Θ → R
r Tikh,λ (θ ) := λkθ k2

Tikhonov regulariser. Here k · k is the Euclidean norm. We will see below that this norm can be replaced by
any sufficiently convex norm and Θ can be a general normed space.
We are now interested in finding out under which condition regularised risk minimisation with the
Tikhonov regulariser admits generalisation bounds. We first study the convexity of r Tikh,λ .
Definition 9.5. For a normed space X, we say that a function f : X → R is strongly λ-convex if for all x1 , x2 ∈ X
and all α ∈ (0, 1) it holds that

λ
f (αx1 + (1 − α) x2 ) ≤ α f ( x1 ) + (1 − α) f ( x2 ) − α(1 − α)k x1 − x2 k2 .
2

x ( x + y)/2 y
Figure 11: Example of a stronly convex function.

The following lemma will prove useful in the next proof:


Lemma 9.1. Let X be a normed space and λ > 0. The following statements hold:
1. r Tikh,λ is 2λ-strongly convex.
2. If f : X → R is a λ-strongly convex function and g is a convex function, then f + g is λ-strongly convex.
3. If f : X → R is a λ-strongly convex and f ( x ) = minz∈X f (z), then for every y ∈ X it holds that

λ
f (y) − f ( x ) ≥ k x − y k2 .
2
With these observations in place, we can state the following result:
Proposition 9.1. Assume that L is a loss function which is ρ-Lipschitz in the first coordinate and is such that
θ 7→ L(hθ ( x ), y) is convex for every ( x, y) ∈ X × Y . Let S ∼ D m and λ > 0 and let A(S) = hS,L
RRM , where h RRM
S,L
is the solution of the regularised risk minimisation with regulariser r Tikh,λ .
Then, A is on-average-replacement-one-stable with rate 2ρ2 /(λm). In particular,
2
b S,L (h RRM ) ≤ 2ρ .
 
E R L (hS,L
RRM
)−R S,L (34)
λm

55
Proof. We write R(θ; S) = R b S,L (hθ ). Then, by Points 1 and 2 of Lemma 9.1, we conclude that θ 7→
R(θ; S) + r Tikh,λ (θ ) is 2λ-strongly convex. By Point 3 of Lemma 9.1, we conclude that for every θ 0 and if
θ 00 is the minimiser of R(·; S) + r Tikh,λ (·)

R(θ 0 ; S) + r Tikh,λ (θ 0 ) − (R(θ 00 ; S) + r Tikh,λ (θ 00 )) ≥ λkθ 00 − θ 0 k2 . (35)

For every i ∈ [m], the left-hand side of (35) can be rewritten as

R(θ 0 ; S) + r Tikh,λ (θ 0 ) − (R(θ 00 ; S) + r Tikh,λ (θ 00 ))


= R(θ 0 ; Si ) + rTikh,λ (θ 0 ) − (R(θ 00 ; Si ) + rTikh,λ (θ 00 ))
L(hθ 0 ( xi ), yi ) − L(hθ 0 ( xii ), yii ) L(hθ 00 ( xi ), yi ) − L(hθ 00 ( xii ), yii )
+ − . (36)
m m
Now if we choose θ 0 as the minimiser of R(·; Si ) + r Tikh,λ (·), then

R(θ 0 ; Si ) + r Tikh,λ (θ 0 ) − (R(θ 00 ; Si ) + r Tikh,λ (θ 00 )) ≤ 0

and hence (36) implies

R(θ 0 ; S) + r Tikh,λ (θ 0 ) − (R(θ 00 ; S) + r Tikh,λ (θ 00 ))


L(hθ 0 ( xi ), yi ) − L(hθ 0 ( xii ), yii ) L(hθ 00 ( xi ), yi ) − L(hθ 00 ( xii ), yii )
≤ − (37)
m m
L(hθ 0 ( xi ), yi ) − L(hθ 00 ( xi ), yi ) L(hθ 0 ( xii ), yii ) − L(hθ 00 ( xii ), yii )
= − . (38)
m m
Combined with (35) we now have that

L(hθ 0 ( xi ), yi ) − L(hθ 00 ( xi ), yi ) L(hθ 0 ( xii ), yii ) − L(hθ 00 ( xii ), yii )


λkθ 00 − θ 0 k2 ≤ − . (39)
m m
The estimate of (39) seems to go in the wrong direction to obtain the on-average-replace-one-stability.
However, the Lipschitz property of L allows us to perform a boot-strap argument. We observe that

| L(hθ 0 ( xi ), yi ) − L(hθ 00 ( xi ), yi )| ≤ ρkθ 0 − θ 00 k. (40)

Of course (40) holds when replacing xi and yi by xii and yii , respectively. If we plug this equation into (39),
we obtain

L(hθ 0 ( xi ), yi ) − L(hθ 0 ( xii ), yii ) L(hθ 00 ( xi ), yi ) − L(hθ 00 ( xii ), yii ) 2ρ


λkθ 00 − θ 0 k2 ≤ − ≤ kθ 00 − θ 0 k.
m m m
This implies that


kθ 00 − θ 0 k ≤ .
λm
If we combine this estimate with (40), then we conclude that

2ρ2
L(hθ 0 ( xi ), yi ) − L(hθ 00 ( xi ), yi ) ≤ .
λm
2ρ2
This implies the on-average-replace-one-stability of A with rate λm . The "in particular" part of the state-
ment follows from Theorem 9.1.

56
Lipschitz continuity of the loss may sometimes be a bit much to ask. For example the very frequently
used square loss is not Lipschitz continuous in its input (unless it is restricted to a compact set). The
result holds under weaker conditions that include the square loss, too.
Corollary 9.1. Assume that L is a loss function which is ρ-Lipschitz in the first coordinate and is such that θ 7→
L(hθ ( x ), y) is convex for every x, y ∈ X × Y . Let S ∼ D m and λ > 0 and let hS,L RRM be the solution of the

regularised risk minimisation with regulariser r Tikh,λ .


Then the following statements hold:
1. For all θ ∗ ∈ Θ it holds that
2ρ2
E(R L (hS,L
RRM
)) ≤ R L (hθ ∗ ) + λkθ ∗ k2 + . (41)
λm

2. If kθ k ≤ B for all θ ∈ Θ and λ =


p
2ρ2 /( B2 m), then
r
8
E(R L (hS,L
RRM
)) ≤ min R L (hθ ) + ρB . (42)
θ ∈Θ m
p
3. With probability 1 − B 8/(me2 ) it holds that
RRM
R L (hS,L ) ≤ min R L (hθ ) + e. (43)
θ ∈Θ

Proof. We have from Proposition 9.1 that for every θ ∗ ∈ Θ


2
b S,L (h RRM ) + 2$
   
E R L (hS,L
RRM
) ≤E R S,L
λm
2
b S,L (hθ ∗ ) + λkθ ∗ k2 + 2$
 
≤E R
λm
2$ 2
= R L ( h θ ∗ ) + λ k θ ∗ k2 + .
λm
where the second inequality follows since the regularised empirical risk is larger than the empirical risk
RRM was the minimiser of the regularised risk. This yields (41).
and the fact that hS,L
2$2
For λ = 2ρ2 /( B2 m), we estimate λkθ ∗ k2 by B 2ρ2 /m and also have that λm = B 2ρ2 /m. Applied to
p p p

(41) this yields (42).


To prove (43), we observe that with (42)
  r
8
E RRM
R L (hS,L ) − min R L (hθ ) ≤ ρB
θ ∈Θ m
RRM ) − min
and R L (hS,L θ ∈Θ R L ( hθ ) ≥ 0. Hence, by Markov’s inequality

  r
8
P R L (hS,L
RRM
) − min R L (hθ ) > e ≤ ρB .
θ ∈Θ me2

Let us end this section by looking at the regularised risk minimisation applied to the regression problem
from above:
[134]: num_points = 15
lmbda = 0.0000001

57
x_dat = np.arange(0,1,1/num_points)
y_dat = np.sin(2*np.pi*(x_dat + 0.1)) # this data should be very easy to fit.

hx_dat = np.zeros([num_points,10])
for k in range(10):
hx_dat[:, k] = np.sin(2*(k+1)*np.pi*(x_dat + 0.1))

a = np.linalg.lstsq(hx_dat, y_dat, rcond=-1)[0]

plt.figure(figsize = (15, 5))


plt.subplot(1,3,1)
plt.plot(x, np.dot(sines,a))
plt.scatter(x_dat, y_dat, c = 'r')

# We change one value by something.


y_dat[4] = y_dat[4] - 1e-2
a = np.linalg.lstsq(np.dot(hx_dat.T,hx_dat) + lmbda*np.identity(10), np.dot(hx_dat.T,y_dat), rcond=-1)[0]
plt.subplot(1,3,2)
plt.plot(x, np.dot(sines,a))
plt.scatter(x_dat, y_dat, c = 'r')
plt.scatter(x_dat[4], y_dat[4], c = 'g',s=80)

# We change one value by ten times something.


y_dat[4] = y_dat[4] - 1e-1
a = np.linalg.lstsq(np.dot(hx_dat.T,hx_dat) + lmbda*np.identity(10), np.dot(hx_dat.T,y_dat), rcond=-1)[0]
plt.subplot(1,3,3)
plt.plot(x, np.dot(sines,a))
plt.scatter(x_dat, y_dat, c = 'r')
plt.scatter(x_dat[4], y_dat[4], c = 'g',s=80)

[134]: <matplotlib.collections.PathCollection at 0x7f62c2c3e668>

58
10 Lecture 10 - Freezing Fritz
Freezing Fritz is a pretty cool guy. He has one problem, though. In his house, it is quite often too cold
or to hot during the night. Then he has to get up and open or close his windows or turn the heat up or
down. Needless to say, he would like to avoid this.
However, his flat has three doors that he can keep open or closed, it has four radiators, and four windows.
There is a picture of his home in Figure 12. It seems like there are endless possibilities of prepping the flat
for whatever temperature the night will have.
Fritz, does not want to play his luck any longer and decided to get active. He recorded the temperature
outside and inside of his bedroom for the last two years. Now he would like to find a prediction that,
given the outside temperature, as well as a certain configuration of his flat, tells him how cold or warm
his bedroom will become.
Can you help Freezing Fritz find blissful sleep?

W2 W1 R1

R2

D2 Temperature Outside

D3 D1

W4

R3 R4

W3
Figure 12: The home of Freezing Fritz. Here we see him lying in his bed. It is too cold. There are four
radiators, labelled R1-R4. Four windows labelled W1-W4 and three doors, labelled D1-D3. Fritz also owns
three plants. It is unclear, if they have anything todo with the heat distribution in this place, though.

Let us first look at the situation. Below we the experiment that Fritz carried out in 8 cases.

[1]: import numpy as np


import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.pyplot import ion
from scipy.signal import convolve2d
import pandas as pd
import seaborn as sn
%matplotlib inline

59
[4]: letItFlow([1,1, 1,1], [0,0,0,0], [1,1,1], 0, report = True)
letItFlow([1,1, 1,1], [5,0,0,0], [0,0,1], 0, report = True)
letItFlow([0,0, 0,0], [5,5,5,5], [1,1,1], 0, report = True)
letItFlow([0,0, 0,0], [0,0,0,0], [1,1,1], 0, report = True)
letItFlow([1,1, 1,1], [0,0,0,0], [1,1,0], 10, report = True)
letItFlow([1,1, 0,0], [5,5,0,5], [1,1,0], 10, report = True)
letItFlow([0,1, 0,1], [3,5,2,1], [1,0,1], 10, report = True)
letItFlow([0,0, 0,0], [5,5,5,5], [1,1,1], 20, report = True)

/usr/lib/python3/dist-packages/ipykernel_launcher.py:224: RuntimeWarning: divide


by zero encountered in log

Temperature outside: 0°C


Window one open: True, window two open: True, window three open: True, window four open: True
Door one open: True, door two open: True, door three open: True
Heater one level: 0, heater two level: 0, heater three level: 0, heater four level: 0
Temperature in bed: 4.5°C

Temperature outside: 0°C


Window one open: True, window two open: True, window three open: True, window four open: True
Door one open: False, door two open: False, door three open: True
Heater one level: 5, heater two level: 0, heater three level: 0, heater four level: 0
Temperature in bed: 13.8°C

60
Temperature outside: 0°C
Window one open: False, window two open: False, window three open: False, window four open: False
Door one open: True, door two open: True, door three open: True
Heater one level: 5, heater two level: 5, heater three level: 5, heater four level: 5
Temperature in bed: 20.4°C

Temperature outside: 0°C


Window one open: False, window two open: False, window three open: False, window four open: False
Door one open: True, door two open: True, door three open: True
Heater one level: 0, heater two level: 0, heater three level: 0, heater four level: 0
Temperature in bed: 10.8°C

Temperature outside: 10°C


Window one open: True, window two open: True, window three open: True, window four open: True
Door one open: True, door two open: True, door three open: False
Heater one level: 0, heater two level: 0, heater three level: 0, heater four level: 0
Temperature in bed: 13.9°C

61
Temperature outside: 10°C
Window one open: True, window two open: True, window three open: False, window four open: False
Door one open: True, door two open: True, door three open: False
Heater one level: 5, heater two level: 5, heater three level: 0, heater four level: 5
Temperature in bed: 19.0°C

Temperature outside: 10°C


Window one open: False, window two open: True, window three open: False, window four open: True
Door one open: True, door two open: False, door three open: True
Heater one level: 3, heater two level: 5, heater three level: 2, heater four level: 1
Temperature in bed: 18.2°C

Temperature outside: 20°C


Window one open: False, window two open: False, window three open: False, window four open: False
Door one open: True, door two open: True, door three open: True
Heater one level: 5, heater two level: 5, heater three level: 5, heater four level: 5
Temperature in bed: 28.6°C

[4]:
For the experiment, we first load the data from the data sets ’data_train_Temperature.csv’ and
’data_test_Temperature.csv’ that will be supplied on moodle.

62
[8]: data_train_Temperature = pd.read_csv('data_train_Temperature.csv')
data_test_Temperature = pd.read_csv('data_test_Temperature.csv')
data_train_Temperature.head()

[8]: Window 1 Window 2 Window 3 Window 4 Heat Control 1 Heat Control 2 \


0 0.0 0.0 1.0 1.0 5.0 2.0
1 0.0 1.0 1.0 0.0 3.0 3.0
2 0.0 0.0 0.0 1.0 5.0 2.0
3 1.0 1.0 1.0 1.0 1.0 1.0
4 0.0 0.0 1.0 1.0 1.0 0.0

Heat Control 3 Heat Control 4 Door 1 Door 2 Door 3 \


0 3.0 3.0 1.0 0.0 1.0
1 4.0 0.0 1.0 0.0 1.0
2 2.0 4.0 0.0 0.0 0.0
3 0.0 0.0 1.0 0.0 0.0
4 2.0 2.0 1.0 0.0 0.0

Temperature Outside Temperature Bed


0 -3.321343 13.728095
1 -4.474207 15.266521
2 -1.854384 23.918517
3 14.983739 19.458973
4 12.165330 19.635414

Let us look at this closely

[9]: data_train_Temperature.describe()

[9]: Window 1 Window 2 Window 3 Window 4 Heat Control 1 \


count 730.000000 730.000000 730.000000 730.000000 730.000000
mean 0.502740 0.500000 0.509589 0.506849 2.500000
std 0.500335 0.500343 0.500251 0.500296 1.683658
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 1.000000
50% 1.000000 0.500000 1.000000 1.000000 2.000000
75% 1.000000 1.000000 1.000000 1.000000 4.000000
max 1.000000 1.000000 1.000000 1.000000 5.000000

Heat Control 2 Heat Control 3 Heat Control 4 Door 1 Door 2 \


count 730.000000 730.000000 730.000000 730.000000 730.000000
mean 2.486301 2.420548 2.515068 0.506849 0.495890
std 1.703850 1.703660 1.692529 0.500296 0.500326
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 1.000000 1.000000 1.000000 0.000000 0.000000
50% 3.000000 2.000000 3.000000 1.000000 0.000000
75% 4.000000 4.000000 4.000000 1.000000 1.000000
max 5.000000 5.000000 5.000000 1.000000 1.000000

63
Door 3 Temperature Outside Temperature Bed
count 730.000000 730.000000 730.000000
mean 0.480822 7.837429 19.530556
std 0.499975 7.788304 3.867791
min 0.000000 -4.998988 5.869975
25% 0.000000 1.039908 16.772720
50% 0.000000 7.470895 20.015297
75% 1.000000 14.628865 22.617748
max 1.000000 21.988839 28.606276

We use the correlation matrix again to see how each of the parameters of the problem affect the tempera-
ture in the bedroom. We also look at how the trade-off between outside and inside temperature is affected
by some of the parameters.

[10]: corrMatrix = data_train_Temperature.corr()


plt.figure(figsize = (12,12))
palette = sn.diverging_palette(20, 220, n=256)
sn.heatmap(corrMatrix, annot=False, cmap = palette, vmin = -1, vmax = 1)
plt.show()

plt.figure(figsize = (12,12))
sn.pairplot(data_train_Temperature, vars = ['Temperature Outside', 'Temperature Bed'], kind = 'scatter',␣
,→hue='Window 1')

sn.pairplot(data_train_Temperature, vars = ['Temperature Outside', 'Temperature Bed'], kind = 'scatter',␣


,→hue='Heat Control 1')

sn.pairplot(data_train_Temperature, vars = ['Temperature Outside', 'Temperature Bed'], kind = 'scatter',␣


,→hue='Door 1')

plt.show()

64
My idea is to interpolate over the data but weigh it according to my observations and domain knowledge.
So I give low weights to windows 2 and 3. Same with door 3. Then I also think that the doors are more
important for the overall value than the individual heaters. My predictor is now a simple weighted
interpolation over 80% of the data training set. I validate on 20% of the training set.

[32]: observations = data_train_Temperature.shape[0]


simple_data_set = data_train_Temperature.copy().drop(range(int(0.2*observations)), axis = 0)

def predict(data):
simple_test_set = data.copy()

#some parameters are more important to fit than others. (In our case, window 1 and doors 1 and 2)
weights = np.ones(12)
weights[0] = 10 # Window 1
weights[1] = 0.1 # Window 2
weights[2] = 1 # Window 3
weights[3] = 10 # Window 4
weights[4] = 1 # Heat 1
weights[5] = 1 # Heat 2
weights[6] = 1 # Heat 3

65
weights[7] = 1 # Heat 4
weights[8] = 10 # Door 1
weights[9] = 10 # Door 2
weights[10] = 1 # Door 3
weights[11] = 1 # Temp Out

for k in range(simple_test_set.shape[0]):
value = 0;
totaldist = 0

for j in range(simple_data_set.values.shape[0]):
value = value + simple_data_set.values[j,-1]/(np.linalg.norm(weights*simple_data_set.
,→values[j,:-1] - weights*simple_test_set.values[k,:-1]))**4

totaldist = totaldist + 1/(np.linalg.norm(weights*simple_data_set.values[j,:-1] -␣


,→weights*simple_test_set.values[k,:-1]))**4

simple_test_set.values[k, -1] = value/totaldist

return simple_test_set

Now we compute error on validation set:

[33]: validation_set = data_train_Temperature.copy().drop(range(int(0.2*observations), observations), axis = 0)


res = predict(validation_set)
# root mean square error:
np.linalg.norm(res.values[:,-1] - validation_set.values[:,-1])/np.sqrt(validation_set.shape[0])

[33]: 1.882422855960373

The algorithm was not very sophisticated. Nonetheless I came within an accuracy of 2 degrees on the
validation set. For me this seems acceptable. Maybe you can help Freezing Fritz even more?
I will just store my prediction on the test set now:

[34]: #make prediction


prediction = predict(data_test_Temperature)
predicted_Temperatures = prediction.values[:,-1]

np.savetxt('PhilippPetersens_Temperature_prediction.csv', predicted_Temperatures, delimiter=',')

Please send your result via email to [email protected]. Your email should include the names
of all people who worked on your code, their student identification numbers, a name for your team, and
the code used. It should also contain one or two paragraphs of a short description of the method you
used.

11 Lecture 11 - Support Vector Machines I


11.1 Definition of the support vector machine algorithm
For d ∈ N, we consider the hypothesis class of (affine) linear classifiers on Rd :
H = { x 7→ sign(hw, x i + b) : w ∈ Rd , b ∈ R}.

66
This hypothesis class corresponds to a binary classification problem with Y = {−1, 1}. We say that a
sample S = ( xi , yi )im=1 is linearly separable, if there exist w ∈ Rd \ {0} and b ∈ R such that

yi = sign(hw, xi i + b) for all i ∈ [m]

or equivalently
yi (hw, xi i + b) ≥ 0 for all i ∈ [m].
Practically, this means that there exists a hyperplane in Rd splitting Rd into two parts, where one contains
all samples labelled −1 and the other contains all points labelled 1. An illustration is given in Figure 13.

Figure 13: Left: Linearly separable data set. Right: Hyperplane that maximises the margin.

Definition 11.1. Let d ∈ N and h be a linear classifier. We define the geometric margin ρh ( x ) of h at a point z
as the Euclidean distance of z to the hyperplane given by { h = 0}. For a sample S ∈ Rd × {−1, 1}, we define the
geometric margin of h as
ρh = min ρh ( xi ).
i ∈[m]

If h( x ) = sign(hw, x i + b), then, for z ∈ Rd ,

|hw, zi + b|
ρh (z) = . (44)
k w k2

Equation (44) is easily verified by the following argument: Let H 0 := { h = 0}. For x1 , x2 ∈ H 0 we have
that hw/kwk2 , x1 i + b/kwk2 = 0 = hw/kwk2 , x2 i + b/kwk2 and hence

0 = hw/kwk2 , x1 − x2 i = 0.

Hence, w/kwk2 is a normal on H 0 . We conclude that

|hz, wi + b|
min kz − x k2 = min |hz, w/kwk2 i − h x, w/kwk2 i| = min |hz, w/kwk2 i + b/kwk2 | = .
x∈ H0 x∈ H0 x∈ H0 k w k2

The support vector machine algorithm returns the hyperplane classifier that maximises the geometric
margin.

Definition 11.2. For d ∈ N, the algorithm ASVM takes a sample S ∈ (Rd × {−1, 1})m and outputs a linear
classifier hSVM = ASVM (S) such that
ρhSVM = max ρh .
h∈H

67
11.2 Primal optimisation problem
We are interested in finding the hyperplane with the maximum geometric margin. By (44), this geometric
margin is given by

|hw, xi i + b|
ρ= max min .
w,b : yi (hw,xi i+b)≥0 i ∈[m] kwk

Since we assume the sample S to be separable, it holds that

|hw, xi i + b| y (hw, xi i + b) 1
max min = max min i = max , (45)
w,b : yi (hw,xi i+b)≥0 i ∈[m] kwk w,b i ∈[m] kwk mini∈[m] yi (hw,xi i+b)=1 k w k

where the last equality follows by observing that rescaling of (w, b) by any scalar does not affect the value
of the fraction.
Since increasing kwk will decrease the value (45), we conclude that

1
ρ= max . (46)
mini∈[m] yi (hw,xi i+b)≥1 k w k

Instead of maximising 1/kwk we can minimise 21 kwk2 and we therefore end with the optimisation prob-
lem
1
min kwk2 , (47)
w,b 2
subject to: yi (hw, xi i + b) ≥ 1. (48)

This optimisation problem is strictly convex and the constraints are affine linear. Hence, there exists a
unique solution and efficient solvers to find it.

11.3 Support vectors


Now we would like to find a different representation of the solution of the support vector machine algo-
rithm. Therefore, we first need to recall a central result from optimisation theory.
Theorem 11.1 (Karush-Kuhn-Tucker’s theorem (simplified)). Assume that f : X → R is convex and differ-
entiable and for i ∈ [m] let gi : X → R be affine linear. Then x is a solution of the optimisation problem

min f ( x )
x
subject to: gi ( x ) ≤ 0,

if and only if there exists α ∈ (R+ )m such that

∇ x L( x, α) = 0 (49)
(∇α L( x, α))i = gi ( x ) ≤ 0, for all i ∈ [m] (50)
m
∑ αi gi = 0, (51)
i =1

where
m
L( x, λ) = f ( x ) + ∑ λi gi for x ∈ X , λ = (λ1 , . . . , λm ) ∈ (R+ )m .
i =1

68
Note that, (50) and (51) are equivalent to
gi ( x ) ≤ 0 ∧ (∀i ∈ [m], αi gi ( x ) = 0). (52)
Applying Theorem 11.1 to (46) by defining
m
1
L( x, λ) = L(w, b, λ) = kwk22 + ∑ λi [yi (hw, xi i) − 1]
2 i =1

yields that for an α ∈ (R+ )m


m
w = − ∑ αi yi xi (53)
i =1
m
∑ αi yi = 0
i =1
αi [yi (hw, xi i + b) − 1] = 0, ∀ i ∈ [ m ]. (54)
By the first equality, we conclude that w is a linear combination of the xi . Moreover, αi 6= 0 only if
yi (hw, xi i + b) = 1. We call vectors such that yi (hw, xi i + b) = 1 the support vectors. By (45), these are the
vectors the distance of which to the hyperplane is equal to the geometric margin.
We notice, that the solution of support vector machine algorithm only depends on the support vectors.
This property will be very important in the sequel.

11.4 Generalisation bounds


11.4.1 Leave-one-out analysis

Definition 11.3. Let A be a learning algorithm taking samples S ∈ (X × {−1, 1})m and mapping them to
hypotheses h : X → Y . The leave-one-out error of h on a sample S is defined by
m
bLOO (A, S) := 1 ∑ 1h
R ,
m i=1 S−{si } (xi )6=yi

where S − {si } denotes the sample of size m − 1 which results from S = (s1 , . . . , sm ) by removing si .
In words, the leave one out error of a sample is the mean error committed when training on all elements
of a sample except one and then observing the error on the left out point.
We have that the average leave-one-out error is an unbiased estimator of the risk.
Lemma 11.1. Let A be a learning algorithm taking samples S ∈ (X × {−1, 1})m and mapping them to hypotheses
h : X → Y . Let D be a distribution on X × {−1, 1}. Then it holds that

ES∼D m ( R
bLOO (A, S)) = ES∼D m−1 (R(A(S)))

Proof. From the definition of the leave-one-out error and the linearity of the expected value, we have that
m
bLOO (A, S)) = 1 ∑ ES∼D m 1h
 
ES∼D m ( R ( x )6 = y
m i =1 S−{si } i i

 
= ES∼D m 1hS−{s } (x1 )6=y1
1
 
= ES∼D m−1 Es∼D 1hS (x)6=y
= ES∼D m−1 R(A(S)).

69
Using this lemma, we can now obtain a first generalisation bound for the SVM algorithm.
Theorem 11.2. Let, for a linearly separable sample S ∈ (X × {−1, 1})m , hSVM = ASVM (S) be the hypothesis
returned by the SVM algorithm. Let NSV (S) be the number of support vectors that define hS . Then
 
NSV (S)
ES∼D m (R(hSVM )) ≤ ES∼ Dm+1 . (55)
m+1

Proof. Let S = ( xi , yi )im=+11 be a linearly separable sample of size m + 1. We observe that if xi is not a
support vector for hS then by (53) and (54), hS = hS−{si } . Since S was linearly separable, we have that
hS ( xi ) = yi for all i ∈ [m + 1]. We conclude that

1 m +1 NSV (S)
bLOO (ASVM , S) =
R ∑
m + 1 i =1
1hS−{s } (xi )6=yi ≤
i m+1
.

The result now follows by Lemma 11.1.

12 Lecture 13 - Support Vector Machines II


12.1 Margin theory
Until now we have not seen any benefit from a large geometric margin in the classification by SVMs. Intu-
itively a large margin makes the classification simpler and should yield improved generalisation bounds.
Defying common sense, one typically starts by studying a different type of margin in a much more general
setting first, to ultimately understand the geometric margin in SVM classification. We shall do the same.
We will now move from classification by hyperplanes with the 0-1 loss, to classification classification with
affine-linear hypotheses and a classification confidence.
If h( x ) = sign(hw, x i + b), then

1h(x)6=yi = 1 − 1(hw,xi+b)yi >0 = 1 − 1h(x)yi >0 (56)

Now the right-hand side of (56) makes sense for all real valued functions h. Then, one may interpret the
classification with a general h : X → R so that the sign of h( x ) corresponds to the predicted class and the
magnitude of h( x ) corresponds to the confidence of the classification.
From this observation, we can create a loss function that penalises not only wrong classifications but those
that do not have enough confidence. In words, for ρc > 0 we could define the hard margin loss as

Lhard,ρc (y, y0 ) = 1 − 1yy0 >ρc = 1yy0 ≤ρc . (57)

For analysis purposes it is convenient to consider a continuous alternative of the hard margin loss, which
we shall define below:
Definition 12.1. Let ρc > 0. The ρc -margin loss is the function Lρc : R × R → R+ defined as Lρc (y, y0 ) =
Φρc (yy0 ), where y, y0 ∈ R and

    1 if x ≤ 0
x
Φρc ( x ) = min 1, max 1 − , 0 = 1 − x/ρc if 0 ≤ x ≤ ρc
ρc
0 if ρc ≤ x.

for x ∈ R

70
0 ρc 1
Figure 14: The functions x 7→ 1x≤ρc (dotted, orange), Φρc ( x ) (solid, green), and x 7→ 1x≤0 (dashed, red).

Using the margin loss function, we can define the associated empirical risk function.
Definition 12.2. For a sample S = ( xi , yi )im=1 , a margin ρc > 0, and h : X → R, we define the empirical margin
loss as
m
b S,ρ (h) := 1 ∑ Φρc (yi h( xi )).
R c
m i =1

Note that for all samples S, hypotheses h : X → R, and margins ρc > 0

1 m m
b S,ρ (h) ≤ 1 ∑ 1y h(x )≤ρ ,

m i =1
1yi h(xi )≤0 ≤ R c
m i =1 i i c
(58)

see also Figure 14. As mentioned before, the reason to introduce the margin loss function is that it is
continuous. More precisely, the fact that it is Lipschitz continuous with Lipschitz constant 1/ρc . The
reason why this is beneficial will become clear from the result below:
Theorem 12.1 (Talagrand’s Lemma). Let Φ : R → R be C -Lipschitz for C > 0. Then for every hypothesis set
H of real-valued functions it holds that for every sample S
b S (Φ ◦ H) ≤ CR
R b S (H),

where Φ ◦ H := {Φ ◦ h : h ∈ H}.
Now we obtain a first margin based generalisation bound.
Theorem 12.2. Let H be a set of real-valued functions. Let D be a distribution on X × {−1, 1}. For ρc > 0 it
holds for all δ > 0 with probability at least 1 − δ over a sample S ∼ D m and for all h ∈ H
r
R(h) ≤ R b S,ρ (h) + Rm (H) + log(1/δ)
2
(59)
c
ρc 2m
r
R(h) ≤ R
2
b S,ρ (h) + R b S (H) + 3 log(2/δ) , (60)
c
ρc 2m

where R(h) = E(x,y)∼D (1yh(x)≤0 ).

Proof. Let H
e := {(z = ( x, y) 7→ yh( x ) : h ∈ H)}. Next we set H b := Φρ ◦ H e and observe that H
b contains
c
functions mapping from X × {−1, 1} to [0, 1]. Thus, we can apply Theorem 3.1, which yields that with
probability 1 − δ it holds for all g ∈ Hb:
r
1 m log(1/δ)
E( g ) ≤ ∑
m i =1
g(zi ) + 2Rm (H) +
b
2m
. (61)

71
Therefore, for every h ∈ H
r
log(1/δ)
E(x,y)∼D (Φρc (yh( x ))) ≤ R
b S,ρ (h) + 2Rm (H)
c
b + . (62)
2m
We have that for all x ∈ R, 1x≤0 ≤ Φρc ( x ) and hence

R(h) = E(x,y)∼D (1y6=h(x) ) = E(x,y)∼D (1yh(x)≤0 ) ≤ E(x,y)∼D (Φρc (yh( x ))).

We conclude that
r
log(1/δ)
R(h) ≤ R
b S,ρ (h) + 2Rm (H)
c
b + .
2m
Next, we apply Theorem 12.1 which yields that
r
b S,ρ (h) + 2 Rm (H) log(1/δ)
R(h) ≤ R c
e + ,
ρ 2m

since Φρc is 1/ρc -Lipschitz.


Finally, we compute
! !
m m
e = 1 ES Eσ
Rm (H) sup ∑ σi yi h( xi )
1
= ES Eσ sup ∑ σi h( xi ) = Rm (H).
m h∈H i =1 m h∈H i =1

This yields (59). We can show (60) by following the same steps as above but estimating with the empirical
Rademacher complexity in (61).

As outlined at the beginning of this section, we consider the hypothesis set H of affine linear functions.
Lets compute the empirical Rademacher complexity of this set.
Theorem 12.3. Let X be an inner product space and S ⊂ Br (0) be a sample of size m. Further let H := { x 7→
hw, x i + b : kwk ≤ Λ, |b| ≤ s}. Then
r
2
Rb S (H) ≤ (s + rΛ) .
m

Proof. We start by applying the definition of the empirical Rademacher complexity:


" #
m
1
Rb S (H) = Eσ
m
sup ∑ σi (hw, xi i + b)
kwk≤Λ,|b|≤Λ i =1
" * + #
m m
1
= Eσ sup w, ∑ σi xi + ∑ σi b
m kwk≤Λ,|b|≤Λ i =1 i =1

Λ m
s m

m
Eσ ∑ σi xi +
m
Eσ ∑ σi .
i =1 i =1

We observe that by Jensens inequality and the independence of the σi


!2
m m m
Eσ k ∑ σi k ≤ Eσ k ∑ σi k2 = ∑ Eσ σi2 = m.
i =1 i =1 i =1

72
Moreover again by Jensens inequality and the independence of the σi , we conclude that
" #!2 2
m m m
Eσ ∑ σi xi ≤ Eσ ∑ σi xi = ∑ k xi k2 ≤ mr2 .
i =1 i =1 i =1

We conclude that

Λ m
s m
Λr + s
m
Eσ ∑ σi xi + m
Eσ ∑ σi ≤ √ .
m
i =1 i =1

Now we can combine Theorems 12.3 and 12.2 to obtain the following generalisation bound for affine
linear classifiers.
Corollary 12.1. Let H = { x 7→ hw, x i + b : kwk ≤ Λ, |b| ≤ s} for Λ, s > 0. Assume further, that X is a subset
of an inner product space and all elements of X have norm bounded by r. Let D be a distribution on X × {−1, 1}.
Then, ρc > 0 and for all δ > 0 it holds with probability 1 − δ over a sample S ∼ D m :
r r
(s + rΛ)2 /$2 log(1/δ)
R(h) ≤ RS,ρc (h) + 2
b + ,
m 2m
for all h ∈ H.
Now let S ⊂ Br (0) be a linearly separable sample with geometric margin ρ and let hS = ASV M (S). Since
S is linearly separable, we can assume that the separating hyperplane passes through Br (0). Let z ∈ Br (0)
be such that hS (z) = 0, i.e. z lies on the hyperplane.
We have that
hS ( x ) = sign(hw, x i + b) = sign(hw, x + zi) = hS0 ( x ),
where hS = ASV M (S0 ) and S0 = ( xi0 , yi )im=1 = ( xi + z, yi )im=1 results from shifting S by z. Since S was
linearly separable with geometric margin ρ we have by (46) that kwk = 1/ρ. Furthermore, S0 ⊂ B2r (0).
Moreover, by the definition of a geometric margin, we have that

hw, xi0 iyi ≥ 1

for all i ∈ [m] and hence, Φ1 (hw, xi0 iyi ) = 0 for all i ∈ [m]. Corollary 12.1, therefore implies that with
probability 1 − δ over the choice of S we have that
s r
(2r )2 log(1/δ)
RD (hS ) = E(x,y)∼D (1hS0 (x−z)y≤0 ) ≤ 2 + , (63)
mρ2 2m

where we define the right-hand side as ∞ if the sample is not linearly separable. Let us note this as a final
corollary.
Corollary 12.2. For all distributions D on X × {−1, 1}, where X is contained in a ball of radius r in a inner
product space, it holds that for all δ > 0 with probability 1 − δ:
s r
r2 log(1/δ)
RD ( hS ) ≤ 4 2
+ ,
mρ 2m

where ρ is the geometric margin of S.

73
The estimate of Corollary 12.2 is remarkable in the sense that it does not seem to depend in any way on
X . We have seen earlier that the VC dimension of linear classifiers depends on the dimension d. Hence,
the VC-dimension-based generalisation bound of Corollary 16.1 would not be dimension independent.
Note though, that we have seen in Theorem 6.1 that the VC dimension based bounds are optimal in the
sense that for a carefully designed distribution we cannot significantly outperform the upper bounds. We
know now that these bad distributions cannot be such that they only generate linearly separable samples
with large geometric margins.

12.2 Non-separable case


Occasionally, one encounters a classification problem that is not linearly separable. One such example, is
shown in Figure 15.

Figure 15: Data set which is not linearly separable.

In this case, no w, b exist such that


yi (hw, xi i + b) ≥ 1 (64)
for all ( xi , yi )im=1 . In particular, we cannot define a margin as in (46). Here one typically resorts to a relaxed
notion of separating hyperplane, by introducing slack variables. Instead of (65), we require that for (ξ i )im=1
it holds that
yi (hw, xi i + b) ≥ 1 − ξ i . (65)

Again, we would like (65) to hold with a large margin ρ = kw1 k , which in this case, we call soft-margin.
In addition, we now want the slack variables to be as small as possible. In total, we end up with the
optimisation problem:
m
1
min kwk2 + ∑ ξ i
p
w,b,ξ 2 i =1
subject to yi (hw, xi i + b) ≥ 1 − ξ i and ξ i ≥ 0,

74
where p ≥ 1. Similarly to the separable case one can show that the minimiser of the optimisation problem
depends only on few sample points. In this case, these points are the support vectors, such that equality
holds in (65).

13 Lecture 13 - Freezing Fritz - Discussion

14 Lecture 14 - Kernel methods


Consider the classification problem of Figure 16 below. We see that by tactically increasing the dimen-
sion, or more generally speaking by mapping the data to a higher dimensional space, we can sometimes
simplify a problem significantly.

Figure 16: Left:: Original not linearly separable problem for a sample S = ( xi , yi )31
i =1 . Right: The sample
(( xi , 1 − 5| xi |), yi )31
i =1 . This seems to be separable.

14.1 The kernel trick and kernel SVM


If we would like to make use of the linear separability after the transformation, then we could use a
support vector machine for classification. The overall approach would then look like this:
• Given a domain X we choose a map ψ : X → Z , where Z is some space.
• For a given sample S ∈ (X × Y )m we compute the image sample Se = (ψ( xi ), yi )im=1 .

• Apply the SVM algorithm to Se which yields a linear hypothesis h on Z .


• h ◦ ψ is then a classifier for the original problem.
For this procedure to make sense, Z needs to be such that, we can perform the SVM learning algorithm.
This requires Z to be a space with an inner product. In the sequel, we will therefore assume Z to be a
Hilbert space.
It appears to be computationally suboptimal to first embed the data into a high dimensional or even infi-
nite dimensional space and then take high, or even infinite dimensional scalar products. If our algorithms
only depend on scalar products between the samples, then it may make sense to use a so-called kernel
instead of these scalar products.
Definition 14.1. Let X be a set. Then a function K : X × X → R is called a kernel over X .

75
For a given transformation ψ as above Kψ ( x, x 0 ) = hψ( x ), ψ( x 0 )iZ is a kernel. Often one can, however,
also start with a kernel for which then a transform exists.
Theorem 14.1 (Mercer’s condition). Let X ⊂ Rn be compact and let K : X × X → R be a continuous and
symmetric function. Then, there exist φn : X → R and an > 0 such that

K ( x, x 0 ) = ∑ an φn (x)φn (x0 ), (66)
n =0

if and only if for all g ∈ L2 (X ), the following holds:


Z Z
g( x )K ( x, x 0 ) g( x 0 )dxdx 0 ≥ 0.
X X

Theorem 14.1 shows that we do not need to find a specific ψ and an associated Hilbert space as long as we
are content with the existence of these objects. We can design a kernel instead. This kernel should satisfy
that K( x, x 0 ) is large, if x, x 0 should be classified similarly.
Having a kernel that can be efficiently computed saves us from computing scalar products between high-
dimensional feature embeddings of values x, x 0 ∈ X . Nonetheless, to apply the SVM algorithm we need
to compute inner products between ψ( x ) and a vector w in the Hilbert space. To compute these scalar
products efficiently, we make use of the representer theorem.
Consider the following optimisation problem:

min f (hw, ψ( x1 )i, . . . , hw, ψ( xm )i) + R(kwk), (67)


w

where f is an arbitrary function from Rm to R and R is a nondecreasing function from R+ to R. We


observe that the SVM algorithm on the feature space is a special case of such an algorithm, when setting
R(kwk) = 21 kwk2 and
f ( a1 , . . . , a m ) = 1
if there exists a b ∈ R such that
yi ( ai + b ) ≥ 1
for all i ∈ [m] and f ( a1 , . . . , am ) = ∞ else.
Theorem 14.2. Assume ψ : X → Z , where Z is a Hilbert space. Assume further that (67) has a solution. Then
there exists a vector α ∈ R N such that w = ∑iN=1 αi ψ( xi ) is a solution of (67).

Proof. Assume what w∗ ∈ Z . Then, we can write

w∗ = w̄ + u,

where w̄ ∈ span{ψ( xi ) : i ∈ [m]} and u ⊥ span{ψ( xi ) : i ∈ [m]}. By the Pythagorean theorem it holds that

kw∗ k2 = kw̄k2 + kuk2 ≥ kw̄k2 .

Therefore, R(kw∗ k) ≥ R(kw̄k). Also

f (hw̄, ψ( x1 )i, . . . , hw̄, ψ( xm )i) = f (hw∗ , ψ( x1 )i, . . . , hw∗ , ψ( xm )i), (68)

by construction. Hence, if there exists an optimal solution of (67) in Z , then there exists also one in
span{ψ( xi ) : i ∈ [m]}.

76
Based on Theorem 14.2, we can now compute for w = ∑im=1 αi ψ( xi )
m
hw, ψ( xi )i = ∑ α j hψ(x j ), ψ(xi )i (69)
j =1
m m
kwk2 = hw, wi = ∑ ∑ α j αi hψ(x j ), ψ(xi )i. (70)
j =1 i =1

If now K( x, x 0 ) = hψ( x ), ψ( x 0 )i, then the SVM problem on Z corresponds to the minimisation of

1 T
minα Gα
α∈R ,b∈R 2
m

subject to yi (( Gα)i + b) ≥ 1,

where Gi,j = K( xi , x j ) is the so-called Gram matrix of the kernel. The optimisation problem above predicts
for an new sample x:
! !
N N
sign(hw, ψ( x )i + b) = sign ∑ αi hψ(xi ), ψ(x)i + b = sign ∑ αi K(xi , x) + b .
i =1 i =1

This learning algorithm is called kernel SVM.

14.2 Learning guarantees


To obtain learning guarantees of the kernel SVM algorithm, we first compute the Rademacher complexity
of the associated hypothesis class.
Proposition 14.1. Let K : X × X → R be a kernel such that for a function ψ : X → Z , where Z is a Hilbert
space, it holds that K( x, x 0 ) = hψ( x ), ψ( x 0 )i for all x, x 0 ∈ X . Let S = ( xi , yi )im=1 ∈ (X × {−1, 1})m be a sample
so that K( xi , xi ) ≤ r2 for a given r > 0 and all i ∈ [m].
We denote H := { x 7→ hw, ψ( x )i + b : kwkZ ≤ Λ, |b| ≤ s}. Then

b S (H) ≤ Λ s rΛ + s
q
R Tr (K) + √ ≤ √ , (71)
m m m

where Tr (K) = ∑im=1 K( xi , xi ) is the trace of K.

Proof. The proof is very similar to that of Theorem 12.3, but we take the effect of the kernel into account.
We have per definition that
" * + #
m m
1
b S (H) = Eσ
R sup w, ∑ σi ψ( xi ) + ∑ σi b
m kwk≤Λ,|b|≤s i =1 i =1
!
Λ m
s m
≤ Eσ
m ∑ σi ψ( xi ) + Eσ ∑ σi .
m
i =1 i =1

77
An application of Jensen’s inequality and the independence of σi and σj for all i 6= j yields
  1/2  1/2
2 2
Λ m
s  m
b S (H) ≤ Eσ 
R
m ∑ σi ψ(xi )  +
m
Eσ ∑ σi 
i =1 i =1
!!1/2 !1/2
Λ m
s m
∑ kψ(xi )k Eσ ∑ kσi k
2 2
≤ Eσ +
m i =1
m i =1
!!1/2
Λ m
s

m
Eσ ∑ K(xi , xi ) +√
m
i =1
Λ s rΛ + s
q
= Tr (K) + √ ≤ √ .
m m m

We can apply Theorem 12.2 to Proposition 14.1 which yields the following generalisation bound for large
margin kernel SVM classifiers:
Theorem 14.3. Let K : X × X → R be a kernel such that for a function ψ : X → Z , where Z is a Hilbert space,
it holds that K( x, x 0 ) = hψ( x ), ψ( x 0 )i for all x, x 0 ∈ X . Let r2 := supx∈X K( x, x ). We denote H := { x 7→
hw, ψ( x )i + b : kwkZ ≤ Λ, |b| ≤ s}. Let D be a distribution on X × {−1, 1}. For ρc > 0 it holds for all δ > 0
with probability at least 1 − δ over a sample S ∼ D m and for all h ∈ H
r
2 ( rΛ + s ) log(1/δ)
R(h) ≤ R b S,ρ (h) +
c
√ + (72)
ρc m 2m

where R(h) = E(x,y)∼D (1yh(x)≤0 ).

14.3 Some standard kernels


• Polynomial kernels: For a constant c > 0, the polynomial kernel of degree d ∈ N over R N is defined
by
K( x, x 0 ) = (h x, x 0 i + c)d .
For example, if N = 2 and d = 2 then,

( x 0 )21
   
x12
2 0 2
√ x2 √ ( x0 )2 0
   
*   +

√2x1 x2
  2( x )1 ( x )2


K( x, x 0 ) = ( x1 x10 + x2 x20 + c)2 =  ,
   
0
√2cx1 √2c( x 0 )1

   
   
 2cx2   2c( x )2 
c c

• Gaussian/Radial basis function kernel: For any constant σ > 0, the Gaussian or RBF kernel over R N is
defined as
0 2 2
K( x, x 0 ) = e−kx−x k /(2σ )
It can be shown, see [5], that the Gaussian kernel satisfies the assumptions of Theorem 14.1 and
there is a ψ and an infinite dimensional feature space Z such that K( x, x 0 ) = hψ( x ), ψ( x 0 )iZ .

78
14.4 A numerical example

[80]: import numpy as np


import matplotlib.pyplot as plt
from scipy import stats

# use seaborn plotting defaults


import seaborn as sns; sns.set()
from sklearn.svm import SVC
from sklearn.datasets import make_moons, load_iris, load_wine

We use the following helper function to plot the decision regions of our SVM classifiers. This is taken
from the Python Data Science Handbook by Jake VanderPlas

[41]: def plot_svc_decision_function(model, ax=None, plot_support=True):


"""Plot the decision function for a 2D SVC"""

# This method is taken from:


# Python Data Science Handbook
# by Jake VanderPlas
# Released November 2016
# Publisher(s): O'Reilly Media, Inc.
# ISBN: 9781491912058

if ax is None:
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()

# create grid to evaluate model


x = np.linspace(xlim[0], xlim[1], 30)
y = np.linspace(ylim[0], ylim[1], 30)
Y, X = np.meshgrid(y, x)
xy = np.vstack([X.ravel(), Y.ravel()]).T
P = model.decision_function(xy).reshape(X.shape)

# plot decision boundary and margins


ax.contour(X, Y, P, colors='k',
levels=[-1, 0, 1], alpha=0.5,
linestyles=['--', '-', '--'])

# plot support vectors


if plot_support:
ax.scatter(model.support_vectors_[:, 0],
model.support_vectors_[:, 1],
s=300, linewidth=1, facecolors='none');
ax.set_xlim(xlim)
ax.set_ylim(ylim)

We will now apply the kernel SVM to three different data sets. We start with the moons data set, which
consists of two interleaving half-circles.

79
[109]: def plot_svm_moons(N, kernel):
X,y = make_moons(n_samples=N, shuffle=True, noise=0.05)

model = SVC(kernel=kernel, C=2000)


model.fit(X, y)

ax = plt.gca()
ax.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='Dark2')
plot_svc_decision_function(model, ax)

N = 200
plt.figure(figsize = (18, 6))
plt.subplot(131)
plot_svm(N, 'rbf')
plt.subplot(132)
plot_svm(N, 'linear')
plt.subplot(133)
plot_svm(N, 'poly')

Next, we look at the performance on one of the most famous data sets in data science, the iris data set.
This data set contains as features the lengths of two leaves of iris flowers. The associated labels are one
of three flower types: "iris setosa’, ‘iris versicolor’, ‘iris virginica’. Since our support vector classifier only
performs binary classification, we combine iris versicolor, iris virginica to one class.

[108]: def plot_svm_iris(N, kernel):


X = load_iris().data[:, 1:3]
y = load_iris().target<1

X = X[:N, :]
y = y[:N]

model = SVC(kernel=kernel, C=2000)


model.fit(X, y)

ax = plt.gca()

80
ax.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='Dark2')
ax.set_xlabel(load_iris().feature_names[1])
ax.set_ylabel(load_iris().feature_names[2])
plot_svc_decision_function(model, ax)

N = 200
plt.figure(figsize = (18, 6))
plt.subplot(131)
plot_svm_iris(N, 'rbf')
plt.subplot(132)
plot_svm_iris(N, 'linear')
plt.subplot(133)
plot_svm_iris(N, 'poly')

Finally, we look at a non linearly separable data set, which is the wine data set. It contains labelled data
of three types of wine of which we combine the later two. The data is in the form of specific measurable
characteristics of the wine, such as the alkohol or magnesium content.

[111]: def plot_svm_wine(N, kernel):

features = [0, 6]
X = load_wine().data[:, features]
y = load_wine().target<1

X = X[:N, :]
y = y[:N]

model = SVC(kernel=kernel, C=2000)


model.fit(X, y)

ax = plt.gca()
ax.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='Dark2')
ax.set_xlabel(load_wine().feature_names[features[0]])
ax.set_ylabel(load_wine().feature_names[features[1]])
plot_svc_decision_function(model, ax)

81
N = 200
plt.figure(figsize = (18, 6))
plt.subplot(131)
plot_svm_wine(N, 'rbf')
plt.subplot(132)
plot_svm_wine(N, 'linear')
plt.subplot(133)
plot_svm_wine(N, 'poly')

15 Lecture 15 - Nearest Neighbour


The nearest neighbour learning algorithm is one of the simplest, but also most versatile and frequently
used machine learning algorithms. Let us define it below:
Definition 15.1. Let X be a set and ρ : X × X → R be a distance function. Let Y be a label space and
S = ( xi , yi )im=1 ∈ (X × Y )m be a sample. Let π : X → perm(m) be such that

ρ( x, xπ (x)(i) ) ≤ ρ( x, xπ (x)(i+1) ).

Then, we define, for k ≤ m, the k-nearest neighbour classifier by

hkNN
S ( x ) := A(S)( x ) := majority label of (yπ (x)(i) )i∈[k] .

Here the majority label is that value y appearing to most in the sequence (yπ (x)(i) )i∈[k] .

Of course, we do not have to take the majority label in the definition of A(S) but could take an average,
such as the mean, of the observed labels, if we want to perform regression instead of classification.

82
Figure 17: Sketch of a 1-nearest neighbour classifier. The blue and yellow dots are from the training set.
The points separate the input space into regions that are closest to one data point. These regions are called
Voronoi regions.

15.1 Generalisation bounds for one-nearest-neighbour


Given a distribution on X and a concept c : X → Y , and a sample S, we expect the one-nearest-neighbour
rule to work well if the following two conditions are met:
• for most randomly chosen x there exists an xi in the sample, such that x − xi is small.
• for x close to x 0 we have that c( x ) is close to c( x 0 ).
The second condition is satisfied if c is a Lipschitz function, i.e., if kc( x ) − c( x 0 )kY ≤ k x − x 0 kX for norms
on X and Y . The first condition seems reasonable to assume if just enough samples are already drawn.
The following result will be useful.
Lemma 15.1. Let Q1 , . . . , Qr ⊂ X and let D be a distribution on X . Further let S ∼ D m . It holds that
!
r
E ∑ P(Qi ) ≤ em . (73)
i : Q ∩S=∅ i

Proof. By the linearity of the expected value, we have that


!
r
E ∑ P( Q i ) = ∑ E (1S ∩ Q = ∅ ) P ( Q i ).
i
i : Qi ∩ S = ∅ i =1

Since S ∼ D m we have that

E (1S∩Qi =∅ ) = P (S ∩ Qi = ∅) = (1 − P( Qi ))m ≤ e−P(Qi )m .

We conclude that
!
E ∑ P( Q i ) ≤ r max P( Qi )e−P(Qi )m .
i =1,...,r
(74)
i : Qi ∩ S = ∅

The function h( x ) = xe−mx satisfies

h0 ( x ) = e−mx − mxe−mx = (1 − mx )e−mx

and has therefore one root at x ∗ = 1/m. It is not hard to see that this is a maximum of h. Since h( x ∗ ) =
1/(em), we conclude that h( x ) ≤ 1/(em). Applying this observation to (74), we obtain the result.

83
Using the lemma above, we can now prove a generalisation bound for the one-nearest-neighbour classi-
fier, if the underlying concept class is Lipschitz continuous.

Theorem 15.1. Let X = [0, 1]d , Y ⊂ [0, 1] and let D be a distribution on X . Let c be a C1 -Lipschitz continuous
target concept and L be a C2 -Lipschitz loss function bounded by 1. It holds that for a sample S ∼ D m with
probability 1 − δ
 √ 
1 1
ES R L (hS ) ≤ 2 dC1 C2 +
1NN
m − d +1 .
e

Proof. For x ∈ X and a sample S ∈ X m , we denote by πS ( x ) the closest element to x in S. Then, h1NN
S (x) =
c(π ( x )). We have that
ES R L (h1NN
S ) = ES EL(h1NN
S ( x ), c( x ))
       
= ES E L h1NN
S ( x ) , c ( x ) 1 k x −π ( x )k∞ ≤e + E S E L h 1NN
S ( x ) , c ( x ) 1 k x −π ( x )k∞ >e
     
≤ ES E L h1NN
S ( x ), c( x ) 1kx−π (x)k∞ ≤e + ES E 1kx−π (x)k∞ >e =: I + II. (75)

We start estimating the term I in (75). It holds that


     
L h1NN
S ( x ) , c ( x ) = L h 1NN
S ( x ) , c ( x ) − L h 1NN
S ( x ) , c ( π S ( x ))
≤ C2 |c( x ) − c(πS ( x ))|
≤ C1 C2 k x − πS ( x )k,
due to the Lipschitz regularity of c and L. Moreover, it is not hard to see that

k x − πS ( x )k ≤ dk x − πS ( x )k∞ .

Hence, I ≤ C1 C2 de. To estimate II, we make the following construction. For a given M ∈ N, we can
decompose the domain X into Md cubes Q1 . . . Q Md of side-length 1/M as in Figure 18. For 2/e ≥ M ≥
1/e, we have that if x1 , x2 ∈ Qi , then k x1 − x2 k∞ ≤ e. We conclude that P(k x − πS ( x )k∞ > e) ≤ P( x ∈

Figure 18: The domain X can be covered by Md cubes of side length 1/M.

S
i : Qi ∩ S = ∅ Qi ). With a union bound and Lemma 15.1, we obtain that

Md 2d e − d
II = ES P(k x − πS ( x )k∞ > e) ≤ ≤ .
em em

84
1
Choosing e = m− d+1 yields that
√ − 1 d
 √  1
ES R L (h1NN
S ) = I + II ≤ C1 C2 dm d+1 + m d+1 /em ≤ C1 C2 d + e m − d +1 ,

which yields the claim.

Remark 15.1. Note that the generalisation bound of 1-nearest neighbour classification deteriorates exponentially
fast with increasing dimension. This is one instance of the so-called curse of dimension.

16 Lecture 16 - Neural Networks


We start by directly defining neural networks.
Definition 16.1. Let N, d ∈ N, $ : R → R. A (shallow) neural network is a function Φ of the form

Φ : Rd → R
N
Φ( x ) = ∑ ci $(hai , xi + bi ) + e,
i =1

where ci , bi , d ∈ R and ai ∈ Rd for i ∈ [ N ]. We say that


• Φ has input dimension d,
• Φ has N neurons,
• the activation function of Φ is $,
• the ai , ci for i ∈ [ N ] are the weights of the neural network
• the bi , e are the biases of the neural network.

Figure 19: Sketch of a neural network with input dimension 3 and 6 neurons.

In practice, often deep neural networks are used. These are functions that result from stacking multiple of
these neural networks after another in multiple layers. We will not discuss these types of networks here.
Neural networks form a general class of hypothesis set. If ρ = 21(0,∞) − 1, N = 1, d = 0, and c1 = 1, then
the class of such neural networks is the class of hyperplane classifiers that we have already encountered
in Section 11.
We first would like to understand this set a bit better and in particular the role of the number of neurons
and the activation function. The following result is one of the most famous in neural network theory:

85
Theorem 16.1 (Universal approximation theorem). Let $ : R → R be a sigmoidal function, i.e., $ is continuous
and limx→−∞ $( x ) = 0 and limx→∞ $( x ) = 1. Then, for every compact set K ⊂ Rd and every continuous function
f : K → R and every e > 0, there exists Φ such that

sup | f ( x ) − Φ( x )| < e, (76)


x ∈K

where Φ is a neural network with activation function $.


Multiple proofs of this statement or generalisations thereof have been found in the literature, see [3, 4, 2].
We present a proof that is close to that in [2].

Proof. Assume towards a contradiction that there exists a function f : K → R and an e > 0 such that for
all neural networks Φ with activation function $

sup | f ( x ) − Φ( x )| > e.
x ∈K

Let us denote the set of all neural networks with input dimension d and activation function $ by N N d,$ .
It is clear from the definition that N N d,$ is a subspace of the space of continuous functions on K, which
we denote by C (K ).
By the theorem of Hahn-Banach, there exists a continuous linear functional h ∈ C (K )0 , the dual space of
C (K ), such that

h(Φ) = 0 and h( f ) = 1,

for all Φ ∈ N N d,$ . Furthermore, by the representation theorem of Riesz, there exists a signed Borel
measure µ 6= 0 such that Z
h( g) = g( x )dµ( x ).
K
Since h(Φ) = 0 for all neural networks Φ, it holds in particular for every neural network with one neuron:

x 7→ $(h a, x i + b).

Hence, we conclude that for a non-zero measure µ


Z
$(h a, x i + b)dµ( x ) = 0
K

for all a ∈ Rd and b ∈ R.


Since $ is continuous and tends to 0 or 1 for x → ±∞ respectively, we conclude that $ is bounded and for
λ, µ > 0

 1 if ( a, x i + b) > 0,
$(hλa, x i + λb + µ) = $(λ( a, x i + b) + µ) → 0 if ( a, x i + b) < 0,
$(µ) if ( a, x i + b) = 0.

for λ → ∞. Letting also µ → ∞ we see that for every x ∈ R

$(hλa, x i + λb + µ) → 1[0,∞) (λa, x i).

We conclude by the dominated convergence theorem that for all a ∈ Rd and b ∈ R


Z Z
1[b,∞) (h a, x i)dµ( x ) = 1[0,∞) (h a, x i − b)dµ( x ) = 0.
K K

86
By using the linearity of the integral, we conclude that for all a ∈ Rd and b1 , b2 ∈ R
Z
1[b1 ,b2 ) (h a, x i)µ( x ) = 0.
K
Since every univariate continuous function on an interval can be approximated arbitrarily well uniformly
by step functions we conclude that for every g ∈ C (R)
Z Z
g(h a, x i)µ( x ) = g[c1 ,c2 ] (h a, x i)µ( x ) = 0,
K K
where c1 = min{h a, x i : x ∈ K }, c2 = max{h a, x i : x ∈ K }.
In particular, g = sin and cos is possible and by Euler’s formula eix = i sin( x ) + cos( x ), we conclude that
Z Z
ei(ha,xi) µ( x ) = ei(ha,xi) µ̃( x ),
K

for a measure µ̃ on Rd supported on K that coincides with µ on K. We conclude that the Fourier transform
of µ̃ vanishes. This implies that µ̃ and hence µ vanishes, which is a contradiction to the choice of µ.

We see that neural networks are a versatile hypothesis set, since they can represent every continuous
function arbitrarily well, if they are sufficiently large. From a generalisation point of view this is of course
not so exciting since the VC dimension of the set of continuous functions is infinite. We recall from
Theorem 6.1 that an infinite VC dimension prohibits us from learning anything.
In practice, neural networks with only a finite number of neurons are used. Typically the resulting set of
neural networks does then not yield form dense subset of the set of continuous functions. Then, we have
again a chance to learn something. Indeed, we can bound the VC dimension of sets of neural networks.
The definition of VC dimension requires a function class with outputs in Y = {−1, 1}. Thus, we can only
define a VC dimension for the set of NNs with binary output, which we get by composing every NN with
a sign function.
Theorem 16.2 ([1, Theorem 2.1]). Let d, N ∈ N and let $ be a piecewise polynomial function. We denote the set
of neural networks with N neurons input dimension d and activation function $ by F N . It holds that

VCdim(sign ◦ F ) = O( N log N ), for N → ∞.


We can combine this theorem with Corollary 16.1 which yields the following generalisation bound:
Corollary 16.1. Let d, N ∈ N and let $ be a piecewise polynomial function. We denote the set of neural networks
with N neurons input dimension d and activation function $ by F N . Let D be a distribution on X × Y with
Y = {−1, 1} and S ∼ D m . Then, for every δ > 0, with probability at least 1 − δ for any h ∈ F :
s s 
em
2N log( N ) log( N log (N)
) 1
log δ
|R(sign(h)) − Rb S (sign(h))| = O  + ,
m 2m

where m ≥ N log N and N → ∞.

[166]: import numpy as np


import keras as ks
from keras.models import Sequential
from keras.layers import Dense
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt

We test a neural network on the classical moons data set

87
[175]: X, y = make_moons(noise=0.1, random_state=0)
plt.scatter(X[:,0], X[:,1], c = y, cmap = 'coolwarm')

[175]: <matplotlib.collections.PathCollection at 0x7fad8efd2908>

[222]: # We define three models. Always one hidden layer with the relu (x \mapsto \max \{x, 0\}) activation␣
,→function.

# The first model has 2 neurons, the second 5, and the last has 20 neurons.
# We apply a sigmoid to the output for for stability reasons.
model1 = Sequential()
model1.add(Dense(2, input_dim=2, activation='relu'))
model1.add(Dense(1, activation='sigmoid'))

model2 = Sequential()
model2.add(Dense(5, input_dim=2, activation='relu'))
model2.add(Dense(1, activation='sigmoid'))

model3 = Sequential()
model3.add(Dense(20, input_dim=2, activation='relu'))
model3.add(Dense(1, activation='sigmoid'))

# compile the models. Here we need to chose an optimiser. This one is called adam, it is used to
# determine how the training is performed. We do not care in this lecture how this is done.
opt = ks.optimizers.Adam(learning_rate=0.02)
model1.compile(loss='mean_squared_error', optimizer=opt, metrics=['accuracy'])
model2.compile(loss='mean_squared_error', optimizer=opt, metrics=['accuracy'])
model3.compile(loss='mean_squared_error', optimizer=opt, metrics=['accuracy'])

88
# fit the models on the dataset

model1.fit(X, y, epochs=30, batch_size=5, verbose = False)


model2.fit(X, y, epochs=30, batch_size=5, verbose = False)
model3.fit(X, y, epochs=30, batch_size=5, verbose = False)

# evaluate the keras model


print('Model 1:')
_, accuracy1 = model1.evaluate(X, y)
print('Model 2:')
_, accuracy2 = model2.evaluate(X, y)
print('Model 3:')
_, accuracy3 = model3.evaluate(X, y)

h=0.2
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))

Z1 = model1.predict(np.c_[xx.ravel(), yy.ravel()])
Z1 = Z1.reshape(xx.shape)
Z2 = model2.predict(np.c_[xx.ravel(), yy.ravel()])
Z2 = Z2.reshape(xx.shape)
Z3 = model3.predict(np.c_[xx.ravel(), yy.ravel()])
Z3 = Z3.reshape(xx.shape)

plt.figure(figsize = (18,5))

plt.subplot(1,3,1)
plt.contourf(xx, yy, Z1, cmap='coolwarm', alpha=.8)
plt.scatter(X[:,0], X[:,1], c = y, cmap = 'coolwarm')
plt.title('2 Neurons')

plt.subplot(1,3,2)
plt.contourf(xx, yy, Z2, cmap='coolwarm', alpha=.8)
plt.scatter(X[:,0], X[:,1], c = y, cmap = 'coolwarm')
plt.title('5 Neurons')

plt.subplot(1,3,3)
plt.contourf(xx, yy, Z3, cmap='coolwarm', alpha=.8)
plt.scatter(X[:,0], X[:,1], c = y, cmap = 'coolwarm')
plt.title('15 Neurons')

Model 1:
4/4 [==============================] - 0s 720us/step - loss: 0.0851 - accuracy:
0.8800
Model 2:
4/4 [==============================] - 0s 680us/step - loss: 0.0351 - accuracy:

89
0.9700
Model 3:
4/4 [==============================] - 0s 1ms/step - loss: 0.0011 - accuracy:
1.0000

[222]: Text(0.5, 1.0, '15 Neurons')

17 Lecture 17 - Facial Expression Classification

[68]: import numpy as np


import matplotlib.pyplot as plt
from matplotlib import colors
from scipy import ndimage, signal
import pandas as pd

Let us load the data first


[69]: labels = np.loadtxt('true_labels_Facial_train.csv', delimiter=',')
data_train = np.load('data_train_Facial.npy', allow_pickle=True)
data_test = np.load('data_test_Facial.npy', allow_pickle=True)
print(data_train.shape)
print(data_test.shape)

(20000, 35, 35)


(10000, 35, 35)

This is an image data set in form of a numpy array.


It contains images of 35x35 pixels. The images are of faces and the labels correspond to their emotional
state:
0: happy
1: sad
2: angry
Lets have a look at the images:

[70]: cmap = colors.ListedColormap(['white', 'yellow', 'black'])


Emotions = ['Happy', 'Sad', 'Angry']

90
plt.figure(figsize = (15,15))
for k in range(16):
plt.subplot(4,4,k+1)
plt.imshow(data_train[k,:,:], cmap= cmap)
plt.title(Emotions[int(labels[k])])

Our biggest problem is to deal with the massive input dimension of 35 × 35.
My solution is to use a very simple algorithm. Other solutions could involve reducing the dimension in a
smart way and then applying tools from earlier.

[71]: from sklearn.neighbors import KNeighborsClassifier

91
[98]: # we split the training set into a train and a validation set:
data_train_split = data_train[0:int(data_train.shape[0]/2), :, :]
lab_train_split = labels[0:int(data_train.shape[0]/2)]
data_validation_split = data_train[int(data_train.shape[0]/2)::, :, :]
lab_validation_split = labels[int(data_train.shape[0]/2)::]

#we train the nearest neighbor classifier on the training set:


neigh = KNeighborsClassifier(n_neighbors=1)
neigh.fit(np.reshape(data_train_split, [data_train_split.shape[0], data_train_split.shape[1]*data_train.
,→shape[2]]), lab_train_split)

[98]: KNeighborsClassifier(n_neighbors=1)

Next we compute the accuracy of our algorithm on the validation set:

[99]: # make prediction:

validation_pred_labels = neigh.predict(np.reshape(data_validation_split, [data_validation_split.


,→shape[0], data_validation_split.shape[1]*data_validation_split.shape[2]]))

# validation accuracy:

accuracy = np.sum(lab_validation_split == validation_pred_labels)/lab_validation_split.shape[0]


print('Accuracy: ' + str(accuracy))

Accuracy: 0.8646

Let us have a look at the misclassified data points to see if there is something conspicuous about them.

[100]: # Let's look at some of the missclassified examples:


mistakes = np.where(lab_validation_split != validation_pred_labels)[0]

cmap = colors.ListedColormap(['white', 'yellow', 'black'])


Emotions = ['Happy', 'Sad', 'Angry']

plt.figure(figsize = (15,15))
for k in range(16):
plt.subplot(4,4,k+1)
plt.imshow(data_validation_split[mistakes[k],:,:], cmap= cmap)
plt.title(Emotions[int(validation_labels[mistakes[k]])])

92
I am very happy with the accuracy on the validation set. I also have no simple explanation why the
faces above were misclassified and therefore no direct way of improving my algorithm. (One notices a
surprisingly high amount of faces with glasses though.) Hence I choose to proceed.
I apply this algorithm to the test set now:

[74]: labels_test = neigh.predict(np.reshape(data_test, [data_test.shape[0], data_test.shape[1]*data_test.


,→shape[2]]))

Finally we store the prediction to enter the competition.

[61]: np.savetxt('prediction_facial_recognition_PhilippPetersen.csv', labels_test, delimiter=',')

93
18 Lecture 18 - Boosting
Boosting is a type of ensemble method where multiple classifiers/predictors are combined to yield one
more powerful classifier/predictor.
We start with the definition of a weak learning algorithm.
Definition 18.1. Let C be a concept class. A weak PAC learning algorithm is an algorithm A taking samples
S ∈ X m to functions in H ⊂ X × {−1, 1} such that for a γ > 0 there exists a function m : (0, 1) → N, such that
for every δ > 0, all distributions D on X and every target concept c, it holds that
 
1
PS∼D m RS (A(S)) ≤ − γ ≥ 1 − δ,
2

if m ≥ m(δ).
A weak learning algorithm only needs to be slightly better than the trivial algorithm that predicts
Rademacher random labels.
The idea behind boosting is now to cleverly combine the hypotheses returned by weak learning algo-
rithms to build a stronger algorithm.
Probably the most widely-used boosting algorithm is AdaBoost:

ADABOOST: Input: Base classifier set H, sample ( xi , yi )im=1 , number of steps T.


1. Initialise D 1 as the uniform probability distribution on [m].
2. for t = 1, . . . , T:
3. Choose hi ∈ H such that et := ∑im=1 D t (i )1ht (xi )6=yi is small.
 
4. set αt := log 1−etet /2.

5. for i = 1, . . . , m:
D t (i ) exp(−αt yi ht ( xi ))
6. set D t+1 (i ) := ∑m
.
j=1 D t ( j ) exp(− αt y j ht ( x j ))

7. return f := sign ◦ ∑tT=1 αt ht .

Theorem 18.1. Let S = ( xi , yi )im=1 be a sample, let H be a set of base classifiers and assume that in the iteration of
AdaBoost, 0 < et < 1/2 − γ for a fixed γ > 0. Then, for f = ADABOOST (H, S, T )
m
b S ( f ) = 1 ∑ 1 f (x )6=y ≤ e−2γ2 T .
R
m i =1 i i

Proof. Let us denote for t ∈ [ T ]

ft = ∑ αphp
p≤t

1 m − yi f t ( xi )
m i∑
Zt := e ,
=1

and f 0 = 0, Z0 = 1. Note that sign( f T ) = f = ADABOOST (H, S, T ).

94
Since 1h(x)y≤0 ≤ e−yh(x) , we have that
m
b S ( f ) = 1 ∑ 1 f (x )6=y
R
m i =1 i i

1 m
m i∑
= 1 f T ( x i ) y i ≤0
=1
≤ ZT
Z
= T
Z0
ZT Z1
= ... .
ZT − 1 Z0

Therefore, the result follows if we can show that for all t ∈ [ T − 1]


Zt+1 2
≤ e−2γ . (77)
Zt
Assume that for a fixed t ∈ [ T − 1]

e − y i f t −1 ( x i )
D t (i ) = − y i f t −1 ( x j )
. (78)
∑m
j =1 e

Then we conclude that


D t (i ) exp(−αt yi ht ( xi )) e − yi f t ( xi )
D t +1 ( i ) = = .
∑m
j=1 D t ( j ) exp(− αt y j ht ( x j )) ∑m
j =1 e
− yi f t ( x j )

Since (78) holds for t = 1, we conclude by induction that (78) holds for all t ∈ [ T ].
Now we have that
Zt+1 ∑ m e − y i f t +1 ( x i )
= i=m1 −y f (x )
Zt ∑ i =1 e i t i
∑ m e − y i f t ( x i ) e − y i α t +1 h t +1 ( x i )
= i =1
∑im=1 e−yi f t (xi )
m
= ∑ D t +1 ( i ) e − y α + h + ( x )
i t 1 t 1 i

i =1
=e − α t +1
∑ + e α t +1 ∑
i : y i = h t +1 ( x i ) i : y i 6 = h t +1 ( x i )

= e − α t +1 ( 1 − e t + 1 ) + e α t +1 e t + 1
1 p
=p (1 − et+1 ) + 1/et+1 − 1et+1
1/et+1 − 1
s
1 − e t +1
r
e t +1
= (1 − e t +1 ) + e t +1
1 − e t +1 e t +1
q q q
= et+1 (1 − et+1 ) + (1 − et+1 )(et+1 ) = 2 et+1 (1 − et+1 ).

We had assumed that et+1 < 1/2 − γ and hence


q q q q
2 et+1 (1 − et+1 ) ≤ 2 (γ − 1/2)((γ + 1/2)) = 2 1/4 − γ2 = 1 − 4γ2 .

95
Using 1 − x ≤ e− x yields that
Zt+1 2
≤ e−2γ .
Zt
This completes the proof.

We saw that Adaboost can very quickly reduce the empirical error, if weak learners exist and can be
found quickly. A standard choice for the set of base classifiers is that of so-called decision stumps (this
name comes from the fact that these are decision trees with minimal depth.), which are linear classifiers
acting on a single axis of the data, i.e., for X = R N

H := { x 7→ b · sign( xi − θ ) : θ ∈ R, b ∈ {±1}, i ∈ [ N ]} .

See Figure 20 for a visualisation of boosting with decision stumps.

Figure 20: Visualisation of classification with boosting and decision stumps. The top left shows the sam-
ples and the underlying disctibution. The next four panels show successively built sums of decision
stumps.

Note that the set of decision stumps is quite small. In fact, there exist simple distributions so that there
does not exist a PAC learning algorithm with hypothesis set H.
We can ask ourselves how the base class affects the generalisation capabilities of Adaboost. For this, we
observe that the output of Adaboost is an element of the following set:
( ! )
T
L(H, T ) = x 7→ sign ∑ αt ht ( x ) : αt ∈ R, ht ∈ H .
t =1

We can compute the VC dimension of L(H, T ).

96
Proposition 18.1. Let H be a base class and let T ∈ N, T ≥ 3. Then

VCdim( L(H, T )) ≤ 2(d + 3)( T + 1) log2 ((d + 3)( T + 1)), (79)

where d := VCdim(H).

Proof. Let C = ( x1 , . . . , xm ) be a set of points shattered by L(H, T ).


Every function f ∈ L(H, T ) is built by the concatenation of h1 , . . . , h T with a linear classifier. By Theorem
4.3 we have that the set |{h(C ) : h ∈ H}| ≤ (em/d)d = md (e/d)d ≤ emd , where d = VCdim(H).
Therefore,
|{(h1 (C ), . . . hT (C )) : h1 , . . . , hT ∈ H}| ≤ eT mdT .
By Example 4.3 and Theorem 4.3, we have that for each element in c = ( c1 , . . . , c T ) ∈
{(h1 (C ), . . . hT (C )) : h1 , . . . , hT ∈ H}, the set

|{sign(h a, ci + b) : a ∈ RT , b ∈ R}| ≤ (em/( T + 1))T +1 = (e/( T + 1))T +1 mT +1 ≤ emT +1 .

Therefore, we conclude that

| { f (C ) : f ∈ L(H, T )} | ≤ (eT +1 mdT +(T +1) ) = eT +1 m(d+1)T +1 ≤ 22(T +1) m(d+1)(T +1) .

Since C was shattered by L(H, T ), we conclude that

2m ≤ 22(T +1) m(d+1)(T +1)

and hence

m ≤ 2( T + 1) + (d + 1)( T + 1) log2 (m) ≤ (d + 3)( T + 1) log2 (m). (80)



Since for x > 1 we have that log2 ( x ) ≤ x, it follows from (80)

m ≤ (d + 3)( T + 1)

and thus

log2 (m) ≤ 2 log((d + 3)( T + 1)). (81)

Applying (81) to (80) yields

m ≤ 2(d + 3)( T + 1) log2 ((d + 3)( T + 1)).

19 Lecture 19 - Clustering
Clustering is the act of associating elements of a data set ( xi )im=1 into a number of sets that may or may
not be determined beforehand. In low dimensions, humans have a very good intuition on how to cluster
data points. For example in Figure 21, most people would have a pretty strong opinion on how to cluster
the points into 2 or three sets. However, defining a mathematical rule is typically harder. To perform
clustering numerically, one needs to specify an objective to minimise or a procedure to follow. We will
discuss some examples of such algorithms in this chapter.

97
Figure 21: Six clustering problems

Let us first describe the task of clustering in more mathematical terms. Clustering is a procedure that
maps an input to an output:
• Input: A set X = ( xi )im=1 and a distance function d : X × X → R+ which is symmetric and satisfies
d( x, x ) = 0. Alternatively, a similarity measure s : X × X → [0, 1] can be given with s symmetric and
s( x, x ) = 1.
Sk
• Output: A sequence of disjoint subsets of X denoted by (Ci )ik=1 such that j =1 Ck = X.

How this segmentation of X into the (Ci )ik=1 is performed depends on d or s and is different from algorithm
to algorithm.

19.1 Linkage-based clustering


Linkage-based clustering is very simple, but often-times surprisingly effective. It works by the following
procedure which is visualised in Figure 22:
1. Start with (Ci0 )im=1 ⊂ X, disjoint, such that xi ∈ Ci .
2. Construct, for ` < m the sets (Ci` )im=−` `−1
1 by the following procedure: Find i1 , i2 such that Ci1 and Ci`−
2
1

are the most similar clusters (we will discuss what this means below). Then
(Ci` )im=−` ` `−1
1 = {Ci : i 6 = i1 , i2 } ∪ {Ci1 ∪ Ci`−
2
1
}. (82)

3. The output of the clustering algorithm is (Ci` )im=−`


1 for a given `.

The notion of "most similar clusters", that was used above can mean many things, depending on the appli-
cation in mind. A couple of examples are listed below:
• Single Linkage clustering: Here we define
d(Ci , Cj ) := min{d( x` , x`0 ) : x` ∈ Ci , x`0 ∈ Cj , ` ∈ [m]}.

• Average Linkage clustering: Here we define


d(Ci , Cj ) := mean{d( x` , x`0 ) : x` ∈ Ci , x`0 ∈ Cj , ` ∈ [m]}.

98
• Max Linkage clustering: Here we define

d(Ci , Cj ) := max{d( x` , x`0 ) : x` ∈ Ci , x`0 ∈ Cj , ` ∈ [m]}.

Figure 22: Example of linkage-based clustering.

19.2 k-means clustering


One way to perform clustering is to define a cost function which the partition (Ci )ik=1 should minimise. For
k-means clustering, we wish to find k clusters such that every point is as close as possible to the center of
its associated cluster.
To make this formal, we first define the centroid of a cluster Ci as

µ(Ci ) := argminµ∈coX ∑ d ( x j , µ )2 ,
x j ∈Ci

where ( x j )m
j=1 is the data set.

Then, k-means clustering consists in finding (Ci )ik=1 minimising

k
Gk−means (( x j )m
j=1 , d )(C1 , . . . , Ck ) := ∑ ∑ d( x j , µ(Ci ))2 . (83)
i =1 x j ∈Ci

Finding a solution of this problem is NP-hard in general. However, there is a widely used algorithm, that
typically performs well. This is Lloyd’s algorithm and consists of the following steps:

LLOYD’S ALGORITHM: Input: ( x j )m


j=1 , number of clusters k ∈ [ m ].

1. Randomly choose initial centroids µ11 , . . . , µ1k .


2. while not converged:
3. ∀i ∈ [k] set Ci`+1 := { x ∈ X : i = argmin j∈[k] k x − µ`j k},

4. ∀i ∈ [k] set µi`+1 := 1


|Ci`+1 |
∑ x∈C`+1 x.
i

5. return the clustering from the last iteration (CiL )ik=1 .

Lemma 19.1. Each iteration of the k-means algorithm does not increase the k-means objective function Gk−means of
(83).

99
1
Proof. It holds that for µ̄(C ) = |C | ∑ x j ∈C x j that for arbitrary λ ∈ co( X )

∑ k x j − λ k2 = ∑ k x j − µ̄(C ) − (λ − µ̄(C ))k2


x j ∈C x j ∈C

= ∑ k x j − µ̄(C )k2 − 2h x j − µ̄(C ), λ − µ̄(C )i + kλ − µ̄(C ))k2 .


x j ∈C

Next, we observe that

∑ k x j − µ̄(C )k2 − 2h x j − µ̄(C ), λ − µ̄(C )i + |λ − µ̄(C ))|2


x j ∈C
  * +
=  ∑ k x j − µ̄(C )k2  − 2 ∑ (x j − µ̄(C)), λ − µ̄(C) + kλ − µ̄(C ))k2
x j ∈C x j ∈C
 

=  ∑ k x j − µ̄(C )k2  + kλ − µ̄(C ))k2 ≥ ∑ k x j − µ̄(C )k2 .


x j ∈C x j ∈C

Hence µ̄(C ) = µ(C ). Let for ` ∈ N and i ∈ [k ], Ci` be as in Lloyd’s algorithm. We have that
k
Gk−means (( x j )m `+1 `+1
j=1 , d )(C1 , . . . , Ck ) = ∑ ∑ k x j − µ(Ci`+1 )k2
i =1 x j ∈C `+1
i

k
≤ ∑ ∑ k x j − µi` k2
i =1 x j ∈C `+1
i

k
≤ ∑ ∑ k x j − µi` k2
i =1 x j ∈ C `
i

k
= ∑ ∑ k x j − µ(Ci` )k2 = Gk−means (( x j )m ` `
j=1 , d )(C1 , . . . , Ck ),
i =1 x j ∈ C `
i

where the first inequality follows by the definition of µ(Ci`+1 ) as a minimum, the second inequality follows
from the definition of µ(Ci` ), and penultimate equality follows by the considerations at the beginning of
the proof.

Figure 23: Two examples of the evolution of means µ1` , µ2` in Lloyd’s algorithm. On the left-hand side, we
observe successful convergence. On the right-hand side, there is no convergence.

100
Remark 19.1. Lloyd’s algorithm is simple and very often effective. However, it comes with a couple of issues.
First of all, convergence is not guaranteed or could take a very long time. An example of a bad initialisation
prohibiting convergence is shown in Figure 23. Moreover, k-NN generally suffers from the issue that all clusters
must necessarily be convex in the sense that if x ∈ co(Ci ) ∩ X, then x ∈ Ci . In some of the examples of Figure 21,
this can be a serious issue. See also Figure 24 for an illustration.

Figure 24: K-means clustering for two data sets. On the left hand side the clustering (with k = 3) was
successful. The problem on the right hand side cannot be clustered correctly (with k = 2), because this
would require non-convex clusters.

19.3 Spectral clustering


It is often convenient to cast a clustering problem in the framework of graph theory. In this setting, the
relationship between data points is described by weights associated to edges between them. Clustering
is then the task of splitting the graph in multiple subgraphs according to some rules. Let us first make a
formal definition of a graph.
Definition 19.1. An undirected graph G = (V, E) is a tuple of a set of nodes V = (v1 , . . . , vn ) and edges
E ⊂ {(i, j), i, j ∈ [n]}, such that for all i, j ∈ [n], (i, j) ∈ E if ( j, i ) ∈ E. We say that vi , v j are connected if
(i, j) ∈ E.
A graph is often represented through its adjacency matrix A ∈ Rn×n , defined as

1 if (i, j) ∈ E,
Ai,j :=
0 else .
If one wants to make a more nuanced description of the relationship between data points, then a weighted
graph is more appropriate.
Definition 19.2. A weighted graph is a triple (V, E, W ), where (V, E) is an undirected graph and W ∈ Rn×n is
a symmetric matrix with positive entries and n = |V |.
Example 19.1. In Figure 25, we show two graphs with the following adjacency matrices:

101
0 1 1 1 1 0 0 0 0 0
 
0 1 1 1 1 0 0 0 0
 
 1 0 1 0 0 0 1 0 0 0 
 1 0 1 0 1 0 0 0 0   
   1 1 0 1 0 0 0 0 0 0 
 1 1 0 1 0 0 0 0 0   
   1 0 1 0 1 0 0 0 0 0 
 1 0 1 0 1 0 0 0 0   
   1 0 0 1 0 0 0 0 0 0 
 1 1 0 1 0 0 0 0 0  and  . (84)
   0 0 0 0 0 0 1 1 1 0 
 0 0 0 0 0 0 1 1 1   
   0 1 0 0 0 1 0 1 0 1 
 0 0 0 0 0 1 0 1 0   
   0 0 0 0 0 1 1 0 0 0 
 0 0 0 0 0 1 1 0 1   
 0 0 0 0 0 1 0 0 0 0 
0 0 0 0 0 1 0 1 0
0 0 0 0 0 0 1 0 0 0

Next, we would like to define a measure that allows us to formulate an objective for clustering. A natural
measure is the so-called cut.
Definition 19.3. Let (V, E, W ) be a weighted graph. Let S ⊂ V be a vertex partition. We define the cut of S as

cut(S) := ∑ ∑ wi,j .
vi ∈ S v j ∈ S c

Figure 25: Two visualisations of the simple graphs of Example 19.1

A small cut, means that not many large weights needed to be removed in order to make the partition.
Intuitively, this seems to be a reasonable condition for a clustering algorithm. It turns out that the cut is
closely related to the so-called graph Laplacian of a graph.
Definition 19.4. Let G = (V, E, W ) be a weighted graph. The degree matrix of G is defined as

Di,i = deg(i ) := ∑ wi,j .


(i,j)∈ E

The graph Laplacian of G is given by


LG = D − W.
Remark 19.2. The graph Laplacian LG satisfies the following formula:

LG = ∑ wi,j (ei − e j )(ei − e j )T , (85)


i< j

where ei , e j are the canonical unit (column) vectors.

102
Let S be a vertex partition and y ∈ {±1}n be such that yi = 1 if and only if vi ∈ S. Then it holds that
!
1 1
4∑ ∑ wi,j + 4 c ∑ wi,j
wi,j (yi − y j )2 = 4
i< j
4 i ∈S,j∈Sc ,i < j i ∈S ,j∈S,i < j
!
= ∑ wi,j + ∑ wi,j = cut(S).
i ∈S,j∈Sc ,i < j i ∈S,j∈Sc ,i > j

Using the formula above, we are now able to express the cut in terms of the graph Laplacian.
Proposition 19.1. Let G = (V, E, W ) be a weighted graph and let LG be the associated graph Laplacian. It holds
for all x ∈ Rn , where n = |V | that

x T LG x = ∑ wi,j (xi − x j )2 . (86)


i< j

In particular,

1 T
cut(S) = y LG y, (87)
4
for y ∈ {±1} with yi = 1 if and only if vi ∈ S.

Proof. By associativity we have that

∑ wi,j (xi − x j )2 = ∑ wi,j ((ei − e j )T x)2


i< j i< j

= ∑ wi,j ( x T (ei − e j ))((ei − e j )T x )


i< j

= ∑ wi,j x T (ei − e j )(ei − e j )T x.


i< j

Using now the linearity as well as (85) yields that

∑ wi,j (xi − x j )2 = xT LG x.
i< j

Remark 19.3. Proposition 19.1 shows that we can compute the cut by computing a quadratic form involving
the graph Laplacian. Minimising expressions of the form x T Ax for positive semidefinite matrices reduces to an
eigenvalue problem and can be considered simple. We have two issues though: First, we do not want to minimise
such an expression over general x ∈ Rn , but only over x that take values in {±1}. Second, a minimiser of the cut
is actually very easy to compute. In fact S = ∅ or S = V always yield cut(S) = 0, the minimal value. This is,
however, not the minimiser we were looking for.
To address the issues raised in Remark 19.3, we first introduce a different notion of a cut that promotes
balanced partitions.
Definition 19.5. Let G = (V, E, W ) be a weighted graph and S ⊂ V be a vertex partition. We define the weighted
cut of G as

cut(S) cut(Sc )
 
1 1
Ncut(S) := + = cut ( S ) + ,
vol(S) vol(Sc ) vol(S) vol(Sc )

where vol(S) = ∑i∈S deg(i ). Here one needs to decide on a convention for S = ∅ or S = V.

103
The following proposition holds:
Proposition 19.2. Let G = (V, E, W ) be a weighted graph and S ⊂ V be a vertex partition. It holds that

Ncut(S) = y T LG y

where  1/2
vol(Sc )

if i ∈ S,


vol(S)vol(V )
yi =  1/2
vol(S)
 −

vol(Sc )vol(V )
if i ∈ Sc .

Proof. It holds by (86) that

1
2 i,j∑
yT LG y = wi,j (yi − y j )2
∈V

= ∑ ∑ wi,j (yi − y j )2
i ∈S j∈Sc
1/2 1/2 !2
vol(Sc )
 
vol(S)
= ∑ ∑ wi,j vol(S)vol(V )
+
vol(Sc )vol(V )
i ∈S j∈Sc
1/2 !
vol(Sc ) vol(Sc )

vol(S) vol(S)
= ∑ ∑ wi,j +2 +
i ∈S j∈Sc
vol(S)vol(V )vol(S)vol(V ) vol(Sc )vol(V ) vol(Sc )vol(V )
vol(Sc )
 
2 vol(S)
= ∑ ∑ wi,j + +
i ∈S j∈Sc
vol(S)vol(V ) vol(V ) vol(Sc )vol(V )
wi,j vol(Sc )
 
vol(S)
=∑ ∑ +2+ .
i ∈S j∈Sc
vol(V ) vol(S) vol(Sc )

vol(S) vol(Sc )
Using that vol(S)
+ vol(Sc )
= 2, we obtain that

wi,j vol(Sc ) + vol(S) vol(S) + vol(Sc )


 
y LG y = ∑ ∑
T
+
i ∈S j∈Sc
vol(V ) vol(S) vol(Sc )
 
1 1
= ∑ ∑ wi,j + = Ncut(S).
i ∈S j∈Sc
vol(S) vol(Sc )

Using Proposition 19.3, we can now rewrite the minimisation of Ncut as a minimisation problem involv-
ing the graph Laplacian.

min y T LG y s.t. (88)


y ∈ { a, b} , for some a, b ∈ R,
n

y T Dy = 1,
y T D1 = 0.

Proposition 19.3. Let G = (V, E, W ) be a weighted graph and S ⊂ V be a vertex partition. S is a minimum of
Ncut if and only if y is a minimiser of (88) with yi ∈ { a, b} for some a, b ∈ R and all i ∈ [n] and S = {i ∈
[ n ] : y i = a }.

104
Proof. We only need to show that, for every minimiser y of (88), the entries are of the form prescribed in
Proposition 19.3 as well as that every y of the form given by Proposition 19.3 satisfies the constraints of
(88). This is left as an exercise for the reader.

Unfortunately, minimising (88) is in not simple at all. In fact, it can be shown to be an NP hard problem.
However, we can simplify this problem by relaxing the condition that y can only take two values.

min y T LG y s.t. (89)


y∈R , n
(90)
T
y Dy = 1,
y T D1 = 0.

This optimisation problem is simple. In fact it is equivalent to an eigenvalue problem. Set z = D1/2 y, then
we obtain the problem

min z T LG z s.t. (91)


z∈R , n

kzk2 = 1,
( D1/2 1)T z = 0,

where LG = D −1/2 LG D −1/2 . To solve (91), we first observe that

LG D1/2 1 = D −1/2 LG 1 = D −1/2 ( D − W )1 = 0.

Hence, D1/2 1 is an eigenvector associated to the smallest eigenvalue of LG . We recall the following con-
sequence of the Courant-Fischer Theorem:
Theorem 19.1. For a matrix A ∈ Rn×n it holds that

λ2 ( A ) = min x T Ax = v2T Av2 ,


k x k=1,x ⊥v1

where λ2 is the second smallest eigenvalue of A and v2 is an associated eigenvector. Moreover, v1 is the eigenvector
associated to the smallest eigenvalue of A.
We conclude that the solution of (91) is the smallest eigenvector associated to the second smallest eigen-
value of LG .
This is the motivaton for the following algorithm:

SPECTRAL CLUSTERING: Input: Graph G = (V, E, W ), threshhold τ ∈ R.


1. construct LG = D −1/2 ( D − W ) D −1/2 .
2. compute ϕ2 = D −1/2 v2 .
3. set Sτ := {i ∈ V : φ2 (i ) ≤ τ }
4. return Sτ .

105
Figure 26: Eigenvalues and D −1/2 v1 and D −1/2 v2 for v1 , v2 the eigenvectors associated to the smallest and
second smallest eigenvalue of LG for the two graphs of Figure 25.

The relationship of spectral clustering to the problem (88) or equivalently the minimisation of the Ncut
problem is given by the following result:
Theorem 19.2. Let G = (V, E, W ) be a weighted graph. There exists a threshhold τ ∈ R such that
q
λ2 (LG ) . Ncut(Sτ ) . λ2 (LG ),

where Sτ is the result of spectral clustering with parameter τ. Here all implicit constants are less than 4.

Figure 27: Spectral clustering is successful for the non-convex clustering problem above.

106
20 Lecture 20 - Dimensionality reduction
We have run into the curse of dimension when we analysed the k-nearest neighbour algorithm. We found
that, for some algorithms to work, it is beneficial if the input dimension is small. Sometimes it is possible
to reduce the dimension of the data space, without really compromising the data. For example, if the data
lies in a low dimensional subspace, it is conceivable that we could restrict our learning problem to this
low dimensional subspace and thereby simplify it. A bit more involved is the situation if the data only
lies on or close to a low-dimensional non-linear manifold. In both situations, we cannot certainly, still
need to find the low dimensional structure before reducing the problem to the simplified setup. Doing
this is called dimensionality reduction.

20.1 Principle component analysis


We assume that we were given n ∈ N data points ( xi )in=1 which lie in a high dimensional space Rd . For
some reason, we believe that we can also represent these data points in a p- dimensional space, where
p < d. Our idea is to project onto a suitable p- dimensional affine subspace of Rd . One way of doing this
is by looking for orthogonal vectors v1 , . . . , v p ∈ Rd a vector µ ∈ Rd as well as coefficients ( β i )in=1 ∈ R p
such that
p
xi ≈ µ + ∑ ( βi )k vk = µ + Vβi ,
k =1

where V is the matrix with k-th row equal to vkT . We only need to decide what we mean by ≈. In the case
of principle component analysis (PCA), we choose µ, V, ( β i )in=1 such that the least squares error is small.
Concretely, we are looking for µ, V, β that assume the following minimum:
n
min
µ,V,β k
∑ k xi − µ − Vβi k22 . (92)
i =1
V T V =Id

Finding the minimiser of (96) can be done in multiple steps. First we restrict ourselves to minimisers of
the form ∑in=1 β i = 0. Indeed, if ∑in=1 β i = λ 6= 0 then we can replace β i by β̃ = β i − λ/n and observe that
n n
∑ k xi − µ − Vβi k22 = ∑ kxi − µ̃ − V β̃i k22 . (93)
i =1 i =1

where µ̃ = µ − λ/nV1 and 1 denotes the vector with all entries equal to 1.
Under this assumption on the β’s, we first seek to find µ. This is done by looking for a stationary point of
(96) in µ, i.e., µ∗ such that
n
∇µ ∑ k xi − µ∗ − Vβ i k22 = 0. (94)
i =1

We compute:
n n n n n
∇µ ∑ k xi − µ − Vβ i k22 = −2 ∑ ( xi − µ − Vβ i ) = 2nµ − 2 ∑ xi + V ∑ β i = 2nµ − 2 ∑ xi .
i =1 i =1 i =1 i =1 i =1

Combining the computation above with (94) yields that µ∗ = 1


n ∑in=1 xi is the sample mean of the data
points.
We find the β i ’s next. Since V is orthogonal, for a fixed i ∈ [n], the solution of

min k xi − µ − Vβ i k22
βi

107
is given by β i = V T ( xi − µ). We simply need to show that this choice of β i satisfies ∑in=1 β i = 0. This of
course follows immediately from the linearity of V T .
In the final step, we would like to find V. By the previous computations the problem reduces to
n
min
VT V =Id
∑ k xi − µ∗ − VV T (xi − µ∗ )k22 . (95)
i =1

The binomial formula yields that

k xi − µ∗ − VV T ( xi − µ∗ )k22 = ( xi − µ∗ )T ( xi − µ∗ ) − 2( xi − µ∗ )T VV T ( xi − µ∗ ) + ( xi − µ∗ )T VV T VV T ( xi − µ∗ )
= ( xi − µ∗ )T ( xi − µ∗ ) − ( xi − µ∗ )T VV T ( xi − µ∗ ),

where we used V T V = Id. The first term above does not depend on V and so we observe that the
optimisation problem (95) is equivalent to
n
max
V T V =Id
∑ (xi − µ∗ )T VV T (xi − µ∗ ). (96)
i =1

Now we use some of the magic of the trace operator. We denote by Tr( A) = ∑ik=1 Aii the trace operator.
Note that for a scalar λ ∈ R it holds that Tr(λ) = λ. It is also well known that for matrices A ∈ Rm×n and
B ∈ Rn×m it holds that Tr( BA) = Tr( AB).
Also, directly from the definition, we have that Tr( A) = Tr( A T ). It holds that
n n  
max ∑ (xi − µ∗ )T VV T (xi − µ∗ ) = max
V T V =Id i =1

V T V =Id i =1
Tr ( x i − µ ∗ T
) VV T
( x i − µ ∗
)
n  
= max
VT V =Id
∑ Tr V T ( xi − µ∗ )( xi − µ∗ )T V
i =1
 
= max (n − 1)Tr V T Σn V , (97)
V T V =Id

where Σn = n− 1
1 ∑i =1 ( xi − µ )( xi − µ ) is the sample variance. It is not hard to see that Tr V Σn V
n ∗ ∗ T T


is maximised by choosing V = (v1 , . . . , v p ), where v1 , . . . , v p are the p eigenvectors associated to the p


largest eigenvalues of Σn .
We introduced PCA in (96) as the affine subspace that best approximates the data points in a least squares
sense. Interestingly, there is a second interpretation of PCA. Indeed, we can equivalently characterise the
subspace of PCA as that which maximises the covariance of the data points.
Indeed
n n n
1
VT
max
V =Id
∑ kV T xi − n ∑ V T xr k22 = Vmax
V =Id
∑ kV T (xi − µ∗ )k22
T
k =1 r =1 k =1
 
= max (n − 1)Tr V T Σn V ,
V T V =Id

where we applied the computation (97) in the last equation.

20.2 Johnson-Lindenstrauss embedding


Assume we have n points in Rd , which we call ( xi )in=1 . Now we want to find a low dimensional represen-
tation of the ( xi )in=1 such that the pairwise distances of the xi are not distorted by too much. Let us make
the phrase "not distorted by too much" a bit more precise.

108
Definition 20.1. For ( xi )in=1 ⊂ Rd and e ≥ 0 we call a map f : Rd → R p an e-isometry if for all i, j ∈ [n]

(1 − e)k xi − x j k2 ≤ k f ( xi ) − f ( x j )k2 ≤ (1 + e)k xi − x j k2 . (98)

Now it is clear, that a 0-isometry exists if p ≥ min{d, n}, since in this case ( xi )in=1 lie in a p dimensional
space and we can simply project onto this space without distorting the pairwise distances at all.
Nonetheless, the question arises, how small p can be to still allow for an e-isometry.
The following theorem yields a lower bound on p such that a linear e-isometry exists.
Theorem 20.1. Let n, d, p ∈ N, d ≥ 4, and 0 < e < 1/2 be such that

20
p≥ log n.
e2

Then, for any set ( xi )in=1 ⊂ Rd , there exists a linear e-isometry f : Rd → R p .


The proof of this theorem is based on a probabilistic argument. We will need the following concentration
inequality.

Lemma 20.1 ([5, Lemma 15.3]). Let x ∈ Rd , p < d and assume that A is a p × d matrix with every entry
independently normally distributed. Then it holds that for every 0 < e < 1/2
" #
2
1 2 3
P (1 − e)k x k2 ≤ √ Ax ≤ (1 + e)k x k2 ≥ 1 − 2e−(e −e ) p/4 .
p

Figure 28: Projection of 3 points onto three subspaces. The projection onto the purple subspace distorts
the pairwise distances the least.

Proof of Theorem 20.1. Let p be as in the statement of the theorem and choose f = √1p A, where A is a p × d
matrix with all entries i.i.d normal distributed. Let xi , x j be two points in ( xi )in=1 then it holds that with
2 3
probability at least 1 − 2e−(e −e ) p/4

(1 − e)k xi − x j k2 ≤ k f ( xi ) − f ( x j )k2 ≤ (1 + e)k xi − x j k2 .

109
There are (n2 ) ≤ n2 many pairs of data points in ( xi )in=1 and hence

P k f ( xi ) − f ( x j )k2 /k xi − x j k2 6∈ (1 − e, 1 + e) ∀i, j ∈ [n]




∑ P k f ( xi ) − f ( x j )k2 /k xi − x j k2 6∈ (1 − e, 1 + e)


i,j∈[n],i < j
2 − e3 ) p/4
≤ 2n2 e−(e .
20
Choosing p ≥ e2
log n implies that
2 − e3 ) p/4
2n2 e−(e ≤ 2n2 e−(1−e)5 log n = 2e−(3−5e) log n < 2e−1/2 log n ≤ 1

for n ≥ 4, where we used that e < 1/2. Hence, the probability that f is an e isometry for ( xi )in=1 is not
zero, which implies that such an f exists.

20.3 Diffusion maps


Principle component analysis as well as the Johnson-Lindenstrauss embedding compute a linear embed-
ding into a lower dimensional space. Quite often, however, the intrinsic dimension is not adequately
captured by a linear subspace. Consider, for example, Figure 29, where the Swiss roll data set is supposed
to be mapped to a lower dimensional space. We see in the middle, that the projection approach via PCA
fails to capture the intrinsic structure of the Swiss roll. On the right we find the method discussed in this
section, which seems to do a much better job.

Figure 29: Swiss roll data set on the left. The middle figure shows an embedding resulting from projecting
on a one dimensional subspace identified by PCA. On the right is the embedding obtained by the diffusion
embedding discussed in this section.

It seems obvious that we do not want to maintain all pairwise distances in this nonlinear setting any
longer. Instead we would rather like to have a map that in some sense respects the intrinsic distances
of the points. To come up with a sensible notion of distance between points in a point cloud, that also
respects the intrinsic structure, we consider the following example:
Example 20.1. Consider the heat equation on a two dimensional domain: u0 ∈ L2 (R2 ), γ > 0


u(t, x ) − γ∆ x u(t, x ) = 0 for (t, x ) ∈ [0, T ] × R2 (99)
∂t
u(0, ·) = u0 . (100)

If u0 is a small bump function centered at a point t1 and ũ0 is a second bump function centered at t2 and both ũ0
and u0 have compact and disjoint support, then kũ0 − u0 k2L2 = kũ0 k2 + ku0 k2L2 . From this information alone, we

110
obtain very little information about the distance of t1 from t2 . However, if we instead look at kũ( T, ·) − u( T, ·)k L2 ,
where ũ and u are the solutions of (99) with initial conditions ũ0 and u0 respectively, then these may give us a more
nuanced description of the distance between t1 and t2 . Consider Figure 30. There, we see that after some time has
passed the distances between heat profiles are closely related to the distances between the initial heat sources.

Figure 30: First and second row: Heat diffusion associated to two sources. Third row: relationship be-
tween distance of initial heat sources and heat profiles after two seconds of diffusion.

The interesting thing about the diffusion is that we can also use it to make sense of a distance on a non-euclidean
domain.

111
Figure 31: Diffusion on a map with a slit. The heat profile after two seconds now describes quite accurately
the distance of points when taking into account the more involved geometry. Indeed, the two lower rows
have a very similar heat profile after two seconds since the associated heat sources lie on the same side of
the slit.

Example 20.1 showed us that, if we define a distance via the heat equation then this will oftentimes
respect the underlying geometry. We will now do something similar for general graphs. Let (V, E, W ) be
a weighted graph. We define a random walk on V by

P( X (t + 1) = v j | X (t) = vi ) = wi,j /deg(i ). (101)

112
For the graph corresponding to the slit domain of Figure 31, where every pixel in the white part of the
image is one vertex of the graph and vertices corresponding to neighbouring pixels are connected by an
edge with weight 1, we run some examples of the random walk in Figure 32.
We denote the matrix of transition probabilities by M, where Mi,j = wi,j /deg(i ). Note that M = D −1 W,
where D is a diagonal matrix with Di,i = deg(i ). If we start the random walk X in the node i, i.e., X (0) = i,
then we can compute the probability that X (t) = j by

P X (t) = v j | X (0) = vi = ( Mt )i,j .




As a result, we have that


n
P X ( t ) = v j | X (0) = v i j =1
= eiT ( Mt ) = Mt (i, ·) ∈ Rn . (102)

The map vi 7→ Mt (i, ·) can now be considered as a map from V to Rn . This map now maps points to
the associated probability distribution of the random walk starting in that point after t iterations. While
we had observed that this embedding seems to reflect the inner geometry of the problem, it is hardly a
dimensionality reduction since n may be quite large.

113
Figure 32: Three random walks starting at the same points as the heat sources of Figure 31. The points
are marked by a blue dot in the images. The resulting distribution is very similar to the heat profiles of
Figure 31.

To reduce the dimension of the embedding, we truncate the size of the matrix M by a spectral decompo-
sition. First of all, we notice that

S = D1/2 MD −1/2 = D −1/2 WD −1/2

is a symmetric matrix and therefore is equivalent to a diagonal matrix like

S = VΛV T ,

for a matrix V such that V T V = Id and a diagonal matrix Λ with Λi,i ≥ Λi+1,i+1 for all i ∈ [n − 1]. Now
we have that
M = D −1/2 SD1/2 = ( D −1/2 V )Λ(V T D1/2 ) =: ΦΛΨ T .

114
Because of this, we can write
n
M= ∑ λk φk ψkT ,
k =1

where φi and ψi are the i-th column of Φ and Ψ, respectively. Note also, that by construction ΦΨ T = Id
and hence
n
Mt = ΦΛt Ψ T = ∑ λtk φk ψkT .
k =1

For the map vi 7→ Mt (i, : ) we now have that


n
vi 7 → ∑ λtk φk (i)ψkT , (103)
k =1

Before we turn this representation into the so-called diffusion map, we observe that φ1 = 1. Indeed, it holds
that M1 = 1. Therefore, φ1 = 1 holds if 1 is the largest eigenvalue of M.
Proposition 20.1. All eigenvalues of M are bounded by 1 in absolute value.

Proof. Let φk be a right eigenvector of M and let imax be such that φk (imax ) = maxi∈[n] φk (i ) > 0. Then
n
λk φk (imax ) = Mφk (imax ) = ∑ Mi max ,j
φk ( j).
j =1

This implies that


n n
φk ( j)
λk = ∑ Mi max ,j
φk (imax )
≤ ∑ Mi max ,j
= 1.
j =1 j =1

Similarly, we obtain that


n n
φk ( j)
−λk = ∑ (− Mimax ,j ) φk (imax )
≥ ∑ (− Mi max ),j
= −1.
j =1 j =1

By this proposition and the previous discussion we conclude that φ1 = 1 and hence does not carry any
information about G.
Definition 20.2. Given a graph G = (V, E, W ) let M = ΦΛΨ T be as above. For t ∈ N, the diffusion map ϕt is
defined as

ϕ t : V → Rn −1
 t 
λ2 φ2 (i )
 λt φ3 (i ) 
 3
ϕ t ( vi ) =  .

..
 . 
λtn φn (i )
The diffusion map is still not really performing dimensionality reduction. However, if we assume that
many eigenvalues are significantly smaller than 1, then for sufficiently large t the values λtk will be very
small. Hence, we believe that we can drop the associated dimensions from the embedding without sig-
nificantly affecting the embedding’s quality.

115
Definition 20.3. Given a graph G = (V, E, W ) let M = ΦΛΨ T be as above. For p ∈ [n − 1] and t ∈ N, the
( p)
truncated diffusion map ϕt is defined as
( p)
ϕt : V → R p
λ2t φ2 (i )
 
 λ3t φ3 (i ) 
ϕ t ( vi ) =  .. .
 
 . 
λtp+1 φ p+1 (i )

Let us conclude this section by proving that the diffusion map does indeed what we wanted it to do,
which is to produce an embedding where the distances correspond to the distances of the probability
densities of a random walk.
Proposition 20.2. Let G = (V, E, W ) be a weighted graph, let M = ΦΛΨ T be as above, and let X be a random
walk as above. For every pair of nodes vi , v j ∈ V and for every t ∈ N it holds that
n
1
∑ deg(k) (P(X (t) = k|X (0) = i) − P(X (t) = k|X (0) = j))
2
k ϕt (vi ) − ϕt (v j )k2 = .
k =1

Proof. By (102), it follows that


n
1
∑ deg(k) (P(X (t) = k|X (0) = i) − P(X (t) = k|X (0) = j))
2

k =1
!2
n n n
1
=∑ ∑ λt` φ` (i )ψ`T (k ) − ∑ λt` φ` ( j)ψ`T (k )
k =1
deg (k) `=1 `=1
!2
n n
1
=∑ ∑ λt` (φ` (i ) − φ` ( j))ψ`T (k )
k =1
deg(k ) `=1
!2
n n ψ`T (k)
= ∑ ∑ ` `
λ t
( φ ( i ) − φ` ( j )) p
deg(k )
k =1 `=1
!2
n n
= ∑ ∑ λt` (φ` (i) − φ` ( j)) D−1/2 ψ`T (k)
k =1 `=1
2
n
= ∑ λt` (φ` (i ) − φ` ( j)) D −1/2 ψ`T .
`=1

Since D −1/2 ψ`T = v` per construction, where (v` )n`=1 denote the eigenvectors of S = D1/2 MD −1/2 . Since
(v` )n`=1 form an orthogonal basis, we obtain that
2
n n 2
∑ λt` (φ` (i ) − φ` ( j)) D −1/2 ψ`T = ∑ λt` (φ` (i ) − φ` ( j)) = k ϕt (vi ) − ϕt (v j )k2 ,
`=1 `=1

by Parseval’s identity.

Example 20.2. Using the diffusion map, we can produce a visual representation of data from which we only know
the relationship between each pair of elements. For example, consider the situation that we only know which central
European country as which neighbours, as given in Figure 33.

116
Figure 33: Adjacency matrix of countries in Europe. Countries that have a common border are considered
to be connected by an edge.

From the local neighbourhood relationship of the countries, we cannot really see how central Europe looks like.
We can, however apply a truncated diffusion map to the graph associated to that adjacency matrix and obtain the
following embedding of Figure 34 two dimensional space.

Figure 34: Embedding of the graph of Figure 33 in the two-dimensional plane. The positions coincide
somewhat with the locations of the countries on a real map. On the right, Voronoi regions are drawn.
They do not yield the correct borders. This is not surprising since Voronoi regions are necessarily convex,
but countries are not.

References
[1] Peter L Bartlett, Vitaly Maiorov, and Ron Meir. Almost linear VC-dimension bounds for piecewise
polynomial networks. Neural computation, 10(8):2159–2173, 1998.

117
[2] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control,
signals and systems, 2(4):303–314, 1989.
[3] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are uni-
versal approximators. Neural networks, 2(5):359–366, 1989.
[4] Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward net-
works with a nonpolynomial activation function can approximate any function. Neural networks,
6(6):861–867, 1993.
[5] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT
press, 2018.

118

You might also like