Machine Learning Luxberg
Machine Learning Luxberg
Table of contents
Warmup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
The kNNalgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Unsupervised learning
Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
K-means and kernel k-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579
Standard k-means algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
Linkage algorithms for hierarchical clustering . . . . . . . . . . . . . . . . . . . 604
A glimpse on spectral graph theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
Unnormalized Laplacians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615
Normalized Laplacians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629
5
Summer 2019
Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .995
Selecting parameters by cross validation . . . . . . . . . . . . . . . . . . . . . . . 998
Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1009
Wrap up
Mathematical Appendix
Recap: Probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1122
Discrete probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1123
Continuous probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1161
Recap: Linear algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1175
Excursion to convex optimization: primal, dual, Lagrangian . . . . 1214
Ulrike von Luxburg: Statistical Machine Learning
Literature
Here are some of the books we use during the lecture (just
individual chapters):
Literature (2)
I Schölkopf, Smola: Learning with kernels. MIT Press, 2002.
SVMs and kernel algorithms, for computer science audience.
I Steinwart, Christmann: Support Vector Machines. Springer,
2008. SVMs and kernel algorithms, for maths audience.
Learning theory:
I Devroye, Györfi, Lugosi: A Probabilistic Theory of Pattern
Recognition. Springer, 1996.
Ulrike von Luxburg: Statistical Machine Learning
Literature (3)
I Kevin Murphy: Machine learning, a probabilistic perspective.
MIT Press, 2012.
Information theoretic (and Bayesian) point of view on machine
learning:
I MacKay: Information theory, inference and learning algorithms.
Cambridge University Press, 2003
Optimization:
Ulrike von Luxburg: Statistical Machine Learning
Learning
Introduction to Machine
16 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
Spam filtering
I Want to classify all emails into two classes: spam or not-spam
I Similar problem as above: hand-designed rules don’t work in
many cases
I So we are going to “train” the spam filter:
Ulrike von Luxburg: Statistical Machine Learning
22
Summer 2019
ver is an active
iver assistance,
2]. Dynamically
ting conditions
clusion and the
contribute to the
port navigation,
Ulrike von Luxburg: Statistical Machine Learning
l 3D coordinate
this important
w probabilistic
w). Our model
evious research:
best achieved
Fig. 1: Example results with our multi-frame 3D inference and explicit
(2) short term occlusion reasoning for onboard vehicle and pedestrian Image: Christian
tracking Wojek
with et al, IEEE PAMI, 2013
overlaid
ts [4–6], allows horizon estimate for different public state-of-the-art datasets (all results at 0.1
ects should not FPPI).
context, which
otion of tracked
24
Summer 2019
Self-driving cars
First breakthrough: The DARPA grand challenge in 2005:
I Build an autonomous car that can find its way 100 km through
the desert.
Ulrike von Luxburg: Statistical Machine Learning
It took decades to work out all these steps, machine learning is the
driving force behind the recent success.
27
Summer 2019
Bioinformatics
Machine learning is used all over the place in bioinformatics:
I Classify different types of diseases based on microarray data
Ulrike von Luxburg: Statistical Machine Learning
Image: Wikipedia
Images: BiochemLabSolutions.com
28
Summer 2019
Bioinformatics (2)
88 Larran‹aga et al.
Figure 1: Classification of the topics where machine learning methods are applied.
Figure from Larranaga et al., 2005
In addition to all these applications, computa- make inferences from a sample. The two main
tional techniques are used to solve other problems, steps in this process are to induce the model by
such as efficient primer design for PCR, biological processing the huge amount of data and to represent
image analysis and backtranslation of proteins (which the model and making inferences efficiently. It must
29
is, given the degeneration of the genetic code, be noticed that the efficiency of the learning and
a complex combinatorial problem). inference algorithms, as well as their space and
Summer 2019
Language processing
2011: Computer “Watson” wins the american quiz show
“jeopardy”. It is a bit like “Wer wird Millionär”, but not so much
about facts and more about word games.
Ulrike von Luxburg: Statistical Machine Learning
I Speech recognition
I Object recognition
Ulrike von Luxburg: Statistical Machine Learning
33
Summer 2019
I Info Sheet
I Important: doodle for tutorial groups
I Literature
I Material covered: see table of contents
I Material NOT covered: neural networks and deep learning;
Bayesian / probabilistic approaches; reinforcement learning
I Language: english / german
Ulrike von Luxburg: Statistical Machine Learning
instances
I Ideally, the computer should use the examples to extract a
general “rule” how the specific task has to be performed
correctly.
39
Summer 2019
Example:
I Premise 1: every person in this room is a student.
Example:
I We throw lots of things, very often.
I In all our experiments, the things fell down and not up.
Example:
I You come 10 minutes late to every lecture I give.
consequences.
I BUT you cannot be sure ...
43
Summer 2019
Example:
Ulrike von Luxburg: Statistical Machine Learning
Formally:
I we want to find a function out of F := Y X (the space of all
functions). This space contains 2100 functions.
52
Summer 2019
I
same number of functions with f (X 0 ) = 1.
And there is no way we can decide which one is going
to be the best prediction.
I In fact, no matter how many data points we get, our
prediction on unseen points will be as bad as random guessing.
Without any further restrictions or assumptions, the
problem of machine learning would be ill posed!
53
Summer 2019
I And yes, we did not talk about what happens if the true
function is not contained in F after all.
Ulrike von Luxburg: Statistical Machine Learning
The details of all this are quite tricky. It is the big success story
of machine learning theory to work out how exactly all these
things come together.
At the end of the course you will know all this, at least roughly ,
56
Summer 2019
Overfitting, underfitting
Choosing a“reasonable function class F” is crucial.
Rats get two choices of water. One choice makes them feel sick,
the other one doesn’t.
Experiment 1:
Ulrike von Luxburg: Statistical Machine Learning
I Rats learn very fast not to drink the water that makes them
sick.
59
Summer 2019
f : Xi 7→ Yi .
I But we want to exploit the extra knowledge we gain from the
unlabeled training points.
66
Summer 2019
Online learning:
I Examples arrive online, one at a time.
Ulrike von Luxburg: Statistical Machine Learning
Recap!
To be able to follow the lecture, you need to know the following
matrial. If you cannot remember it, please recap. You can use the
slides in the maths appendix, the tutorials, and any text book ...
Linear algebra:
I Vector space, basis, linear mapping, norm, scalar product
eigenvectors
I Eigen-decompostion of symmetric matrices, positive definite
matrices
72
Summer 2019
Recap! (2)
Probability theory:
I Discrete:
I Basics: Probability distribution, conditional probability, Bayes
theorem, random variables, expectation, variance,
independence,
I Distributions: joint, marginal, product distributions;
Bernoulli, Binomial distribution, Poisson distribution, uniform
Ulrike von Luxburg: Statistical Machine Learning
distribution
I multivariate random variabels: variance, covariance
I Continuous (we keep it on the intuitive level):
I Density, cumulative distribution function, expectation,
variance
I Normal distribution (univariate and multivariate), covariance
matrix.
73
74 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
Warmup
Summer 2019
The kNNalgorithm
Literature:
For the algorithm: ¡br/¿ Hastie, Tibshirani, Friedman Section 2.3.2
Duda, Hart Section 4.5 ¡br/¿ ¡br/¿ For theory (not covered in this
Ulrike von Luxburg: Statistical Machine Learning
Training error:
I Predict the labels of all training points: Ŷi := falg (Xi ).
(
0 if Ŷi = Yi
err(Xi , Yi , Ŷi , ) :=
1 otherwise
n
1X
errtrain (falg ) = err(Xi , Yi , falg (Xi ))
n i=1
Ulrike von Luxburg: Statistical Machine Learning
Remarks:
I This error obviously also depends on the training sample, but
we drop this from the notation to make it more readable. To
be really precise, one would have to write something like
errtrain (falg , (Xi , Yi )i=1,...,n )
I Later we will call this quantity the “empirical risk” of the
classifier (with respect to the 0-1-loss).
78
Summer 2019
(
0 if Ŷj = Yj
err(Xj , Yj , Ŷj ) :=
1 otherwise
Ulrike von Luxburg: Statistical Machine Learning
I Define the test error of the classifier as the average error, over
all test points:
m
1 X
errtest (falg ) = err(Xj , Yj , falg (Xj ))
m j=1
79
Summer 2019
I Let Xi1 , ..., Xik be the first k points in this order (the k
nearest neighbors of X 0 ). We denote the set of these points by
kNN(X 0 ).
I Assign to Y 0 the majority label among the corresponding labels
Yi1 , ..., Yik , that is define
( Pk
0 0 if j=1 Yij ≤ k/2
Y =
1 otherwise
82
83 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
Example
Summer 2019
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o oo oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o o oo oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o o oo o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o o o oo o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o oo o o oo o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o oo o o oo o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o oo o oo
o o oo o o oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o o o oo o o o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o o o oo o o o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o oo oo o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o o o oo oo o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o
oo oo o o ooo o o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o o o oo o o o ooo o o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
oo o o o
oooo o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o oooo o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o
o oooo ooo oooo o oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o o oo o oooo ooo oooo o oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o o oo
oooo ooooo oooo ooooo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o oo o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
oo o o oo o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
oo
o o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o oo o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o oo o o o o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o o .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o o
o oo oo oo oooo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o oo oo oo oooo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o oo oo o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o o oo oo o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o
oo o o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o o
oo o o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
oo ooo o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
oo ooo o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o oo oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o o oo oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Ulrike von Luxburg: Statistical Machine Learning
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o
o o o o o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o o o o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o oo oo o o ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
oo oo o
o .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
ooo .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ooo
o o o oo o
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
o o o oo o
.....................................................................
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o ..................................................................... o
2.3.3 From
I Least Squares to Nearest Neighbors Yellow/blue little dots: if this were a test point, the kNN classifier would
In Figure 2.2 we see that far fewer training observations are misclassifie
The linear decision boundary from least squares is very smooth,thanand
in Figure
ap- 2.1. This should not give us too much comfort, though, sin
parently stable to fit. It does appear to rely heavily on the in classify the point as yellow/blue.
Figure 2.3 none of the training data are misclassified. A little thoug
assumption
suggests
that a linear decision boundary is appropriate. In language we will that for k-nearest-neighbor fits, the error on the training da
develop
shortly, it has low variance and potentially high bias. should be approximately an increasing function of k, and will always be
85
I Mixture of Gaussians
Ulrike von Luxburg: Statistical Machine Learning
87
Summer 2019
1.5
0.5
0
Ulrike von Luxburg: Statistical Machine Learning
−0.5
−1
−1.5
−2 −1 0 1 2
88
Summer 2019
0.14 0.11
0.1
0.12
Ulrike von Luxburg: Statistical Machine Learning
0.09
0.1
0.08
0.08
0.07
0.06
0.06
0.04 0.05
2 4 6 8 10 12 2 4 6 8 10 12
k k
92
Summer 2019
256
!1/2
X
d(X, X 0 ) = (Xs − Xs0 )2
s=1
94
Summer 2019
0.035
0.05
0.03
0.04 0.025
0.02
Ulrike von Luxburg: Statistical Machine Learning
0.03
0.015
0.02 0.01
0.005
0.01
0
0 −0.005
2 4 6 8 10 12 2 4 6 8 10 12
k k
0.14
0.12
0.12
0.1
0.1
0.08
Ulrike von Luxburg: Statistical Machine Learning
0.08
0.06
0.06
0.04
0.04
0.02
0.02
0 0
2 4 6 8 10 12 2 4 6 8 10 12
k k
Inductive bias
WHAT DO YOU THINK IS THE INDUCTIVE BIAS IN THIS
ALGORITHM? WHAT KIND OF FUNCTIONS ARE
“PREFERRED” OR “LEARNED”?
Ulrike von Luxburg: Statistical Machine Learning
99
Summer 2019
Inductive bias
WHAT DO YOU THINK IS THE INDUCTIVE BIAS IN THIS
ALGORITHM? WHAT KIND OF FUNCTIONS ARE
“PREFERRED” OR “LEARNED”?
Input points that are close to each other should have the
same label
Ulrike von Luxburg: Statistical Machine Learning
99
Summer 2019
Extensions
I The kNN rule can easily used for regression as well: As output
value take the average over the labels in the neighborhood.
I kNN-based algorithms can also be used for many other tasks
such as density estimation, outlier detection, clustering, etc.
Ulrike von Luxburg: Statistical Machine Learning
100
Summer 2019
Summary
I The kNN classifier is about the simplest classifier that exists.
I But often it performs quite reasonably.
I For any type of data, always consider the kNN classifier as a
baseline.
I One can prove that in the limit of infinitely many data points,
the kNN classifier is “consistent”, that is it learns the best
Ulrike von Luxburg: Statistical Machine Learning
Formal setup
103 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
I If Y = R, that is regression
Loss function
The loss function measures how “expensive” an error is:
0
`(x, y, y ) =
1 otherwise
`(x, y, y 0 ) = (y − y 0 )2
True Risk
The true risk (or true expected loss) of a prediction function
f : X → Y (with respect to loss function `) is defined as
R∗ := inf{R(f ) f : X → Y, f measurable }
f ∗ := argmin R(f )
points”, then the true risk of our learning rule fn will be arbitrarily
close to the best possible risk.
113
Summer 2019
I Devroye, Section 2
Ulrike von Luxburg: Statistical Machine Learning
probability
Ulrike von Luxburg: Statistical Machine Learning
0.6
0.02
0.4
0.01
0.2
0
140 160 180 200 female ma
(
m if P (Y = m) > P (Y = f )
Ulrike von Luxburg: Statistical Machine Learning
fn (X) =
f otherwise
118
Summer 2019
0.4
0.01
0.2
0
60 180 200 female male 140 160
point
steriors P(Y|X) loss weights P(Y=m | X=
119
0.8
Summer 2019
female
male 0.8
0.03
probability
0.6
0.02
0.4
0.01
0.2
0
140 160 180 200 female
120
Summer 2019
(
m if P (X = x|Y = m) > P (X = x|Y = f )
fn (x) =
f otherwise
P (X = x|Y = m) · P (Y = m)
P (Y = m|X = x) =
P (X = x)
Ulrike von Luxburg: Statistical Machine Learning
(
m if P (Y = m|X = x) > P (Y = f |X = x)
fn (x) =
f otherwise
122
Summer 2019
0
Approach 3: Bayesian
140 a posteriori
160 180 200criterion female
(2)
Visually: select according to which curve is higher
posteriors P(Y|X) loss we
0.8
0.8
0.6
0.6
loss
0.4
Ulrike von Luxburg: Statistical Machine Learning
0.4
0.2
0.2
0
140 160 180 200 l(m if Y=f)
I Use Bayes decision rule: Select the label fn (X) for which the
conditional risk is minimal.
124
125 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
Run demo_bayesian_decision_theory.m
Example: male vs female
Summer 2019
probability
0.025 0.6 0.02
0.02 0.015
0.015 0.4
0.01
0.01
0.2
0.005 0.005
0
140 160 180 200 female male 140 160 180 200
pointwise risk
posteriors P(Y|X) loss weights R(prediction | X =x) overall risk at given threshold
2
prediction=f 0.5
0.8 0.8 prediction=m
Ulrike von Luxburg: Statistical Machine Learning
1.5 0.45
0.6 0.6 0.4
1 0.35
0.4 0.4 0.3
0.5 0.25
0.2 0.2
0.2
0
140 160 180 200 l(m,true=f) l(f,true=m) 140 160 180 200 140 160 180 200
position of decision threshold
126
Summer 2019
probability
0.025 0.6 0.02
0.02
0.015 0.4 0.015
0.01 0.01
0.2
0.005 0.005
0
140 160 180 200 female male 140 160 180 200
pointwise risk
posteriors P(Y|X) loss weights R(prediction | X =x) overall risk at given threshold
2
prediction=f
0.8 0.8 prediction=m 0.6
Ulrike von Luxburg: Statistical Machine Learning
1.5
0.5
0.6 0.6
1 0.4
0.4 0.4
0.5 0.3
0.2 0.2
0.2
0
140 160 180 200 l(m,true=f) l(f,true=m) 140 160 180 200 140 160 180 200
position of decision threshold
127
Summer 2019
probability
0.025 0.6 0.02
0.02 0.015
0.015 0.4
0.01
0.01
0.2
0.005 0.005
0
140 160 180 200 female male 140 160 180 200
pointwise risk
posteriors P(Y|X) loss weights R(prediction | X =x) overall risk at given threshold
2
prediction=f 1
0.8 prediction=m
1.5
Ulrike von Luxburg: Statistical Machine Learning
1.5
0.8
0.6
1 1 0.6
0.4
0.5 0.5 0.4
0.2
0.2
0
140 160 180 200 l(m,true=f) l(f,true=m) 140 160 180 200 140 160 180 200
position of decision threshold
128
Summer 2019
probability
0.025 0.6 0.02
0.02
0.015 0.4 0.015
0.01 0.01
0.2
0.005 0.005
0
140 160 180 200 female male 140 160 180 200
pointwise risk
posteriors P(Y|X) loss weights R(prediction | X =x) overall risk at given threshold
2
prediction=f
prediction=m 1.2
0.8
Ulrike von Luxburg: Statistical Machine Learning
1.5 1.5
1
0.6
1 1 0.8
0.4 0.6
0.5 0.5 0.4
0.2
0.2
0
140 160 180 200 l(m,true=f) l(f,true=m) 140 160 180 200 140 160 180 200
position of decision threshold
129
Summer 2019
probability
0.025 0.6 0.02
0.02
0.015 0.4 0.015
0.01 0.01
0.2
0.005 0.005
0
140 160 180 200 female male 140 160 180 200
pointwise risk
posteriors P(Y|X) loss weights R(prediction | X =x) overall risk at given threshold
2 0.6
prediction=f
0.8 1.5 prediction=m
0.5
Ulrike von Luxburg: Statistical Machine Learning
1.5
0.6
1 0.4
1
0.4
0.3
0.5 0.5
0.2
0.2
0
140 160 180 200 l(m,true=f) l(f,true=m) 140 160 180 200 140 160 180 200
position of decision threshold
130
131 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
η(x) := E(Y X = x)
η(x) = 0 · P (Y = 0 X = x) + 1 · P (Y = 1 X = x)
= P (Y = 1 X = x)
133
Summer 2019
WHY?
Ulrike von Luxburg: Statistical Machine Learning
134
Summer 2019
P (X = x, Y = 1) = P (Y = 1|X = x)P (X = x)
Ulrike von Luxburg: Statistical Machine Learning
= η(x)µ(x)
and similarly
P (X = x, Y = 0) = P (Y = 0|X = x)P (X = x)
= (1 − η(x))µ(x)
(
1 if η(x) ≥ 1/2
f ◦ (x) :=
0 otherwise
137
Summer 2019
Remark:
I The theorem shows that f ◦ = f ∗ (WHY?)
Ulrike von Luxburg: Statistical Machine Learning
P (f (x) 6= Y X = x)
= 1 − P (f (x) = Y |X = x)
= 1 − P (f (x) = 1, Y = 1 X = x) − P (f (x) = 0, Y = 0 X = x)
Ulrike von Luxburg: Statistical Machine Learning
(∗)
= 1 − 1f (x)=1 P (Y = 1|X = x) − 1f (x)=0 P (Y = 0|X = x)
= 1 − 1f (x)=1 η(x) − 1f (x)=0 (1 − η(x))
≥ 0
R(f ) ≥ R(f ◦ ).
,
141
Summer 2019
Plug-in classifier
Simple idea: If we don’t know the underlying distribution, but are
given some training data, simply estimate the regression function
η(x) by some quantity ηn (x) and build the plugin-classifier
(
1 if ηn (x) ≥ 0.5
fn :=
0 otherwise
Ulrike von Luxburg: Statistical Machine Learning
Theoretical considerations:
I It can be shown that the plugin-approach is universally
consistent. ,
Practical considerations:
I Estimating densities is notoriously hard, in particular for
high-dimensional input spaces. So unfortunately, the
plugin-approach is useless for practice. /
143
144 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
In the following, let’s look at the classic case, the squared loss
function:
Ulrike von Luxburg: Statistical Machine Learning
η(x) = E(Y X = x)
Proof.
(3)
Regression function (context of L2 regression)
149 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
(4)
Regression function (context of L2 regression)
,
Summer 2019
f ◦ : X → R, f ◦ (x) := η(x)
,
151 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
n
1X
Rn (f ) := `(Xi , Yi , f (Xi ))
n i=1
Ulrike von Luxburg: Statistical Machine Learning
The key point is that the empirical risk can be computed based on
the training points only.
153
Summer 2019
fn := argmin Rn (f )
f ∈F
Ulrike von Luxburg: Statistical Machine Learning
Overfitting vs Underfitting
Coming back to the terms underfitting and overfitting:
I Underfitting happens if F is too small. In this case we have a
small estimation error but a large approximation error.
I Overfitting happens if F is too large. Then we have a high
estimation error but a small approximation error.
Ulrike von Luxburg: Statistical Machine Learning
158
Summer 2019
2
= E (fn (x) − E(fn (x))) + E(fn (x)) − f ∗ (x)
2
| {z } | {z }
Variance term Bias term
Note: we always have ≤ (for any loss function), but for the L2 -loss
we get equality (as we have seen in Proposition 3).
ERM, remarks
I From a conceptual/theoretical side, ERM is a straight forward
learning principle.
I The key to the success / failure of ERM is to choose a “good”
function class F
I From the computational side, it is not always easy (depending
on function class and loss function, the problem can be quite
Ulrike von Luxburg: Statistical Machine Learning
Alternative approach:
I Let F be a very large space of functions.
Rreg,n (f ) := Rn (f ) + λ · Ω(f )
supervised learning
Linear methods for
167 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
I Bishop Sec 3
Ulrike von Luxburg: Statistical Machine Learning
168
Summer 2019
Linear setup
I Assume we have training data (Xi , Yi ) with Xi ∈ X := Rd and
Yi ∈ Y := R.
I We want to find the “best” linear function, that is a function
of the form
d
X
f (x) = wi x(i) + b
Ulrike von Luxburg: Statistical Machine Learning
i=1
n
1 X 2
Yi − f (Xi )
n i=1
170
Summer 2019
Example
Want to predict the shoe size of a person, based on many input
values:
I For each person X, we have a couple of real-valued
measurements: X (1) = height, X (2) = weight, X (3) = income,
X (4) = age.
(Note: some measurements are useful for the question, some
might not be useful)
Ulrike von Luxburg: Statistical Machine Learning
2
shoesize = height + 0 · weight + 0 · income + 0 · children + 1
10
171
Summer 2019
Concise notation
I To write everything in a more concise form, we stack the
training inputs into a big matrix (each point is one row) and
the output in a big vector:
Ulrike von Luxburg: Statistical Machine Learning
Getting rid of b
We want to write the problem even more concisely.
I Define the matrices
Ulrike von Luxburg: Statistical Machine Learning
I Then we have
d+1
X d
X
(X̃ w̃)i = X̃ik w̃k = Xik wk + b = (Xw)i + b
k=1 k=1
174
Summer 2019
n i=1 n
175
Summer 2019
n
1 X 2 1
Yi − (Xw)i = kY − Xwk2
n i=1 n
ML ; Optimization problem
We can see:
I In order to solve (###), we need to solve an optimization
problem
I In this particular case, we will see in a minute that we can
solve it analytically.
I For most other ML algorithms, we need to use optimization
Ulrike von Luxburg: Statistical Machine Learning
Proof. Exercise
178
Summer 2019
I Rank of a matrix
Ulrike von Luxburg: Statistical Machine Learning
179
Summer 2019
Proof intuition.
I Want to find the minimum of the function kY − Xwk2
Ulrike von Luxburg: Statistical Machine Learning
I Now finally note that the right hand side agrees with the k-th
coordinate of the vector X t (Y − Xw).
183
Summer 2019
Proof (sketch).
I As above we get the necessary condition X t Y = (X t X)w.
I One can check that one particular vector w that satisfies this
condition is given as w = (X t X)+ X t Y (EXERCISE!)
185
Summer 2019
training points.
I But on the test points, the solutions will disagree. The
question is then which one to prefer. One idea here is to use
regularization, see below.
,
187
Summer 2019
I d high, n low
I n ≈ d
Ulrike von Luxburg: Statistical Machine Learning
??????????
188
Summer 2019
large n is.
Case d n:
I This is the harmless case: we have many points in a
low-dimensional space. Here linear functions are not very
flexible, and we tend not to overfit (sometimes we underfit).
Ulrike von Luxburg: Statistical Machine Learning
190
Summer 2019
I
is a convex optimization problem, and we can compute its
solution analytically.
192
193 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
“horses”)
I For each text, count how often each word occurs
I The represent each text as a vector: each dimension
corresponds to one word, and the entry of the vector is how
often this word occurs in the given text.
195
196 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
I Hastie/Tibshirani/Friedman Section 3
I Bishop Section 3
203
Summer 2019
Examples
Example 1:
I Your data lives in Rd , but it clearly cannot be described by a
linear function of the original coordinates. Alternatively, you
can fit a function of the form
D
X
f (x) = wi Φi (x)
Ulrike von Luxburg: Statistical Machine Learning
i=1
Examples (2)
I For example, if we want to learn a periodic function, the Φi
might be the first couple of Fourier functions to fit a function
of the form
D
X
g(x) = wi sin(kx)
k=1
I
functions Φi as polynomials x, x2 , ..., xD to fit a function of
the form
D
X
h(x) = w i xk
k=1
In this way you can use a linear algorithm (find the linear
coefficients wi ) to fit non-linear functions (such as g(x) or h(x)) to
your data.
206
Summer 2019
Examples (3)
Example 2: Feature spaces
the input X consists of a web page, the task is to predict how
many seconds users stay on the page before they leave it again.
How to solve it
It is easy to rewrite the “standard” least squares problem in this
more general framework:
I Define the design matrix as follows:
Ulrike von Luxburg: Statistical Machine Learning
Section 3
211
Summer 2019
Idea
Want to improve standard L2 -regression. Two points of view:
1
wn,λ := argmin kY − Φwk2 + λkwk2 .
w∈RD n
213
Summer 2019
Solution
Theorem 8 (Solution of Ridge Regression)
The coefficients wn,λ that solve the ridge regression problem are
given as
−1
t
wn,λ := Φ Φ + nλID Φt Y
Ulrike von Luxburg: Statistical Machine Learning
Solution (2)
Proof.
I Objective function is Obj(w) := 1 kY − Φwk2 + λkwk2 .
n
I Note that this function is convex.
2 !
grad(Obj)(w) = − Φt (Y − Φw) + 2λw = 0
n
Ulrike von Luxburg: Statistical Machine Learning
=⇒ Φ Φ + nλID wn,λ = Φt Y
t
Solution (3)
Proof that (Φt Φ + nλID ) is invertible:
I The matrix A := Φt Φ is symmetric, hence we can decompose
it into eigenvalues: A = V t ΛV where Λ is the diagonal matrix
with all eigenvalues of A
I Because of the special form A := Φt Φ, all eigenvalues are ≥ 0
(the matrix is positive semi-definite).
I A has full rank (is invertible) iff all its eigenvalues are > 0.
Ulrike von Luxburg: Statistical Machine Learning
(A + λI)v = Av + λv = σv + λv = (σ + λ)v
I If λ > 0, then all eigenvalues of A + λI are > 0:
σ ≥ 0 and λ > 0 implies σ + λ > 0
I So we know that A + λI has full rank and is invertible.
216
,
Summer 2019
10
X
f (x) = wk sin(kx)
k=1
217
Summer 2019
least squares fit (red) and ridge regression blue for a linear
218
1 1
ln λ = 2.6
t t
0 0
−1 −1
0 x 1 0 x 1
Ulrike von Luxburg: Statistical Machine Learning
1 1
ln λ = −0.31
t t
0 0
−1 −1
0 x 1 0 x 1
1 1
ln λ = −2.4
t t
0 0
−1 −1
0 x 1 0 x 1
Figure 3.5 Illustration of the dependence of bias and variance on model complexity, governed by a regulariza-
tion parameter λ, using the sinusoidal data set from Chapter 1. There are L = 100 data sets, each having N = 25
data points, and there are 24 Gaussian basis functions in the model so that the total number of parameters is
M = 25 including the bias parameter. The left column shows the result of fitting the model to the data sets for
various values of ln λ (for clarity, only 20 of the 100 fits are shown). The right column shows the corresponding
average of the 100 fits (red) along with the sinusoidal function from which the data sets were generated (green).
220
Summer 2019
is close to the
0.06
minimum error
0.03
0
−3 −2 −1 0 1 2
ln λ
Φ = U ΣV t
σ
j
wn,λ = ... = V diag 2 U tY
σj + λ
σj 1
σj2+λ σj
simple example:
I Assume you want to estimate the mean of a normal
distribution N (Θ, I) in Rd , d ≥ 3.
I Assume we have just a single data point X ∈ Rd from this
distribution.
I Standard least squares estimator: Θ̂LS = X.
224
Summer 2019
6
2
4
1.5
1 2
0.5
0
20
0
20
20 10
10 20
10 15
10
0 0 5
0 0
I
convex optimization problem, and we can compute its solution
analytically.
228
Summer 2019
• Hastie/Tibshirani/Wainwright, Section 2
Sparsity
I Consider the setting of linear regression with basis functions
Φ1 , ..., ΦD .
P
I It is very desirable to obtain a solution function fn := i wi Φi
for which many of the coefficients wi are zero. Such a solution
is called “sparse”.
I Reasons:
Ulrike von Luxburg: Statistical Machine Learning
Excursion: p-norms
I For p > 0, define for a vector w ∈ RD
D
X 1/p
p
kwkp := |wi | .
i=1
This is not a norm (it does not even satisfy the homogeneity
Ulrike von Luxburg: Statistical Machine Learning
I
small weights. It punishes all weights linearly, not quadratic,
and thus can afford to have a large weight if at the same time
many small weights disappear.
239
Summer 2019
The Lasso
Consider the following regularization problem:
I Input space X arbitrary, output space Y = R.
I Fix a set of basis functions Φ1 , ..., ΦD : X → R
I As function
P space choose all functions of the form
f (x) = i wi Φi (x).
As regularizer use Ω(f ) := kwk1 = D
P
i=1 |wi |. Choose a
I
Ulrike von Luxburg: Statistical Machine Learning
Example
Example (2)
History
I The name LASSO stands for “least absolute shrinkage and
selection operator”
I First invented by Tibshirani: Regression shrinkage and
selection via the lasso. J. Royal. Statist. Soc. B, 1996
I For a short retrospective and some important literature
pointers, see Tibshirani: Regression shrinkage and selection via
Ulrike von Luxburg: Statistical Machine Learning
I
efficient solvers exist.
245
Summer 2019
Y = Xw + noise
considered fixed):
Y |X, w ∼ N (Xw, σ 2 I)
247
Summer 2019
max P (Y |X, w)
w
max exp(−kY − Xwk2 /σ 2 )
w
Ulrike von Luxburg: Statistical Machine Learning
min kY − Xwk2
w
Feature normalization
Summer 2019
In practice: normalization
In regularized regression, it makes a difference how we scale our
data. Example:
I Body height measured in mm or cm or even km
Different scales lead to different solutions, because they affect the
regularization in a different way (WHY???)
Ulrike von Luxburg: Statistical Machine Learning
Φcentered
i
Φrescaled
i := Pn centered
( j=1 Φi (Xj )2 )1/2
I But you also might want to figure out whether certain design
choices make sense, for example whether it is useful to remove
outliers in the beginning or not.
It is very important that all these choices are made appropriately.
Cross validation is the method of choice for doing that.
256
Summer 2019
5
and train with parameters s.
6 Compute the validation error err(s, k) on fold k
7 Compute P the average validation error over the folds:
err(s) = K k=1 err(s, k)/K.
8 Select the parameter combination s that leads to the best
validation error: s∗ = argmins∈S err(s).
9 OUTPUT: s∗
257
258 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
soon as you try to “improve the test error”, the test data
effectively gets part of the training procedure and is spoiled.
260
Summer 2019
folds might not be the best ones on the whole data set
(because the latter is larger).
I It is very difficult to prove theoretical statements that relate
the cross-validation error to the test error (due to the high
dependency between the training runs). In particular, the CV
error is not unbiased, it tends to underestimate the test error.
Intuition
Summer 2019
Intuition
Given:
I We assume that our data lives in Rd (perhaps, through a
feature space representation).
I Want to solve a classification problem with input space
X = Rd and output space Y = {±1} (for simplicity we focus
on the two-class case for now).
Ulrike von Luxburg: Statistical Machine Learning
268
Summer 2019
Intuition (2)
I Idea is to separate the two classes by a linear function:
Ulrike von Luxburg: Statistical Machine Learning
269
Summer 2019
Hyperplanes in Rd
Now let’s consider linear classification with hyperplanes.
I A hyperplane in Rd has the form
H = {x ∈ Rd hw, xi + b = 0}
sign(hw, xi + b) ∈ {±1}
Projection interpretation
Here is another way to interpret classification by hyperplanes:
I The function hw, xi projects the points x on a real line in the
direction of the normal vector of the hyperplane.
I The term b shifts them along this line.
I Then we look at the sign of the result and classify by the sign
Ulrike von Luxburg: Statistical Machine Learning
272
Summer 2019
I Duda / Hart
Ulrike von Luxburg: Statistical Machine Learning
274
Summer 2019
1 P
m+ := Xi ∈ Rd
n+ {i | Yi =+1}
Ulrike von Luxburg: Statistical Machine Learning
2 1 P 2
σw,+ := hw, Xi i − hw, m+ i
n+ {i | Yi =+1}
Make the analogue definitions for class −1: ...
280
Summer 2019
CW := (Xi − m+ )(Xi − m+ )t
n+
{i | Yi =+1}
1 X
+ (Xi − m− )(Xi − m− )t
n−
{i | Yi =−1}
Solution vector w
Proposition 9 (Solution vector w∗ of LDA)
If the matrix CW is invertible, then the optimal solution of the
problem w∗ := argmaxw∈Rd J(w) is given by
∂J CB whw, CW wi − CW whw, CB wi
(w) = 2
∂w hw, CW wi2
283
Summer 2019
| {z }
| {z } ∈R
∈R
Determining b
So far, we only discussed how to find the normal vector w. How do
we set the offset b? (Recall that the hyperplane is hw, xi + b).
F = {f (x) = hw, xi + b; w ∈ Rd , b ∈ R}
w∗ = argmax J(w)
w∈Rd
Proof: skipped.
LDA, alternative motivation by ERM (3)
Summer 2019
I LDA does not work well if the variance of the two classes is
very different from each other (remember, in the derivation of
LDA based on Gaussian distributions we assumed equal
variance for both classes).
293
Summer 2019
History
I A variant of this was first published by R. Fisher in 1936:
Fisher, R. A. (1936). The Use of Multiple Measurements in
Taxonomic Problems. Annals of Eugenics 7 (2): 179–188.
I LDA goes under various names: Linear discriminant analysis,
Fisher’s linear disciminant.
I R. Fisher is THE founder of modern statistics (design of
Ulrike von Luxburg: Statistical Machine Learning
Logistic regression
Literature: Hastie/Tibshirani/Friedman Section 4.4
For the probabilistic point of view, see Chapter 8 in Murphy
Ulrike von Luxburg: Statistical Machine Learning
297
Summer 2019
But why would someone come up with the logistic loss function???
1
P (Y = y X = x) =
1 + exp(−yf (x))
For such points, the classifier “is not sure”, but ideally we would
like to find a classifier that is “pretty sure” on which sides all points
belong.
306
Summer 2019
Decision function:
P (Y = circle|X = x, b) = 1/(1 + exp(−yf (x)))
Loss incurred for the respective decisions: logistic
307
Summer 2019
Adding regularization
I As in linear regression, we can now use regularization to avoid
overfitting.
I For example, we could use Ω(f ) = kwk22 (as in ridge
regression) or Ω(f ) = kwk1 (as in Lasso).
I Then regularized logistic regession minimizes minimize
Ulrike von Luxburg: Statistical Machine Learning
n
1X
log 1 + exp(−Yi hw, Xi) + λΩ(f ).
n i=1
perspective, Chapter 8
311
Summer 2019
General idea
I In linear discriminant analysis (LDA): Minimizing the L2 loss
over the class of linear function is “the same” as finding the
Bayesian decision theory solution for the probabilistic model
with Gaussian class conditional priors, and Gaussian noise, and
uniform class prior.
I In logistic regression: Minimizing the logistic loss function over
Ulrike von Luxburg: Statistical Machine Learning
I
minimization with respect to this newly defined loss function `:
n
X
argmin `(Xi , f (Xi ), Yi )
f ∈F i=1
314
Summer 2019
Then there always exists a particular loss function ` such that this
approach corresponds to ERM with this particular loss function.
Note that it does not always work the other way round (if we start
with a given loss function `, it is not always possible to construct a
corresponding model probability distribution)
315
Summer 2019
I Shawe-Taylor / Cristianini
Ulrike von Luxburg: Statistical Machine Learning
318
319 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
Prelude
The support vector machine (SVM) is the algorithm that made
machine learning its own sub-discipline of computer science, it is
one of the most important machine learning algorithms. It has been
published in the late 1990ies (see later for more on history).
We are going to study the linear case first. The main power of the
method comes from the “kernel trick” which is going to make them
Ulrike von Luxburg: Statistical Machine Learning
non-linear.
320
Summer 2019
Geometric motivation
Given a set of linearly separable data points in Rd . Which
hyperplane to take???
Ulrike von Luxburg: Statistical Machine Learning
321
Summer 2019
Canonical hyperplane
I We are interested in a linear classifier of the form
f (x) = sign(hw, xi + b)
I Note that if we multiply w and b by the same constant a > 0,
this does not change the classifier:
sign(haw, xi + ab) = sign(a(hw, xi + b)) = sign(hw, xi + b)
Ulrike von Luxburg: Statistical Machine Learning
The Margin
I Let H := {x ∈ Rd hw, xi + b = 0} be a hyperplane.
I Assume that a hyperplane correctly separates the training data.
I The margin of the hyperplane H with respect to the training
points (Xi , Yi )i=1,..,n is defined as the minimal distance of a
training point to the hyperplane:
Ulrike von Luxburg: Statistical Machine Learning
First proof.
Observe:
Ulrike von Luxburg: Statistical Machine Learning
Now we build the scalar product with w and add b on both sides:
w kwk2
=⇒ hw, xi = hw, h + ρ i = hw, hi + ρ
kwk kwk
=⇒ hw, xi + b = hw, hi + b +ρkwk
| {z } | {z }
=1 =0
=⇒ ρ = 1/kwk
,
328
Summer 2019
hw, xi + b = 1
hw, hi + b = 0
Ulrike von Luxburg: Statistical Machine Learning
hw, x − hi = 1
hw/kwk, x − hi = 1/kwk
Now the proposition follows from the fact that w and x − h point
in the same direction and w/kwk has norm 1. ,
329
Summer 2019
In formulas:
Ulrike von Luxburg: Statistical Machine Learning
1
maximizew∈Rd ,b∈R
kwk
subject to Yi = sign(hw, Xi i + b) ∀i = 1, ..., n
|hw, Xi i + b)| ≥ 1 ∀i = 1, ..., n
330
Summer 2019
This problem only has a solution if the data set is linearly separable,
that is there exists a hyperplane H that separates all training points
without error. This might be too strict ...
Ulrike von Luxburg: Statistical Machine Learning
333
Summer 2019
2
minimizew∈Rd ,b∈R,ξ∈Rn kwk + ξi
2 n i=1
subject to Yi (hw, Xi i + b) ≥ 1 − ξi ∀i = 1, ..., n
ξi ≥ 0 ∀i = 1, ..., n
Here C is a constant that controls the tradeoff between the
two terms, see below.
This problem is called the (primal) soft margin SVM problem.
I Note that this is a convex (quadratic) problem as well.
334
Summer 2019
Note that for soft SVMs, the margin is defined implicitly (the
points on the margin are the ones that satisfy hw, xi + b = ±1.
335
336 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
It looks as follows:
338
Summer 2019
n
CX
minimize max{0, 1 − Yi (hw, Xi i + b)} + kwk2
w,b n i=1 | {z }
| {z } L2 −regularizer
Empirical risk wrt Hinge loss
Ulrike von Luxburg: Statistical Machine Learning
n
1 X
L(w, b, α) = kwk2 − αi (Yi (hw, Xi i + b) − 1)
2 i=1
347
Summer 2019
Dual Problem:
Ulrike von Luxburg: Statistical Machine Learning
maximize g(α)
α
subject to αi ≥ 0, i = 1, ..., n
But this is pretty abstract, we would need to first compute the dual
function, but this seems non-trivial. We now show how to compute
g(α) explicitly. Let’s try to simplify the Lagrangian first.
348
Summer 2019
In particular,
n
∂ X !
L(w, b, α) = − αi Yi = 0 (∗)
∂b i=1
Ulrike von Luxburg: Statistical Machine Learning
∂ X !
L(w, b, α) = w − αi Yi X i = 0 (∗∗)
∂w i
349
Summer 2019
n
1 X
L(w, b, α) = kwk2 − αi (Yi (hw, Xi i + b) − 1)
2 i=1
1 X X X
Ulrike von Luxburg: Statistical Machine Learning
= kwk2 + αi − αi Yi hw, Xi i − b αi Yi
2 i i i
| {z }
=0 by (∗)
margin SVN:
n n
X 1X
maximize αi − αi αj Yi Yj hXi , Xj i
n
α∈R
i=1
2 i,j=1
subject to αi ≥ 0 ∀i = 1, ..., n
Xn
αi Yi = 0
i=1
352
Summer 2019
n n
X 1X
maximize αi − αi αj Yi Yj hXi , Xj i
α∈Rn
i=1
2 i,j=1
Ulrike von Luxburg: Statistical Machine Learning
Yi = sign(hw, Xi + b)
i
X X
= αi Yi hXi , Xi + Yj − Yi αi hXi , Xj i
i i
Support vectors
Support vector property:
I KKT conditions in the hard margin case tell us: Only Lagrange
multipliers αi that are non-zero correspond to active
constraints (the ones that are precisely met). Formally,
αi Yi f (Xi ) − 1 = 0
Ulrike von Luxburg: Statistical Machine Learning
A similar statement holds for the soft margin case, there the αi
are only non-zero for points on the margin, in the margin, or
on the wrong side of the margin.
I In our context: Only those αi are non-zero that correspond to
points that lie exactly on the margin, inside the margin or on
the wrong side of the hyperplane. The corresponding points
are called support vectors.
361
Summer 2019
Scalar products
We can see that all the information about the input points Xi that
enters the optimization problem is expressed in terms of scalar
products:
I hXi , Xj i in the dual objective function
This is going to be the key point to be able to apply the kernel trick.
363
Summer 2019
Exercise
It might be instructive to solve the following exercise:
Input data: x1 = (1, 0); y1 = +1; x2 = (−1, 0); y2 = −1.
Primal problem:
I Write down the hard margin primal optimization problem and
solve it using the Lagrange approach.
I Write down the soft margin primal optimization problem and
Ulrike von Luxburg: Statistical Machine Learning
Exercise (2)
I Use the dual solution to recover the solution of the primal
problem. Compare the values of the objective functions at the
dual and primal solution.
I Determine the support vectors.
Ulrike von Luxburg: Statistical Machine Learning
365
Summer 2019
History
I Vladimir Vapnik is the “inventor” of the SVM (and, in fact, he
laid the foundations of statistical learning theory in general).
I The hard margin SVM and the kernel trick was introduced by
Boser, Bernhard; Guyon, Isabelle; and Vapnik, Vladimir. A
training algorithm for optimal margin classifiers. Conference on
Learning Theory (COLT), 1992
Ulrike von Luxburg: Statistical Machine Learning
supervised learning
Kernel methods for
Summer 2019
Intuition
Summer 2019
0.5
−0.5
−1
−1.5
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
371
Summer 2019
data.
372
Summer 2019
n n
X 1X
maximize αi − αi αj Yi Yj k(Xi , Xj )
n
α∈R
i=1
2 i,j=1
subject to 0 ≤ αi ≤ C/n ∀i = 1, ..., n
X n
αi Yi = 0
i=1
(and the same goes for the function that evaluates the results on
test points).
375
Summer 2019
embeddings Φ(Xi ) and compute the scalar products directly via the
function k.
376
Summer 2019
is a valid kernel!
Ulrike von Luxburg: Statistical Machine Learning
Proof sketch:
I Symmetry: clear
k : Rd × Rd → R, k(x, y) = hx, yi
Is obviously a kernel.
Ulrike von Luxburg: Statistical Machine Learning
385
Summer 2019
Is obviously a kernel.
Ulrike von Luxburg: Statistical Machine Learning
387
Summer 2019
−kx − yk2
d d
k : R × R → R, k(x, y) = exp
2σ 2
One can prove that this is indeed a kernel, this is not obvious at
all!!! (WHAT DO WE NEED TO PROVE?).
Not very useful for practice, but often mentioned, hence I put it on
Ulrike von Luxburg: Statistical Machine Learning
the slides.
391
Summer 2019
function.
I Then we classify with an SVM.
393
Summer 2019
I Now define
(P
π∈Πk w(π) if Πk (v, ṽ) 6= ∅
s(v, ṽ) =
0 otherwise
I k̃ = k1 + k2 is a kernel
I k̃ = k1 · k2 is a kernel
Proof. EXERCISE.
399
Summer 2019
v 0 XX 0 v = v 0 Kv = λv 0 v = λ
maps
Reproducing kernel Hilbert space and feature
Summer 2019
If you have never heard of Hilbert spaces, just think of the space
Rd . The crucial properties are:
I H is a vector space with a scalar product h·, ·iH
for all x, y ∈ X .
x 7→ Φ(x) := kx := k(x, ·)
X
hf, giG := αi βj k(xi , yj )
i,j
Proof.
X
hk(x, ·), f i = hk(x, ·), αi k(xi , ·)i
Ulrike von Luxburg: Statistical Machine Learning
i
X
= αi hk(xi , ·), k(x, ·)i
i
X
= αi k(xi , x)
i
= f (x)
,
412
Summer 2019
depend on wcomp .
I So in particular, the loss w is not affected by wcomp .
,
417
Summer 2019
However, a kernel for which the feature map is not injective might
not be too useful (WHY?)
Ulrike von Luxburg: Statistical Machine Learning
Example:
I The Gaussian kernel with fixed kernel width σ on a compact
subset X of Rd is universal.
420
Summer 2019
Kernels — history
I Reproducing kernel Hilbert spaces play a big role in
mathematics, they have been invented by Aronszajn in 1950.
He already proved all of the key properties.
Aronszajn. Theory of Reproducing Kernels. Transactions of
the American Mathematical Society, 1950
I The feature space interpretation has first been published by
Ulrike von Luxburg: Statistical Machine Learning
Kernel algorithms
In the following, we are now going to see a couple of algorithms
that all use the kernel trick. The roadmap is always the same:
I Start with a linear algorithm
I Shawe-Taylor / Cristianini
Ulrike von Luxburg: Statistical Machine Learning
In kernel language:
Ulrike von Luxburg: Statistical Machine Learning
X
hw, Xi + b = h αi Yi Xi , Xi + b
i
X X
= αi Yi k(Xi , X) + Yj − Yi αi k(Xi , Xj )
i i
If k is a Gaussian kernel:
X
f (x) = βi exp(−kx − Xi k2 /(2σ 2 ))
i
429
Summer 2019
Regularization interpretation
Recall that we interpreted the linear primal SVM problem in terms
of regularized risk minimization where the risk was the Hinge loss
and we regularized by L2 - regularizer kwk2 .
X X
kwk2 = h βi Φ(Xi ), βj Φ(Xj )i = β t Kβ
i j
433
Summer 2019
n
CX
minimizew∈H kwk2H + ξi
n i=1
Ulrike von Luxburg: Statistical Machine Learning
subject to Yi hw, Φ(Xi )iH ≥ 1 − ξi (i = 1, ..., n)
437
Summer 2019
X
kwk2 = hw, wi = βi βj k(Xi , Xj )
i,j
and
X X
hw, Φ(Xj )i = βi hΦ(Xi ), Φ(Xj )i = βi k(Xi , Xj )
i i
438
Summer 2019
should be large.
I All this leads to a convex optimization problem that can be
solved efficiently.
I There are lots of important properties (support vector
property, representer theorem, etc).
I The kernel SVM is equivalent to regularized risk minimization
with the Hinge loss and regularization by the squared norm in
the feature space.
443
Summer 2019
n
1X
minimize (Yi − hΦ(Xi ), wi)2
w∈Rd n i=1
α∗ = K −1 Y (EXERCISE!).
I To evaluate the solution on a new data point x , we need to
compute
X X
f (x) = hΦ(x), w∗ i = αj∗ hΦ(x), Φ(Xj )i = αj∗ k(x, Xj )
j j
j=1
Summer 2019
Ridge regression
Recall ridge regression in feature space:
n
1X
minimize (Yi − hw, Φ(Xi )i)2 + λkwk2
w∈Rd n i=1
n
X
w= αj Φ(Xj )
j=1
455
Summer 2019
α = (nλI + K)−1 Y
Ulrike von Luxburg: Statistical Machine Learning
As before, we can compute the prediction for a new test point just
using kernels.
456
Summer 2019
What we want to do
I Have seen: many algorithms require that the data points are
centered (have mean = 0) and are normalized.
I However, now we want to work in feature space, but without
explicitly working with the coordinates in feature space.
I So how can we do this ???
Ulrike von Luxburg: Statistical Machine Learning
460
Summer 2019
s=1 s=1
n n
1X 1X
= k(xi , xj ) − k(xi , xs ) − k(xj , xs )
n s=1 n s=1
n
1 X
+ 2 k(xs , xt )
n s,t=1
462
Summer 2019
K̃ = (K − 1n K − K1n + 1n K1n )
Good news:
Ulrike von Luxburg: Statistical Machine Learning
I Normalize the data points, that is recale each data point such
that it has norm 1. This is equivalent to normalizing the rows
of the centered matrix to have unit norm.
I Normalize the individual features, that is rescale all columns of
the centered matrix to have norm 1. This is just a rescaling of
the coordinate axes such that the variance in each coordinate
direction is 1.
465
Summer 2019
0.1
0.5
0
0
−0.1
−0.5
−0.2
−1 0 1 −0.3 −0.2 −0.1 0 0.1 0.2
466
Summer 2019
I Normalize features:
I Typically this never hurts, and often helps.
I For kernel methods it is impossible to normalize the features
(we don’t know the embedding Φ explicitly, in particular we
don’t know what the features are).
467
Summer 2019
Φ(X) Φ(Y )
hΦ̂(X), Φ̂(Y )i = ,
kΦ(X)k kΦ(Y )k
hΦ(X), Φ(Y )i
=
kΦ(X)kkΦ(Y )k
k(x, y)
=p
k(x, x)k(y, y)
468
Summer 2019
k(x, y)
Ulrike von Luxburg: Statistical Machine Learning
k̂(x, y) = p .
k(x, x)k(y, y)
algorithms
More supervised learning
472 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
(*) Boosting
474 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
Unsupervised learning
475 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
Goal:
I Given data points x1 , ..., xn ∈ Rd
Recap: Projections
... see slides in the appendix (slides 1212 ff.)
Ulrike von Luxburg: Statistical Machine Learning
480
Summer 2019
For simplicity, let us assume that the data points are centered:
Ulrike von Luxburg: Statistical Machine Learning
n
1X
x̄ = xi = 0
n i=1
(If this is not the case, we can center the data points by setting
x̃i = xi − x̄.)
483
Summer 2019
a0 X 0 Xa = λa0 a = λ
demo_pca.m
Example: simple Gaussian toy data
Summer 2019
USPS example
USPS handwritten digits, 16 x 16 greyscale images.
; demo_pca_usps.m
Ulrike von Luxburg: Statistical Machine Learning
490
Summer 2019
5 5 5
10 10 10
15 15 15
5 10 15 5 10 15 5 10 15
5 5 5
Ulrike von Luxburg: Statistical Machine Learning
10 10 10
15 15 15
5 10 15 5 10 15 5 10 15
5 5 5
10 10 10
15 15 15
5 10 15 5 10 15 5 10 15
491
Summer 2019
5 5 5
10 10 10
15 15 15
5 10 15 5 10 15 5 10 15
PrincComp 4 PrincComp 5 PrincComp 6
Ulrike von Luxburg: Statistical Machine Learning
5 5 5
10 10 10
15 15 15
5 10 15 5 10 15 5 10 15
PrincComp 7 PrincComp 8 PrincComp 9
5 5 5
10 10 10
15 15 15
5 10 15 5 10 15 5 10 15
492
Summer 2019
5 5 5
10 10 10
15 15 15
5 10 15 5 10 15 5 10 15
reconstruction with 4 eigs reconstruction with 5 eigs reconstruction with 10 eigs
Ulrike von Luxburg: Statistical Machine Learning
5 5 5
10 10 10
15 15 15
5 10 15 5 10 15 5 10 15
reconstruction with 50 eigs reconstruction with 100 eigs reconstruction with 256 eigs
5 5 5
10 10 10
15 15 15
5 10 15 5 10 15 5 10 15
493
Summer 2019
90
80
70
60
Ulrike von Luxburg: Statistical Machine Learning
50
40
30
20
10
0
0 50 100 150 200 250 300
494
495 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
Eigenfaces
Principal components for a data set of faces:
Summer 2019
i=1
496
Summer 2019
One can prove that this approach leads to exactly the same solution
as the one induced by the max-variance criterion (we skip this
derivation).
497
Summer 2019
90
80
70
60
50
40
30
20
10
0
0 50 100 150 200 250 300
498
Summer 2019
Global!
Keep in mind that PCA optimizes global criteria.
I No guarantees what happens to individual data points. This is
different for some other dimensionality reduction methods
(such as random projections and Johnson-Lindenstrauss).
Ulrike von Luxburg: Statistical Machine Learning
I If the sample size is small, then outliers can have a large effect
on PCA.
500
Summer 2019
,
requires the solution of the
eigenvalue equation vi =
PCs vi define a new basis
along directions of maxima
variance.
501
Summer 2019
Observe:
I PCA uses the covariance matrix — and this matrix inherently
uses the actual coordinates of the data points.
I So how should we be able to kernelize PCA???
Ulrike von Luxburg: Statistical Machine Learning
(k) (l)
Ckl = Cov1dim (X (k) , X (l) ) = ni=1 Xi Xi = (X 0 X)kl
P
Ulrike von Luxburg: Statistical Machine Learning
(k) (l)
that because Xi Xi = (xi x0i )kl this implies
Also note P
(X 0 X) = ni=1 (xi x0i )
|{z}
d×d
I Kernel matrix is K = XX 0
Proof of (1).
Ka = λa
⇐⇒ XX 0 a = λa
=⇒ X 0 XX 0 a = λX 0 a
⇐⇒ Cv = λv
506
Summer 2019
with aj = λ1 hxj , vi ∈ R
(or more compactly, a = λ1 Xv ∈ Rn ).
509
Summer 2019
,
511
Summer 2019
I Dimensions:
I C is a d × d-matrix, so its eigendecomposition has d
eigenvalues.
I K is a n × n matrix with n eigenvalues.
I But intuitively spoken, we just showed that we can convert
the eigenvalues of K to those of C and vice versa.
HOW CAN THIS BE, IF d AND n ARE DIFFERENT???
512
Summer 2019
I Then compute v := √1
P
λ i ai x i
Ulrike von Luxburg: Statistical Machine Learning
X X
πv (xi ) = v 0 xi = h aj xj , xi i = aj hxj , xi i
j j
Ulrike von Luxburg: Statistical Machine Learning
Output: y1 , ..., yn ∈ R`
515
Summer 2019
−5
−6 −4 −2 0 2 4 6
518
Summer 2019
0
Ulrike von Luxburg: Statistical Machine Learning
−5
10
0
5
−10 −5 0
0.5
4
0 6
2
4
Ulrike von Luxburg: Statistical Machine Learning
0
−0.5 2
−2
0
−4
−1
−6 −2
−4 −2 0 2 −4
−1 −0.5 0 0.5 1 1.5 2 4 6
521
Summer 2019
4 0
3 −5
6
2
1 4
0 2
−1
0
−2
Ulrike von Luxburg: Statistical Machine Learning
−2
−3
−4
−4
−5 −6 4 5
−4 −3 −2 −1 0 1 2 3
−6 −4 −2 0 2 4 6 −5
522
Summer 2019
3
0
2
Ulrike von Luxburg: Statistical Machine Learning
1
−0.5
0
−1 0 −0.5 −1 −1.5
2 1.5 1 0.5
−1 −0.5 0 0.5 1 1.5 2 2.5
523
Summer 2019
History
I Classical PCA was invented by Pearson:
On Lines and Planes of Closest Fit to Systems of Points in
Space. Philosophical Magazine, 1901.
I It is one of the most popular “classical” techniques for data
analysis.
I Kernel PCA was invented pretty much 100 years later ,
Ulrike von Luxburg: Statistical Machine Learning
covariance matrix.
Kernel PCA:
I Use the kernel trick to make PCA non-linear.
525
Summer 2019
Multi-dimensional scaling
Literature:
Multi-dimensional scaling:
I Is a classic that is covered in many books on data analysis.
Ulrike von Luxburg: Statistical Machine Learning
Embedding problem
I Assume we are given a distance matrix D ∈ Rn×n that
contains distances dij = kxi − xj k between data points.
I Can we “recover” the points (xi )i=1,...,n ∈ Rd ?
This problem is called (metric) multi-dimensional scaling.
in this lecture.
528
Summer 2019
Classic MDS
Assume we are given a Euclidean distance matrix D. Will now see
how to express the entries of the Gram matrix S = (hxi , xj i)ij=1,...,n
in terms of entries of D:
I By definition:
I Rearranging gives
1
hxi , xj i = hxi , xi i + hxj , xj i −d2ij
2 | {z } | {z }
=d(0,xi )2 =d(0,xj )2
532
Summer 2019
1 2
hxi , xj i = d1i + d21j − d2ij
2
I So we can express the entries of the Gram matrix S with
Ulrike von Luxburg: Statistical Machine Learning
Demos
demo_mds.m
Summer 2019
Metric MDS
Metric MDS refers to the problem where the distance matrix D is
no longer Euclidean, but we still believe (hope) that a good
embedding exists.
I If the distance matrix D is not Euclidean, we will not be able
to recover an exact embedding.
I Instead, one defines a “stress function”. Below is an examle
Ulrike von Luxburg: Statistical Machine Learning
− xj k − dij )2
P
ij (kxi
stress(embedding) = P
ij kxi − xj k
Non-metric MDS
I Instead of distance values, we are just given distance
comparisons, that is we know whether dij < dik or vice versa.
I The task is then to find an embedding such that these ordinal
relationships are preserved.
I Our group is working on this problem ,
Ulrike von Luxburg: Statistical Machine Learning
History of MDS
I Metric MDS: Torgerson (1952) - The first well-known MDS
proposal. Fits the Euclidean model.
I Non-metric MDS: Shepard (1962) and Kruskal (1964)
Ulrike von Luxburg: Statistical Machine Learning
540
Summer 2019
Summary MDS
I Given a distance matrix D, MDS tries to construct an
embedding of the data points in Rd such that the distances are
preserved as well as possible.
I If D is Euclidean, a perfect embedding can easily be
constructed.
I If D is not Euclidean, MDS tries to find an embedding that
Ulrike von Luxburg: Statistical Machine Learning
introduction
Graph-based machine learning algorithms:
Summer 2019
Neighborhood graphs
Given the similarity or distance scores between our objects, we want
to build a graph based on it.
I Vertices = objects
Different variants:
Ulrike von Luxburg: Statistical Machine Learning
1 1
0 0
−1 −1
−2 −2
−2 0 2 −2 0 2
Ulrike von Luxburg: Statistical Machine Learning
−1
−1
−2
−2
−2 −1 0 1 2 3 −2 0 2
547
Summer 2019
Isomap
Literature:
I Original paper:
J. Tenenbaum, V. De Silva, J. Langford. A global geometric
Ulrike von Luxburg: Statistical Machine Learning
Isomap
We often think that data is “inherently low-dimensional”:
I Images of a tea pot, taken from all angles. Even though the
images live in R256 , say, we believe they sit on a manifold
corresponding to a circle:
Ulrike von Luxburg: Statistical Machine Learning
551
Summer 2019
Isomap (2)
I A phenomenon generates very high-dimensional data, but the
“effective number of parameters” is very low
Ulrike von Luxburg: Statistical Machine Learning
552
553 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
Fingers extension
Isomap (3)
Wrist rotation
Summer 2019
Isomap (4)
(A) (B) (C)
More abstractly:
I We assume that the data lives in a high-dimensional space, but
effectively just sits on a low-dimensional manifold
I We would like to find a mapping that recovers this manifold.
Isomap (5)
(A) (B) (C)
Ulrike von Luxburg: Statistical Machine Learning
Figure 1: The problem of nonlinear dimensionality reduction, as illustrated for three dimensional
data (B) sampled from two dimensional manifolds (A). An unsupervised learning al-
gorithm must discover the global internal coordinates of the manifold without external
signals that suggest how the data should be embedded in two dimensions. The LLE algo-
rithm described in this paper discovers the neighborhood-preserving mappings shown in
(C); the color coding reveals how the data is embedded in two dimensions.
555
Summer 2019
ambient space.
556
Summer 2019
Theoretical guarantees
In the original paper (supplement) the authors have proved:
I If the data points X1 , ..., Xn are sampled uniformly from a
“nice” manifold, then as n → ∞ and k ≈ log n, the shortest
path distances in the kNN graph approximate the geodesic
distances on the manifold.
I Under some geometric assumptions on the manifold, MDS
Ulrike von Luxburg: Statistical Machine Learning
Demos
I demo_isomap.m
I Toolbox by Laurens van der Maaten:
I addpath(genpath(’/Users/ule/matlab_ule/
downloaded_packages_not_in_path/dim_reduction_
toolbox/’)); then call drgui
I Use swiss roll and helix, play with k
Ulrike von Luxburg: Statistical Machine Learning
562
Summer 2019
History
I Manifold methods became fashionable in machine learning in
the early 2000s.
I Isomap was invented in 2000.
I Since then, a large number of manifold-based dimensionality
reduction techniques has been invented:
Locally linear embedding, Laplacian eigenmaps, Hessian
Ulrike von Luxburg: Statistical Machine Learning
Summary Isomap
I Unsupervised learning technique to extract the manifold
structure from distance / similarity data
I Intuition: local distances define the intrinsic geometry, shortest
paths in a kNN graphs correspond to geodesics.
I MDS then tries to find an appropriate embedding.
Ulrike von Luxburg: Statistical Machine Learning
564
565 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
SKIPPED
(*) Ordinal embedding (GNMDS, SOE, t-STE):
569 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
Clustering
Summer 2019
Data clustering
Data clustering is one of the most important problems of
unsupervised learning.
I Given just input data X1 , ..., Xn
I ...
571
Summer 2019
individually.
572
lustering Gene Expression Data
Summer 2019
and
real results
data sets
on can
real be
datafound
sets on
can our
be web-pag
found o
Demos/SelfTuningClustering.html
caltech.edu/lihi/Demos/SelfTuningClustering.html
576
Summer 2019
BlueWhale
Cow
Gibbon
Gorilla
Human
Primates
PygmyChimpanzee
Chimpanzee
Orangutan
SumatranOrangutan
Naive approach:
I Define a criterion that measures these distances and try to find
the best partition with respect to this criterion: Example:
Ulrike von Luxburg: Statistical Machine Learning
k-means objective
I Assume we are given data points X1 , ..., Xn ∈ Rd
I Assume we want to separate it into K groups.
I We want to construct K class representatives (class means)
m1 , ..., mK that represent the groups.
I Consider the following objective function:
Ulrike von Luxburg: Statistical Machine Learning
K X
X
min kXi − mk k2
{m1 ,...,mK ∈Rd }
k=1 i∈Ck
That is, we want to find the centers such that the sum of
squared distances of data points to the closest centers are
minimized.
581
582 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
clusters.
583
Summer 2019
6 6 6
4 4 4
2 2 2
0 0 0
−2 0 2 4 6 −2 0 2 4 6 −2 0 2 4 6
Previous step Updated clustering for fixed centers Updated centers based on new clustering
6 6 6
4 4 4
Ulrike von Luxburg: Statistical Machine Learning
2 2 2
0 0 0
−2 0 2 4 6 −2 0 2 4 6 −2 0 2 4 6
...
584
Summer 2019
6 6 6
4 4 4
2 2 2
0 0 0
−2 0 2 4 6 −2 0 2 4 6 −2 0 2 4 6
Ulrike von Luxburg: Statistical Machine Learning
585
Summer 2019
(i+1) 1 X
mk = (i+1)
Xs
|Ck | (i+1)
s∈Ck
decreases.
I There are only finitely many partitions we can inspect.
I So the algorithm has to terminate.
,
588
Summer 2019
X X X X
1 2 3 4
Ulrike von Luxburg: Statistical Machine Learning
m1 m2 m3
m1 m2 m3
590
Summer 2019
1 S = ∅ # S set of centers
Ulrike von Luxburg: Statistical Machine Learning
K
X 1 X
min kXi − Xj k2
{C1 ,...,CK } |Ck |2
k=1 i∈Ck ,j∈Ck
2. Find cluster centers such that the distances of the data points to
these centers are minimized:
XK X
min kXi − mk k2
m1 ,...,mK ∈Rd
k=1 i∈Ck
This is curious because one can prove that there only exist
polynomially many Voronoi partitions of any given data set.
The difficulty is that we cannot construct any enumeration to
Ulrike von Luxburg: Statistical Machine Learning
Summary K-means
I Represent clusters by cluster centers
I Highly non-convex NP hard optimization problem
I Heuristic: Lloyd’s k-means algorithm
I Very easy to implement, hence very widely used.
I In my opinion: k-means works well for vector quantization (if
you want to find a large number of clusters, say 100 or so). It
Ulrike von Luxburg: Statistical Machine Learning
does not work so well for small k, here you should consider
spectral clustering.
603
604 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
Hierarchical clustering
Goal: obtain a complete hierarchy of clusters and sub-clusters in
form of a dendrogram
Platypus
Marsupials and monotremes
Wallaroo
Opossum
Rodents Rat
HouseMouse
Cat
HarborSeal
GreySeal
WhiteRhino
Ulrike von Luxburg: Statistical Machine Learning
Ferungulates
Horse
FinbackWhale
BlueWhale
Cow
Gibbon
Gorilla
Human
Primates
PygmyChimpanzee
Chimpanzee
Orangutan
SumatranOrangutan
Simple idea
Agglomerative (bottom-up) strategy:
I Start: each point is its own cluster
P
x∈C,yinC 0 d(x,y)
Average linkage: dist(C, C 0 ) = |C|·|C 0 |
Ulrike von Luxburg: Statistical Machine Learning
X
X
X X
X X
X X X
X
X
• find the two clusters which have the smallest distance to each
other
• merge them to one cluster
Examples
... show matlab demos ...
demo_linkage_clustering_by_foot()
demo_linkage_clustering_comparison()
Ulrike von Luxburg: Statistical Machine Learning
609
Summer 2019
Theoretical considerations:
I Linkage algorithms attempt to estimate the density tree
ISMB 1999.
611
Summer 2019
What is it about?
General idea:
I many properties of graphs can be described by properties of the
adjacency matrix and related matrices (“graph Laplacians”).
I In particular, the eigenvalues and eigenvectors can say a lot
about the “geometry” of the graph.
Ulrike von Luxburg: Statistical Machine Learning
614
615 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
Unnormalized Laplacians
Summer 2019
f t Lf = f t Df − f t W f
X X
= di fi2 − fi fj wij
i i,j
!
1 XX X XX
= ( wij )fi2 − 2 fi fj wij + ( wij )fj2
Ulrike von Luxburg: Statistical Machine Learning
2 i j ij j i
1X
= wij (fi − fj )2
2 ij
618
Summer 2019
2
looks like a discrete version of the standard Laplace operator
Z
hf, ∆f i = |∇f |2 dx
graph.
I The eigenspace of eigenvalue 0 is spanned by the characteristic
functions 1A1 , ..., 1Ak of those components
(where 1Ai (j) = 1 if vj ∈ Ai and 1Ai (j) = 0 otherwise).
622
Summer 2019
components (6)
Unnormalized Laplacians and connected
Summer 2019
Normalized Laplacians
Summer 2019
Two versions:
I The “symmetric” normalized graph Laplacian
Lrw = D−1 L
We will now see that both normalized Laplacians are closely related,
and have similar properties as the unnormalized Laplacian.
630
Summer 2019
Part (4): The statement about Lsym follows from the adapted key
property, and then the statement about Lrw follows from (2). ,
633
Summer 2019
Proof.
Analogous to the one for the unnormalized case.
634
635 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
Cheeger constant
Summer 2019
Cheeger constant
Let G be an an undirected graph with non-negative edge weights
wij , S ⊂ V be a subset of vertices, S̄ := V \ S its complement.
Define:
P
I Volume of the set: vol(S) := s∈S d(s)
P
I Cut value: cut(S, S̄) := i∈S,j∈S̄ wij
Ulrike von Luxburg: Statistical Machine Learning
I Cheeger constant:
cut(S, S̄)
hG (S) :=
min{vol(S), vol(S̄)}
hG := min hG (S)
S⊂V
636
Summer 2019
small Cheeger cuts are achieved for cuts that split the graph into
reasonably big, tightly connected subgraphs (so that numerator is
large) which are well clustered (denominator small).
Ulrike von Luxburg: Statistical Machine Learning
640
Summer 2019
λ2 p
≤ hG ≤ 2λ2
2
Ulrike von Luxburg: Statistical Machine Learning
Intuition:
I The Cheeger constant describes the cluster properties of a
graph.
I The Cheeger constant is controlled by the second eigenvalue.
Spectral clustering
Literature:
I U. Luxburg. Tutorial on Spectral Clustering, Statistics and
Computing, 2007.
Ulrike von Luxburg: Statistical Machine Learning
Clustering in graphs
General problem:
I Given a graph
Problem: outliers
Clustering in graphs (4)
Summer 2019
RatioCut criterion
Idea:
I want to define an objective function that measures the quality
of a “nearly balanced cut”:
I the smaller the cut value, the smaller the objective function
I the more balanced the cut, the smaller the objective function
I ·
Consider a partition V = A∪B.
I Define |A| := number of vertices in A
I Introduce the balancing term 1/|A| + 1/|B|.
I Observe: The balancing term is small when A and B have
(approximately) the same number of vertices:
648
Summer 2019
Bad news:
finding the global minimum of RatioCut is NP hard /
Good news:
Ulrike von Luxburg: Statistical Machine Learning
But there exists an algorithm that finds very good solutions in most
of the cases: spectral clustering ,
651
652 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
Now observe:
n
X 1X 1
cut(A, Ā) = wij = wij (fi − fj )2 = f t Lf
4 i,j=1 2
i∈A,j∈Ā
654
Summer 2019
n
X
t
min f Lf subject to fi = 0 and fi = ±1 (**)
f
i=1
different way.
n
X
t
min f Lf subject to fi = 0 and fi ∈ R and kf k = 1
f
i=1
Ulrike von Luxburg: Statistical Machine Learning
(#)
P
Finally, observe that i fi = 0 ⇐⇒ f ⊥ 1 where 1 is the
constant-one vector (1, 1, ..., 1). We obtain:
i ∈ A : ⇐⇒ fi∗ ≥ 0
657
Summer 2019
HardBalancedCutRelaxation(G)
1 Input: Weight matrix (or adjacency matrix) W of the graph
2 D := the corresponding degree matrix
3 L := D − W (the corresponding graph Laplacian)
4 Compute the second-smallest eigenvector f of L
Ulrike von Luxburg: Statistical Machine Learning
Relaxing RatioCut
Now we want to solve the soft balanced mincut problem of
optimizing ratiocut:
This goes along the same lines as the hard balanced mincut
problem:
Ulrike von Luxburg: Statistical Machine Learning
(
+(|Ā|/|A|)1/2 if i ∈ A
fi =
−(|A|/|Ā|)1/2 if i ∈ Ā
6 Return A, Ā
663
Summer 2019
Examples
A couple of data points drawn from a mixture of Gaussians on R.
Ulrike von Luxburg: Statistical Machine Learning
664
665 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
Examples (2)
Summer 2019
General idea:
I The first k eigenvectors encode the cluster structure of k
disjoint clusters.
(To see this, consider the case of k perfectly disconnected
clusters)
Ulrike von Luxburg: Statistical Machine Learning
I Idea: they are “nearly the same” if we still have nice (but not
perfect) clusters.
I In particular, any simple algorithm can recover the cluster
membership based on the embedded points Yi . We use
k-means to do so.
667
668 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
(3)
Unnormalized spectral clustering for k clusters
669 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
(4)
Unnormalized spectral clustering for k clusters
Summer 2019
k
5 Define the new data points Yi ∈ R to be the rows of the
matrix V . This is sometimes called the spectral embedding.
6 Now cluster the points (Yi )i=1,...,n by the k-means algorithm.
670
Summer 2019
vol(A) vol(B)
This looks very similar to RatioCut, but we measure the size of the
sets A and B not by their number of vertices, but by the weight of
their edges:
X
vol(A) = di
i∈A
676
Summer 2019
NormalizedSpectralClustering(G)
1 Input: Weight matrix (or adjacency matrix) W of the graph
2 D := the corresponding degree matrix
−1/2
3 Lsym := D (D − W )D−1/2 (the normalized graph
Ulrike von Luxburg: Statistical Machine Learning
Laplacian)
4 Compute the second-smallest eigenvector f of Lsym
−1/2
5 Compute the vector g = D f
6 Define the partition A = {i gi ≥ 0}, Ā = V \ A
7 Return A, Ā
678
Summer 2019
that just separates one point from the rest of the space. This never
happens for normalized spectral custering. Details are beyond the
scope of this lecture and require heavy functional analysis.
Literature: Luxburg, Bousquet, Belkin: Consistency of spectral clustering, 2008
679
Summer 2019
D̃ = degrees of W̃
680
681 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
L̃ = D̃ − W̃
Regularization for spectral clustering (2)
682 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
No, we can still use the power method and exploit sparsity of the
graph:
Ulrike von Luxburg: Statistical Machine Learning
τ
(D̃−1/2 W̃ D̃−1/2 )v = D̃−1/2 (W + J)D̃−1/2 v
n
−1/2 −1/2 τ
= D̃ W
{zD̃ v} + 1(1v)
|n {z }
|
sparse
O(n)
684
Summer 2019
theory
Introduction to learning
690 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
The framework
The data. Data points (Xi , Yi ) are an i.i.d. sample from some
underlying (unknown) probability distribution P on the space
X × {±1}.
fF = argmin R(f )
Ulrike von Luxburg: Statistical Machine Learning
f ∈F
fn = argmin Rn (f )
f ∈F
bounds
Controlling the estimation error: generalization
Summer 2019
n
1X
Zi → E(Z) (almost surely).
n i=1
701
Summer 2019
predict the training label; for all other points predict -1.
Uniform convergence
Want to have a condition that is sufficient for the convergence of
the empirical risk of the data-dependent function fn :
WHY EXACTLY?
Ulrike von Luxburg: Statistical Machine Learning
708
Summer 2019
P (sup |R(f ) − Rn (f )| ≥ ε) → 0 as n → ∞.
f ∈F
Ulrike von Luxburg: Statistical Machine Learning
709
Summer 2019
|R(fn ) − R(fF )|
(by definition of fF we know that R(fn ) − R(fF ) ≥ 0)
= R(fn ) − R(fF )
Ulrike von Luxburg: Statistical Machine Learning
,
711
Summer 2019
Finite classes
Summer 2019
P (R(f ) − Rn (f )| ≥ ε) ≤ 2 exp(−2nε2 ).
Pr(sup |R(f ) − Rn (f )| ≥ ε)
Ulrike von Luxburg: Statistical Machine Learning
f ∈F
|R(f ) − Rn (f )| < ε.
r
2 log(2m) + log(1/δ)
δ = 2m exp(−2nε ) =⇒ ε =
2n
With this, the proposition becomes the following generalization
Ulrike von Luxburg: Statistical Machine Learning
bound:
719
Summer 2019
2n
Note that the generalization bound holds uniformly (with the same
error guarantee) for all functions in F, so in particular for the
function that a classifier might pick based on the sample points it
has seen.
720
Summer 2019
Shattering coefficient
Summer 2019
f ∈F
VC dimension
Summer 2019
VC dimension: Definition
Definition: We say that a function class F shatters a set of points
X1 , ..., Xn if F can realize all possible labelings of the points, that
is |FX1 ,...,Xn | = 2n .
2R2
V C(F) = min d, 2 + 1
ρ
741
Summer 2019
i=0
r
d log(2en/d) − log(δ)
R(f ) ≤ Rn (f ) + 2 .
n
Proof skipped.
Ulrike von Luxburg: Statistical Machine Learning
746
Summer 2019
Rademacher complexity
Summer 2019
Rademacher complexity
The shattering coefficient is a purely combinatorial object, it does
not take into account what the actual probability distribution is.
This seems suboptimal.
Definition: Fix a number n of points. Let σ1 , ..., σn be i.i.d. tosses
of a fair coin (result is -1 or 1 with probability 0.5 each). The
Rademacher complexity of a function class F with respect to n is
Ulrike von Luxburg: Statistical Machine Learning
defined as
n
1X
Radn (F) := E sup σi f (Xi )
f ∈F n i=1
The expectation is both over the draw of the random points Xi and
the random labels σi .
It measures how well a function class can fit random labels.
749
Summer 2019
r
log(1/δ)
R(f ) ≤ Rn (f ) + 2 Radn (F) +
2n
Regularization
Recap: regularized risk minimization:
minimize Rn (f ) + λ · Ω(f )
Regularization (2)
Proving consistency for regularization is technical but very elegant:
I Make sure that your overall space of functions F is dense in
the space of continuous functions
Example: linear combinations of a universal kernel.
I Consider a sequence of regularization constants λn with
λn → 0 as n → ∞.
I Define function class Fn := {f ∈ F ; λn · Ω(f ) ≤ const}
Ulrike von Luxburg: Statistical Machine Learning
Regularization (3)
If you want to see the mathematical details, I recommend the
following paper:
Brief history
I The first proof that there exists a learning algorithm that is
universally Bayes consistent was the Theorem of Stone 1977,
about the kNN classifier.
I The combinatorial tools and generalization bounds have
essentially been developed in the early 1970ies already (Vapnik,
Chervonenkis, 1971, 1972, etc) and refined in the years around
Ulrike von Luxburg: Statistical Machine Learning
2000.
I The statistics community also proved many results, in
particular rates of convergence. There the focus is more on
regression rather than classification.
I By and large, the theory is well understood by now, the focus
of attention moved to different areas of machine learning
theory (for example, online learning, unsupervised learning,
etc).
761
762 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
Examples revisited
Remember the examples we discussed in the first lecture?
Ulrike von Luxburg: Statistical Machine Learning
Many of you had argued that unless we have a strong belief that
the right curve is correct, we should prefer the left one due to
“simplicity”.
763
Summer 2019
Intuitive argument:
Ulrike von Luxburg: Statistical Machine Learning
These formulations can be found in many papers and text books, I don’t know the original source ...
764
Summer 2019
k=1
SKIPPED
(*) Loss functions, proper and surrogate losses:
772 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
SKIPPED
(*) Probabilistic interpretation of ERM:
773 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
collaborative filtering
Introduction: recommender systems,
Summer 2019
Recommender systems
Goal: give recommendations to users, based on their past behavior:
I Recommend movies (e.g., netflix)
I recommend music (e.g., lastfm)
I recommend products to buy (e.g., amazon)
Prominent example: Pandora Radio. You start with a song you like,
Ulrike von Luxburg: Statistical Machine Learning
PROOFS: EXERCISE!
783
Summer 2019
I Compute the SVD such that the singular values are sorted in
decreasing order.
I Keep the first k columns of U and V . Call the resulting
matrices Uk ∈ Rn×k and Vk ∈ Rd×k (such that Vk0 ∈ Rk×d ).
Ulrike von Luxburg: Statistical Machine Learning
Given A and k, find the matrix Ak with rank at most k such that
kA − Ak kF is minimized.
Netflix problem
General problem:
I Consider a huge matrix of user ratings of movies. Rows
correspond to movies, columns correspond to users, entries are
ratings on a scale from 1 to 5.
I We only know few entries in this matrix.
X
minimize rank(M ) subject to (mij − zij )2 ≤ δ
(i,j)∈Ω
Is NP hard /
794
Hard-Impute
I Have an initial guess for the missing entries ; matrix Z1
I Fill in the missing entries with the ones of Z2 , and start over
again ...
Sometimes this works reasonably.
But let’s try to think about alternatives ... one option for
non-convex optimization problems is always to construct a convex
relaxation (have seen this before, at least twice, where? )
795
Summer 2019
X
kσk1 = |σi |.
i
X
minimize kM ktr subject to (mij − zij )2 ≤ δ (∗)
(i,j)∈Ω
Ulrike von Luxburg: Statistical Machine Learning
1 X
minimize (mij − zij )2 + λkM ktr (∗∗)
2
(i,j)∈Ω
These two problems are essentially the same, once in the natural
formulation (∗) and once in the regularization / Lagrangian
formulation (∗∗).
797
Summer 2019
Soft-thresholding:
I Given the SVD of a matrix Z = U DV 0 , denote the singular
values by di .
I We define Sλ (Z) := U Dλ V 0 where Dλ is the diagonal matrix
with diagonal entries (di − λ)+ := max(di − λ, 0)
I Soft thresholding decreases the trace norm and also often
decreases the rank of a matrix.
800
Summer 2019
Proposition 41
Consider a matrix Z that is completely known, and choose some
λ > 0. Then solution of the optimization problem
Ulrike von Luxburg: Statistical Machine Learning
Introduce notation:
I Denote by Ω the set of matrix entries that are known.
(a) Repeat:
i. Compute Z new ← Sλk (PΩ (X) + PΩ⊥ (Z old )).
Ulrike von Luxburg: Statistical Machine Learning
te, however,
a low that we canCan
rank matrix. write
exploit this cleverly in the algorithm.
1.00
1.0
0.99
0.9
0.98
Test RMSE
RMSE
Train Hard−Impute
0.97
Ulrike von Luxburg: Statistical Machine Learning
0.8
Test Soft−Impute
0.96
0.7
0.95
0 50 100 150 200 0.65 0.70 0.75 0.80 0.85 0.90
Figure from “Statistical learning with sparsity”. Dotted line on rhs = Netflixes own algorithm, baseline.
Figure 7.2 Left: Root-mean-squared error for the Netflix training and test data for
Sanity check: random guessing would have RMSE≈2
If there is a column (or a row) that does not have any observed
entry, it will be impossible to reconstruct the matrix.
d
µ(U ) := max kPu ei k2
r i=1,...,d
Examples:
N ≥ Crp log p
Ulrike von Luxburg: Statistical Machine Learning
minimize ∥X∥∗
(1.3)
subject to Xij = Mij (i, j) ∈ Ω.
1/2
is unique and equal to M with probability at least 1 − 6 log(n2 )(n1 + n2 )2−2β − n22−2β .
sampled from the Haar measure on the Grassmann manifold. If r > log(n2 ), the number of entries can be reduced
to cu r(n1 + n2 ) log4 (n2 ). Similarly, there is a numerical constant ci such that ci µ2 r(n1 + n2 ) log3 (n2 ) entries are
Summer 2019
Simulations: noise-free
MISSING DATA AND setting (3)
MATRIX COMPLETION 179
Rank 1 Rank 5
1.0
1.0
p=20
p=40
Probability of Exact Completion
0.8
0.6
0.6
0.4
0.4
Ulrike von Luxburg: Statistical Machine Learning
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6
Figure 7.4 Convex matrix completion in the no-noise setting. Shown are probabili-
Figure from “Statistical learning with sparsity”;
ties
r=rank of exact
of true matrix,completion
p dimension of(mean
± one standard error) as a function of the proportion
matrix
missing, for n ◊ n matrices with n œ {20, 40}. The true rank of the complete matrix
WHAT CAN YOU SEE?
is one in the left panel and five in the right panel.
824
Summer 2019
I Plot
Results:
828
Summer 2019
Simulations:
MISSING DATA ANDnoisy
MATRIXsetting (2)
COMPLETION 18
Rank 1 Rank 5
3.5
3.5
Average Relative Error
3.0
2.5
2.5
2.0
2.0
Ulrike von Luxburg: Statistical Machine Learning
1.5
1.5
1.0
1.0
0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6
the imputation error from matrix completion as a function of the proportion missing
for 40 ◊ 40 matrices. Shown are the mean absolute error( ± one standard error) ove
100 simulations, all relative to the noise standard deviation. In each case we chos
829
Summer 2019
Outlook / literature
I We just scratched the surface, there are many more variants of
the problem, and also many more algorithms.
I If you are interested, the book “Statistical learning with
sparsity” is a good starting point.
History:
Ulrike von Luxburg: Statistical Machine Learning
Compressed sensing
Book chapters:
• Hastie, Tibshirani, Wainwright: Statistical learning with
sparsity. 2015. Chapter 10.
Ulrike von Luxburg: Statistical Machine Learning
Motivation
Consider the camera in your phone:
I If you take a picture, it first generates a raw image that is
stored by a pixel-based representation (e.g., rgb values for each
pixel).
I Then it compresses the picture by representing it in a suitable
basis (say, a wavelet basis) and generates a compressed version
Ulrike von Luxburg: Statistical Machine Learning
Motivation (3)
Idea: it would be great if we could skip the first step and directly
capture the data in the better representation.
Applications:
I Cameras with little power / storage. Take a picture with less
Ulrike von Luxburg: Statistical Machine Learning
Setup
Assume we observe a vector x ∈ Rd .
I Typically it is not be sparse in the standard basis, that is kxk0
is close to d
I But it might be sparse in a different basis: There exists an
orthonormal matrix U such that x = U α and α is a sparse
vector: kαk0 =: s is small
Ulrike von Luxburg: Statistical Machine Learning
Setup (2)
Notation in the following:
I d dimension of the original space (high)
We always have s ≤ k ≤ d.
Ulrike von Luxburg: Statistical Machine Learning
836
Summer 2019
Figure 6. A schematic diagram of the “one-pixel camera.” The “DMD” is the grid of micro-mirrors that
reflect some parts of the incoming light beam toward the sensor, which is a single photodiode. Other
parts of the image (the black squares) are diverted away. Each measurement made by the photodiode is
a random combination of many pixels. In “One is Enough” (p.114), 1600 random measurements suffice
to create an image comparable to a 4096-pixel camera. (Figure courtesy of Richard Baraniuk.)
Compressed sensing
Another way to describe it:
x̃ = W x ∈ Rk
Ulrike von Luxburg: Statistical Machine Learning
minimizekxk1 subject to x̃ = W x
845
Summer 2019
Intuitively:
x̃ := argmin{kvk0 ; v ∈ Rd , W v = y}
kx̃k0 ≤ kxk0 ≤ s
Ulrike von Luxburg: Statistical Machine Learning
kW zk22
− 1 = |0 − 1| = 1 > ε
kzk22
852
Summer 2019
argmin{kvk0 ; v ∈ Rd , W v = y} = argmin{kvk1 ; v ∈ Rd , W v = y}
Remarks:
Ulrike von Luxburg: Statistical Machine Learning
ε2
entry is drawn randomly from a normal distribution N (0, 1/s).
Then, with probabilitiy 1 − δ (over the choice of the matrix),
the matrix W is (ε, s)-RIP.
(ii) More generally, if U is any d × d orthonormal matrix, then with
probability 1 − δ, the matrix W U is (ε, s)-RIP.
Remarks:
I This result is closely related to the theorem of
Johnson-Lindenstrauss, which is widely used in randomized
algorithms.
I The second part of the theorem takes care of the situation
Ulrike von Luxburg: Statistical Machine Learning
(a) (b)
862
Sampling
Example: time series (2)
Signal in the “default basis” (say, the time series itself, not sparse):
(c)
Ambiguity!
(a) (b)
Ulrike von Luxburg: Statistical Machine Learning
(d)
[FIG5] Heuristic procedure for reconstruction from undersampled data. A sparse sig
domain (b). Equispaced undersampling results in signal aliasing (d) preventing reco
incoherent interference (c). Some strong signal components stick above the interfer
thresholding (e) and (f). The interference of these components is computed (g) and
level and enabling recovery of weaker components.
863
behave much like additive random noise
Summer 2019 trajectories must follow relatively sm
ite appearances, the artifacts are not noise; Example: time series (3)
Sampling schemes must also be robu
ng causes leakage of energy away from each situations. Non-Cartesian sampling
Obvious first idea: equispaced undersampling.
o value of the original signal. This energy sensitive to system imperfections.
I Just measure (“sense”) the signal at equispaced positions (in
the image on the previous slide, at the positions indicated by
the red dots at the bottom).
Sampling Recov
I Replace the remaining entries with 0.
+
I Go over to the sparse basis and represent the signal there. −
Ulrike von Luxburg: Statistical Machine Learning
Ambiguity!
(b)
(d) (f)
864
ourier reconstruction exhibits incoherent arti- hardware and physiological constr
Summer 2019
ly behave much like additive random noise Example: time series (4)
trajectories must follow relatively
spite The compressed sensing approach: Sampling
appearances, the artifacts are not noise; Randomschemes must also be r
undersampling
pling causes leakage of energy away from each situations. Non-Cartesian samplin
ero value ofInstead of sampling at equispaced positions,
sensitive to randomly pick
I
the original signal. This energy system imperfections.
some entries (in the image before, this is indicated by the red
dots at the top).
I Try to represent the image in the sparse basis:
Sampling Rec
Ulrike von Luxburg: Statistical Machine Learning
(c) (e)
Works! If we threshold the small Fourier coefficients, we are
left with the sparse representation of the signal.
Ambiguity!
(b)
865
(d) (f)
Summer 2019
Figure 2. Reconstructing a sparse wave train. (a) The frequency spectrum of a 3-sparse signal. (b) The
signal itself, with two sampling strategies: regular sampling (red dots) and random sampling (blue dots).
(c) When the spectrum is reconstructed from the regular samples, severe “aliasing” results because the
number of samples is 8 times less than the Shannon-Nyquist limit. It is impossible to tell which frequen-
cies are genuine and which are impostors. (d) With random samples, the two highest spikes can easily
be picked out from the background. (Figure courtesy of M. Lustig, D. Donoho, J.Santos and J. Pauly,
Compressed Sensing MRI, Signal Processing Magazine, March 2008. ⃝2008 c IEEE.)
866
Summer 2019
Example: Images
Example taken from: MacKenzie, Dana. Compressed sensing makes
every pixel count. What is happening in the mathematical sciences
7 (2009): 114-127.
Original noisy image. Shown is the image itself and (I guess) the
coefficients in Fourier (Wavelet?) basis. Signal is sparse in this
basis (but of course, it was not recorded in this basis, here the
Ulrike von Luxburg: Statistical Machine Learning
Many artifacts.
869
Summer 2019
Nice :-)
Compressed sensing with noisy data. (a) An image with added noise. (b) The image, under-
nd reconstructed using the Shannon-Nyquist approach. As in Figure 2, artifacts appear in the
870
ted image. (d) The same image, undersampled randomly and reconstructed with a “too opti-
se model. Although there are no artifacts, some of the noise has been misinterpreted as real
Summer 2019
Outlook
I Active area of research
I Lots of actual applications!!! Cameras, MRI scanning, etc
Ulrike von Luxburg: Statistical Machine Learning
873
874 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
comparisons
Ranking from pairwise
Summer 2019
Introduction
Text books (but I don’t like both chapters so much):
I Mohri et al. chapter 9
Introduction, informal
I Ranking candidates for a job offering
I Ranking of the world’s best tennis players
I Ranking of search results in google
I Ranking of molecules according to whether they could serve as
a drug for a certain disease
Ulrike von Luxburg: Statistical Machine Learning
2 X
dτ (π, π̂) := 1{sign(π(i)−π(j)) 6= sign(π̂(i)− π̂(j))
n(n − 1)
i=1 j=i+1
Note that it only looks at the unordered sets, not at the order
within the sets.
I Normalized discounted cumulative gain (NDCG): We take the
ranking π as “reference ranking”. Then we compare it to the
second ranking π̂, but we weight errors among the top items of
π more severely than errors for items at the bottom of the
ranking π. Many different ways in which this can be done ...
880
Summer 2019
The model
Ground truth model is very general:
I n objects
n
1X
τi := P (i j)
n j=1
This algorithm is about the simplest thing you can come up with, it
is sometimes called Borda count or Copeland method in the
literature.
887
Summer 2019
n · pobs · r
r
Ψk (n, r, pobs ) := (τk − τk+1 ) ·
| {z } log n
=: separation para. | {z }
sampling para.
Ulrike von Luxburg: Statistical Machine Learning
Example 1:
I Assume the true ordering is o1 o2 ... on , that is the best
player is o1 . Our goal is to find the best player (that is, k = 1).
I Assume a noise-free setting: a better player always wins against
a worse player, that is pij = 1 if oi oj and 0 otherwise.
Ulrike von Luxburg: Statistical Machine Learning
(As a side remark: the lower bound also holds if we make the
Ulrike von Luxburg: Statistical Machine Learning
can show that the deviations of the random variables are small.
I In particular, we can then bound the probability that one of
the top-k items “is beaten” (in terms of τ̂i ) by one of the
not-top-k items.
See the following figure:
895
896 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
1/2
if i, j ∈ S ∗ (a) or i, j 6∈ S ∗ (a)
Pa (i j) 1/2 + δ if i ∈ S ∗ (a) and j 6∈ S ∗ (a)
1/2 − δ if i 6∈ S ∗ (a) and j ∈ S ∗ (a)
I Note that the true τ -values give the correct top-k set.
I Our goal is to identify the true permutation based on
observations, that is we want to find the correct parameter a
that has been used.
897
Summer 2019
“information”.
I Assume that a is chosen uniformly from k, ..., n. Then we
sample observations according to the model Pa .
898
Summer 2019
Details skipped.
899
Summer 2019
Proof: union bound with the previous theorem (union bound leads
to power 13 instead of 14).
900
Summer 2019
Approximate recovery
Result looks surprisingly similar. Just the separation term now not
depends just on τk − τk+1 , but on all τ -values in a certain
neighborhood of k (where the size of the neighborhood depens on
the error we are allowed to make).
Details skipped.
901
Summer 2019
Discussion
I On a high level, the theorem shows two things:
I Ranking from noisy data is difficult if we don’t make any
assumptions.
I You cannot improve on the counting algorithm — unless you
do make more assumptions.
I In practice, the query complexity of n3 log n is completely out
Ulrike von Luxburg: Statistical Machine Learning
Learning to rank
Summer 2019
Learning to rank
I Objects x1 , ..., xn .
I Observations of the form xi ≺ xj . Encode this as follows:
I Consider the space S of
( all unordered pairs of objects.
+1 if xi ≺ xj
I Output variable yij =
−1 otherwise
I Goal: learn a classifier that makes as few mistakes as possible.
Ulrike von Luxburg: Statistical Machine Learning
904
Summer 2019
implies xi ≺ xi .
907
Summer 2019
I
examples are enough for good classification performance.
I Both approaches make only minimalistic / no assumptions
whatsoever on the structure of the numbers we want to sort.
WHERE IS THE CATCH?
910
Summer 2019
ranking.
I Learning to rank: the bound only considers the estimation
error of the classifier, when applied to predict unobserved
comparisons. [As a side remark: in the given example even
the Bayes classifier would have a poor performance close to
random guessing. The generalization bound just tells us that
we need not so many comparisons to come close to the
performance of the actual Bayes classifier. ]
911
Summer 2019
SVM ranking
As ERM is infeasible computationally, we could use a linear SVM
instead:
I Encode an ordered pair of objects by a feature vector
xij := ei − ej ∈ Rn and the outcome yij as described above.
I Get training points of the form (xij , yij ).
It is often easy to say that things “are pretty similar” or “not similar
at all”, but it is hard to come up with good ways to quantify this.
dist( , ) = 0.1
dist( , ) = 0.7
916
Summer 2019
The full distance ranking can then for example be used to find the
Ulrike von Luxburg: Statistical Machine Learning
And this is also a nice example for the type of questions we work in
in my group - we just proved this result a couple of weeks ago...
924
Summer 2019
Spectral ranking
Based on the following paper:
Fogel, d’Aspremont, Vojnovic: SerialRank: Spectral ranking using
seriation. NIPS 2014.
Ulrike von Luxburg: Statistical Machine Learning
926
Summer 2019
Spectral ranking
Setting as before: we observe pairwise comparison, want to output
a ranking.
0 if no data exists
I
corrupted√entries (selected uniformly at random). Then if
m = O( δn), then the SerialRank algorithm will produce the
ground truth ranking with probability at least 1 − δ.
This is the interesting statement.
Proofs are based on some old work by Atkinson 1998, we skip them.
929
Summer 2019
The setting
Want to build a search engine:
I Query comes in
The new idea by the google founders was to instead look at the link
structure of the webpages.
931
Summer 2019
Page Rank
Published by Brin, Page, 1998.
Main idea:
I A webpage is important if many important links point to that
page.
I A link is important if it comes from an page that is important.
Ulrike von Luxburg: Statistical Machine Learning
X r(i)
r(j) = (∗)
dout (i)
i∈parents(j)
rt = rt · A
So r is a left eigenvector of A with eigenvalue 1.
Intuition: if the random walk runs for a long time, it will converge
to an equilibrium distribution. It is called the stationary distribution
or the invariant distribution.
πP = π
938
Summer 2019
I disconnected components
940
Summer 2019
1
αP + (1 − α) 1
n
where n is the number of vertices and 1 the constant one matrix.
X X X
Av = A( ai v i ) = ai (Avi ) = ai λi vi
i i i
Caveat:
I Won’t work if q0 ⊥ first eigenvector
t
rk+1 = rkt (αA + (1 − α)ev t
= αrkt A + (1 − α) rkt e v t
Ulrike von Luxburg: Statistical Machine Learning
|{z}
=1
=α rkt A +(1 − α)v t
|{z}
sparse
Comments:
I v is the “personalization vector” (≈ probability over all
webpages of whether the surfer would like to see that page)
945
Summer 2019
learning
from raw data to machine
The data processing chain:
948 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
Are all the potential test cases covered in the training data, with all
the variety that exists?
I If you want to classify digits, are all of them going to be
upright? If not, add digits in all orientations to your system.
950
Summer 2019
I
machine learning class, about whether you like the class or not.
Example: you want to detect faces in images, and you have a set of
images (with and without faces).
I What exactly do you use as training examples? The whole
image? The part of the image that contains a face / not a
Ulrike von Luxburg: Statistical Machine Learning
Outlier detection
Data often contains “outliers”:
I Model-free approach:
I Want to find a set S which has two properties:
it is as small as possible, , but it contains most of the data
points.
Missing values
Often data is not complete, you have missing values. Two big
cases:
I Missing at random: this is the easier case, no bias introduced
by the fact that data is missing.
I Missing not at random: values could miss systematically, and
the fact that a value is missing might contain information:
Ulrike von Luxburg: Statistical Machine Learning
Choice of features
I Choose reasonable features if you can ,
I If you want to classify texts into different topics, a bag of
words seems reasonable. A “bag of letters” would not be
reasonable.
I Don’t be shy with including features. If in doubt, always
include a questionable feature. Many supervised algorithms like
Ulrike von Luxburg: Statistical Machine Learning
I But in many cases this is a bad idea. Note that the numerical
values 1, 2, 3 suggest that a detective story is closer to novel
than to a children’s book (as similarity between feature vectors
we use the scalar product, and it implicitly encodes this kind of
intuition). MAKE SURE YOU UNDERSTAND THIS POINT!
966
Summer 2019
Example:
Feature 1: is it a detective story? (0 or 1)
Feature 2: is it a novel? (0 or 1)
And so on.
Ulrike von Luxburg: Statistical Machine Learning
967
Summer 2019
Images: BiochemLabSolutions.com
972
Summer 2019
Data compression
I Sometimes the data set we have is too big to be processed
(here I refer to the number n of points, not to their dimension
d).
I Ideally, we would like to have a smaller data set, but we don’t
want to loose much information.
I Most importantly, by reducing the data set we do not want to
Ulrike von Luxburg: Statistical Machine Learning
Dimensionality reduction
Dimensionality reduction is a very useful preprocessing step if the
dimensionality of the data is high. Unless one removes too many
dimension, it very often helps to improve classification accuracy
(intuitively, we remove noise from the data).
Data standardization
Summer 2019
Data standardization
It is very common to use data standardization (centering and
normalizing), and almost never hurts.
Clustering
Summer 2019
Clustering
Sometimes it makes sense to understand the cluster structure of
your data.
I Many small clusters to reduce the size of your data set (vector
quantization).
I Few big clusters of data. You might then treat the classes
differently, or use learning algorithms that exploit the cluster
Ulrike von Luxburg: Statistical Machine Learning
Unbalanced classes.
I Assume your training data consists of 1000 points, but just 10
Ulrike von Luxburg: Statistical Machine Learning
of them are from class +1, and the other 990 from the class -1.
I If you now use a standard loss function, it is very likely that
the best classifier is the one that simply predicts -1
everywhere. WHY?
I To circumvent this problem you have to reweight the loss
function such that training errors that mispredict points of
class 1 get much more punished than the other way round.
988
Summer 2019
n
X
Rn (f ) = wi `(Xi , Yi , f (Xi ))
i=1
990
Summer 2019
Training
Summer 2019
Multi-class approaches
Assume we are given a classification problem with K classes, labeled
1, .., K. Further, assume that the ordering of these labels is not
important (“closeness” of the labels does not have any meaning).
One-versus-all
I For each k ∈ {1, .., K} we train binary classification problems
where the first class contains all points of class k and the other
class all remaining points.
Ulrike von Luxburg: Statistical Machine Learning
One-vs-one.
996
Summer 2019
Comments.
I One can prove that both approaches lead to Bayes consistent
classifiers if the underlying binary classifiers are Bayes
consistent.
I There also exist more complicated schemes, but in practice
they don’t really perform better than the simple ones.
I Multi-class scenarios are also an active field of research.
997
998 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
I But you also might want to figure out whether certain design
choices make sense, for example whether it is useful to remove
outliers in the beginning or not.
It is very important that all these choices are made appropriately.
Cross validation is the method of choice for doing that.
999
Summer 2019
5
and train with parameters s.
6 Compute the validation error err(s, k) on fold k
7 Compute P the average validation error over the folds:
err(s) = K k=1 err(s, k)/K.
8 Select the parameter combination s that leads to the best
validation error: s∗ = argmins∈S err(s).
9 OUTPUT: s∗
1000
1001 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
soon as you try to “improve the test error”, the test data
effectively gets part of the training procedure and is spoiled.
1003
Summer 2019
folds might not be the best ones on the whole data set
(because the latter is larger).
I It is very difficult to prove theoretical statements that relate
the cross-validation error to the test error (due to the high
dependency between the training runs). In particular, the CV
error is not unbiased, it tends to underestimate the test error.
Feature selection
Literature:
I Text book: Shalev-Shwartz/Ben-David, Chapter 25
I Computational reasons
I We simply might want to “understand” our classifier. For
example, in a medical context people don’t want to have a
black-box classifier that simply suggests a certain treatment.
They want to know what are the reasons for this choice.
1010
Summer 2019
Filter methods
General procedure:
I Take the training data (points and labels)
General properties:
I Faster than wrapper methods, less overfitting than wrapper
methods
I But independent of the actual classifier we use, which might
not be such a good idea
1012
Summer 2019
Drawbacks:
I it can happen that two features are just indicative if they are
both in the set of features (but each alone is not very
indicative). But the naive sequential method might miss this.
I You can never “undo” a choice.
1015
Summer 2019
Wrapper methods
As opposed to filter methods, the wrapper methods repeatedly train
the actual learning algorithm we want to use.
I Train the classifier with different sets of features and compute
the cross validation error.
I To select the final set of features, use the “smallest set” that
still produces a good cross validation error.
Ulrike von Luxburg: Statistical Machine Learning
Advantage: We only select features if they are really useful for the
actual algorithm we use.
Disadvantage:
I Computationally very expensive (we have to retrain the
classifier over and over again).
1018
Summer 2019
I
1029
Summer 2019
I etc
Ulrike von Luxburg: Statistical Machine Learning
Note: indeed,
I tp/P ∈ [0, 1]
I f p/N ∈ [0, 1]
ROC and AUC (3)
1035 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
I Here you compare the statistic erri − ẽrri against the statistic
where you randomly exchange erri and ẽrri .
I Read on permutation tests how this works ...
1046
Summer 2019
I For example, the t-test assumes the “data” (in this case, the
values erri and ẽrri ) to be normally distributed, independent,
Ulrike von Luxburg: Statistical Machine Learning
Some references
There is lots of research on how to compare classifiers.
Sinica, 2006.
1055
1056 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
selection.
I The learning approach:
I Consider to incorporate weights in your loss function if your
classes are unbalanced.
I Use regularization to prevent overfitting.
I Always use cross-validation to set your parameters / decide on
design principles.
I Make sure you run the final test on an independent set!
1059
1060 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
High-level guidelines
Try not to loose sight of the following, very general (and abstract)
principles:
I Try to incorporate any prior knowledge about your data into
the process of learning.
I Try to understand the inductive bias used by your learning
algorithm. (Remember, learning is impossible without an
Ulrike von Luxburg: Statistical Machine Learning
SKIPPED
(*) Online learning
Summer 2019
Online learning
Literature:
On the level of text books:
I Chapter 7 in Mohri/Rostamizadeh/Tawalkar
Ulrike von Luxburg: Statistical Machine Learning
Warmup
Summer 2019
But the circumstances are quite different than in the batch setting:
I The distribution over these examples can change over time
I We incur an error if pt 6= yt .
1066
Summer 2019
?!?!??!?!??!?!??!?!??!
1067
Summer 2019
Looks pretty hopeless ... but looking at it the right way shows that
it isn’t ...
1069
Summer 2019
Minimizing regret
What would be a good way to measure the success of an online
learning algorithm? Above we have already seen that the absolute
number of errors is not a good measure for how successful an
algorithm is.
T
X T
X
Regret(T ) := sup sup |pt − yt | − |f (xt ) − yt |
f ∈F (x1 ,y1 )...,(xT ,yT ) t=1 t=1
Intuition:
I Assume there exists one expert that is very good.
I All other experts (the ones that predict worse than the best
expert) get exponentially smaller than the best one.
I So in the long run, most of the weight will be accumulated by
the best expert.
I So in the long run, our prediction will be close to the one by
the best expert.
1076
Summer 2019
6 y"t ← 1
7 else y"t ← 0
8 Receive(yt )
9 if ("
yt ̸= yt ) then
10 for i ← 1 to N do
11 if (yt,i ̸= yt ) then
12 wt+1,i ← βwt,i
13 else wt+1,i ← wt,i
14 return wT +1
1077
Theorem 7.3
Fix β ∈ (0, 1). Let mT be the number of mistakes made by algorithm WM after T ≥ 1
rounds, and m∗T be the number of mistakes made by the best of the N experts. Then,
the following inequality holds:
Ulrike von Luxburg: Statistical Machine Learning
Randomized
hen its loss is defined toweighted majority
be the averaged (3) !i wi(t) vt,i = ⟨w(t) , vt ⟩.
cost, namely,
gorithm
Theassumes
followingthat the number
algorithm of rounds
is often called Ttheis exponential
given. In Exercise
weights21.4 we sh
ow toalgorithm:
get rid of this dependence using the doubling trick.
Weighted-Majority
input: number of "experts, d ; number of rounds, T
parameter: η = 2 log (d)/T
initialize: w̃(1) = (1, . . . , 1)
Ulrike von Luxburg: Statistical Machine Learning
for t = 1, 2, . . .
! (t)
set w(t) = w̃(t) /Z t where Z t = i w̃i
(t)
choose expert i at random according to P [i ] = wi
receive costs of all experts vt ∈ [0, 1] d
The following theorem is key for analyzing the regret bound of Weigh
Majority.
1082
Summer 2019
The following theorem is key for analyzing the regret bound of Weighted-
Randomized
Majority. weighted majority (4)
Theorem 21.11. Assuming that T > 2 log (d), the Weighted-Majority algorithm enjoys
the bound
T
# T
# "
⟨w(t) , vt ⟩ − min vt,i ≤ 2 log (d) T .
i∈[d]
t=1 t=1
Proof: Skipped,
Proof. We have: based on deriving upper and lower bounds on the
potential function Wt := N
P
i=1 wt,i . See book by Mohri et al.
Ulrike von Luxburg: Statistical Machine Learning
# w̃ (t) # (t)
Z t+1 i
log = log e−ηvt,i = log wi e−ηvt,i .
Discussion: Zt Zt
i i
I The number of mistakes the algorithm makes in T rounds is of
√ e−a ≤ 1 − a + a 2 /2, which holds for all a ∈ (0, 1), and the fact
Using the inequality
!the(t)order T.
that i wi = 1, we obtain
I There exists a lower bound that shows that this bound is
Z t+1does # $ %
optimal — log there (t)
≤ log notwexist
i 1 −any
ηvt,i
algorithm
+ η 2 2
vt,i /2 that achieves a
Zt
better asymptotic guarantee. i
# $ %
(t)
= log(1 − wi ηvt,i − η2 vt,i
2
/2 ).
1083
i
& '( )
Summer 2019
Example 1: routing
I You drive to work every day.
I Goal is, in the long run, to have a good strategy for selecting
the route. The learning setting is as follows:
I Each day you choose one of them, and you observe how much
time it takes you.
I But you don’t get any feedback on how much time it would
have taken to take any of the other routes! ; limited
feedback
1085
Summer 2019
I After the user has seen the add, you record whether he clicked
on it or not.
I However, but you don’t have any feedback on what the user
would have done if you had shown him a different add.
Ulrike von Luxburg: Statistical Machine Learning
1086
Summer 2019
I But you don’t know what you would have gained at another
Ulrike von Luxburg: Statistical Machine Learning
machine.
Note that an interesting new aspect comes into the game:
exploration vs. exploitation ...
Main idea:
I You maintain a weight vector w
Ulrike von Luxburg: Statistical Machine Learning
Aizerman 1964).
1096
Summer 2019
Winnow
Winnow also construct linear classifiers. But it uses multiplicative
updates rather than linear ones:
I If prediction of an expert is correct, its weight is increased.
I If prediciton of an expert is incorrect, its weight gets decreased.
Ulrike von Luxburg: Statistical Machine Learning
1097
Summer 2019
Winnow (2)
Winnow(η)
1 w1 ← 1/N
2 for t ← 1 to T do
3 Receive(xt )
4 y!t ← sgn(wt · xt )
5 Receive(yt )
6 if (!
yt ̸= yt ) then
"N
7 Zt ← i=1 wt,i exp(ηyt xt,i )
Ulrike von Luxburg: Statistical Machine Learning
8 for i ← 1 to N do
wt,i exp(ηyt xt,i )
9 wt+1,i ← Zt
10 else wt+1 ← wt
11 return wT +1
The principle is reminiscent to weighted majority in the expert
setting... Figure 7.10 Winnow algorithm, with yt ∈ {−1, +1} for all t ∈ [1, T ].
History
I There is a close connection of online learning to game theory,
and a number of important results
√ have already been proved in
the 1950ies (regret of order T , but linear dependency on d)
I Perceptron: dates back to Rosenblatt 1958, first theoretical
analysis was by Minski and Papert 1969 and Novikov 1952
I Weighted majority and its variants are due to Littlestone and
Ulrike von Luxburg: Statistical Machine Learning
Wrap up
1102 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
Keywords to discuss:
I blind, double-blind,
Keywords are:
I program chair = editor in chief
Journal:
I impact factors, bogus!!!!
I editorial board
Ulrike von Luxburg: Statistical Machine Learning
... of a researcher:
I citation numbers, h-index
I prizes
the group.
I Check how much and where all the other PhD students publish.
I How many international guests the group had during the last
year.
I To which conferences the people in the group go regularly. Is
there travel money?
I Are there any regular activities going on? Reading groups,
seminars, ...?
1112
Summer 2019
But be aware: using such tools only cat get you that far.
I You don’t have the freedom to make all the design choices you
want.
Ulrike von Luxburg: Statistical Machine Learning
Also good:
I AISTATS
Ulrike von Luxburg: Statistical Machine Learning
I UAI
Mathematical Appendix
Summer 2019
Conditional probabilities
Define the probability of event A under the condition that event B
has taken place:
P (A ∩ B)
P (A B) =
P (B)
Solution:
A = {3}, B = {1, 3, 5}, P (A ∩ B) = P ({3}) = 1/6, P (B) = 1/2,
this implies P ({3} “uneven”) = (1/6)/(1/2) = 1/3.
1126
Summer 2019
Important formulas
I Union bound. Let A1 , ..., Ak be any events. Then
k
X
P (A1 ∪ A2 ∪ ... ∪ Ak ) ≤ P (Ai )
i=1
Intuitive reason:
Ulrike von Luxburg: Statistical Machine Learning
1127
Summer 2019
P (B ∩ A) P (A B) · P (B)
P (B A) = =
P (A) P (A)
Example:
The probability that a woman has breast cancer is 1%. The
probability that the disease is detected by a mammography is
Ulrike von Luxburg: Statistical Machine Learning
Given:
I P (B) = 0.01
I P (A B) = 0.80
Ulrike von Luxburg: Statistical Machine Learning
I P (A ¬B) = 0.096
I Need to compute P (A). Here we use the total probability:
Random variables
A random variable is a function X : Ω → R.
Example:
I We have 5 red and 5 black balls in an urn
PX (A) = P (X ∈ A)
n k
P (X = k) = p (1 − p)n−k
k
λk e−λ
P (X = k) =
k!
The Poisson distribution counts the occurrence of “rare
events” in a fixed time interval (like radioactive decay), λ is
the intensity parameter.
Ulrike von Luxburg: Statistical Machine Learning
Independence
Two events A, B are called independent if
P (A ∩ B) = P (A) · P (B).
Example:
I Throw a coin twice. X = result of the first toss, Y = result of
the second toss. These two random variables are independent.
1135
Summer 2019
Independence (2)
I Throw a coin twice. X = result of the first toss, Y = sum of
the two results. These two random variables are not
independent.
Ulrike von Luxburg: Statistical Machine Learning
1136
Summer 2019
Expectation
For a discrete random variable X : Ω → {r1 , ..., rk } its expectation
(mean value) is defined as
k
X
E(X) := ri · P ({X = ri })
i=1
Expectation (2)
Important formulas and properties:
I The expectation is linear: for random variables X1 , ..., Xn and
real numbers a1 , ..., an ∈ R,
Xn n
X
E( ai X i ) = ai E(Xi )
i=1 i=1
Ulrike von Luxburg: Statistical Machine Learning
Variance
The variance of a random variable is defined as
2
Var(X) = E (X − E(X))2 = E(X 2 ) − E(X)
X
Var(X) = (ri − E(X))2 · P (X = ri )
i=1
Variance (2)
Example:
I We throw a biased coin, heads occurs with probability p, tails
comes with probability 1 − p. We assign the random variable
X = 1 for heads and X = 0 for tails.
I We have already seen: E(X) = p.
Variance (3)
I If X, Y are independent random variables, then
Standard deviation
The standard deviation of a random variable is just the square root
of the variance:
p
std(X) = Var(X)
Ulrike von Luxburg: Statistical Machine Learning
1142
Summer 2019
Properties:
I Cov(X, Y ) = Cov(Y, X)
Var(X)
P (|X − E(X)| ≥ t) ≤
t2
1148
Summer 2019
spaces.
1152
Summer 2019
P (X ∈ A, Y ∈ B Z ∈ C)
= P (X ∈ A Z ∈ C) · P (Y ∈ B Z ∈ C)
Ulrike von Luxburg: Statistical Machine Learning
d
X
Vard (X) = E(kX (i) − E(X (i) )k2 )
i=1
n i=1
“empirical”...
1159
Summer 2019
Z
P (X ∈ A) = p(x)dx
A
Ulrike von Luxburg: Statistical Machine Learning
1162
Summer 2019
I
in this course we won’t discuss this.
1163
Summer 2019
g : R → R, g(x) = P (X ≤ x)
Ulrike von Luxburg: Statistical Machine Learning
1164
Summer 2019
Uniform distribution
The uniform distribution on [0, 1]: for 0 ≤ a < b ≤ 1 we define
P (X ∈ [a, b]) = b − a
1 (x − µ)2
fµ,σ (x) = √ exp −
2πσ 2 2σ 2
I The special case of mean 0 and variance 1 is called the
“standard normal distribution”. Sometimes the normal
distribution is also called a Gaussian distribution.
1166
1167 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
1 1
fµ,Σ (x) = p exp − (x − µ)0 Σ−1 (x − µ)
2π det(Σ) 2
Mixture of Gaussians
When generating toy data for machine learning applications, one
often uses a mixture of Gaussian distributions:
k
X
f (x) = αi fµi ,Σi
i=1
1171
1172 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
0.05
0.15
0.1
−2
−3
−2
Mixture of Gaussians (2)
−1
density
0 12
3
1173 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
Expectation
In the continuous domain, sums are going to be replaced by
integrals. For example, the expectation of a random variable X
with density function p(x) is defined as
Z
E(X) = x · p(x)dx
R
Ulrike von Luxburg: Statistical Machine Learning
1174
Summer 2019
I On the homepage you can also find the link to a short linear
algebra recap writeup (by Zico Kolter and Chuong Do).
1175
Summer 2019
Vector space
A vector space V is a set of “vectors” that supports the following
operations:
I We can add and substract vectors: For v, w ∈ V we can build
v + w, v − w
I We can multiply vectors with scalars: For v ∈ V , a ∈ R we can
build av.
Ulrike von Luxburg: Statistical Machine Learning
Basis
A basis of a vector space is a set of vectors b1 , ..., bd ∈ V that
satisfies two properties:
I Any vector in V can be written as a linear combination of
basis vectors:
For any v ∈ V there exist a1 , ..., ad ∈ R such that
v = di=1 ai bi
P
Ulrike von Luxburg: Statistical Machine Learning
Basis (2)
Example:
I e1 := (1, 0) and e2 := (0, 1) form a basis of R2
Linear mappings
A linear mapping T : V → V satisfies
T (av1 + bv2 ) = aT (v1 ) + bT (v2 ) for all a, b ∈ R, v1 , v2 ∈ V .
Typical linear mappings are: stretching, rotation, projections, etc.,
and combinations thereof.
Matrices
m × n-matrix A:
Summer 2019
Matrices (2)
Transpose of a matrix , written as At or A0 is the matrix where we
exchange rows with columns (that is, instead of aij we have aji ).
Ulrike von Luxburg: Statistical Machine Learning
1182
forms other sorting algorithms. This has made quicksort a favorite in many applications—
Summer 2019
for instance, it is the basis of the code by which really enormous files are sorted.
Matrices (3)
We can multiply to matrices if their “dimensions” fit:
2.5 XMatrix
= m × multiplication
n-matrix, Y n × k matrix. Then Z : X · Y is a
m × k-matrix with entries
The product of two n × n matrices X and Y is a third n × n matrix Z = XY , with (i, j)th ent
!nn
i
× = (i, j)
X Y Z
1183
Summer 2019
Matrices (4)
Special case where Y is a vector of length n × 1 is called
matrix-vector-multiplication:
X
z = Xy with zi = xij yj
j
Ulrike von Luxburg: Statistical Machine Learning
1184
Summer 2019
Inverse of a matrix
I For some matrices A we can compute the inverse matrix A−1 .
It is the unique matrix that satisfies
A · A−1 = A−1 · A = Id
everywhere else).
I A matrix is called invertible if it has an inverse matrix.
I A square matrix is invertible if and only if it has full rank.
1187
Summer 2019
Intuition:
I The scalar product is related to the angle between the two
vectors:
I hv, wi = 0 ⇐⇒ v ⊥ w (vectors are orthogonal)
I If v and w have norm 1, then hv, wi is the cosine of the angle
between the two vectors.
I The norm is the length of a vector.
1188
Summer 2019
A = V DV t
Ulrike von Luxburg: Statistical Machine Learning
Φ = U ΣV t
Ulrike von Luxburg: Statistical Machine Learning
where
I U ∈ Rn×n is orthogonal. Its columns are called
left singular vectors.
I Σ ∈ Rn×d is a diagonal matrix containing the singular values
σ1 , ..., σd on the diagonal
I V ∈ Rd×d is an orthogonal matrix. Its columns are called
right singular vectors.
1194
1195 Ulrike von Luxburg: Statistical Machine Learning Summer 2019
Equivalent formulations:
I Positive definite ⇐⇒ v t Av > 0 for all v ∈ Rn \ {0}.
i=1
I In case the matrix has rank d, all its eigenvalues are non-zero.
Then we can write the inverse of A as
d
−1
X 1
A = vi vit
λ
i=1 i
1199
Summer 2019
I (A+ )+ = A
Ulrike von Luxburg: Statistical Machine Learning
A(x1 , x2 , x3 )0 := (x1 , x2 )0 ∈ R2
AArec A = A
v t Av
λ1 = maxn = max v t Av.
v∈R kvk2 v∈Rn :kvk=1
Ulrike von Luxburg: Statistical Machine Learning
Projections
A linear mapping P : E → E between vector spaces is a projection
if and only if P 2 = P .
It is an orthogonal projection if and only if it is a projection and
nullspace(P ) ⊥ image(P ).
Ulrike von Luxburg: Statistical Machine Learning
1207
Summer 2019
Projections (2)
We always have two points of view of a projection:
Projections (3)
Summer 2019
Projections (4)
View 2, Projection on a one-dimensional subspace:
The orthogonal projection on a one-dimensional space spanned by
vector a can be expressed as
π : Rd → R, π(x) = a0 x
View 2, Projection on an `-dimensional subspace:
Want to project on an `-dim subspace S with ONB v1 , ..., v` .
Ulrike von Luxburg: Statistical Machine Learning
Projections (5)
Affine projections:
{x∈V ;kxk=1}
(Funnily, this is not true the other way round: you can have all
sublevel sets convex, but yet the function is not convex. )
1219
Summer 2019
minimize f (x)
subject to gi (x) ≤ 0 (i = 1, ..., k)
settings as well, but one would need convex analysis for this).
1223
Summer 2019
minimize f (x)
subject to g(x) = 0
∇x L(x, ν) = 0
I The condition g(x) = 0 is equivalent to ∇ν L(x, ν) = 0.
Simple example
Consider the problem to minimize f (x) subject to g(x) = 0, where
f, g : R2 → R are defined as
∇x1 L = 2x1 + ν = 0
!
∇x2 L = 2x2 + ν = 0
!
∇ ν L = x1 + x2 − 1 = 0
minimize f (x)
subject to g(x) ≤ 0
Simple example
What are the side lengths of a rectangle that maximize its area,
under the assumption that its perimeter is at most 1?
maximize x · y subject to 2x + 2y ≤ 1
Ulrike von Luxburg: Statistical Machine Learning
minimize(−x · y) subject to 2x + 2y − 1 ≤ 0
minimize f0 (x)
subject to fi (x) ≤ 0 (i = 1, ..., m)
hj (x) = 0 (j = 1, ..., k)
Ulrike von Luxburg: Statistical Machine Learning
I Then define
Ulrike von Luxburg: Statistical Machine Learning
m
X k
X
L(x, λ, ν) = f0 (x) + λi fi (x) + νj hj (x)
i=1 j=1
Proof.
I Let x0 be a feasible point of the primal problem (that is, a
point that satisfies all constraints).
Ulrike von Luxburg: Statistical Machine Learning
m
X k
X
λi fi (x0 ) + νj hj (x0 ) ≤ 0
|{z} | {z } | {z }
i=1 ≥0 ≤0 i=1
=0
1247
Summer 2019
,
1248
Summer 2019
λ,ν
Weak duality
Proposition 52 (Weak duality)
The solution d∗ of the dual problem is always a lower bound for the
solution of the primal problem, that is d∗ ≤ p∗ .
Proof. Follows directly from Proposition 51 above. ,
Ulrike von Luxburg: Statistical Machine Learning
Strong duality
I We say that strong duality holds if p∗ = d∗ .
I This is not always the case, just under particular conditions.
Such conditions are called constraint qualifications in the
optimization literature.
I Convex optimization problems often satisfy strong duality, but
not always.
Ulrike von Luxburg: Statistical Machine Learning
1253
Summer 2019
minimizex,y exp(−x)
subject to x/y ≤ 0
y≥0
EXERCISE!
Ulrike von Luxburg: Statistical Machine Learning
1255
Summer 2019
,
1259
Summer 2019
Remarks:
I This proposition always holds (not only under strong duality).