0% found this document useful (0 votes)
15 views1,266 pages

Machine Learning Luxberg

This document appears to be the table of contents for a course on Statistical Machine Learning taught by Ulrike von Luxburg in the summer of 2019 at the University of Tübingen. The course covers topics such as linear and kernel methods for regression and classification, dimensionality reduction techniques like PCA, graph-based learning algorithms, and more. The table of contents lists over 50 sections that will be covered in the course, along with a note that some chapters marked with an asterisk will not be covered.

Uploaded by

honggao1204
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views1,266 pages

Machine Learning Luxberg

This document appears to be the table of contents for a course on Statistical Machine Learning taught by Ulrike von Luxburg in the summer of 2019 at the University of Tübingen. The course covers topics such as linear and kernel methods for regression and classification, dimensionality reduction techniques like PCA, graph-based learning algorithms, and more. The table of contents lists over 50 sections that will be covered in the course, along with a note that some chapters marked with an asterisk will not be covered.

Uploaded by

honggao1204
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1266

Summer 2019

Statistical Machine Learning


Ulrike von Luxburg
Summer 2019
Department of Computer Science, University of Tübingen
Ulrike von Luxburg: Statistical Machine Learning

(Version as of April 18, 2019)

Contents will be updated constantly, see course webpage.


Some chapters are marked with (∗), these ones will NOT be
covered in this lecture, you can ignore them.
0
Summer 2019

Table of contents

Introduction to Machine Learning

What is machine learning? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16


Motivating examples and applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Organisation of the course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Ulrike von Luxburg: Statistical Machine Learning

Machine learning as inductive inference . . . . . . . . . . . . . . . . . . . . . . . . . 38


Different learning scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

Recap: Linear algebra, probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Warmup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
The kNNalgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

Formal setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102


1
Summer 2019

Table of contents (2)


Standard setup for supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . 103
Statistical and Bayesian Decision theory . . . . . . . . . . . . . . . . . . . . . . . 115
Optimal prediction functions in closed form . . . . . . . . . . . . . . . . . . . . 131
... for classification under 0-1 loss . . . . . . . . . . . . . . . . . . . . . . . . . . 132
... for regression under L2 loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Basic learning principles: ERM, RRM . . . . . . . . . . . . . . . . . . . . . . . . . . 151

Linear methods for supervised learning


Ulrike von Luxburg: Statistical Machine Learning

Linear methods for regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167


Linear least squares regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Feature representation of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Least squares with linear combination of basis functions . . . . . . . . 203
Ridge regression: least squares with L2 -regularization . . . . . . . . . . . 211
Lasso: least squares with L1 -regularization . . . . . . . . . . . . . . . . . . . . . 229
(∗) Probabilistic interpretation of linear regression . . . . . . . . . . . . . . 246
2
Summer 2019

Table of contents (3)


Feature normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
Selecting parameters by cross validation . . . . . . . . . . . . . . . . . . . . . . . 255

Linear methods for classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266


Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
Linear discriminant analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
(∗) Probabilistic interpretation of linear classification . . . . . . . . . . . 311
Ulrike von Luxburg: Statistical Machine Learning

Linear Support vector machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318


Intuition and primal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .319
Excursion: convex optimization, primal, Lagrangian, dual . . . . 344
Deriving the dual problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
Important properties of SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359

Kernel methods for supervised learning


Positive definite kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
3
Summer 2019

Table of contents (4)


Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
Definition and properties of kernels . . . . . . . . . . . . . . . . . . . . . . . . . 379
Reproducing kernel Hilbert space and feature maps . . . . . . . . . . 402
Support vector machines with kernels . . . . . . . . . . . . . . . . . . . . . . . . . . 425
Regression methods with kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
Kernelized least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
Kernel ridge regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
How to center and normalize in the feature space . . . . . . . . . . . . . . 459
Ulrike von Luxburg: Statistical Machine Learning

More supervised learning algorithms


(*) Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472
(*) Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473

Unsupervised learning

Dimensionality reduction and embedding . . . . . . . . . . . . . . . . . . . . . . . . . . 475


4
Summer 2019

Table of contents (5)


PCA and kernel PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
Multi-dimensional scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
Graph-based machine learning algorithms: introduction . . . . . . . . . 543
Isomap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550
(*) Maximum Variance Unfolding: SKIPPED . . . . . . . . . . . . . . . . . . 565
(*) tSNE: SKIPPED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566
(*) Johnson-Lindenstrauss: SKIPPED . . . . . . . . . . . . . . . . . . . . . . . . . 567
(*) Ordinal embedding (GNMDS, SOE, t-STE): SKIPPED . . . . . .568
Ulrike von Luxburg: Statistical Machine Learning

Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
K-means and kernel k-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579
Standard k-means algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
Linkage algorithms for hierarchical clustering . . . . . . . . . . . . . . . . . . . 604
A glimpse on spectral graph theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
Unnormalized Laplacians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615
Normalized Laplacians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629
5
Summer 2019

Table of contents (6)


Cheeger constant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635
Spectral clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642
Unnormalized spectral clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . .652
Normalized spectral clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675
(*) Correlation clustering: SKIPPED . . . . . . . . . . . . . . . . . . . . . . . . . . 688

Introduction to learning theory


Ulrike von Luxburg: Statistical Machine Learning

The standard theory for supervised learning . . . . . . . . . . . . . . . . . . . . . . . 690


Learning Theory: setup and main questions . . . . . . . . . . . . . . . . . . . . 691
Controlling the estimation error: generalization bounds . . . . . . . . . 700
Capacity measures for function classes . . . . . . . . . . . . . . . . . . . . . . . . . 714
Finite classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715
Shattering coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725
VC dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735
Rademacher complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 748
6
Summer 2019

Table of contents (7)


Controlling the approximation error . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754
Getting back to Occam’s razor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 762

(*) Loss functions, proper and surrogate losses: SKIPPED . . . . . . . . . 771

(*) Probabilistic interpretation of ERM: SKIPPED . . . . . . . . . . . . . . . . 772

(*) The No-Free-Lunch Theorem: SKIPPED . . . . . . . . . . . . . . . . . . . . . . 773


Ulrike von Luxburg: Statistical Machine Learning

Fairness in machine learning

Low rank matrix methods


Introduction: recommender systems, collaborative filtering . . . . . . 776
Matrix factorization basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 780
Low rank matrix completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 788
Compressed sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .831
7
Summer 2019

Table of contents (8)


Ranking from pairwise comparisons
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875
Simple but effective counting algorithm . . . . . . . . . . . . . . . . . . . . . . . . 884
Learning to rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 903
Application: distance completion problem . . . . . . . . . . . . . . . . . . . . . . 914
Spectral ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 926
Google page rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 930
Ulrike von Luxburg: Statistical Machine Learning

The data processing chain: from raw data to machine learning

Preparing the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .948


Data aquisition: train versus test distribution . . . . . . . . . . . . . . . . . . 949
Converting raw data to training data . . . . . . . . . . . . . . . . . . . . . . . . . . 954
Data cleaning: missing values, outliers . . . . . . . . . . . . . . . . . . . . . . . . . 956
Defining features, similarities, distance functions . . . . . . . . . . . . . . . 964
8
Summer 2019

Table of contents (9)


Defining a similarity / kernel / distance function . . . . . . . . . . . . . . . 969
Reducing the number of training points? . . . . . . . . . . . . . . . . . . . . . . 975
Unsupervised dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . . . . 979
Data standardization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .982
Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 984

Setting up the learning problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986


Choice of a loss / risk function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 987
Ulrike von Luxburg: Statistical Machine Learning

Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .995
Selecting parameters by cross validation . . . . . . . . . . . . . . . . . . . . . . . 998
Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1009

Evaluation of the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025

General guidelines for attacking ML problems . . . . . . . . . . . . . . . . . . . . 1056


Key steps in the processing chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1057
9
Summer 2019

Table of contents (10)


High-level guidelines and principles . . . . . . . . . . . . . . . . . . . . . . . . . . . 1060

(*) Online learning SKIPPED

Online learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1063


Warmup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1064
Prediction with expert advice, weighted majority algorithm . . . . 1074
Follow the (perturbed,regularized) leader . . . . . . . . . . . . . . . . . . . . . 1088
Ulrike von Luxburg: Statistical Machine Learning

Perceptron and winnow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1092

Outlook: some of the research in our group

Wrap up

Excursion: Research, publications, reviewing . . . . . . . . . . . . . . . . . . . . . 1102


Things we did not talk about . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1114
10
Summer 2019

Table of contents (11)


Further machine learning resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 1116

Mathematical Appendix
Recap: Probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1122
Discrete probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1123
Continuous probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1161
Recap: Linear algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1175
Excursion to convex optimization: primal, dual, Lagrangian . . . . 1214
Ulrike von Luxburg: Statistical Machine Learning

Convex optimization problems: intuition . . . . . . . . . . . . . . . . . . . 1215


Lagrangian: intuitive point of view . . . . . . . . . . . . . . . . . . . . . . . . 1222
Lagrangian: formal point of view . . . . . . . . . . . . . . . . . . . . . . . . . . 1243
11
Summer 2019

Literature
Here are some of the books we use during the lecture (just
individual chapters):

General machine learning:


I Shai Shalev-Shwartz, Shai Ben-David: Understanding Machine
Learning. Cambridge University Press 2014. Theory and
algorithms.
Ulrike von Luxburg: Statistical Machine Learning

I Chris Bishop: Pattern recognition and Machine Learning.


Springer 2006. Focus on algorithms.
I Hastie, Tibshirani, Friedman: Elements of statistical learning.
Springer, 2009. More from a traditional statistics point of
view.
I Mohri, Rostamizadeh, Talwalkar: Foundations of machine
learning. MIT Press, 2012. More theory than algorithms.
12
Summer 2019

Literature (2)
I Schölkopf, Smola: Learning with kernels. MIT Press, 2002.
SVMs and kernel algorithms, for computer science audience.
I Steinwart, Christmann: Support Vector Machines. Springer,
2008. SVMs and kernel algorithms, for maths audience.
Learning theory:
I Devroye, Györfi, Lugosi: A Probabilistic Theory of Pattern
Recognition. Springer, 1996.
Ulrike von Luxburg: Statistical Machine Learning

Covers all the basic results of statistical learning theory, but we


won’t use much of it in this lecture.
Low rank matrix methods:
I Hastie, Tibshirani, Wainwright: Statistcal learning with
sparsity. CRC Press, 2015. In depth.
The Bayesian / probabilistic viewpoint:
13
Summer 2019

Literature (3)
I Kevin Murphy: Machine learning, a probabilistic perspective.
MIT Press, 2012.
Information theoretic (and Bayesian) point of view on machine
learning:
I MacKay: Information theory, inference and learning algorithms.
Cambridge University Press, 2003
Optimization:
Ulrike von Luxburg: Statistical Machine Learning

I Boyd / Vandenberghe for convex optimization

I Nocedal / Writght: Numerical optimizaton (focus on


algorithms)
14
15 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Learning
Introduction to Machine
16 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

What is machine learning?


17 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Motivating examples and applications


Summer 2019

Hand-written digit recognition


Want to automatically recognize the postal code in the address
field of a letter:
Ulrike von Luxburg: Statistical Machine Learning
18
Summer 2019

Hand-written digit recognition (2)


I Take a camera picture of the address
I Segment the image into individual letters and digits
I Image of a digit: 16 × 16 greyscale image, corresponds to a
vector with 256 entries, each entry between 0 and 1 (0 =
white, 1 = black)
Ulrike von Luxburg: Statistical Machine Learning

I Goal: want to know which digits they are, that is we want to


find a “correct” mapping f : [0, 1]256 → {0, 1, 2, ..., 9}.
19
Summer 2019

Hand-written digit recognition (3)


I Problem: it is impossible to hand-design such a rule!
Ulrike von Luxburg: Statistical Machine Learning
20
Summer 2019

Hand-written digit recognition (4)


The machine learning approach:
I Present many “training examples” to the computer:
(Xi , Yi )i=1,...,n such that Xi = greyscale image,
Yi ∈ {0, 1, 2, ..., 9} the true class label
I The computer is supposed “to learn” a function f that
correctly assigns digits to greyscale images
This problem is one of the “founding problems” of pattern
Ulrike von Luxburg: Statistical Machine Learning

recognition and machine learning.


21
Summer 2019

Spam filtering
I Want to classify all emails into two classes: spam or not-spam
I Similar problem as above: hand-designed rules don’t work in
many cases
I So we are going to “train” the spam filter:
Ulrike von Luxburg: Statistical Machine Learning
22
Summer 2019

Spam filtering (2)


I Internally, the spam filter “updates its rules” based on the
training example it gets.
This is a typical “online learning problem”: training arrives in an
online stream, rules have to be updated all the time
Ulrike von Luxburg: Statistical Machine Learning
23
Summer 2019

Object detection in general

ver is an active
iver assistance,
2]. Dynamically
ting conditions
clusion and the
contribute to the
port navigation,
Ulrike von Luxburg: Statistical Machine Learning

l 3D coordinate

this important
w probabilistic
w). Our model
evious research:
best achieved
Fig. 1: Example results with our multi-frame 3D inference and explicit
(2) short term occlusion reasoning for onboard vehicle and pedestrian Image: Christian
tracking Wojek
with et al, IEEE PAMI, 2013
overlaid
ts [4–6], allows horizon estimate for different public state-of-the-art datasets (all results at 0.1
ects should not FPPI).
context, which
otion of tracked
24
Summer 2019

Self-driving cars
First breakthrough: The DARPA grand challenge in 2005:
I Build an autonomous car that can find its way 100 km through
the desert.
Ulrike von Luxburg: Statistical Machine Learning

I The winning team was the one of Sebastian Thrun (Stanford),


one of the leading machine learning researchers.
25
Summer 2019

Self-driving cars (2)


By now: all major companies realize self-driving cars:
Ulrike von Luxburg: Statistical Machine Learning
26
Summer 2019

Self-driving cars (3)


How does it work?
I Many sensors, cameras, etc

I All of them generate low-level signals

I Need to “extract information” from each signal such as


I “here is a wall / a car / a pedestrian”
I “the cyclist extends his arm to signal a turn”
“Railways cross the street”
Ulrike von Luxburg: Statistical Machine Learning

I Lots of information, noisy, unreliable, contradicting. Need to


aggregate / combine this information.
I Need to predict what happens next (“cyclist wants to turn
left”)
I Then the car can “decide” what to do next.

It took decades to work out all these steps, machine learning is the
driving force behind the recent success.
27
Summer 2019

Bioinformatics
Machine learning is used all over the place in bioinformatics:
I Classify different types of diseases based on microarray data
Ulrike von Luxburg: Statistical Machine Learning

Image: Wikipedia

I Want to find drugs that can bind to a protein:

Images: BiochemLabSolutions.com
28
Summer 2019

Bioinformatics (2)
88 Larran‹aga et al.

Downloaded from http://bib.oxfordjournals.org/ at University Hamburg on March 28, 2013


Ulrike von Luxburg: Statistical Machine Learning

Figure 1: Classification of the topics where machine learning methods are applied.
Figure from Larranaga et al., 2005
In addition to all these applications, computa- make inferences from a sample. The two main
tional techniques are used to solve other problems, steps in this process are to induce the model by
such as efficient primer design for PCR, biological processing the huge amount of data and to represent
image analysis and backtranslation of proteins (which the model and making inferences efficiently. It must
29

is, given the degeneration of the genetic code, be noticed that the efficiency of the learning and
a complex combinatorial problem). inference algorithms, as well as their space and
Summer 2019

Medical image analysis


Skin cancer detection:
Ulrike von Luxburg: Statistical Machine Learning

Figure taken from Esteva et al, Nature 2017

Machine learning achieves level comparable to best human experts.


30
Summer 2019

Machine Learning in Science


Example Archeology: ML approach of human genome finds
evidence for unknown human ancestor:
Ulrike von Luxburg: Statistical Machine Learning
31

Nature communications, 2019


Summer 2019

Language processing
2011: Computer “Watson” wins the american quiz show
“jeopardy”. It is a bit like “Wer wird Millionär”, but not so much
about facts and more about word games.
Ulrike von Luxburg: Statistical Machine Learning

Now: many breakthroughs, just talk to your phone ...

Teaser: Try the DeepL translator


32
Summer 2019

Breakthroughs in recent years, based on deep


learning
... in areas with large amounts of structured data:
I Machine translation

I Speech recognition

I Object recognition
Ulrike von Luxburg: Statistical Machine Learning
33
Summer 2019

Another recent success story: Deep mind’s


AlphaGo
Deep mind (google) constructed an computer player for the board
game GO. In March 2016 it defeated the 18-times world champion
in go, Lee Se-dol.
Ulrike von Luxburg: Statistical Machine Learning
34
Summer 2019

Another recent success story: Deep mind’s


AlphaGo (2)
AlphaGo combines an advanced tree search with deep neural
networks. These neural networks take a description of the Go board
as an input and process it through 12 different network layers
containing millions of neuron-like connections. We trained the
neural networks on 30 million moves from games played by human
experts. AlphaGo learned to discover new strategies for itself, by
Ulrike von Luxburg: Statistical Machine Learning

playing thousands of games between its neural networks, and


adjusting the connections using a trial-and-error process known as
reinforcement learning.
35
36 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Organisation of the course


Summer 2019

I Info Sheet
I Important: doodle for tutorial groups
I Literature
I Material covered: see table of contents
I Material NOT covered: neural networks and deep learning;
Bayesian / probabilistic approaches; reinforcement learning
I Language: english / german
Ulrike von Luxburg: Statistical Machine Learning

I Prerequisits: maths, nothing else (see first exercises)


37
38 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Machine learning as inductive inference


Summer 2019

What is machine learning?


First explanation:

I Development of algorithms which allow a computer to “learn”


specific tasks from training examples.
I Learning means that the computer can not only memorize the
seen examples, but can generalize to previously unseen
Ulrike von Luxburg: Statistical Machine Learning

instances
I Ideally, the computer should use the examples to extract a
general “rule” how the specific task has to be performed
correctly.
39
Summer 2019

Deduction vs. Induction


WHO KNOWS WHAT INDUCTION AND DEDUCTION MEAN?
Ulrike von Luxburg: Statistical Machine Learning
40
Summer 2019

Deduction vs. Induction (2)


Deductive inference is the process of reasoning from one or more
general statements (premises) to reach a logically certain
conclusion.

Example:
I Premise 1: every person in this room is a student.

I Premise 2: every student is older than 10 years.


Ulrike von Luxburg: Statistical Machine Learning

I Conclusions: every person in this room is older than 10 years.

If the premises are correct, then all conclusions are correct


as well.

Nice in theory. For example, mathematics is based on this principle.


But no natural way to deal with uncertainty regarding the premises.
41
Summer 2019

Deduction vs. Induction (3)


Inductive inference: reasoning that constructs or evaluates
general propositions that are derived from specific examples.

Example:
I We throw lots of things, very often.

I In all our experiments, the things fell down and not up.

I So we conclude that likely, things always fall down.


Ulrike von Luxburg: Statistical Machine Learning

Very important: we can never be sure, our conclusion can


be wrong!
42
Summer 2019

Deduction vs. Induction (4)


Humans do inductive reasoning all the time: draw uncertain
conclusions from our relatively limited experiences.

Example:
I You come 10 minutes late to every lecture I give.

I The first 7 times I don’t complain.

I You conclude that I don’t care and it won’t have any


Ulrike von Luxburg: Statistical Machine Learning

consequences.
I BUT you cannot be sure ...
43
Summer 2019

Machine learning as inductive inference


Here comes now our second, more abstract description of what
machine learning is:

Machine learning tries to automate the process of inductive


inference.
Ulrike von Luxburg: Statistical Machine Learning
44
Summer 2019

Machine learning as inductive inference (2)


In most applications of machine learning: focus is not so
much on building models (“understanding the underlying
process”) but more on predicting.

Example with the things that always fall down:


I We are not so much concerned why the stones always fall
down and not up
Ulrike von Luxburg: Statistical Machine Learning

I We just want to predict that they fall down


45
Summer 2019

Machine learning as inductive inference (3)


Very important observation:
I Being able to predict is often easier than understanding the
underlying mechanism.
I In particular, it is often not necessary to understand the
underlying mechanism for being able to make good predictions!

Example:
Ulrike von Luxburg: Statistical Machine Learning

I We observe that the sun rises every day.


I So we predict that the sun is also going to rise the next day.
I To make this prediction, we don’t need to understand why this
is the case.
46
Summer 2019

Why should machine learning work at all?


Consider the following simple regression example:
I Given: input-output pairs (Xi , Yi ), Xi ∈ X , Yi ∈ Y.

I Goal: learn to predict the Y -values from the X-values, that is


we want to “learn” a suitable function f : X → Y.

EXAMPLE 1: WHAT DO YOU BELIEVE IS THE VALUE f (0.4)?


Ulrike von Luxburg: Statistical Machine Learning
47
Summer 2019

Why should machine learning work at all? (2)


Here are two guesses:
Ulrike von Luxburg: Statistical Machine Learning

WHICH ONE IS BETTER?


48
Summer 2019

Why should machine learning work at all? (3)


Now I tell you that in fact, the function values Yi have been
generated by a uniform random number generator.
Ulrike von Luxburg: Statistical Machine Learning

WHAT DO YOU PREDICT NOW?


49
Summer 2019

Why should machine learning work at all? (4)


Consequence 1: we will only be able to learn if “there is
something we can learn”.
I Output Y “has something to do” with input X
I “Similar inputs” lead to “similar outputs”
I There is a “simple relationship” or “simple rule” to generate
the output for a given input
Ulrike von Luxburg: Statistical Machine Learning

I The function f is “simple” (but caution, this is not the end of


the story, see later in the section on learning theory)
These assumptions are rarely made explicit, but something along
this line has to be satisfied, otherwise ML is doomed.
50
Summer 2019

Why should machine learning work at all? (5)


Consequence 2: We need to have an idea what we are
looking for. This is called the “inductive bias”. Learning is
impossible without such a bias.

Let’s try to get some intuition for what this means.


Ulrike von Luxburg: Statistical Machine Learning
51
Summer 2019

Inductive bias: very simple example


Discrete input space X = {0.01, 0.02, ..., 1}.
Output space: Y = {0, 1}
Given: training examples (Xi , Yi )i=1,...,n ⊂ X × Y, assume there is
no label noise (all training labels are correct).
Goal: Learn a function f : X → Y based on the examples

Case 1: no inductive bias, every function f : X → Y can be


Ulrike von Luxburg: Statistical Machine Learning

the correct one.

Formally:
I we want to find a function out of F := Y X (the space of all
functions). This space contains 2100 functions.
52
Summer 2019

Inductive bias: very simple example (2)


I Now assume we have already 5 training points and their labels.
I This means that we can rule out all functions from F which do
not satisfy f (Xi ) = Yi .
So we are left with 295 possible functions.
I Now we want to predict the value at a previously unseen point
X0 ∈ X .
There are 294 remaining functions with f (X 0 ) = 0 and the
Ulrike von Luxburg: Statistical Machine Learning

I
same number of functions with f (X 0 ) = 1.
And there is no way we can decide which one is going
to be the best prediction.
I In fact, no matter how many data points we get, our
prediction on unseen points will be as bad as random guessing.
Without any further restrictions or assumptions, the
problem of machine learning would be ill posed!
53
Summer 2019

Inductive bias: very simple example (3)


A more formal way of stating this result is called the “No free lunch
theorem”. There is a section on this topic in the slides on learning
theory, you can read it if you want (we won’t discuss it in the
lecture though).
Ulrike von Luxburg: Statistical Machine Learning
54
Summer 2019

Inductive bias: very simple example (4)


Case 2: model with an inductive bias.
I Assume that the true function is one out of two functions:
either the constant one function 1 or the constant zero
function 0.
Our hypothesis space is F = {0, 1}.
I Then, after we observed one training example, we know exactly
which function is the correct one and can predict without error.
Ulrike von Luxburg: Statistical Machine Learning

If we have a (strong enough) inductive bias, we can predict


based on few training examples.
55
Summer 2019

Inductive bias: very simple example (5)


A bit too simplistic?
I Yes, the hypothesis F seems too restricted to be useful for
practice. The problem of selecting a good hypothesis class is
called model selection.
I And yes, we did not take noise into account (yet).

I And yes, we did not talk about what happens if the true
function is not contained in F after all.
Ulrike von Luxburg: Statistical Machine Learning

The details of all this are quite tricky. It is the big success story
of machine learning theory to work out how exactly all these
things come together.

At the end of the course you will know all this, at least roughly ,
56
Summer 2019

Overfitting, underfitting
Choosing a“reasonable function class F” is crucial.

Consider the following example:


I True function f : quadratic function

I Training points: Yi = f (Xi ) + noise

I Red curve: F = all linear functions


Ulrike von Luxburg: Statistical Machine Learning

I Blue curve: F = all polynomial functions


57
Summer 2019

Overfitting, underfitting (2)


Overfitting:
I We can always find a function that explains all training points
very well or even exactly
I But such a function tends to be very complicated and models
the noise as well
I Predictions for unseen data points are poor (“large test error”)
Ulrike von Luxburg: Statistical Machine Learning

I Low approximation error, high estimation error (; see later for


definitions)
Underfitting:
I Model is too simplistic

I But estimated functions are stable with respect to noise

I Large approximation error, low estimation error (; see later


for definitions)
58
Summer 2019

Excursion: Inductive bias in animal learning


Any “system” that learns has an inductive bias. Consider
learning in animals:

Rats get two choices of water. One choice makes them feel sick,
the other one doesn’t.

Experiment 1:
Ulrike von Luxburg: Statistical Machine Learning

I Two types of water taste differently (neutral and sugar).

I Rats learn very fast not to drink the water that makes them
sick.
59
Summer 2019

Excursion: Inductive bias in animal learning (2)


Experiment 2:
I Same water, but one type of water is presented together with
“audio-visual stimuli” (certain sounds and light conditions),
while the other type of water is presented without these
accompanying audio-visual stimuli.
I In this setting, rats did NOT learn to avoid the water that
makes them sick.
Ulrike von Luxburg: Statistical Machine Learning

I Apparently, they cannot make a connection between “sound of


the food” and “sickness”.
60
Summer 2019

Excursion: Inductive bias in animal learning (3)


Explanation:
I From the point of view of evolution, it makes a lot of sense
that the taste of food is related to whether it makes sick or
not, whereas this does not seem so useful for sounds coming
with food.

In our words: the rat has an inductive bias!


Ulrike von Luxburg: Statistical Machine Learning

In psychology, this effect is called the “Garcia effect” (published in


a line of papers by John Garcia and co-workers in the 1960ies).
Reference: Garcia, John and Brett, Linda Phillips and Rusiniak, Kenneth W. Limits of
Darwinian conditioning. In S.B. Klein and R.R. Mowrer, editors, Contemporary learning
theories: instrumental conditioning theory and the impact of biological constraints, pages
181-204, 1989.
61
Summer 2019

Inductive bias, bottom line for now


Any successful learning algorithm has an inherent inductive
bias.
I It sounds really great to say “We look at the data in a
completely unbiased way, and just let the data speak for
itslef”. But this cannot work (also not in science). We need to
make assumptions.
I In machine learning, we often prefer to select a hypothesis
Ulrike von Luxburg: Statistical Machine Learning

from some “restricted” or “small” function space F.


I Whether this function is “close to the truth” depends on
whether the model class F is “selected well” for the problem
at hand.
I For some algorithms it will be obvious what the inductive bias
is. For some algorithms it is hard to understand what exactly
the bias is. But if the algorithm works, there HAS TO
BE a bias. This is very important to keep in mind.
62
63 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Different learning scenarios


Summer 2019

Supervised, semi-supervised, unsupervised


learning
Supervised learning:
I We are given input-output pairs (Xi , Yi ) as training data.

I The goal is to “learn” the function f : Xi 7→ Yi .

I “Supervised”: the teacher tells us what the true outcome


should be on the given samples
Ulrike von Luxburg: Statistical Machine Learning

The most important supervised learning tasks:


I Classification: Yi ∈ {0, 1}, or Yi ∈ {1, 2, ..., K}.
Applications: spam classification, digit recognition, ...
I Regression: Yi ∈ R.
Example: predict how much a person is willing to pay for a
laptop, based on his past shopping behavior
64
Summer 2019

Supervised, semi-supervised, unsupervised


learning (2)
Unsupervised learning:
I We are just given input values Xi , but no output values

I The learner is supposed to learn the “structure” of these


inputs.
Examples:
Ulrike von Luxburg: Statistical Machine Learning

I Clustering: partition the data into “meaningful groups”.


Applications:
I Different types of cancer based on gene-expression profiles
I Find communities in a social network
I Build a taxonomy of a document collection, based on the
topics of the documents
I Outlier detection: find data points that are “outliers”.
Application: intrusion detection, data preprocessing.
65
Summer 2019

Supervised, semi-supervised, unsupervised


learning (3)
Semi-supervised learning:
I We are given very many input values X1 , ..., Xu (unlabeled
points), and some input-output pairs (Xi , Yi ) (“labeled
points”).
I Goal is, as in supervised learning, to predict the function
Ulrike von Luxburg: Statistical Machine Learning

f : Xi 7→ Yi .
I But we want to exploit the extra knowledge we gain from the
unlabeled training points.
66
Summer 2019

Supervised, semi-supervised, unsupervised


learning (4)
I Applications:
I Predict the function of certain proteins: lots of proteins are
known (unlabeled points), but it is very expensive to run lab
experiments to find out whether they have a certain
functionality or not (labeled points)
I Computer tomography: lots of images of patients are
Ulrike von Luxburg: Statistical Machine Learning

available, but only few of them are labeled by an medical


doctor as to contain a tumor or not.
67
Summer 2019

Batch vs. Online vs. Active Learning


Batch learning:
I We get a bunch of training data before we start the learning
process.
I Then we learn on this whole training set.

Online learning:
I Examples arrive online, one at a time.
Ulrike von Luxburg: Statistical Machine Learning

I Each time a new training example arrives, we update our


internal model.
I Prediction accuracy is measured in an online fashion as well
(we have to predict constantly).
I Example: spam detection
68
Summer 2019

Batch vs. Online vs. Active Learning (2)


Active learning:
I Instead of examples being given to us, we can actively choose
which example should be the next one we want to learn (we
can “actively ask” a teacher)
I Of course it is crucial to select a question that gives you as
much information about the true underlying function as
possible.
Ulrike von Luxburg: Statistical Machine Learning

I Applications: domains where obtaining labels is expensive.


Active learning is particularly interesting in combination with
semi-supervised learning.
69
Summer 2019

Discriminative vs. generative approach


We distinguish two different machine learning scenarios:
I Discriminative approach. We are just interested in predicting
the output variable Y .
I Generative approach. For some cases, we not only want to
predict the labels, but we want to “understand” the underlying
process, that is we want to model the joint distribution
Ulrike von Luxburg: Statistical Machine Learning

P (X, Y ), in particular the class conditional distributions


P (X|Y = 1) and P (X|Y = 0).

In general, solving discriminative problems is easier than generative


problems.
70
71 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Recap: Linear algebra, probability theory


Summer 2019

Recap!
To be able to follow the lecture, you need to know the following
matrial. If you cannot remember it, please recap. You can use the
slides in the maths appendix, the tutorials, and any text book ...

Linear algebra:
I Vector space, basis, linear mapping, norm, scalar product

I Matrices: matrix multiplication, rank, inverse, eigenvalues and


Ulrike von Luxburg: Statistical Machine Learning

eigenvectors
I Eigen-decompostion of symmetric matrices, positive definite
matrices
72
Summer 2019

Recap! (2)
Probability theory:
I Discrete:
I Basics: Probability distribution, conditional probability, Bayes
theorem, random variables, expectation, variance,
independence,
I Distributions: joint, marginal, product distributions;
Bernoulli, Binomial distribution, Poisson distribution, uniform
Ulrike von Luxburg: Statistical Machine Learning

distribution
I multivariate random variabels: variance, covariance
I Continuous (we keep it on the intuitive level):
I Density, cumulative distribution function, expectation,
variance
I Normal distribution (univariate and multivariate), covariance
matrix.
73
74 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Warmup
Summer 2019

The kNNalgorithm
Literature:
For the algorithm: ¡br/¿ Hastie, Tibshirani, Friedman Section 2.3.2
Duda, Hart Section 4.5 ¡br/¿ ¡br/¿ For theory (not covered in this
Ulrike von Luxburg: Statistical Machine Learning

lecture): Devroye, GyÃPrfi, Lugosi: A Probabilistic Theory of


Pattern Recognition ¡br/¿
75
Summer 2019

A simple machine learning experiment


Data:
I Take a set of training points and labels (Xi , Yi )i=1,...,n . The
machine learning algorithm has access to this training input
and can use it to generate a classification rule f : X → {0, 1}.
I Take a set of test points (Xj , Yj )j=1,...,m . This set is
independent from the training set (“previously unseen points”)
Ulrike von Luxburg: Statistical Machine Learning

and will be used to evaluate the success of the training


algorithm.
76
Summer 2019

A simple machine learning experiment (2)


Assume our machine learning algorithm has used the training data
to construct a rule falg for predicting labels.

Training error:
I Predict the labels of all training points: Ŷi := falg (Xi ).

I Compute the error of the classifier on each training point:


Ulrike von Luxburg: Statistical Machine Learning

(
0 if Ŷi = Yi
err(Xi , Yi , Ŷi , ) :=
1 otherwise

This is called the “pointwise 0-1-loss”.


77
Summer 2019

A simple machine learning experiment (3)


I Define the training error of the classifier as the average error,
over all training points:

n
1X
errtrain (falg ) = err(Xi , Yi , falg (Xi ))
n i=1
Ulrike von Luxburg: Statistical Machine Learning

Remarks:
I This error obviously also depends on the training sample, but
we drop this from the notation to make it more readable. To
be really precise, one would have to write something like
errtrain (falg , (Xi , Yi )i=1,...,n )
I Later we will call this quantity the “empirical risk” of the
classifier (with respect to the 0-1-loss).
78
Summer 2019

A simple machine learning experiment (4)


Test error:
I Predict the labels of all test points: Ŷj := falg (Xj ).

I Compute the error of the classifier on each test point:

(
0 if Ŷj = Yj
err(Xj , Yj , Ŷj ) :=
1 otherwise
Ulrike von Luxburg: Statistical Machine Learning

I Define the test error of the classifier as the average error, over
all test points:

m
1 X
errtest (falg ) = err(Xj , Yj , falg (Xj ))
m j=1
79
Summer 2019

A simple machine learning experiment (5)


Technical remarks:
I The quantity errtest as defined above is an empirical test error
(it depends on the test set). Later, we will define the risk of
the classifier, which is the expectation over this quanity.
Ulrike von Luxburg: Statistical Machine Learning
80
Summer 2019

A simple machine learning experiment (6)


Remarks:
I Obviously, it is not so much of a challenge for an algorithm to
correctly predict the training labels (after all, the algorithm
gets to know these labels).
I Still, machine learning algorithms usually make training errors,
that is they construct a rule f that does not perfectly fit the
training data.
Ulrike von Luxburg: Statistical Machine Learning

I But the crucial measure of success is the performance of the


classifier on an independent test set.
I In particular, it is not the case that a low training error
automatically indicates a low test error or vice versa.
81
Summer 2019

The kNN classifier


Given: Training points (Xi , Yi )i=1,...,n ⊂ X × {0, 1} and a
distance function d : X × X → R.
Goal: Construct a classifier f that predicts the labels from the
inputs.
I Given a test point X 0 , compute all distances d(X 0 , Xi ) and
sort them in ascending order.
Ulrike von Luxburg: Statistical Machine Learning

I Let Xi1 , ..., Xik be the first k points in this order (the k
nearest neighbors of X 0 ). We denote the set of these points by
kNN(X 0 ).
I Assign to Y 0 the majority label among the corresponding labels
Yi1 , ..., Yik , that is define
( Pk
0 0 if j=1 Yij ≤ k/2
Y =
1 otherwise
82
83 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Example
Summer 2019

Influence of the parameter k


The classification result will depend on the parameter k.

WHAT DO YOU THINK, IS IT BETTER TO HAVE k SMALL OR


LARGE?
Ulrike von Luxburg: Statistical Machine Learning
84
Summer 2019

16 Influence of the parameter k (2)


2. Overview of Supervised Learning 2.3 Least Squares and Nearest Neighbors

1-Nearest Neighbor Classifier 15-Nearest Neighbor Classifier

.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o oo oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o o oo oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o o oo o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o o o oo o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o oo o o oo o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o oo o o oo o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o oo o oo
o o oo o o oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o o o oo o o o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o o o oo o o o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o oo oo o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o o o oo oo o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o
oo oo o o ooo o o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o o o oo o o o ooo o o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
oo o o o
oooo o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o oooo o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o
o oooo ooo oooo o oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o o oo o oooo ooo oooo o oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o o oo
oooo ooooo oooo ooooo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o oo o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
oo o o oo o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
oo
o o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o oo o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o oo o o o o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o o .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o o
o oo oo oo oooo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o oo oo oo oooo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o oo oo o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o o oo oo o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o
oo o o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o o
oo o o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
oo ooo o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
oo ooo o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o oo oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o o oo oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Ulrike von Luxburg: Statistical Machine Learning

.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o
o o o o o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o o o o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o oo oo o o ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
oo oo o
o .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
ooo .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ooo
o o o oo o
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
o o o oo o
.....................................................................
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o oo
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o ..................................................................... o

FIGURE 2.3. The


Figure from same classification example in two dimensions
Hastie/Tibshirani/Friedman: as in Fig-2.2. The same classification example in two dimensions as in Fi
FIGURE
ure 2.1. The classes are coded as a binary variable (BLUE = 0, ORANGE
ure=2.1.
1), The
and classes are coded as a binary variable (BLUE = 0, ORANGE = 1) an
then predicted by 1-nearest-neighbor classification. then fit by 15-nearest-neighbor averaging as in (2.8). The predicted class is hen
I Yellow/blue circles: training points and their labels
chosen by majority vote amongst the 15-nearest neighbors.

2.3.3 From
I Least Squares to Nearest Neighbors Yellow/blue little dots: if this were a test point, the kNN classifier would
In Figure 2.2 we see that far fewer training observations are misclassifie
The linear decision boundary from least squares is very smooth,thanand
in Figure
ap- 2.1. This should not give us too much comfort, though, sin
parently stable to fit. It does appear to rely heavily on the in classify the point as yellow/blue.
Figure 2.3 none of the training data are misclassified. A little thoug
assumption
suggests
that a linear decision boundary is appropriate. In language we will that for k-nearest-neighbor fits, the error on the training da
develop
shortly, it has low variance and potentially high bias. should be approximately an increasing function of k, and will always be
85

On the other hand, the k-nearest-neighbor procedures do not forappear


k = 1. to
An independent test set would give us a more satisfactory mea
Summer 2019

Influence of the parameter k (3)


Generally:
I k too small ; overfitting
(Extreme case: k = 1, very wiggly and prone to noise, zero
training error)
I k too large ; underfitting
(Extreme case: k = n, then every point gets the same label,
namely the overall majority label)
Ulrike von Luxburg: Statistical Machine Learning

I Theoretical analysis can reveal: k should be roughly of order


log n as n → ∞ (we won’t prove it in this course, if you are
interested you might want to consider the book by Devroye,
see literature list ).
86
Summer 2019

Application: simple mixture of Gaussians


Recap:
I Normal distribution in 1 dimension

I Multivariate normal distribution

I Mixture of Gaussians
Ulrike von Luxburg: Statistical Machine Learning
87
Summer 2019

Application: simple mixture of Gaussians (2)


We draw 100 points randomly from a mixture of two Gaussian
distributions. The figure shows a typical training set:

1.5

0.5

0
Ulrike von Luxburg: Statistical Machine Learning

−0.5

−1

−1.5

−2 −1 0 1 2
88
Summer 2019

Application: simple mixture of Gaussians (3)


Conduct the following experiment:
1 for rep = 1, ...10
2 Draw n training points (Xi , Yi )i=1,...,n
3 Draw m test points (Xi0 , Yi0 )i=1,...,m
4 for k = k1 , ..., ks
5 Predict the labels of all training points, using the k nearest
training points
Ulrike von Luxburg: Statistical Machine Learning

6 ErrTrain(k,rep) = the training error, averaged over all


training points
7 Predict the labels of all test points, using the k nearest
training (!) points
8 ErrTest(k,rep) = the test error, averaged over all test
points
9 return For each k, return the average train and test error
(where the average is taken over the repetitions)
89
Summer 2019

Application: simple mixture of Gaussians (4)


Note:
I For the kNNclassifier, the training error and test error are
about the same (it does not really “train” in the sense that it
selects a function that is particularly good on the training
data).
I Depending on whether a point is considered to be part of its
own kNN neighborhood or not, the train error differs a bit.
Ulrike von Luxburg: Statistical Machine Learning
90
Summer 2019

Application: simple mixture of Gaussians (5)


The following figure shows these train and test errors:
I Left figures: errors in each individual repetition

I Right figure: errors averaged over all repetitions


Ulrike von Luxburg: Statistical Machine Learning
91
Summer 2019

Application: simple mixture of Gaussians (6)

average train and test errors


Mixture of Gaussians, 500 train and 100 test points
0.13
average train error (pt included)
Train errors (blue), test errors(red) for 10 repetitions average train error (pt not included)
0.16 average test err
0.12

0.14 0.11

0.1
0.12
Ulrike von Luxburg: Statistical Machine Learning

0.09
0.1

0.08

0.08
0.07

0.06
0.06

0.04 0.05
2 4 6 8 10 12 2 4 6 8 10 12
k k
92
Summer 2019

Application: hand written digits


Data:

Represented as 16 × 16 greyscale image. That is, each digit


Ulrike von Luxburg: Statistical Machine Learning

corresponds to a vector of length 256 with entries in [0, 1].

Task 1: Learn to distinguish between 1 and 8


Task 2: Learn to distinguish between 3 and 8
93
Summer 2019

Application: hand written digits (2)


Setup:
I To apply the kNN rule we need to define a distance function
between digits.
I For simplicity, we use the Euclidean distance between the
vectors:
for X = (X1 , ..., X256 )t and X 0 = (X10 , ..., X256
0
)t we set
Ulrike von Luxburg: Statistical Machine Learning

256
!1/2
X
d(X, X 0 ) = (Xs − Xs0 )2
s=1
94
Summer 2019

Application: hand written digits (3)


Results task 1 (digit 1 vs digit 8):
Digit 1 vs 8, 500 train and 100 test points average train and test errors
0.045
average train error (pt included)
Train errors (blue), test errors(red) for 10 repetitions average train error (pt not included)
0.06 0.04 average test err

0.035
0.05
0.03

0.04 0.025

0.02
Ulrike von Luxburg: Statistical Machine Learning

0.03
0.015

0.02 0.01

0.005
0.01
0

0 −0.005
2 4 6 8 10 12 2 4 6 8 10 12
k k

(x-Axis: parameter k, y-axis: error)


95
Summer 2019

Application: hand written digits (4)

WHAT DO YOU THINK, IS THIS GOOD OR BAD?


Ulrike von Luxburg: Statistical Machine Learning
96
Summer 2019

Application: hand written digits (5)


Results task 2 (digit 3 vs digit 8):
Digit 3 vs 8, 500 train and 100 test points average train and test errors
0.16
average train error (pt included)
Train errors (blue), test errors(red) for 10 repetitions average train error (pt not included)
0.16 average test err
0.14

0.14
0.12

0.12
0.1
0.1
0.08
Ulrike von Luxburg: Statistical Machine Learning

0.08

0.06
0.06

0.04
0.04

0.02
0.02

0 0
2 4 6 8 10 12 2 4 6 8 10 12
k k

These results are surprisingly good!!!


97
Summer 2019

Influence of the similarity function


The choice of the similarity function is crucial as well:
I The performance of kNNrules can only be good if the distance
/ similarity function encodes the “relevant information”.
Example: you want to classify mushrooms as “edible” or “not
edible” and as distance function between mushrooms you use
the difference in weight ...
Ulrike von Luxburg: Statistical Machine Learning

I in many applications it is not so obvious how to define a good


distance / similarity function
Example: you want to classify the genre of songs. How do you
compute a similarity between different songs???
98
Summer 2019

Inductive bias
WHAT DO YOU THINK IS THE INDUCTIVE BIAS IN THIS
ALGORITHM? WHAT KIND OF FUNCTIONS ARE
“PREFERRED” OR “LEARNED”?
Ulrike von Luxburg: Statistical Machine Learning
99
Summer 2019

Inductive bias
WHAT DO YOU THINK IS THE INDUCTIVE BIAS IN THIS
ALGORITHM? WHAT KIND OF FUNCTIONS ARE
“PREFERRED” OR “LEARNED”?

Input points that are close to each other should have the
same label
Ulrike von Luxburg: Statistical Machine Learning
99
Summer 2019

Extensions
I The kNN rule can easily used for regression as well: As output
value take the average over the labels in the neighborhood.
I kNN-based algorithms can also be used for many other tasks
such as density estimation, outlier detection, clustering, etc.
Ulrike von Luxburg: Statistical Machine Learning
100
Summer 2019

Summary
I The kNN classifier is about the simplest classifier that exists.
I But often it performs quite reasonably.
I For any type of data, always consider the kNN classifier as a
baseline.
I One can prove that in the limit of infinitely many data points,
the kNN classifier is “consistent”, that is it learns the best
Ulrike von Luxburg: Statistical Machine Learning

possible function (see next lecture).


101
102 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Formal setup
103 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Standard setup for supervised learning


Summer 2019

Let’s become more formal now


We now want to introduce the formal setup for supervised
statistical learning.
Ulrike von Luxburg: Statistical Machine Learning
104
Summer 2019

The underlying space


I Input space X , output space Y
I Sometimes, the spaces X or Y have some mathematical
structure (topology, metric, vector space, etc), or we try to
construct such a structure.
I We assume that each space endowed with a sigma algebra, to
be able to define a probability measure on the space. We
Ulrike von Luxburg: Statistical Machine Learning

ignore this issue in the following (for real world machine


learning this is not an issue).
I Probability distribution P on the product space X × Y
(with product sigma algebra)
I no assumption on the form of the probability distribution
I output variables random as well
105
Summer 2019

A classifier / prediction function


... is simply a function f : X → Y.

We now need to be able to measure how “good” a classifier /


prediction function is.

Depends on the problem we want to solve, e.g.


Ulrike von Luxburg: Statistical Machine Learning

I If Y discrete, that is classification

I If Y = R, that is regression

I Other outputs are possible, for example ”structured prediction”


106
Summer 2019

Loss function
The loss function measures how “expensive” an error is:

A loss function is a function ` : X × Y × Y → R≥0 .


Example:
I The 0-1-loss function for classification is defined as
(
0 if y = y 0
Ulrike von Luxburg: Statistical Machine Learning

0
`(x, y, y ) =
1 otherwise

I The squared loss for regression is defined as

`(x, y, y 0 ) = (y − y 0 )2

Note: the choice of a loss function influences the inductive bias.


107
Summer 2019

Loss function (2)


Note:
I In some applications, it is important that the loss also depends
on x.

CAN YOU COME UP WITH AN EXAMPLE?

I In some applications, it is important that the loss depends on


Ulrike von Luxburg: Statistical Machine Learning

the order of y and y 0 (the type of error)

CAN YOU COME UP WITH AN EXAMPLE?


108
Summer 2019

True Risk
The true risk (or true expected loss) of a prediction function
f : X → Y (with respect to loss function `) is defined as

R(f ) := E `(X, Y, f (X)))

where the expectation is over the random draw of (X, Y ) according


to the probability distribution P on X × Y.
Ulrike von Luxburg: Statistical Machine Learning

The goal of learning is to use some training data to


construct a function fn that has true risk as small as
possible.
109
Summer 2019

Bayes risk and Bayes classifier


What is the best function we can think of?

I The Bayes risk is defined as

R∗ := inf{R(f ) f : X → Y, f measurable }

(we won’t discuss measurability, if you’ve never heard of it


Ulrike von Luxburg: Statistical Machine Learning

then simply assume that f can be any function you want).


I In case the infimum is attained, the corresponding function

f ∗ := argmin R(f )

is called the Bayes classifier / Bayes predictor.


110
Summer 2019

The training data and learning


Assume we are given supervised training data:
I We draw n training points (Xi , Yi )i=1,...,n ∈ X × Y
i.i.d. (independent and identically distributed) according to
distribution P .

Note: this is a strong assumption!!!


Ulrike von Luxburg: Statistical Machine Learning

The goal of learning is to construct a function fn that has


true risk close to the Bayes risk, that is R(fn ) ≈ R∗ .
111
Summer 2019

Consistency of a learning algorithm


Consider an infinite sequence of data points (Xi , Yi )i∈N that have
been drawn i.i.d. from distribution P over X × Y. Denote by fn
the learning rule that has been constructed by an algorithm A
based on the first n training points.
I We say that the algorithm A is consistent (for probability
distribution P ) if the risk R(fn ) of its selected function fn
Ulrike von Luxburg: Statistical Machine Learning

converges to the Bayes risk, that is

∀ε > 0 : lim P (R(fn ) − R∗ > ε) = 0.


n→∞

I Convergence is in probability, for those who know what that


means; if we have convergence almost surely, the algorithm is
called strongly consistent.
112
Summer 2019

Consistency of a learning algorithm (2)


I We say that algorithm A is universally consistent if it is
consistent for all possible probability distributions P over
X × Y.

Ultimately, what we want to find are learning algorithms


that are universally consistent: No matter what the underlying
probability distribution is, when we have seen “enough data
Ulrike von Luxburg: Statistical Machine Learning

points”, then the true risk of our learning rule fn will be arbitrarily
close to the best possible risk.
113
Summer 2019

Consistency of a learning algorithm (3)


For quite some time it was unknown whether universally consistent
algorithms exist at all. The first positive answer was in 1977 when
Stone proved that the kNN classifier is universally consistent.

Since then many algorithms have been found to be universally


consistent, among them support vector machines, boosting, and
many more.
Ulrike von Luxburg: Statistical Machine Learning

Understanding the underlying principles behind these algorithms is


the focus of this course, and in the learning theory part we will take
a glimpse on how to get consistency statements.
114
Summer 2019

Statistical and Bayesian Decision theory


Literature:
I Hastie, Section 2.4 - 2.9 (parts only)

I Devroye, Section 2
Ulrike von Luxburg: Statistical Machine Learning

I Duda/Hart, Section 2 (only parts of it, very technical)


115
Summer 2019

What if we know all the underlying quantities?


Before we go into learning, let’s consider how we would solve
classification if we had perfect knowledge of the probability
distribution P .
Ulrike von Luxburg: Statistical Machine Learning
116
Summer 2019

Running example: Male or female?


Predict sex of a person from body height:

HOW WOULD YOU PROCEED? ???


Ulrike von Luxburg: Statistical Machine Learning
117
Summer 2019

Running example: Male or female?


Predict sex of a person from body height:

HOW WOULD YOU PROCEED? ???


Class conditionals P(X | Y) priors P(Y)
1
female
male 0.8
0.03

probability
Ulrike von Luxburg: Statistical Machine Learning

0.6
0.02
0.4
0.01
0.2

0
140 160 180 200 female ma

GIVEN THIS INFORMATION, HOW WOULD YOU LABEL THE


INPUT X = 160? posteriors P(Y|X) loss weights
0.8
117
Summer 2019

Approach 1: just look at priors (a bit stupid)


Decide based on class prior probabilities P (Y ).
I If you don’t have any clue what to do, you could simply use
the following rule:
You always predict the label of the “larger class”, that is

(
m if P (Y = m) > P (Y = f )
Ulrike von Luxburg: Statistical Machine Learning

fn (X) =
f otherwise
118
Summer 2019

Approach 1: just look at priors (a bit stupid) (2)


Visually: select the higher bar
onditionals P(X | Y) priors P(Y) marg
1
female
male 0.8 0.03
probability
0.6 0.02
Ulrike von Luxburg: Statistical Machine Learning

0.4
0.01
0.2

0
60 180 200 female male 140 160

point
steriors P(Y|X) loss weights P(Y=m | X=
119

0.8
Summer 2019

Approach 2: maximum likelihood principle


Decide based on the likelihood functions P (X|Y ) (maximum
likelihood approach).

I Consider the class conditional distributions P (X|Y=m) and


P (X|Y = f ).
Class conditionals P(X | Y) priors P(Y
1
Ulrike von Luxburg: Statistical Machine Learning

female
male 0.8
0.03

probability
0.6
0.02
0.4
0.01
0.2

0
140 160 180 200 female
120
Summer 2019

Approach 2: maximum likelihood principle (2)


I Then predict the label with the higher likelihood:

(
m if P (X = x|Y = m) > P (X = x|Y = f )
fn (x) =
f otherwise

Visually: select according to which curve is higher


Ulrike von Luxburg: Statistical Machine Learning
121
Summer 2019

Approach 3: Bayesian a posteriori criterion


Decide based on the posterior distributions P (Y |X) (“Bayesian
maximum a posteriori approach”):

I Compute the posterior probabilities

P (X = x|Y = m) · P (Y = m)
P (Y = m|X = x) =
P (X = x)
Ulrike von Luxburg: Statistical Machine Learning

I Predict by the following rule:

(
m if P (Y = m|X = x) > P (Y = f |X = x)
fn (x) =
f otherwise
122
Summer 2019

0
Approach 3: Bayesian
140 a posteriori
160 180 200criterion female
(2)
Visually: select according to which curve is higher
posteriors P(Y|X) loss we
0.8

0.8
0.6
0.6

loss
0.4
Ulrike von Luxburg: Statistical Machine Learning

0.4
0.2
0.2

0
140 160 180 200 l(m if Y=f)

(figure is for uniform prior)


123
Summer 2019

Approach: also take costs of errors into account


Take the “costs” of errors into account:
I Define a loss function `(x, y, ŷ) that tells you how much loss
you incur by classifying the label of x as ŷ if the true label is y.
I The risk R(ŷ|X = x) := E(`(x, Y, ŷ) is the expected loss we
incur at point x when predicting ŷ (where the expectation is
over the randomness in the sample, in this case only the
Ulrike von Luxburg: Statistical Machine Learning

randomness concerning the true label Y of x).


I Consider the expected conditional risk at point x

R(ŷ|X = x) = `(x, m, ŷ)P (Y = m) + `(x, f , ŷ)P (Y = f )

I Use Bayes decision rule: Select the label fn (X) for which the
conditional risk is minimal.
124
125 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Run demo_bayesian_decision_theory.m
Example: male vs female
Summer 2019

Example: male vs female (2)


Class conditionals P(X | Y) priors P(Y) marginal P(X)
1
female 0.03
0.035 male
0.8 0.025
0.03

probability
0.025 0.6 0.02
0.02 0.015
0.015 0.4
0.01
0.01
0.2
0.005 0.005
0
140 160 180 200 female male 140 160 180 200

pointwise risk
posteriors P(Y|X) loss weights R(prediction | X =x) overall risk at given threshold
2
prediction=f 0.5
0.8 0.8 prediction=m
Ulrike von Luxburg: Statistical Machine Learning

1.5 0.45
0.6 0.6 0.4
1 0.35
0.4 0.4 0.3
0.5 0.25
0.2 0.2
0.2
0
140 160 180 200 l(m,true=f) l(f,true=m) 140 160 180 200 140 160 180 200
position of decision threshold
126
Summer 2019

Example: male vs female (3)


Class conditionals P(X | Y) priors P(Y) marginal P(X)
1 0.035
0.035 female
male
0.8 0.03
0.03
0.025

probability
0.025 0.6 0.02
0.02
0.015 0.4 0.015
0.01 0.01
0.2
0.005 0.005
0
140 160 180 200 female male 140 160 180 200

pointwise risk
posteriors P(Y|X) loss weights R(prediction | X =x) overall risk at given threshold
2
prediction=f
0.8 0.8 prediction=m 0.6
Ulrike von Luxburg: Statistical Machine Learning

1.5
0.5
0.6 0.6
1 0.4
0.4 0.4
0.5 0.3
0.2 0.2
0.2
0
140 160 180 200 l(m,true=f) l(f,true=m) 140 160 180 200 140 160 180 200
position of decision threshold
127
Summer 2019

Example: male vs female (4)


Class conditionals P(X | Y) priors P(Y) marginal P(X)
1
female 0.03
0.035 male
0.8 0.025
0.03

probability
0.025 0.6 0.02
0.02 0.015
0.015 0.4
0.01
0.01
0.2
0.005 0.005
0
140 160 180 200 female male 140 160 180 200

pointwise risk
posteriors P(Y|X) loss weights R(prediction | X =x) overall risk at given threshold
2
prediction=f 1
0.8 prediction=m
1.5
Ulrike von Luxburg: Statistical Machine Learning

1.5
0.8
0.6
1 1 0.6
0.4
0.5 0.5 0.4
0.2
0.2
0
140 160 180 200 l(m,true=f) l(f,true=m) 140 160 180 200 140 160 180 200
position of decision threshold
128
Summer 2019

Example: male vs female (5)


Class conditionals P(X | Y) priors P(Y) marginal P(X)
1 0.035
0.035 female
male
0.8 0.03
0.03
0.025

probability
0.025 0.6 0.02
0.02
0.015 0.4 0.015
0.01 0.01
0.2
0.005 0.005
0
140 160 180 200 female male 140 160 180 200

pointwise risk
posteriors P(Y|X) loss weights R(prediction | X =x) overall risk at given threshold
2
prediction=f
prediction=m 1.2
0.8
Ulrike von Luxburg: Statistical Machine Learning

1.5 1.5
1
0.6
1 1 0.8
0.4 0.6
0.5 0.5 0.4
0.2
0.2
0
140 160 180 200 l(m,true=f) l(f,true=m) 140 160 180 200 140 160 180 200
position of decision threshold
129
Summer 2019

Example: male vs female (6)


Class conditionals P(X | Y) priors P(Y) marginal P(X)
1 0.035
0.035 female
male
0.8 0.03
0.03
0.025

probability
0.025 0.6 0.02
0.02
0.015 0.4 0.015
0.01 0.01
0.2
0.005 0.005
0
140 160 180 200 female male 140 160 180 200

pointwise risk
posteriors P(Y|X) loss weights R(prediction | X =x) overall risk at given threshold
2 0.6
prediction=f
0.8 1.5 prediction=m
0.5
Ulrike von Luxburg: Statistical Machine Learning

1.5
0.6
1 0.4
1
0.4
0.3
0.5 0.5
0.2
0.2
0
140 160 180 200 l(m,true=f) l(f,true=m) 140 160 180 200 140 160 180 200
position of decision threshold
130
131 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Optimal prediction functions in closed form


132 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

... for classification under 0-1 loss


Summer 2019

Regression function (context of classification)


Consider (X, Y ) drawn according to a probability distribution P on
the product space X × {0, 1}. We want to describe the distribution
P in terms of two other quantities:
I Let µ be the marginal distribution of X, that is
µ(A) = P (X ∈ A).
Ulrike von Luxburg: Statistical Machine Learning

I Define the so-called regression (!)-function:

η(x) := E(Y X = x)

I In the special case of classification, the regression function can


be rewritten as

η(x) = 0 · P (Y = 0 X = x) + 1 · P (Y = 1 X = x)
= P (Y = 1 X = x)
133
Summer 2019

Regression function (context of classification) (2)


Intuition:
I If η(x) is close to 0 or close to 1, then classifying x is easy.

I If η(x) is close to 0.5, then classifying x is difficult.

WHY?
Ulrike von Luxburg: Statistical Machine Learning
134
Summer 2019

Regression function (context of classification) (3)


Proposition 1 (Unique decomposition)
The probability distribution P is uniquely determined by µ and η.

Intuition (discrete case): We can rewrite

P (X = x, Y = 1) = P (Y = 1|X = x)P (X = x)
Ulrike von Luxburg: Statistical Machine Learning

= η(x)µ(x)

and similarly

P (X = x, Y = 0) = P (Y = 0|X = x)P (X = x)
= (1 − η(x))µ(x)

So we can express the probability of any event (X, Y ) in terms of


η and µ.
135
Summer 2019

Regression function (context of classification) (4)


Formal proof for the general case:
... see the book of Devroye, Györfi, Lugosi, first pages.
Ulrike von Luxburg: Statistical Machine Learning
136
Summer 2019

Explicit form of the Bayes classifier


Consider the 0-1-loss function. Recall:
I the risk of a classifier under the 0-1-loss counts “how often”
the classifier fails, that is

R(f ) = E(`(X, Y, f (X))) = E(1f (X)6=Y ) = P (f (X) 6= Y ).

IThe Bayes classifier f ∗ was defined as the classifier that


Ulrike von Luxburg: Statistical Machine Learning

minimizes the true risk (this is an implicit definition, we don’t


yet have a formula for it).
Now consider the following classifier:

(
1 if η(x) ≥ 1/2
f ◦ (x) :=
0 otherwise
137
Summer 2019

Explicit form of the Bayes classifier (2)


Theorem 2 (f ◦ is the Bayes classifier)
Let f : X → {0, 1} be any (measurable) decision function and f ◦
the classifier defined above. Then R(f ) ≥ R(f o ).

Remark:
I The theorem shows that f ◦ = f ∗ (WHY?)
Ulrike von Luxburg: Statistical Machine Learning

I Consequence: in the particular case of classification with the


0-1-loss, we have an explicit formula for the Bayes classifier.
I In practice, this doesn’t help, WHY?
138
Summer 2019

Explicit form of the Bayes classifier (3)


Proof of Theorem 2:
Step 1: Consider any fixed classifier f : X → {0, 1} and compute
its error probability at some fixed point x:

P (f (x) 6= Y X = x)
= 1 − P (f (x) = Y |X = x)
= 1 − P (f (x) = 1, Y = 1 X = x) − P (f (x) = 0, Y = 0 X = x)
Ulrike von Luxburg: Statistical Machine Learning

(∗)
= 1 − 1f (x)=1 P (Y = 1|X = x) − 1f (x)=0 P (Y = 0|X = x)
= 1 − 1f (x)=1 η(x) − 1f (x)=0 (1 − η(x))

For step (∗), observe that f (x) is a deterministic function.


139
Summer 2019

Explicit form of the Bayes classifier (4)


Step 2: Now compare the pointwise error of any other classifier f
to the one of f ◦ :

P (f (X) 6=Y X = x) − P (f ◦ (X) 6= Y X = x)


= ... plug in the formula from last page and simplify ...
= (2η(x) − 1)(1f ◦ (x)=1 − 1f (x)=1 )
(∗∗)
Ulrike von Luxburg: Statistical Machine Learning

≥ 0

To see the last step (∗∗):


I if f ◦ (x) = 1, then η(x) ≥ 0.5, so both terms ≥ 0.

I if f ◦ (x) = 0, then η(x) ≤ 0.5, so both terms ≤ 0.


140
Summer 2019

Explicit form of the Bayes classifier (5)


Step 3: We have seen that for all fixed values x, the probability of
error satisfies

P (f (X) 6=Y X = x) ≥ P (f ◦ (X) 6= Y X = x)

Because this holds for any individual value of x, it also holds in


Ulrike von Luxburg: Statistical Machine Learning

expectation over all x. This implies

R(f ) ≥ R(f ◦ ).

,
141
Summer 2019

Explicit form of the Bayes classifier (6)


Remarks:
I If we work with 0-1-loss and if we know the regression
function, then we don’t need to “learn”, we can simply write
down what the optimal classifier is.
I One can also explicitly compute the optimal classifier for many
other loss functions, it is also going to depend on the
regression function. We skip this.
Ulrike von Luxburg: Statistical Machine Learning

I Problem in practice: we don’t know the regression function.


142
Summer 2019

Plug-in classifier
Simple idea: If we don’t know the underlying distribution, but are
given some training data, simply estimate the regression function
η(x) by some quantity ηn (x) and build the plugin-classifier
(
1 if ηn (x) ≥ 0.5
fn :=
0 otherwise
Ulrike von Luxburg: Statistical Machine Learning

Theoretical considerations:
I It can be shown that the plugin-approach is universally
consistent. ,
Practical considerations:
I Estimating densities is notoriously hard, in particular for
high-dimensional input spaces. So unfortunately, the
plugin-approach is useless for practice. /
143
144 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

... for regression under L2 loss


Summer 2019

Loss functions for regression


While in classification, there is a “natural loss function” (the
0-1-loss), there exist many loss functions for regression and it is not
so obvious which one is the most useful one.

In the following, let’s look at the classic case, the squared loss
function:
Ulrike von Luxburg: Statistical Machine Learning

Squared loss (L2 -loss): `(x, y, f (x)) = (f (x) − y)2


145
Summer 2019

Regression function (context of L2 regression)


As in the classification setting, we define the regression function:

η(x) = E(Y X = x)

Recall: E stands for expectation, not for error...


Ulrike von Luxburg: Statistical Machine Learning

We now want to show an explicit formula for the Bayes learner as


well. As in the classification case, we fix a particular loss function,
this time it is the squared loss.

We need one more intermediate result:


146
Summer 2019

Regression function (context of L2 regression)


(2)
Proposition 3 (Decomposition)
We always have

E |f (X) − Y |2 = E |f (X) − η(X)|2 + E |η(X) − Y |2 .


  
Ulrike von Luxburg: Statistical Machine Learning

Note: Getting a related inequality with ≤ is trivial (by the triangle


inequality), but the equality in this statement is not obvious.

Proof.

(see Gyorfi, Kohler, Krzyzak, Walk: Distribution-free theory for noparametric


regression, p.2)
147
148 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

(3)
Regression function (context of L2 regression)
149 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

(4)
Regression function (context of L2 regression)

,
Summer 2019

Explicit form of optimal solution under L2 loss


Define the following learning rule that predicts the real-valued
output based on the regression function η:

f ◦ : X → R, f ◦ (x) := η(x)

Theorem 4 (Explicit form of optimal L2 -solution)


Ulrike von Luxburg: Statistical Machine Learning

The function f ◦ minimizes the L2 -risk.


Proof. Follows directly from Proposition 3:
I Second expectation on the rhs does not depend on f .

I First expectation is always ≥ 0, and it is = 0 for


f (X) = η(X).
I So the whole right hand side is minimized by f (X) = η(X).
150

,
151 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Basic learning principles: ERM, RRM


Summer 2019

Two major principles


I Assume we operate in the standard setup, and are given a set
of training points (Xi , Yi ).
I Based on these points we want to “learn” a function
f : X → Y that has as small true loss as possible.

There are two major approaches to supervised learning:


Ulrike von Luxburg: Statistical Machine Learning

I Empirical risk minimization (ERM)


I Regularized risk minimization (RRM)
152
Summer 2019

Empirical risk minimization


As we don’t know P we cannot compute the true risk. But we can
compute the empirical risk based on a sample (Xi , Yi )i=1,...,n

n
1X
Rn (f ) := `(Xi , Yi , f (Xi ))
n i=1
Ulrike von Luxburg: Statistical Machine Learning

The key point is that the empirical risk can be computed based on
the training points only.
153
Summer 2019

Empirical risk minimization (2)


Empirical risk minimization approach:
I Define a set F of functions from X → Y.

I Within these functions, choose one that has the smallest


empirical risk:

fn := argmin Rn (f )
f ∈F
Ulrike von Luxburg: Statistical Machine Learning

(might not be unique; for simplicity, let’s assume the minimizer


exists)
154
Summer 2019

Estimation vs approximation error


With this approach, we can make two types of error:
I Denote by f˜ the true best function in the set F, that is
f˜ = argminf ∈F R(f ).
I The quantity R(fn ) − R(f˜) is called the estimation error. It is
a random variable that depends on the random sample.
I The quantity R(f˜) − R(f ∗ ) is called the approximation error.
Ulrike von Luxburg: Statistical Machine Learning

It is a deterministic quantity that does not depend on the


sample, but on the choice of the space F.
155
156 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Estimation vs approximation error (2)


Summer 2019

Estimation vs approximation error (3)


In the following sketch, one curve shows the approximation error,
one the estimation error.
Ulrike von Luxburg: Statistical Machine Learning

I WHICH ONE IS WHICH?


I HOW WOULD THE CURVE OF THE TRUE RISK OF THE
CLASSIFIER LOOK LIKE?
157
Summer 2019

Overfitting vs Underfitting
Coming back to the terms underfitting and overfitting:
I Underfitting happens if F is too small. In this case we have a
small estimation error but a large approximation error.
I Overfitting happens if F is too large. Then we have a high
estimation error but a small approximation error.
Ulrike von Luxburg: Statistical Machine Learning
158
Summer 2019

Bias-Variance tradeoff in L2-regression


Sometimes another decomposition of the errors is used. Can be
seen most easily for the case of regression with L2 loss:
Let fn be the function constructed by an algorithm on n points, and
f ∗ : Rd → R the true best function (the regression function). Then
we can decompose the pointwise expected L2 risk in two terms:

E(|fn (x) − f ∗ (x)|2 )


Ulrike von Luxburg: Statistical Machine Learning

  2
= E (fn (x) − E(fn (x))) + E(fn (x)) − f ∗ (x)
2

| {z } | {z }
Variance term Bias term

Note: we always have ≤ (for any loss function), but for the L2 -loss
we get equality (as we have seen in Proposition 3).

Proof. skipped (see Gyorfi, Kohler, Krzyzak, Walk: Distribution-free theory


for noparametric regression, p.24)
159
Summer 2019

Bias-Variance tradeoff in L2-regression (2)


Intuition:
I The variance term: same intuition as estimation error, depends
on random data and the capacity of the function class F.
I The bias term: same intuition as the approximation error.
Does not depend on the data, just on the capacity of the
function class F.
Ulrike von Luxburg: Statistical Machine Learning
160
161 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Bias-Variance tradeoff in L2-regression (3)


Summer 2019

ERM, remarks
I From a conceptual/theoretical side, ERM is a straight forward
learning principle.
I The key to the success / failure of ERM is to choose a “good”
function class F
I From the computational side, it is not always easy (depending
on function class and loss function, the problem can be quite
Ulrike von Luxburg: Statistical Machine Learning

challenging: finding the minimizer of the 0-1-loss is often NP


hard.) This is why in practice we use convex relaxations of the
0-1-loss function, see later.
162
Summer 2019

Regularized risk minimization


Crucial problem in ERM: choose F

Alternative approach:
I Let F be a very large space of functions.

I Define a regularizer Ω : F → R≥0 that measures how


“complex” a function is. Examples:
F = polynomials, Ω(f ) = degree of the polynomial f
Ulrike von Luxburg: Statistical Machine Learning

I F = differentiable functions, Ω(f ) = maximal slope


I Define the regularized risk

Rreg,n (f ) := Rn (f ) + λ · Ω(f )

Here λ > 0 is called regularization constant.


I Then choose f ∈ F to minimize the regularized risk.
163
Summer 2019

Regularized risk minimization (2)


Intuition:
I If I can fit the data reasonably well with a “simple function”,
then choose such a simple function.
I If all simple functions lead to a very high empirical risk, then
better choose a more complex function.
Ulrike von Luxburg: Statistical Machine Learning
164
Summer 2019

Regularized risk minimization (3)


EXERCISE: WHAT HAPPENS IF λ IS VERY SMALL? VERY
LARGE?
Ulrike von Luxburg: Statistical Machine Learning
165
166 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

supervised learning
Linear methods for
167 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Linear methods for regression


Summer 2019

Linear least squares regression


Literature:
I Hastie/Tibshirani/Friedman Section 3

I Bishop Sec 3
Ulrike von Luxburg: Statistical Machine Learning
168
Summer 2019

Linear setup
I Assume we have training data (Xi , Yi ) with Xi ∈ X := Rd and
Yi ∈ Y := R.
I We want to find the “best” linear function, that is a function
of the form
d
X
f (x) = wi x(i) + b
Ulrike von Luxburg: Statistical Machine Learning

i=1

where x = (x(1) , ..., x(d) )t ∈ Rd .

The wi are called “weights” and b the “offset” or “intercept”


or “threshold”.
169
Summer 2019

Linear setup (2)


As loss function, we want to use the squared loss (L2 loss).

Formally, the linear least squares problem is the following:

(#) Find parameters w1 , ..., wd ∈ R and b ∈ R such that the


empirical least squares error of the linear function f (as defined on
the last slide) is minimal:
Ulrike von Luxburg: Statistical Machine Learning

n
1 X 2
Yi − f (Xi )
n i=1
170
Summer 2019

Example
Want to predict the shoe size of a person, based on many input
values:
I For each person X, we have a couple of real-valued
measurements: X (1) = height, X (2) = weight, X (3) = income,
X (4) = age.
(Note: some measurements are useful for the question, some
might not be useful)
Ulrike von Luxburg: Statistical Machine Learning

I In this example, we might find that the following function is


good for predicting the shoe size:

2
shoesize = height + 0 · weight + 0 · income + 0 · children + 1
10
171
Summer 2019

Concise notation
I To write everything in a more concise form, we stack the
training inputs into a big matrix (each point is one row) and
the output in a big vector:
Ulrike von Luxburg: Statistical Machine Learning

I Notation: the i-th training point consists of the vector


Xi ∈ Rd , its entries are denoted as Xi1 , ..., Xid .
I Now we can write:

f (Xi ) = hXi , wi + b = (Xw)i + b


172
Summer 2019

Concise notation (2)


I Formally, the linear least squares problem is the following:

(##) Determine w ∈ Rd and b ∈ R as to minimize the


empirical least squares error
n
1 X 2
Yi − ((Xw)i + b)
n i=1
Ulrike von Luxburg: Statistical Machine Learning
173
Summer 2019

Getting rid of b
We want to write the problem even more concisely.
I Define the matrices
Ulrike von Luxburg: Statistical Machine Learning

I Then we have
d+1
X d
X
(X̃ w̃)i = X̃ik w̃k = Xik wk + b = (Xw)i + b
k=1 k=1
174
Summer 2019

Getting rid of b (2)


I Hence, there is a unique correspondence between the original
problem and the following new problem:

(###) determine w̃ ∈ Rd+1 as to minimize the empirical least


squares error
n
1 X 2 1
Yi − (X̃ w̃)i = kY − X̃ w̃k2
Ulrike von Luxburg: Statistical Machine Learning

n i=1 n
175
Summer 2019

Getting rid of b (3)


Without loss of generality, from now on we consider the simplified
problem that does not involve the intercept b. We also remove the
twiddles on the letters X̃ and w̃ to make notation simpler. We still
call the resulting problem (###).

(###) Determine w ∈ Rd as to minimize the empirical least


squares error
Ulrike von Luxburg: Statistical Machine Learning

n
1 X 2 1
Yi − (Xw)i = kY − Xwk2
n i=1 n

In the following, we sometimes consider different factors in front of


the norm (for example, we might drop the 1/n, and include a factor
1/2 for mathematical convenience). It doesn’t change the solution,
but the formulas then look nicer.
176
Summer 2019

ML ; Optimization problem
We can see:
I In order to solve (###), we need to solve an optimization
problem
I In this particular case, we will see in a minute that we can
solve it analytically.
I For most other ML algorithms, we need to use optimization
Ulrike von Luxburg: Statistical Machine Learning

algorithms to achieve this.


177
Summer 2019

Least squares regression is convex


Recap:
I Convex optimization problem (maths appendix, page 1219)

Proposition 5 (Least squares is convex)


The least squares optimization problem (###) is a convex
optimization problem.
Ulrike von Luxburg: Statistical Machine Learning

Proof. Exercise
178
Summer 2019

Solution, case of full rank


Recap:
I Inverse of a matrix

I Rank of a matrix
Ulrike von Luxburg: Statistical Machine Learning
179
Summer 2019

Solution, case of full rank (2)


Theorem 6 (Solution, case rank(X) = d)
Assume that X has rank d. Then the solution w of linear least
squares regression (###) is given by w = (X t X)−1 X t Y .

Proof intuition.
I Want to find the minimum of the function kY − Xwk2
Ulrike von Luxburg: Statistical Machine Learning

I Take the derivative and set it to 0.

I Then we have to check that what we get is indeed a minimum


(in a 1-dimensional situation we would look at the second
derivative for this).
I The minimum then has to be a global minimum because the
objective function is convex.
We can either do all this by foot, coordinate-wise. Or we do it more
elegantly as follows:
180
Summer 2019

Solution, case of full rank (3)


Proof, formally. We write all equations in matrix form:
I Objective function: Obj : Rd → R, Obj(w) := 1 kY − Xwk2
2
I Derivative: the gradient is a vector in Rd consisting of all
partial derivatives, it is given as
grad(Obj)(w) = −X t (Y − Xw).
(To see this, either use matrix derivatives or compute all the
partial derivatives by foot, see slide below.)
Ulrike von Luxburg: Statistical Machine Learning

I Setting the gradient to zero gives the necessary condition:


X t Y = (X t X)w. Ideally, we would like to solve this for w.
I We always have rank(X) = rank(X t X) = rank(XX t ). In
particular, under the assumption that X has rank d, the matrix
X t X is of full rank, hence invertible.
I So we can solve for w by w = (X t X)−1 X t Y .
181
Summer 2019

Solution, case of full rank (4)


I Now we need to figure out whether this is indeed a minimum.
To this end, consider the Hessian matrix that contains all
∂ 2 Obj
second derivatives: H(Obj) = ∂w ∂w0
= X t X.
I This matrix is positive semi-definite: obviously all eigenvalues
≥ 0,
(WHY ???)
and because of the rank condition we have > 0. So the
Ulrike von Luxburg: Statistical Machine Learning

solution we computed above is indeed a local minimum.


,
182
Summer 2019

Solution, case of full rank (5)


Side remark, here is the “derivative by foot”: To see that
the gradient is indeed given by the expression on the previous slide,
we again compute it, this time coordinate-wise:
I Objective function, written explicitly : Obj : Rd → R,
Pn  2
(k)
Obj(w) = 21 kY − Xwk2 = 2n 1
P
i=1 Yi − w X
k k i .
Ulrike von Luxburg: Statistical Machine Learning

I Take partial derivatives, for each wk separately (attention, the


variables of this function are the wk , not the Xi )
Pn  2  
∂ Obj(w) 1
P (k) (k)
∂wk
= n i=1 Yi − k wk Xi − Xi

I Now finally note that the right hand side agrees with the k-th
coordinate of the vector X t (Y − Xw).
183
Summer 2019

Solution, general case


Recap:
I Generalized inverse of a matrix (see maths appendix page
1204)
Ulrike von Luxburg: Statistical Machine Learning
184
Summer 2019

Solution, general case (2)


Theorem 7 (Solution, case rank(X) < d)
Assume that X has rank < d.
1. Then a solution w of linear least squares regression (###) is
given by w = (X t X)+ X t Y , where A+ denotes the generalized
inverse of a matrix A.
2. This solution is not unique. But even if w1 , w2 are two
Ulrike von Luxburg: Statistical Machine Learning

different solutions, then their predictions agree on the training


data, that is hw1 , Xi i = hw2 , Xi i for all i = 1, ..., n.

Proof (sketch).
I As above we get the necessary condition X t Y = (X t X)w.

I One can check that one particular vector w that satisfies this
condition is given as w = (X t X)+ X t Y (EXERCISE!)
185
Summer 2019

Solution, general case (3)


I So w = (X t X)+ X t Y is one solution to the problem, which
proves part 1 of the theorem.
I However, w is not unique:
I Let w be a solution and v any vector with Xv = 0 (exists
because X has rank < d).
I Then w + v is a solution as well (EXERCISE: CHECK IT BY
PLUGGING IT IN THE NECESSARY CONDITION).
Ulrike von Luxburg: Statistical Machine Learning

I So our problem has many solution vectors. Note that all of


them lead to the same objective value.
I Moreover observe: all solutions give the same results on the
training points:
186
Summer 2019

Solution, general case (4)


I Let w1 be a solution, and w2 = w1 + v a solution as well.
Then the vector of all predictions on the training points looks
as follows:

Xw2 = X(w1 + v) = Xw1 + Xv = Xw1


| {z } | {z }
predictions by w2 predictions by w1

So all solutions to the problem predict the same values on the


Ulrike von Luxburg: Statistical Machine Learning

training points.
I But on the test points, the solutions will disagree. The
question is then which one to prefer. One idea here is to use
regularization, see below.
,
187
Summer 2019

Relationship between n and d


n = number of points, d = dimension of the space.

WHAT DO YOU THINK IS BETTER:


I n high, d low

I d high, n low

I n ≈ d
Ulrike von Luxburg: Statistical Machine Learning

??????????
188
Summer 2019

Relationship between n and d (2)


Formally:
I the linear system we solve for least squares regression is
X t Y = (X t X)w. It has d equations and d unknowns
(independently of the value of n).
I So either there exists a unique solution (case rank(X t X) = d)
or many solutions (case rank < d).
I Note that the system always has a solution, no matter how
Ulrike von Luxburg: Statistical Machine Learning

large n is.

Intuition: we just want to find the best linear function, we


don’t require that it goes through the data points exactly.
189
Summer 2019

Relationship between n and d (3)


Informally, we interpret d as the number of parameters and n as the
number of constraints.

Case d  n:
I This is the harmless case: we have many points in a
low-dimensional space. Here linear functions are not very
flexible, and we tend not to overfit (sometimes we underfit).
Ulrike von Luxburg: Statistical Machine Learning
190
Summer 2019

Relationship between n and d (4)


Case d  n:
I Typically, it is a bad idea to have much more parameters than
constraints because it leads to overfitting.
I Geometric reason: if we have few points (n) in a very
high-dimensional space (d), then linear functions are very
powerful (in machine learning terms, the size of the function
class is large because the dimension of the space is so large).
Ulrike von Luxburg: Statistical Machine Learning

This leads to overfitting.


191
Summer 2019

Summary: Linear least squares regression


I Regression problem, X = Rd , Y = R
I Loss function: L2 -loss
I Function class F: set of all linear functions over X (this space
is “pretty small”).
I No regularization.
Finding the linear function that minimizes the empirical L2 -loss
Ulrike von Luxburg: Statistical Machine Learning

I
is a convex optimization problem, and we can compute its
solution analytically.
192
193 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Feature representation of data


Summer 2019

Feature representation of data


On the first glance, the assumption that the data points are in Rd
looks pretty restrictive. What if our data is not “numbers”?

It turns out that in many cases it is a good idea to represent


“objects” by “feature vectors”.
Ulrike von Luxburg: Statistical Machine Learning
194
Summer 2019

Feature representation of data (2)


Example: bag of words representation for texts
I Make a list of all words occurring in the text
I Throw away all words that are too common (“the”, “a”, “for”,
“you”, ... )
I Use “stemming” to throw away word endings (like the plural
“s”): we want to consider the word “horse” the same as
Ulrike von Luxburg: Statistical Machine Learning

“horses”)
I For each text, count how often each word occurs
I The represent each text as a vector: each dimension
corresponds to one word, and the entry of the vector is how
often this word occurs in the given text.
195
196 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Feature representation of data (3)


Summer 2019

Feature representation of data (4)


Example: strings in a feature representation
I Given a string

I Represent it by counting substrings (can also allow substrings


with “gaps” in between)
Ulrike von Luxburg: Statistical Machine Learning
197
Summer 2019

Feature representation of data (5)


Example: motif representation of graphs (such as chemical
molecules)

Count the occurrence of certain subgraphs (called motifs):


Ulrike von Luxburg: Statistical Machine Learning
198
Summer 2019

Feature representation of data (6)


Example: books and/or users in amazon.

I can describe a book by how often it was bought by each user


I or by how often it was bought together with each other book.
Ulrike von Luxburg: Statistical Machine Learning
199
Summer 2019

Feature representation of data (7)


Example: images

I Can obviously represent images as vectors of greyscale values,


or RGB values or CYMK values ...
Ulrike von Luxburg: Statistical Machine Learning
200
Summer 2019

Feature representation of data (8)


General procedure that works very often:
I Given a set of “objects” (texts, graphs, images, emails, ...)
I Describe the objects by simple “features” that can be
expressed as numbers
I Together, these objects give a feature vector ∈ Rd .
I Note that often, the dimension d ends up very large! The
Ulrike von Luxburg: Statistical Machine Learning

incentive is: give as much information as possible to the


learning algorithm, and hope that it is going to identify /
extract the information that is helpful for classification.
201
Summer 2019

Feature representation of data (9)


In machine learning, the mapping Φ : X → Rd that takes an
abstract object X to its feature representation is called the feature
map. It is usually denoted by Φ.

All in all, the assumption that “data is in Rd ” does make


sense in very many applications.
Ulrike von Luxburg: Statistical Machine Learning
202
Summer 2019

Least squares with linear combination of basis


functions
Literature:
Ulrike von Luxburg: Statistical Machine Learning

I Hastie/Tibshirani/Friedman Section 3

I Bishop Section 3
203
Summer 2019

Using non-linear basis functions


Idea:
I Linear functions are often quite restrictive.
I Instead, want to learn a function of the form
D
X
f (x) = wi Φi (x)
i=1
Ulrike von Luxburg: Statistical Machine Learning

where the functions Φ1 , ..., ΦD are arbitrary “basis functions”.


I Note: f is linear in the parameters w, but if the functions Φ
are non-linear in x, then so is f .
204
Summer 2019

Examples
Example 1:
I Your data lives in Rd , but it clearly cannot be described by a
linear function of the original coordinates. Alternatively, you
can fit a function of the form
D
X
f (x) = wi Φi (x)
Ulrike von Luxburg: Statistical Machine Learning

i=1

(where the number D of basis functions does not need to


coincide with the original dimension d)
205
Summer 2019

Examples (2)
I For example, if we want to learn a periodic function, the Φi
might be the first couple of Fourier functions to fit a function
of the form
D
X
g(x) = wi sin(kx)
k=1

In some other case, we might want to choose the basis


Ulrike von Luxburg: Statistical Machine Learning

I
functions Φi as polynomials x, x2 , ..., xD to fit a function of
the form
D
X
h(x) = w i xk
k=1

In this way you can use a linear algorithm (find the linear
coefficients wi ) to fit non-linear functions (such as g(x) or h(x)) to
your data.
206
Summer 2019

Examples (3)
Example 2: Feature spaces
the input X consists of a web page, the task is to predict how
many seconds users stay on the page before they leave it again.

We might consider basis functions as the following ones:


I Φ1 counts the number of occurrences of the word “soccer”

I Φ1 counts the number of occurrences of the word “team”


Ulrike von Luxburg: Statistical Machine Learning

I Φ3 tells you how many words the text has in total

I ... etc ...


207
Summer 2019

How to solve it
It is easy to rewrite the “standard” least squares problem in this
more general framework:
I Define the design matrix as follows:
Ulrike von Luxburg: Statistical Machine Learning

I Then the least squares problem is to find w as to minimize


kY − Φwk2 .
I This has the solution w = (Φt Φ)−1 Φt Y (with exact inverse or
generalized inverse) as we have seen above.
208
Summer 2019

Advantages and disadvantages


I Note that in the given scenario, we did not choose our
function basis to depend on the input. We chose the functions
before we got to see the data points.
I If we have prior knowledge about our data, we can select a
“good” set of basis functions.
I For any useful inference, the dimension D of the feature space
Ulrike von Luxburg: Statistical Machine Learning

has to be much smaller than the number n of data points.


I WHY, AGAIN???
I This is still quite restrictive. Just consider the case where our
data is D-dimensional and we want to to have a function
space with polynomials of degrees one and two. There are
already of the order D2 many basis polynomials of degree two
(xi xj for i, j = 1, ..., D). So it can easily happen that we
need more basis functions than we can cope with...
209
Summer 2019

Advantages and disadvantages (2)


I There is one way out of this trap, namely to regularize, in
particular by enforcing sparsity. See Lasso below.
Ulrike von Luxburg: Statistical Machine Learning
210
Summer 2019

Ridge regression: least squares with


L2-regularization
Literature: Hastie/Tibshirani/Friedman Section 3.4.3; Bishop
Ulrike von Luxburg: Statistical Machine Learning

Section 3
211
Summer 2019

Idea
Want to improve standard L2 -regression. Two points of view:

1. Want to have a unique solution, no matter what the rank of


the design matrix is. This is going to improve numerical
stability.

2. In the standard problem, the coefficients wi can become very


Ulrike von Luxburg: Statistical Machine Learning

large. This leads to a high variance of the results.

To avoid this effect, we want to introduce regularization to


force the coefficients to stay “small”.
212
Summer 2019

Ridge regression problem


Consider the following regularization problem:
I Input space X arbitrary, output space Y = R.
I Fix a set of basis functions Φ1 , ..., ΦD : X → R
I As function
P space choose all functions of the form
f (x) = i wi Φi (x).
As regularizer use Ω(f ) := kwk2 = D 2
P
i=1 wi . Choose a
I
Ulrike von Luxburg: Statistical Machine Learning

regularization constant λ > 0.


I Then solve the problem

1
wn,λ := argmin kY − Φwk2 + λkwk2 .
w∈RD n
213
Summer 2019

Solution
Theorem 8 (Solution of Ridge Regression)
The coefficients wn,λ that solve the ridge regression problem are
given as
 −1
t
wn,λ := Φ Φ + nλID Φt Y
Ulrike von Luxburg: Statistical Machine Learning

where ID is the D × D identity matrix.


214
Summer 2019

Solution (2)
Proof.
I Objective function is Obj(w) := 1 kY − Φwk2 + λkwk2 .
n
I Note that this function is convex.

I Take the derivative with respect to w and set it to 0:

2 !
grad(Obj)(w) = − Φt (Y − Φw) + 2λw = 0
n
Ulrike von Luxburg: Statistical Machine Learning


=⇒ Φ Φ + nλID wn,λ = Φt Y
t

I It is straight forward to see that the matrix (Φt Φ + nλID ) has


full rank whenever λ > 0 (see below). So we can take the
inverse, and the theorem follows as in the standard
L2 -regression case.
215
Summer 2019

Solution (3)
Proof that (Φt Φ + nλID ) is invertible:
I The matrix A := Φt Φ is symmetric, hence we can decompose
it into eigenvalues: A = V t ΛV where Λ is the diagonal matrix
with all eigenvalues of A
I Because of the special form A := Φt Φ, all eigenvalues are ≥ 0
(the matrix is positive semi-definite).
I A has full rank (is invertible) iff all its eigenvalues are > 0.
Ulrike von Luxburg: Statistical Machine Learning

I σ is an eigenvalue of A with eigenvector v ⇐⇒ σ + λ is an


eigenvalue of A + λI. Reason:

(A + λI)v = Av + λv = σv + λv = (σ + λ)v
I If λ > 0, then all eigenvalues of A + λI are > 0:
σ ≥ 0 and λ > 0 implies σ + λ > 0
I So we know that A + λI has full rank and is invertible.
216

,
Summer 2019

Example (by Matthias Hein)

I True function: periodic function + noise


I Basis functions Φ: first 10 Fourier basis functions

x 7→ sin(kx), (k = 1, ..., 10)

So we want to determine the coefficients wi for a function of


the form
Ulrike von Luxburg: Statistical Machine Learning

10
X
f (x) = wk sin(kx)
k=1
217
Summer 2019

Example (by Matthias Hein) (2)


Ulrike von Luxburg: Statistical Machine Learning

least squares fit (red) and ridge regression blue for a linear
218

erve that the solution of ridge regression is biased towards


Summer 2019

Choice of the parameter λ


QUESTION: WHAT IS THE ROLE OF λ? WHAT HAPPENS TO
ESTIMATION AND APPROXIMATION ERROR IF IT IS HIGH /
LOW?
Ulrike von Luxburg: Statistical Machine Learning
219
Summer 2019

Choice of the parameter λ (2)


Example (from Bishop’s book):
Left: results for decreasing amount of regularization
Right: True curve (green), average estimated curve (red)
150 3. LINEAR MODELS FOR REGRESSION

1 1
ln λ = 2.6
t t

0 0

−1 −1

0 x 1 0 x 1
Ulrike von Luxburg: Statistical Machine Learning

1 1
ln λ = −0.31
t t

0 0

−1 −1

0 x 1 0 x 1

1 1
ln λ = −2.4
t t

0 0

−1 −1

0 x 1 0 x 1

Figure 3.5 Illustration of the dependence of bias and variance on model complexity, governed by a regulariza-
tion parameter λ, using the sinusoidal data set from Chapter 1. There are L = 100 data sets, each having N = 25
data points, and there are 24 Gaussian basis functions in the model so that the total number of parameters is
M = 25 including the bias parameter. The left column shows the result of fitting the model to the data sets for
various values of ln λ (for clarity, only 20 of the 100 fits are shown). The right column shows the corresponding
average of the 100 fits (red) along with the sinusoidal function from which the data sets were generated (green).
220
Summer 2019

Choice of the parameter λ (3)


3.2. The Bias-Variance
Same example, Decomposition
bias-variance decomposition: 151
Larger regularization constant λ leads to less complex functions:
and variance,
m, correspond- 0.15
2
shown in Fig- (bias)
is the average 0.12 variance
st data set size (bias)2 + variance
minimum value 0.09 test error
occurs around
Ulrike von Luxburg: Statistical Machine Learning

is close to the
0.06
minimum error
0.03

0
−3 −2 −1 0 1 2
ln λ

Gaussian basis functions by minimizing the regularized error


221
Summer 2019

(*) Ridge regression as shrinkage method


Geometric interpretation of regularization via SVD:
I Consider the Singular Value Decomposition of the matrix
Φ ∈ Rn×d :

Φ = U ΣV t

I Plugging this into the formula for wn,λ leads to


Ulrike von Luxburg: Statistical Machine Learning

 σ 
j
wn,λ = ... = V diag 2 U tY
σj + λ

I Standard least squares regression (without regularization)


corresponds to λ = 0, and the fraction satisfies
σj 1
=
σj2+λ σj
222
Summer 2019

(*) Ridge regression as shrinkage method (2)


I Regularized case:
I Case σi large: not much difference to non-regularized case:
σj 1

σj2 +λ σj

I Case σi small: here it makes a lot of difference whether we


have σi2 or σi2 + λ in the denominator. In particular,
Ulrike von Luxburg: Statistical Machine Learning

σj 1

σj2+λ σj

This means that the regularization “shrinks” the directions of


small variance. Intuitively, these are the directions that mainly
contain noise, no signal.
223
Summer 2019

(*) Ridge regression as shrinkage method (3)


In statistics, related methods are often called “shrinkage methods”
(because we try to “shrink” the weights wi ).

From a statistics point of view, they can be justified by what is


called “Stein’s paradox” (discovered in the 1950ies). Essentially,
this paradox says that if we want to estimate at least three
parameters jointly, then it is better to “shrink them”. Here is a
Ulrike von Luxburg: Statistical Machine Learning

simple example:
I Assume you want to estimate the mean of a normal
distribution N (Θ, I) in Rd , d ≥ 3.
I Assume we have just a single data point X ∈ Rd from this
distribution.
I Standard least squares estimator: Θ̂LS = X.
224
Summer 2019

(*) Ridge regression as shrinkage method (4)


I Now consider the following “shrinkage
 estimator”
 (it is called
d−2
the James-Stein estimator): Θ̂JS = 1 − kXk2 X.
I One can prove that it always outperforms the standard least
squares error:

E(kΘ − Θ̂LS k) ≥ E(kΘ − Θ̂JS k)


Ulrike von Luxburg: Statistical Machine Learning

Read it on wikipedia if you are interested ,


225
Summer 2019

History and Terminology


I Invented by Andrey Tikhonov, 1943, in the context of integral
equations.
Original publication: Tikhonov, Andrey Nikolayevich. On the
stability of inverse problems. Doklady Akademii Nauk SSSR,
1943
This type of regularization is often called Tikhonov
Ulrike von Luxburg: Statistical Machine Learning

regularization after its inventor.


I Introduced in statistics literature in the following paper:
Hoerl and Kennard. Ridge regression: Biased estimation for
nonorthogonal problems. Technometrics, 1970.
226
Summer 2019

History and Terminology (2)


I Originally, the intention was to make the solution of the least
squares problem more stable and to achieve a unique solution.
I Replace the matrix Φt Φ in the least squares solution by the
matrix Φt Φ + λId.
I This is where the name “ridge” comes from (we add a little
“ridge” on the diagonal of the matrix).
Matrix Xt X Matrix Xt X + λ Id
Ulrike von Luxburg: Statistical Machine Learning

6
2
4
1.5

1 2

0.5
0
20
0
20
20 10
10 20
10 15
10
0 0 5
0 0

I The regularization interpretation we described above is more


recent.
227
Summer 2019

Summary: Ridge regression


I Regression problem, X = Rd , Y = R
I Loss function: L2 -loss
I Function class: linear functions parameterized by w
F := {fw : Rd → R, fw (x) = hw, xi; w ∈ Rd }
I Regularizer: Ω(fw ) = kwk2
Finding the function that minimizes the regularized risk is a
Ulrike von Luxburg: Statistical Machine Learning

I
convex optimization problem, and we can compute its solution
analytically.
228
Summer 2019

Lasso: least squares with L1-regularization


Books:
• Hastie/Tibshirani/Friedman, Section 3.4.3;
• Bishop Section 3
Ulrike von Luxburg: Statistical Machine Learning

• Hastie/Tibshirani/Wainwright, Section 2

Original paper: Tibshirani: Regression shrinkage and selection via


the lasso. J. Royal. Statist. Soc. B, 1996
229
Summer 2019

Sparsity
I Consider the setting of linear regression with basis functions
Φ1 , ..., ΦD .
P
I It is very desirable to obtain a solution function fn := i wi Φi
for which many of the coefficients wi are zero. Such a solution
is called “sparse”.
I Reasons:
Ulrike von Luxburg: Statistical Machine Learning

I Computational reasons: even if we have many basis functions,


we just need to evaluate few of them.
I Interpretability of the solution
230
Summer 2019

A naive regularizer for sparsity

QUESTION: WHAT WOULD BE A GOOD REGULARIZER TO


ENFORCE SPARSITY?
Ulrike von Luxburg: Statistical Machine Learning
231
Summer 2019

A naive regularizer for sparsity (2)


Need to find a function that is small if w is sparse:

Use the regularizer


D
X
Ω0 (f ) := 1wi 6=0 .
i=1
Ulrike von Luxburg: Statistical Machine Learning

It directly penalizes the number of non-zero entries wi .

HOWEVER, USING THIS REGULARIZER IS NOT A GOOD


IDEA. WHY?
232
Summer 2019

A naive regularizer for sparsity (3)


Ω0 is a discrete function, and optimizing discrete functions is
typically NP hard.
Ulrike von Luxburg: Statistical Machine Learning
233
Summer 2019

Excursion: p-norms
I For p > 0, define for a vector w ∈ RD
D
X 1/p
p
kwkp := |wi | .
i=1

I For any p ≥ 1, this is a norm and as such a convex function. It


Ulrike von Luxburg: Statistical Machine Learning

is called the p-norm.


I for 0 < p < 1, it is not a norm (exercise!) and also not convex
(exercise, and see figure on next slide)
234
4.2 the e↵ects of di↵erent p are illustrated using the level sets of kwkp . A
Summer 2019

Excursion: p-norms (2)


Unit spheres of p-balls for different values of p (e.g., the red line is
the set of points w ∈ R2 for which kwk2 = 1).
Ulrike von Luxburg: Statistical Machine Learning

(Image by Matthias Hein)


235
Summer 2019

Excursion: p-norms (3)


For p = 0, we can define the function
d
X d
X
kwk0 := lim kwkpp = lim |wi |p = |wi |0
p→0 p→0
i=1 i=1

(Note that we take the limit of kwkpp , not of kwkp ).

This is not a norm (it does not even satisfy the homogeneity
Ulrike von Luxburg: Statistical Machine Learning

condition kaxk = akxk) , but it is still called zero-norm in the


literature.

It coincides with our regularizer: if we define 00 = 0 and recall that


a0 = 1 (for a 6= 0), we get
d
X d
X
0
kwk0 = |wi | = 1wi 6=0 = Ω0 (f )
i=1 i=1
236
Summer 2019

Sparsity and the L1-norm


We now want to settle for kwk1 as a regularizer: It is “as close” to
the non-convex regularizer kwk0 as possible while still being convex.

Question: Does it still tend to give sparse solutions?

Answer is yes, see the illustration on the next silde:


Ulrike von Luxburg: Statistical Machine Learning
237
Summer 2019

Sparsity and the L1-norm (2)


Illustration: Assume we restrict the search to functions with
kwk ≤ const. The blue cross shows the best solution
w = (w1 , w2 )t . It is not sparse for L2 , but sparse for L1 -norm
regularization.
Ulrike von Luxburg: Statistical Machine Learning
238
Summer 2019

Sparsity and the L1-norm (3)


Another intuitive argument why solutions with L1 -regularization
might be sparser than L2 -regularization:
I The L2 -norm puts a particularly large penalty on large
coefficients wi . That is, to avoid a large L2 -penalty, it is better
to have many small wi that are all non-zero than to have most
wi equal to 0 and a couple of large wi .
The L1 -norm at least does not have this “preference” for many
Ulrike von Luxburg: Statistical Machine Learning

I
small weights. It punishes all weights linearly, not quadratic,
and thus can afford to have a large weight if at the same time
many small weights disappear.
239
Summer 2019

The Lasso
Consider the following regularization problem:
I Input space X arbitrary, output space Y = R.
I Fix a set of basis functions Φ1 , ..., ΦD : X → R
I As function
P space choose all functions of the form
f (x) = i wi Φi (x).
As regularizer use Ω(f ) := kwk1 = D
P
i=1 |wi |. Choose a
I
Ulrike von Luxburg: Statistical Machine Learning

regularization constant λ > 0.


I Then solve the problem
1
wn,λ := argmin kY − Φwk22 + λkwk1 .
w∈RD n
240
Summer 2019

Solution of the Lasso problem


I The Lasso objective function is convex (it is a sum of two
convex functions).
I However, there does not exist a closed form solution.
I Hence it has to be solved by a standard algorithm for convex
optimization.
I In general, any convex solver can be used, but might be slow.
Ulrike von Luxburg: Statistical Machine Learning

I Observing that the problem can be recast as a quadratic


problem might help already.
I But many faster approaches exist, for example coordinate
descent algorithms. We are not going to discuss them in the
lecture.
241
242 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Example

(Figure by Matthias Hein)


243 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Example (2)

(Figure by Matthias Hein)


Summer 2019

History
I The name LASSO stands for “least absolute shrinkage and
selection operator”
I First invented by Tibshirani: Regression shrinkage and
selection via the lasso. J. Royal. Statist. Soc. B, 1996
I For a short retrospective and some important literature
pointers, see Tibshirani: Regression shrinkage and selection via
Ulrike von Luxburg: Statistical Machine Learning

the lasso: a retrospective. J. R. Statist. Soc. B (2011)


244
Summer 2019

Summary: the Lasso


I Regression problem, X arbitrary space, Y = R
I Loss function: L2 -loss
I Function class F: a linear combination of a fixed set of basis
functions.
I Regularizer: L1 -norm kwk1 to enforce sparsity.
Convex optimization problem, no analytic solution, but
Ulrike von Luxburg: Statistical Machine Learning

I
efficient solvers exist.
245
Summer 2019

(∗) Probabilistic interpretation of linear


regression

The following slides just provide a sketch. If you want to know


Ulrike von Luxburg: Statistical Machine Learning

more or see exact formulas, please read this book chapter:

Kevin Murphy: Machine Learning, a probabilistic perspective,


Chapter 7
246
Summer 2019

Linear regression: ERM = maximum likelihood


I Assume the following probabilistic setup: the data is generated
by the following linear model:

Y = Xw + noise

where w is unknown and the noise follows a (d-dim) normal


distribution N (0, σ 2 I) (σ unknown “meta-parameter”,
Ulrike von Luxburg: Statistical Machine Learning

considered fixed):

Y |X, w ∼ N (Xw, σ 2 I)
247
Summer 2019

Linear regression: ERM = maximum likelihood


(2)
I Maximum likelihood framework: want to find the parameter w
such that the likelihood of the observations is maximized:

max P (Y |X, w)
w
max exp(−kY − Xwk2 /σ 2 )
w
Ulrike von Luxburg: Statistical Machine Learning

min kY − Xwk2
w

That is: Maximum likelihood regression with a Gaussian noise


model corresponds to ERM with the L2 loss function.
248
Summer 2019

Linear regression: RRM = Bayesian MAP


I Assume that the observations are generated as above, but
additionally assume that we have a prior distribution over the
parameter w:

Y |X, w ∼ N (Xw, σ 2 I) and w ∼ N (0, τ 2 I)


I Bayesian maximum a posteriori approach (MAP): choose w
Ulrike von Luxburg: Statistical Machine Learning

that maximizes the posterior probability:


P (Y |X, w)P (w)
P (w|X, Y ) =
P (Y |X)
I Writing down all formulas, can see: Leads to ridge regression
(with tradeoff constant λ = σ 2 /τ 2 ):

min kXw − Y k2 + λkwk2


w
249
Summer 2019

More generally: Bayesian interpretation of ERM


and RRM
I The noise model in the probabilistic setup corresponds to the
choice of a loss function in the ERM framework.
I The prior distribution of the parameter in the Bayesian model
corresponds to a particular choice of regularizer in RRM.
Examples:
Ulrike von Luxburg: Statistical Machine Learning

I If the data contains many outliers, one chooses a Laplace noise


model (rather than a Gaussian one): P (w) ≈ exp(−kwk/τ ).
This then leads to the L1 -loss function
1X
|Yi − Ŷi |
n i
250
Summer 2019

More generally: Bayesian interpretation of ERM


and RRM (2)
I Similarly, if we use a Laplace prior instead of a normal prior for
the parameter, we end with Lasso regularization instead of
Tikhonov/Ridge regression.
Ulrike von Luxburg: Statistical Machine Learning
251
252 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Feature normalization
Summer 2019

In practice: normalization
In regularized regression, it makes a difference how we scale our
data. Example:
I Body height measured in mm or cm or even km
Different scales lead to different solutions, because they affect the
regularization in a different way (WHY???)
Ulrike von Luxburg: Statistical Machine Learning

Moreover, we typically want all coordinates to have “the same


amount of influence” on the solution. This is not the case if our
measurements have completely different orders of magnitude (for
example, one coordinate is “body height in mm” and one is “shoe
size”).
253
Summer 2019

In practice: normalization (2)


In order to make sure that all basis functions “are treated the
same” it is thus recommended to standardize your data:
1. Centering:
Replace Φi by Φcentered := Φi − Φ̄i with Φ̄i := n1 nj=1 Φi (Xj ).
P
i
2. Normalizing the variance: rescale each basis function such that
it has unit L2 -norm (variance) on the training data:
Ulrike von Luxburg: Statistical Machine Learning

Φcentered
i
Φrescaled
i := Pn centered
( j=1 Φi (Xj )2 )1/2

In terms of the matrix Φ: you center and normalize the columns of


the matrix to have center 0 and unit norm.
254
255 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Selecting parameters by cross validation


Summer 2019

Cross validation - purpose


In all machine learning algorithms, we have to set parameters or
make design decisions:
I Regularization parameter in ridge regression or Lasso
I Parameter C of the SVM
I Parameter σ in the Gaussian kernel
I Number of principle components in PCA
Ulrike von Luxburg: Statistical Machine Learning

I But you also might want to figure out whether certain design
choices make sense, for example whether it is useful to remove
outliers in the beginning or not.
It is very important that all these choices are made appropriately.
Cross validation is the method of choice for doing that.
256
Summer 2019

K-fold cross validation


1 INPUT: Training points (Xi , Yi )i=1,...,n , a set S of different
parameter combinations.
2 Partition the training set into K parts that are equally large.
These parts are called “fold”
3 for all choices of parameters s ∈ S
4 for k = 1, ..., K
Build one training set out of folds 1, ..., k − 1, k + 1, ..., K
Ulrike von Luxburg: Statistical Machine Learning

5
and train with parameters s.
6 Compute the validation error err(s, k) on fold k
7 Compute P the average validation error over the folds:
err(s) = K k=1 err(s, k)/K.
8 Select the parameter combination s that leads to the best
validation error: s∗ = argmins∈S err(s).
9 OUTPUT: s∗
257
258 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

K-fold cross validation (2)


Summer 2019

K-fold cross validation (3)


I Once you selected the parameter combination s∗ , you train
your classifier a final time on the whole training set. Then you
use a completely new test set to compute the test error.
Ulrike von Luxburg: Statistical Machine Learning
259
Summer 2019

K-fold cross validation (4)


I Never, never use your test set in the validation phase. As soon
as the test points enter the learning algorithm in any way, they
can no longer be used to compute a test error. The test set
must not be used in training in any way!
I In particular: you are NOT ALLOWED to first train using
cross validation, then compute the test error, realize that it is
not good, then train again until the test error gets better. As
Ulrike von Luxburg: Statistical Machine Learning

soon as you try to “improve the test error”, the test data
effectively gets part of the training procedure and is spoiled.
260
Summer 2019

K-fold cross validation (5)


What number of folds K?

Not so critical, often people use 5 or 10.


Ulrike von Luxburg: Statistical Machine Learning
261
Summer 2019

K-fold cross validation (6)


How to choose the set S?
I If you just have to tune one parameter, say the regularization
constant λ. Then choose λ on a logspace, say
λ ∈ {10−3 , 10−2 , ...103 }.
I If you have to choose two parameters, say C and the kernel
width σ, define, say, SC = {10−2 , 10−1 , ...105 },
Sσ = {10−2 , 10−1 , ..., 103 }, and then choose S = SC × Sσ .
Ulrike von Luxburg: Statistical Machine Learning

That is, you have to try every parameter combination!


I You can already guess that if we have more parameters, then
this is going to become tricky. Here you might want run several
cross validations, say first choose C and σ (jointly) and fix
them. Then choose the number of principle components, etc.
I Note that overfitting can also happen for cross-validation!
262
Summer 2019

K-fold cross validation (7)


I There are also some advanced methods to “walk in the
parameter space” (the idea is to try something like a gradient
descent in the space of parameters).
Ulrike von Luxburg: Statistical Machine Learning
263
Summer 2019

Advantages and disadvantages


Disadvantages of cross validation:
I Computationally expensive!!! In particular, if you have many
parameters to tune, not just one or two.
I Note that the training size of the problems used in the
individual cross validation training runs is n(·K − 1)/K. If the
sample size is small, then the parameters tuned on the smaller
Ulrike von Luxburg: Statistical Machine Learning

folds might not be the best ones on the whole data set
(because the latter is larger).
I It is very difficult to prove theoretical statements that relate
the cross-validation error to the test error (due to the high
dependency between the training runs). In particular, the CV
error is not unbiased, it tends to underestimate the test error.

Further reading: Y. Yang. Comparing learning methods for


classification. Statistica Sinica, 2006. and references therein.
264
Summer 2019

Advantages and disadvantages (2)


Advantages:

There is no other, systematic method to choose parameters in a


useful way.

Always, always, always do cross validation!!! Make sure the final


test set is never touched while training (retraining for improving the
test error is not allowed, then the data is spoiled).
Ulrike von Luxburg: Statistical Machine Learning
265
266 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Linear methods for classification


267 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Intuition
Summer 2019

Intuition
Given:
I We assume that our data lives in Rd (perhaps, through a
feature space representation).
I Want to solve a classification problem with input space
X = Rd and output space Y = {±1} (for simplicity we focus
on the two-class case for now).
Ulrike von Luxburg: Statistical Machine Learning
268
Summer 2019

Intuition (2)
I Idea is to separate the two classes by a linear function:
Ulrike von Luxburg: Statistical Machine Learning
269
Summer 2019

Hyperplanes in Rd
Now let’s consider linear classification with hyperplanes.
I A hyperplane in Rd has the form

H = {x ∈ Rd hw, xi + b = 0}

where w ∈ Rd is the normal vector of the hyperplane and b the


offset.
Ulrike von Luxburg: Statistical Machine Learning
270
Summer 2019

Classification using hyperplanes


To decide whether a point lies on the right or left side of a
hyperplane, we use the decision function

sign(hw, xi + b) ∈ {±1}

Note that it is a convenient convention to use the class labels +1


and −1 (because we can then simply use the sign function).
Ulrike von Luxburg: Statistical Machine Learning
271
Summer 2019

Projection interpretation
Here is another way to interpret classification by hyperplanes:
I The function hw, xi projects the points x on a real line in the
direction of the normal vector of the hyperplane.
I The term b shifts them along this line.
I Then we look at the sign of the result and classify by the sign
Ulrike von Luxburg: Statistical Machine Learning
272
Summer 2019

Loss functions for classification


There exist quite a number of loss functions that are used in
classification:

We are now going to see a number of basic approaches for linear


classification based on various loss functions:
I Linear discriminant analysis (least squares loss)

I Logistic regression (logistic loss)


Ulrike von Luxburg: Statistical Machine Learning

I Linear support vector machines (hinge loss)


273
Summer 2019

Linear discriminant analysis


Literature:
I Hastie/Tibshirani/Friedman Sec. 4.3

I Duda / Hart
Ulrike von Luxburg: Statistical Machine Learning
274
Summer 2019

LDA: Geometric motivation


Different projections: which one is better for classification?
Ulrike von Luxburg: Statistical Machine Learning
275
276 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

LDA: Geometric motivation (2)


Summer 2019

LDA: Geometric motivation (3)


I Linear classification amounts to a one-dimensional projection.
I LDA: Chooses the projection direction w such that ...
I The class centers are as far away from each other as possible
I The variance within each class is as small as possible.
Ulrike von Luxburg: Statistical Machine Learning
277
278 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

LDA: Geometric motivation (4)


Summer 2019

LDA: Geometric motivation (5)


Observe the different roles of w and b:

Step 1: finding a good separating direction ; w


I All the intuition above is about how to find a good direction w
(neither the separation of the two classes nor their variances
are affected by b).
I So the first step of LDA will be to find a good direction w.
Ulrike von Luxburg: Statistical Machine Learning

Step 2: given w, decide where to cut ; b


I The parameter b only influences where we “cut” the two
classes, after we projected them on w.
I The best parameter b is thus selected only once we know w.

Note: the label information is used in both steps!


279
Summer 2019

Formally: the Fisher criterion


Define the following quantities for class +1:
I Let n+ be the number of points in class 1

I Define the center of class 1 as

1 P
m+ := Xi ∈ Rd
n+ {i | Yi =+1}
Ulrike von Luxburg: Statistical Machine Learning

Note that after projecting on w, the mean is given as hw, m+ i.


I Define the within-class variance after projecting on w as

2 1 P  2
σw,+ := hw, Xi i − hw, m+ i
n+ {i | Yi =+1}
Make the analogue definitions for class −1: ...
280
Summer 2019

Formally: the Fisher criterion (2)


Now define the Fisher criterion as
hw, m+ − m− i2
J(w) = 2 2
.
σw,+ + σw,−

The idea of linear discriminant analysis is now to select


Ulrike von Luxburg: Statistical Machine Learning

w ∈ Rd such that the Fisher criterion is maximized.


281
Summer 2019

Fisher criterion in matrix form


We can write the Fisher criterion in matrix form as follows:
I Define the between-class scatter matrix as

CB := (m+ − m− )(m+ − m− )t ∈ Rd×d


I Define the total within-class scatter matrix as
1 X
Ulrike von Luxburg: Statistical Machine Learning

CW := (Xi − m+ )(Xi − m+ )t
n+
{i | Yi =+1}
1 X
+ (Xi − m− )(Xi − m− )t
n−
{i | Yi =−1}

I The Fisher criterion can now be rewritten as


hw, CB wi
J(w) =
hw, CW wi
282
Summer 2019

Solution vector w
Proposition 9 (Solution vector w∗ of LDA)
If the matrix CW is invertible, then the optimal solution of the
problem w∗ := argmaxw∈Rd J(w) is given by

w∗ = (CW )−1 (m+ − m− ).


Ulrike von Luxburg: Statistical Machine Learning

Remark: it can happen that CW is not invertible (in particular, if


d > n. WHY?).
In this case, one can resort to the pseudo-inverse.
Proof (sketch).
I Take the derivative:

∂J CB whw, CW wi − CW whw, CB wi
(w) = 2
∂w hw, CW wi2
283
Summer 2019

Solution vector w (2)


I Set it to 0:
hw, CW wi
CB w = CW w
hw, CB wi
I Rewrite (plug in the definition of CB ):
hw, CW wi
(m+ − m− ) (m+ − m− )t w = CW w
hw, CB wi
Ulrike von Luxburg: Statistical Machine Learning

| {z }
| {z } ∈R
∈R

I Additionally, observe that J(w) is invariant under rescaling of


w, that is J(w) = J(αw) for α 6= 0.
I So the solution is
w∗ ∝ (CW )−1 (m+ − m− )
I We can check that the Hessian of J(w) at w∗ is negative
definite, so w∗ is indeed a maximum. ,
284
Summer 2019

Determining b
So far, we only discussed how to find the normal vector w. How do
we set the offset b? (Recall that the hyperplane is hw, xi + b).

The standard is to choose b, once w is known, as to minimize the


training error.
Ulrike von Luxburg: Statistical Machine Learning
285
Summer 2019

LDA, alternative motivation by ERM


We can also start with the ERM framework and make the following
assumptions:
I As function class we use the affine linear functions as above:

F = {f (x) = hw, xi + b; w ∈ Rd , b ∈ R}

I As loss function we use the squared loss between the


Ulrike von Luxburg: Statistical Machine Learning

real-valued output (!) of the function f (x) and the actual


class labels:

`(X, Y, f (X)) = (Y − f (X))2

I No assumption on the underlying distributions.

Then we can prove the following nice theorem:


286
Summer 2019

LDA, alternative motivation by ERM (2)


Theorem 10 (LDA as ERM)
Consider the following two optimization problems:
(1) Minimizing the least squares loss of affine linear functions:
n
X
0 0
(w , b ) := argmin (Yi − hw, Xi i − b)2
w∈Rd ,b∈R i=1
Ulrike von Luxburg: Statistical Machine Learning

(2) The LDA problem:

w∗ = argmax J(w)
w∈Rd

Then the solutions w0 and w∗ coincide up to a constant, in


particular they correspond to the same hyperplane.
287
288 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Proof: skipped.
LDA, alternative motivation by ERM (3)
Summer 2019

LDA, alternative motivation by ERM (4)


Comments:
I Note: The least squares loss in problem (1) of the theorem is
with respect to hw, Xi i + b, not with respect to the sign of this
expression (which is what we are ultimately interested in):

(Yi − (hw, Xi i + b) )2 versus (Yi − sign(hw, Xi i + b) )2


| {z } | {z }
∈R ∈{±1}
Ulrike von Luxburg: Statistical Machine Learning
289
Summer 2019

LDA, motivation by Bayesian decision theory


Let us make the following assumptions:
I The two class conditional distributions P (X|Y = 1) and
P (X|Y = −1) follow a multivariate normal distribution with
the same covariance matrix, but different means
I Classes have equal prior weights, that is
P (Y = 1) = P (Y = −1) = 0.5.
Ulrike von Luxburg: Statistical Machine Learning

Then we can argue as follows:


I Bayesian decision theory: under these assumptions the optimal
classifier selects according to whether
?
P (Y = 1 X = x) > P (Y = −1 X = x).
I Equivalently:
  ?
log P (Y = 1 X = x)/P (Y = −1 X = x) > 0.
290
Summer 2019

LDA, motivation by Bayesian decision theory (2)


I If we compute this term for the normal distributions, one can
see that the decision boundary between the two classes (i.e.,
the set where both classes have equal posteriors) is a
hyperplane, and coincides with the LDA solution.
I Details skipped, see Hastie/Tibshirani/Friedman for details.
Ulrike von Luxburg: Statistical Machine Learning
291
Summer 2019

LDA, motivation by Bayesian decision theory (3)


Insights:
I Under the given assumptions (normal distributions, same
weights, same covariance, etc), LDA should work nicely!
I We can also suspect that it does not such a good job if the
assumptions are not satisfied.
Ulrike von Luxburg: Statistical Machine Learning
292
Summer 2019

Limitations and generalizations


I LDA does not work well if the classes are not “blobs”
Ulrike von Luxburg: Statistical Machine Learning

I LDA does not work well if the variance of the two classes is
very different from each other (remember, in the derivation of
LDA based on Gaussian distributions we assumed equal
variance for both classes).
293
Summer 2019

Limitations and generalizations (2)


Generalizations:
I LDA tends to overfit (so far, we do not regularize). There also
exist regularized versions, we’ll skip it.
I LDA can be generalized to multiclass problems as well. We’ll
skip it.
Ulrike von Luxburg: Statistical Machine Learning
294
Summer 2019

History
I A variant of this was first published by R. Fisher in 1936:
Fisher, R. A. (1936). The Use of Multiple Measurements in
Taxonomic Problems. Annals of Eugenics 7 (2): 179–188.
I LDA goes under various names: Linear discriminant analysis,
Fisher’s linear disciminant.
I R. Fisher is THE founder of modern statistics (design of
Ulrike von Luxburg: Statistical Machine Learning

experiments, analysis of variance, maximum likelihood,


sufficient statistics, randomized tests, ... )
295
Summer 2019

Summary: Linear discriminant analysis (LDA)


Three different motivations:
I Geometric motivation: project in a direction that separates the
classes well
I ERM motivation: minimize the least squares loss on space of
linear functions
I Model-based (probabilistic) motivation: Bayes classifier under
Ulrike von Luxburg: Statistical Machine Learning

assumption of normal distributions with equal variances


All the motivations lead to the same algorithm:
I Minimize the Fisher criterion

I Can compute solution vector w analytically


296
Summer 2019

Logistic regression
Literature: Hastie/Tibshirani/Friedman Section 4.4
For the probabilistic point of view, see Chapter 8 in Murphy
Ulrike von Luxburg: Statistical Machine Learning
297
Summer 2019

Logistic regression problem as ERM


I Want to solve classification on Rd with linear functions:
I Given Xi ∈ Rd , Yi ∈ {±1}.
I F = {f (x) = hw, xi + b; w ∈ Rd , b ∈ R}
I Use ERM
I Using L2 -loss corresponds to Linear Discriminant Analysis
(LDA)
Ulrike von Luxburg: Statistical Machine Learning

I Now: use the logistic loss function:


298
Summer 2019

Logistic regression problem as ERM (2)

`(X, f (X), Y ) = log2 (1 + exp(−Y f (X)))


Ulrike von Luxburg: Statistical Machine Learning

I It already starts to “punish” if points are still on the correct


side of the hyperplane, but get close to it.
I Once on the wrong side, it punishes “moderately” (close to
linear)
299
300 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Logistic regression problem as ERM (3)


Summer 2019

Computing the ERM solution


Consider the problem of finding the best linear function under the
logistic loss in the ERM setting.
I There is no closed form solution for this problem.
I Good news: the logistic loss function is convex.
This can be proved by showing that the Hessian matrix is
positive definite.
Ulrike von Luxburg: Statistical Machine Learning

I So we can use our favorite convex solver to obtain the logistic


regression solution.
I The standard technique in this case is the Newton-Raphson
algorithm, but we won’t discuss the details.

But why would someone come up with the logistic loss function???

The answer comes from the following Bayesian approach to


classification.
301
Summer 2019

The logistic model


I We do NOT make a full model of the joint probability
distribution P (x, y) or the class conditional distributions
P (y|x) (generative approach).
I We just specify a model for the conditional posterior
distributions (discriminative approach):
Ulrike von Luxburg: Statistical Machine Learning

1
P (Y = y X = x) =
1 + exp(−yf (x))

with f (x) = hw, xi + b. Here w and b are the parameters. The


function 1/(1 + exp(−t)) is called the logistic function and
looks as follows:
302
303 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

The logistic model (2)


Summer 2019

The logistic model (3)


I Intuition:
Consider the projection scenario. Instead of a hard threshold
(left = one class, right = other class) we have a smooth
transition of the probability.
Ulrike von Luxburg: Statistical Machine Learning
304
Summer 2019

The logistic model (4)


The actual value of f (x) tells “how far” we are from the
decision surface, that is “how sure” the classifier is about this
class. |f (x)| ≈ 0.5 means that the classifier does not really
know by itself, f (x) close to 0 or 1 means that the classifier is
“pretty sure”.
Ulrike von Luxburg: Statistical Machine Learning
305
Summer 2019

The logistic model (5)


I Maximizing P (Y = y X = x) = 1/(1 + exp(−yf (x)))
corresponds to minimizing the following loss function:

`(X, f (X), Y ) = log(1 + exp(−Y f (X)))

This is the logistic loss function.


Note that the logistic loss also punishes points that are correctly
Ulrike von Luxburg: Statistical Machine Learning

classified but are “too close” to the hyperplane.

For such points, the classifier “is not sure”, but ideally we would
like to find a classifier that is “pretty sure” on which sides all points
belong.
306
Summer 2019

The logistic model (6)


Ulrike von Luxburg: Statistical Machine Learning

Decision function:
P (Y = circle|X = x, b) = 1/(1 + exp(−yf (x)))
Loss incurred for the respective decisions: logistic
307
Summer 2019

Adding regularization
I As in linear regression, we can now use regularization to avoid
overfitting.
I For example, we could use Ω(f ) = kwk22 (as in ridge
regression) or Ω(f ) = kwk1 (as in Lasso).
I Then regularized logistic regession minimizes minimize
Ulrike von Luxburg: Statistical Machine Learning

n
1X  
log 1 + exp(−Yi hw, Xi) + λΩ(f ).
n i=1

I If the regularizer is convex in w, then so is the regularzied


logistic regression problem. It can be solved by standard
convex solvers.
I More specialized (more efficient) solvers exist.
308
Summer 2019

History of logistic regression


Very nice historic account: Cramer: The origins of logistic
regression. Tinbergen Institute Working Paper, 2002
I Dates back to the 19th century to the work of Pierre-Francois
Verhulst (published in several papers around 1845)
I Rediscovered in the 1920 by Pearl and Reed
I Many variants and adaptations (“probit” or “logit”)
Ulrike von Luxburg: Statistical Machine Learning

I In 1973, Daniel McFaden draws the connections to decision


theory; in 2000, he earns the nobel prize in economic sciences
for his development of theory and methods for analyzing
discrete choice!
309
Summer 2019

Summary: logistic regression


I Loss function: logistic loss (a “smoothed” version of a step
function)
I Function class: linear
I Either pure empirical risk minimization, or regularized risk
minimization, for example with L1 - or L2 -regularizer
I Convex optimization problem, no closed form solution.
Ulrike von Luxburg: Statistical Machine Learning
310
Summer 2019

(∗) Probabilistic interpretation of linear


classification
Literature: Kevin Murphy: Machine Learning, a probabilistic
Ulrike von Luxburg: Statistical Machine Learning

perspective, Chapter 8
311
Summer 2019

General idea
I In linear discriminant analysis (LDA): Minimizing the L2 loss
over the class of linear function is “the same” as finding the
Bayesian decision theory solution for the probabilistic model
with Gaussian class conditional priors, and Gaussian noise, and
uniform class prior.
I In logistic regression: Minimizing the logistic loss function over
Ulrike von Luxburg: Statistical Machine Learning

linear functions can be interpreted as a probabilistic approach


as well.

Let’s briefly look at the general concept.


312
Summer 2019

Excursion: Probabilistic interpretation of ERM


I Bayesian approach: choose f (x) according to whether
P (Y = +1 X = x) is larger or smaller than
P (Y = −1 X = x).
I Assume that the conditional probability P (Y = +1 X = x)
has a certain functional form, that is it can be described by
some function f ∈ F (for some appropriate F).
Ulrike von Luxburg: Statistical Machine Learning

I The goal is to find the function f ∈ F that “best explains our


training data”. That is, for each training point we would like
to have P (f (Xi ) = Yi ) as large as possible.
I This amounts to selecting f ∈ F by

argmax Πni=1 P (f (Xi ) = Yi )


f ∈F
313
Summer 2019

Excursion: Probabilistic interpretation of ERM


(2)
I This is equivalent to the following problem (simply take − log):
n
X
argmin − log P (f (Xi ) = Yi )
f ∈F i=1
| {z }
=:`(Xi ,f (Xi ),Yi )

This approach can be interpreted as empirical risk


Ulrike von Luxburg: Statistical Machine Learning

I
minimization with respect to this newly defined loss function `:

n
X
argmin `(Xi , f (Xi ), Yi )
f ∈F i=1
314
Summer 2019

Excursion: Probabilistic interpretation of ERM


(3)
What does this tell us?

Assume we start with an assumption how the probability


distributions P (Y X = x) look like, and we follow the Bayesian
approach of selecting according to P (Y X = x).
Ulrike von Luxburg: Statistical Machine Learning

Then there always exists a particular loss function ` such that this
approach corresponds to ERM with this particular loss function.

Note that it does not always work the other way round (if we start
with a given loss function `, it is not always possible to construct a
corresponding model probability distribution)
315
Summer 2019

Excursion: Probabilistic interpretation of ERM


(4)
Why is this insight useful?

It helps to get more intuition:


I For some loss functions, we can “understand” what the
corresponding probabilistic model is. This gives insight into
when a particular approach might or might not work.
Ulrike von Luxburg: Statistical Machine Learning

Linear discriminant analysis is an example for this: you would


not guess that the quadratic loss for a linear function class
means that we assume that classes are round blobs with the
same shape (normal distributions with the same covariance
structure).
316
Summer 2019

Excursion: Probabilistic interpretation of ERM


(5)
I Given a particular model, writing down the loss function helps
to understand the behavior of the classifier: What are the
errors that are punished most? So what does the classifier try
to avoid at all costs?
Ulrike von Luxburg: Statistical Machine Learning
317
Summer 2019

Linear Support vector machines


Literature:
I Schölkopf / Smola Section 7

I Shawe-Taylor / Cristianini
Ulrike von Luxburg: Statistical Machine Learning
318
319 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Intuition and primal


Summer 2019

Prelude
The support vector machine (SVM) is the algorithm that made
machine learning its own sub-discipline of computer science, it is
one of the most important machine learning algorithms. It has been
published in the late 1990ies (see later for more on history).

We are going to study the linear case first. The main power of the
method comes from the “kernel trick” which is going to make them
Ulrike von Luxburg: Statistical Machine Learning

non-linear.
320
Summer 2019

Geometric motivation
Given a set of linearly separable data points in Rd . Which
hyperplane to take???
Ulrike von Luxburg: Statistical Machine Learning
321
Summer 2019

Geometric motivation (2)


Idea:take the hyperplane with the largest distance to both classes
(“large margin”):
Ulrike von Luxburg: Statistical Machine Learning

Why might this make sense?


322
Summer 2019

Geometric motivation (3)


Why might this make sense:
I Robustness: assume our data points are noisy. If we “wiggle”
some of the points, then they are still on the same side of the
hyperplane, so the classification result is robust on the training
points.
I Later we will see: the size of the margin can be interpreted as
a regularization term. The larger the margin, the “less
Ulrike von Luxburg: Statistical Machine Learning

complex” the corresponding function class.


323
Summer 2019

Canonical hyperplane
I We are interested in a linear classifier of the form
f (x) = sign(hw, xi + b)
I Note that if we multiply w and b by the same constant a > 0,
this does not change the classifier:
sign(haw, xi + ab) = sign(a(hw, xi + b)) = sign(hw, xi + b)
Ulrike von Luxburg: Statistical Machine Learning

I Want to remove this degree of freedom.


I For now, assume data can be perfectly separated by
hyperplane.
We say that the pair (w, b) is in canonical form with respect to the
points x1 , ..., xn if they are scaled such that
min |hw, xi i + b| = 1
i=1,...,n

We also say that the hyperplane is in canonical representation.


324
Summer 2019

The Margin
I Let H := {x ∈ Rd hw, xi + b = 0} be a hyperplane.
I Assume that a hyperplane correctly separates the training data.
I The margin of the hyperplane H with respect to the training
points (Xi , Yi )i=1,..,n is defined as the minimal distance of a
training point to the hyperplane:
Ulrike von Luxburg: Statistical Machine Learning

ρ(H, X1 , ..., Xn ) := min d(Xi , H) := min min kXi − hk


i=1,...n i=1...n h∈H
325
Summer 2019

The Margin (2)


Proposition 11 (Margin)
For a hyperplane in canonical representation, the margin ρ can be
computed by ρ = 1/kwk.

First proof.
Observe:
Ulrike von Luxburg: Statistical Machine Learning

I Points on the hyperplane itself satisfy hw, xi + b = 0.


(Reason: definition of the hyperplane)
I Points that sit on the margin satisfy hw, xi + b = ±1.
(Reason: canonical representation)
I Let x be the training point that is closest to the hyperplane
(that is, the one that defines the margin), and h ∈ H the
closest point on the hyperplane. Then kx − hk = ρ.
326
327 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

The Margin (3)


Summer 2019

The Margin (4)


We also know that
w
x=h+ρ
kwk

because the line connecting x and h is in the normal direction w


and has length ρ.
Ulrike von Luxburg: Statistical Machine Learning

Now we build the scalar product with w and add b on both sides:

w kwk2
=⇒ hw, xi = hw, h + ρ i = hw, hi + ρ
kwk kwk
=⇒ hw, xi + b = hw, hi + b +ρkwk
| {z } | {z }
=1 =0
=⇒ ρ = 1/kwk

,
328
Summer 2019

The Margin (5)


Alternative proof:
By definition, the margin is ρ = kX − hk. In order to compute it,
observe that

hw, xi + b = 1
hw, hi + b = 0
Ulrike von Luxburg: Statistical Machine Learning

Subtracting these two equations and rescaling with kwk gives

hw, x − hi = 1
hw/kwk, x − hi = 1/kwk

Now the proposition follows from the fact that w and x − h point
in the same direction and w/kwk has norm 1. ,
329
Summer 2019

Hard margin SVM


So here is our first formulation of the SVM optimization problem:
• Maximize the margin
• Subject to:
I all points are on the correct side of the hyperplane
I and outside the margin.

In formulas:
Ulrike von Luxburg: Statistical Machine Learning

1
maximizew∈Rd ,b∈R
kwk
subject to Yi = sign(hw, Xi i + b) ∀i = 1, ..., n
|hw, Xi i + b)| ≥ 1 ∀i = 1, ..., n
330
Summer 2019

Hard margin SVM (2)


Usually, we consider the following equivalent optimization problem:
1
minimizew∈Rd ,b∈R kwk2
2
subject to Yi (hw, Xi i + b) ≥ 1 ∀i = 1, ..., n

This problem is called the (primal) hard margin SVM problem.


Ulrike von Luxburg: Statistical Machine Learning
331
Summer 2019

Hard margin SVM (3)


First remarks:
I This optimization problem is convex.

I In fact, it is a quadratic optimization problem (objective


function is quadratic, constraints are linear).
I Observe that the solution will always be a hyperplane in
canoncial form. EXERCISE.
Ulrike von Luxburg: Statistical Machine Learning

I The only reason to add constant 1/2 in front of kwk2 is for


mathematical convenience (the derivative is then w and not
2w). Sometimes we also drop it later.
332
Summer 2019

Hard margin SVM (4)


However, big disadvantage:

This problem only has a solution if the data set is linearly separable,
that is there exists a hyperplane H that separates all training points
without error. This might be too strict ...
Ulrike von Luxburg: Statistical Machine Learning
333
Summer 2019

Soft margin SVM


I We want to allow for the case that the separating hyperplane
makes some errors (that is, it does not perfectly separate the
training data).
I To this end, we introduce “slack variables” ξi and consider the
following new optimization problem:
n
1 CX
Ulrike von Luxburg: Statistical Machine Learning

2
minimizew∈Rd ,b∈R,ξ∈Rn kwk + ξi
2 n i=1
subject to Yi (hw, Xi i + b) ≥ 1 − ξi ∀i = 1, ..., n
ξi ≥ 0 ∀i = 1, ..., n
Here C is a constant that controls the tradeoff between the
two terms, see below.
This problem is called the (primal) soft margin SVM problem.
I Note that this is a convex (quadratic) problem as well.
334
Summer 2019

Soft margin SVM (2)


Interpretation:
I If ξi = 0, then the point Xi is on the correct side of the
hyperplane, outside the margin.
I If ξi ∈]0, 1[, then Xi is still on the correct side of the
hyperplane, but inside the margin.
I If ξi > 1, then Xi is on the wrong side of the hyperplane.
Ulrike von Luxburg: Statistical Machine Learning

Note that for soft SVMs, the margin is defined implicitly (the
points on the margin are the ones that satisfy hw, xi + b = ±1.
335
336 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Soft margin SVM (3)


Summer 2019

SVM as regularized risk minimization


We want to interpret the SVM in the regularization framework:
n
1 C X
minimize kwk2 + ξi
|2 {z } n i=1
; Regularization term | {z }
; Risk term

To this end, we want to incorporate the constraints into the


Ulrike von Luxburg: Statistical Machine Learning

objective to form a new loss function:


337
Summer 2019

SVM as regularized risk minimization (2)


Consider the constraint Yi (hw, Xi i + b) ≥ 1 − ξi . Exploiting ξi ≥ 0
we can rewrite it as follows:

ξi ≥ max{0, 1 − Yi (hw, Xi i + b)}

This is a loss function, the so called Hinge loss:

`(x, y, f (x)) = max{0, 1 − yf (x)}


Ulrike von Luxburg: Statistical Machine Learning

It looks as follows:
338
Summer 2019

SVM as regularized risk minimization (3)


Comparison to other loss functions:
Ulrike von Luxburg: Statistical Machine Learning
339
Summer 2019

SVM as regularized risk minimization (4)


This loss function has a couple of interesting properties:
I It even punishes points if they have the correct label but are
too close to the decision surface (the margin).
I For points on the wrong side it increases linearly, like an
L1 -norm, not quadratic.
Ulrike von Luxburg: Statistical Machine Learning
340
Summer 2019

SVM as regularized risk minimization (5)


With this loss function, we can now interpret the soft margin SVM
as regularized risk minimization:

n
CX
minimize max{0, 1 − Yi (hw, Xi i + b)} + kwk2
w,b n i=1 | {z }
| {z } L2 −regularizer
Empirical risk wrt Hinge loss
Ulrike von Luxburg: Statistical Machine Learning

The constant C plays the “inverse role” of the regularization


constant λ we used in the previous problems (just multiply the
objective with 1/C and replace 1/C by λ).

It is a convention that we use C in SVMs, not λ ...


341
Summer 2019

SVM as regularized risk minimization (6)


EXERCISE: what happens if C is chosen very small, what if C is
chosen very large?
Ulrike von Luxburg: Statistical Machine Learning
342
Summer 2019

Summary so far: Linear SVM (primal)


What we have seen so far:
I The linear SVM tries to maximize the margin between the two
classes.
I The hard margin SVM only considers solutions without
training errors. The soft margin SVM can trade-off margin
errors or misclassification errors with a large margin.
Ulrike von Luxburg: Statistical Machine Learning

I Both hard and soft SVM are quadratic optimization problems


(in particular, convex).
I The soft margin SVM can be interpreted as regularized risk
minimization with the Hinge loss function and
L2 -regularization.
343
Summer 2019

Excursion: convex optimization, primal,


Lagrangian, dual
see slides 1219ff. in the appendix
Ulrike von Luxburg: Statistical Machine Learning
344
345 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Deriving the dual problem


Summer 2019

Dual of hard margin SVM


It turns out that all the important properties of SVM can only be
seen from the dual optimization problem.

So let us derive the dual problem:


Ulrike von Luxburg: Statistical Machine Learning
346
Summer 2019

Dual of hard margin SVM (2)


Primal problem (the one we start with):
1
minimizew∈Rd ,b∈R kwk2
2
subject to Yi (hw, Xi i + b) ≥ 1 ∀i = 1, ..., n

Lagrangian: we introduce one Lagrange multiplier αi ≥ 0 for


Ulrike von Luxburg: Statistical Machine Learning

each constraint and write down the Lagrangian:

n
1 X
L(w, b, α) = kwk2 − αi (Yi (hw, Xi i + b) − 1)
2 i=1
347
Summer 2019

Dual of hard margin SVM (3)


Formally, the dual problem is the following:
Dual function:

g(α) = min L(w, b, α)


w,b

Dual Problem:
Ulrike von Luxburg: Statistical Machine Learning

maximize g(α)
α
subject to αi ≥ 0, i = 1, ..., n

But this is pretty abstract, we would need to first compute the dual
function, but this seems non-trivial. We now show how to compute
g(α) explicitly. Let’s try to simplify the Lagrangian first.
348
Summer 2019

Dual of hard margin SVM (4)


Saddle point condition: We know that at the solution of the
primal, the saddle point condition has to hold:

In particular,
n
∂ X !
L(w, b, α) = − αi Yi = 0 (∗)
∂b i=1
Ulrike von Luxburg: Statistical Machine Learning

∂ X !
L(w, b, α) = w − αi Yi X i = 0 (∗∗)
∂w i
349
Summer 2019

Dual of hard margin SVM (5)


Rewrite the Lagrangian: We plug (∗) and (∗∗) in the
Lagrangian at the saddle point (w∗ , b∗ , α∗ ):
I First exploit (∗):

n
1 X
L(w, b, α) = kwk2 − αi (Yi (hw, Xi i + b) − 1)
2 i=1
1 X X X
Ulrike von Luxburg: Statistical Machine Learning

= kwk2 + αi − αi Yi hw, Xi i − b αi Yi
2 i i i
| {z }
=0 by (∗)

I Now we replace w by formula (∗∗) and get after simplification:


X 1X
L(w∗ , b∗ , α∗ ) = αi − αi αj Yi Yj hXi , Xj i
i
2 i,j

I Observe: L(w, b, α) does not depend on ω and b any more!


350
Summer 2019

Dual of hard margin SVM (6)


Dual function:
So at the saddle point (w∗ , b∗ , α∗ ), the dual function is very simple:

g(α) := min L(w, b, α)


w,b
X 1X
= αi − αi αj Yi Yj hXi , Xj i
i
2 i,j
Ulrike von Luxburg: Statistical Machine Learning

(we can drop the “minw,b ” because w, b have disappeared).


351
Summer 2019

Dual of hard margin SVM (7)


To finally write down the dual optimization problem, we have to
keep enforcing (∗) and (∗∗) (otherwise the transformation of the
Lagrangian to its simpler form is no longer valid).
I By now, (∗∗) is meaningless, because w disappeared already.
So we drop it.
I But we need to carry the condition (∗) to the dual.

So finally we end up with the dual problem of the linear hard


Ulrike von Luxburg: Statistical Machine Learning

margin SVN:
n n
X 1X
maximize αi − αi αj Yi Yj hXi , Xj i
n
α∈R
i=1
2 i,j=1
subject to αi ≥ 0 ∀i = 1, ..., n
Xn
αi Yi = 0
i=1
352
Summer 2019

Dual of the soft margin SVM


Analogously, one can derive the dual problem of the soft margin
SVM, it looks nearly the same:

n n
X 1X
maximize αi − αi αj Yi Yj hXi , Xj i
α∈Rn
i=1
2 i,j=1
Ulrike von Luxburg: Statistical Machine Learning

subject to 0 ≤ αi ≤ C/n ∀i = 1, ..., n


X n
αi Yi = 0
i=1
353
Summer 2019

Dual SVM in practice


I Given the input data, compute all the scalar products hXi , Xj i
I Solve the dual optimization problem (it is convex), this gives
you the αi .
I To compute the class label of a test point X, we need to
understand how we can recover the primal variables w and b
from the dual variables α. This works as follows.
Ulrike von Luxburg: Statistical Machine Learning
354
Summer 2019

Dual SVM in practice (2)


Recover the primal optimal variables w, b from the dual
solution α:
P
I To compute w: directly use (∗∗): w =
i αi Yi Xi
I To compute b, we need to exploit the KKT conditions of the
soft margin SVM. As we did not put all details of the soft
margin deriviation on the slides, here is the summary:
(i) αi = 0 implies that the slack variable ξi = 0 and that the
Ulrike von Luxburg: Statistical Machine Learning

point Xi is outside of the margin and correctly classified.


(ii) 0 < αi < C/n implies that the corresponding point sits
exactly on the margin. In particular, we then have
Yi (hw, Xi i + b) = 1).
(iii) αi = C/n implies that the slack variable ξi > 0. The
corresponding points sit either inside the margin (still on the
correct side) or on the wrong side of the hyperplane.
355
Summer 2019

Dual SVM in practice (3)


To compute b, we thus select a point Xj on the margin as in
case (ii) above, that is an index j with 0 < αj < C/n, and
then solve Yj (hw, Xj i + b) = 1) for b.
To increase numerical stability, we might use all such points
Xj and average the resulting values of b.
EXERCISE: USE THE LAGRANGE APPROACH TO DERIVE
THE DUAL OF THE SOFT MARGIN SVM, AND USE THE
Ulrike von Luxburg: Statistical Machine Learning

KKT CONDITIONS TO VERIFY THE THREE CASES (i)-(iii)


ABOVE.
356
Summer 2019

Dual SVM in practice (4)


Now we can evaluate the label of a test point X:

Yi = sign(hw, Xi + b)

with w and b as on the previous slide:


X
hw, Xi + b = h αi Yi Xi , Xi + b
Ulrike von Luxburg: Statistical Machine Learning

i
X  X 
= αi Yi hXi , Xi + Yj − Yi αi hXi , Xj i
i i

In practice, you don’t need to take the intermediate steps of


computing w and b, you can use this formula directly, it only
depends on the αi .
357
Summer 2019

In practice: solve the primal or the dual?


In practice: Solve the primal or dual problem?
I Because we know that for quadratic problems we have strong
duality, we could either solve the primal or the dual problem.
I The primal problem has d + 1 variables (where d is the
dimension of the space), and n constraints (where n is the
number of training points). If d is small compared to n, then it
Ulrike von Luxburg: Statistical Machine Learning

makes sense to solve the primal problem.


I The dual problem has n variables and n + 1 constraints. If d is
large compared to n, then it is better to solve the dual
problem. In most SVM libraries, this is the default.
358
359 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Important properties of SVMs


Summer 2019

Solution as linear combination


Representation of the solution: From (∗∗) we see immediately
that the solution vector w can always be
P expressed as a linear
combination of the input points: w = i αi Yi Xi . This is very
important for the kernel version of the algorithm (; representer
theorem, see later).
Ulrike von Luxburg: Statistical Machine Learning
360
Summer 2019

Support vectors
Support vector property:
I KKT conditions in the hard margin case tell us: Only Lagrange
multipliers αi that are non-zero correspond to active
constraints (the ones that are precisely met). Formally,
 
αi Yi f (Xi ) − 1 = 0
Ulrike von Luxburg: Statistical Machine Learning

A similar statement holds for the soft margin case, there the αi
are only non-zero for points on the margin, in the margin, or
on the wrong side of the margin.
I In our context: Only those αi are non-zero that correspond to
points that lie exactly on the margin, inside the margin or on
the wrong side of the hyperplane. The corresponding points
are called support vectors.
361
Summer 2019

Support vectors (2)


I So the solution can be expressed just by the coefficients of the
support vectors.
I In low-dimensional spaces this property means that we have a
sparse solution vector w. But note that sparsity is not
necessarily true in very high-dimensional spaces (then
essentially all points sit on the margin).
Ulrike von Luxburg: Statistical Machine Learning
362
Summer 2019

Scalar products
We can see that all the information about the input points Xi that
enters the optimization problem is expressed in terms of scalar
products:
I hXi , Xj i in the dual objective function

I hx, Xi i and hXi , Xj i in the evaluation of the target function


on new points
Ulrike von Luxburg: Statistical Machine Learning

This is going to be the key point to be able to apply the kernel trick.
363
Summer 2019

Exercise
It might be instructive to solve the following exercise:
Input data: x1 = (1, 0); y1 = +1; x2 = (−1, 0); y2 = −1.
Primal problem:
I Write down the hard margin primal optimization problem and
solve it using the Lagrange approach.
I Write down the soft margin primal optimization problem and
Ulrike von Luxburg: Statistical Machine Learning

solve it using the Lagrange approach.


Dual problem:
I Write down the dual hard margin optimization problem and
solve it.
I Write down the dual soft margin primal optimization problem
and solve it.
364
Summer 2019

Exercise (2)
I Use the dual solution to recover the solution of the primal
problem. Compare the values of the objective functions at the
dual and primal solution.
I Determine the support vectors.
Ulrike von Luxburg: Statistical Machine Learning
365
Summer 2019

History
I Vladimir Vapnik is the “inventor” of the SVM (and, in fact, he
laid the foundations of statistical learning theory in general).
I The hard margin SVM and the kernel trick was introduced by
Boser, Bernhard; Guyon, Isabelle; and Vapnik, Vladimir. A
training algorithm for optimal margin classifiers. Conference on
Learning Theory (COLT), 1992
Ulrike von Luxburg: Statistical Machine Learning

I This was generalized to the soft margin SVM by Cortes,


Corinna and Vapnik, Vladimir. ”Support-Vector Networks”,
Machine Learning, 20, 1995.
366
Summer 2019

Summary: linear SVM


I Input data: X = Rd , Y = {±1}
I Function class: linear functions of the form f (x) = hw, xi + b
I Want to select hyperplane as to maximize the margin
I Soft margin SVM has interpretation as regularized risk
minimization with respect to the Hinge loss and with
L2 -regularization
Ulrike von Luxburg: Statistical Machine Learning

I Is a quadratic optimization problem


I Convex duality leads to the following key properties of the
solution:
I Solution w∗ can always be expressed as linear combination of
input points
I Sparsity: only points that are on, in or on the wrong side of
the margin contribute to this linear combination
I To compute and evaluate the solution, all we need are scalar
products of input points.
367
368 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

supervised learning
Kernel methods for
Summer 2019

Positive definite kernels


Introductory literature:
I Schölkopf / Smola Section 2

I Shawe-Taylor / Cristianini Section 2 and 3


Ulrike von Luxburg: Statistical Machine Learning

For a deeper mathematical treatment of kernels see the following


book:
I Steinwart, Christmann: Support Vector Machines. Springer,
2008.
369
370 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Intuition
Summer 2019

Linear methods — disadvantages


We have seen several linear methods for regression and
classification. Even though these methods are conceptually
appealing, they have a number of disadvantages.

I Linear functions are restrictive. This can be of advantage to


avoid overfitting, but often it leads to underfitting. For
example, in classification we could not find any hyperplane to
Ulrike von Luxburg: Statistical Machine Learning

separate the following example:


1.5

0.5

−0.5

−1

−1.5
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
371
Summer 2019

Linear methods — disadvantages (2)


I Alternatively, we could use a feature map with basis functions
Φi to represent more complex functions, say polynomials. But:

I It is not so obvious which are ”good” basis functions.


I We need to fix the basis before we see the data. This means
that we need to have very many basis functions to be flexible.
This leads to a very high dimensional representations of our
Ulrike von Luxburg: Statistical Machine Learning

data.
372
Summer 2019

Linear methods — disadvantages (3)


The goal of kernel methods is to introduce a non-linear
component to linear methods:
Ulrike von Luxburg: Statistical Machine Learning
373
Summer 2019

Starting point: Key observation for SVMs


To run the linear support vector machine algorithm, we do not need
to compute Φ(X) explicitly — all we need to know are scalar
products of the form hΦ(Xi ), Φ(Xj )i:
I The dual objective function only contains terms of the form
hXi , Xj i, the Xi never occur “alone”.
I To evaluate the solution at the test point, again we only need
Ulrike von Luxburg: Statistical Machine Learning

to be able to compute scalar products of the input, we never


need to know coordinates of the input points.
374
Summer 2019

Starting point: Key observation for SVMs (2)


Let us be more explicit:
I Assume the data lives in Rd .

I Introduce the shorthand notation k(x, y) := hx, yi.

Then we can write the SVM optimization problem purely in terms


of the function k (the Xi never occur outside the function k):
Ulrike von Luxburg: Statistical Machine Learning

n n
X 1X
maximize αi − αi αj Yi Yj k(Xi , Xj )
n
α∈R
i=1
2 i,j=1
subject to 0 ≤ αi ≤ C/n ∀i = 1, ..., n
X n
αi Yi = 0
i=1

(and the same goes for the function that evaluates the results on
test points).
375
Summer 2019

Idea: Kernels replacing feature maps


Assume we are in a feature mapping scenario, but we know how to
compute scalar products explicitly, that is we know a function
k : X × X → R with

k(xi , xj ) = hΦ(xi ), Φ(xj )i.

The idea is that it might even be possible to avoid computing the


Ulrike von Luxburg: Statistical Machine Learning

embeddings Φ(Xi ) and compute the scalar products directly via the
function k.
376
Summer 2019

Kernel methods — the overall picture


What we want to do:
I Given points in some abstract space X

I Would like to (implicitly) embed the points into some space Rd


via a (non-linear) feature map Φ
I In that space, we use a linear method like an SVM

I Ideally, we never compute the embedding directly.


Ulrike von Luxburg: Statistical Machine Learning

I Instead we want to use a “kernel function ” to compute

k(x, y) = hΦ(x), Φ(y)i.

This approach is called the “kernel trick” and the


corresponding algorithms are called “kernel methods”.
377
Summer 2019

Kernel methods — the overall picture (2)


In the following we try to make this idea formal.
I How do the functions k need to look like? (; kernels)
I Once we have an appropriate k, what is the corresponding
feature map? (; RKHS)
Ulrike von Luxburg: Statistical Machine Learning
378
379 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Definition and properties of kernels


Summer 2019

Kernel function — definition


Let X be any space. A symmetric function k : X × X → R is
called a kernel function if for all n ≥ 1, x1 , x2 , ..., xn ∈ X and
c1 , ..., cn ∈ R we have
n
X
ci cj k(xi , xj ) ≥ 0.
i,j=1
Ulrike von Luxburg: Statistical Machine Learning

Given a set of points x1 , ..., xn , we define the corresponding kernel


matrix as the matrix K with entries kij = k(xi , xj ).

The condition above is equivalent to saying that c0 Kc ≥ 0 for all


c ∈ Rn .
380
Summer 2019

Kernel function — definition (2)


Remarks:
I It is NOT true that a function that satisfies k(x, y) ≥ 0 for all
x, y ∈ X is positive definite!!!

EXERCISE: FIND A COUNTEREXAMPLE (try to construct a


matrix with positive entries that is not pd).
Ulrike von Luxburg: Statistical Machine Learning

I In the maths literature, the above condition would be called


“positive semi-definite” (and it would be called “positive
definite” only if the inequality is strict).
381
Summer 2019

Scalar products lead to kernels


Observe:
For any mapping Φ : X → Rd the function defined (!) by

k : X × X → R, k(x, y) = hΦ(x), Φ(y)i

is a valid kernel!
Ulrike von Luxburg: Statistical Machine Learning

Proof sketch:
I Symmetry: clear

I Positive definiteness: follows from the positive definiteness of


the scalar product, EXERCISE!
382
Summer 2019

Scalar products lead to kernels (2)


In this case, the kernel matrix is given as follows:
I Let X1 , ..., Xn ∈ X be data points, Φ : X → Rd a feature map.
I Denote by Φ the n × d-matrix that contains the data points
Φ(Xi ) as rows.
I Then the matrix Φ · Φt ∈ Rn×n coincides with the
corresponding kernel matrix K with entries
Ulrike von Luxburg: Statistical Machine Learning

kij = hΦ(Xi ), Φ(Xj )i = Φ(Xi )Φ(Xj )t .


383
Summer 2019

Intuition: kernels as similarity functions


I The scalar product can be interpreted as a measure of how
similar two points are.
I We now use the same intuition for a kernel. The kernel is a
measure of how “similar” two points in the feature
space are.
Ulrike von Luxburg: Statistical Machine Learning
384
Summer 2019

Example: linear kernel


The linear kernel. The trivial kernel on Rd defined by the
standard scalar product:

k : Rd × Rd → R, k(x, y) = hx, yi

Is obviously a kernel.
Ulrike von Luxburg: Statistical Machine Learning
385
Summer 2019

Example: cosine similarity


I Assume your data lives in Rd and is normalized such that all
data points have (roughly) norm 1. Then they sit on the
hypersphere (surface of the ball of radius 1).
I Points are similar if the corresponding vectors “point to the
same direction”.
Ulrike von Luxburg: Statistical Machine Learning

I As a measure how similar the points are, we use the cosine of


the angle between the two points.
I cosine = 1 ⇐⇒ points agree
I cosine = 0 ⇐⇒ points are orthogonal
386
Summer 2019

Example: cosine similarity (2)


I If the data points are normalized, then the cosine of the angle
spanned by two points x and y is given by the scalar product
hx, yi.

Is obviously a kernel.
Ulrike von Luxburg: Statistical Machine Learning
387
Summer 2019

Example: the Gaussian kernel


On Rd , define the following kernel:

−kx − yk2
 
d d
k : R × R → R, k(x, y) = exp
2σ 2

where σ > 0 is a parameter.


Ulrike von Luxburg: Statistical Machine Learning

One can prove that this is indeed a kernel, this is not obvious at
all!!! (WHAT DO WE NEED TO PROVE?).

See the text books if you are interested.


388
Summer 2019

Example: the Gaussian kernel (2)


Induced notion of similarity:
Ulrike von Luxburg: Statistical Machine Learning

Two points are considered “very similar” if they are of distance at


most σ, “somewhat similar” if they are at distance (roughly) at
most 3σ, and “pretty dissimilar” if they are further away than that.
389
Summer 2019

Example: the Gaussian kernel (3)


Note: the Gaussian kernel is also called rbf-kernel for “radial basis
function”.
Ulrike von Luxburg: Statistical Machine Learning
390
Summer 2019

Example: polynomial kernel


X = Rd .

k(x, y) = (x0 y + c)k

where c > 0 and k ∈ N.

Not very useful for practice, but often mentioned, hence I put it on
Ulrike von Luxburg: Statistical Machine Learning

the slides.
391
Summer 2019

Example: kernels based on explicit feature maps


I Assume we explicitly constructed a feature space embedding
such as a bag-of-words representation for texts or a
bag-of-motifs representation of graphs.
I Then simply use the linear kernel in the feature space Rd .

Induced similarity functions:


Ulrike von Luxburg: Statistical Machine Learning

I books are considered “similar” if they get bought by the same


users.
I Graphs are considered “similar” if they contain the same
motifs.
I etc ...
392
Summer 2019

Example: kernel between vertices in a graph


Application scenario:
I Say we want to classify persons in a social network, whether
they prefer samsung or apple phones. All we know about the
persons are their friendships.
I We consider people as “similar” if they have similar sets of
friends. We want to encode this notion of similarity by kernel
Ulrike von Luxburg: Statistical Machine Learning

function.
I Then we classify with an SVM.
393
Summer 2019

Example: kernel between vertices in a graph (2)


There exists a large number of graph kernels. One big family is
based on paths between vertices:
Ulrike von Luxburg: Statistical Machine Learning
394
Summer 2019

Example: kernel between vertices in a graph (3)


To define a kernel between vertices on a graph:
I Consider a directed graph with edge weights in [0, 1], where
this value encodes a similarity (high means very similar). For a
directed path π = v1 , ..., vk define the weight of the path as
k−1
w(π) = πj=1 w(vj , vj+1 )
I For each pair of vertices v, ṽ consider the set Πk (v, ṽ) which
consists of all paths from v to ṽ of lengths at most k
Ulrike von Luxburg: Statistical Machine Learning

I Now define
(P
π∈Πk w(π) if Πk (v, ṽ) 6= ∅
s(v, ṽ) =
0 otherwise

and the symmetric kernel function

k(v, ṽ) = s(v, ṽ) + s(ṽ, v)


395
396 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Example: kernel between vertices in a graph (4)


Summer 2019

Example: kernel between vertices in a graph (5)


I Note that this kernel cannot be interpreted in terms of a
simple feature vector! (WHY EXACTLY?)
I This principle leads to the family of diffusion kernels, we
won’t discuss details.
Ulrike von Luxburg: Statistical Machine Learning
397
Summer 2019

Simple rules for dealing with kernels


I In general, it is really difficult to prove that a certain function
k is indeed a kernel (WHAT DO WE HAVE TO PROVE?)
I In practice, it usually does not work to come up with a nice
similarity function and “hope” that it is a kernel.
I But at least, there are some simple rules that can help to
transform and combine elementary kernels:
Ulrike von Luxburg: Statistical Machine Learning
398
Summer 2019

Simple rules for dealing with kernels (2)


Assume that k1 , k2 : X × X → R are kernel functions. Then:
I k̃ = α · k1 for some constant α > 0 is a kernel.

I k̃ = k1 + k2 is a kernel

I k̃ = k1 · k2 is a kernel

I The pointwise limit of a sequence of kernels is a kernel.

I For any function f : X → R, the expression


Ulrike von Luxburg: Statistical Machine Learning

k̃(x, y) := f (x)k(x, y)f (y) defines a kernel.

In particular, k̃(x, y) = f (x)f (y) is a kernel.

Proof. EXERCISE.
399
Summer 2019

(∗) Kernel matrix: pd or psd?


Due to a common confusion, let me stress again:
I A scalar product is positive definite. This means that the
property hv, vi > 0 holds (with strict inequality!) for all v 6= 0
I The kernel matrix is positive semi-definite in the sense that
c0 Kc ≥ 0 (greater or equal!).
WHY IS THIS NOT A CONTRADICTION?
Ulrike von Luxburg: Statistical Machine Learning
400
Summer 2019

(∗) Kernel matrix: pd or psd? (2)


I Consider data X1 , ..., Xn ∈ Rd .
I Then the kernel matrix coincides with K = XX t .
I Let v be an eigenvector of K. We have

v 0 XX 0 v = v 0 Kv = λv 0 v = λ

I For eigenvectors with λ > 0: fine.


Ulrike von Luxburg: Statistical Machine Learning

I For eigenvectors with λ = 0: Then X 0 v = 0, and the scalar


product of the 0-vector with itself is 0. Fine as well.
In particular: The rank of the kernel matrix is at most the
dimension of the underlying vector space. So if n > d, the kernel
matrix must have eigenvalues 0. EXERCISE!
401
402 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

maps
Reproducing kernel Hilbert space and feature
Summer 2019

Kernels do what they are supposed to do


Here is the justification for why we defined kernels the way we did:

Theorem 12 (Kernel implies embedding)


A function k : X × X → R is a kernel if and only if there exists a
Hilbert space H and a map Φ : X → H such that
k(x, y) = hΦ(x), Φ(y)i.
Ulrike von Luxburg: Statistical Machine Learning

If you have never heard of Hilbert spaces, just think of the space
Rd . The crucial properties are:
I H is a vector space with a scalar product h·, ·iH

I Space is complete (all Cauchy sequences converge)

I (Scalar product gives rise to a norm: kxkH := hx, xiH )


403
404 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Kernels do what they are supposed to do (2)


Summer 2019

Kernels do what they are supposed to do (3)


WHICH DIRECTION OF THE THEOREM IS EASY, WHICH ONE
IS DIFFICULT?
Ulrike von Luxburg: Statistical Machine Learning
405
Summer 2019

Kernels do what they are supposed to do (4)


Proof of “⇐”
Clear by definition of the kernel (we defined the kernel exactly such
that this direction holds).
Ulrike von Luxburg: Statistical Machine Learning
406
Summer 2019

Kernels do what they are supposed to do (5)


Proof of “⇒”
We have to prove the following:

Given X and k, there exists a vector space H with a scalar product


h·, ·iH , and a mapping Φ : X → H such that

k(x, y) = hΦ(x), Φ(y)iH


Ulrike von Luxburg: Statistical Machine Learning

for all x, y ∈ X .

We now introduce the Reproducing Kernel Hilbert Space (RKHS),


a scalar product on this space and a corresponding feature mapping
Φ. ,
407
Summer 2019

Reproducing kernel Hilbert space (RKHS)


As vector space we are going to use a space of functions:
I Consider a mapping Φ : X → RX (where RX denotes the
space of all real-valued functions from X to R), defined as

x 7→ Φ(x) := kx := k(x, ·)

That is, the point x ∈ X is mapped to the function


Ulrike von Luxburg: Statistical Machine Learning

kx : X → R, kx (y) = k(x, y).


408
Summer 2019

Reproducing kernel Hilbert space (RKHS) (2)


I Now consider the images {kx |x ∈ X } as a spanning set of a
vector space. That is, we define the space G that contains all
finite linear combinations of such functions:
Xr
G := { αi k(xi , ·) αi ∈ R, r ∈ N, xi ∈ X }
i=1
Ulrike von Luxburg: Statistical Machine Learning
409
Summer 2019

Reproducing kernel Hilbert space (RKHS) (3)


I Define a scalar product on G as follows:
I For the spanning functions we define

hkx , ky i = hk(x, ·), k(y, ·)i := k(x, y)


I For general functions
P in G the scalar product
P is then given as
follows: If g = i αi k(xi , ·) and f = j βj k(yi , ·) then
Ulrike von Luxburg: Statistical Machine Learning

X
hf, giG := αi βj k(xi , yj )
i,j

To make sure that this is really a scalar product, we need to


prove two things (EXERCISE!):
I Check that this is well-defined (not obvious because there
might be several different linear combinations for the same
function).
I Check that it satisfies all properties of a scalar product
(crucial ingredient is the fact that k is positive definite. )
410
Summer 2019

Reproducing kernel Hilbert space (RKHS) (4)


I Finally, to make G a proper Hilbert space we need to take its
topological completion G, that is we add all limits of Cauchy
sequences.
Ulrike von Luxburg: Statistical Machine Learning

I The resulting space H := G is called the reproducing kernel


Hilbert space.

I By construction, it has the property that

k(x, y) = hΦ(x), Φ(y)i.


411
Summer 2019

(∗) RKHS, further properties


The reproducing property:
P
Let f = i αi k(xi , ·). Then hf, k(x, ·)i = f (x).

Proof.
X
hk(x, ·), f i = hk(x, ·), αi k(xi , ·)i
Ulrike von Luxburg: Statistical Machine Learning

i
X
= αi hk(xi , ·), k(x, ·)i
i
X
= αi k(xi , x)
i
= f (x)

,
412
Summer 2019

(∗) RKHS, further properties (2)


For those who know a bit of functional analysis:
I Let H be a Hilbert space of functions from X to R. Then H is
a reproducing kernel Hilbert space if and only if all evaluation
functionals δx : H → R, f 7→ f (x) are continuous.
I In particular, functions in an RKHS are pointwise well defined
(as opposed to, say, function in an L2 -space which are only
defined almost everywhere).
Ulrike von Luxburg: Statistical Machine Learning

I Given a kernel, the RKHS is unique (up to isometric


isomorphisms). Given an RKHS, the kernel is unique.
I There is a close connection to the Riesz representation
theorem.
413
Summer 2019

The representer theorem


I In general, the RKHS is an infinite-dimensional vector space (a
basis has to contain infinitely many vectors).
I The next theorem shows that in practice, we only have to deal
with a finite-dimensional subspace.
I This subspace is still pretty large, later we discuss how to avoid
overfitting!
Ulrike von Luxburg: Statistical Machine Learning
414
Summer 2019

The representer theorem (2)


Setup:
I Assume we are given a kernel k. Denote the corresponding
RKHS with H, and the norm and scalar product in the space
by k · kH and h·, ·iH .
I Assume that we want to learn a linear function f : H → R
that acts on the RKHS H of a kernel k.
I All such functions have the form f (x) = hw, xiH for some
Ulrike von Luxburg: Statistical Machine Learning

w ∈ H, that is we can identify the function f with the


corresponding vector w ∈ H.
(for maths people: reason is that the dual of a real Hilbert
space is isomorphic to this Hilbert space)

In this setup, we can prove the following theorem:


415
Summer 2019

The representer theorem (3)


Theorem 13 (Representer theorem)
Consider a regularized risk minimization problem of the form

minimize Rn (w) + λΩ(kwkH ) (∗)


w∈H

where X arbitrary input space, Y output space, k : X × X → R a


kernel, H the corresponding RKHS. For a given training set
Ulrike von Luxburg: Statistical Machine Learning

(Xi , Yi )i=1,...,n ⊂ X × Y and classifier fw (x), = hw, xiH , let Rn be the


empirical risk of the classifier with respect to a loss function `, and
Ω : [0, ∞[→ R a strictly monotonically increasing function. Then
problem (∗) always has an optimal solution of the form
n
X

w = αi k(Xi , ·).
i=1
416
Summer 2019

The representer theorem (4)


Proof intuition.
I Split a the space H into the subspace
Hdata := span{kX1 , ..., kXn } (induced by the data) and its
orthogonal complement Hcomp . Then H = Hdata + Hcomp .
I Now express each vector w ∈ H as w = wdata + wcomp .

I It is not difficult to see that the predictions of all functions


with the same wdata agree on all training points, they do not
Ulrike von Luxburg: Statistical Machine Learning

depend on wcomp .
I So in particular, the loss w is not affected by wcomp .

I For fixed wdata , the norm ow w is smallest if wcomp is 0.

I So if we had a solution w ∗ where wcomp would be non-zero, we


could get a better solution by setting wcomp to zero.
I Thus we can always find an optimal solution with wcomp = 0.

,
417
Summer 2019

The representer theorem (5)


Intuitively, this theorem implies the following:
I We have seen that for any given kernel k there exists a feature
space H.
I However, this space was a function space that usually is an
infinite-dimensional Hilbert space.
I The representer theorem now says that for any finite data set
with n points, we don’t need to deal with all the infinitely
Ulrike von Luxburg: Statistical Machine Learning

many dimensions, but we are only confronted with a space of


at most n dimensions.
I As any n-dimensional subspace of a Hilbert space is isomorphic
to Rn , we can simply assume that our feature map goes to Rn .
I This makes our lives much easier.
418
Summer 2019

(∗) Injective feature map


Note that without any further assumptions, the feature map
Φ : X → H of a kernel k does not need to be injective!

(Simple counterexample: k(x, y) = hx, yi2 ).

However, a kernel for which the feature map is not injective might
not be too useful (WHY?)
Ulrike von Luxburg: Statistical Machine Learning

A particular class of “nice” kernels are univesal kernels:


419
Summer 2019

(∗) Universal kernels


A continuous kernel k on a compact metric space X is called
universal if the RKHS H of k is dense in C(X ), that is for every
function g ∈ C(X ) and all ε > 0 there exists a function f ∈ H
such that kf − gk∞ ≤ ε.

Intuition: with a universal kernel, we can approximate pretty much


any function we like: all continuous functions, and all functions that
Ulrike von Luxburg: Statistical Machine Learning

can be approximated by continuous functions (such as step


functions). In particular, we can separate any pair of disjoint
compact subsets from each other.

Example:
I The Gaussian kernel with fixed kernel width σ on a compact
subset X of Rd is universal.
420
Summer 2019

(∗) Universal kernels (2)


I Related statements can also be proved if we let σ → 0 slowly
as n → inf ty.
I Polynomial kernels are not universal.

The kernel being universal is a necessary requirement if we want to


construct learning algorithms that are uniformly Bayes consistent.
Ulrike von Luxburg: Statistical Machine Learning

Universal kernels have many nice properties. For example, their


feature maps are injective.

For details and proofs see the book by Steinwart / Christmann:


Support Vector Machines. Springer 2008.
421
Summer 2019

Kernels — history
I Reproducing kernel Hilbert spaces play a big role in
mathematics, they have been invented by Aronszajn in 1950.
He already proved all of the key properties.
Aronszajn. Theory of Reproducing Kernels. Transactions of
the American Mathematical Society, 1950
I The feature space interpretation has first been published by
Ulrike von Luxburg: Statistical Machine Learning

Aizerman 1964, but in a different context. At that time the


potential of the method had not been realized.
Aizerman, Braverman, Rozonoer: Theoretical foundations of
the potential function method in pattern recognition learning.
Automation and Remote Control, 1964.
422
Summer 2019

Kernels — history (2)


I Then it was rediscovered in the context of the SVM in 1992:
Boser, Bernhard E.; Guyon, Isabelle M.; and Vapnik, Vladimir
N.; A training algorithm for optimal margin classifiers.
Conference on Learning Theory (COLT), 1992
I Since then, kernels and the kernel trick became extremely
popular, the first text books already appeared pretty soon, e.g.
Schölkopf / Smola 2002 and Shawe-Taylor / Cristianini 2004.
Ulrike von Luxburg: Statistical Machine Learning
423
Summer 2019

Kernel algorithms
In the following, we are now going to see a couple of algorithms
that all use the kernel trick. The roadmap is always the same:
I Start with a linear algorithm

I Try to write this algorithm (both training and testing parts) in


such a way that the only access to training and testing points
is in terms of scalar products (this is often possible but not
Ulrike von Luxburg: Statistical Machine Learning

always; sometimes it is simple, sometimes it is difficult).


I Then replace the scalar product by the kernel function.

In the machine learning lingo: we kernelize the algorithm.


424
Summer 2019

Support vector machines with kernels


Literature:
I Schölkopf / Smola

I Shawe-Taylor / Cristianini
Ulrike von Luxburg: Statistical Machine Learning

I A very theoretical / mathematically deep treatment of the


theory of kernels and support vector machines is the following
book:
Steinwart / Christmann: Support Vector Machines. Springer,
2008.
425
Summer 2019

SVMs with kernels


I Consider the dual (!) SVM problem
I Have seen: the only way it accesses the training points in
terms of scalar products
I So replace hXi , Xj i by k(Xi , Xj ) everywhere
I The result is the dual of the “kernelized” SVM.
Ulrike von Luxburg: Statistical Machine Learning

Formally, this looks as follows:


426
Summer 2019

SVMs with kernels (2)


Given input training points (Xi , Yi )i=1,...,n and a kernel function
k : X × X → R.

Kernelized dual SVM problem:


n n
X 1X
maximize αi − αi αj Yi Yj k(Xi , Xj )
n
α∈R
i=1
2 i,j=1
Ulrike von Luxburg: Statistical Machine Learning

subject to 0 ≤ αi ≤ C/n ∀i = 1, ..., n


X n
αi Yi = 0
i=1

Solving this problem gives the dual variables α.


427
Summer 2019

SVMs with kernels (3)


Computing labels at new points: Have already seen how to
compute the label of a test points for known α:
P
• w = i α i Yi Xi
P
• b = Yj − i Yi αi hXi , Xj i for some j such that C/n > αj > 0.
• Label of test point X given by hw, Xi + b

In kernel language:
Ulrike von Luxburg: Statistical Machine Learning

X
hw, Xi + b = h αi Yi Xi , Xi + b
i
X  X 
= αi Yi k(Xi , X) + Yj − Yi αi k(Xi , Xj )
i i

This is the approach that is typically used in practice.


428
Summer 2019

The power of kernels


Why is the kernel framework so powerful? Let’s look at one
particular example, the Gaussian kernel.
I Have seen that the decision function of a kernelized SVM has
the form
X
f (x) = βi k(x, Xi ) + b
Ulrike von Luxburg: Statistical Machine Learning

If k is a Gaussian kernel:
X
f (x) = βi exp(−kx − Xi k2 /(2σ 2 ))
i
429
Summer 2019

The power of kernels (2)


I Important property: we can approximate any arbitrary
continuous function g : Rd → R by a sum of Gaussian kernels.
Ulrike von Luxburg: Statistical Machine Learning

In particular, we can approximate any reasonable “decision


surface” in Rd by an SVM with Gaussian kernel:
430
431 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

The power of kernels (3)


Summer 2019

The power of kernels (4)


I Kernels with this property are called “universal” kernels. One
can prove that SVMs with universal kernels are universally
consistent in the sense we defined in the very beginning of the
lecture, that is they approximate the Bayes risk.
I Note: if the kernel is universal, the underlying function class is
huge (all continuous functions can be approximated). All the
more important is that we regularize!
Ulrike von Luxburg: Statistical Machine Learning
432
Summer 2019

Regularization interpretation
Recall that we interpreted the linear primal SVM problem in terms
of regularized risk minimization where the risk was the Hinge loss
and we regularized by L2 - regularizer kwk2 .

How does it look for the kernelized SVM?


I The loss function is still the Hinge loss because the part with
the variables ξi does not change.
Ulrike von Luxburg: Statistical Machine Learning

I But the regularizer is now kwk2 where w is a vector in the


feature space, and the norm is taken in the feature space.
I By the representer theorem,

X X
kwk2 = h βi Φ(Xi ), βj Φ(Xj )i = β t Kβ
i j
433
Summer 2019

Regularization interpretation (2)


I It is not so easy to gain intuition about this norm (obviously it
depends on the kernel). But at least we can say that the
regularization “restricts the size of the function space” (in the
sense that there are fewer functions that can be expressed with
w with low norm than with high norm).
Ulrike von Luxburg: Statistical Machine Learning
434
Summer 2019

Regularization interpretation (3)


This regularization interpretation is really important, otherwise the
SVM “could not work”:
I We implicitly embed our data in a very high-dimensional space.

I In high-dimensional spaces, it happens very easily that we


overfit.
I The only way we can circumvent this is to regularize.
Ulrike von Luxburg: Statistical Machine Learning

I This is what the kernelized SVM does.


435
Summer 2019

Kernel SVMs in practice


If you want to use SVMs in practice, here is the vanilla approach:
I Come up with a good kernel that encodes a “natural notion”
of similarity (sometimes easy, sometimes not).
I Train an SVM by some standard package (there are lots of
SVM packages out there; for matlab, my favorite one is
libSVM)
Ulrike von Luxburg: Statistical Machine Learning

I MAKE SURE YOU SET ALL PARAMETERS BY CROSS


VALIDATION!
The results are very sensitive to the choice of the
regularization parameter C and the kernel parameters (such as
σ for the Gaussian kernel).
Later in the lecture we will look at more preprocessing steps that
you should use (; non-vanilla-version).
436
Summer 2019

(∗) Kernelizing the SVM primal


AS AN EXERCISE: IT IS POSSIBLE TO EXPRESS THE PRIMAL
OPTIMIZATION PROBLEM IN TERMS OF KERNELS?

n
CX
minimizew∈H kwk2H + ξi
n i=1
Ulrike von Luxburg: Statistical Machine Learning

 
subject to Yi hw, Φ(Xi )iH ≥ 1 − ξi (i = 1, ..., n)
437
Summer 2019

(∗) Kernelizing the SVM primal (2)


I A priori, we cannot write the primal function just in terms of
scalar products because it contains a scalar product between
the variable we are looking for (w) and the input points (Xi ).
I But according to the representer theorem, the solution vector
w can always be written as aPlinear combination of input
feature vectors, that is w = i βi Φ(Xi ).
I Consequently,
Ulrike von Luxburg: Statistical Machine Learning

X
kwk2 = hw, wi = βi βj k(Xi , Xj )
i,j

and
X X
hw, Φ(Xj )i = βi hΦ(Xi ), Φ(Xj )i = βi k(Xi , Xj )
i i
438
Summer 2019

(∗) Kernelizing the SVM primal (3)


I With this knowledge we can also kernelize the primal problem:
n
1X CX
minimize βi βj k(Xi , Xj ) + ξi
β∈Rd ,b∈R,ξ∈Rd 2 n i=1
i,j
n
X 
subject to Yi βj k(Xj , Xi ) + b ≥ 1 − ξi (i = 1, ..., n)
j=1
Ulrike von Luxburg: Statistical Machine Learning
439
Summer 2019

Why are SVMs so successful?


Before SVMs, there were neural networks. They are great, but they
have a couple of drawbacks:
I Lots of parameters to tune (design choices to make: how many
neurons, how many layers, etc)
I Training a neural network is a non-convex problem

I To be able to successfully work with neural networks one needs


Ulrike von Luxburg: Statistical Machine Learning

a large amount of experience.


Then came SVMs, they revolutionized the field. Why?
I Convex optimization problem, easy to implement

I Very few variables to tune (C, and maybe a kernel parameter


such as σ in the Gaussian kernel), this can be done by cross
validation
440
Summer 2019

Why are SVMs so successful? (2)


I Appealing from a conceptual side (large margin principle) and
also from the mathematical point of view (support vector
property, representer theorem, etc).
I The kernel framework boosts the potential of the SVM to the
non-linear regime, but does not lead to excessive overfitting.
I Statistical learning theory shows many nice guarantees about
the SVM (consistency, etc).
Ulrike von Luxburg: Statistical Machine Learning
441
Summer 2019

Why are SVMs so successful? (3)


By now, neural networks are back in the form of “deep networks”:
I Deep networks are very successfull in cases where there exists
lots of highly structured data (speech, text, images).
I Main difference to 30 years ago: computational power
increased a lot, some good heuristics have been worked out.
I From theory point of view, not understood at all.
Ulrike von Luxburg: Statistical Machine Learning

Comparing SVMs and Deep networks, both tend to be successful in


very different types of applications.
442
Summer 2019

Summary: SVM with kernels


I Given data points in some space X and a kernel function k on
this space
I Want to solve classification.
I By the kernel trick, we embed our data points into some
abstract feature space and use a linear classifier in this space.
I The inductive principle is that the margin in this feature space
Ulrike von Luxburg: Statistical Machine Learning

should be large.
I All this leads to a convex optimization problem that can be
solved efficiently.
I There are lots of important properties (support vector
property, representer theorem, etc).
I The kernel SVM is equivalent to regularized risk minimization
with the Hinge loss and regularization by the squared norm in
the feature space.
443
Summer 2019

Summary: SVM with kernels (2)


The kernel SVM is the most important classification
algorithm that is out there. If you just remember one thing
from this whole course, try to remember SVMs ,
Ulrike von Luxburg: Statistical Machine Learning
444
445 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Regression methods with kernels


446 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Kernelized least squares


Summer 2019

Least squares revisited


We had already seen in the beginning how to solve least squares
regression in a feature space (at that point we called it differently,
regression with basis functions):

Given data in some space X , a mapping Φ : X → Rd . Already


seen: The least squares problem in feature space
Ulrike von Luxburg: Statistical Machine Learning

n
1X
minimize (Yi − hΦ(Xi ), wi)2
w∈Rd n i=1

has the analytic solution w∗ = (Φt Φ)−1 Φt Y .

We are now going to rewrite everything using kernels.


447
Summer 2019

Kernelizing least squares (first method via


representer theorem)
I The representer theorem tells us that the least squares problem
always has a solution of the form
n
X

w = αj Φ(Xj ).
j=1
Ulrike von Luxburg: Statistical Machine Learning

I Plugging this in the objective gives


n
1X
minimize (Yi − hΦ(Xi ), wi)2
w∈Rd n i=1
n n
1X X
⇐⇒ minimize (Y i − αj hΦ(Xi ), Φ(Xj )i)2
α∈Rn
n i=1 j=1
| {z }
kij
448
Summer 2019

Kernelizing least squares (first method via


representer theorem) (2)
I In matrix notation:
1
minimize kY − Kαk2
α∈Rn n
I By taking the derivative with respect to α and exploiting that
K is pd it is easy to see that the solution is given as
Ulrike von Luxburg: Statistical Machine Learning

α∗ = K −1 Y (EXERCISE!).
I To evaluate the solution on a new data point x , we need to
compute
X X
f (x) = hΦ(x), w∗ i = αj∗ hΦ(x), Φ(Xj )i = αj∗ k(x, Xj )
j j

I So we can express the optimization problem, its


solution and the evaluation function purely in terms of
kernel functions. We have kernelized least squares.
449
Summer 2019

(∗) Kernelizing least squares (second method via


SVD)
Recap: the kernel matrix is ΦΦt
I Let X1 , ..., Xn ∈ X be data points, Φ : X → Rd a feature map.
I Denote by Φ the n × d-matrix that contains the data points
Φ(Xi ) as rows.
I Then the matrix Φ · Φt ∈ Rn×n coincides with the
Ulrike von Luxburg: Statistical Machine Learning

corresponding kernel matrix K with entries


kij = hΦ(Xi ), Φ(Xj )i = Φ(Xi )Φ(Xj )t .
450
Summer 2019

(∗) Kernelizing least squares (second method via


SVD) (2)
Proposition 14 (Matrix Identities)
For any n × d-matrix Φ we have

(Φt Φ)−1 Φt = Φt (ΦΦt )−1


Ulrike von Luxburg: Statistical Machine Learning

Proof of the proposition.


I Let Φ = U ΣV t the singular value decomposition of Φ.

I It is straightforward to prove that (Φt Φ)−1 Φt = V Σ+ U t (have


seen this already when we derived least squares).
I It is even more straightforward to see that
Φt (ΦΦt )−1 = V Σ+ U t . ,
451
Summer 2019

(∗) Kernelizing least squares (second method via


SVD) (3)
Using this proposition and the fact that the kernel matrix K is
given as ΦΦt we can rewrite the least squares solution as

w∗ = (Φt Φ)−1 Φt Y = Φt (ΦΦt )Y −1 = Φt K −1 Y

Denote α := K −1 Y . With this notation, the evaluation function is


Ulrike von Luxburg: Statistical Machine Learning

f (x) = hw∗ , Φ(x)i = (w∗ )t Φ(x)


= (Φt K −1 Y )t Φ(x) = Y t K −1 Φ Φ(x)
= αt ΦΦ(x)
Xn
= αj Φ(Xj )t Φ(x)
j=1
n
X
= αj k(Xj , x)
452

j=1
Summer 2019

(∗) Kernelizing least squares (second method via


SVD) (4)
So we can express the optimization problem, its solution
and the evaluation function purely in terms of kernel
functions. We have kernelized least squares.
Ulrike von Luxburg: Statistical Machine Learning
453
454 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Kernel ridge regression


Summer 2019

Ridge regression
Recall ridge regression in feature space:
n
1X
minimize (Yi − hw, Φ(Xi )i)2 + λkwk2
w∈Rd n i=1

Again use the representer theorem to express w as a linear


combination of input points:
Ulrike von Luxburg: Statistical Machine Learning

n
X
w= αj Φ(Xj )
j=1
455
Summer 2019

Ridge regression (2)


This leads to the following kernelized ridge regression problem:
1
minimize kY − Kαk2 + λαt Kα
α∈Rn n
The solution is given by

α = (nλI + K)−1 Y
Ulrike von Luxburg: Statistical Machine Learning

As before, we can compute the prediction for a new test point just
using kernels.
456
Summer 2019

A subtle difference: Ridge regression vs. kernel


ridge regression
I Given points in space X . Want to compare:
I Ridge regression using basis functions Φi (x) = k(Xi , x)
I Kernel ridge regression using kernel k and feature map Φ
I In both cases, we work in the same function space, namely the
one spanned by the functions Φi .
Ulrike von Luxburg: Statistical Machine Learning

I So no matter which function we use, the least squares error is


the same in both approaches.
I However, the regularizers are different:
I In the standard case we regularize by kαk2 .
I In the kernel case we regularize by αt Kα.
I This is as for linear and kernel SVMs ...
457
Summer 2019

Kernel version of LASSO


Note: the trick that we used to derive the kernel version of ridge
regression does NOT work for Lasso (WHY????)
Ulrike von Luxburg: Statistical Machine Learning
458
Summer 2019

How to center and normalize in the feature space


Literature:
I Shawe-Taylor / Cristianini Section 5.1
Ulrike von Luxburg: Statistical Machine Learning
459
Summer 2019

What we want to do
I Have seen: many algorithms require that the data points are
centered (have mean = 0) and are normalized.
I However, now we want to work in feature space, but without
explicitly working with the coordinates in feature space.
I So how can we do this ???
Ulrike von Luxburg: Statistical Machine Learning
460
Summer 2019

Centering in the feature space


To center points in the feature space, we would need to perform
the following calculations:
P
I Compute center: Φ̄ := 1/n
i Φ(xi )
I Replace Φ(xi ) by Φ(xi ) − Φ̄

Not obvious that we can express this in terms of scalar products...


Ulrike von Luxburg: Statistical Machine Learning
461
Summer 2019

Centering in the feature space (2)


To proceed, assume that we can compute Φ̄ and let’s compute the
kernel values between the centered points:

K˜ij := hΦ(xi ) − Φ̄, Φ(xj ) − Φ̄i


Xn n
X
= hΦ(xi ) − 1/n Φ(xs ), Φ(xj ) − 1/n Φ(xs )i
Ulrike von Luxburg: Statistical Machine Learning

s=1 s=1
n n
1X 1X
= k(xi , xj ) − k(xi , xs ) − k(xj , xs )
n s=1 n s=1
n
1 X
+ 2 k(xs , xt )
n s,t=1
462
Summer 2019

Centering in the feature space (3)


In matrix notation, this means that we can compute the centered
kernel matrix as follows:

K̃ = (K − 1n K − K1n + 1n K1n )

where 1n is the n × n matrix containing 1/n as each entry.

Good news:
Ulrike von Luxburg: Statistical Machine Learning

I We do not have to do the centering operation explicitly.

I We can implicitly center the data by replacing the “old” kernel


matrix K by the new matrix K̃.
463
464 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Centering in the feature space (4)


Summer 2019

Normalizing in feature space


Assume our n data points are in Rd and we stack them in a data
matrix X as usual:
I Each row of X corresponds to one data point.

I The data matrix has dimensions n × d

Two different ways to normalize data:


Ulrike von Luxburg: Statistical Machine Learning

I Normalize the data points, that is recale each data point such
that it has norm 1. This is equivalent to normalizing the rows
of the centered matrix to have unit norm.
I Normalize the individual features, that is rescale all columns of
the centered matrix to have norm 1. This is just a rescaling of
the coordinate axes such that the variance in each coordinate
direction is 1.
465
Summer 2019

Normalizing in feature space (2)


orig data points centered
2
4
0
2
−2
0
−2 0 2 4 6 −5 0 5
Ulrike von Luxburg: Statistical Machine Learning

points centered, points normalized points centered, features normalized

0.1
0.5
0
0
−0.1
−0.5

−0.2
−1 0 1 −0.3 −0.2 −0.1 0 0.1 0.2
466
Summer 2019

Normalizing in feature space (3)


Note:
I Normalize the points
I We will see below that there is a way to normalize points in
the feature space.
I Sometimes it helps, sometimes it hurts.
I If in doubt, use cross-validation to see whether your results
improve if you normalize or not.
Ulrike von Luxburg: Statistical Machine Learning

I Normalize features:
I Typically this never hurts, and often helps.
I For kernel methods it is impossible to normalize the features
(we don’t know the embedding Φ explicitly, in particular we
don’t know what the features are).
467
Summer 2019

Normalizing in feature space (4)


To normalize the points such that they have unit norm in the
feature space:
I Assume the data points are already centered in feature space.
I Then define the normalized data point
Φ̂(X) := Φ(X)/kΦ(X)k.
I Observe:
Ulrike von Luxburg: Statistical Machine Learning

 
Φ(X) Φ(Y )
hΦ̂(X), Φ̂(Y )i = ,
kΦ(X)k kΦ(Y )k
hΦ(X), Φ(Y )i
=
kΦ(X)kkΦ(Y )k
k(x, y)
=p
k(x, x)k(y, y)
468
Summer 2019

Normalizing in feature space (5)


So instead of first normalizing the points and then computing their
kernels we can directly compute the kernels for the normalized data
points.

So to normalize the points in feature space, we simply replace the


kernel function k by the normalized kernel function

k(x, y)
Ulrike von Luxburg: Statistical Machine Learning

k̂(x, y) = p .
k(x, x)k(y, y)

(VERIFY THAT THIS IS INDEED A KERNEL)


469
Summer 2019

When it can go wrong


Standardizing the data is a preprocessing step. As any such step, it
often helps, but sometimes can go wrong:
Ulrike von Luxburg: Statistical Machine Learning
470
471 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

algorithms
More supervised learning
472 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

(*) Random Forests


473 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

(*) Boosting
474 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Unsupervised learning
475 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Dimensionality reduction and embedding


Summer 2019

PCA and kernel PCA


Classical PCA is covered in many statistics books:
I A complete book on PCA is Jolliffe: Principal Component
Analysis. Springer, 2002.
Ulrike von Luxburg: Statistical Machine Learning

I Chapter 8 in Mardia, Kent, Bibby: Multivariate Analysis.


Academic Press, 1979. A classic.
Literature on kernel PCA:
I Chapter 14.2 of Schölkopf and Smola

I Chapter 6.2. of Shawe-Taylor and Cristianini


476
Summer 2019

Principal component analysis (PCA)


... is a “traditional” method for unsupervised dimensionality
reduction. Is based on linear principles.

Goal:
I Given data points x1 , ..., xn ∈ Rd

I want to reduce the dimensionality of the data by throwing


away “dimensions which are not important”.
Ulrike von Luxburg: Statistical Machine Learning

I Result is set of new data points y1 , ..., yn ∈ R` with ` < d.


477
478 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Principal component analysis (PCA) (2)


Summer 2019

Principal component analysis (PCA) (3)


Three approaches in this lecture:
I Traditional approach: Maximize the variance of the reduced
data ; Covariance matrix approach
I Traditional approach: Minimize the quadratic error ; SVD
approach
I Kernel PCA
Ulrike von Luxburg: Statistical Machine Learning
479
Summer 2019

Recap: Projections
... see slides in the appendix (slides 1212 ff.)
Ulrike von Luxburg: Statistical Machine Learning
480
Summer 2019

Recap: Variance and Covariance


... see slides in the appendix (slides 1159 ff.)
Ulrike von Luxburg: Statistical Machine Learning
481
Summer 2019

PCA by max variance approach: Idea


Ulrike von Luxburg: Statistical Machine Learning

Want to find a linear projection on a low-dim space such that the


overall variance of the resulting points is as large as possible.
482
Summer 2019

PCA by max variance approach: Idea (2)


Given: data points x1 , ..., xn ∈ Rd , parameter ` < d (the dimension
of the space we want to project to).

Goal: find a projection πS on an affine subspace S such that the


variance of the projected points is maximized: maxS Var` (πS (X)).

For simplicity, let us assume that the data points are centered:
Ulrike von Luxburg: Statistical Machine Learning

n
1X
x̄ = xi = 0
n i=1

(If this is not the case, we can center the data points by setting
x̃i = xi − x̄.)
483
Summer 2019

PCA by max variance approach: Case ` = 1


One-dimensional case:
We first of all assume that ` = 1, that is we want to project the
data points on a 1-dim space.

Have to solve the following optimization problem:


Ulrike von Luxburg: Statistical Machine Learning

max Var(πa (X))


a∈Rd ,kak=1
X n
⇐⇒ max (πa (xi ))2 subject to a0 a = 1
a∈Rd
i=1
n
X
⇐⇒ max (a0 xi )2 subject to a0 a = 1
a∈Rd
i=1
⇐⇒ max kXak2 subject to a0 a = 1
a∈Rd
484
Summer 2019

PCA by max variance approach: Case ` = 1 (2)


To solve this:
I Write the Lagrangian:

L(a, λ) = kXak2 − λ(a0 a − 1) = a0 X 0 Xa − λ(a0 a − 1)

I Compute the partial derivatives wrt a:


!
∂L/∂a = X 0 Xa − λa = 0
Ulrike von Luxburg: Statistical Machine Learning

Thus necessary condition: a is an eigenvector of X 0 X.


I Substitute X 0 Xa = λa in the original objective function:

a0 X 0 Xa = λa0 a = λ

I This is maximal for a being the largest eigenvector of X 0 X.


Solution: If the data points are centered, then projecting on the
largest eigenvector of C = X 0 X solves the problem for ` = 1.
485
Summer 2019

PCA by max variance approach: Case ` > 1


Case ` > 1:
By similar arguments we can prove that we need to project the data
on the space spanned by the ` largest eigenvectors of X 0 X.

(and by “largest eigenvector” I mean the eigenvector corresponding


to the largest eigenvalue).
Ulrike von Luxburg: Statistical Machine Learning
486
Summer 2019

PCA: algorithm using covariance matrix C


Input: Data points x1 , ..., xn ∈ Rd , parameter ` ≤ d.

I Center the data points, that is compute x̃i = xi − x̄ for all i.


I Compute the n × d data matrix X with the centered data
points x̃i as rows, and the d × d sample covariance matrix
C = X 0 X.
Compute the eigendecomposition C = V DV 0 .
Ulrike von Luxburg: Statistical Machine Learning

I Define V` as the matrix containing the ` largest eigenvectors


(i.e., the first ` columns of V if the eigs in D are ordered
decreasingly).
I Compute the new data points:
I View 2: yi = V`0 x̃i ∈ R`
I View 1: zi = P x̃i + x̄ ∈ Rd with P = V` V`0
487
Summer 2019

PCA: algorithm using covariance matrix C (2)


Notation:
I The eigenvectors are called principal axes or principal
directions.
I In View 1: the distance between a point and its projection is
called the reconstruction error or projection error.
Ulrike von Luxburg: Statistical Machine Learning
488
489 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

demo_pca.m
Example: simple Gaussian toy data
Summer 2019

USPS example
USPS handwritten digits, 16 x 16 greyscale images.
; demo_pca_usps.m
Ulrike von Luxburg: Statistical Machine Learning
490
Summer 2019

USPS example (2)


Some digits from the data set

5 5 5

10 10 10

15 15 15
5 10 15 5 10 15 5 10 15

5 5 5
Ulrike von Luxburg: Statistical Machine Learning

10 10 10

15 15 15
5 10 15 5 10 15 5 10 15

5 5 5

10 10 10

15 15 15
5 10 15 5 10 15 5 10 15
491
Summer 2019

USPS example (3)


The first principal components (computed based on 500 digits):
First principal components
PrincComp 1 PrincComp 2 PrincComp 3

5 5 5

10 10 10

15 15 15
5 10 15 5 10 15 5 10 15
PrincComp 4 PrincComp 5 PrincComp 6
Ulrike von Luxburg: Statistical Machine Learning

5 5 5

10 10 10

15 15 15
5 10 15 5 10 15 5 10 15
PrincComp 7 PrincComp 8 PrincComp 9

5 5 5

10 10 10

15 15 15
5 10 15 5 10 15 5 10 15
492
Summer 2019

USPS example (4)


Reconstructing digits:
Reconstructed first digit
reconstruction with 1 eigs reconstruction with 2 eigs reconstruction with 3 eigs

5 5 5

10 10 10

15 15 15
5 10 15 5 10 15 5 10 15
reconstruction with 4 eigs reconstruction with 5 eigs reconstruction with 10 eigs
Ulrike von Luxburg: Statistical Machine Learning

5 5 5

10 10 10

15 15 15
5 10 15 5 10 15 5 10 15
reconstruction with 50 eigs reconstruction with 100 eigs reconstruction with 256 eigs

5 5 5

10 10 10

15 15 15
5 10 15 5 10 15 5 10 15
493
Summer 2019

USPS example (5)


All eigenvalues:
Singular Values of the data matrix:
100

90

80

70

60
Ulrike von Luxburg: Statistical Machine Learning

50

40

30

20

10

0
0 50 100 150 200 250 300
494
495 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Eigenfaces
Principal components for a data set of faces:
Summer 2019

PCA – min squared error approach


Second approach to PCA:

Find a projection πS on an affine subspace S such that the squared


distance between the points and their projections is minimized:
n
X
min kxi − πS (xi )k2 .
S
Ulrike von Luxburg: Statistical Machine Learning

i=1
496
Summer 2019

PCA – min squared error approach (2)


Ulrike von Luxburg: Statistical Machine Learning

One can prove that this approach leads to exactly the same solution
as the one induced by the max-variance criterion (we skip this
derivation).
497
Summer 2019

Choosing the parameter `


I Heuristic: look at largest eigenvalues, and take the most
“informative” ones. It also can be seen: the reconstruction
error is bounded as
X n
X
2
kxi − π` xi k ≤ λk
i k=`+1
Ulrike von Luxburg: Statistical Machine Learning

Singular Values of the data matrix:


100

90

80

70

60

50

40

30

20

10

0
0 50 100 150 200 250 300
498
Summer 2019

Choosing the parameter ` (2)


I If PCA is used a a preprocessing step for supervised learning,
then use cross validation to set the parameter `!

Note: It is not a priori clear whether it is better to choose `


large or small ...
Ulrike von Luxburg: Statistical Machine Learning
499
Summer 2019

Global!
Keep in mind that PCA optimizes global criteria.
I No guarantees what happens to individual data points. This is
different for some other dimensionality reduction methods
(such as random projections and Johnson-Lindenstrauss).
Ulrike von Luxburg: Statistical Machine Learning

I If the sample size is small, then outliers can have a large effect
on PCA.
500
Summer 2019

When does it (not) make sense?


Principal Component Analysis (PCA)
I The PCA works best if the data comes from a Gaussian.
Given a set of m centered v
xi R n ,
PCA diagonalizes the cova
matrix
m
1
C= xi x i
m
i=1
Ulrike von Luxburg: Statistical Machine Learning

,
requires the solution of the
eigenvalue equation vi =
PCs vi define a new basis
along directions of maxima
variance.
501
Summer 2019

When does it (not) make sense? (2)


I But it can have very bad effects if the data is far from
Gaussian:
Ulrike von Luxburg: Statistical Machine Learning
502
Summer 2019

Towards kernel PCA


Now we want to kernelize the PCA algorithm to be able to have
non-linear principal components.

Observe:
I PCA uses the covariance matrix — and this matrix inherently
uses the actual coordinates of the data points.
I So how should we be able to kernelize PCA???
Ulrike von Luxburg: Statistical Machine Learning

I The solution will be: there is a tight relationship between the


covariance matrix and the kernel matrix.
503
Summer 2019

Recap: Covariance matrix vs. kernel matrix


Consider centered data points x1 , ..., xn , stacked in a data matrix
X as rows. Denote the k-th colum of the matrix by X (k) (contains
the k-th coordinate of all data points). Then:
I Covariance matrix is C = X 0 X because

(k) (l)
Ckl = Cov1dim (X (k) , X (l) ) = ni=1 Xi Xi = (X 0 X)kl
P
Ulrike von Luxburg: Statistical Machine Learning

(k) (l)
that because Xi Xi = (xi x0i )kl this implies
Also note P
(X 0 X) = ni=1 (xi x0i )
|{z}
d×d

I Kernel matrix is K = XX 0

(because (XX 0 )ij = dk=1 xik xjk = x0i xj = hxi , xj i).


P
504
Summer 2019

Recap: Covariance matrix vs. kernel matrix (2)


What we now try to do is to express the eigenvalues/eigenvectors
of C by those of K and vice versa.
Ulrike von Luxburg: Statistical Machine Learning
505
Summer 2019

Eig of K implies eig of C


Proposition 15 (Eig of K implies eig of C)
Consider a set of points x1 , ..., xn ∈ Rd . Consider
Pn λ ∈ R and
0
a ∈ R with Ka = λa. Define v := X a = j=1 aj xj ∈ Rd . Then:
n

1. If v 6= 0, then v is an eigenvector of C with eigenvalue λ, that


is Cv = λv.

Ulrike von Luxburg: Statistical Machine Learning

2. If kak = 1, then kvk = λ.

Proof of (1).

Ka = λa
⇐⇒ XX 0 a = λa
=⇒ X 0 XX 0 a = λX 0 a
⇐⇒ Cv = λv
506
Summer 2019

Eig of K implies eig of C (2)


,
Proof of (2).
X X X X
kvk2 = k aj x j k 2 = h aj x j , ai x i i = ai aj hxi , xj i
j j i i,j
X
= ai aj k(xi , xj ) = a0 Ka = a0 λa = λ
i,j
Ulrike von Luxburg: Statistical Machine Learning

Bottom line: an eigenvector a of K gives rise of an eigenvector


√ v of
C, and to obtain a unit vector we have to normalize it by 1/ λ.
507
Summer 2019

Eig of C implies eig of K


Proposition 16 (Eig of C implies eig of K)
Assume that the points xi are centered. Let v and λ be eigenvector
and eigenvalue of C, that is Cv = λv. Then the vector
a := λ1 Xv ∈ Rn is an eigenvector of K with eigenvalue λ, that is
Ka = λa.
Ulrike von Luxburg: Statistical Machine Learning

Proof in several steps:


508
Summer 2019

Eig of C implies eig of K (2)


Step 1: non-zero eigenvectors of C are linear combinations
of input points:
By assumption,
Pn λv =0 Cv, and because the data is centered we
have C = j=1 xj xj . Hence:
n n
1 1X 1X X
v = Cv = xj x0j v = xj hxj , vi =: aj x j
λ λ j=1 λ j=1
Ulrike von Luxburg: Statistical Machine Learning

with aj = λ1 hxj , vi ∈ R
(or more compactly, a = λ1 Xv ∈ Rn ).
509
Summer 2019

Eig of C implies eig of K (3)


Step 2: express eig of C as eig of K:
Xn Xn n
X
0
Cv = λv ⇐⇒ ( xj xj )( ai x i ) = λ ai x i
j=1 i=1 i=1
Xn n
X
⇐⇒ xj x0j ai xi = λ ai x i
i,j=1 i=1
Ulrike von Luxburg: Statistical Machine Learning

(now multiply with x0s


for some s)
!
X X
=⇒ x0s ( ai xj hxj , xi i) = λ ai x0s xi
ij i
!
X X
⇐⇒ ai hxs , xj ihxj , xi i = λ ai hxi , xs i
ij i
2
⇐⇒ (K a)s = λ(Ka)s
510
Summer 2019

Eig of C implies eig of K (4)


Thus we obtain:
X
v= ai xi eig of C
i
=⇒ ∀s = 1, ..., n : (K 2 a)s = λ(Ka)s
⇐⇒ K 2 a = λKa
⇐⇒ Ka = λa (as K is positive definite)
Ulrike von Luxburg: Statistical Machine Learning

⇐⇒ a is eigenvector of K with eigenvalue λ.

,
511
Summer 2019

(∗) Sanity checks


Let’s apply the two propositions one after the other:
I Assume Cv = λv, set a := 1 Xv. Then by Prop. 16,
λ
Ka = λa. Now set ṽ = X 0 a. Then by Prop 15, C ṽ = λṽ.
Note that this makes sense because
1 1 1
ṽ = X 0 a = X 0 Xv = X 0 Xv = Cv = v
λ λ λ
Ulrike von Luxburg: Statistical Machine Learning

I Dimensions:
I C is a d × d-matrix, so its eigendecomposition has d
eigenvalues.
I K is a n × n matrix with n eigenvalues.
I But intuitively spoken, we just showed that we can convert
the eigenvalues of K to those of C and vice versa.
HOW CAN THIS BE, IF d AND n ARE DIFFERENT???
512
Summer 2019

First algorithm: eigs of C by eigs of K


Assume that the data points are centered in Rd . Then to compute
the `-th eigenvector of C, we can proceed as follows:
I Compute the kernel matrix K and its `-the eigenvector a

I Make sure a is normalized to kak = 1.

I Then compute v := √1
P
λ i ai x i
Ulrike von Luxburg: Statistical Machine Learning

Note: this “algorithm” still requires to know the original vectors xi .


However, in practice we don’t want to compute the eigenvectors
themselves but just the projections on these eigenvectors. Let’s
look at it.
513
Summer 2019

Expressing the projection on eigs of C using K


I Assume we want to project on eigenvector
P v of C. Have
already seen that we can write v = i ai xi . Thus:

X X
πv (xi ) = v 0 xi = h aj xj , xi i = aj hxj , xi i
j j
Ulrike von Luxburg: Statistical Machine Learning

I If we want to project on a subspace spanned by ` vectors


v1 , ...v` ∈ Rd , compute each of the ` coordinates by this
formula as well.
I To compute the projections, we only need scalar products ,.
So we can write PCA as a kernel algorithm:
514
Summer 2019

Finally: kernel PCA


Input: Kernel matrix K (computed from abstract data points
X1 , ..., Xn ), parameter `
I Center the data in feature space by computing the centered
kernel matrix K̃ = K − 1n K − K1n + 1n K1n
I Compute the eigendecomposition K̃ = ADA0 . Let Ak denote
the k-th column of A and λk the corresponding eigenvalue.

Ulrike von Luxburg: Statistical Machine Learning

I Define the matrix V` which has the columns Ak / λk ,


k = 1, ..., `
I Compute the low dim representation points yi = (y 1 , ..., y ` )0
i i
with the formula yis = nj=1 vjs K̃ji (for all s = 1, ..., `).
P

Output: y1 , ..., yn ∈ R`
515
Summer 2019

kPCA toy example: three Gaussians


demo_kpca_bernhard.m
I Data drawn from 2-dim Gaussians (red crosses)

I kernel used is Gaussian kernel

I Now can compute the first eigenvectors in kernel feature space

I For test points, plot the coordinate which results when


projecting on this eigenvector in the feature space (grey scale)
Ulrike von Luxburg: Statistical Machine Learning
516
517 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

kPCA toy example: three Gaussians (2)


Summer 2019

kPCA toy example: rings


demo_kpca_toy.m:
Consider the following three-dimensional data set:
Original data
5
Ulrike von Luxburg: Statistical Machine Learning

−5
−6 −4 −2 0 2 4 6
518
Summer 2019

kPCA toy example: rings (2)


Now apply kPCA:
I Choose the Gaussian kernel with σ = 2.

I Note: we implicitly work in the RKHS, which has n dimensions

I So it makes sense to choose ` = 3 (even though the original


data set just had d = 2).
Ulrike von Luxburg: Statistical Machine Learning
519
Summer 2019

kPCA toy example: rings (3)


Here is the result:
3 dim kpca

0
Ulrike von Luxburg: Statistical Machine Learning

−5
10

0
5
−10 −5 0

Surprising, isn’t it???


520
Summer 2019

More toy examples


: 3 dim kpca
1.5

0.5

4
0 6
2
4
Ulrike von Luxburg: Statistical Machine Learning

0
−0.5 2
−2
0
−4
−1
−6 −2
−4 −2 0 2 −4
−1 −0.5 0 0.5 1 1.5 2 4 6
521
Summer 2019

More toy examples (2)


3 dim kpca
5
5

4 0

3 −5
6
2

1 4

0 2

−1
0
−2
Ulrike von Luxburg: Statistical Machine Learning

−2
−3

−4
−4

−5 −6 4 5
−4 −3 −2 −1 0 1 2 3
−6 −4 −2 0 2 4 6 −5
522
Summer 2019

More toy examples (3)


3 dim kpca
2 2
1.5
1
1.5
0.5
0
1 −0.5
−1
−1.5
0.5
4

3
0
2
Ulrike von Luxburg: Statistical Machine Learning

1
−0.5
0

−1 0 −0.5 −1 −1.5
2 1.5 1 0.5
−1 −0.5 0 0.5 1 1.5 2 2.5
523
Summer 2019

History
I Classical PCA was invented by Pearson:
On Lines and Planes of Closest Fit to Systems of Points in
Space. Philosophical Magazine, 1901.
I It is one of the most popular “classical” techniques for data
analysis.
I Kernel PCA was invented pretty much 100 years later ,
Ulrike von Luxburg: Statistical Machine Learning

B. Schölkopf, A. Smola, and K.-R. Müller. Kernel Principal


component Analysis. In B. Schölkopf, C. J. C. Burges, and A.
J. Smola, editors, Advances in Kernel Methods–Support Vector
Learning, pages 327-352. MIT Press, Cambridge, MA, 1999.
524
Summer 2019

Summary: PCA and kernel PCA


Standard PCA:
I Technique to reduce the dimension of a data set in Rd by linear
projections.
I First explanation: throw away dimensions with “low variance”

I Second explanation: minimize the squared error.

I Can be computed by an eigendecomposition of the empirical


Ulrike von Luxburg: Statistical Machine Learning

covariance matrix.
Kernel PCA:
I Use the kernel trick to make PCA non-linear.
525
Summer 2019

Multi-dimensional scaling
Literature:
Multi-dimensional scaling:
I Is a classic that is covered in many books on data analysis.
Ulrike von Luxburg: Statistical Machine Learning

I A whole book on the subject: Borg, Groenen: Modern


multidimensional scaling. Springer, 2005.
Isomap:
I The original paper is: J. Tenenbaum, V. De Silva, J. Langford.
A global geometric framework for nonlinear dimensionality
reduction. Science, 2000.
526
Summer 2019

Embedding problem
I Assume we are given a distance matrix D ∈ Rn×n that
contains distances dij = kxi − xj k between data points.
I Can we “recover” the points (xi )i=1,...,n ∈ Rd ?
This problem is called (metric) multi-dimensional scaling.

A more general way of asking: Given abstract “objects”


Ulrike von Luxburg: Statistical Machine Learning

x1 , ..., xn ∈ X , can we find an embedding Φ : X → Rd (for some d)


such that kΦ(xi ) − Φ(xj )k = dij ?

DO YOU BELIEVE IT ALWAYS WORKS?


527
Summer 2019

Embedding problem (2)


Answer will be:
I We can find a correct point configuration if the distances really
come from points ∈ Rd . In this case we say that D is a
Euclidean distance matrix. See next slide for how this works.
I For general distance matrices D, we cannot achieve such an
embedding without distorting the data. There is a huge bulk
of literature on approximate embeddings, but we won’t cover it
Ulrike von Luxburg: Statistical Machine Learning

in this lecture.
528
Summer 2019

Embedding problem (3)


WHY DO YOU THINK SUCH AN EMBEDDING MIGHT BE
USEFUL?
Ulrike von Luxburg: Statistical Machine Learning
529
Summer 2019

Embedding problem (4)


Why might we be interested in such an embedding?
I Visualization!

I Many algorithms are just defined for Euclidean data. If we


want to apply them, we need to find a Euclidean
representation of our data.
I Identify low-dimensional structure, see Isomap below.
Ulrike von Luxburg: Statistical Machine Learning

What might be problematic about it?


I We might introduce distortion to the data ...
530
Summer 2019

MDS in various flavors


I Classic MDS: we assume that the given distance matrix is
Euclidean.
I If the matrix is Euclidean, embedding will be exact.
I If the matrix is not Euclidean, embedding will make some
errors.
I Metric MDS: we are given any distance matrix (might be
Ulrike von Luxburg: Statistical Machine Learning

non-Euclidean). We try to find an embedding that


approximately preserves all distances.
I In case the original matrix is non-Euclidean, perfect
reconstruction is impossible, so we definitely will make
approximation errors.
I In case the original matrix is Euclidean, we might still make
errors due to the formulation of the problem, see below.
I Non-metric MDS: we are not given distances, but ordinal
information, see below.
531
Summer 2019

Classic MDS
Assume we are given a Euclidean distance matrix D. Will now see
how to express the entries of the Gram matrix S = (hxi , xj i)ij=1,...,n
in terms of entries of D:
I By definition:

d2ij = kxi − xj k2 = hxi − xj , xi − xj i


Ulrike von Luxburg: Statistical Machine Learning

= hxi , xi i + hxj , xj i − 2hxi , xj i

I Rearranging gives
1 
hxi , xj i = hxi , xi i + hxj , xj i −d2ij
2 | {z } | {z }
=d(0,xi )2 =d(0,xj )2
532
Summer 2019

Classic MDS (2)


I We are free to choose the origin 0 as we want. For simplicity,
we choose the first data point x1 as the origin. This gives:

1 2 
hxi , xj i = d1i + d21j − d2ij
2
I So we can express the entries of the Gram matrix S with
Ulrike von Luxburg: Statistical Machine Learning

sij = hxi , xj i in terms of the given distance values.


I Because S it is positive definite, we can decompose S in the
form S = XX 0 where X ∈ Rn×d .
EXERCISE: HOW EXACTLY DO YOU DO THIS? WHAT IS
THE DIMENSION d GOING TO BE?
I The rows of X are what we are looking for, that is we set the
embedding of point xi as the i-th row of the matrix X.
533
Summer 2019

Classic MDS implementation


This is how it finally works:
 
1
I Compute the matrix S with sij = 2
d21i + d21j − d2ij .
I Compute the eigenvalue decomposition S = V ΛV 0 .

I Define X = V Λ.
Alternatively, if you want to fix some dimension d ≤ n, set Vd
to be the first d columns of V and Λd the d × d diagonal
Ulrike von Luxburg: Statistical Machine Learning

matrix with√the first d eigenvalues on the diagonal, and then


set X = Vd Λd .
I Row i of X then gives the coordinates of the embedded point
xi .
534
Summer 2019

Classic MDS implementation (2)


How to choose d?
I If the data points come from Rd , then the matrix S is going to
have rank d, that is there are d eigenvalues > 0 and n − d
eigenvalues equal to 0.
I Hence, looking at the spectrum of S gives you an idea to
choose d. In case of classic MDS, you can just read off d from
the matrix, in the more general case of metric MDS you simply
Ulrike von Luxburg: Statistical Machine Learning

“choose it reasonable” (as in PCA).


535
536 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Demos
demo_mds.m
Summer 2019

Metric MDS
Metric MDS refers to the problem where the distance matrix D is
no longer Euclidean, but we still believe (hope) that a good
embedding exists.
I If the distance matrix D is not Euclidean, we will not be able
to recover an exact embedding.
I Instead, one defines a “stress function”. Below is an examle
Ulrike von Luxburg: Statistical Machine Learning

for such a stress function:

− xj k − dij )2
P
ij (kxi
stress(embedding) = P
ij kxi − xj k

Many more stress functions are considered in the literature.


I Then we try to find an embedding x1 , ..., xn with small stress
by a standard non-convex optimization algorithm, say gradient
descent.
537
Summer 2019

Metric MDS (2)


When using metric MDS, there are two sources of error:
I The distance matrix is not Euclidean, so we will not be able to
recover a perfect embedding.
I The optimization problems are highly non-convex and suffer all
kinds of problems of local optima.
Using metric MDS only makes sense if the data is “nearly
Euclidean”, and results should always be treated with care.
Ulrike von Luxburg: Statistical Machine Learning
538
Summer 2019

Non-metric MDS
I Instead of distance values, we are just given distance
comparisons, that is we know whether dij < dik or vice versa.
I The task is then to find an embedding such that these ordinal
relationships are preserved.
I Our group is working on this problem ,
Ulrike von Luxburg: Statistical Machine Learning

Which of the bottom images is most similar to the top image?


539
Summer 2019

History of MDS
I Metric MDS: Torgerson (1952) - The first well-known MDS
proposal. Fits the Euclidean model.
I Non-metric MDS: Shepard (1962) and Kruskal (1964)
Ulrike von Luxburg: Statistical Machine Learning
540
Summer 2019

Outlook: general embedding problems


There exists a huge literature on embedding metric spaces in
Euclidean spaces:
I Given certain assumptions on the metric ...
I In what space can I embed (dimension???)
I What are the guarantees on the distortion?
Some literature pointers:
Ulrike von Luxburg: Statistical Machine Learning

I Theorem of Bourgain: Any n-point metric space can be


embedded in Euclidean space with distortion O(log n). By the
theorem of Johnson-Lindenstrauss, we can achieve
dimensionality of O(log n) as well.
I An overview paper on the area is: Piotr Indyk and Jiri
Matousek. Low-distortion embeddings of finite metric spaces.
In: Handbook of Discrete and Computational Geometry, 2004.
541
Summer 2019

Summary MDS
I Given a distance matrix D, MDS tries to construct an
embedding of the data points in Rd such that the distances are
preserved as well as possible.
I If D is Euclidean, a perfect embedding can easily be
constructed.
I If D is not Euclidean, MDS tries to find an embedding that
Ulrike von Luxburg: Statistical Machine Learning

minimizes the “stess”. The resulting problem is highly


non-convex. Solutions should be treated with care.
542
543 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

introduction
Graph-based machine learning algorithms:
Summer 2019

Neighborhood graphs
Given the similarity or distance scores between our objects, we want
to build a graph based on it.
I Vertices = objects

I Edges between objects in the same “neighborhood”

Different variants:
Ulrike von Luxburg: Statistical Machine Learning

I directed k-nearest neighbor graph: connect xi by a directed


edge to its k nearest neighbors (or to the k points with the
largest similarity) .
544
Summer 2019

Neighborhood graphs (2)


Note that this graph is not symmetric. Many algorithms need
undirected graphs (in particular, spectral methods). To make
it undirected:
I Standard k-nearest neighbor graph: put an edge between xi
and xj if xi is among the k nearest neighbors of xj OR vice
versa.
I Mutual k-nearest neighbor graph: put an edge between xi and
Ulrike von Luxburg: Statistical Machine Learning

xj if xi is among the k nearest neighbors of xj AND vice versa.


545
Summer 2019

Neighborhood graphs (3)


Alternatively, we can use the ε-graph:
I Connect each point to all other points that have distance
smaller than ε (or similarity larger than some threshold e)

Note: all these neighborhood graphs can be built based on


similarities or based on distances.
Ulrike von Luxburg: Statistical Machine Learning
546
Summer 2019

Neighborhood graphs (4)


kNN graph (k=5) mutual kNN graph (k=5)
2
2

1 1

0 0

−1 −1

−2 −2

−2 0 2 −2 0 2
Ulrike von Luxburg: Statistical Machine Learning

mutual kNN graph (k=10) eps graph (eps =0.4)


2
2
1
1
0
0

−1
−1

−2
−2
−2 −1 0 1 2 3 −2 0 2
547
Summer 2019

Neighborhood graphs (5)


Edge weights:
I A priori, all the graphs above are unweighted.

I On kNN graphs, it often makes sense to use similarities as


edge weights.
(Reason: edges have very diverse “lengths”, and we want to
tell this to the algorithm; e.g., spectral clustering is allowed to
cut long edges more easily than short edges)
Ulrike von Luxburg: Statistical Machine Learning

I Never use distances as weights! (this destroys the “logic”


behind a neighborhood graph: no edge means “far away”, and
no edge is the same as edge weight 0 ... )
I For ε-graphs, edge weights do not make so much sense
because all edges are more or less “equally long”. The edge
weights then do not carry much extra information.
548
Summer 2019

Neighborhood graphs (6)


Why are we interested in similarity graphs?
I Sparse representation of the similarity structure

I Graphs are well-known objects, lots of algorithms to deal with


them.
I Similarity graph encodes local structure, goal of machine
learning (unsupervised learning) is to make statements about
its global structure.
Ulrike von Luxburg: Statistical Machine Learning

I There exist many algorithms for machine learning on graphs:


I Clustering: Spectral clustering (see later)
I Dimensionality reduction: Isomap, Laplacian eigenmaps,
Maximum Variance Unfolding
I Semi-supervised learning: label propagation (not treated in
the lecture)
549
Summer 2019

Isomap
Literature:
I Original paper:
J. Tenenbaum, V. De Silva, J. Langford. A global geometric
Ulrike von Luxburg: Statistical Machine Learning

framework for nonlinear dimensionality reduction. Science,


2000.
550
Summer 2019

Isomap
We often think that data is “inherently low-dimensional”:
I Images of a tea pot, taken from all angles. Even though the
images live in R256 , say, we believe they sit on a manifold
corresponding to a circle:
Ulrike von Luxburg: Statistical Machine Learning
551
Summer 2019

Isomap (2)
I A phenomenon generates very high-dimensional data, but the
“effective number of parameters” is very low
Ulrike von Luxburg: Statistical Machine Learning
552
553 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Fingers extension
Isomap (3)

Wrist rotation
Summer 2019

Isomap (4)
(A) (B) (C)
More abstractly:
I We assume that the data lives in a high-dimensional space, but
effectively just sits on a low-dimensional manifold
I We would like to find a mapping that recovers this manifold.

I If we could do this, then we could reduce the dimensionality in


a very meaningful way.
Ulrike von Luxburg: Statistical Machine Learning
554
Summer 2019

Isomap (5)
(A) (B) (C)
Ulrike von Luxburg: Statistical Machine Learning

Figure 1: The problem of nonlinear dimensionality reduction, as illustrated for three dimensional
data (B) sampled from two dimensional manifolds (A). An unsupervised learning al-
gorithm must discover the global internal coordinates of the manifold without external
signals that suggest how the data should be embedded in two dimensions. The LLE algo-
rithm described in this paper discovers the neighborhood-preserving mappings shown in
(C); the color coding reveals how the data is embedded in two dimensions.
555
Summer 2019

The Isomap algorithm


Intuition:
I In a small local region, Euclidean (extrinsic) distances between
points on a manifold approximately coincide with the intrinsic
distances. We want to keep the local distances unchanged.
I This is no longer the case for large distances: we want to keep
the intrinsic (geodesic) distances rather the ones in the
Ulrike von Luxburg: Statistical Machine Learning

ambient space.
556
Summer 2019

The Isomap algorithm (2)


I If we want to “straighten” a manifold, we need to embed it in
such a way that the Euclidean distance after embedding
corresponds to the geodesic distance on the manifold.
Ulrike von Luxburg: Statistical Machine Learning

So we would like to discover the geodesic distances in the


manifold.
557
Summer 2019

The Isomap algorithm (3)


I To discover the geodesic distances in the manifold:
I Build a kNN graph on the data
I Use the shortest path distance in this graph.
I Idea is: it goes “along” the manifold.
Ulrike von Luxburg: Statistical Machine Learning
558
559 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

The Isomap algorithm (4)

(figure by Matthias Hein)


Summer 2019

The Isomap algorithm (5)


The algorithm:
I Given some abstract data points X1 , ..., Xn and a distance
function d(xi , xj ).
I Build a k-nearest neighbor graph where the edges are weighted
by the distances. These are the local distances.
I In the kNN graph, compute the shortest path distances dsp
between all pairs of points and write them in the matrix D.
Ulrike von Luxburg: Statistical Machine Learning

They correspond to the geodesic distances.


I Then apply metric MDS with D as input. Finds embedding
that preserves the geodesic distances.
560
Summer 2019

Theoretical guarantees
In the original paper (supplement) the authors have proved:
I If the data points X1 , ..., Xn are sampled uniformly from a
“nice” manifold, then as n → ∞ and k ≈ log n, the shortest
path distances in the kNN graph approximate the geodesic
distances on the manifold.
I Under some geometric assumptions on the manifold, MDS
Ulrike von Luxburg: Statistical Machine Learning

then recovers an embedding with distortion converging to 0.

(Attention, the dimension is an issue here. Typically, we


cannot embed a manifold without distortion in a space of the
intrinsic dimension, we need to choose the dimension larger).
561
Summer 2019

Demos
I demo_isomap.m
I Toolbox by Laurens van der Maaten:
I addpath(genpath(’/Users/ule/matlab_ule/
downloaded_packages_not_in_path/dim_reduction_
toolbox/’)); then call drgui
I Use swiss roll and helix, play with k
Ulrike von Luxburg: Statistical Machine Learning
562
Summer 2019

History
I Manifold methods became fashionable in machine learning in
the early 2000s.
I Isomap was invented in 2000.
I Since then, a large number of manifold-based dimensionality
reduction techniques has been invented:
Locally linear embedding, Laplacian eigenmaps, Hessian
Ulrike von Luxburg: Statistical Machine Learning

eigenmaps, Diffusion maps, Maximum Variance Unfolding, and


many more ...
I There exists a nice matlab toolbox that implements all these
algorithms, written by Laurens van der Maaten.
http://homepage.tudelft.nl/19j49/Matlab_Toolbox_
for_Dimensionality_Reduction.html
To call the demo:
I addpath(genpath(’/Users/ule/matlab_ule/downloaded_packages_not_in_path/
dim_reduction_toolbox/’))
I call drgui
563
Summer 2019

Summary Isomap
I Unsupervised learning technique to extract the manifold
structure from distance / similarity data
I Intuition: local distances define the intrinsic geometry, shortest
paths in a kNN graphs correspond to geodesics.
I MDS then tries to find an appropriate embedding.
Ulrike von Luxburg: Statistical Machine Learning
564
565 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

(*) Maximum Variance Unfolding: SKIPPED


566 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

(*) tSNE: SKIPPED


567 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

(*) Johnson-Lindenstrauss: SKIPPED


568 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

SKIPPED
(*) Ordinal embedding (GNMDS, SOE, t-STE):
569 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Clustering
Summer 2019

Data clustering
Data clustering is one of the most important problems of
unsupervised learning.
I Given just input data X1 , ..., Xn

I We want to discover groups (“clusters”) in the data such that


points in the same cluster are “similar” to each other and
points in different clusters are “dissimilar” of each other.
Ulrike von Luxburg: Statistical Machine Learning

I Important: a priori, we don’t have any information (training


labels) about these groups, and often we don’t know how
many groups there are (if any).
570
Summer 2019

Data clustering (2)


Applications:
I Find “genres” of songs

I Find different “groups” of customers

I Find two different types of cancer, based on gene expression


data
I Discover proteins that have a similar function
Ulrike von Luxburg: Statistical Machine Learning

I ...
571
Summer 2019

Data clustering (3)


Two main reasons to do this:
I Improve your understanding of the data! Exploratory data
analysis.
I Reduce the complexity of the data. Vector quantization.
For example, instead of training on a set of 106 customers, use
1000 “representative” customers.
I Break your problem into subproblems and treat each cluster
Ulrike von Luxburg: Statistical Machine Learning

individually.
572
lustering Gene Expression Data
Summer 2019

Example: Clustering gene expression data


Ulrike von Luxburg: Statistical Machine Learning

M. Eisen et al., PNAS, 1998


573
Summer 2019

Example: Protein interaction networks


Ulrike von Luxburg: Statistical Machine Learning

(from http://www.math.cornell.edu/ durrett/RGD/RGD.html)


574
Summer 2019

Example: Social networks


Corporate email communication (Adamic and Adar, 2005)
Ulrike von Luxburg: Statistical Machine Learning
575
Summer 2019

Example: Image segmentation


Ulrike von Luxburg: Statistical Machine Learning

(from Zelnik-Manor/Perona, 2005)


cation.
imageFully
segmentation.
automatic intensity
Fully automatic
based image
intensity
segmen
bas
our algorithm.

and
real results
data sets
on can
real be
datafound
sets on
can our
be web-pag
found o
Demos/SelfTuningClustering.html
caltech.edu/lihi/Demos/SelfTuningClustering.html
576
Summer 2019

Example: Genetic distances between mammals


Platypus
Marsupials and monotremes
Wallaroo
Opossum
Rodents Rat
HouseMouse
Cat
HarborSeal
GreySeal
Ferungulates WhiteRhino
Horse
FinbackWhale
Ulrike von Luxburg: Statistical Machine Learning

BlueWhale
Cow
Gibbon
Gorilla
Human
Primates
PygmyChimpanzee
Chimpanzee
Orangutan
SumatranOrangutan

cf. Chen/Li/Ma/Vitanyi (2004)

3: The evolutionary tree built from


lian mtDNA sequences.
577
Summer 2019

Most common approach


Have two goals:
I Want distances between points in the same cluster to be small

I Want distances between points in different clusters be large

Naive approach:
I Define a criterion that measures these distances and try to find
the best partition with respect to this criterion: Example:
Ulrike von Luxburg: Statistical Machine Learning

average within-cluster distances


minimize
average between-cluster distances
Problem:
I Which objective to choose?

I Most such optimization problems are NP hard (combinatorial


optimization).
578
Summer 2019

K-means and kernel k-means


Literature:
Tibshirani/Hastie/Friedmann
Ulrike von Luxburg: Statistical Machine Learning
579
580 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Standard k-means algorithm


Summer 2019

k-means objective
I Assume we are given data points X1 , ..., Xn ∈ Rd
I Assume we want to separate it into K groups.
I We want to construct K class representatives (class means)
m1 , ..., mK that represent the groups.
I Consider the following objective function:
Ulrike von Luxburg: Statistical Machine Learning

K X
X
min kXi − mk k2
{m1 ,...,mK ∈Rd }
k=1 i∈Ck

That is, we want to find the centers such that the sum of
squared distances of data points to the closest centers are
minimized.
581
582 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

k-means objective (2)


Summer 2019

Lloyd’s algorithm (k-means algorithm)


The following heuristic is typically used to find a local optimum of
the k-means objective function:
I Start with randomly chosen centers.
I Repeat the following two steps until convergence:
I Assign all points to the closest cluster center.
I Define the new centers as the mean vectors of the current
Ulrike von Luxburg: Statistical Machine Learning

clusters.
583
Summer 2019

Lloyd’s algorithm (k-means algorithm) (2)


Previous step Updated clustering for fixed centers Updated centers based on new clustering

6 6 6

4 4 4

2 2 2

0 0 0

−2 0 2 4 6 −2 0 2 4 6 −2 0 2 4 6
Previous step Updated clustering for fixed centers Updated centers based on new clustering

6 6 6

4 4 4
Ulrike von Luxburg: Statistical Machine Learning

2 2 2

0 0 0

−2 0 2 4 6 −2 0 2 4 6 −2 0 2 4 6

...
584
Summer 2019

Lloyd’s algorithm (k-means algorithm) (3)


Previous step Updated clustering for fixed centers Updated centers based on new clustering

6 6 6

4 4 4

2 2 2

0 0 0

−2 0 2 4 6 −2 0 2 4 6 −2 0 2 4 6
Ulrike von Luxburg: Statistical Machine Learning
585
Summer 2019

Lloyd’s algorithm (k-means algorithm) (4)


The formal k-means algorithm:
d
1 Input: Data points X1 , ..., Xn ∈ R , number K of clusters to
construct.
(0) (0)
2 Randomly initialize the centers m1 , ..., mK .
3 while not converged
4 Assign each data point to the closest cluster center, that is
(i+1) (i+1)
define the clusters C1 , ..., CK by
Ulrike von Luxburg: Statistical Machine Learning

(i+1) (i) (i)


Xs ∈ C k ⇐⇒ kXs − mk k2 ≤ kXs − ml k2 , l = 1, ..., K

5 Compute the new cluster centers by

(i+1) 1 X
mk = (i+1)
Xs
|Ck | (i+1)
s∈Ck

6 Output: Clusters C1 , ..., CK


586
Summer 2019

Lloyd’s algorithm (k-means algorithm) (5)


matlab demo: demo_kmeans()
Ulrike von Luxburg: Statistical Machine Learning
587
Summer 2019

K-means algorithm — Termination


Proposition 17 (Termination)
Given a finite set of n point in Rd . Then the k-means algorithm
terminates after a finite number of iterations.
Proof sketch.
I In each iteration of the while loop, the objective function
Ulrike von Luxburg: Statistical Machine Learning

decreases.
I There are only finitely many partitions we can inspect.
I So the algorithm has to terminate.
,
588
Summer 2019

K-means algorithm — Solutions can be


arbitrarily bad
Proposition 18 (Bad solution possible)
The algorithm ends in a local optimum which can be an arbitrary
factor away from the global solution.
Proof.
Ulrike von Luxburg: Statistical Machine Learning

I We give an example with four points in R, see figure on the


next slides.
I By adjusting the parameters a and b and c we can achieve an
arbitrarily bad ratio of global and local solution.
589
Summer 2019

K-means algorithm — Solutions can be


arbitrarily bad (2)
Data set: four points on the real line:
a b c

X X X X
1 2 3 4
Ulrike von Luxburg: Statistical Machine Learning

Optimal solution: (initialization: X1, X2, X3; value of solution: c^2 / 2 )

m1 m2 m3

Bad solution: (initialization: X1, X3, X4; value of solution: a^2 / 2 )

m1 m2 m3
590
Summer 2019

K-means algorithm — Initialization


Methods to select initial centers:
I Most common: randomly choose some data points as starting
centers.
I Much better: Farthest first heuristic:

1 S = ∅ # S set of centers
Ulrike von Luxburg: Statistical Machine Learning

2 Pick x uniformly at random from the data points


S = {x}
3 while |S| < k
4 for all x ∈ X \ S
5 Compute D(x) := mins∈S kx − sk2 .
6 Select the next center y with probability proportional to
D(x) among the remaining data points.
591
Summer 2019

K-means algorithm — Initialization (2)


The k-means algorithm with this heuristic is called kmeans++
and satisfies nice approximation guarantees.
I Initialize the centers using the solution of an even simpler
clustering algorithm.
I Ideally have prior knowledge, for example that certain points
are in different clusters.
Ulrike von Luxburg: Statistical Machine Learning
592
Summer 2019

K-means algorithm — Heuristics for practice


As it is the standard procedure for highly non-convex optimization
problems, in practice we restart the algorithm many times with
different initializations. Then we use the best of all these runs as
our final result.
Ulrike von Luxburg: Statistical Machine Learning
593
Summer 2019

K-means algorithm — Heuristics for practice (2)


Common problem:
I In the course of the algorithm it can happen that a center
“looses” all its data points (no point is assigned to the center
any more).
I In this case, one either restarts the whole algorithm, or
randomly replaces the empty center by one of the data points.
Ulrike von Luxburg: Statistical Machine Learning
594
Summer 2019

K-means algorithm — Heuristics for practice (3)


Local search heuristics to improve the result once the algorithm has
terminated:
I Restart many times with different initializations.
I Swap individual points between clusters.
I Remove a cluster center, and introduce a completely new
center instead.
Ulrike von Luxburg: Statistical Machine Learning

I Merge clusters, and additionally introduce a completely new


cluster center.
I Split a cluster in two pieces (preferably, one which has a very
bad objective function). Then reduce the number of clusters
again, for example by randomly removing one.
595
Summer 2019

k-means minimizes within-cluster distances


Another way to understand the k-means objective:

Proposition 19 (k-means and within-cluster distances)


The following two optimization problems are equivalent:

1. Find a discrete partition of the data set such that the


within-cluster-distances are minimized:
Ulrike von Luxburg: Statistical Machine Learning

K
X 1 X
min kXi − Xj k2
{C1 ,...,CK } |Ck |2
k=1 i∈Ck ,j∈Ck

2. Find cluster centers such that the distances of the data points to
these centers are minimized:
XK X
min kXi − mk k2
m1 ,...,mK ∈Rd
k=1 i∈Ck

Proof. Elementary, but a bit lengthy, we skip it.


596
Summer 2019

k-means leads to Voronoi partitions


Observe that the partition induced by the k-means objective
corresponds to a Voronoi partition of the space:
Ulrike von Luxburg: Statistical Machine Learning
597
Summer 2019

k-means leads to Voronoi partitions (2)


This has two important consequences:
I all cluster boundaries are linear (WHY MIGHT THIS BE
INTERESTING?)
I The k-means algorithm always constructs convex clusters! This
gives intuition about when it works and when it doesn’t work:
Ulrike von Luxburg: Statistical Machine Learning
598
599 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

k-means leads to Voronoi partitions (3)


Summer 2019

K-means — computational complexity


I Finding the global solution of the k-means optimization
problem is NP hard (both if k is fixed or variable, and both if
the dimension is fixed or variable).

This is curious because one can prove that there only exist
polynomially many Voronoi partitions of any given data set.
The difficulty is that we cannot construct any enumeration to
Ulrike von Luxburg: Statistical Machine Learning

search through them.

See the following paper and references therein:


Mahajan, Meena and Nimbhorkar, Prajakta and Varadarajan,
Kasturi: The planar k-means problem is NP-hard. WALCOM:
Algorithms and Computation, 2009.
600
Summer 2019

K-means — computational complexity (2)


I On the other hand, optimizing the k-means objective has
polynomial smoothed complexity.
Arthur, David and Manthey, Bodo and Röglin: k-Means has
polynomial smoothed complexity. FOCS 2009.

I With careful seeding, one can achieve constant-factor


approximations:
Ulrike von Luxburg: Statistical Machine Learning

I Consider the random farthest first rule for initialization


(kmeans with this initialization is called kmeans++).
I Then, the expected objective value is at most a factor
O(log k) worse than the optimal solution.
I Reference:
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of
careful seeding. In: Symposium on Discrete Algorithms
(SODA), 2007.
601
Summer 2019

More variants of K-means


I K-median: here the centers are always data points. Can be
used if we only have distances, but no coordinates of data
points.
I weighted K-means: introduce weights for the individual data
points
I kernel-K-means: the kernelized version of K-means (note that
Ulrike von Luxburg: Statistical Machine Learning

all boundaries between clusters are linear).


I. S. Dhillon, Y. Guan, and B. Kulis, Kernel k-means, spectral
clustering and normalized cuts. KDD, 2004.
I soft K-means: no hard assignments, but “soft” assignments
(often interpreted as “probability” of belonging to a certain
cluster)
I Note: K-means is a simplified version of the EM-algorithm
which fits a Gaussian mixture model to the data.
602
Summer 2019

Summary K-means
I Represent clusters by cluster centers
I Highly non-convex NP hard optimization problem
I Heuristic: Lloyd’s k-means algorithm
I Very easy to implement, hence very widely used.
I In my opinion: k-means works well for vector quantization (if
you want to find a large number of clusters, say 100 or so). It
Ulrike von Luxburg: Statistical Machine Learning

does not work so well for small k, here you should consider
spectral clustering.
603
604 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Linkage algorithms for hierarchical clustering


Summer 2019

Hierarchical clustering
Goal: obtain a complete hierarchy of clusters and sub-clusters in
form of a dendrogram
Platypus
Marsupials and monotremes
Wallaroo
Opossum
Rodents Rat
HouseMouse
Cat
HarborSeal
GreySeal
WhiteRhino
Ulrike von Luxburg: Statistical Machine Learning

Ferungulates
Horse
FinbackWhale
BlueWhale
Cow
Gibbon
Gorilla
Human
Primates
PygmyChimpanzee
Chimpanzee
Orangutan
SumatranOrangutan

cf. Chen/Li/Ma/Vitanyi (2004)


605

3: The evolutionary tree built from


Summer 2019

Simple idea
Agglomerative (bottom-up) strategy:
I Start: each point is its own cluster

I Then check which points are closest and “merge” them to


form a new cluster
I Continue, always merge two “closest” clusters until we are left
with one cluster only
Ulrike von Luxburg: Statistical Machine Learning
606
Summer 2019

Simple idea (2)


To define which clusters are “closest”:

Single linkage: dist(C, C 0 ) = minx∈C,y∈C 0 d(x, y)


X
X
X X
X X
X X X
X
X

P
x∈C,yinC 0 d(x,y)
Average linkage: dist(C, C 0 ) = |C|·|C 0 |
Ulrike von Luxburg: Statistical Machine Learning

X
X
X X
X X
X X X
X
X

Complete linkage: dist(C, C 0 ) = maxx∈C,y∈C 0 d(x, y)


X
X
X X
X X
X X X
X
X
607
Summer 2019

Linkage algorithms – basic form


Input:
• Distance matrix D between data points (size n × n)
• function dist to compute a distance between clusters (usually
takes D as input)
(0) (0) (0)
Initialization: Clustering C (0) = {C1 , ..., Cn } with Ci = {i}.
While the current number of clusters is > 1:
Ulrike von Luxburg: Statistical Machine Learning

• find the two clusters which have the smallest distance to each
other
• merge them to one cluster

Output: Resulting dendrogram


608
Summer 2019

Examples
... show matlab demos ...
demo_linkage_clustering_by_foot()
demo_linkage_clustering_comparison()
Ulrike von Luxburg: Statistical Machine Learning
609
Summer 2019

Linkage algorithms tend to be problematic


Observations from practice:
I Linkage algorithms are very vulnerable to outliers

I One cannot “undo” a bad link

Theoretical considerations:
I Linkage algorithms attempt to estimate the density tree

I Even though this can be done in a statistically consistent way,


Ulrike von Luxburg: Statistical Machine Learning

estimating densities in high dimensions is extremely


problematic and usually does not work in practice.
610
Summer 2019

History and References


I The original article: S. C. Johnson. Hierarchical clustering
schemes. Psychometrika, 2:241 - 254, 1967.
I A complete book on the topic: N. Jardine and R. Sibson.
Mathematical taxonomy. Wiley, London, 1971.
I Nice, more up-to-date overview with application in biology: J.
Kim and T. Warnow. Tutorial on phylogenetic tree estimation.
Ulrike von Luxburg: Statistical Machine Learning

ISMB 1999.
611
Summer 2019

Linkage algorithms — summary


I Attempt to estimate the whole cluster tree
I There exist many more ways of generating different trees from
a given distance matrix.
I Advantage of tree-based algorithms: do not need to decide on
“the correct” number of clusters, get more information than
just a flat clustering
Ulrike von Luxburg: Statistical Machine Learning

I However, one should be very careful about the results because


they are very unstable, prone to outliers and statistically
unreliable.
612
Summer 2019

A glimpse on spectral graph theory


Literature:
I U. Luxburg. Tutorial on Spectral Clustering, Statistics and
Computing, 2007.
Ulrike von Luxburg: Statistical Machine Learning

I F. Chung: Spectral Graph Theory (Chapters 1 and 2).


I D. Spielman: Spectral Graph Theory, 2011.
Book chapter whose official reference I was unable to verify, but you can
simply downolad it from Dan Spielman’s homepage (be aware that there
are many papers on this subject by him, make sure you download the one
that looks like a tutorial.
613
Summer 2019

What is it about?
General idea:
I many properties of graphs can be described by properties of the
adjacency matrix and related matrices (“graph Laplacians”).
I In particular, the eigenvalues and eigenvectors can say a lot
about the “geometry” of the graph.
Ulrike von Luxburg: Statistical Machine Learning
614
615 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Unnormalized Laplacians
Summer 2019

Unnormalized Graph Laplacians: Definition


Consider an undirected graph with non-negative edge weights wij .
Notation:
I W :=the weight matrix of the graph
I D := diag(d1 , ...., dn ) the degree matrix of the graph
I L := D − W the unnormalized graph Laplacian matrix
Ulrike von Luxburg: Statistical Machine Learning
616
Summer 2019

Unnormalized Laplacians: Key property


Proposition 20 (Key property)
Let G be an undirected graph. Then for all f ∈ Rn ,

f t Lf = 12 ni,j=1 wij (fi − fj )2 .


P
Ulrike von Luxburg: Statistical Machine Learning

Proof. Simply do the calculus:


617
Summer 2019

Unnormalized Laplacians: Key property (2)

f t Lf = f t Df − f t W f
X X
= di fi2 − fi fj wij
i i,j
!
1 XX X XX
= ( wij )fi2 − 2 fi fj wij + ( wij )fj2
Ulrike von Luxburg: Statistical Machine Learning

2 i j ij j i
1X
= wij (fi − fj )2
2 ij
618
Summer 2019

Why is it called “Laplacian”?


Where does the name “graph Laplacian” come from?
1X
f t Lf = wij (fi − fj )2
2
Interpret wij ∼ 1/d(Xi , Xj )2
1X
f t Lf = ((fi − fj )/dij )2
Ulrike von Luxburg: Statistical Machine Learning

2
looks like a discrete version of the standard Laplace operator
Z
hf, ∆f i = |∇f |2 dx

Hence the graph Laplacian measures the variation of the function f


along the graph: f t Lf is low if points that are close in the graph
have similar values fi .
619
Summer 2019

Unnormalized Laplacians: Spectral properties


Proposition 21 (Simple spectral properties)
For an undirected graph with non-negative edge weights, the graph
Laplacian has the following properties:
I L is symmetric and positive semi-definite.

I Smallest eigenvalue of L is 0, corresponding eigenvector is


Ulrike von Luxburg: Statistical Machine Learning

1 := (1, ..., 1)t .


I Thus eigenvalues 0 = λ1 ≤ λ2 ≤ ... ≤ λn .
620
Summer 2019

Unnormalized Laplacians: Spectral properties (2)


Proof.

Symmetry: W is symmetric (graph is undirected), D is


symmetric, so L is symmetric.

Positive Semi-Definite: by key proposition:


1
t
f Lf = 2 ij wij (fi − fj )2 ≥ 0
P
Ulrike von Luxburg: Statistical Machine Learning

Smallest Eigenvalue/vector: It is indeed an eigenvector because


L1 = D1 − W 1 = 0
It is the smallest because all eigs are ≥ 0.
621
Summer 2019

Unnormalized Laplacians and connected


components
Proposition 22 (Relation between spectum and clusters)
Consider an undirected graph with non-negative edge weights.
I Then the (geometric) multiplicity of eigenvalue 0 is equal to
the number k of connected components A1 , ..., Ak of the
Ulrike von Luxburg: Statistical Machine Learning

graph.
I The eigenspace of eigenvalue 0 is spanned by the characteristic
functions 1A1 , ..., 1Ak of those components
(where 1Ai (j) = 1 if vj ∈ Ai and 1Ai (j) = 0 otherwise).
622
Summer 2019

Unnormalized Laplacians and connected


components (2)
Proof, case k=1.
I Assume that the graph is connected.
I Let f be an eigenvector with eigenvalue 0.
I Want to show: f is a constant vector.
Ulrike von Luxburg: Statistical Machine Learning

Here is the reasoning:


I By definition: Lf = 0.
I Exploiting this and the key proposition:
X
0 = f t Lf = wij (fi − fj )2
ij

I The right hand side can only be 0 if all summands are 0.


623
Summer 2019

Unnormalized Laplacians and connected


components (3)
I Hence, for all pairs (i, j):
I either wij = 0 (that is, vi and vj are not connected by an
edge in the graph),
I or fi = fj .
Consequently: if vi and vj are connected in the graph, then
fi = fj .
Ulrike von Luxburg: Statistical Machine Learning

In particular, f is constant on the whole connected component.


624
Summer 2019

Unnormalized Laplacians and connected


components (4)
Proof, case k > 1.
I If the graph consists of k disconnected components, both the
adjacency matrix and the graph Laplacian are block diagonal.
In particular, each little block is the graph Laplacian of the
corresponding connected component.
Ulrike von Luxburg: Statistical Machine Learning
625
Summer 2019

Unnormalized Laplacians and connected


components (5)
I For each block (= each connected component), by the case
k = 1 we know that there is exactly one eigenvector for
eigenvalue 0, and it is constant:
Ulrike von Luxburg: Statistical Machine Learning

I For the matrix L we then know that there are k eigenvalues 0,


each one coming from one of the blocks. Padding the
eigenvectors with zeros leads to the cluster indicator vectors:
626
627 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

components (6)
Unnormalized Laplacians and connected
Summer 2019

Unnormalized Laplacians and connected


components (7)
QUESTION:
Consider a graph with k connected components:

WHY IS THERE NO CONTRADITION BETWEEN


PROPOSITION 22 (SECOND STATEMENT) AND
PROPOSITION 21???
Ulrike von Luxburg: Statistical Machine Learning
628
629 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Normalized Laplacians
Summer 2019

Normalized graph Laplacian


For various reasons (see below) it is better to normalize the graph
Laplacian matrix.

Two versions:
I The “symmetric” normalized graph Laplacian

Lsym = D−1/2 LD−1/2


Ulrike von Luxburg: Statistical Machine Learning

(where the square root of the diagonal matrix D can be


computed entry-wise).

I The “random walk graph Laplacian”

Lrw = D−1 L

We will now see that both normalized Laplacians are closely related,
and have similar properties as the unnormalized Laplacian.
630
Summer 2019

Normalized Laplacians: First properties


Proposition 23 (Adapted key property)
For every f ∈ Rn we have
n
!2
1X f f
0
f Lsym f = wij √ i − pj .
2 i,j=1 di dj
Ulrike von Luxburg: Statistical Machine Learning

Proof. Similar to the unnormalized case. ,


631
Summer 2019

Normalized Laplacians: First properties (2)


Proposition 24 (Simple spectral properties)
Consider an undirected graph with non-negative edge weights.
Then:
1. λ is an eigenvalue of Lrw with eigenvector u
⇐⇒ λ is an eigenvalue of Lsym with eigenvector w = D1/2 u.
2. λ is an eigenvalue of Lrw with eigenvector u
Ulrike von Luxburg: Statistical Machine Learning

⇐⇒ λ and u solve the generalized eigenproblem Lu = λDu.


3. 0 is an eigenvalue of Lrw with the constant one vector 1 as
eigenvector. 0 is an eigenvalue of Lsym with eigenvector D1/2 1.
4. Lsym and Lrw are positive semi-definite and have n
non-negative real-valued eigenvalues 0 = λ1 ≤ . . . ≤ λn .
632
Summer 2019

Normalized Laplacians: First properties (3)


Proof.

Part (1): multiply the eigenvalue equation Lsym w = λw with D−1/2


from the left and substitute u = D−1/2 w.

Part (2): multiply Lrw u = λu with D from the left.

Part (3): Just plug it in the corresponding eigenvalue equations.


Ulrike von Luxburg: Statistical Machine Learning

Part (4): The statement about Lsym follows from the adapted key
property, and then the statement about Lrw follows from (2). ,
633
Summer 2019

Normalized Laplacians and connected


components
Proposition 25 (Relation between spectum and clusters)
Let G be an undirected graph with non-negative weights. Then the
multiplicity k of the eigenvalue 0 of both Lrw and Lsym equals the
number of connected components A1 , . . . , Ak in the graph. For
Lrw , the eigenspace of 0 is spanned by the indicator vectors 1Ai of
Ulrike von Luxburg: Statistical Machine Learning

those components. For Lsym , the eigenspace of 0 is spanned by the


vectors D1/2 1Ai .

Proof.
Analogous to the one for the unnormalized case.
634
635 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Cheeger constant
Summer 2019

Cheeger constant
Let G be an an undirected graph with non-negative edge weights
wij , S ⊂ V be a subset of vertices, S̄ := V \ S its complement.
Define:
P
I Volume of the set: vol(S) := s∈S d(s)
P
I Cut value: cut(S, S̄) := i∈S,j∈S̄ wij
Ulrike von Luxburg: Statistical Machine Learning

I Cheeger constant:

cut(S, S̄)
hG (S) :=
min{vol(S), vol(S̄)}
hG := min hG (S)
S⊂V
636
Summer 2019

Cheeger constant (2)


Example: a clique (=fully connected graph, including self-loops)
with n vertices has hG = Θ(1):
I S that contains n/2 vertices:
Ulrike von Luxburg: Statistical Machine Learning
637
Summer 2019

Cheeger constant (3)


I S that contains 1 vertex:

I Similarly for other sets S.


Ulrike von Luxburg: Statistical Machine Learning
638
Summer 2019

Cheeger constant (4)


Example: two cliques with n/2 vertices each, connected by a single
edge. Results in hG = O(1/n2 )
Ulrike von Luxburg: Statistical Machine Learning
639
Summer 2019

Cheeger constant (5)


Intuition:

small Cheeger cuts are achieved for cuts that split the graph into
reasonably big, tightly connected subgraphs (so that numerator is
large) which are well clustered (denominator small).
Ulrike von Luxburg: Statistical Machine Learning
640
Summer 2019

Relation of the Cheeger constant and λ2


Theorem 26 (Cheeger inequality for graphs)
Consider a connected, undirected, unweighted graph. Let λ2 be the
second-smallest eigenvalue of Lsym . Then

λ2 p
≤ hG ≤ 2λ2
2
Ulrike von Luxburg: Statistical Machine Learning

Intuition:
I The Cheeger constant describes the cluster properties of a
graph.
I The Cheeger constant is controlled by the second eigenvalue.

Proof. Blackboard / skipped (see Chung: Spectral Graph


Theory). ,
641
Summer 2019

Spectral clustering
Literature:
I U. Luxburg. Tutorial on Spectral Clustering, Statistics and
Computing, 2007.
Ulrike von Luxburg: Statistical Machine Learning

I The more recent editions of Tibshirani/Hastie/Friedman also


contain a chapter on it.
642
Summer 2019

Clustering in graphs
General problem:
I Given a graph

I Want to find “clusters” in the graph:


I many connections inside the cluster
I few connections between different clusters
Ulrike von Luxburg: Statistical Machine Learning
643
Summer 2019

Clustering in graphs (2)


Examples:
I Find “communities” in a social network (e.g., to analyze
communication patterns in a company; to place targeted ads in
facebook)
I Find groups of jointly acting proteins in a protein-interaction
network
I Find groups of similar films (; “genres” )
Ulrike von Luxburg: Statistical Machine Learning

I Find subgroups of diseases (for more specific medical


treatment)
644
Summer 2019

Clustering in graphs (3)


First idea:
I “Few connections between clusters” ≈ “small cut”.

I Thus clustering ≈ find a mincut in the graph.


Ulrike von Luxburg: Statistical Machine Learning
645
646 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Problem: outliers
Clustering in graphs (4)
Summer 2019

Clustering in graphs (5)


Better idea: find two sets such that
I the cut between the sets is small

I each of the clusters is “reasonably large”

Some background on complexity of cut problems:


I Finding any mincut (without extra constraints) is easy and can
Ulrike von Luxburg: Statistical Machine Learning

be done in polynomial time.


I Finding the balanced mincut is NP hard

I Can we do something in between?


647
Summer 2019

RatioCut criterion
Idea:
I want to define an objective function that measures the quality
of a “nearly balanced cut”:
I the smaller the cut value, the smaller the objective function
I the more balanced the cut, the smaller the objective function

Measuring the balancedness of a cut:


Ulrike von Luxburg: Statistical Machine Learning

I ·
Consider a partition V = A∪B.
I Define |A| := number of vertices in A
I Introduce the balancing term 1/|A| + 1/|B|.
I Observe: The balancing term is small when A and B have
(approximately) the same number of vertices:
648
Summer 2019

RatioCut criterion (2)


I Example: n vertices in total.
I Case |A| = n/2, |B| = n/2. Then
1/|A| + 1/|B| = 4/n = O(1/n).
I Case |A| = 1, |B| = n − 1. Then
1/|A| + 1/|B| = 1 + 1/(n − 1) = O(1).

I In general: Under the constraint that |A| + |B| = n, the term


1 1
|A| + |B| is minimal if |A| = |B|.
Ulrike von Luxburg: Statistical Machine Learning

Formally, this can be seen by taking the derivative of the


function f (a) = 1/a + 1/(n − a) with respect to a and setting
it to 0.
649
Summer 2019

RatioCut criterion (3)


Combining cut and balancing: Define:
X
cut(A, B) := wij
i∈A,j∈B
 
1 1
RatioCut(A, B) = cut(A, B) +
|A| |B|
Ulrike von Luxburg: Statistical Machine Learning

RatioCut gets smaller if


I the cut is smaller

I the clusters are more balanced

This is what we wanted to achieve.


650
Summer 2019

RatioCut criterion (4)


Target is now: find the cut with the minimal RatioCut value in a
graph.

Bad news:
finding the global minimum of RatioCut is NP hard /

Good news:
Ulrike von Luxburg: Statistical Machine Learning

But there exists an algorithm that finds very good solutions in most
of the cases: spectral clustering ,
651
652 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Unnormalized spectral clustering


Summer 2019

Goal: minimize RatioCut


Consider the following problem:

Given an undirected graph G with non-negative edge


weights. What is the minimal RatioCut in the graph?

On the following slides we want to show how we can use spectral


graph theory to achieve what we want.
Ulrike von Luxburg: Statistical Machine Learning
653
Summer 2019

Relaxing balanced cut


Consider a graph with an even number n of vertices. For A ⊂ V ,
denote Ā = V \ A. We want to solve the following problem:

min cut(A, Ā) subject to |A| = |Ā| (*)


A⊂V

We want to rewrite the problem in a more convenient way.


Ulrike von Luxburg: Statistical Machine Learning

Introduce f = (f1 , ..., fn )t ∈ Rn with


(
+1 if i ∈ A
fi =
−1 if i 6∈ A.

Now observe:
n
X 1X 1
cut(A, Ā) = wij = wij (fi − fj )2 = f t Lf
4 i,j=1 2
i∈A,j∈Ā
654
Summer 2019

Relaxing balanced cut (2)


So we can rewrite problem (∗) equivalently as follows:

n
X
t
min f Lf subject to fi = 0 and fi = ±1 (**)
f
i=1

So far, we did not change the problem at all, we just wrote it in a


Ulrike von Luxburg: Statistical Machine Learning

different way.

It still looks difficult because it is a discrete optimization problem.


655
Summer 2019

Relaxing balanced cut (3)


Now we are going to relax the problem: we simply replace the
difficult condition fi = ±1 by the two conditions
fi ∈ R and kf k = 1:

n
X
t
min f Lf subject to fi = 0 and fi ∈ R and kf k = 1
f
i=1
Ulrike von Luxburg: Statistical Machine Learning

(#)
P
Finally, observe that i fi = 0 ⇐⇒ f ⊥ 1 where 1 is the
constant-one vector (1, 1, ..., 1). We obtain:

min f t Lf subject to f ⊥ 1 and fi ∈ R and kf k = 1 (##)


f
656
Summer 2019

Relaxing balanced cut (4)


The final observation is now:
I 1 is the smallest eigenvector of the matrix L

I So Rayleigh’s principle tells us that the solution to Problem


(##) is f ∗ being the second-smallest eigenvector of L.

To transform the solution of the relaxed problem into a partition we


Ulrike von Luxburg: Statistical Machine Learning

simply consider the sign:

i ∈ A : ⇐⇒ fi∗ ≥ 0
657
Summer 2019

Relaxing balanced cut (5)


So we end up with the following algorithm:

HardBalancedCutRelaxation(G)
1 Input: Weight matrix (or adjacency matrix) W of the graph
2 D := the corresponding degree matrix
3 L := D − W (the corresponding graph Laplacian)
4 Compute the second-smallest eigenvector f of L
Ulrike von Luxburg: Statistical Machine Learning

5 Define the partition A = {i fi ≥ 0}, Ā = V \ A


6 Return A, Ā
658
Summer 2019

Relaxing balanced cut (6)


Remarks about the relaxation approach:
I Our original problem was NP hard.
I We now solve a relaxed problem (in polynomial time, see
below).
I In general, relaxing a problem does not lead to any guarantees
about whether the solution of the relaxed problem is close to
Ulrike von Luxburg: Statistical Machine Learning

the solution of the original problem.


I This is also the case for spectral clustering. We can construct
example graphs for which the relaxation is arbitrarily bad.
However, such examples are very artificial.
I However, in practice the spectral relaxation works very well!!!
659
Summer 2019

Relaxing RatioCut
Now we want to solve the soft balanced mincut problem of
optimizing ratiocut:

min RatioCut(A, Ā) (*)


A⊂V

This goes along the same lines as the hard balanced mincut
problem:
Ulrike von Luxburg: Statistical Machine Learning

I Define particular values of fi , namely

(
+(|Ā|/|A|)1/2 if i ∈ A
fi =
−(|A|/|Ā|)1/2 if i ∈ Ā

I Observe that we can write RatioCut(A, Ā) = ... = f t Lf .


660
Summer 2019

Relaxing RatioCut (2)


I So the RatioCut problem is equivalent to

min f t Lf subject to fi “of the form given above” (**)


f

I Also observe that any f of the form given above satisfies


P
i∈V fi = ... = 0. So any such f satisfies f ⊥ 1.
I So the RatioCut problem is also equivalent to
Ulrike von Luxburg: Statistical Machine Learning

min f t Lf subject to f ⊥ 1 and fi “of the form given above”


f
(**)

I Now we relax the condition fi “of the form given above” to


fi ∈ R, and apply Rayleigh to see that we need to compute the
second eigenvector.
661
Summer 2019

Relaxing RatioCut (3)


I As before, we assign points to A and Ā according to the sign
of the resulting f ∗ .

In pseudo-code, this algorithm is exactly the one we have


already seen above. It is called (unnormalized) spectral
clustering.
Ulrike von Luxburg: Statistical Machine Learning
662
Summer 2019

(Unnormalized) Spectral Clustering, case two


clusters
UnnormalizedSpectralClustering(G)
1 Input: Weight matrix (or adjacency matrix) W of the graph
2 D := the corresponding degree matrix
3 L := D - W (the corresponding graph Laplacian)
4 Compute the second-smallest eigenvector f of L
5 Define the partition A = {i fi ≥ 0}, Ā = V \ A
Ulrike von Luxburg: Statistical Machine Learning

6 Return A, Ā
663
Summer 2019

Examples
A couple of data points drawn from a mixture of Gaussians on R.
Ulrike von Luxburg: Statistical Machine Learning
664
665 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Examples (2)
Summer 2019

Unnormalized spectral clustering for k clusters


One can extend the algorithm to the case of k clusters.

General idea:
I The first k eigenvectors encode the cluster structure of k
disjoint clusters.
(To see this, consider the case of k perfectly disconnected
clusters)
Ulrike von Luxburg: Statistical Machine Learning

I To extract the cluster information from the first k


eigenvectors, we construct the so-called spectral embedding:
I Let V be the matrix that contains the first k eigenvectors as
columns.
I Now define new points Yi ∈ Rk as the i-th row of matrix V .
666
Summer 2019

Unnormalized spectral clustering for k clusters


(2)
I Note: in the “ideal case” (disconnected clusters) the Yi are
the same for all points in the same cluster.
Ulrike von Luxburg: Statistical Machine Learning

I Idea: they are “nearly the same” if we still have nice (but not
perfect) clusters.
I In particular, any simple algorithm can recover the cluster
membership based on the embedded points Yi . We use
k-means to do so.
667
668 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

(3)
Unnormalized spectral clustering for k clusters
669 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

(4)
Unnormalized spectral clustering for k clusters
Summer 2019

Unnormalized spectral clustering for k clusters


(5)
UnnormalizedSpectralClustering(G)
1 Input: Weight matrix (or adjacency matrix) W of the graph
2 D := the corresponding degree matrix
3 L := D - W (the corresponding graph Laplacian)
4 Compute the n × k matrix V that contains the first k
eigenvectors as columns.
Ulrike von Luxburg: Statistical Machine Learning

k
5 Define the new data points Yi ∈ R to be the rows of the
matrix V . This is sometimes called the spectral embedding.
6 Now cluster the points (Yi )i=1,...,n by the k-means algorithm.
670
Summer 2019

Unnormalized spectral clustering for k clusters


(6)
Some more intuition:
I Seems funny: we first say we want to use spectral clustering,
and in the end we run k-means.
I The point is that the spectral embedding is such a clever
transformation of the original data that after this
transformation the cluster structure is “obvious”, we just have
Ulrike von Luxburg: Statistical Machine Learning

to extract it by a simple algorithm.


671
Summer 2019

Analysis: Running time


The bottleneck of the algorithm is the computation of the
eigenvector:
I In general, the first eigenvectors of a symmetric matrix can be
computed in time O(n3 )
I However, one can do much better on sparse matrices (running
time then depends on the sparsity and on other conditions
Ulrike von Luxburg: Statistical Machine Learning

such as the “spectral gap”).


672
Summer 2019

Analysis: No approximation guarantees


I As hinted above: we cannot guarantee that the solution we
find by unnormalized spectral clustering is close to the
minimizer of RatioCut
I In the following example, the cut constructed by spectral
clustering is c · n times larger than the best RatioCut:
Ulrike von Luxburg: Statistical Machine Learning

Note that this example relies heavily on symmetry.


Guattery, S., Miller, G. (1998). On the quality of spectral separators. SIAM Journal of Matrix Anal. Appl., 1998.
673
Summer 2019

Analysis: No approximation guarantees (2)


I However, despite the lack of approximation guarantees, it
performs extremely well in practice. It terms of clustering
performance, it is the state of the art and one of the most
widely used “modern” algorithms for clustering.
Ulrike von Luxburg: Statistical Machine Learning
674
675 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Normalized spectral clustering


Summer 2019

Normalized cut criterion


We have seen that the unnormalized spectral clustering algorithm
solves the (relaxed) problem of minimizing Ratiocut.

For various reasons (see later), it turns out to be better to consider


the following objective function called normalized cut:
 
1 1
Ncut(A, B) = cut(A, B) +
Ulrike von Luxburg: Statistical Machine Learning

vol(A) vol(B)

This looks very similar to RatioCut, but we measure the size of the
sets A and B not by their number of vertices, but by the weight of
their edges:

X
vol(A) = di
i∈A
676
Summer 2019

Normalized cut criterion (2)


By a derivation that is very similar to what we have seen for
Ratiocut:
I relaxing the problem to minimizing Ncut leads to clustering
the eigenvectors of the random walk Laplacian Lrw .
I For computational reasons, one replaces the
eigendecomposition of Lrw by the one of Lsym (WHY?)
Ulrike von Luxburg: Statistical Machine Learning
677
Summer 2019

Minimizing normalized cut


Relaxation approach, very similar to the one for Ratiocut, leads to
the following algorithm:

NormalizedSpectralClustering(G)
1 Input: Weight matrix (or adjacency matrix) W of the graph
2 D := the corresponding degree matrix
−1/2
3 Lsym := D (D − W )D−1/2 (the normalized graph
Ulrike von Luxburg: Statistical Machine Learning

Laplacian)
4 Compute the second-smallest eigenvector f of Lsym
−1/2
5 Compute the vector g = D f
6 Define the partition A = {i gi ≥ 0}, Ā = V \ A
7 Return A, Ā
678
Summer 2019

Normalized vs. unnormalized spectral clustering


You should always prefer the normalized spectral clustering
algorithm. There are several theoretical (and practical) results that
show this. Here is one of them:

The unnormalized spectral algorithm is not statistically consistent:


if the samle size increases, the second eigenvector of the
unnormalized Laplacian can converge to a trivial Dirac function
Ulrike von Luxburg: Statistical Machine Learning

that just separates one point from the rest of the space. This never
happens for normalized spectral custering. Details are beyond the
scope of this lecture and require heavy functional analysis.
Literature: Luxburg, Bousquet, Belkin: Consistency of spectral clustering, 2008
679
Summer 2019

Regularization for spectral clustering


I In practice it still happens regularly that normalized spectral
clustering identifies outliers rather than the true clusters. The
reason is that the balancing is not strong enough.
I There is a very simple cure for it that helps very often: add a
little regularization term to the adjacency matrix before
applying normalized spectral clustering. Concretely, replace the
Ulrike von Luxburg: Statistical Machine Learning

weight matrix W by the matrix


τ
W̃ := W + J
n
where J is the all-ones-matrix and τ a small parameter, and
then compute the normalized Laplacian and use spectral
clustering as usual:

D̃ = degrees of W̃
680
681 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

L̃ = D̃ − W̃
Regularization for spectral clustering (2)
682 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Figure: Karl Rohe


Regularization for spectral clustering (3)
Summer 2019

Regularization for spectral clustering (4)


Why does regularization help? Here is the line of argument:
I Many sparse random graphs provably have many “k-dangling
sets”, and those sets create small eigenvalues of the normalized
Laplacian (by Cheeger’s inequality, λ2 ≤ 2/(2k − 1) ≈ 1/k)
Ulrike von Luxburg: Statistical Machine Learning

I If we regularize, one can prove that these eigenvalues


“disappear”: if the graph has a good cluster structure, then
one can prove that the correct cut has a smaller value in the
regularized graph than the cut of dangling sets...
683
Summer 2019

Regularization for spectral clustering (5)


Hang on, the regularized matrix is dense! So do we run into
computational issues?

No, we can still use the power method and exploit sparsity of the
graph:
Ulrike von Luxburg: Statistical Machine Learning

τ
(D̃−1/2 W̃ D̃−1/2 )v = D̃−1/2 (W + J)D̃−1/2 v
n
−1/2 −1/2 τ
= D̃ W
{zD̃ v} + 1(1v)
|n {z }
|
sparse
O(n)
684
Summer 2019

Regularization for spectral clustering (6)


Literature on regularized spectral clustering:
I Zhang, Rohe: Understanding Regularized Spectral Clustering
via Graph Conductance, Arxiv, 2018
I Chaudhuri, Chung, Tsiatas: Spectral clustering of graphs with
general degrees in the planted partition model, COLT 2012
I Amini, Cheng, Bickel, Levina: Pseudo-Likelihood methods for
community detection, Annals of Statistics, 2013
Ulrike von Luxburg: Statistical Machine Learning
685
Summer 2019

History of spectral clustering


I Has been discovered and rediscovered several times since the
1970ies, but went pretty much unnoticed.
I Breakthrough papers:
I Shi, J. and Malik, J. Normalized cuts and image
segmentation. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 2000.
Ulrike von Luxburg: Statistical Machine Learning

I Meila, M. and Shi, J. A random walks view of spectral


segmentation. AISTATS, 2001.
I Ng, A., Jordan, M., and Weiss, Y. On spectral clustering:
analysis and an algorithm. NIPS, 2002.
I By now, it has been established as the most popular “modern”
clustering algorithm, with many theoretical results
underpinning its usefulness.
Still active field of research, e.g. see regularized spectral clustering.
686
Summer 2019

Spectral clustering summary


I Spectral clustering tries to solve a balanced cut problem:
I minimize Ratiocut (; unnormalized spectral clustering)
I minimize Ncut (; normalized spectral clustering)
I Both these problems are discrete optimization problems and
NP hard to solve.
I Spectral clustering solves a relaxed version of these problems.
Ulrike von Luxburg: Statistical Machine Learning

I In theory, there are no approximation guarantees — the


relaxed solution can be miles away from the one we want.
In practice, it works very well and is state of the art in many
applications.
I Running time complexity can be as bad as O(n3 ), but for
sparse graphs it is very fast.
Normalized spectral clustering is THE modern state of the
art clustering algorithm.
687
688 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

(*) Correlation clustering: SKIPPED


689 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

theory
Introduction to learning
690 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

The standard theory for supervised learning


Summer 2019

Learning Theory: setup and main questions


Literature on learning theory:
I High-level: U. von Luxburg and B. Schölkopf. Statistical
Learning Theory: Models, Concepts, and Results. 2011.
Ulrike von Luxburg: Statistical Machine Learning

I More technical: Bousquet, Boucheron, Lugosi: Introduction to


statistical learning theory, 2003
I The “classic” book (technical): Devroye, Györfi, Lugosi: A
probabilistic theory of pattern recognition. Springer, 1996
I Some of the general text books cover different aspects, for
example the books by Shalev-Shwartz,Ben-David and
Mohri,Rostamizadeh, Talwalkar.
691
Summer 2019

Statistical learning theory


On an abstract level, SLT tries to answer questions such as:
I Which learning tasks can be performed by computers in
general (positive and negative results)?
I What kind of assumptions do we have to make such that
machine learning can be successful?
I What are the key properties a learning algorithm needs to
Ulrike von Luxburg: Statistical Machine Learning

satisfy in order to be successful?


I Which performance guarantees can we give on the results of
certain learning algorithms?
In the following we focus on the case of binary classification,
for which the theory is well-understood.
692
Summer 2019

The framework
The data. Data points (Xi , Yi ) are an i.i.d. sample from some
underlying (unknown) probability distribution P on the space
X × {±1}.

Goal. Our goal is to learn a deterministic function f : X → {±1}


such that the expected loss (risk) according to some given loss
function ` is as small as possible. In classification, the natural loss
Ulrike von Luxburg: Statistical Machine Learning

function is the 0-1-loss.


693
Summer 2019

The framework (2)


Assumptions we (do not) make:
I We do not make any assumption on the underlying
distribution P that generates our data, it can be anything.
I True labels do not have to be a deterministic function of the
input (consider the example of predicting male/female based
on body height).
Ulrike von Luxburg: Statistical Machine Learning

I Data points have been sampled i.i.d.


I Data does not change over time (the ordering of the training
points does not matter, and the distribution P does not
change),
I The distribution P is unknown at the time of learning.
694
Summer 2019

Recap: Bayes classifier


The Bayes classifier
The Bayes classifier for a particular learning problem is the classifier
that achieves the minimal expected risk.

Have already seen: if we knew the underlying distribution P , then


we also know the Bayes classifier (just look at the regression
function).
Ulrike von Luxburg: Statistical Machine Learning

The challenge is that we do not know P . The goal is now to


construct a classifier that is “as close to the Bayes classifier” as
possible. Now let’s become more formal.
695
Summer 2019

Convergence and consistency


Assume we have a set of n points (Xi , Yi ) drawn from P . Consider
a given function class F from which we are allowed to pick our
classifier. Denote:
I f ∗ the Bayes classifier corresponding to P .

I fF the best classifier in F, that is

fF = argmin R(f )
Ulrike von Luxburg: Statistical Machine Learning

f ∈F

I fn the classifier chosen from F by some training algorithm on


the given sample of n points.

Now consider the following definitions:


696
Summer 2019

Convergence and consistency (2)


1. A learning algorithm is called consistent with respect to F
and P if the risk R(fn ) converges in probability to the risk
R(fF ) of the best classifier in F, that is for all ε > 0,

P (R(fn ) − R(fF ) > ε) → 0 as n → ∞.

2. A learning algorithm is called Bayes-consistent with respect


to P if the risk R(fn ) converges to the risk R(f ∗ ) of the
Ulrike von Luxburg: Statistical Machine Learning

Bayes classifier, that is for all ε > 0,

P (R(fn ) − R(f ∗ ) > ε) → 0 as n → ∞.

3. A learning algorithm is called universally consistent with


respect to F (resp. universally Bayes-consistent) if it is
consistent with respect to F (resp. Bayes-consistent) for all
probability distributions P .
697
Summer 2019

Convergence and consistency (3)


Note that consistency with respect to a fixed function class F only
concerns the estimation error, not the approximation error:

I Here consistency means that our decisions are not affected


systematically from the fact that we only get to see a finite
sample, rather than the full space. In other words, all “finite
sample effects” cancel out once we get to see enough data.
Ulrike von Luxburg: Statistical Machine Learning

I If a learning algorithm is consistent, it means that it does not


overfit when it gets to see enough data (low estimation error,
low variance).
I Consistency with respect to F does not tell us anything about
underfitting (approximation error; this depends on the choice
of F).
698
Summer 2019

Empirical risk minimization (ERM)

True risk of a function: R(f ) = E(`(X, f (X), Y ))


n
1X
Empirical risk: Rn (f ) = `(Xi , f (Xi ), Yi )
n i=1

Empirical risk minimization: given n training points (Xi , Yi )i=1,...,n


Ulrike von Luxburg: Statistical Machine Learning

and a fixed function class F, select the function fn that minimizes


the training error on the data:

fn = argmin Rn (f )
f ∈F

Earlier we informally discussed that ERM won’t work if the function


class F is “too large”. We are now going to make this formal.
699
700 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

bounds
Controlling the estimation error: generalization
Summer 2019

Law of large numbers and concentration


Recall from probability theory:

Proposition 27 (Law of large numbers, simplest version)


Let (Zi )i∈N be a sequence of independent random variables that
have been drawn according to some probability distribution P ,
denote its expectation as E(Z). Then (under mild assumptions)
Ulrike von Luxburg: Statistical Machine Learning

n
1X
Zi → E(Z) (almost surely).
n i=1
701
Summer 2019

Law of large numbers and concentration (2)


Even more, there exist very strong guarantees on how fast this
convergence takes place:

Proposition 28 (Concentration inequality, Chernoff 1952,


Hoeffding 1963)
Assume that the random variables Z1 , ..., Zn are independent and
Ulrike von Luxburg: Statistical Machine Learning

take values in [0, 1]. Then for any ε > 0


n
!
1X
P Zi − E(Z) ≥ ε ≤ 2 exp(−2nε2 ).
n i=1
702
Summer 2019

Law of large numbers and concentration (3)


Now consider our scenario of binary classification.

Proposition 29 (Risks converge for fixed function)


Fix a function f0 ∈ F. Then, for this fixed function f0 ,

Rn (f0 ) → R(f0 ) (almost surely).


Ulrike von Luxburg: Statistical Machine Learning

DO YOU SEE WHY?


703
Summer 2019

Law of large numbers and concentration (4)


Proof:
I Apply the Hoeffding bound to the variables
Zi := `(f0 (Xi ), Yi ). This leads to convergence in probability.
I (For those who know about probability theory: To get almost
sure convergence, you need to do one P extra step, namely apply
the Borel-Cantelli lemma. Key is that ∞ 2
n=1 exp(−2nε ) is
finite. Exercise.)
Ulrike von Luxburg: Statistical Machine Learning
704
Summer 2019

Law of large numbers and concentration (5)


Question: Let fn be the function selected by empirical risk
minimization. Does the LLN imply that

Rn (fn ) − R(fn ) → 0 ???????????.


Ulrike von Luxburg: Statistical Machine Learning
705
Summer 2019

Law of large numbers and concentration (6)


NO!!!

Here is a simple counter-example:


I X = [0, 1] with uniform distribution; labels deterministic with
x < 0.5 =⇒ y = −1 and x ≥ 0.5 =⇒ y = +1
I Draw n training points

I Define fn as follows: for all points in the training sample,


Ulrike von Luxburg: Statistical Machine Learning

predict the training label; for all other points predict -1.

Here we have Rn (fn ) = 0 but R(fn ) = 0.5 for all n.

DO YOU SEE WHY WE CANNOT APPLY THE LLT, WHERE


DOES IT GO WRONG???
706
Summer 2019

Uniform convergence
Want to have a condition that is sufficient for the convergence of
the empirical risk of the data-dependent function fn :

I We require that for all functions in F, the empirical risk (as


measured on the data) has to be close to the true risk.
I Formally: for any ε > 0, we want that with high probability,
Ulrike von Luxburg: Statistical Machine Learning

sup |Rn (f ) − R(f )| < ε


f ∈F

I Note the logic behind this condition: if Rn (f ) and R(f ) are


close for all functions f ∈ F, they particularly will be close for
the function fn that has been chosen by the classification
algorithm.
707
Summer 2019

Uniform convergence (2)


Note that in the counter-example above, this requirement is clearly
not satisfied.

WHY EXACTLY?
Ulrike von Luxburg: Statistical Machine Learning
708
Summer 2019

Uniform convergence (3)


Definition:

We say that the law of large number holds uniformly over a


function class F if for all ε > 0,

P (sup |R(f ) − Rn (f )| ≥ ε) → 0 as n → ∞.
f ∈F
Ulrike von Luxburg: Statistical Machine Learning
709
Summer 2019

Uniform convergence: sufficient for consistency


Relatively easy to see:

Proposition 30 (Uniform convergence is sufficient for


consistency)
Let fn be the function that minimizes the empirical risk in F. Then:
Ulrike von Luxburg: Statistical Machine Learning

P (|R(fn ) − R(fF )| ≥ ε) ≤ P (sup |R(f ) − Rn (f )| ≥ ε/2).


f ∈F
710
Summer 2019

Uniform convergence: sufficient for consistency


(2)
Proof.

|R(fn ) − R(fF )|
(by definition of fF we know that R(fn ) − R(fF ) ≥ 0)
= R(fn ) − R(fF )
Ulrike von Luxburg: Statistical Machine Learning

= R(fn ) − Rn (fn ) + Rn (fn ) − Rn (fF ) + Rn (fF ) − R(fF )


(note that Rn (fn ) − Rn (fF ) ≤ 0 by def. of fn )
≤ R(fn ) − Rn (fn ) + Rn (fF ) − R(fF )
≤ 2 sup |R(f ) − Rn (f )|
f ∈F

,
711
Summer 2019

Uniform convergence: necessary for consistency


What is much less obvious is that uniform convergence is also
necessary, this is in fact a very deep result:

Theorem 31 (Vapnik & Chervonenkis, 1971)


Let F be any function class. Then empirical risk minimization is
uniformly consistent with respect to F if and only if uniform
convergence holds:
Ulrike von Luxburg: Statistical Machine Learning

P (sup |R(f ) − Rn (f )| > ε) → 0 as n → ∞, (1)


f ∈F

The proof is beyond the scope of this lecture.


712
Summer 2019

Uniform convergence: necessary for consistency


(2)

But the big question is now:

How do I know whether we have uniform consistency for some


Ulrike von Luxburg: Statistical Machine Learning

function class F???


713
714 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Capacity measures for function classes


715 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Finite classes
Summer 2019

Capacity measures: intuition


Have seen:
I If a function class is too large (as in the counter-example),
then we don’t have uniform convergence.
I If a function class is small (say, it only consists of a single
function), then we have uniform convergence.
Ulrike von Luxburg: Statistical Machine Learning

We now want to come up with ways to measure the size of a


function class — in such a way that we can bound the term

P (sup |R(f ) − Rn (f )| > ε)


f ∈F
716
Summer 2019

Generalization bound for finite classes


Recall the Hoeffding bound for a fixed function f0 :

P (R(f ) − Rn (f )| ≥ ε) ≤ 2 exp(−2nε2 ).

Now consider a function class with finitely many functions:


F = {f1 , ...., fm }. We get:

Pr(sup |R(f ) − Rn (f )| ≥ ε)
Ulrike von Luxburg: Statistical Machine Learning

f ∈F

= Pr( sup |R(fi ) − Rn (fi )| ≥ ε)


i=1,...,m
 
= Pr |R(f1 ) − Rn (f1 )| ≥ ε or |R(f2 ) − Rn (f2 )| ≥ ε or ...
m
X
≤ Pr(|R(fi ) − Rn (fi )| ≥ ε)
i=1
≤ 2m exp(−2nε2 )
717
Summer 2019

Generalization bound for finite classes (2)


Leads to the first result:
Proposition 32 (Generalization of finite classes)
Assume F is finite and contains m functions. Chose any
ε, 0 < ε < 1. Then, with probability at least 1 − 2m exp(−2nε2 ),
we have for all f ∈ F that
Ulrike von Luxburg: Statistical Machine Learning

|R(f ) − Rn (f )| < ε.

Note that this statement is somewhat inconvenient, it is “the


wrong way round”: we choose the error, and get the probability
that this error holds; but we would like to say that with a chosen
probability, how large is the error.
718
Summer 2019

Generalization bound for finite classes (3)


So we now try to reverse the statement: set the probability to some
value δ, and the solve for ε:

r
2 log(2m) + log(1/δ)
δ = 2m exp(−2nε ) =⇒ ε =
2n
With this, the proposition becomes the following generalization
Ulrike von Luxburg: Statistical Machine Learning

bound:
719
Summer 2019

Generalization bound for finite classes (4)


Theorem 33 (Generalization bound for finite classes)
Assume F is finite and contains m functions. Choose some failure
probability 0 < δ < 1. Then, with probability at least 1 − δ, for all
f ∈ F we have
r
log(2m) + log(1/δ)
R(f ) ≤ Rn (f ) +
Ulrike von Luxburg: Statistical Machine Learning

2n

Note that the generalization bound holds uniformly (with the same
error guarantee) for all functions in F, so in particular for the
function that a classifier might pick based on the sample points it
has seen.
720
Summer 2019

Generalization bound for finite classes (5)


Let’s digest this bound:
I It bounds the true risk by the empirical risk plus a “capacity
term”.
I If the function class gets larger (m increases), then the bound
gets worse.
I If m is “small enough” compared to n (in the sense that
log m/n is small, then we get a tight bound.
Ulrike von Luxburg: Statistical Machine Learning

I The whole bound only holds with probability 1 − δ. When we


decrease δ (higher confidence), the bound gets worse.
I If m is fixed, and the confidence value δ is fixed, and n → ∞,
then the empirical risk converges
√ to the true risk. The speed of
convergence is of the order 1/ n.
I If you want to grow your function space with n in order to be
able to fit more accurately if you have more data, you need to
make sure that (log m)/n → 0 if you want to get consistency.
721
Summer 2019

Generalization bound for finite classes (6)


EXERCISE:
Consider X = [0, 1], split it into a grid of k cells of the same size.
As function class, consider all functions that are piecewise constant
(0 or 1) on all cells.
Ulrike von Luxburg: Statistical Machine Learning

I Prove that ERM is consistent if k is fixed.


I Consider the case that k grows with n. How fast can k grow
such that we still have consistency?
722
Summer 2019

Generalization bound for finite classes (7)


I What about the approximation error?
Ulrike von Luxburg: Statistical Machine Learning
723
Summer 2019

Generalization bound for finite classes (8)


Bottom line:
I For finite function classes, we can measure the size of F by its
number m of functions.
I This leads to a generalization bound with plausible behavior.
Ulrike von Luxburg: Statistical Machine Learning

However, what should we do if F is infinite (say, space of all linear


functions)? Then the approach above does not work ... WHY?
724
725 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Shattering coefficient
Summer 2019

Shattering coefficient: definition


We now want to measure the capacity of an infinite class of
functions. The most basic such capacity measure is the following:
Definition: For a given sample X1 , ...., Xn ∈ X and a function
class F define FX1 ,...,Xn as the set of those functions that we get by
restricting F to the sample:
Ulrike von Luxburg: Statistical Machine Learning

FX1 ,...,Xn := {f |X1 ,...,Xn ; f ∈ F}

The shattering coefficient N (F, n) of a function class F is defined


as the maximal number of functions in FX1 ,...,Xn :

N (F, n) := max{|FX1 ,...,Xn | ; X1 , ..., Xn ∈ X }


726
Summer 2019

Shattering coefficient: definition (2)


Example 1: X = R, F as below (positive class = right half-space)
Ulrike von Luxburg: Statistical Machine Learning
727
Summer 2019

Shattering coefficient: definition (3)


Example 2: X = R2 , F such that positive class = space above a
horizontal line
Ulrike von Luxburg: Statistical Machine Learning
728
Summer 2019

Shattering coefficient: definition (4)


Example 3: X = R2 , F = interior of circles.
Ulrike von Luxburg: Statistical Machine Learning

CAN YOU COME UP WITH A BOUND ON THE SHATTERING


COEFFICIENT FOR A SMALL n?
729
Summer 2019

Shattering coefficient: generalization bound


Theorem 34 (Generalization bound with shattering
coefficient)
Let F be any arbitrary function class. Then for all 0 < ε < 1,

Pr(sup |R(f ) − Rn (f )| > ε) ≤ 2N (F, 2n) exp(−nε2 /4).


f ∈F
Ulrike von Luxburg: Statistical Machine Learning

The other way round: With probability at least 1 − δ, all functions


f ∈ F satisfy
r
log(N (F, 2n)) − log(δ)
R(f ) ≤ Rn (f ) + 2 .
n
730
Summer 2019

Proof of Theorem 34 by symmetrization


I By Rn we denote the risk on our given sample of n points.
I By Rn0 we denote the risk that we get on a second,
independent sample of n points, called the “ghost sample”.

Proposition 35 (Symmetrization lemma)


Pr(sup |R(f ) − Rn (f )| > ε)
Ulrike von Luxburg: Statistical Machine Learning

f ∈F

≤ 2 Pr(sup |Rn (f ) − Rn0 (f )| > ε/2).


f ∈F

(Proof elementary, omitted)


731
Summer 2019

Proof of Theorem 34 by symmetrization (2)


What is the point of symmetrization?
I The right hand side only depends on the values of the
functions f on the two samples:
If two functions f and g coincide on all points of the orignal
sample and the ghost sample, at is f (x) = g(x) for all x in the
samples, then Rn (f ) = Rn (g) and Rn0 (f ) = Rn0 (g).
I So the supremum over f ∈ F in fact only runs over finitely
Ulrike von Luxburg: Statistical Machine Learning

many functions: all possible binary functions on the two


samples.
I The number of such functions is bounded by the shattering
coefficient N (F, 2n).
I Now Theorem 34 is a consequence of Theorem 33.
732
Summer 2019

Discussion of the generalization bound with


shattering coefficient
I The bound is analogous to the one for finite function classes,
just the number m of functions has been replaced by the
shattering coefficient.
I Intuitively, the shattering coefficient measures “how powerful”
a function class is, how many different labelings of a data set it
Ulrike von Luxburg: Statistical Machine Learning

can possibly realize.


I Overfitting happens if a function class is very powerful and can
in principle fit everything. Then we don’t get consistency, the
shattering coefficient is large.
The smaller the shattering coefficient, the less prone we are to
overfitting (in the extreme case of one function, we don’t
overfit).
733
Summer 2019

Discussion of the generalization bound with


shattering coefficient (2)
I To prove consistency of a classifier, we need to establish that
log N (F, 2n)/n → 0 as n → ∞.

Intuitively: the number of possibilities in which a data set can


be labeled has to grow at most polynomially in n.
Ulrike von Luxburg: Statistical Machine Learning

I Shattering coefficients are complicated to compute and to deal


with. To prove consistency, we would need to know how fast
the shattering coefficients grow with n (exponentially or less).

We now study a tool that can help us with this.


734
735 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

VC dimension
Summer 2019

VC dimension: Definition
Definition: We say that a function class F shatters a set of points
X1 , ..., Xn if F can realize all possible labelings of the points, that
is |FX1 ,...,Xn | = 2n .

The VC dimension of F is defined as the largest number n such


that there exists a sample of size n which is shattered by F.
Formally,
Ulrike von Luxburg: Statistical Machine Learning

VC(F) = max{n ∈ N ∃X1 , ..., Xn ∈ X s.t. |FX1 ,...,Xn | = 2n }.

If the maximum does not exist, the VC dimension is defined to be


infinity.

(VC stands for Vapnik-Chervonenkis, the people who invented it)


736
Summer 2019

VC dimension: Definition (2)


Example: positive class = closed interval
Ulrike von Luxburg: Statistical Machine Learning
737
Summer 2019

VC dimension: Definition (3)


Example: positive class = interior of a axis algined rectangle
Ulrike von Luxburg: Statistical Machine Learning
738
Summer 2019

VC dimension: Definition (4)


Example: positive class = interior of a convex polygon
Ulrike von Luxburg: Statistical Machine Learning
739
740 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Example: sine waves


VC dimension: Definition (5)
Summer 2019

VC dimension: Definition (6)


Finally, examples that are relevant for practice (SVMs!):
I X = Rd , F = linear hyperplanes. Then V C(F) = d + 1.
Proof see exercises.
I X = Rd , ρ > 0, Fρ := linear hyperplanes with margin at least
ρ. Then one can prove: if the data points are restricted to a
ball of radius R, then
Ulrike von Luxburg: Statistical Machine Learning

2R2
 
V C(F) = min d, 2 + 1
ρ
741
Summer 2019

VC dimension: Sauer-Shelah Lemma


Why are we interested in the VC dimension? Here is the reason:

Proposition 36 (Vapnik, Chervonenkis, Sauer, Shelah)


Let F be a function class with finite VC dimension d. Then
d  
X n
N (F, n) ≤
i
Ulrike von Luxburg: Statistical Machine Learning

i=0

for all n ∈ N. In particular, for all n ≥ d we have


 en d
N (F, n) ≤ .
d

Proof: nice combinatorial argument, see the exercises.


742
Summer 2019

VC dimension: Sauer-Shelah Lemma (2)


This is a really cool statement:
I If a function class has a finite VC dimension, then the
shattering coefficient only grows polynomially!
I If a function class has infinite VC dimension, then the
shattering coefficient grows exponentially.
I It is impossible that the growth rate of the function class is
Ulrike von Luxburg: Statistical Machine Learning

“slightly smaller” than 2n . Either it is 2n , or much smaller,


polynomial.
743
744 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

VC dimension: Sauer-Shelah Lemma (3)


Summer 2019

Generalization bound with VC dimension


Plugging the Sauer-Shelah-Lemma in Theorem 34 immediately gives
the following generalization bound in terms of the VC dimension:

Theorem 37 (Generalization bound with VC dimension)


Let F be a function class with VC dimension d. Then with
probability at least 1 − δ, all functions f ∈ F satisfy
Ulrike von Luxburg: Statistical Machine Learning

r
d log(2en/d) − log(δ)
R(f ) ≤ Rn (f ) + 2 .
n

Consequence: VC-dim finite =⇒ consistency


745
Summer 2019

Generalization bound with VC dimension (2)


More generally, the statement also holds the other way round:
Theorem 38
Empirical risk minimization is consistent with respect to F if and
only if VC(F) is finite.

Proof skipped.
Ulrike von Luxburg: Statistical Machine Learning
746
Summer 2019

Generalization bound with VC dimension (3)


Yet another interpretation of the generalization bound: how many
samples do we need to draw to achieve error at most ε?
q
I Set ε := 2 d log(2en/d)−log(δ)
n
, solve for n and ignore all
constants.
I Result: We need of the order n = d/ε2 many sample points.
Ulrike von Luxburg: Statistical Machine Learning
747
748 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Rademacher complexity
Summer 2019

Rademacher complexity
The shattering coefficient is a purely combinatorial object, it does
not take into account what the actual probability distribution is.
This seems suboptimal.
Definition: Fix a number n of points. Let σ1 , ..., σn be i.i.d. tosses
of a fair coin (result is -1 or 1 with probability 0.5 each). The
Rademacher complexity of a function class F with respect to n is
Ulrike von Luxburg: Statistical Machine Learning

defined as
n
1X
Radn (F) := E sup σi f (Xi )
f ∈F n i=1

The expectation is both over the draw of the random points Xi and
the random labels σi .
It measures how well a function class can fit random labels.
749
Summer 2019

Rademacher complexity (2)


There exist a number of generalization bounds for Rademacher
complexities, and they tend to be sharper than the ones by
combinatorial concepts like shattering coefficients. They typically
look like this:
Theorem 39 (Rademacher generalization bound)
With probability at least 1 − δ, for all f ∈ F,
Ulrike von Luxburg: Statistical Machine Learning

r
log(1/δ)
R(f ) ≤ Rn (f ) + 2 Radn (F) +
2n

Proofs are beyond the scope of this lecture.


750
Summer 2019

Rademacher complexity (3)

Computing Rademacher complexities for function classes is in many


cases much simpler than computing shattering coefficients or VC
dimensions.
Ulrike von Luxburg: Statistical Machine Learning
751
Summer 2019

Generalization bounds: conclusions


I Generalization bounds are a tool to answer the question
whether a learning algorithm is consistent.
I Consistency refers to the estimation error, not the
approximation error.
I Typically, generalization bounds have the following form:
With probability at least 1 − δ, for all f ∈ F
Ulrike von Luxburg: Statistical Machine Learning

R(f ) ≤ Rn (f ) + capacity term + confidence term

The capacity term measures the size of the function class.


The confidence term deals with how certain we are about our
statement.
I There are many different ways to measure the capacity of
function classes, we just scratched the surface.
752
Summer 2019

Generalization bounds: conclusions (2)


I Generalizations are worst case bounds: worst case over all
possible probability distributions, and worst case over all
learning algorithms that pick a function from F.
Ulrike von Luxburg: Statistical Machine Learning
753
754 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Controlling the approximation error


Summer 2019

Nested function classes


I So far, we always fixed a function class F and investigated
wether the estimation error in this class vanishes as we get to
see more data.
I However, we need to take into account the approximation error
as well.
I Idea is now: consider function classes that slowly grow with n:
Ulrike von Luxburg: Statistical Machine Learning
755
Summer 2019

Nested function classes (2)


I If we have few data, the class is supposed to be small to avoid
overfitting (generalization bound!)
I Eventually, when we see enough data, we can afford a larger
function class without overfitting. The larger the class, the
smaller our approximation error.
Ulrike von Luxburg: Statistical Machine Learning

There are two major approaches to this:


I Structural risk minimization: explicit approach

I Regularization: implicit approach


756
Summer 2019

Structural risk minimization (SRM)


I Consider a nested sequence of function spaces: F1 ⊂ F2 ⊂ ...
I We now select an appropriate function class and a good
function in this class simultaneously:

fn := argmin Rn (f ) + capacity term(Fm )


m∈N,f ∈Fm
Ulrike von Luxburg: Statistical Machine Learning

I The capacity term is the one that comes from a generalization


bound.

I If the nested function classes approximate the space of “all”


functions, one can prove that such an approach can lead to
universal consistency.
757
Summer 2019

Regularization
Recap: regularized risk minimization:

minimize Rn (f ) + λ · Ω(f )

where Ω punishes “complex” functions.


Ulrike von Luxburg: Statistical Machine Learning

The trick is now: Regularization is an implicit way of performing


structural risk minimization.
758
Summer 2019

Regularization (2)
Proving consistency for regularization is technical but very elegant:
I Make sure that your overall space of functions F is dense in
the space of continuous functions
Example: linear combinations of a universal kernel.
I Consider a sequence of regularization constants λn with
λn → 0 as n → ∞.
I Define function class Fn := {f ∈ F ; λn · Ω(f ) ≤ const}
Ulrike von Luxburg: Statistical Machine Learning

I Choose λn → 0 so slow that log N (Fn , n)/n → 0.


I On the one hand, this ensures that in the limit we won’t
overfit, the estimation error goes to 0.
I On the other hand, if λn → 0, then Fn → F because
λΩ(f ) < c =⇒ Ω(f ) ≤ c/λn → ∞.
Hence, the approximation error goes to 0, so we won’t
underfit.
759
Summer 2019

Regularization (3)
If you want to see the mathematical details, I recommend the
following paper:

Steinwart: Support Vector Machines Are Universally Consistent.


Journal of Complexity, 2002.
Ulrike von Luxburg: Statistical Machine Learning
760
Summer 2019

Brief history
I The first proof that there exists a learning algorithm that is
universally Bayes consistent was the Theorem of Stone 1977,
about the kNN classifier.
I The combinatorial tools and generalization bounds have
essentially been developed in the early 1970ies already (Vapnik,
Chervonenkis, 1971, 1972, etc) and refined in the years around
Ulrike von Luxburg: Statistical Machine Learning

2000.
I The statistics community also proved many results, in
particular rates of convergence. There the focus is more on
regression rather than classification.
I By and large, the theory is well understood by now, the focus
of attention moved to different areas of machine learning
theory (for example, online learning, unsupervised learning,
etc).
761
762 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Getting back to Occam’s razor


Summer 2019

Examples revisited
Remember the examples we discussed in the first lecture?
Ulrike von Luxburg: Statistical Machine Learning

The question was which of the two functions should be preferred.

Many of you had argued that unless we have a strong belief that
the right curve is correct, we should prefer the left one due to
“simplicity”.
763
Summer 2019

Examples revisited (2)


This principle is often called “Occam’s razor” or “principle of
parsimony”:

When we choose from a set of otherwise equivalent models, the


simpler model should be preferred.

Intuitive argument:
Ulrike von Luxburg: Statistical Machine Learning

“Occam’s razor helps us to shave off those concepts, variables or


constructs that are not really needed to explain the phenomenon.
By doing that, developing the model will become easier, and there
is less chance of introducing inconsistencies, ambiguities and
redundancies. “

These formulations can be found in many papers and text books, I don’t know the original source ...
764
Summer 2019

Occam’s razor vs. learning theory


However:
I The main message of learning theory was that we need to
control the size of the function class F.
I We had not at all talked about “simplicity” of functions!

Is this a contradiction? Is Occam’s razor wrong???


Ulrike von Luxburg: Statistical Machine Learning
765
Summer 2019

Occam’s razor vs. learning theory (2)

First point of view: we don’t need “simplicity”:

I Consider an example of a function class that just contains 10


functions, all of which are very “complicated” (not “simple”).
I For the estimation error, this would be great, we would soon
Ulrike von Luxburg: Statistical Machine Learning

be able to detect which of the function minimizes the ERM,


with high probability.
I If the function class also happens to be able to describe the
underlying phenomenon (low approximation error), this would
be perfect.
I In this case, we do not need simple functions!!!
766
Summer 2019

Occam’s razor vs. learning theory (3)


Second point of view: Spaces of simple functions tend to be
small.

Example: Polynomials in one variable, with a discrete set of


coefficients:
d
X
f (x) = ak xk with ak ∈ {−1, −0.99, −0.98, ...., 0.98, 0.99, 1}
Ulrike von Luxburg: Statistical Machine Learning

k=1

There are about 200 polyonimals of degree 1,


2002 polynomials of degree 2,
200d polynomials of degree d.

Here, the spaces get larger the more “parameters” we have.


767
Summer 2019

Occam’s razor vs. learning theory (4)


Both points of view come together if we talk about data
compression.
I A space with few functions can be represented with few bits
(say, by a small lookup table).
I A space with “simple” functions can be represented with few
bits as well (encode all the parameters).
Ulrike von Luxburg: Statistical Machine Learning

I A space of “complex” function cannot be compressed.


Intuitive conclusion:
I Spaces of simple functions are small, spaces of complex
functions tend to be large.
I Learning theory tells us that we should prefer small function
spaces.
I This often leads to spaces of simple functions.
768
Summer 2019

Occam’s razor vs. learning theory (5)


This intuition can be made rigorous and formal:
I Sample compression bounds in statistical learning theory

I The whole branch of learning based on the “Minimum


description length principle” (comprehensive book in this area
by Peter Grünwald)
Ulrike von Luxburg: Statistical Machine Learning
769
Summer 2019

Occam’s razor vs. learning theory (6)


Bottom line:
I The quantity that is important is not so much the simplicity of
the functions but rather the size of the function space.
I But spaces of simple functions tend to be small and are good
candidates for learning.
I Occam’s razor slightly misses the point, but is a good first
proxy. It is not always correct, but often...
Ulrike von Luxburg: Statistical Machine Learning
770
771 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

SKIPPED
(*) Loss functions, proper and surrogate losses:
772 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

SKIPPED
(*) Probabilistic interpretation of ERM:
773 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

(*) The No-Free-Lunch Theorem: SKIPPED


774 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Fairness in machine learning


775 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Low rank matrix methods


776 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

collaborative filtering
Introduction: recommender systems,
Summer 2019

Recommender systems
Goal: give recommendations to users, based on their past behavior:
I Recommend movies (e.g., netflix)
I recommend music (e.g., lastfm)
I recommend products to buy (e.g., amazon)

ANY IDEAS HOW WE COULD DO THIS?


Ulrike von Luxburg: Statistical Machine Learning
777
Summer 2019

Recommender systems (2)


Content-based approach:
I Model products based on explicit features. Use theses features
to define a similarity function between products
I If a user likes product A, then recommend products similar to
A.

Prominent example: Pandora Radio. You start with a song you like,
Ulrike von Luxburg: Statistical Machine Learning

and then Pandora plays similar songs.


778
Summer 2019

Recommender systems (3)


Collaborative approach:
I Forget about explictly modeling users or features.

I Instead, implicitly model similarity of users and products based


on past shopping behavior.
I Consider user/product matrix with ratings. Defines an implicit
similarity between users (or products).
Ulrike von Luxburg: Statistical Machine Learning

I Then recommend similar items to similar users.

Prominent example: lastfm

ADVANTAGES / DISADVANTAGES OF THE TWO?


779
Summer 2019

Matrix factorization basics


Hastie, Tibshirani, Wainwright: Statistical learning with sparsity.
2015. Chapter 7.2
Ulrike von Luxburg: Statistical Machine Learning
780
Summer 2019

Recap: singular value decomposition (SVD)


Recall PCA:
I Eigenvalue decomposition for a symmetric matrix

I Best rank-k approximation of the matrix: based on highest k


eigenvalues
Now want to do something more general for arbitrary (non-square)
matrices.
Ulrike von Luxburg: Statistical Machine Learning
781
Summer 2019

Recap: singular value decomposition (SVD) (2)


Every (!) matrix can be decomposed as follows:
Ulrike von Luxburg: Statistical Machine Learning

U is the matrix of left singular vectors, V the right singular vectors,


and the diagonal of Σ contains the singular values.
782
Summer 2019

Recap: singular value decomposition (SVD) (3)


There is a simple relationship between SVD and PCA:

For any matrix A,


I the left singular vectors are the eigenvectors of AA0

I the right singular vectors are the eigenvectors of A0 A

I the non-zero singular values are the square roots of the


eigenvalues of both AA0 and A0 A.
Ulrike von Luxburg: Statistical Machine Learning

PROOFS: EXERCISE!
783
Summer 2019

SVD for rank-k approximation


Consider the following “Top-k-SVD” procedure:
I Given a n × d matrix A.

I Compute the SVD such that the singular values are sorted in
decreasing order.
I Keep the first k columns of U and V . Call the resulting
matrices Uk ∈ Rn×k and Vk ∈ Rd×k (such that Vk0 ∈ Rk×d ).
Ulrike von Luxburg: Statistical Machine Learning

I Keep the singular values σ1 , ..., σk and write them in a


diagonal matrix Σk ∈ Rk×k .
I Now define Ak := Uk Σk V 0 .
k
784
Summer 2019

SVD for rank-k approximation (2)


Intuitive interpretation:
I Assume that A is a matrix recording ratings of n users about d
products.
I Then the top-k right singular vectors can be interpreted as
basic “customer types”. Each customer is a weighted mixture
of the basic customers.
Ulrike von Luxburg: Statistical Machine Learning

I Similarly, the top-k left singular vectors can be interpreted as


basic “product types”.
785
Summer 2019

SVD for rank-k approximation (3)


qP
2
The Frobenius norm of matrix is defined as kBkF := ij bij .

Theorem 40 (SVD as rank-k approximation)


The matrix Ak defined by the first k singular values/vectors solves
the following rank-k-approximation problem:
Ulrike von Luxburg: Statistical Machine Learning

Given A and k, find the matrix Ak with rank at most k such that
kA − Ak kF is minimized.

Proof: EXERCISE (consider the hints in Exercise 7.2. p. 196 in


“statistical learning with sparsity”).

EXERCISE: COMPARE THIS RESULT WITH THE


CORRESPONDING RESULT FOR PCA!
786
Summer 2019

SVD for rank-k approximation (4)


Digest again what the intuitive interpretation is:
I We know the full product / user matrix.

I The k top singular vectors define k types of users / products.

I Based on these few types, we can “explain” the behavior of


everybody (up to a small approximation error).
Ulrike von Luxburg: Statistical Machine Learning
787
Summer 2019

Low rank matrix completion


Literature
• Hastie, Tibshirani, Wainwright: Statistical learning with
sparsity. 2015. Chapter 7
Ulrike von Luxburg: Statistical Machine Learning

Some important orgingal papers:


• Candes, Recht: Exact matrix completion via convex
optimization. Foundations of computational mathematics,
2009.
• Matrix Completion from a Few Entries R. Keshavan, Andrea
Montanari, and Sewoong Oh: Matrix completion from noisy
entries. JMLR, 2010
• Mazumder, Hastie, Tibshirani: Spectral regularization
algorithms for learning large incomplete matrices. JMLR, 2010.
788
Summer 2019

Netflix problem
General problem:
I Consider a huge matrix of user ratings of movies. Rows
correspond to movies, columns correspond to users, entries are
ratings on a scale from 1 to 5.
I We only know few entries in this matrix.

I The matrix completion problem is to estimate the missing


Ulrike von Luxburg: Statistical Machine Learning

entries in order to recommend new movies to a user.


789
Summer 2019

Netflix problem (2)


History of the Netflix challenge:
I Launched in 2006

I Data: About 20.000 movies, 500.000 users, 109 ratings (that


is, about 1% of the entries are known)
I Goal: predict the missing entries, error measure RMSE (root
pPn
2
mean squared error i=1 (zi − ẑi ) /n)
Ulrike von Luxburg: Statistical Machine Learning

I First team that beats Netflixes own algorithm by an


improvement of at least 10% wins a prize of 1 million dollars.
I Was finally achieved in 2009.
790
Summer 2019

Matrix completion problem


General setup:
I Consider an m × n matrix which is unknown.

I We get to see some entries in the matrix.

I Assume that the position of the revealed entries is random (no


adversarial setting).
I Goal is to estimate the unknown entries as well as possible.
Ulrike von Luxburg: Statistical Machine Learning

CAN YOU THINK OF EASY / DIFFICULT CASES? IS IT


ALWAYS POSSIBLE?
791
Summer 2019

Matrix completion problem (2)


We need to make assumptions to be able to solve this problem
(inductive bias!). If the entries are not related to each other (say,
independent random numbers), there is no way in which we could
predict missing entries.
Ulrike von Luxburg: Statistical Machine Learning
792
Summer 2019

Matrix completion problem (3)


High-level idea from learning theory: A useful inductive bias is one
that leads to a “small” set of possible matrices.

Here is what everybody uses:

We are going to look for a matrix that has low rank.


Ulrike von Luxburg: Statistical Machine Learning

(Just as a sanitiy check: a matrix with independent random entries typically


has high rank, the eigenvalues follow the semi-circle law.)
793
Summer 2019

Matrix completion, first formulations


Denote by Ω the set of entries of a matrix Z that have been
observed: we know the values zij for all (i, j) ∈ Ω. We would like
to solve the following problem:
minimize rank(M ) subject to mij = zij for (i, j) ∈ Ω.
or a slightly weaker version
Ulrike von Luxburg: Statistical Machine Learning

X
minimize rank(M ) subject to (mij − zij )2 ≤ δ
(i,j)∈Ω

or the regularization version


X
minimize (mij − zij )2 + λ rank(M )
(i,j)∈Ω

Is NP hard /
794

CAN YOU SEE WHAT MAKES IT SO DIFFICULT?


Summer 2019

Matrix completion, first approach using SVD


Here is a straight forward heuristic by which we can try to solve the
optimization problem:

Hard-Impute
I Have an initial guess for the missing entries ; matrix Z1

I Compute the SVD of Z1 , keep the first r singular components


; Z2
Ulrike von Luxburg: Statistical Machine Learning

I Fill in the missing entries with the ones of Z2 , and start over
again ...
Sometimes this works reasonably.

But let’s try to think about alternatives ... one option for
non-convex optimization problems is always to construct a convex
relaxation (have seen this before, at least twice, where? )
795
Summer 2019

Trace as convex relaxation of rank


Let us try to find a convex relaxation of the rank function:
I If σ := σ(A) denotes the vector of singular values of matrix A,
then
rank(A) = kσk0
I Recall the standard approch in sparse regression (Lasso). We
relaxed the 0-norm to the 1-norm, which is convex:
Ulrike von Luxburg: Statistical Machine Learning

X
kσk1 = |σi |.
i

We now use this as a norm for matrices, it is called the nuclear


norm or the trace norm:
kAktr := kσ(A)k1 .
One can prove that the nuclear norm is the tightest convex
796

relaxation of the rank of a matrix.


Summer 2019

Trace norm regulariziation


Now consider the following optimization problems:

X
minimize kM ktr subject to (mij − zij )2 ≤ δ (∗)
(i,j)∈Ω
Ulrike von Luxburg: Statistical Machine Learning

1 X
minimize (mij − zij )2 + λkM ktr (∗∗)
2
(i,j)∈Ω

These two problems are essentially the same, once in the natural
formulation (∗) and once in the regularization / Lagrangian
formulation (∗∗).
797
Summer 2019

Trace norm regulariziation (2)


Two big questions:
I Can we give an efficient algorithm that can find the global
optimum, either in formulation (∗) or in (∗∗)?
I If yes, what can we say about the theoretical properties of the
global optimum, how close is it going to be to the matrix we
are looking for? In particular, how many entries do we need to
observe to find a good reconstruction?
Ulrike von Luxburg: Statistical Machine Learning
798
Summer 2019

Solving (∗), naive algorithm: semi-definite


program
The first formulation of the problem is a semi-definite program. In
principle, SDPs can be solved in polynomial time, but “polynomial”
can still be very long... There are general-purpose solvers for such
problems, but they are so slow that they only work for small
instances.
Ulrike von Luxburg: Statistical Machine Learning

We skip the details.


799
Summer 2019

Solving (∗∗) efficiently: soft-impute


Here is a strategy to solve (∗∗):
I Start with initial guesses for the missing values.
I Compute the SVD, “soft-threshold” the singular values by
some threshold λ.
I Repeat until convergence.
Ulrike von Luxburg: Statistical Machine Learning

Soft-thresholding:
I Given the SVD of a matrix Z = U DV 0 , denote the singular
values by di .
I We define Sλ (Z) := U Dλ V 0 where Dλ is the diagonal matrix
with diagonal entries (di − λ)+ := max(di − λ, 0)
I Soft thresholding decreases the trace norm and also often
decreases the rank of a matrix.
800
Summer 2019

Solving (∗∗) efficiently: soft-impute (2)


Let’s first consider one step of soft-thresholding on a completely
known matrix Z (no missing entries):

Proposition 41
Consider a matrix Z that is completely known, and choose some
λ > 0. Then solution of the optimization problem
Ulrike von Luxburg: Statistical Machine Learning

min kZ − M k2F + λkM ktr


M

is given by the result Sλ (Z) of one round of soft-thresholding.

Proof: see Mazumder, Hastie, Tibshirani: Spectral regularization


algorithms for learning large incomplete matrices. JMLR, 2010.
801
Summer 2019

Solving (∗∗) efficiently: soft-impute (3)


Now we want to use a similar approach to complete the matrix Z in
case it is just partially observed (problem (∗∗)).

Introduce notation:
I Denote by Ω the set of matrix entries that are known.

I Define the “projection” PΩ (Z) as the matrix that has the


original values zij at all the observed positions of Z, and 0
Ulrike von Luxburg: Statistical Machine Learning

otherwise (that is, fill the unobserved entries with zeros).


I With this definition,
2
P
(i,j)∈Ω (zij − mij ) = kPΩ (Z) − PΩ (M )kF .
I Define PΩ⊥ (Z) as the “projection” of the matrix Z on the
entries that are NOT in Ω (so that Z = PΩ (Z) + PΩ⊥ (Z)).
802
Summer 2019 We now present Algorithm 1—S OFT-I MPUTE—for computing a s
different
Solving (∗∗) values of λ using warm
efficiently: starts.
soft-impute (4)
Algorithm 1 S OFT-I MPUTE
1. Initialize Z old = 0.

2. Do for λ1 > λ2 > . . . > λK :

(a) Repeat:
i. Compute Z new ← Sλk (PΩ (X) + PΩ⊥ (Z old )).
Ulrike von Luxburg: Statistical Machine Learning

∥Z new −Z old ∥2F


ii. If ∥Z old ∥2F
< ε exit.
iii. Assign Z old ← Z new .
(b) Assign Ẑλk ← Z new .

3. Output the sequence of solutions Ẑλ1 , . . . , ẐλK .


Intuition:
The algorithm repeatedly replaces the missing entries with the cur
the guess by solving (8). Figures 2, 3 and 4 show some examples of so
803
Summer 2019

Solving (∗∗) efficiently: soft-impute (5)


I Inner loop (a) for fixed λ: clamp the observed values of the
matrix, fill the rest by a low-rank approximation, until
convergence
I Outer loop (2): start with a case that is easy (λi large, matrix
low rank) and work our way towards the more difficult
situation of smaller λi
Ulrike von Luxburg: Statistical Machine Learning
804
Summer 2019

Solving (∗∗) efficiently: soft-impute (6)


convergence
Properties:to the global solution is established. In Exercise 7.4
asked ItoItverify
can bethatproveda that
fixedthispoint of the
algorithm algorithm
always convergessatisfies
to the th
dient equations associated
global solution witha suitable
of (∗∗) (for the objective
choice offunction (7.10).
the sequence
derived ask ak−1,...,K
(λ ) first-order Nesterov algorithm (see Exercise 7.5).
).
n requires
I an be
It can SVD computation
implemented of aeven
efficiently (potentially large) (just
for huge matrices densea m
ugh P (Z) is sparse. For “Netflix-sized” problems, such large d
couple of hours for the whole Netflix dataset). Main trick: can
can typically
decomposenot the
even be stored
dense matrix Zininmemory
a sum of (68Gb
a sparsewith
matrix8 bytes
and
Ulrike von Luxburg: Statistical Machine Learning

te, however,
a low that we canCan
rank matrix. write
exploit this cleverly in the algorithm.

P (Z) + P ‹ (Zold ) = P (Z) ≠ P (Zold ) + Z old


¸˚˙˝ .
¸ ˚˙ ˝
sparse low rank

e first Icomponent is sparse, with | | nonmissing entries. The s


R-package softImpute
nent is a soft-thresholded SVD, so can be represented using the c
components. Moreover,
Details: Mazumder, forTibshirani
Hastie, each component, we can exploit t
805
Summer 2019
¸=1
Solvingthe(∗∗)
However, efficiently:
row and column meanssoft-impute
can be estimated (7)
separately, using a
simple two-way ANOVA regression model (on unbalanced data).
Soft-impute on the Netflix data:
Netflix Competition Data

1.00
1.0

0.99
0.9

0.98
Test RMSE
RMSE

Train Hard−Impute

0.97
Ulrike von Luxburg: Statistical Machine Learning

0.8

Test Soft−Impute

0.96
0.7

0.95
0 50 100 150 200 0.65 0.70 0.75 0.80 0.85 0.90

Rank Training RMSE

Figure from “Statistical learning with sparsity”. Dotted line on rhs = Netflixes own algorithm, baseline.

Figure 7.2 Left: Root-mean-squared error for the Netflix training and test data for
Sanity check: random guessing would have RMSE≈2

the iterated-SVD (Hard-Impute) and the convex spectral-regularization algorithm


806
Summer 2019

Theoretical results on matrix completion


Setting:
I Consider an arbitrary matrix Z. For simplicity, assume that Z
is square of size p × p.
I Assume that we observed n entries of the matrix, drawn
uniformly at random.
I Question: how large does n need to be such that we can
Ulrike von Luxburg: Statistical Machine Learning

successfully reconstruct the matrix Z (exactly or


approximately)?
807
Summer 2019

Condition: no empty columns or rows


Problem of empty columns:

If there is a column (or a row) that does not have any observed
entry, it will be impossible to reconstruct the matrix.

QUESTION TO YOU: We sample n elements. Each element


belongs to one of p columns. How large does n need to be such
Ulrike von Luxburg: Statistical Machine Learning

that, with reasonably high probability, we have at least one element


per column?
808
Summer 2019

Condition: no empty columns or rows (2)


ANSWER: this is the coupon collector problem, we need at least
p log p samples.
Ulrike von Luxburg: Statistical Machine Learning
809
Summer 2019

Condition: enough parameters to express rank r


Number of “parameters” for a rank-r matrix:
I A rank-r matrix of dimension p × p can be described by r
vectors of length p ; rp parameters
I In general, we cannot assume that we can “compress” these rp
entries any further
I So it is plausible that we won’t be able to perfectly reconstruct
Ulrike von Luxburg: Statistical Machine Learning

such a matrix if we observe less than rp entries of the matrix.

(of course, this is argument is really hand-waiving, but often such


hand-waiving intuition helps to get a first feeling for a problem; the
next step is then to make it precise)
810
Summer 2019
In order to appreciate the need for coheren
Condition:rank-one matrix Z = e1 eT1 , with a single one in it
coherency
Consider the on the left
following side of Equation (7.16) below:
matrix
Q R Q
1 0 0 0 v1
c0 0 0 0d c0
Z = c
a0
d and ZÕ = c
0 0 0b a0
0 0 0 0 0
Assume we want to solve EXACT recovery:
If we are allowed to observe only N π p2 entr
Ulrike von Luxburg: Statistical Machine Learning

I It is of rank one (so the problem should be easy, few


entries chosen uniformly at random, then with h
parameters to learn).
observe the single nonzero entry, and hence hav
I However, there is only one important entry.
it from the all-zeroes matrix. Similar concerns ap
I If we sample of order less than of the order p2 entries,
likelihood is high that we never get to see this particular entry.
So if we are after exact recovery, we have a problem.
811
Summer 2019

Condition: coherency (2)


The problem in this example is:
I Some entries are much more important than others.

I In mathematical terms: the eigenvectors of this matrix are too


much aligned with the standard basis of Rp .
Ulrike von Luxburg: Statistical Machine Learning
812
Summer 2019

Condition: coherency (3)


To deal with this problem:

We want to “measure” up to which extent the eigenvectors of the


matrix are aligned with the standard basis: this leads to the notion
of coherence.

Definition: Let U be a suspace of Rd of dimension r and PU the


orthogonal projection on U . Then the coherence of U wrt to the
Ulrike von Luxburg: Statistical Machine Learning

standard basis (ei )i is defined as

d
µ(U ) := max kPu ei k2
r i=1,...,d

FOR MATRIX COMPLETION, IS IT BETTER TO HAVE SMALL


OR LARGE COHERENCE?
813
Summer 2019

Condition: coherency (4)


To get intuition, consider the case where U is spanned by one
vector:
I The smaller µ(U ), the “easier” the matrix will be to recover.

I Maximum value of coherence occurs if U = span(ei ) (more


generally, if ei ∈ U ). Then we have kPu ei k = 1.
I Minimum value of coherence occurs if U is spanned by the
√ √
vector (1/ n, ..., 1/ n)0 .
Ulrike von Luxburg: Statistical Machine Learning

I Intuition: More generally, coherence is low (good) if all entries


of the vector have about the same order of magnitude. Then
each entry contains about the same amount of information, so
sampling few entries should be fine.
814
815 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Examples:

The matrix we started with:


Condition: coherency (5)
Summer 2019

Condition: coherency (6)


Same matrix, just flipped entries:
Ulrike von Luxburg: Statistical Machine Learning

As a sanity check, the all ones matrix:


816
Summer 2019

Condition: coherency (7)


A random rank-r matrix:
Ulrike von Luxburg: Statistical Machine Learning

See plot on next page:


817
Summer 2019

Condition: coherency (8)


Plot of maxi=1...n kPV ei k, random rank-r matrix, n=100:
Ulrike von Luxburg: Statistical Machine Learning
818
Summer 2019

Guarantee for exact recovery


One can prove guarantees of the following flavor (p size of matrix, r
rank):

With high probability, exact recovery is possible if the number of


observed entries is at least

N ≥ Crp log p
Ulrike von Luxburg: Statistical Machine Learning

Here, C is a constant that depends on the coherence of the matrix:


I Coherence low: N ≈ rp log p

I Coherence high: C · r ≈ p, such that we need to sample about


p2 log p entries.

Proofs are beyond the scope of this lecture.

GIVEN OUR PREVIOUS OBSERVATIONS, DO YOU THINK


THIS IS A GOOD OR A BAD RESULT?
819
Summer 2019
Note that for any subspace, the smallest µ(U ) can be is 1, achieved, for example, if U is spanned by vectors whose

Guarantee
entries all have magnitudefor exact
1/ n. The recovery
largest possible value for µ(U ) is(2)
n/r which would correspond to any subspace
that contains a standard basis element. If a matrix has row and column spaces with low coherence, then each entry can
be expected to provide about the same amount of information.
Concretely, the coherency-based theorem for exact recovery!looks
Recall that the nuclear norm of an n1 ×n2 matrix X is the sum of the singular values of X, ∥X∥∗ = k=1
min{n ,n as}1 2
σk (X
follows (Candes,
where, here and 2009):
below, σk (X) denotes the kth largest singular value of X. The main result of this paper is the fol-
lowing
Theorem 1.1 Let M be an n1 × n2 matrix of rank r with singular value decomposition U ΣV ∗ . Without loss of
generality, impose the conventions n1 ≤ n2 , Σ is r × r, U is n1 × r and V is n2 × r. Assume that
A0 The row and column spaces have coherences bounded above by some positive µ0 .
"
A1 The matrix U V ∗ has a maximum entry bounded by µ1 r/(n1 n2 ) in absolute value for some positive µ1 .
Suppose m entries of M are observed with locations sampled uniformly at random. Then if
Ulrike von Luxburg: Statistical Machine Learning

m ≥ 32 max{µ21 , µ0 } r(n1 + n2 ) β log2 (2n2 ) (1.2)

for some β > 1, the minimizer to the problem

minimize ∥X∥∗
(1.3)
subject to Xij = Mij (i, j) ∈ Ω.
1/2
is unique and equal to M with probability at least 1 − 6 log(n2 )(n1 + n2 )2−2β − n22−2β .

(here k · k denotes the trace norm).


The assumptions
∗√
A0 and A1 were introduced in [4]. Both µ0 and µ1 may depend on r, n1 , or n2 . Moreover,
note that µ1 ≤ µ0 r by the Cauchy-Schwarz inequality. As shown in [4], both subspaces selected from the uniform
distribution and spaces constructed as the span of singular vectors with bounded entries are not only incoherent with
the standard basis, but also obey A1 with high probability for values of µ1 at most logarithmic in n1 and/or n2 .
Applying this theorem to the models studied in Section 2 of [4], we find that there is a numerical constant cu such
that cu r(n1 + n2 ) log5 (n2 ) entries are sufficient to reconstruct a rank r matrix whose row and column spaces are
820

sampled from the Haar measure on the Grassmann manifold. If r > log(n2 ), the number of entries can be reduced
to cu r(n1 + n2 ) log4 (n2 ). Similarly, there is a numerical constant ci such that ci µ2 r(n1 + n2 ) log3 (n2 ) entries are
Summer 2019

Guarantee for approximate recovery


Guarantee from Keshavan et al, 2010:
Ulrike von Luxburg: Statistical Machine Learning
821
Summer 2019

Simulations: noise-free setting


Assume you want to run simulations for low rank matrix
completion, under controlled conditions.

HOW COULD YOU GENERATE A “RANDOM” LOW RANK


MATRIX TO PLAY WITH?
Ulrike von Luxburg: Statistical Machine Learning
822
Summer 2019

Simulations: noise-free setting (2)


I Model the ground truth by a simple low rank model:
I Generate the p × r matrices U and V with independent
random entries, normally distributed according to N (0, 1)
(in the figures below: p = 20 or 40, and r = 1 or 5. )
I Define Z = U · V 0 .
(WHAT IS THE RANK OF THESE MATRICES? WHAT
CAN YOU SAY ABOUT THE SINGULAR VALUES?)
Ulrike von Luxburg: Statistical Machine Learning

I Generate the toy data: Sample n random entries from this


matrix.
I Try to recover the ground truth:
I Use soft-impute to complete the matrices, results in Ẑ.
I Check whether kẐ − Zk ≈ 0
I Repeat this experiment many times and report fraction of
correctly recovered matrices.
823
Summer 2019

Simulations: noise-free
MISSING DATA AND setting (3)
MATRIX COMPLETION 179
Rank 1 Rank 5

1.0

1.0
p=20
p=40
Probability of Exact Completion

Probability of Exact Completion


0.8

0.8
0.6

0.6
0.4

0.4
Ulrike von Luxburg: Statistical Machine Learning

0.2

0.2
0.0

0.0
0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6

Proportion Missing Proportion Missing

Figure 7.4 Convex matrix completion in the no-noise setting. Shown are probabili-
Figure from “Statistical learning with sparsity”;

ties
r=rank of exact
of true matrix,completion
p dimension of(mean
± one standard error) as a function of the proportion
matrix
missing, for n ◊ n matrices with n œ {20, 40}. The true rank of the complete matrix
WHAT CAN YOU SEE?
is one in the left panel and five in the right panel.
824
Summer 2019

Simulations: noise-free setting (4)


I Problem gets harder the less entries we observe (curves
decrease from left to right). Makes sense.
I Problem gets more difficult if original matrix has a higher
rank (compare left and right figure). Makes sense.

I For a fixed r, likelihood of exact recovery increases with the


dimension p (that is, the problem seems simpler if the
Ulrike von Luxburg: Statistical Machine Learning

original matrix is higher!!!). Not entirely sure here, let’s


discuss:
The last point seems surprising, but here might be two explanations:
825
Summer 2019

Simulations: noise-free setting (5)


I Consider the recovery theorem of Keshavan: everything else
being fixed, the necessary number of samples m increases as
mp ≈ p log p, but probability of exact recovery increases as
1 − 1/p3 . So if we would compare the probability of recovering
all entries from mp measurements, then this probability would
increase with p.
(Note that strictly speaking, the theorem does not make a
Ulrike von Luxburg: Statistical Machine Learning

statement that says that if the proportion of missing samples is


fixed to a particular constant, that then the probability of
recovery grows with p. But I guess one could extract such a
statement from the proof.)
826
Summer 2019

Simulations: noise-free setting (6)


I Another hand-waiving explanation: to recover a rank r-matrix
of size p means to recover rp parameters. The matrix has
const · p2 many entries that can serve as our source of
information. So the ratio between what we want and what we
have is rp/p2 = r/p, which for fixed r gets better when p
increases.
I All these explanations are a bit ad hoc, and if I really had to
Ulrike von Luxburg: Statistical Machine Learning

find out, I would run simulations on my own ...


827
Summer 2019

Simulations: noisy setting


Noisy setting:
I Generate the matrix Z as before.

I Now add Gaussian noise with standard deviation 0.5 to each of


the entries of Z, this results in Znoisy .
I Now try to reconstruct Z, when you observe entries from
Znoisy (using Soft-impute).
Ulrike von Luxburg: Statistical Machine Learning

I Plot

average Frobenius norm error


average relative error :=
noise standard deviation
(WHAT IS THE BEST AVERAGE RELATIVE ERROR YOU
COULD HOPE FOR? )

Results:
828
Summer 2019

Simulations:
MISSING DATA ANDnoisy
MATRIXsetting (2)
COMPLETION 18
Rank 1 Rank 5
3.5

3.5
Average Relative Error

Average Relative Error


3.0

3.0
2.5

2.5
2.0

2.0
Ulrike von Luxburg: Statistical Machine Learning

1.5

1.5
1.0

1.0
0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6

Proportion Missing Proportion Missing

Figure 7.6 Matrix completion via Soft-Impute in Figure


the noisy setting.
from “Statistical The
learning withplots show
sparsity”

the imputation error from matrix completion as a function of the proportion missing
for 40 ◊ 40 matrices. Shown are the mean absolute error( ± one standard error) ove
100 simulations, all relative to the noise standard deviation. In each case we chos
829
Summer 2019

Outlook / literature
I We just scratched the surface, there are many more variants of
the problem, and also many more algorithms.
I If you are interested, the book “Statistical learning with
sparsity” is a good starting point.
History:
Ulrike von Luxburg: Statistical Machine Learning

I PhD thesis Fazel 2002: nuclear norm as surrogate for rank


I Nati Srebro et al, 2005, nuclear norm relaxations, with first
generalization bounds.
I Candes, Recht: Exact matrix completion via convex
optimization. Foundations of computational mathematics
(FOCM), 2009. Bounds in exact case.
I Netflix challenge: 2006 - 2009.
830
Summer 2019

Compressed sensing
Book chapters:
• Hastie, Tibshirani, Wainwright: Statistical learning with
sparsity. 2015. Chapter 10.
Ulrike von Luxburg: Statistical Machine Learning

• Shalev-Shwartz, Ben-David: Understanding Machine Learning,


Section 23.3.
831
Summer 2019

Motivation
Consider the camera in your phone:
I If you take a picture, it first generates a raw image that is
stored by a pixel-based representation (e.g., rgb values for each
pixel).
I Then it compresses the picture by representing it in a suitable
basis (say, a wavelet basis) and generates a compressed version
Ulrike von Luxburg: Statistical Machine Learning

of the image (say, a jpg file).


832
all but the largest S set to zero.
Summer 2019 ence pairs, and we now give examples of such pairs. In our first
t sense since all but a few of its example, $ is the canonical or spike basis ϕk(t ) = δ(t − k ) and
S-sparse STATISTICAL LEARNING WITH SPARSITY
Motivation (2) 5
nonzero
normal
Wavelet
e have × 104 Coefficients
if x is 2
e sense 1.5
the (xi ) 1
0.5
approxi-
0
he error
−0.5
terms, −1
fraction 0 2 4 6 8 10
Ulrike von Luxburg: Statistical Machine Learning

uch loss. × 105


e where (a) (b) (c)
oticeable
approxi- [FIG1] (a) Original megapixel image with pixel values in the range [0,255] and (b) its
Figure transform
wavelet 1.2 (a) Original
coefficientsmegapixel
(arranged image withorder
in random pixelforvalues in the
enhanced range [0, 255]
visibility).
ng away Relatively few wavelet coefficients capture most of the signal energy; many such images
and (b) its wavelet transform coefficients (arranged in random order for enhanced
are highly compressible. (c) The reconstruction obtained by zeroing out all the coefficients
visibility). Relatively few wavelet coefficients capture most of the signal energy; many
e, what in the wavelet expansion but the 25,000 largest (pixel values are thresholded to the range
such images are highlywith
compressible.
[0,255]). The difference the original (c) Theisreconstruction
picture obtained
hardly noticeable. by zeroing
As we describe in out
coders “Undersampling
all the coefficients
andin the wavelet
Sparse expansion
Signal Recovery,” but
this the 25,
image can000 largest (pixel
be perfectly values
recovered are
from
d many just thresholded to the range
96,000 incoherent [0, 255]). The differences from the original picture are hardly
measurements.
noticeable.
833

IEEE SIGNAL PROCESSING MAGAZINE [23] MARCH 2008


ical models and their selection are discussed in Chapter 9 while compressed
Summer 2019

Motivation (3)
Idea: it would be great if we could skip the first step and directly
capture the data in the better representation.

This is called compressed (compressive) sensing.

Applications:
I Cameras with little power / storage. Take a picture with less
Ulrike von Luxburg: Statistical Machine Learning

pixels, but achieve the same quality in the end.


I MRI / tomography: scans parts of the body, scanning time
increases for larger images. Want to speed up scanning (take
less pictures) but still have the same quality.
834
Summer 2019

Setup
Assume we observe a vector x ∈ Rd .
I Typically it is not be sparse in the standard basis, that is kxk0
is close to d
I But it might be sparse in a different basis: There exists an
orthonormal matrix U such that x = U α and α is a sparse
vector: kαk0 =: s is small
Ulrike von Luxburg: Statistical Machine Learning

I If we would know the basis U and would have a technical way


to measure the signal in this basis directly, this would be great.
I Goal is now: construct a basis that does the job.
835
Summer 2019

Setup (2)
Notation in the following:
I d dimension of the original space (high)

I s true sparsity of the signal in the basis U (low)

I k the sparsity we actually achieve (hopefully low as well)

We always have s ≤ k ≤ d.
Ulrike von Luxburg: Statistical Machine Learning
836
Summer 2019

Example: Single pixel camera


I Standard camera: record millions of pixels and then apply
compression (e.g. jpeg compression) after the picture has been
taken.
I New approach: we use only a single pixel detector to create
images and we gather only a small fraction of the information,
effectively compressing the image while taking it.
Ulrike von Luxburg: Statistical Machine Learning
837
Summer 2019

Example: Single pixel camera (2)


Ulrike von Luxburg: Statistical Machine Learning

Figure 6. A schematic diagram of the “one-pixel camera.” The “DMD” is the grid of micro-mirrors that
reflect some parts of the incoming light beam toward the sensor, which is a single photodiode. Other
parts of the image (the black squares) are diverted away. Each measurement made by the photodiode is
a random combination of many pixels. In “One is Enough” (p.114), 1600 random measurements suffice
to create an image comparable to a 4096-pixel camera. (Figure courtesy of Richard Baraniuk.)

What’s Happening in the Mathematical Sciences 125

Original resolution: d = 64 × 64 = 4096.


Compressed: k = 1600 measurements.
838
Summer 2019

Example: Single pixel camera (3)


How does it work? See figure below:
I uses an array of bacteria-sized mirrors to acquire a random
sample of the incoming light and project it on a single
photodiode.
I Each mirror can be tilted in one of two ways, either to reflect
the light toward the single sensor or away from it. The
photodiode thus receives a linear combination of the images on
Ulrike von Luxburg: Statistical Machine Learning

all the mirrors that are “on”.


I Thus the light that the sensor receives is a weighted average of
many different pixels, all combined into one pixel.
I One configuration wi of the mirrors gives rise to one linear
measurement of our signal.
I Repeat this procedure for k different mirror configurations
w1 , ..., wk and store the k measurements the photodiode
receives.
839
Summer 2019

Example: Single pixel camera (4)


Cool result: By taking of the order Θ(s log(d/s)) snapshots, with a
different random selection of pixels each time, the single-pixel
camera is able to acquire a recognizable picture with a resolution
comparable to d pixels.
Ulrike von Luxburg: Statistical Machine Learning

One is Enough. A photograph taken by the “single-pixel camera”


One is Enough. A photograph built by
taken by Richard
the “single-pixel Baraniuk
camera” built by Richardand Kevin
Baraniuk and Kevin
Kelly of Rice University. (a) A photograph of a soccer ball, taken by a conventional digital camera at
Kelly of Rice University. (a) A photograph of a soccer ball,
64 × 64 taken
resolution. bysoccer
(b) The same a conventional digitalcamera.
ball, photographed by a single-pixel camera
The image atis de-
rived mathematically from 1600 separate, randomly selected measurements, using a method called
64 × 64 resolution. (b) The same soccer ball, photographed by(Photos
compressed sensing. a single-pixel camera.
courtesy of R. G. Baraniuk, The
Compressive image
Sensing is de-
[Lecture Notes], Signal
c
Processing Magazine, July 2007. ⃝2007 IEEE.)
rived mathematically from 1600 separate, randomly selected measurements, using a method called
compressed sensing. (Photos courtesy of R. G. Baraniuk, Compressive Sensing [Lecture Notes], Signal
c
Processing Magazine, July 2007. ⃝2007 IEEE.) 114 What’s Happening in the Mathematical Sciences
840
Summer 2019

Example: Single pixel camera (5)


Nice blog discussion by Terrence Tao on this topic:
https://terrytao.wordpress.com/2007/04/13/
compressed-sensing-and-single-pixel-cameras/
Ulrike von Luxburg: Statistical Machine Learning
841
Summer 2019

Key steps in compressed sensing


1. We design k “linear measurements” w1 , ...., wk ∈ Rd (in
applications, this is done by specific hardware, see later).
2. Nature picks an unknown signal x ∈ Rd (d large)
3. We directly receive the measurement results
x̃1 = hw1 , xi, x̃2 = hw2 , xi, ..., resulting in the measurement
vector x̃ ∈ Rk (k reasonably small).
Ulrike von Luxburg: Statistical Machine Learning

4. We are now supposed to reconstrct x ∈ Rd from x̃ ∈ Rk .

Note: The goal is to design a single (!) set of measurement vectors


w1 , ..., wk that works well for all (!) signals x in the sense that we
are able to reconstruct with little error.
842
Summer 2019

Compressed sensing
Another way to describe it:

I We don’t measure the signal x directly, but just a


“compressed” version of it, namely we measure

x̃ = W x ∈ Rk
Ulrike von Luxburg: Statistical Machine Learning

where W is a k × d-matrix with k  d. The matrix is known


to us, we choose it before we see any data.
I Now we want to reconstruct the high-dimensional signal x
from the low-dimensional representation x̃.

WHAT WOULD BE THE NAIVE WAY OF RECONSTRUCTION?


WHY DOESN’T IT WORK?
843
Summer 2019

Compressed sensing (2)


I To reconstruct, we would need to solve the linear system
x̃ = W x for x.
I However, the latter is heavily underdetermined (we have k
equalities but d unkowns, with k  d). There are infinitely
many solutions to this linear system.

The trick is going to be:


Ulrike von Luxburg: Statistical Machine Learning

I We need to make assumptions on the x we are looking for.

I In particular, we assume that x is sparse in some basis. We will


see below that this does the job.
844
Summer 2019

Compressed sensing main result


Theorem 42 (Compressed sensing with random
measurements)
Fix a signal length d and a sparsity level s. Let W be a k × d
matrix with k = Θ(s log(d/s)), with each of its entries chosen
independently from a standard normal distribution N (0, 1). Then,
with high probability over the choice of W , every s-sparse signal
Ulrike von Luxburg: Statistical Machine Learning

can be efficiently recovered from x̃ = W x by the following


optimization problem:

minimizekxk1 subject to x̃ = W x
845
Summer 2019

Compressed sensing main result (2)


DO YOU THINK THAT k = Θ(s log(d/s)) IS GOOD OR BAD?
Ulrike von Luxburg: Statistical Machine Learning
846
Summer 2019

Compressed sensing main result (3)


I To measure a signal with sparsity s, we will definitely need at
least s measurements.
I s log(d/s) is definitely much smaller than d, it is close to s.
So it looks really great!

Some hand-waivy intutition why the result is exactly


k = Θ(s log(d/s)):
Ulrike von Luxburg: Statistical Machine Learning

I We know that we are looking for a vector of length d which


has only s non-zero components (in some appropriate basis).
But we don’t know which are the non-zero components.
d

I There are subsets of size s among the d components.
s
I To recover the vector we could try out all different such
subsets, and then reconstruct based on these subsets.
847
Summer 2019

Compressed sensing main result (4)


I Any efficient algorithm to do so would need to be able to
distinguish at least log ds ≈ Θ(s log(d/s)) many situations.

(This argument is similar to the proof for the lower bound in


comparison-based sorting ...)
Ulrike von Luxburg: Statistical Machine Learning
848
Summer 2019

Compressed sensing main result (5)


Key steps in the proof of the theorem:
1. If we choose the matrix W such that is has the “restricted
isometry property” (RIP, see below), then any k-sparse vector
x can be reconstructed from its compressed image x̃ with only
little distortion, by an inefficient algorithm using `0 -norm
optimization.
2. The reconstruction of x from x̃ can be calculated equally well
Ulrike von Luxburg: Statistical Machine Learning

using `1 -norm optimization (rather than `0 -norm). This is very


suprising!
3. It is easy to find matrices W that have the RIP property: we
can use a random matrix with k = Ω(s log d) where s is the
sparsity of the signal and d the original dimension of the space.
849
Summer 2019

Proof step 1: define RIP matrices


Definition (Restricted Isometry Property):

A matrix W ∈ Rk×d is (ε, 2s)-RIP if for all x 6= 0 with kxk0 ≤ s we


have
kW xk22
−1 ≤ε
kxk22
Ulrike von Luxburg: Statistical Machine Learning

Intuitively:

Multiplying an RIP-matrix to a sparse vector does not considerably


change the norm of the vector, no matter which vector we choose.
850
Summer 2019

Proof step 1: Perfect reconstruction from RIP


using `0
The following theorem shows that RIP matrices yield a lossless
compression for sparse vectors:

Theorem 43 (Reconstruction based on 0-norm)


Let W be (ε, 2s)-RIP for some ε < 1, x with kxk0 ≤ s (that is, x is
sparse in the standard basis of Rd ), y = W x the compression of x
Ulrike von Luxburg: Statistical Machine Learning

by matrix W . Then the reconstruction

x̃ := argmin{kvk0 ; v ∈ Rd , W v = y}

coincides exactly with x.


851
Summer 2019

Proof step 1: Perfect reconstruction from RIP


using `0 (2)
Proof of the theorem, by contradition:
I Assume that x̃ 6= x.
I By definition of x̃ we have

kx̃k0 ≤ kxk0 ≤ s
Ulrike von Luxburg: Statistical Machine Learning

In particular, kx − x̃k0 ≤ 2s.


I Now apply the RIP property to the vector z := x − x̃. Recall
that W x = W x̃, hence W z = 0.
I Then the RIP property gives

kW zk22
− 1 = |0 − 1| = 1 > ε
kzk22


852
Summer 2019

Proof step 1: Perfect reconstruction from RIP


using `0 (3)
Note that this theorem immediately gives a first algorithm for
reconstructing x from its sparse representation x̃.

WHICH ONE? IS IT A GOOD ONE?


Ulrike von Luxburg: Statistical Machine Learning
853
Summer 2019

Proof step 1: Perfect reconstruction from RIP


using `0 (4)
We need to solve an `0 -optimization problem. This is
combinatorial, hard, undesirably ...
Ulrike von Luxburg: Statistical Machine Learning
854
Summer 2019

Proof step 2: Perfect reconstruction from RIP


using `1
Now comes the very surprising result: we get exact (!) recovery
even if we replace the `0 norm by the `1 norm:

Theorem 44 (Reconstruction based on 1-norm)


Under the same assumptions as before:
Ulrike von Luxburg: Statistical Machine Learning

argmin{kvk0 ; v ∈ Rd , W v = y} = argmin{kvk1 ; v ∈ Rd , W v = y}

Proof: omitted, a nice writeup can be found in Chapter 23.3 of


Shalev-Shwartz and Ben-David.

WHY IS THIS SURPRISING? WHY IS IT INTERESTING?


855
Summer 2019

Proof step 2: Perfect reconstruction from RIP


using `1 (2)
The theorem gives us an pretty efficient (=polynomial) way to
exactly (!) reconstruct the original signal from the sparse one, by
solving a linear program!

Remarks:
Ulrike von Luxburg: Statistical Machine Learning

I There exists an even stronger version of the theorem which


does not assume that the original vector is s-sparse.
Essentially, the statement says that we can perfectly recover
the s largest components.
I There also exists a version of the theorem which only assumes
that the matrix is sparse in some unknown basis (not the
original one).
856
Summer 2019

Proof step 3: Constructing RIP matrices


We still haven’t clarified how we actually construct the compression
matrix W :

Theorem 45 (Random matrices are RIP)


(i) Let s ≤ d an integer, ε, δ ∈]0, 1[. Choose
k ≥ const · s log(d/(δε)) . Now choose W ∈ Rk×d such that each
Ulrike von Luxburg: Statistical Machine Learning

ε2
entry is drawn randomly from a normal distribution N (0, 1/s).
Then, with probabilitiy 1 − δ (over the choice of the matrix),
the matrix W is (ε, s)-RIP.
(ii) More generally, if U is any d × d orthonormal matrix, then with
probability 1 − δ, the matrix W U is (ε, s)-RIP.

Proof: omitted, a nice writeup can be found in Chapter 23.3 of


Shalev-Shwartz and Ben-David.
857
Summer 2019

Proof step 3: Constructing RIP matrices (2)

Remarks:
I This result is closely related to the theorem of
Johnson-Lindenstrauss, which is widely used in randomized
algorithms.
I The second part of the theorem takes care of the situation
Ulrike von Luxburg: Statistical Machine Learning

that the signal is not sparse in the original basis, but a


different basis, by additionally applying a basis transformation
U to the signal.
858
Summer 2019

More intuition: a different way to tell the same


story
Compressed sensing is advantageous whenever
I signals are sparse in a known basis

I measurements (or computation at the sensor end) are


expensive
I but computations at the receiver end are cheap.
Ulrike von Luxburg: Statistical Machine Learning
859
Summer 2019

More intuition: a different way to tell the same


story (2)
I One measures a relatively small number of random linear
combinations of the signal values — much smaller than
the number of signal samples nominally defining it.
I However, because the underlying signal is compressible, the
nominal number of signal samples is still an overestimate of
the effective number of degrees of freedom of the signal.
Ulrike von Luxburg: Statistical Machine Learning

I As a result, the signal can be reconstructed with good


accuracy from relatively few measurements by a clever
nonlinear procedure.
860
Summer 2019

More intuition: a different way to tell the same


story (3)
When does it work?

Transform sparsity: The desired image should have a sparse


representation in a known transform domain (i.e., it must be
compressible by transform coding).
Ulrike von Luxburg: Statistical Machine Learning

Incoherence of undersampling artifacts: The artifacts in linear


reconstruction caused by undersampling should be incoherent (noise
like) in the sparsifying transform domain.

Nonlinear reconstruction: The image should be reconstructed by


a nonlinear method that enforces both sparsity of the image
representation and consistency of the reconstruction with the
acquired samples.
861
facts that actually behave much like additive random n
Summer 2019

Example: time series


[Figure 5(c)]. Despite appearances, the artifacts are not n
rather, undersampling causes leakage of energy away from
The following exampleindividual
is taken nonzero
from Lustig,
value M.,
of theDonoho, D. L., This en
original signal.
Santos, J. M., Pauly, J. M. (2008). Compressed sensing MRI.
Signal Processing Magazine, IEEE, 25(2), 72-82.
Sampling
Sparse signal, as it would be in the appropriate basis (say, a vector
of Fourier coefficients of a time series).
Ulrike von Luxburg: Statistical Machine Learning

(a) (b)
862

[FIG5] Heuristic procedure for reconstruction from undersam


Summer 2019

Sampling
Example: time series (2)
Signal in the “default basis” (say, the time series itself, not sparse):

(c)

Ambiguity!
(a) (b)
Ulrike von Luxburg: Statistical Machine Learning

Assume it is too costly to sample the whole time series completely.

(d)

[FIG5] Heuristic procedure for reconstruction from undersampled data. A sparse sig
domain (b). Equispaced undersampling results in signal aliasing (d) preventing reco
incoherent interference (c). Some strong signal components stick above the interfer
thresholding (e) and (f). The interference of these components is computed (g) and
level and enabling recovery of weaker components.
863
behave much like additive random noise
Summer 2019 trajectories must follow relatively sm
ite appearances, the artifacts are not noise; Example: time series (3)
Sampling schemes must also be robu
ng causes leakage of energy away from each situations. Non-Cartesian sampling
Obvious first idea: equispaced undersampling.
o value of the original signal. This energy sensitive to system imperfections.
I Just measure (“sense”) the signal at equispaced positions (in
the image on the previous slide, at the positions indicated by
the red dots at the bottom).
Sampling Recov
I Replace the remaining entries with 0.
+
I Go over to the sparse basis and represent the signal there. −
Ulrike von Luxburg: Statistical Machine Learning

Result: artifacts called “aliasing”. It does not work at all!


(c) (e)

Ambiguity!
(b)

(d) (f)
864
ourier reconstruction exhibits incoherent arti- hardware and physiological constr
Summer 2019

ly behave much like additive random noise Example: time series (4)
trajectories must follow relatively
spite The compressed sensing approach: Sampling
appearances, the artifacts are not noise; Randomschemes must also be r
undersampling
pling causes leakage of energy away from each situations. Non-Cartesian samplin
ero value ofInstead of sampling at equispaced positions,
sensitive to randomly pick
I
the original signal. This energy system imperfections.
some entries (in the image before, this is indicated by the red
dots at the top).
I Try to represent the image in the sparse basis:
Sampling Rec
Ulrike von Luxburg: Statistical Machine Learning

(c) (e)
Works! If we threshold the small Fourier coefficients, we are
left with the sparse representation of the signal.
Ambiguity!
(b)
865

(d) (f)
Summer 2019

Example: time series (5)


Ulrike von Luxburg: Statistical Machine Learning

Figure 2. Reconstructing a sparse wave train. (a) The frequency spectrum of a 3-sparse signal. (b) The
signal itself, with two sampling strategies: regular sampling (red dots) and random sampling (blue dots).
(c) When the spectrum is reconstructed from the regular samples, severe “aliasing” results because the
number of samples is 8 times less than the Shannon-Nyquist limit. It is impossible to tell which frequen-
cies are genuine and which are impostors. (d) With random samples, the two highest spikes can easily
be picked out from the background. (Figure courtesy of M. Lustig, D. Donoho, J.Santos and J. Pauly,
Compressed Sensing MRI, Signal Processing Magazine, March 2008. ⃝2008 c IEEE.)
866
Summer 2019

Example: Images
Example taken from: MacKenzie, Dana. Compressed sensing makes
every pixel count. What is happening in the mathematical sciences
7 (2009): 114-127.

Original noisy image. Shown is the image itself and (I guess) the
coefficients in Fourier (Wavelet?) basis. Signal is sparse in this
basis (but of course, it was not recorded in this basis, here the
Ulrike von Luxburg: Statistical Machine Learning

transform to the sparse basis happened afterwards):


867
868 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Example: Images (2)


Summer 2019

Example: Images (3)


Now use random undersampling to recored the picture, and
reconstruct based on l2-minimization:
Ulrike von Luxburg: Statistical Machine Learning

Many artifacts.
869
Summer 2019

Example: Images (4)


Random undersampling, reconstruction based on `1 -minimization:
Ulrike von Luxburg: Statistical Machine Learning

Nice :-)
Compressed sensing with noisy data. (a) An image with added noise. (b) The image, under-
nd reconstructed using the Shannon-Nyquist approach. As in Figure 2, artifacts appear in the
870

ted image. (d) The same image, undersampled randomly and reconstructed with a “too opti-
se model. Although there are no artifacts, some of the noise has been misinterpreted as real
Summer 2019

Relation to standard information theory


Shannon sampling theorem (1949):
I A time-varying signal with no frequencies higher than d hertz
can be perfectly reconstructed by sampling the signal at
regular intervals of 1/2d seconds (that is, we sample at 2d
different time points).
I A signal with frequencies higher than d hertz cannot be
Ulrike von Luxburg: Statistical Machine Learning

reconstructed uniquely if we sample with this rate; there is


always a possibility of aliasing (two different signals that have
the same samples).
Compressed sensing: makes stronger assumptions than Shannon:
I The achievable resolution is controlled not only by the maximal
number of frequencies (the dimension d of the space), but by
the “information content” (the sparsity s of the signal).
871
Summer 2019

Relation to standard information theory (2)


I If we know that among the d different frequencies only s of
them really occur, then we can reconstruct the signal from a
small number of measurements.
Ulrike von Luxburg: Statistical Machine Learning
872
Summer 2019

Outlook
I Active area of research
I Lots of actual applications!!! Cameras, MRI scanning, etc
Ulrike von Luxburg: Statistical Machine Learning
873
874 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

comparisons
Ranking from pairwise
Summer 2019

Introduction
Text books (but I don’t like both chapters so much):
I Mohri et al. chapter 9

I Shalev-Shwartz/Ben-David, Chapter 17.4


Ulrike von Luxburg: Statistical Machine Learning

Papers: see the individual sections.


875
Summer 2019

Introduction, informal
I Ranking candidates for a job offering
I Ranking of the world’s best tennis players
I Ranking of search results in google
I Ranking of molecules according to whether they could serve as
a drug for a certain disease
Ulrike von Luxburg: Statistical Machine Learning

IN WHICH SENSE ARE THESE PROBLEMS DIFFERENT, IN


WHCIH SENSE SIMILAR?
876
Summer 2019

Introduction, informal (2)


I top-k ranking vs full ranking
I sampling with or without replacement
I active vs. passive selection of comparisons
I distributed or not
I ground truth exists or not
Problems run under many different names: rank aggregation,
Ulrike von Luxburg: Statistical Machine Learning

ranking, tournaments, voting, ... and are tackled in many different


communities (machine learning, computational social choice,
theoretical computer science, etc).
877
Summer 2019

Introduction, more formal


I Given n objects x1 , ..., xn .
I In the simplest case, we assume that there exists a “true” total
order ≺ on the objects, that is there exists a permutation π
such that xπ(1) ≺ xπ(2) ≺ ... ≺ xπ(n) .
I Goal is to learn this permutation from partial observations of
the ranking. In the simplest case, observations are of the form
Ulrike von Luxburg: Statistical Machine Learning

xk ≺ xl for certain pairs (k, l).


878
Summer 2019

Introduction, more formal (2)


Distance functions between permutations:

Given two permutations π and π̂ of the same set of objects. Want


to compute how different these rankings are.
I Kendall-τ distance: Count the number of pairs (i, j) that are
in different order in the two permutations:
n X
n
Ulrike von Luxburg: Statistical Machine Learning

2 X
dτ (π, π̂) := 1{sign(π(i)−π(j)) 6= sign(π̂(i)− π̂(j))
n(n − 1)
i=1 j=i+1

I Spearman-ρ distance: Count for each object by how much it


is “displaced” in one permutation with respect to the other:
n
X
dρ (π, π̂) = |π(i) − π̂(i)|
i=1
879
Summer 2019

Introduction, more formal (3)


I Top-k differences. Assume we are just interested in whether
the top k objects in the two rankings coincide. Denote by Sk
the set of first k objects in π, and by Ŝk the corresponding set
in π̂. We define the distance

dk (π, π̂) := |Sk 4Ŝk | := |(Sk ∪ Ŝk ) \ (Sk ∩ Ŝk )|


Ulrike von Luxburg: Statistical Machine Learning

Note that it only looks at the unordered sets, not at the order
within the sets.
I Normalized discounted cumulative gain (NDCG): We take the
ranking π as “reference ranking”. Then we compare it to the
second ranking π̂, but we weight errors among the top items of
π more severely than errors for items at the bottom of the
ranking π. Many different ways in which this can be done ...
880
Summer 2019

Introduction, more formal (4)


There exists a large variety of probabilistic model assumptions in
the literature. Here are some typical examples:
I We assume there exists a true ranking. When asking a user to
provide an answer to the question xi ? xj , he gives an incorrect
answer with probability p (where p is independent of xi , xj ).
I We assume that the objects xi can be represented by a real
number u(xi ), for example a utility score. Then we define
Ulrike von Luxburg: Statistical Machine Learning

xi ≺ xj := u(xi ) < u(xj ). The likelihood to observe an


incorrect answer depends on the distance u(xi ) − u(xj ). Many
different versions, for example the BTL model below.
881
Summer 2019

Introduction, more formal (5)


I Model for paired comparisons: Bradley-Terry-Luce (BTL)
model. Each object has a score u(xi ) (utility value, skill, ...).
Probability of anwers to comparisons are modeled by a logistic
model:
1
P (xi  xj ) =
1 + exp(−(u(xi ) − u(xj ))
Ulrike von Luxburg: Statistical Machine Learning

I Mallows model (probability distribution over all permutations):


Assume that π is the true ranking. Then the probabiltiy to
observe a ranking π̂ is chosen proportional to αdτ (π,π̂) where
α ∈]0, 1] is a parameter and dτ is the Kendall-τ distance.
Choosing α = 1 implies the uniform distribution over all
permutations, the closer α is to 0, the more the mass
concentrates around π.
882
Summer 2019

Introduction, more formal (6)


Default statistical approach. Given a probabilistic model, a straight
forward idea is to use a maximum likelihood estimator: Find the
permutation that maximizes the likelihood of the observed data.
However, it is often infeasible due to computational complexity
(need to have a clever way to try out all permutations).

Default algorithmic approach: Given the observations, find the


Ulrike von Luxburg: Statistical Machine Learning

permutation that is as consistent as possible with your observations


(minimizes a loss function). For example, assume in a sports
tournament that everybody played against everybody. Now find a
ranking that violates as few outcomes as possible. This problem is
NP hard, there exists a PTAS for it (Kenyon-Mathieu and Schudy:
How to rank with few errors. STOC 2007).
883
Summer 2019

Simple but effective counting algorithm


Based on the paper:
Shah, Wainwright: Simple, robust and optimal ranking from
pairwise comparisons. Arxiv, 2015.
Ulrike von Luxburg: Statistical Machine Learning
884
Summer 2019

The model
Ground truth model is very general:
I n objects

I For each pair of objects, assume a parameter pij := P (i  j).


Assume that P (i  j) + P (i ≺ j) = 1 (no ties).
I Define the score that measures the probability that object i
beats a randomly chosen object j:
Ulrike von Luxburg: Statistical Machine Learning

n
1X
τi := P (i  j)
n j=1

This score τi can be interpreted as the probability that object i


wins against a randomly chosen object j (under the uniform
probability distribution of objects). We consider the ranking
induced by these scores as the true ranking. Note: high score
= top of the list.
885
Summer 2019

The model (2)


Observation model:
I Assume that the number of times that a pair (i, j) is observed
is distributed according to a binomial distribution Bin(r, pobs )
(where r ∈ N and pobs ∈ [0, 1] are global parameters
independent of i and j).
I To generate the observations we proceed as follows:
I For each pair (i, j), we draw a random variable
Ulrike von Luxburg: Statistical Machine Learning

nij ∼ Bin(r, pobs ). This is the number of times that we are


going to observe comparisons between i and j.
I Now we ask nij times independently whether i ≺ j or i  j.
We get the answers with probabilities according to pij .
This model is very general, it encompasses most of the more
specialized models that exist in the literature.

Goal: Given a set of comparisons, find either the top-k ranking or


the full ranking.
886
Summer 2019

The counting algorithm


Simple counting algorithm:
I Define τ̂i as the number of times that object i has won over
another object j, based on the observed comparisons.
I Define the estimated ranking (or the top-k set) as the order
induced by the estimated scores τ̂i .
Ulrike von Luxburg: Statistical Machine Learning

This algorithm is about the simplest thing you can come up with, it
is sometimes called Borda count or Copeland method in the
literature.
887
Summer 2019

Bounds for exact recovery of top-k items


Define the following parameter:

n · pobs · r
r
Ψk (n, r, pobs ) := (τk − τk+1 ) ·
| {z } log n
=: separation para. | {z }
sampling para.
Ulrike von Luxburg: Statistical Machine Learning

I The separation parameter measures how well-separated the


first k items are from the remaining ones.
I The sampling parameter is a complexity term that depends on
the number n of objects and the expected number n · pobs · r
of observations per object.
I We will see below that that the larger Ψk , the easier it is to
discover the true ranking. (DOES IT MAKE SENSE?)
888
Summer 2019

Bounds for exact recovery of top-k items (2)


Theorem 46 (Exact top k recovery)
(a) Upper bound: Denote by Sk the set of true top-k times, and
by Ŝk the estimated set of top-k items according to the
counting algorithm. If Ψk (n, r, pobs ) ≥ 8, then Ŝk = Sk with
probability at least 1 − 1/n14 .
Ulrike von Luxburg: Statistical Machine Learning

(b) Lower bound: If Ψk (n, r, pobs ) ≤ 1/7, n ≥ 7 and


r · pobs > log n/(2n). Then there exist instances such that any
algorithm that attempts to recover the top-k items will err
with probability at least 1/7.
889
Summer 2019

Digesting the theorem


Let’s digest the upper bound by constructing a simple example:

Example 1:
I Assume the true ordering is o1  o2  ...  on , that is the best
player is o1 . Our goal is to find the best player (that is, k = 1).
I Assume a noise-free setting: a better player always wins against
a worse player, that is pij = 1 if oi  oj and 0 otherwise.
Ulrike von Luxburg: Statistical Machine Learning

I Then τi = (n − i)/n, and in particular τ1 − τ2 = 1/n.

I Consider the case where we observe each pair exactly r times


(we set pobs = 1, so the number of observations is
deterministic as well).
890
Summer 2019

Digesting the theorem (2)


I Upper bound in q the theorem: perfect recovery works if
nr
Ψ := (τ1 − τ2 ) · log n
≥ const.. In our case, τ1 − τ2 = 1/n,
and solving the equation leads to r ≥ n log n. That is, the
upper bound guarantees perfect recovery if we observe each
pair (!) at least n log n times. So overall we have to make
n3 log n comparisons.
Ulrike von Luxburg: Statistical Machine Learning

On the other side, the lower bound says the following:


I Assume that we have separation τ1 − τ2 = 1/n as in our
example. Then, if we observe less than r = n3 log n we cannot
guarantee recovery.
I The point is that the lower bound is a worst case statement:
for the worst of all examples, we cannot guarantee recovery if
we observe less than n3 log n examples.
891
Summer 2019

Digesting the theorem (3)


I For example, we can construct an example that has separation
1/n as well, but has lots of noise:
Example 2: Assume that for i > 2 we have
P (1  i) = 1/2 + 2/(n − 2), P (2  i) = 1/2 + 2/(n − 2),
and all other pairwise probabilitites are 1/2. This clearly is a
very difficult case.
Ulrike von Luxburg: Statistical Machine Learning

Taken jointly, upper and lower bound say:


I The counting algorithm gives perfect recovery if we get to see
n3 log n comparisons (this is for the case of separation 1/n).
I There are instances where it fails if we get to see less than this
amount of samples.
I In this sense, the couting algorithm is optimal (up to constants
in the bounds).
892
Summer 2019

Digesting the theorem (4)


I But of course, the query complexity (number of comparisons
we need) is huge. The problem is not that we have a bad
algorithm (this is what the lower bound tells). The problem is
that we make too little assumptions, so there is no structure
we can exploit.

(As a side remark: the lower bound also holds if we make the
Ulrike von Luxburg: Statistical Machine Learning

Bradely-Terry-Luce assumptions, one can construct an example


that satisfies their assumptions and still needs many
comparisons).
893
Summer 2019

Digesting the theorem (5)


Just as comparison: In our example 1, we are in a completely
noiseless case.
I WHAT IS THE NUMBER OF COMPARISONS WE NEED TO
SORT A SEQUENCE?
I WHAT IS THE NUMBER OF COMPARISONS WE NEED TO
FIND THE TOP ITEM IN A LIST OF ITEMS?
Ulrike von Luxburg: Statistical Machine Learning

So we are miles away from this good performance. HOW CAN


THIS BE, WHAT IS THE DIFFERENCE?
894
Summer 2019

Proof sketch, upper bound


Let’s briefly look at the proof:
I For each pair (i, j), we have a certain number of independent
observations.
I The parameter τ̂i is an average over these observations.
I This average is highly concentrated around its expectation.
Applying standard concentration inequalities (Bernstein), one
Ulrike von Luxburg: Statistical Machine Learning

can show that the deviations of the random variables are small.
I In particular, we can then bound the probability that one of
the top-k items “is beaten” (in terms of τ̂i ) by one of the
not-top-k items.
See the following figure:
895
896 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Proof sketch, upper bound (2)


Summer 2019

Proof sketch, lower bound


We construct one particular example in a clever way:
I For each a in {k-1, k, k+1, ..., n} let
S ∗ (a) := {1, 2, ..., k − 1} ∪ {a}. This is supposed to be the
true top-k set.
I Define the probabilities

Ulrike von Luxburg: Statistical Machine Learning

1/2
 if i, j ∈ S ∗ (a) or i, j 6∈ S ∗ (a)
Pa (i  j) 1/2 + δ if i ∈ S ∗ (a) and j 6∈ S ∗ (a)
1/2 − δ if i 6∈ S ∗ (a) and j ∈ S ∗ (a)

I Note that the true τ -values give the correct top-k set.
I Our goal is to identify the true permutation based on
observations, that is we want to find the correct parameter a
that has been used.
897
Summer 2019

Proof sketch, lower bound (2)


To construct the lower bound, we now want to show that no matter
which algorithm we use to estimate the correct top-k set in our
example, it always errs with a constant probability.

To this end we use a tool from information theory: Fano’s


inequality. Essentially it says that if we want to recover a certain
parameter, we need to receive a certain amount of “signal” or
Ulrike von Luxburg: Statistical Machine Learning

“information”.
I Assume that a is chosen uniformly from k, ..., n. Then we
sample observations according to the model Pa .
898
Summer 2019

Proof sketch, lower bound (3)


I Fano’s inequality now states that any algorithm that estimates
a by some â has to make an error of at least

I(a, observation) + log 2


P (a 6= â) ≥ 1 −
log(n − k + 1)

So we need to bound the mutual information


I(a, observation), which boils down to a sum of
Ulrike von Luxburg: Statistical Machine Learning

Kullback-Leibler divergences D(Pa ||Pb ). They can be


computed by stanard methods.

Details skipped.
899
Summer 2019

Exact recovery of full ranking


The bound for top-k ranking can immediately be turned into a
bound of exact full ranking. The main observation is that a ranking
is correct if the top-k rankings for all k = 1, ..., n − 1 are correct.
This immediately leads to:

Theorem 47 (Upper bound, full permutation)


Ulrike von Luxburg: Statistical Machine Learning

Let π̂ be the permutation induced by the estimated scores τ̂ , and π


the one by the true scores τ . If Ψk (n, r, pobs ) ≥ 8 for all
k = 1, ..., n − 1, then P (π̂ = π) ≥ 1 − 1/n13

Proof: union bound with the previous theorem (union bound leads
to power 13 instead of 14).
900
Summer 2019

Approximate recovery
Result looks surprisingly similar. Just the separation term now not
depends just on τk − τk+1 , but on all τ -values in a certain
neighborhood of k (where the size of the neighborhood depens on
the error we are allowed to make).

We still get the same kind of worst case query complexity.


Ulrike von Luxburg: Statistical Machine Learning

Details skipped.
901
Summer 2019

Discussion
I On a high level, the theorem shows two things:
I Ranking from noisy data is difficult if we don’t make any
assumptions.
I You cannot improve on the counting algorithm — unless you
do make more assumptions.
I In practice, the query complexity of n3 log n is completely out
Ulrike von Luxburg: Statistical Machine Learning

of bounds, there is no way you can collect that many


comparisons in a realistic setting. So what is obviously needed
are algorithms that work well with less queries in realistic
settings (assumptions).
902
903 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Learning to rank
Summer 2019

Learning to rank
I Objects x1 , ..., xn .
I Observations of the form xi ≺ xj . Encode this as follows:
I Consider the space S of
( all unordered pairs of objects.
+1 if xi ≺ xj
I Output variable yij =
−1 otherwise
I Goal: learn a classifier that makes as few mistakes as possible.
Ulrike von Luxburg: Statistical Machine Learning
904
Summer 2019

Naive idea: ERM


The first naive algorithm we can think of is to perform empricial
risk minimization on the set of permutations: that is we pick the
permutation that agrees most with our observations.

We have mentioned already that this is NP hard to do


(computational complexity), but let’s look at how many queries we
would need (query complexity).
Ulrike von Luxburg: Statistical Machine Learning
905
Summer 2019

Generalization bounds for learning to rank


Proposition 48 (VC dim of permutations)
Consider a set V of n objects, and the set S of all unordered pairs
of objects. Denote by Π the set of permutations of V . Each
permutation π induces a classifier fπ : S → {−1, 1} on the set S
(as described above). Then the space F := {fπ | π ∈ Π} has
VC-dimension n − 1.
Ulrike von Luxburg: Statistical Machine Learning

Proof: Step 1: Prove that V C < n. To this end, consider any


subset S 0 ⊂ S with |S 0 | = n. Want to show that it cannot be
shattered by the function class F. We construct a proof by
contradiction.
I Assume we have a set S 0 ⊂ S of n pairs of objects that can be
shattered by F.
906
Summer 2019

Generalization bounds for learning to rank (2)


I Consider the comparision graph of S 0 : Vertices = all n objects;
undirected edge from object i to j if {i, j} ∈ S 0 .
I The graph has n vertices and n edges by construction, so it
needs to contain an undirected cycle. Now observe that we
cannot shatter the pairs in S 0 that correspond to the
edges in the cycle: we cannot realize the function that
corresponds to xi ≺ xj ≺ ... ≺ xl ≺ xi , because the latter
Ulrike von Luxburg: Statistical Machine Learning

implies xi ≺ xi . 
907
Summer 2019

Generalization bounds for learning to rank (3)


Step 2: Prove that V C ≥ n − 1. To this end, need to find at least
one subset S 0 ⊂ S with |S 0 | = n − 1 that can be shattered. Using
the same construction as above, we simply choose S 0 such that the
graph is a tree. Can always be done. This does it.

Remark: naively, the set Π consists of n! many permutations, so


Ulrike von Luxburg: Statistical Machine Learning

the shattering coefficient is n!. The log-shatttering coefficient is


then log(n!) = n log n, So the first natural guess is that the VC
dim might be n log n. We now see that it is even n − 1.
908
Summer 2019

Generalization bounds for learning to rank (4)


Our standard VC-generalization bound for a class with
VC-dimension d over a sample of m comparisons is that with
probability at least 1 − δ, any permutation π satisfies
r
d log(2em/d) − log(δ)
R(f ) ≤ Rn (f ) + 2 .
m
As a rule of thumb: how many sample points to you need to
Ulrike von Luxburg: Statistical Machine Learning

achieve an error of about ε at most? Here is the argument:


I Ignore all log terms and constants.
p
I Then the error ε is of the order ε := d/m. Solving for m
tells us that we need to observe of the order d/ε2 comparisons.
In our case with d = n this means that we need to observe
about m := n/ε2 comparisons to achieve an error of at most
ε, with high probability.

Seems pretty good!


909
Summer 2019

Generalization bounds for learning to rank (5)


Comparison to the bound in the Shah/Wainwrigth approach
(simple counting algorithm):
I Note: the bound in the Shah/Wainwright approach was:
recovery works if we see about n3 log n3 comparisons (with
similar results for approximate recovery and recovery of the full
ranking).
Now we have a VC bound that says that of the order n
Ulrike von Luxburg: Statistical Machine Learning

I
examples are enough for good classification performance.
I Both approaches make only minimalistic / no assumptions
whatsoever on the structure of the numbers we want to sort.
WHERE IS THE CATCH?
910
Summer 2019

Generalization bounds for learning to rank (6)


I Note that the Shah/Wainwright bound talks about identifying
the correct ranking, while the VC bound just talks about
predicting the outcome of comparisons.
I Consider an example that is difficult in the Shah / Wainwright
framework: all πij -values close to 1/2, so the τ -values are very
similar to each other.
I Shah/Wainwright: need many samples to find the actual
Ulrike von Luxburg: Statistical Machine Learning

ranking.
I Learning to rank: the bound only considers the estimation
error of the classifier, when applied to predict unobserved
comparisons. [As a side remark: in the given example even
the Bayes classifier would have a poor performance close to
random guessing. The generalization bound just tells us that
we need not so many comparisons to come close to the
performance of the actual Bayes classifier. ]
911
Summer 2019

Generalization bounds for learning to rank (7)


I So the bounds are difficult to compare, neither the estimation
error of the predictor nor its approximation error are directly
related to the difficulty of the ranking problem.
Ulrike von Luxburg: Statistical Machine Learning
912
Summer 2019

SVM ranking
As ERM is infeasible computationally, we could use a linear SVM
instead:
I Encode an ordered pair of objects by a feature vector
xij := ei − ej ∈ Rn and the outcome yij as described above.
I Get training points of the form (xij , yij ).

I Classify using a linear hyperplane (that is, find a vector


Ulrike von Luxburg: Statistical Machine Learning

w ∈ Rn ) such that sign(hw, zij i) makes as few errors as


possible. Use an SVM to find this hyperplane.
I In particular, the predicted ordering can then be recovered by
the ordering of the coordinates of w.
Can also prove margin-type generalization bounds for SVM ranking,
skipped.
913
Summer 2019

Application: distance completion problem


This is a topic we are actually working on right now in my research
group.
Ulrike von Luxburg: Statistical Machine Learning
914
Summer 2019

Setup: Triplet comparisons


A scenario beyond simple ranking:
I Points X1 , ..., Xn from Rd

I We don’t know any numeric information such as vector


representations or distance values
I We just get to see binary variables that compare distances:
Ulrike von Luxburg: Statistical Machine Learning

d(Xi , Xj ) < d(Xk , Xl ) = true or false

So in the ranking language, we get a partial ranking between the


distances of the objects.
915
Summer 2019

Setup: Triplet comparisons (2)


Why is this interesting?

It is often easy to say that things “are pretty similar” or “not similar
at all”, but it is hard to come up with good ways to quantify this.

Example user ratings: easier to compare


Ulrike von Luxburg: Statistical Machine Learning

dist( , ) < dist( , )


... than to give numeric distance values:

dist( , ) = 0.1

dist( , ) = 0.7
916
Summer 2019

Setup: Triplet comparisons (3)


In the following we consider:
?
I Triple questions: d(Xi , Xj ) ≤ d(Xi , Xk )
?
I Quadruple questions: d(Xi , Xj ) ≤ d(Xk , Xl )
Ulrike von Luxburg: Statistical Machine Learning
917
Summer 2019

Distance completion problem


Distance comparison problem:
I n objects from Rd

I All we observe are a subset of all triple comparisons of the


form d(Xi , Xj ) < d(Xi , Xk ).
I Want to estimate the full ranking between all distances dij .

The full distance ranking can then for example be used to find the
Ulrike von Luxburg: Statistical Machine Learning

nearest neighbors of each data points, and then we can apply


classification algorithms, regression algorithms, clustering
algorithms, etc.
918
Summer 2019

Query complexity of the distance completion


problem
Given n objects, how many randomly chosen triple comparisons do I
need in order to estimate the true distance ranking reliably?
Ulrike von Luxburg: Statistical Machine Learning
919
Summer 2019

Query complexity, first observations


First observations about ranking m objects.
I There are of the order m := n2 distances. A comparison-based
sorting algorithm would need Θ(m log m)Θ(n2 log n2 ) many
(actively chosen !) comparisons. In the noiseless case, it would
produce the perfect ranking.
I If we use ERM to recover an approximate ranking, we would
Ulrike von Luxburg: Statistical Machine Learning

just need m/ε2 = n2 /ε2 many queries to learn the ranking up


to error ε (ignoring that the computational complexity is much
too high).
I If we apply the simple counting algorithm by
Shah/Wainwright, we also get a query complexity of
m3 log m = n6 log n (of randomly chosen comparisons).
Would also work in a noisy case.
920
Summer 2019

Query complexity, first observations (2)


In any application in real-world, query complexities of order n2 are
prohibitive ...
Ulrike von Luxburg: Statistical Machine Learning
921
Summer 2019

Query complexity, first observations (3)


However:
I Observe that the three approaches I mentioned above do not
make any assumption on the objects that need to be ordered,
it can be any arbitrary collection of numbers.
I We know more about our data: the things we want to order
are Euclidean distances. Can we exploit this in some way?
Ulrike von Luxburg: Statistical Machine Learning
922
Summer 2019

Query complexity, exploit structure


The answer is yes: we can exploit the structure of the problem.
I Consider the set of points X1 , ..., Xn ∈ Rd , and a certain
subset of triple questions.
I Observe that a triple comparison gives a relationship in form of
a hyperplane: dij < dik is equivalent to saying: if we consider
the hyperplane between point Xj and Xk , then point Xi is on
Ulrike von Luxburg: Statistical Machine Learning

the same side as Xj .


I We can now build equivalence classes of point sets: the ones
that satisfy the same hyperplane conditions.
I It is now possible to “count” the number of equivalence classes
(non-trivial!). The result is: there are of the order 2dn log n such
equivalence classes (where d is the dimension of the space).
I This means that the log-shattering coefficient of the set of all
Euclidean (!) distance completions is just dn log n.
923
Summer 2019

Query complexity, exploit structure (2)


I So we the standard shattering-coefficient generalization bound
says that with high probability, if we want to approximate the
correct distance ranking up to error ε, we need of the order
dn log n/ε2 many triple questions, close to linear!!!
I Note that this is much better than the n2 log n requirements
we had without making any assumption.
This is a very nice example to demonstrate that exploiting the
Ulrike von Luxburg: Statistical Machine Learning

structure in the problem helps (at least in theory).

And this is also a nice example for the type of questions we work in
in my group - we just proved this result a couple of weeks ago...
924
Summer 2019

Query complexity, exploit structure (3)


Outlook:
I Note that while this shows that in theory we only need few
triples to recover the full distance ranking for Euclidean points,
we don’t know how to do it in practice.
I We would need to have an algorithm that does ERM on the
set of equivalence classes ...
Ulrike von Luxburg: Statistical Machine Learning
925
Summer 2019

Spectral ranking
Based on the following paper:
Fogel, d’Aspremont, Vojnovic: SerialRank: Spectral ranking using
seriation. NIPS 2014.
Ulrike von Luxburg: Statistical Machine Learning
926
Summer 2019

Spectral ranking
Setting as before: we observe pairwise comparison, want to output
a ranking.

Define the comparison matrix:



1
 if i  j
Cij = −1 if i ≺ j
Ulrike von Luxburg: Statistical Machine Learning


0 if no data exists

Define a similarity matrix as follows:


n
X 1 + Ci,k Cj,k
Sij :=
k=1
2

(counts the number of matching comparisons of i and j with other


items k)
927
Summer 2019

Spectral ranking (2)


SpectralRanking algorithm:
I Compute the similarity matrix S based on the observed data
I Construct the unnormalized Laplacian L and compute its
second eigenvector.
I Rank all items according to the corresponding entries in this
eigenvector.
Ulrike von Luxburg: Statistical Machine Learning
928
Summer 2019

Spectral ranking (3)


There are lots of theoretical results on this algorithm:
I Assume we get to see all pairwise comparisons, answered
truthfully, and there are no ties. Then SpectralRanking
recovers the correct ranking perfectly.
(not interesting from an algorithmic point of view, we could
just do topological sort in this case).
Given a comparison matrix for n objects, with at most m
Ulrike von Luxburg: Statistical Machine Learning

I
corrupted√entries (selected uniformly at random). Then if
m = O( δn), then the SerialRank algorithm will produce the
ground truth ranking with probability at least 1 − δ.
This is the interesting statement.
Proofs are based on some old work by Atkinson 1998, we skip them.
929
Summer 2019

Google page rank


The setting here is not a pairwise-comparison setting. But no
student should leave this university without knowing google page
rank, so let’s discuss it anyway.
Ulrike von Luxburg: Statistical Machine Learning
930
Summer 2019

The setting
Want to build a search engine:
I Query comes in

I First need to find all documents that match the query

I Then need to decide which to display on the top of the list. So


we need to rank the search results according to their
“relevance”.
Ulrike von Luxburg: Statistical Machine Learning

Early attempts looked at the content of the documents (count how


often the keyword occurs, etc).

The new idea by the google founders was to instead look at the link
structure of the webpages.
931
Summer 2019

Page Rank
Published by Brin, Page, 1998.

Main idea:
I A webpage is important if many important links point to that
page.
I A link is important if it comes from an page that is important.
Ulrike von Luxburg: Statistical Machine Learning

Results of a search query should then be ranked according to


importance.
932
Summer 2019

Page Rank (2)


Given a directed graph G = (V, E), potentially with edge weights
sij , define:
P
I Out-degree: dout (i) = sik
P {k|i→k}
I In-degree: din (j) =
{k|k→j} skj

Define the ranking function r for all vertices:


Ulrike von Luxburg: Statistical Machine Learning

X r(i)
r(j) = (∗)
dout (i)
i∈parents(j)

This is an implicit definition. We need to find a way to solve this


for r(j), for all j.
933
Summer 2019

Page Rank (3)


Define the matrix A with entries
(
1/dout (i) if i → j
aij =
0 otherwise
and the vector r with the relevance scores as entries.

Observe that (rt · A)j = i ri aij , so we can rewrite (∗) as


P
Ulrike von Luxburg: Statistical Machine Learning

rt = rt · A
So r is a left eigenvector of A with eigenvalue 1.

The page rank idea consists of ranking vertices according to this


eigenvector.
In the following we will consider two things:
I Interpretation as a random surfer model
I How to compute the eigenvector.
934
Summer 2019

Random walks on a graph


To give the random surfer interpretation to pagerank, we first need
to learn about random walks on a graph:
I Consider a directed graph G = (V, E) with n vertices.
I A random walk on the graph is a time-homogenous,
discrete-time Markov chain. At each point in time, we
randomly jump from one vertex to a neighboring vertex. The
Ulrike von Luxburg: Statistical Machine Learning

probability to end in one of the neighbors only depends on the


current vertex, not on the past beyond this.
I It is fully described by transition matrix P with entries

pij = P (Xt+1 = vj |Xt = vi ).

For a weighted graph with similarity edge weights sij , have

pij = sij /di and in particular P = D−1 S.


935
Summer 2019

Random walks on a graph (2)


Initial distribution:

At time point 0, we start the random walk at a random vertex


according
P to probability row vector µ = (µ1 , ..., µn ) with µi ≥ 0,
µi = 1. The special case where we start at a deterministic vector
corresponds to the case where µ = (0, ..., 0, 1, 0, ..., 0).
Ulrike von Luxburg: Statistical Machine Learning
936
Summer 2019

Random walks on a graph (3)


k-step distribution:

I At time t = 0, the state distribution is µ, our initial


distribution.
I At time t = 1, the distribution is µP .
I At time t = 2, the distribution is (µP )P = µP 2 . Note that
entry ij of the matrix P 2 describes all possible ways to get
Ulrike von Luxburg: Statistical Machine Learning

with exactly 2 steps from vertex i to j.


I In general, the matrix P k describes the k-step transition
probabilities.
937
Summer 2019

Random walks on a graph (4)


Stationary distribution:

Intuition: if the random walk runs for a long time, it will converge
to an equilibrium distribution. It is called the stationary distribution
or the invariant distribution.

Definition of a stationary distribution: If we start in a stationary


distribution π and perform one step of the random walk, we have
Ulrike von Luxburg: Statistical Machine Learning

again the stationary distribution. In formulas: π is a stationary


distribution of P if

πP = π
938
Summer 2019

Random walks on a graph (5)


Convergence to the stationary distribution:

Consider a graph G that is weighted, undirected, connected, and


not bipartite. Denote by P the transition matirx. Then:
I The stationary distribution of P is given as the (normalized)
P
degree vector: πi = di /( j dj ) (CAN YOU SEE WHY?)
I limt→∞ P (Xt = vi ) = πi
Ulrike von Luxburg: Statistical Machine Learning

I The matrix P t converges to the matrix 1π with constant


columns π. The speed of convergence depends on the
eigengap between the first and second eigenvalue of P :
I Largest eigenvalue is always 1, second largest eigenvalue λ2
satisfies λ2 < 1 (note: it might not be real-valued!). Note
that right and left eigenvalues are the same, just the
eigenvectors differ.
I Perron-Frobenius theorem: P t = 1π + O(nc |λ2 |t ).
939
Summer 2019

The random surfer model


Recall the definition of the matrix A for pagerank. Observe:

I the matrix A is the transition matrix of a ranodm walk on the


graph of the internet.
I the ranking vector r is its stationary distribution.
Ulrike von Luxburg: Statistical Machine Learning

If done naively, two big problems:


I dangling nodes (e.g., pdf pages).

I disconnected components
940
Summer 2019

The random surfer model (2)


Solution to both problems:
we introduce a random restart (“teleportation”):
I with probability α close to 1, we walk along edges of the graph.

I With probability 1 − α, we teleport: we jump to any other


random webpage.
Transition matrix is then given as
Ulrike von Luxburg: Statistical Machine Learning

1
αP + (1 − α) 1
n
where n is the number of vertices and 1 the constant one matrix.

The ranking is then the stationary distribution of this matrix.


941
Summer 2019

How to compute it: the power method


Need to compute an eigenvector of a matrix of size n × n where n
is the number of webpages in the internet (2014: one billion
webpages).

Computing an eigenvector of a symmetric matrix has worst case


complexity of about O(n3 ) (and btw, our current matrix is not
symmetric).
Ulrike von Luxburg: Statistical Machine Learning

IS THERE ANY REASON TO BELIEVE THAT THIS MIGHT


WORK?
942
Summer 2019

How to compute it: the power method (2)


The simplest way to compute eigenvectors: the power method

I Let A be any diagonalizable matrix.


I Goal: want to compute eigenvector corresponding to the
largest eigenvalue.
I Observe: Denote by v1 , ..., vnPa basis of eigenvectors of matrix
A. Consider any vector v = i ai vi . Then
Ulrike von Luxburg: Statistical Machine Learning

X X X
Av = A( ai v i ) = ai (Avi ) = ai λi vi
i i i

If we apply A k times, then:


n
k
X  X ai λki 
A v= ai λki vi = a1 λk1v1 + v
k i
i | {z } i=2 a1 λ1
dominates | {z }
vanishes
943
Summer 2019

How to compute it: the power method (3)


The Power Method, vanilla version:

1 Initialize q0 by any random vector with kq0 k = 1


2 while not converged
3 z (k) := Aq (k)
4 q (k) := z (k) /kz (k) k
Ulrike von Luxburg: Statistical Machine Learning

Caveat:
I Won’t work if q0 ⊥ first eigenvector

I Does not necessarily converge if the multiplicity of the largest


eigenvalue is larger than 1.
I Speed of convergence depends on the gap between the first
and second eigenvalue, namely λ2 /λ1 .
944
Summer 2019

How to compute it: the power method (4)


Implementation of page rank is a simple power iteration:
I We Initialize with constant vector r = e = (1, ..., 1)t .
I We iterate until convergence:

t
rk+1 = rkt (αA + (1 − α)ev t
= αrkt A + (1 − α) rkt e v t
Ulrike von Luxburg: Statistical Machine Learning

|{z}
=1
=α rkt A +(1 − α)v t
|{z}
sparse

Comments:
I v is the “personalization vector” (≈ probability over all
webpages of whether the surfer would like to see that page)
945
Summer 2019

How to compute it: the power method (5)


I 1 − α is the teleportation parameter.
I in the last line, we essentially have to perform one sparse
matrix-vector multipication, this can be done in parallel.
I Speed of convergence depends on the gap between first and
second eigenvalue. Personalization adds speed because if the
spectrum of P is {1, λ2 , λ3 , ...}, then the spectrum of the
personalize matrix is {1, αλ2 , αλ3 , ...}.
Ulrike von Luxburg: Statistical Machine Learning

Thus we have a tradeoff: α large ; small gap, slow


convergence, but structure of the web graph well represented.
946
947 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

learning
from raw data to machine
The data processing chain:
948 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Preparing the data


949 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Data aquisition: train versus test distribution


Summer 2019

Train versus test distribution


Machine learning starts with acquiring good data. The results of
your learning algorithm can only be as good as the information in
your data!!!

If you want to train a classifier to perform a certain task and you


start collecting data for training, you should ask yourself the
following questions:
Ulrike von Luxburg: Statistical Machine Learning

Are all the potential test cases covered in the training data, with all
the variety that exists?
I If you want to classify digits, are all of them going to be
upright? If not, add digits in all orientations to your system.
950
Summer 2019

Train versus test distribution (2)


Is your training set “representative” for the test cases: is the
distribution of your training inputs roughly the same as the
distribution of your test inputs?
I If you want to predict whether general customers like a certain
product, it is not a good idea to just collect opinions from
students, say.
Yet another sampling bias: We take a questionnaire in the
Ulrike von Luxburg: Statistical Machine Learning

I
machine learning class, about whether you like the class or not.

I Because we take it towards the end of the semester, the


people who disliked it most are no longer present (because
they already dropped the lecture).
I And the people who never attend the lecture because they
think it is enough to read the slides at home are are also not
present.
951
Summer 2019

Train versus test distribution (3)


I Much more subtle: credit scoring rules. Such rules are used by
the banks to predict whether a customer is likely to pay back a
loan in time. The problem is that the available training data
(persons plus the knowledge whether they payed back) is highly
biased, because the banks only gave the credits to pre-selected
people in the first place (so if their “old” selection rule never
gave a credit to females, say, then they never get positive
Ulrike von Luxburg: Statistical Machine Learning

training examples for females who pay back the credit).


952
Summer 2019

Train versus test distribution (4)


Ulrike von Luxburg: Statistical Machine Learning

At least, try to be aware of any “in-balancedness” in your training


set.
953
954 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Converting raw data to training data


Summer 2019

Raw data to training data


Often there is a considerable amount of decisions to take when you
go from raw data to training data.

Example: you want to detect faces in images, and you have a set of
images (with and without faces).
I What exactly do you use as training examples? The whole
image? The part of the image that contains a face / not a
Ulrike von Luxburg: Statistical Machine Learning

face? All 16 × 16 patches of your image?


I What representation of the image do you use in the first place?
What color space? What resolution?
I What do you do with different brightnesses? Do you also use
the additional data provided by the camera as input
(exposition time, aperture, focus, etc)?
955
956 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Data cleaning: missing values, outliers


Summer 2019

Outlier detection
Data often contains “outliers”:

An outlier is an observation which deviates so much from the other


observations as to arouse suspicions that it was generated by a
different mechanism.

Unfortunately, outliers can have a very big influence on the


Ulrike von Luxburg: Statistical Machine Learning

outcome of certain algorithms.

DO YOU KNOW AN EXAMPLE FOR THIS?

Consequently, we sometimes might want to remove outliers from


our data.
957
Summer 2019

Outlier detection (2)


There exist many algorithms for outlier detection:
I The classical statistics / model-based approach:
I Fit some model to the data, say a mixture of Gaussians
I Then the outliers are points which have very low probability
under this model.
I For most ML applications, this is not applicable at all ... data
too complex ... try to avoid building explicit models ...
Ulrike von Luxburg: Statistical Machine Learning

I Model-free approach:
I Want to find a set S which has two properties:
it is as small as possible, , but it contains most of the data
points.

I To identify outliers, we have to find such a set S. We then


say that all points that fall outside of S are outliers.
958
Summer 2019

Outlier detection (3)


I The one-class SVM falls into this framework (see the book of
Schölkopf/Smola).
I Many other approaches exist.
Ulrike von Luxburg: Statistical Machine Learning
959
Summer 2019

Outlier detection (4)


Very important to understand:
I “Right” or “wrong” does not really exist (outlier detection is
an unsupervised technique!)
I Depending on the algorithm (and, as a matter of fact, the
underlying distance function) completely different points might
get marked as outliers.
I Note that “outliers” are not always bad and undesired, to the
Ulrike von Luxburg: Statistical Machine Learning

contrary: these might be points that are particularly


interesting.
I For example, if we want to assess the side effects of some
medical treatment, there might be just a very small number of
patients who have severe side effects, but these are the ones
we care about.
I If we throw outliers away, we might loose the most important
data!
960
Summer 2019

Outlier detection (5)


Always be suspicious if people generously remove outliers! I
tend not to use outlier detection, unless I really know what I
am doing...
Ulrike von Luxburg: Statistical Machine Learning
961
Summer 2019

Missing values
Often data is not complete, you have missing values. Two big
cases:
I Missing at random: this is the easier case, no bias introduced
by the fact that data is missing.
I Missing not at random: values could miss systematically, and
the fact that a value is missing might contain information:
Ulrike von Luxburg: Statistical Machine Learning

I You run a population survey. One of the questions is about


the income. People might not want to disclose their income if
it is really high or really low. So the value is not missing at
random, which introduces a bias.
What to do? No standard recipe ...
I Throw away corrupted entries (ok if missing at random, not ok
if missing not at random)
I Impute missing values in some clever way.
962
Summer 2019

Missing values (2)


I Use an algorithm that can cope with missing values (example:
if you use a feature vector to describe your data, and one entry
in the vector is missing, this is a problem if you want to
compute a scalar product).
Ulrike von Luxburg: Statistical Machine Learning
963
964 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Defining features, similarities, distance functions


Summer 2019

Choice of features
I Choose reasonable features if you can ,
I If you want to classify texts into different topics, a bag of
words seems reasonable. A “bag of letters” would not be
reasonable.
I Don’t be shy with including features. If in doubt, always
include a questionable feature. Many supervised algorithms like
Ulrike von Luxburg: Statistical Machine Learning

SVMs are good at identifying useful features and can work


with thousands of features (not always, but often).
I Choosing good features can be a very tough problem (in fact,
half of the computer vision community does research on what
are good features for image classification. Famous are the
SIFT features: they are scale and rotation invariant local
features on images (google it if you are interested).
965
Summer 2019

Dealing with categorial features


I Assume you have a feature that is not a numerical value but a
“category”.
Example: you want to describe books, the categories are
“detective story”, “novel”, “children’s book”, ....
I The naive attempt would be to say “detective story”=1,
“novel” = 2, “children’s book” = 3, ...
Ulrike von Luxburg: Statistical Machine Learning

I But in many cases this is a bad idea. Note that the numerical
values 1, 2, 3 suggest that a detective story is closer to novel
than to a children’s book (as similarity between feature vectors
we use the scalar product, and it implicitly encodes this kind of
intuition). MAKE SURE YOU UNDERSTAND THIS POINT!
966
Summer 2019

Dealing with categorial features (2)


I Alternative: use binary encodings:
For each category, you introduce one yes/no feature.

Example:
Feature 1: is it a detective story? (0 or 1)
Feature 2: is it a novel? (0 or 1)
And so on.
Ulrike von Luxburg: Statistical Machine Learning
967
Summer 2019

Sparse feature vectors


I Sometimes the feature vector is very sparse (e.g. in a bag of
words approach).
I You might use dimensionality reduction first (eg., SVD) or
cluster features into meaningful groups.
I You might also build disjunctive features (replace two features
by one feature that contains the sum of the two, or a logical
Ulrike von Luxburg: Statistical Machine Learning

disjunction in case of categorial features).


I Sometimes (but not very often) people also use a hash
function to map the original feature vector to a condensed
representation. This is called “feature hashing”.
968
969 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Defining a similarity / kernel / distance function


Summer 2019

Importance of a good choice


I One of the basic principles of all the learning algorithms we
have seen is that points which are “similar” or “close” tend to
belong to the same class (have a similar label).
I The notion of “similarity” or “closeness” has to be given to
the ML algorithm as input.
I This is the place where a lot of the prior knowledge you have
Ulrike von Luxburg: Statistical Machine Learning

about your problem is made accessible to your algorithm.


I The choice of the similarity function / kernel / distance
function is crucial! If this function is not well-suited, there is
nothing you can save by the best learning algorithm in the
world
970
Summer 2019

Defining a similarity /kernel / distance function


In a kernel approach (or distance or similarity based
approach):
I Any prior knowledge you have has to go into the similarity /
kernel / distance function.
I Keep in mind: ML tries to classify such that points that are
similar / close tend to end up in the same classes.
Ulrike von Luxburg: Statistical Machine Learning
971
Summer 2019

Defining a similarity /kernel / distance function


(2)
Example from bioinformatics:
I We want to classify proteins as “drugable” or “not drugable”
(this means whether they have the potential to be used in a
medical context).
Ulrike von Luxburg: Statistical Machine Learning

Images: BiochemLabSolutions.com
972
Summer 2019

Defining a similarity /kernel / distance function


(3)
I As similarity function between proteins we use a score
produced by the BLAST sequence alignment algorithm. The
intuition is that proteins with a similar primary sequence have
a similar function.
I Alternatively, we might take into account that the important
information is not so much the primary sequence but the 3d
Ulrike von Luxburg: Statistical Machine Learning

structure of the protein. In this case, we need to define a


similarity function that takes the 3d structure of the protein
into account.
973
Summer 2019

Defining a similarity /kernel / distance function


(4)
Example digit classification:
I To define the similarity between two hand-written digits, we
should ignore the color of the pen in which the digit has been
written.
Ulrike von Luxburg: Statistical Machine Learning
974
975 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Reducing the number of training points?


Summer 2019

Data compression
I Sometimes the data set we have is too big to be processed
(here I refer to the number n of points, not to their dimension
d).
I Ideally, we would like to have a smaller data set, but we don’t
want to loose much information.
I Most importantly, by reducing the data set we do not want to
Ulrike von Luxburg: Statistical Machine Learning

introduce substantial artefacts or distortion to the data.


Two approaches that are both used commonly:
976
Summer 2019

Data compression (2)


Subsampling
I randomly select n0 < n points from the original data set and
just train on the smaller set.
I One might want to repeat this procedure several times and
average in the end.
I Here the distribution of the data is the same as before, but the
variance of the result will be higher (simply because we have a
Ulrike von Luxburg: Statistical Machine Learning

lower sample size).


977
Summer 2019

Data compression (3)


Vector quantization:
I Run an algorithm to select a set of n0 “representative points”
from the original data set.
I A common approach is to use the k-means algorithm with
k = n0 for this task.
I This approach introduces a change to the data distribution,
the centers selected by k-means no longer follow the original
Ulrike von Luxburg: Statistical Machine Learning

distribution of the data.


One indicator why this is the case: It can be very efficient
(from the k-means point of view) to just put a small number
of representatives in high-density regions, but cover the outliers
by a representative each.
978
979 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Unsupervised dimensionality reduction


Summer 2019

Dimensionality reduction
Dimensionality reduction is a very useful preprocessing step if the
dimensionality of the data is high. Unless one removes too many
dimension, it very often helps to improve classification accuracy
(intuitively, we remove noise from the data).

There are two considerably different approaches to this problem:


I unsupervised dimensionality reduction (see below)
Ulrike von Luxburg: Statistical Machine Learning

I supervised dimensionality reduction. This is usually called


“feature selection”, see later today.
980
Summer 2019

Dimensionality reduction (2)


Unsupervised dimensionality reduction: we have seen two
algorithms so far:
I (kernel) PCA / SVD. This is THE standard approach in this
context.
I Isomap. Is usually not so much used for data preprocessing,
but more for unsupervised learning on its own.
There exist many, many more methods.
Ulrike von Luxburg: Statistical Machine Learning

But be aware: it can always happen that your (unsupervised)


dimensionality reduction destroys important information, see the
discussion in the PCA section for examples.
981
982 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Data standardization
Summer 2019

Data standardization
It is very common to use data standardization (centering and
normalizing), and almost never hurts.

We have already discussed standardization in the context of


(kernel) PCA:
I Center the data points
Ulrike von Luxburg: Statistical Machine Learning

I Normalize rows or columns of your data matrix, see the


discussion in the context of kernel PCA.
983
984 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Clustering
Summer 2019

Clustering
Sometimes it makes sense to understand the cluster structure of
your data.
I Many small clusters to reduce the size of your data set (vector
quantization).
I Few big clusters of data. You might then treat the classes
differently, or use learning algorithms that exploit the cluster
Ulrike von Luxburg: Statistical Machine Learning

structure. Often, this is simply used as an explorative


preprocessing step, in particular as a sanity check for your data
(or to discover unknown aspects of your data).
985
986 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Setting up the learning problem


987 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Choice of a loss / risk function


Summer 2019

Weighted loss and risk functions


The default loss function for classification is the 0-1-loss.

However, there are a couple of standard cases where we need to


incorporate some weights into loss functions.

Unbalanced classes.
I Assume your training data consists of 1000 points, but just 10
Ulrike von Luxburg: Statistical Machine Learning

of them are from class +1, and the other 990 from the class -1.
I If you now use a standard loss function, it is very likely that
the best classifier is the one that simply predicts -1
everywhere. WHY?
I To circumvent this problem you have to reweight the loss
function such that training errors that mispredict points of
class 1 get much more punished than the other way round.
988
Summer 2019

Weighted loss and risk functions (2)


I As an example, you can define the training error (empirical
error) as
X X
Rn (f ) = `(Xi , Yi , f (Xi )) + γ · `(Xi , Yi , f (Xi ))
i: Yi =1 i: Yi =−1

where γ is a parameter. High γ means that errors in class -1


get punished much more severely.
Ulrike von Luxburg: Statistical Machine Learning
989
Summer 2019

Weighted loss and risk functions (3)


Importance-weighted classification.
I It also might be the case that a correct result is very important
for some training points, but not so important for other
training points.
Example: you might care much more about “good customers”
than about not so good customers. So you might be more
careful with annoying some customers by adds than others.
Ulrike von Luxburg: Statistical Machine Learning

I In such cases you can assign different weights wi to the


training points and then define the empirical error as

n
X
Rn (f ) = wi `(Xi , Yi , f (Xi ))
i=1
990
Summer 2019

Weighted loss and risk functions (4)


Cost sensitive classification.
I In many applications, the kind of errors we make are not
symmetric.
I Example: spam classification
I If a spam mail ends up in your inbox, no much harm done.
I But if an important non-spam email ends in your spam folder,
this can be a disaster.
Ulrike von Luxburg: Statistical Machine Learning

I Here the loss function itself contains weights, that is


(
1 ⇐⇒ Yi = spam, f (Xi ) = ham
`(Xi , Yi , f (Xi )) =
γ ⇐⇒ Yi = ham, f (Xi ) = spam

where γ is a parameter (say, 103 ).


991
Summer 2019

Designing your own loss function


Discrete loss functions like the weighted 0-1-loss are NP hard to
optimize. Instead, most ML algorithms use a different loss function
(called surrogate loss).
If you want to design such a surrogate loss function, here are some
considerations you should take into account (this is not part of the
standard ML processing chain, and often rather difficult, but I
wanted to mention the keywords):
Ulrike von Luxburg: Statistical Machine Learning

I Ideally, your new loss function should be convex, otherwise you


are going to have a hard time optimizing it.
992
Summer 2019

Designing your own loss function (2)


I There are two (slightly different) properties of loss functions:
being proper and classification calibrated. On a very high
level, both definitions try to ensure that for any classifier f and
the Bayes classifier f ∗ we have

f 6= f ∗ =⇒ R(f ) > R(f ∗ )


Ulrike von Luxburg: Statistical Machine Learning

I Formally, the definitions of “proper” and “calibrated” are not


exactly the same but are in a close relationship.
993
Summer 2019

Designing your own loss function (3)


I All this is pretty recent work, and the design of loss and risk
functions is a very active field of research, see for example the
publications by Peter Bartlett (now in Brisbane) or Bob
Williamson (Australian National University) .
Ulrike von Luxburg: Statistical Machine Learning
994
995 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Training
Summer 2019

Multi-class approaches
Assume we are given a classification problem with K classes, labeled
1, .., K. Further, assume that the ordering of these labels is not
important (“closeness” of the labels does not have any meaning).
One-versus-all
I For each k ∈ {1, .., K} we train binary classification problems
where the first class contains all points of class k and the other
class all remaining points.
Ulrike von Luxburg: Statistical Machine Learning

I We then get K classifiers fk . The final decision for a class k is


then the one which gives the highest score to class k:

ff inal (x) = argmax fk (x)


k=1,...,K

One-vs-one.
996
Summer 2019

Multi-class approaches (2)


I We train each class against each other class, this then gives
K(K − 1)/2 classifiers fkl in the end.
I The final classification is then by majority vote.
X
ff inal (x) = argmax 1flk (x)>0
l=1,...,K
k=1,...,l−1,l+1,...K
Ulrike von Luxburg: Statistical Machine Learning

Comments.
I One can prove that both approaches lead to Bayes consistent
classifiers if the underlying binary classifiers are Bayes
consistent.
I There also exist more complicated schemes, but in practice
they don’t really perform better than the simple ones.
I Multi-class scenarios are also an active field of research.
997
998 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Selecting parameters by cross validation


Summer 2019

Cross validation - purpose


In all machine learning algorithms, we have to set parameters or
make design decisions:
I Regularization parameter in ridge regression or Lasso
I Parameter C of the SVM
I Parameter σ in the Gaussian kernel
I Number of principle components in PCA
Ulrike von Luxburg: Statistical Machine Learning

I But you also might want to figure out whether certain design
choices make sense, for example whether it is useful to remove
outliers in the beginning or not.
It is very important that all these choices are made appropriately.
Cross validation is the method of choice for doing that.
999
Summer 2019

K-fold cross validation


1 INPUT: Training points (Xi , Yi )i=1,...,n , a set S of different
parameter combinations.
2 Partition the training set into K parts that are equally large.
These parts are called “fold”
3 for all choices of parameters s ∈ S
4 for k = 1, ..., K
Build one training set out of folds 1, ..., k − 1, k + 1, ..., K
Ulrike von Luxburg: Statistical Machine Learning

5
and train with parameters s.
6 Compute the validation error err(s, k) on fold k
7 Compute P the average validation error over the folds:
err(s) = K k=1 err(s, k)/K.
8 Select the parameter combination s that leads to the best
validation error: s∗ = argmins∈S err(s).
9 OUTPUT: s∗
1000
1001 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

K-fold cross validation (2)


Summer 2019

K-fold cross validation (3)


I Once you selected the parameter combination s∗ , you train
your classifier a final time on the whole training set. Then you
use a completely new test set to compute the test error.
Ulrike von Luxburg: Statistical Machine Learning
1002
Summer 2019

K-fold cross validation (4)


I Never, never use your test set in the validation phase. As soon
as the test points enter the learning algorithm in any way, they
can no longer be used to compute a test error. The test set
must not be used in training in any way!
I In particular: you are NOT ALLOWED to first train using
cross validation, then compute the test error, realize that it is
not good, then train again until the test error gets better. As
Ulrike von Luxburg: Statistical Machine Learning

soon as you try to “improve the test error”, the test data
effectively gets part of the training procedure and is spoiled.
1003
Summer 2019

K-fold cross validation (5)


What number of folds K?

Not so critical, often people use 5 or 10.


Ulrike von Luxburg: Statistical Machine Learning
1004
Summer 2019

K-fold cross validation (6)


How to choose the set S?
I If you just have to tune one parameter, say the regularization
constant λ. Then choose λ on a logspace, say
λ ∈ {10−3 , 10−2 , ...103 }.
I If you have to choose two parameters, say C and the kernel
width σ, define, say, SC = {10−2 , 10−1 , ...105 },
Sσ = {10−2 , 10−1 , ..., 103 }, and then choose S = SC × Sσ .
Ulrike von Luxburg: Statistical Machine Learning

That is, you have to try every parameter combination!


I You can already guess that if we have more parameters, then
this is going to become tricky. Here you might want run several
cross validations, say first choose C and σ (jointly) and fix
them. Then choose the number of principle components, etc.
I Note that overfitting can also happen for cross-validation!
1005
Summer 2019

K-fold cross validation (7)


I There are also some advanced methods to “walk in the
parameter space” (the idea is to try something like a gradient
descent in the space of parameters).
Ulrike von Luxburg: Statistical Machine Learning
1006
Summer 2019

Advantages and disadvantages


Disadvantages of cross validation:
I Computationally expensive!!! In particular, if you have many
parameters to tune, not just one or two.
I Note that the training size of the problems used in the
individual cross validation training runs is n(·K − 1)/K. If the
sample size is small, then the parameters tuned on the smaller
Ulrike von Luxburg: Statistical Machine Learning

folds might not be the best ones on the whole data set
(because the latter is larger).
I It is very difficult to prove theoretical statements that relate
the cross-validation error to the test error (due to the high
dependency between the training runs). In particular, the CV
error is not unbiased, it tends to underestimate the test error.

Further reading: Y. Yang. Comparing learning methods for


classification. Statistica Sinica, 2006. and references therein.
1007
Summer 2019

Advantages and disadvantages (2)


Advantages:

There is no other, systematic method to choose parameters in a


useful way.

Always, always, always do cross validation!!! Make sure the final


test set is never touched while training (retraining for improving the
test error is not allowed, then the data is spoiled).
Ulrike von Luxburg: Statistical Machine Learning
1008
Summer 2019

Feature selection
Literature:
I Text book: Shalev-Shwartz/Ben-David, Chapter 25

I An great overview paper: Guyon, Elisseeff: Introduction to


Ulrike von Luxburg: Statistical Machine Learning

variable and feature selection. JMLR, 2003.


I A whole book (based on a feature selection challenge): Guyon
et. al: Feature Extraction: Foundations and Applications.
Springer, 2006.
1009
Summer 2019

Feature selection problem


I Given high-dimensional data vectors.
I Goal is to reduce the number of features, but in a supervised
way.
I We don’t want to loose (much) classification accuracy.
I Reasons for doing this:
I Curse of dimensionality
Ulrike von Luxburg: Statistical Machine Learning

I Computational reasons
I We simply might want to “understand” our classifier. For
example, in a medical context people don’t want to have a
black-box classifier that simply suggests a certain treatment.
They want to know what are the reasons for this choice.
1010
Summer 2019

Feature selection, first thoughts


I Ideally, what we would like to do is to take all subsets of
features and figure out which of them leads to the best
classifier.
I However, this is not a good idea. WHY?
Ulrike von Luxburg: Statistical Machine Learning
1011
Summer 2019

Filter methods
General procedure:
I Take the training data (points and labels)

I Try to identify “good features” just based on this data (see


below for how this works).
I Then train your classifiers with the good features only.
Ulrike von Luxburg: Statistical Machine Learning

General properties:
I Faster than wrapper methods, less overfitting than wrapper
methods
I But independent of the actual classifier we use, which might
not be such a good idea
1012
Summer 2019

Filter methods (2)


Scores:
I Usually, filter methods try to compute “dependency scores”
between sets of features and labels.
I Example:
I Assume we want to classify mushrooms as “edible” or
“poisonous”. We collect many features: size, smell, color,
shape, ... , and also have the true labels in our training set.
Ulrike von Luxburg: Statistical Machine Learning

I If we now figure out that mushrooms are edible if and only if


they are brown then we just need the color feature for perfect
prediction.
I Note: this kind of feature selection is supervised (we need the
label information)!
1013
Summer 2019

Filter methods (3)


I To assess how “indicative” a subset of features is for a given
labels, there exist many scores:
I based on correlation
I based on Fisher information
I based on information theoretic measures such as mutual
information
Ulrike von Luxburg: Statistical Machine Learning
1014
Summer 2019

Filter methods (4)


Sequential forward selection:
I Start with an empty set.

I Then, one after another, add “the best” of the remaining


features, according to some score.
I Do this as long as a second, overall score significantly improves
by adding features. Then stop.
Ulrike von Luxburg: Statistical Machine Learning

Drawbacks:
I it can happen that two features are just indicative if they are
both in the set of features (but each alone is not very
indicative). But the naive sequential method might miss this.
I You can never “undo” a choice.
1015
Summer 2019

Filter methods (5)


Sequential backward selection:

Analogous, just start with all features and keep on removing


“unimportant ones”.
Ulrike von Luxburg: Statistical Machine Learning
1016
Summer 2019

Filter methods (6)


More complicated procedures, for example based on branch and
bound methods:
try to search through the tree of all feature subsets, but just
evaluating few of them ... difficult and used seldom.
Ulrike von Luxburg: Statistical Machine Learning
1017
Summer 2019

Wrapper methods
As opposed to filter methods, the wrapper methods repeatedly train
the actual learning algorithm we want to use.
I Train the classifier with different sets of features and compute
the cross validation error.
I To select the final set of features, use the “smallest set” that
still produces a good cross validation error.
Ulrike von Luxburg: Statistical Machine Learning

Advantage: We only select features if they are really useful for the
actual algorithm we use.

Disadvantage:
I Computationally very expensive (we have to retrain the
classifier over and over again).
1018
Summer 2019

Wrapper methods (2)


I Very prone to overfitting: one can interpret the feature
selection method as a very large blowup of the hypothesis
space, so overfitting happens easily.
Ulrike von Luxburg: Statistical Machine Learning
1019
Summer 2019

Feature selection checklist from Guyon /


Elisseeff 2003
1. Do you have domain knowledge? If yes, construct a better set
of ad hoc features.
2. Are your features commensurate? If no, consider normalizing
them.
3. Do you suspect interdependence of features? If yes, expand
Ulrike von Luxburg: Statistical Machine Learning

your feature set by constructing conjunctive features or


products of features, as much as your computer resources
allow you.
4. Do you need to prune the input variables (e.g. for cost, speed
or data understanding reasons)? If no, construct disjunctive
features or weighted sums of features (e.g. by clustering or
matrix factorization, see Section 5).
1020
Summer 2019

Feature selection checklist from Guyon /


Elisseeff 2003 (2)
5. Do you need to assess features individually (e.g. to understand
their influence on the system or because their number is so
large that you need to do a first filtering)? If yes, use a
variable ranking method (Section 2 and Section 7.2); else, do
it anyway to get baseline results.
6. Do you need a predictor? If no, stop.
Ulrike von Luxburg: Statistical Machine Learning

7. Do you suspect your data is “dirty” (has a few meaningless


input patterns and/or noisy outputs or wrong class labels)? If
yes, detect the outlier examples using the top ranking variables
obtained in step 5 as representation; check and/or discard
them.
1021
Summer 2019

Feature selection checklist from Guyon /


Elisseeff 2003 (3)
8. Do you know what to try first? If no, use a linear predictor.
Use a forward selection method (Section 4.2) with the “probe”
method as a stopping criterion (Section 6) or use the 0-norm
embedded method (Section 4.3). For comparison, following
the ranking of step 5, construct a sequence of predictors of
same nature using increasing subsets of features. Can you
Ulrike von Luxburg: Statistical Machine Learning

match or improve performance with a smaller subset? If yes,


try a non-linear predictor with that subset.
9. Do you have new ideas, time, computational resources, and
enough examples? If yes, compare several feature selection
methods, including your new idea, correlation coefficients,
backward selection and embedded methods (Section 4). Use
linear and non-linear predictors. Select the best approach with
model selection (Section 6).
1022
Summer 2019

Feature selection checklist from Guyon /


Elisseeff 2003 (4)
10. Do you want a stable solution (to improve performance and/or
understanding)? If yes, sub-sample your data and and redo
your analysis for several bootstraps.
Ulrike von Luxburg: Statistical Machine Learning
1023
Summer 2019

Literature on feature selection


I Text book: Shalev-Shwartz/Ben-David, Chapter 25
I An great overview paper: Guyon, Elisseeff: Introduction to
variable and feature selection. JMLR, 2003.
I A whole book (based on a feature selection challenge): Guyon
et. al: Feature Extraction: Foundations and Applications.
Springer, 2006.
Ulrike von Luxburg: Statistical Machine Learning
1024
1025 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Evaluation of the results


Summer 2019

Counting performance measures


There are many different ways to measure the error of a classifier,
we are going to summarize many of them now.

The main difference between these performance measures is if the


classes are very unbalanced.
Ulrike von Luxburg: Statistical Machine Learning
1026
Summer 2019

Counting performance measures (2)


Confusion table:
Ulrike von Luxburg: Statistical Machine Learning

For example, fp denotes the number of points that have wrongly


been predicted to belong to the positive class.
1027
Summer 2019

Counting performance measures (3)


I Error rate: fraction of points that are wrongly classified:
(f n + f p)/(P + N )
I Accuracy: fraction of examples that are correctly classified:
1 − errorrate
Ulrike von Luxburg: Statistical Machine Learning
1028
Summer 2019

Counting performance measures (4)


I True positive rate (sensitivity): tp / P
(“How many of the true positives did we find?”)
I False positive rate: fp / N
(“How many of the negative points have been wrongly
classified positive”)?
I True negative rate (specifity): tn / N
False negative rate: fn / P
Ulrike von Luxburg: Statistical Machine Learning

I
1029
Summer 2019

Counting performance measures (5)


If the classes are highly unbalanced, one sometimes uses:
I Positive predictive value: tp/(tp + f p)
I Negative predictive value: tn/(tn + f n)
Ulrike von Luxburg: Statistical Machine Learning
1030
Summer 2019

Counting performance measures (6)


In information retrieval the following measures are common (here
we are mainly interested in the positive class, we want to retrieve
documents from a collection that fit the search query):
I Recall: tp/P (how many positive examples can we find)
I Precision: tp/(tp + f p) how many of all positively classified
examples are indeed correct
Ulrike von Luxburg: Statistical Machine Learning

These are in particularly used in applications where discovering true


negatives does not add much value to a classifier.
1031
Summer 2019

ROC and AUC


In many applications, in particular in information retrieval, there are
many more negative examples than positive examples:
I Most webpages are irrelevant to a certain search query.

I Most possible links do not exist in a social network.

I etc
Ulrike von Luxburg: Statistical Machine Learning

In such cases, classification accuracy is a very bad performance


measure (WHY?).

Instead, one is interested in true positives and false positives. To be


able judge classifiers based on both criteria simultaneously, we can
now use ROC curves.
1032
Summer 2019

ROC and AUC (2)


ROC (=Receiver-operator characteristic) curve:
I Consider a family of classifiers class = sign(g(x) + Θ)
I Plots the false positive rate versus the true positive rate for
varying decision threshold Θ:
I Vary Θ from −∞ to ∞
I Evaluate tp(Θ) and f p(Θ)
Then plot the points (f p(Θ)/N, tp(Θ)/P ).
Ulrike von Luxburg: Statistical Machine Learning

I Leads to a curve in [0, 1]2 :


1033
1034 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Note: indeed,
I tp/P ∈ [0, 1]

I f p/N ∈ [0, 1]
ROC and AUC (3)
1035 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

ROC and AUC (4)


Intuition with normal distributions:
1036 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

ROC and AUC (5)


1037 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

ROC and AUC (6)


Summer 2019

ROC and AUC (7)


ROC for comparing classifiers:
I Assume you have two classifiers that depend on a certain
parameter σ
I Plot the ROC curve of both classifiers
I If the curve of classifier 1 is always above the one of classifier
2, then classifier 1 is considered superior.
Ulrike von Luxburg: Statistical Machine Learning
1038
Summer 2019

ROC and AUC (8)


I Often, such a clear picture is not true, the curves are going to
intersect.
Ulrike von Luxburg: Statistical Machine Learning

I In this case, you might still be able to say in what parameter


range one classifier is better than the other.
I Or you might want to use AUC.
1039
Summer 2019

ROC and AUC (9)


AUC (Area under the ROC curve):
I To translate the ROC to a “number”, sometimes the area
under the ROC curve is used as a performance measure.
I The larger the area, the “better” the classifier.
Ulrike von Luxburg: Statistical Machine Learning
1040
Summer 2019

ROC and AUC (10)


ROC and AUC are used a lot in machine learning. However, there is
also a lot of criticism related to these measures, see references in
the end.
Ulrike von Luxburg: Statistical Machine Learning
1041
Summer 2019

Multi-class performance measures


I Accuracy and error rate still can be defined, but the more
classes the less informative are these numbers. WHY?
I In general, the more classes the harder it is to summarize the
classification performance in one number.
I The best way to access the quality of multi-class classifiers is
to discuss the confusion matrix directly ...
Ulrike von Luxburg: Statistical Machine Learning
1042
Summer 2019

Comparing many classifiers


I Below is a table I took from a random publication on
classification (µ denotes the mean, σ the standard deviation of
the result on several independent tests).
I You will find similar tables in very many publications.

I What can you read from this table???


Ulrike von Luxburg: Statistical Machine Learning
1043
1044 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Comparing many classifiers (2)


Summer 2019

Comparing many classifiers (3)


Often it is not so easy to decide which classifier is “better”:
I “Better in general” will be hard anyway (no free lunch
theorem, see monday!)
I Depending on the choice of data sets! (here lots of cheating is
possible)
I When would you say is a classifier “really better” than another
one???
Ulrike von Luxburg: Statistical Machine Learning
1045
Summer 2019

Statistical tests for comparing classifiers


You can try to do this in a more sound way using statistical tests:
I The null hypothesis is that both classifiers perform the same

I As test statistic use the difference in error rates: erri − ẽrri on


many different data sets i (here erri and ẽrri are the errors of
the two classifiers on data set i)
I Assumption: These values are independent across data sets.
Ulrike von Luxburg: Statistical Machine Learning

I Then you can use a t-test to test whether the performance of


the classifiers is significantly different.
Permutation test:
I You can also use a permutation test.

I Here you compare the statistic erri − ẽrri against the statistic
where you randomly exchange erri and ẽrri .
I Read on permutation tests how this works ...
1046
Summer 2019

Statistical tests for comparing classifiers (2)


Comment:
I It is not extremely popular to use statistical tests, and it is also
somewhat questionable whether it is really useful.
I A test has to make assumptions on the underlying distribution.

I For example, the t-test assumes the “data” (in this case, the
values erri and ẽrri ) to be normally distributed, independent,
Ulrike von Luxburg: Statistical Machine Learning

and from the same population.


I This cannot really be true, it would not even hold for the
Bayes errors (if we plotted a histogram of the Bayes errors in
the data sets we use, it would not look like a normal
distribution).
I But if the assumptions are not satisfied, the test is
meaningless.
1047
Summer 2019

Some critical remarks


A quote from Duin (1996), see also Hand (2008), (full references
below):

We are interested in the real performance for practical applications.


Therefore, an application domain has to be defined. The traditional
way to do this is by a diverse collection of datasets. In studying the
results, however, one should keep in mind that such a collection
Ulrike von Luxburg: Statistical Machine Learning

does not represent any reality. It is an arbitrary collection, at most


showing partially the diversity, but certainly not with any
representative weight. It appears still possible that for classifiers
showing a consistently bad behavior in the problem collection,
somewhere an application exists for which they are perfectly suited.
1048
Summer 2019

Some critical remarks (2)


And one more (same source):

In comparing classifiers one should realize that some classifiers are


valuable because they are heavily parameterized and thereby offer a
trained analyst a large flexibility in integrating his problem
knowledge in the classification procedure. Other classifiers, on the
contrary, are very valuable because they are entirely automatic and
Ulrike von Luxburg: Statistical Machine Learning

do not demand any user parameter adjustment. As a consequence


they can be used by anybody. It is therefore difficult to compare
these types of classifiers in a fair and objective way.
1049
Summer 2019

Some critical remarks (3)


If you want to compare classifiers:
I Always try to compare them for a particular task (the “best
classifier” does not exist, see no free lunch theorem).
I Always test on a variety of data sets, under many different
conditions, on many different data sets.
I Be fair: try to implement all classifiers in the best way you
can. If available, use implementations by the people who
Ulrike von Luxburg: Statistical Machine Learning

invented the algorithms (often a considerable amount of works


goes into fine-tuning a classifier).
I Most people won’t believe you if you say that your classifier
“consistently outperforms” the other classifiers.
1050
Summer 2019

Some critical remarks (4)


I Try to assess the strengths / weaknesses of your classifier (and
be open about it): the most helpful insight is to say that “in
this situation, prefer classifier A, in that situation classifier B”.
This also means to identify the situations where your approach
does NOT work.
Ulrike von Luxburg: Statistical Machine Learning
1051
Summer 2019

Some critical remarks (5)


If you read that “one classifier is better than the other one” always
be a bit suspicious:
I Has the experiment been designed in a fair way? For example,
have all parameters been chosen by cross validation?
I How were the data sets selected?

I Are there different data sets of different types (many/few


sample points, high/low dimension, balanced/unbalanced
Ulrike von Luxburg: Statistical Machine Learning

classes, toy/real world data, ... )


1052
Summer 2019

Some critical remarks (6)


Finally:
I Eventually, some of the very good algorithms tend to get
identified in the community (SVM, spectral clustering, ... ),
simply because many people get positive experiences by using
them (but others might go completely unnoticed).
I Machine learning challenges:
I People put a particular machine learning problem online (data
Ulrike von Luxburg: Statistical Machine Learning

and some evaluation protocol). Then everybody can commit


his/her solutions. The winner gets a prize (or just the fame).
I For examples see www.kaggle.com
1053
Summer 2019

Some references
There is lots of research on how to compare classifiers.

David Hand (London) did a lot of (critical) research on the topic of


comparing classifiers, see for example:
I Hand D.J. Assessing the performance of classification
methods. International Statistical Review, 2012
I Hand D.J. Measuring classifier performance: a coherent
Ulrike von Luxburg: Statistical Machine Learning

alternative to the area under the ROC curve. Machine


Learning 77, 2009.
I Jamain, A., Hand, D. J. Mining supervised classification
performance studies: a meta-analytic investigation. Journal of
Classification, 25, 87-112. 2008.
1054
Summer 2019

Some references (2)


A couple of other references:
I Duin, R. A Note on Comparing Classifiers, Pattern Recognition
Letters, 17:529-536.1996.
I Demsar: Statistical comparisons of classifiers over multiple
data sets. JMLR, 2006.
I Yang: Comparing learning methods for classification. Statistica
Ulrike von Luxburg: Statistical Machine Learning

Sinica, 2006.
1055
1056 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

General guidelines for attacking ML problems


1057 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Key steps in the processing chain


Summer 2019

The key steps


As we have seen:
I Machine learning is not about applying one given algorithm to
your data set.
I It requires a lot of tuning, playing, trial and error.

I If your problem is easy, simple approaches might work.

I If your problem is difficult, you might not be successful if you


Ulrike von Luxburg: Statistical Machine Learning

don’t use a sophisticated processing chain.


1058
Summer 2019

The key steps (2)


Here is the list of what I consider the minimal things you should
always perform:
I Data preprocessing:
I Figure out whether your training data has systematic biases.
I Standardize your features
I If your data is high-dimensional, try unsupervised
dimensionality reduction (PCA) and supervised feature
Ulrike von Luxburg: Statistical Machine Learning

selection.
I The learning approach:
I Consider to incorporate weights in your loss function if your
classes are unbalanced.
I Use regularization to prevent overfitting.
I Always use cross-validation to set your parameters / decide on
design principles.
I Make sure you run the final test on an independent set!
1059
1060 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

High-level guidelines and principles


Summer 2019

High-level guidelines
Try not to loose sight of the following, very general (and abstract)
principles:
I Try to incorporate any prior knowledge about your data into
the process of learning.
I Try to understand the inductive bias used by your learning
algorithm. (Remember, learning is impossible without an
Ulrike von Luxburg: Statistical Machine Learning

inductive bias). Is it appropriate for your problem?


I When solving a problem of interest, do not solve a more
general problem as an intermediate step. (V. Vapnik).
I Play, play, play!
1061
1062 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

SKIPPED
(*) Online learning
Summer 2019

Online learning
Literature:
On the level of text books:
I Chapter 7 in Mohri/Rostamizadeh/Tawalkar
Ulrike von Luxburg: Statistical Machine Learning

I Chapter 21 in Shalev-Shwartz/Ben-David, Textbook level


introduction.
Research level, focus on theory:
I Cesa-Bianchi, Lugosi: Prediction, learning and games. A
comprehensive research-level book, focus is on theory.
I Lecture notes by Sasha Rakhlin, focus is on theory.
1063
1064 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Warmup
Summer 2019

Online learning: intuition


Consider the example of spam classification:
I Training examples arrive in a stream, one after the other.

I At each point in time, we have a current working model by


which we can predict the label of the new instance (spam or
not-spam).
I The goal is to make as few mistakes as possible.
Ulrike von Luxburg: Statistical Machine Learning

But the circumstances are quite different than in the batch setting:
I The distribution over these examples can change over time

I We play against an adversary who tries to exploit all the


weaknesses of our current predictions.
1065
Summer 2019

Online learning: intuition (2)


Online learning, a bit more formally:
I We maintain a “model” ft by which we predict labels.
I At each point t in time, we first observe a point xt .
I Then we have to predict what we believe is the correct label of
xt , by applying our current model function: pt := ft (xt ).
I Then we get to see the true label yt of this point
Ulrike von Luxburg: Statistical Machine Learning

I We incur an error if pt 6= yt .
1066
Summer 2019

No assumptions on input sequence


To accomodate all the different possibilities for the input sequences
(changing over time, adversary, etc) we proceed as follows:

We do not make ANY assumption on how the sequence is being


generated.

There could be an adversary who, at each point in time, comes up


Ulrike von Luxburg: Statistical Machine Learning

with a “difficult” example (Xi , Yi ). He can choose Xi and Yi


completely as he wishes (that is, he can also give wrong labels).

?!?!??!?!??!?!??!?!??!
1067
Summer 2019

No assumptions on input sequence (2)


Let’s start with a couple of examples:

Example: Random outcomes


I Adversary picks Xi uniformly at random from [0, 1]

I Then he throws a coin and gives us the corresponding result as


Yi .
I In this scenario there won’t be any way that we can predict the
Ulrike von Luxburg: Statistical Machine Learning

label better than random guessing.


1068
Summer 2019

No assumptions on input sequence (3)


Example: fictitious play
I The adversary is allowed to choose the label of the new point.

I We assume that he knows our learning strategy, that is he


know what is our current function ft that we use to predict.
I He now chooses some Xt , finds out what we would have
predicted by ft (Xt ), and chooses Yt exactly as the opposite
label.
Ulrike von Luxburg: Statistical Machine Learning

I We are going to predict the wrong label at every single step!

Looks pretty hopeless ... but looking at it the right way shows that
it isn’t ...
1069
Summer 2019

Minimizing regret
What would be a good way to measure the success of an online
learning algorithm? Above we have already seen that the absolute
number of errors is not a good measure for how successful an
algorithm is.

Assume that we can choose or model function ft from some


function class F. We now define the regret with respect to F:
Ulrike von Luxburg: Statistical Machine Learning

T
X T
X
Regret(T ) := sup sup |pt − yt | − |f (xt ) − yt |
f ∈F (x1 ,y1 )...,(xT ,yT ) t=1 t=1

In words, we measure how much we regret not having used another


predictor in our class F.
1070
Summer 2019

Minimizing regret (2)


Consider again the two examples from above:

Example 1 (random outcome):


I No function in the class would be better than what we could
do, so the regret would be (close to) 0. This makes sense and
is the result we would expect: if there is nothing we can learn,
then we should not be punished for not being able to learn.
Ulrike von Luxburg: Statistical Machine Learning

Example 2 (fictitious play):


I In the example where the adversary chooses a label to fool us,
the regret can be up to T (depending on the function class),
and there is nothing we can do against it.
I This is not what we would hope to get — after all, we should
be as clever as the adversary, so why is he allowed to fool us?
1071
Summer 2019

Minimizing regret (3)


I To sidestep the latter problem, we introduce one more
component: we allow the learner to randomize his predictions.
Ulrike von Luxburg: Statistical Machine Learning
1072
Summer 2019

Minimizing regret (4)


Randomized learner:
I At each point in time, the learner maintains a probability
distribution over the function space F.
I When he is asked to make a prediction, he randomly chooses a
strategy according to his internal probability.
I The adversary is allowed to know the current probability
distribution (in order to be able to act adversarially), but he
Ulrike von Luxburg: Statistical Machine Learning

does not have any influence on the randomness that generates


our label.
1073
Summer 2019

Prediction with expert advice, weighted majority


algorithm
Literature: Chapter 21 in Shalev-Shwartz/Ben-David
Ulrike von Luxburg: Statistical Machine Learning
1074
Summer 2019

Prediction with expert advice


I We are given a (finite) class F of d different prediction
functions. The functions are called “experts”.
I Your job is to find out how to combine the opinion of all
experts to extract a good decision.
Example applications:
I Ask different experts for an opinion about whether to buy or
Ulrike von Luxburg: Statistical Machine Learning

sell stock at the stock market


I Consider different wheater forecast services and figure out
what the best prediciton is
1075
Summer 2019

Deterministic weighted majority


First idea:
I We start with all experts having the same weight.
I We predict the label of the new point by the weighted majority
among our experts.
I If an expert makes a mistake, we reduce its weight by a factor
of 1/2.
Ulrike von Luxburg: Statistical Machine Learning

Intuition:
I Assume there exists one expert that is very good.

I All other experts (the ones that predict worse than the best
expert) get exponentially smaller than the best one.
I So in the long run, most of the weight will be accumulated by
the best expert.
I So in the long run, our prediction will be close to the one by
the best expert.
1076
Summer 2019

Deterministic weighted majority (2)


7.2 Prediction with expert advice
Weighted majority algorithm (N denotes number of experts, β
update factor):
Weighted-Majority(N )
1 for i ← 1 to N do
2 w1,i ← 1
3 for t ← 1 to T do
4 Receive(xt )
! !
5 if i : yt,i =1 wt,i ≥ i : yt,i =0 wt,i then
Ulrike von Luxburg: Statistical Machine Learning

6 y"t ← 1
7 else y"t ← 0
8 Receive(yt )
9 if ("
yt ̸= yt ) then
10 for i ← 1 to N do
11 if (yt,i ̸= yt ) then
12 wt+1,i ← βwt,i
13 else wt+1,i ← wt,i
14 return wT +1
1077

Figure 7.3 Weighted majority algorithm, yt , yt,i ∈ {0, 1}.


Summer 2019 Since we are not in the realizable setting, the mistake bounds of theorem 7.1
cannot apply. However, the following theorem presents a bound on the number of
Deterministic weighted majority (3)
mistakes mT made by the WM algorithm after T ≥ 1 rounds of on-line learning as
We can give
a function of thethe following
number guarantee
of mistakes onthe
made by thebest
performance
expert, that is ofthetheexpert
who achieves(β
algorithm theissmallest
the step size):of mistakes for the sequence y1 , . . . , yT . Let us
number
emphasize that this is the best expert in hindsight.

Theorem 7.3
Fix β ∈ (0, 1). Let mT be the number of mistakes made by algorithm WM after T ≥ 1
rounds, and m∗T be the number of mistakes made by the best of the N experts. Then,
the following inequality holds:
Ulrike von Luxburg: Statistical Machine Learning

log N + m∗T log 1


β
mT ≤ . (7.6)
log 2
1+β

Proof To prove this theorem, we first introduce a potential function. We then


Proof: based on simple counting arguments, see the book of Mohri
derive upper and lower bounds for this function, and combine them to obtain our
et al.
1078
Summer 2019

Deterministic weighted majority (4)


Bottom line:
I number of mistakes is about a constant times the number of
mistakes of the best expert in hindsight
I This is remarkable, we don’t make any assumption on the
input sequence whatsoever, it can be random, adversarial,
whatever you want.
I However, by adversarial ficticious play we can always construct
Ulrike von Luxburg: Statistical Machine Learning

sequences that have regret of the order Θ(T ).


1079
Summer 2019

Randomized weighted majority


The weighted majority algorithm does not take into account how
“sure” our experts are, in the sense that our decision is always the
same, no matter how much more weight we have for one or the
other decision.

This can be improved by introducing randomization. The model is


as follows:
Ulrike von Luxburg: Statistical Machine Learning

I The learner is allowed to randomize his choice: at time t, the


learner comes up with a probability distribution over the
(t) (t)
experts, that is vector of probabilities w1 , ..., wd .
I When he gets to see the point Xt , he chooses one of his
experts, say fi , according to his probability distribution, and
uses this expert to predict the label fi (Xt ).
1080
Summer 2019

Randomized weighted majority (2)


I This decision leads to a loss of 1fi (Xt )6=Yt . Because the
strategy is randomized, we consider the expected error:
d
X (i)
wt 1fi (Xt )6=Yt
i=1
Ulrike von Luxburg: Statistical Machine Learning
1081
ollowing the advice of the i th expert. If the learner’s predictions are randomi
Summer 2019

Randomized
hen its loss is defined toweighted majority
be the averaged (3) !i wi(t) vt,i = ⟨w(t) , vt ⟩.
cost, namely,
gorithm
Theassumes
followingthat the number
algorithm of rounds
is often called Ttheis exponential
given. In Exercise
weights21.4 we sh
ow toalgorithm:
get rid of this dependence using the doubling trick.

Weighted-Majority
input: number of "experts, d ; number of rounds, T
parameter: η = 2 log (d)/T
initialize: w̃(1) = (1, . . . , 1)
Ulrike von Luxburg: Statistical Machine Learning

for t = 1, 2, . . .
! (t)
set w(t) = w̃(t) /Z t where Z t = i w̃i
(t)
choose expert i at random according to P [i ] = wi
receive costs of all experts vt ∈ [0, 1] d

pay cost ⟨w(t) , vt ⟩


(t+1) (t)
update rule ∀i , w̃i = w̃i e−ηvt,i

The following theorem is key for analyzing the regret bound of Weigh
Majority.
1082
Summer 2019
The following theorem is key for analyzing the regret bound of Weighted-
Randomized
Majority. weighted majority (4)
Theorem 21.11. Assuming that T > 2 log (d), the Weighted-Majority algorithm enjoys
the bound
T
# T
# "
⟨w(t) , vt ⟩ − min vt,i ≤ 2 log (d) T .
i∈[d]
t=1 t=1

Proof: Skipped,
Proof. We have: based on deriving upper and lower bounds on the
potential function Wt := N
P
i=1 wt,i . See book by Mohri et al.
Ulrike von Luxburg: Statistical Machine Learning

# w̃ (t) # (t)
Z t+1 i
log = log e−ηvt,i = log wi e−ηvt,i .
Discussion: Zt Zt
i i
I The number of mistakes the algorithm makes in T rounds is of
√ e−a ≤ 1 − a + a 2 /2, which holds for all a ∈ (0, 1), and the fact
Using the inequality
!the(t)order T.
that i wi = 1, we obtain
I There exists a lower bound that shows that this bound is
Z t+1does # $ %
optimal — log there (t)
≤ log notwexist
i 1 −any
ηvt,i
algorithm
+ η 2 2
vt,i /2 that achieves a
Zt
better asymptotic guarantee. i
# $ %
(t)
= log(1 − wi ηvt,i − η2 vt,i
2
/2 ).
1083

i
& '( )
Summer 2019

Randomized weighted majority (5)


I Formally, for this bound to hold we need to know the number
T of rounds in advance (in the algorithm, η depends on T ).
Can get around this by the “doubling trick”.
I The term log(d) measures the “capacity” of the function class
F. The whole setting can be generalized to infinite function
classes as well. The capacity term log d is then replaced by the
“Littlestone dimension”.
Ulrike von Luxburg: Statistical Machine Learning

I Using concentration inequalities, the expected loss can be


replaced by the actual loss.
1084
Summer 2019

Outlook: Multi-armed bandits


One very important feature of the expert advice setting:
In each round, we get full feedback on the loss that each single
strategy would have produced. This is unrealistic in many scenarios.

Example 1: routing
I You drive to work every day.

I There are d different routes you can take.


Ulrike von Luxburg: Statistical Machine Learning

I Goal is, in the long run, to have a good strategy for selecting
the route. The learning setting is as follows:
I Each day you choose one of them, and you observe how much
time it takes you.
I But you don’t get any feedback on how much time it would
have taken to take any of the other routes! ; limited
feedback
1085
Summer 2019

Outlook: Multi-armed bandits (2)


Example 2: online advertising
I You have a choice of different ads to show to a user.

I After the user has seen the add, you record whether he clicked
on it or not.
I However, but you don’t have any feedback on what the user
would have done if you had shown him a different add.
Ulrike von Luxburg: Statistical Machine Learning
1086
Summer 2019

Outlook: Multi-armed bandits (3)


The setting where you only see the outcome of the selected expert
is called “multi-armed bandits” setting.
I You have a number of slot machines.

I At each time t, you can pull the arm of one machine.

I You observe the amount of money you gain.

I But you don’t know what you would have gained at another
Ulrike von Luxburg: Statistical Machine Learning

machine.
Note that an interesting new aspect comes into the game:
exploration vs. exploitation ...

We don’t discuss bandits any further in this lecture.


1087
1088 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Follow the (perturbed,regularized) leader


Summer 2019

Follow the leader


Here is yet another, pretty simple strategy called follow the leader:
At each point in time, simply choose the expert that, up to now,
accumulated the smallest cumulative loss.

However, this rule can easily be fooled:


I Consider a scenario of two experts. The first one always
predicts 0, the other one always 1.
Ulrike von Luxburg: Statistical Machine Learning

I The adversary plays ficticious play.

I Then each of the experts is wrong in half of the cases, but


follow the leader is wrong in all of the cases, leading to a
regret of size Ω(T ).
But there is a simple trick to mend this: try to ensure that follow
the leader gets a bit more stable.

... follow the regularized and perturbed leader:


1089
Summer 2019

Follow the leader (2)


Follow the perturbed leader:
I Denote by Li,t the cumulative loss that expert i would have
incurred up to time t.
I Now add some noise Zi to it (introduce some perturbation):

L̃i,t := Li,t + Zi,t


Ulrike von Luxburg: Statistical Machine Learning

I Now choose the prediction of the expert with the smallest


L̃i,t -value.
1090
Summer 2019

Follow the leader (3)


Follow the regularized leader:
I Instead of adding a noise term, we add a regularization term to
each of the cumulative losses.
I This could, for example, be the norm of the coefficients in case
of linear classification.
I Makes the solutions more stable over time.
Ulrike von Luxburg: Statistical Machine Learning
1091
1092 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Literature: book by Mohri et al.


Perceptron and winnow
Summer 2019

Online linear classification: perceptron


Consider linear classification in an online setting:
I Data points live in Rd , but are received one after the other

I Goal is to maintain a hyperplane that separates the classes.

Main idea:
I You maintain a weight vector w
Ulrike von Luxburg: Statistical Machine Learning

I Whenever your prediction makes a mistake, you slightly update


your weight vector:
I If the algorithm makes an error, this means that yt wt xt < 0.
I After the update, we have yt wt+1 xt = yt wt yt xt + ηkxt k2 , so
the inner product becomes “more positive”, moving the
weight vector in the correct direction.
1093
Summer 2019

Online linear classification: perceptron (2)


Perceptron(w0 )
1 w1 ← w0 ◃ typically w0 = 0
2 for t ← 1 to T do
3 Receive(xt )
4 y!t ← sgn(wt · xt )
5 Receive(yt )
6 if (!
yt ̸= yt ) then
Ulrike von Luxburg: Statistical Machine Learning

7 wt+1 ← wt + yt xt ◃ more generally ηyt xt , η > 0.


8 else wt+1 ← wt
9 return wT +1

Figure 7.6 Perceptron algorithm.

7.3.1 Perceptron algorithm


1094
Summer 2019

Online linear classification: perceptron (3)


There are many different ways in which the perceptron algorihtm
can be analyzed:
I It can be shown that it is equivalent to stochastic gradient
descent.
I One can prove margin-like generalization bounds that describe
the performance if the data can be linearly separated by a
hyperplane with a certain margin.
Ulrike von Luxburg: Statistical Machine Learning
1095
Summer 2019

Online linear classification: perceptron (4)


Variants: there are many, many variants of the perceptron:
I You can kernelize it.

I You can teach it to “fortget the past” (“forgetron”).

I many more ...

The perceptron is about the oldest learning algorithm that exists


(Roseblatt, 1958; margin bound by Novikoff 1962; kernel version by
Ulrike von Luxburg: Statistical Machine Learning

Aizerman 1964).
1096
Summer 2019

Winnow
Winnow also construct linear classifiers. But it uses multiplicative
updates rather than linear ones:
I If prediction of an expert is correct, its weight is increased.
I If prediciton of an expert is incorrect, its weight gets decreased.
Ulrike von Luxburg: Statistical Machine Learning
1097
Summer 2019

Winnow (2)
Winnow(η)
1 w1 ← 1/N
2 for t ← 1 to T do
3 Receive(xt )
4 y!t ← sgn(wt · xt )
5 Receive(yt )
6 if (!
yt ̸= yt ) then
"N
7 Zt ← i=1 wt,i exp(ηyt xt,i )
Ulrike von Luxburg: Statistical Machine Learning

8 for i ← 1 to N do
wt,i exp(ηyt xt,i )
9 wt+1,i ← Zt
10 else wt+1 ← wt
11 return wT +1
The principle is reminiscent to weighted majority in the expert
setting... Figure 7.10 Winnow algorithm, with yt ∈ {−1, +1} for all t ∈ [1, T ].

The Winnow algorithm is similar to the Perceptron algorithm, b


the additive update of the weight vector in the Perceptron case, Win
1098
Summer 2019

History
I There is a close connection of online learning to game theory,
and a number of important results
√ have already been proved in
the 1950ies (regret of order T , but linear dependency on d)
I Perceptron: dates back to Rosenblatt 1958, first theoretical
analysis was by Minski and Papert 1969 and Novikov 1952
I Weighted majority and its variants are due to Littlestone and
Ulrike von Luxburg: Statistical Machine Learning

Warmuth, 1989 and the following years (first analysis with


log d dependency).
I the last 10 years or so the COLT community had a strong
focus on analyzing online algorithms.
I Littlestone/Warmuth 1999
I EXP3 (bandit setting breakthrough) arount 2000
Still very active area of research.
1099
1100 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

research in our group


Outlook: some of the
1101 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Wrap up
1102 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Excursion: Research, publications, reviewing


Summer 2019

Publication culture in Computer Science


Keywords to discuss:

journals, conferences, lecture notes, arxiv, technical reports


Ulrike von Luxburg: Statistical Machine Learning
1103
Summer 2019

Publication culture in Computer Science (2)


Top conferences in Machine Learning: NIPS, ICML, COLT
Top journals in Machine Learning: JMLR, MLJ, IEEE IT
Top journal in Statistics: Annals of Statistics
Ulrike von Luxburg: Statistical Machine Learning
1104
Summer 2019

Reviewing: the process itself


... in journals:

Keywords to discuss:
I blind, double-blind,

I editor (in chief, associate)

I how the whole process works: submission, editor, associate


Ulrike von Luxburg: Statistical Machine Learning

editor, 3 reviewers, associate editor, decision (accept,


minor/major revision, reject), notification
1105
Summer 2019

Reviewing: the process itself (2)


... in conferences:

Keywords are:
I program chair = editor in chief

I area chair = associate editor (mainly in large conferences)

I programm committee: sometimes this means reviewer,


sometimes area chair.
Ulrike von Luxburg: Statistical Machine Learning

I Biggest issue: scaling! (NIPS 2016: 2500 submissions, 3200


reviewers, 100 area chairs)
I How the process works: submission, program chairs, bidding +
paper assignment to reviewers, reviews come in, discussion
among reviewers, discussion among area chairs/program
chairs, decision (accept/reject).
1106
Summer 2019

Points to address in a review


Two different parties are addressed in a review:
I Authors: want constructive and fair feedback about their paper

I Editors: need arguments for their decision

Points typically addressed in a review:


I Quality. Is the paper technically sound? Are claims
Ulrike von Luxburg: Statistical Machine Learning

well-supported? Is this a complete piece of work, or merely a


position paper? Are the authors careful (and honest) about
evaluating both the strengths and weaknesses of the work?
I Clarity. Is the paper clearly written? Is it well-organized?
Does it adequately inform the reader? (A superbly written
paper provides enough information for the expert reader to
reproduce its results.)
1107
Summer 2019

Points to address in a review (2)


I Originality. Are the problems or approaches new? Is this a
novel combination of familiar techniques? Is it clear how this
work differs from previous contributions? Is related work
adequately referenced?
I Significance. Are the results important? Does the paper
address a difficult problem in a better way than previous
research? Does it advance the state of the art? Does it
Ulrike von Luxburg: Statistical Machine Learning

provide unique data, unique conclusions on existing data, or a


unique theoretical or pragmatic approach?
1108
Summer 2019

Points to address in a review (3)


Typical structure of a review:
I Summarize what you believe are the main contributions of
the paper, in own words (about 1 paragraph), and summarize
your opinion about the paper (few sentences)
I Give detailed evaluation: address the four points (quality,
clarity, originality, significance), and summarize pros / cons.
I Give minor comments to the authors (typos, unclear
Ulrike von Luxburg: Statistical Machine Learning

formulations, parts that cannot be understood, wrong


formulas, etc
I Private comments to the editor (not seen by the authors).
Here one can declare conflicts of interests, whether one knows
the authors, how thoroughly one has done the review (eg,
checked proof details).
1109
1110 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

... see files ...


Example reviews
Summer 2019

Judging the quality of researchers


People (in particular, administration), always tries to invent
numbers to measure quality. Per se, this is impossible, but you
should know a couple of terms:

Journal:
I impact factors, bogus!!!!

I editorial board
Ulrike von Luxburg: Statistical Machine Learning

I “standing in the community”

... of a researcher:
I citation numbers, h-index

I prizes

I is he/she in editorial boards


1111
Summer 2019

Finding a PhD position


Check carefully where you go:
I Your supervisor should have published regularly during the last
couple of years, in good conferences / journals.
I Ideally, the group should not be too small (just you) or too
large (30 people). In the latter case, insist to find out who
would supervise you, because it definitely won’t be the head of
Ulrike von Luxburg: Statistical Machine Learning

the group.
I Check how much and where all the other PhD students publish.

I How many international guests the group had during the last
year.
I To which conferences the people in the group go regularly. Is
there travel money?
I Are there any regular activities going on? Reading groups,
seminars, ...?
1112
Summer 2019

Finding a PhD position (2)


I Where do the other PhD students come from? From the same
university? Are there postdocs who came after their PhDs?
I How long does a PhD in that group take, usually.
I How is the supervision organized in the group?
I Find out how much freedom the people have in selecting what
they want to work on.
Ulrike von Luxburg: Statistical Machine Learning

I How much time do people have for research, what other


obligations are there (teaching, project work, ...)?
I Ask in advance whether you will get the opportunity to talk to
another PhD student in the group. Listen to what they say
“between the lines”. Ask all the questions above to the head
of the group, and once more to the PhD student.
I Ultimately, you also need to have the impression that you like
the place and get along with your potential supervisor...
1113
1114 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Things we did not talk about


Summer 2019

Things that are important but we did not cover


I Many important algorithms, of course
I Probabilistic approaches (Graphical models, Bayesian learning,
Gaussian processes, etc). See the book by Kevin Murphy.
I Bio-inspired approaches (neural networks, deep networks,
genetic algorithms, etc).
I Reinforcement learning
Ulrike von Luxburg: Statistical Machine Learning

I Information-theoretic approaches. See book by David MacKay.


1115
1116 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Further machine learning resources


Summer 2019

Online classes and videos


There exist a number of online lectures (MOOC = massive online
open course) about machine learning, taught by some of the best
machine learners in the world:
I Stanford, Caltech, New York, ...

Machine learning summer schools:


I many of them exist, see for example www.mlss.cc.
Ulrike von Luxburg: Statistical Machine Learning

I Nearly all of them get videotaped and are available at youtube


or videolectures.net
Videolectures:
I Most machine learning conferences, workshops, summer
schools, etc are videotaped. The videos are online, many of
them at videolectures.net or youtube.
I So if you are interested in a particular topic, try to find a video
at videolectures, likelihood is high that you are going to be
successful.
1117
Summer 2019

Machine learning software


There exist many software tools for machine learning, from very
applied and simple to more sophisticated and closer to research. I
don’t have a favorite one.

But be aware: using such tools only cat get you that far.
I You don’t have the freedom to make all the design choices you
want.
Ulrike von Luxburg: Statistical Machine Learning

I At least, try to understand what they do if you press a button...


1118
Summer 2019

Machine learning research


The top conferences in machine learning (with reviewed papers):
I Neural Information Processing Systems (NIPS)s

I International Conference on Machine Learning (ICML)

I International Conference on Learning Theory (COLT)

Also good:
I AISTATS
Ulrike von Luxburg: Statistical Machine Learning

I UAI

I KDD (more data mining oriented)

The top journals:


I Journal of machine learning research (JMLR), the flagship
journal
I Machine Learning Journal (MLJ), good as well.
1119
1120 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Want more? Plan for the next semesters ...


1121 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Mathematical Appendix
Summer 2019

Recap: Probability theory


Literature:
I In general, any book on probability theory
Ulrike von Luxburg: Statistical Machine Learning

I On the homepage you can also find the link to a probability


recap writeup for a CS course at Stanford University (written
by Arian Maleki and Tom Do).
1122
1123 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Discrete probability theory


Summer 2019

Discrete probability measure


I Ω = space of “elementary events”, “sample space”.
This space is called “discrete” if it has finitely many elements.
I “Space of events”: In the discrete case this is simply the power
set P(Ω) of Ω, that is all possible subsets of Ω.
(In general it is more complicated, the space of events has to
be a “σ-algebra”).
Ulrike von Luxburg: Statistical Machine Learning

I Probability measure: P : P(Ω) → [0, 1] such that the following


three rules are satisfied (“Axioms of Kolomogorov”)
I P (A) ≥ 0 for all events A ⊂ P(Ω)
I P (Ω) = 1
I “sigma-additivity”: Let S1 , S2 , ... ⊂ Ω be at most
P countably
many disjoint sets. Then P (S1 ∪ S2 ∪ ...) = i P (Si )
Note: in the discrete case, the probability measure is uniquely
defined on all of P(Ω) if we know P (ω) for all elementary events
ω ∈ Ω.
1124
Summer 2019

Discrete probability measure (2)


Example: throwing a die
I Elementary events: {1, 2, ..., 6}

I Probability of the elementary events:


P (1) = P (2) = ... = P (6) = 1/6.
I Probabilities of all other subsets of Ω can be computed based
on the elementary events due to the sigma-additivity.
Example: P (1, 2, 5) = P (1) + P (2) + P (5) = 3 · 1/6 = 1/2.
Ulrike von Luxburg: Statistical Machine Learning
1125
Summer 2019

Conditional probabilities
Define the probability of event A under the condition that event B
has taken place:

P (A ∩ B)
P (A B) =
P (B)

Example with a die: compute the probability P ({3} “uneven”).


Ulrike von Luxburg: Statistical Machine Learning

Solution:
A = {3}, B = {1, 3, 5}, P (A ∩ B) = P ({3}) = 1/6, P (B) = 1/2,
this implies P ({3} “uneven”) = (1/6)/(1/2) = 1/3.
1126
Summer 2019

Important formulas
I Union bound. Let A1 , ..., Ak be any events. Then
k
X
P (A1 ∪ A2 ∪ ... ∪ Ak ) ≤ P (Ai )
i=1

Intuitive reason:
Ulrike von Luxburg: Statistical Machine Learning
1127
Summer 2019

Important formulas (2)


I Formula of total probability. Let B1 , ..., Bk be a disjoint
decomposition of the probability space, that is all Bi are
disjoint and B1 ∪ ... ∪ Bk = Ω. Then:
k
X k
X
P (A) = P (A ∩ Bi )= P (A Bi )P (Bi )
i=1 i=1
Ulrike von Luxburg: Statistical Machine Learning
1128
Summer 2019

Important formulas (3)


I Bayes’ formula:

P (B ∩ A) P (A B) · P (B)
P (B A) = =
P (A) P (A)

Example:
The probability that a woman has breast cancer is 1%. The
probability that the disease is detected by a mammography is
Ulrike von Luxburg: Statistical Machine Learning

80 % (true positive rate). The probability that the test detects


the disease although the patient does not have it is 9.6% (false
positive rate). If a woman at age 40 is tested as positive, what
is the probability that she indeed has breast cancer?
1129
Summer 2019

Important formulas (4)


Define the following events:
A := mammography is positive
B := woman has breast cancer

Given:
I P (B) = 0.01
I P (A B) = 0.80
Ulrike von Luxburg: Statistical Machine Learning

I P (A ¬B) = 0.096
I Need to compute P (A). Here we use the total probability:

P (A) = P (A|B)P (B) + P (A|¬B)P (¬B)


= 0.8 · 0.01 + 0.096 · 0.99 = 0.103

Now we plug this into Bayes theorem and obtain


0.80 · 0.01
P (B|A) = = 0.078
0.103
1130
Summer 2019

Random variables
A random variable is a function X : Ω → R.
Example:
I We have 5 red and 5 black balls in an urn

I We draw 3 balls randomly without replacement

I Random variable X = number of red balls we got


Ulrike von Luxburg: Statistical Machine Learning

A random variable is called discrete if its image is discrete (it can


take at most finitely many values).
1131
Summer 2019

Random variables (2)


A random variable X : Ω → R induces a probability distribution PX
on its image: for any (measurable) set A ⊂ R we define

PX (A) = P (X ∈ A)

The measure PX is called the distribution of the random variable.


Ulrike von Luxburg: Statistical Machine Learning
1132
Summer 2019

Important discrete probability distributions


I Bernoulli distribution: we throw a biased coin once. It takes
value 1 with probability p and value 0 with probability (1 − p).
I Binomial distribution B(n, p). We throw a biased coin n
times independently from each other. The binomial random
variable counts how often we got 1. It is defined as
 
Ulrike von Luxburg: Statistical Machine Learning

n k
P (X = k) = p (1 − p)n−k
k

It has expected value np and variance np(1 − p).


1133
Summer 2019

Important discrete probability distributions (2)


I Poisson distribution P ois(λ).

λk e−λ
P (X = k) =
k!
The Poisson distribution counts the occurrence of “rare
events” in a fixed time interval (like radioactive decay), λ is
the intensity parameter.
Ulrike von Luxburg: Statistical Machine Learning

It has expected value λ and variance λ.


1134
Summer 2019

Independence
Two events A, B are called independent if
P (A ∩ B) = P (A) · P (B).

Note that this implies that P (A B) = P (A).

Two random variables X, Y : Ω → R are called independent if for


Ulrike von Luxburg: Statistical Machine Learning

all events A, B we have that


P (X ∈ A, Y ∈ B) = P (X ∈ A) · P (Y ∈ B).

Example:
I Throw a coin twice. X = result of the first toss, Y = result of
the second toss. These two random variables are independent.
1135
Summer 2019

Independence (2)
I Throw a coin twice. X = result of the first toss, Y = sum of
the two results. These two random variables are not
independent.
Ulrike von Luxburg: Statistical Machine Learning
1136
Summer 2019

Expectation
For a discrete random variable X : Ω → {r1 , ..., rk } its expectation
(mean value) is defined as
k
X
E(X) := ri · P ({X = ri })
i=1

Intuition: the expectation is the “average result”, where the results


Ulrike von Luxburg: Statistical Machine Learning

are weighted according to their probabilities.


Examples:
I We throw a die, X is the result. Then
E(X) = 6i=1 i · 16 = 3.5.
P

I We throw a biased coin, heads occurs with probability p, tais


with probability 1 − p. We assign the random variable X = 1
for heads and X = 0 for tails. Then
E(X) = 0 · (1 − p) + 1 · p = p.
1137
Summer 2019

Expectation (2)
Important formulas and properties:
I The expectation is linear: for random variables X1 , ..., Xn and
real numbers a1 , ..., an ∈ R,

Xn n
X
E( ai X i ) = ai E(Xi )
i=1 i=1
Ulrike von Luxburg: Statistical Machine Learning

I Expectation and independence: If X, Y are independent, then

E(X · Y ) = E(X) · E(Y ).


1138
Summer 2019

Variance
The variance of a random variable is defined as
2
Var(X) = E (X − E(X))2 = E(X 2 ) − E(X)


For a discrete random variable with possible values r1 , ..., rn , it is


given as
n
Ulrike von Luxburg: Statistical Machine Learning

X
Var(X) = (ri − E(X))2 · P (X = ri )
i=1

The variance measures how much the random variable “varies”


about its mean.
1139
Summer 2019

Variance (2)
Example:
I We throw a biased coin, heads occurs with probability p, tails
comes with probability 1 − p. We assign the random variable
X = 1 for heads and X = 0 for tails.
I We have already seen: E(X) = p.

I Now let’s compute the variance:


Ulrike von Luxburg: Statistical Machine Learning

Var(X) = (1 − p)2 p + (0 − p)2 (1 − p) = (1 − p)p

Important properties of the variance:


I Var(X) ≥ 0.
I For random variables X and scalars a, b ∈ R we have
Var(aX + b) = a2 Var(X)
1140
Summer 2019

Variance (3)
I If X, Y are independent random variables, then

Var(X + Y ) = Var(X) + Var(Y ).


Ulrike von Luxburg: Statistical Machine Learning
1141
Summer 2019

Standard deviation
The standard deviation of a random variable is just the square root
of the variance:

p
std(X) = Var(X)
Ulrike von Luxburg: Statistical Machine Learning
1142
Summer 2019

Covariance and correlation


The covariance of two real-valued random variables X and Y is
defined as

Cov(X, Y ) = E (X − E(X))(Y − E(Y ))
= E(XY ) − E(X)E(Y )
It provides (one particular) measure of how related the two random
Ulrike von Luxburg: Statistical Machine Learning

variables are: whether we can use a linear (!) function to predict


one of them from the other one.

Properties:
I Cov(X, Y ) = Cov(Y, X)

I Var(X + Y ) = Var(X) + Var(Y ) + 2 Cov(X, Y ).

I If Cov(X, Y ) = 0, the random variables are called


uncorrelated.
I X, Y independent =⇒ X, Y uncorrelated (but not vice versa)
1143
Summer 2019

Covariance and correlation (2)


The correlation coefficient is defined as

Cor(X, Y ) := ρ(X, Y ) := Cov(X, Y )/(std(X)std(Y ))

I rescales the covariance to a number between −1 and 1


I ρ = 1 iff Y = aX + b for a > 0, b ∈ R
I ρ = −1 iff Y = aX + b for a < 0, b ∈ R
Ulrike von Luxburg: Statistical Machine Learning
1144
Summer 2019

Covariance and correlation (3)


Examples (point sets and their correlation coefficient, taken from
wikipedia):
Ulrike von Luxburg: Statistical Machine Learning
1145
Summer 2019

Covariance and correlation (4)


(*) Covariance and correlation cares about linear relationships:
Ulrike von Luxburg: Statistical Machine Learning

(*) Even if Y is a deterministic function of X, the covariance can


be 0
1146
Summer 2019

Covariance and correlation (5)


Exercise: onsider a symmetric random variable X (such that the
distribution of X and −X are the same), and define Y = X 2 .
Then Cov(X, Y ) = 0
Ulrike von Luxburg: Statistical Machine Learning
1147
Summer 2019

(*) Important inequalities


I Markov’s inequality: Let X be a non-negative random variable
and t > 0. Then
E(X)
P (X ≥ t) ≤
t
I Chebyshev’s inequality:
Ulrike von Luxburg: Statistical Machine Learning

Var(X)
P (|X − E(X)| ≥ t) ≤
t2
1148
Summer 2019

Joint, marginal and product distribution


We want to look at the “joint distribution” of two random variables.
Example:
I We “sample” people: Ω = set of all people

I X = their weight (in kg), Y = their height (in cm).

I The joint distribution measures how the pair of random


variables (X, Y ) : Ω → R2 is distributed.
Ulrike von Luxburg: Statistical Machine Learning
1149
Summer 2019

Joint, marginal and product distribution (2)


I The distribution of X is called the marginal distribution of X,
similarly for Y .
Ulrike von Luxburg: Statistical Machine Learning
1150
Summer 2019

Joint, marginal and product distribution (3)


I Note that for given marginal distributions, there exist many
joint distributions that respect the marginals!
Ulrike von Luxburg: Statistical Machine Learning
1151
Summer 2019

Joint, marginal and product distribution (4)


A particular joint distribution is the product distribution: it gives
the joint distribution of X and Y if they are independent of each
other:
I Consider two discrete random variables X, Y : Ω → R.

I Define the product distribution


P ((X, Y ) = (x, y)) = P (X = x) · P (Y = y).
The construction works analogously for a product of finitely many
Ulrike von Luxburg: Statistical Machine Learning

spaces.
1152
Summer 2019

(*) Conditional independence


Consider three discrete random variables X, Y, Z : Ω → R. We say
that X and Y are conditionally independent given Z if

P (X ∈ A, Y ∈ B Z ∈ C)
= P (X ∈ A Z ∈ C) · P (Y ∈ B Z ∈ C)
Ulrike von Luxburg: Statistical Machine Learning

for all sets A, B, C ⊂ Ω with P (Z ∈ C) > 0.


1153
Summer 2019

Variance and covariance of multivariate random


variables
Variance and covariance for 1-dim random variables X ∈ R:

Var(X) = E((X − E(X))2 )


 
Cov(X, Y ) = E (X − E(X))(Y − E(Y ))
Ulrike von Luxburg: Statistical Machine Learning

They can be estimated from sample points x1 , ..., xn and y1 , ..., yn


as follows:
n
X
x̄ := 1/n xi
i=1
n
X
ˆ
Var(X) = 1/n (xi − x̄)2
i=1
1154
Summer 2019

Variance and covariance of multivariate random


variables (2)
n
X
ˆ
Cov(X, Y ) = 1/n (xi − x̄)(yi − ȳ)
i=1

Note that for variance and covariance, one sometimes normalizes


ˆ resp. Cov
the estimator Var ˆ with the factor of 1/(n − 1) instead of
1/n to achieve an unbiased estimate (we skip this issue here).
Ulrike von Luxburg: Statistical Machine Learning
1155
Summer 2019

Variance and covariance of multivariate random


variables (3)
Now consider d-dim random variables: X = (X (1) , ..., X (d) )0 .
The expectation E(X) of a d-dim random variable is the vector
that contains the coordinate-wise expectations.
The overall variance over all d dimensions is the sum of the
variances of the individual dimensions:
Ulrike von Luxburg: Statistical Machine Learning

d
X
Vard (X) = E(kX (i) − E(X (i) )k2 )
i=1

The covariance matrix of X is a d × d-matrix C which encodes the


covariances between the individual dimensions of the distribution:
1156
Summer 2019

Variance and covariance of multivariate random


variables (4)

Ckl = Cov(X (k) , X (l) )

EXAMPLE: SHOE SIZE / HEIGHT / AGE OF A PERSON


Ulrike von Luxburg: Statistical Machine Learning
1157
Summer 2019

Variance and covariance of multivariate random


variables (5)
These quantities can be estimated from the data:
d X
X n
ˆ d (X) = 1/n (k)
Var (xi − x̄(k) )2
k=1 i=1
n
1 X (k) (l)
(Ĉ)kl = (xi − x̄(k) )(xi − x̄(l) )
Ulrike von Luxburg: Statistical Machine Learning

n i=1

I Ĉ is called the empirical covariance matrix or the sample


covariance matrix.
1158
Summer 2019

Variance and covariance of multivariate random


variables (6)
I If the data points are centered, and we define matrix X
containing the points as rows, then the empirical covariance
matrix Ĉ coincides with X0 X because
(k) (l)
Ckl = ni=1 Xi Xi = (X 0 X)kl
P

I In the following, we often drop the “hat” and the word


Ulrike von Luxburg: Statistical Machine Learning

“empirical”...
1159
Summer 2019

(*) Conditional expectation


Example:
I X, Y two independent throws of a die, Z = X + Y .

I Want to compute the expectation of Z under the condition


that X was 3.
I We write E(Z X = 3)
Ulrike von Luxburg: Statistical Machine Learning

If we don’t fix the outcome value of X, then we write E(Z X),


this is a random variable (because we don’t know the random
outcome of X).

Formally, this is a pretty complicated mathematical object. For


those who have not seen it before, we just treat it in an intuitive
manner.
1160
Summer 2019

Continuous probability theory


Probability theory gets more complicated once we go beyond the
discrete regime. In this class, we try to keep it on a somewhat
intuitive level.
Ulrike von Luxburg: Statistical Machine Learning
1161
Summer 2019

Density and cumulative distribution


Consider a random variable X : Ω → R. We say that X has density
function p : R → R if for all (measurable) subsets A ⊂ R we have

Z
P (X ∈ A) = p(x)dx
A
Ulrike von Luxburg: Statistical Machine Learning
1162
Summer 2019

Density and cumulative distribution (2)


I Intuitively, a density is something like a “continuous
histogram”.
I Sometimes the density is abbreviated as “pdf” (“probability
density function”) in the literature.
I Density functions are always non-negative and integrate to 1.
They don’t have to be continuos.
Not every random variable can be described by a density, but
Ulrike von Luxburg: Statistical Machine Learning

I
in this course we won’t discuss this.
1163
Summer 2019

(*) Cumulative distribution function


A real-valued random variable can always be described by its
cumulative distribution function (sometimes abbreviated as “cdf” in
the literature).
For a random variable X : Ω → R it is defined as

g : R → R, g(x) = P (X ≤ x)
Ulrike von Luxburg: Statistical Machine Learning
1164
Summer 2019

Uniform distribution
The uniform distribution on [0, 1]: for 0 ≤ a < b ≤ 1 we define

P (X ∈ [a, b]) = b − a

Its density is constant.


Ulrike von Luxburg: Statistical Machine Learning
1165
Summer 2019

Normal distribution (univariate)


The most important continuous distribution on R is the normal
distribution, abbreviated N (µ, σ 2 ).
I It has two parameters: its expectation µ and its variance σ 2 .
I µ controls the location of the distribution
I σ controls the “width” of the distribution
I The density function of N (µ, σ 2 ) is given as
Ulrike von Luxburg: Statistical Machine Learning

1 (x − µ)2 

fµ,σ (x) = √ exp −
2πσ 2 2σ 2
I The special case of mean 0 and variance 1 is called the
“standard normal distribution”. Sometimes the normal
distribution is also called a Gaussian distribution.
1166
1167 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Normal distribution (univariate) (2)


Summer 2019

Multivariate normal distribution


The multivariate normal is defined for the d-dimensional space Rd ,
it is abbreviated by N (µ, Σ).
I It has two parameters: the expectation vector µ ∈ Rd , and the
covariance matrix Σ ∈ Rd×d . The covariance matrix is always
positive definite.
I The density function is defined as follows:
Ulrike von Luxburg: Statistical Machine Learning

1  1 
fµ,Σ (x) = p exp − (x − µ)0 Σ−1 (x − µ)
2π det(Σ) 2

I The eigenvectors and eigenvalues of the covariance matrix


control the shape of the Gaussian.
I Each of the marginal distributions is a univariate normal
distribution.
1168
1169 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Multivariate normal distribution (2)


1170 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Multivariate normal distribution (3)


Summer 2019

Mixture of Gaussians
When generating toy data for machine learning applications, one
often uses a mixture of Gaussian distributions:

Given mean vectors µ1 , ..., µk ∈ Rd , postivite definite covariance


d×d
P Σ1 , ..., Σk ∈ R , and mixing coefficients α1 , ..., αk > 0
matrices
with αi = 1, the density function of the mixture of Gaussians as
follows:
Ulrike von Luxburg: Statistical Machine Learning

k
X
f (x) = αi fµi ,Σi
i=1
1171
1172 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

0.05
0.15

0.1

−2
−3
−2
Mixture of Gaussians (2)

−1
density

0 12
3
1173 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Mixture of Gaussians (3)


Summer 2019

Expectation
In the continuous domain, sums are going to be replaced by
integrals. For example, the expectation of a random variable X
with density function p(x) is defined as

Z
E(X) = x · p(x)dx
R
Ulrike von Luxburg: Statistical Machine Learning
1174
Summer 2019

Recap: Linear algebra


Literature:
I In general, any introductory book on linear algebra
Ulrike von Luxburg: Statistical Machine Learning

I On the homepage you can also find the link to a short linear
algebra recap writeup (by Zico Kolter and Chuong Do).
1175
Summer 2019

Vector space
A vector space V is a set of “vectors” that supports the following
operations:
I We can add and substract vectors: For v, w ∈ V we can build
v + w, v − w
I We can multiply vectors with scalars: For v ∈ V , a ∈ R we can
build av.
Ulrike von Luxburg: Statistical Machine Learning

I These operations satisfy all kinds of formal requirements


(associativity, commutativity, identity element, inverse element,
and so on).
1176
1177 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Vector space (2)


Most prominent example: V = Rd .
Summer 2019

Basis
A basis of a vector space is a set of vectors b1 , ..., bd ∈ V that
satisfies two properties:
I Any vector in V can be written as a linear combination of
basis vectors:
For any v ∈ V there exist a1 , ..., ad ∈ R such that

v = di=1 ai bi
P
Ulrike von Luxburg: Statistical Machine Learning

I The vectors in the basis cannot be expressed in terms of each


other, they are linearly independent:
Pd
i=1 ai bi = 0 =⇒ ai = 0 for all i = 1, ..., d.

The number of vectors in a basis is called the dimension of the


vector space.
1178
Summer 2019

Basis (2)
Example:
I e1 := (1, 0) and e2 := (0, 1) form a basis of R2

I v1 := (1, 1) and v2 := (1, 2) form a basis of R2

I v1 := (1, 1) and v2 := (2, 2) do not form a basis of R2 .


Ulrike von Luxburg: Statistical Machine Learning
1179
Summer 2019

Linear mappings
A linear mapping T : V → V satisfies
T (av1 + bv2 ) = aT (v1 ) + bT (v2 ) for all a, b ∈ R, v1 , v2 ∈ V .
Typical linear mappings are: stretching, rotation, projections, etc.,
and combinations thereof.

Note: to figure out what a linear mapping does, P it is enough to


know what it does on the basis
P vectors:Pfor v = i ai bi we know
Ulrike von Luxburg: Statistical Machine Learning

by linearity that T (v) = T ( i ai bi ) = i ai T (bi )


1180
1181 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Matrices
m × n-matrix A:
Summer 2019

Matrices (2)
Transpose of a matrix , written as At or A0 is the matrix where we
exchange rows with columns (that is, instead of aij we have aji ).
Ulrike von Luxburg: Statistical Machine Learning
1182
forms other sorting algorithms. This has made quicksort a favorite in many applications—
Summer 2019

for instance, it is the basis of the code by which really enormous files are sorted.
Matrices (3)
We can multiply to matrices if their “dimensions” fit:
2.5 XMatrix
= m × multiplication
n-matrix, Y n × k matrix. Then Z : X · Y is a
m × k-matrix with entries
The product of two n × n matrices X and Y is a third n × n matrix Z = XY , with (i, j)th ent
!nn

zij = k=1 XxikisYkj


Zij = .
X
ysk
To make it more visual, Zij is the dot products=1
of the ith row of X with the jth column of Y :
j
Ulrike von Luxburg: Statistical Machine Learning

i
× = (i, j)

X Y Z
1183
Summer 2019

Matrices (4)
Special case where Y is a vector of length n × 1 is called
matrix-vector-multiplication:

X
z = Xy with zi = xij yj
j
Ulrike von Luxburg: Statistical Machine Learning
1184
Summer 2019

Linear mappings correspond to matrices


Linear mappings correspond to matrices:
Intuition: the columns of the matrix contain the images of the basis
vectors:
Ulrike von Luxburg: Statistical Machine Learning

I Matrix-vector multiplication is then the same as applying the


mapping to the vector.
I Multiplication of two matrices is the same as applying the
mappings one after the other.
1185
Summer 2019

The rank of a matrix


Many equivalent definition: The rank of a matrix is ...
I ... the largest number of independent columns in the matrix

I ... the largest number of independent rows in the matrix

I ... the dimension of the image space of the linear mapping


that corresponds to the matrix
I ... in case the matrix is symmetric: the rank is the number of
Ulrike von Luxburg: Statistical Machine Learning

non-zero eigenvalues of the matrix (see below).

A n × n-matrix is said to have full rank if it has rank n.

A n × n-matrix is said to have low rank if its rank is “small”


compared to n (this is not a formal definition, it is often used
informally).
1186
Summer 2019

Inverse of a matrix
I For some matrices A we can compute the inverse matrix A−1 .
It is the unique matrix that satisfies

A · A−1 = A−1 · A = Id

where Id is the identity matrix (1 on the diagonal, 0


Ulrike von Luxburg: Statistical Machine Learning

everywhere else).
I A matrix is called invertible if it has an inverse matrix.
I A square matrix is invertible if and only if it has full rank.
1187
Summer 2019

Norms and scalar products


Some vector spaces have additional structure: norms or even scalar
products. In particular, this is true for Rd .

Given two vectors v = (v1 , ..., vn )t and


Pw = (w1 , ...wn )t ∈ Rn , their
n
scalar product is defined as hv, wi = i=1 vi wi .

The norm kvk of a vector v ∈ Rd is defined as kvk2 = hv, vi.


Ulrike von Luxburg: Statistical Machine Learning

Intuition:
I The scalar product is related to the angle between the two
vectors:
I hv, wi = 0 ⇐⇒ v ⊥ w (vectors are orthogonal)
I If v and w have norm 1, then hv, wi is the cosine of the angle
between the two vectors.
I The norm is the length of a vector.
1188
Summer 2019

Norms and scalar products (2)


A matrix A is called orthogonal if all its columns are orthogonal to
each other. It is called orthonormal if additionally, all its columns
have norm 1.

For orthogonal matrices, we always have At = A−1 .


Ulrike von Luxburg: Statistical Machine Learning
1189
Summer 2019

Eigenvalues and eigenvectors


A vector v ∈ Rn , v 6= 0 is called an eigenvector of A ∈ Rn×n with
eigenvalue λ if Av = λv.

Intuition: in the direction of v, the linear mapping corresponding to


A is stretching by factor λ.
Ulrike von Luxburg: Statistical Machine Learning

Taken together, all eigenvectors with eigenvalue λ form a subspace


called the eigenspace associated to eigenvalue λ. The dimension of
this subspace is called the geometric multiplicity of λ.
1190
Summer 2019

Eigenvalues and eigenvectors (2)


Just for completeness:

Eigenvectors can also be defined as the roots of the characteristic


!
polynomial f (λ) := det(A − λI) = 0. The degree of this
polynomial is d (the dimension of the space). The multiplicity of
this root is called the algebraic multiplicity of λ.
Ulrike von Luxburg: Statistical Machine Learning

The algebraic multiplicity is always larger or equal to the geometric


one. In case of strict inequality, the matrix cannot be diagonalized.

Simple example where the two multiplicities do not agree:


the nilpotent matrix [0, 1; 0, 0] has eigenvalue 0 with geometric
multiplicity 1, but algebraic multiplicity 2. It cannot even be
diagonalized over C.
1191
Summer 2019

Eigenvalue decomposition of a symmetric matrix


Diagonalization:
I A matrix A is called diagonalizable if there exists a basis of
eigenvectors.
I In this case, we can write the matrix in the form

A = V DV t
Ulrike von Luxburg: Statistical Machine Learning

where V is an orthonormal matrix that contains the


eigenvectors as columns, and D is a diagonal matrix containing
the eigenvalues.
I One can also write the matrix in the form
d
X
A= λi vi vit
i=1

where λi are the eigenvalues and vi the eigenvectors.


1192
Summer 2019

Eigenvalue decomposition of a symmetric matrix


(2)
I Intuitively, a matrix is diagonalizable if it performs “Strecken
und Spiegeln”, but no rotation.

Symmetric matrices are always diagonalizable and have real-valued


eigenvalues. Their eigenvectors (of different eigenvalues) are always
perpendicular to each other
Ulrike von Luxburg: Statistical Machine Learning
1193
Summer 2019

(*) Singular Value Decomposition (SVD)


If a matrix is not square, we cannot compute eigenvalues. But there
exists a closely related concept, the singular values:

Any matrix Φ ∈ Rn×d can be decomposed as follows:

Φ = U ΣV t
Ulrike von Luxburg: Statistical Machine Learning

where
I U ∈ Rn×n is orthogonal. Its columns are called
left singular vectors.
I Σ ∈ Rn×d is a diagonal matrix containing the singular values
σ1 , ..., σd on the diagonal
I V ∈ Rd×d is an orthogonal matrix. Its columns are called
right singular vectors.
1194
1195 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

(*) Singular Value Decomposition (SVD) (2)


Summer 2019

(*) Singular Value Decomposition (SVD) (3)


There is a close relation between the singular values of Φ and the
eigenvalues of the (symmetric!) matrices ΦΦt and Φt Φ:
I The left singular vectors of Φ are the eigenvectors of ΦΦt .
CAN YOU SEE WHY?
ΦΦt = (U ΣV 0 )(U ΣV 0 )0 = U ΣV V 0 ΣU 0 = U Σ2 U 0 .
I The right singular vectors of Φ are the eigenvectors of Φt Φ.

I The non-zero singular values of Φ are the square roots of the


Ulrike von Luxburg: Statistical Machine Learning

non-zero eigenvalues of both Φt Φ and ΦΦt .


1196
Summer 2019

(*) Singular Value Decomposition (SVD) (4)


Note in particular:
I An SVD exists for any matrix!

I The singular values are unique.

I The singular vectors are “as unique” as in an eigenvector


decomposition (that is, up to scalar multiplication, and in case
of higher multiplicity the singular vectors span a whole space).
Ulrike von Luxburg: Statistical Machine Learning
1197
Summer 2019

Positive Definite Matices


A symmetric matrix A is called positive semi-definite if all its
eigenvalues are ≥ 0. In case of strict inequality it is called positive
definite.

Equivalent formulations:
I Positive definite ⇐⇒ v t Av > 0 for all v ∈ Rn \ {0}.

I Positive semi-definite ⇐⇒ v t Av ≥ 0 for all v ∈ Rn \ {0}.


Ulrike von Luxburg: Statistical Machine Learning

I Positive semi-definite ⇐⇒ we can decompose the matrix in


the form A = XX 0 .
1198
Summer 2019

(*) Generalized inverse


Consider a symmetric matrix A ∈ Rd×d .
I Let λ1 , ..., λd the eigenvalues and v1 , ..., vd a corresponding set
of eigenvectors of A. We can write A in the spectral
decomposition as
d
X
A= λi vi vit
Ulrike von Luxburg: Statistical Machine Learning

i=1

I In case the matrix has rank d, all its eigenvalues are non-zero.
Then we can write the inverse of A as
d
−1
X 1
A = vi vit
λ
i=1 i
1199
Summer 2019

(*) Generalized inverse (2)


I In case the matrix is not of full rank, it is not invertible.
However, we can define the Moore-Penrose generalized inverse
as
X 1
A+ := vi vit
i:λ 6=0
λ i
i
Ulrike von Luxburg: Statistical Machine Learning

(intuitively, this is the inverse of the matrix A restricted to the


subspace orthogonal to its nullspace).
1200
Summer 2019

(*) Generalized inverse (3)


Properties of the generalized inverse:

In general we don’t have that AA+ = I or A+ A = I.

But we have the following slightly weaker properties:


I AA+ A = A and A+ AA+ = A+

I (A+ )+ = A
Ulrike von Luxburg: Statistical Machine Learning

I A+ A and AA+ are both symmetric.

I If A is invertible, then A−1 = A+ .

I AA+ is an orthogonal projection on the ran(A) (the image of


the matrix A), and A+ A is an orthogonal projection on
ran(At ).
1201
Summer 2019

(*) Generalized inverse (4)


Some intuition:
Consider a linear operator A that is a projection on some
lower-dimensional subpace. As example, consider the projection of
the three-dim space to the two-dim plane:

A(x1 , x2 , x3 )0 := (x1 , x2 )0 ∈ R2

Call the projection A and consider a “reconstruction” operator Arec .


Ulrike von Luxburg: Statistical Machine Learning

I Note that from the result of the projection, it is impossible to


reconstruct the original point exactly (this is why the matrix A
is not invertible).
I However, I can reconstruct another point that would give the
same projection result: for example, I can simply define

Arec (x1 , x2 ) := (x1 , x2 , 17) ∈ R3


1202
Summer 2019

(*) Generalized inverse (5)


I Note that if I apply the projection again after reconstruction, I
get the same result as after the first projection: I have

AArec A = A

The Moore-Penrose pseudoinverse is one particular such


reconstruction operator.
Ulrike von Luxburg: Statistical Machine Learning
1203
Summer 2019

(*) Rayleigh principle


Proposition 49 (Rayleigh principle)
Let A ∈ Rn×n be a symmetric matrix with eigenvalues λ1 ≥ ... ≥ λn
and eigenvectors v1 , ...., vn . Then

v t Av
λ1 = maxn = max v t Av.
v∈R kvk2 v∈Rn :kvk=1
Ulrike von Luxburg: Statistical Machine Learning

The eigenvector v1 is the vector for which this maximum is attained.


Moreover,

λk+1 = max v t Av.


v⊥{v1 ,...,vk }:kvk=1

This theorem holds analogously for minimization problems. In this case,


the solution is given by the smallest eigenvalue / vector.
1204
Summer 2019

(*) Rayleigh principle (2)


Proof intuition.
I Let λ be any eigenvalue with eigenvector v. Then
v t Av = v t (λv) = λ (because v t v = 1).
I So among all eigenvectors v1 , ..., vn , the eigenvector v1 leads
to the largest value λ1 .
I Now consider an arbitrary unit vector w ∈ Rn . Because A is
symmetric, there exists a basis of eigenvectors v1 , ...,
Pvn . In
Ulrike von Luxburg: Statistical Machine Learning

particular, there exist coefficients ci such that w = i ci vi and


kck = 1.
Pn
I Then w t Aw = ... = t
i,j=1 ci cj vi Avj
I But for i 6= j we get v t Avj = v t λvj = 0 (because different
i i
eigenvectors are perpendicular to each other).
I so w t Aw =
P 2 t P 2
i c i vi Av i = i ci λi .
1205
Summer 2019

(*) Rayleigh principle (3)


I Among all c with kck = 1, the maximum of this expression is
attained for c1 = 1, c2 = ... = cn = 0.
Ulrike von Luxburg: Statistical Machine Learning
1206
Summer 2019

Projections
A linear mapping P : E → E between vector spaces is a projection
if and only if P 2 = P .
It is an orthogonal projection if and only if it is a projection and
nullspace(P ) ⊥ image(P ).
Ulrike von Luxburg: Statistical Machine Learning
1207
Summer 2019

Projections (2)
We always have two points of view of a projection:

View 1: represent the projected points still as elements of the


original space, that is P : Rd → Rd .

View 2: Represent the projected points just as elements of the


low-dim space, that is π : Rd → R`
Ulrike von Luxburg: Statistical Machine Learning
1208
1209 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Projections (3)
Summer 2019

Projections (4)
View 2, Projection on a one-dimensional subspace:
The orthogonal projection on a one-dimensional space spanned by
vector a can be expressed as
π : Rd → R, π(x) = a0 x
View 2, Projection on an `-dimensional subspace:
Want to project on an `-dim subspace S with ONB v1 , ..., v` .
Ulrike von Luxburg: Statistical Machine Learning

Define the matrix V with the vectors v1 , ..., v` as columns. Then


compute the low-dim representation as
π : Rd → R` , x 7→ V 0 x
View 1:
Define P := V V 0 (with V as above) and set
π : Rd → Rd , x 7→ P x
1210
Summer 2019

Projections (5)
Affine projections:

Linear projections always map 0 to 0. If we want to perform an


orthogonal projection on an affine (= shifted) space S̃ = S + µ, we
need to express the mapping as T x = P (x − µ) + µ.
Ulrike von Luxburg: Statistical Machine Learning
1211
Summer 2019

(*) Matrix norms


There are many different ways in which one can define a norm of a
matrix:
I Operator norm, spectral norm: Consider a matrix as linear
operator on a normed vector space V with norm k · k. Then
define

kAk = sup kAxk


Ulrike von Luxburg: Statistical Machine Learning

{x∈V ;kxk=1}

If A is normal (that is, AA∗ = A ∗ A), then the operator


coincides with

max{|λ|; λ eigenvalueof A}.

This is sometimes called the spectral norm.


1212
Summer 2019

(*) Matrix norms (2)


I Frobenius norm, Hilbert-Schmidt norm, nuclear norm:
sX s
p X
2
kAkF := aij = trace(A ∗ A) = σi2
ij i

where σi are the singular values of the matrix.


Ulrike von Luxburg: Statistical Machine Learning
1213
Summer 2019

Excursion to convex optimization: primal, dual,


Lagrangian
Literature:
Ulrike von Luxburg: Statistical Machine Learning

I Appendix E in the book by Bishop


I Section 6.3 in the book by Schölkopf / Smola
I Your favorite book on convex optimization, for example:
I Boyd, S. and L. Vandenberghe. Convex Optimization.
Cambridge University Press, 2004. Comprehensive, yet easy to
read.
1214
1215 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Convex optimization problems: intuition


Summer 2019

Convex optimization problems


Convex sets:
I A subset S of a vector space is called convex if for all x, y ∈ S
and for all t ∈ [0, 1] it holds that tx + (1 − t)y ∈ S.
I In words: for any two points x, y ∈ S, the straight line
connecting these two points is contained in the set S.
Ulrike von Luxburg: Statistical Machine Learning
1216
Summer 2019

Convex optimization problems (2)


Convex functions:
I A function f : S → R that is defined on a convex domain S is
called convex if for all x, y ∈ S and t ∈ [0, 1] we have
f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y).
I Intuitively, this means that if we look at the graph of the
function and we connect two points of this graph by a straight
line, then this line is always above the graph.
Ulrike von Luxburg: Statistical Machine Learning
1217
Summer 2019

Convex optimization problems (3)


Examples:
I functions of one variable that are twice differentiable are
convex iff their second derivative is non-negative.
I Functions of several variables that are twice differentiable are
convex if their Hessian matrix is positive (semi)-definite.
Ulrike von Luxburg: Statistical Machine Learning
1218
Summer 2019

Convex optimization problems (4)


Observe: For convex functions g, the sublevel sets of the form
{x|g(x) ≤ 0} are convex.
Ulrike von Luxburg: Statistical Machine Learning

(Funnily, this is not true the other way round: you can have all
sublevel sets convex, but yet the function is not convex. )
1219
Summer 2019

Convex optimization problems (5)


Convex optimization problem:
I An optimization problem of the the form

minimize f (x)
subject to gi (x) ≤ 0 (i = 1, ..., k)

is called convex if the functions f , gi are convex.


Ulrike von Luxburg: Statistical Machine Learning

I Sometimes on also considers equality constraints of the form


hj = 0. They can always be replaced by two inequality
constraints hj ≤ 0 and −hj ≤ 0. However, the only situation
in which hj and −hj are both convex occurs if h is a linear
function.
I Convex optimization problems have the desirable property that
any local minimum is already a global minimum.
1220
Summer 2019

Convex optimization problems (6)


Important terms:
I The function f over which we optimize is called the objective
function
I The functions gi are called the constraints.
I The set of points where all constraints are satsified is called
the feasible set.
Ulrike von Luxburg: Statistical Machine Learning
1221
1222 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Lagrangian: intuitive point of view


Summer 2019

Convex optimization problem


We now want to derive a “recipe” by which many convex
optimization problems can be analyzed / rewritten / solved. We
don’t consider formal proofs, but just derive the concepts in an
intuitive way.

In particular, for the ease of presentation assume that all functions


are continuously differentiable (all statements hold in more general
Ulrike von Luxburg: Statistical Machine Learning

settings as well, but one would need convex analysis for this).
1223
Summer 2019

Recap: gradient of a function


Consider a function f : Rd → R.
I The gradient of f is the vector of partial derivatives:

∇f (x) = (∂/∂x1 , ..., ∂/∂xd )0 (x)

I For each x, the gradient ∇f (x) points in the direction where


Ulrike von Luxburg: Statistical Machine Learning

the function increases most:


1224
1225 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Recap: gradient of a function (2)


Summer 2019

Lagrange multiplier for equality constraints


Consider the following convex optimization problem:

minimize f (x)
subject to g(x) = 0

where f and g are convex.


Ulrike von Luxburg: Statistical Machine Learning
1226
Summer 2019

Lagrange multiplier for equality constraints (2)


Recall: if g is convex, then it sublevel-sets are convex:
Ulrike von Luxburg: Statistical Machine Learning

Sublevel set: {x|g(x) ≤ c} (the green set in the figure)


1227
Summer 2019

Lagrange multiplier for equality constraints (3)


Gradient (equality constraint): For any point x on the “surface”
{g(x) = 0} the gradient ∇g(x) is orthogonal to the surface itself.
Ulrike von Luxburg: Statistical Machine Learning

Intuition: to increase / decrease g(x), you need to move away from


the surface, not walk along the surface.
1228
Summer 2019

Lagrange multiplier for equality constraints (4)


Gradient (objective function): Consider the point x∗ on the
surface {g(x) = 0} for which f (x) is minimized. This point must
have the property that ∇f (x) is orthogonal to the surface.
Ulrike von Luxburg: Statistical Machine Learning

Intuition: otherwise we could move a little along the surface to


decrease f (x).
1229
Summer 2019

Lagrange multiplier for equality constraints (5)


Conequence: at the optimal point, ∇g(x) and ∇f (x) are parallel,
that is there exists some ν ∈ R such that ∇f (x) + ν∇g(x) = 0.
Ulrike von Luxburg: Statistical Machine Learning
1230
Summer 2019

Lagrange multiplier for equality constraints (6)


We now define the Lagrangian function

L(x, ν) = f (x) + νg(x)

where ν ∈ R is a new variable called Lagrance multiplier. Now


observe:
I The condition ∇f (x) + ν∇g(x) = 0 is equivalent to
Ulrike von Luxburg: Statistical Machine Learning

∇x L(x, ν) = 0
I The condition g(x) = 0 is equivalent to ∇ν L(x, ν) = 0.

To find an optimal point x∗ we need to find a saddle point of


L(x, ν), that is a point such that both ∇x L(x, ν) and ∇ν L(x, ν)
vanish.
1231
Summer 2019

Simple example
Consider the problem to minimize f (x) subject to g(x) = 0, where
f, g : R2 → R are defined as

f (x1 , x2 ) = x21 + x22 − 1


g(x1 , x2 ) = x1 + x2 − 1

Observe: it is hard to solve this problem by naive methods because


Ulrike von Luxburg: Statistical Machine Learning

it is unclear how to take care of the constraints!

Solution by the Lagrange approach:

Write it in the standard form:

minimize x21 + x22 − 1


subject to x1 + x2 − 1 = 0
1232
Summer 2019

Simple example (2)


The Lagrangian is

L(x, ν) = x21 + x22 − 1 +ν(x1 + x2 − 1)


| {z } | {z }
f (x1 ,x2 ) g(x1 ,x2 )

Now compute the derivatives and set them to 0:


!
Ulrike von Luxburg: Statistical Machine Learning

∇x1 L = 2x1 + ν = 0
!
∇x2 L = 2x2 + ν = 0
!
∇ ν L = x1 + x2 − 1 = 0

If we solve this linear system of equations we obtain


(x∗1 , x∗2 ) = (0.5, 0.5).
1233
Summer 2019

Lagrange multiplier for inequality constraints


Consider the following convex optimization problem:

minimize f (x)
subject to g(x) ≤ 0

where f and g are convex.


Ulrike von Luxburg: Statistical Machine Learning

We now distinguish two cases: constraint is “active” or “inactive”:


1234
Summer 2019

Lagrange multiplier for inequality constraints (2)


Case 1: Constraint is “active”, that is the optimal point is on the
surface g(x) = 0.

Again ∇f and ∇g are parallel in the optimal point.

But furthermore, the direction of derivatives matters:


I The derivative of g points outwards (at any point on the
Ulrike von Luxburg: Statistical Machine Learning

surface g = 0). This is always the case if g is convex.


I Then the derivative of f is directed inwards (otherwise we
could decrease the objective by walking inside).
1235
Summer 2019

Lagrange multiplier for inequality constraints (3)


Ulrike von Luxburg: Statistical Machine Learning

So we have ∇f (x) = −λ∇g(x) for some value λ > 0.


1236
Summer 2019

Lagrange multiplier for inequality constraints (4)


Case 2: Constraint is “inactive”, that is the optimal point is not on
the surface g(x) = 0 but somewhere in the interior.

I Then we have ∇f = 0 at the solution (otherwise we could


decrease the objective value).
I We do not have any condition on ∇g (it is as if we would not
have this constraint).
Ulrike von Luxburg: Statistical Machine Learning
1237
Summer 2019

Lagrange multiplier for inequality constraints (5)


We can summarize both cases using the Lagrangian again. We now
define the Lagrangian

L(x, λ) = f (x) + λg(x)

where the Lagrange multiplier has to be positive: λ ≥ 0.


Ulrike von Luxburg: Statistical Machine Learning

I Case 1: constraint active, λ > 0.


I Need to find a saddle point: ∇x L(x, λ) = ∇λ L(x, λ) = 0.
I Case 2: constraint inactive, λ = 0.
!
I Then L(x, λ) = f (x). Hence ∇x L(x, λ) = ∇x f (x) = 0,
∇λ L(x, λ) ≡ 0.
I So in both cases we have again a saddle point of the
Lagrangian.
1238
Summer 2019

Lagrange multiplier for inequality constraints (6)

Also in both cases we have λg(x∗ ) = 0.


I Constraint active: λ > 0, g(x∗ ) = 0.

I Constraint inactive: λ = 0, g(x∗ ) 6= 0.

This is called the Karush-Kuhn-Tucker (KKT) condition.


Ulrike von Luxburg: Statistical Machine Learning
1239
Summer 2019

Simple example
What are the side lengths of a rectangle that maximize its area,
under the assumption that its perimeter is at most 1?

We need to solve the following optimization problem:

maximize x · y subject to 2x + 2y ≤ 1
Ulrike von Luxburg: Statistical Machine Learning

Bring the problem in standard form:

minimize(−x · y) subject to 2x + 2y − 1 ≤ 0

Form the Lagrangian:

L(x, y, λ) = −xy + λ(2x + 2y − 1)


1240
Summer 2019

Simple example (2)


Saddle point conditions / derivatives:
!
∂L/∂x = −y + 2λ = 0
!
∂L/∂y = −x + 2λ = 0
!
∂L/∂λ = 2x + 2y − 1 = 0

Solving this system of three equations gives x = y = 0.25.


Ulrike von Luxburg: Statistical Machine Learning
1241
Summer 2019

Simple example (3)


Now need to see: when does this approach work, when does it not
work, what can we prove about it?
Ulrike von Luxburg: Statistical Machine Learning
1242
1243 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Lagrangian: formal point of view


Summer 2019

Lagranigan and dual: formal definition


Consider the primal optimization problem

minimize f0 (x)
subject to fi (x) ≤ 0 (i = 1, ..., m)
hj (x) = 0 (j = 1, ..., k)
Ulrike von Luxburg: Statistical Machine Learning

Denote by x∗ a solution of the problem and by p∗ := f0 (x∗ ) the


objective value at the solution.
1244
Summer 2019

Lagranigan and dual: formal definition (2)


Define the corresponding Lagrangian as follows:

I For each equality constraint j introduce a new variable νj ∈ R,


and for each inequality constraint i introduce a new variable
λi ≥ 0. These variables are called Lagrange multipliers.

I Then define
Ulrike von Luxburg: Statistical Machine Learning

m
X k
X
L(x, λ, ν) = f0 (x) + λi fi (x) + νj hj (x)
i=1 j=1

Define the dual function g : Rm × Rk → R by

g(λ, ν) = inf L(x, λ, ν)


x
1245
Summer 2019

Dual function as lower bound on primal


Proposition 50 (Dual function is concave)
No matter whether the primal problem is convex or not, the dual
function is always concave in (λ, ν).

Proof. For fixed x, L(x, λ, ν) is linear in λ and ν and thus


concave. The dual function as a pointwise infimum over concave
Ulrike von Luxburg: Statistical Machine Learning

functions is concave as well. ,

Note that concave is good, because we are going to maximize this


function later on.
1246
Summer 2019

Dual function as lower bound on primal (2)


Proposition 51 (Dual function as lower bound on primal)
For all λi ≥ 0 and νj ∈ R we have g(λ, ν) ≤ p∗ .

Proof.
I Let x0 be a feasible point of the primal problem (that is, a
point that satisfies all constraints).
Ulrike von Luxburg: Statistical Machine Learning

I For such a point we have

m
X k
X
λi fi (x0 ) + νj hj (x0 ) ≤ 0
|{z} | {z } | {z }
i=1 ≥0 ≤0 i=1
=0
1247
Summer 2019

Dual function as lower bound on primal (3)


I This implies
m
X k
X
L(x0 , λ, ν) = f0 (x0 ) + λi fi (x0 ) + νj hj (x0 ) ≤ f0 (x0 )
i=1 j=1

Note that this property holds in particular when x0 is x∗ .


I Moreover, for any x0 (and in particular for x0 := x∗ ) we have
Ulrike von Luxburg: Statistical Machine Learning

inf L(x, λ, ν) ≤ L(x0 , λ, ν)


x

I Combining the last two properties gives

g(λ, ν) = inf L(x, λ, ν) ≤ L(x∗ , λ, ν) ≤ f0 (x∗ )


x

,
1248
Summer 2019

Dual optimization problem


Have seen: the dual function provides a lower bound on the primal
value. Finding the highest such lower bound is the task of the dual
problem:

We define the dual optimization problem as

max g(λ, ν) subject to λi ≥ 0, νj ∈ R


Ulrike von Luxburg: Statistical Machine Learning

λ,ν

Denote the solution of this problem by λ∗ , ν ∗ and the corresponding


objective value d∗ := g(λ∗ , ν ∗ ).
1249
1250 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Dual vs Primal, some intuition:


Dual optimization problem (2)
1251 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Dual optimization problem (3)


Summer 2019

Weak duality
Proposition 52 (Weak duality)
The solution d∗ of the dual problem is always a lower bound for the
solution of the primal problem, that is d∗ ≤ p∗ .
Proof. Follows directly from Proposition 51 above. ,
Ulrike von Luxburg: Statistical Machine Learning

We call the difference p∗ − d∗ the duality gap.


1252
Summer 2019

Strong duality
I We say that strong duality holds if p∗ = d∗ .
I This is not always the case, just under particular conditions.
Such conditions are called constraint qualifications in the
optimization literature.
I Convex optimization problems often satisfy strong duality, but
not always.
Ulrike von Luxburg: Statistical Machine Learning
1253
Summer 2019

Strong duality (2)


Examples:
I Linear problems have strong duality

I Quadratic problems have strong duality (; support vector


machines)
I There exist many convex problems that do not satisfy strong
duality. Here is an example:
Ulrike von Luxburg: Statistical Machine Learning

minimizex,y exp(−x)
subject to x/y ≤ 0
y≥0

One can check tha this is a convex problem, yet p∗ = 1 and


d∗ = 0.
1254
Summer 2019

Strong duality: how to convert the solution of


the dual to the one of the primal
By strong duality: p∗ = d∗ , that is we get the same objective
values. But how can we recover the primal variables x∗ that lead to
this solution, if we just know the dual variables λ∗ , ν ∗ of the
optimal dual solution?

EXERCISE!
Ulrike von Luxburg: Statistical Machine Learning
1255
Summer 2019

Strong duality implies saddle point


Proposition 53 (Strong duality implies saddle point)
Assume strong duality holds, let x∗ be the solution of the primal
and (λ∗ , ν ∗ ) the solution of the dual optimization problem. Then
(x∗ , λ∗ , ν ∗ ) is a saddle point of the Lagrangian.
Ulrike von Luxburg: Statistical Machine Learning
1256
1257 Ulrike von Luxburg: Statistical Machine Learning Summer 2019

Strong duality implies saddle point (2)


Summer 2019

Strong duality implies saddle point (3)


Proof.
I We first have to show that x∗ is a minimizer of L(x, λ∗ , ν ∗ ):
I By the strong duality assumption we have f0 (x∗ ) = g(λ∗ , ν ∗ ).
I With this we get

f0 (x∗ ) = g(λ∗ , ν ∗ ) = inf L(x, λ∗ , ν ∗ ) ≤ L(x∗ , λ∗ , ν ∗ ) ≤ f0 (x∗ )


x
Ulrike von Luxburg: Statistical Machine Learning

(last inequality follows from Proposition 52).


I Because we have the same term on the left and side, we have
equality everywhere.
I So in particular, inf x L(x, λ∗ , ν ∗ ) = L(x∗ , λ∗ , ν ∗ ).
I Then we have to show that (λ∗ , ν ∗ ) are maximizers of
L(x∗ , λ, ν).
I This follows from the definition of (λ∗ , ν ∗ ) as solutions of
maxλ,ν minx L(x, λ, ν).
1258
Summer 2019

Strong duality implies saddle point (4)


I Taken together we get

L(x∗ , λ, ν) ≤ L(x∗ , λ∗ , ν ∗ ) ≤ L(x, λ∗ , ν ∗ )

That is, (x∗ , λ∗ , ν ∗ ) is a saddle point of the Lagrangian:


I It is a minimum for x (with fixed λ∗ , ν ∗ ).
I It is a maximum for (λ, ν) (with fixed x∗ ).
Ulrike von Luxburg: Statistical Machine Learning

,
1259
Summer 2019

Saddle point always implies primal solution


Proposition 54 (Saddlepoint implies primal solution)
If (x∗ , λ∗ , ν ∗ ) is a saddle point of the Lagrangian, then x∗ is always
a solution of the primal problem.

Proof. Not very difficult, but we skip it. ,


Ulrike von Luxburg: Statistical Machine Learning

Remarks:
I This proposition always holds (not only under strong duality).

I This proposition gives sufficient conditions for optimality.


Under additional assumptions (constraint qualifications) it is
also a necessary condition.
1260
Summer 2019

Why is this whole approach useful?


I Whenever we have a saddle point of the Lagrangian, we have a
solution of our constraint optimization problem. This is great,
because otherwise we would not know how to solve it.
I If strong duality holds, we even know that any solution must
be a saddle point. So if we don’t find a saddle point, then we
know that no solution exists.
Ulrike von Luxburg: Statistical Machine Learning

I If your original minimization problem is not convex, at least its


dual is a concave maximization problem (or, by changing the
sign, a convex minimization problem). If the duality gap is
small, then it might make sense to solve the dual instead of
the primal (you will not find the optimal solution, but maybe a
solution that is close).
I As we will see for support vector machines, the Lagrangian
framework sometimes gives important insights into properties
of the solution.
1261

You might also like