0% found this document useful (0 votes)
9 views

Lecture MachineLearning

Uploaded by

Mangesh Deshmukh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Lecture MachineLearning

Uploaded by

Mangesh Deshmukh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 139

Introduction to Machine Learning

Marc Toussaint
July 11, 2019

This is a direct concatenation and reformatting of all lecture slides and exercises from
the Machine Learning course (summer term 2019, U Stuttgart), including indexing to
help prepare for exams.
Double-starred** sections and slides are not relevant for the exam.

Contents
1 Introduction 4

2 Regression 12
Linear regression (2:3) Features (2:6) Regularization (2:11) Estimator variance (2:12)
Ridge regularization (2:15) Ridge regression (2:15) Cross validation (2:17) Lasso reg-
ularization (2:19) Feature selection** (2:19) Dual formulation of ridge regression**
(2:24) Dual formulation of Lasso regression** (2:25)

3 Classification & Structured Output 23


Structured ouput and structured input (3:1)

3.1 The discriminative function . . . . . . . . . . . . . . . . . . . . . . . . 23


Discriminative function (3:3)
3.2 Loss functions for classification . . . . . . . . . . . . . . . . . . . . . . 25
Accuracy, Precision, Recall (3:8) Log-likelihood (3:10) Neg-log-likelihood (3:10) Cross
entropy (3:11) one-hot-vector (3:11) Hinge loss (3:12)
3.3 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Logistic regression (multi-class) (3:14) Logistic regression (binary) (3:18)

3.4 Conditional Random Fields** . . . . . . . . . . . . . . . . . . . . . . . 31


Conditional random fields** (3:27)
2 Introduction to Machine Learning, Marc Toussaint

4 Neural Networks 35
Neural Network function class (4:3) NN loss functions (4:9) NN regularization (4:10)
NN Dropout (4:10) data augmentation (4:11) NN gradient (4:13) NN back propa-
gation (4:13) Stochastic Gradient Descent (4:16) NN initialization (4:20) Historical
Discussion** (4:22)
4.1 Computation Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Computation Graph (4:27) Chain rules (4:28)

4.2 Images & Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45


Convolutional NN (4:31) LSTM** (4:36) Gated Recurrent Units** (4:38)

5 Kernelization 50
Kernel trick (5:1) Kernel ridge regression (5:1)

6 Unsupervised Learning 54

6.1 PCA and Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . 54


Principle Component Analysis (PCA) (6:3) Autoencoders (6:10) Independent compo-
nent analysis** (6:13) Partial least squares (PLS)** (6:14) Kernel PCA** (6:18) Spectral
clustering** (6:22) Multidimensional scaling** (6:26) ISOMAP** (6:29) Non-negative
matrix factorization** (6:32) Factor analysis** (6:34)
6.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
k-means clustering (6:37) Gaussian mixture model (6:41) Agglomerative Hierarchical
Clustering** (6:43) Centering and whitening** (6:45)

7 Local Learning & Emsemble Learning 72


Local learning and ensemble approaches (7:1)

7.1 Local & lazy learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72


Smoothing kernel (7:3) kNN smoothing kernel (7:4) Epanechnikov quadratic
smoothing kernel (7:4) kd-trees (7:7)
7.2 Combining weak and randomized learners . . . . . . . . . . . . . . . 75
Bootstrapping (7:12) Bagging (7:12) Model averaging (7:14) Weak learners as features
(7:15) Boosting (7:16) AdaBoost** (7:17) Gradient boosting (7:21) Decision trees**
(7:26) Boosting decision trees** (7:31) Random forests** (7:32)

8 Probabilistic Machine Learning 85


Learning as Inference (8:1)

8.1 Bayesian [Kernel] Ridge|Logistic Regression & Gaussian Processes . 85


Bayesian (kernel) ridge regression (8:5) Predictive distribution (8:8) Gaussian Pro-
cess (8:12) Bayesian Kernel Ridge Regression (8:12) Bayesian (kernel) logistic regres-
sion (8:19) Gaussian Process Classification (8:22) Bayesian Kernel Logistic Regression
(8:22)
Introduction to Machine Learning, Marc Toussaint 3

8.2 Bayesian Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 92


Bayesian Neural Networks (8:24) Dropout for Uncertainty Prediction (8:25)

8.3 No Free Lunch** . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93


No Free Lunch (8:27)

A Probability Basics 95
Inference: general meaning (9:4)

A.1 Basic definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97


Random variables (9:8) Probability distribution (9:9) Joint distribution (9:10)
Marginal (9:10) Conditional distribution (9:10) Bayes’ Theorem (9:12) Multiple RVs,
conditional independence (9:13) Learning as Bayesian inference (9:19)
A.2 Probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Bernoulli and Binomial distributions (9:21) Beta (9:22) Multinomial (9:25) Dirichlet
(9:26) Conjugate priors (9:30)
A.3 Distributions over continuous domain . . . . . . . . . . . . . . . . . . 106
Dirac distribution (9:33) Gaussian (9:34) Particle approximation of a distribution
(9:38) Utilities and Decision Theory (9:41) Entropy (9:42) Kullback-Leibler diver-
gence (9:43)
A.4 Monte Carlo methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Monte Carlo methods (9:45) Rejection sampling (9:46) Importance sampling (9:47)
Student’s t, Exponential, Laplace, Chi-squared, Gamma distributions (9:49)

9 Exercises 113
9.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
9.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
9.3 Exercise 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
9.4 Exercise 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
9.5 Exercise 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
9.6 Exercise 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
9.7 Exercise 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
9.8 Exercise 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
9.9 Exercise 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
9.10 Exercise 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
9.11 Exercise 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
9.12 Exercise 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

Index 138
4 Introduction to Machine Learning, Marc Toussaint

1 Introduction

What is Machine Learning?

1) A long list of methods/algorithms for different data anlysis problems


– in sciences
– in commerce

2) Frameworks to develop your own learning algorithm/method

3) Machine Learning = model formulation + optimization


1:1

What is Machine Learning?


• Pedro Domingos: A Few Useful Things to Know about Machine Learning

learning = representation + evaluation + optimization

• “Representation”: Choice of model, choice of hypothesis space


• “Evaluation”: Choice of objective function, optimality principle
Notes: The prior is both, a choice of representation and, usually, a part of the objective
function.
In Bayesian settings, the choice of model often directly implies also the “objective func-
tion”
• “Optimization”: The algorithm to compute/approximate the best model
1:2

Pedro Domingos: A Few Useful Things to Know about Machine Learning


• It’s generalization that counts
– Data alone is not enough
– Overfitting has many faces
– Intuition fails in high dimensions
– Theoretical guarantees are not what they seem
• Feature engineering is the key
• More data beats a cleverer algorithm
• Learn many models, not just one
• Simplicity does not imply accuracy
• Representable does not imply learnable
Introduction to Machine Learning, Marc Toussaint 5

• Correlation does not imply causation


1:3

What is Machine Learning?


• In large parts, ML is: (let’s call this ML0 )
Fitting a function f : x 7→ y to given data D = {(xi , yi )}ni=1

• Why is function fitting so omnipresent?


– Dynamics, behavior, decisions, control, predictions – are all about functions
– Thinking? (i.e., planning, optimization, (logical) inference, CSP solving, etc?)
– In the latter case, algorithms often provide the scaffolding, ML the heuristics to
accelerate/scale them. (E.g., an evaluation function within a MCTS planning algo-
rithm.)
1:4

ML0 objective: Empirical Risk Minimization


• We have a hypothesis space H of functions f : x 7→ y
In a standard parameteric case H = {fθ | θ ∈ Rn } are functions fθ : x 7→ y that
are described by n parameters θ ∈ Rn

• Given data D = {(xi , yi )}ni=1 , the standard objective is to minimize the “error”
on the data
n
X
f ∗ argmin `(f (xi ), yi ) ,
f ∈H i=1

where `(ŷ, y) > 0 penalizes a discrepancy between a model output ŷ and the
data y.
– Squared error `(ŷ, y) = (ŷ − y)2
– Classification error `(ŷ, y) = [ŷ 6= y]
– neg-log likelihood `(ŷ, y) = − log p(y | ŷ)
– etc
1:5

What is Machine Learning beyond ML0 ?


• Fitting more structured models to data, which includes
– Time series, recurrent processes
– Graphical Models
– Unsupervised learning (semi-supervised learning)
...but in all these cases, the scenario is still not interactive, the data D is static, the de-
cision is about picking a single model f from a hypothesis space, and the objective is a
loss based on f and D only.
6 Introduction to Machine Learning, Marc Toussaint

• Active Learning, where the “ML agent” makes decisions about what data label
to query next
• Bandits, Reinforcement Learning, manipulating the domain (and thereby data
source)
1:6

Machine Learning is everywhere

NSA, Amazon, Google, Zalando, Trading, ...


Chemistry, Biology, Physics, ...
Control, Operations Reserach, Scheduling, ...
1:7

Face recognition

eigenfaces

keypoints

(e.g., Viola & Jones)


1:8

Hand-written digit recognition (US postal data)

(e.g., Yann LeCun)


1:9
Introduction to Machine Learning, Marc Toussaint 7

Gene annotation

(e.g., Gunnar Rätsch, mGene Project)


1:10

Speech recognition

1:11

Spam filters

1:12

Machine Learning became an important technology


in science as well
8 Introduction to Machine Learning, Marc Toussaint

(Stuttgart Cluster of Excellence “Data-integrated Simulation Science (SimTech)”)


1:13

Organization of this lecture

See TOC of last year’s slide collection

• Part 1: The Core: Regression & Classification


• Part 2: The Breadth of ML methods
• Part 3: Bayesian Methods
1:14

Is this a theoretical or practical course?

Neither alone.

• The goal is to teach how to design good learning algorithms


data

modelling [requires theory & practise]

algorithms [requires practise & theory]

testing, problem identification, restart
1:15

How much math do you need?

• Let L(x) = ||y − Ax||2 . What is

argmin L(x)
x

• Find
min ||y − Ax||2 s.t. xi ≤ 1
x

• Given a discriminative function f (x, y) we define

ef (y,x)
p(y | x) = P f (y 0 ,x)
y0 e
Introduction to Machine Learning, Marc Toussaint 9

• Let A be the covariance matrix of a Gaussian. What does the Singular Value
Decomposition A = V DV > tell us?
1:16

How much coding do you need?


• A core subject of this lecture: learning to go from principles (math) to code

• Many exercises will implement algorithms we derived in the lecture and collect
experience on small data sets

• Choice of language is fully free. I support C++; tutors might prefer Python;
Octave/Matlab or R is also good choice.
1:17

Books

Springer Series in Statistics


Hastie • Tibshirani • Friedman

Trevor Hastie
riedman
ng Robert Tibshirani Trevor Hastie, Robert Tibshirani and
Jerome Friedman
Jerome Friedman: The Elements of Statis-
tical Learning: Data Mining, Inference, and
ormation tech-
medicine, biolo-
ed to the devel-
s data mining,
nderpinnings but
The Elements of
ortant ideas in

Statistical Learning Prediction Springer, Second Edition, 2009.


The Elements of Statistical Learning

stical, the
, with a liberal
nyone interested
pervised learning
works, support
treatment of this
Data Mining, Inference, and Prediction http://www-stat.stanford.edu/
uding graphical
orithms for the
a chapter on
e discovery rates.
˜tibs/ElemStatLearn/
atistics at
Tibshirani Second Edition (recommended: read introductory chap-
le. Hastie co-
S-PLUS and
co-author of the
or of many data-
ting.
ter)

(this course will not go to the full depth in math of Hastie et al.)
1:18

Books
10 Introduction to Machine Learning, Marc Toussaint

Bishop, C. M.: Pattern Recognition and Ma-


chine Learning.
Springer, 2006
http://research.microsoft.com/
en-us/um/people/cmbishop/prml/
(some chapters are fully online)

1:19

Books & Readings

• more recently:
– David Barber: Bayesian Reasoning and Machine Learning
– Kevin Murphy: Machine learning: a Probabilistic Perspective

• See the readings at the bottom of:


http://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/index.html#readings
1:20

Organization
• Course Webpage:
http://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/19-MachineLearning/
– Slides, Exercises & Software (C++)
– Links to books and other resources
• Admin things, please first ask:
Carola Stahl, [email protected], Raum 2.217

• Rules for the tutorials:


– Doing the exercises is crucial!
– Nur Votieraufgaben. At the beginning of each tutorial:
– sign into a list
– vote on exercises you have (successfully) worked on
– Students are randomly selected to present their solutions
– You need 50% of completed exercises to be allowed to the exam
– Please check 2 weeks before the end of the term, if you can take the exam
Introduction to Machine Learning, Marc Toussaint 11

1:21
12 Introduction to Machine Learning, Marc Toussaint

2 Regression

(MT/plot.h -> gnuplot pipe) (MT/plot.h -> gnuplot pipe)


1.5 3
'train' us 1:2 train
'model' us 1:2 decision boundary
1
2

0.5
1

0
-0.5

-1
-1

-1.5 -2
-3 -2 -1 0 1 2 3 -2 -1 0 1 2 3

• Are these linear models? Linear in what?


– Input: No.
– Parameters, features: Yes!
2:1

Linear Modelling is more powerful than it might seem at first!

• Linear Regression on non-linear features → very powerful (polynomials, piece-wise,


spline basis, kernels)
• Regularization (Ridge, Lasso) & cross-validation for proper generalization to test data
• Gaussian Processes and SVMs are closely related (linear in kernel features, but with
different optimality criteria)
• Liquid/Echo State Machines, Extreme Learning, are examples of linear modelling on
many (sort of random) non-linear features
• Basic insights in model complexity (effective degrees of freedom)
• Input relevance estimation (z-score) and feature selection (Lasso)
• Linear regression → linear classification (logistic regression: outputs are likelihood ra-
tios)

⇒ Linear modelling is a core of ML


(We roughly follow Hastie, Tibshirani, Friedman: Elements of Statistical Learning)
2:2

Linear Regression
Introduction to Machine Learning, Marc Toussaint 13

• Notation:
– input vector x ∈ Rd
– output value y ∈ R
– parameters β = (β0 , β1 , .., βd )> ∈ Rd+1
– linear model
Pd
f (x) = β0 + j=1 βj xj

• Given training data D = {(xi , yi )}ni=1 we define the least squares cost (or “loss”)
Pn
Lls (β) = i=1 (yi − f (xi ))2

2:3

Optimal parameters β

• Augment input vector with a 1 in front: x̄ = (1, x) = (1, x1 , .., xd )> ∈ Rd+1
β = (β0 , β1 , .., βd )> ∈ Rd+1
Pn
f (x) = β0 + j=1 βj xj = x̄>β

• Rewrite sum of squares as:P


n
Lls (β) = i=1 (yi − x̄>i β)2 = ||y − Xβ||2

x̄> 1 x1,1 x1,2 ··· x1,d  y1 


     
 1  

X = ..   =  .. ..  y = .. 
    
 ,
   
.  . .  . 
  
  
    

x̄>
     
n 1 xn,1 xn,2 ··· xn,d yn

• Optimum:
∂Lls (β)
0>d = ∂β = −2(y − Xβ)>X ⇐⇒ 0d = X>Xβ − X>y
β̂ ls = (X>X)-1 X>y

2:4
14 Introduction to Machine Learning, Marc Toussaint

(MT/plot.h -> gnuplot pipe) (MT/plot.h -> gnuplot pipe)


1 1.5
'train' us 1:2 'train' us 1:2
'model' us 1:2 'model' us 1:2
0
1

-1
0.5

-2
0
-3

-0.5
-4

-1
-5

-6 -1.5
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

./x.exe -mode 1 -dataFeatureType 1 -modelFeatureType 1


2:5

Non-linear features
• Replace the inputs xi ∈ Rd by some non-linear features φ(xi ) ∈ Rk
Pk
f (x) = j=1 φj (x) βj = φ(x)>β

• The optimal β is the same


φ(x1 )>
 

.. 
 
β̂ ls = (X>X)-1 X>y but with X = ∈ Rn×k

 
. 


 

φ(xn )>
 

• What are “features”?


a) Features are an arbitrary set of basis functions
b) Any function linear in β can be written as f (x) = φ(x)>β
for some φ, which we denote as “features”
2:6

Example: Polynomial features


• Linear: φ(x) = (1, x1 , .., xd ) ∈ R1+d
d(d+1)
• Quadratic: φ(x) = (1, x1 , .., xd , x21 , x1 x2 , x1 x3 , .., x2d ) ∈ R1+d+ 2

d(d+1) d(d+1)(d+2)
• Cubic: φ(x) = (.., x31 , x21 x2 , x21 x3 , .., x3d ) ∈ R1+d+ 2 + 6
Introduction to Machine Learning, Marc Toussaint 15

x φ(x) f (x) = φ(x)>β

1
x1
x1
x2 xd
φ β
x21 f (x)
x1 x2
xd x1 x3

x2d

./x.exe -mode 1 -dataFeatureType 1 -modelFeatureType 1


2:7

Example: Piece-wise features (in 1D)


• Piece-wise constant: φj (x) = [ξj < x ≤ ξj+1 ]
• Piece-wise linear: φj (x) = (1, x)> [ξj < x ≤ ξj+1 ]
• Continuous piece-wise linear: φj (x) = [x − ξj ]+ (and φ0 (x) = x)

2:8

Example: Radial Basis Functions (RBF)


• Given a set of centers {cj }kj=1 , define
1 2
φj (x) = b(x, cj ) = e− 2 ||x−cj || ∈ [0, 1]

Each φj (x) measures similarity with the center cj


16 Introduction to Machine Learning, Marc Toussaint

• Special case:
use all training inputs {xi }ni=1 as centers

1 
 

b(x, x1 ) 





φ(x) = 

 ..


 (n + 1 dim)
.
 
 
 
 
b(x, xn )

This is related to “kernel methods” and GPs, but not quite the same—we’ll discuss this
later.
2:9

Features
• Polynomial
• Piece-wise
• Radial basis functions (RBF)
• Splines (see Hastie Ch. 5)

• Linear regression on top of rich features is extremely powerful!


2
(MT/plot.h -> gnuplot pipe)

'z.train' us 1:2
'z.model' us 1:2
1.5 2:10
1

0.5

The need for regularization 0

-0.5

Noisy sin data fitted with radial basis functions -1

-1.5
-3 -2 -1 0 1 2 3

./x.exe -mode 1 -n 40 -modelFeatureType 4 -dataType 2 -rbfWidth .1 -sigma


.5 -lambda 1e-10

• Overfitting & generalization:


The model overfits to the data—and generalizes badly

• Estimator variance:
When you repeat the experiment (keeping the underlying function fixed), the
regression always returns a different model estimate
2:11

Estimator variance
• Assumption:
– The data was noisy with variance Var{y} = σ 2 In

• We computed parameters β̂ = (X>X)-1 X>y, therefore


Var{β̂} = (X>X)-1 σ 2
Introduction to Machine Learning, Marc Toussaint 17

– high data noise σ → high estimator variance


1
– more data → less estimator variance: Var{β̂} ∝ n

• In practise we don’t know σ, but we can estimate it based on the deviation from
the learnt model: (with k = dim(β) = dim(φ))

n
1 X
σ̂ 2 = (yi − f (xi ))2
n − k i=1

2:12

Estimator variance
• “Overfitting”
– picking one specific data set y ∼ N(ymean , σ 2 In )
↔ picking one specific b̂ ∼ N(βmean , (X>X)-1 σ 2 )

• If we could reduce the variance of the estimator, we could reduce overfitting—


and increase generalization.
2:13

Hastie’s section on shrinkage methods is great! Describes several ideas on re-


ducing estimator variance by reducing model complexity. We focus on regu-
larization.
2:14

Ridge regression: L2 -regularization


• We add a regularization to the cost:
Pn Pk
Lridge (β) = i=1 (yi − φ(xi )>β)2 + λ j=2 βj2

NOTE: β1 is usually not regularized!

• Optimum:
β̂ ridge = (X>X + λI)-1 X>y

(where I = Ik , or with I1,1 = 0 if β1 is not regularized)


2:15
18 Introduction to Machine Learning, Marc Toussaint

• The objective is now composed of two “potentials”: The loss, which depends
on the data and jumps around (introduces variance), and the regularization
penalty (sitting steadily at zero). Both are “pulling” at the optimal β → the
regularization reduces variance.

• The estimator variance becomes less: Var{β̂} = (X>X + λI)-1 σ 2

• The ridge effectively reduces the complexity of the model:

Pd d2j
df(λ) = j=1 d2j +λ

where d2j is the eigenvalue of X>X = V D2 V >


(details: Hastie 3.4.1)
2:16

Choosing λ: generalization error & cross validation

• λ = 0 will always have a lower training data error


We need to estimate the generalization error on test data

• k-fold cross-validation:

D1 D2 ··· Di ··· Dk

1: Partition data D in k equal sized subsets D = {D1 , .., Dk }


2: for i = 1, .., k do
3: compute β̂i on the training data D \ Di leaving out Di
4: compute the error `i = Lls (β̂i , Di )/|Di | on the validation data Di
5: end for
report mean squared error `ˆ = 1/k i `i and variance 1/(k-1)[( i `2i ) − k`ˆ2 ]
P P
6:

• Choose λ for which `ˆ is smallest


2:17

quadratic features on sinus data:


Introduction to Machine Learning, Marc Toussaint 19

(MT/plot.h -> gnuplot pipe)


1.3
cv error
1.2 training error

1.1

mean squared error


1

0.9

0.8

0.7

0.6

0.5

0.4
0.001 0.01 0.1 1 10 100 1000 10000100000
lambda

./x.exe -mode 4 -n 10 -modelFeatureType 2 -dataType 2 -sigma .1


./x.exe -mode 1 -n 10 -modelFeatureType 2 -dataType 2 -sigma .1

2:18

Lasso: L1 -regularization

• We add a L1 regularization to the cost:

Pn Pk
Llasso (β) = i=1 (yi − φ(xi )>β)2 + λ j=2 |βj |

NOTE: β1 is usually not regularized!

• Has no closed form expression for optimum

(Optimum can be found by solving a quadratic program; see appendix.)

2:19

Lasso vs. Ridge:


20 Introduction to Machine Learning, Marc Toussaint

• Lasso → sparsity! feature selection!


2:20

Pn Pk
Lq (β) = i=1 (yi − φ(xi )>β)2 + λ j=2 |βj |q

• Subset selection: q = 0 (counting the number of βj 6= 0)


2:21

Summary
• Representation: choice of features

f (x) = φ(x)>β

• Objective: squared error + Ridge/Lasso regularization


n
X
Lridge
(β) = (yi − φ(xi )>β)2 + λ||β||2I
i=1
Introduction to Machine Learning, Marc Toussaint 21

• Solver: analytical (or quadratic program for Lasso)

β̂ ridge = (X>X + λI)-1 X>y

2:22

Summary
• Linear models on non-linear features—extremely powerful

linear
polynomial
Ridge regression
piece-wise linear
Lasso classification*
RBF
kernel

*logistic regression

• Generalization ↔ Regularization ↔ complexity/DoF penalty

• Cross validation to estimate generalization empirically → use to choose regu-


larization parameters
2:23

Appendix: Dual formulation of Ridge


• The standard way to write the Ridge regularization:
Pn Pk
Lridge (β) = i=1 (yi − φ(xi )>β)2 + λ j=2 βj2

• Finding β̂ ridge = argminβ Lridge (β) is equivalent to solving


n
X
β̂ ridge = argmin (yi − φ(xi )>β)2
β i=1
k
X
subject to βj2 ≤ t
j=2

λ is the Lagrange multiplier for the inequality constraint


2:24

Appendix: Dual formulation of Lasso


• The standard way to write the Lasso regularization:
Pn Pk
Llasso (β) = i=1 (yi − φ(xi )>β)2 + λ j=2 |βj |
22 Introduction to Machine Learning, Marc Toussaint

• Equivalent formulation (via KKT):


n
X
β̂ lasso = argmin (yi − φ(xi )>β)2
β i=1
k
X
subject to |βj | ≤ t
j=2

• Decreasing t is called “shrinkage”: The space of allowed β shrinks. Some β


will become zero → feature selection
2:25
Introduction to Machine Learning, Marc Toussaint 23

3 Classification & Structured Output

Structured Output & Structured Input


• regression:
Rd → R

• structured output:
Rd → binary class label {0, 1}
Rd → integer class label {1, 2, .., M }
Rd → sequence labelling y1:T
Rd → image labelling y1:W,1:H
Rd → graph labelling y1:N

• structured input:
relational database → R
labelled graph/sequence → R

3:1

3.1 The discriminative function


3:2

Discriminative Function
• Represent a discrete-valued function F : Rd → Y via a discriminative func-
tion
f : Rd × Y → R
such that
F : x 7→ argmaxy f (x, y)
That is, a discriminative function f (x, y) maps an input x to an output
ŷ(x) = argmax f (x, y)
y

• A discriminative function f (x, y) has high value if y is a correct answer to x;


and low value if y is a false answer
• In that way a discriminative function discriminates correct labelling from wrong
ones
3:3
24 Introduction to Machine Learning, Marc Toussaint

Example Discriminative Function

• Input: x ∈ R2 ; output y ∈ {1, 2, 3}


displayed are p(y = 1|x), p(y = 2|x), p(y = 3|x)

0.9
0.5
0.1
1 0.9
0.8 0.5
0.1
0.6 0.9
0.4 0.5
0.2 0.1
0

3
-2 2
-1 1
0 0
1 -1
2
3 -2

(here already “scaled” to the interval [0,1]... explained later)

• You can think of f (x, y) as M separate functions, one for each class y ∈ {1, .., M }. The highest one
determines the class prediction ŷ
• More examples: plot[-3:3] -x-2,0,x-2 splot[-3:3][-3:3] -x-y-2,0,x+y-2
3:4

How could we parameterize a discriminative function?


• Linear in features!
– Same features, different parameters for each output: f (x, y) = φ(x)>βy
– More general input-output features: f (x, y) = φ(x, y)>β

• Example for joint features: Let x ∈ R and y ∈ {1, 2, 3}, might be



1 [y = 1] 



x [y = 1] 
x2 [y = 1] 

1 >
   
 

 1 [y = 2]  
φ(x, y) = , which is equivalent to f (x, y) = x  βy
 
x [y = 2] 
 
 
 
x2 [y = 2] 
 
x2
  
 
 

 1 [y = 3] 
x [y = 3] 
 

 
2
x [y = 3]

• Example when both x, y ∈ {0, 1} are discrete:



1 


 [x = 0][y = 0] 


φ(x, y) = 


[x = 0][y = 1] 



 [x = 1][y = 0] 

[x = 1][y = 1]
Introduction to Machine Learning, Marc Toussaint 25

3:5

Notes on features
• Features “connect” input and output. Each φj (x, y) allows f to capture a cer-
tain dependence between x and y
• If both x and y are discrete, a feature φj (x, y) is typically a joint indicator func-
tion (logical function), indicating a certain “event”
• Each weight βj mirrors how important/frequent/infrequent a certain depen-
dence described by φj (x, y) is
• −f (x, y) is also called energy, and the is also called energy-based modelling,
esp. in neural modelling
3:6

3.2 Loss functions for classification


3:7

What is a good objective to train a classifier?


• Accuracy, Precision & Recall:
For data size n, false positives (FP), true positives (TP), we define:

TP+TN
– accuracy = n
TP
– precision = TP+FP
(TP+FP = classifier positives)
TP
– recall (TP-rate) = TP+FN
(TP+FN = data positives)
FP
– FP-rate = FP+TN
(FP+TN = data negatives)

• Such metrics be our actual objective. But they are not differentiable.
For the purpose of ML, we need to define a “proxy” objective that is nice to
optimize.

• Bad idea: Squared error regression


3:8

Bad idea: Squared error regression of class indicators


• Train f (x, y) to be the indicator function for class y
that is, ∀y : train f (x, y) on the regression data D = {(xi , I(y = yi ))}n
i=1 :
train f (x, 1) on value 1 for all xi with yi = 1 and on 0 otherwise
train f (x, 2) on value 1 for all xi with yi = 2 and on 0 otherwise
train f (x, 3) on value 1 for all xi with yi = 3 and on 0 otherwise
...
26 Introduction to Machine Learning, Marc Toussaint

• This typically fails: (see also Hastie 4.2)

Although the optimal separating boundaries are linear and linear discriminating func-
tions could represent them, the linear functions trained on class indicators fail to dis-
criminate.
→ squared error regression on class indicators is the “wrong objective”
3:9

Log-Likelihood
• The discriminative function f (y, x) not only defines the class prediction F (x);
we can additionally also define probabilities,
f (x,y)
p(y | x) = Pe
y0 ef (x,y0 )

• Maximizing Log-Likelihood: (minimize neg-log-likelihood, nll)


Pn
Lnll (β) = − i=1 log p(yi | xi )
3:10

Cross Entropy
• This is the same as log-likelihood for categorical data, just a notational trick,
really.
• The categorical data yi ∈ {1, .., M } are class labels. But assume they are en-
coded in a one-hot-vector

ŷi = eyi = (0, .., 0, 1, 0, ..., 0) , ŷiz = [yi = z]


Then we can write the neg-log-likelihood as
n X
X M n
X
Lnll (β) = − ŷiz log p(z | xi ) = H(ŷi , p(·, xi ))
i=1 z=1 i=1
P
where H(p, q) = − z p(z) log q(z) is the so-called cross entropy between two
normalized multinomial distributions p and q.

• As a side note, the cross entropy measure would also work if the target ŷi are
probabilities instead of one-hot-vectors.
3:11
Introduction to Machine Learning, Marc Toussaint 27

Hinge loss
• For a data point (x, y ∗ ), the one-vs-all hinge loss “wants” that f (y ∗ , x) is larger
than any other f (y, x), y 6= y ∗ , by a margin of 1.
In other terms, it penalizes when f (y ∗ , x) < f (y, x) + 1, y 6= y ∗ .
• It penalizes linearly, therefore the one-vs-all hinge loss is defined as
X
Lhinge (f ) = [1 − (f (y ∗ , x) − f (y, x))]+
y6=y ∗

• This is related to Support Vector Machines (only data points inside the margin
induce an error and gradient), and also to the Perceptron Algorithm
3:12

3.3 Logistic regression


3:13

Logistic regression: Multi-class case


• Data D = {(xi , yi )}ni=1 with xi ∈ Rd and yi ∈ {1, .., M }
• We choose f (x, y) = φ(x)>βy with separate parameters βy for each y
• Conditional class probabilties

ef (x,y) p(y | x)
p(y | x) = P f (x,y 0 )
↔ f (x, y) − f (x, z) = log
y0 e p(z | x)

(the discriminative functions model “log-ratios”)


• Given data D = {(xi , yi )}ni=1 , we minimize the regularized neg-log-likelihood
Pn
Llogistic (β) = − i=1 log p(yi | xi ) + λ||β||2

Written as cross entropy (with one-hot encoding ŷiz = [yi = z]):


n X
X M
Llogistic (β) = − [yi = z] log p(z | xi ) + λ||β||2
i=1 z=1

3:14

Optimal parameters β
• Gradient:
> n
∂Llogistic (β) X
= (pic − yic )φ(xi ) + 2λIβc = X>(pc − yc ) + 2λIβc
∂βc i=1
28 Introduction to Machine Learning, Marc Toussaint

where pic = p(y = c | xi )


which is non-linear in β ⇒ ∂β L = 0 does not have an analytic solution

• Hessian:
∂ 2 Llogistic (β)
H= = X>Wcd X + 2[c = d] λI
∂βc ∂βd
where Wcd is diagonal with Wcd,ii = pic ([c = d] − pid )

• Newton algorithm: iterate


>
∂Llogistic (β)
β ← β − H -1 ∂β
3:15

polynomial (quadratic) ridge 3-class logistic regression:

(MT/plot.h -> gnuplot pipe)


3
0.9
train 0.5
p=0.5 0.1
1 0.9
2 0.5
0.8
0.1
0.6 0.9
1
0.4 0.5
0.2 0.1
0
0

3
-2 2
-1 1
-1 0 0
1 -1
2
3 -2
-2
-2 -1 0 1 2 3

./x.exe -mode 3 -d 2 -n 200 -modelFeatureType 3 -lambda 1e+1


3:16

• Note, if we have M discriminative functions f (x, y), w.l.o.g., we can always


choose one of them to be constantly zero. E.g.,

f (x, M ) ≡ 0 or βM ≡ 0
The other functions then have to be greater/less relative to this baseline.

• This is usually not done in the multi-class case, but almost always in the binary
case.
3:17
Introduction to Machine Learning, Marc Toussaint 29

Logistic regression: Binary case


• In the binary case, we have “two functions” f (x, 0) and f (x, 1). W.l.o.g. we may fix
f (x, 0) = 0 to zero. Therefore we choose features

φ(x, y) = φ(x) [y = 1]

with arbitrary input features φ(x) ∈ Rk


• We have

> 0 else
f (x, 1) = φ(x) β , ŷ = argmax f (x, y) =
y 1 if φ(x)>β > 0

1
exp(x)/(1+exp(x))
0.9

• and conditional class probabilities 0.8

0.7

0.6

ef (x,1) 0.5

p(1 | x) = f (x,0) = σ(f (x, 1)) 0.4

e + ef (x,1) 0.3

0.2

0.1

z 0
e 1
with the logistic sigmoid function σ(z) = 1+ez
= e−z +1
. -10 -5 0 5 10

• Given data D = {(xi , yi )}n


i=1 , we minimize the regularized neg-log-likelihood

Pn
Llogistic (β) = − i=1 log p(yi | xi ) + λ||β||2
Pn h i
=− i=1 yi log p(1 | xi ) + (1 − yi ) log[1 − p(1 | xi )] + λ||β||2
3:18

Optimal parameters β
• Gradient (see exercises):
> n
∂Llogistic (β) X
= (pi − yi )φ(xi ) + 2λIβ = X>(p − y) + 2λIβ
∂β i=1

φ(x1 )> 
 

where pi := p(y = 1 | xi ) , X= .. 


 ∈ R
 n×k
. 


 

φ(xn )>
 

∂ 2 Llogistic (β)
• Hessian H = ∂β 2 = X>W X + 2λI
W = diag(p ◦ (1 − p)), that is, diagonal with Wii = pi (1 − pi )

• Newton algorithm: iterate


>
∂Llogistic (β)
β ← β − H -1 ∂β
3:19
30 Introduction to Machine Learning, Marc Toussaint

polynomial (cubic) ridge logistic regression:

(MT/plot.h -> gnuplot pipe)


3
1
train 0
decision boundary -1

2
3
2
1
1 0
-1
-2
-3
0
3
2
-2 1
-1 -1 0
0 -1
1
2
3 -2
-2
-2 -1 0 1 2 3

./x.exe -mode 2 -d 2 -n 200 -modelFeatureType 3 -lambda 1e+0


3:20

RBF ridge logistic regression:

(MT/plot.h -> gnuplot pipe)


3
1
train 0
decision boundary -1

2
3
2
1
1 0
-1
-2
-3
0
3
2
-2 1
-1 -1 0
0 -1
1
2
3 -2
-2
-2 -1 0 1 2 3

./x.exe -mode 2 -d 2 -n 200 -modelFeatureType 4 -lambda 1e+0 -rbfBias


0 -rbfWidth .2
3:21

Recap
Introduction to Machine Learning, Marc Toussaint 31

ridge regression logistic regression


REPRESENTATION f (x) = φ(x)>β f (x, y) = φ(x, y)>β
OBJECTIVE LlsP
(β) = Llogistic
P(β) =
n 2 2
i=1 (yi − f (xi )) + λ||β||I − n 2
i=1 log p(yi | xi ) + λ||β||I
p(y | x) ∝ e f (x,y)

SOLVER β̂ ridge = (X>X + λI)-1 X>y binary case:


β ← β − (X>W X + 2λI)-1
(X>(p − y) + 2λIβ)

3:22

3.4 Conditional Random Fields**

3:23

Examples for Structured Output

• Text tagging
X = sentence
Y = tagging of each word
http://sourceforge.net/projects/crftagger

• Image segmentation
X = image
Y = labelling of each pixel
http://scholar.google.com/scholar?cluster=13447702299042713582

• Depth estimation
X = single image
Y = depth map
http://make3d.cs.cornell.edu/

3:24
32 Introduction to Machine Learning, Marc Toussaint

CRFs in image processing

3:25

CRFs in image processing


• Google “conditional random field image”
– Multiscale Conditional Random Fields for Image Labeling (CVPR 2004)
– Scale-Invariant Contour Completion Using Conditional Random Fields (ICCV 2005)
– Conditional Random Fields for Object Recognition (NIPS 2004)
– Image Modeling using Tree Structured Conditional Random Fields (IJCAI 2007)
– A Conditional Random Field Model for Video Super-resolution (ICPR 2006)
3:26

Conditional Random Fields (CRFs)


• CRFs are a generalization of logistic binary and multi-class classification
• The output y may be an arbitrary (usually discrete) thing (e.g., sequence/image/graph
labelling)
• Hopefully we can maximize efficiently

argmax f (x, y)
y

over the output!


Introduction to Machine Learning, Marc Toussaint 33

→ f (x, y) should be structured in y so this optimization is efficient.

• The name CRF describes that p(y|x) ∝ ef (x,y) defines a probability distribution
(a.k.a. random field) over the output y conditional to the input x. The word
“field” usually means that this distribution is structured (a graphical model;
see later part of lecture).
3:27

CRFs: the structure is in the features


• Assume y = (y1 , .., yl ) is a tuple of individual (local) discrete labels
• We can assume that f (x, y) is linear in features
k
X
f (x, y) = φj (x, y∂j )βj = φ(x, y)>β
j=1

where each feature φj (x, y∂j ) depends only on a subset y∂j of labels. φj (x, y∂j )
effectively couples the labels y∂j . Then ef (x,y) is a factor graph.
3:28

Example: pair-wise coupled pixel labels


x11

y11 y12 y13 y14 y1W

y21

y31

yH1

• Each black box corresponds to features φj (y∂j ) which couple neighboring pixel
labels y∂j
• Each gray box corresponds to features φj (xj , yj ) which couple a local pixel
observation xj with a pixel label yj
3:29

CRFs: Core equations

f (x, y) = φ(x, y)>β


34 Introduction to Machine Learning, Marc Toussaint

ef (x,y)
p(y|x) = P f (x,y 0 )
= ef (x,y)−Z(x,β)
y0 e
X 0
Z(x, β) = log ef (x,y ) (log partition function)
y0
X X
L(β) = − log p(yi |xi ) = − [f (xi , yi ) − Z(xi , β)]
i i
X
∇Z(x, β) = p(y|x) ∇f (x, y)
y
X
∇2 Z(x, β) = p(y|x) ∇f (x, y) ∇f (x, y)> − ∇Z ∇Z>
y

• This gives the neg-log-likelihood L(β), its gradient and Hessian


3:30

Training CRFs
• Maximize conditional likelihood
But Hessian is typically too large (Images: ∼10 000 pixels, ∼50 000 features)
If f (x, y) has a chain structure over y, the Hessian is usually banded → computation
time linear in chain length

Alternative: Efficient gradient method, e.g.:


Vishwanathan et al.: Accelerated Training of Conditional Random Fields with Stochastic Gradient
Methods

• Other loss variants, e.g., hinge loss as with Support Vector Machines
(“Structured output SVMs”)

• Perceptron algorithm: Minimizes hinge loss using a gradient method


3:31
Introduction to Machine Learning, Marc Toussaint 35

4 Neural Networks
Outline
• Model, Objective, Solver:
– How do NNs represent a function f (x), or discriminative function f (y, x)?
– What are objectives? (standard objectives, different regularizations)
– How are they trained? (Initialization, SGD)
• Computation Graphs & Chain Rules
• Images & Sequences
– CNNs
– LSTMs & GRUs
– Complex architectures (e.g. Mask-RCNN, dense pose prediction, etc)
4:1

Neural Network models


• NNs are a parameterized function fβ : Rd 7→ RM
– β are called weights
– Given a data set D = {(xi , yi )}n
i=1 , we minimize some loss
n
X
β ∗ = argmin `(fβ (xi ), yi ) + regularization
β
i=1

• In that sense, they just replace our previous model assumption f (x) = φ(x)>β,
the reset is “in principle” the same
4:2

Neural Network models


• A (fwd-forward) NN Rh0 7→ RhL with L layers, each hl -dimensional, defines
1-layer: fβ (x) = W1 x + b1 , W1 ∈ Rh1 ×h0 , b1 ∈ Rh1
the function 2-layer: fβ (x) = W2 σ(W1 x + b1 ) + b2 , Wl ∈ Rhl ×hl-1 , bl ∈ Rhl
L-layer: fβ (x) = WL σ(· · · σ(W1 x + b1 ) · · · ) + bL
• The parameter β = (W1:L , b1:L ) is the collection of all
weights Wl ∈ Rhl ×hl-1 and biases bl ∈ Rhl
• To describe the mapping as an iteration, we introduce notation for the interme-
diate values:
– the input to layer l is zl = Wl xl-1 + bl ∈ Rhl
– the activation of layer l is xl = σ(zl ) ∈ Rhl
Then the L-layer NN model can be computed using the forward propagation:
∀l=1,..,L-1 : zl = Wl xl-1 + bl , xl = σ(zl )
where x0 ≡ x is the input, and fβ (x) ≡ zL the output
4:3
36 Introduction to Machine Learning, Marc Toussaint

Neural Network models

• The activation function σ(z) is applied element-wise,


rectified linear unit (ReLU) = z[z ≥ 0]
σ(z) = [z]+ = max{0, z} (
0.01z z < 0
leaky ReLU σ(z) = max{0.01z, z} =
z z≥0
sigmoid, logistic σ(z) = 1/(1 + e−z )
tanh σ(z) = tanh(z)

• L-layer means L − 1 hidden layers plus 1 output layer. (The input x0 is not
counted.)

• The forward propagation therefore iterates applying


– a linear transformation xl-1 7→ zl , highly parameterized with Wl , bl
– a non-linear transformation zl 7→ xl , element-wise and without parameters

4:4

feature-based regression
Introduction to Machine Learning, Marc Toussaint 37

feature-based classification
(same features for all outputs)

neural network

4:5

Neural Network models


• We can think of the second-to-last layer xL-1 as a feature vector

φβ (x) = xL-1
38 Introduction to Machine Learning, Marc Toussaint

• This aligns NNs models with what we discussed so far. But the crucial differ-
ence is:
In NNs, the features φβ (x) are also parameterized and trained!
While in previous lectures, we had to fix φ(x) by hand, NNs allow us to learn
features and intermediate representations

• Note: It is a common approach to train NNs as usual, but after training fix the trained features φ(x)
(“remove the head (=output layer) and fix the remaining body of the NN”) and use these trained
features for similar problems or other kinds of ML on top.
4:6

NNs as unversal function approximators


• A 1-layer NN is linear in the input
• Already a 2-layer NN with h1 → ∞ hidden neurons is a universal function
approximator
– Corresponds to k → ∞ features φ(x) ∈ Rk that are well tuned
4:7

Objectives to train NNs


– loss functions
– regularization
4:8

Loss functions as usual


• Squared error regression, for hL -dimensional output:
– for a single data point (x, y ∗ ), `(f (x), y ∗ ) = (f (x) − y ∗ )2
– the loss gradient is ∂`
∂f
= 2(f − y ∗ )>

• For multi-class classification we have hL = M outputs, and fβ (x) ∈ RM repre-


sents the discriminative function
• Neg-log-likelihood or cross entropy loss:
– for a single data point (x, y ∗ ), `(f (x), y ∗ ) = − log p(y ∗ |x)
– the loss gradient at output y is ∂`
∂fy
= p(y|x) − [y = y ∗ ]
• One-vs-all hinge loss:
– for a single data point (x, y ∗ ), `(f (x), y ∗ ) =
P
y6=y ∗ [1 − (fy∗ − fy )]+ ,
∗ ∂`
– the loss gradient at non-target outputs y 6= y is = [fy∗ < fy + 1]
∂fy
∗ ∂`
P
– the loss gradient at the target output y is ∂fy∗
= − y6=y∗ [fy∗ < fy + 1]
4:9
Introduction to Machine Learning, Marc Toussaint 39

New types of regularization


• Conventional, add a L2 or L1 regularization.
2
– adds a penalty λWl,ij (Ridge) or λ|Wl,ij | (Lasso) for every weight
– In practise, compute the unregularized gradient as usual, then add λWl,ij (for L2 ),
or λ sign Wl,ij (for L1 ) to the gradient
– Historically, this is called weight decay, as the additional gradient (executed after
the unregularized weight update) just decays weights
• Dropout
– Srivastava et al: Dropout: a simple way to prevent neural networks from overfitting, JMLR
2014.
– “a way of approximately combining exponentially many different neural network
architectures efficiently”
– “p can simply be set at 0.5, which seems to be close to optimal for a wide range
ofnetworks and tasks”
– on test/prediction time, take true averages
• Others:
– Data Augmentation
– Training ensembles, bagging (averaging bootstrapped models)
– Additional embedding objectives (e.g. semi-supervised embedding)
– Early stopping
4:10

Data Augmentation
• A very interesting form of regularization is to modify the data!
• Generate more data by applying invariances to the given data. The model then
learns to generalize as described by these invariances.
• This is a form of regularization that directly incorporates expert knowledge

4:11

Optimization
4:12
40 Introduction to Machine Learning, Marc Toussaint

Computing the gradient


• Recall forward propagation in an L-layer NN:

∀l=1,..,L-1 : zl = Wl xl-1 + bl , xl = σ(zl )

• For a single data point (x, y ∗ ), assume we have a loss `(f (x), y ∗ )
∆ d` d` 1×M
We define δL = df = dzL ∈ R as the gradient (as row vector) w.r.t. output
values zL .
d`
• Backpropagation: We can recursivly compute the gradient dz l
∈ R1×hl w.r.t.
all other layers zl as:

d` d` ∂zl+1 ∂xl

∀l=L-1,..,1 : δl = = = [δl+1 Wl+1 ] ◦ [σ 0 (zl )]>
dzl dzl+1 ∂xl ∂zl

where ◦ is an element-wise product. The gradient w.r.t. parameters:

d` d` ∂zl,i d` d`
= = δl,i xl-1,j or = δ> >
l xl-1 , = δ>
l
dWl,ij dzl,i ∂Wl,ij dWl dbl

4:13

• This forward and backward computations are done for each data point (xi , yi ).
P
• Since the total loss is the sum L(β) = i `(fβ (xi ), yi ), the total gradient is the
sum of gradients per data point.
• Efficient implementations send multiple data points (tensors) simultaneously
through the network (fwd and bwd), which speeds up computations.
4:14

Optimization
• For small data size:
Pn
We can compute the loss and its gradient i=1 ∇β `(fβ (xi ), yi ).
– Use classical gradient-based optimization methods
– default: L-BFGS, oldish but efficient: Rprop
– Called batch learning (in contrast to online learning)

Pn
• For large data size: The i=1 is highly inefficient!
– Adapt weights based on much smaller data subsets, mini batches
4:15

Stochastic Gradient Descent


Introduction to Machine Learning, Marc Toussaint 41

• Compute the loss and gradient for a mini batch D̂ ⊂ D of fixed size k.

X
L(β, D̂) = `(fβ (xi ), yi )
i∈D̂
X
∇β L(β, D̂) = ∇β `(fβ (xi ), yi )
i∈D̂

• Naive Stochastic Gradient Descent, iterate

β ← β − η∇β L(β, D̂)

– Choice of learning rate η is crucial for convergence!


– Exponential cooling: η = η0t
4:16

Stochastic Gradient Descent

• SGD with momentum:

∆β ← α∆β − η∇β L(β, D̂)


β ← β + ∆β

• Nesterov Accelerated Gradient (“Nesterov Momentum”):

∆β ← α∆β − η∇β L(β + ∆β, D̂)


β ← β + ∆β

Yurii Nesterov (1983): A method for solving the convex programming problm with convergence rate
O(1/k2 )

4:17
42 Introduction to Machine Learning, Marc Toussaint

Adam

arXiv:1412.6980
(all operations interpreted element-wise)
4:18

Adam & Nadam


• Adam interpretations (everything element-wise!):
– mt ≈ hgi the mean gradient in the recent iterations
– vt ≈ g 2 the mean gradient-square in the recent iterations
– m̂t , v̂t are bias corrected (check: in first iteration, t = 1, we have m̂t = gt , unbiased,
as desired)
hgi
– ∆θ ≈ −α would be a Newton step if g 2 were the Hessian...
hg 2 i

• Incorporate Nesterov into Adam: Replace parameter update by


p (1 − β1 )gt
θt ← θt-1 − α/( v̂t + ) · (β1 m̂t + )
1 − β1t

Dozat: Incorporating Nesterov Momentum into Adam, ICLR’16


4:19

Initialization
• The Initialization of weights is important! Heuristics:

• Choose random weights that don’t grow or vanish the gradient:


Introduction to Machine Learning, Marc Toussaint 43

– E.g., initialize weight vectors in Wl,i· with standard deviation 1, i.e., each entry with
sdv √1
hl-1

– Roughly: If each element of zl has standard deviation , the same should be true for
zl+1 .

• Choose each weight vector Wl,i· to point in a uniform random direction →


same as above

• Choose biases bl,i randomly so that the ReLU hinges cover the input well (think
of distributing hinge features for continuous piece-wise linear regression)
4:20

Brief Discussion
4:21

Historical discussion
(This is completely subjective.)
• Early (from 40ies):
– McCulloch Pitts, Hebbian learning, Rosenblatt, Werbos (backpropagation)
• 80ies:
– Start of connectionism, NIPS
– ML wants to distinguish itself from pure statistics (“machines”, “agents”)
• ’90-’10:
– More theory, better grounded, Statistical Learning theory
– Good ML is pure statistics (again) (Frequentists, SVM)
– ...or pure Bayesian (Graphical Models, Bayesian X)
– sample-efficiency, great generalization, guarantees, theory
– Great successes, in applications across disciplines; supervised, unsupervised, struc-
tured
• ’10-:
– Big Data. NNs. Size matters. GPUs.
– Disproportionate focus on images
– Software engineering becomes central
4:22

• NNs did not become “better” than they were 20y ago. What changed is the
metrics by which they’re are evaluated:
• Old:
– Sample efficiency & generalization; get the most from little data
44 Introduction to Machine Learning, Marc Toussaint

– Guarantees (both, w.r.t. generalization and optimization)


– generalize much better than nearest neighbor
• New:
– Ability to cope with billions of samples
→ no batch processing, but stochastic optimization (Adam) without monotone con-
vergence
→ nearest neighbor methods infeasible, compress data into high-capacity NNs
4:23

NNs vs. nearest neighbor


• Imagine an autonomous car. Instead of carrying a neural net, it carries 1 Petabyte of
data (500 hard drives, several billion pictures). In every split second it records an image
from a camera and wants to query the database to returen the 100 most similar pictures.
Perhaps with a non-trivial similarity metric. That’s not reasonable!
• In that sense, NNs are much better than nearest neighbor. They store/compress/memorize
huge amounts of data. Sample efficiency and the precise generalization behavior beome
less relevant.
• That’s how the metrics changed from ’90-’10 to nowadays
4:24

4.1 Computation Graphs

– A great collateral benefit of NN research!


– Perhaps a new paradigm to design large scale systems, beyond what software engineer-
ing teaches classically
– [see section 3.2 in “Maths” lecture]
4:25

Example
• Three real-valued quantities x, g and f which depend on each other:

f (x, g) = 3x + 2g and g(x) = 2x .

∂ d
What is ∂x f (x, g) and what is dx f (x, g)?

• The partial derivative only considers a single function f (a, b, c, ..) and asks how
the output of this single function varies with one of its arguments. (Not caring
that the arguments might be functions of yet something else).
• The total derivative considers full networks of dependencies between quanti-
ties and asks how one quantity varies with some other.
4:26
Introduction to Machine Learning, Marc Toussaint 45

Computation Graphs
• A function network or computation graph is a DAG of n quantities xi where
each quantity is a deterministic function of a set of parents π(i) ⊂ {1, .., n}, that
is
xi = fi (xπ(i) )
where xπ(i) = (xj )j∈π(i) is the tuple of parent values
• (This could also be called deterministic Bayes net.)

• Total derivative: Given a variation dx of some quantity, how would all child
quantities (down the DAG) vary?
4:27
f
Chain rules
• Forward-version: (I use in robotics) g

df X ∂f dg
=
dx ∂g dx
g∈π(f )
x
dg df
Why “forward”? You’ve computed dx
already, now you move forward to dx
.
∂f dx ∂f
Note: If x ∈ π(f ) is also a direct argument to f , the sum includes the term ∂x dx
≡ ∂x
. To
df
= ∂f ∂f dg
P
emphasize this, one could also write dx ∂x
+ g∈π(f ) ∂g dx .
g6=x
f

• Backward-version: (used in NNs!)


df X df ∂g
=
dx dg ∂x g
g:x∈π(g)

Why “backward”? You’ve computed df


dg
already, now you move backward to dx df
. x
df ∂f
≡ ∂f df
= ∂f df ∂g
P
Note: If f ∈ π(g), the sum includes df ∂x ∂x
. We could also write dx ∂x
+ g:x∈π(g) dg ∂x
.
g6=f
4:28

4.2 Images & Time Series


4:29

Images & Time Series


• My guess: 90% of the recent success of NNs is in the areas of images or time
series
• For images, convolutional NNs (CNNs) impose a very sensible prior; the rep-
resentations that emerge in CNNs are in fact similar to representations in the
visual area of our brain.
46 Introduction to Machine Learning, Marc Toussaint

• For time series, long-short term memory (LSTM) networks represent long-term
dependencies in a way that is well trainable – something that is hard to do with
other model structures.
• Both these structural priors, combined with huge data and capacity, make these
methods very strong.
4:30

Convolutional NNs

• Standard fully connected layer: full matrix Wi has hi hi+1 parameters


• Convolutional: Each neuron (entry of zi+1 ) receives input from a square re-
ceptive field, with k × k parameters. All neurons share these parameters →
translation invariance. The whole layer only has k 2 parameters.
• There are often multiple neurons with the same receitive field (“depth” of the
layer), to represent different “filters”. Stride leads to downsampling. Padding
at borders.

• Pooling applies a predefined operation on the receptive field (no parameters):


max or average. Typically for downsampling.
4:31

Learning to read these diagrams...

AlexNet
4:32
Introduction to Machine Learning, Marc Toussaint 47

ResNet
4:33

ResNeXt
4:34

Pretrained networks

• ImageNet5k, AlexNet, VGG, ResNet, ResNeXt


4:35
48 Introduction to Machine Learning, Marc Toussaint

LSTMs

4:36

LSTM
• c is a memory signal, that is multiplied with a sigmoid signal Γf . If that is
saturated (Γf ≈ 1), the memory is preserved; and backpropagation copies gra-
dients back
• If Γi is close to 1, a new signal c̃ is written into memory
• If Γo is close to 1, the memory contributes to the normal neural activations a
4:37

Gated Recurrent Units


• Cleaner and more modern: Gated Recurrent Units
but perhaps just similar performance

• Gated Feedback RNNs


4:38

Deep RL
• Value Network
• Advantage Network
• Action Network
• Experience Replay (prioritized)
• Fixed Q-targets
• etc, etc
Introduction to Machine Learning, Marc Toussaint 49

4:39

Conclusions
• Conventional feed-forward neural networks are by no means magic. They’re a param-
eterized function, which is fit to data.
• Convolutional NNs do make strong and good assumptions about how information pro-
cessing on images should be structured. The results are great and related to some de-
gree to human visual representations. A large part of the success of deep learning is on
images.
Also LSTMs make good assumptions about how memory signals help represent time
series.
The flexibility of “clicking together” network structures and general differentiable com-
putation graphs is great.
All these are innovations w.r.t. formulating structured models for ML
• The major strength of NNs is in their capacity and that, using massive parallelized com-
putation, they can be trained on tons of data. Maybe they don’t even need to be better
than nearest neighbor lookup, but they can be queried much faster.
4:40
50 Introduction to Machine Learning, Marc Toussaint

5 Kernelization

Kernel Ridge Regression—the “Kernel Trick”


• Reconsider solution of Ridge regression (using the Woodbury identity):
β̂ ridge = (X>X + λIk )-1 X>y = X>(XX> + λIn )-1 y

• Recall X> = (φ(x1 ), .., φ(xn )) ∈ Rk×n , then:

f ridge (x) = φ(x)>β ridge = φ(x)>X>(XX >


+λI)-1 y
| {z } | {z }
κ(x)> K

K is called kernel matrix and has elements

Kij = k(xi , xj ) := φ(xi )>φ(xj )

κ is the vector: κ(x)> = φ(x)>X> = k(x, x1:n )

The kernel function k(x, x0 ) calculates the scalar product in feature space.
5:1

The Kernel Trick


• We can rewrite kernel ridge regression as:

f rigde (x) = κ(x)>(K + λI)-1 y


with Kij = k(xi , xj )
κi (x) = k(x, xi )

→ at no place we actually need to compute the parameters β̂


→ at no place we actually need to compute the features φ(xi )
→ we only need to be able to compute k(x, x0 ) for any x, x0

• This rewriting is called kernel trick.


• It has great implications:
– Instead of inventing funny non-linear features, we may directly invent funny ker-
nels
– Inventing a kernel is intuitive: k(x, x0 ) expresses how correlated y and y 0 should be:
it is a meassure of similarity, it compares x and x0 . Specifying how ’comparable’ x
and x0 are is often more intuitive than defining “features that might work”.
5:2
Introduction to Machine Learning, Marc Toussaint 51

• Every choice of features implies a kernel.

• But, does every choice of kernel correspond to a specific choice of features?


5:3

Reproducing Kernel Hilbert Space


• Let’s define a vector space Hk , spanned by infinitely many basis elements
{φx = k(·, x) : x ∈ Rd }

Vectors in this space are linear combinations of such basis elements, e.g.,
X X
f= αi φxi , f (x) = αi k(x, xi )
i i

• Let’s define a scalar product in this space. Assuming k(·, ·) is positive definite, we first
define the scalar product for every basis element,
hφx , φy i := k(x, y)
Then it follows
X X
hφx , f i = αi hφx , φxi i = αi k(x, xi ) = f (x)
i i

• The φx = k(·, x) is the ‘feature’ we associate with x. Note that this is P


a function and in-
finite dimensional. Choosing α = (K + λI)-1 y represents f ridge (x) = n i=1 αi k(x, xi ) =
κ(x)>α, and shows that ridge regression has a finite-dimensional solution in the basis
elements {φxi }. A more general version of this insight is called representer theorem.
5:4

Representer Theorem
• For
f ∗ = argmin L(f (x1 ), .., f (xn )) + Ω(||f ||2Hk )
f ∈Hk

where L is an arbitrary loss function, and Ω a monotone regularization, it holds


n
X
f∗ = αi k(·, xi )
i=1

• Proof:
decompose f = fs + f⊥ , fs ∈ span{φxi : xi ∈ D}
f (xi ) = hf, φxi i = hfs + f⊥ , φxi i = hfs , φxi i = fs (xi )
L(f (x1 ), .., f (xn )) = L(fs (x1 ), .., fs (xn ))
Ω(||fs + f⊥ ||2Hk ) ≥ Ω(||fs ||2Hk )
5:5
52 Introduction to Machine Learning, Marc Toussaint

Example Kernels
• Kernel functions need to be positive definite: ∀z:|z|>0 : k(z, z 0 ) > 0
→ K is a positive definite matrix
• Examples:
– Polynomial: k(x, x0 ) = (x>x0 + c)d
 √ √ √ >
Let’s verify for d = 2, φ(x) = 1, 2x1 , 2x2 , x21 , 2x1 x2 , x22 :

0
 

k(x, x0 ) = ((x1 , x2 ) x1  2
 0  + 1)
x2
= (x1 x01 + x2 x02 + 1)2
2 2
= x21 x01 + 2x1 x2 x01 x02 + x22 x02 + 2x1 x01 + 2x2 x02 + 1
= φ(x)>φ(x0 )

– Squared exponential (radial basis function): k(x, x0 ) = exp(−γ | x − x0 | 2 )


5:6

Example Kernels

• Bag-of-words kernels: let φw (x) be the count of word w in document x; define


k(x, y) = hφ(x), φ(y)i
• Graph kernels (Vishwanathan et al: Graph kernels, JMLR 2010)
– Random walk graph kernels

• Gaussian Process regression will explain that k(x, x0 ) has the semantics of an
(apriori) correlatedness of the yet unknown underlying function values f (x) and
f (x0 )
– k(x, x0 ) should be high if you believe that f (x) and f (x0 ) might be similar
– k(x, x0 ) should be zero if f (x) and f (x0 ) might be fully unrelated
5:7

Kernel Logistic Regression*


For logistic regression we compute β using the Newton iterates

β ← β − (X>W X + 2λI)-1 [X>(p − y) + 2λβ] (1)


= −(X>W X + 2λI)-1 X>[(p − y) − W Xβ] (2)
Using the Woodbury identity we can rewrite this as

(X>W X + A)-1 X>W = A-1 X>(XA-1 X> + W -1 )-1 (3)


1 1 >
β ← − X>(X X + W -1 )-1 W -1 [(p − y) − W Xβ] (4)
2λ 2λ
h i
= X>(XX> + 2λW -1 )-1 Xβ − W -1 (p − y) . (5)
Introduction to Machine Learning, Marc Toussaint 53

We can now compute the discriminative function values fX = Xβ ∈ Rn at the training points by
iterating over those instead of β:
h i
fX ← XX>(XX> + 2λW -1 )-1 Xβ − W -1 (p − y) (6)
h i
= K(K + 2λW -1 )-1 fX − W -1 (pX − y) (7)

Note, that pX on the RHS also depends on fX . Given fX we can compute the discriminative
function values fZ = Zβ ∈ Rm for a set of m query points Z using
h i
fZ ← κ>(K + 2λW -1 )-1 fX − W -1 (pX − y) , κ> = ZX> (8)

5:8
54 Introduction to Machine Learning, Marc Toussaint

6 Unsupervised Learning

Unsupervised learning
• What does that mean? Generally: modelling P (x)
• Instances:
– Finding lower-dimensional spaces
– Clustering
– Density estimation
– Fitting a graphical model
• “Supervised Learning as special case”...
6:1

6.1 PCA and Embeddings


6:2

Principle Component Analysis (PCA)


• Assume we have data D = {xi }ni=1 , xi ∈ Rd .

Intuitively: “We believe that there is an underlying lower-dimensional space


explaining this data”.

• How can we formalize this?


6:3

PCA: minimizing projection error


• For each xi ∈ Rd we postulate a lower-dimensional latent variable zi ∈ Rp

xi ≈ Vp zi + µ

• Optimality: Pn
Find Vp , µ and values zi that minimize i=1 ||xi − (Vp zi + µ)||2
6:4

Optimal Vp

n
X
µ̂, ẑ1:n = argmin ||xi − Vp zi − µ||2
µ,z1:n
i=1
Pn
⇒ µ̂ = hxi i = 1
n i=1 xi , ẑi = Vp>(xi − µ)
Introduction to Machine Learning, Marc Toussaint 55

• Center the data x̃i = xi − µ̂. Then

n
X
V̂p = argmin ||x̃i − Vp Vp>x̃i ||2
Vp i=1

• Solution via Singular Value Decomposition


– Let X ∈ Rn×d be the centered data matrix containing all x̃i
– We compute a sorted Singular Value Decomposition X>X = V DV >
D is diagonal with sorted singular values λ1 ≥ λ2 ≥ · · · ≥ λd
V = (v1 v2 · · · vd ) contains largest eigenvectors vi as columns
Vp := V1:d,1:p = (v1 v2 · · · vp )
6:5

Principle Component Analysis (PCA)

Vp> is the matrix that projects to the largest variance directions of X>X

zi = Vp>(xi − µ) , Z = XVp

• In non-centered case: Compute SVD of variance

1 >
A = Var{x} = xx> − µµ> = X X − µµ>
n

6:6
56 Introduction to Machine Learning, Marc Toussaint

Example: Digits

6:7

Example: Digits

• The “basis vectors” in Vp are also eigenvectors


Every data point can be expressed in these eigenvectors

x ≈ µ + Vp z
= µ + z1 v1 + z2 v2 + . . .
= + z1 · + z2 · + ···
Introduction to Machine Learning, Marc Toussaint 57

6:8

Example: Eigenfaces

(Viola & Jones)


6:9

Non-linear Autoencoders
• PCA given the “optimal linear autoencode”
• We can relax the encoding (Vp ) and decoding (Vp>) to be non-linear mappings,
e.g., represented as a neural network

A NN which is trained to reproduce the input: mini ||y(xi ) − xi ||2


The hidden layer (“bottleneck”) needs to find a good representation/compression.

• Stacking autoencoders:
58 Introduction to Machine Learning, Marc Toussaint

6:10

Augmenting NN training with semi-supervised embedding objectives


• Weston et al. (ICML, 2008)

Mnist1h dataset, deep NNs of 2, 6, 8, 10 and 15 layers; each hidden layer 50 hidden
units

6:11

What are good representations?

– Reproducing/autoencoding data, maintaining maximal information


– Disentangling correlations (e.g., ICA)
– those that are most correlated with desired outputs (PLS, NNs)
– those that maintain the clustering
– those that maintain relative distances (MDS)
...
– those that enable efficient reasoning, decision making & learning in the real world
– How do we represent our 3D environment, enabeling physical & geometric reason-
ing?
– How do we represent things to enable us inventing novel things, machines, tech-
nology, science?
6:12
Introduction to Machine Learning, Marc Toussaint 59

Independent Component Analysis**


• Assume we have data D = {xi }n d
i=1 , xi ∈ R .
PCA: P (xi | zi ) = N(xi | W zi + µ, I) , P (zi ) = N(zi | 0, I)
Factor Analysis: P (xi | zi ) = N(xi | W zi + µ, Σ) , P (zi ) = N(zi | 0, I)
ICA: P (xi | zi ) = N(xi | W zi + µ, I) , P (zi ) = dj=1 P (zij )
Q

latent sources observed


mixing
x0

y0 x1

x2
y1
x3
y2
x4

y3 x5

x6

• In ICA
1) We have (usually) as many latent variables as observed dim(xi ) = dim(zi )
2) We require all latent variables to be independent
3) We allow for latent variables to be non-Gaussian

Note: without point (3) ICA would be without sense!


6:13

Partial least squares (PLS)**


• Is it really a good idea to just pick the p-higest variance components??

Why should that be a good idea?

6:14

PLS*
• Idea: The first dimension to pick should be the one most correlated with the
OUTPUT, not with itself!
60 Introduction to Machine Learning, Marc Toussaint

Input: data X ∈ Rn×d , y ∈ Rn


Output: predictions ŷ ∈ Rn
1: initialize the predicted output: ŷ = hyi 1n
2: initialize the remaining input dimensions: X̂ = X
3: for i = 1, .., p do
4: i-th input ‘basis vector’: zi = X̂ X̂>y
zi z>
i
5: update prediction: ŷ ← ŷ + Zi y where Zi =
z>
i zi
6: remove “used” input dimensions: X̂ ← X̂(I − Zi )
7: end for

(Hastie, page 81)


Line 4 identifies a new input “coordinate” via maximal correlation between the remaning input
dimensions and y.
Line 5 updates the prediction to include the project of y onto zi
Line 6 removes the projection of input data X̂ along zi . All zi will be orthogonal.
6:15

PLS for classification*

• Not obvious.

• We’ll try to invent one in the exercises :-)


6:16

• back to linear autoencoding, i.e., PCA - but now linear in RKHS


6:17

“Feature PCA” & Kernel PCA**



>
 φ(x1 ) 
.. 
 
• The feature trick: X =  ∈ Rn×k



 . 


φ(xn )>
 

• The kernel trick: rewrite all necessary equations such that they only involve
scalar products φ(x)>φ(x0 ) = k(x, x0 ):

We want to compute eigenvectors of X>X = φ(xi )φ(xi )>. We can rewrite this as
P
i
X>Xvj = λvj
X
>
XX Xvj = λ Xvj ,
| {z } |{z} vj = αji φ(xi )
|{z} i
K Kαj Kαj

Kαj = λαj

Where K = XX> with entries Kij = φ(xi )>φ(xj ).


Introduction to Machine Learning, Marc Toussaint 61

→ We compute SVD of the kernel matrix K → gives eigenvectors αj ∈ Rn .


Projection: x 7→ z = Vp>φ(x) = i α1:p,i φ(xi )>φ(x) = Aκ(x)
P

(with matrix A ∈ Rp×n , Aji = αji and vector κ(x) ∈ Rn , κi (x) = k(xi , x))
Since we cannot center the features φ(x) we actually need “the double centered kernel matrix” K
e =
1 1
(I − n 11>)K(I − n 11>), where Kij = φ(xi )>φ(xj ) is uncentered.
6:18

Kernel PCA
red points: data
P
green shading: eigenvector αj represented as functions i αji k(xj , x)

Kernel PCA “coordinates” allow us to discriminate clusters!


6:19

Kernel PCA
• Kernel PCA uncovers quite surprising structure:

While PCA “merely” picks high-variance dimensions


Kernel PCA picks high variance features—where features correspond to basis
functions (RKHS elements) over x

• Kernel PCA may map data xi to latent coordinates zi where clustering is much
easier

• All of the following can be represented as kernel PCA:


– Local Linear Embedding
– Metric Multidimensional Scaling
62 Introduction to Machine Learning, Marc Toussaint

– Laplacian Eigenmaps (Spectral Clustering)


see “Dimensionality Reduction: A Short Tutorial” by Ali Ghodsi
6:20

Kernel PCA clustering

0 2
• Using a kernel function k(x, x0 ) = e−||x−x || /c
:

• Gaussian mixtures or k-means will easily cluster this


6:21

Spectral Clustering**
Spectral Clustering is very similar to kernel PCA:
• Instead of the kernel matrix K with entries kij = k(xi , xj ) we construct a
weighted adjacency matrix, e.g.,

0 if xi are not a kNN of xj
wij = 2
e−||xi −xj || /c otherwise

wij is the weight of the edge between data point xi and xj .

• Instead of computing maximal eigenvectors of K, e compute minimal eigenvec-


tors of
L=I−W f, W f = diag(P wij )-1 W
j
P
( j wij is called degree of node i, W is the normalized weighted adjacency ma-
f
trix)
6:22

• Given L = U DV >, we pick the p smallest eigenvectors Vp = V1:n,1:p (perhaps


exclude the trivial smallest eigenvector)
Introduction to Machine Learning, Marc Toussaint 63

• The latent coordinates for xi are zi = Vi,1:p

• Spectral Clustering provides a method to compute latent low-dimensional co-


ordinates zi = Vi,1:p for each high-dimensional xi ∈ Rd input.

• This is then followed by a standard clustering, e.g., Gaussian Mixture or k-


means
6:23

6:24

• Spectral Clustering is similar to kernel PCA:


– The kernel matrix K usually represents similarity
The weighted adjacency matrix W represents proximity & similarity
– High Eigenvectors of K are similar to low EV of L = I − W

• Original interpretation of Spectral Clustering:


– L = I − W (weighted graph Laplacian) describes a diffusion process:
The diffusion rate Wij is high if i and j are close and similar
64 Introduction to Machine Learning, Marc Toussaint

– Eigenvectors of L correspond to stationary solutions

• The Graph Laplacian L: For some vector f ∈ Rn , note the following identities:
X X X
(Lf )i = ( wij )fi − wij fj = wij (fi − fj )
j j j
X X X
>
f Lf = fi wij (fi − fj ) = wij (fi2 − fi fj )
i j ij
X 1 1 1X
= wij ( fi2 + fj2 − fi fj ) = wij (fi − fj )2
ij
2 2 2 ij
where the second-to-last = holds if wij = wji is symmetric.
6:25

Metric Multidimensional Scaling**


• Assume we have data D = {xi }ni=1 , xi ∈ Rd .
As before we want to indentify latent lower-dimensional representations zi ∈
Rp for this data.

• A simple idea: Minimize the stress


SC (z1:n ) = i6=j (d2ij − ||zi − zj ||2 )2
P

We want distances in high-dimensional space to be equal to distances in low-


dimensional space.
6:26

Metric Multidimensional Scaling = (kernel) PCA


• Note the relation:
d2ij = ||xi − xj ||2 = ||xi − x̄||2 + ||xj − x̄||2 − 2(xi − x̄)>(xj − x̄)

This translates a distance into a (centered) scalar product

• If may we define
Ke = (I − 1 11>)D(I − 1 11>) , Dij = −d2 /2
n n ij
>
then Kij = (xi − x̄) (xj − x̄) is the normal covariance matrix and MDS is equiv-
g
alent to kernel PCA
6:27

Non-metric Multidimensional Scaling


• We can do this for any data (also non-vectorial or not ∈ Rd ) as long as we have
a data set of comparative dissimilarities dij
X
S(z1:n ) = (d2ij − |zi − zj |2 )2
i6=j
Introduction to Machine Learning, Marc Toussaint 65

• Minimize S(z1:n ) w.r.t. z1:n without any further constraints!


6:28

Example for Non-Metric MDS: ISOMAP**


• Construct kNN graph and label edges with Euclidean distance
– Between any two xi and xj , compute “geodescic” distance dij
(shortest path along the graph)
– Then apply MDS

by Tenenbaum et al.
6:29

The zoo of dimensionality reduction methods


• PCA family:
– kernel PCA, non-neg. Matrix Factorization, Factor Analysis

• All of the following can be represented as kernel PCA:


– Local Linear Embedding
– Metric Multidimensional Scaling
– Laplacian Eigenmaps (Spectral Clustering)

They all use different notions of distance/correlation as input to kernel PCA

see “Dimensionality Reduction: A Short Tutorial” by Ali Ghodsi


6:30

PCA variants*
66 Introduction to Machine Learning, Marc Toussaint

6:31

PCA variant: Non-negative Matrix Factorization**


• Assume we have data D = {xi }ni=1 , xi ∈ Rd .
As for PCA (where we had xi ≈ Vp zi + µ) we search for a lower-dimensional
space with linear relation to xi

• In NMF we require everything is non-negative: the data xi , the projection W ,


and latent variables zi
Find W ∈ Rp×d (the tansposed projection) and Z ∈ Rn×p (the latent variables
zi ) such that
X ≈ ZW
• Iterative solution: (E-step and M-step like...)
Pd
j=1 wkj xij /(ZW )ij
zik ← zik Pd
j=1 wkj
PN
i=1 zik xij /(ZW )ij
wkj ← wkj PN
i=1 zik

6:32

PCA variant: Non-negative Matrix Factorization*

(from Hastie 14.6)


Introduction to Machine Learning, Marc Toussaint 67

6:33

PCA variant: Factor Analysis**


Another variant of PCA: (Bishop 12.64)
Allows for different noise in each dimension P (xi | zi ) = N(xi | Vp zi + µ, Σ)
(with Σ diagonal)
6:34

6.2 Clustering
6:35

Clustering
• Clustering often involves two steps:
• First map the data to some embedding that emphasizes clusters
– (Feature) PCA
– Spectral Clustering
– Kernel PCA
– ISOMAP
• Then explicitly analyze clusters
– k-means clustering
– Gaussian Mixture Model
– Agglomerative Clustering
6:36

k-means Clustering
• Given data D = {xi }ni=1 , find K centers µk , and a data assignment c : i 7→ k to
minimize X
min (xi − µc(i) )2
c,µ
i

• k-means clustering:
– Pick K data points randomly to initialize the centers µk
– Iterate adapting the assignments c(i) and the centers µk :
X
∀i : c(i) ← argmin (xj − µc(j) )2 = argmin(xi − µk )2
c(i) k
j
X 1 X
∀k : µk ← argmin (xi − µc(i) )2 = xi
µk |c-1 (k)|
i i∈c-1 (k)

6:37
68 Introduction to Machine Learning, Marc Toussaint

k-means Clustering

from Hastie

6:38

k-means Clustering

• Converges to local minimum → many restarts


• Choosing k? Plot L(k) = minc,µ i (xi − µc(i) )2 for different k – choose a trade-
P
off between model complexity (large k) and data fit (small loss L(k))

6:39
Introduction to Machine Learning, Marc Toussaint 69

k-means Clustering for Classification

from Hastie
6:40

Gaussian Mixture Model for Clustering


• GMMs can/should be introduced as generative probabilistic model of the data:
– K different Gaussians with parmameters µk , Σk
– Assignment RANDOM VARIABLE ci ∈ {1, .., K} with P (ci = k) = πk
– The observed data point xi with P (xi | ci = k; µk , Σk ) = N(xi | µk , Σk )
• EM-Algorithm described as a kind of soft-assignment version of k-means
– Initialize the centers µ1:K randomly from the data; all covariances Σ1:K to unit and
all πk uniformly.
– E-step: (probabilistic/soft assignment) Compute

q(ci = k) = P (ci = k | xi , µ1:K , Σ1:K ) ∝ N(xi | µk , Σk ) πk

– M-step: Update parameters (centers AND covariances)

1P
πk = q(ci = k)
n i
1 P
µk = i q(ci = k) xi
nπk
1 P > >
Σk = i q(ci = k) xi xi − µk µk
nπk

6:41
70 Introduction to Machine Learning, Marc Toussaint

Gaussian Mixture Model

EM iterations for Gaussian Mixture model:

from Bishop

6:42

Agglomerative Hierarchical Clustering

• agglomerative = bottom-up, divisive = top-down


• Merge the two groups with the smallest intergroup dissimilarity
• Dissimilarity of two groups G, H can be measures as
– Nearest Neighbor (or “single linkage”): d(G, H) = mini∈G,j∈H d(xi , xj )
– Furthest Neighbor (or “complete linkage”): d(G, H) = maxi∈G,j∈H d(xi , xj )
1
P P
– Group Average: d(G, H) = |G||H| i∈G j∈H d(xi , xj )

6:43
Introduction to Machine Learning, Marc Toussaint 71

Agglomerative Hierarchical Clustering

6:44

Appendix: Centering & Whitening


• Some prefer to center (shift to zero mean) the data before applying methods:
x ← x − hxi , y ← y − hyi
this spares augmenting the bias feature 1 to the data.

• More interesting: The loss and the best choice of λ depends on the scaling of the
data. If we always scale the data in the same range, we may have better priors
about choice of λ and interpretation of the loss
1 1
x← p x, y←p y
Var{x} Var{y}

• Whitening: Transform the data to remove all correlations and variances.


1 >
Let A = Var{x} = n
X X − µµ> with Cholesky decomposition A = M M>.

x ← M -1 x , with Var{M -1 x} = Id

6:45
72 Introduction to Machine Learning, Marc Toussaint

7 Local Learning & Emsemble Learning

Local learning & Ensemble learning


• “Simpler is Better”
– We’ve learned about [kernel] ridge — logistic regression
– We’ve learned about high-capacity NN training
– Sometimes one should consider also much simpler methods as baseline

• Content:
– Local learners
– local & lazy learning, kNN, Smoothing Kernel, kd-trees
– Combining weak or randomized learners
– Bootstrap, bagging, and model averaging
– Boosting
– (Boosted) decision trees & stumps, random forests
7:1

7.1 Local & lazy learning


7:2

Local & lazy learning


• Idea of local (or “lazy”) learning:
Do not try to build one global model f (x) from the data. Instead, whenever we
have a query point x∗ , we build a specific local model in the neighborhood of
x∗ .

• Typical approach:
– Given a query point x∗ , find all kNN in the data D = {(xi , yi )}N
i=1
– Fit a local model fx∗ only to these kNNs, perhaps weighted
– Use the local model fx∗ to predict x∗ 7→ ŷ0

• Weighted local least squares:


Pn
Llocal (β, x∗ ) = i=1 K(x∗ , xi )(yi − f (xi ))2 + λ||β||2

where K(x∗ , x) is called smoothing kernel. The optimum is:

β̂ = (X>W X + λI)-1 X>W y , W = diag(K(x∗ , x1:n ))

7:3
Introduction to Machine Learning, Marc Toussaint 73

Regression example
(
∗ 1 if xi ∈ kNN(x∗ )
kNN smoothing kernel: K(x , xi ) =
0 otherwise
Epanechnikov quadratic smoothing kernel: Kλ (x∗ , x) = D(|x∗ − x|/λ) ,
( D(s) =
3
4
(1 − s2 ) if s ≤ 1
0 otherwise

(Hastie, Sec 6.3)


7:4

Smoothing Kernels

from Wikipedia
7:5

Which metric to use for NN?


• This is the crutial question? The fundamental question of generalization.
– Given a query x∗ , which data points xi would you consider as being “related”, so
that the label of xi is correlated to the correct label of x∗ ?
74 Introduction to Machine Learning, Marc Toussaint

• Possible answers beyond naive Euclidean distance |x∗ − xi |


– Some other kernel function k(x∗ , xi )
– First encode x into a “meaningful” latent representation z; then use Euclidean dis-
tance there
– Take some off-the-shelf pretrained image NN, chop of the head, use this internal
representation
7:6

kd-trees
• For local & lazy learning it is essential to efficiently retrieve the kNN

Problem: Given data X, a query x∗ , identify the kNNs of x∗ in X.

• Linear time (stepping through all of X) is far too slow.

A kd-tree pre-structures the data into a binary tree, allowing O(log n) retrieval
of kNNs.
7:7

kd-trees

(There are “typos” in this figure... Exercise to find them.)

• Every node plays two roles:


– it defines a hyperplane that separates the data along one coordinate
– it hosts a data point, which lives exactly on the hyperplane (defines the division)
7:8

kd-trees
• Simplest (non-efficient) way to construct a kd-tree:
– hyperplanes divide alternatingly along 1st, 2nd, ... coordinate
Introduction to Machine Learning, Marc Toussaint 75

– choose random point, use it to define hyperplane, divide data, iterate

• Nearest neighbor search:


– descent to a leave node and take this as initial nearest point
– ascent and check at each branching the possibility that a nearer point exists on the
other side of the hyperplane

• Approximate Nearest Neighbor (libann on Debian..)


7:9

7.2 Combining weak and randomized learners


7:10

Combining learners
• The general idea is:
– Given data D, let us learn various models f1 , .., fM
– Our prediction is then some combination of these, e.g.
M
X
f (x) = αm fm (x)
m=1

• “Various models” could be:

Model averaging: Fully different types of models (using different (e.g. limited)
feature sets; neural nets; decision trees; hyperparameters)

Bootstrap: Models of same type, trained on randomized versions of D

Boosting: Models of same type, trained on cleverly designed modifications/reweight


of D

• How can we choose the αm ? (You should know that!)


7:11

Bootstrap & Bagging


• Bootstrap:
– Data set D of size n
– Generate M data sets Dm by resampling D with replacement
– Each Dm is also of size n (some samples doubled or missing)
76 Introduction to Machine Learning, Marc Toussaint

– Distribution over data sets ↔ distribution over β (compare slide 02:13)


– The ensemble {f1 , .., fM } is similar to cross-validation
– Mean and variance of {f1 , .., fM } can be used for model assessment

• Bagging: (“bootstrap aggregation”)


M
1 X
f (x) = fm (x)
M m=1

7:12

• Bagging has similar effect to regularization:

(Hastie, Sec 8.7)


7:13

Bayesian Model Averaging


• If f1 , .., fM are very different models
– Equal weighting would not be clever
– More confident models (less variance, less parameters, higher likelihood)
→ higher weight

• Bayesian Averaging

M
X
P (y|x) = P (y|x, fm , D) P (fm |D)
m=1
Introduction to Machine Learning, Marc Toussaint 77

The term P (fm |D) is the weighting αm : it is high, when the model is likely
under the data (↔ the data is likely under the model & the model has “fewer
parameters”).
7:14

The basis function view: Models are features!


PM
• Compare model averaging f (x) = m=1 αm fm (x) with regression:

k
X
f (x) = φj (x) βj = φ(x)>β
j=1

• We can think of the M models fm as features φj for linear regression!


– We know how to find optimal parameters α
– But beware overfitting!
7:15

Boosting
• In Bagging and Model Averaging, the models are trained on the “same data”
(unbiased randomized versions of the same data)

• Boosting tries to be cleverer:


– It adapts the data for each learner
– It assigns each learner a differently weighted version of the data

• With this, boosing can


– Combine many “weak” classifiers to produce a powerful “committee”
– A weak learner only needs to be somewhat better than random
7:16

AdaBoost**
(Freund & Schapire, 1997)
(classical Algo; use Gradient Boosting instead in practice)

• Binary classification problem with data D = {(xi , yi )}ni=1 , yi ∈ {−1, +1}


• We know how to train discriminative functions f (x); let

G(x) = sign f (x) ∈ {−1, +1}


78 Introduction to Machine Learning, Marc Toussaint

• We will train a sequence of classificers G1 , .., GM , each on differently weighted


data, to yield a classifier
M
X
G(x) = sign αm Gm (x)
m=1

7:17

AdaBoost**

(Hastie, Sec 10.1)


7:18

AdaBoost**

Input: data D = {(xi , yi )}n i=1


Output: family of M classifiers Gm and weights αm
1: initialize ∀i : wi = 1/n
2: for m = 1, .., M do
3: Fit classifier
Pn Gm to the training data weighted by wi
w [y 6=G (x )]
4: errm = i=1 Pi n i w m i
i=1 i
5: αm = log[ 1−err
errm
m
]
6: ∀i : wi ← wi exp{αm [yi 6= Gm (xi )]}
7: end for

(Hastie, sec 10.1)


Introduction to Machine Learning, Marc Toussaint 79

Weights unchanged for correctly classified points


Multiply weights with 1−err
errm > 1 for mis-classified data points
m

• Real AdaBoost: A variant exists that combines probabilistic classifiers σ(f (x)) ∈
[0, 1] instead of discrete G(x) ∈ {−1, +1}
7:19

The basis function view


• In AdaBoost, each model Gm depends on the data weights wm
We could write this as
M
X
f (x) = αm fm (x, wm )
m=1

The “features” fm (x, wm ) now have additional parameters wm


We’d like to optimize
min L(f )
α,w1 ,..,wM

w.r.t. α and all the feature parameters wm .

• In general this is hard.


But assuming αm̂ and wm̂ fixed, optimizing for αm and wm is efficient.

• AdaBoost does exactly this, choosing wm so that the “feature” fm will best
reduce the loss (cf. PLS)
(Literally, AdaBoost uses exponential loss or neg-log-likelihood; Hastie sec 10.4 & 10.5)
7:20

Gradient Boosting
• AdaBoost generates a series of basis functions by using different data weight-
ings wm depending on so-far classification errors
• We can also generate a series of basis functions fm by fitting them to the gradi-
ent of the so-far loss
7:21

Gradient Boosting
• Assume we want to miminize some loss function
n
X
min L(f ) = L(yi , f (xi ))
f
i=1
80 Introduction to Machine Learning, Marc Toussaint

We can solve this using gradient descent

∂L(f0 ) ∂L(f0 + α1 f1 ) ∂L(f0 + α1 f1 + α2 f2 )


f ∗ = f0 + α1 +α2 +α3 +···
∂f ∂f ∂f
| {z } | {z } | {z }
≈f1 ≈f2 ≈f3

– Each fm aproximates the so-far loss gradient


– We use linear regression to choose αm (instead of line search)
• More intuitively: ∂L(f )
∂f “points into the direction of the error/redisual of f ”. It
shows how f could be improved.
∂L(fso far )
Gradient boosting uses the next lerner fk ≈ ∂f to approximate how f can
be improved.
Optimizing α’s does the improvement.
7:22

Gradient Boosting

Input: function class F (e.g., of discriminative functions), data D = {(xi , yi )}ni=1 ,


an arbitrary loss function L(y, ŷ)
Output: function fˆ to minimize n
P
i=1 L(yi , f (x
Pin
))
1: Initialize a constant fˆ = f0 = argminf ∈F i=1 L(yi , f (xi ))
2: for m = 1 : M do
∂L(y ,f (x ))
3: For each data point i = 1 : n compute rim = − ∂fi(x ) i f =fˆ
i
4: Fit a regression fm ∈ F to the targets rim , minimizing squared error
5: Find optimal coefficients (e.g., via feature logistic regression)
α = argminα i L(yi , m
P P
j=0 αm fm (xi ))
(often: fix α0:m-1 and only optimize over αm )
Update fˆ = m
P
6: j=0 αm fm
7: end for

• If F is the set of regression/decision trees, then step 5 usually re-optimizes the termi-
nal constants of all leave nodes of the regression tree fm . (Step 4 only determines the
terminal regions.)
7:23

Gradient boosting is the preferred method


• Hastie’s book quite “likes” gradient boosting
– Can be applied to any loss function
– No matter if regression or classification
– Very good performance
– Simpler, more general, better than AdaBoost
7:24
Introduction to Machine Learning, Marc Toussaint 81

Classical examples for boosting


7:25

Decision Trees**
• Decision trees are particularly used in Bagging and Boosting contexts

• Decision trees are “linear in features”, but the features are the terminal regions
of a tree, which are constructed depending on the data

• We’ll learn about


– Boosted decision trees & stumps
– Random Forests
7:26

Decision Trees
• We describe CART (classification and regression tree)
• Decision trees are linear in features:
k
X
f (x) = cj [x ∈ Rj ]
j=1

where Rj are disjoint rectangular regions and cj the constant prediction in a


region
• The regions are defined by a binary decision tree

7:27

Growing the decision tree


P
i yi [xi ∈Rj ]
• The constants are the region averages cj = P
i [xi ∈Rj ]

• Each split xa > t is defined by a choice of input dimension a ∈ {1, .., d} and a
threshold t
82 Introduction to Machine Learning, Marc Toussaint

• Given a yet unsplit region Rj , we split it by choosing

h X X i
min min (yi − c1 )2 + min (yi − c2 )2
a,t c1 c2
i:xi ∈Rj ∧xa ≤t i:xi ∈Rj ∧xa >t

– Finding the threshold t is really quick (slide along)


– We do this for every input dimension a

7:28

Deciding on the depth (if not pre-fixed)

• We first grow a very large tree (e.g. until at most 5 data points live in each
region)

• Then we rank all nodes using “weakest link pruning”:


Iteratively remove the node that least increases

n
X
(yi − f (xi ))2
i=1

• Use cross-validation to choose the eventual level of pruning

This is equivalent to choosing a regularization parameter λ for


L(T ) = n 2
P
i=1 (yi − f (xi )) + λ|T |
where the regularization |T | is the tree size

7:29
Introduction to Machine Learning, Marc Toussaint 83

Example:
CART on the Spam data set
(details: Hastie, p 320)

Test error rate: 8.7%

7:30

Boosting trees & stumps**

• A decision stump is a decision tree with fixed depth 1 (just one split)

• Gradient boosting of decision trees (of fixed depth J) and stumps is very effec-
tive

Test error rates on Spam data set:


full decision tree 8.7%
boosted decision stumps 4.7%
boosted decision trees with J = 5 4.5%
7:31

Random Forests: Bootstrapping & randomized splits**


• Recall that Bagging averages models f1 , .., fM where each fm was trained on a
bootstrap resample Dm of the data D
84 Introduction to Machine Learning, Marc Toussaint

This randomizes the models and avoids over-generalization

• Random Forests do Bagging, but additionally randomize the trees:


– When growing a new split, choose the input dimension a only from a random subset
m features
– m is often very small; even m = 1 or m = 3

• Random Forests are the prime example for “creating many randomized weak
learners from the same data D”
7:32

Random Forests vs. gradient boosted trees

(Hastie, Fig 15.1)


7:33
Introduction to Machine Learning, Marc Toussaint 85

8 Probabilistic Machine Learning

Learning as Inference
• The parameteric view

P (Data|β) P (β)
P (β|Data) =
P (Data)

• The function space view

P (Data|f ) P (f )
P (f |Data) =
P (Data)

• Today:
– Bayesian (Kernel) Ridge Regression ↔ Gaussian Process (GP)
– Bayesian (Kernel) Logistic Regression ↔ GP classification
– Bayesian Neural Networks (briefly)
8:1

• Beyond learning about specific Bayesian learning methods:

Understand relations between

loss/error ↔ neg-log likelihood

regularization ↔ neg-log prior

cost (reg.+loss) ↔ neg-log posterior


8:2

8.1 Bayesian [Kernel] Ridge|Logistic Regression & Gaussian Pro-


cesses
8:3

Gaussian Process = Bayesian (Kernel) Ridge Regression


8:4
86 Introduction to Machine Learning, Marc Toussaint

Ridge regression as Bayesian inference


• We have random variables X1:n , Y1:n , β

• We observe data D = {(xi , yi )}ni=1 and want to compute P (β | D)

xi
β
• Let’s assume:
yi
P (X) is arbitrary i = 1 : n
2 λ 2
P (β) is Gaussian: β ∼ N(0, σλ ) ∝ e− 2σ2 ||β||
P (Y | X, β) is Gaussian: y = x>β +  ,  ∼ N(0, σ 2 )
8:5

Ridge regression as Bayesian inference


• Bayes’ Theorem:
P (D | β) P (β)
P (β | D) =
P (D)
Qn
P (yi | β, xi ) P (β)
P (β | x1:n , y1:n ) = i=1
Z
P (D | β) is a product of independent likelihoods for each observation (xi , yi )

Using the Gaussian expressions:


n
1 Y − 12 (yi −x>i β)2 − λ2 ||β||2
P (β | D) = e 2σ e 2σ
Z 0 i=1
n
1 hX > 2
i
− log P (β | D) = (yi − x i β) + λ||β|| 2
+ log Z 0
2σ 2 i=1

− log P (β | D) ∝ Lridge (β)

1st insight: The neg-log posterior P (β | D) is proportional to the cost function


Lridge (β)!
8:6

Ridge regression as Bayesian inference


• Let us compute P (β | D) explicitly:
n
1 Y − 12 (yi −x>
i β)
2
− λ2 ||β||2
P (β | D) = e 2σ e 2σ
Z 0 i=1
Introduction to Machine Learning, Marc Toussaint 87

1 − 12 Pi (yi −x>i β)2 − λ2 ||β||2


= e 2σ e 2σ
Z0
1 − 1 [(y−Xβ)>(y−Xβ)+λβ>β]
= 0 e 2σ2
Z
1 − 1 [ 1 y>y+ 12 β>(X>X+λI)β− 22 β>X>y]
= 0 e 2 σ2 σ σ
Z
= N(β | β̂, Σ)

This is a Gaussian with covariance and mean


Σ = σ 2 (X>X + λI)-1 , β̂ = σ12 ΣX>y = (X>X + λI)-1 X>y
• 2nd insight: The mean β̂ is exactly the classical argminβ Lridge (β).
• 3rd insight: The Bayesian approach not only gives a mean/optimal β̂, but also
a variance Σ of that estimate. (Cp. slide 02:13!)
8:7

Predicting with an uncertain β


• Suppose we want to make a prediction at x. We can compute the predictive
distribution over a new observation y ∗ at x∗ :

P (y ∗ | x∗ , D) = P (y ∗ | x∗ , β) P (β | D) dβ
R
β

N(y ∗ | φ(x∗ )>β, σ 2 ) N(β | β̂, Σ) dβ


R
= β

= N(y ∗ | φ(x∗ )>β̂, σ 2 + φ(x∗ )>Σφ(x∗ ))

Note, for f (x) = φ(x)>β, we have P (f (x) | D) = N(f (x) | φ(x)>β̂, φ(x)>Σφ(x)) without the σ 2
• So, y ∗ is Gaussian distributed around the mean prediction φ(x∗ )>β̂:

(from Bishop, p176)


8:8

Wrapup of Bayesian Ridge regression


• 1st insight: The neg-log posterior P (β | D) is equal to the cost function Lridge (β).
88 Introduction to Machine Learning, Marc Toussaint

This is a very very common relation: optimization costs correspond to neg-log proba-
bilities; probabilities correspond to exp-neg costs.

• 2nd insight: The mean β̂ is exactly the classical argminβ Lridge (β)
More generally, the most likely parameter argmaxβ P (β|D) is also the least-cost param-
eter argminβ L(β). In the Gaussian case, most-likely β is also the mean.

• 3rd insight: The Bayesian inference approach not only gives a mean/optimal
β̂, but also a variance Σ of that estimate
This is a core benefit of the Bayesian view: It naturally provides a probability distribu-
tion over predictions (“error bars”), not only a single prediction.
8:9

Kernel Bayesian Ridge Regression


• As in the classical case, we can consider arbitrary features φ(x)
• .. or directly use a kernel k(x, x0 ):
P (f (x) | D) = N(f (x) | φ(x)>β̂, φ(x)>Σφ(x))
φ(x)>β̂ = φ(x)>X>(XX> + λI)-1 y
= κ(x)(K + λI)-1 y
φ(x)>Σφ(x) = φ(x)>σ 2 (X>X + λI)-1 φ(x)
σ2 σ2
= φ(x)>φ(x) − φ(x)>X>(XX> + λIk )-1 Xφ(x)
λ λ
σ2 σ2
= k(x, x) − κ(x)(K + λIn )-1 κ(x)>
λ λ
3rd line: As on slide 05:2
2nd to last line: Woodbury identity (A + U BV )-1 = A-1 − A-1 U (B -1 + V A-1 U )-1 V A-1 with A = λI

• In standard conventions λ = σ 2 , i.e. P (β) = N(β|0, 1)


– Regularization: scale the covariance function (or features)
8:10

Gaussian Processes
are equivalent to Kernelized Bayesian Ridge Regression
(see also Welling: “Kernel Ridge Regression” Lecture Notes; Rasmussen & Williams sections 2.1 &
6.2; Bishop sections 3.3.3 & 6)

• But it is insightful to introduce them again from the “function space view”:
GPs define a probability distribution over functions; they are the infinite di-
mensional generalization of Gaussian vectors
8:11
Introduction to Machine Learning, Marc Toussaint 89

Gaussian Processes – function prior


• The function space view
P (D|f ) P (f )
P (f |D) =
P (D)
• A Gaussian Processes prior P (f ) defines a probability distribution over func-
tions:
– A function is an infinite dimensional thing – how could we define a Gaussian dis-
tribution over functions?
– For every finite set {x1 , .., xM }, the function values f (x1 ), .., f (xM ) are Gaussian
distributed with mean and covariance
E{f (xi )} = µ(xi ) (often zero)
cov{f (xi ), f (xj )} = k(xi , xj )
Here, k(·, ·) is called covariance function
• Second, for Gaussian Processes we typically have a Gaussian data likelihood
P (D|f ), namely
P (y | x, f ) = N(y | f (x), σ 2 )
8:12

Gaussian Processes – function posterior


• The posterior P (f |D) is also a Gaussian Process, with the following mean of
f (x), covariance of f (x) and f (x0 ): (based on slide 10 (with λ = σ 2 ))
E{f (x) | D} = κ(x)(K + λI)-1 y + µ(x)
cov{f (x), f (x0 ) | D} = k(x, x0 ) − κ(x0 )(K + λIn )-1 κ(x0 )>
8:13

Gaussian Processes

(from Rasmussen & Williams)


8:14
90 Introduction to Machine Learning, Marc Toussaint

GP: different covariance functions

(from Rasmussen & Williams)

• These are examples from the γ-exponential covariance function

k(x, x0 ) = exp{−|(x − x0 )/l|γ }

8:15

GP: derivative observations

(from Rasmussen & Williams)


8:16

• Bayesian Kernel Ridge Regression = Gaussian Process

• GPs have become a standard regression method


• If exact GP is not efficient enough, many approximations exist, e.g. sparse and
pseudo-input GPs
8:17
Introduction to Machine Learning, Marc Toussaint 91

GP classification = Bayesian (Kernel) Logistic Regression


8:18

Bayesian Logistic Regression (binary case)


• f now defines a discriminative function:

P (X) = arbitrary
2
P (β) = N(β|0, ) ∝ exp{−λ||β||2 }
λ
P (Y = 1 | X, β) = σ(β>φ(x))

• Recall
n
X
logistic
L (β) = − log p(yi | xi ) + λ||β||2
i=1

• Again, the parameter posterior is

P (β|D) ∝ P (D | β) P (β) ∝ exp{−Llogistic (β)}

8:19

Bayesian Logistic Regression


• Use Laplace approximation (2nd order Taylor for L) at β ∗ = argminβ L(β):
1 >
L(β) ≈ L(β ∗ ) + β̄>∇ + β̄ H β̄ , β̄ = β − β ∗
2
At β ∗ the gradient ∇ = 0 and L(β ∗ ) = const. Therefore
1
P̃ (β|D) ∝ exp{− β̄>H β̄}
2
⇒ P (β|D) ≈ N(β|β ∗ , H -1 )

• Then the predictive distribution of the discriminative function is also Gaussian!


R
P (f (x) | D) = β P (f (x) | β) P (β | D) dβ
≈ β N(f (x) | φ(x)>β, 0) N(β | β ∗ , H -1 ) dβ
R

= N(f (x) | φ(x)>β ∗ , φ(x)>H -1 φ(x)) =: N(f (x) | f ∗ , s2 )

• The predictive distribution over the label y ∈ {0, 1}:


R
P (y(x) = 1 | D) = f (x) σ(f (x)) P (f (x)|D) df
1
≈ σ((1 + s2 π/8)- 2 f ∗ )
which uses a probit approximation of the convolution.
→ The variance s2 pushes the predictive class probabilities towards 0.5.
8:20
92 Introduction to Machine Learning, Marc Toussaint

Kernelized Bayesian Logistic Regression


• As with Kernel Logistic Regression, the MAP discriminative function f ∗ can be
found iterating the Newton method ↔ iterating GP estimation on a re-weighted
data set.
• The rest is as above.
8:21

Kernel Bayesian Logistic Regression


is equivalent to Gaussian Process Classification

• GP classification became a standard classification method, if the prediction


needs to be a meaningful probability that takes the model uncertainty into ac-
count.
8:22

8.2 Bayesian Neural Networks


8:23

Bayesian Neural Networks


• Simple ways to get uncertainty estimates:
– Train ensembles of networks (e.g. bootstrap ensembles)
– Treat the output layer fully probabilistic (treat the trained NN body as feature vector
φ(x), and apply Bayesian Ridge/Logistic Regression on top of that)

• Ways to treat NNs inherently Bayesian:


– Infinite single-layer NN → GP (classical work in 80/90ies)
– Putting priors over weights (“Bayesian NNs”, Neil, MacKay, 90ies)
– Dropout (much more recent, see papers below)

• Read
Gal & Ghahramani: Dropout as a bayesian approximation: Representing model uncertainty in
deep learning (ICML’16)

Damianou & Lawrence: Deep gaussian processes (AIS 2013)


8:24

Dropout in NNs as Deep GPs


• Deep GPs are essentially a a chaining of Gaussian Processes
– The mapping from each layer to the next is a GP
– Each GP could have a different prior (kernel)
Introduction to Machine Learning, Marc Toussaint 93

• Dropout in NNs
– Dropout leads to randomized prediction
– One can estimate the mean prediction from T dropout samples (MC estimate)
– Or one can estimate the mean prediction by averaging the weights of the network
(“standard dropout”)
– Equally one can MC estimate the variance from samples
– Gal & Ghahramani show, that a Dropout NN is a Deep GP (with very special ker-
pl2
nel), and the “correct” predictive variance is this MC estimate plus 2nλ (kernel
length scale l, regularization λ, dropout prob p, and n data points)
8:25

8.3 No Free Lunch**


8:26

No Free Lunch
• Averaged over all problem instances, any algorithm performs equally. (E.g.
equal to random.)
– “there is no one model that works best for every problem”
Igel & Toussaint: On Classes of Functions for which No Free Lunch Results Hold (Information Process-
ing Letters 2003)

• Rigorous formulations formalize this “average over all problem instances”. E.g.
by assuming a uniform prior over problems
– In black-box optimization, a uniform distribution over underlying objective func-
tions f (x)
– In machine learning, a uniform distribution over the hiddern true function f (x)
... and NLF always considers non-repeating queries.

• But what does uniform distribution over functions mean?

• NLF is trivial: when any previous query yields NO information at all about the results
of future queries, anything is exactly as good as random guessing
8:27

Conclusions
• Probabilistic inference is a very powerful concept!
– Inferring about the world given data
– Learning, decision making, reasoning can view viewed as forms of (proba-
bilistic) inference
94 Introduction to Machine Learning, Marc Toussaint

• We introduced Bayes’ Theorem as the fundamental form of probabilistic infer-


ence

• Marrying Bayes with (Kernel) Ridge (Logisic) regression yields


– Gaussian Processes
– Gaussian Process classification

• We can estimate uncertainty also for NNs


– Dropout
– Probabilistic weights and variational approximations; Deep GPs

• No Free Lunch for ML!


8:28
Introduction to Machine Learning, Marc Toussaint 95

A Probability Basics

The need for modelling

• Given a real world problem, translating it to a well-defined learning problem


is non-trivial

• The “framework” of plain regression/classification is restricted: input x, out-


put y.

• Graphical models (probabilstic models with multiple random variables and de-
pendencies) are a more general framework for modelling “problems”; regres-
sion & classification become a special case; Reinforcement Learning, decision
making, unsupervised learning, but also language processing, image segmen-
tation, can be represented.
9:1

Outline

• Basic definitions
– Random variables
– joint, conditional, marginal distribution
– Bayes’ theorem
• Examples for Bayes
• Probability distributions [skipped, only Gauss]
– Binomial; Beta
– Multinomial; Dirichlet
– Conjugate priors
– Gauss; Wichart
– Student-t, Dirak, Particles
• Monte Carlo, MCMC [skipped]
These are generic slides on probabilities I use throughout my lecture. Only parts are
mandatory for the AI course.
9:2
96 Introduction to Machine Learning, Marc Toussaint

Thomas Bayes (1702-–1761)

“Essay Towards Solving a


Problem in the Doctrine of
Chances”

• Addresses problem of inverse probabilities:


Knowing the conditional probability of B given A, what is the conditional prob-
ability of A given B?

• Example:
40% Bavarians speak dialect, only 1% of non-Bavarians speak (Bav.) dialect
Given a random German that speaks non-dialect, is he Bavarian?
(15% of Germans are Bavarian)
9:3

Inference
• “Inference” = Given some pieces of information (prior, observed variabes) what
is the implication (the implied information, the posterior) on a non-observed
variable

• Decision-Making and Learning as Inference:


– given pieces of information: about the world/game, collected data, assumed
model class, prior over model parameters
– make decisions about actions, classifier, model parameters, etc
9:4

Probability Theory
• Why do we need probabilities?
– Obvious: to express inherent stochasticity of the world (data)

• But beyond this: (also in a “deterministic world”):


– lack of knowledge!
– hidden (latent) variables
Introduction to Machine Learning, Marc Toussaint 97

– expressing uncertainty
– expressing information (and lack of information)

• Probability Theory: an information calculus


9:5

Probability: Frequentist and Bayesian


• Frequentist probabilities are defined in the limit of an infinite number of trials
Example: “The probability of a particular coin landing heads up is 0.43”

• Bayesian (subjective) probabilities quantify degrees of belief


Example: “The probability of rain tomorrow is 0.3” – not possible to repeat “tomorrow”
9:6

A.1 Basic definitions


9:7

Probabilities & Random Variables


• For a random variable X with discrete domain dom(X) = Ω we write:
∀x∈Ω : 0 ≤ P (X = x) ≤ 1
P
x∈Ω P (X = x) = 1

Example: A dice can take values Ω = {1, .., 6}.


X is the random variable of a dice throw.
P (X = 1) ∈ [0, 1] is the probability that X takes value 1.

• A bit more formally: a random variable is a map from a measureable space to a domain
(sample space) and thereby introduces a probability measure on the domain (“assigns a
probability to each possible value”)
9:8

Probabilty Distributions
• P (X = 1) ∈ R denotes a specific probability
P (X) denotes the probability distribution (function over Ω)

Example: A dice can take values Ω = {1, 2, 3, 4, 5, 6}.


By P (X) we discribe the full distribution over possible values {1, .., 6}. These are 6
numbers that sum to one, usually stored in a table, e.g.: [ 16 , 16 , 61 , 16 , 16 , 16 ]
98 Introduction to Machine Learning, Marc Toussaint

• In implementations we typically represent distributions over discrete random


variables as tables (arrays) of numbers

• Notation for summing over a RV:


In equation we often need to sum P over RVs. We then write
X P (X) · · ·
P
as shorthand for the explicit notation x∈dom(X) P (X = x) · · ·
9:9

Joint distributions
Assume we have two random variables X and Y

• Definitions:
Joint: P (X, Y )
P
Marginal: P (X) = Y P (X, Y )
P (X,Y )
Conditional: P (X|Y ) = P (Y )

P
The conditional is normalized: ∀Y : X P (X|Y ) = 1

• X is independent of Y iff: P (X|Y ) = P (X)


(table thinking: all columns of P (X|Y ) are equal)
9:10

Joint distributions
joint: P (X, Y )
P
marginal: P (X) = Y P (X, Y )
P (X,Y )
conditional: P (X|Y ) = P (Y )

• Implications of these definitions:


Product rule: P (X, Y ) = P (X|Y ) P (Y ) = P (Y |X) P (X)

P (Y |X) P (X)
Bayes’ Theorem: P (X|Y ) = P (Y )

9:11

Bayes’ Theorem
Introduction to Machine Learning, Marc Toussaint 99

P (Y |X) P (X)
P (X|Y ) =
P (Y )

likelihood · prior
posterior = normalization
9:12

Multiple RVs:
• Analogously for n random variables X1:n (stored as a rank n tensor)
Joint: P (X1:n )
P
Marginal: P (X1 ) = X2:n P (X1:n ),
P (X1:n )
Conditional: P (X1 |X2:n ) = P (X2:n )

• X is conditionally independent of Y given Z iff:


P (X|Y, Z) = P (X|Z)

• Product rule and Bayes’ Theorem:


Qn P (X, Z, Y ) = P (X|Y, Z) P (Y |Z) P (Z)
P (X1:n ) = i=1 P (Xi |Xi+1:n ) P (Y |X,Z) P (X|Z)
P (X|Y, Z) = P (Y |Z)
P (X2 |X1 ,X3:n ) P (X1 |X3:n )
P (X1 |X2:n ) = P (X2 |X3:n ) P (X,Z|Y ) P (Y )
P (X, Y |Z) = P (Z)
9:13

Example 1: Bavarian dialect


• 40% Bavarians speak dialect, only 1% of non-Bavarians speak (Bav.) dialect
Given a random German that speaks non-dialect, is he Bavarian?
(15% of Germans are Bavarian)

P (D = 1 | B = 1) = 0.4 B

P (D = 1 | B = 0) = 0.01
P (B = 1) = 0.15 D

If follows
P (D=0 | B=1) P (B=1) .6·.15
P (B = 1 | D = 0) = P (D=0) = .6·.15+0.99·.85 ≈ 0.097
9:14
100 Introduction to Machine Learning, Marc Toussaint

Example 2: Coin flipping

HHTHT H

HHHHH d1 d2 d3 d4 d5

• What process produces these sequences?

• We compare two hypothesis:


H = 1 : fair coin P (di = H | H = 1) = 21
H = 2 : always heads coin P (di = H | H = 2) = 1

• Bayes’ theorem:
P (D | H)P (H)
P (H | D) = P (D)
9:15

Coin flipping

D = HHTHT

999
P (D | H = 1) = 1/25 P (H = 1) = 1000
1
P (D | H = 2) = 0 P (H = 2) = 1000

P (H = 1 | D) P (D | H = 1) P (H = 1) 1/32 999
= = =∞
P (H = 2 | D) P (D | H = 2) P (H = 2) 0 1
9:16

Coin flipping

D = HHHHH

999
P (D | H = 1) = 1/25 P (H = 1) = 1000
1
P (D | H = 2) = 1 P (H = 2) = 1000

P (H = 1 | D) P (D | H = 1) P (H = 1) 1/32 999
= = ≈ 30
P (H = 2 | D) P (D | H = 2) P (H = 2) 1 1
9:17
Introduction to Machine Learning, Marc Toussaint 101

Coin flipping

D = HHHHHHHHHH

999
P (D | H = 1) = 1/210 P (H = 1) = 1000
1
P (D | H = 2) = 1 P (H = 2) = 1000

P (H = 1 | D) P (D | H = 1) P (H = 1) 1/1024 999
= = ≈1
P (H = 2 | D) P (D | H = 2) P (H = 2) 1 1

9:18

Learning as Bayesian inference

P (Data|World) P (World)
P (World|Data) =
P (Data)

P (World) describes our prior over all possible worlds. Learning means to infer
about the world we live in based on the data we have!

• In the context of regression, the “world” is the function f (x)

P (Data|f ) P (f )
P (f |Data) =
P (Data)

P (f ) describes our prior over possible functions

Regression means to infer the function based on the data we have


9:19

A.2 Probability distributions


recommended reference: Bishop.: Pattern Recognition and Machine Learning
9:20

Bernoulli & Binomial


• We have a binary random variable x ∈ {0, 1} (i.e. dom(x) = {0, 1})
The Bernoulli distribution is parameterized by a single scalar µ,

P (x = 1 | µ) = µ , P (x = 0 | µ) = 1 − µ
Bern(x | µ) = µ (1 − µ)1−x
x
102 Introduction to Machine Learning, Marc Toussaint

• We have a data set of random variables D = {x1 , .., xn }, each xi ∈ {0, 1}. If
each xi ∼ Bern(xi | µ) we have
Qn Qn
P (D | µ) = i=1 Bern(xi | µ) = i=1 µxi (1 − µ)1−xi
n n
X 1X
argmax log P (D | µ) = argmax xi log µ + (1 − xi ) log(1 − µ) = xi
µ µ
i=1
n i=1

Pn
• The Binomial distribution is the distribution over the count m = i=1 xi

n

m

n

n!
Bin(m | n, µ) =   µ (1 − µ)n−m , 
 =
m m (n − m)! m!
 

9:21

Beta
How to express uncertainty over a Bernoulli parameter µ
• The Beta distribution is over the interval [0, 1], typically the parameter µ of a
Bernoulli:
1
Beta(µ | a, b) = µa−1 (1 − µ)b−1
B(a, b)

with mean hµi = a


a+b and mode µ∗ = a−1
a+b−2 for a, b > 1

• The crucial point is:


– Assume we are in a world with a “Bernoulli source” (e.g., binary bandit), but don’t
know its parameter µ
– Assume we have a prior distribution P (µ) = Beta(µ | a, b)
– Assume
P we collected somePdata D = {x1 , .., xn }, xi ∈ {0, 1}, with counts aD =
i xi of [xi = 1] and bD = i (1 − xi ) of [xi = 0]
– The posterior is

P (D | µ)
P (µ | D) = P (µ) ∝ Bin(D | µ) Beta(µ | a, b)
P (D)
∝ µaD (1 − µ)bD µa−1 (1 − µ)b−1 = µa−1+aD (1 − µ)b−1+bD
= Beta(µ | a + aD , b + bD )

9:22

Beta

The prior is Beta(µ | a, b), the posterior is Beta(µ | a + aD , b + bD )


Introduction to Machine Learning, Marc Toussaint 103

• Conclusions:
– The semantics of a and b are counts of [xi = 1] and [xi = 0], respectively
– The Beta distribution is conjugate to the Bernoulli (explained later)
– With the Beta distribution we can represent beliefs (state of knowledge) about un-
certain µ ∈ [0, 1] and know how to update this belief given data
9:23

Beta

from Bishop
9:24

Multinomial
• We have an integer random variable x ∈ {1, .., K}
The probability of a single x can be parameterized by µ = (µ1 , .., µK ):

P (x = k | µ) = µk
PK
with the constraint k=1 µk = 1 (probabilities need to be normalized)
• We have a data set of random variables D = {x1 , .., xn }, each xi ∈ {1, .., K}. If
each xi ∼ P (xi | µ) we have
Qn Qn QK [x =k] QK
P (D | µ) = i=1 µxi = i=1 k=1 µk i = k=1 µm
k
k

Pn
where mk = i=1 [xi = k] is the count of [xi = k]. The ML estimator is

1
argmax log P (D | µ) = (m1 , .., mK )
µ n
104 Introduction to Machine Learning, Marc Toussaint

• The Multinomial distribution is this distribution over the counts mk


QK
Mult(m1 , .., mK | n, µ) ∝ k=1 µm k
k

9:25

Dirichlet
How to express uncertainty over a Multinomial parameter µ
• The Dirichlet distribution is over the K-simplex, that is, over µ1 , .., µK ∈ [0, 1]
PK
subject to the constraint k=1 µk = 1:
QK k −1
Dir(µ | α) ∝ k=1 µα
k

It is parameterized by α = (α1 , .., αK ), has mean hµi i = Pαi and mode µ∗i =
j αj
Pαi −1 for ai > 1.
j αj −K

• The crucial point is:


– Assume we are in a world with a “Multinomial source” (e.g., an integer bandit), but
don’t know its parameter µ
– Assume we have a prior distribution P (µ) = Dir(µ | α)
– Assume
P we collected some data D = {x1 , .., xn }, xi ∈ {1, .., K}, with counts mk =
i [xi = k]
– The posterior is

P (D | µ)
P (µ | D) = P (µ) ∝ Mult(D | µ) Dir(µ | a, b)
P (D)
QK m QK αk −1 αk −1+mk
= K
Q
∝ k=1 µk k k=1 µk k=1 µk
= Dir(µ | α + m)

9:26

Dirichlet

The prior is Dir(µ | α), the posterior is Dir(µ | α + m)

• Conclusions:
– The semantics of α is the counts of [xi = k]
– The Dirichlet distribution is conjugate to the Multinomial
– With the Dirichlet distribution we can represent beliefs (state of knowledge) about
uncertain µ of an integer random variable and know how to update this belief given
data
9:27
Introduction to Machine Learning, Marc Toussaint 105

Dirichlet
Illustrations for α = (0.1, 0.1, 0.1), α = (1, 1, 1) and α = (10, 10, 10):

from Bishop
9:28

Motivation for Beta & Dirichlet distributions


• Bandits:
– If we have binary [integer] bandits, the Beta [Dirichlet] distribution is a way to
represent and update beliefs
– The belief space becomes discrete: The parameter α of the prior is continuous, but
the posterior updates live on a discrete “grid” (adding counts to α)
– We can in principle do belief planning using this
• Reinforcement Learning:
– Assume we know that the world is a finite-state MDP, but do not know its transition
probability P (s0 | s, a). For each (s, a), P (s0 | s, a) is a distribution over the integer
s0
– Having a separate Dirichlet distribution for each (s, a) is a way to represent our
belief about the world, that is, our belief about P (s0 | s, a)
– We can in principle do belief planning using this → Bayesian Reinforcement Learning
• Dirichlet distributions are also used to model texts (word distributions in text),
images, or mixture distributions in general
9:29

Conjugate priors

• Assume you have data D = {x1 , .., xn } with likelihood

P (D | θ)

that depends on an uncertain parameter θ


Assume you have a prior P (θ)

• The prior P (θ) is conjugate to the likelihood P (D | θ) iff the posterior

P (θ | D) ∝ P (D | θ) P (θ)
106 Introduction to Machine Learning, Marc Toussaint

is in the same distribution class as the prior P (θ)

• Having a conjugate prior is very convenient, because then you know how to
update the belief given data
9:30

Conjugate priors

likelihood conjugate
Binomial Bin(D | µ) Beta Beta(µ | a, b)
Multinomial Mult(D | µ) Dirichlet Dir(µ | α)
Gauss N(x | µ, Σ) Gauss N(µ | µ0 , A)
1D Gauss N(x | µ, λ-1 ) Gamma Gam(λ | a, b)
nD Gauss N(x | µ, Λ-1 ) Wishart Wish(Λ | W, ν)
nD Gauss N(x | µ, Λ-1 ) Gauss-Wishart
N(µ | µ0 , (βΛ)-1 ) Wish(Λ | W, ν)
9:31

A.3 Distributions over continuous domain


9:32

Distributions over continuous domain


• Let x be a continuous RV. The probability density function (pdf) p(x) ∈ [0, ∞)
defines the probability
Z b
P (a ≤ x ≤ b) = p(x) dx ∈ [0, 1]
a

Ry
The (cumulative) probability distribution F (y) = P (x ≤ y) = −∞
dx p(x) ∈
[0, 1] is the cumulative integral with limy→∞ F (y) = 1

(In discrete domain: probability distribution and probability mass function P (x) ∈ [0, 1] are
used synonymously.)

• Two basic examples:


>
1
Σ-1 (x−µ)
Gaussian: N(x | µ, Σ) = 1
| 2πΣ | 1/2
e− 2 (x−µ)
R
Dirac or δ (“point particle”) δ(x) = 0 except at x = 0, δ(x) dx = 1

δ(x) = ∂x H(x) where H(x) = [x ≥ 0] = Heaviside step function
9:33
Introduction to Machine Learning, Marc Toussaint 107
N (x|µ, σ 2 )

Gaussian distribution

2
1
/σ 2
• 1-dim: N(x | µ, σ 2 ) = 1
| 2πσ 2 | 1/2
e− 2 (x−µ)
µ x
• n-dim Gaussian in normal form:

1 1
N(x | µ, Σ) = exp{− (x − µ)> Σ-1 (x − µ)}
| 2πΣ | 1/2 2

with mean µ and covariance matrix Σ. In canonical form:

exp{− 21 a>A-1 a} 1
N[x | a, A] = -1 1/2
exp{− x> A x + x>a} (9)
| 2πA | 2

with precision matrix A = Σ-1 and coefficient a = Σ-1 µ (and mean µ = A-1 a).
Note: | 2πΣ | = det(2πΣ) = (2π)n det(Σ)

• Gaussian identities: see http://ipvs.informatik.uni-stuttgart.de/mlr/marc/


notes/gaussians.pdf
9:34

Gaussian identities
Symmetry: N(x | a, A) = N(a | x, A) = N(x − a | 0, A)

Product:
N(x | a, A) N(x | b, B) = N[x | A-1 a + B -1 b, A-1 + B -1 ] N(a | b, A + B)
N[x | a, A] N[x | b, B] = N[x | a + b, A + B] N(A-1 a | B -1 b, A-1 + B -1 )

“Propagation”:
N(x | a + F y, A) N(y | b, B) dy = N(x | a + F b, A + F BF>)
R
y

Transformation:
N(F x + f | a, A) = 1
|F |
N(x | F -1 (a − f ), F -1 AF -> )

Marginal
 & conditional:

x a A C
N , = N(x | a, A) · N(y | b + C>A-1 (x - a), B − C>A-1 C)
y b C> B

More Gaussian identities: see http://ipvs.informatik.uni-stuttgart.de/mlr/marc/


notes/gaussians.pdf
9:35

Gaussian prior and posterior


108 Introduction to Machine Learning, Marc Toussaint

• Assume we have data D = {x1 , .., xn }, each xi ∈ Rn , with likelihood

N(xi | µ, Σ)
Q
P (D | µ, Σ) = i
n
1X
argmax P (D | µ, Σ) = xi
µ n i=1
n
1X
argmax P (D | µ, Σ) = (xi − µ)(xi − µ)>
Σ n i=1

• Assume we are initially uncertain about µ (but know Σ). We can express this
uncertainty using again a Gaussian N[µ | a, A]. Given data we have

P (µ | D) ∝ P (D | µ, Σ) P (µ) = i N(xi | µ, Σ) N[µ | a, A]


Q

= i N[µ | Σ-1 xi , Σ-1 ] N[µ | a, A] ∝ N[µ | Σ-1 i xi , nΣ-1 + A]


Q P

Note: in the limit A → 0 (uninformative prior) this becomes

1X 1
P (µ | D) = N(µ | xi , Σ)
n i n

which is consistent with the Maximum Likelihood estimator


9:36

Motivation for Gaussian distributions


• Gaussian Bandits
• Control theory, Stochastic Optimal Control
• State estimation, sensor processing, Gaussian filtering (Kalman filtering)
• Machine Learning
• etc
9:37

Particle Approximation of a Distribution


• We approximate a distribution p(x) over a continuous domain Rn

• A particle distribution q(x) is a weighed set S = {(xi , wi )}N


i=1 of N particles
– each particle has a “location” xi ∈ Rn and a weight wi ∈ R
– weights are normalized, i wi = 1
P

N
X
q(x) := wi δ(x − xi )
i=1
i
where δ(x − x ) is the δ-distribution.
Introduction to Machine Learning, Marc Toussaint 109

• Given weighted particles, we can estimate for any (smooth) f :


Z
PN i i
hf (x)ip = f (x)p(x)dx ≈ i=1 w f (x )
x

See An Introduction to MCMC for Machine Learning www.cs.ubc.ca/˜nando/


papers/mlintro.pdf
9:38

Particle Approximation of a Distribution


Histogram of a particle representation:

9:39

Motivation for particle distributions


• Numeric representation of “difficult” distributions
– Very general and versatile
– But often needs many samples
• Distributions over games (action sequences), sample based planning, MCTS
• State estimation, particle filters
• etc
9:40

Utilities & Decision Theory


110 Introduction to Machine Learning, Marc Toussaint

• Given a space of events Ω (e.g., outcomes of a trial, a game, etc) the utility is a
function
U : Ω→R
• The utility represents preferences as a single scalar – which is not always obvi-
ous (cf. multi-objective optimization)
• Decision Theory making decisions (that determine p(x)) that maximize expected
utility Z
E{U }p = U (x) p(x)
x
• Concave utility functions imply risk aversion (and convex, risk-taking)
9:41

Entropy
• The neg-log (− log p(x)) of a distribution reflects something like “error”:
– neg-log of a Guassian ↔ squared error
– neg-log likelihood ↔ prediction error

• The (− log p(x)) is the “optimal” coding length you should assign to a symbol
x. This will minimize the expected length of an encoding
Z
H(p) = p(x)[− log p(x)]
x

• The entropy H(p) = Ep(x) {− log p(x)} of a distribution p is a measure of uncer-


tainty, or lack-of-information, we have about x
9:42

Kullback-Leibler divergence
• Assume you use a “wrong” distribution q(x) to decide on the coding length of
symbols drawn from p(x). The expected length of a encoding is
Z
p(x)[− log q(x)] ≥ H(p)
x

• The difference Z
 p(x)
D p q = p(x) log ≥0
x q(x)
is called Kullback-Leibler divergence

Proof of inequality, using the Jenson inequality:


Z Z
q(x) q(x)
− p(x) log ≥ − log p(x) =0
x p(x) x p(x)
9:43
Introduction to Machine Learning, Marc Toussaint 111

A.4 Monte Carlo methods


9:44

Monte Carlo methods


• Generally, a Monte Carlo method is a method to generate a set of (potentially
weighted) samples that approximate a distribution p(x).
In the unweighted case, the samples should be i.i.d. xi ∼ p(x)
In the general (also weighted) case, we want particles that allow to estimate
expectations of anything that depends on x, e.g. f (x):
N
X Z
lim hf (x)iq = lim wi f (xi ) = f (x) p(x) dx = hf (x)ip
N →∞ N →∞ x
i=1

In this view, Monte Carlo methods approximate an integral.


• Motivation: p(x) itself is too complicated to express analytically or compute
hf (x)ip directly

• Example: What is the probability that a solitair would come out successful? (Original
story by Stan Ulam.) Instead of trying to analytically compute this, generate many
random solitairs and count.
• Naming: The method developed in the 40ies, where computers became faster. Fermi,
Ulam and von Neumann initiated the idea. von Neumann called it “Monte Carlo” as a
code name.
9:45

Rejection Sampling
• How can we generate i.i.d. samples xi ∼ p(x)?
• Assumptions:
– We can sample x ∼ q(x) from a simpler distribution q(x) (e.g., uniform), called
proposal distribution
– We can numerically evaluate p(x) for a specific x (even if we don’t have an analytic
expression of p(x))
– There exists M such that ∀x : p(x) ≤ M q(x) (which implies q has larger or equal
support as p)

• Rejection Sampling:
– Sample a candiate x ∼ q(x)
– With probability p(x)
M q(x)
accept x and add to S; otherwise reject
– Repeat until |S| = n
• This generates an unweighted sample set S to approximate p(x)
9:46
112 Introduction to Machine Learning, Marc Toussaint

Importance sampling
• Assumptions:
– We can sample x ∼ q(x) from a simpler distribution q(x) (e.g., uniform)
– We can numerically evaluate p(x) for a specific x (even if we don’t have an analytic
expression of p(x))
• Importance Sampling:
– Sample a candiate x ∼ q(x)
– Add the weighted sample (x, p(x)
q(x)
) to S
– Repeat n times
• This generates an weighted sample set S to approximate p(x)
p(xi )
The weights wi = q(xi ) are called importance weights

• Crucial for efficiency: a good choice of the proposal q(x)


9:47

Applications
• MCTS estimates the Q-function at branchings in decision trees or games
• Inference in graphical models (models involving many depending random vari-
ables)
9:48

Some more continuous distributions


>
1
A-1 (x−a)
Gaussian N(x | a, A) = | 2πA1 | 1/2 e− 2 (x−a)

Dirac or δ δ(x) = ∂x H(x)
2 ν+1
Student’s t p(x; ν) ∝ [1 + xν ]− 2
(=Gaussian for ν → ∞, otherwise heavy
tails)
Exponential p(x; λ) = [x ≥ 0] λe−λx
(distribution over single event time)
1 − | x−µ | /b
Laplace p(x; µ, b) = 2b e
(“double exponential”)
Chi-squared p(x; k) ∝ [x ≥ 0] xk/2−1 e−x/2
Gamma p(x; k, θ) ∝ [x ≥ 0] xk−1 e−x/θ
9:49
Introduction to Machine Learning, Marc Toussaint 113

9 Exercises

9.1 Exercise 1

There will be no credit points for the first exercise – we’ll do them on the fly.
(Präsenzübung) For those of you that had lectures with me before this is redundant—
you’re free to skip the tutorial.

9.1.1 Reading: Pedro Domingos

Read at least until section 5 of Pedro Domingos’s A Few Useful Things to Know about
Machine Learning http://homes.cs.washington.edu/˜pedrod/papers/cacm12.
pdf. Be able to explain roughly what generalization and the bias-variance-tradeoff
(Fig. 1) are.

9.1.2 Matrix equations

a) Let X, A be arbitrary matrices, A invertible. Solve for X:

XA + A> = I

b) Let X, A, B be arbitrary matrices, (C − 2A>) invertible. Solve for X:

X>C = [2A(X + B)]>

c) Let x ∈ Rn , y ∈ Rd , A ∈ Rd×n . A obviously not invertible, but let A>A be invert-


ible. Solve for x:
(Ax − y)>A = 0>n

d) As above, additionally B ∈ Rn×n , B positive-definite. Solve for x:

(Ax − y)>A + x>B = 0>n

9.1.3 Vector derivatives

Let x ∈ Rn , y ∈ Rd , A ∈ Rd×n .

a) What is ∂x x ? (Of what type/dimension is this thing?)
∂ >
b) What is ∂x [x x] ?
c) Let B be symmetric (and pos.def.). What is the minimum of (Ax − y)>(Ax − y) +
x>Bx w.r.t. x?
114 Introduction to Machine Learning, Marc Toussaint

9.1.4 Coding

Future exercises will need you to code some Machine Learning methods. You are
free to choose your programming language. If you’re new to numerics we rec-
ommend Python (SciPy & scikit-learn) or Matlab/Octave. I’ll support C++, but
recommend it really only to those familiar with C++.
To get started, try to just plot the data set http://ipvs.informatik.uni-stuttgart.
de/mlr/marc/teaching/data/dataQuadReg2D.txt, e.g. in Octave:

D = importdata(’dataQuadReg2D.txt’);
plot3(D(:,1),D(:,2),D(:,3), ’ro’)

Or in Python

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
D = np.loadtxt(’dataQuadReg2D.txt’)
fig = plt.figure()
ax = fig.add_subplot(111, projection=’3d’)
ax.plot(D[:,0],D[:,1],D[:,2], ’ro’)
plt.show()

Or you can store the grid data in a file and use gnuplot, e.g.:

splot ’dataQuadReg2D.txt’ with points

For those using C++, download and test https://github.com/MarcToussaint/


rai. In particular, have a look at test/Core/array with many examples on how to
use the array class. Report on problems with installation.

9.2 Exercise 2
9.2.1 Getting Started with Ridge Regression (10 Points)

In the appendix you find starting-point implementations of basic linear regression


for Python, C++, and Matlab. These include also the plotting of the data and model.
Have a look at them, choose a language and understand the code in detail.
Introduction to Machine Learning, Marc Toussaint 115

On the course webpage there are two simple data sets dataLinReg2D.txt and
dataQuadReg2D.txt. Each line contains a data entry (x, y) with x ∈ R2 and
y ∈ R; the last entry in a line refers to y.
a) The examples demonstrate plain linear regression for dataLinReg2D.txt. Ex-
tend them to include a regularization parameter λ. Report the squared error on the
full data set when trained on the full data set. (3 P)
b) Do the same for dataQuadReg2D.txt while first computing quadratic features.
(4 P)
c) Implement cross-validation (slide 02:17) to evaluate the prediction error of the
quadratic model for a third, noisy data set dataQuadReg2D_noisy.txt. Report
1) the squared error when training on all data (=training error), and 2) the mean
squared error `ˆ from cross-validation. (3 P)
Repeat this for different Ridge regularization parameters λ. (Ideally, generate a nice
bar plot of the generalization error, including deviation, for various λ.)

Python (by Stefan Otte)

#!/usr/bin/env python
# encoding: utf-8
"""
NOTE: the operators + - * / are element wise operation. If you want
matrix multiplication use ‘‘dot‘‘ or ‘‘mdot‘‘!
"""
from __future__ import print_function
import numpy as np
from numpy import dot
from numpy.linalg import inv
from numpy.linalg import multi_dot as mdot
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d.axes3d import Axes3D
# 3D plotting
###############################################################################
# Helper functions
def prepend_one(X):
"""prepend a one vector to X."""
return np.column_stack([np.ones(X.shape[0]), X])
def grid2d(start, end, num=50):
"""Create an 2D array where each row is a 2D coordinate.
np.meshgrid is pretty annoying!
"""
dom = np.linspace(start, end, num)
X0, X1 = np.meshgrid(dom, dom)
return np.column_stack([X0.flatten(), X1.flatten()])
###############################################################################
# load the data
data = np.loadtxt("dataLinReg2D.txt")
print("data.shape:", data.shape)
# split into features and labels
X, y = data[:, :2], data[:, 2]
print("X.shape:", X.shape)
print("y.shape:", y.shape)

# 3D plotting
fig = plt.figure()
ax = fig.add_subplot(111, projection=’3d’) # the projection arg is important!
ax.scatter(X[:, 0], X[:, 1], y, color="red")
ax.set_title("raw data")
plt.draw()
# show, use plt.show() for blocking
116 Introduction to Machine Learning, Marc Toussaint

# prep for linear reg.


X = prepend_one(X)
print("X.shape:", X.shape)

# Fit model/compute optimal parameters beta


beta_ = mdot([inv(dot(X.T, X)), X.T, y])
print("Optimal beta:", beta_)
# prep for prediction
X_grid = prepend_one(grid2d(-3, 3, num=30))
print("X_grid.shape:", X_grid.shape)
# Predict with trained model
y_grid = dot(X_grid, beta_)
print("Y_grid.shape", y_grid.shape)
# vis the result
fig = plt.figure()
ax = fig.add_subplot(111, projection=’3d’) # the projection part is important
ax.scatter(X_grid[:, 1], X_grid[:, 2], y_grid) # don’t use the 1 infront
ax.scatter(X[:, 1], X[:, 2], y, color="red") # also show the real data
ax.set_title("predicted data")
plt.show()

C++
(by Marc Toussaint)

//install https://github.com/MarcToussaint/rai in $HOME/git and compile ’make -C rai/Core’


//export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/git/rai/lib
//g++ -I$HOME/git/rai/rai -L$HOME/git/rai/lib -fPIC -std=c++0x main.cpp -lCore

#include <Core/array.h>
//===========================================================================
void gettingStarted() {
//load the data
arr D = FILE("dataLinReg2D.txt");
//plot it
FILE("z.1") <<D;
gnuplot("splot ’z.1’ us 1:2:3 w p", true);

//decompose in input and output


uint n = D.d0; //number of data points
arr Y = D.sub(0,-1,-1,-1).reshape(n); //pick last column
arr X = catCol(ones(n,1), D.sub(0,-1,0,-2)); //prepend 1s to inputs
cout <<"X dim = " <<X.dim() <<endl;
cout <<"Y dim = " <<Y.dim() <<endl;
//compute optimal beta
arr beta = inverse_SymPosDef(˜X*X)*˜X*Y;
cout <<"optimal beta=" <<beta <<endl;
//display the function
arr X_grid = grid(2, -3, 3, 30);
X_grid = catCol(ones(X_grid.d0,1), X_grid);
cout <<"X_grid dim = " <<X_grid.dim() <<endl;
arr Y_grid = X_grid * beta;
cout <<"Y_grid dim = " <<Y_grid.dim() <<endl;
FILE("z.2") <<Y_grid.reshape(31,31);
gnuplot("splot ’z.1’ us 1:2:3 w p, ’z.2’ matrix us ($2/5-3):($1/5-3):3 w l", true);
cout <<"CLICK ON THE PLOT!" <<endl;
}

//===========================================================================
int main(int argc, char *argv[]) {
rai::initCmdLine(argc,argv);

gettingStarted();
Introduction to Machine Learning, Marc Toussaint 117

return 0;
}

Matlab
(by Peter Englert)

clear;

% load the date


load(’dataLinReg2D.txt’);
% plot it
figure(1);clf;hold on;
plot3(dataLinReg2D(:,1),dataLinReg2D(:,2),dataLinReg2D(:,3),’r.’);
% decompose in input X and output Y
n = size(dataLinReg2D,1);
X = dataLinReg2D(:,1:2);
Y = dataLinReg2D(:,3);

% prepend 1s to inputs
X = [ones(n,1),X];
% compute optimal beta
beta = inv(X’*X)*X’*Y;
% display the function
[a b] = meshgrid(-2:.1:2,-2:.1:2);
Xgrid = [ones(length(a(:)),1),a(:),b(:)];
Ygrid = Xgrid*beta;
Ygrid = reshape(Ygrid,size(a));
h = surface(a,b,Ygrid);
view(3);
grid on;

9.3 Exercise 3
(BSc Data Science students may skip b) parts.)

9.3.1 Hinge-loss gradients (5 Points)

The function [z]+ = max(0, z) is called hinge. In ML, a hinge penalizes errors (when
z > 0) linearly, but raises no costs at all if z < 0.
Assume we have a single data point (x, y ∗ ) with class label y ∗ ∈ {1, .., M }, and the
discriminative function f (y, x). We penalize the discriminative function with the
one-vs-all hinge loss
X
Lhinge (f ) = [1 − (f (y ∗ , x) − f (y, x))]+
y6=y ∗
118 Introduction to Machine Learning, Marc Toussaint

hinge
a) What is the gradient ∂L (f )
∂f (y,x) of the loss w.r.t. the discriminative values. For
simplicity, distinguish the cases of taking the derivative w.r.t. f (y ∗ , x) and w.r.t.
f (y, x) for y 6= y ∗ . (3 P)
b) Now assume the parameteric model f (y, x) = φ(x)>βy , where for every y we
have different parameters βy ∈ Rd . And we have a full data set D = {(xi , yi )}ni=1
with class labels yi ∈ {1, .., M } and loss
n X
X
Lhinge (f ) = [1 − (f (yi , xi ) − f (y, xi ))]+
i=1 y6=yi

∂Lhinge (f )
What is the gradient ∂βy ? (2 P)

9.3.2 Log-likelihood gradient and Hessian (5 Points)

Consider a binary classification problem with data D = {(xi , yi )}ni=1 , xi ∈ Rd and


yi ∈ {0, 1}. We define
f (x) = φ(x)>β (10)
−z
p(x) = σ(f (x)) , σ(z) = 1/(1 + e ) (11)
n h
X i
Lnll (β) = − yi log p(xi ) + (1 − yi ) log[1 − p(xi )] (12)
i=1

where β ∈ Rd is a vector. (NOTE: the p(x) we defined here is a short-hand for


p(y = 1|x) on slide 03:9.)
∂ ∂
a) Compute the derivative ∂β L(β). Tip: use the fact ∂z σ(z) = σ(z)(1 − σ(z)). (3 P)
2

b) Compute the 2nd derivative ∂β 2 L(β). (2 P)

9.4 Exercise 4
At the bottom several ’extra’ exercises are given. These are not part of the tutorial
and only for your interest.

(BSc Data Science students may skip preparing exercise 2.)

(There were questions about an API documentation of the C++ code. See https://
github.com/MarcToussaint/rai-maintenance/blob/master/help/doxygen.md.)

9.4.1 Logistic Regression (6 Points)

On the course webpage there is a data set data2Class.txt for a binary classifi-
cation problem. Each line contains a data entry (x, y) with x ∈ R2 and y ∈ {0, 1}.
Introduction to Machine Learning, Marc Toussaint 119

a) Compute the optimal parameters β and the mean neg-log-likelihood − n1 log L(β)
for logistic regression using linear features. Plot the probability P (y = 1 | x) over a
2D grid of test points. (4 P)

• Recall the objective function, and its gradient and Hessian that we derived in
the last exercise:
n
X
L(β) = − log P (yi | xi ) + λ||β||2 (13)
i=1
Xn h i
=− yi log pi + (1 − yi ) log[1 − pi ] + λ||β||2 (14)
i=1
> n
∂L(β) X
∇L(β) = = (pi − yi ) φ(xi ) + 2λIβ = X>(p − y) + 2λIβ (15)
∂β i=1
n
∂ 2 L(β) X
∇2 L(β) = = pi (1 − pi ) φ(xi ) φ(xi )> + 2λI = X>W X + 2λI (16)
∂β 2 i=1
where p(x) := P (y = 1 | x) = σ(φ(x)>β), pi := p(xi ), W := diag(p ◦ (1 − p))
(17)

• Setting the gradient equal to zero can’t be done analytically. Instead, optimal
parameters can quickly be found by iterating Newton steps: For this, initialize
β = 0 and iterate
β ← β − (∇2 L(β))-1 ∇L(β) . (18)
You usually need to iterate only a few times (∼10) til convergence.
• As you did for regression, plot the discriminative function f (x) = φ(x)>β or
the class probability function p(x) = σ(f (x)) over a grid.

Useful gnuplot commands:

splot [-2:3][-2:3][-3:3.5] ’model’ matrix \


us ($1/20-2):($2/20-2):3 with lines notitle
plot [-2:3][-2:3] ’data2Class.txt’ \
us 1:2:3 with points pt 2 lc variable title ’train’

b) Compute and plot the same for quadratic features. (2 P)

9.4.2 Structured Output (4 Points)

(Warning: This is one of these exercises that do not have “one correct solution”.)
Consider data of tuples (x, y1 , y2 ) where
120 Introduction to Machine Learning, Marc Toussaint

• x ∈ Rd is the age and some other features of a spotify user

• y1 ∈ R quantifies how much the user likes HipHop

• y2 ∈ R quantifies how much the user likes Classic

Naively one could train separate regressions x 7→ y1 and x 7→ y2 . However, it


seems reasonable that somebody that likes HipHop might like Classic a bit less
than average (anti-correlated).
a) Properly define a representation and objective for modelling the prediction x 7→
(y1 , y2 ). (2 P)
Tip: Discriminative functions f (y, x) can not only be used to define a class predic-
tion F (x) = argmaxy f (y, x), but equally also continuous predictions where y ∈ R
or y ∈ RO . How can you setup and parameterize a discriminative function for the
mapping x 7→ (y1 , y2 )?
b) Assume you would not only like to make a deterministic prediction x 7→ (y1 , y2 ),
but map x to a probability distribution p(y1 , y2 |x), where presumably y1 and y2
would be anti-correlated. How can this be done? (2 P)
c) Extra: Assume that the data set would contain a third output variable y3 ∈ {0, 1},
e.g. y3 may indicate whether the user pays for music. How could you setup learn-
ing a model x 7→ (y1 , y2 , y3 )?
d) Extra: General discussion: Based on this, how are regression, classification, and
conditional random fields related?

9.4.3 Extra: Discriminative Function in Logistic Regression

Logistic Regression (slide 03:19) defines class probabilities as proportional to the


exponential of a discriminative function:

exp f (x, y)
P (y|x) = P 0
y 0 exp f (x, y )

Prove that, in the binary classification case, you can assume f (x, 0) = 0 without
loss of generality.
This results in
exp f (x, 1)
P (y = 1|x) = = σ(f (x, 1)).
1 + exp f (x, 1)

(Hint: first assume f (x, y) = φ(x, y)>β, and then define a new discriminative func-
tion f 0 as a function of the old one, such that f 0 (x, 0) = 0 and for which P (y|x)
maintains the same expressibility.)
Introduction to Machine Learning, Marc Toussaint 121

9.4.4 Extra: Special cases of CRFs

Slide 03:31 summarizes the core equations for CRFs.


a) Confirm the given equations for ∇β Z(x, β) and ∇β2 Z(x, β) (i.e., derive them from
the definition of Z(x, β)).
b) Binary logistic regression is clearly a special case of CRFs. Sanity check that the
gradient and Hessian given on slide 03:20 can alternatively be derived from the
general expressions for ∇β L(β) and ∇β2 L(β) on slide 03:31. (The same is true for the
multi-class case .)
c) Proof that ordinary ridge regression is a special case of CRFs if you choose the
discriminative function f (x, y) = −y 2 + 2yφ(x)>β.

9.5 Exercise 5

In these two exercises you’ll program a NN from scratch, use neural random fea-
tures for classification, and train it too. Don’t use tensorflow yet, but the same
language you used for standard regression & classification. Take slide 04:14 as ref-
erence for NN equations.
(DS BSc students may skip 2 b-c, i.e. should at least try to code/draft also the back-
ward pass, but ok if no working solutions.)

9.5.1 Programming your own NN – NN initialization and neural random fea-


tures (5 Points)

(Such an approach was (once) also known as Extreme Learning.)


A standard NN structure is described by h0:L , which describes the dimensionality
of the input (h0 ), the dimensionality of all hidden layers (h1:L-1 ), and the dimen-
sionality of the output hL .
a) Code a routine “forward(x, β)” that computes fβ (x), the forward propagation of
the network, for a NN with given structure h0:L . Note that for each layer l = 1, .., L
we have parameters Wl ∈ RhL ×hL-1 and bl ∈ Rhl . Use the leaky ReLu activation
function. (2 P)
b) Write a method that initializes all weights such that for each neuron, the zi = 0
hyperplane is located randomly, with random orientation and random √ offset (fol-
low slide 04:21). Namely, choose each Wl,i· as Gaussian with sdv 1/ hl-1 , and
choose the biases bl,i ∼ U(−1, 1) uniform. (1 P)
c) Consider again the classification data set data2Class.txt, which we also used
in the previous exercise. In each line it has a two-dimensional input and the output
yi ∈ {0, 1}.
122 Introduction to Machine Learning, Marc Toussaint

Use your NN to map each input x to features φ(x) = xL-1 , then use these features
as input to logistic regression as done in the previous exercise. (Initialize a separate
β and optimize by iterating Newton steps.)
First consider just L = 2 (just one hidden layer and xL-1 are the features) and h1 =
300. (2 P)
Extra) How does it perform if we initialize all bl = 0? How would it perform if the
input would be rescaled x ← 105 x? How does the performance vary with h1 and
with L?

9.5.2 Programming your own NN – Backprop & training (5 Points)

We now also train the network using backpropagation and hinge loss. We test again
on data2Class.txt. As this is a binary classification problem we only need one
output neuron fβ (x). If fβ (x) > 0 we classify 1, otherwise we classify 0.
Reuse the “forward(x, β)” coded above.
a) Code a routine “backward(δL+1 , x, w)”, that performs the backpropagation steps
d`
and collects the gradients dw l
.
For this, let us use a hinge loss. In the binary case (when you use only one out-
put neuron), it is simplest to redefine y ∈ {−1, +1}, and define the hinge loss as
`(f, y) = max(0, 1 − f y), which has the loss gradient δL = −y[1 − yf > 0] at the
output neuron.
Run forward and backward propagation for each x, y in the dataset, and sum up
the respective gradients. (2 P)
b) Code a routine which optimizes the parameters using gradient descent:

d` d`
∀l=1,..,L : Wl ← Wl − α , bl ← bl − α
dWl dbl

with step size α = .01. Run until convergence (should take a few thousand steps).
Print out the loss function ` at each 100th iteration, to verify that the parameter
optimization is indeed decreasing the loss. (2 P)
c) Run for h = (2, 20, 1) and visualize the prediction by plotting σ(fβ (x)) over a
2-dimensional grid. (1 P)

9.6 Exercise 6

(DS BSc students can skip the exercise 2b and 3.)


Introduction to Machine Learning, Marc Toussaint 123

9.6.1 Getting started with tensorflow (4 Points)

Tensorflow (https://www.tensorflow.org/) is one of the state-of-the-art com-


putation graph libraries mostly used for implementing neural networks. We recom-
mend using the Python API for this example. Install the tensorflow library with pip
using the command pip install tensorflow --user or follow the instruc-
tions on the webpage for your platform/language.
For the logistic regression, we used the following objective function:
n h
X i
L(β) = − yi log pi + (1 − yi ) log[1 − pi ] (19)
i=1
where p(x) := P (y = 1 | x) = σ(x>β) , pi := p(xi ) (20)

a) Implement the loss function by using standard tensor commands like


tf.math.sigmoid(), tf.tensordot(), tf.math.reduce_sum(). Define the
variables X ∈ R100×3 , y ∈ R100 and β ∈ R3 as tf.placeholder. Store the com-
putation graph and display it in a browser using tensorboard. (2 P)
Hints:
– You can save the computation graph by using the following command:
writer = tf.summary.FileWriter(’logs’, sess.graph)
– You can display it by running tensorboard from the command line:
tensorboard --logdir logs
– Then open the given url in a browser.

b) Run a session to compute the loss, gradient and hessian. Feed random values
into the input placeholders. Gradient and Hessian can be calculated by tf.gradients()
and tf.hessians(). Compare it to the analytical solution using the same ran-
dom values. (2 P)
Code calculating the analytical solutions of the loss, the gradient and the hessian in
python:

def numpy_equations(X, beta, y):


p = 1. / (1. + np.exp(-np.dot(X, beta)))
L = -np.sum(y * np.log(p) + ((1. - y) * np.log(1.-p)))
dL = np.dot(X.T, p - y)
W = np.identity(X.shape[0]) * p * (1. - p)
ddL = np.dot(X.T, np.dot(W, X))
return L, dL, ddL

9.6.2 Classification with NNs in tensorflow (6 Points)

Now you will directly use tensorflow commands for creating neural networks.
124 Introduction to Machine Learning, Marc Toussaint

a) We want to verify our results for classification on the dataset “data2Class.txt” by


implementing the NN using tensorflow. Create two dense layers with ReLU acti-
vation function and h1 = h2 = 100. Map to one output neuron (i.e. h3 = 1) without
activation function. Display the computation graph as in the previous exercise. Use
tf.losses.hinge_loss() as a loss function and the Adam Optimizer to train
the network. Run the training and plot the final result. (3 P)
Hints:
– There are many tutorials online, also on www.tensorflow.org. HOWEVER, most
of them use the keras conventions to first abstractly declare the model structure, then
compile it into actual tensorflow structure (Factory pattern). You can use this, but to
really learn tensorflow we recommend using the direct tensorflow methods to create
models instead.
– Here is an example that declares an input variable, two hidden layers, an output layer,
a target variable, and a loss variable:
input = tf.placeholder(shape=[None,2], dtype=tf.float32)
target_output = tf.placeholder(’float’)

relu_layer_operation = tf.layers.Dense(100,
activation=tf.nn.leaky_relu,
kernel_initializer=tf.initializers.random_uniform(-.1,.1),
bias_initializer=tf.initializers.random_uniform(-1.,1.))

linear_layer_operation = tf.layers.Dense(1,
activation=None,
kernel_initializer=tf.initializers.random_uniform(-.1,.1),
bias_initializer=tf.initializers.random_uniform(-.01,.01))

hidden1 = relu_layer_operation(input)
hidden2 = relu_layer_operation(hidden1)
model_output = linear_layer_operation(hidden2)

loss = tf.reduce_mean(tf.losses.hinge_loss(logits=model_output, labels=target_output))

– Use any tutorial to realize the training of such a model.

b) Now we want to use a neural network on real images. Download the Bel-
giumTS1 dataset from: https://btsd.ethz.ch/shareddata/BelgiumTSC/
BelgiumTSC_Training.zip (Training data) and https://btsd.ethz.ch/sharedd
BelgiumTSC/BelgiumTSC_Testing.zip (Test data). The dataset consists of
traffic signs according to 62 different classes. Create a neural network architec-
ture and train it on the training dataset. You can use any architecture you want but
at least use one convolutional layer. Report the classification error on the test set.
(3 P)
Hints: Use tf.layers.Conv2D to create convolutional layers, and
tf.contrib.layers.flatten to reshape an image layer into a vector layer (as
input to a dense layer). The following code can be used to load data, rescale it and
display images:

1 Belgium traffic sign dataset; Radu Timofte*, Markus Mathias*, Rodrigo Benenson, and Luc Van

Gool, Traffic Sign Recognition - How far are we from the solution?, International Joint Conference on
Neural Networks (IJCNN 2013), August 2013, Dallas, USA
Introduction to Machine Learning, Marc Toussaint 125

import os
import skimage
from skimage import transform
from skimage.color import rgb2gray
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
def load_data(data_directory):
directories = [d for d in os.listdir(data_directory)
if os.path.isdir(os.path.join(data_directory, d))]
labels = []
images = []
for d in directories:
label_directory = os.path.join(data_directory, d)
file_names = [os.path.join(label_directory, f)
for f in os.listdir(label_directory)
if f.endswith(".ppm")]
for f in file_names:
images.append(skimage.data.imread(f))
labels.append(int(d))
return np.array(images), np.array(labels)
def plot_data(signs, labels):
for i in range(len(signs)):
plt.subplot(4, len(signs)/4 + 1, i+1)
plt.axis(’off’)
plt.title("Label {0}".format(labels[i]))
plt.imshow(signs[i])
plt.subplots_adjust(wspace=0.5)
plt.show()

images, labels = load_data("./Training")


# display 30 random images
randind = np.random.randint(0, len(images), 30)
plot_data(images[randind], labels[randind])
images = rgb2gray(np.array([transform.resize(image, (50, 50)) for image in images])) # convert to 50x50

9.6.3 Bonus: Stochastic Gradient Descent (3 Points)

(Bonus means: The extra 3 points will count to your total, but not to the required
points (from which you eventually need 50%).)
We test SDG on a squared function f (x) = 21 x>Hx. A Newton method would need
access to the exact Hessian H and directly step to the optimum x∗ = 0. But SDG
only has access to an estimate of the gradient. Ensuring proper convergence is
much more difficult for SGD.
Let x ∈ Rd for d = 1000. Generate a sparse random matrix J ∈ Rn×d , for n = 105
as follows: In each row, fill in 10 random numbers drawn from N(0, σ 2 ) at random
places. Each row either hasPnσ = 1 or σ = 100, chosen randomly. We now define
H = J>J. Note that H = i=1 J> i Ji is a sum of rank-1 matrices.

a) Given this setup, simulate a stochastic gradient descent. (2P)

1. Initialize x = 1d .

2. Choose K = 32 random integers ik ∈ {1, .., n}, where k = 1, .., K. These


indicate which data points we see in this iteration.
126 Introduction to Machine Learning, Marc Toussaint

3. Compute the stochastic gradient estimate


K
1 X >
g= Jik (Jik x)
K
k=1

where Jik is the ik th row of J.


1 >
4. For logging: Compute the full error ` = 2n x Hx and the stochastic mini batch
ˆ 1
P K 2
error ` = 2K k=1 (Jik x) , and write them to a log file for later plotting.
5. Update x based on g using plain gradient descent with fixed step size, and
iterate from (ii).

b) Plot the learning curves, i.e., the full and the stochastic error. How well do they
match? In what sense does optimization converge? Discuss the stationary distribu-
tion of the optimum. (1P)
c) Extra: Test variants: exponential cooling of the learning rate, Nesterov momen-
tum, and ADAM.

9.7 Exercise 7
(DS BSc students can skip the exercise 3.)

9.7.1 Kernels and Features (4 Points)

Reconsider the equations for Kernel Ridge regression given on slide 05:3, and – if
features are given – the definition of the kernel function and κ(x) in terms of the
features as given on slide 05:2.
a) Prove that Kernel Ridge regression with the linear kernel function  
k(x, x0 ) =
1 + x>x0 is equivalent to Ridge regression with linear features φ(x) =  1
 . (1 P)
x
b) In Kernel Ridge regression, the optimal function is of the form f (x) = κ(x)>α
and therefore linear in κ(x). In plain ridge regression, the optimal function is of the
form f (x) = φ(x)>β and linear in φ(x). Prove that choosing k(x, x0 ) = (1 + x>x0 )2
implies that f (x) = κ(x)>α is a second order polynomial over x. (2 P)
c) Equally, note that choosing the squared exponential kernel k(x, x0 ) = exp(−γ | x−
x0 | 2 ) implies that the optimal f (x) is linear in radial basis function (RBF) features.
Does this necessarily impliy that Kernel Ridge regression with squared exponen-
tial kernel, and plain Ridge regression with RBF features are exactly equivalent?
(Equivalent means, have the same optimal function.) Distinguish the cases λ = 0
(no regularization) and λ > 0. (1 P)
(Voluntary: Practically test yourself on the regression problem from Exercise 2,
whether Kernel Ridge Regression and RBF features are exactly equivalent.)
Introduction to Machine Learning, Marc Toussaint 127

9.7.2 Kernel Construction (4 Points)

For a non-empty set X, a kernel is a symmetric function k : X × X → R. Note that


the set X can be arbitrarily structured (real vector space, graphs, images, strings
and so on). A very important class of useful kernels for machine learning are posi-
tive definite kernels. A kernel is called positive definite, if for all arbitrary finite sub-
sets {xi }ni=1 ⊆ X the corresponding kernel matrix K with elements Kij = k(xi , xj )
is positive semi-definite,
α ∈ Rn ⇒ α>Kα ≥ 0 . (21)

Let k1 , k2 : X × X → R be two positive definite kernels. Often, one wants to


construct more complicated kernels out of existing ones. Proof that

1. k(x, x0 ) = k1 (x, x0 ) + k2 (x, x0 )


2. k(x, x0 ) = c · k1 (x, x0 ) for c ≥ 0
3. k(x, x0 ) = k1 (x, x0 ) · k2 (x, x0 )
4. k(x, x0 ) = k1 (f (x), f (x0 )) for f : X → X

are positive definite kernels.

9.7.3 Kernel logistic regression (2 Points)

The “kernel trick” is generally applicable whenever the “solution” (which may be
the predictive function f ridge (x), or the discriminative function, or principal com-
ponents...) can be written in a form that only uses the kernel function k(x, x0 ), but
never features φ(x) or parameters β explicitly.
Derive a kernelization of Logistic Regression. That is, think about how you could
perform the Newton iterations based only on the kernel function k(x, x0 ).
Tips: Reformulate the Newton iterations

β ← β − (X>W X + 2λI)-1 [X>(p − y) + 2λIβ] (22)


using the two Woodbury identities
(X>W X + A)-1 X>W = A-1 X>(XA-1 X> + W -1 )-1 (23)
> -1 -1 -1 > -1 > -1 -1 -1
(X W X + A) = A − A X (XA X + W ) XA (24)

Note that you’ll need to handle the X>(p − y) and 2λIβ differently.
Then think about what is actually been iterated in the kernalized case: surely we
cannot iteratively update the optimal parameters, because we want to rewrite equa-
tions to never touch β or φ(x) explicitly.
128 Introduction to Machine Learning, Marc Toussaint

9.8 Exercise 8
(DS BSc students should nominally achieve 8 Pts on this sheet.)

9.8.1 PCA derived (6 Points)

For data D = {xi }ni=1 , xi ∈ Rd , we introduced PCA as a method that finds lower-
dimensional representations zi ∈ Rp of each data point such that xi ≈ V zi +µ. PCA
chooses V, µ and zi to minimize the reproduction error
n
X
||xi − (V zi + µ)||2 .
i=1

We derive the solution here step by step.

1. Find the optimal latent representation zi of a data point xi as a function of V


and µ. (1P)
2. Find an optimal offset µ. (1P)
(Hint: there is a whole subspace of solutions to this
P problem. Verify that your
solution is consistent with (i.e. contains) µ = n1 i xi ).
3. Find optimal projection vectors {vi }pi=1 , which make up the projection matrix
 
| |
V = v1 . . . vp  (25)
| |
(2P)
Guide:
– Given a projection V , any vector can be split in orthogonal components which
belong to the projected subspace and its complement (which we call W ). x =
V V >x + W W>x.
– For simplicity, let us work with the centered datapoints x̃i = xi − µ.
– The optimal projection V is the one which minimizes the discarded components
W W>x̃i .
Xn n
X
V̂ = argmin ||W W>x̃i ||2 = ||x̃i − V V >x̃i ||2 (26)
V
i=1 i=1

– Don’t try to solve


Pcomputing gradients and setting them to zero. Instead,
Pn use>the
fact that V V > = pi=1 vi v>
i , and the singular value decomposition of i=1 x̃i x̃i =
X̃>X̃ = EDE T .

4. In the above, is the orthonormality of V an essential assumption? (1P)


5. Prove that you can compute the V also from the SVD of X (instead of X>X).
(1P)
Introduction to Machine Learning, Marc Toussaint 129

9.8.2 PCA and reconstruction on the Yale face database (5 Points)

On the webpage find and download the Yale face database http://ipvs.informatik.
uni-stuttgart.de/mlr/marc/teaching/data/yalefaces.tgz. (Optionally use yalefaces_c
which is slightly cleaned version of the same dataset). The file contains gif images
of 165 faces.

1. Write a routine to load all images into a big data matrix X ∈ R165×77760 , where
each row contains a gray image.
In Octave, images can easily be read using I=imread("subject01.gif");
and imagesc(I);. You can loop over files using files=dir("."); and
files(:).name. Python tips:

import matplotlib.pyplot as plt


import scipy as sp
plt.imshow(plt.imread(file_name))

1
P
2. Compute the mean face µ = n i xi and center the whole data matrix, X̃ =
X − 1n µ>.

3. Compute the singular value decomposition X̃ = U DV > for the centered data
matrix.
In Octave/Matlab, use [U, S, V] = svd(X, "econ"), where the "econ"
ensures you don’t run out of memory.
In python, use

import scipy.sparse.linalg as sla


u, s, vt = sla.svds(X, k=num_eigenvalues)

4. Find the p-dimensional representations Z = X̃Vp , where Vp ∈ R77760×p con-


tains only the first p columns of V (Depending on which language / library
you use, verify that the eigenvectors are returned in eigenvalue-descending
order, otherwise you’ll have to find the correct eigenvectors manually). As-
sume p = 60. The rows of Z represent each face as a p-dimensional vector,
instead of a 77760-dimensional image.

5. Reconstruct all faces by computing X 0 = 1n µ> >


P+n ZVp and0 display them; Do
they look ok? Report the reconstruction error i=1 ||xi − xi ||2 .
Repeat for various PCA-dimensions p = 5, 10, 15, 20 . . ..
130 Introduction to Machine Learning, Marc Toussaint

9.9 Exercise 9
(DS BSc students may skip exercise 2, but still please read about Mixture of Gaus-
sians and the explanations below.)

Introduction
There is no lecture on Thursday. To still make progress, please follow this guide to
learn some new material yourself. The subject is k-means clustering and Mixture
of Gaussians.

k-means clustering: The method is fully described on slide 06:36 of the lecture.
Again, I present the method as derived from an optimality principle. Most other
references described k-means clustering just as a procedure. Also wikipedia https:
//en.wikipedia.org/wiki/K-means_clustering gives a more verbose ex-
plaination of this procedure. In my words, this is the procedure:
– We have data D = {xi }n d
i=1 , with xi ∈ R . We want to cluster the data in K different
clusters. K is chosen ad-hoc.
– Each cluster is represented only by its mean (or center) µk ∈ Rd , for k = 1, .., K.
– We initially assign each µk to a random data point, µk ← xi with i ∼ U{1, .., n}
– The algorithm also maintains an assignment mapping c : {1, .., n} → {1, .., K}, which
assigns each data point xi to a cluster k = c(i)
– For given centers µk , we update all assignments using
X
∀i : c(i) ← argmin (xj − µc(j) )2 = argmin(xi − µk )2 ,
c(i) k
j

which means we assign xi to the cluster with the nearest center µk .


– For given assignments c(i), we update all centers using
X 1 X
∀k : µk ← argmin (xi − µc(i) )2 = xi ,
µk |c-1 (k)|
i i∈c-1 (k)

that is, we set the centers equal to the mean of the data points assigned to the cluster.
– The last two steps are iterated until the assignment does not vary.

Based on this, solve exercise 1.

Mixture of Gaussians: Mixture of Gaussians (MoG) are very similar to k-means


clustering. Its objective (expected likelihood maximization) is based on a proba-
bilistic data model, which we do not go into detail here. Slide 06:40 only gives the
relevant equations. A much more complete derivation of MoG as instance of Expec-
tation Maximization is found in https://ipvs.informatik.uni-stuttgart.
Introduction to Machine Learning, Marc Toussaint 131

de/mlr/marc/teaching/15-MachineLearning/08-graphicalModels-Learnin
pdf. Bishop’s book https://www.microsoft.com/en-us/research/people/
cmbishop/#!prml-book also gives a very good introduction. But we only need
the procedural understanding here:
– We have data D = {xi }n d
i=1 , with xi ∈ R . We want to cluster the data in K different
clusters. K is chosen ad-hoc.
– Each cluster is represented by its mean (or center) µk ∈ Rd and a covariance matrix Σk .
This covariance matrix describes an ellipsoidal shape of each cluster.
– We initially assign each µk to a random data point, µk ← xi with i ∼ U{1, .., n}, and
each covariance matrix to uniform (if the data is roughly uniform).
– The core difference to k-means: The algorithm also maintains a probabilistic (or soft)
assignment mapping qi (k) ∈ [0, 1], such that K
P
k=1 qi (k) = 1. The number qi (k) is the
probability of assigning data xi to cluster k (or the probability that data xi originates
from cluster k). So, each data index i is mapped to a probability over k, rather than a
specific k as in k-means.
– For given cluster parameters µk , Σk , we update all the probabilistic assignments using

1 1 > -1
∀i,k : qi (k) ← N(xi | µk , Σk ) = e− 2 (xi −µk ) Σk (xi −µk )
| 2πΣk | 1/2
1
∀i,k : qi (k) ← P 0
qi (k)
k0 qi (k )

PK
where the second line normalizes the probabilistic assignments to ensure k=1 qi (k) =
1.
– For given probabilistic assignments qi (k), we update all cluster parameters using

1 P
∀ k : µk ← P i qi (k) xi
i qi (k)
1 P > >
∀k : Σk ← P i qi (k) xi xi − µk µk ,
i qi (k)

where µk is the weighed mean of the data assigned to cluster k (weighted with qi (k)),
and similarly for Σk .
– In this description, I skipped another parameter, πk , which is less important and we can
discuss in class.

Based on this, solve exercise 2.


Exercise 3 is extra, meaning, it’s a great exercise, but beyond the default work scope.

9.9.1 Clustering the Yale face database (6 Points)

On the webpage find and download the Yale face database http://ipvs.informatik.
uni-stuttgart.de/mlr/marc/teaching/data/yalefaces_cropBackground.tgz. The file
contains gif images of 136 faces.
We’ll cluster the faces using k-means in K = 4 clusters.
132 Introduction to Machine Learning, Marc Toussaint

a) Compute a k-means clustering starting with random initializations of the centers.


Repeat
P k-means clustering 10 times. For each run, report on the clustering error
min i (xi − µc(i) )2 and pick the best clustering. Display the center faces µk and
perhaps some samples for each cluster. (3 P)
b) Repeat the above for various K and plot the clustering error over K. (1 P)
c) Repeat the above on the first 20 principal components of the data. Discussion
in the tutorial: Is PCA the best way to reduce dimensionality as a precursor to
k-means clustering? What would be the ‘ideal’ way to reduce dimensionality as
precursor to k-means clustering? (2 P)

9.9.2 Mixture of Gaussians (4 Points)

Download the data set mixture.txt from the course webpage, containing n =
300 2-dimensional points. Load it in a data matrix X ∈ Rn×2 .
a) Implement the EM-algorithm for a Gaussian Mixture on this data set. Choose
K = 3. Initialize by choosing the three means µk to be different randomly selected
data points xi (i random in {1, .., n}) and the covariances Σk = I (a more robust
choice would be the covariance of the whole data). Iterate EM starting with the first
E-step (computing probabilistic assignments) based on these initializations. Repeat
with random restarts—how often does it converge to the optimum? (3 P)
b) Do exactly the same, but this time initialize the posterior qi (k) randomly (i.e.,
assign each point to a random cluster: for each point xi select k 0 = rand(1 : K) and
set qi (k) = [k = k 0 ]); then start EM with the first M-step. Is this better or worse than
the previous way of initialization? (1 P)

9.9.3 Extra: Graph cut objective function & spectral clustering

One of the central messages of the whole course is: To solve (learning) problems,
first formulate an objective function that defines the problem, then derive algo-
rithms to find/approximate the optimal solution. That should also hold for clus-
tering.
k-means finds centers µk and assignments c : i 7→ k to minimize min i (xi −µc(i) )2 .
P

An alternative class of objective functions for clustering are graph cuts. Consider
n data points with similarities wij , forming a weighted graph. We denote P by W =
(wij ) the symmetric weight matrix, and D = diag(d1 , .., dn ), with di = j wij , the
degree matrix. For simplicitly we consider only 2-cuts, that is, cutting the graph in
two disjoint clusters, C1 ∪ C2 = {1, .., n}, C1 ∩ C2 = ∅. The normalized cut objective
is
  X
RatioCut(C1 , C2 ) = 1/|C1 | + 1/|C2 | wij
i∈C1 ,j∈C2
Introduction to Machine Learning, Marc Toussaint 133

( p
+ |C2 |/|C1 | for i ∈ C1
a) Let fi = p be a kind of indicator function of the clus-
− |C1 |/|C2 | for i ∈ C2
tering. Prove that
f>(D − W )f = n RatioCut(C1 , C2 )

fi2 = n.
P P
b) Further prove that i fi = 0 and i
Note (to be discussed in the tutorial in more detail): Spectral clustering addresses
X
min f>(D − W )f s.t. fi = 0 , ||f ||2 = 1
i

by computing eigenvectors f of the graph Laplacian D − W with smallest eigen-


values. This is a relaxation of the above problem that minimizes over continuous
functions f ∈ Rn instead of discrete clusters C1 , C2 . The resulting eigen functions
are “approximate indicator functions of clusters”. The algorithms uses k-means
clustering in this coordinate system to explicitly decide on the clustering of data
points.

9.10 Exercise 10

(DS BSc students please try to complete the full exercise this time.)

9.10.1 Method comparison: kNN regression versus Neural Networks (5 Points)

k-nearest neighbor regression is a very simple lazy learning method: Given a data
set D = {(xi , yi )}ni=1 and query point x∗ , first find
Pthe k nearest neighbors K ⊂
1
{1, .., n}. In the simplest case, the output y = K k∈K yk is then the average of
these k nearest neighbors. In the classification case, the output is the majority vote
of the neighbors.
(To make this smoother, one can weigh each nearest neighbor based on the distance
|x∗ − xk |, and use local linear or polynomial (logistic) regression. But this is not
required here.)
On the webpage there is a data set data2ClassHastie.txt. Your task is to com-
pare the performance of kNN classification (with basic kNN majority voting) with
a neural network classifier. (If you prefer, you can compare kNN against another
classifier such as logistic regression with RBF features, instead of neural networks.
The class boundaries are non-linear in x.)
As part of this exercise, discuss how a fair and rigorous comparison between two
ML methods is done.
134 Introduction to Machine Learning, Marc Toussaint

9.10.2 Gradient Boosting for classification (5 Points)

Consider the following weak learner for classification: Given a data set D = {(xi , yi )}ni=1 , yi
{−1, +1}, the weak learner picks a single i∗ and defines the discriminative function

2
/2σ 2
f (x) = αe−(x−xi∗ ) ,

with fixed width σ and variable parameter α. Therefore, this weak learner is param-
eterized only by i∗ and α ∈ R, which are chosen to minimize the neg-log-likelihood

n
X
Lnll (f ) = − log σ(yi f (xi )) .
i=1

a) Write down an explicit pseudo code for gradient boosting with this weak learner.
By “pseudo code” I mean explicit equations for every step that can directly be im-
plemented. This needs to be specific for this particular learner and loss. (3 P)
b) Here is a 1D data set, where ◦ are 0-class, and × 1-class data points. “Simulate”
the algorithm graphically on paper. (2 P)
width

choose first

Extra) If we would replace the neg-log-likelihood by a hinge loss, what would be


the relation to SVMs?

9.11 Exercise 11

(DS BSc students may skip coding exercise 3, but should be able to draw on the
board what the result would look like.)

9.11.1 Sum of 3 dices (3 Points)

You have 3 dices (potentially fake dices where each one has a different probability
table over the 6 values). You’re given all three probability tables P (D1 ), P (D2 ), and
P (D3 ). Write down the equations and an algorithm (in pseudo code) that computes
the conditional probability P (S|D1 ) of the sum of all three dices conditioned on the
value of the first dice.
Introduction to Machine Learning, Marc Toussaint 135

9.11.2 Product of Gaussians (3 Points)

A Gaussian distribution over x ∈ Rn with mean µ and covariance matrix Σ is de-


fined as
1 1 > -1
N(x | µ, Σ) = e− 2 (x−µ) Σ (x−µ)
| 2πΣ | 1/2
Multiplying probability distributions is a fundamental operation, and multiplying
two Gaussians is needed in many models. From the definition of a n-dimensional
Gaussian, prove the general rule

N(x | a, A) N(x | b, B) ∝ N(x | (A-1 + B -1 )-1 (A-1 a + B -1 b), (A-1 + B -1 )-1 ) .

where the proportionality ∝ allows you to drop all terms independent of x.


Note: The so-called canonical form of a Gaussian is defined as N[x | ā, Ā] = N(x | Ā-1 ā, Ā-1 );
in this convention the product reads much nicher: N[x | ā, Ā] N[x | b̄, B̄] ∝ N[x | ā +
b̄, Ā + B̄]. You can first prove this before proving the above, if you like.

9.11.3 Gaussian Processes (5 Points)

Consider a Gaussian Process prior P (f ) over functions defined by the mean func-
tion µ(x) = 0, the γ-exponential covariance function

k(x, x0 ) = exp{−|(x − x0 )/l|γ }

and an observation noise σ = 0.1. We assume x ∈ R is 1-dimensional. First consider


the standard squared exponential kernel with γ = 2 and l = 0.2.
a) Assume we have two data points (−0.5, 0.3) and (0.5, −0.1). Display the poste-
rior P (f |D). For this, compute the mean posterior function fˆ(x) and the standard
deviation function σ̂(x) (on the 100 grid points) exactly as on slide 08:10, using
λ = σ 2 . Then plot fˆ, fˆ + σ̂ and fˆ − σ̂ to display the posterior mean and standard
deviation. (3 P)
b) Now display the posterior P (y ∗ |x∗ , D). This is only a tiny difference from the
above (see slide 08:8). The mean is the same, but the variance of y ∗ includes addi-
tionally the obseration noise σ 2 . (1 P)
c) Repeat a) & b) for a kernel with γ = 1. (1 P)

9.12 Exercise 12

All exercises are voluntary, for you to collect extra points.


136 Introduction to Machine Learning, Marc Toussaint

9.12.1 Autoencoder (7 Points)

On the webpage find and download the Yale face database http://ipvs.informatik.
uni-stuttgart.de/mlr/marc/teaching/data/yalefaces_cropBackground.tgz. The file
contains gif images of 136 faces.
We want to compare two methods (Autoencoder vs PCA) to reduce the dimen-
sionality of this dataset. This means that we want to create and train a neural net-
work to find a lower-dimensional representation of our data. Recall the slides and
exercises about dimensionality reduction, neural networks and especially Autoen-
coders (slide 06:10).
a) Create a neural network using tensorflow (or any other framework, e.g., keras)
which takes the images as input, creates a lower-dimensional representation with
dimensionality p = 60 in the hidden layer (i.e., a layer with 60 neurons) and outputs
the reconstructed images. The loss function should compare the original image
with the reconstructed one. After having trained the network, Preconstruct all faces
n
and display some examples. Report the reconstruction error i=1 ||xi − x0i ||2 . (5P)
b) Use PCA to reduce the dimensionality of the dataset to p = 60 as well (e.g. use
your code from exercise e08:02). Reconstruct
Pnall faces using PCA and display some
examples. Report the reconstruction error i=1 ||xi −x0i ||2 . Compare the reconstruc-
tions and the error from PCA with the results from the Autoencoder. Which one
works better? (2P)
Extra) Repeat for various dimensions p = 5, 10, 15, 20 . . .

9.12.2 Special cases of CRFs (3 Points)

Slide 03:31 summarizes the core equations for CRFs.


a) Confirm the given equations for ∇β Z(x, β) and ∇β2 Z(x, β) (i.e., derive them from
the definition of Z(x, β)). (1P)
b) Binary logistic regression is clearly a special case of CRFs. Sanity check that the
gradient and Hessian given on slide 03:20 can alternatively be derived from the
general expressions for ∇β L(β) and ∇β2 L(β) on slide 03:31. (The same is true for the
multi-class case.) (1P)
c) Proof that ordinary ridge regression is a special case of CRFs if you choose the
discriminative function f (x, y) = −y 2 + 2yφ(x)>β. (1P)
Introduction to Machine Learning, Marc Toussaint 137

Exam Preparation Tip: These are the headings of all questions that appeared in
previous exams – no guarantee that similar ones might appear!
– Gymnastics
– True or false?
– Bayesian reasoning
– Dice rolling
– Coin flipping
– Gaussians & Bayes
– Linear Regression
– Features
– Features & Kernels
– Unusual Kernels
– Empirically Estimating Variance
– Logistic regression
– Logistic regression & log-likelihood gradient
– Discriminative Function
– Principle Component Analysis
– Clustering
– Neural Network
– Bayesian Ridge Regression and Gaussian Processes
– Ridge Regression in the Bayesian view
– Bayesian Predictive Distribution
– Bootstrap & combining learners
– Local Learning
– Boosting
– Joint Clustering & Regression

Not relevant this year (was focus in earlier years)


– Probabilistic independence
– Logistic regression & SVM
– Graphical models & Inference
– Graphical models and factor graphs
– Random forests
Index
Accuracy, Precision, Recall (3:8), Dual formulation of ridge regression**
AdaBoost** (7:17), (2:24),
Agglomerative Hierarchical Clustering**
(6:43), Entropy (9:42),
Autoencoders (6:10), Epanechnikov quadratic smoothing ker-
nel (7:4),
Bagging (7:12), Estimator variance (2:12),
Bayes’ Theorem (9:12),
Bayesian (kernel) logistic regression (8:19), Factor analysis** (6:34),
Feature selection** (2:19),
Bayesian (kernel) ridge regression (8:5), Features (2:6),

Bayesian Kernel Logistic Regression (8:22), Gated Recurrent Units** (4:38),


Gaussian (9:34),
Bayesian Kernel Ridge Regression (8:12), Gaussian mixture model (6:41),
Gaussian Process (8:12),
Bayesian Neural Networks (8:24), Gaussian Process Classification (8:22),
Bernoulli and Binomial distributions (9:21),
Gradient boosting (7:21),
Beta (9:22),
Boosting (7:16), Hinge loss (3:12),
Boosting decision trees** (7:31), Historical Discussion** (4:22),
Bootstrapping (7:12),
Importance sampling (9:47),
Centering and whitening** (6:45), Independent component analysis** (6:13),
Chain rules (4:28),
Computation Graph (4:27), Inference: general meaning (9:4),
Conditional distribution (9:10), ISOMAP** (6:29),
Conditional random fields** (3:27),
Conjugate priors (9:30), Joint distribution (9:10),
Convolutional NN (4:31),
Cross entropy (3:11), k-means clustering (6:37),
Cross validation (2:17), kd-trees (7:7),
Kernel PCA** (6:18),
data augmentation (4:11), Kernel ridge regression (5:1),
Decision trees** (7:26), Kernel trick (5:1),
Dirac distribution (9:33), kNN smoothing kernel (7:4),
Dirichlet (9:26), Kullback-Leibler divergence (9:43),
Discriminative function (3:3),
Dropout for Uncertainty Prediction (8:25), Lasso regularization (2:19),
Learning as Bayesian inference (9:19),
Dual formulation of Lasso regression**
(2:25), Learning as Inference (8:1),
Introduction to Machine Learning, Marc Toussaint 139

Linear regression (2:3), Smoothing kernel (7:3),


Local learning and ensemble approaches Spectral clustering** (6:22),
(7:1), Stochastic Gradient Descent (4:16),
Log-likelihood (3:10), Structured ouput and structured input
Logistic regression (binary) (3:18), (3:1),
Logistic regression (multi-class) (3:14), Student’s t, Exponential, Laplace, Chi-
squared, Gamma distributions
LSTM** (4:36), (9:49),

Marginal (9:10), Utilities and Decision Theory (9:41),


Model averaging (7:14),
Monte Carlo methods (9:45), Weak learners as features (7:15),
Multidimensional scaling** (6:26),
Multinomial (9:25),
Multiple RVs, conditional independence
(9:13),

Neg-log-likelihood (3:10),
Neural Network function class (4:3),
NN back propagation (4:13),
NN Dropout (4:10),
NN gradient (4:13),
NN initialization (4:20),
NN loss functions (4:9),
NN regularization (4:10),
No Free Lunch (8:27),
Non-negative matrix factorization** (6:32),

one-hot-vector (3:11),

Partial least squares (PLS)** (6:14),


Particle approximation of a distribution
(9:38),
Predictive distribution (8:8),
Principle Component Analysis (PCA)
(6:3),
Probability distribution (9:9),

Random forests** (7:32),


Random variables (9:8),
Regularization (2:11),
Rejection sampling (9:46),
Ridge regression (2:15),
Ridge regularization (2:15),

You might also like