Lecture MachineLearning
Lecture MachineLearning
Marc Toussaint
July 11, 2019
This is a direct concatenation and reformatting of all lecture slides and exercises from
the Machine Learning course (summer term 2019, U Stuttgart), including indexing to
help prepare for exams.
Double-starred** sections and slides are not relevant for the exam.
Contents
1 Introduction 4
2 Regression 12
Linear regression (2:3) Features (2:6) Regularization (2:11) Estimator variance (2:12)
Ridge regularization (2:15) Ridge regression (2:15) Cross validation (2:17) Lasso reg-
ularization (2:19) Feature selection** (2:19) Dual formulation of ridge regression**
(2:24) Dual formulation of Lasso regression** (2:25)
4 Neural Networks 35
Neural Network function class (4:3) NN loss functions (4:9) NN regularization (4:10)
NN Dropout (4:10) data augmentation (4:11) NN gradient (4:13) NN back propa-
gation (4:13) Stochastic Gradient Descent (4:16) NN initialization (4:20) Historical
Discussion** (4:22)
4.1 Computation Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Computation Graph (4:27) Chain rules (4:28)
5 Kernelization 50
Kernel trick (5:1) Kernel ridge regression (5:1)
6 Unsupervised Learning 54
A Probability Basics 95
Inference: general meaning (9:4)
9 Exercises 113
9.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
9.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
9.3 Exercise 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
9.4 Exercise 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
9.5 Exercise 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
9.6 Exercise 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
9.7 Exercise 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
9.8 Exercise 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
9.9 Exercise 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
9.10 Exercise 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
9.11 Exercise 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
9.12 Exercise 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Index 138
4 Introduction to Machine Learning, Marc Toussaint
1 Introduction
• Given data D = {(xi , yi )}ni=1 , the standard objective is to minimize the “error”
on the data
n
X
f ∗ argmin `(f (xi ), yi ) ,
f ∈H i=1
where `(ŷ, y) > 0 penalizes a discrepancy between a model output ŷ and the
data y.
– Squared error `(ŷ, y) = (ŷ − y)2
– Classification error `(ŷ, y) = [ŷ 6= y]
– neg-log likelihood `(ŷ, y) = − log p(y | ŷ)
– etc
1:5
• Active Learning, where the “ML agent” makes decisions about what data label
to query next
• Bandits, Reinforcement Learning, manipulating the domain (and thereby data
source)
1:6
Face recognition
eigenfaces
keypoints
Gene annotation
Speech recognition
1:11
Spam filters
1:12
Neither alone.
argmin L(x)
x
• Find
min ||y − Ax||2 s.t. xi ≤ 1
x
ef (y,x)
p(y | x) = P f (y 0 ,x)
y0 e
Introduction to Machine Learning, Marc Toussaint 9
• Let A be the covariance matrix of a Gaussian. What does the Singular Value
Decomposition A = V DV > tell us?
1:16
• Many exercises will implement algorithms we derived in the lecture and collect
experience on small data sets
• Choice of language is fully free. I support C++; tutors might prefer Python;
Octave/Matlab or R is also good choice.
1:17
Books
Trevor Hastie
riedman
ng Robert Tibshirani Trevor Hastie, Robert Tibshirani and
Jerome Friedman
Jerome Friedman: The Elements of Statis-
tical Learning: Data Mining, Inference, and
ormation tech-
medicine, biolo-
ed to the devel-
s data mining,
nderpinnings but
The Elements of
ortant ideas in
stical, the
, with a liberal
nyone interested
pervised learning
works, support
treatment of this
Data Mining, Inference, and Prediction http://www-stat.stanford.edu/
uding graphical
orithms for the
a chapter on
e discovery rates.
˜tibs/ElemStatLearn/
atistics at
Tibshirani Second Edition (recommended: read introductory chap-
le. Hastie co-
S-PLUS and
co-author of the
or of many data-
ting.
ter)
(this course will not go to the full depth in math of Hastie et al.)
1:18
Books
10 Introduction to Machine Learning, Marc Toussaint
1:19
• more recently:
– David Barber: Bayesian Reasoning and Machine Learning
– Kevin Murphy: Machine learning: a Probabilistic Perspective
Organization
• Course Webpage:
http://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/19-MachineLearning/
– Slides, Exercises & Software (C++)
– Links to books and other resources
• Admin things, please first ask:
Carola Stahl, [email protected], Raum 2.217
1:21
12 Introduction to Machine Learning, Marc Toussaint
2 Regression
0.5
1
0
-0.5
-1
-1
-1.5 -2
-3 -2 -1 0 1 2 3 -2 -1 0 1 2 3
Linear Regression
Introduction to Machine Learning, Marc Toussaint 13
• Notation:
– input vector x ∈ Rd
– output value y ∈ R
– parameters β = (β0 , β1 , .., βd )> ∈ Rd+1
– linear model
Pd
f (x) = β0 + j=1 βj xj
• Given training data D = {(xi , yi )}ni=1 we define the least squares cost (or “loss”)
Pn
Lls (β) = i=1 (yi − f (xi ))2
2:3
Optimal parameters β
• Augment input vector with a 1 in front: x̄ = (1, x) = (1, x1 , .., xd )> ∈ Rd+1
β = (β0 , β1 , .., βd )> ∈ Rd+1
Pn
f (x) = β0 + j=1 βj xj = x̄>β
X = .. = .. .. y = ..
,
. . . .
x̄>
n 1 xn,1 xn,2 ··· xn,d yn
• Optimum:
∂Lls (β)
0>d = ∂β = −2(y − Xβ)>X ⇐⇒ 0d = X>Xβ − X>y
β̂ ls = (X>X)-1 X>y
2:4
14 Introduction to Machine Learning, Marc Toussaint
-1
0.5
-2
0
-3
-0.5
-4
-1
-5
-6 -1.5
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
Non-linear features
• Replace the inputs xi ∈ Rd by some non-linear features φ(xi ) ∈ Rk
Pk
f (x) = j=1 φj (x) βj = φ(x)>β
φ(xn )>
d(d+1) d(d+1)(d+2)
• Cubic: φ(x) = (.., x31 , x21 x2 , x21 x3 , .., x3d ) ∈ R1+d+ 2 + 6
Introduction to Machine Learning, Marc Toussaint 15
1
x1
x1
x2 xd
φ β
x21 f (x)
x1 x2
xd x1 x3
x2d
2:8
• Special case:
use all training inputs {xi }ni=1 as centers
1
b(x, x1 )
φ(x) =
..
(n + 1 dim)
.
b(x, xn )
This is related to “kernel methods” and GPs, but not quite the same—we’ll discuss this
later.
2:9
Features
• Polynomial
• Piece-wise
• Radial basis functions (RBF)
• Splines (see Hastie Ch. 5)
'z.train' us 1:2
'z.model' us 1:2
1.5 2:10
1
0.5
-0.5
-1.5
-3 -2 -1 0 1 2 3
• Estimator variance:
When you repeat the experiment (keeping the underlying function fixed), the
regression always returns a different model estimate
2:11
Estimator variance
• Assumption:
– The data was noisy with variance Var{y} = σ 2 In
• In practise we don’t know σ, but we can estimate it based on the deviation from
the learnt model: (with k = dim(β) = dim(φ))
n
1 X
σ̂ 2 = (yi − f (xi ))2
n − k i=1
2:12
Estimator variance
• “Overfitting”
– picking one specific data set y ∼ N(ymean , σ 2 In )
↔ picking one specific b̂ ∼ N(βmean , (X>X)-1 σ 2 )
• Optimum:
β̂ ridge = (X>X + λI)-1 X>y
• The objective is now composed of two “potentials”: The loss, which depends
on the data and jumps around (introduces variance), and the regularization
penalty (sitting steadily at zero). Both are “pulling” at the optimal β → the
regularization reduces variance.
Pd d2j
df(λ) = j=1 d2j +λ
• k-fold cross-validation:
D1 D2 ··· Di ··· Dk
1.1
0.9
0.8
0.7
0.6
0.5
0.4
0.001 0.01 0.1 1 10 100 1000 10000100000
lambda
2:18
Lasso: L1 -regularization
Pn Pk
Llasso (β) = i=1 (yi − φ(xi )>β)2 + λ j=2 |βj |
2:19
Pn Pk
Lq (β) = i=1 (yi − φ(xi )>β)2 + λ j=2 |βj |q
Summary
• Representation: choice of features
f (x) = φ(x)>β
2:22
Summary
• Linear models on non-linear features—extremely powerful
linear
polynomial
Ridge regression
piece-wise linear
Lasso classification*
RBF
kernel
*logistic regression
• structured output:
Rd → binary class label {0, 1}
Rd → integer class label {1, 2, .., M }
Rd → sequence labelling y1:T
Rd → image labelling y1:W,1:H
Rd → graph labelling y1:N
• structured input:
relational database → R
labelled graph/sequence → R
3:1
Discriminative Function
• Represent a discrete-valued function F : Rd → Y via a discriminative func-
tion
f : Rd × Y → R
such that
F : x 7→ argmaxy f (x, y)
That is, a discriminative function f (x, y) maps an input x to an output
ŷ(x) = argmax f (x, y)
y
0.9
0.5
0.1
1 0.9
0.8 0.5
0.1
0.6 0.9
0.4 0.5
0.2 0.1
0
3
-2 2
-1 1
0 0
1 -1
2
3 -2
• You can think of f (x, y) as M separate functions, one for each class y ∈ {1, .., M }. The highest one
determines the class prediction ŷ
• More examples: plot[-3:3] -x-2,0,x-2 splot[-3:3][-3:3] -x-y-2,0,x+y-2
3:4
1 >
1 [y = 2]
φ(x, y) = , which is equivalent to f (x, y) = x βy
x [y = 2]
x2 [y = 2]
x2
1 [y = 3]
x [y = 3]
2
x [y = 3]
3:5
Notes on features
• Features “connect” input and output. Each φj (x, y) allows f to capture a cer-
tain dependence between x and y
• If both x and y are discrete, a feature φj (x, y) is typically a joint indicator func-
tion (logical function), indicating a certain “event”
• Each weight βj mirrors how important/frequent/infrequent a certain depen-
dence described by φj (x, y) is
• −f (x, y) is also called energy, and the is also called energy-based modelling,
esp. in neural modelling
3:6
TP+TN
– accuracy = n
TP
– precision = TP+FP
(TP+FP = classifier positives)
TP
– recall (TP-rate) = TP+FN
(TP+FN = data positives)
FP
– FP-rate = FP+TN
(FP+TN = data negatives)
• Such metrics be our actual objective. But they are not differentiable.
For the purpose of ML, we need to define a “proxy” objective that is nice to
optimize.
Although the optimal separating boundaries are linear and linear discriminating func-
tions could represent them, the linear functions trained on class indicators fail to dis-
criminate.
→ squared error regression on class indicators is the “wrong objective”
3:9
Log-Likelihood
• The discriminative function f (y, x) not only defines the class prediction F (x);
we can additionally also define probabilities,
f (x,y)
p(y | x) = Pe
y0 ef (x,y0 )
Cross Entropy
• This is the same as log-likelihood for categorical data, just a notational trick,
really.
• The categorical data yi ∈ {1, .., M } are class labels. But assume they are en-
coded in a one-hot-vector
• As a side note, the cross entropy measure would also work if the target ŷi are
probabilities instead of one-hot-vectors.
3:11
Introduction to Machine Learning, Marc Toussaint 27
Hinge loss
• For a data point (x, y ∗ ), the one-vs-all hinge loss “wants” that f (y ∗ , x) is larger
than any other f (y, x), y 6= y ∗ , by a margin of 1.
In other terms, it penalizes when f (y ∗ , x) < f (y, x) + 1, y 6= y ∗ .
• It penalizes linearly, therefore the one-vs-all hinge loss is defined as
X
Lhinge (f ) = [1 − (f (y ∗ , x) − f (y, x))]+
y6=y ∗
• This is related to Support Vector Machines (only data points inside the margin
induce an error and gradient), and also to the Perceptron Algorithm
3:12
ef (x,y) p(y | x)
p(y | x) = P f (x,y 0 )
↔ f (x, y) − f (x, z) = log
y0 e p(z | x)
3:14
Optimal parameters β
• Gradient:
> n
∂Llogistic (β) X
= (pic − yic )φ(xi ) + 2λIβc = X>(pc − yc ) + 2λIβc
∂βc i=1
28 Introduction to Machine Learning, Marc Toussaint
• Hessian:
∂ 2 Llogistic (β)
H= = X>Wcd X + 2[c = d] λI
∂βc ∂βd
where Wcd is diagonal with Wcd,ii = pic ([c = d] − pid )
3
-2 2
-1 1
-1 0 0
1 -1
2
3 -2
-2
-2 -1 0 1 2 3
f (x, M ) ≡ 0 or βM ≡ 0
The other functions then have to be greater/less relative to this baseline.
• This is usually not done in the multi-class case, but almost always in the binary
case.
3:17
Introduction to Machine Learning, Marc Toussaint 29
φ(x, y) = φ(x) [y = 1]
1
exp(x)/(1+exp(x))
0.9
0.7
0.6
ef (x,1) 0.5
e + ef (x,1) 0.3
0.2
0.1
z 0
e 1
with the logistic sigmoid function σ(z) = 1+ez
= e−z +1
. -10 -5 0 5 10
Pn
Llogistic (β) = − i=1 log p(yi | xi ) + λ||β||2
Pn h i
=− i=1 yi log p(1 | xi ) + (1 − yi ) log[1 − p(1 | xi )] + λ||β||2
3:18
Optimal parameters β
• Gradient (see exercises):
> n
∂Llogistic (β) X
= (pi − yi )φ(xi ) + 2λIβ = X>(p − y) + 2λIβ
∂β i=1
φ(x1 )>
where pi := p(y = 1 | xi ) , X= ..
∈ R
n×k
.
φ(xn )>
∂ 2 Llogistic (β)
• Hessian H = ∂β 2 = X>W X + 2λI
W = diag(p ◦ (1 − p)), that is, diagonal with Wii = pi (1 − pi )
2
3
2
1
1 0
-1
-2
-3
0
3
2
-2 1
-1 -1 0
0 -1
1
2
3 -2
-2
-2 -1 0 1 2 3
2
3
2
1
1 0
-1
-2
-3
0
3
2
-2 1
-1 -1 0
0 -1
1
2
3 -2
-2
-2 -1 0 1 2 3
Recap
Introduction to Machine Learning, Marc Toussaint 31
3:22
3:23
• Text tagging
X = sentence
Y = tagging of each word
http://sourceforge.net/projects/crftagger
• Image segmentation
X = image
Y = labelling of each pixel
http://scholar.google.com/scholar?cluster=13447702299042713582
• Depth estimation
X = single image
Y = depth map
http://make3d.cs.cornell.edu/
3:24
32 Introduction to Machine Learning, Marc Toussaint
3:25
argmax f (x, y)
y
• The name CRF describes that p(y|x) ∝ ef (x,y) defines a probability distribution
(a.k.a. random field) over the output y conditional to the input x. The word
“field” usually means that this distribution is structured (a graphical model;
see later part of lecture).
3:27
where each feature φj (x, y∂j ) depends only on a subset y∂j of labels. φj (x, y∂j )
effectively couples the labels y∂j . Then ef (x,y) is a factor graph.
3:28
y21
y31
yH1
• Each black box corresponds to features φj (y∂j ) which couple neighboring pixel
labels y∂j
• Each gray box corresponds to features φj (xj , yj ) which couple a local pixel
observation xj with a pixel label yj
3:29
ef (x,y)
p(y|x) = P f (x,y 0 )
= ef (x,y)−Z(x,β)
y0 e
X 0
Z(x, β) = log ef (x,y ) (log partition function)
y0
X X
L(β) = − log p(yi |xi ) = − [f (xi , yi ) − Z(xi , β)]
i i
X
∇Z(x, β) = p(y|x) ∇f (x, y)
y
X
∇2 Z(x, β) = p(y|x) ∇f (x, y) ∇f (x, y)> − ∇Z ∇Z>
y
Training CRFs
• Maximize conditional likelihood
But Hessian is typically too large (Images: ∼10 000 pixels, ∼50 000 features)
If f (x, y) has a chain structure over y, the Hessian is usually banded → computation
time linear in chain length
• Other loss variants, e.g., hinge loss as with Support Vector Machines
(“Structured output SVMs”)
4 Neural Networks
Outline
• Model, Objective, Solver:
– How do NNs represent a function f (x), or discriminative function f (y, x)?
– What are objectives? (standard objectives, different regularizations)
– How are they trained? (Initialization, SGD)
• Computation Graphs & Chain Rules
• Images & Sequences
– CNNs
– LSTMs & GRUs
– Complex architectures (e.g. Mask-RCNN, dense pose prediction, etc)
4:1
• In that sense, they just replace our previous model assumption f (x) = φ(x)>β,
the reset is “in principle” the same
4:2
• L-layer means L − 1 hidden layers plus 1 output layer. (The input x0 is not
counted.)
4:4
feature-based regression
Introduction to Machine Learning, Marc Toussaint 37
feature-based classification
(same features for all outputs)
neural network
4:5
φβ (x) = xL-1
38 Introduction to Machine Learning, Marc Toussaint
• This aligns NNs models with what we discussed so far. But the crucial differ-
ence is:
In NNs, the features φβ (x) are also parameterized and trained!
While in previous lectures, we had to fix φ(x) by hand, NNs allow us to learn
features and intermediate representations
• Note: It is a common approach to train NNs as usual, but after training fix the trained features φ(x)
(“remove the head (=output layer) and fix the remaining body of the NN”) and use these trained
features for similar problems or other kinds of ML on top.
4:6
Data Augmentation
• A very interesting form of regularization is to modify the data!
• Generate more data by applying invariances to the given data. The model then
learns to generalize as described by these invariances.
• This is a form of regularization that directly incorporates expert knowledge
4:11
Optimization
4:12
40 Introduction to Machine Learning, Marc Toussaint
• For a single data point (x, y ∗ ), assume we have a loss `(f (x), y ∗ )
∆ d` d` 1×M
We define δL = df = dzL ∈ R as the gradient (as row vector) w.r.t. output
values zL .
d`
• Backpropagation: We can recursivly compute the gradient dz l
∈ R1×hl w.r.t.
all other layers zl as:
d` d` ∂zl+1 ∂xl
∆
∀l=L-1,..,1 : δl = = = [δl+1 Wl+1 ] ◦ [σ 0 (zl )]>
dzl dzl+1 ∂xl ∂zl
d` d` ∂zl,i d` d`
= = δl,i xl-1,j or = δ> >
l xl-1 , = δ>
l
dWl,ij dzl,i ∂Wl,ij dWl dbl
4:13
• This forward and backward computations are done for each data point (xi , yi ).
P
• Since the total loss is the sum L(β) = i `(fβ (xi ), yi ), the total gradient is the
sum of gradients per data point.
• Efficient implementations send multiple data points (tensors) simultaneously
through the network (fwd and bwd), which speeds up computations.
4:14
Optimization
• For small data size:
Pn
We can compute the loss and its gradient i=1 ∇β `(fβ (xi ), yi ).
– Use classical gradient-based optimization methods
– default: L-BFGS, oldish but efficient: Rprop
– Called batch learning (in contrast to online learning)
Pn
• For large data size: The i=1 is highly inefficient!
– Adapt weights based on much smaller data subsets, mini batches
4:15
• Compute the loss and gradient for a mini batch D̂ ⊂ D of fixed size k.
X
L(β, D̂) = `(fβ (xi ), yi )
i∈D̂
X
∇β L(β, D̂) = ∇β `(fβ (xi ), yi )
i∈D̂
Yurii Nesterov (1983): A method for solving the convex programming problm with convergence rate
O(1/k2 )
4:17
42 Introduction to Machine Learning, Marc Toussaint
Adam
arXiv:1412.6980
(all operations interpreted element-wise)
4:18
Initialization
• The Initialization of weights is important! Heuristics:
– E.g., initialize weight vectors in Wl,i· with standard deviation 1, i.e., each entry with
sdv √1
hl-1
– Roughly: If each element of zl has standard deviation , the same should be true for
zl+1 .
• Choose biases bl,i randomly so that the ReLU hinges cover the input well (think
of distributing hinge features for continuous piece-wise linear regression)
4:20
Brief Discussion
4:21
Historical discussion
(This is completely subjective.)
• Early (from 40ies):
– McCulloch Pitts, Hebbian learning, Rosenblatt, Werbos (backpropagation)
• 80ies:
– Start of connectionism, NIPS
– ML wants to distinguish itself from pure statistics (“machines”, “agents”)
• ’90-’10:
– More theory, better grounded, Statistical Learning theory
– Good ML is pure statistics (again) (Frequentists, SVM)
– ...or pure Bayesian (Graphical Models, Bayesian X)
– sample-efficiency, great generalization, guarantees, theory
– Great successes, in applications across disciplines; supervised, unsupervised, struc-
tured
• ’10-:
– Big Data. NNs. Size matters. GPUs.
– Disproportionate focus on images
– Software engineering becomes central
4:22
• NNs did not become “better” than they were 20y ago. What changed is the
metrics by which they’re are evaluated:
• Old:
– Sample efficiency & generalization; get the most from little data
44 Introduction to Machine Learning, Marc Toussaint
Example
• Three real-valued quantities x, g and f which depend on each other:
∂ d
What is ∂x f (x, g) and what is dx f (x, g)?
• The partial derivative only considers a single function f (a, b, c, ..) and asks how
the output of this single function varies with one of its arguments. (Not caring
that the arguments might be functions of yet something else).
• The total derivative considers full networks of dependencies between quanti-
ties and asks how one quantity varies with some other.
4:26
Introduction to Machine Learning, Marc Toussaint 45
Computation Graphs
• A function network or computation graph is a DAG of n quantities xi where
each quantity is a deterministic function of a set of parents π(i) ⊂ {1, .., n}, that
is
xi = fi (xπ(i) )
where xπ(i) = (xj )j∈π(i) is the tuple of parent values
• (This could also be called deterministic Bayes net.)
• Total derivative: Given a variation dx of some quantity, how would all child
quantities (down the DAG) vary?
4:27
f
Chain rules
• Forward-version: (I use in robotics) g
df X ∂f dg
=
dx ∂g dx
g∈π(f )
x
dg df
Why “forward”? You’ve computed dx
already, now you move forward to dx
.
∂f dx ∂f
Note: If x ∈ π(f ) is also a direct argument to f , the sum includes the term ∂x dx
≡ ∂x
. To
df
= ∂f ∂f dg
P
emphasize this, one could also write dx ∂x
+ g∈π(f ) ∂g dx .
g6=x
f
• For time series, long-short term memory (LSTM) networks represent long-term
dependencies in a way that is well trainable – something that is hard to do with
other model structures.
• Both these structural priors, combined with huge data and capacity, make these
methods very strong.
4:30
Convolutional NNs
AlexNet
4:32
Introduction to Machine Learning, Marc Toussaint 47
ResNet
4:33
ResNeXt
4:34
Pretrained networks
LSTMs
4:36
LSTM
• c is a memory signal, that is multiplied with a sigmoid signal Γf . If that is
saturated (Γf ≈ 1), the memory is preserved; and backpropagation copies gra-
dients back
• If Γi is close to 1, a new signal c̃ is written into memory
• If Γo is close to 1, the memory contributes to the normal neural activations a
4:37
Deep RL
• Value Network
• Advantage Network
• Action Network
• Experience Replay (prioritized)
• Fixed Q-targets
• etc, etc
Introduction to Machine Learning, Marc Toussaint 49
4:39
Conclusions
• Conventional feed-forward neural networks are by no means magic. They’re a param-
eterized function, which is fit to data.
• Convolutional NNs do make strong and good assumptions about how information pro-
cessing on images should be structured. The results are great and related to some de-
gree to human visual representations. A large part of the success of deep learning is on
images.
Also LSTMs make good assumptions about how memory signals help represent time
series.
The flexibility of “clicking together” network structures and general differentiable com-
putation graphs is great.
All these are innovations w.r.t. formulating structured models for ML
• The major strength of NNs is in their capacity and that, using massive parallelized com-
putation, they can be trained on tons of data. Maybe they don’t even need to be better
than nearest neighbor lookup, but they can be queried much faster.
4:40
50 Introduction to Machine Learning, Marc Toussaint
5 Kernelization
The kernel function k(x, x0 ) calculates the scalar product in feature space.
5:1
Vectors in this space are linear combinations of such basis elements, e.g.,
X X
f= αi φxi , f (x) = αi k(x, xi )
i i
• Let’s define a scalar product in this space. Assuming k(·, ·) is positive definite, we first
define the scalar product for every basis element,
hφx , φy i := k(x, y)
Then it follows
X X
hφx , f i = αi hφx , φxi i = αi k(x, xi ) = f (x)
i i
Representer Theorem
• For
f ∗ = argmin L(f (x1 ), .., f (xn )) + Ω(||f ||2Hk )
f ∈Hk
• Proof:
decompose f = fs + f⊥ , fs ∈ span{φxi : xi ∈ D}
f (xi ) = hf, φxi i = hfs + f⊥ , φxi i = hfs , φxi i = fs (xi )
L(f (x1 ), .., f (xn )) = L(fs (x1 ), .., fs (xn ))
Ω(||fs + f⊥ ||2Hk ) ≥ Ω(||fs ||2Hk )
5:5
52 Introduction to Machine Learning, Marc Toussaint
Example Kernels
• Kernel functions need to be positive definite: ∀z:|z|>0 : k(z, z 0 ) > 0
→ K is a positive definite matrix
• Examples:
– Polynomial: k(x, x0 ) = (x>x0 + c)d
√ √ √ >
Let’s verify for d = 2, φ(x) = 1, 2x1 , 2x2 , x21 , 2x1 x2 , x22 :
0
k(x, x0 ) = ((x1 , x2 ) x1 2
0 + 1)
x2
= (x1 x01 + x2 x02 + 1)2
2 2
= x21 x01 + 2x1 x2 x01 x02 + x22 x02 + 2x1 x01 + 2x2 x02 + 1
= φ(x)>φ(x0 )
Example Kernels
• Gaussian Process regression will explain that k(x, x0 ) has the semantics of an
(apriori) correlatedness of the yet unknown underlying function values f (x) and
f (x0 )
– k(x, x0 ) should be high if you believe that f (x) and f (x0 ) might be similar
– k(x, x0 ) should be zero if f (x) and f (x0 ) might be fully unrelated
5:7
We can now compute the discriminative function values fX = Xβ ∈ Rn at the training points by
iterating over those instead of β:
h i
fX ← XX>(XX> + 2λW -1 )-1 Xβ − W -1 (p − y) (6)
h i
= K(K + 2λW -1 )-1 fX − W -1 (pX − y) (7)
Note, that pX on the RHS also depends on fX . Given fX we can compute the discriminative
function values fZ = Zβ ∈ Rm for a set of m query points Z using
h i
fZ ← κ>(K + 2λW -1 )-1 fX − W -1 (pX − y) , κ> = ZX> (8)
5:8
54 Introduction to Machine Learning, Marc Toussaint
6 Unsupervised Learning
Unsupervised learning
• What does that mean? Generally: modelling P (x)
• Instances:
– Finding lower-dimensional spaces
– Clustering
– Density estimation
– Fitting a graphical model
• “Supervised Learning as special case”...
6:1
xi ≈ Vp zi + µ
• Optimality: Pn
Find Vp , µ and values zi that minimize i=1 ||xi − (Vp zi + µ)||2
6:4
Optimal Vp
n
X
µ̂, ẑ1:n = argmin ||xi − Vp zi − µ||2
µ,z1:n
i=1
Pn
⇒ µ̂ = hxi i = 1
n i=1 xi , ẑi = Vp>(xi − µ)
Introduction to Machine Learning, Marc Toussaint 55
n
X
V̂p = argmin ||x̃i − Vp Vp>x̃i ||2
Vp i=1
Vp> is the matrix that projects to the largest variance directions of X>X
zi = Vp>(xi − µ) , Z = XVp
1 >
A = Var{x} = xx> − µµ> = X X − µµ>
n
6:6
56 Introduction to Machine Learning, Marc Toussaint
Example: Digits
6:7
Example: Digits
x ≈ µ + Vp z
= µ + z1 v1 + z2 v2 + . . .
= + z1 · + z2 · + ···
Introduction to Machine Learning, Marc Toussaint 57
6:8
Example: Eigenfaces
Non-linear Autoencoders
• PCA given the “optimal linear autoencode”
• We can relax the encoding (Vp ) and decoding (Vp>) to be non-linear mappings,
e.g., represented as a neural network
• Stacking autoencoders:
58 Introduction to Machine Learning, Marc Toussaint
6:10
Mnist1h dataset, deep NNs of 2, 6, 8, 10 and 15 layers; each hidden layer 50 hidden
units
6:11
y0 x1
x2
y1
x3
y2
x4
y3 x5
x6
• In ICA
1) We have (usually) as many latent variables as observed dim(xi ) = dim(zi )
2) We require all latent variables to be independent
3) We allow for latent variables to be non-Gaussian
6:14
PLS*
• Idea: The first dimension to pick should be the one most correlated with the
OUTPUT, not with itself!
60 Introduction to Machine Learning, Marc Toussaint
• Not obvious.
φ(xn )>
• The kernel trick: rewrite all necessary equations such that they only involve
scalar products φ(x)>φ(x0 ) = k(x, x0 ):
We want to compute eigenvectors of X>X = φ(xi )φ(xi )>. We can rewrite this as
P
i
X>Xvj = λvj
X
>
XX Xvj = λ Xvj ,
| {z } |{z} vj = αji φ(xi )
|{z} i
K Kαj Kαj
Kαj = λαj
(with matrix A ∈ Rp×n , Aji = αji and vector κ(x) ∈ Rn , κi (x) = k(xi , x))
Since we cannot center the features φ(x) we actually need “the double centered kernel matrix” K
e =
1 1
(I − n 11>)K(I − n 11>), where Kij = φ(xi )>φ(xj ) is uncentered.
6:18
Kernel PCA
red points: data
P
green shading: eigenvector αj represented as functions i αji k(xj , x)
Kernel PCA
• Kernel PCA uncovers quite surprising structure:
• Kernel PCA may map data xi to latent coordinates zi where clustering is much
easier
0 2
• Using a kernel function k(x, x0 ) = e−||x−x || /c
:
Spectral Clustering**
Spectral Clustering is very similar to kernel PCA:
• Instead of the kernel matrix K with entries kij = k(xi , xj ) we construct a
weighted adjacency matrix, e.g.,
0 if xi are not a kNN of xj
wij = 2
e−||xi −xj || /c otherwise
6:24
• The Graph Laplacian L: For some vector f ∈ Rn , note the following identities:
X X X
(Lf )i = ( wij )fi − wij fj = wij (fi − fj )
j j j
X X X
>
f Lf = fi wij (fi − fj ) = wij (fi2 − fi fj )
i j ij
X 1 1 1X
= wij ( fi2 + fj2 − fi fj ) = wij (fi − fj )2
ij
2 2 2 ij
where the second-to-last = holds if wij = wji is symmetric.
6:25
• If may we define
Ke = (I − 1 11>)D(I − 1 11>) , Dij = −d2 /2
n n ij
>
then Kij = (xi − x̄) (xj − x̄) is the normal covariance matrix and MDS is equiv-
g
alent to kernel PCA
6:27
by Tenenbaum et al.
6:29
PCA variants*
66 Introduction to Machine Learning, Marc Toussaint
6:31
6:32
6:33
6.2 Clustering
6:35
Clustering
• Clustering often involves two steps:
• First map the data to some embedding that emphasizes clusters
– (Feature) PCA
– Spectral Clustering
– Kernel PCA
– ISOMAP
• Then explicitly analyze clusters
– k-means clustering
– Gaussian Mixture Model
– Agglomerative Clustering
6:36
k-means Clustering
• Given data D = {xi }ni=1 , find K centers µk , and a data assignment c : i 7→ k to
minimize X
min (xi − µc(i) )2
c,µ
i
• k-means clustering:
– Pick K data points randomly to initialize the centers µk
– Iterate adapting the assignments c(i) and the centers µk :
X
∀i : c(i) ← argmin (xj − µc(j) )2 = argmin(xi − µk )2
c(i) k
j
X 1 X
∀k : µk ← argmin (xi − µc(i) )2 = xi
µk |c-1 (k)|
i i∈c-1 (k)
6:37
68 Introduction to Machine Learning, Marc Toussaint
k-means Clustering
from Hastie
6:38
k-means Clustering
6:39
Introduction to Machine Learning, Marc Toussaint 69
from Hastie
6:40
1P
πk = q(ci = k)
n i
1 P
µk = i q(ci = k) xi
nπk
1 P > >
Σk = i q(ci = k) xi xi − µk µk
nπk
6:41
70 Introduction to Machine Learning, Marc Toussaint
from Bishop
6:42
6:43
Introduction to Machine Learning, Marc Toussaint 71
6:44
• More interesting: The loss and the best choice of λ depends on the scaling of the
data. If we always scale the data in the same range, we may have better priors
about choice of λ and interpretation of the loss
1 1
x← p x, y←p y
Var{x} Var{y}
x ← M -1 x , with Var{M -1 x} = Id
6:45
72 Introduction to Machine Learning, Marc Toussaint
• Content:
– Local learners
– local & lazy learning, kNN, Smoothing Kernel, kd-trees
– Combining weak or randomized learners
– Bootstrap, bagging, and model averaging
– Boosting
– (Boosted) decision trees & stumps, random forests
7:1
• Typical approach:
– Given a query point x∗ , find all kNN in the data D = {(xi , yi )}N
i=1
– Fit a local model fx∗ only to these kNNs, perhaps weighted
– Use the local model fx∗ to predict x∗ 7→ ŷ0
7:3
Introduction to Machine Learning, Marc Toussaint 73
Regression example
(
∗ 1 if xi ∈ kNN(x∗ )
kNN smoothing kernel: K(x , xi ) =
0 otherwise
Epanechnikov quadratic smoothing kernel: Kλ (x∗ , x) = D(|x∗ − x|/λ) ,
( D(s) =
3
4
(1 − s2 ) if s ≤ 1
0 otherwise
Smoothing Kernels
from Wikipedia
7:5
kd-trees
• For local & lazy learning it is essential to efficiently retrieve the kNN
A kd-tree pre-structures the data into a binary tree, allowing O(log n) retrieval
of kNNs.
7:7
kd-trees
kd-trees
• Simplest (non-efficient) way to construct a kd-tree:
– hyperplanes divide alternatingly along 1st, 2nd, ... coordinate
Introduction to Machine Learning, Marc Toussaint 75
Combining learners
• The general idea is:
– Given data D, let us learn various models f1 , .., fM
– Our prediction is then some combination of these, e.g.
M
X
f (x) = αm fm (x)
m=1
Model averaging: Fully different types of models (using different (e.g. limited)
feature sets; neural nets; decision trees; hyperparameters)
7:12
• Bayesian Averaging
M
X
P (y|x) = P (y|x, fm , D) P (fm |D)
m=1
Introduction to Machine Learning, Marc Toussaint 77
The term P (fm |D) is the weighting αm : it is high, when the model is likely
under the data (↔ the data is likely under the model & the model has “fewer
parameters”).
7:14
k
X
f (x) = φj (x) βj = φ(x)>β
j=1
Boosting
• In Bagging and Model Averaging, the models are trained on the “same data”
(unbiased randomized versions of the same data)
AdaBoost**
(Freund & Schapire, 1997)
(classical Algo; use Gradient Boosting instead in practice)
7:17
AdaBoost**
AdaBoost**
• Real AdaBoost: A variant exists that combines probabilistic classifiers σ(f (x)) ∈
[0, 1] instead of discrete G(x) ∈ {−1, +1}
7:19
• AdaBoost does exactly this, choosing wm so that the “feature” fm will best
reduce the loss (cf. PLS)
(Literally, AdaBoost uses exponential loss or neg-log-likelihood; Hastie sec 10.4 & 10.5)
7:20
Gradient Boosting
• AdaBoost generates a series of basis functions by using different data weight-
ings wm depending on so-far classification errors
• We can also generate a series of basis functions fm by fitting them to the gradi-
ent of the so-far loss
7:21
Gradient Boosting
• Assume we want to miminize some loss function
n
X
min L(f ) = L(yi , f (xi ))
f
i=1
80 Introduction to Machine Learning, Marc Toussaint
Gradient Boosting
• If F is the set of regression/decision trees, then step 5 usually re-optimizes the termi-
nal constants of all leave nodes of the regression tree fm . (Step 4 only determines the
terminal regions.)
7:23
Decision Trees**
• Decision trees are particularly used in Bagging and Boosting contexts
• Decision trees are “linear in features”, but the features are the terminal regions
of a tree, which are constructed depending on the data
Decision Trees
• We describe CART (classification and regression tree)
• Decision trees are linear in features:
k
X
f (x) = cj [x ∈ Rj ]
j=1
7:27
• Each split xa > t is defined by a choice of input dimension a ∈ {1, .., d} and a
threshold t
82 Introduction to Machine Learning, Marc Toussaint
h X X i
min min (yi − c1 )2 + min (yi − c2 )2
a,t c1 c2
i:xi ∈Rj ∧xa ≤t i:xi ∈Rj ∧xa >t
7:28
• We first grow a very large tree (e.g. until at most 5 data points live in each
region)
n
X
(yi − f (xi ))2
i=1
7:29
Introduction to Machine Learning, Marc Toussaint 83
Example:
CART on the Spam data set
(details: Hastie, p 320)
7:30
• A decision stump is a decision tree with fixed depth 1 (just one split)
• Gradient boosting of decision trees (of fixed depth J) and stumps is very effec-
tive
• Random Forests are the prime example for “creating many randomized weak
learners from the same data D”
7:32
Learning as Inference
• The parameteric view
P (Data|β) P (β)
P (β|Data) =
P (Data)
P (Data|f ) P (f )
P (f |Data) =
P (Data)
• Today:
– Bayesian (Kernel) Ridge Regression ↔ Gaussian Process (GP)
– Bayesian (Kernel) Logistic Regression ↔ GP classification
– Bayesian Neural Networks (briefly)
8:1
xi
β
• Let’s assume:
yi
P (X) is arbitrary i = 1 : n
2 λ 2
P (β) is Gaussian: β ∼ N(0, σλ ) ∝ e− 2σ2 ||β||
P (Y | X, β) is Gaussian: y = x>β + , ∼ N(0, σ 2 )
8:5
P (y ∗ | x∗ , D) = P (y ∗ | x∗ , β) P (β | D) dβ
R
β
Note, for f (x) = φ(x)>β, we have P (f (x) | D) = N(f (x) | φ(x)>β̂, φ(x)>Σφ(x)) without the σ 2
• So, y ∗ is Gaussian distributed around the mean prediction φ(x∗ )>β̂:
This is a very very common relation: optimization costs correspond to neg-log proba-
bilities; probabilities correspond to exp-neg costs.
• 2nd insight: The mean β̂ is exactly the classical argminβ Lridge (β)
More generally, the most likely parameter argmaxβ P (β|D) is also the least-cost param-
eter argminβ L(β). In the Gaussian case, most-likely β is also the mean.
• 3rd insight: The Bayesian inference approach not only gives a mean/optimal
β̂, but also a variance Σ of that estimate
This is a core benefit of the Bayesian view: It naturally provides a probability distribu-
tion over predictions (“error bars”), not only a single prediction.
8:9
Gaussian Processes
are equivalent to Kernelized Bayesian Ridge Regression
(see also Welling: “Kernel Ridge Regression” Lecture Notes; Rasmussen & Williams sections 2.1 &
6.2; Bishop sections 3.3.3 & 6)
• But it is insightful to introduce them again from the “function space view”:
GPs define a probability distribution over functions; they are the infinite di-
mensional generalization of Gaussian vectors
8:11
Introduction to Machine Learning, Marc Toussaint 89
Gaussian Processes
8:15
P (X) = arbitrary
2
P (β) = N(β|0, ) ∝ exp{−λ||β||2 }
λ
P (Y = 1 | X, β) = σ(β>φ(x))
• Recall
n
X
logistic
L (β) = − log p(yi | xi ) + λ||β||2
i=1
8:19
• Read
Gal & Ghahramani: Dropout as a bayesian approximation: Representing model uncertainty in
deep learning (ICML’16)
• Dropout in NNs
– Dropout leads to randomized prediction
– One can estimate the mean prediction from T dropout samples (MC estimate)
– Or one can estimate the mean prediction by averaging the weights of the network
(“standard dropout”)
– Equally one can MC estimate the variance from samples
– Gal & Ghahramani show, that a Dropout NN is a Deep GP (with very special ker-
pl2
nel), and the “correct” predictive variance is this MC estimate plus 2nλ (kernel
length scale l, regularization λ, dropout prob p, and n data points)
8:25
No Free Lunch
• Averaged over all problem instances, any algorithm performs equally. (E.g.
equal to random.)
– “there is no one model that works best for every problem”
Igel & Toussaint: On Classes of Functions for which No Free Lunch Results Hold (Information Process-
ing Letters 2003)
• Rigorous formulations formalize this “average over all problem instances”. E.g.
by assuming a uniform prior over problems
– In black-box optimization, a uniform distribution over underlying objective func-
tions f (x)
– In machine learning, a uniform distribution over the hiddern true function f (x)
... and NLF always considers non-repeating queries.
• NLF is trivial: when any previous query yields NO information at all about the results
of future queries, anything is exactly as good as random guessing
8:27
Conclusions
• Probabilistic inference is a very powerful concept!
– Inferring about the world given data
– Learning, decision making, reasoning can view viewed as forms of (proba-
bilistic) inference
94 Introduction to Machine Learning, Marc Toussaint
A Probability Basics
• Graphical models (probabilstic models with multiple random variables and de-
pendencies) are a more general framework for modelling “problems”; regres-
sion & classification become a special case; Reinforcement Learning, decision
making, unsupervised learning, but also language processing, image segmen-
tation, can be represented.
9:1
Outline
• Basic definitions
– Random variables
– joint, conditional, marginal distribution
– Bayes’ theorem
• Examples for Bayes
• Probability distributions [skipped, only Gauss]
– Binomial; Beta
– Multinomial; Dirichlet
– Conjugate priors
– Gauss; Wichart
– Student-t, Dirak, Particles
• Monte Carlo, MCMC [skipped]
These are generic slides on probabilities I use throughout my lecture. Only parts are
mandatory for the AI course.
9:2
96 Introduction to Machine Learning, Marc Toussaint
• Example:
40% Bavarians speak dialect, only 1% of non-Bavarians speak (Bav.) dialect
Given a random German that speaks non-dialect, is he Bavarian?
(15% of Germans are Bavarian)
9:3
Inference
• “Inference” = Given some pieces of information (prior, observed variabes) what
is the implication (the implied information, the posterior) on a non-observed
variable
Probability Theory
• Why do we need probabilities?
– Obvious: to express inherent stochasticity of the world (data)
– expressing uncertainty
– expressing information (and lack of information)
• A bit more formally: a random variable is a map from a measureable space to a domain
(sample space) and thereby introduces a probability measure on the domain (“assigns a
probability to each possible value”)
9:8
Probabilty Distributions
• P (X = 1) ∈ R denotes a specific probability
P (X) denotes the probability distribution (function over Ω)
Joint distributions
Assume we have two random variables X and Y
• Definitions:
Joint: P (X, Y )
P
Marginal: P (X) = Y P (X, Y )
P (X,Y )
Conditional: P (X|Y ) = P (Y )
P
The conditional is normalized: ∀Y : X P (X|Y ) = 1
Joint distributions
joint: P (X, Y )
P
marginal: P (X) = Y P (X, Y )
P (X,Y )
conditional: P (X|Y ) = P (Y )
P (Y |X) P (X)
Bayes’ Theorem: P (X|Y ) = P (Y )
9:11
Bayes’ Theorem
Introduction to Machine Learning, Marc Toussaint 99
P (Y |X) P (X)
P (X|Y ) =
P (Y )
likelihood · prior
posterior = normalization
9:12
Multiple RVs:
• Analogously for n random variables X1:n (stored as a rank n tensor)
Joint: P (X1:n )
P
Marginal: P (X1 ) = X2:n P (X1:n ),
P (X1:n )
Conditional: P (X1 |X2:n ) = P (X2:n )
P (D = 1 | B = 1) = 0.4 B
P (D = 1 | B = 0) = 0.01
P (B = 1) = 0.15 D
If follows
P (D=0 | B=1) P (B=1) .6·.15
P (B = 1 | D = 0) = P (D=0) = .6·.15+0.99·.85 ≈ 0.097
9:14
100 Introduction to Machine Learning, Marc Toussaint
HHTHT H
HHHHH d1 d2 d3 d4 d5
• Bayes’ theorem:
P (D | H)P (H)
P (H | D) = P (D)
9:15
Coin flipping
D = HHTHT
999
P (D | H = 1) = 1/25 P (H = 1) = 1000
1
P (D | H = 2) = 0 P (H = 2) = 1000
P (H = 1 | D) P (D | H = 1) P (H = 1) 1/32 999
= = =∞
P (H = 2 | D) P (D | H = 2) P (H = 2) 0 1
9:16
Coin flipping
D = HHHHH
999
P (D | H = 1) = 1/25 P (H = 1) = 1000
1
P (D | H = 2) = 1 P (H = 2) = 1000
P (H = 1 | D) P (D | H = 1) P (H = 1) 1/32 999
= = ≈ 30
P (H = 2 | D) P (D | H = 2) P (H = 2) 1 1
9:17
Introduction to Machine Learning, Marc Toussaint 101
Coin flipping
D = HHHHHHHHHH
999
P (D | H = 1) = 1/210 P (H = 1) = 1000
1
P (D | H = 2) = 1 P (H = 2) = 1000
P (H = 1 | D) P (D | H = 1) P (H = 1) 1/1024 999
= = ≈1
P (H = 2 | D) P (D | H = 2) P (H = 2) 1 1
9:18
P (Data|World) P (World)
P (World|Data) =
P (Data)
P (World) describes our prior over all possible worlds. Learning means to infer
about the world we live in based on the data we have!
P (Data|f ) P (f )
P (f |Data) =
P (Data)
P (x = 1 | µ) = µ , P (x = 0 | µ) = 1 − µ
Bern(x | µ) = µ (1 − µ)1−x
x
102 Introduction to Machine Learning, Marc Toussaint
• We have a data set of random variables D = {x1 , .., xn }, each xi ∈ {0, 1}. If
each xi ∼ Bern(xi | µ) we have
Qn Qn
P (D | µ) = i=1 Bern(xi | µ) = i=1 µxi (1 − µ)1−xi
n n
X 1X
argmax log P (D | µ) = argmax xi log µ + (1 − xi ) log(1 − µ) = xi
µ µ
i=1
n i=1
Pn
• The Binomial distribution is the distribution over the count m = i=1 xi
n
m
n
n!
Bin(m | n, µ) = µ (1 − µ)n−m ,
=
m m (n − m)! m!
9:21
Beta
How to express uncertainty over a Bernoulli parameter µ
• The Beta distribution is over the interval [0, 1], typically the parameter µ of a
Bernoulli:
1
Beta(µ | a, b) = µa−1 (1 − µ)b−1
B(a, b)
P (D | µ)
P (µ | D) = P (µ) ∝ Bin(D | µ) Beta(µ | a, b)
P (D)
∝ µaD (1 − µ)bD µa−1 (1 − µ)b−1 = µa−1+aD (1 − µ)b−1+bD
= Beta(µ | a + aD , b + bD )
9:22
Beta
• Conclusions:
– The semantics of a and b are counts of [xi = 1] and [xi = 0], respectively
– The Beta distribution is conjugate to the Bernoulli (explained later)
– With the Beta distribution we can represent beliefs (state of knowledge) about un-
certain µ ∈ [0, 1] and know how to update this belief given data
9:23
Beta
from Bishop
9:24
Multinomial
• We have an integer random variable x ∈ {1, .., K}
The probability of a single x can be parameterized by µ = (µ1 , .., µK ):
P (x = k | µ) = µk
PK
with the constraint k=1 µk = 1 (probabilities need to be normalized)
• We have a data set of random variables D = {x1 , .., xn }, each xi ∈ {1, .., K}. If
each xi ∼ P (xi | µ) we have
Qn Qn QK [x =k] QK
P (D | µ) = i=1 µxi = i=1 k=1 µk i = k=1 µm
k
k
Pn
where mk = i=1 [xi = k] is the count of [xi = k]. The ML estimator is
1
argmax log P (D | µ) = (m1 , .., mK )
µ n
104 Introduction to Machine Learning, Marc Toussaint
9:25
Dirichlet
How to express uncertainty over a Multinomial parameter µ
• The Dirichlet distribution is over the K-simplex, that is, over µ1 , .., µK ∈ [0, 1]
PK
subject to the constraint k=1 µk = 1:
QK k −1
Dir(µ | α) ∝ k=1 µα
k
It is parameterized by α = (α1 , .., αK ), has mean hµi i = Pαi and mode µ∗i =
j αj
Pαi −1 for ai > 1.
j αj −K
P (D | µ)
P (µ | D) = P (µ) ∝ Mult(D | µ) Dir(µ | a, b)
P (D)
QK m QK αk −1 αk −1+mk
= K
Q
∝ k=1 µk k k=1 µk k=1 µk
= Dir(µ | α + m)
9:26
Dirichlet
• Conclusions:
– The semantics of α is the counts of [xi = k]
– The Dirichlet distribution is conjugate to the Multinomial
– With the Dirichlet distribution we can represent beliefs (state of knowledge) about
uncertain µ of an integer random variable and know how to update this belief given
data
9:27
Introduction to Machine Learning, Marc Toussaint 105
Dirichlet
Illustrations for α = (0.1, 0.1, 0.1), α = (1, 1, 1) and α = (10, 10, 10):
from Bishop
9:28
Conjugate priors
P (D | θ)
P (θ | D) ∝ P (D | θ) P (θ)
106 Introduction to Machine Learning, Marc Toussaint
• Having a conjugate prior is very convenient, because then you know how to
update the belief given data
9:30
Conjugate priors
likelihood conjugate
Binomial Bin(D | µ) Beta Beta(µ | a, b)
Multinomial Mult(D | µ) Dirichlet Dir(µ | α)
Gauss N(x | µ, Σ) Gauss N(µ | µ0 , A)
1D Gauss N(x | µ, λ-1 ) Gamma Gam(λ | a, b)
nD Gauss N(x | µ, Λ-1 ) Wishart Wish(Λ | W, ν)
nD Gauss N(x | µ, Λ-1 ) Gauss-Wishart
N(µ | µ0 , (βΛ)-1 ) Wish(Λ | W, ν)
9:31
Ry
The (cumulative) probability distribution F (y) = P (x ≤ y) = −∞
dx p(x) ∈
[0, 1] is the cumulative integral with limy→∞ F (y) = 1
(In discrete domain: probability distribution and probability mass function P (x) ∈ [0, 1] are
used synonymously.)
Gaussian distribution
2σ
2
1
/σ 2
• 1-dim: N(x | µ, σ 2 ) = 1
| 2πσ 2 | 1/2
e− 2 (x−µ)
µ x
• n-dim Gaussian in normal form:
1 1
N(x | µ, Σ) = exp{− (x − µ)> Σ-1 (x − µ)}
| 2πΣ | 1/2 2
exp{− 21 a>A-1 a} 1
N[x | a, A] = -1 1/2
exp{− x> A x + x>a} (9)
| 2πA | 2
with precision matrix A = Σ-1 and coefficient a = Σ-1 µ (and mean µ = A-1 a).
Note: | 2πΣ | = det(2πΣ) = (2π)n det(Σ)
Gaussian identities
Symmetry: N(x | a, A) = N(a | x, A) = N(x − a | 0, A)
Product:
N(x | a, A) N(x | b, B) = N[x | A-1 a + B -1 b, A-1 + B -1 ] N(a | b, A + B)
N[x | a, A] N[x | b, B] = N[x | a + b, A + B] N(A-1 a | B -1 b, A-1 + B -1 )
“Propagation”:
N(x | a + F y, A) N(y | b, B) dy = N(x | a + F b, A + F BF>)
R
y
Transformation:
N(F x + f | a, A) = 1
|F |
N(x | F -1 (a − f ), F -1 AF -> )
Marginal
& conditional:
x a A C
N , = N(x | a, A) · N(y | b + C>A-1 (x - a), B − C>A-1 C)
y b C> B
N(xi | µ, Σ)
Q
P (D | µ, Σ) = i
n
1X
argmax P (D | µ, Σ) = xi
µ n i=1
n
1X
argmax P (D | µ, Σ) = (xi − µ)(xi − µ)>
Σ n i=1
• Assume we are initially uncertain about µ (but know Σ). We can express this
uncertainty using again a Gaussian N[µ | a, A]. Given data we have
1X 1
P (µ | D) = N(µ | xi , Σ)
n i n
N
X
q(x) := wi δ(x − xi )
i=1
i
where δ(x − x ) is the δ-distribution.
Introduction to Machine Learning, Marc Toussaint 109
9:39
• Given a space of events Ω (e.g., outcomes of a trial, a game, etc) the utility is a
function
U : Ω→R
• The utility represents preferences as a single scalar – which is not always obvi-
ous (cf. multi-objective optimization)
• Decision Theory making decisions (that determine p(x)) that maximize expected
utility Z
E{U }p = U (x) p(x)
x
• Concave utility functions imply risk aversion (and convex, risk-taking)
9:41
Entropy
• The neg-log (− log p(x)) of a distribution reflects something like “error”:
– neg-log of a Guassian ↔ squared error
– neg-log likelihood ↔ prediction error
• The (− log p(x)) is the “optimal” coding length you should assign to a symbol
x. This will minimize the expected length of an encoding
Z
H(p) = p(x)[− log p(x)]
x
Kullback-Leibler divergence
• Assume you use a “wrong” distribution q(x) to decide on the coding length of
symbols drawn from p(x). The expected length of a encoding is
Z
p(x)[− log q(x)] ≥ H(p)
x
• The difference Z
p(x)
D p q = p(x) log ≥0
x q(x)
is called Kullback-Leibler divergence
• Example: What is the probability that a solitair would come out successful? (Original
story by Stan Ulam.) Instead of trying to analytically compute this, generate many
random solitairs and count.
• Naming: The method developed in the 40ies, where computers became faster. Fermi,
Ulam and von Neumann initiated the idea. von Neumann called it “Monte Carlo” as a
code name.
9:45
Rejection Sampling
• How can we generate i.i.d. samples xi ∼ p(x)?
• Assumptions:
– We can sample x ∼ q(x) from a simpler distribution q(x) (e.g., uniform), called
proposal distribution
– We can numerically evaluate p(x) for a specific x (even if we don’t have an analytic
expression of p(x))
– There exists M such that ∀x : p(x) ≤ M q(x) (which implies q has larger or equal
support as p)
• Rejection Sampling:
– Sample a candiate x ∼ q(x)
– With probability p(x)
M q(x)
accept x and add to S; otherwise reject
– Repeat until |S| = n
• This generates an unweighted sample set S to approximate p(x)
9:46
112 Introduction to Machine Learning, Marc Toussaint
Importance sampling
• Assumptions:
– We can sample x ∼ q(x) from a simpler distribution q(x) (e.g., uniform)
– We can numerically evaluate p(x) for a specific x (even if we don’t have an analytic
expression of p(x))
• Importance Sampling:
– Sample a candiate x ∼ q(x)
– Add the weighted sample (x, p(x)
q(x)
) to S
– Repeat n times
• This generates an weighted sample set S to approximate p(x)
p(xi )
The weights wi = q(xi ) are called importance weights
Applications
• MCTS estimates the Q-function at branchings in decision trees or games
• Inference in graphical models (models involving many depending random vari-
ables)
9:48
9 Exercises
9.1 Exercise 1
There will be no credit points for the first exercise – we’ll do them on the fly.
(Präsenzübung) For those of you that had lectures with me before this is redundant—
you’re free to skip the tutorial.
Read at least until section 5 of Pedro Domingos’s A Few Useful Things to Know about
Machine Learning http://homes.cs.washington.edu/˜pedrod/papers/cacm12.
pdf. Be able to explain roughly what generalization and the bias-variance-tradeoff
(Fig. 1) are.
XA + A> = I
Let x ∈ Rn , y ∈ Rd , A ∈ Rd×n .
∂
a) What is ∂x x ? (Of what type/dimension is this thing?)
∂ >
b) What is ∂x [x x] ?
c) Let B be symmetric (and pos.def.). What is the minimum of (Ax − y)>(Ax − y) +
x>Bx w.r.t. x?
114 Introduction to Machine Learning, Marc Toussaint
9.1.4 Coding
Future exercises will need you to code some Machine Learning methods. You are
free to choose your programming language. If you’re new to numerics we rec-
ommend Python (SciPy & scikit-learn) or Matlab/Octave. I’ll support C++, but
recommend it really only to those familiar with C++.
To get started, try to just plot the data set http://ipvs.informatik.uni-stuttgart.
de/mlr/marc/teaching/data/dataQuadReg2D.txt, e.g. in Octave:
D = importdata(’dataQuadReg2D.txt’);
plot3(D(:,1),D(:,2),D(:,3), ’ro’)
Or in Python
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
D = np.loadtxt(’dataQuadReg2D.txt’)
fig = plt.figure()
ax = fig.add_subplot(111, projection=’3d’)
ax.plot(D[:,0],D[:,1],D[:,2], ’ro’)
plt.show()
Or you can store the grid data in a file and use gnuplot, e.g.:
9.2 Exercise 2
9.2.1 Getting Started with Ridge Regression (10 Points)
On the course webpage there are two simple data sets dataLinReg2D.txt and
dataQuadReg2D.txt. Each line contains a data entry (x, y) with x ∈ R2 and
y ∈ R; the last entry in a line refers to y.
a) The examples demonstrate plain linear regression for dataLinReg2D.txt. Ex-
tend them to include a regularization parameter λ. Report the squared error on the
full data set when trained on the full data set. (3 P)
b) Do the same for dataQuadReg2D.txt while first computing quadratic features.
(4 P)
c) Implement cross-validation (slide 02:17) to evaluate the prediction error of the
quadratic model for a third, noisy data set dataQuadReg2D_noisy.txt. Report
1) the squared error when training on all data (=training error), and 2) the mean
squared error `ˆ from cross-validation. (3 P)
Repeat this for different Ridge regularization parameters λ. (Ideally, generate a nice
bar plot of the generalization error, including deviation, for various λ.)
#!/usr/bin/env python
# encoding: utf-8
"""
NOTE: the operators + - * / are element wise operation. If you want
matrix multiplication use ‘‘dot‘‘ or ‘‘mdot‘‘!
"""
from __future__ import print_function
import numpy as np
from numpy import dot
from numpy.linalg import inv
from numpy.linalg import multi_dot as mdot
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d.axes3d import Axes3D
# 3D plotting
###############################################################################
# Helper functions
def prepend_one(X):
"""prepend a one vector to X."""
return np.column_stack([np.ones(X.shape[0]), X])
def grid2d(start, end, num=50):
"""Create an 2D array where each row is a 2D coordinate.
np.meshgrid is pretty annoying!
"""
dom = np.linspace(start, end, num)
X0, X1 = np.meshgrid(dom, dom)
return np.column_stack([X0.flatten(), X1.flatten()])
###############################################################################
# load the data
data = np.loadtxt("dataLinReg2D.txt")
print("data.shape:", data.shape)
# split into features and labels
X, y = data[:, :2], data[:, 2]
print("X.shape:", X.shape)
print("y.shape:", y.shape)
# 3D plotting
fig = plt.figure()
ax = fig.add_subplot(111, projection=’3d’) # the projection arg is important!
ax.scatter(X[:, 0], X[:, 1], y, color="red")
ax.set_title("raw data")
plt.draw()
# show, use plt.show() for blocking
116 Introduction to Machine Learning, Marc Toussaint
C++
(by Marc Toussaint)
#include <Core/array.h>
//===========================================================================
void gettingStarted() {
//load the data
arr D = FILE("dataLinReg2D.txt");
//plot it
FILE("z.1") <<D;
gnuplot("splot ’z.1’ us 1:2:3 w p", true);
//===========================================================================
int main(int argc, char *argv[]) {
rai::initCmdLine(argc,argv);
gettingStarted();
Introduction to Machine Learning, Marc Toussaint 117
return 0;
}
Matlab
(by Peter Englert)
clear;
% prepend 1s to inputs
X = [ones(n,1),X];
% compute optimal beta
beta = inv(X’*X)*X’*Y;
% display the function
[a b] = meshgrid(-2:.1:2,-2:.1:2);
Xgrid = [ones(length(a(:)),1),a(:),b(:)];
Ygrid = Xgrid*beta;
Ygrid = reshape(Ygrid,size(a));
h = surface(a,b,Ygrid);
view(3);
grid on;
9.3 Exercise 3
(BSc Data Science students may skip b) parts.)
The function [z]+ = max(0, z) is called hinge. In ML, a hinge penalizes errors (when
z > 0) linearly, but raises no costs at all if z < 0.
Assume we have a single data point (x, y ∗ ) with class label y ∗ ∈ {1, .., M }, and the
discriminative function f (y, x). We penalize the discriminative function with the
one-vs-all hinge loss
X
Lhinge (f ) = [1 − (f (y ∗ , x) − f (y, x))]+
y6=y ∗
118 Introduction to Machine Learning, Marc Toussaint
hinge
a) What is the gradient ∂L (f )
∂f (y,x) of the loss w.r.t. the discriminative values. For
simplicity, distinguish the cases of taking the derivative w.r.t. f (y ∗ , x) and w.r.t.
f (y, x) for y 6= y ∗ . (3 P)
b) Now assume the parameteric model f (y, x) = φ(x)>βy , where for every y we
have different parameters βy ∈ Rd . And we have a full data set D = {(xi , yi )}ni=1
with class labels yi ∈ {1, .., M } and loss
n X
X
Lhinge (f ) = [1 − (f (yi , xi ) − f (y, xi ))]+
i=1 y6=yi
∂Lhinge (f )
What is the gradient ∂βy ? (2 P)
9.4 Exercise 4
At the bottom several ’extra’ exercises are given. These are not part of the tutorial
and only for your interest.
(There were questions about an API documentation of the C++ code. See https://
github.com/MarcToussaint/rai-maintenance/blob/master/help/doxygen.md.)
On the course webpage there is a data set data2Class.txt for a binary classifi-
cation problem. Each line contains a data entry (x, y) with x ∈ R2 and y ∈ {0, 1}.
Introduction to Machine Learning, Marc Toussaint 119
a) Compute the optimal parameters β and the mean neg-log-likelihood − n1 log L(β)
for logistic regression using linear features. Plot the probability P (y = 1 | x) over a
2D grid of test points. (4 P)
• Recall the objective function, and its gradient and Hessian that we derived in
the last exercise:
n
X
L(β) = − log P (yi | xi ) + λ||β||2 (13)
i=1
Xn h i
=− yi log pi + (1 − yi ) log[1 − pi ] + λ||β||2 (14)
i=1
> n
∂L(β) X
∇L(β) = = (pi − yi ) φ(xi ) + 2λIβ = X>(p − y) + 2λIβ (15)
∂β i=1
n
∂ 2 L(β) X
∇2 L(β) = = pi (1 − pi ) φ(xi ) φ(xi )> + 2λI = X>W X + 2λI (16)
∂β 2 i=1
where p(x) := P (y = 1 | x) = σ(φ(x)>β), pi := p(xi ), W := diag(p ◦ (1 − p))
(17)
• Setting the gradient equal to zero can’t be done analytically. Instead, optimal
parameters can quickly be found by iterating Newton steps: For this, initialize
β = 0 and iterate
β ← β − (∇2 L(β))-1 ∇L(β) . (18)
You usually need to iterate only a few times (∼10) til convergence.
• As you did for regression, plot the discriminative function f (x) = φ(x)>β or
the class probability function p(x) = σ(f (x)) over a grid.
(Warning: This is one of these exercises that do not have “one correct solution”.)
Consider data of tuples (x, y1 , y2 ) where
120 Introduction to Machine Learning, Marc Toussaint
exp f (x, y)
P (y|x) = P 0
y 0 exp f (x, y )
Prove that, in the binary classification case, you can assume f (x, 0) = 0 without
loss of generality.
This results in
exp f (x, 1)
P (y = 1|x) = = σ(f (x, 1)).
1 + exp f (x, 1)
(Hint: first assume f (x, y) = φ(x, y)>β, and then define a new discriminative func-
tion f 0 as a function of the old one, such that f 0 (x, 0) = 0 and for which P (y|x)
maintains the same expressibility.)
Introduction to Machine Learning, Marc Toussaint 121
9.5 Exercise 5
In these two exercises you’ll program a NN from scratch, use neural random fea-
tures for classification, and train it too. Don’t use tensorflow yet, but the same
language you used for standard regression & classification. Take slide 04:14 as ref-
erence for NN equations.
(DS BSc students may skip 2 b-c, i.e. should at least try to code/draft also the back-
ward pass, but ok if no working solutions.)
Use your NN to map each input x to features φ(x) = xL-1 , then use these features
as input to logistic regression as done in the previous exercise. (Initialize a separate
β and optimize by iterating Newton steps.)
First consider just L = 2 (just one hidden layer and xL-1 are the features) and h1 =
300. (2 P)
Extra) How does it perform if we initialize all bl = 0? How would it perform if the
input would be rescaled x ← 105 x? How does the performance vary with h1 and
with L?
We now also train the network using backpropagation and hinge loss. We test again
on data2Class.txt. As this is a binary classification problem we only need one
output neuron fβ (x). If fβ (x) > 0 we classify 1, otherwise we classify 0.
Reuse the “forward(x, β)” coded above.
a) Code a routine “backward(δL+1 , x, w)”, that performs the backpropagation steps
d`
and collects the gradients dw l
.
For this, let us use a hinge loss. In the binary case (when you use only one out-
put neuron), it is simplest to redefine y ∈ {−1, +1}, and define the hinge loss as
`(f, y) = max(0, 1 − f y), which has the loss gradient δL = −y[1 − yf > 0] at the
output neuron.
Run forward and backward propagation for each x, y in the dataset, and sum up
the respective gradients. (2 P)
b) Code a routine which optimizes the parameters using gradient descent:
d` d`
∀l=1,..,L : Wl ← Wl − α , bl ← bl − α
dWl dbl
with step size α = .01. Run until convergence (should take a few thousand steps).
Print out the loss function ` at each 100th iteration, to verify that the parameter
optimization is indeed decreasing the loss. (2 P)
c) Run for h = (2, 20, 1) and visualize the prediction by plotting σ(fβ (x)) over a
2-dimensional grid. (1 P)
9.6 Exercise 6
b) Run a session to compute the loss, gradient and hessian. Feed random values
into the input placeholders. Gradient and Hessian can be calculated by tf.gradients()
and tf.hessians(). Compare it to the analytical solution using the same ran-
dom values. (2 P)
Code calculating the analytical solutions of the loss, the gradient and the hessian in
python:
Now you will directly use tensorflow commands for creating neural networks.
124 Introduction to Machine Learning, Marc Toussaint
relu_layer_operation = tf.layers.Dense(100,
activation=tf.nn.leaky_relu,
kernel_initializer=tf.initializers.random_uniform(-.1,.1),
bias_initializer=tf.initializers.random_uniform(-1.,1.))
linear_layer_operation = tf.layers.Dense(1,
activation=None,
kernel_initializer=tf.initializers.random_uniform(-.1,.1),
bias_initializer=tf.initializers.random_uniform(-.01,.01))
hidden1 = relu_layer_operation(input)
hidden2 = relu_layer_operation(hidden1)
model_output = linear_layer_operation(hidden2)
b) Now we want to use a neural network on real images. Download the Bel-
giumTS1 dataset from: https://btsd.ethz.ch/shareddata/BelgiumTSC/
BelgiumTSC_Training.zip (Training data) and https://btsd.ethz.ch/sharedd
BelgiumTSC/BelgiumTSC_Testing.zip (Test data). The dataset consists of
traffic signs according to 62 different classes. Create a neural network architec-
ture and train it on the training dataset. You can use any architecture you want but
at least use one convolutional layer. Report the classification error on the test set.
(3 P)
Hints: Use tf.layers.Conv2D to create convolutional layers, and
tf.contrib.layers.flatten to reshape an image layer into a vector layer (as
input to a dense layer). The following code can be used to load data, rescale it and
display images:
1 Belgium traffic sign dataset; Radu Timofte*, Markus Mathias*, Rodrigo Benenson, and Luc Van
Gool, Traffic Sign Recognition - How far are we from the solution?, International Joint Conference on
Neural Networks (IJCNN 2013), August 2013, Dallas, USA
Introduction to Machine Learning, Marc Toussaint 125
import os
import skimage
from skimage import transform
from skimage.color import rgb2gray
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
def load_data(data_directory):
directories = [d for d in os.listdir(data_directory)
if os.path.isdir(os.path.join(data_directory, d))]
labels = []
images = []
for d in directories:
label_directory = os.path.join(data_directory, d)
file_names = [os.path.join(label_directory, f)
for f in os.listdir(label_directory)
if f.endswith(".ppm")]
for f in file_names:
images.append(skimage.data.imread(f))
labels.append(int(d))
return np.array(images), np.array(labels)
def plot_data(signs, labels):
for i in range(len(signs)):
plt.subplot(4, len(signs)/4 + 1, i+1)
plt.axis(’off’)
plt.title("Label {0}".format(labels[i]))
plt.imshow(signs[i])
plt.subplots_adjust(wspace=0.5)
plt.show()
(Bonus means: The extra 3 points will count to your total, but not to the required
points (from which you eventually need 50%).)
We test SDG on a squared function f (x) = 21 x>Hx. A Newton method would need
access to the exact Hessian H and directly step to the optimum x∗ = 0. But SDG
only has access to an estimate of the gradient. Ensuring proper convergence is
much more difficult for SGD.
Let x ∈ Rd for d = 1000. Generate a sparse random matrix J ∈ Rn×d , for n = 105
as follows: In each row, fill in 10 random numbers drawn from N(0, σ 2 ) at random
places. Each row either hasPnσ = 1 or σ = 100, chosen randomly. We now define
H = J>J. Note that H = i=1 J> i Ji is a sum of rank-1 matrices.
1. Initialize x = 1d .
b) Plot the learning curves, i.e., the full and the stochastic error. How well do they
match? In what sense does optimization converge? Discuss the stationary distribu-
tion of the optimum. (1P)
c) Extra: Test variants: exponential cooling of the learning rate, Nesterov momen-
tum, and ADAM.
9.7 Exercise 7
(DS BSc students can skip the exercise 3.)
Reconsider the equations for Kernel Ridge regression given on slide 05:3, and – if
features are given – the definition of the kernel function and κ(x) in terms of the
features as given on slide 05:2.
a) Prove that Kernel Ridge regression with the linear kernel function
k(x, x0 ) =
1 + x>x0 is equivalent to Ridge regression with linear features φ(x) = 1
. (1 P)
x
b) In Kernel Ridge regression, the optimal function is of the form f (x) = κ(x)>α
and therefore linear in κ(x). In plain ridge regression, the optimal function is of the
form f (x) = φ(x)>β and linear in φ(x). Prove that choosing k(x, x0 ) = (1 + x>x0 )2
implies that f (x) = κ(x)>α is a second order polynomial over x. (2 P)
c) Equally, note that choosing the squared exponential kernel k(x, x0 ) = exp(−γ | x−
x0 | 2 ) implies that the optimal f (x) is linear in radial basis function (RBF) features.
Does this necessarily impliy that Kernel Ridge regression with squared exponen-
tial kernel, and plain Ridge regression with RBF features are exactly equivalent?
(Equivalent means, have the same optimal function.) Distinguish the cases λ = 0
(no regularization) and λ > 0. (1 P)
(Voluntary: Practically test yourself on the regression problem from Exercise 2,
whether Kernel Ridge Regression and RBF features are exactly equivalent.)
Introduction to Machine Learning, Marc Toussaint 127
The “kernel trick” is generally applicable whenever the “solution” (which may be
the predictive function f ridge (x), or the discriminative function, or principal com-
ponents...) can be written in a form that only uses the kernel function k(x, x0 ), but
never features φ(x) or parameters β explicitly.
Derive a kernelization of Logistic Regression. That is, think about how you could
perform the Newton iterations based only on the kernel function k(x, x0 ).
Tips: Reformulate the Newton iterations
Note that you’ll need to handle the X>(p − y) and 2λIβ differently.
Then think about what is actually been iterated in the kernalized case: surely we
cannot iteratively update the optimal parameters, because we want to rewrite equa-
tions to never touch β or φ(x) explicitly.
128 Introduction to Machine Learning, Marc Toussaint
9.8 Exercise 8
(DS BSc students should nominally achieve 8 Pts on this sheet.)
For data D = {xi }ni=1 , xi ∈ Rd , we introduced PCA as a method that finds lower-
dimensional representations zi ∈ Rp of each data point such that xi ≈ V zi +µ. PCA
chooses V, µ and zi to minimize the reproduction error
n
X
||xi − (V zi + µ)||2 .
i=1
On the webpage find and download the Yale face database http://ipvs.informatik.
uni-stuttgart.de/mlr/marc/teaching/data/yalefaces.tgz. (Optionally use yalefaces_c
which is slightly cleaned version of the same dataset). The file contains gif images
of 165 faces.
1. Write a routine to load all images into a big data matrix X ∈ R165×77760 , where
each row contains a gray image.
In Octave, images can easily be read using I=imread("subject01.gif");
and imagesc(I);. You can loop over files using files=dir("."); and
files(:).name. Python tips:
1
P
2. Compute the mean face µ = n i xi and center the whole data matrix, X̃ =
X − 1n µ>.
3. Compute the singular value decomposition X̃ = U DV > for the centered data
matrix.
In Octave/Matlab, use [U, S, V] = svd(X, "econ"), where the "econ"
ensures you don’t run out of memory.
In python, use
9.9 Exercise 9
(DS BSc students may skip exercise 2, but still please read about Mixture of Gaus-
sians and the explanations below.)
Introduction
There is no lecture on Thursday. To still make progress, please follow this guide to
learn some new material yourself. The subject is k-means clustering and Mixture
of Gaussians.
k-means clustering: The method is fully described on slide 06:36 of the lecture.
Again, I present the method as derived from an optimality principle. Most other
references described k-means clustering just as a procedure. Also wikipedia https:
//en.wikipedia.org/wiki/K-means_clustering gives a more verbose ex-
plaination of this procedure. In my words, this is the procedure:
– We have data D = {xi }n d
i=1 , with xi ∈ R . We want to cluster the data in K different
clusters. K is chosen ad-hoc.
– Each cluster is represented only by its mean (or center) µk ∈ Rd , for k = 1, .., K.
– We initially assign each µk to a random data point, µk ← xi with i ∼ U{1, .., n}
– The algorithm also maintains an assignment mapping c : {1, .., n} → {1, .., K}, which
assigns each data point xi to a cluster k = c(i)
– For given centers µk , we update all assignments using
X
∀i : c(i) ← argmin (xj − µc(j) )2 = argmin(xi − µk )2 ,
c(i) k
j
that is, we set the centers equal to the mean of the data points assigned to the cluster.
– The last two steps are iterated until the assignment does not vary.
de/mlr/marc/teaching/15-MachineLearning/08-graphicalModels-Learnin
pdf. Bishop’s book https://www.microsoft.com/en-us/research/people/
cmbishop/#!prml-book also gives a very good introduction. But we only need
the procedural understanding here:
– We have data D = {xi }n d
i=1 , with xi ∈ R . We want to cluster the data in K different
clusters. K is chosen ad-hoc.
– Each cluster is represented by its mean (or center) µk ∈ Rd and a covariance matrix Σk .
This covariance matrix describes an ellipsoidal shape of each cluster.
– We initially assign each µk to a random data point, µk ← xi with i ∼ U{1, .., n}, and
each covariance matrix to uniform (if the data is roughly uniform).
– The core difference to k-means: The algorithm also maintains a probabilistic (or soft)
assignment mapping qi (k) ∈ [0, 1], such that K
P
k=1 qi (k) = 1. The number qi (k) is the
probability of assigning data xi to cluster k (or the probability that data xi originates
from cluster k). So, each data index i is mapped to a probability over k, rather than a
specific k as in k-means.
– For given cluster parameters µk , Σk , we update all the probabilistic assignments using
1 1 > -1
∀i,k : qi (k) ← N(xi | µk , Σk ) = e− 2 (xi −µk ) Σk (xi −µk )
| 2πΣk | 1/2
1
∀i,k : qi (k) ← P 0
qi (k)
k0 qi (k )
PK
where the second line normalizes the probabilistic assignments to ensure k=1 qi (k) =
1.
– For given probabilistic assignments qi (k), we update all cluster parameters using
1 P
∀ k : µk ← P i qi (k) xi
i qi (k)
1 P > >
∀k : Σk ← P i qi (k) xi xi − µk µk ,
i qi (k)
where µk is the weighed mean of the data assigned to cluster k (weighted with qi (k)),
and similarly for Σk .
– In this description, I skipped another parameter, πk , which is less important and we can
discuss in class.
On the webpage find and download the Yale face database http://ipvs.informatik.
uni-stuttgart.de/mlr/marc/teaching/data/yalefaces_cropBackground.tgz. The file
contains gif images of 136 faces.
We’ll cluster the faces using k-means in K = 4 clusters.
132 Introduction to Machine Learning, Marc Toussaint
Download the data set mixture.txt from the course webpage, containing n =
300 2-dimensional points. Load it in a data matrix X ∈ Rn×2 .
a) Implement the EM-algorithm for a Gaussian Mixture on this data set. Choose
K = 3. Initialize by choosing the three means µk to be different randomly selected
data points xi (i random in {1, .., n}) and the covariances Σk = I (a more robust
choice would be the covariance of the whole data). Iterate EM starting with the first
E-step (computing probabilistic assignments) based on these initializations. Repeat
with random restarts—how often does it converge to the optimum? (3 P)
b) Do exactly the same, but this time initialize the posterior qi (k) randomly (i.e.,
assign each point to a random cluster: for each point xi select k 0 = rand(1 : K) and
set qi (k) = [k = k 0 ]); then start EM with the first M-step. Is this better or worse than
the previous way of initialization? (1 P)
One of the central messages of the whole course is: To solve (learning) problems,
first formulate an objective function that defines the problem, then derive algo-
rithms to find/approximate the optimal solution. That should also hold for clus-
tering.
k-means finds centers µk and assignments c : i 7→ k to minimize min i (xi −µc(i) )2 .
P
An alternative class of objective functions for clustering are graph cuts. Consider
n data points with similarities wij , forming a weighted graph. We denote P by W =
(wij ) the symmetric weight matrix, and D = diag(d1 , .., dn ), with di = j wij , the
degree matrix. For simplicitly we consider only 2-cuts, that is, cutting the graph in
two disjoint clusters, C1 ∪ C2 = {1, .., n}, C1 ∩ C2 = ∅. The normalized cut objective
is
X
RatioCut(C1 , C2 ) = 1/|C1 | + 1/|C2 | wij
i∈C1 ,j∈C2
Introduction to Machine Learning, Marc Toussaint 133
( p
+ |C2 |/|C1 | for i ∈ C1
a) Let fi = p be a kind of indicator function of the clus-
− |C1 |/|C2 | for i ∈ C2
tering. Prove that
f>(D − W )f = n RatioCut(C1 , C2 )
fi2 = n.
P P
b) Further prove that i fi = 0 and i
Note (to be discussed in the tutorial in more detail): Spectral clustering addresses
X
min f>(D − W )f s.t. fi = 0 , ||f ||2 = 1
i
9.10 Exercise 10
(DS BSc students please try to complete the full exercise this time.)
k-nearest neighbor regression is a very simple lazy learning method: Given a data
set D = {(xi , yi )}ni=1 and query point x∗ , first find
Pthe k nearest neighbors K ⊂
1
{1, .., n}. In the simplest case, the output y = K k∈K yk is then the average of
these k nearest neighbors. In the classification case, the output is the majority vote
of the neighbors.
(To make this smoother, one can weigh each nearest neighbor based on the distance
|x∗ − xk |, and use local linear or polynomial (logistic) regression. But this is not
required here.)
On the webpage there is a data set data2ClassHastie.txt. Your task is to com-
pare the performance of kNN classification (with basic kNN majority voting) with
a neural network classifier. (If you prefer, you can compare kNN against another
classifier such as logistic regression with RBF features, instead of neural networks.
The class boundaries are non-linear in x.)
As part of this exercise, discuss how a fair and rigorous comparison between two
ML methods is done.
134 Introduction to Machine Learning, Marc Toussaint
Consider the following weak learner for classification: Given a data set D = {(xi , yi )}ni=1 , yi
{−1, +1}, the weak learner picks a single i∗ and defines the discriminative function
2
/2σ 2
f (x) = αe−(x−xi∗ ) ,
with fixed width σ and variable parameter α. Therefore, this weak learner is param-
eterized only by i∗ and α ∈ R, which are chosen to minimize the neg-log-likelihood
n
X
Lnll (f ) = − log σ(yi f (xi )) .
i=1
a) Write down an explicit pseudo code for gradient boosting with this weak learner.
By “pseudo code” I mean explicit equations for every step that can directly be im-
plemented. This needs to be specific for this particular learner and loss. (3 P)
b) Here is a 1D data set, where ◦ are 0-class, and × 1-class data points. “Simulate”
the algorithm graphically on paper. (2 P)
width
choose first
9.11 Exercise 11
(DS BSc students may skip coding exercise 3, but should be able to draw on the
board what the result would look like.)
You have 3 dices (potentially fake dices where each one has a different probability
table over the 6 values). You’re given all three probability tables P (D1 ), P (D2 ), and
P (D3 ). Write down the equations and an algorithm (in pseudo code) that computes
the conditional probability P (S|D1 ) of the sum of all three dices conditioned on the
value of the first dice.
Introduction to Machine Learning, Marc Toussaint 135
Consider a Gaussian Process prior P (f ) over functions defined by the mean func-
tion µ(x) = 0, the γ-exponential covariance function
9.12 Exercise 12
On the webpage find and download the Yale face database http://ipvs.informatik.
uni-stuttgart.de/mlr/marc/teaching/data/yalefaces_cropBackground.tgz. The file
contains gif images of 136 faces.
We want to compare two methods (Autoencoder vs PCA) to reduce the dimen-
sionality of this dataset. This means that we want to create and train a neural net-
work to find a lower-dimensional representation of our data. Recall the slides and
exercises about dimensionality reduction, neural networks and especially Autoen-
coders (slide 06:10).
a) Create a neural network using tensorflow (or any other framework, e.g., keras)
which takes the images as input, creates a lower-dimensional representation with
dimensionality p = 60 in the hidden layer (i.e., a layer with 60 neurons) and outputs
the reconstructed images. The loss function should compare the original image
with the reconstructed one. After having trained the network, Preconstruct all faces
n
and display some examples. Report the reconstruction error i=1 ||xi − x0i ||2 . (5P)
b) Use PCA to reduce the dimensionality of the dataset to p = 60 as well (e.g. use
your code from exercise e08:02). Reconstruct
Pnall faces using PCA and display some
examples. Report the reconstruction error i=1 ||xi −x0i ||2 . Compare the reconstruc-
tions and the error from PCA with the results from the Autoencoder. Which one
works better? (2P)
Extra) Repeat for various dimensions p = 5, 10, 15, 20 . . .
Exam Preparation Tip: These are the headings of all questions that appeared in
previous exams – no guarantee that similar ones might appear!
– Gymnastics
– True or false?
– Bayesian reasoning
– Dice rolling
– Coin flipping
– Gaussians & Bayes
– Linear Regression
– Features
– Features & Kernels
– Unusual Kernels
– Empirically Estimating Variance
– Logistic regression
– Logistic regression & log-likelihood gradient
– Discriminative Function
– Principle Component Analysis
– Clustering
– Neural Network
– Bayesian Ridge Regression and Gaussian Processes
– Ridge Regression in the Bayesian view
– Bayesian Predictive Distribution
– Bootstrap & combining learners
– Local Learning
– Boosting
– Joint Clustering & Regression
Neg-log-likelihood (3:10),
Neural Network function class (4:3),
NN back propagation (4:13),
NN Dropout (4:10),
NN gradient (4:13),
NN initialization (4:20),
NN loss functions (4:9),
NN regularization (4:10),
No Free Lunch (8:27),
Non-negative matrix factorization** (6:32),
one-hot-vector (3:11),