0% found this document useful (0 votes)
15 views12 pages

IML Summary

Cours de Machine Learning

Uploaded by

degentekle
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views12 pages

IML Summary

Cours de Machine Learning

Uploaded by

degentekle
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Introduction to Machine Learning We can also get the closed form solution by using the fact We commonly use:

by dcamenisch that the squared loss is a convex function and ŵ is the • Lasso Regression: argmin||y − Φw||22 + λ||w||1
global minima of this function. Therefore we can calculate w∈Rd
the gradient ∇ℓ(y, f (x)) and solve for 0 to find ŵ. Later,
1 Introduction we will see a more efficient way of finding ŵ.
• Ridge Regression: argmin||y − Φw||22 + λ||w||22
w∈Rd

This document is a summary of the 2022 edition of the Lasso regression sets a lot of weights to zero, while ridge
lecture Introduction to Machine Learning at ETH Zurich. 2.1.1 Different Loss Functions regression just puts the focus on lower weights.
I do not guarantee correctness or completeness, nor is
this document endorsed by the lecturers. If you spot The square loss penalizes over- and underestimation the
any mistakes or find other improvements, feel free to same. Further it puts a large penalty on outliers (grows 3 Optimization
open a pull request at github.com/DannyCamenisch/iml- quadratically). While this is often good, we might want a
different loss function, some possibilities are: If the closed form is not available or desirable, as calcu-
summary. This work is published as CC BY-NC-SA.
lating it is expensive, we use the gradient descent algo-
• Huber loss - ignores outliers (a = y − f (x) rithm. It works by initializing w0 and iteratively moving
c bn a
󰀫 it towards the optimal solution. We choose the direction
1 2
a for |a| ≤ δ
ℓδ (y, f (x)) := 2 by calculating ∇ℓ(w) and then multiply it by the stepsize
1
2 Regression δ · (|a| − 2 · δ) otherwise / learning rate η:

In this first part we are gonna focus on fitting lines to dat- • Asymmetric losses - weigh over- and underestimation wt+1 = wt − ηt · ∇ℓ(wt )
apoints. For this we will introduce the machine learning differently Convergence is only guaranteed for the convex case, else
pipeline. It consists of three parts and has the goal to we might get stuck at any stationary point.
find the optimal model fˆ for given data D, that we can 2.2 Nonlinear Functions
use to predict new data.
Linear functions helped us to keep the calculations ”sim-
ple” and find good solutions. But often there are problems
that are more complex and would require nonlinear func-
The three parts of the ML Pipeline are the function class tions. The avoid using nonlinear functions we introduce
F, the loss function ℓ and the optimization method. feature mapping.
In the coming sections f ∗ will be the ground truth function
and fˆ will be used for our (learned) prediction model.

2.1 Linear Regression


Given the data (xi , yi ) we use models of the form f (x) =
w⊤ x + b to fit the data. To find the optimal values for w
and b we try to reduce the squared loss:
From our input vector x we extract a feature vector φ(x)
As the slope gets smaller, we want to decrease η, so that we
1󰁛 1 by using a fixed mapping φ that can consist of any nonlin-
ℓ(y, f (x)) := (yi − f (xi ))2 = ||y − Xw||22 ear function. On this feature vector we can use the already do not overshoot. For the linear regression case we have:
n n
known methods for linear functions to find a solution. ||wt − w∗ ||2 ≤ ρt ||w0 − w∗ ||2 , ρ = ||I − ηX ⊤ X||op
In the matrix notation b is part of w. The closed form so-
lution for linear regression is given by the normal equation Where ρ is the convergence speed for constant stepsize η.
Ax − b ⇒ x = (A⊤ A)−1 A⊤ y:
2.3 Regularization This leads to an optimal fixed stepsize of:
We will later see that too complex models are not always 2
ŵ = (X ⊤ X)−1 X ⊤ y η=
good, as they use too many features. If we want to re- λmin + λmax
duce the number of features, we can encourage sparsity by We stop when new iterations do not cause any change any-
introducing a penalty term. more (below a certain threshold).
To make gradient descent more stable / robust against ill- a systematic bias. To prevent this from happening we can 4.3 Bias and Variance
conditioned landscapes we might add momentum: split the training set again, creating a validation set. Now
For different datasets D1 , ..., DK we define:
the idea is to choose the model with the best validation
w t+1 t
= w + γ∆w t−1 t
− ηt ∇ℓ(w ) 󰁓K
error and use the test set only to get the estimate for the • Bias - distance of the average model f¯ = 1
K i=1 fˆi
generalization error. to the ground truth Ex [ℓ(f¯(x), f ∗ (x))]
3.1 Stochastic Gradient Descent Setting aside so much data can be wasteful. So we intro-
duce k-fold cross-validation • Variance - average 󰁓 distance of the models to the
When we have a lot of data, it is costly to compute the K ˆ ¯
average model Ex [ K
1
i=1 ℓ(fi (x), f (x))]
gradient, so we only use a minibatch S of the dataset D
(randomly sampled without replacement). Now the update
step looks like this:

wt+1 = wt − ηt · ∇ℓS (wt )

Where the loss is only calculated over the minibatch S.


This method also gives us a chance to escape saddle
points. We proceed as follows:

1. For all folds k = 1, ..., K:


4 Model Error (a) Train fˆk on D′ − Dk′
󰁓
We generally want to minimize the estimation error (b) Compute val. error Rk = 1
|Dk′| x,y ℓ(fˆk (x), y)
ℓ(fˆ(x), f ∗ (x)), since we do not know f ∗ we can not ac- 󰁓K
1
tually compute this value. Instead, we usually observe 2. Compute cross-validation error CV = K i=1 Ri
yi = f ∗ (xi )+󰂃i . For each observed sample we can compute 5 Classification
3. Pick model with lowest cross-validation error CV
the prediction error ℓ(fˆ(x), y), in fact we are often in-
terested in the average prediction error or generalization Instead of predicting y ∈ R, we limit y to be in a finite,
4. Evaluate the model using the test set D′′
error: discrete set Y (e.g. {−1, +1}). When looking at binary
For K very large, we can get the best approximation, classification we often use the labels −1, +1 and let the
R(fˆ) := Ex,y [ℓ(fˆ(x), y)] = Ex [ℓ(fˆ(x), f ∗ (x))] + 󰂃 if K = |D′ | we call it leave-one-out cross-validation predicted value be equal to ŷ = sgnfˆ(x). Similar to re-
(LOOCV). gression we care about the generalization error:
The generalization error computed over all possible (x, y)
pairs weighted by how likely each is. R(fˆ) = Px,y [y ∕= sgnfˆ(x)] = Ex,y [ℓ0−1 (fˆ(x), y)]
The training loss is often to optimistic to approximate the
4.2 Model Complexity
generalization error. To get a better approximation we Model complexity is closely related to training and gener- Where we call ℓ0−1 (fˆ(x), y) = Iy∕=sgnfˆ(x) the zero-one
split our data into training and test set. alization error. loss. Since this loss is neither convex nor continuous,
we can not efficiently minimize the training error with it.
Therefore we introduce different type of loss functions:
ˆ
• Exponential loss: ℓexp (fˆ(x), y) = eyf (x)
By only using the training set to fit our model, we have ˆ
the test data to get a better estimate of the generalization • Logistic loss: ℓlog (fˆ(x), y) = log(1 + eyf (x) )
error.
• Hinge loss: ℓhinge (fˆ(x), y) = max(0, 1 − y fˆ(x))

4.1 Cross-Validation • Linear loss: ℓlin (fˆ(x), y) = y fˆ(x)


When choosing between different models, we might choose
the model with the lowest test set error, this may introduce
We will mainly focus on the logistic loss (also called lo- To train our classifier we can use gradient descent. The 6 Hypothesis Testing
gistic regression), as in practice it is the most used. We gradient of the logistic loss is given by:
can derive that the logistic loss is the negative conditional We focused a lot to derive good surrogate losses for the
yi x i
log likelihood P[y = +1|x] or P[y = −1|x] that is param- ∇ℓ(fˆ(x), y) = 0-1 loss. But is this error really a good metric? Hypothe-
eterized by fˆ(x) via the softmax transformation. We 1 + eyi fˆ(x) sis testing is a way to express asymmetry in classification
define (similar for y = −1): tasks. For this we introduce the confusion matrix:
For linearly separable data, gradient descent on the logis-
1 tic loss converges to the direction wMM that maximizes the
P[y = +1|x] = minimum ℓ2 -distance between the decision boundary and
1 + e−fˆ(x)
yi . We call this the maximum-margin solution.
In particular we can write:

wMM = argmax min yi w⊤ xi = argmax margin(w)


||w||2 =1 i ||w||2 =1

Instead of just linear functions, we can again use feature


mapping to receive nonlinear classifiers.
Using this we can define the probability vector:
p̂(x) = (P[y = −1|x], P[y = +1|x]) 5.2 Support Vector Machines
If we want to extend the log loss to multiple classes, we For general w that correctly separates the data, margin(w)
||w||2
define a vector f˜(x) = (fˆ1 (x), ..., fˆK (x)) and transform it is the min. distance of any point to the decision boundary.
using softmax: If we use general w the solution is not unique anymore. Further we define:
ˆ
efk (x) 1
But we can rescale any unit norm w by α = margin(w) such FP FN
p̂k = 󰁓K ˆ error1 /FPR = , error2 /FNR =
i=1 e
fj (x) that αw = w̃. So instead of searching within unit norm w TN + FP TP + FN
For the multiclass case we choose the classifier error to be to find wMM with maximum margin, we can search within
all w̃ with margin(w̃) = 1 to find the one that maximizes: We want to find a test that minimizes the FPR, while con-
the maximal entry of p̂ if y ∕= ŷ. trolling the FNR. This can be viewed as defining a null
margin(w̃) 1 hypothesis H0 (x) and then deciding to accept or reject it
5.1 Linear Classifiers =
||w̃||2 ||w̃||2 (H0 is always the positive class). When choosing H0 we
want it to represent the more crucial class one to get right,
Linear classifiers use functions form the class F = This is how support vector machines work. More formal: e.g. it is more important to truly classify a person as sick
{f | f (x) = w⊤ x, w ∈ Rd }. We already know that
than to classify them as healthy. To decide it we accept
this class of functions makes training and prediction sim- ŵ = min ||w||2 s.t. yi w⊤ xi ≥ 1 for all i = 1, ..., n
w or reject H0 we fix τ , where we accept H0 (x)(ŷ = −1) if
ple. The decision boundary of the function is given by
p̂(x) < τ and the opposite way around.
{x | f (x) = 0}. If the data is not linearly separable, we might want to use
a soft-margin SVM. Since not all constraints can hold,
we want to allow some ”slack” in the constraints:

󰁛n
1
ŵ = min ||w||22 + λ max(0, 1 − yi w⊤ xi )
w,ξ 2
i=1

The later part penalizes any margin violations. To find the


optimal λ one might use cross-validation.
6.1 AUROC 7.1 Example for Ridge Regression 8 Other Nonlinear Methods
TP ⊤ ⊤
We want to have a large recall but also a small Remember w = Φ α and K = ΦΦ , applying this to ridge
#[y=+1] 8.1 KNN Classification
FPR. Based on these metrics we can draw the ROC curve regression we get:
by varying τ .
1 1
||y − Φw||22 + λ||w||22 = ||y − ΦΦ⊤ α||22 + λ||Φ⊤ α||22
n n
1
= ||y − Kα||22 + λα⊤ Kα
n

7.2 Different Kernels


A valid kernel must have the following properties:
We can either choose our model by caring about a specific
• K is symmetric because of the inner products:
point, e.g. TPR @ FPR = 5%, or we choose whichever
k(x, z) = k(z, x) This method does not need any training and classification
courve gets closer to the ideal curve, that is maximizing
the area under the curve. is done during test time. For a given training set D it
• K is positive-semidefinite for any choice of inputs
works as follows:
x1 , ..., xn , i.e. z ⊤ Kz ≥ 0
1. Pick k and distance metric d
7 Kernels Common kernel choices are:
2. For given x, find among x1 , ..., xn ∈ D the k closest
We have previously seen how we can get nonlinear func- • linear: k(x, z) = x⊤ z to x → xi1 , ..., xik
tions via feature maps φ. But there are limits to these
feature maps, they can introduce a lot of computational • polynomial: k(x, z) = (x⊤ z + 1)m 3. Output the majority vote of labels yi1 , ..., yik
complexity (feature explosion) and there are also infinite 󰀓 󰀔
This method is very sensitive to k and becomes unstable in
feature maps we can not get this way. If we want to avoid • rbf : k(x, z) = exp − ||x−z||
τ
α
high dimensions. We might need large n for good results
these limitations we use the kernel trick. It consists of but computation can be reduces when allowing for some
An RBF kernel with α = 2 is also called a gaussian kernel
two steps: error probability.
and one with α = 1 is a laplacian kernel. Special about the
1. We know that the solution ŵ is in the column space RBF kernel is that it corresponds to infinite dimensional
of Φ⊤ . Therefore among the global minimizers one features. 8.2 Decision Trees
has the form ŵ = Φ⊤ α̂ with α̂ ∈ Rn so that: Given valid kernels we can compose new ones by conserv-
󰁛n ing kernel convexity:
fˆ(x) = ŵ⊤ φ(x) = α̂⊤ Φφ(x) = α̂i · φ(xi )⊤ φ(x)
i=1 • k = k1 + k2
Notice that α̂ only depends on xi via inner products • k = k1 · k2
φ(xi )⊤ φ(xj ). Using this we can define a symmet-
ric kernel function k(x, z) = φ(x)⊤ φ(z) and a corre- • k = c · k1 ∀c > 0
sponding kernel matrix K = ΦΦ⊤ .
• k = f (k1 ) ∀f convex
2. Sometimes we can more efficiently compute
the inner products / evaluate the kernel func- Mercers Theorem: Any valid kernel can be decomposed
tion, into a linear combination of inner products.
√ e.g.
√ for the √ feature vector φ(x) =
[1, 2x1 , 2x2 , x21 , x22 , 2x1 x2 ], the inner product A decision tree returns a partition of X with sets aligned
is: with the main axis. A given x is assigned the majority
class of the partition it lands in. The partitions can be
φ(x)⊤ φ(z) = (1+x1 z1 +x2 z2 )2 = (1+x⊤ z)2 = k(x, z) modelled as leaf nodes of a binary tree. Single trees can
This kernel function is a lot less expensive to com- easily overfit to noise, we have to choose the depth of the
pute. tree carefully.
9 Neural Networks 9.1 Forward Propagation 9.3 Overfitting
Success in learning crucially depends on the quality of the This is the process of calculating the output for a given Since any deep neural network has a lot more parameters
features. The key idea of neural networks is to parameter- input. then data points to train on, overfitting can happen easily.
ize the feature maps and optimize over the parameters. To avoid this we use:
• For input layer
We want to build a complex model out of simple compo- v (0) = [x; 1] • Regularization: add a penalty on the weights to
nents: the cost function
φ(x, θ) = ϕ(θ⊤ x) • For each hidden layer 1 : L − 1
• Early Stopping: stop training once validation error
Hereby, θ ∈ Rd are the weights and ϕ : R 󰀁→ R is a non- z (l) = W (l) v (l−1) and v (l) = [ϕ(z (l) ); 1] stop to decrease
linear activation function. Possible activation functions
are: • For output layer • Dropout: randomly ignore hidden units during
training with probability p, after training all units
• Identity: ϕ(z) = z f = W (L) v (L−1) are used and weights are multiplied by p
1
• Sigmoid: ϕ(z) = 1+exp(−z)
9.2 Backpropagation • Batch Normalization: normalize the input data
exp(z)−exp(−z) (mean 0, variance 1) in each layer
• Tanh: ϕ(z) = tanh z = exp(z)+exp(−z) We can use the loss functions we already know to com-
pute the loss. For multi output networks, we use the sum
• ReLU: ϕ(z) = max(0, z)
of per-output for regression tasks and cross-entropy loss
9.4 Convolutional Neural Networks
Nesting these components we create networks of the form: for classification tasks. As mentioned we use SGD to fit CNN are a specialized architecture for neural networks.
our neural network. We want to jointly optimize over all The idea is that predictions should be unchanged under
weight for all layers. This is generally a non-convex op- some transformations of the data, e.g. rotation of images.
timization problem. Nevertheless, we can try to find a
local optimum. In order to apply SGD, need to compute
(l)
∇W ℓ(W ; x, y) w.r.t. each weight wi,j :

Each layer is not fully connected but structured. The acti-


vation function is applied to the element-wise convolution:
Where vi = ϕ(zi ) and zi is the sum of inputs times their
weight. To deal with biases we introduce a ”constant” 1 ϕ(W ∗ v (l) )
feature to each layer. Note that we can have as many lay-
ers as we want and use different activation functions per Notice that we can reuse calculations from the previous The output dimension when applying m different f × f
layer. Such networks are typically trained via SGD. layer , forwards pass and only have to compute the filters to an n × n image with padding p and stride s is:
By the universal approximation theorem, we can approxi- gradient for each layer. n + 2p − f
mate any arbitrary smooth target function, given at least Since the optimization problem is non-convex the initial- l= +1
s
one layer with sufficient width. ization of the weights matters. With inappropriate weights
we can run into exploding or vanishing gradients. To avoid Additionally we might use average or max pooling layers
this we randomly initialize the weights based on some dis- to aggregate several units into a single one, or use stride
tribution assumption for the activation function. layers to skip units to decrease size.
One way of finding a good solution is Lloyd’s heuristics: Our goal is to learn the function f (x) = Ax that maps
the high dimensional data to the lower dimensions, while
1. Initialize cluster centers µ(0) minimizing the reconstruction error. First we will look at
2. While not converged: the case k = 1.
n
󰁛
(a) Assign each point to closest center: min ||xi − zi w||22 s.t. ||w||2 = 1
w,z
(t−1) i=1
zi ← argmin ||xi − µj ||2
j∈{1,...,k} We limit w to be of unit length to guarantee a unique
solution.
(b) Update centers as mean of assigned data points:

(t) 1 󰁛
µj ← xi
10 Unsupervised Learning nj
i|zi =j

10.1 k-Means Clustering This guarantees to monotonically decrease the average


Given an unlabelled dataset, we try to learn feature sim- squared distance in each iteration and converges to a lo-
ilarities based on proximity in feature space. Data points cal optimum. This local optimum is strongly dependent
with similar features then should be grouped into the same on the initialization. One way to initialize the centers is
cluster. k-Means tries to represent each cluster by a single k-Means++:
(center) point µi .
1. Start with random data point as center µ1 = xi
where i ∼ Unif{1, ..., n}
2. Add centers 2, ..., k randomly, proportionally to the Since for a given w the minimal distance vector x̄i − xi
squared distance to closest selected center: is perpendicular to w, we find that the optimal solution
for zi = w⊤ xi . We can now substitute zi and receive the
given µ1:j pick µj+1 = xi following optimization goal:
1 n
󰁛
where p(i) = min ||xi − µl ||22
z l∈{1,...,j} ŵ = argmin ||xi − ww⊤ xi ||22
||w||2 =1 i=1
To find the optimal number of clusters k can not be done
by cross-validation, as the loss keeps decreasing with larger Which again can be reformulated as:
k. We can either keep increasing k until we reach a negli- n
󰁛
gible decrease in loss or we can use regularization to add a ŵ = argmax (w⊤ xi )2 or ŵ = argmax w⊤ Σw
penalty term for larger k. ||w||2 =1 i=1 ||w||2 =1
Each data points is assigned by:
󰁓n
zj = argmin||xj − µi ||2 , zj ∈ {1, ..., k} 10.2 Principal Component Analysis Where Σ = n1 i=1 xi x⊤ i is the empirical covariance. Since
i we still have an argmax this is not a minimization prob-
PCA is used for dimensionality reduction. Given data lem anymore and we can not find a solution like in previous
To pick the optimal centers we try to minimize the sum of xi ∈ Rd we want to obtain a low-dimensional represen- problems. There still exists a closed form solution given
squared distances: tation zi ∈ Rk where k < d. One of the benefits of low- by the principal eigenvector of Σ, i.e. w = v1 where for
n dimensional representation is that we can visualize data λ1 ≥ ... ≥ λd ≥ 0:
󰁛
R̂(µ) = min ||xi − µj ||22 that we otherwise could not. Feature discovery is another
j∈{1,...,k} use case for PCA, it can help us to discover features from d
󰁛
i=1
data, e.g. Eigenfaces. We assume that our data is centered Σ= λi vi vi⊤
This is a non-convex optimization problem and NP-hard. around the origin. i=1
Up until now everything was for k = 1. For k > 1 we have Given this, a new point x is projected as z where: We have already seen that the empirical risk / train-
to change the normalization from ||w||2 = 1 to W ⊤ W = I n
ing error R̂D (f ) often underestimates the population risk.
everything else is basically the same, we just take the first 󰁛 (i) But by the law of large numbers we have that empirical
zi = αj k(xj , x)
k principal eigenvectors so that W = [v1 , ..., vk ]. j=1
risk approaches the population risk. We call this differ-
Choosing the optimal k is different depending on our goal, ence |R̂D (f ) − R(f )| the generalization error w.r.t. f .
for feature induction we use cross-validation else we often
pick k so that the variance of our data is mostly explained
10.3 Autoencoders
(other dimensions would add little information). Autoencoders are neural networks with a󰁓 bottleneck layer
n
and din = dout . We want to minimize n1 i=1 ||xi − x̂i ||22 .
10.2.1 PCA through SVD The idea is to learn the identity function:
Another way of obtaining the PCA is through singular- x̂ = f (x; θ) where f (x; θ) = fdec (fenc (x, θenc ); θdec )
value decomposition. Recall that we can represent any
data matrix X as U SV ⊤ where S is a diagonal matrix
containing the singular values (singular values being the
square root of eigenvalues). Now the top k principal com-
ponents are exactly the first k columns of V .

10.2.2 Kernel PCA


Again we run into problems trying to work with complex
arrangements of data, e.g. circles, swiss roll, etc. 11.1 Optimal Predictor for the Squared
Loss
If linear activation functions and the square loss between The population risk for the squared loss is:
input and output are used, then the encoder learns PCA.
Otherwise it learns some nonlinear embedding z of the fea- R(f ) = Ex,y [(y − f (x))2 ]
tures.
Suppose we knew p(x, y) which f minimizes the population
risk?
11 Statistical Perspective f ∗ = min Ex,y [(y − f (x))2 ]
f
In this part we will explore a statistical perspective on su-
= min Ex [Ey [(y − f (x))2 | X = x]]
Similar to supervised learning where we worked with ker- pervised learning by estimating the data distribution and f
nels, we can take the same approach for unsupervised then deriving a decision rule from the distribution. This = Ex [min Ey [(y − f (x))2 | X = x]]
󰁓n
learning. Since it holds Σ = n1 i=1 xi x⊤ ⊤
i = X X we can
allows us to express prior knowledge about the data. We f

apply the kernel trick. We start by assuming w = Φ⊤ α, start with the fundamental assumption that our data is
Now we focus on the inner part, suppose we are given a
plugging this into our objective and the constraint we end generated iid. by some unknown distribution, note that
fixed x:
up with: this assumption is often violated in practice:
α⊤ K ⊤ Kα f ∗ (x) = argmin Ey [(ŷ − y)2 | X = x)] = E[y | X = x]
α̂ = argmax (xi , yi ) ∼ p(x, y)
α α⊤ Kα ŷ

We arrive at the general closed form solution: We want to find a hypothesis f : X 󰀁→ Y that minimizes We therefore have shown that f ∗ minimizing the popula-
n
the expected loss / prediction error / population tion risk is given by the conditional mean, which can be
1 󰁛 risk (over all possible data):
α(i) = √ vi K= λi vi vi⊤ λ1 ≥ ... ≥ λn ≥ 0 calculated by:
λi i=1
󰁝 󰁝

R(f ) = p(x, y)ℓ(y, f (x))dxdy = Ex,y [ℓ(y, f (x))] f (x) = E[y | X = x] = y · p(y | x)dy
Note that we only need the conditional distribution p(y | x) • Bias: Excess risk of best model considered compared Suppose we knew p(x, y) which f minimizes the population
and not the full joint distribution p(x, y). Thus one strat- to minimal achievable risk knowing p(x, y) risk?
egy is for estimating a predictor from training data is to
• Variance: Risk incurred due to estimating model f ∗ (x) = argmin Ey [Iy∕=ŷ | X = x]
estimate the conditional distribution p̂(y | x) and then use
from limited data ŷ
it to predict labels via the conditional mean.
• Noise: Risk incurred by optimal model (irreducible = argmax p(ŷ | x)
One common approach to estimate the conditional distri- ŷ
bution is to choose a particular parametric form and then error)
estimate the parameters θ with the maximum (log) likeli- The MLE for linear regression is unbiased, further it is the This hypothesis f ∗ minimizing the population risk is given
hood estimation: minimum variance estimator among all unbiased estima- by the most probable class, this hypothesis is called the
tors. However, we have also seen that it can overfit. Bayes’ optimal predictor for the 0-1 loss.
θ∗ = argmax p̂(y1 , ..., yn | x1 , ..., xn , θ) Similar to the regression we can now look at logistic re-
θ
gression and assume that we have iid. Bernoulli noise.
n
󰁛 11.2 Maximum a Posteriori Estimate Therefore the conditional probability is:
= argmin − log p(yi | x, θ)
θ i=1 It is often favourable to introduce some bias (make assump-
tions) to reduce variance drastically. One such assumption p(y | x, w) ∼ Ber(y; σ(w⊤ x))
11.1.1 Example: Conditional Linear Gaussian could be that the weights are small. We can capture this 1
Where σ(z) = is the sigmoid function. Using
assumption with a Gaussian prior wi ∼ N (0, β 2 ). Then, 1+exp(−z)
Let us look at the case where we make the assumption MLE we get:
the posterior distribution of w is given by:
that the noise is Gaussian. We have y = f (x) + 󰂃 with
󰂃 ∼ N (0, σ 2 ) and f (x) = w⊤ x. Therefore the conditional p(w, x̄, ȳ) ŵ = argmax p(ȳ | w, x̄)
p(w | x̄, ȳ) = w
probability is: p(x̄, ȳ) n
󰁛
p(w, ȳ | x̄) · p(x̄) = argmin log(1 + exp(−yi w⊤ xi ))
= w
p(ȳ | x̄) · p(x̄) i=1
p(w) · p(ȳ | w, x̄) Which is exactly the logistic loss. Instead of solving MLE
=
p(ȳ | x̄) we can estimate MAP, e.g. with a Gaussian prior:
Hereby we used that w is apriori independent of x̄ (note
Then we can find the optimal ŵ by using the definition of that x̄ = x1:n , ȳ = y1:n ). Now we want to find the maxi- ŵ = argmax p(w | x̄, ȳ)
w
the normal distribution (some steps are left out): mum a posteriori estimate (MAP) for w: n
󰁛
ŵ = argmax p̂(y1:n | x1:n , w, σ) ŵ = argmax p(w | x̄, ȳ) = argmin λ||w||22 + log(1 + exp(−yi w⊤ xi ))
w w
w i=1
n
󰁛 = argmin − log p(w) − log p(ȳ | w, x̄) + log p(ȳ | x̄)
= argmin − log N (yi | xi , w⊤ x, σ 2 )
w
i=1
w
󰁛n
12 Bayesian Decision Theory
σ2
󰁛n = argmin 2 ||w||22 + (yi − w⊤ xi )2
= argmin (yi − w⊤ xi )2 w β We now want to use these estimated models to inform deci-
i=1
w
i=1
sions. Suppose we have a given set of actions A. To act un-
2
Which is exactly the same as ridge regression with λ = βσ2 . der uncertainty we assign each action a cost C : Y ×A 󰀁→ R
Therefore we have shown that under the conditional linear More generally, regularized estimation can often be under- and pick the action with the maximum expected utility.
Gaussian assumption, the MLE is equivalent to the least stood as MAP inference, with different priors (= regular-
squares estimation. izers) and likelihoods (= loss functions). a∗ = argmin Ey [C(y, a) | x]
a∈A

11.1.2 Bias-Variance Tradeoff This is called Bayesian decision theory or maximum ex-
11.3 Statistical Models for Classification
Recall that the following hold: pected utility principle. If we had the true distribution
We now want to do the same risk minimization for classi- this decision implements the Bayesian optimal decision.
2 fication. The population risk for the 0-1 loss is:
Prediction Error = Bias + Variance + Noise In practice we can only estimate this distribution, e.g. via
Where we have: R(f ) = P[y ∕= f (x)] = Ex,y [Iy∕=f (x) ] logistic regression.
12.1 Asymmetric Costs 13 Generative Modeling
We can then use this to implement an asymmetric cost In the previous part we looked at discriminative mod-
function, e.g.: els with the aim to estimate the conditional distribution
󰀻 p(y | x). Generative models aim to estimate the joint dis-
󰁁
󰀿cF P if y = −1, a = +1 tribution p(x, y). This will help us to model much more
C(y, a) = cF N if y = +1, a = −1 complex situations. Remember Bayes’ rules:
󰁁
󰀽
0 otherwise 1
p(y | x) = p(y) · p(x | y)
z󰁿 󰁾󰁽 󰂀
p(x,y)

Where z is the normalization constant p(x). Generative


modeling can be seen as the seen as the attempt to infer This is equivalent to the following decision rule for binary
the process, according to which examples are generated. classification:
󰀕 󰀖
p(Y = +1 | x)
y = sgn log
13.1 Naive Bayes Model p(Y = −1 | x)
󰁿 󰁾󰁽 󰂀
We want to apply generative modeling for classification f (x)

tasks. We starte by making the assumption that given


Where f (x) is called the discriminant function. We can
some class label, each feature is independent of all the
rewrite this and get:
other features (therefore naive). This helps us estimating
12.2 Abstention 󰁔d d
p(x̄ | ȳ) as it is equal to i=1 p(xi | yi ). 󰁛 1
Another cost function could be used to decline to make a f (x) =
σ 2 (µ+1,i − µ−1,i ) ·xi
classification (action D): i=1 󰁿 i 󰁾󰁽 󰂀
󰀫 13.2 Gaussian Naive Bayes Classifier wi
Iy∕=a if a ∈ {−1, +1} We model the features by conditionally independent Gaus- p 󰁛 1d
C(y, a) = + log + (µ2 − µ2+1,i )
c if a = D sians and estimate the parameters via maximum likelihood 1 − p i=1 2σi2 −1,i
estimation: 󰁿 󰁾󰁽 󰂀
w0
1. MLE for class prior:
If the conditional independence assumption is violated, we
Count(Y = y) can run into some serious issues, e.g. the classifier can
p(y) = p̂y =
n become overconfident.
2. MLE for feature distribution:
2
p(xi | y) = N (xi ; µ̂y,i , σy,i )
13.3 Gaussian Bayes Classifier
We drop the independence assumption and model our fea-
Where: tures as generated by a multivariant Gaussian N (x; µy , Σy )
12.3 Uncertainty Sampling 1 󰁛 with:
µy,i = xj,i
Labelling is often expensive since we need an expert to clas- Count(Y = y) 1 󰁛
j | yj =y
sify the samples. We want to minimize the actual number µy = xj
1 󰁛 Count(Y = y)
2
of labels that need to be hand classified. There is a simple σy,i = (xj,i − µ̂y,i )2 j | yj =y
Count(Y = y) 1 󰁛
strategy for this, always pick the sample that we are most j | yj =y
Σy = (xj − µ̂y )(xj − µ̂y )⊤
uncertain about, by estimating p(y | x), and then asking Count(Y = y)
Predictions are then made by: j | yj =y
the expert to label this sample.
d
󰁜 This is also called the quadratic discriminant analysis
y = argmax p(ŷ | x) = argmax p(ŷ) · p(xi | ŷ) (QDA).
ŷ ŷ i=1
13.5 Generative vs. Discriminative This is a non-convex objective, but we can still try to ap-
ply SGD. But there is a better way to fit this model. The
Discriminative models:
idea is that fitting a GMM is similar to training a GBC
• Model p(y | x) and do not attempt to model p(x, y) without labels. We want to apply an iterative approach
where we first start with some guess for our parameters,
• Cannot detect outliers predict the unknown labels and then impute the missing
• Are typically more robust, since accurately modeling data. Now we can get a closed form update for our model
x may be difficult which we then use to refine our parameters.

Generative models: 14.1 Hard-EM Algorithm


• Model joint distribution p(x, y) and are therefore First we are gonna look at the simpler version of the EM
If we impose the restriction that Σ+ = Σ− this leads us
more ambitious (expectation maximization) algorithm:
to the linear discriminant analysis LDA and if we further
restrict p(y) = 12 we get the Fisher LDA. • Can be more powerful (e.g. dectect outliers, missing • Initialize the parameters θ(0)
values) if model assumptions are met
• For t = 1, 2, ... :
• Are typically less robust against outliers
– E-Step: predict the most likely class for each
data point:
14 Gaussian Mixture Model zi
(t)
= argmax p(z | xi , θ(t−1) )
z
Gaussian mixture models make the assumption that data
= argmax p(z | θ(t−1) ) · p(xi | z, θ(t−1) ))
is generated from Gaussians. To be more precise a convex- z
combination of Gaussian distributions:
– M-Step: compute MLE of θ(0) as for GBC
k
󰁛
p(x | θ) = p(x | µ, Σ, w) = wj · N (x; µj , Σj ) There are some problems with this approach, for one points
j=1 are assigned a label even though the model is uncertain.
Further it tries to extract too much information from a
single point. In practice, this may work poorly if clus-
ters are overlapping. Hard-EM with uniform weights and
spherical covariances is equivalent to k-Means with Lloyd’s
Gaussian Bayes classifiers can also be used for outlier de- heuristics.
tection by introducing a threshold τ such that all data
points x with p(x) ≤ τ are outliers.
14.2 Soft-EM Algorithm
13.4 Avoiding Overfitting Instead of predicting hard class assignments for each data
point we want to predict class probabilities.
From previous examples we know that MLE is prone to
overfitting. We can avoid this by employing the techniques • Initialize the parameters θ(0)
already seen: We do not know the labels z for the data and can only see • For t = 1, 2, ... :
• Restricting Model Class: fewer parameters (e.g. the level-set on the right, now we want to cluster this data.
– E-Step: calculate the cluster membership
GNB) The problem we try to solve is to estimate the parameters
weights for each point:
for the Gaussian distributions (minimize log-likelihood).
(t) (t−1)
• Using Priors: restrict (”smaller”) parameter values γj (xi ) = p(zi = j | xi , θj )
n
󰁛 k
󰁛

Using a prior for the parameters leads us again to MAP (wi:k , µ∗i:k , Σ∗1:k ) = argmin − log wj · N (xi | µj , Σj ) (t−1)
wj · p(xi ; θj )
estimation. i=1 j=1 = 󰁓 (t−1)
k wk · p(xi ; θk )
– M-Step: compute MLE with closed form solu- • E-Step: Take the expected value over latent vari-
tion: ables to generate a likelihood function Q(θ; θ(t−1) ):
󰁓n (t)
(t)
n
1 󰁛 (t) (t) i=1 xi · γj (xi ) Q(θ; θ(t−1) ) = EZ [log p(X, Z | θ) | X, θ(t−1) ]
wj = γj (xi ) µj = 󰁓n (t)
n i=1 n 󰁛
󰁛 k
i=1 γj (xi )
󰁓n = γzi (xi ) log p(xi , zi | θ)
(t) (t) (t) ⊤
(t) i=1 γj (xi )(xi − µj )(xi − µj ) i=1 zi =1
Σj = 󰁓n (t)
i=1 γj (xi ) with γz (x) = p(z | x, θ(t−1) )

In general, Soft-EM will typically result in higher likeli- • M-Step: Compute MLE / Maximize:
hood values, as it can better deal with ”overlapping” clus- θ(t) = argmax Q(θ; θ(t−1) )
ters. When speaking of EM we usually refer to Soft-EM. θ

The EM algorithm is sensitive to initialization. We usually It is important to note that we have guaranteed mono-
initialize the weights as uniformly distributed, the means tonic convergence, where each EM-iteration monotonically
randomly or with k-Means++ and for variances we use increases the data likelihood.
spherical initialization or empirical covariance of the data.
To select k, in contrast to k-Means, we can use cross- 14.5 GMMs for Density Estimation 15 Generative Adversarial Net-
validation.
So far, we used GMMs primarily for clustering and clas-
sification. Another natural use case for GMMs is density
works
14.3 Degeneracy of GMMs estimation, which in turn can be used for anomaly detec- Until now the models we explored failed to capture com-
tion or data imputation. plex, high-dimensional data types (e.g. images and audio).
GMMs can overfit when only having limited data, we want
to avoid that the Gaussians get too narrow and fit to a The key idea is to use a neural network to learn a func-
single data point. To avoid this we add v 2 I to our vari- tion that takes a ”simple” distribution (e.g. Gaussian)
ance. This makes sure that the variance does not collapse and returns a non linear distribution. This leads us to the
and is equivalent to placing a Wishart prior the covariance problem that it becomes to compute the likelihood of the
matrix, and computing the MAP. We choose v by cross- data needed for the loss. Therefore we need an alternative
validation. objective for training.

14.4 Gaussian-Mixture Bayes Classifiers


We can also use GMMs for classification tasks, by assum- To determine outliers, we simply compare the estimated
ing that the conditional distribution for each class can be density of a data point against a threshold value τ . This
modelled by a GMM. allows us to control the FP rate. As we vary the threshold
we trade FPs and FNs. We can use ROC curves as evalu-
ky
󰁛 ation criterion and optimize using cross-validation to find
(y) (y) (y)
p(x | y) = wj N (x; µj , Σj ) the optimal value for τ .
j=1

We can then use this model for classification, giving us 14.6 General EM Algorithm We simultaneously train two neural networks, a generator
highly complex decision boundaries: G trying to produce realistic examples and a discriminator
The framework of soft EM can also be used for more gen-
ky eral distributions than gaussians. We formulate the two D trying to detect ”fake” examples. This whole process
1 󰁛 (y)
p(y | x) = p(y)
(y) (y)
wj N (x; µj , Σj ) steps: can be viewed as a game, where the generator and dis-
z j=1
criminator try to compete against each other. This leads
to the following objective:
min max Ex∼pdata [log D(x, wD )]
wG wD

+Ez∼pz [log(1 − D(G(z, wG ), wD ))]

Training a GAN requires to find the saddle point rather


than a (local) minima. For a fixed generator G, the opti-
mal discriminator is such that:
pdata (x)
DG (x) =
pdata (x) + pG (x)

In general it is important that the discriminator is not too


powerful, as this could lead to memorization on finite data.
Other issues that can occur are oscillations/divergence or
mode collapse.
Evaluation GANs is still an open research question. One
possible performance metric is the so called duality gap:
′ ′
DG(wG , wD ) = max

M (wG , wD ) − min

M (wG , wD )
wD wG

Where M (wG , wD ) is the objective used in training.

You might also like