IML Summary
IML Summary
by dcamenisch that the squared loss is a convex function and ŵ is the • Lasso Regression: argmin||y − Φw||22 + λ||w||1
global minima of this function. Therefore we can calculate w∈Rd
the gradient ∇ℓ(y, f (x)) and solve for 0 to find ŵ. Later,
1 Introduction we will see a more efficient way of finding ŵ.
• Ridge Regression: argmin||y − Φw||22 + λ||w||22
w∈Rd
This document is a summary of the 2022 edition of the Lasso regression sets a lot of weights to zero, while ridge
lecture Introduction to Machine Learning at ETH Zurich. 2.1.1 Different Loss Functions regression just puts the focus on lower weights.
I do not guarantee correctness or completeness, nor is
this document endorsed by the lecturers. If you spot The square loss penalizes over- and underestimation the
any mistakes or find other improvements, feel free to same. Further it puts a large penalty on outliers (grows 3 Optimization
open a pull request at github.com/DannyCamenisch/iml- quadratically). While this is often good, we might want a
different loss function, some possibilities are: If the closed form is not available or desirable, as calcu-
summary. This work is published as CC BY-NC-SA.
lating it is expensive, we use the gradient descent algo-
• Huber loss - ignores outliers (a = y − f (x) rithm. It works by initializing w0 and iteratively moving
c bn a
it towards the optimal solution. We choose the direction
1 2
a for |a| ≤ δ
ℓδ (y, f (x)) := 2 by calculating ∇ℓ(w) and then multiply it by the stepsize
1
2 Regression δ · (|a| − 2 · δ) otherwise / learning rate η:
In this first part we are gonna focus on fitting lines to dat- • Asymmetric losses - weigh over- and underestimation wt+1 = wt − ηt · ∇ℓ(wt )
apoints. For this we will introduce the machine learning differently Convergence is only guaranteed for the convex case, else
pipeline. It consists of three parts and has the goal to we might get stuck at any stationary point.
find the optimal model fˆ for given data D, that we can 2.2 Nonlinear Functions
use to predict new data.
Linear functions helped us to keep the calculations ”sim-
ple” and find good solutions. But often there are problems
that are more complex and would require nonlinear func-
The three parts of the ML Pipeline are the function class tions. The avoid using nonlinear functions we introduce
F, the loss function ℓ and the optimization method. feature mapping.
In the coming sections f ∗ will be the ground truth function
and fˆ will be used for our (learned) prediction model.
n
1
ŵ = min ||w||22 + λ max(0, 1 − yi w⊤ xi )
w,ξ 2
i=1
(t) 1
µj ← xi
10 Unsupervised Learning nj
i|zi =j
apply the kernel trick. We start by assuming w = Φ⊤ α, start with the fundamental assumption that our data is
Now we focus on the inner part, suppose we are given a
plugging this into our objective and the constraint we end generated iid. by some unknown distribution, note that
fixed x:
up with: this assumption is often violated in practice:
α⊤ K ⊤ Kα f ∗ (x) = argmin Ey [(ŷ − y)2 | X = x)] = E[y | X = x]
α̂ = argmax (xi , yi ) ∼ p(x, y)
α α⊤ Kα ŷ
We arrive at the general closed form solution: We want to find a hypothesis f : X → Y that minimizes We therefore have shown that f ∗ minimizing the popula-
n
the expected loss / prediction error / population tion risk is given by the conditional mean, which can be
1 risk (over all possible data):
α(i) = √ vi K= λi vi vi⊤ λ1 ≥ ... ≥ λn ≥ 0 calculated by:
λi i=1
∗
R(f ) = p(x, y)ℓ(y, f (x))dxdy = Ex,y [ℓ(y, f (x))] f (x) = E[y | X = x] = y · p(y | x)dy
Note that we only need the conditional distribution p(y | x) • Bias: Excess risk of best model considered compared Suppose we knew p(x, y) which f minimizes the population
and not the full joint distribution p(x, y). Thus one strat- to minimal achievable risk knowing p(x, y) risk?
egy is for estimating a predictor from training data is to
• Variance: Risk incurred due to estimating model f ∗ (x) = argmin Ey [Iy∕=ŷ | X = x]
estimate the conditional distribution p̂(y | x) and then use
from limited data ŷ
it to predict labels via the conditional mean.
• Noise: Risk incurred by optimal model (irreducible = argmax p(ŷ | x)
One common approach to estimate the conditional distri- ŷ
bution is to choose a particular parametric form and then error)
estimate the parameters θ with the maximum (log) likeli- The MLE for linear regression is unbiased, further it is the This hypothesis f ∗ minimizing the population risk is given
hood estimation: minimum variance estimator among all unbiased estima- by the most probable class, this hypothesis is called the
tors. However, we have also seen that it can overfit. Bayes’ optimal predictor for the 0-1 loss.
θ∗ = argmax p̂(y1 , ..., yn | x1 , ..., xn , θ) Similar to the regression we can now look at logistic re-
θ
gression and assume that we have iid. Bernoulli noise.
n
11.2 Maximum a Posteriori Estimate Therefore the conditional probability is:
= argmin − log p(yi | x, θ)
θ i=1 It is often favourable to introduce some bias (make assump-
tions) to reduce variance drastically. One such assumption p(y | x, w) ∼ Ber(y; σ(w⊤ x))
11.1.1 Example: Conditional Linear Gaussian could be that the weights are small. We can capture this 1
Where σ(z) = is the sigmoid function. Using
assumption with a Gaussian prior wi ∼ N (0, β 2 ). Then, 1+exp(−z)
Let us look at the case where we make the assumption MLE we get:
the posterior distribution of w is given by:
that the noise is Gaussian. We have y = f (x) + with
∼ N (0, σ 2 ) and f (x) = w⊤ x. Therefore the conditional p(w, x̄, ȳ) ŵ = argmax p(ȳ | w, x̄)
p(w | x̄, ȳ) = w
probability is: p(x̄, ȳ) n
p(w, ȳ | x̄) · p(x̄) = argmin log(1 + exp(−yi w⊤ xi ))
= w
p(ȳ | x̄) · p(x̄) i=1
p(w) · p(ȳ | w, x̄) Which is exactly the logistic loss. Instead of solving MLE
=
p(ȳ | x̄) we can estimate MAP, e.g. with a Gaussian prior:
Hereby we used that w is apriori independent of x̄ (note
Then we can find the optimal ŵ by using the definition of that x̄ = x1:n , ȳ = y1:n ). Now we want to find the maxi- ŵ = argmax p(w | x̄, ȳ)
w
the normal distribution (some steps are left out): mum a posteriori estimate (MAP) for w: n
ŵ = argmax p̂(y1:n | x1:n , w, σ) ŵ = argmax p(w | x̄, ȳ) = argmin λ||w||22 + log(1 + exp(−yi w⊤ xi ))
w w
w i=1
n
= argmin − log p(w) − log p(ȳ | w, x̄) + log p(ȳ | x̄)
= argmin − log N (yi | xi , w⊤ x, σ 2 )
w
i=1
w
n
12 Bayesian Decision Theory
σ2
n = argmin 2 ||w||22 + (yi − w⊤ xi )2
= argmin (yi − w⊤ xi )2 w β We now want to use these estimated models to inform deci-
i=1
w
i=1
sions. Suppose we have a given set of actions A. To act un-
2
Which is exactly the same as ridge regression with λ = βσ2 . der uncertainty we assign each action a cost C : Y ×A → R
Therefore we have shown that under the conditional linear More generally, regularized estimation can often be under- and pick the action with the maximum expected utility.
Gaussian assumption, the MLE is equivalent to the least stood as MAP inference, with different priors (= regular-
squares estimation. izers) and likelihoods (= loss functions). a∗ = argmin Ey [C(y, a) | x]
a∈A
11.1.2 Bias-Variance Tradeoff This is called Bayesian decision theory or maximum ex-
11.3 Statistical Models for Classification
Recall that the following hold: pected utility principle. If we had the true distribution
We now want to do the same risk minimization for classi- this decision implements the Bayesian optimal decision.
2 fication. The population risk for the 0-1 loss is:
Prediction Error = Bias + Variance + Noise In practice we can only estimate this distribution, e.g. via
Where we have: R(f ) = P[y ∕= f (x)] = Ex,y [Iy∕=f (x) ] logistic regression.
12.1 Asymmetric Costs 13 Generative Modeling
We can then use this to implement an asymmetric cost In the previous part we looked at discriminative mod-
function, e.g.: els with the aim to estimate the conditional distribution
p(y | x). Generative models aim to estimate the joint dis-
cF P if y = −1, a = +1 tribution p(x, y). This will help us to model much more
C(y, a) = cF N if y = +1, a = −1 complex situations. Remember Bayes’ rules:
0 otherwise 1
p(y | x) = p(y) · p(x | y)
z
p(x,y)
In general, Soft-EM will typically result in higher likeli- • M-Step: Compute MLE / Maximize:
hood values, as it can better deal with ”overlapping” clus- θ(t) = argmax Q(θ; θ(t−1) )
ters. When speaking of EM we usually refer to Soft-EM. θ
The EM algorithm is sensitive to initialization. We usually It is important to note that we have guaranteed mono-
initialize the weights as uniformly distributed, the means tonic convergence, where each EM-iteration monotonically
randomly or with k-Means++ and for variances we use increases the data likelihood.
spherical initialization or empirical covariance of the data.
To select k, in contrast to k-Means, we can use cross- 14.5 GMMs for Density Estimation 15 Generative Adversarial Net-
validation.
So far, we used GMMs primarily for clustering and clas-
sification. Another natural use case for GMMs is density
works
14.3 Degeneracy of GMMs estimation, which in turn can be used for anomaly detec- Until now the models we explored failed to capture com-
tion or data imputation. plex, high-dimensional data types (e.g. images and audio).
GMMs can overfit when only having limited data, we want
to avoid that the Gaussians get too narrow and fit to a The key idea is to use a neural network to learn a func-
single data point. To avoid this we add v 2 I to our vari- tion that takes a ”simple” distribution (e.g. Gaussian)
ance. This makes sure that the variance does not collapse and returns a non linear distribution. This leads us to the
and is equivalent to placing a Wishart prior the covariance problem that it becomes to compute the likelihood of the
matrix, and computing the MAP. We choose v by cross- data needed for the loss. Therefore we need an alternative
validation. objective for training.
We can then use this model for classification, giving us 14.6 General EM Algorithm We simultaneously train two neural networks, a generator
highly complex decision boundaries: G trying to produce realistic examples and a discriminator
The framework of soft EM can also be used for more gen-
ky eral distributions than gaussians. We formulate the two D trying to detect ”fake” examples. This whole process
1 (y)
p(y | x) = p(y)
(y) (y)
wj N (x; µj , Σj ) steps: can be viewed as a game, where the generator and dis-
z j=1
criminator try to compete against each other. This leads
to the following objective:
min max Ex∼pdata [log D(x, wD )]
wG wD