0% found this document useful (0 votes)
37 views63 pages

Summary FS24

The document provides an overview of a course on Machine Learning, detailing its key concepts, types of learning (supervised and unsupervised), and various algorithms and techniques used in the field. It covers topics such as linear regression, optimization methods, classification, clustering, and probabilistic modeling. The course aims to equip students with the knowledge to develop algorithms that enable computers to learn from data without explicit programming.

Uploaded by

Elias Salameh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views63 pages

Summary FS24

The document provides an overview of a course on Machine Learning, detailing its key concepts, types of learning (supervised and unsupervised), and various algorithms and techniques used in the field. It covers topics such as linear regression, optimization methods, classification, clustering, and probabilistic modeling. The course aims to equip students with the knowledge to develop algorithms that enable computers to learn from data without explicit programming.

Uploaded by

Elias Salameh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Introduction to Machine Learning

Introduction to Machine Learning


Course summary

Contents

1 Introduction 4
1.1 Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Types of Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Other types of learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Machine Learning Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Function Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.2 Training Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1
Introduction to Machine Learning

2 Linear Regression 9
2.1 Simple Linear regression in 1D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Linear regression for general D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Non-linear Regression 13

4 Optimization 13
4.1 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Stopping Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 Convergence for Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.4 Speeding up Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.5 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.6 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.7 Test Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.8 Validation Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.9 K-fold Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.10 Model Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 Bias-variance Tradeoff and Regularization 24


5.1 Bias-variance Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.1.1 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.1.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2.1 Comparing Lasso and Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6 Classification 31
6.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.2 Binary Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.2.1 Exponential and Logistic loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.2.2 Maximum-Margin and Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . 34
6.3 Multiclass classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.3.1 One-vs-Rest Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.3.2 Softmax Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.3.3 Cross-entropy Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.3.4 Asymmetric Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.3.5 Receiver Operating Characteristic (ROC) . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.4 Trustworthy Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2
Introduction to Machine Learning

7 Kernel Methods 41
7.1 Improving Polynomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.1.1 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.1.2 Kernel trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.2 Other nonlinear methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.2.1 k-Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.2.2 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

8 Neural Networks 46
8.1 Learning Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

9 Clustering 52
9.1 k-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9.2 k-means++ Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

10 Unsupervised learning: Dimensionality reduction 54


10.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

11 Probabilistic modeling 59
11.1 Maximum-likelihood estimation (Frequentist) . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
11.2 Maximum-a-posteriori estimation (Bayesian) . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
11.3 Discriminative vs. Generative approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
11.4 Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

This document is a summary of the Introduction to Machine Learning course that is part
of ETH’s Computer Science bachelor program. This document was created to summarize
and repeat the key learnings of this course. It probably contains errors. 06.2024 Franz

3
Introduction to Machine Learning

1 Introduction

1.1 Idea

Machine learning focuses on the development of algorithms that enable computers to learn and improve
from experience without being explicitly programmed. It involves training algorithms on data to recognize
patterns, make predictions, or make decisions without explicit instructions.
Applications range from image classification, language processing, recommendation systems, detection of
anomalies (e.g. in medical diagnostics), all the way to self driving cars, or drug discovery.

Clarification
Suppose we want to figure out if a patient is at risk of developing skin cancer. As input data we
consider time spent outside without proper protection (exposure to UV radiation).
▶ Traditional approach: We let the computer perform tasks based on specific rules that we defined.
Example: If UV radiation exposure of the patient is greater than x, then we assume that the patient
is at risk. If it is smaller than x, then we assume that the patient is not at risk.
▶ Machine Learning approach: Instead of providing the rules to the computer, we let the computer
figure out the rules based on example data that we provide to the computer.
Example: We provide the computer with data (xi , yi ) where xi is the UV radiation exposure of patient
i and yi is a boolean value indicating whether patient i developed skin cancer. Instead of specifying
a fixed x to decide if the patient is at risk of developing skin cancer (as in the traditional approach),
we let the computer efficiently find a x̂ that best fits the input data that we provided.

1.2 Types of Learning

1.2.1 Supervised Learning

Definition
Supervised Learning: In supervised learning we provide the computer with a labelled dataset,
meaning that each training example is paired with an output label. Using this input data, the
computer should learn rules which can be applied to new data. The above mentioned "UV radiation"
example is a case of supervised learning (yi in the input data represents the label).
Below are some supervised learning tasks:
▶ Classification: Predict the category to which the given input data belongs to (e.g. "spam" or

4
Introduction to Machine Learning

"no spam" for emails, or "0", "1", "2", "3", "4" representing cancer stages in images). So we predict a
discrete scalar.
▶ Regression: Predict a continuous scalar for a given input (e.g. house prices based on features
like location, size, and number of rooms, or estimating the age of a person’s photo). So we predict a
continuous scalar.
▶ Structured Prediction: Predict complex outputs such as sequences, graphs or images (e.g.
language translation, speech recognition, or image segmentation).

1.2.2 Unsupervised Learning

Definition
Unsupervised Learning: In unsupervised learning the computer learns patterns from unlabelled
data. This means that the training data does not contain output data, therefore there is no
supervision. It can be used to detect patterns or relationships within the data. Examples are
clustering or anomaly detection.
Below are some unsupervised learning tasks:
▶ Clustering: Given a set of images, identify which images belong together.

5
Introduction to Machine Learning

▶ Dimensionality reduction: The idea is to reduce the number of input variables (dimensions) while
keeping as much information as possible. By reducing the dimensions we increase the performance
because we process less variables. Typically we are interested in removing the noise or redundancy.
So we transform the original high-dimensional data into a lower-dimensional representation while
retaining as much relevant information as possible.
▶ Generative modeling: The idea is to learn the underlying probability distribution of the training
data in order to generate new data samples similar to the ones in the training set. By this the machine
can create new instances of data that is similar to the originally submitted training data.

1.2.3 Other types of learning


• Semi-supervised learning: Learning from labelled and unlabelled data
• Active learning: Acquire most informative labels for learning
• Online / lifelong / continual learning: Learning from data as it arrives over time
• Reinforcement learning: Learning by interacting with an unknown environment

1.3 Machine Learning Pipeline

Suppose we want to use machine learning to estimate the price of a house. If we can find the prices of other
houses we can use supervised learning and shape a regression task. We can proceed as shown below:

1. Find numerical representation for house: Since we cannot parse a house directly to a computer,
we need to describe all properties that we are interested in by numerical values (e.g. location, size,
number of bedrooms, age). For this simple example, let’s only consider the area of the house.
2. Collect training data: We want to collect labeled data (i.e. sizes and prices of other houses).

3. Learn: Using the collected training data, we want to efficiently find a function fˆ ∈ F that fits our
training data. F is a function class (e.g. linear functions, 2nd degree polynomials, etc.), and fˆ is
one specific function from F .
4. Predict: Finally, we can plug in the numerical representation of our own house into fˆ to get an
estimation of the price of our house (fˆ(area of our house) = price estimation of our house).

Returning to the previous example of estimating the price of a house, we now only consider the area (x),
and the house price (y). Therefore, our training data consists of (xi , yi ) for the ith house.

But how do we find fˆ? For this we need to specify the function class F , the training loss, and the optimization.

6
Introduction to Machine Learning

1.3.1 Function Class

As mentioned before, a function class is a collection of functions that have certain properties.

Therefore we first need to think about the shape or form of the function fˆ that we are looking for.

1.3.2 Training Loss

As we want to find a function fˆ that best fits our training data, we first have to clarify what "best fit" means.
We want to have low training loss, which means that fˆ(xi ) should be close to yi for most points. We define
the training loss as

n
1X
L(f ) = l(f (xi ), yi )
n i=1

where l is the pointwise loss function. It takes in a function f ∈ F and represents the "closeness" between
f (xi ) and yi for a point (xi , yi ). Intuitive loss functions are |fˆ(xi ) − yi | (the absolute distance between the
label and the estimated result) or (fˆ(xi ) − yi )2 (squared loss - outliers are more strongly weighted).

7
Introduction to Machine Learning

Other examples are the Huber loss (weight outliers less than in the square loss method), or the asymmetric
loss method (weight over- and underestimation differently).

1.3.3 Optimization

Optimization describes the way we find the minimum of the training loss function (and therefore find the
best possible weights). There are several approaches:
Closed Form: This is the most straight forward approach: we directly compute the minimum of the training
loss function (e.g. by setting the gradient to zero). But this approach cannot be used if we have training
loss functions with many local minima. Finding the global minima by checking all local minima may be
inefficient.
Gradient Descent: In this approach we try to find the minimum in an iterative manner. This means that
we start at a given point, and iteratively go in the direction of negative gradient (the gradient points in the
direction of steepest ascent at a given point - therefore we go in the negative direction of the gradient). This
allows us to find a local minimum.
Stochastic Gradient Descent: Instead of computing the exact gradient of the entire dataset at each step,
we randomly select a subset of the data and compute the gradient for that subset. This makes the algorithm
more performant and helps to avoid local minima.
Momentum: We add momentum to gradient descent. This means that we move in the direction of the
negative gradient and the accumulated past gradients (momentum). By considering the momentum, we
smooth out the path and accelerate the convergence. Momentum helps to get rid of oscillations and can lead
to a faster and more stable descent.

8
Introduction to Machine Learning

2 Linear Regression

2.1 Simple Linear regression in 1D

For simple linear regression in 1D we choose the function class of all linear functions (F = {w1 x+w2 |w1 , w2 ∈
R}, w1 and w2 are called weights) and the squared loss approach. Our job is to find the linear function
fˆ ∈ F that minimizes the training loss (average squared loss):

fˆ = argminf ∈F (L(f )) / find function that minimizes loss


n
1X
= argminf ∈F ( (f (xi ) − yi )2 ) / definition of loss function
n i=1

Since f = w1 x + w2 we can find the correct parameters by


n
1X
ŵ := (wˆ1 , wˆ2 ) = argminw1 ,w2 ∈R ( (w1 xi + w2 − yi )2 )
n i=1

Theorem
Necessary condition for local optimality: Let Ω ⊂ Rn be open and f : Ω → R a continuously
differentiable function. Then we have:
x0 ∈ Ω is a local minimizer =⇒ ∇f (x0 ) = 0
Therefore, when searching for a local minimizer, we only need to consider the set of critical points
defined as {x ∈ Ω|∇f (x) = 0}.

Therefore
Pn it follows that the global minimum ŵ must satisfy ∇w L(ŵ) = 0. The training loss is L(w0 , w1 ) =
1
n i=1 ((w 1 xi + w2 − yi ) ). The gradient must be 0.
2

  2 Pn
∇wˆ1 L(wˆ1 , wˆ2 ) i=1 ((w1 xi + w2 − yi )xi )
 
!
∇w L(wˆ1 , wˆ2 ) = = n2P =0
∇wˆ2 L(wˆ1 , wˆ2 ) n
n
i=1 (w x
1 i + w2 − y i )
The number of solutions depends on our set of points.

2.2 Linear regression for general D

Instead of just considering one attribute (the size of the house), we want to include other attributes such as
distance to city center, age of the property, or number of rooms.

Clarification
We denote x[i] to describe the ith attribute of an input vector, and xi to describe the ith input vector.
Example attributes for dimensionality d = 3:
• x[1] = size in square meters
• x[2] = distance to city center in kilometers
• x[3] = age of the property in years
And input data:
• x1 = (100, 3, 40)
• x2 = (200, 6, 70)
• x3 = (50, 1, 3)

9
Introduction to Machine Learning

Function class: To model linear regression Pd for general d we pick the function class of all linear
functions with d features F = {w0 + i=1 wi x[i] |w0 , ..., wd ∈ R}. In vector-notation, this is equal to
{w0 + wT x|w0 ∈ R, (w1 , ..., wd ) ∈ Rd }. This can be visualized as a plane.

Training loss: We again consider the squared loss method. Therefore, the pointwise Pnloss function is
(f (xi ) − yi )2 (for given point i) and the average loss function over all points is L(f ) = n1 i=1 (f (xi ) − yi )2 .
We look for the optimal function fˆ ∈ F such that fˆ = argminf ∈F L(f ) = argminf ∈F ( n1 i=1 (f (xi ) − yi )2 ).
Pn

We now want to represent this equation using vector-notation. Since f is of the form f (x) = w0 + wT x we
can also minimize over the weights (so the scalar w0 and the vector w):

n
1X
(wˆ0 , ŵ) = argminw0 ∈R,w∈Rd (w0 + wT xi − yi )2
n i=1

Let’s define vi as the blue expression, so vi := w0 + wT xi . This can be visualized as follows:

The entire vector v is therefore:

10
Introduction to Machine Learning

Going back to our training loss function:

n
1X
L(w0 , w) = (w0 + wT xi − yi )2 / loss function
n i=1
n
1X
= (vi − yi )2 / definition of vi
n i=1
1
= ||v − y||22 / by def of 2-norm (squared to get rid of root)
n
1
= ||1w0 + Xw − y||22 / v = 1w0 + Xw (1 is a vector with only 1s)
n

For simplicity reasons, let’s set w0 = 0.

1
L(0, w) = ||Xw − y||22 / w0 = 0
n
1
= (||Xw||22 − 2(Xw)T y + ||y||22 ) / ||a + b||22 = ||a||22 + 2aT b + ||b||22 (see clarification)
n
1
= ((Xw)T Xw − 2(Xw)T y + y T y) / ||a||22 = aT a
n
1
= (wT X T Xw − 2(wT X T )y + y T y) / (ab)T = bT aT
n

11
Introduction to Machine Learning

Clarification
Why is ||a + b||22 = ||a||22 + 2aT b + ||b||22 ?:

v 2
u n
uX
||a + b||2 = t (ai + bi )2
2
/ definition of 2-norm
i=1
2
= (a + b) · (a + b) / formulate to multiplication
p

= (a + b) · (a + b) / root cancelled out


= (a + b)T (a + b) / definition of dot product
= (a + b )(a + b)
T T
/ (a + b)T = (aT + bT )
= aT a + aT b + bT a + bT b / expand
=a·a+a·b+b·a+b·b / definition of dot product
= a · a + 2(a · b) + b · b / commutativity of dot product
= ||a||22 + 2(aT b) + ||b||22 / definition of 2-norm

Now we take the gradient with respect to w:

1
∇w L(0, w) = (2X T Xw − 2X T y) / take gradient
n
2 2
= X T Xw − X T y / expand
n n
!
=0 / gradient must be 0

2 T 2 2 2 2 T
X Xw − X T y = 0 =⇒ X T Xw = X T y / + X y
n n n n n
n
=⇒ X T Xw = X T y / ·
2
=⇒ w = (X T X)−1 X T y / if X T X is invertible

Geometric interpretation: We can also find a geometric interpretation which may be more intuitive. We
want to find ŵ:

The set of all possible Xw is equal to the image of X (so span(X) = {Xw|w ∈ Rd }). The closest point to y
on span(X) is the orthogonal projection of y onto span(X). We can denote it as Πx y.
We now want to find the ŵ that satisfies X ŵ = ΠX y. We know that whatever w ∈ Rd we pick, it will always

12
Introduction to Machine Learning

be orthogonal to y − X ŵ.

y − X ŵ ⊥ Xw for all w =⇒ (y − X ŵ) · Xw = 0 for all w / dot product of orthogonal vectors is 0


=⇒ (y − X ŵ) Xw = 0 for all w
T
/ definition of dot product
=⇒ uw = 0 for all w and u = (y − X ŵ) X T
/ define u = (y − X ŵ)T X
=⇒ u = 0
=⇒ (y − X ŵ)T X = 0 / by definition of u
=⇒ y X − (X ŵ) X = 0
T T
/ linear algebra rules
=⇒ y X = (X ŵ) X
T T
/ + (X ŵ)T X
=⇒ X T y = X T X ŵ / aT b = bT a
=⇒ (X T X)−1 X T y = ŵ / if X T X is invertible

This is the same result as calculated using the gradient!

3 Non-linear Regression

We can also choose a function class of non-linear functions to fit our data points (e.g. set of polynomials of
degree p − 1: F = {w0 + w1 x + ... + wp−1 xp−1 }). Pp
These non-linear functions can be represented as f (x) = i=1 wi ϕi (x) where ϕ(x) = (ϕ1 (x), ..., ϕp (x)) is
defined as the feature vector. So f (x) is a linear combination of the feature vectors. In the example
of polynomials, ϕ(x) = (1, x, ..., xp−1 ). But we could also come up with a trigonometric basis ϕ(x) =
(1, sin(2πx), ..., sin(2π(p − 1)x) or anything else.
We are interested in finding the best weights (consider squared loss again):

ŵ = argminw∈Rp L(w)
n
1X
= argminw∈Rp ( (yi − wT ϕ(xi ))2 )
n i=1
= ||y − Φw||2

4 Optimization

How do we find the minimum of a training loss function in the most efficient way?

4.1 Gradient Descent

Most of the time we cannot simply compute the minimum of the training loss function (so we cannot follow
the closed-form approach). The reason is that many machine learning models have a very high number of
parameters (often in the millions or even billions). The loss function’s landscape in such a high-dimensional
space is incredibly complex. Analytically solving for the minimum of such functions is too inefficient.
Therefore we have the below iterative algorithm to find a local minimum of a differentiable training loss
function.

Definition
General iterative algorithm to minimize L(w):

1. Start at an initial w0
2. At each step t, calculate wt+1 = wt − η∇w L(wt )
3. Stop at certain condition (e.g. |L(wt+1 ) − L(wt )| ≤ ϵ)

13
Introduction to Machine Learning

4. Output ŵ = wt

Clarification
1. Start at an initial w0 : To begin the iterative approach, we first need to choose a starting point
w0 . This point can be chosen at random or in a certain range (in case we have some vague idea where
the minimum might be).
2. At each step t, calculate wt+1 = wt − η∇w L(wt ): How did we come up with this formula?
Let’s say that in every iteration we want to move η̃ units (called step size) in a certain direction v (v
is a unit vector, so ||v|| = 1). This gives us wt+1 = wt + η̃v.
• Step size: It would be great if the step size η̃ is not constant. If it was constant, then we
might overshoot the minimum of the function and fail to converge.

Having a constant step size may also result in slow convergence. Therefore we set η̃ =
η||∇w L(wt )|| where η is a constant. If we are at some steep point L(wt ), the 2-norm of the
gradient is large (||∇w L(wt )|| is large) resulting in a large step size. But if we are at some
flat point L(wt ), the 2-norm of the gradient is small, resulting in a small step size.

• Direction: We know that the gradient points to the direction of steepest ascent. As we
want to find a minimum of the loss function, we need to navigate in negative direction of
∇w L(wt )
the gradient. Therefore we have v = − ||∇ t
w L(w )||
(we divide by the length of the gradient to
normalize it - v should be a unit vector).
This gives us:
wt+1 = wt + η̃v / add step size times some unit vector
∇w L(w )t
= wt + (η||∇w L(w)||)(− ) / step size and direction as shown above
||∇w L(wt )||
= wt + (η)(−∇w L(wt )) / the norm cancels out
= w − η∇w L(w )
t t

14
Introduction to Machine Learning

4.2 Stopping Conditions

When do we stop the iterations? Below are three approaches:


▶ Convergence criterion: Stop when the difference between the last two iterations falls below a certain
constant ϵ (so |L(wt+1 ) − L(wt )| ≤ ϵ). This indicates that next iterations are unlikely to further improve
the solution.
▶ Gradient norm: Stop when the norm of the gradient falls below a certain constant ϵ (so |∇L(wt )| ≤ ϵ).
▶ A priori termination: Abort the iterations after n steps, no matter how close we are to the solution.
There are more stopping conditions - each with its own up and downsides.

4.3 Convergence for Linear Regression

How can we prove that our iteration converges to ŵ =argminw∈Rd L(w)?


Assume we begin our iteration at a certain point wt . Now we need to show that the current distance to ŵ
is greater than the distance of the next iteration wt+1 to ŵ (formally: ||wt+1 − ŵ||2 ≤ ||wt − ŵ||).

||wt+1 − ŵ||2 = ||wt − η∇L(wt ) − ŵ||2 / wt+1 = wt − η∇L(wt ) as shown previously


= ||wt − η(X T Xwt − X T y) − ŵ||2 / ∇L(wt ) = X T Xwt − X T y for n = 2 as shown above
= ||wt − η(X T Xwt − X T X ŵ) − ŵ||2 / X T X ŵ = X T y as shown previously
= ||(I − ηX T X)(wt − ŵ)||2

Definition
Operator norm: For any matrix A:

||Az||2
||A||op = supz
||z||2

So ||A||op is the largest factor by which A can stretch a vector z. Or alternatively, ||A||op is the length
of the largest vector we get by plugging in all possible unit vectors z.

||Az ′ ||2
Note that for all vectors z ′ : ||z ′ ||2 ≤ ||A||op . By multiplying both sides by ||z ′ ||2 we get ||Az ′ ||2 ≤
||A||op ||z ′ ||2 .
Above we derived:

||wt+1 − ŵ||2 = ||(I − ηX T X)(wt − ŵ)||2 / from above


= ||(I − ηX X)(w − ŵ)||2
T t
/ let A = (I − ηX T X) and z ′ = (wt − ŵ)
≤ ||I − ηX T X||op ||wt − ŵ||2 / ||Az ′ ||2 ≤ ||A||op ||z ′ ||
= ρ||wt − ŵ||2 / let ρ = ||I − ηX T X||op

This means we have ||wt+1 − ŵ||2 ≤ ρ||wt − ŵ||2 . When ρ < 1 then we know that ||wt − ŵ||2 > ||wt − ŵ||2 .
This can guarantee this by choosing the correct step size η.

4.4 Speeding up Gradient Descent

If the contour lines around the optimum are circles, gradient descent converges faster than if the contour
lines around the optimum are stretched elipses.

15
Introduction to Machine Learning

The idea is to introduce momentum to GD. Therefore we modify GD as follows: wt+1 = wt + β(wt − wt−1 ) −
η∇L(wt ) (with β we can control the momentum).

This speeds up GD in flat areas and additionally dampens oscillations.

16
Introduction to Machine Learning

4.5 Stochastic Gradient Descent

Using GD can be computationally expensive, especially for large datasets (large number of training points n).
The memory required is O(nd) and the computation time for one iteration O(n · time to compute ∇L(w)).
In SGD, we partition all data points into i small mini-batches S1 , ..., Si . For each iteration
P we consider the
points in the mini-batch, instead of all points (instead of ∇L(w) we use ∇LS (w) = |S| 1
i∈S ∇l(yi , f (xi ))).
This means that our loss function is not the same for every iteration, but it is just an approximation of the
training loss function. This results in inaccuracies, but because SGD is so much faster we can perform more
iterations.
As mentioned, SGD works great for large datasets. Additionally, the fact that only the points in the mini-
batch are considered results in some variation. This can help to escape local minima and saddle points more
easily.

17
Introduction to Machine Learning

4.6 Model Selection

The machine learning engineer must find the best possible model (function class, training loss method and
optimization method) such that it accurately captures the underlying patterns in the data and generalizes
well to unseen data. Therefore we need to find a way to measure how good our estimation fˆ is. We want to
get a sense for how well fˆ does in predicting unseen test samples.
We assume that there is a ground truth function f ∗ that we would be able to compute if we had all possible
data points available, and without any measure errors. A good model therefore should have a low estimation
error l(f ∗ (x), fˆ(x)) = (f ∗ (x) − fˆ(x))2 . Sadly we cannot compute the estimation error because we don’t know
how the ground truth function f ∗ looks like. Therefore, a good model fˆ should have a low prediction error
on unseen test input (a subset of the data that we collected).

Definition

Training error/loss: Training error refers to the loss that fˆ incurs on the data it was trained on.
The training error gives an indication of how well the model fits the training data. A low training
error suggests that the model has learned the patterns in the training data well.
Generalizaion error/loss: Generalization error (or test error), is the loss on new, unseen data. It is
calculated by evaluating the model’s performance on a separate dataset called the test dataset, which
was not used during training. The generalization error measures how well the model can generalize
its predictions to new, unseen examples. It provides an estimate of how well the model will perform
in real-world applications.

4.7 Test Set

Usually the training loss is too optimistic because it optimizes for the training data which does not necessarily
reflect how well the model generalizes to unseen data. Below is a visualization of overfitting.
Intuitively this means that the model learned to memorize the training data instead of generalizing it for
unseen data.

18
Introduction to Machine Learning

So far we used all data points for training. This can lead to overfitting and additionally we are not able to
measure how well the model performs on unseen data.

The main idea is to hold out part of the available data (we call this held-out set "test data") and use the
rest of the data points for training.

We train on all data that is not in the test set and use the test data only for approximating the generalization
error.

19
Introduction to Machine Learning

When training the model on the training set D, we get the estimation fˆ. The generalization P error can now be
calculated similarly as the training loss, but only over the data in the test set D′′ : |D1′′ | (x,y)∈D′′ l(fˆ(x), y).
Usually, the more data we have the larger the test data set becomes (in relative terms).
Note: In some cases it makes sense to split the data randomly (e.g. with housing prices example), whereas
in other cases data must be split in chunks (e.g. time related data like temperature measurements).

4.8 Validation Set

Additionally to the test data, we can hold out another subset of points in the training data to use it as a
pre-validation for a model. Intuitively, it’s a practice exam for the model before the final test against the
test data. This allows us to select the best features and tune hyperparameters without biasing the model on
the unseen test data set.

Definition
Hyperparameters: Hyperparameters are parameters that are set prior to the training process. They
define the structure or behavior of the learning algorithm and are not learned from the data during
training. Examples are the learning rate, activation function, or batch size.

Clarification
Features: Features is the data that is used as input for models to make predictions or decisions.
They can be the attributes to describe each piece of data (e.g. the color, size, or shape of an object),
or the combination or transformation of attributes.

4.9 K-fold Cross Validation

We partition the training data Dtrain into k subsets of equal size (Dtrain = D1 ∪ D2 ∪ ... ∪ Dk ) as shown
below.

20
Introduction to Machine Learning

Then we perform the below algorithm:

Definition
K-Fold algorithm: Given a choice of model (hyperparameters and features Φ) and K ∈ N (usuallly
5 or 10), do:
1. For all k = 1, ..., K (so for each partition):
(a) Compute fˆkΦ that minimizes training loss on Dtrain \Dk using features Φ (notation: fˆkΦ =
M (Dk )).
(b) Compute validation error on partition k: Lk (Φ) := |D1k | ( (x,y)∈Dk l(fˆkΦ (x), y))
P

2. Now we have K functions fˆ1Φ , ..., fˆkΦ each trained on a different data set. Additionally, we
have K validation errors L1 (Φ), ..., Lk (Φ). Now we compute the average validation error called
"cross-validation error" CV (Φ) = K 1
Lk (Φ).
Note:

• For every choice of features and hyperparameters, we now have a corresponding cross-
validation error CV (Φ)
• We can now perform the above procedure for other choices of features and hyperparameters
and choose the one with lowest cross-validation error
• Let the optimal features be Φ∗ . We are now able to compute the final model fˆΦ on the entire

training data Dtrain


• The generalization error of fˆΦ can be calculated using the held-out test data (generalization

error = L(fˆΦ , Dtest )).


4.10 Model Complexity

The training loss (loss that fˆ generates on training data) decreases as the complexity of the model increases:

21
Introduction to Machine Learning

On the other hand, the generalization loss (loss that fˆ generates on test data) is u-shaped:

22
Introduction to Machine Learning

We want to minimize the generalization loss:

23
Introduction to Machine Learning

Note that:

• Underfitting models predict...

– ...training data not well


– ...test data not well

• Right models predict...

– ...training data well


– ...test data well

• Overfitting models predict...

– ...training data well


– ...test data not well

5 Bias-variance Tradeoff and Regularization

5.1 Bias-variance Tradeoff

5.1.1 Bias

Definition

Bias: The bias of a model M is the expected squared distance of the average model f¯ = K 1
PK ˆΦ
i=1 fk
to the ground truth function f ∗ . It can be estimated as EX (f ∗ (X) − f¯(X))2 (though in practice we
don’t know f ∗ ).

24
Introduction to Machine Learning

▶ Too simple models generally have high bias

▶ Too complex models generally have low bias

25
Introduction to Machine Learning

5.1.2 Variance

Definition

Variance: The variance of a model M is the expected squared distance of individual models fˆkΦ =
M (Dk ) to the average model f¯. It can be estimated as EX ( K ¯
PK
k=1 (fk (X) − f (X)).
1 Φ

▶ Too simple models generally have small variance

▶ Too complex models generally have high variance

5.1.3 Conclusion

Bias: Expected squared distance of average function to ground truth function


Variance: Expected squared distance of individual functions to average function

26
Introduction to Machine Learning

5.2 Regularization

Regularization in machine learning is a technique used to prevent a model from overfitting on the training
data. We know that overfitting occurs when a model learns the noise or random fluctuations in the training
data instead of the actual underlying patterns, making it perform well on the training data but poorly on any
unseen data. Regularization addresses this issue by adding a penalty on the size of the model parameters,
encouraging the model to be simpler.
Example: Consider below ground truth function f ∗ = −x+3x2 +x5 and n = 35 noisy samples y = f ∗ (x)+ϵ
with ϵ N (0, 0.4).

We choose the method: nonlinear regression with polynomial features ϕ(x) = (1, x, x2 , ..., xm ) with degree
m.
If we set m = 12 and use the minimizer argminw∈Rd ||y − Φw||22 , then we get a polynomial of too high degree
and experience overfitting.

27
Introduction to Machine Learning

We see that the polynomial function is "wiggly", which means that its coefficients are large.

Definition
Pd
1-Norm: ||w||1 = i=1 |wi | (the sum of all absolute entries of the vector)

To make the polynomial function less wiggly, we can set a penalty on the coefficients:

argminw∈Rd ||y − Φw||22 such that ||ŵ||1 ≤ C(C ∈ N)

We just say that the final vector should not have too large coefficients.

Theorem
For any C ∈ N that we chose, there is a λ, such that:

ŵλlasso = argminw∈Rd ||y − Φw||22 + λ||w||1

(we call this Lasso regression)

ŵλridge = argminw∈Rd ||y − Φw||22 + λ||w||22


(we call this ridge regression). Note: we use squared 2-norm instead of 1-norm.

Clarification
The larger λ becomes, the smaller the 1-norm (in the case of lasso) or the squared 2-norm (in the case
of ridge) are. This means that the polynomial function will be less wiggly and thus we get a simpler
model.

5.2.1 Comparing Lasso and Ridge Regression

Lasso:

28
Introduction to Machine Learning

Ridge:

Clarification
Lasso regression helps us find a sparser solution (i.e. a solution with coefficients are 0). This is
because of the sharp diamond shape of ||w||1 ≤ C. This makes it likely that the closest point to the
optimum lies in one of the shape’s sharp corners - and that is the point where some weights are 0.

29
Introduction to Machine Learning

It is always possible to set λ in such a way that a sparse solution is found.

Equivalently to the previous geometric intuition, ridge regression helps us find a denser solution.

We can use cross-validation to find an optimal λ.

1. Choose λ, Φ and K ∈ N

2. For k = 1, ..., K:
(a) Calculate fˆkλ = argminf (L(f, Dtrain \Dk ) + λ||f ||)
(b) Calculate cross-validation loss on Dk : CVkλ = |D1k | (x,y)∈Dk l(fkλ (x), y)
P

PK
3. Compute average cross-validation loss among all folds: CV λ = 1
K i=1 CViλ

4. Repeat step 1 and 2 and pick the λ with lowest CV λ

5. Train on entire training set and validate using the test set Dtest .

30
Introduction to Machine Learning

6 Classification

6.1 Problem

In regression we wanted to find a function that maps our inputs to some real value output (fˆ : Rd → R). In
classification we are interested in mapping our inputs to a finite, discrete set (fˆ : Rd → {1, 2, ..., k} for some
k ∈ N).
Examples are:

• Binary classification: Is this email spam (classes: spam / not spam)? Should we evacuate people
considering the current weather conditions (classes: evacuation / no evaculation)?
• Multiclass Classification: Does this image show a cat, a dog, or a fish (classes: cat / dog / fish)?
In which stage is this cancer (classes: 0, 1, 2, 3, 4)?

6.2 Binary Classification


1. We assign numerical values to both classes (e.g. healthy=-1, sick=1)

2. Similarly to regression, we can predict the class as ŷ = sign(fˆ(x)). So if fˆ(x) is positive or zero,
then we assume the classification to be 1. If fˆ(x) is negative, then we assume the classification to
be −1.
3. A good model predicts ŷ = y, where y is the true label

6.2.1 Exponential and Logistic loss

Our goal is to bring the generalization error (error on unseen data) down as much as possible.

Generalization error of fˆ : E(x,y)∈Dtest [y ̸= sign(fˆ(x))]

Definition
0 if y = sign(f (x))

l0−1 : The pointwise loss function l0−1 (f (x), y) =
1 if y ̸= sign(f (x))

The idea in classification is to use different loss functions for training and testing. During testing we want
the generalization to be as low as possible. And the generalization error is calculated using the l0−1 pointwise
loss function.
But for training the model, we want to have a smooth, differentiable loss function which we can run gradient-
based optimizations on. But for testing the model’s performance, we use the exact l0−1 loss function.
But which training error do we use? One possibility is the exponential loss.

31
Introduction to Machine Learning

Definition

Exponential Loss: lexp = e−yf (x) .

Clarification
We want the sign of y · f (x) to be positive (when the sign of y and the sign of f (x) are equal, then
the sign of y · f (x) is positive). Therefore the loss function g(yf (x)) is small when yf (x) is positive
and large when yf (x) is negative. The further right along the x-axis we go, the less the loss should
become - because we want yf (x) to become positive.
Additionally, the loss function should be differentiable so that it can be used in gradient descent.
Strong convexity eliminates the problem of finding local minima or stationary points.

Other options are discussed below:

• A: This loss function points to the wrong direction. It incentives yf (x) to be negative, which means
that the signs of y and f (x) are different. Therefore, this loss function is not suitable.

• B: This loss function works, however it is not optimal as it makes us optimize to a specific yf (x)
value (namely the minimum of the function).

• C: This loss function is negative if yf (x) > 0. This means that we get a reward the higher yf (x)
gets. This leads to problems as displayed below:

32
Introduction to Machine Learning

The distance to the green class of points is maximized with reward for linear loss, at expense of
mistakes in the red class. This problem does not appear in the logistic loss, because it is strictly
positive.

Apart from the exponential loss we discussed, there also is a logistic loss.

Definition
Logistic loss:
llog (f (x), y) = log(1 + e−yf (x) )

Clarification
Why is there a +1 summand? If yf (x) is a large positive number, then −yf (x) is a large negative
number. This means that e−yf (x) is a small positive number and log(e−yf (x) ) is a large negative
number. But this is a problem because the training function should be non-negative, which is why
we add a +1 summand. Since e−yf (x) is a small positive number, 1 + e−yf (x) is slightly larger than 1.
This makes log(1 + e−yf (x) ) positive.

33
Introduction to Machine Learning

6.2.2 Maximum-Margin and Support Vector Machine

Definition
Linearly separable datasets: A dataset is considered linearly separable if you can draw a straight
line (in two dimensions), a plane (in three dimensions), or a hyperplane (in more than three dimensions)
to perfectly separate the different classes within the dataset. The below image illustrates this.

Definition
Maximum-Margin Solution: The idea behind the maximum-margin solution is to find a decision
boundary that separates the different classes in the dataset such that the distance between the decision
boundary and the closest data point from each class is maximized. The closest points are called the
support vectors, because they "support" the location and orientation of the decision boundary (i.e.
the decision boundary is orthogonal to the support vector).

34
Introduction to Machine Learning

Why do we want to maximize the margin? A larger margin implies that the decision boundary is as
far away as possible from any data point. This is desirable because it makes the classifier more robust
to variations in the data (misclassification is less likely). It can be interpreted as the most "confident"
separation between classes, which should lead to better performance on unseen data.

The maximum-margin solution is defined as wmm = argmax||w||2 =1 (mini (yi ⟨w, xi ⟩) (maximizes the mini-
mum distance between the decision boundary and the closest point).

Clarification
Why is the distance between the decision boundary and the training sample (xi , yi ) given by yi ⟨w, xi ⟩?

• xi are the coordinates of the point and yi is the label (either 1 or −1)
• w is normalized and therefore has length 1 (See above argmax||w||2 =1 )
Pd
• The dot product between two vectors ⟨a, b⟩ is defined as aT b = i ai bi . There also is a
geometric definition which is ||a||||b||cos(∠ab)
Therefore,

⟨w, xi ⟩ = ||w||||xi ||cos(∠wxi ) geometric definition of ⟨., .⟩


= ||xi ||cos(∠wxi ) ||w|| = 1
= The length of vector xi projected in w direction
= The distance between the decision boundary and point xi

35
Introduction to Machine Learning

This means that when ⟨w, xi ⟩ is positive, then the point (xi , yi ) lies on the same side of the decision
boundary where w points to. This is good if yi is positive as well. But if yi is negative, the point
should be on the other side of the decision boundary. Therefore we multiply by yi to make sure that
each training sample xi is correctly classified according to its label.
With wmm = argmax||w||2 =1 (mini (yi ⟨w, xi ⟩) we look for the normalized w vector that maximizes the
distance between the decision boundary and the closest point of each class.

6.3 Multiclass classification

We might want to classify data into more than just two classes.

6.3.1 One-vs-Rest Approach

The one-vs-rest approach transforms the problem of multiclass classification into multiple binary classifi-
cations. For a classification problem with K classes, the one-vs-rest approach creates k separate binary
classifiers (fˆ1 , ..., fˆk ). Each classifier is trained to distinguish between one of the classes and all other classes
combined.

Then we can predict ŷ(x) = argmaxi=1,...,k fˆi (x).

36
Introduction to Machine Learning

But this means that we have to run k separate optimization procedures. Is there a way to train them
simultaneously?

6.3.2 Softmax Function

The softmax function is a useful tool that allows us to perform this task. It converts a vector of raw scores
(often called logits) into a vector of probabilities (all elements are between 0 and 1 and the sum over all
elements is 1).

Definition
Softmax: Given a vector a ∈ Rn of size n, sof tmax(a) is defined as:

eai
sof tmax(a)i = Pn
j=1 eaj

Clarification
Example:
e−1
  
0.035119
  
−1 e−1 +e1 +e2
1
sof tmax( 1 ) =  e−1 +e
e
1 +e2  = 0.259496
  
2 −1
e 2
1 2
0.705385
e +e +e

• All elements are between 0 and 1


• All elements sum up to 1
• Element 2 is highest, and therefore most likely
• Element −1 is lowest, and therefore least likely
Softmax uses the exponential function because of its properties:
• ∀x ∈ R : ex > 0 (always positive - we don’t want negative values in our output vector)
• ∀x, y ∈ R : x ≥ y =⇒ ex ≥ ey (monotonically increasing - higher values should lead to
higher probabilities)

There also is a hardmax function, which prints 1 for the largest element in the input array and 0 for all other
elements.

37
Introduction to Machine Learning

Definition
Temperature of softmax: We can introduce a constant T called temperature:

eai /T
sof tmax(a)i = Pn aj /T
j=1 e

The temperature parameter scales the logits before applying the softmax function. This allows us to
control the distribution of the probability output.

• T=1: The softmax function behaves normally, as the temperature does not scale the logits.
• T>1: The temperature increases, making the softmax output more uniform (i.e., the proba-
bilities become more similar). This can be useful when we want the model to be less confident
about its predictions or when we wish to encourage more exploration.
• T<1: The temperature decreases, making the softmax output more concentrated (i.e., in-
creasing the disparity between the higher and lower probabilities). This can be helpful when
we want the model to make more confident predictions. The softmax function behaves like
the hardmax function as T converges towards 0.

6.3.3 Cross-entropy Loss

Our goal still is to train the k separate optimization procedures simultaneously. The most popular approach
is to run gradient descent on the cross-entropy loss. Cross-entropy loss measures the performance of the
classification model and uses softmax to do so.

Definition

Cross-entropy loss: We feed the vector f (x) := (fˆ1 (x), ..., fˆk (x)) ∈ Rk (which are the outputs of
the different binary classifiers) into the cross-entropy loss:

efy (x)
lCE (f (x), y) = −log( Pk )
fi (x)
i=1 e

Note:

38
Introduction to Machine Learning

• fy (x) is the output of the classifier y ∈ {1, ..., k} (which is also the label). This means we
plug in the yth element of sof tmax(f (x)) into −log(...). Intuitively this means that we plug
in the probability that the model is correct into −log(...).

• The reason we use −log(...) is because it satisfies the properties a loss function should have:

– It is non-negative for values between 0 and 1 (the yth element of sof tmax(f (x)) is a
probability value between 0 and 1)
– −log(...) becomes monotonically smaller the higher the input value gets (which means
that we get less loss if the probability value becomes larger)

6.3.4 Asymmetric Losses

We used the l0−1 loss (zero-one-loss) to determine how well our model performs on unseen data (the zero-
one-loss prints 1 if our predicted classification is wrong, and 0 if it is correct). This loss might not be optimal
in real-world applications.
Which covid diagnosis classifier should we use? We test with 32 samples.

If we want to prioritize public health, then we want to use classifier B. For prioritizing individual freedom
we choose classifier A.
Instead of using the l0−1 loss that treats all errors symmetrically, we can weight errors differently by e.g.
giving errors where y = −1 more weight (using constants cF N for "False Negative" and cF P for "False
Positive").

39
Introduction to Machine Learning

1 X 1 X #F N #F P
cF N 1ŷ=−1 + cF P 1ŷ=1 = cF N + cF P
#y = 1 #y = −1 #y = 1 #y = −1
(x,y):y=1 (x,y):y=−1

Given a fˆ, we can obtain smaller loss cF N #y=1


#F N
+ cF P #y=−1
#F P
by changing τ . τ is the threshold value that
we use to decide if our prediction y is 1 or −1. We used to have τ = 0:

+1 if fˆ ≥ 0

ŷτ = sign(fˆ(x)) =
−1 if fˆ < 0

6.3.5 Receiver Operating Characteristic (ROC)

6.4 Trustworthy Generalization

Consider the previous example where we wanted to classify cats, dogs and ships. Often times these classes
can be split further into subgroups (e.g. male/female cats or dogs, albino animals, etc.).
It might be that a classifier has low error in classifying general input data correctly, but has high error on
specific subgroups.

40
Introduction to Machine Learning

• Classifier A: Low overall error, but high error on minority groups


• Classifier B: Difference in error among subgroups is smaller, but overall error is higher

There are several possible notions of "fair" classifiers:

• Equalized odds: all groups have similar error rate (P(Ŷ = 1|Y = y, S = g) equal for all g ∈ G, y ∈
{0, 1} where Ŷ is the predicted value)
• Demographic parity: Across group equal percentage predicted as 1 vs -1 (P(Ŷ = 1|S = g) equal
for all g ∈ G)
• Equalized opportunity: All groups should have similar "false negative rate" (P(Ŷ = 1|Y = 1, S =
g) equal for all g ∈ G)
• Worst-group error: The largest error among all groups is small (maxg∈G y P(Ŷ = −y|Y =
P
y, S = g) smallö)
• Balanced error: Average of all group errors is small ( |G|
1
y P(Ŷ = −y|Y = y, S = g))
P P
g

7 Kernel Methods
7.1 Improving Polynomial Regression

7.1.1 Computational Complexity

How large must p be (the dimension of the below features vector) to express a degree m polynomial for input
data x ∈ Rd ?
Φ(x) = [1, x[1] , ..., x[d] , x2[1] , ..., x2[d] , ..., x[1] x[2] , ..., x[1] x[d] , ..., x2[1] x[2] , ...] ∈ Rp
A polynomial of degree m is a linear combination of monomials of degree smaller or equal to m, where the
coefficient of at least one monomial with degree m is non-zero (monomial of degree m can be xm [1] but also
combinations of different attributes such as x[1] x[2] )
m−1

Therefore, all monomials with degree smaller or equal to m must be present in Φ(x) (e.g. 1, x[1] , ..., xm
[1] , ...
but also x[1] x[2] ...). This means that p (the size of Φ(x)) grows exponentially as larger d and m become.
m−2 2

The key message is that we quickly obtain a "feature explosion" with large memory and computational
requirements, especially when working with high dimensional data.

41
Introduction to Machine Learning

7.1.2 Kernel trick

To avoid operating in this huge feature space, there is something called the kernel trick. The kernel trick
helps us to efficiently find non-linear decision boundaries (e.g. polynomial) without explicitly computing the
coordinates of the data in the feature space.

Clarification
Example: Suppose we have the below data points.

It is obvious that the dataset is not linearly separable. Therefore we need to find a non-linear decision
boundary.
The idea is to transform the dataset into a higher dimension, in order to make it linearly separable.
As displayed below, we transformed the dataset into higher dimension - now it is linearly separable
and we can efficiently find a decision boundary.

Now we can reverse the transformation to make the dataset return to its original dimension.

There are many possible functions that can map the data to any number of higher dimensions such that it
becomes linearly separable.

42
Introduction to Machine Learning

There can be many transformations that allow the data to be linearly separated in higher dimensions, but
not all of these functions are actually "kernels" (defined soon).
We saw how higher dimensional transformations can allow us to separate data in order to make easy clas-
sifications. It seems that for finding the classifiers, we would have to perform operations with the higher
dimensional vectors in the transformed feature space (Φ(x)). But we don’t want to work with these higher
dimensional data because of additional computational cost.
The kernel trick provides a solution to this problem. The “trick” is that kernel methods represent the
data only through a set of pairwise "similarity comparisons" (a.k.a. scalar product) in the original lower
dimensional space, instead of explicitly calculating the transformations Φ(x).

Definition
Kernel: A function that takes vectors in the original space as its inputs and returns the dot product
of the vectors in the feature space.
k(a, b) = ⟨Φ(a), Φ(b)⟩
So the kernel function is some arbitrary function that accepts inputs in the original lower dimen-
sional space and returns the same result as the dot product of the transformed vectors in the higher
dimensional space, without the overhead of having to compute the coordinates of the vectors in the
high-dimensional space.

The trick is to use such a kernel function instead of calculating Φ(a) and Φ(b).

Clarification
√ √ √
Example: For the feature space Φ(x) = (1, 2x1 , 2x2 , 2x1 x2 , x21 , x22 ) (feature space of 2nd degree
polynomials in x = (x1 , x2 )T ), k(a, b) = (1 + aT b)2 is a valid kernel (a, b ∈ R2 ).

k(a, b) = (1 + aT b)2 / def of k


= (1 + a1 b1 + a2 b2 ) 2
/ dot product
= (1 + 2a1 b1 + 2a2 b2 + 2a1 b1 a2 b2 + +
a21 b21 a22 b22 ) / expand
√ √ √ √ √ √
= ⟨(1, 2a1 , 2a2 , 2a1 a2 , a21 , a22 ), (1, 2b1 , 2b2 , 2b1 b2 , b21 , b22 )⟩ / rewrite
= ⟨Φ(a), Φ(b)⟩ / def of Φ

This means that instead of computing Φ(a) and Φ(b) (which is memory intensive because we bring
the data into higher dimensional space), we can simply calculate (1 + aT b)2 to get the dot product of
the Φ-vectors.

To calculate the training loss, we don’t need the actual value of Φ(x) (where x is some data sample), but
we need their dot product ⟨Φ(a), Φ(b)⟩ (a and b are some data samples).

Theorem
Mercer’s theorem: A function must satisfy the below two properties to be a kernel:

43
Introduction to Machine Learning

• Symmetry: k(a, b) = k(b, a). Symmetry must hold because

k(a, b) = ⟨Φ(a), Φ(b)⟩ / def of k


= ⟨Φ(b), Φ(a)⟩ / dot product is symmetric
= k(b, a) / def of k

k(x1 , x1 ) · · · k(x1 , xn )
 

• Positive semi-definiteness: The kernel matrix K =  .. .. ..


. . .  is positive-
 

k(xn , x1 ) · · · k(xn , xn )
semidefinite for any choice of inputs x1 , ..., xn . This means that ∀c ∈ Rn : cT Kc ≤ 0.
Positive semi-definiteness can geometrically be interpreted as follows: We take any vector
c and transform it using the linear map K. Then we take the dot product between the
transformed vector Kc and the original vector c to see if Kc still "points in the same direction
as c" (meaning that the angle between Kc and c is smaller than or equal to 90◦ ). If this is
the case for all c, then we call K positive-semidefinite.

7.2 Other nonlinear methods

7.2.1 k-Nearest Neighbor

The idea behind the k-NN approach is to make predictions for new data points based on the majority class
(in classification) or the average (in regression) of their k nearest neighbors in the training data.
For a new data point x that we want to classify or predict its value, the algorithm finds the k closest data
points (neighbors) to x in the training data based on some distance metric (e.g. Euclidean distance or
Manhattan distance).

For classification, the algorithm assigns the most frequent class among the k nearest neighbors to the new
data point x. For regression, the algorithm calculates the average (or weighted average) of the target values
of the k nearest neighbors and assigns this value to x.
This means, that there is no explicit training performed for k-NN.
The parameter k (the number of neighbors to consider) is a hyperparameter of the algorithm, and the optimal
k can be found using cross-validation. k needs to be chosen carefully because it significantly influences the
model’s behavior. The choice of distance metric (e.g. Euclidean, Manhattan, or others) also affects the
algorithm’s performance. Different distance metrics might be more suitable for different types of data and
problems.

44
Introduction to Machine Learning

k-NN is...

• + ...simple to understand and implement


• + ...does not require a training phase
• + ...makes no assumptions about the underlying distribution of the data, which means it can be
effective for datasets where the distribution of the test samples is unknown
• - ...computationally expensive, because the algorithm needs to compute distances to all training
samples, which can be slow and memory-intensive for large datasets
• - ...sensitive to noise and outliers in the data, because noisy data or outliers can affect the predictions,
especially when using small k
• - ...As the number of features (dimensions) increases, the distance between data points becomes less
meaningful, leading to decreased performance (curse of dimensionality).

Clarification
Example: Imagine the below  data:
0.3 0.5 N aN 0.2 0.7

0.4 0.6 0.2 0.3 0.8
0.2 0.4 0.1 0.3 0.6
 
0.1 0, 3 0.1 0.2 0.5
 

0.7 0.9 0.5 0.8 0.9


It has missing data in the first row. We can use the k-NN approach to predict it.
1. Calculate the distance to every other data point (e.g. using Eucledian distance, k = 3),
ignoring the missing dimension.
• distance to 2nd row: √(0.3 − 0.4) + (0.5 − 0.6) + (0.2 − 0.3) + (0.7 − 0.8) =
p
2 2 2 2

0.01 + 0.01 + 0.01 + 0.01 = 0.04 = 0.2
• distance to 3rd row: (0.3 − 0.2)2 + (0.5 − 0.4)2 + (0.2 − 0.3)2 + (0.7 − 0.6)2 =
p
√ √
0.01 + 0.01 + 0.01 + 0.01 = 0.04 = 0.2
• distance to 4th row: √ (0.3 − 0.1)2 + (0.5 − 0.3)2 + (0.2 − 0.2)2 + (0.7 − 0.5)2 =
p

0.04 + 0.04 + 0 + 0.04 = 0.12 = 0.346
• distance to 5th row: (0.3 − 0.7)2 + (0.5 − 0.9)2 + (0.2 − 0.8)2 + (0.7 − 0.9)2 =
p
√ √
0.16 + 0.16 + 0.36 + 0.04 = 0.72 = 0.848

45
Introduction to Machine Learning

2. Identify k nearest neighbors: rows 2, 3, 4.

3. Predict missing value by calculating the average of the k nearest neighbors: 0.2+0.1+0.1
3 =
0.1333.

7.2.2 Decision Trees

The idea behind decision trees is to learn a series of hierarchical if-else decision rules to predict the value of
a new data point. By using these if-else decision rules we partition the data as shown below.

Decision trees are...

• + ...intuitive and relatively easy to interpret

• + ...makes no assumptions about the underlying distribution of the data, which means it can be
effective for datasets where the distribution of the test samples is unknown

• - ...are prone to overfitting, especially when the tree depth is large (many if-else decisions) and the
dataset is noisy

• - ...usually relying on greedy methods to find the decision rules, which can be suboptimal as bad
decision choices on the top nodes propagate to the leaves

8 Neural Networks

8.1 Learning Features

So far, the features Φ were fixed during training. In neural networks, the features are not fixed but trained.
To come up with the best possible features might be hard. For instance: What are the best features for
classifying the below written digits?

46
Introduction to Machine Learning

Is it pixels? Groups of pixels? Edges? Others?


Hand-designing features requires domain knowledge, which is not always given. The idea is to parameterize
the features as well.

Clarification
Example: It’s best to explain using a practical example. We want to predict whether an email is
spam or not.
• Inputs:
1. real number between 0 and 1 indicating the ratio of keywords related to spam emails (e.g.
"prince", "free", "winner", etc.).
2. discrete number 0 or 1 indicating whether the email was sent in out-of-office hours (e.g.
1 if email was sent between 6 pm and 7 am).
• Input layer: 2 nodes called a1 and a2 (one node per feature)
• Hidden layer: 1 hidden layer with 2 nodes
• Output layer: 1 node indicating the likelihood of email being spam

Below are the initial weights:


• First layer to second layer:
0.25 0.25
 
w(1) =
−0.25 −0.25

• Second layer to third layer: w(2) = 0.5 0.5




Biases:

47
Introduction to Machine Learning

• First layer to second layer:


0
 
b =
(1)
0

• Second layer to third layer: b(2) = 0




Note that we have |E| weights (amounts of edges in the neural network graph) and |V | − |VInput layer |
biases (amounts of nodes without input layer).
Forward Pass:

Calculating the activation of the first neuron in the hidden layer, considering input "input 1: a1 = 0.1,
(1) (1)
input 2: a2 = 1": (a1 · w11 + a2 · w12 + b11 ) = (0.1 · 0.25 + 1 · 0.25 + 0) = 0.275. We now want
to normalize the output using an activation function - for instance to map the activation to a value
between 0 and 1.

48
Introduction to Machine Learning

Definition
Examples of activation functions:
• Identity: σid (x) = x

• Sigmoid: σsig (x) = 1


1+e−x

ex −e−x
• Tanh: σtanh (x) = ex +e−x

• Rectified Linear Unit (ReLU): σReLU (x) = max(0, x)

49
Introduction to Machine Learning

If we use the sigmoid activation function, then we get σsig (0.275) = 0.56832.
To calculate the activations of the first layer, we can also use matrix-vector-notation:

0.25 0.25 0.1 0


    
σsig (w(1) a + b(1) ) = σsig ( + )
−0.25 −0.25 1 0
0.275
 
= σsig ( )
−0.275
0.56832
 
=
0.43168

Theorem
Universal Approximation Theorem: The Universal Approximation Theorem states that a neural
network with at least one hidden layer, using a non-linear activation function, can approximate
any continuous function within a compact set in Rn to any desired degree of accuracy, given
sufficient number of neurons in the hidden layer.

Several questions arise:

• Which features should be used?

• How many hidden layers should be used?

• How should the weights be initialized?

• What activation function should be used?

We try to answer these questions in the next chapters. First, let’s consider another simple example to
understand "backwards pass".

50
Introduction to Machine Learning

• x is the only node in the input layer (feature)


• v is the only node in the hidden layer
• f is the only node in the output layer

Forward pass: v is calcaulted as v = σ(w′ ·x), and f as f = w ·v (in this example we don’t use an activation
function in the last layer). Furthermore, we don’t use any biases.
This means our neural network can be described as f (W ; x) = f ([ w′ , w ]; x) = wσ(w′ x).
 

Backward pass: The squared loss function is shown below (se use 12 for simplification - it doesn’t change
the location of the minimum of the function):
1
l(W ; x, y) = (y − f (x))2 / def square loss
2
1
= (y − wσ(w′ x)) / def f
2
Next we calculate the derivative with respect to the weights W .
With respect to w:
dl dl df
= / chain rule
dw df dw
= (y − f )(−1)σ(w′ x)
= (f − y)v / def v

With respect to w′ :
dl dl df

= / chain rule
dw df dw′
dl df dv dz
= / chain rule, z = w′ x
df dv dz dw′
= (f − y)wσ ′ (w′ x)x / σ ′ is the derivative of σ

Clarification
Example: Consider the below neural network.

Forward pass: Assume:

51
Introduction to Machine Learning

• Inputs x1 = 1, x2 = 1
• ReLU activation function

• Loss function: l = (y − f )2 (squared loss)


This leads to:
• v1 = 1 · −0.9 + 1 · 0.9 = 0
• v2 = 1 · 0 + 1 · 0.9 = 0.9

• f = 0 · −0.9 + 0.9 · 0.9 = 0.81


Let W (2) be the weights between the second layer and the output layer (W (2) = −0.9 0.9 ). To


perform backward propagation with these weights we calculate:


dl dl df
=
dW (2) df dW (2)
dl dW (2) v
= / f = W (2) v matrix multiplication
df dW (2)
dl dW (2) v
= vT / = vT
df dW (2)
d(y − f )2
= 0 0.9 / def of l and v T

df
= 2(1 − 0.81)(−1) 0 0.9 / label y = 1


= 0

−0.342

This is the gradient of the loss function.


Next we perform gradient descent to optimize the weights W (2) .
dl
(2)
Wnew = W (2) − η / def gradient descent
dW (2)
= −0.9 0.9 − 0.2 0 / step size η = 0.2
 
−0.342
= −0.9 0.9684


With these new weights the neural network is able to predict the true label 1 better than before.

9 Clustering

So far we looked at supervised learning. The idea behind unsupervisied learning is that the dataset does not
contain any labels, and let the model learn the labels itself based on some goals (e.g. grouping the data by
similarities, understanding meaningful features, etc.).
As the name suggests, the idea behind clustering is that we group data points into clusters such that similar
data points are assigned to the same cluster. Sometimes it is also the idea to identify data points that don’t
fit in any cluster (outliers).

• Hierarchical clustering: The idea of hierarchical clustering is to seek a hierarchy of clusters.


Hierarchical clustering is useful when the relationship among the data points is more important
than just the grouping, as it creates a tree-like diagram, that shows these relationships.
• Partitioning approaches: The idea of partitioning approaches is to organize data into several
groups based on a predefined criteria. This means that a "cost" function is introduced which is
iteratively minimized until a satisfactory clustering is achieved.

52
Introduction to Machine Learning

9.1 k-means Clustering

The idea behind K-means clustering is to divide a given dataset into K distinct, non-overlapping clusters
based on the features present in the data. The goal is to minimize the distances between data points and
their corresponding cluster center.

1. Represent each of the k clusters by one single point µi for i ∈ {1, ..., k} (which denotes its center)

2. Assign datapoints to the closest center µi

Data is given as D = {x1 , x2 , ..., xn }, xj ∈ Rd and each xj is assigend to the closest cluster (closest cluster
= argmini ||xj − µi ||2 ).

The centers are picked in such a way that the sum of all squared distances is minimized. The squared
distance is used in order to give more weight to outliers, which makes the final solution more robust in case
the input is noisy.
Finding the optimal solution is NP hard and non-convex. This means that there are local minima, where
the optimization algorithm gets stuck in suboptimal solutions.

Definition
Lloyd’s Heuristic:

• Initialize centers µ1 , ...µk randomly or using another approach


• Assign each data point to the nearest center by calculating the distance between each point
and each center and assigning each point to the closest center
• Update each center
P to be the average of the data points assigned to its cluster:
∀i : µnew
i = 1
ni xi assigned to µi xi

• Repeat steps 2 and 3 until one of the below conditions is met:


– The centers no longer change between iterations
– The assignments of the datapoints no longer change between iterations
– A maximum number of iterations is reached
Runtime: O(nkd)

There are a few drawbacks of Lloyd’s Heuristic:

53
Introduction to Machine Learning

• The algorithm is sensitive to the initial selection of the centers. Different initializations lead to dif-
ferent solutions. Therefore, suboptimal solutions might be reached (in case the algorithm optimized
towards a local minimum)

• k-means assumes that clusters are spherical (or intersected by other spheres), which might not be
optimal for some datasets. Therefore, it should not be applied for clusters with shapes other than
spheres.

• k-means requires the number of clusters to be specified in advance, which might not be known before.

9.2 k-means++ Clustering

k-menans++ is an alternative to the random initialization of the centers µi , ...µk .

Definition
k-means++:
1. Start with a random point as the first center: µ1 = xi where i ∼ U nif (1, ..., n)
2. For each datapoint x not chosen yet, compute the distance between x and the nearest
center. Choose one new data point at random as a new center, using a weighted probability
distribution where a point x is chosen with probability proportional to the distance to its
nearest center. This makes datapoint far away from centers more likely to be chosen as new
centers.
3. Repeat steps 2 until k centers have been chosen.

By biasing the selection of initial centers towards points that are far apart from each other, k-means++ leads
to a wider spread of the initial centers and reduces the likelihood of converging to suboptimal solutions. Also,
it has been shown that shown that the algorithm’s expected "loss function value" (sum of squared distances)
is at most O(log(k)) times the optimal value (k being the number of clusters).

10 Unsupervised learning: Dimensionality reduction

Definition
Dimensionality Reduction: Given a data set D = {x1 , x2 , ..., xn } ∈ Rd , obtain an "embedding"
(low-dimensional representation of the same data) {z1 , z2 , .., zn } ∈ Rk where k is much smaller than
d.

The goal is to find a mapping Rd → Rk which we can use to create the embeddings. We distinguish between:

• Linear dimensionality reduction (f (x) = Ax)

• Non-Linear dimensionality reduction

Clarification
Example: Let’s perform a simple dimensionality reduction from d = 2 to k = 1.

54
Introduction to Machine Learning

We want to transform the 2-dimensionalP data into 1-dimensional data (a line). We first need to make
sure that the data is centered (µ = n1 i xi = 0). If it is not centered then we need to make it centered
by calculating µ and subtracting it from all xi .
We want to project the points xi onto the line w. Then we just need a 1-dimensional coefficient to
express every point on that line (xi ≈ zi w where w ∈ R2 represents the line and zi ∈ R1 the location
of the projection on the line). Our aim is to identify the w that best explains the data D.

In other words, we want to minimize ||zi w − xi ||22 for all points. Additionally, we enforce ||w||2 = 1 (without
loss of generality). w defines the projection line.
n
X
(w∗ , z ∗ ) = argmin{w,z1 ,...,zn },||w||2 =1 ||zi w − xi ||22
i=1

In our k = 1 case, the optimal z is given by zi∗ = wT xi (orthogonal projection of xi on w). Remember that
zi is the factor by which we scale the normalized w vector to end up at the projection of point xi .
This means that zi w becomes wT xi w (since zi = wT xi ). wT xi w are the coordinates of the projected point.
This makes it possible to optimize just over w instead of both w and z:
n
X
w∗ = argminw,||w||2 =1 ||wT xi w − xi ||22
i=1

10.1 Principal Component Analysis

The goal is to reduce the dimensionality of the data. Below is the procedure with an example. Let’s consider
this dataset:

55
Introduction to Machine Learning

Mouse 1 Mouse 2 Mouse 3 Mouse 4 Mouse 5 Mouse 6


Gene 1 10 11 8 3 1 2
Gene 2 6 4 5 3 2.8 1

The corresponding plot looks like this:

10 11
   
Let x1 = , x2 = etc., and n = 6 our number of samples.
6 4

0
 
Pn
1. The first step is to center the data. This means we make sure that 1
i=1 xi = holds. This
n 0
5.8333 5.8333
   
Pn
is not the case for our example, as n i=1 xi =
1
. Therefore we subtract from
3.6333 3.6333
every sample.

Now we shifted all data points such that their average is equal to the null vector. Shifting the data
did not change how the data points are positioned relative to each other.

2. As we want to bring the data points to one dimension, we are looking for a line that best captures
the variance of the data points. The data points are projected to the line and the line goes through
the origin.

56
Introduction to Machine Learning

We want to find a line that maximizes the distances of the projected points to the origin. This is
because we want to capture as much variance as possible.

3. We calculate the covariance matrix S.


For simplicity,let’stransform the dataset
  to this matrix: 
10 6 5.8333 3.6333 4.1666 2.3666

11 4  5.8333 3.6333  5.1666 0.3666 
8 5  5.8333 3.6333  2.1666 1.3666 
     
X= − =
.
3 3  5.8333 3.6333 −2.8333 −0.6333
    

 2 2.8 5.8333 3.6333 −4.8333 −0.8333
1 1 5.8333 3.6333 −3.8333 −2.6333

COV(G1, G1) COV(G1, G2)


 
S= / def of COV-matrix
COV(G2, G1) COV(G2, G2)
1
Pn n
i=1 (xi;1 − µG1 ) (xi;1 − µG1 )(xi;2 − µG2 )
2
 P 
= Pn i=1 P
n / def of COV
n i=1 (xi;2 − µ G2 )(xi;1 − µG1 ) i=1 (xi;2 − µG2 )
2

1
 Pn Pn
Pn i=1 (xi;1 ) i=1 (xi;1 )(xi;2 )
2

= n / µG1 = 0, µG2 = 0 (centered)
i=1 (xi;2 )(xi;1 ) i=1 (xi;2 )
2
P
n
1
= XT X
n
15.805 5.105
 
=
5.105 2.605

S is symmetric because COV (A, B) = COV (B, A). The diagonal elements of S are the variance of
the features.

4. Now we need to calculate the normalized eigenvalues and eigenvectors of S. The results are:
−0.323
v1 = , λ1 = 0.861,
 0.946
0.946
v2 = , λ2 = 17.55
0.323
The eigenvector with highest eigenvalue is the first principal component.

57
Introduction to Machine Learning

Clarification
• The eigenvalue is equivalent to the average sum of squared distances from the projected points
to the origin ( n1 (Xv2 )T Xv2 = λ2 ).
• If we consider kP principal components and our data was originally in dimension d, then the
d
overall loss is n i=k+1 λi .

• There is also a connection to the singular value decomposition: when decomposing X to


X = U ΣV T (U, V are unitary). The top k principal components are the first k columns of V .
• Principal components are always orthogonal to each other, because eigenvectors are orthogonal
to each other as well. Intuition: Principal components are defined to capture the maximum
variance in the data and orthogonality ensures that each principal component describes a
unique direction in the data.

It is important to note that PCA is not always useful when it comes to classification:

The second principal component would be much better for solving the classification problem:

58
Introduction to Machine Learning

11 Probabilistic modeling

The idea is to not find the best fitting function fˆ ∈ F , but to find the probability distribution of the unseen
data P̂ ∈ P . Thereby we assume that the training data has the same distribution as the unseen data.
Some applications of probabilistic modeling are:

• conditioned on previous sentences, generate the next word

• given artist and style, generate a new song

• find certain parameters (e.g. mean, variance) that describe nature or society

• detect atypical x (anomaly and out-of-distribution detection)

The first step is to specify a probability class P . For d = 2 (so 2 parameters), this could be one of the below:

59
Introduction to Machine Learning

There are two fundamental opposite paradigms in statistics:

Definition
Frequentist Approach: In the frequentist view, the true distribution is fixed.
• Probability: Probability is interpreted as the long-run frequency of events. Example: the
probability of a coin landing heads is the proportion of heads observed in a large number of
coin flips.
• Parameters: Parameters (e.g. mean or variance) are considered fixed but unknown quanti-
ties. They are not random variables.
• Methods: Maximum-likelihood estimation, confidence intervals, hypothesis testing

Definition
Bayesian Approach: In the Bayesian view, the true distribution is a random draw.
• Probability: In the Bayesian view, probability is a measure of belief or certainty about an
event. It is a matter of personal perspective and can change with new evidence.
• Parameters: Parameters are treated as random variables with their own probability distri-
butions (prior distributions).
• Methods: Maximum-a-posteriori estimation

Let’s take a look at both methods.

11.1 Maximum-likelihood estimation (Frequentist)

Let’s consider searching in a parametric probability class P . This class ideally includes the "ground truth"
probability distribution P̂. The idea behind the maximum-likelihood approach is to find the distribution
P ∈ P under which the observed data samples have the highest probability ("likelihood") of being observed
(compared to any other P ∈ P ).
The likelihood is expressed by P ((x1 , ..., xn ), Θ), where (x1 , ..., xn ) are the data samples and Θ are the pa-
rameters of the probability distribution.
For a fixed probability distribution PΘ ∈ P : if each training sample is independent and identically
distributed ("iid") with xi ∼ PΘ , then the random dataset D = {xi }ni=1 has a joint distribution of
P (D; Θ) = p((x1 , ..., xn ), Θ) = Πni=1 p(xi ; Θ) (factorization because of independence).
In practice we use the log-maximum-likelihood, because it allows us to rewrite products into sums

60
Introduction to Machine Learning

(log(a · b) = log(a) + log(b)). We are allowed to take the log because taking the log is a monotonic transfor-
mation. This means it preserves the location of the maximum, meaning that the maximum of the likelihood
function and the maximum of its logarithm occur at the same parameter values.
Θ̂M LE = argmaxΘ (p(D; Θ)) / we want maximum probability
= argmaxΘ (log(p(D; Θ))) / log doesn’t change optimal parameters
= argmaxΘ (log(Πni=1 p(xi ; Θ))) / assuming iid
Xn
= argmaxΘ ( log(p(xi ; Θ))) / log laws
i=1

Clarification
Example: MLE for Gaussian distribution

• Given dataset: D = {(x1 , y1 ), ..., (xn , yn )}


• MLE for class label probability: p(Y = y) = #Y =y
n = p̂y
• MLE for feature distribution:
– p(x|y) = N (x; µ̂y , Σ̂y )
– µ̂y = #Y1=y Σi:yi =y xi
– Σ̂y = #Y =y Σi:yi =y (xi
1
− µ̂y )(xi − µ̂y )T (covariance matrix)

Theorem
Bayes’ Theorem:
p(a;b)
• p(a|b) = p(b)

p(a|b)·p(b)
• Therefore: p(b|a) = p(a)

• The joint probability is defined as p(x, y) = p(x)p(y|x)

11.2 Maximum-a-posteriori estimation (Bayesian)

Maximum-a-posteriori estimation (MAP) can also be used to estimate the unknown parameters of a probabil-
ity distribution. It is an extension of the the maximum-likelihood estimation which involves prior information
about the parameter through Bayes’ theorem. So if we have any prior information about the parameters,
then we can use MAP.
ΘM AP = argmaxΘ (p(D; Θ)) / we want maximum probability
= argmaxΘ (p(D|Θ)p(Θ)) / Bayes’ Theorem

Here, p(Θ) encodes any prior knowledge or beliefs about the parameter.

11.3 Discriminative vs. Generative approach

There are two approaches

• Discriminative: The main goal is to estimate the conditional probability P (y|x) (y is the label
and x is the input feature vector). We are only interested in finding p(y|x). One example might be:
given nutritional values (e.g. calories, fat, protein, carbohydrates, etc.), is a meal healthy or not?
→ Simple to compute p(y|x)

61
Introduction to Machine Learning

• Generative: The main goal is to understand how the data was generated. We aim to find the joint
probability p(x, y) (which is equal to p(x|y)p(y)). This means that we also learn the probability
distribution of x given a y, which means we can reconstruct the input features if we know a label y.
One example might be: We identified someone with a disease (e.g. flue, covid, cold, etc.) and we
want to reconstruct the symptoms (e.g. loss of taste, headache, etc.). Another example is creating
images of digits based on the input y = 1, y = 2, etc. → Simple to compute p(x|y)

Both approaches can give us the joint probability p(x, y). The difference is that discriminative models
parameterize p(y|x) and generative models parameterize p(x|y).

11.4 Gaussian Mixture Models

The idea behind Gaussian Mixture Models (GMMs) is to represent a complex distribution of data as a
combination of multiple Gaussian distributions. Each Gaussian component in the mixture represents a
cluster or subgroup within the overall population.

• A d-dimensional Gaussian distribution is defined by its mean vector µ ∈ Rd and the d×d-dimensional
covariance matrix Σ

• In a mixture model, the probability distribution of data points is modeled as a combination of


different component distributions. For GMMs, these components are Gaussian distributions.

• The different Gaussian distributions are weighted to represent the probability of data points
Pk being
generated by that component. The weights sum up to 1 (for GMM with k distributions: i=1 wi =
1).

This results to:


k
X
p(x) = wi N (x|µi , Σi )
i=1

The parameters of a GMM are typically estimated using the expectation-maximization algorithm:

• wk : The weight of every Gaussian component

• µk : Every Gaussian component has one mean vector µk

• Σk×k : Every Gaussian component has one covariance matrix Σk×k

62
Introduction to Machine Learning

Definition
Expectation-maximization (EM) algorithm:
1. Initialization: Start with initial guesses for the parameters. If the GMM consists of k
Gaussian distributions, then we have k parameters Θ1 , ..., Θk where Θj = (wj , µj , Σj ) and:
• w is a scalar representing the weight of distribution j
• µj is a d × 1 vector containing the means of distribution j
• Σj is a d × d covariance matrix of distribution j

2. Expectation Step (E): For each point, calculate the probability of that point being in the
different Gaussian components (calculate the "responsibilities"). This means, calculate the
probability that point xi is covered by component j:

wj N (xi |µj , Σj )
γi,j = Pk
o=1 wo N (xi |µo , Σo )

3. Maximization Step (M): Update the parameters based on the current responsibilities:
n
1X
wjnew = γi,j
n i=1
Pn
i=1 γi,j xi
µnew
j = n γ
i=1 i,j
Pn
γi,j (xi − µnew )(xi − µnew )T
Σnew
j = i=1
Pn j j

i=1 γi,j

4. Iteration: Repeat step 2 and 3 until convergence (i.e. parameters do not change significantly
anymore).

The initialization can happen as follows:

• Weights: Uniform distribution (wj = k1 )


• Means: Random initialization but with k-means++ procedure
• Variances: Initialize as spherical, according to empirical variance in the data (the distributions
should cover all the data)

63

You might also like