EDAN96 2024 Last Lecture-1
EDAN96 2024 Last Lecture-1
2
Lectures
1. Intro to ML
2. Intro to probability and information theory
3. Basic concepts in ML
4. Decision Trees
5. DTs cont. Ensemble Methods
6. Bayesian classifiers
7. Logistic regression
8. Feedforward networks
9. Techniques in neural networks
10. Dimensionality reduction
11.Convolutional Neural Networks
12.CNNs cont. and Explainable AI
13. (Overview different ML tasks, data)
3
Intro to ML/Basic concepts
Unsupervised algorithms
k-means clustering, UMAP, PCA, GMMs.
Supervised algorithms:
k-NN
Linear regression (predicts numerical values)
Logistic regression
Neural networks
(Binary) Classification and regression.
4
Intro to ML/Basic concepts
Underfitting: model does not explain the data: poor
performance
Overfitting: model fits too perfectly to the data (learns noise),
poor generalization.
6
𝒱
Cross-validation (CV)
To get a good estimate, we can repeat the process several times.
K-fold cross-validation
1. Partition the data into K chunks
2. Use K − 1 for training and 1 as validation
3. Compute the performance on (e.g., RMSE) and iterate
4. Run CV for each model and select best average performance
7
𝒱
Best practises
Split the dataset in: training, validation (in CV) and test.
Preprocessing steps should be treated as part of the model (only on training set).
Use cross-validation to tune parameters to the model and select best model(s)
Train on full dataset.
Do the final testing with your test data.
The test result should be similar to the validation result.
Deploy model.
Test data
8
Summary: Intro to ML
Some early algorithms:
For classification: k-NN
For clustering: k-means
For regression: linear regression
9
k-Nearest Neighbors (kNN)
1. Input: points in space. Each point has a label.
2. Save the entire dataset as reference.
3. For each point p to classify: find the k number of closest points using
some distance measure (e.g., Euclidian distance).
4. Give p the most common label of the k neighbors.
Break ties:
k odd or
increase k
10
Evaluation tools
• Confusion matrix
Wikipedia: https://en.wikipedia.org/wiki/Confusion_matrix
11
Evaluation tools cont.
• Precision
• TP/(TP + FP)
• Recall
• TP/(TP+FN)
• F1-score is the harmonic
mean
precision ⋅ recall
F1 = 2 ⋅
precision + recall
Wikipedia: https://en.wikipedia.org/wiki/Precision_and_recall
12
k-means
1. Input: points in space.
2. Decide how many clusters k you want.
3. Place k centroids in the space (randomly)
4. Iteratively:
1. Assign each point to the closest centroid.
2. Recalculate the centroids as the average of the points in the
cluster.
3. Repeat until the clusters do not change anymore.
Random centroids. Assign points to closest Move centroids. Assign points to closest again. Move centroids. Stop when clusters do not change.
13
Evaluate clustering
If we have labels, we can evaluate the clusters with
homogeneity and completeness
15
R2-score
R 2- score (coefficient of determination):
2
SSres
R =1−
SStot
where
(yi − fi)2 (the residual sum of squares)
∑
SSres =
i
(yi − y)2 (total sum of squares)
∑
SStot =
i
A value of 0 means that the model does not predict better than the mean of the target, 1
the model explains the variance perfectly. Negative values are worse than guessing the
mean.
16
Summary: Loss
Loss functions: the error function to minimize for
the models (during training). Also called cost
function.
17
Summary: Probability
Fundamental Rules in Probability Theory
If we consider discrete RVs x , y , the integral in (1) is replaced by a sum. This is where the name comes from.
Summary: Probability
Bayes’ theorem
19
Posterior p(x|y): expresses what we know about x|y
Statistical Independence
20
𝕍
𝕍
𝕍
Conditional Independence
21
Probability Distributions: Gaussian
We often assume that probabilities
are either uniform or Gaussian,
because that makes the math
easier (and it is often a good
description of “reality”)
22
Summary: Entropy
Entropy: the level of uncertainty in a variables possible outcomes (measured in bits).
∑
H(X) = − p(x)log p(x) or p[−log p(X)]
x∈χ
∑
H(P, Q) = − p(x) log q(x)
x∈χ
Decision Trees.
Summary: Gini impurity
Gini impurity: measures how often a randomly chosen element of a
set would be incorrectly labeled if it were labeled randomly and
independently according to the distribution of labels in the set. It
reaches its minimum (zero) when all cases in the node fall into a single
target category. It reflects the degree of disorder or impurity in a node.
p(x)2
∑
G(p) = 1 −
x∈
Gini impurity is another measure that can be used to calculate the split in
𝒳
25
Decision Trees.
Summary: MLE, MAP, linear
regression
p(Data | θ)p(θ)
p(θ | Data) =
p(Data)
Maximum A Posteriori Estimation, MAP: Find parameters that are most probable
given the data.
27
Summary: Linear regression
We consider the regression problem
y = f(x) + ϵ
x ∈ ℝD are inputs
y ∈ ℝ are observed targets
ϵ∼ (0,σ 2) is identically distributed measurement noise
Goal: find f that is close to the function.
Linear regression parametrization
y = f(x)+ϵ = x ⊤θ+ϵ,
θ ∈ ℝD are the parameters we seek
28
where
𝒩
Parameter Estimation
Assume we are given a training set = {xi, yi}Ni=1
Inputs xi = [1 xi1 xi2 ⋯ xi,d−1] ∈ ℝD, i = 1,...,N
Corresponding observations yi ∈ ℝ, i = 1,...,N
Where yi and yj are conditionally independent given xi, xj
yi = θ0 + θ1xi1 + θ2 xi2 + ⋯θd−1xi,d−1 + ϵi = xiT θ + ϵi
29
𝒟
Closed-form MLE Solution
We use the loss function:
1
ℒ(θ) = 2 ∥y − Xθ∥2 (minimized the sum of the squared
2σ
distance to the line)
Set the gradient to 0:
∂ℒ
= 0 ⇔ a lot of math for matrix derivation ⇔ θ* = (X ⊤X)−1X ⊤y
∂θ
30
Summary: Regularization
MLE solution is prone to overfitting!
One way to control overfitting is to penalize big parameter values
this technique is called regularization
A typical example is a loss function where a regularization term is
added (Ridge regression)
ℒ(θ) = argmin ∥y − Xθ∥22+α ∥θ∥22
θ
not spam)
37
Summary: Gaussian mixture models
Gaussian mixture models (GMMs): Assumes that data is generated from a
mixture of k Gaussians, each with its own mean (μk) and covariance (Σk). We
have a mixing coefficient πk that determines the weight of each Gaussian,
P(x) = weighted sum of the Gaussians
Used for e.g., clustering, e.g., customer segmentation in marketing.
The parameters (μk, Σk, πk) are learned using the Expectation-Maximization
(EM) algorithm (initialized e.g., using k-means):
E-step: Compute the posterior probability that each data point belongs to
each cluster (responsibility).
M-step: update parameters to using the responsibilities from the E-step.
38
Logistic regression
Logistic regression: models a discrete outcome for a
continuous variable. Sigmoid function: outputs value between
0 and 1.
Logistic regression is special case of NN: No hidden nodes, one output
node with sigmoid activation function.
wk+1 = wk − αk ∇f(wk)
The gradients are calculated for each layer backward:
backpropagation.
42
Material to understand gradient descent and backpropagation
Beautiful videos by Grant Sanderson (3blue1brown)
But what is a neural network?
Gradient Descent
Backpropagation
Backpropagation calculus
43
Do you want me to go through the backpropagation?
44
The matrices for updating the weights
data W1, B1 W2, B2
x
Output True
(2)
w11 values values
The last layer: a1(1)
45


data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)
First run a forward pass through the (2)
w21 a1(2) 0.1 0
(2)
network to get all current values. w12
(2)
w22
a2(1)
The gradients are calculated backwards (2) a2(2) 0.66 1
w13
using the chain rule:
(2)
a3(1) w23
Layer 1 Layer 2
Input layer Output layer
(2)
(2)
w21 a1(2) 0.1 0
w12
(2)
w22
a2(1)
x→z (1)
→a (1)
→z (2)
→a (2)
→C
(2)
w13 a2(2) 0.66 1
(2)
a3(1) w23
Layer 1 Layer 2
Output layer
da1(2)
Input layer
dC dC z (1), a (1) z (2), a (2)
= ⋅ =− 2(a1(2) − y1) ⋅ σ′(z1(2))
dz1(2) da1(2) dz1(2)
Loss for 1 training example
C0 = (a (L) − y)2 = (0.66 − 1)2
Δzj(l) = Δz1(2)
47 a (L) = σ(z (L))

data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)
(2)
(2)
w21 a1(2) 0.1 0
w12
(l) (l)
ΔB = Δz
(2)
w22
a2(1)
(2)
w13 a2(2) 0.66 1
(2)
a3(1) w23
Layer 1 Layer 2
Input layer Output layer
Layer 1 Layer 2
Input layer Output layer
zj(2) =
∑
wjiai(1) + bj(2) dC dC dzj(l)
i = ⋅ = Δzj(l) ⋅ ai(l−1)
dwji(l) dzj(l) dwji(l)
49
x → z (1) → a (1) → z (2) → a (2) → C
data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)
(2)
w21 a1(2) 0.1 0
ΔW (l) = Δz (l) ⋅ (a (l−1))T
(2)
w12
(2)
w22
a2(1)
(2)
w13 a2(2) 0.66 1
(2)
a3(1) w23
zj(2) =
∑
wjiai(1) + bj(2) dC dC dzj(l)
i = ⋅ = Δzj(l) ⋅ ai(l−1)
dwji(l) dzj(l) dwji(l)
50
x → z (1) → a (1) → z (2) → a (2) → C
data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)
a1(2) 0
Δz (l−1) = ((W (l))T ⋅ Δz (l)) * σ′(z (l−1))
(2)
w21 0.1
(2)
w12
(2)
w22
(1) (1) (2) (2) a2(1)
x→z →a →z →a →C (2) a2(2) 0.66 1
w13
(2)
a3(1) w23
Layer 1 Layer 2
Input layer Output layer
51

data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)
a1(2) 0
Δz (l−1) = ((W (l))T ⋅ Δz (l)) * σ′(z (l−1))
(2)
w21 0.1
(2)
w12
(2)
w22
(1) (1) (2) (2) a2(1)
x→z →a →z →a →C (2) a2(2) 0.66 1
w13
(2)
w23
dC dC a3(1)
52


data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)
a1(2) 0
Δz (l−1) = ((W (l))T ⋅ Δz (l)) * σ′(z (l−1))
(2)
w21 0.1
(2)
w12
(2)
w22
(1) (1) (2) (2) a2(1)
x→z →a →z →a →C (2) a2(2) 0.66 1
w13
(2)
w23
dC dC a3(1)
dC (2)
= w11 ⋅ Δz1(2) + w21
(2)
⋅ Δz2(2)
da1(1) The chain rule:
dC dC dz1(2) dC dz2(2)
53 = ⋅ + ⋅
da1(1) dz1(2) da1(1) dz2(2) da1(1)


data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)
dC (2)
(2)
w21 a1(2) 0.1 0
= w11 ⋅ Δz1(2) + (2)
w21 ⋅ Δz2(2) (2)
w12
da1(1)
(2)
w22
a2(1)
(2)
w13 a2(2) 0.66 1
Layer 1 Layer 2
Input layer Output layer
54
data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)
dC (2)
(2)
w21 a1(2) 0.1 0
= w11 ⋅ Δz1(2) + (2)
w21 ⋅ Δz2(2) (2)
w12
da1(1)
(2)
w22
a2(1)
(2)
w13 a2(2) 0.66 1
(2)
w23
Δa1(1)
a3(1)
w11 w21
Δz1(2)
( (2))
Layer 1 Layer 2
= w12 w22
Output layer
Δa2(1)
Input layer
z (1), a (1) z (2), a (2)
w13 w23 Δz2
Δa2(1)
55
data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)
dC (2)
(2)
w21 a1(2) 0.1 0
= w11 ⋅ Δz1(2) + (2)
w21 ⋅ Δz2(2) (2)
w12
da1(1)
(2)
w22
a2(1)
(2)
w13 a2(2) 0.66 1
(2)
w23
Δa1(1)
a3(1)
w11 w21
Δz1(2)
( (2))
Layer 1 Layer 2
= w12 w22
Output layer
Δa2(1)
Input layer
z (1), a (1) z (2), a (2)
w13 w23 Δz2
Δa2(1)
57
Comments about backpropagation
Fully connected layers: O(n 2) weights between each layer. Each update
depends on the dataset size and the number of weights. hyperbolic tan
Special purpose layers for different tasks reduce computation and can
still capture the most important patterns.
CNNs and tranformers have the advantage that they can be parallelized!
60
Convolutional Neural Networks
Convolutional neural networks (CNNs): Currently the best performing networks for
analyzing images.
Deep neural networks with (at least one) convolutional layers, pooling layers and fully
connected linear layers.
Example: Black and white image that has 100 x 100 pixels = 10 000 inputs.
Fully connected layers have many weights, if we have many layers it quickly becomes
computationally infeasible.
Idea 1: Pixels close by are more relevant to each other than pixels far away.
Idea 2: similar features are relevant all over the image, for example edges, so we can
reuse the parameters to detect them.
We can learn small filters that only look at a small region, e.g., 5 x 5 pixels, use many
parallell filters and many layers and still have few parameters to train.
61
Convolutions
Mathematical convolution:
Apply another function (A kernel)
around the current values and add
up the values.
E.g., Add Gaussian blur to an image: the kernel is a Gaussian. For
each pixel, give the surrounding pixels weights proportional to a
Gaussian and add them to create a blurry image. That is, you blend the
values.
62
See Goodfellow: Chapter 9
Convolution Layer: filters
Each neuron in a convolutional layer looks at a small region (receptive field)
in the input/previous layer and apply the kernel (the filter) on that region.
Either we pad the border or the next layer is smaller!
We learn the weights
of the kernel.
The same kernel
is used for the
whole input.
63
Conv-layer
64
Conv-layer: stride
65
Conv-layer: dilation (gaps in kernel)
66
Conv-layer: padding
67
Conv-layer: Multiple channels
68
Other types of layers
Pooling: select maximum value from a window
Dropout: used during training to randomly set a few
neurons to 0. Reduces overfitting
BatchNorm: normalizes input.
69
Summary: PCA and UMAP
Uniform Manifold Approximation and Projection, UMAP:
Used for visualization, non-linear. Captures
Principal Component Analysis, PCA: preprocessing
step, noise reduction, linear. Can be used for
dimensionality reduction to speed up training of other
models.
70
Summary: UMAP
UMAP: Uniform Manifold Approximation and Projection. Focuses on
preserving both local and global structure.
1. Construct a high-dimensional graph:
A. compute pair-wise distances.
B. Convert distances to probabilities by defining a local neighborhood for
each point
C. Combine local neighborhoods to a global graph called “fuzzy simplicial
complex”
2. Optimize low-dimensional representation:
A. Initialize a low-dimensional embedding: place the points “randomly”
B. Minimize the difference between high and low dimensional
representation using cross-entropy
71
C. Adjust the positions of the points in low-d representation using
stochastic gradient descent.
PCA: Overview
Find vectors (summary indices) that capture the important
information in the data (principal components).
Keep statistical properties such as variance.
PCA finds lines, planes and hyper-planes in the D-
dimensional space that approximate the data as well as
possible in the least squares sense.
Finds the vectors that minimize the reconstruction error,
which equivalent as finding the vectors that maximize
the variance.
72
Summary: PCA: algorithm
1 N
N∑
1. Compute mean feature vector μ = xi
i=1
1 N
(xi − μ)(xi − μ)⊤
N∑
2. Compute covariance matrix S =
i=1
3. Compute eigenvalues λi and eigenvectors wi of S using SVD.
4. Arrange the eigenvalues in descending order
5. Pick the first M
6. Pick eigenvectors corresponding to the selected eigenvalues
7. Arrange eigenvectors in matrix W = [w1, …, wM ]
73
8. Extract low-dimensional feature vectors: zi = W ⊤ xi
The exam:
Kårhusets Gasquesal.
Bring calculator.
Approximately 60 points, 30 for passing grade.
1 point per important “point in the answer”.
E.g., What is does PCA need zero mean? (2 pt)
Possible answer: PCA finds the directions that maximize the
variance of the data (1 pt), so if the mean is not 0, the direction will
be influenced by the offset of the mean and not the spread of the
data (1 pt).
74
The exam:
Some examples from previous exams (fall 2023): first
question about general concepts.
2.3 Cross-entropy for 2 training examples.
75
The exam:
Look at the coding problem 8.1, something similar may come.
Look at the study questions. I will add more for CNNs and
data.
76
Course evaluation
77
🎄🎁🤶Happy Holidays🎄🎁🤶
78