0% found this document useful (0 votes)

21 views78 pages

EDAN96 2024 Last Lecture-1

The document provides a comprehensive overview of machine learning concepts covered in EDAN96 Lecture 14, including various algorithms, evaluation tools, and best practices for model training and validation. Key topics include supervised and unsupervised learning, cross-validation techniques, loss functions, and ensemble methods. It also discusses Bayesian classifiers and the importance of understanding probability distributions in machine learning.

Uploaded by

Axel Rosenqvist

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views78 pages

EDAN96 2024 Last Lecture-1

Uploaded by

Axel Rosenqvist

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 78

EDAN96 Lecture 14:

Summary and exam prep

MAJ STENMARK, DEPT. COMPUTER SCIENCE
Content
Summary of the course
Exam prep

2
Lectures
1. Intro to ML
2. Intro to probability and information theory
3. Basic concepts in ML
4. Decision Trees
5. DTs cont. Ensemble Methods
6. Bayesian classifiers
7. Logistic regression
8. Feedforward networks
9. Techniques in neural networks
10. Dimensionality reduction
11.Convolutional Neural Networks
12.CNNs cont. and Explainable AI
13. (Overview different ML tasks, data)
3
Intro to ML/Basic concepts
Unsupervised algorithms
k-means clustering, UMAP, PCA, GMMs.
Supervised algorithms:
k-NN
Linear regression (predicts numerical values)
Logistic regression
Neural networks
(Binary) Classification and regression.

4
Intro to ML/Basic concepts
Underfitting: model does not explain the data: poor
performance
Overfitting: model fits too perfectly to the data (learns noise),
poor generalization.

Detect it: use a validation set during training as

a test set, e.g, using cross-validation.

Blue: training data

Red:validation data
5
Cross-validation (CV)
Cross-validation is a technique to assess how well a model generalizes to
new data.
Validation set: small subset of the training set that omit from the training
and we use for testing during training

Purpose: assess model performance, optimize hyper parameters, reduce

variance in the evaluation, utilize (a small) dataset.

6
𝒱
Cross-validation (CV)
To get a good estimate, we can repeat the process several times.
K-fold cross-validation
1. Partition the data into K chunks
2. Use K − 1 for training and 1 as validation
3. Compute the performance on (e.g., RMSE) and iterate
4. Run CV for each model and select best average performance

7
𝒱
Best practises
Split the dataset in: training, validation (in CV) and test.
Preprocessing steps should be treated as part of the model (only on training set).
Use cross-validation to tune parameters to the model and select best model(s)
Train on full dataset.
Do the final testing with your test data.
The test result should be similar to the validation result.
Deploy model.
Test data

Save the test data

for final testing

8
Summary: Intro to ML
Some early algorithms:
For classification: k-NN
For clustering: k-means
For regression: linear regression

9
k-Nearest Neighbors (kNN)
1. Input: points in space. Each point has a label.
2. Save the entire dataset as reference.
3. For each point p to classify: find the k number of closest points using
some distance measure (e.g., Euclidian distance).
4. Give p the most common label of the k neighbors.

Break ties:
k odd or
increase k

10
Evaluation tools
• Confusion matrix

Wikipedia: https://en.wikipedia.org/wiki/Confusion_matrix

11
Evaluation tools cont.
• Precision
• TP/(TP + FP)
• Recall
• TP/(TP+FN)
• F1-score is the harmonic
mean

precision ⋅ recall
F1 = 2 ⋅
precision + recall

Wikipedia: https://en.wikipedia.org/wiki/Precision_and_recall

12
k-means
1. Input: points in space.
2. Decide how many clusters k you want.
3. Place k centroids in the space (randomly)
4. Iteratively:
1. Assign each point to the closest centroid.
2. Recalculate the centroids as the average of the points in the
cluster.
3. Repeat until the clusters do not change anymore.
Random centroids. Assign points to closest Move centroids. Assign points to closest again. Move centroids. Stop when clusters do not change.

13
Evaluate clustering
If we have labels, we can evaluate the clusters with
homogeneity and completeness

Homogeneity is a measure of how similar the samples in 0 1 2

a cluster are. The formula includes the ratio between the Homogeneity 1
number of samples labelled c in cluster k and the total Completeness 0.579
number of samples in cluster k.

Completeness measures how much similar samples are

put in the same cluster.

Normalized mutual information, V-measure: 0 1 2

h⋅c Homogeneity 0.613
V − measure = 2 ⋅ Completeness 1
h+c
14
Visualizing the result

15
R2-score
R 2- score (coefficient of determination):

2
SSres
R =1−
SStot
where
(yi − fi)2 (the residual sum of squares)
∑
SSres =
i
(yi − y)2 (total sum of squares)
∑
SStot =
i

A value of 0 means that the model does not predict better than the mean of the target, 1
the model explains the variance perfectly. Negative values are worse than guessing the
mean.
16
Summary: Loss
Loss functions: the error function to minimize for
the models (during training). Also called cost
function.

Usually the log-loss is used to make the

calculations easier.
Mean squared error: regression
Cross-entropy (multi-class classification)

17
Summary: Probability
Fundamental Rules in Probability Theory

These 2 rules appear everywhere in ML and Bayesian statistics

p (x , y ): joint distribution of the two RVs x ,

p (x ),p(y ): marginal distributions

p(x, y)
p (y |x ) = : conditional distribution of y given x
p(x)
18

If we consider discrete RVs x , y , the integral in (1) is replaced by a sum. This is where the name comes from.
Summary: Probability
Bayes’ theorem

Interpretation: if we observe y, we can draw some conclusions about x

given the observed values of y
Prior p(x): encapsulates our prior knowledge of x
Likelihood p(y|x): describes how y and x are related
Marginal likelihood or evidence p(y): normalizing constant
p(y, x)
∫ ∫ p(x) ∫
Integral p(y | x)p(x)d x = p(x)d x = p(y, x)d x of the numerator wrt x

19
Posterior p(x|y): expresses what we know about x|y
Statistical Independence

Intuitively, x and y are independent if the value of y (once known) does

not add any additional information about x (and vice versa)

If x, y are (statistically) independent then:

1. p( y | x) = p( y)
2. p(x | y) = p(x)
3. [x + y] = [x] + [y]
4. cov[x, y] = 0

20
𝕍
𝕍
𝕍
Conditional Independence

Intuition: y does not give any more information to the probability of x

We often assume (conditional) independence just to make the math

easier! E.g. in naive Bayes classifiers.

21
Probability Distributions: Gaussian
We often assume that probabilities
are either uniform or Gaussian,
because that makes the math
easier (and it is often a good
description of “reality”)

For a Gaussian distribution, the

conditional and marginal μ = 0,σ 2 = 1
distributions are also Gaussian.

22
Summary: Entropy
Entropy: the level of uncertainty in a variables possible outcomes (measured in bits).

∑
H(X) = − p(x)log p(x) or p[−log p(X)]
x∈χ

Cross-entropy: measures the difference between two probability distributions

∑
H(P, Q) = − p(x) log q(x)
x∈χ

When the labels are known: p ̂ − y}̂ :

∈ {p,1 − p} and q ∈ {y,1
H(p, q) = − y log ŷ − (1 − y)log(1 − y)̂

Cross-entropy is a popular loss function.

23
𝔼
Summary: Mutual information
Mutual information (MI) is a measure from information theory that
quantifies the amount of information one random variable X provides
about another random variable Y. It reflects the reduction in uncertainty
about one variable due to knowledge of the other.

I(X; Y ) = H(X) − H(X | Y ) = H(Y ) − H(Y | X)

It can also be written as
p(x, y)
∑∑
I(X; Y ) = p(x, y) log2
x∈ y∈
p(x)p(y)

Mutual information/information gain can be used to calculate the split in

𝒳
𝒴
24

Decision Trees.
Summary: Gini impurity
Gini impurity: measures how often a randomly chosen element of a
set would be incorrectly labeled if it were labeled randomly and
independently according to the distribution of labels in the set. It
reaches its minimum (zero) when all cases in the node fall into a single
target category. It reflects the degree of disorder or impurity in a node.

p(x)2
∑
G(p) = 1 −
x∈

Gini impurity is another measure that can be used to calculate the split in
𝒳
25

Decision Trees.
Summary: MLE, MAP, linear
regression
p(Data | θ)p(θ)
p(θ | Data) =
p(Data)

Maximum likelihood estimation.MLE: We want to find parameter values so that the

observed data is most probable, that is maximize p(Data | θ).
Example:
Imagine trying to estimate the bias θ of a coin. Using MLE, you would
estimate θ as the proportion of heads in a series of coin flips because this
maximizes the probability of observing the results you actually obtained.
26
Summary: MAP
p(Data | θ)p(θ)
p(θ | Data) =
p(Data)

Maximum A Posteriori Estimation, MAP: Find parameters that are most probable
given the data.

Using a prior distribution on the parameters p(θ) and then

Selecting θ that maximizes the posterior p(θ | X, y)
We choose the parameters that are “most probable” given the data.

27
Summary: Linear regression
We consider the regression problem
y = f(x) + ϵ
x ∈ ℝD are inputs
y ∈ ℝ are observed targets
ϵ∼ (0,σ 2) is identically distributed measurement noise
Goal: find f that is close to the function.
Linear regression parametrization
y = f(x)+ϵ = x ⊤θ+ϵ,
θ ∈ ℝD are the parameters we seek
28
where
𝒩
Parameter Estimation
Assume we are given a training set = {xi, yi}Ni=1
Inputs xi = [1 xi1 xi2 ⋯ xi,d−1] ∈ ℝD, i = 1,...,N
Corresponding observations yi ∈ ℝ, i = 1,...,N
Where yi and yj are conditionally independent given xi, xj
yi = θ0 + θ1xi1 + θ2 xi2 + ⋯θd−1xi,d−1 + ϵi = xiT θ + ϵi

Goal: find optimal parameters θ* ∈ ℝD for the LR model in

y = f(x)+ϵ = x ⊤θ+ϵ

29
𝒟
Closed-form MLE Solution
We use the loss function:
1
ℒ(θ) = 2 ∥y − Xθ∥2 (minimized the sum of the squared
2σ
distance to the line)
Set the gradient to 0:
∂ℒ
= 0 ⇔ a lot of math for matrix derivation ⇔ θ* = (X ⊤X)−1X ⊤y
∂θ

30
Summary: Regularization
MLE solution is prone to overfitting!
One way to control overfitting is to penalize big parameter values
this technique is called regularization
A typical example is a loss function where a regularization term is
added (Ridge regression)
ℒ(θ) = argmin ∥y − Xθ∥22+α ∥θ∥22
θ

The second term is the regularizer

It penalizes the amplitude of the parameters θ
31
α ∈ ℝ+controls the “strictness” of the regularization
Summary decision trees
Decision trees: flowchart-like structure that partitions the
dataset on one attribute in each internal node.
Classification trees: recursively partition the dataset
using the attribute that maximizes information gain (ID3)
or minimize Gini impurity (CART). Stop when empty
dataset, all datapoints belong to same class, or no
attributes are left to split on.
Regression trees: use MSE as split criteria and use
ranges for the splits. The output value is calculated as
the mean of the values in the leaf.
Prone to overfitting! Prevent it by setting max height, min
number samples in leaf nodes.
32
Summary: Ensemble methods (bagging)
Bagging: bootstrap aggregation, random forests.
bootstrap the data: randomly sample the dataset with replacement to
generate several datasets.
train several trees independently using the datasets.
aggregate the results: vote on the class, average the
result in regression.
Random forests also chose a random subset of the
attributes in each split.
The bootstrapping reduces the variance of the model
without
33
causing overfitting.
The Severity Prediction of The Binary And Multi-Class Cardiovascular Disease -
A Machine Learning-Based Fusion Approach
Hafsa Binte Kibria Abdul Matin
Summary: Ensemble methods (stacking)
Stacking: Combines the predictions of multiple base models (weak learners)
using a meta-model to produce a final prediction.
Train multiple models (DTs, NNs) on the same dataset.
Use the models to make predictions.
Feed the predictions into a meta-model, e.g, linear regression.
The meta-model learns how to combine the predictions.
Achieves higher accuracy than one model.
Computationally expensive and you need to select the models and meta-
model.
34
Summary: Ensemble methods (boosting)
Boosting: Sequentially train models that focus on correcting errors from
previous ones. Final prediction is made by aggregating all models (weighted
vote/average).
Train a base model on the dataset.
Identify errors (misclassified points or big absolute errors).
Train the next model by giving these errors higher weights.
Repeat a predefined number of times (or until error is small).
AdaBoost (Adaptive Boosting)
Gradient boosting machines: uses a loss function for the errors and gradient
descent. Implementations XGBoost and LightGBM.
35
Summary: Bayesian Classifiers
Naive: assumes all features are (conditionally) independent.
Example spam classification using the word discount.
Steps:
number of spam emails
Calculate the prior for each class P(spam) = (same for not spam)
total emails

Calculate the likelihood of each class. P(discount|spam) = numbernumber

of spam emails with "discount"
of spam emails
(same for not snap)

Combine them to calculate the posterior. P(spam|discount) = P(discount|spam) P(spam)

P(discount)
(same for

not spam)

Assign the class label with the highest posterior probability.

36
Summary: Bayesian Classifiers
Gaussian naive Bayes: Assumes that each feature has a Gaussian
distribution (with different variance and mean).
Calculate the mean and variance for each feature for each class and use
these probabilities as prior, likelihood and posterior.
For multiple features, we assume conditional independence and can multiply
the probabilities to get the posterior.

37
Summary: Gaussian mixture models
Gaussian mixture models (GMMs): Assumes that data is generated from a
mixture of k Gaussians, each with its own mean (μk) and covariance (Σk). We
have a mixing coefficient πk that determines the weight of each Gaussian,
P(x) = weighted sum of the Gaussians
Used for e.g., clustering, e.g., customer segmentation in marketing.

The parameters (μk, Σk, πk) are learned using the Expectation-Maximization
(EM) algorithm (initialized e.g., using k-means):
E-step: Compute the posterior probability that each data point belongs to
each cluster (responsibility).
M-step: update parameters to using the responsibilities from the E-step.
38
Logistic regression
Logistic regression: models a discrete outcome for a
continuous variable. Sigmoid function: outputs value between
0 and 1.
Logistic regression is special case of NN: No hidden nodes, one output
node with sigmoid activation function.

Neural networks: has hidden layers and activation functions.

Uses back propagation to calculate the gradients and
gradient descent to update weights.
Gradient descent: wk+1 = wk − αk ∇f(wk) where αk is the
learning rate.
Backpropagation: the loss function is defined for the output
layer, so when calculating how to update the weights we go
backwards through the network.
39
Activation functions
Adds non-linearity to a neural network.
Rectified linear unit ReLU
Sigmoid
Softmax (rescales the output to probabilities)
e zi
σ(z)i = C
∑j=1 e zj

softmax is generalizing logistic regression to

multiple classes
Hyperbolic tangent
40
Neural Networks
data W1, B1 W2, B2
x
In each neuron the incoming values are weighted: Output True
(2) values values
z1(2) = w1i ai1 + b1(2) w11
∑ a1(1)
i
The output: a1 = σ(z1) (2)
(2)
w21 a1(2) 0.1 0
w12
(2)
w22
a2(1)
(2)
w13 a2(2) 0.66 1
(2)
a3(1) w23
Directed network of nodes (neurons)
Layer 2
with weighted connections and Layer 1
Output layer
Input layer
activation functions. z (1), a (1) z (2), a (2)
The output is compared to the desired Between the input and the output layers
there can be many hidden layers
labels with a loss (cost) function.
Neural Networks
The output (for a batch of training data) is compared to the desired
labels with a
loss (cost) function.
The weights are iteratively updated using gradient descent to “walk”
towards the minimum.

wk+1 = wk − αk ∇f(wk)
The gradients are calculated for each layer backward:
backpropagation.

42
Material to understand gradient descent and backpropagation
Beautiful videos by Grant Sanderson (3blue1brown)
But what is a neural network?
Gradient Descent
Backpropagation
Backpropagation calculus

43
Do you want me to go through the backpropagation?

44
The matrices for updating the weights
data W1, B1 W2, B2
x
Output True
(2)
w11 values values
The last layer: a1(1)

(2) (2) (2)

(2)
w21 a1(2) 0.1 0
Δz = − 2 ⋅ (a − y) * σ′(z ) (2)
w12
(2)
w22
(2) dC a2(1)
Where Δz is the vector with derivatives a2(2) 0.66 1
dzi(2)
(2)
w13
(2)
a3(1) w23
Other layers: Layer 2
(l) (l) Layer 1
ΔB = Δz Input layer Output layer

ΔW (l) = Δz (l) ⋅ (a (l−1))T z (1), a (1) z (2), a (2)

Δz (l−1) = ((W (l))T ⋅ Δz (l)) * σ′(z (l−1))

45

data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)
First run a forward pass through the (2)
w21 a1(2) 0.1 0
(2)
network to get all current values. w12
(2)
w22
a2(1)
The gradients are calculated backwards (2) a2(2) 0.66 1
w13
using the chain rule:
(2)
a3(1) w23

Layer 1 Layer 2
Input layer Output layer

z (1), a (1) z (2), a (2)

z (L) = w (L)a (L−1) + b (L)
Loss for 1 training example
C0 = (a (L) − y)2 = (0.66 − 1)2
a (L−1) → z (L) → a (L) → C0 a (L) = σ(z (L))
46
data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)

(2)
(2)
w21 a1(2) 0.1 0
w12
(2)
w22
a2(1)

x→z (1)
→a (1)
→z (2)
→a (2)
→C
(2)
w13 a2(2) 0.66 1
(2)
a3(1) w23

Layer 1 Layer 2
Output layer
da1(2)
Input layer
dC dC z (1), a (1) z (2), a (2)
= ⋅ =− 2(a1(2) − y1) ⋅ σ′(z1(2))
dz1(2) da1(2) dz1(2)
Loss for 1 training example
C0 = (a (L) − y)2 = (0.66 − 1)2
Δzj(l) = Δz1(2)
47 a (L) = σ(z (L))

data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)

(2)
(2)
w21 a1(2) 0.1 0
w12
(l) (l)
ΔB = Δz
(2)
w22
a2(1)
(2)
w13 a2(2) 0.66 1
(2)
a3(1) w23

Layer 1 Layer 2
Input layer Output layer

z (1), a (1) z (2), a (2)

zj(2) = wjiai(1) + bj(2) dC dC dzj(l)

∑ = ⋅ = Δzj(l) ⋅ 1
i
dbj(l) dzj(l) dbj(l)
48
x → z (1) → a (1) → z (2) → a (2) → C
data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)
(2)
w21 a1(2) 0.1 0
ΔW (l) = Δz (l) ⋅ (a (l−1))T
(2)
w12
(2)
w22
a2(1)
(2)
w13 a2(2) 0.66 1
(2)
a3(1) w23

Layer 1 Layer 2
Input layer Output layer

z (1), a (1) z (2), a (2)

zj(2) =
∑
wjiai(1) + bj(2) dC dC dzj(l)
i = ⋅ = Δzj(l) ⋅ ai(l−1)
dwji(l) dzj(l) dwji(l)
49
x → z (1) → a (1) → z (2) → a (2) → C
data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)
(2)
w21 a1(2) 0.1 0
ΔW (l) = Δz (l) ⋅ (a (l−1))T
(2)
w12
(2)
w22
a2(1)
(2)
w13 a2(2) 0.66 1
(2)
a3(1) w23

(Δw21 Δw22 Δw23) (Δz (2))

Δw11 Δw12 Δw13 Δz1(2)
( )
= (1) (1) (1) Layer 2
a1 a 2 a3
Layer 1
Output layer
Input layer
2
z (1), a (1) z (2), a (2)

zj(2) =
∑
wjiai(1) + bj(2) dC dC dzj(l)
i = ⋅ = Δzj(l) ⋅ ai(l−1)
dwji(l) dzj(l) dwji(l)
50
x → z (1) → a (1) → z (2) → a (2) → C
data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)
a1(2) 0
Δz (l−1) = ((W (l))T ⋅ Δz (l)) * σ′(z (l−1))
(2)
w21 0.1
(2)
w12
(2)
w22
(1) (1) (2) (2) a2(1)
x→z →a →z →a →C (2) a2(2) 0.66 1
w13
(2)
a3(1) w23

Layer 1 Layer 2
Input layer Output layer

z (1), a (1) z (2), a (2)

51

data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)
a1(2) 0
Δz (l−1) = ((W (l))T ⋅ Δz (l)) * σ′(z (l−1))
(2)
w21 0.1
(2)
w12
(2)
w22
(1) (1) (2) (2) a2(1)
x→z →a →z →a →C (2) a2(2) 0.66 1
w13
(2)
w23
dC dC a3(1)

= ⋅ σ′(z1(1)) Layer 1 Layer 2

dz1(1) da1(1) Input layer Output layer

z (1), a (1) z (2), a (2)

52

data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)
a1(2) 0
Δz (l−1) = ((W (l))T ⋅ Δz (l)) * σ′(z (l−1))
(2)
w21 0.1
(2)
w12
(2)
w22
(1) (1) (2) (2) a2(1)
x→z →a →z →a →C (2) a2(2) 0.66 1
w13
(2)
w23
dC dC a3(1)

= ⋅ σ′(z1(1)) Layer 1 Layer 2

dz1(1) da1(1) Input layer Output layer

z (1), a (1) z (2), a (2)

dC (2)
= w11 ⋅ Δz1(2) + w21
(2)
⋅ Δz2(2)
da1(1) The chain rule:

dC dC dz1(2) dC dz2(2)
53 = ⋅ + ⋅
da1(1) dz1(2) da1(1) dz2(2) da1(1)

data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)
dC (2)
(2)
w21 a1(2) 0.1 0
= w11 ⋅ Δz1(2) + (2)
w21 ⋅ Δz2(2) (2)
w12

da1(1)
(2)
w22
a2(1)
(2)
w13 a2(2) 0.66 1

(w21 w22 w23)

w11 w12 w13 a3(1)
(2)
w23

Layer 1 Layer 2
Input layer Output layer

z (1), a (1) z (2), a (2)

54
data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)
dC (2)
(2)
w21 a1(2) 0.1 0
= w11 ⋅ Δz1(2) + (2)
w21 ⋅ Δz2(2) (2)
w12

da1(1)
(2)
w22
a2(1)
(2)
w13 a2(2) 0.66 1
(2)
w23
Δa1(1)
a3(1)
w11 w21
Δz1(2)
( (2))
Layer 1 Layer 2

= w12 w22
Output layer
Δa2(1)
Input layer
z (1), a (1) z (2), a (2)
w13 w23 Δz2
Δa2(1)

55
data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)
dC (2)
(2)
w21 a1(2) 0.1 0
= w11 ⋅ Δz1(2) + (2)
w21 ⋅ Δz2(2) (2)
w12

da1(1)
(2)
w22
a2(1)
(2)
w13 a2(2) 0.66 1
(2)
w23
Δa1(1)
a3(1)
w11 w21
Δz1(2)
( (2))
Layer 1 Layer 2

= w12 w22
Output layer
Δa2(1)
Input layer
z (1), a (1) z (2), a (2)
w13 w23 Δz2
Δa2(1)

56 Δz (l−1) = ((W (l))T ⋅ Δz (l)) * σ′(z (l−1))

How to train you neural network
Averaging over all training examples is the most
accurate but computationally expensive. ∂C 1 n−1 ∂Ck
n∑
=
∂w (L)
k=0
∂w (L)

You can use:

1. The whole dataset: batch gradient descent
2. One random observation: stochastic gradient
descent
3 A few observations: mini-batch gradient descent.
for epoch in range(250):
y_pred = model(X) # We compute Xw = y_hat
loss = loss_fn(y_pred, y) # (h_hat - y)^2
optimizer.zero_grad()
loss.backward() # we compute the gradients
optimizer.step() # we update the weights

57
Comments about backpropagation
Fully connected layers: O(n 2) weights between each layer. Each update
depends on the dataset size and the number of weights. hyperbolic tan

The derivative of the activation function affects the gradient:

∂C0 (L) ∂a (L) ∂C

∂z j 0
=
∂w (L) kj ∂wkj(L) ∂z (L) ∂aj(L)
when the gradient is small, it may prevent
for epoch in range(250):
the values from updating, y_pred = model(X) # We compute Xw = y_hat
vanishing gradient problem. loss = loss_fn(y_pred, y) # (h_hat - y)^2
optimizer.zero_grad()
(It might also explode). loss.backward() # we compute the gradients
optimizer.step() # we update the weights
This can prevent deep layers.
58
Comments about backpropagation
Fully connected layers: O(n 2) weights between each layer. Each update
depends on the dataset size and the number of weights. ReLU

The derivative of the activation function affects the gradient:

∂C0 (L) ∂a (L) ∂C

∂z j 0
=
∂w (L) kj ∂wkj(L) ∂z (L) ∂aj(L)
when the gradient is small, it may prevent
for epoch in range(250):
the values from updating, y_pred = model(X) # We compute Xw = y_hat
vanishing gradient problem. loss = loss_fn(y_pred, y) # (h_hat - y)^2
optimizer.zero_grad()
(It might also explode). loss.backward() # we compute the gradients
optimizer.step() # we update the weights
This can prevent deep layers.
59
Special purpose layers
To limit the computational cost, fully connected layers are used
sparingly.

Special purpose layers for different tasks reduce computation and can
still capture the most important patterns.

Such as convolutional neural network (CNN) layers and layers for

sequence modeling (RNN, transformers)

CNNs and tranformers have the advantage that they can be parallelized!
60
Convolutional Neural Networks
Convolutional neural networks (CNNs): Currently the best performing networks for
analyzing images.
Deep neural networks with (at least one) convolutional layers, pooling layers and fully
connected linear layers.

Example: Black and white image that has 100 x 100 pixels = 10 000 inputs.
Fully connected layers have many weights, if we have many layers it quickly becomes
computationally infeasible.
Idea 1: Pixels close by are more relevant to each other than pixels far away.
Idea 2: similar features are relevant all over the image, for example edges, so we can
reuse the parameters to detect them.
We can learn small filters that only look at a small region, e.g., 5 x 5 pixels, use many
parallell filters and many layers and still have few parameters to train.
61
Convolutions
Mathematical convolution:
Apply another function (A kernel)
around the current values and add
up the values.
E.g., Add Gaussian blur to an image: the kernel is a Gaussian. For
each pixel, give the surrounding pixels weights proportional to a
Gaussian and add them to create a blurry image. That is, you blend the
values.

62
See Goodfellow: Chapter 9
Convolution Layer: filters
Each neuron in a convolutional layer looks at a small region (receptive field)
in the input/previous layer and apply the kernel (the filter) on that region.
Either we pad the border or the next layer is smaller!
We learn the weights
of the kernel.
The same kernel
is used for the
whole input.

63
Conv-layer

64
Conv-layer: stride

65
Conv-layer: dilation (gaps in kernel)

66
Conv-layer: padding

67
Conv-layer: Multiple channels

68
Other types of layers
Pooling: select maximum value from a window
Dropout: used during training to randomly set a few
neurons to 0. Reduces overfitting
BatchNorm: normalizes input.

69
Summary: PCA and UMAP
Uniform Manifold Approximation and Projection, UMAP:
Used for visualization, non-linear. Captures
Principal Component Analysis, PCA: preprocessing
step, noise reduction, linear. Can be used for
dimensionality reduction to speed up training of other
models.

70
Summary: UMAP
UMAP: Uniform Manifold Approximation and Projection. Focuses on
preserving both local and global structure.
1. Construct a high-dimensional graph:
A. compute pair-wise distances.
B. Convert distances to probabilities by defining a local neighborhood for
each point
C. Combine local neighborhoods to a global graph called “fuzzy simplicial
complex”
2. Optimize low-dimensional representation:
A. Initialize a low-dimensional embedding: place the points “randomly”
B. Minimize the difference between high and low dimensional
representation using cross-entropy
71
C. Adjust the positions of the points in low-d representation using
stochastic gradient descent.
PCA: Overview
Find vectors (summary indices) that capture the important
information in the data (principal components).
Keep statistical properties such as variance.
PCA finds lines, planes and hyper-planes in the D-
dimensional space that approximate the data as well as
possible in the least squares sense.
Finds the vectors that minimize the reconstruction error,
which equivalent as finding the vectors that maximize
the variance.

72
Summary: PCA: algorithm
1 N
N∑
1. Compute mean feature vector μ = xi
i=1
1 N
(xi − μ)(xi − μ)⊤
N∑
2. Compute covariance matrix S =
i=1
3. Compute eigenvalues λi and eigenvectors wi of S using SVD.
4. Arrange the eigenvalues in descending order
5. Pick the first M
6. Pick eigenvectors corresponding to the selected eigenvalues
7. Arrange eigenvectors in matrix W = [w1, …, wM ]
73
8. Extract low-dimensional feature vectors: zi = W ⊤ xi
The exam:
Kårhusets Gasquesal.
Bring calculator.
Approximately 60 points, 30 for passing grade.
1 point per important “point in the answer”.
E.g., What is does PCA need zero mean? (2 pt)
Possible answer: PCA finds the directions that maximize the
variance of the data (1 pt), so if the mean is not 0, the direction will
be influenced by the offset of the mean and not the spread of the
data (1 pt).
74
The exam:
Some examples from previous exams (fall 2023): first
question about general concepts.
2.3 Cross-entropy for 2 training examples.

75
The exam:
Look at the coding problem 8.1, something similar may come.

Look at the study questions. I will add more for CNNs and
data.

76
Course evaluation

Please help me by giving concrete feedback!

77
🎄🎁🤶Happy Holidays🎄🎁🤶

Bi Intro
No ratings yet
Bi Intro
24 pages
Class 3 - Classification
No ratings yet
Class 3 - Classification
80 pages
ML RUSA Module 6 Probablistic EM KNN SVM
No ratings yet
ML RUSA Module 6 Probablistic EM KNN SVM
51 pages
ML Exam Preparation Tips
No ratings yet
ML Exam Preparation Tips
41 pages
COMP4702 Notes 2019: Week 2 - Supervised Learning
No ratings yet
COMP4702 Notes 2019: Week 2 - Supervised Learning
23 pages
User Guide: Nanocad Mechanica
No ratings yet
User Guide: Nanocad Mechanica
1,283 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Applied Statistics - Lecture 1: Mario Beraha
No ratings yet
Applied Statistics - Lecture 1: Mario Beraha
52 pages
Mock Exams 2024
No ratings yet
Mock Exams 2024
81 pages
General ML Notes
No ratings yet
General ML Notes
30 pages
Pattern Summary Final
No ratings yet
Pattern Summary Final
28 pages
Week11 - Regularization and Optimization
No ratings yet
Week11 - Regularization and Optimization
75 pages
Pattern L1 L6
No ratings yet
Pattern L1 L6
19 pages
PRML RefSheet
No ratings yet
PRML RefSheet
6 pages
DS Cheat Sheets
No ratings yet
DS Cheat Sheets
18 pages
05-1 Supervised Learning
No ratings yet
05-1 Supervised Learning
65 pages
ML - Module 5
No ratings yet
ML - Module 5
80 pages
机器学习
No ratings yet
机器学习
41 pages
ML Unit 1
No ratings yet
ML Unit 1
73 pages
Basics of ML and Evaluation
No ratings yet
Basics of ML and Evaluation
42 pages
07 Intro To ML
No ratings yet
07 Intro To ML
38 pages
Data Science Cheat Sheet
No ratings yet
Data Science Cheat Sheet
7 pages
Supervised Learning
No ratings yet
Supervised Learning
6 pages
Data Science Cheatsheet
No ratings yet
Data Science Cheatsheet
4 pages
Ml2 Script v2
No ratings yet
Ml2 Script v2
123 pages
MLP RL1
No ratings yet
MLP RL1
6 pages
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
No ratings yet
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
11 pages
ECE 449 Notes
No ratings yet
ECE 449 Notes
5 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
ML Imp QB
No ratings yet
ML Imp QB
34 pages
ChatGPT - Machine Learning Overview
No ratings yet
ChatGPT - Machine Learning Overview
34 pages
Machine Learning Models: by Mayuri Bhandari
No ratings yet
Machine Learning Models: by Mayuri Bhandari
48 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
4 pages
ML Fundamentals by Bitspace
No ratings yet
ML Fundamentals by Bitspace
19 pages
Machine Learning: The Basics
No ratings yet
Machine Learning: The Basics
288 pages
Lecture 03 Bayes Classifier With Prob Concepts
No ratings yet
Lecture 03 Bayes Classifier With Prob Concepts
70 pages
Data Science Cheatsheet 2.0: Statistics Model Evaluation Logistic Regression
No ratings yet
Data Science Cheatsheet 2.0: Statistics Model Evaluation Logistic Regression
4 pages
Cheatsheet Supervised Learning
No ratings yet
Cheatsheet Supervised Learning
4 pages
178 HW 9
No ratings yet
178 HW 9
153 pages
ML 01
No ratings yet
ML 01
24 pages
Super Cheatsheet Machine Learning
100% (1)
Super Cheatsheet Machine Learning
15 pages
Poly Aml
No ratings yet
Poly Aml
76 pages
Murphy Book Solution
No ratings yet
Murphy Book Solution
100 pages
Data Science Cheatsheet
100% (1)
Data Science Cheatsheet
5 pages
Machine Learning Updated
No ratings yet
Machine Learning Updated
14 pages
Extra Lecturenotes Cs725
No ratings yet
Extra Lecturenotes Cs725
119 pages
Pytest PDF
No ratings yet
Pytest PDF
219 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Cheat Sheet - Machine Learning - Data Science Interview PDF
No ratings yet
Cheat Sheet - Machine Learning - Data Science Interview PDF
16 pages
Bishop Solutions PDF
No ratings yet
Bishop Solutions PDF
87 pages
Curs 1 SSL - Introduction
No ratings yet
Curs 1 SSL - Introduction
57 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
Cheat Sheet
No ratings yet
Cheat Sheet
163 pages
Brief Intro To ML PDF
No ratings yet
Brief Intro To ML PDF
236 pages
Lecture - 2 Classification (Machine Learning Basic and KNN)
No ratings yet
Lecture - 2 Classification (Machine Learning Basic and KNN)
94 pages
W5 Iccii Lab Physical Synthesis
No ratings yet
W5 Iccii Lab Physical Synthesis
16 pages
Magner 22 User Manual
No ratings yet
Magner 22 User Manual
10 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
10 pages
Machine Learning/Data Science Interview Cheat Sheets: Aqeel Anwar
No ratings yet
Machine Learning/Data Science Interview Cheat Sheets: Aqeel Anwar
17 pages
Cheet Sheet
No ratings yet
Cheet Sheet
47 pages
Grade 10 - Unit 09
No ratings yet
Grade 10 - Unit 09
3 pages
@1-Aspire - 4741 Hardware Servicing Guide
No ratings yet
@1-Aspire - 4741 Hardware Servicing Guide
49 pages
Data and AI Summit 2025 Presentation
No ratings yet
Data and AI Summit 2025 Presentation
32 pages
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
No ratings yet
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
10 pages
User Manual of EN-S9 Syringe Pump
No ratings yet
User Manual of EN-S9 Syringe Pump
67 pages
Hi-Scan 5030C: Compact, Durable and Transportable X-Ray Scanner
No ratings yet
Hi-Scan 5030C: Compact, Durable and Transportable X-Ray Scanner
2 pages
Historical Development and Modern Trends of Operation Management
No ratings yet
Historical Development and Modern Trends of Operation Management
14 pages
File 2 HTML
No ratings yet
File 2 HTML
8 pages
Telus Kalenjin
No ratings yet
Telus Kalenjin
3 pages
Instruction Manual: Remote I/O Card
No ratings yet
Instruction Manual: Remote I/O Card
20 pages
10.4324 9781003038047-6 Chapterpdf
No ratings yet
10.4324 9781003038047-6 Chapterpdf
25 pages
Detailed Guide 7 Loss Functions Machine Learning Python Code
No ratings yet
Detailed Guide 7 Loss Functions Machine Learning Python Code
16 pages
Hosted PBX
No ratings yet
Hosted PBX
10 pages
Intelligence International Relations and Under-Theorisation
No ratings yet
Intelligence International Relations and Under-Theorisation
16 pages
Verilog Coding Examples
No ratings yet
Verilog Coding Examples
11 pages
0 PHP
No ratings yet
0 PHP
32 pages
07 MAT210 March22 06 ASSESSMENT 2 Test 60marks 25% Chapter 2 4
No ratings yet
07 MAT210 March22 06 ASSESSMENT 2 Test 60marks 25% Chapter 2 4
5 pages
Install DLP
No ratings yet
Install DLP
59 pages
Intelligence Analysis and Social Science Methods Exploring The Potential For and Possible Limits of Mutual Learning
No ratings yet
Intelligence Analysis and Social Science Methods Exploring The Potential For and Possible Limits of Mutual Learning
14 pages
Cryptography Crash Course Computer Science Transcript
No ratings yet
Cryptography Crash Course Computer Science Transcript
4 pages
Module 1
No ratings yet
Module 1
17 pages
Cap202 - Object Oriented Programming Syllabus
No ratings yet
Cap202 - Object Oriented Programming Syllabus
2 pages
Jameliz Benitez Smith Leaked Viral Video Link
No ratings yet
Jameliz Benitez Smith Leaked Viral Video Link
4 pages
Algorithmic Analysis and Development For Enhanced Human AutonomousVehicle With Pages Removed (1) Remov
No ratings yet
Algorithmic Analysis and Development For Enhanced Human AutonomousVehicle With Pages Removed (1) Remov
6 pages
CM Wifi
No ratings yet
CM Wifi
21 pages
H12 223 HCNP R - S IENP Examination
No ratings yet
H12 223 HCNP R - S IENP Examination
2 pages
Cyber Threat Assessment-2024-02-11-0647 - 360958
No ratings yet
Cyber Threat Assessment-2024-02-11-0647 - 360958
16 pages
Submitted Final Version 2
No ratings yet
Submitted Final Version 2
10 pages
Class Diagrams
No ratings yet
Class Diagrams
15 pages
ACTIVITY 2 Compare Data With A Hash
No ratings yet
ACTIVITY 2 Compare Data With A Hash
4 pages
Assignment 2, EDAP01: 1 Statement
No ratings yet
Assignment 2, EDAP01: 1 Statement
4 pages
EDAP01
No ratings yet
EDAP01
4 pages
Assignment 2, EDAP01: 1 Statement
No ratings yet
Assignment 2, EDAP01: 1 Statement
3 pages
FEKH46, Seminar 5
No ratings yet
FEKH46, Seminar 5
3 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet

EDAN96 2024 Last Lecture-1

Uploaded by

EDAN96 2024 Last Lecture-1

Uploaded by

EDAN96 Lecture 14:

Summary and exam prep

Detect it: use a validation set during training as

Blue: training data

Purpose: assess model performance, optimize hyper parameters, reduce

Save the test data

Homogeneity is a measure of how similar the samples in 0 1 2

Completeness measures how much similar samples are

Normalized mutual information, V-measure: 0 1 2

Usually the log-loss is used to make the

These 2 rules appear everywhere in ML and Bayesian statistics

p (x , y ): joint distribution of the two RVs x ,

p (x ),p(y ): marginal distributions

Interpretation: if we observe y, we can draw some conclusions about x

Intuitively, x and y are independent if the value of y (once known) does

If x, y are (statistically) independent then:

Intuition: y does not give any more information to the probability of x

We often assume (conditional) independence just to make the math

For a Gaussian distribution, the

Cross-entropy: measures the difference between two probability distributions

When the labels are known: p ̂ − y}̂ :

Cross-entropy is a popular loss function.

I(X; Y ) = H(X) − H(X | Y ) = H(Y ) − H(Y | X)

Mutual information/information gain can be used to calculate the split in

Maximum likelihood estimation.MLE: We want to find parameter values so that the

Using a prior distribution on the parameters p(θ) and then

Goal: find optimal parameters θ* ∈ ℝD for the LR model in

The second term is the regularizer

Calculate the likelihood of each class. P(discount|spam) = numbernumber

Combine them to calculate the posterior. P(spam|discount) = P(discount|spam) P(spam)

Assign the class label with the highest posterior probability.

Neural networks: has hidden layers and activation functions.

softmax is generalizing logistic regression to

(2) (2) (2)

ΔW (l) = Δz (l) ⋅ (a (l−1))T z (1), a (1) z (2), a (2)

z (1), a (1) z (2), a (2)

z (1), a (1) z (2), a (2)

zj(2) = wjiai(1) + bj(2) dC dC dzj(l)

z (1), a (1) z (2), a (2)

(Δw21 Δw22 Δw23) (Δz (2))

z (1), a (1) z (2), a (2)

= ⋅ σ′(z1(1)) Layer 1 Layer 2

z (1), a (1) z (2), a (2)

= ⋅ σ′(z1(1)) Layer 1 Layer 2

z (1), a (1) z (2), a (2)

(w21 w22 w23)

z (1), a (1) z (2), a (2)

56 Δz (l−1) = ((W (l))T ⋅ Δz (l)) * σ′(z (l−1))

You can use:

The derivative of the activation function affects the gradient:

∂C0 (L) ∂a (L) ∂C

The derivative of the activation function affects the gradient:

∂C0 (L) ∂a (L) ∂C

Such as convolutional neural network (CNN) layers and layers for

Please help me by giving concrete feedback!

You might also like