0% found this document useful (0 votes)
21 views78 pages

EDAN96 2024 Last Lecture-1

The document provides a comprehensive overview of machine learning concepts covered in EDAN96 Lecture 14, including various algorithms, evaluation tools, and best practices for model training and validation. Key topics include supervised and unsupervised learning, cross-validation techniques, loss functions, and ensemble methods. It also discusses Bayesian classifiers and the importance of understanding probability distributions in machine learning.

Uploaded by

Axel Rosenqvist
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views78 pages

EDAN96 2024 Last Lecture-1

The document provides a comprehensive overview of machine learning concepts covered in EDAN96 Lecture 14, including various algorithms, evaluation tools, and best practices for model training and validation. Key topics include supervised and unsupervised learning, cross-validation techniques, loss functions, and ensemble methods. It also discusses Bayesian classifiers and the importance of understanding probability distributions in machine learning.

Uploaded by

Axel Rosenqvist
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

EDAN96 Lecture 14:

Summary and exam prep


MAJ STENMARK, DEPT. COMPUTER SCIENCE
Content
Summary of the course
Exam prep

2
Lectures
1. Intro to ML
2. Intro to probability and information theory
3. Basic concepts in ML
4. Decision Trees
5. DTs cont. Ensemble Methods
6. Bayesian classifiers
7. Logistic regression
8. Feedforward networks
9. Techniques in neural networks
10. Dimensionality reduction
11.Convolutional Neural Networks
12.CNNs cont. and Explainable AI
13. (Overview different ML tasks, data)
3
Intro to ML/Basic concepts
Unsupervised algorithms
k-means clustering, UMAP, PCA, GMMs.
Supervised algorithms:
k-NN
Linear regression (predicts numerical values)
Logistic regression
Neural networks
(Binary) Classification and regression.

4
Intro to ML/Basic concepts
Underfitting: model does not explain the data: poor
performance
Overfitting: model fits too perfectly to the data (learns noise),
poor generalization.

Detect it: use a validation set during training as


a test set, e.g, using cross-validation.

Blue: training data


Red:validation data
5
Cross-validation (CV)
Cross-validation is a technique to assess how well a model generalizes to
new data.
Validation set: small subset of the training set that omit from the training
and we use for testing during training

Purpose: assess model performance, optimize hyper parameters, reduce


variance in the evaluation, utilize (a small) dataset.

6
𝒱
Cross-validation (CV)
To get a good estimate, we can repeat the process several times.
K-fold cross-validation
1. Partition the data into K chunks
2. Use K − 1 for training and 1 as validation
3. Compute the performance on (e.g., RMSE) and iterate
4. Run CV for each model and select best average performance

7
𝒱
Best practises
Split the dataset in: training, validation (in CV) and test.
Preprocessing steps should be treated as part of the model (only on training set).
Use cross-validation to tune parameters to the model and select best model(s)
Train on full dataset.
Do the final testing with your test data.
The test result should be similar to the validation result.
Deploy model.
Test data

Save the test data


for final testing

8
Summary: Intro to ML
Some early algorithms:
For classification: k-NN
For clustering: k-means
For regression: linear regression

9
k-Nearest Neighbors (kNN)
1. Input: points in space. Each point has a label.
2. Save the entire dataset as reference.
3. For each point p to classify: find the k number of closest points using
some distance measure (e.g., Euclidian distance).
4. Give p the most common label of the k neighbors.

Break ties:
k odd or
increase k

10
Evaluation tools
• Confusion matrix

Wikipedia: https://en.wikipedia.org/wiki/Confusion_matrix

11
Evaluation tools cont.
• Precision
• TP/(TP + FP)
• Recall
• TP/(TP+FN)
• F1-score is the harmonic
mean

precision ⋅ recall
F1 = 2 ⋅
precision + recall

Wikipedia: https://en.wikipedia.org/wiki/Precision_and_recall

12
k-means
1. Input: points in space.
2. Decide how many clusters k you want.
3. Place k centroids in the space (randomly)
4. Iteratively:
1. Assign each point to the closest centroid.
2. Recalculate the centroids as the average of the points in the
cluster.
3. Repeat until the clusters do not change anymore.
Random centroids. Assign points to closest Move centroids. Assign points to closest again. Move centroids. Stop when clusters do not change.

13
Evaluate clustering
If we have labels, we can evaluate the clusters with
homogeneity and completeness

Homogeneity is a measure of how similar the samples in 0 1 2


a cluster are. The formula includes the ratio between the Homogeneity 1
number of samples labelled c in cluster k and the total Completeness 0.579
number of samples in cluster k.

Completeness measures how much similar samples are


put in the same cluster.

Normalized mutual information, V-measure: 0 1 2


h⋅c Homogeneity 0.613
V − measure = 2 ⋅ Completeness 1
h+c
14
Visualizing the result

15
R2-score
R 2- score (coefficient of determination):

2
SSres
R =1−
SStot
where
(yi − fi)2 (the residual sum of squares)

SSres =
i
(yi − y)2 (total sum of squares)

SStot =
i

A value of 0 means that the model does not predict better than the mean of the target, 1
the model explains the variance perfectly. Negative values are worse than guessing the
mean.
16
Summary: Loss
Loss functions: the error function to minimize for
the models (during training). Also called cost
function.

Usually the log-loss is used to make the


calculations easier.
Mean squared error: regression
Cross-entropy (multi-class classification)

17
Summary: Probability
Fundamental Rules in Probability Theory

These 2 rules appear everywhere in ML and Bayesian statistics

p (x , y ): joint distribution of the two RVs x ,

p (x ),p(y ): marginal distributions


p(x, y)
p (y |x ) = : conditional distribution of y given x
p(x)
18

If we consider discrete RVs x , y , the integral in (1) is replaced by a sum. This is where the name comes from.
Summary: Probability
Bayes’ theorem

Interpretation: if we observe y, we can draw some conclusions about x


given the observed values of y
Prior p(x): encapsulates our prior knowledge of x
Likelihood p(y|x): describes how y and x are related
Marginal likelihood or evidence p(y): normalizing constant
p(y, x)
∫ ∫ p(x) ∫
Integral p(y | x)p(x)d x = p(x)d x = p(y, x)d x of the numerator wrt x

19
Posterior p(x|y): expresses what we know about x|y
Statistical Independence

Intuitively, x and y are independent if the value of y (once known) does


not add any additional information about x (and vice versa)

If x, y are (statistically) independent then:


1. p( y | x) = p( y)
2. p(x | y) = p(x)
3. [x + y] = [x] + [y]
4. cov[x, y] = 0

20
𝕍
𝕍
𝕍
Conditional Independence

Intuition: y does not give any more information to the probability of x

We often assume (conditional) independence just to make the math


easier! E.g. in naive Bayes classifiers.

21
Probability Distributions: Gaussian
We often assume that probabilities
are either uniform or Gaussian,
because that makes the math
easier (and it is often a good
description of “reality”)

For a Gaussian distribution, the


conditional and marginal μ = 0,σ 2 = 1
distributions are also Gaussian.

22
Summary: Entropy
Entropy: the level of uncertainty in a variables possible outcomes (measured in bits).


H(X) = − p(x)log p(x) or p[−log p(X)]
x∈χ

Cross-entropy: measures the difference between two probability distributions


H(P, Q) = − p(x) log q(x)
x∈χ

When the labels are known: p ̂ − y}̂ :


∈ {p,1 − p} and q ∈ {y,1
H(p, q) = − y log ŷ − (1 − y)log(1 − y)̂

Cross-entropy is a popular loss function.


23
𝔼
Summary: Mutual information
Mutual information (MI) is a measure from information theory that
quantifies the amount of information one random variable X provides
about another random variable Y. It reflects the reduction in uncertainty
about one variable due to knowledge of the other.

I(X; Y ) = H(X) − H(X | Y ) = H(Y ) − H(Y | X)


It can also be written as
p(x, y)
∑∑
I(X; Y ) = p(x, y) log2
x∈ y∈
p(x)p(y)

Mutual information/information gain can be used to calculate the split in


𝒳
𝒴
24

Decision Trees.
Summary: Gini impurity
Gini impurity: measures how often a randomly chosen element of a
set would be incorrectly labeled if it were labeled randomly and
independently according to the distribution of labels in the set. It
reaches its minimum (zero) when all cases in the node fall into a single
target category. It reflects the degree of disorder or impurity in a node.

p(x)2

G(p) = 1 −
x∈

Gini impurity is another measure that can be used to calculate the split in
𝒳
25

Decision Trees.
Summary: MLE, MAP, linear
regression
p(Data | θ)p(θ)
p(θ | Data) =
p(Data)

Maximum likelihood estimation.MLE: We want to find parameter values so that the


observed data is most probable, that is maximize p(Data | θ).
Example:
Imagine trying to estimate the bias θ of a coin. Using MLE, you would
estimate θ as the proportion of heads in a series of coin flips because this
maximizes the probability of observing the results you actually obtained.
26
Summary: MAP
p(Data | θ)p(θ)
p(θ | Data) =
p(Data)

Maximum A Posteriori Estimation, MAP: Find parameters that are most probable
given the data.

Using a prior distribution on the parameters p(θ) and then


Selecting θ that maximizes the posterior p(θ | X, y)
We choose the parameters that are “most probable” given the data.

27
Summary: Linear regression
We consider the regression problem
y = f(x) + ϵ
x ∈ ℝD are inputs
y ∈ ℝ are observed targets
ϵ∼ (0,σ 2) is identically distributed measurement noise
Goal: find f that is close to the function.
Linear regression parametrization
y = f(x)+ϵ = x ⊤θ+ϵ,
θ ∈ ℝD are the parameters we seek
28
where
𝒩
Parameter Estimation
Assume we are given a training set = {xi, yi}Ni=1
Inputs xi = [1 xi1 xi2 ⋯ xi,d−1] ∈ ℝD, i = 1,...,N
Corresponding observations yi ∈ ℝ, i = 1,...,N
Where yi and yj are conditionally independent given xi, xj
yi = θ0 + θ1xi1 + θ2 xi2 + ⋯θd−1xi,d−1 + ϵi = xiT θ + ϵi

Goal: find optimal parameters θ* ∈ ℝD for the LR model in


y = f(x)+ϵ = x ⊤θ+ϵ

29
𝒟
Closed-form MLE Solution
We use the loss function:
1
ℒ(θ) = 2 ∥y − Xθ∥2 (minimized the sum of the squared

distance to the line)
Set the gradient to 0:
∂ℒ
= 0 ⇔ a lot of math for matrix derivation ⇔ θ* = (X ⊤X)−1X ⊤y
∂θ

30
Summary: Regularization
MLE solution is prone to overfitting!
One way to control overfitting is to penalize big parameter values
this technique is called regularization
A typical example is a loss function where a regularization term is
added (Ridge regression)
ℒ(θ) = argmin ∥y − Xθ∥22+α ∥θ∥22
θ

The second term is the regularizer


It penalizes the amplitude of the parameters θ
31
α ∈ ℝ+controls the “strictness” of the regularization
Summary decision trees
Decision trees: flowchart-like structure that partitions the
dataset on one attribute in each internal node.
Classification trees: recursively partition the dataset
using the attribute that maximizes information gain (ID3)
or minimize Gini impurity (CART). Stop when empty
dataset, all datapoints belong to same class, or no
attributes are left to split on.
Regression trees: use MSE as split criteria and use
ranges for the splits. The output value is calculated as
the mean of the values in the leaf.
Prone to overfitting! Prevent it by setting max height, min
number samples in leaf nodes.
32
Summary: Ensemble methods (bagging)
Bagging: bootstrap aggregation, random forests.
bootstrap the data: randomly sample the dataset with replacement to
generate several datasets.
train several trees independently using the datasets.
aggregate the results: vote on the class, average the
result in regression.
Random forests also chose a random subset of the
attributes in each split.
The bootstrapping reduces the variance of the model
without
33
causing overfitting.
The Severity Prediction of The Binary And Multi-Class Cardiovascular Disease -
A Machine Learning-Based Fusion Approach
Hafsa Binte Kibria Abdul Matin
Summary: Ensemble methods (stacking)
Stacking: Combines the predictions of multiple base models (weak learners)
using a meta-model to produce a final prediction.
Train multiple models (DTs, NNs) on the same dataset.
Use the models to make predictions.
Feed the predictions into a meta-model, e.g, linear regression.
The meta-model learns how to combine the predictions.
Achieves higher accuracy than one model.
Computationally expensive and you need to select the models and meta-
model.
34
Summary: Ensemble methods (boosting)
Boosting: Sequentially train models that focus on correcting errors from
previous ones. Final prediction is made by aggregating all models (weighted
vote/average).
Train a base model on the dataset.
Identify errors (misclassified points or big absolute errors).
Train the next model by giving these errors higher weights.
Repeat a predefined number of times (or until error is small).
AdaBoost (Adaptive Boosting)
Gradient boosting machines: uses a loss function for the errors and gradient
descent. Implementations XGBoost and LightGBM.
35
Summary: Bayesian Classifiers
Naive: assumes all features are (conditionally) independent.
Example spam classification using the word discount.
Steps:
number of spam emails
Calculate the prior for each class P(spam) = (same for not spam)
total emails

Calculate the likelihood of each class. P(discount|spam) = numbernumber


of spam emails with "discount"
of spam emails
(same for not snap)

Combine them to calculate the posterior. P(spam|discount) = P(discount|spam) P(spam)


P(discount)
(same for

not spam)

Assign the class label with the highest posterior probability.


36
Summary: Bayesian Classifiers
Gaussian naive Bayes: Assumes that each feature has a Gaussian
distribution (with different variance and mean).
Calculate the mean and variance for each feature for each class and use
these probabilities as prior, likelihood and posterior.
For multiple features, we assume conditional independence and can multiply
the probabilities to get the posterior.

37
Summary: Gaussian mixture models
Gaussian mixture models (GMMs): Assumes that data is generated from a
mixture of k Gaussians, each with its own mean (μk) and covariance (Σk). We
have a mixing coefficient πk that determines the weight of each Gaussian,
P(x) = weighted sum of the Gaussians
Used for e.g., clustering, e.g., customer segmentation in marketing.

The parameters (μk, Σk, πk) are learned using the Expectation-Maximization
(EM) algorithm (initialized e.g., using k-means):
E-step: Compute the posterior probability that each data point belongs to
each cluster (responsibility).
M-step: update parameters to using the responsibilities from the E-step.
38
Logistic regression
Logistic regression: models a discrete outcome for a
continuous variable. Sigmoid function: outputs value between
0 and 1.
Logistic regression is special case of NN: No hidden nodes, one output
node with sigmoid activation function.

Neural networks: has hidden layers and activation functions.


Uses back propagation to calculate the gradients and
gradient descent to update weights.
Gradient descent: wk+1 = wk − αk ∇f(wk) where αk is the
learning rate.
Backpropagation: the loss function is defined for the output
layer, so when calculating how to update the weights we go
backwards through the network.
39
Activation functions
Adds non-linearity to a neural network.
Rectified linear unit ReLU
Sigmoid
Softmax (rescales the output to probabilities)
e zi
σ(z)i = C
∑j=1 e zj

softmax is generalizing logistic regression to


multiple classes
Hyperbolic tangent
40
Neural Networks
data W1, B1 W2, B2
x
In each neuron the incoming values are weighted: Output True
(2) values values
z1(2) = w1i ai1 + b1(2) w11
∑ a1(1)
i
The output: a1 = σ(z1) (2)
(2)
w21 a1(2) 0.1 0
w12
(2)
w22
a2(1)
(2)
w13 a2(2) 0.66 1
(2)
a3(1) w23
Directed network of nodes (neurons)
Layer 2
with weighted connections and Layer 1
Output layer
Input layer
activation functions. z (1), a (1) z (2), a (2)
The output is compared to the desired Between the input and the output layers
there can be many hidden layers
labels with a loss (cost) function.
Neural Networks
The output (for a batch of training data) is compared to the desired
labels with a
loss (cost) function.
The weights are iteratively updated using gradient descent to “walk”
towards the minimum.

wk+1 = wk − αk ∇f(wk)
The gradients are calculated for each layer backward:
backpropagation.

42
Material to understand gradient descent and backpropagation
Beautiful videos by Grant Sanderson (3blue1brown)
But what is a neural network?
Gradient Descent
Backpropagation
Backpropagation calculus

43
Do you want me to go through the backpropagation?

44
The matrices for updating the weights
data W1, B1 W2, B2
x
Output True
(2)
w11 values values
The last layer: a1(1)

(2) (2) (2)


(2)
w21 a1(2) 0.1 0
Δz = − 2 ⋅ (a − y) * σ′(z ) (2)
w12
(2)
w22
(2) dC a2(1)
Where Δz is the vector with derivatives a2(2) 0.66 1
dzi(2)
(2)
w13
(2)
a3(1) w23
Other layers: Layer 2
(l) (l) Layer 1
ΔB = Δz Input layer Output layer

ΔW (l) = Δz (l) ⋅ (a (l−1))T z (1), a (1) z (2), a (2)


Δz (l−1) = ((W (l))T ⋅ Δz (l)) * σ′(z (l−1))

45


data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)
First run a forward pass through the (2)
w21 a1(2) 0.1 0
(2)
network to get all current values. w12
(2)
w22
a2(1)
The gradients are calculated backwards (2) a2(2) 0.66 1
w13
using the chain rule:
(2)
a3(1) w23

Layer 1 Layer 2
Input layer Output layer

z (1), a (1) z (2), a (2)


z (L) = w (L)a (L−1) + b (L)
Loss for 1 training example
C0 = (a (L) − y)2 = (0.66 − 1)2
a (L−1) → z (L) → a (L) → C0 a (L) = σ(z (L))
46
data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)

(2)
(2)
w21 a1(2) 0.1 0
w12
(2)
w22
a2(1)

x→z (1)
→a (1)
→z (2)
→a (2)
→C
(2)
w13 a2(2) 0.66 1
(2)
a3(1) w23

Layer 1 Layer 2
Output layer
da1(2)
Input layer
dC dC z (1), a (1) z (2), a (2)
= ⋅ =− 2(a1(2) − y1) ⋅ σ′(z1(2))
dz1(2) da1(2) dz1(2)
Loss for 1 training example
C0 = (a (L) − y)2 = (0.66 − 1)2
Δzj(l) = Δz1(2)
47 a (L) = σ(z (L))

data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)

(2)
(2)
w21 a1(2) 0.1 0
w12
(l) (l)
ΔB = Δz
(2)
w22
a2(1)
(2)
w13 a2(2) 0.66 1
(2)
a3(1) w23

Layer 1 Layer 2
Input layer Output layer

z (1), a (1) z (2), a (2)

zj(2) = wjiai(1) + bj(2) dC dC dzj(l)


∑ = ⋅ = Δzj(l) ⋅ 1
i
dbj(l) dzj(l) dbj(l)
48
x → z (1) → a (1) → z (2) → a (2) → C
data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)
(2)
w21 a1(2) 0.1 0
ΔW (l) = Δz (l) ⋅ (a (l−1))T
(2)
w12
(2)
w22
a2(1)
(2)
w13 a2(2) 0.66 1
(2)
a3(1) w23

Layer 1 Layer 2
Input layer Output layer

z (1), a (1) z (2), a (2)

zj(2) =

wjiai(1) + bj(2) dC dC dzj(l)
i = ⋅ = Δzj(l) ⋅ ai(l−1)
dwji(l) dzj(l) dwji(l)
49
x → z (1) → a (1) → z (2) → a (2) → C
data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)
(2)
w21 a1(2) 0.1 0
ΔW (l) = Δz (l) ⋅ (a (l−1))T
(2)
w12
(2)
w22
a2(1)
(2)
w13 a2(2) 0.66 1
(2)
a3(1) w23

(Δw21 Δw22 Δw23) (Δz (2))


Δw11 Δw12 Δw13 Δz1(2)
( )
= (1) (1) (1) Layer 2
a1 a 2 a3
Layer 1
Output layer
Input layer
2
z (1), a (1) z (2), a (2)

zj(2) =

wjiai(1) + bj(2) dC dC dzj(l)
i = ⋅ = Δzj(l) ⋅ ai(l−1)
dwji(l) dzj(l) dwji(l)
50
x → z (1) → a (1) → z (2) → a (2) → C
data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)
a1(2) 0
Δz (l−1) = ((W (l))T ⋅ Δz (l)) * σ′(z (l−1))
(2)
w21 0.1
(2)
w12
(2)
w22
(1) (1) (2) (2) a2(1)
x→z →a →z →a →C (2) a2(2) 0.66 1
w13
(2)
a3(1) w23

Layer 1 Layer 2
Input layer Output layer

z (1), a (1) z (2), a (2)

51

data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)
a1(2) 0
Δz (l−1) = ((W (l))T ⋅ Δz (l)) * σ′(z (l−1))
(2)
w21 0.1
(2)
w12
(2)
w22
(1) (1) (2) (2) a2(1)
x→z →a →z →a →C (2) a2(2) 0.66 1
w13
(2)
w23
dC dC a3(1)

= ⋅ σ′(z1(1)) Layer 1 Layer 2


dz1(1) da1(1) Input layer Output layer

z (1), a (1) z (2), a (2)

52


data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)
a1(2) 0
Δz (l−1) = ((W (l))T ⋅ Δz (l)) * σ′(z (l−1))
(2)
w21 0.1
(2)
w12
(2)
w22
(1) (1) (2) (2) a2(1)
x→z →a →z →a →C (2) a2(2) 0.66 1
w13
(2)
w23
dC dC a3(1)

= ⋅ σ′(z1(1)) Layer 1 Layer 2


dz1(1) da1(1) Input layer Output layer

z (1), a (1) z (2), a (2)

dC (2)
= w11 ⋅ Δz1(2) + w21
(2)
⋅ Δz2(2)
da1(1) The chain rule:

dC dC dz1(2) dC dz2(2)
53 = ⋅ + ⋅
da1(1) dz1(2) da1(1) dz2(2) da1(1)


data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)
dC (2)
(2)
w21 a1(2) 0.1 0
= w11 ⋅ Δz1(2) + (2)
w21 ⋅ Δz2(2) (2)
w12

da1(1)
(2)
w22
a2(1)
(2)
w13 a2(2) 0.66 1

(w21 w22 w23)


w11 w12 w13 a3(1)
(2)
w23

Layer 1 Layer 2
Input layer Output layer

z (1), a (1) z (2), a (2)

54
data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)
dC (2)
(2)
w21 a1(2) 0.1 0
= w11 ⋅ Δz1(2) + (2)
w21 ⋅ Δz2(2) (2)
w12

da1(1)
(2)
w22
a2(1)
(2)
w13 a2(2) 0.66 1
(2)
w23
Δa1(1)
a3(1)
w11 w21
Δz1(2)
( (2))
Layer 1 Layer 2

= w12 w22
Output layer
Δa2(1)
Input layer
z (1), a (1) z (2), a (2)
w13 w23 Δz2
Δa2(1)

55
data W1, B1 W2, B2
Backpropagation x
(2)
w11
Output
values
True
values
a1(1)
dC (2)
(2)
w21 a1(2) 0.1 0
= w11 ⋅ Δz1(2) + (2)
w21 ⋅ Δz2(2) (2)
w12

da1(1)
(2)
w22
a2(1)
(2)
w13 a2(2) 0.66 1
(2)
w23
Δa1(1)
a3(1)
w11 w21
Δz1(2)
( (2))
Layer 1 Layer 2

= w12 w22
Output layer
Δa2(1)
Input layer
z (1), a (1) z (2), a (2)
w13 w23 Δz2
Δa2(1)

56 Δz (l−1) = ((W (l))T ⋅ Δz (l)) * σ′(z (l−1))



How to train you neural network
Averaging over all training examples is the most
accurate but computationally expensive. ∂C 1 n−1 ∂Ck
n∑
=
∂w (L)
k=0
∂w (L)

You can use:


1. The whole dataset: batch gradient descent
2. One random observation: stochastic gradient
descent
3 A few observations: mini-batch gradient descent.
for epoch in range(250):
y_pred = model(X) # We compute Xw = y_hat
loss = loss_fn(y_pred, y) # (h_hat - y)^2
optimizer.zero_grad()
loss.backward() # we compute the gradients
optimizer.step() # we update the weights

57
Comments about backpropagation
Fully connected layers: O(n 2) weights between each layer. Each update
depends on the dataset size and the number of weights. hyperbolic tan

The derivative of the activation function affects the gradient:

∂C0 (L) ∂a (L) ∂C


∂z j 0
=
∂w (L) kj ∂wkj(L) ∂z (L) ∂aj(L)
when the gradient is small, it may prevent
for epoch in range(250):
the values from updating, y_pred = model(X) # We compute Xw = y_hat
vanishing gradient problem. loss = loss_fn(y_pred, y) # (h_hat - y)^2
optimizer.zero_grad()
(It might also explode). loss.backward() # we compute the gradients
optimizer.step() # we update the weights
This can prevent deep layers.
58
Comments about backpropagation
Fully connected layers: O(n 2) weights between each layer. Each update
depends on the dataset size and the number of weights. ReLU

The derivative of the activation function affects the gradient:

∂C0 (L) ∂a (L) ∂C


∂z j 0
=
∂w (L) kj ∂wkj(L) ∂z (L) ∂aj(L)
when the gradient is small, it may prevent
for epoch in range(250):
the values from updating, y_pred = model(X) # We compute Xw = y_hat
vanishing gradient problem. loss = loss_fn(y_pred, y) # (h_hat - y)^2
optimizer.zero_grad()
(It might also explode). loss.backward() # we compute the gradients
optimizer.step() # we update the weights
This can prevent deep layers.
59
Special purpose layers
To limit the computational cost, fully connected layers are used
sparingly.

Special purpose layers for different tasks reduce computation and can
still capture the most important patterns.

Such as convolutional neural network (CNN) layers and layers for


sequence modeling (RNN, transformers)

CNNs and tranformers have the advantage that they can be parallelized!
60
Convolutional Neural Networks
Convolutional neural networks (CNNs): Currently the best performing networks for
analyzing images.
Deep neural networks with (at least one) convolutional layers, pooling layers and fully
connected linear layers.

Example: Black and white image that has 100 x 100 pixels = 10 000 inputs.
Fully connected layers have many weights, if we have many layers it quickly becomes
computationally infeasible.
Idea 1: Pixels close by are more relevant to each other than pixels far away.
Idea 2: similar features are relevant all over the image, for example edges, so we can
reuse the parameters to detect them.
We can learn small filters that only look at a small region, e.g., 5 x 5 pixels, use many
parallell filters and many layers and still have few parameters to train.
61
Convolutions
Mathematical convolution:
Apply another function (A kernel)
around the current values and add
up the values.
E.g., Add Gaussian blur to an image: the kernel is a Gaussian. For
each pixel, give the surrounding pixels weights proportional to a
Gaussian and add them to create a blurry image. That is, you blend the
values.

62
See Goodfellow: Chapter 9
Convolution Layer: filters
Each neuron in a convolutional layer looks at a small region (receptive field)
in the input/previous layer and apply the kernel (the filter) on that region.
Either we pad the border or the next layer is smaller!
We learn the weights
of the kernel.
The same kernel
is used for the
whole input.

63
Conv-layer

64
Conv-layer: stride

65
Conv-layer: dilation (gaps in kernel)

66
Conv-layer: padding

67
Conv-layer: Multiple channels

68
Other types of layers
Pooling: select maximum value from a window
Dropout: used during training to randomly set a few
neurons to 0. Reduces overfitting
BatchNorm: normalizes input.

69
Summary: PCA and UMAP
Uniform Manifold Approximation and Projection, UMAP:
Used for visualization, non-linear. Captures
Principal Component Analysis, PCA: preprocessing
step, noise reduction, linear. Can be used for
dimensionality reduction to speed up training of other
models.

70
Summary: UMAP
UMAP: Uniform Manifold Approximation and Projection. Focuses on
preserving both local and global structure.
1. Construct a high-dimensional graph:
A. compute pair-wise distances.
B. Convert distances to probabilities by defining a local neighborhood for
each point
C. Combine local neighborhoods to a global graph called “fuzzy simplicial
complex”
2. Optimize low-dimensional representation:
A. Initialize a low-dimensional embedding: place the points “randomly”
B. Minimize the difference between high and low dimensional
representation using cross-entropy
71
C. Adjust the positions of the points in low-d representation using
stochastic gradient descent.
PCA: Overview
Find vectors (summary indices) that capture the important
information in the data (principal components).
Keep statistical properties such as variance.
PCA finds lines, planes and hyper-planes in the D-
dimensional space that approximate the data as well as
possible in the least squares sense.
Finds the vectors that minimize the reconstruction error,
which equivalent as finding the vectors that maximize
the variance.

72
Summary: PCA: algorithm
1 N
N∑
1. Compute mean feature vector μ = xi
i=1
1 N
(xi − μ)(xi − μ)⊤
N∑
2. Compute covariance matrix S =
i=1
3. Compute eigenvalues λi and eigenvectors wi of S using SVD.
4. Arrange the eigenvalues in descending order
5. Pick the first M
6. Pick eigenvectors corresponding to the selected eigenvalues
7. Arrange eigenvectors in matrix W = [w1, …, wM ]
73
8. Extract low-dimensional feature vectors: zi = W ⊤ xi
The exam:
Kårhusets Gasquesal.
Bring calculator.
Approximately 60 points, 30 for passing grade.
1 point per important “point in the answer”.
E.g., What is does PCA need zero mean? (2 pt)
Possible answer: PCA finds the directions that maximize the
variance of the data (1 pt), so if the mean is not 0, the direction will
be influenced by the offset of the mean and not the spread of the
data (1 pt).
74
The exam:
Some examples from previous exams (fall 2023): first
question about general concepts.
2.3 Cross-entropy for 2 training examples.

75
The exam:
Look at the coding problem 8.1, something similar may come.

Look at the study questions. I will add more for CNNs and
data.

76
Course evaluation

Please help me by giving concrete feedback!

77
🎄🎁🤶Happy Holidays🎄🎁🤶

78

You might also like