Deep Learning Intro Slides
Deep Learning Intro Slides
7.1.2025
1
Pre-requisites
2
Course schedule
3
Rules of the course
4
Communication channels
5
Course grading
6
Assignments
7
Exercise sessions
• The exercise sessions are organized to help you solve the assignments.
• You do not have to attend the exercise sessions.
• There will be four online exercise sessions for each assignment:
1. Friday 10:15-11:45
2. Friday 12:15-13:45
3. Monday 10:15-11:45
4. Monday 12:15-13:45
• No exercise sessions during the exam week 17.-21.2.
• Please read carefully the protocol for the exercise sessions in MyCourses.
8
Course material
• We do not have special sessions on PyTorch, you should learn it by following PyTorch tutorials.
• If you know numpy (pre-requisite), PyTorch should be easy to learn.
• Deep learning frameworks develop very quickly, you need to learn new frameworks/features all the
time.
• If you need help with PyTorch, please ask for assistance in the exercise sessions.
• Lecture notes, material by Alexander Ilin, are available in MyCourses.
• Lecture slides will be shared in MyCourses before each lecture.
• The lectures will be recorded and made available in MyCourses.
• Deep Learning book by Goodfellow, Bengio and Courville (2016).
9
What is deep learning
Feature engineering
• Suppose that you need to solve a custom machine learning problem, for example:
• Spam detection.
• Reading information from scanned invoices.
• We can solve that problem by designing a set of features from the data and use those feature as
input of a machine learning model.
• Spam detection: Useful features are counts of certain words.
• Line item extraction from invoices: Useful features to classify a number as a line item or not are
position on the invoice, words that appear in the proximity.
Data → Feature engineering → Machine learning model (e.g. random forest classifier)
• Benefit of feature engineering: One can use domain knowledge to design features that are robust
(for example, invariant to certain distortions).
• What are the problems with feature engineering?
11
Feature engineering: Problem 1
12
Feature engineering: Problem 2
• Handcrafted features are not perfect. There are always examples that are not processed correctly,
which motivates engineering of new features.
Features
Misclassified
examples Classifier
13
Representation learning
14
Deep learning = artificial neural networks
learning frameworks that are not necessarily 1940 1950 1960 1970 1980 1990 2000 2010 2020
neurally inspired. Frequency of phrases ”cybernetics”, ”neural networks” and ”deep learning”
according to Google books.
15
Linear classifiers
Logistic regression
3
• Consider a binary classification problem: Our training data
consist of examples (x(1) , y (1) ), ..., (x(N) , y (N) ) with 2
m
! 1
X
f (x) = σ wj xj + b = σ w⊤ x + b 2
j=1
3
where m is the number of features in x. 3 2 1 0 1 2 3
Training examples
• Logistic regression model: σ(x) = 1
1+e −x
is a logistic
(a.k.a. sigmoid) function.
• Using the logistic function guarantees that the output is
between 0 and 1 and it can be seen as the probability that
x belongs to one of the classes: p(y = 1 | x) = f (x).
Logistic function
17
Training of a binary classifier: Optimization problem
• Training of our classifier: find parameters w and b that would classify our training examples as
accurately as possible.
• We train the classifier by solving the following optimization problem.
• Assume Bernoulli distribution for labels y :
p(y | x, w, b) = f (x)y (1 − f (x))1−y
where f (x) is the output of the classifier.
• Write the likelihood function for N training examples:
N
Y
p(data | w, b) = p(y (i) | x(i) , w, b)
i=1
• Maximize the log-likelihood function F (w, b) = log p(data | w, b) or minimize the negative of that:
N
X
L(w, b) = − y (i) log f (x(i) ) + (1 − y (i) ) log(1 − f (x(i) ))
i=1
18
Toy binary classification problem
• Consider a toy binary classification problem with two parameters w1 and w2 (no bias term):
1
f (x) = σ (w1 x1 + w2 x2 ) σ(x) =
1 + e −x
The loss function in this toy example can be visualized using a contour plot.
3.5
3
3.0
2
2.5
1
2.0
w2
0
1.5
1
1.0
2 0.5
3 0.0
4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5
3 2 1 0 1 2 3 n w1
X (i) (i) (i) (i)
Training examples L(w1 , w2 ) = − y log f (x ) + (1 − y ) log(1 − f (x ))
i=1
19
Gradient
3.5
2.5
∂L
∂w1
g(w) =
∂L 2.0
w2
∂w2
1.5
• The gradient of L points in the direction of the
greatest rate of increase of L, its magnitude is 1.0
the slope of the graph of L in that direction.
0.5
0.0
4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5
w1
20
Gradient descent
w2
minimum, so we need to iterate
1.5
wt+1 = wt − ηt g(wt ) 1.0
21
Multilayer perceptrons
A historical note: First models of neurons
• The first algorithm of training a linear binary classifier was proposed by Rosenblatt in 1958.
• The model was called Perceptron and its training procedure was inspired by neuroscience (Donald
Hebb’s rule) rather than by mathematical optimization.
• The problem with Perceptrons: they are linear classifiers and can solve only a very limited set of
classification problems.
• This problem was well understood already in the 1960s. Minsky and Papert (1969) in their book
called “Perceptrons” argued that more complex (nonlinear) problems have to be solved with
multiple layers of perceptrons.
23
Multilayer perceptrons
• Multilayer perceptron (MLP) is a neural network that consists of multiple layers of perceptrons
(neurons).
24
Activation functions
25
Matrix-vector notation
26
Training of multilayer perceptrons
• Again, we can tune the parameters θ k (which include Wk , bk ) of the classifier by maximizing the
log-likelihood, for example, using gradient descent:
∂L
θk ← θk − η .
∂θ k
• The gradient ∂L
∂θ k
can be computed efficiency using the error backpropagation algorithm.
27
The backpropagation algorithm
(Rumelhart et al., 1986)
Backpropagation: An example with scalars
• We can compute the derivatives wrt the model parameters θ and w using the chain rule:
∂L ∂L ∂y w θ
=
∂θ ∂y ∂θ
∂L ∂L ∂y ∂h
=
∂w ∂y ∂h ∂w h y
| {z } x f1 f2 L
∂L
∂h
29
Backpropagation: An example with scalars
• We can compute the derivatives wrt the model parameters θ and w using the chain rule:
∂L ∂L ∂y w θ
=
∂θ ∂y ∂θ
∂L ∂L ∂y ∂h
=
∂w ∂y ∂h ∂w h y
| {z } x f1 f2 L
∂L
∂h ∂L
∂y
29
Backpropagation: An example with scalars
• We can compute the derivatives wrt the model parameters θ and w using the chain rule:
∂L ∂L ∂y w θ
=
∂θ ∂y ∂θ
∂L
∂L ∂L ∂y ∂h ∂θ
=
∂w ∂y ∂h ∂w h y
| {z } x f1 f2 L
∂L
∂h ∂L ∂L
∂h ∂y
29
Backpropagation: An example with scalars
• We can compute the derivatives wrt the model parameters θ and w using the chain rule:
∂L ∂L ∂y w θ
=
∂θ ∂y ∂θ ∂L ∂L
∂L ∂L ∂y ∂h ∂w ∂θ
=
∂w ∂y ∂h ∂w h y
| {z } x f1 f2 L
∂L
∂h ∂L ∂L
∂h ∂y
29
Chain rule for multi-variable functions
• For multi-variable functions, the chain rule can be written in terms of Jacobian matrices.
y = f (u), u = g (x) y ∈ RM , u ∈ RK , x ∈ RN
∂y ∂y1
∂x1
1
· · · ∂x N
. .. ..
Jacobian matrix: Jf ◦g = .. . .
∂yM
∂x1
· · · ∂y
∂xN
M
30
Backpropagation for multi-variable functions
• We apply the chain rule to compute the derivatives wrt the model parameters (and re-use
intermediate derivatives):
K
∂L X ∂L ∂yk
w θ
=
∂θj ∂yk ∂θj
k=1
K
∂L X ∂L ∂yk
= h y
∂hl ∂yk ∂hl x f1 f2 L
k=1
L
∂L X ∂L ∂hl
=
∂wi ∂hl ∂wi
l=1
31
Backpropagation for multi-variable functions
• We apply the chain rule to compute the derivatives wrt the model parameters (and re-use
intermediate derivatives):
K
∂L X ∂L ∂yk
w θ
=
∂θj ∂yk ∂θj
k=1
K
∂L X ∂L ∂yk
= h y
∂hl ∂yk ∂hl x f1 f2 L
k=1
L ∂L
∂L X ∂L ∂hl ∂yk
=
∂wi ∂hl ∂wi
l=1
• We can compute the derivatives sequentially going from the outputs of the network towards the
inputs (thus the name of the algorithm backpropagation).
31
Backpropagation for multi-variable functions
• We apply the chain rule to compute the derivatives wrt the model parameters (and re-use
intermediate derivatives):
K
∂L X ∂L ∂yk
w θ
=
∂θj ∂yk ∂θj
k=1 ∂L
K ∂θj
∂L X ∂L ∂yk
= h y
∂hl ∂yk ∂hl x f1 f2 L
k=1
L ∂L ∂L
∂L X ∂L ∂hl ∂hl ∂yk
=
∂wi ∂hl ∂wi
l=1
• We can compute the derivatives sequentially going from the outputs of the network towards the
inputs (thus the name of the algorithm backpropagation).
31
Backpropagation for multi-variable functions
• We apply the chain rule to compute the derivatives wrt the model parameters (and re-use
intermediate derivatives):
K
∂L X ∂L ∂yk
w θ
=
∂θj ∂yk ∂θj
k=1 ∂L ∂L
K ∂wi ∂θj
∂L X ∂L ∂yk
= h y
∂hl ∂yk ∂hl x f1 f2 L
k=1
L ∂L ∂L
∂L X ∂L ∂hl ∂hl ∂yk
=
∂wi ∂hl ∂wi
l=1
• We can compute the derivatives sequentially going from the outputs of the network towards the
inputs (thus the name of the algorithm backpropagation).
31
PyTorch
• PyTorch is a programming framework which allows one to create complex multilayer models
without the need to implement the optimization procedure and the function gradients.
Backpropagation is already implemented in the framework.
import torch
mlp = nn.Sequential(
nn.Linear(3, 5),
nn.ReLU(), x1
nn.Linear(5, 1),
)
optimizer = torch.optim.SGD(mlp.parameters(), lr=0.01) x2 y
for i in range(100):
optimizer.zero_grad() x3
# Compute loss
y = mlp(x)
loss = loss_fn(y, targets)
• Let us extend this loss to the case of K classes. We can represent the target as a one-hot vector
y, where yj = 1 if the example belongs to class j and otherwise yi = 0:
" # " # K
1 0 X
class 1: y = class 2: y = yj = 1
0 1 j=1
• Similarly, we can represent the output of the network as a vector f whose j-th element fj is the
modeled probability that input x belongs to class j.
" # K
f1 X
f= 0 ≤ fj ≤ 1 fj = 1
f2 j=1
34
Classification problems: Cross-entropy loss
• Now we write the cross-entropy loss of N data samples in the following form:
N K
1 X X (n)
L(θ) = − y log fj (x(n) , θ)
N n=1 j=1 j
we apply the softmax nonlinearity to the outputs hj of the last layer of the neural network:
exp hj
f j = PK
j ′ =1 exp hj
′
35
Regression problems: Mean-squared error loss
(n)
• yj is the j-th element of y(n)
• fj is the j-th element of the network output f(x, θ)
• ny is the number of elements in y(n)
• N is the number of the training examples
36
Convergence of gradient descent
Toy example: minimizing a quadratic function
w2
• The axes of the ellipses are determined by the eigenvectors of 0.5
matrix A. 0.0
38
Effect of learning rate
• Suppose that we use gradient descent to find the minimum of the loss:
θ t+1 = θ t − ηg(θ t )
• The learning rate η has a major effect on the convergence of the gradient descent.
2.5 2.5
2.0 2.0
1.5 1.5
1.0 1.0
w2
w2
0.5 0.5
0.0 0.0
0.5 0.5
1.0 1.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5
w1 w1
small η: too slow convergence large η: oscillates and can even diverge
39
Convergence of gradient descent (see, e.g., Goh, 2017)
2.5
2.0
1
For quadratic loss: L(w) = w⊤ Aw − b⊤ w 1.5
2 1.0
w2
• The optimal learning rate depends on the curvature of the loss. 0.5
0.0
• If we select the learning rate optimally, the rate of convergence 0.5
1.0
2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0
∥wt+1 − w∗ ∥ w1
rate(η) = Large κ(A): slow convergence
∥wt − w∗ ∥
because of zigzaging
of the gradient descent is determined by the condition number of
2.5
matrix A: 2.0
1.5
κ(A) − 1 1.0
rate(η∗ ) =
w2
κ(A) + 1 0.5
0.0
1.0
rate(η∗ ) = 1: no convergence. 1.0 0.5 0.0 0.5 1.0
w1
1.5 2.0 2.5
• For non-quadratic functions, the error surface is locally well approximated by a quadratic function:
1
L(w) ≈ L(wt ) + g⊤ (w − wt ) + (w − wt )⊤ H(w − wt )
2
2.5
∂2L 2
L
· · · ∂w∂1 ∂w
2.0
∂w1 ∂w1 M
w2
. .. .. 1.5
H= .. . .
1.0
2 2
∂ L ∂ L
∂wM ∂w1
· · · ∂wM ∂wM 0.5
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0
w1
• For the quadratic loss L(w) = 21 w⊤ Aw − b⊤ w the Hessian matrix H = A. Thus, the convergence
of the gradient descent is affected by the properties of the Hessian.
41
Optimization tricks
Why did deep learning era begin only in 2010–2012?
• Many components of deep learning have been invented long time ago but the deep learning
revolution started only in 2010–2012.
invented.
• Better network components have been developed. 1940 1950 1960 1970 1980 1990 2000 2010 2020
Frequency of phrases ”cybernetics”, ”neural networks” and ”deep learning”
according to Google books.
• Training of a deep neural network is a non-trivial optimization problem which requires multiple
tricks: input normalization, weight initialization, mini-batch training (stochastic gradient descent),
improved optimizers, batch normalization.
43
Input normalization
• In this case, the Hessian matrix is equal to the 2nd order moment (correlation matrix) of the data
N
1 X (n) (n)⊤
H= x x = Cx
N n=1
• For fastest convergence, H = Cx should be equal to the identity matrix I. We can achieve this by
decorrelating the input components using principal component analysis:
xPCA = D−1/2 E⊤ (x − µ)
where µ is data mean and EDE⊤ is the eigenvalue decomposition of the data covariance matrix.
• Neural networks are nonlinear models but normalizing their inputs usually improves convergence.
• A simple way of input normalization: center to zero mean and scale to unit variance.
44
Weight initialization in a linear layer
are usually initialized with random numbers drawn from some distribution p(wij ).
• Glorot and Bengio (2010): If we select p(wij ) carelessly, the magnitudes of the signals in the
forward/backward pass can grow/decay when the signals are propagated throughout the network,
which may have a negative impact on the optimization landscape.
• Popular initialization schemes balance the magnitudes of the signals in the forward and backward
passes: " √ √ #
6 6
Xavier’s initialization: wij ∼ uniform − p ,p
N x + Ny N x + Ny
where Nx , Ny is the number of inputs/outputs of a linear layer.
• Important: initialization schemes assume normalized inputs!
45
Mini-batch training
(stochastic gradient descent)
Mini-batch training
• The loss function contains N terms corresponding to the training samples, for example:
N K
1 X X (n)
L(θ) = − y log fj (x(n) , θ)
N n=1 j=1 j
• Large data sets are redundant: gradient computed on two different subsets of data are likely to be
similar. Why to waste computations?
• We can compute gradient using only part of training data (a mini-batch Bm , a subset of the data):
K
∂L 1 X ∂ X (n)
≈− −yj log fj (x(n) , θ)
∂θ |Bm | n∈B ∂θ j=1
m
• By using mini-batches, we introduce “sampling noise” to the gradient computations, thus the
method is called stochastic gradient descent.
47
Practical considerations for mini-batch training
• Epoch: going through all of the training examples once (usually using mini-batch training).
• It is good to shuffle the data between epochs when forming mini-batches (otherwise gradient
estimates are biased towards a particular mini-batch split).
• Mini-batches need to be balanced for classes.
• The recent trend is to use as large batches as possible (depends on the GPU memory size).
• Using larger batch sizes reduces the amount of noise in the gradient estimates.
• Computing the gradient for multiple samples at the same time is computationally efficient (requires
matrix-matrix multiplications which are efficient, especially on GPUs).
48
Model fine-tuning during mini-batch training
1.5
• One way to reduce this effect is to anneal the learning
rate ηt towards the end of training. 1.0
w2
• The simplest schedule is to decrease the learning rate 0.5
after every r updates.
0.0
• Another popular trick is to use exponential moving
0.5
average of the model parameters as the final model:
1.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5
θ ′t = γθ ′t−1 + (1 − γ)θ t w1
49
Improved optimization algorithms
Problems with gradient descent
2.5
2.0
• When the curvature of the objective function 1.5
substantially varies in different directions, the
1.0
w2
optimization trajectory of the gradient descent
can be zigzaging. 0.5
0.0
0.5
1.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5
w1
51
Momentum method (Polyak, 1964)
• Idea: 2.5
• We would like to move faster in directions with
small but consistent gradients. 2.0
• We would like to move slower in directions with 1.5
large but inconsistent gradients.
1.0
w2
• Implementation: Aggregate negative gradients in
momentum mt : 0.5
0.0
mt+1 = αmt − ηt gt
θ t+1 = θ t + mt+1 0.5
1.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5
w1
52
Adam (Kingma and Ba, 2014)
• The most popular algorithm today is Adam which uses a unit-less update rule:
mbt
θ t ← θ t−1 − ηt √
bvt + ϵ
• Fist and second order statistics of the gradient are computed using exponential moving averages:
mt = β1 mt−1 + (1 − β1 )gt
vt = β2 vt−1 + (1 − β2 )gt2
• Correction is used to improve estimates at the beginning of training (β t is β to the power of t):
b t = mt /(1 − β1t )
m vt = vt /(1 − β2t )
b
• Since the update rule is unit-less, the optimization procedure is not affected by the scale of the
objective function.
53
Why Adam works well
mbt
θ t ← θ t−1 − η √
bvt + ϵ
mt = β1 mt−1 + (1 − β1 )gt
vt = β2 vt−1 + (1 − β2 )gt2
• In Adam, the effective step size |∆t | is bounded. In the most common case:
m
bt E [g ]
|∆t | = η √ ≈ ηp ≤η because E [g 2 ] = E [g ]2 + E [(g − E [g ])2 ]
bvt E [g 2 ]
Thus, we never take too big steps (which can be the case for standard gradient descent).
• We go with the maximum speed (step size η) only if g is the same between updates
(mini-batches), that is when the gradients are consistent.
• At convergence, we start fluctuating around the optimum, E [g ] ≈ 0 and E [g 2 ] > 0 and the
effective step size gets smaller. Thus, Adam has a mechanism for automatic annealing of the
learning rate.
54
Batch normalization
Batch normalization (Ioffe and Szegedy, 2015)
• Idea: Since input normalization has positive effect on training, can we also normalize the
intermediate signals? The problem is that these signals change during training and we cannot
perform normalization before the training.
• The solution is to normalize intermediate signals to zero mean and unit variance in each training
mini-batch:
N N
1. Compute the means and variances of the intermediate 1 X (i) 1 X (i)
µ= x σ2 = (x − µ)2
signals x from the current mini-batch {x(1) , . . . , x(N) }. N i=1 N i=1
x−µ
2. Normalize signals to zero mean and unit variance. x̃ = √
σ2 + ϵ
3. Scale and shift the signals with trainable parameters
γ and β. y = γ ⊙ x̃ + β
56
Batch normalization: Training and evaluation modes
• The mean and standard deviation are computed for each mini-batch. What to do at test time
when we use a trained network for a test input?
• Batch normalization layer keeps track of the batch statistics (mean and standard deviation) during
training:
N
1 X (i)
µ ← (1 − α)µ + α x
N i=1
N
1 X (i)
σ 2 ← (1 − α)σ 2 + α (x − µ)2
N i=1
57
Batch normalization: Training and evaluation modes
model = nn.Sequential(
• PyTorch: If you use a batch normalization layer, the nn.Linear(1, 100),
behavior of the network in the training and evaluation nn.BatchNorm1d(100),
nn.ReLU(),
modes will be different: nn.Linear(100, 1),
• Training: Use statistics from a mini-batch, update )
running statistics µ and σ 2 .
• Evaluation: Use running statistics µ and σ 2 , keep µ # Switch to training mode
and σ 2 fixed. model.train()
• Important to remember: BN introduces dependencies # train the model
...
between samples in a mini-batch in the computational
# Switch to evaluation mode
graph. model.eval()
# test the model
58
Home assignments
Assignment 01 mlp
1. Implement the backpropagation algorithm and train a multilayer perceptron (MLP) in numpy, not
using PyTorch yet.
y output layer
y = ψ(W3 h2 + b3 )
hidden layer 2
h2 = ϕ(W2 h1 + b2 )
hidden layer 1
h1 = ϕ(W1 x + b1 )
input x
x1 x2 x3 input layer
60
Assignment 01 mlp
2. Implement backpropagation for a multilayer perceptron network in numpy. For each block of a
neural network, you need to implement the following computations.
θ
∂L
∂θ
x y
f
∂L ∂L
∂h ∂y
61