0% found this document useful (0 votes)

11 views

Deep Learning Intro Slides

The document outlines the course CS-E4890 Deep Learning, detailing prerequisites, course schedule, rules, communication channels, grading, assignments, exercise sessions, and course materials. It introduces deep learning concepts, including feature engineering, representation learning, and the structure of neural networks, emphasizing the transition from traditional machine learning to deep learning. Additionally, it covers the training of binary classifiers, optimization problems, and multilayer perceptrons, highlighting the importance of activation functions and matrix-vector notation.

Uploaded by

mrolaw01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Deep Learning Intro Slides

Uploaded by

mrolaw01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 68

CS-E4890 Deep Learning

Lecture #1 Training of neural networks

7.1.2025

Jorma Laaksonen ––– Juho Kannala ––– Alexander Ilin

Lecturers & teaching assistants

Jorma Juho Alexander

Laaksonen Kannala Ilin

Alireza Mohammadian Ben Herman Gianmarco Midena Iaroslav Melekhtov Ji Xu

Kalle Lindgren Khoi Nguyen Lisa Petry Matias Turkulainen Mohammadreza Nakhaei
Nora Raud Ville Huhtala Yu Xu

1
Pre-requisites

• Basics of machine learning: supervised and unsupervised learning, overfitting.

• Linear algebra: vectors, matrices, eigenvalues and eigenvectors.
• Basics of probability and statistics: sum rule, product rule, Bayes’ rule, expectation, mean,
variance, maximum likelihood, Kullback-Leibler divergence.
• Good knowledge of Python and numpy.

2
Course schedule

• Please study carefully Course schedule in MyCourses.

• 9 lectures (3 by Jorma + 3 by Juho + 2 by Alex + 1 guest lecture)
• 8 assignments
• Exercise sessions
• Exam

3
Rules of the course

• By taking this course, you accept the following rules:

1. You give permission for proctoring your assignment submissions.
2. Solution sharing is strictly not allowed before, during and after the course. That means that you
are not allowed to share your solutions (or any parts) via private channels and/or public
repositories.

4
Communication channels

• Slack is the main communication channel: deeplearn25-aalto.slack.com

• Please ask questions about assignments in the dedicated channels.
• The teaching assistants (TAs) will look at Slack regularly.
• Please read about the Slack etiquette in file 0 rules.ipynb in the first assignment.
• If you have question regarding the course, please check if it has already been answered in Slack.
• If you have a general and non-personal question regarding the course, not related to completing
the assignments, ask in in Slack’s #general channel.
• If you have a non-general and personal question regarding the course, please send an email to
[email protected]

5
Course grading

• 8 assignments, the maximum either 6 or 8 points in each, (4 · 6 + 4 · 8 = 56 assignment max)

• Exam with maximum 24 points.
• The assignment and exam poits are summed (56 + 24 = 80 total max)
• The course can be passed with 40 total points.
• Grade 5 requires 68 total points.

6
Assignments

• The assignments are released already.

• Please read very carefully the instructions in MyCourses.
• You can find the deadlines on Course schedule page in MyCourses.
• Strict deadlines, zero points for late submissions, no exceptions.
• The feedback is returned on the same week after the deadline.
• If you plan to be away, submit your solutions early, no need to wait until the deadline.

7
Exercise sessions

• The exercise sessions are organized to help you solve the assignments.
• You do not have to attend the exercise sessions.
• There will be four online exercise sessions for each assignment:
1. Friday 10:15-11:45
2. Friday 12:15-13:45
3. Monday 10:15-11:45
4. Monday 12:15-13:45
• No exercise sessions during the exam week 17.-21.2.
• Please read carefully the protocol for the exercise sessions in MyCourses.

8
Course material

• We do not have special sessions on PyTorch, you should learn it by following PyTorch tutorials.
• If you know numpy (pre-requisite), PyTorch should be easy to learn.
• Deep learning frameworks develop very quickly, you need to learn new frameworks/features all the
time.
• If you need help with PyTorch, please ask for assistance in the exercise sessions.
• Lecture notes, material by Alexander Ilin, are available in MyCourses.
• Lecture slides will be shared in MyCourses before each lecture.
• The lectures will be recorded and made available in MyCourses.
• Deep Learning book by Goodfellow, Bengio and Courville (2016).

9
What is deep learning
Feature engineering

• Suppose that you need to solve a custom machine learning problem, for example:
• Spam detection.
• Reading information from scanned invoices.
• We can solve that problem by designing a set of features from the data and use those feature as
input of a machine learning model.
• Spam detection: Useful features are counts of certain words.
• Line item extraction from invoices: Useful features to classify a number as a line item or not are
position on the invoice, words that appear in the proximity.
Data → Feature engineering → Machine learning model (e.g. random forest classifier)
• Benefit of feature engineering: One can use domain knowledge to design features that are robust
(for example, invariant to certain distortions).
• What are the problems with feature engineering?

11
Feature engineering: Problem 1

• For many tasks, it is difficult to know what features should be

extracted.
• Example: We want to detect certain buildings in images
(two-dimensional maps of RGB values). What are useful
features?
• Manually designing features for a complex task requires a
great deal of human time and effort; it can take decades for
an entire community of researchers to design good features.
• Example: SIFT features in image classification.

12
Feature engineering: Problem 2

• Handcrafted features are not perfect. There are always examples that are not processed correctly,
which motivates engineering of new features.
Features

Misclassified
examples Classifier

• Features can get very complex and difficult to maintain.

13
Representation learning

Data → Features (representation) → Classifier

• These problems can be overcome with representation learning: We use machine learning to
discover not only the mapping from representation to output but also the representation itself.
• A representation learning algorithm can discover a good set of features much faster (in days
instead of decades of efforts of an entire research community).
• Learned representations often result in much better performance compared to hand-designed
representations.
• With learned representations, AI systems can rapidly adapt to new tasks, with minimal human
intervention.
• This is what deep learning does: it learns representations from data.

14
Deep learning = artificial neural networks

• Many ideas in deep learning models have been inspired by neuroscience:

• The basic idea of having many computational units that become intelligent only via their interactions
with each other is inspired by the brain.
• The Neocognitron (Fukushima, 1980) introduced a powerful model architecture for processing images
that was inspired by the structure of the mammalian visual system and later became the basis for the
modern convolutional networks.

• The name “deep learning” was invented to

re-brand artificial neural networks which had AlexNet
(Krizhevsky et al., 2012)
become unpopular in the 2000s.
Backpropagation
• Modern deep learning: A more general (Rumelhart et al., 1986)
Perceptron
(Rosenblatt, 1958)
principle of learning multiple levels of Deep belief networks
(Hinton et al., 2006)
McCulloch & Pitts
composition, which can be applied in machine neuron (1943)

learning frameworks that are not necessarily 1940 1950 1960 1970 1980 1990 2000 2010 2020

neurally inspired. Frequency of phrases ”cybernetics”, ”neural networks” and ”deep learning”
according to Google books.

15
Linear classifiers
Logistic regression

3
• Consider a binary classification problem: Our training data
consist of examples (x(1) , y (1) ), ..., (x(N) , y (N) ) with 2

x(i) ∈ Rm and y (i) ∈ {0, 1}. 1

• We use the training data to build a linear classifier 0

m
! 1
X
f (x) = σ wj xj + b = σ w⊤ x + b 2
j=1
3
where m is the number of features in x. 3 2 1 0 1 2 3
Training examples
• Logistic regression model: σ(x) = 1
1+e −x
is a logistic
(a.k.a. sigmoid) function.
• Using the logistic function guarantees that the output is
between 0 and 1 and it can be seen as the probability that
x belongs to one of the classes: p(y = 1 | x) = f (x).
Logistic function

17
Training of a binary classifier: Optimization problem

• Training of our classifier: find parameters w and b that would classify our training examples as
accurately as possible.
• We train the classifier by solving the following optimization problem.
• Assume Bernoulli distribution for labels y :
p(y | x, w, b) = f (x)y (1 − f (x))1−y
where f (x) is the output of the classifier.
• Write the likelihood function for N training examples:
N
Y
p(data | w, b) = p(y (i) | x(i) , w, b)
i=1

• Maximize the log-likelihood function F (w, b) = log p(data | w, b) or minimize the negative of that:
N
X
L(w, b) = − y (i) log f (x(i) ) + (1 − y (i) ) log(1 − f (x(i) ))
i=1

This loss or error function if often called binary cross entropy.

18
Toy binary classification problem

• Consider a toy binary classification problem with two parameters w1 and w2 (no bias term):
1
f (x) = σ (w1 x1 + w2 x2 ) σ(x) =
1 + e −x
The loss function in this toy example can be visualized using a contour plot.

3.5
3
3.0
2
2.5
1
2.0

w2
0
1.5
1
1.0
2 0.5

3 0.0
4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5
3 2 1 0 1 2 3 n w1
X (i) (i) (i) (i)
Training examples L(w1 , w2 ) = − y log f (x ) + (1 − y ) log(1 − f (x ))
i=1

19
Gradient

3.5

• Gradient is a vector of partial derivatives: 3.0

2.5
 
∂L
∂w1
g(w) =  
∂L 2.0

w2
∂w2

1.5
• The gradient of L points in the direction of the
greatest rate of increase of L, its magnitude is 1.0
the slope of the graph of L in that direction.
0.5

0.0
4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5
w1

20
Gradient descent

• Gradient descent: update the model parameters

in the direction opposite to the gradient of the 3.5
loss:
3.0
w ← w − ηg(w)
with some step size η. 2.5

• We reduce the error but do not end up at the 2.0

w2
minimum, so we need to iterate
1.5
wt+1 = wt − ηt g(wt ) 1.0

• This optimization algorithm is called gradient 0.5

descent.
0.0
• Step size η is commonly called the learning rate. 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5
w1

21
Multilayer perceptrons
A historical note: First models of neurons

• The first algorithm of training a linear binary classifier was proposed by Rosenblatt in 1958.
• The model was called Perceptron and its training procedure was inspired by neuroscience (Donald
Hebb’s rule) rather than by mathematical optimization.
• The problem with Perceptrons: they are linear classifiers and can solve only a very limited set of
classification problems.
• This problem was well understood already in the 1960s. Minsky and Papert (1969) in their book
called “Perceptrons” argued that more complex (nonlinear) problems have to be solved with
multiple layers of perceptrons.

23
Multilayer perceptrons

• Multilayer perceptron (MLP) is a neural network that consists of multiple layers of perceptrons
(neurons).

• Every neuron implements a function

x1
m
!
X
y =ϕ wj xj + b = ϕ w⊤ x + b
j=1
x2 y
which resembles a simple classifier that we
considered before. output layer
x3
• Layers in an MLP are called fully-connected
because each neuron is connected to each input layer hidden layer 2
neuron in the previous layer.
hidden layer 1

24
Activation functions

• The model implemented by a single neuron is

m
!
X
y =ϕ wj xj + b = ϕ w⊤ x + b
j=1

where ϕ is a nonlinear function which is often called an activation function.

• Popular activation functions :
• before 2010: S-shaped functions tanh(x) and σ(x) = 1/(1 + e −x )
• after 2010: relu(z) = max(0, z)

tanh(x) σ(x) = 1/(1 + e −x ) relu(x) = max(0, x)

25
Matrix-vector notation

• A more compact style: A node in the graph corresponds to an entire layer.

y output layer y = ψ(W3 h2 + b3 )

hidden layer 2 h2 = ϕ(W2 h1 + b2 )

hidden layer 1 h1 = ϕ(W1 x + b1 )

x1 x2 x3 input layer input x

26
Training of multilayer perceptrons

• Our neural network represents a function which is composed

of several functions: ψ(W3 h2 + b3 ) f3 (·, θ 3 )

f (x) = f3 (f2 (f1 (x, θ 1 ), θ 2 ), θ 3 ) h2 = ϕ(W2 h1 + b2 ) f2 (·, θ 2 )

• If we solve a binary classification problem, we can use the
same loss function that we used before: h1 = ϕ(W1 x + b1 ) f1 (·, θ 1 )
N
X
L(θ) = − y (i) log f (x(i) ) + (1 − y (i) ) log(1 − f (x(i) )) input x
i=1

• Again, we can tune the parameters θ k (which include Wk , bk ) of the classifier by maximizing the
log-likelihood, for example, using gradient descent:
∂L
θk ← θk − η .
∂θ k
• The gradient ∂L
∂θ k
can be computed efficiency using the error backpropagation algorithm.

27
The backpropagation algorithm
(Rumelhart et al., 1986)
Backpropagation: An example with scalars

• Consider a multi-layer model that operates only with scalars:

L = L(y ), y = f2 (h, θ), h = f1 (x, w )

• We can compute the derivatives wrt the model parameters θ and w using the chain rule:

∂L ∂L ∂y w θ
=
∂θ ∂y ∂θ
∂L ∂L ∂y ∂h
=
∂w ∂y ∂h ∂w h y
| {z } x f1 f2 L
∂L
∂h

29
Backpropagation: An example with scalars

• Consider a multi-layer model that operates only with scalars:

L = L(y ), y = f2 (h, θ), h = f1 (x, w )

• We can compute the derivatives wrt the model parameters θ and w using the chain rule:

∂L ∂L ∂y w θ
=
∂θ ∂y ∂θ
∂L ∂L ∂y ∂h
=
∂w ∂y ∂h ∂w h y
| {z } x f1 f2 L
∂L
∂h ∂L
∂y

• We can compute the derivatives efficiently by storing intermediate results.

29
Backpropagation: An example with scalars

• Consider a multi-layer model that operates only with scalars:

L = L(y ), y = f2 (h, θ), h = f1 (x, w )

• We can compute the derivatives wrt the model parameters θ and w using the chain rule:

∂L ∂L ∂y w θ
=
∂θ ∂y ∂θ
∂L
∂L ∂L ∂y ∂h ∂θ
=
∂w ∂y ∂h ∂w h y
| {z } x f1 f2 L
∂L
∂h ∂L ∂L
∂h ∂y

• We can compute the derivatives efficiently by storing intermediate results.

29
Backpropagation: An example with scalars

• Consider a multi-layer model that operates only with scalars:

L = L(y ), y = f2 (h, θ), h = f1 (x, w )

• We can compute the derivatives wrt the model parameters θ and w using the chain rule:

∂L ∂L ∂y w θ
=
∂θ ∂y ∂θ ∂L ∂L
∂L ∂L ∂y ∂h ∂w ∂θ
=
∂w ∂y ∂h ∂w h y
| {z } x f1 f2 L
∂L
∂h ∂L ∂L
∂h ∂y

• We can compute the derivatives efficiently by storing intermediate results.

29
Chain rule for multi-variable functions

• For multi-variable functions, the chain rule can be written in terms of Jacobian matrices.

y = f (u), u = g (x) y ∈ RM , u ∈ RK , x ∈ RN
 ∂y ∂y1

∂x1
1
· · · ∂x N
 . .. .. 
Jacobian matrix: Jf ◦g = .. . . 

∂yM
∂x1
· · · ∂y
∂xN
M

• The chain rule is:

Jf ◦g (x) = Jf (u)Jg (x)

or each element of the Jacobian is:

K
∂yj X ∂yj ∂uk
=
∂xi ∂uk ∂xi
k=1

30
Backpropagation for multi-variable functions

• Consider a multi-layer model:

L = L(y), y = f2 (h, θ), h = f1 (x, w) y ∈ RK , h ∈ RL , x ∈ Rm

• We apply the chain rule to compute the derivatives wrt the model parameters (and re-use
intermediate derivatives):
K
∂L X ∂L ∂yk
w θ
=
∂θj ∂yk ∂θj
k=1
K
∂L X ∂L ∂yk
= h y
∂hl ∂yk ∂hl x f1 f2 L
k=1
L
∂L X ∂L ∂hl
=
∂wi ∂hl ∂wi
l=1

31
Backpropagation for multi-variable functions

• Consider a multi-layer model:

L = L(y), y = f2 (h, θ), h = f1 (x, w) y ∈ RK , h ∈ RL , x ∈ Rm

• We can compute the derivatives sequentially going from the outputs of the network towards the
inputs (thus the name of the algorithm backpropagation).

31
Backpropagation for multi-variable functions

• Consider a multi-layer model:

L = L(y), y = f2 (h, θ), h = f1 (x, w) y ∈ RK , h ∈ RL , x ∈ Rm

• We apply the chain rule to compute the derivatives wrt the model parameters (and re-use
intermediate derivatives):
K
∂L X ∂L ∂yk
w θ
=
∂θj ∂yk ∂θj
k=1 ∂L
K ∂θj
∂L X ∂L ∂yk
= h y
∂hl ∂yk ∂hl x f1 f2 L
k=1
L ∂L ∂L
∂L X ∂L ∂hl ∂hl ∂yk
=
∂wi ∂hl ∂wi
l=1

• We can compute the derivatives sequentially going from the outputs of the network towards the
inputs (thus the name of the algorithm backpropagation).

31
Backpropagation for multi-variable functions

• Consider a multi-layer model:

L = L(y), y = f2 (h, θ), h = f1 (x, w) y ∈ RK , h ∈ RL , x ∈ Rm

• We apply the chain rule to compute the derivatives wrt the model parameters (and re-use
intermediate derivatives):
K
∂L X ∂L ∂yk
w θ
=
∂θj ∂yk ∂θj
k=1 ∂L ∂L
K ∂wi ∂θj
∂L X ∂L ∂yk
= h y
∂hl ∂yk ∂hl x f1 f2 L
k=1
L ∂L ∂L
∂L X ∂L ∂hl ∂hl ∂yk
=
∂wi ∂hl ∂wi
l=1

• We can compute the derivatives sequentially going from the outputs of the network towards the
inputs (thus the name of the algorithm backpropagation).

31
PyTorch

• PyTorch is a programming framework which allows one to create complex multilayer models
without the need to implement the optimization procedure and the function gradients.
Backpropagation is already implemented in the framework.
import torch
mlp = nn.Sequential(
nn.Linear(3, 5),
nn.ReLU(), x1
nn.Linear(5, 1),
)
optimizer = torch.optim.SGD(mlp.parameters(), lr=0.01) x2 y

for i in range(100):
optimizer.zero_grad() x3

# Compute loss
y = mlp(x)
loss = loss_fn(y, targets)

# Compute gradient by backpropagation Multilayer perceptron model

loss.backward()
y
# Update model parameters x L
optimizer.step()
Computational graph
32
Loss functions
Classification problems: One-hot encoding of targets

• To train a binary classifier, we used the binary cross entropy loss:

N
X
L(θ) = − y (i) log f (x(i) ) + (1 − y (i) ) log(1 − f (x(i) ))
i=1

• Let us extend this loss to the case of K classes. We can represent the target as a one-hot vector
y, where yj = 1 if the example belongs to class j and otherwise yi = 0:
" # " # K
1 0 X
class 1: y = class 2: y = yj = 1
0 1 j=1

• Similarly, we can represent the output of the network as a vector f whose j-th element fj is the
modeled probability that input x belongs to class j.
" # K
f1 X
f= 0 ≤ fj ≤ 1 fj = 1
f2 j=1

34
Classification problems: Cross-entropy loss

• Now we write the cross-entropy loss of N data samples in the following form:
N K
1 X X (n)
L(θ) = − y log fj (x(n) , θ)
N n=1 j=1 j

• Compare to the binary cross-entropy loss:

N
X
L(θ) = − y (i) log f (x(i) ) + (1 − y (i) ) log(1 − f (x(i) ))
i=1

• To guarantee that the outputs define a proper discrete probability distribution

K
X
0 ≤ fj ≤ 1 fj = 1
j=1

we apply the softmax nonlinearity to the outputs hj of the last layer of the neural network:
exp hj
f j = PK
j ′ =1 exp hj
′

35
Regression problems: Mean-squared error loss

• Regression tasks: targets are y(n) ∈ RK .

• We can tune the parameters of the network by minimizing the mean-squared error (MSE):
N ny
1 X X (n) 2
L(θ) = yj − fj (x(n) , θ)
Nny n=1 j=1

(n)
• yj is the j-th element of y(n)
• fj is the j-th element of the network output f(x, θ)
• ny is the number of elements in y(n)
• N is the number of the training examples

36
Convergence of gradient descent
Toy example: minimizing a quadratic function

• Let us consider a toy optimization problem with a quadratic

loss function:
1 2.5
L(w) = w⊤ Aw − b⊤ w
2 2.0
• The contour plot of our loss function contains ellipses 1.5
concentrated around the global minimum w∗ . 1.0 w∗

w2
• The axes of the ellipses are determined by the eigenvectors of 0.5
matrix A. 0.0

• The eigenvalues λm of A determine the curvature of the 0.5

objective function: Larger λm correspond to higher curvatures 1.0

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5
in the corresponding direction. w1

38
Effect of learning rate

• Suppose that we use gradient descent to find the minimum of the loss:
θ t+1 = θ t − ηg(θ t )
• The learning rate η has a major effect on the convergence of the gradient descent.

2.5 2.5

2.0 2.0

1.5 1.5

1.0 1.0
w2

w2
0.5 0.5

0.0 0.0

0.5 0.5

1.0 1.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5
w1 w1
small η: too slow convergence large η: oscillates and can even diverge

39
Convergence of gradient descent (see, e.g., Goh, 2017)

2.5

2.0
1
For quadratic loss: L(w) = w⊤ Aw − b⊤ w 1.5
2 1.0

w2
• The optimal learning rate depends on the curvature of the loss. 0.5

0.0
• If we select the learning rate optimally, the rate of convergence 0.5

1.0
2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0
∥wt+1 − w∗ ∥ w1
rate(η) = Large κ(A): slow convergence
∥wt − w∗ ∥
because of zigzaging
of the gradient descent is determined by the condition number of
2.5
matrix A: 2.0

1.5
κ(A) − 1 1.0
rate(η∗ ) =

w2
κ(A) + 1 0.5

0.0

rate(η∗ ) = 0: convergence in one step 0.5

1.0
rate(η∗ ) = 1: no convergence. 1.0 0.5 0.0 0.5 1.0
w1
1.5 2.0 2.5

κ(A) = 1 is ideal: can

converge in one iteration
40
Quadratic approximation

• For non-quadratic functions, the error surface is locally well approximated by a quadratic function:
1
L(w) ≈ L(wt ) + g⊤ (w − wt ) + (w − wt )⊤ H(w − wt )
2

• H is the matrix of second-order derivatives (called Hessian): 3.0

2.5
 ∂2L 2
L
· · · ∂w∂1 ∂w

2.0
∂w1 ∂w1 M

w2
 . .. ..  1.5
H=  .. . .

1.0

2 2
∂ L ∂ L
∂wM ∂w1
· · · ∂wM ∂wM 0.5

0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0
w1

• For the quadratic loss L(w) = 21 w⊤ Aw − b⊤ w the Hessian matrix H = A. Thus, the convergence
of the gradient descent is affected by the properties of the Hessian.

41
Optimization tricks
Why did deep learning era begin only in 2010–2012?

• Many components of deep learning have been invented long time ago but the deep learning
revolution started only in 2010–2012.

• Amounts of labeled data have grown thousands convolutional neural networks

(LeCun et al., 1989)
of times.
Backpropagation
• Computers have become millions of times faster. (Rumelhart et al., 1986)
Perceptron
• Clever optimization algorithms have been (Rosenblatt, 1958)

invented.
• Better network components have been developed. 1940 1950 1960 1970 1980 1990 2000 2010 2020
Frequency of phrases ”cybernetics”, ”neural networks” and ”deep learning”
according to Google books.

• Training of a deep neural network is a non-trivial optimization problem which requires multiple
tricks: input normalization, weight initialization, mini-batch training (stochastic gradient descent),
improved optimizers, batch normalization.

43
Input normalization

• Consider solving a linear regression problem with gradient descent

N
1 X (n) 2
L(w) = y − w⊤ x(n)
2N n=1

• In this case, the Hessian matrix is equal to the 2nd order moment (correlation matrix) of the data
N
1 X (n) (n)⊤
H= x x = Cx
N n=1

• For fastest convergence, H = Cx should be equal to the identity matrix I. We can achieve this by
decorrelating the input components using principal component analysis:

xPCA = D−1/2 E⊤ (x − µ)

where µ is data mean and EDE⊤ is the eigenvalue decomposition of the data covariance matrix.
• Neural networks are nonlinear models but normalizing their inputs usually improves convergence.
• A simple way of input normalization: center to zero mean and scale to unit variance.

44
Weight initialization in a linear layer

• The weights wij of a linear layer mx

X
yi = wij xj + bi
j=1

are usually initialized with random numbers drawn from some distribution p(wij ).
• Glorot and Bengio (2010): If we select p(wij ) carelessly, the magnitudes of the signals in the
forward/backward pass can grow/decay when the signals are propagated throughout the network,
which may have a negative impact on the optimization landscape.
• Popular initialization schemes balance the magnitudes of the signals in the forward and backward
passes: " √ √ #
6 6
Xavier’s initialization: wij ∼ uniform − p ,p
N x + Ny N x + Ny
where Nx , Ny is the number of inputs/outputs of a linear layer.
• Important: initialization schemes assume normalized inputs!

45
Mini-batch training
(stochastic gradient descent)
Mini-batch training

• The loss function contains N terms corresponding to the training samples, for example:
N K
1 X X (n)
L(θ) = − y log fj (x(n) , θ)
N n=1 j=1 j

• Large data sets are redundant: gradient computed on two different subsets of data are likely to be
similar. Why to waste computations?
• We can compute gradient using only part of training data (a mini-batch Bm , a subset of the data):
K
∂L 1 X ∂ X (n)
≈− −yj log fj (x(n) , θ)
∂θ |Bm | n∈B ∂θ j=1
m

• By using mini-batches, we introduce “sampling noise” to the gradient computations, thus the
method is called stochastic gradient descent.

47
Practical considerations for mini-batch training

• Epoch: going through all of the training examples once (usually using mini-batch training).
• It is good to shuffle the data between epochs when forming mini-batches (otherwise gradient
estimates are biased towards a particular mini-batch split).
• Mini-batches need to be balanced for classes.
• The recent trend is to use as large batches as possible (depends on the GPU memory size).
• Using larger batch sizes reduces the amount of noise in the gradient estimates.
• Computing the gradient for multiple samples at the same time is computationally efficient (requires
matrix-matrix multiplications which are efficient, especially on GPUs).

48
Model fine-tuning during mini-batch training

• In mini-batch training, we always use noisy estimates

of the gradient. Therefore, the magnitude of the
2.5
gradient can be non-zero even when we are close to
the optimum. 2.0

1.5
• One way to reduce this effect is to anneal the learning
rate ηt towards the end of training. 1.0

w2
• The simplest schedule is to decrease the learning rate 0.5
after every r updates.
0.0
• Another popular trick is to use exponential moving
0.5
average of the model parameters as the final model:
1.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5
θ ′t = γθ ′t−1 + (1 − γ)θ t w1

49
Improved optimization algorithms
Problems with gradient descent

2.5

2.0
• When the curvature of the objective function 1.5
substantially varies in different directions, the
1.0

w2
optimization trajectory of the gradient descent
can be zigzaging. 0.5

0.0

0.5

1.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5
w1

51
Momentum method (Polyak, 1964)

• Idea: 2.5
• We would like to move faster in directions with
small but consistent gradients. 2.0
• We would like to move slower in directions with 1.5
large but inconsistent gradients.
1.0

w2
• Implementation: Aggregate negative gradients in
momentum mt : 0.5

0.0
mt+1 = αmt − ηt gt
θ t+1 = θ t + mt+1 0.5

1.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5
w1

52
Adam (Kingma and Ba, 2014)

• The most popular algorithm today is Adam which uses a unit-less update rule:
mbt
θ t ← θ t−1 − ηt √
bvt + ϵ
• Fist and second order statistics of the gradient are computed using exponential moving averages:

mt = β1 mt−1 + (1 − β1 )gt
vt = β2 vt−1 + (1 − β2 )gt2

• Correction is used to improve estimates at the beginning of training (β t is β to the power of t):

b t = mt /(1 − β1t )
m vt = vt /(1 − β2t )
b

• Since the update rule is unit-less, the optimization procedure is not affected by the scale of the
objective function.

53
Why Adam works well

mbt
θ t ← θ t−1 − η √
bvt + ϵ
mt = β1 mt−1 + (1 − β1 )gt
vt = β2 vt−1 + (1 − β2 )gt2

• In Adam, the effective step size |∆t | is bounded. In the most common case:

m
bt E [g ]
|∆t | = η √ ≈ ηp ≤η because E [g 2 ] = E [g ]2 + E [(g − E [g ])2 ]
bvt E [g 2 ]

Thus, we never take too big steps (which can be the case for standard gradient descent).
• We go with the maximum speed (step size η) only if g is the same between updates
(mini-batches), that is when the gradients are consistent.
• At convergence, we start fluctuating around the optimum, E [g ] ≈ 0 and E [g 2 ] > 0 and the
effective step size gets smaller. Thus, Adam has a mechanism for automatic annealing of the
learning rate.

54
Batch normalization
Batch normalization (Ioffe and Szegedy, 2015)

• Idea: Since input normalization has positive effect on training, can we also normalize the
intermediate signals? The problem is that these signals change during training and we cannot
perform normalization before the training.
• The solution is to normalize intermediate signals to zero mean and unit variance in each training
mini-batch:
N N
1. Compute the means and variances of the intermediate 1 X (i) 1 X (i)
µ= x σ2 = (x − µ)2
signals x from the current mini-batch {x(1) , . . . , x(N) }. N i=1 N i=1

x−µ
2. Normalize signals to zero mean and unit variance. x̃ = √
σ2 + ϵ
3. Scale and shift the signals with trainable parameters
γ and β. y = γ ⊙ x̃ + β

56
Batch normalization: Training and evaluation modes

• The mean and standard deviation are computed for each mini-batch. What to do at test time
when we use a trained network for a test input?
• Batch normalization layer keeps track of the batch statistics (mean and standard deviation) during
training:
N
1 X (i)
µ ← (1 − α)µ + α x
N i=1
N
1 X (i)
σ 2 ← (1 − α)σ 2 + α (x − µ)2
N i=1

where α is the momentum parameter (note a confusing name).

• The running statistics µ and σ 2 are then used at the testing time.
• To get further insights on why batch normalization helps, see (Daneshmand et al., 2020) and
(Bjorck et al., 2018).

57
Batch normalization: Training and evaluation modes

model = nn.Sequential(
• PyTorch: If you use a batch normalization layer, the nn.Linear(1, 100),
behavior of the network in the training and evaluation nn.BatchNorm1d(100),
nn.ReLU(),
modes will be different: nn.Linear(100, 1),
• Training: Use statistics from a mini-batch, update )
running statistics µ and σ 2 .
• Evaluation: Use running statistics µ and σ 2 , keep µ # Switch to training mode
and σ 2 fixed. model.train()
• Important to remember: BN introduces dependencies # train the model
...
between samples in a mini-batch in the computational
# Switch to evaluation mode
graph. model.eval()
# test the model

58
Home assignments
Assignment 01 mlp

1. Implement the backpropagation algorithm and train a multilayer perceptron (MLP) in numpy, not
using PyTorch yet.

y output layer
y = ψ(W3 h2 + b3 )

hidden layer 2
h2 = ϕ(W2 h1 + b2 )

hidden layer 1
h1 = ϕ(W1 x + b1 )

input x
x1 x2 x3 input layer

60
Assignment 01 mlp

2. Implement backpropagation for a multilayer perceptron network in numpy. For each block of a
neural network, you need to implement the following computations.

• forward computations y = f (x, θ)

• backward computations that transform the derivatives wrt the block’s outputs ∂L
∂y
into the
derivatives wrt all its inputs: ∂L
∂x
, ∂L
∂θ

θ
∂L
∂θ

x y
f
∂L ∂L
∂h ∂y

Lecture B1 - Overview and Intro
No ratings yet
Lecture B1 - Overview and Intro
86 pages
Deep Learning Notes
100% (1)
Deep Learning Notes
71 pages
Airline Reservation System SRS
80% (65)
Airline Reservation System SRS
29 pages
Marine Industries in MY
No ratings yet
Marine Industries in MY
3 pages
01 Intro Slides
No ratings yet
01 Intro Slides
67 pages
CSCE 636: Deep Learning
No ratings yet
CSCE 636: Deep Learning
30 pages
DNN Merged Sugata
No ratings yet
DNN Merged Sugata
243 pages
Cours1 Annotations
No ratings yet
Cours1 Annotations
42 pages
Module1_ Deep Learning
No ratings yet
Module1_ Deep Learning
26 pages
01_ml_basics
No ratings yet
01_ml_basics
61 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
151 pages
Csa4020 Deep-Learning LP 1.0 22 Csa4020 Deep-Learning LP 1.0 1 Deep Learning
No ratings yet
Csa4020 Deep-Learning LP 1.0 22 Csa4020 Deep-Learning LP 1.0 1 Deep Learning
2 pages
DLAI4 Revision
No ratings yet
DLAI4 Revision
6 pages
Main
No ratings yet
Main
183 pages
Lec 1
No ratings yet
Lec 1
30 pages
Deep Learning Assignment 1 Solution: Name: Vivek Rana Roll No.: 1709113908
No ratings yet
Deep Learning Assignment 1 Solution: Name: Vivek Rana Roll No.: 1709113908
5 pages
DL Lab Manual
No ratings yet
DL Lab Manual
65 pages
Fundamentals of Deep Learning
No ratings yet
Fundamentals of Deep Learning
195 pages
Unit-3
No ratings yet
Unit-3
16 pages
NISS Deep Learning Tutorial
No ratings yet
NISS Deep Learning Tutorial
58 pages
AA12_Deep_Learning_2024 (1)
No ratings yet
AA12_Deep_Learning_2024 (1)
30 pages
Deep Learning With Python: TH TH
No ratings yet
Deep Learning With Python: TH TH
36 pages
DL Module 1 - CS-1 Fundamentals of Neural Network
No ratings yet
DL Module 1 - CS-1 Fundamentals of Neural Network
81 pages
1-Introduction
No ratings yet
1-Introduction
81 pages
Lecture 1a - Introduction
No ratings yet
Lecture 1a - Introduction
38 pages
ML Overview
No ratings yet
ML Overview
26 pages
Neural Networks and Deep Learning
No ratings yet
Neural Networks and Deep Learning
22 pages
AI Chapter 4
No ratings yet
AI Chapter 4
63 pages
6S191_MIT_DeepLearning_L1
No ratings yet
6S191_MIT_DeepLearning_L1
108 pages
Introduction To Machine Learning: Pekka Parviainen
No ratings yet
Introduction To Machine Learning: Pekka Parviainen
39 pages
1 AI_Introduction and ML
No ratings yet
1 AI_Introduction and ML
32 pages
DL Unit 1
No ratings yet
DL Unit 1
200 pages
Essentials of Deep Learning
No ratings yet
Essentials of Deep Learning
2 pages
2024 Machine Learning Intro
No ratings yet
2024 Machine Learning Intro
50 pages
Lecture Notes 01
No ratings yet
Lecture Notes 01
77 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
123 pages
CS-871-Lecture 1
No ratings yet
CS-871-Lecture 1
41 pages
cours1
No ratings yet
cours1
42 pages
Lec 01 Introduction
No ratings yet
Lec 01 Introduction
98 pages
Lecture Notes 2016
No ratings yet
Lecture Notes 2016
132 pages
Lecture Notes: Introduction To Machine Learning For The Sciences
No ratings yet
Lecture Notes: Introduction To Machine Learning For The Sciences
80 pages
03-Lecture Notes-Mid
No ratings yet
03-Lecture Notes-Mid
23 pages
Machine Learning
No ratings yet
Machine Learning
83 pages
AI5006 - Deep Learning
No ratings yet
AI5006 - Deep Learning
6 pages
Unit - 1 Deep Learning 3-2
No ratings yet
Unit - 1 Deep Learning 3-2
15 pages
PA_LAB_MDM[1]
No ratings yet
PA_LAB_MDM[1]
4 pages
Aula 1 T
No ratings yet
Aula 1 T
4 pages
DL-2
No ratings yet
DL-2
62 pages
Introducing Deep Learning and The Pytorch Library: This Chapter Covers
No ratings yet
Introducing Deep Learning and The Pytorch Library: This Chapter Covers
5 pages
Generating Arabic Letters Using Generative
No ratings yet
Generating Arabic Letters Using Generative
63 pages
CM20315 01 Intro 01
No ratings yet
CM20315 01 Intro 01
39 pages
Sony Ai Content[1]
No ratings yet
Sony Ai Content[1]
26 pages
SHAI - Task 3 - NN
No ratings yet
SHAI - Task 3 - NN
10 pages
First Contact With Tensor Flow - Part 1
100% (1)
First Contact With Tensor Flow - Part 1
136 pages
First Contact With Tensor Flow PDF
100% (2)
First Contact With Tensor Flow PDF
136 pages
Unit 1&2
No ratings yet
Unit 1&2
270 pages
Gerald Corzo 5/26/2020: Workshop Google Machine Learning Tools (Services) 1
No ratings yet
Gerald Corzo 5/26/2020: Workshop Google Machine Learning Tools (Services) 1
24 pages
syllabus INTRODUCTION TO DEEP LEARNING
No ratings yet
syllabus INTRODUCTION TO DEEP LEARNING
11 pages
Session 2 ANN 2024
No ratings yet
Session 2 ANN 2024
29 pages
Question Bank DL
No ratings yet
Question Bank DL
98 pages
The Visual Elements—Photography: A Handbook for Communicating Science and Engineering
From Everand
The Visual Elements—Photography: A Handbook for Communicating Science and Engineering
Felice C. Frankel
No ratings yet
Touchpad Plus Ver. 3.1 Class 4
From Everand
Touchpad Plus Ver. 3.1 Class 4
Geeta Zunjani
No ratings yet
Research-Is The Systematic Application of The Scientific Inquiry in Order To Find Solutions To Problems and Contribute To Knowledge
No ratings yet
Research-Is The Systematic Application of The Scientific Inquiry in Order To Find Solutions To Problems and Contribute To Knowledge
6 pages
Coca Cola Blak: Case Study
No ratings yet
Coca Cola Blak: Case Study
6 pages
The Future of Indian Economy Past Reforms and Challenges Ahead
No ratings yet
The Future of Indian Economy Past Reforms and Challenges Ahead
281 pages
Design of Fuzzy Adaptive Proportional Integral Derivative Controller For Networked Control System Using Switched Ethernet Network
No ratings yet
Design of Fuzzy Adaptive Proportional Integral Derivative Controller For Networked Control System Using Switched Ethernet Network
13 pages
LP Grade 7 Statistical Instruments
No ratings yet
LP Grade 7 Statistical Instruments
4 pages
Le Et Al v. ArciTerra Group, LLC - Document No. 7
No ratings yet
Le Et Al v. ArciTerra Group, LLC - Document No. 7
2 pages
[Ebooks PDF] download Intra Asian Trade and the World Market Routledge Studies in the Modern History of Asia 1st Edition Ajh Latham full chapters
100% (8)
[Ebooks PDF] download Intra Asian Trade and the World Market Routledge Studies in the Modern History of Asia 1st Edition Ajh Latham full chapters
81 pages
Leadership and Organisations: Coursework (CW1) - TIMED TEST
No ratings yet
Leadership and Organisations: Coursework (CW1) - TIMED TEST
9 pages
W32 Presentation
No ratings yet
W32 Presentation
43 pages
Branson 1965 PDF
No ratings yet
Branson 1965 PDF
16 pages
CSCI262/CSCI862 System Security Spring 2021 Assignment 3 (12 Marks, Worth 12%)
No ratings yet
CSCI262/CSCI862 System Security Spring 2021 Assignment 3 (12 Marks, Worth 12%)
3 pages
A Leaders Guide To Manufacturing 4 0 - Ar - Ddi PDF
No ratings yet
A Leaders Guide To Manufacturing 4 0 - Ar - Ddi PDF
44 pages
308: Drafting, Pleading & Conveyance: Objectives of The Course
No ratings yet
308: Drafting, Pleading & Conveyance: Objectives of The Course
2 pages
An Efficient Algorithm For 3D Rectangular Box Packing
No ratings yet
An Efficient Algorithm For 3D Rectangular Box Packing
4 pages
CV Ashiqur Rahman
No ratings yet
CV Ashiqur Rahman
2 pages
Inflation & The Cost of Living
No ratings yet
Inflation & The Cost of Living
15 pages
08_SafetyAnalysis-1
100% (1)
08_SafetyAnalysis-1
13 pages
Cebu Country Club vs. Elizagaque (542 SCRA 65)
No ratings yet
Cebu Country Club vs. Elizagaque (542 SCRA 65)
2 pages
Overcoming Objections - Worksheet
No ratings yet
Overcoming Objections - Worksheet
8 pages
Hamad et al., 2023
No ratings yet
Hamad et al., 2023
51 pages
The Toyota Way (Overview)
No ratings yet
The Toyota Way (Overview)
7 pages
TORRES de Lima Vs City of Manila
No ratings yet
TORRES de Lima Vs City of Manila
2 pages
PLD8000 User Manual-V2.0
No ratings yet
PLD8000 User Manual-V2.0
128 pages
Lists and Tuples Part 1 - ch07 - Part1
No ratings yet
Lists and Tuples Part 1 - ch07 - Part1
27 pages
Keyboard Shortcuts Windows
No ratings yet
Keyboard Shortcuts Windows
1 page
Suree Foods Research
No ratings yet
Suree Foods Research
21 pages
Method of Statement For Pile Foundation
No ratings yet
Method of Statement For Pile Foundation
8 pages
Contemporary Issues in International Business and Entrepreneurship
No ratings yet
Contemporary Issues in International Business and Entrepreneurship
30 pages