Lecture 02
Lecture 02
Deep Learning 1
1/31
Part 1 The Perceptron
2/31
The Perceptron
F. Rosenblatt (19281971)
3/31
The Perceptron
be our input data and t(1) , t(2) , . . . , t(N ) ∈ {−1, 1} our corresponding labels (aka.
targets). The goal of the perceptron is to learn a collection of parameters (w, b)
such that all points are correctly classied, i.e.
∀N
k=1 : y
(k)
= t(k)
4/31
The Perceptron Algorithm
Recall that for each data point, predictions of our perceptron are computed as:
z (k) = w⊤ x(k) + b
y (k) = sign(z (k) )
Perceptron algorithm
▶ Iterate (multiple times from k = 1 . . . N ).
▶ If x(k) is correctly classied (y
(k) = t(k) ), continue.
▶ If x(k) is wrongly classied (y
(k) ̸= t(k) ), update the perceptron:
(k) (k)
w ←w+η·x t
(k)
b←b+η·t
5/31
Perceptron at Work
2 2 2
0 0 0
2 2 2
4 4 4
4 2 0 2 4 4 2 0 2 4 4 2 0 2 4
2 2 2
0 0 0
2 2 2
4 4 4
4 2 0 2 4 4 2 0 2 4 4 2 0 2 4
6/31
The Perceptron: Optimization View
Proposition
The perceptron can be seen as a gradient descent of the error function
N
1 X
E(w, b) = max(0, −z (k) t(k) )
N k=1 | {z }
Ek (w, b)
Proof.
∂Ek ∂z (k) (k)
w−η = w − η · 1−z(k) t(k) >0 · − t
∂w ∂w
∂z (k) (k)
= w − η · 1y(k) ̸=t(k) · − t
∂w
= w + η · 1y(k) ̸=t(k) · x(k) t(k)
7/31
Perceptron vs. Nearest Mean Classier
The nearest mean classier rst builds the means of two classes:
1 X (k) 1 X (k)
µ1 = x and µ2 = x
|C1 | k∈C |C2 | k∈C
1 2
Then it predicts:
8/31
Perceptron vs. Nearest Mean Classier
The nearest mean classier can then be further developed to become expressible as
a linear classier:
y = sign(∥x − µ2 ∥2 − ∥x − µ1 ∥2 )
= sign(∥x∥2 − 2µ⊤ 2 2 ⊤ 2
2 x + ∥µ2 ∥ − ∥x∥ + 2µ1 x − ∥µ1 ∥ )
9/31
Perceptron vs. Nearest Mean Classier
Question:
▶ Both the perceptron and the nearest mean classier are linear classiers.
Then, how do the perceptron and nearest mean classier dier?
Observation
▶ Perceptron always separates data when it is linearly separable. This is not
always the case for the nearest mean classier, especially when the data of
each class is elongated.
Nearest mean classier describes the data, perceptron minimizes the error.
10/31
Part 2 Training Multilayer Networks
11/31
From Perceptrons to Training Deep Networks
Recap: The perceptron performs a gradient descent of the error function
N
1 X
E(w, b) = max(0, −z (k) t(k) )
N k=1 | {z }
Ek (w, b)
Idea: Generalize the formulation where z is not the output of the perceptron, but
of any multilayer neural network.
Question: How to compute ∂E/∂θ, i.e. the gradient of the newly dened error
function w.r.t. the model parameters?
12/31
Naive Approach: Numerical Dierentiation
∂E E(θ + ϵ · δt ) − E(θ)
∀t : = lim
∂θt ϵ→0 ϵ
Properties:
▶ Can be applied to any error function E (not necessary the error of a neural
network).
▶ Need to evaluate the function as many times as there are parameters (→ slow
when the number of parameters is large).
▶ A neural network typically has between 103 and 109 parameters (→ numerical
dierentiation unfeasible).
▶ Because ϵ and the numerator are very small, for numerical dierentiation to
work, one must use high-precision (e.g. oat64 rather than oat32).
13/31
Better Approach: The Chain Rule
Suppose that some parameter of interest θq (one element of the parameter vector
θ) is linked to the output of the network through some sequence of functions.
layer 1 layer 2
µq a b z
∂z ∂z ∂b ∂a
=
∂θq ∂b ∂a ∂θq
i.e. the derivative w.r.t. the parameter of interest is the product of local derivatives
along the path connecting θq to z.
14/31
The Multivariate Chain Rule
In practice, some parameter of interest may be linked to the output of the network
through multiple paths (formed by all neurons between them):
layer 1 layer 2
a1 b1
µq a2 b2 z
...
...
The chain rule can be extended to this multivariate scenario by enumerating all the
paths between θq and z
∂z X X ∂z ∂bj ∂ai
=
∂θq i j
∂bj ∂ai ∂θq
P P
where i and j run over the indices of the neurons in the corresponding layers.
Its complexity grows exponentially with the number of layers.
15/31
Factor Structure in the Multivariate Chain Rule
a1 b1 performed incrementally.
...
16/31
Part 3
Worked-Through Example and
Formalization
17/31
Worked through Example
w23 a3 w36 z7
a2 a6 w67
w24 a4 w46
Forward pass: a1 = x1
a2 = x2
z3 = a1 w13 + a2 w23 a3 = g(z3 )
z4 = a2 w24 a4 = g(z4 )
z5 = a3 w35 a5 = g(z5 )
z6 = a3 w36 + a4 w46 a6 = g(z6 )
z7 = a5 w57 + a6 w67
E = max(0, −z7 t)
18/31
Worked through Example
a 1 = x1
a1 w13 w35 a5 w57 a 2 = x2
z3 = a1 w13 + a2 w23 a3 = g(z3 )
w23 a3 w36 z7
z4 = a2 w24 a4 = g(z4 )
a2 a6 w67 z5 = a3 w35 a5 = g(z5 )
z6 = a3 w36 + a4 w46 a6 = g(z6 )
w24 a4 w46
z7 = a5 w57 + a6 w67
E = max(0, −z7 t)
Backward pass:
∂E
δ7 = = I(−z7 · t > 0) · (−t)
∂z7
19/31
Worked through Example
a 1 = x1
a1 w13 w35 a5 w57 a 2 = x2
z3 = a1 w13 + a2 w23 a3 = g(z3 )
w23 a3 w36 z7
z4 = a2 w24 a4 = g(z4 )
a2 a6 w67 z5 = a3 w35 a5 = g(z5 )
z6 = a3 w36 + a4 w46 a6 = g(z6 )
w24 a4 w46
z7 = a5 w57 + a6 w67
E = max(0, −z7 t)
Backward pass:
∂E
δ7 = = I(−z7 · t > 0) · (−t)
∂z7
∂E ∂E ∂z7
= = δ7 · a6
∂w67 ∂z7 ∂w67
∂E ∂E ∂z7
= = δ7 · a5
∂w57 ∂z7 ∂w57
19/31
Worked through Example
a 1 = x1
a1 w13 w35 a5 w57 a 2 = x2
z3 = a1 w13 + a2 w23 a3 = g(z3 )
w23 a3 w36 z7
z4 = a2 w24 a4 = g(z4 )
a2 a6 w67 z5 = a3 w35 a5 = g(z5 )
z6 = a3 w36 + a4 w46 a6 = g(z6 )
w24 a4 w46
z7 = a5 w57 + a6 w67
E = max(0, −z7 t)
Backward pass:
∂E
δ7 = = I(−z7 · t > 0) · (−t)
∂z7
∂E ∂E ∂z7
δ6 = = = δ7 · w67
∂a6 ∂z7 ∂a6
∂E ∂E ∂z7
δ5 = = = δ7 · w57
∂a5 ∂z7 ∂a5
19/31
Worked through Example
a 1 = x1
a1 w13 w35 a5 w57 a 2 = x2
z3 = a1 w13 + a2 w23 a3 = g(z3 )
w23 a3 w36 z7
z4 = a2 w24 a4 = g(z4 )
a2 a6 w67 z5 = a3 w35 a5 = g(z5 )
z6 = a3 w36 + a4 w46 a6 = g(z6 )
w24 a4 w46
z7 = a5 w57 + a6 w67
E = max(0, −z7 t)
Backward pass:
∂E ∂E ∂z7
δ6 = = = δ7 · w67
∂a6 ∂z7 ∂a6
∂E ∂E ∂z7
δ5 = = = δ7 · w57
∂a5 ∂z7 ∂a5
∂E ∂E ∂a6 ∂z6 ′
= = δ6 · g (z6 ) · a4
∂w46 ∂a6 ∂z6 ∂w46
∂E ∂E ∂a6 ∂z6 ′
= = δ6 · g (z6 ) · a3
∂w36 ∂a6 ∂z6 ∂w36
∂E ∂z5 ∂a5 ∂E ′
= = δ5 · g (z5 ) · a3
∂w35 ∂w35 ∂z5 ∂a5
19/31
Worked through Example
a 1 = x1
a1 w13 w35 a5 w57 a 2 = x2
z3 = a1 w13 + a2 w23 a3 = g(z3 )
w23 a3 w36 z7
z4 = a2 w24 a4 = g(z4 )
a2 a6 w67 z5 = a3 w35 a5 = g(z5 )
z6 = a3 w36 + a4 w46 a6 = g(z6 )
w24 a4 w46
z7 = a5 w57 + a6 w67
E = max(0, −z7 t)
Backward pass:
∂E ∂E ∂z7
δ6 = = = δ7 · w67
∂a6 ∂z7 ∂a6
∂E ∂E ∂z7
δ5 = = = δ7 · w57
∂a5 ∂z7 ∂a5
∂E ∂E ∂a6 ∂z6 ′
δ4 = = = δ6 · g (z6 ) · w46
∂a4 ∂a6 ∂z6 ∂a4
∂E ∂E ∂a6 ∂z6 ∂E ∂a5 ∂z5 ′ ′
δ3 = = + = δ6 · g (z6 ) · w36 + δ5 · g (z5 ) · w35
∂a3 ∂a6 ∂z6 ∂a3 ∂a5 ∂z5 ∂a3
19/31
Worked through Example
a 1 = x1
a1 w13 w35 a5 w57 a 2 = x2
z3 = a1 w13 + a2 w23 a3 = g(z3 )
w23 a3 w36 z7
z4 = a2 w24 a4 = g(z4 )
a2 a6 w67 z5 = a3 w35 a5 = g(z5 )
z6 = a3 w36 + a4 w46 a6 = g(z6 )
w24 a4 w46
z7 = a5 w57 + a6 w67
E = max(0, −z7 t)
Backward pass:
∂E ∂E ∂a6 ∂z6 ′
δ4 = = = δ6 · g (z6 ) · w46
∂a4 ∂a6 ∂z6 ∂a4
∂E ∂E ∂a6 ∂z6 ∂E ∂a5 ∂z5 ′ ′
δ3 = = + = δ6 · g (z6 ) · w36 + δ5 · g (z5 ) · w35
∂a3 ∂a6 ∂z6 ∂a3 ∂a5 ∂z5 ∂a3
∂E ∂E ∂a4 ∂z4 ′
= = δ4 · g (z4 ) · a2
∂w24 ∂a4 ∂z4 ∂w24
∂E ∂E ∂a3 ∂z3 ′
= = δ3 · g (z3 ) · a2
∂w23 ∂a3 ∂z3 ∂w23
∂E ∂E ∂a3 ∂z3 ′
= = δ3 · g (z3 ) · a1
∂w13 ∂a3 ∂z3 ∂w13
19/31
Formalization for a Standard Neural Network
The error gradient can be propagated from layer to layer using the chain rule:
∂E X ∂E ∂ak ∂zk
= · ·
∂aj ∂a ∂z
k | {z } | {zk}
k ∂aj
|{z} |{z}
δj δk g ′ (zk ) wjk
∂E ∂E ∂ak ∂zk
= · ·
∂wjk ∂ak ∂zk ∂wjk
| {z } | {z } | {z }
δk g ′ (zk ) aj
20/31
Matrix Formulation
Observation:
▶ Backpropagation equations can be written as matrix-vector products, or outer
products:
Neuron-wise Layer-wise
X
′
δj = δk g (zk )wjk δ (l−1)
= W (l−1,l) · (g ′ (z (l) ) ⊙ δ (l) )
k
∂E ∂E
= δk g ′ (zk )aj = a(l−1) · (g ′ (z (l) ) ⊙ δ (l) )⊤
∂wjk ∂W (l−1,l)
where:
Note:
▶ Further vectorization can be achieved by computing the gradient for multiple
data points at once, in which case, we have matrix-matrix products.
21/31
Part 4 Further Remarks / Advanced Topics
22/31
Choice of Nonlinear Activation Function
In practice, for training to proceed, the nonlinear function must be chosen in a way
that:
23/31
Neural Networks with Shared Parameters
Shared parameters can be handled by treating original parameters as neurons gen-
erated from the new parameters, and appling the chain rule one step further:
w13 ...
a1 u v a5 u w24 ...
u ...
v a3 u z7 w36 w23 ...
v w57 ... ...
a2 a6 w35
v ...
u a4 v w46
w67 ...
24/31
Automatic Dierentiation
Automatic Dierentiation:
▶ Generates automatically backpropagation equations from the forward
equations.
Consequences:
▶ In practice we do not need to do backpropagation anymore. We just need to
program the forward pass, and the backward pass comes for free.
▶ This has enabled researchers to develop neural networks that are way more
complex, and with much more heterogeneous structures (e.g. ResNet, Yolo,
transformers, etc.).
▶ Only in few cases, it is still useful to express the gradient analytically (e.g. to
analyze theoretically the stability of a gradient descent procedure, such as the
vanishing/exploding gradients problem in recurrent neural networks).
25/31
Simple algorithm for training a neural network
θ ← θ − γ · ∇E(θ)
end for
▶ The parameter γ is a learning rate that needs to be set by the user.
26/31
Part 5 Neural Network at Work
27/31
Neural Network at Work
iteration 0 iteration 1 iteration 3
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
4 2 0 2 4 4 2 0 2 4 4 2 0 2 4
Observation:
▶ After enough iterations, all points are on the correct side of the decision
boundary (like for the perceptron, but even if the data is not linearly
separable).
28/31
Neural Network at Work
4
iteration 31 Consideration for next lectures:
3 Optimization (Lectures 34)
2
1 ▶ How to make training faster? (especially
0 important if we consider large problems with
1 many input variables.)
2
3
Regularization (Lectures 56)
4 ▶ Decision function doesn't look nice. Unlikely to
4 2 0 2 4
work well for new data points. Can we introduce
a mechanism in the learning proceedure that
promotes more regular and well-generalizing
decision functions?
29/31
Summary
30/31
Summary
▶ Error backpropagation only extracts the gradient. This is not enough for
training a neural network successfully, one still needs to make sure the
gradient descent is carried out eciently (lectures 34) and that the learned
model is robust to generalize well to new data points (lectures 56).
31/31