Curs3site PDF
Curs3site PDF
Neural Networks
Course 3: Gradient Descent and Backpropagation
Overview
The neuron in the hidden layer and in the output layer are non-linear
neurons. Most of the times are logistic neurons (sigmoid ), but can also be
tanh, rectified linear units or based on other non-linear function
This type of network is also called Multi Layer Perceptron (MLP), but this is
confusing, since the perceptron is rarely used in this kind of architecture
y=1 y=1
y=0.5
𝜃 -6 -4 -2 0 2 4 6
𝑤𝑖 𝑥𝑖
𝑖
Gradient Descent
Gradient Descent
Adjust the weights and biases such that to minimize a cost function
A cost function is a mathematical function that assigns a value (cost) to
how bad a sample is classified
A common used function (that we will also use) is the mean squared error
1
𝐶 𝑤, 𝑏 = 𝑥| 𝑡 − 𝑎 |2
2𝑛
1
𝐶 𝑤, 𝑏 = (𝑡 − 𝑎)2
2𝑛
𝑥
Why use a cost function? (why not just count the correct outputs)
A small update in the weights might not result in a change in the number of
correctly classified samples
What is a gradient?
A gradient is just a fancy word for derivate
The gradient (first derivative) determines the slope of the tangent of the graph of
the function. (points in the direction of the greatest rate of increase)
Gradient Descent
𝜕𝐶 𝜕𝐶
Δ𝐶 𝑣1, 𝑣2 = 𝜕𝑣1 Δ𝑣1 + 𝜕𝑣2 Δ𝑣2
Gradient Descent
Δv = −𝜂𝛻𝐶
The learning rate must be small enough to not jump over the minimum
Even though we have used just two variables, the same principle can be
used for any derivable Cost function of many variables
Gradient Descent
Example:
Performing gradient descent for the Adaline perceptron.
Adaline Perceptron:
Example:
Performing gradient descent for the Adaline perceptron.
1. Compute the gradient 𝛻𝐶
𝑑𝐶 𝑑𝐶
𝛻𝐶 = ( , )
𝑑𝑤 𝑑𝑏
𝜕𝐶 𝜕𝐶 𝜕𝑦
= ∙
𝜕𝑤 𝜕𝑦 𝜕𝑤
Gradient Descent
Example:
Performing gradient descent for the Adaline perceptron.
1. Compute the gradient 𝛻𝐶
1 ′ 1 1
2
C′ = 𝑥 𝑡−𝑦 = 2 𝑥 (𝑡 − 𝑦)′ = 𝑥(𝑡 − 𝑦)(−𝑦)′
2𝑛 2𝑛 𝑛
𝑑𝑦 𝑑 𝑤𝑥+𝑏
= =𝑥
𝑑𝑤 𝑑𝑤
𝑑𝑦 𝑑 𝑤𝑥+𝑏
= =1
𝑑𝑏 𝑑𝑏
-----------------------------------------------------------------------------------------------
𝑑𝐶 1 𝑑𝐶 1
=− 𝑥 𝑡−𝑦 𝑥 =− 𝑥 𝑡−𝑦
𝑑𝑤 𝑛 𝑑𝑏 𝑛
Gradient Descent
Example:
Performing gradient descent for the Adaline perceptron.
2. Choose a learning rate 𝜂
This is the same update rule we used for the Adaline in the previous
course. The formula looks different since now we are averaging over all
samples
Gradient Descent
Using
gradient descent we can optimize a perceptron.
However, can we train a network with multiple neurons?
Yes, but not yet !
Why?
Backpropagation
Using
gradient descent we can optimize a perceptron.
However, can we train a network with multiple layers?
Yes, but not yet !
Why? We know the error
at the last layers,
so we can update
We don’t know how the immediate
much of the error weights that affect
depends on the that error
previous layers.
We need to
backpropagate the
error
Backpropagation
Some notations:
wijl = the weight from the neuron i from the l − 1 layer to the neuron j in the l layer
L = the last layer
yil = the activation of neuron i from the l layer
bli = the bias of neuron i from the l layer
𝑧𝑖𝑙 = the net input for the neuron 𝑖 from the 𝑙 layer ( 𝑙 𝑙−1
𝑗 𝑤𝑗𝑖 𝑦𝑗 + 𝑏𝑖𝑙 )
3 3
𝑤14 𝑤44
3 3
𝑤24 𝑤34
𝑦12 𝑦42
3
𝑤34 𝑦22 𝑦32
Backpropagation
We can adjust the error by adjusting the bias and the weights of each
neuron 𝑖 from each layer 𝑙. This will change the net input from 𝑧𝑖𝑙 to (𝑧𝑖𝑙 +
Δ𝑧𝑖𝑙 ) and the activation from 𝜎(𝑧𝑖𝑙 ) to 𝜎(𝑧𝑖𝑙 + Δ𝑧𝑖𝑙 )
𝜕𝐶
In this case, the cost function will change by Δ𝐶 = Δ𝑧𝑖𝑙
𝜕𝑧𝑖𝑙
Layer 2 Layer 3
Layer 1
Layer 4
𝜕𝐶 𝑙
𝐶+ 𝑙 Δ𝑧𝑖
Δ𝑧𝑖𝑙 𝜕𝑧𝑖
Backpropagation
𝜕𝐶
We can minimize 𝐶 by making Δ𝐶 = Δ𝑧𝑖𝑙 negative:
𝜕𝑧𝑖𝑙
𝜕𝐶
Δ𝑧𝑖𝑙 must have an opposite sign to
𝜕𝑧𝑖𝑙
Considering that Δ𝑧𝑖𝑙 is a small number (since we want to make small changes), the
𝜕𝐶
amount by how the error is minimized depends on how large is
𝜕𝑧𝑖𝑙
𝜕𝐶
If is close to zero, then the cost can not be further reduced
𝜕𝑧𝑖𝑙
𝜕𝐶
Thus, we can consider that represents a measure of the error for the neuron 𝑖 in
𝜕𝑧𝑖𝑙
the layer 𝑙
𝜕𝐶
We will use 𝛿𝑖𝑙 = to represent the error .
𝜕𝑧𝑖𝑙
Note, that we could have also represented the error in respect to the output 𝑦𝑖𝑙 but
the current variant results in using less formulas
Backpropagation
We will back-propagate the error to the neurons in the previous layer and
will repeat the above steps
Backpropagation
Compute how the cost function depends on the error from the last layer
𝜕𝐶 𝜕𝐶 𝜕𝑦𝑖𝐿 𝜕𝐶 𝐿
𝐿 = 𝐿 ∙ 𝐿 = 𝐿 ∙ 𝜎′(𝑧𝑖)
𝜕𝑧𝑖 𝜕𝑦𝑖 𝜕𝑧𝑖 𝜕𝑦𝑖
Layer L-1
𝜕𝐶 𝜕𝐶 Last layer
𝛿𝑖𝐿 = 𝐿 = 𝐿 ∙ 𝜎′(𝑧𝑖𝐿 )
𝜕𝑧𝑖 𝜕𝑦𝑖 𝑧𝑖𝐿 𝑦𝑖𝑙
𝐶
Backpropagation
Backpropagate the error (write the error in respect to the error in the next
layer)
𝜕𝐶 𝜕𝐶 𝜕𝑦𝑖𝑙 𝜕𝑦𝑖𝑙 𝜕𝐶 𝜕𝑧𝑘𝑙+1 ′ 𝑧𝑖
𝑙 = 𝑙∙ 𝑙 = 𝑙 ∙ ∙ = 𝜎 𝑙 𝛿𝑖𝑙+1 ∙ 𝑤𝑖𝑘
𝑙+1
𝜕𝑧𝑖 𝜕𝑦𝑖 𝜕𝑧𝑖 𝜕𝑧𝑖 𝜕𝑧𝑘𝑙+1 𝜕𝑦𝑖𝑙
𝑘 𝑘
Layer l+1
𝜕𝐶 𝑧𝑘𝑙+1
𝛿𝑖𝑙 = 𝑙 = 𝜎 ′ 𝑧𝑖𝑙 𝛿𝑖𝑙+1 ∙ 𝑤𝑖𝑘
𝑙+1
𝜕𝑧𝑖 𝐶
𝑘
𝑙+1
𝑤𝑖𝑘
𝑦𝑖𝑙
𝑧𝑖𝑙
Backpropagation
𝜕𝐶 𝜕𝐶 𝜕𝑧𝑖𝑙 𝑙 𝑙−1
𝑙 = ∙ = 𝛿𝑖 ∙ 𝑎𝑘
𝜕𝑤𝑘𝑖 𝜕𝑧𝑖𝑙 𝜕𝑤𝑘𝑖
𝑙
Layer 3
Layer 4
𝑙 𝑧𝑖𝑙
𝑤𝑘𝑖
𝜕𝐶 𝑙 𝑙−1 𝑎𝑘𝑙−1
𝑙 = 𝛿𝑖 ∙ 𝑎𝑘
𝜕𝑤𝑘𝑖 𝑧𝑘𝑙−1
Backpropagation
𝜕𝐶 𝜕𝐶 𝜕𝑧𝑖𝑙 𝑙
𝑙 = 𝑙 ∙ 𝑙 = 𝛿𝑖 ∙
𝜕𝑏𝑖 𝜕𝑧𝑖 𝜕𝑏𝑖 Layer 2 Layer 3
Layer 4
𝜕𝐶 𝑙
𝑙 = 𝛿𝑖
𝜕𝑏𝑖
𝑧𝑖𝑙
𝑏𝑖𝑙
Backpropagation
𝑘
2. Compute the gradient for the weights in the current layer:
𝜕𝐶
= 𝛿𝑖𝑙 𝑦 𝑙−1
𝜕𝑤𝑖𝑗
Each neuron from the output layer will activate for a certain digit. The outputted
digit of the network will be given by the output neuron that has the highest
confidence. (the largest outputted activation)
A network to read digits
Training info:
The learning rate used is 𝜂 = 3.0
Learning is performed using SGD with minibatch size of 10
Training is performed for 30 iterations
Training data consists of 50000 images.
Test data consists of 10000 images
Results:
ANN identifies approx. 95% of the tested images
A network made of perceptron (10), detects approx. 83%
(watch demo)
Questions & Discussion
Bibliography
http://neuralnetworksanddeeplearning.com/
Chris Bishop, “Neural Network for Pattern Recognition”
http://sebastianraschka.com/Articles/2015_singlelayer_neurons.html
http://deeplearning.net/