0% found this document useful (0 votes)
56 views38 pages

Curs3site PDF

Uploaded by

Gigi Florica
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views38 pages

Curs3site PDF

Uploaded by

Gigi Florica
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

18 October, 2016

Neural Networks
Course 3: Gradient Descent and Backpropagation
Overview

 Feed Forward Network Architecture


 Gradient Descent
 Backpropagation

A network to read digits


 Conclusions
Feed Forward Network
Architecture
Feed Forward Network Architecture

 Composed of at least 3 layers:


 The first one is called the input layer
 The last one is the output layer
 The ones in the middle are called hidden layers. The hidden layer is composed of
hidden units.
 Each unit is symmetrically connected with all the units in the layer above (there
are no loops)
Feed Forward Network Architecture

 The neuron in the hidden layer and in the output layer are non-linear
neurons. Most of the times are logistic neurons (sigmoid ), but can also be
tanh, rectified linear units or based on other non-linear function

 This type of network is also called Multi Layer Perceptron (MLP), but this is
confusing, since the perceptron is rarely used in this kind of architecture

 Why not use perceptron?


Feed Forward Network Architecture

 Why not use the perceptron


 Learning is performed by slightly modifying weights and observing the output.
However, since the perceptron only outputs 0 or 1 it is possible to update the
weights and see not change at all

y=1 y=1

y=0.5

𝜃 -6 -4 -2 0 2 4 6
𝑤𝑖 𝑥𝑖
𝑖
Gradient Descent
Gradient Descent

 Adjust the weights and biases such that to minimize a cost function
 A cost function is a mathematical function that assigns a value (cost) to
how bad a sample is classified
 A common used function (that we will also use) is the mean squared error
1
𝐶 𝑤, 𝑏 = 𝑥| 𝑡 − 𝑎 |2
2𝑛

1
𝐶 𝑤, 𝑏 = (𝑡 − 𝑎)2
2𝑛
𝑥

w=all weights in the network b = all biases in the network


t = target output vector for input x a = output when input is x. y(x)
| 𝑣 | = length of the vector
Gradient Descent

 Why use a cost function? (why not just count the correct outputs)
 A small update in the weights might not result in a change in the number of
correctly classified samples

 Why use Mean Square Error ?


 We can interpret the formula as being very similar to the Euclidian distance, thus
this can be interpreted as minimizing the distance between the target and the
output
 The formula also resembles the one for the variance (which computes how far the
elements are from the mean). So, for example in regression, this would be
equivalent to reducing how far the elements go from the mean (the target
hyperplane)
 It is continuous and it is easy differentiable (which will be useful later)
Gradient Descent

 What is a gradient?
 A gradient is just a fancy word for derivate 
 The gradient (first derivative) determines the slope of the tangent of the graph of
the function. (points in the direction of the greatest rate of increase)
Gradient Descent

 A function with multiple variables have multiple derivatives:


𝜕𝑓 𝜕𝑓 𝜕𝑓
f(x,y,z) has 3 derivatives, , ,
𝜕𝑥 𝜕𝑦 𝜕𝑧
𝜕𝑓
means partial derivative and is obtained by differentiating f with
𝜕𝑥
respect to x and considering all the other variables (y, z) as constants

 If x changes with Δ𝑥, y with Δ𝑦 and z with Δ𝑧 then:


𝜕𝑓 𝜕𝑓 𝜕𝑓
Δ𝑓 = Δ𝑥 + Δy + Δz
𝜕𝑥 𝜕𝑦 𝜕𝑧
Gradient Descent

 Minimizing the Cost function:


Suppose the Cost functions has
only two variables (v1 and v2 )
and its geometric representation
is a quadratic bowl.

If we move in the direction v1 by


Δ𝑣1 and in the direction v2 by
Δ𝑣2 then

𝜕𝐶 𝜕𝐶
Δ𝐶 𝑣1, 𝑣2 = 𝜕𝑣1 Δ𝑣1 + 𝜕𝑣2 Δ𝑣2
Gradient Descent

 Minimizing the Cost function:


𝜕𝐶 𝜕𝐶
𝛻𝐶 = ( , )
𝜕𝑣1 𝜕𝑣2
and
Δ𝑣 = (Δ𝑣1, Δ𝑣2)
then
Δ𝐶 𝑣1, 𝑣2 = 𝛻𝐶 ∙ Δv
Since we want to always move downwards (minimizing C function) we
want Δ𝐶 to be negative. So

Δv = −𝜂𝛻𝐶

Where 𝜂 is a small number, called learning rate


Gradient Descent

 Minimizing the Cost function:


 So, by adjusting v1 and v2 such that Δ𝑣 = −𝜂𝛻𝐶 will always lead to a
smaller value for C.

 Repeating the above step multiple times drives us to a local minimum

 The learning rate must be small enough to not jump over the minimum

 Even though we have used just two variables, the same principle can be
used for any derivable Cost function of many variables
Gradient Descent

 Example:
 Performing gradient descent for the Adaline perceptron.
Adaline Perceptron:

The activation function:


y = 𝑤𝑥 + 𝑏

We will use the mean square error as the cost functions


1 2
𝐶 𝑤, 𝑏 = 𝑥 𝑡−𝑦
2𝑛
Gradient Descent

 Example:
 Performing gradient descent for the Adaline perceptron.
1. Compute the gradient 𝛻𝐶

𝑑𝐶 𝑑𝐶
𝛻𝐶 = ( , )
𝑑𝑤 𝑑𝑏

Remember the chain rule:


1
C = a function of the variable y (output) 𝑡−𝑦
𝑥
2
2𝑛
Y = a function of the variable (w) 𝑦 = 𝑤𝑥 + 𝑏

𝜕𝐶 𝜕𝐶 𝜕𝑦
= ∙
𝜕𝑤 𝜕𝑦 𝜕𝑤
Gradient Descent

 Example:
 Performing gradient descent for the Adaline perceptron.
1. Compute the gradient 𝛻𝐶

1 ′ 1 1
2
C′ = 𝑥 𝑡−𝑦 = 2 𝑥 (𝑡 − 𝑦)′ = 𝑥(𝑡 − 𝑦)(−𝑦)′
2𝑛 2𝑛 𝑛

𝑑𝑦 𝑑 𝑤𝑥+𝑏
= =𝑥
𝑑𝑤 𝑑𝑤
𝑑𝑦 𝑑 𝑤𝑥+𝑏
= =1
𝑑𝑏 𝑑𝑏
-----------------------------------------------------------------------------------------------
𝑑𝐶 1 𝑑𝐶 1
=− 𝑥 𝑡−𝑦 𝑥 =− 𝑥 𝑡−𝑦
𝑑𝑤 𝑛 𝑑𝑏 𝑛
Gradient Descent

 Example:
 Performing gradient descent for the Adaline perceptron.
2. Choose a learning rate 𝜂

3. For a fixed number of iterations:


𝑑𝐶 𝜂
adjust w: 𝑤 = 𝑤 − 𝜂 =𝑤+ 𝑡−𝑦 𝑥
𝑑𝑤 𝑛 𝑥
𝑑𝐶 𝜂
adjust b: 𝑏 = 𝑏 − 𝜂 =𝑏+ 𝜂(𝑡 − 𝑦)
𝑑𝑏 𝑛 𝑥

 This is the same update rule we used for the Adaline in the previous
course. The formula looks different since now we are averaging over all
samples
Gradient Descent

 Wewill usually be performing stochastic gradient descent (mini-


batch) instead gradient descent.

 The idea is very simple:


 Choose a subset of smaller size that approximates the 𝛻𝐶. Use that
instead of the real 𝛻𝐶.
 Update the weights using the previously computed value
 Choose another subset from the training data and repeat the previous
steps until all the samples have been processed

𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑑𝑎𝑡𝑎


This speeds up the learning process by a factor of 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑚𝑖𝑛𝑖𝑏𝑎𝑡𝑐ℎ
Backpropagation

 Using
gradient descent we can optimize a perceptron.
However, can we train a network with multiple neurons?
 Yes, but not yet ! 
 Why?
Backpropagation

 Using
gradient descent we can optimize a perceptron.
However, can we train a network with multiple layers?
 Yes, but not yet ! 
 Why? We know the error
at the last layers,
so we can update
We don’t know how the immediate
much of the error weights that affect
depends on the that error
previous layers.

We need to
backpropagate the
error
Backpropagation
 Some notations:
 wijl = the weight from the neuron i from the l − 1 layer to the neuron j in the l layer
 L = the last layer
 yil = the activation of neuron i from the l layer
 bli = the bias of neuron i from the l layer
 𝑧𝑖𝑙 = the net input for the neuron 𝑖 from the 𝑙 layer ( 𝑙 𝑙−1
𝑗 𝑤𝑗𝑖 𝑦𝑗 + 𝑏𝑖𝑙 )

Layer 2 Layer 3 𝑦43


Layer 1
Layer 4 𝑧43

3 3
𝑤14 𝑤44
3 3
𝑤24 𝑤34

𝑦12 𝑦42
3
𝑤34 𝑦22 𝑦32
Backpropagation

 We can adjust the error by adjusting the bias and the weights of each
neuron 𝑖 from each layer 𝑙. This will change the net input from 𝑧𝑖𝑙 to (𝑧𝑖𝑙 +
Δ𝑧𝑖𝑙 ) and the activation from 𝜎(𝑧𝑖𝑙 ) to 𝜎(𝑧𝑖𝑙 + Δ𝑧𝑖𝑙 )

𝜕𝐶
 In this case, the cost function will change by Δ𝐶 = Δ𝑧𝑖𝑙
𝜕𝑧𝑖𝑙

Layer 2 Layer 3
Layer 1
Layer 4

𝜕𝐶 𝑙
𝐶+ 𝑙 Δ𝑧𝑖
Δ𝑧𝑖𝑙 𝜕𝑧𝑖
Backpropagation

𝜕𝐶
 We can minimize 𝐶 by making Δ𝐶 = Δ𝑧𝑖𝑙 negative:
𝜕𝑧𝑖𝑙
𝜕𝐶
 Δ𝑧𝑖𝑙 must have an opposite sign to
𝜕𝑧𝑖𝑙

Considering that Δ𝑧𝑖𝑙 is a small number (since we want to make small changes), the
𝜕𝐶
amount by how the error is minimized depends on how large is
𝜕𝑧𝑖𝑙
𝜕𝐶
If is close to zero, then the cost can not be further reduced
𝜕𝑧𝑖𝑙

𝜕𝐶
Thus, we can consider that represents a measure of the error for the neuron 𝑖 in
𝜕𝑧𝑖𝑙
the layer 𝑙
𝜕𝐶
We will use 𝛿𝑖𝑙 = to represent the error .
𝜕𝑧𝑖𝑙

Note, that we could have also represented the error in respect to the output 𝑦𝑖𝑙 but
the current variant results in using less formulas
Backpropagation

 How backpropagation algorithm works:


 We will compute the error for each neuron at the last layer: 𝛿𝑖𝐿
For each layer 𝑙 , starting from the last one to the first:
For each neuron 𝑖 in the layer 𝑙
 We will compute the error : 𝛿𝑖𝑙
𝜕𝐶 𝜕𝐶
 Using this value we will compute and
𝜕𝑏𝑖𝑙 𝑙
𝜕𝑤𝑖𝑗

 We will back-propagate the error to the neurons in the previous layer and
will repeat the above steps
Backpropagation

 Compute how the cost function depends on the error from the last layer

𝜕𝐶 𝜕𝐶 𝜕𝑦𝑖𝐿 𝜕𝐶 𝐿
𝐿 = 𝐿 ∙ 𝐿 = 𝐿 ∙ 𝜎′(𝑧𝑖)
𝜕𝑧𝑖 𝜕𝑦𝑖 𝜕𝑧𝑖 𝜕𝑦𝑖
Layer L-1

𝜕𝐶 𝜕𝐶 Last layer
𝛿𝑖𝐿 = 𝐿 = 𝐿 ∙ 𝜎′(𝑧𝑖𝐿 )
𝜕𝑧𝑖 𝜕𝑦𝑖 𝑧𝑖𝐿 𝑦𝑖𝑙
𝐶
Backpropagation

 Backpropagate the error (write the error in respect to the error in the next
layer)
𝜕𝐶 𝜕𝐶 𝜕𝑦𝑖𝑙 𝜕𝑦𝑖𝑙 𝜕𝐶 𝜕𝑧𝑘𝑙+1 ′ 𝑧𝑖
𝑙 = 𝑙∙ 𝑙 = 𝑙 ∙ ∙ = 𝜎 𝑙 𝛿𝑖𝑙+1 ∙ 𝑤𝑖𝑘
𝑙+1
𝜕𝑧𝑖 𝜕𝑦𝑖 𝜕𝑧𝑖 𝜕𝑧𝑖 𝜕𝑧𝑘𝑙+1 𝜕𝑦𝑖𝑙
𝑘 𝑘

Layer l+1
𝜕𝐶 𝑧𝑘𝑙+1
𝛿𝑖𝑙 = 𝑙 = 𝜎 ′ 𝑧𝑖𝑙 𝛿𝑖𝑙+1 ∙ 𝑤𝑖𝑘
𝑙+1
𝜕𝑧𝑖 𝐶
𝑘

𝑙+1
𝑤𝑖𝑘

𝑦𝑖𝑙
𝑧𝑖𝑙
Backpropagation

 Compute how the cost function depends on a weight

𝜕𝐶 𝜕𝐶 𝜕𝑧𝑖𝑙 𝑙 𝑙−1
𝑙 = ∙ = 𝛿𝑖 ∙ 𝑎𝑘
𝜕𝑤𝑘𝑖 𝜕𝑧𝑖𝑙 𝜕𝑤𝑘𝑖
𝑙
Layer 3

Layer 4

𝑙 𝑧𝑖𝑙
𝑤𝑘𝑖
𝜕𝐶 𝑙 𝑙−1 𝑎𝑘𝑙−1
𝑙 = 𝛿𝑖 ∙ 𝑎𝑘
𝜕𝑤𝑘𝑖 𝑧𝑘𝑙−1
Backpropagation

 Compute how the cost function depends on a bias

𝜕𝐶 𝜕𝐶 𝜕𝑧𝑖𝑙 𝑙
𝑙 = 𝑙 ∙ 𝑙 = 𝛿𝑖 ∙
𝜕𝑏𝑖 𝜕𝑧𝑖 𝜕𝑏𝑖 Layer 2 Layer 3

Layer 4

𝜕𝐶 𝑙
𝑙 = 𝛿𝑖
𝜕𝑏𝑖

𝑧𝑖𝑙

𝑏𝑖𝑙
Backpropagation

 Doing the math for the 𝜎′



𝑑𝜎 ′ 1
= −𝑧
= ( 1 + 𝑒 −𝑧 −1 )′ = −1(1 + 𝑒 −𝑧 )′ 1 + 𝑒 −𝑧 −2 =
𝑑𝑧 1+𝑒
′ −𝑧 −𝑧 −𝑧
1 −𝑧 𝑒 𝑒 1 𝑒
=− −𝑧 2
1 + 𝑒 −𝑧 ′ = − −𝑧 2
= −𝑧 2
= ∙
−𝑧 1 + 𝑒 −𝑧
=
1+𝑒 1 + 𝑒 1 + 𝑒 1 + 𝑒
1 + 𝑒 −𝑧 1
=𝜎 𝑧 ∙ − =𝜎 𝑧 ∙ 1−𝜎 𝑧
1 + 𝑒 −𝑧 1 + 𝑒 −𝑧

𝜎 ′ 𝑧𝑖𝑙 = 𝑦𝑖𝑙 (1 − 𝑦𝑖𝑙 )


Backpropagation

 Doing the math for 𝛿𝑖𝐿


1 𝐿 2
𝜕𝐶 𝑑 𝑡
𝑗2 𝑗 − 𝑦 𝑗 ′
∙ 𝜎 ′ 𝑧𝐿 = 𝑦𝑖𝐿 1 − 𝑦𝑖𝐿 ∙ = 𝑦 𝐿
1 − 𝑦 𝐿
𝑡 − 𝑦 𝐿
=
𝑖 𝑖 𝑖 𝑖 𝑖
𝜕𝑦𝑖𝐿 𝑑𝑦𝑖𝐿
=𝑦𝑖𝐿 1 − 𝑦𝑖𝐿 𝑡𝑖 − 𝑦𝑖𝐿 −1 = 𝑎𝑖𝐿 (1 − 𝑦𝑖𝐿 )(𝑦𝑖𝐿 − 𝑡𝑖 )

𝛿𝑖𝐿 = 𝑦𝑖𝐿 (1 − 𝑦𝑖𝐿 )(𝑦𝑖𝐿 − 𝑡𝑖 )


Backpropagation

 Putting it all together


0. Compute the error for the final layer:
𝛿𝑖𝐿 = 𝑦𝑖𝐿 1 − 𝑦𝑖𝐿 𝑦𝑖𝐿 − 𝑡𝑖

Process the layer below:


1. Compute the error for the previous layer:
𝛿𝑖𝑙 = 𝑦𝑖𝑙 (1 − 𝑦𝑖𝑙 ) 𝛿𝑖𝑙+1 ∙ 𝑤𝑖𝑘
𝑙+1

𝑘
2. Compute the gradient for the weights in the current layer:
𝜕𝐶
= 𝛿𝑖𝑙 𝑦 𝑙−1
𝜕𝑤𝑖𝑗

3. Compute the gradient for the biases in the current layers:


𝜕𝐶 𝑙
= 𝛿 𝑖
𝜕𝑏 𝑖
Repeat until we reach the input layer
A network to read digits
A network to read digits

 We will train a feed forward network, using the backpropagation algorithm


that can recognize handwritten digits

 The dataset can be downloaded from here: (:


http://deeplearning.net/data/mnist/mnist.pkl.gz)
A network to read digits

 Each image is 28x28 in size and is represented as a vector of 784 pixels,


each pixel having an intensity.

 We will use a network of 3 layers:


 784 for the input layer
 36 for the hidden layer
 10 for the output layer

Each neuron from the output layer will activate for a certain digit. The outputted
digit of the network will be given by the output neuron that has the highest
confidence. (the largest outputted activation)
A network to read digits

 Training info:
 The learning rate used is 𝜂 = 3.0
 Learning is performed using SGD with minibatch size of 10
 Training is performed for 30 iterations
 Training data consists of 50000 images.
 Test data consists of 10000 images

 Results:
 ANN identifies approx. 95% of the tested images
 A network made of perceptron (10), detects approx. 83%

(watch demo)
Questions & Discussion
Bibliography

http://neuralnetworksanddeeplearning.com/
Chris Bishop, “Neural Network for Pattern Recognition”
http://sebastianraschka.com/Articles/2015_singlelayer_neurons.html
http://deeplearning.net/

You might also like