0% found this document useful (0 votes)
191 views48 pages

BackPropagation PDF

The document discusses the back-propagation algorithm, which is used to calculate the gradient of a cost function with respect to the weights in a neural network. It first defines notation used in neural networks, such as weights, biases, transfer functions. It then derives the equations to calculate the error rate with respect to weights for both output layer nodes and nodes in hidden layers. The goal is to minimize this error rate to optimize the weights in the network.

Uploaded by

sridhiya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
191 views48 pages

BackPropagation PDF

The document discusses the back-propagation algorithm, which is used to calculate the gradient of a cost function with respect to the weights in a neural network. It first defines notation used in neural networks, such as weights, biases, transfer functions. It then derives the equations to calculate the error rate with respect to weights for both output layer nodes and nodes in hidden layers. The goal is to minimize this error rate to optimize the weights in the network.

Uploaded by

sridhiya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

The back-propagation algorithm

January 8, 2012

Ryan
The neuron
I The sigmoid equation is what is typically used as a transfer
function between neurons. It is similar to the step function,
but is continuous and differentiable.
The neuron
I The sigmoid equation is what is typically used as a transfer
function between neurons. It is similar to the step function,
but is continuous and differentiable.
I
1
σ(x) = (1)
1 + e −x

x
-5 -4 -3 -2 -1 0 1 2 3 4 5

Figure: The Sigmoid Function


The neuron
I The sigmoid equation is what is typically used as a transfer
function between neurons. It is similar to the step function,
but is continuous and differentiable.
I
1
σ(x) = (1)
1 + e −x

x
-5 -4 -3 -2 -1 0 1 2 3 4 5

Figure: The Sigmoid Function

I One useful property of this transfer function is the simplicity


of computing it’s derivative. Let’s do that now...
The derivative of the sigmoid transfer function

 
d d 1
σ(x) =
dx dx 1 + e −x
The derivative of the sigmoid transfer function

 
d d 1
σ(x) =
dx dx 1 + e −x
e −x
=
(1 + e −x )2
The derivative of the sigmoid transfer function

 
d d 1
σ(x) =
dx dx 1 + e −x
e −x
=
(1 + e −x )2
1 + e −x − 1
=
(1 + e −x )2
The derivative of the sigmoid transfer function

 
d d 1
σ(x) =
dx dx 1 + e −x
e −x
=
(1 + e −x )2
(1 + e −x ) − 1
=
(1 + e −x )2
The derivative of the sigmoid transfer function

 
d d 1
σ(x) =
dx dx 1 + e −x
e −x
=
(1 + e −x )2
(1 + e −x ) − 1
=
(1 + e −x )2
1 + e −x 1
= −
(1 + e −x )2 (1 + e −x ) 2
The derivative of the sigmoid transfer function

 
d d 1
σ(x) =
dx dx 1 + e −x
e −x
=
(1 + e −x )2
(1 + e −x ) − 1
=
(1 + e −x )2
1 + e −x
 
1 2
= −
(1 + e −x )2 1 + e −x
The derivative of the sigmoid transfer function

 
d d 1
σ(x) =
dx dx 1 + e −x
e −x
=
(1 + e −x )2
(1 + e −x ) − 1
=
(1 + e −x )2
1 + e −x
 
1 2
= −
(1 + e −x )2 1 + e −x
= σ(x) − σ(x)2
The derivative of the sigmoid transfer function

 
d d 1
σ(x) =
dx dx 1 + e −x
e −x
=
(1 + e −x )2
(1 + e −x ) − 1
=
(1 + e −x )2
1 + e −x
 
1 2
= −
(1 + e −x )2 1 + e −x
= σ(x) − σ(x)2
σ 0 = σ(1 − σ)
Single input neuron

ω
ξ σ O

Figure: A Single-Input Neuron

In the above figure (2) you can see a diagram representing a single
neuron with only a single input. The equation defining the figure is:

O = σ(ξω)
Single input neuron

ω
ξ σ O

Figure: A Single-Input Neuron

In the above figure (2) you can see a diagram representing a single
neuron with only a single input. The equation defining the figure is:

O = σ(ξω + θ)
Multiple input neuron

θ
ω1
ξ1
ω2 P
ξ2 σ O
ω3
ξ3

Figure: A Multiple Input Neuron

Figure 3 is the diagram representing the following equation:

O = σ(ω1 ξ1 + ω2 ξ2 + ω3 ξ3 + θ)
A neural network

Figure: A layer
A neural network

Figure: A neural network


A neural network

I J K

Figure: A neural network


The back propagation algorithm

Notation
I xj` : Input to node j of layer `
The back propagation algorithm

Notation
I xj` : Input to node j of layer `
I Wij` : Weight from layer ` − 1 node i to layer ` node j
The back propagation algorithm

Notation
I xj` : Input to node j of layer `
I Wij` : Weight from layer ` − 1 node i to layer ` node j
1
I σ(x) = 1+e −x
: Sigmoid Transfer Function
The back propagation algorithm

Notation
I xj` : Input to node j of layer `
I Wij` : Weight from layer ` − 1 node i to layer ` node j
1
I σ(x) = 1+e −x
: Sigmoid Transfer Function
I θ` : Bias of node j of layer `
j
The back propagation algorithm

Notation
I xj` : Input to node j of layer `
I Wij` : Weight from layer ` − 1 node i to layer ` node j
1
I σ(x) = 1+e −x
: Sigmoid Transfer Function
I θ` : Bias of node j of layer `
j
I Oj` : Output of node j in layer `
The back propagation algorithm

Notation
I xj` : Input to node j of layer `
I Wij` : Weight from layer ` − 1 node i to layer ` node j
1
I σ(x) = 1+e −x
: Sigmoid Transfer Function
I θ` : Bias of node j of layer `
j
I Oj` : Output of node j in layer `
I tj : Target value of node j of the output layer
The error calculation

Given a set of training data points tk and output layer output Ok


we can write the error as
1X
E= (Ok − tk )2
2
k∈K

We let the error of the network for a single training iteration be


∂E
denoted by E . We want to calculate ∂W ` , the rate of change of
jk
the error with respect to the given connective weight, so we can
minimize it.
Now we consider two cases: The node is an output node, or it is in
a hidden layer...
Output layer node

∂E
=
∂Wjk
Output layer node

∂E ∂ 1X
= (Ok − tk )2
∂Wjk ∂Wjk 2
k∈K
Output layer node

∂E ∂
= (Ok − tk ) Ok
∂Wjk ∂Wjk
Output layer node

∂E ∂
= (Ok − tk ) σ(xk )
∂Wjk ∂Wjk
Output layer node

∂E ∂
= (Ok − tk )σ(xk )(1 − σ(xk )) xk
∂Wjk ∂Wjk
Output layer node

∂E
= (Ok − tk )Ok (1 − Ok )Oj
∂Wjk
Output layer node

∂E
= (Ok − tk )Ok (1 − Ok )Oj
∂Wjk
For notation purposes I will define δk to be the expression
(Ok − tk )Ok (1 − Ok ), so we can rewrite the equation above as

∂E
= Oj δk
∂Wjk

where
δk = Ok (1 − Ok )(Ok − tk )
Hidden layer node

∂E
=
∂Wij
Hidden layer node

∂E ∂ 1X
= (Ok − tk )2
∂Wij ∂Wij 2
k∈K
Hidden layer node

∂E X ∂
= (Ok − tk ) Ok
∂Wij ∂Wij
k∈K
Hidden layer node

∂E X ∂
= (Ok − tk ) σ(xk )
∂Wij ∂Wij
k∈K
Hidden layer node

∂E X ∂xk
= (Ok − tk )σ(xk )(1 − σ(xk ))
∂Wij ∂Wij
k∈K
Hidden layer node

∂E X ∂xk ∂Oj
= (Ok − tk )Ok (1 − Ok ) ·
∂Wij ∂Oj ∂Wij
k∈K
Hidden layer node

∂E X ∂Oj
= (Ok − tk )Ok (1 − Ok )Wjk
∂Wij ∂Wij
k∈K
Hidden layer node

∂E ∂Oj X
= (Ok − tk )Ok (1 − Ok )Wjk
∂Wij ∂Wij
k∈K
Hidden layer node

∂E ∂xj X
= Oj (1 − Oj ) (Ok − tk )Ok (1 − Ok )Wjk
∂Wij ∂Wij
k∈K
Hidden layer node

∂E X
= Oj (1 − Oj )Oi (Ok − tk )Ok (1 − Ok )Wjk
∂Wij
k∈K
Hidden layer node

∂E X
= Oj (1 − Oj )Oi (Ok − tk )Ok (1 − Ok )Wjk
∂Wij
k∈K

But, recalling our definition of δk we can write this as


∂E X
= Oi Oj (1 − Oj ) δk Wjk
∂Wij
k∈K
Hidden layer node

∂E X
= Oj (1 − Oj )Oi (Ok − tk )Ok (1 − Ok )Wjk
∂Wij
k∈K

But, recalling our definition of δk we can write this as


∂E X
= Oi Oj (1 − Oj ) δk Wjk
∂Wij
k∈K

Similar to before we will now define all terms besides the Oi to be


δj , so we have
∂E
= Oi δj
∂Wij
How weights affect errors

For an output layer node k ∈ K

∂E
= Oj δk
∂Wjk
where
δk = Ok (1 − Ok )(Ok − tk )

For a hidden layer node j ∈ J

∂E
= Oi δj
∂Wij
where X
δj = Oj (1 − Oj ) δk Wjk
k∈K
What about the bias?

If we incorporate the bias term θ into the equation you will find
that
∂O ∂θ
= O(1 − O)
∂θ ∂θ
and because ∂θ/∂θ = 1 we view the bias term as output from a
node which is always one.
What about the bias?

If we incorporate the bias term θ into the equation you will find
that
∂O ∂θ
= O(1 − O)
∂θ ∂θ
and because ∂θ/∂θ = 1 we view the bias term as output from a
node which is always one.
This holds for any layer ` we are concerned with, a substitution
into the previous equations gives us that
∂E
= δ`
∂θ
(because the O` is replacing the output from the “previous layer”)
The back propagation algorithm
1. Run the network forward with your input data to get the
network output
2. For each output node compute
δk = Ok (1 − Ok )(Ok − tk )
3. For each hidden node calulate
X
δj = Oj (1 − Oj ) δk Wjk
k∈K

4. Update the weights and biases as follows


Given
∆W = −ηδ` O`−1
∆θ = −ηδ`
apply
W + ∆W → W
θ + ∆θ → θ

You might also like