0% found this document useful (0 votes)
92 views62 pages

2024 MTH058 Lecture02 Backpropagation

Backpropagation is an algorithm used to train neural networks by propagating errors backwards from the output to minimize loss. It first propagates inputs forward and calculates outputs, then calculates error of output compared to target using loss function. Errors are then propagated backwards to update weights, reducing errors and improving predictions through multiple iterations.

Uploaded by

Mark Mystery
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views62 pages

2024 MTH058 Lecture02 Backpropagation

Backpropagation is an algorithm used to train neural networks by propagating errors backwards from the output to minimize loss. It first propagates inputs forward and calculates outputs, then calculates error of output compared to target using loss function. Errors are then propagated backwards to update weights, reducing errors and improving predictions through multiple iterations.

Uploaded by

Mark Mystery
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

BACK-PROPAGATION

IN NEURAL NETWORKS

Nguyễn Ngọc Thảo


[email protected]
Outline
• Artificial neural networks
• Perceptron
• Multi-layer perceptron
• Back-propagation algorithm

2
Artificial neural networks
What is a neural network?
• The biological neural network (NN) is a reasoning model
based on the human brain.
• There are approximately 86 billion neurons. Estimates of synapses
vary for an adult, ranging in 100 – 500 trillion.
• It is a system that is highly complex, nonlinear and parallel
information-processing.
• Learning through experience is an essential characteristic.
• Plasticity: connections leading to the “right answer” are enhanced,
while those to the “wrong answer” are weakened.

4
Biological neural network
• There are attempts to emulate biological neural network in
the computer, resulting artificial neural networks (ANNs).

• Just resemble the learning mechanisms, not the architecture


• Megatron-Turing's NLG: 530 billion parameters, GPT-3: 175 billion

5
ANN: Network architecture
• An ANN has many neurons, arranging in a hierarchy of layers.
• Each neuron is an elementary information-processing unit.

• ANN improve performance via experience and generalization.


6
ANN: Applications

7
ANN: Neurons and Signals
• Each neuron receives several input signals through its
connections and produces at most a single output signal.
Expressing the strength
of the input

• The set of weights is the long-term memory in an ANN → the


learning process iteratively adjusts the weights.
8
Artificial neuron

Biological neuron

Biological neuron Artificial neuron


Soma Neuron
Dendrite Input
Axon Output
Synapse Weight
9
10 10
11

Source: The Asimov Institute


11
How to build an ANN?
• The network architecture must be decided first.
• How many neurons are to be used?
• How the neurons are to be connected to form a network?
• Then determine which learning algorithm to use,
• Supervised /semi-supervised / unsupervised / reinforcement learning
• And finally train the neural network
• How to initialize the weights of the network?
• How to update them from a set of training examples.

12
Perceptron
Perceptron (Frank Rosenblatt, 1958)
• A perceptron has a single neuron with adjustable synaptic
weights and a hard limiter.

A single-layer two-input perceptron


14
How does a perceptron work?
• Divide the n-dimensional space into two decision regions by
a hyperplane defined by the linearly separable function
𝒏

𝒚 = ෍ 𝒙𝒊 𝒘𝒊 − 𝜽
𝒊=𝟏

15
Perceptron learning rule
• Step 1 – Initialization: Initial weights 𝒘𝟏 , 𝒘𝟐 , … , 𝒘𝒏 are randomly assigned to small
numbers (usually in −0.5, 0.5 , but not restricted to).

• Step 2 – Activation: At iteration 𝒑 , apply the 𝒑𝑡ℎ example, which has inputs
𝒙𝟏 (𝒑), 𝒙𝟐 (𝒑), … , 𝒙𝒏 (𝒑) and desired output 𝒀𝒅 𝒑 , and calculate the actual output
𝒏 1 𝑖𝑓 𝑥 ≥ 0
𝒀 𝒑 =𝛔 ෍ 𝒙𝒊 (𝒑)𝒘𝒊 (𝒑) − 𝜽 𝜎 𝑥 =ቊ
0 𝑖𝑓 𝑥 < 0
𝒊=𝟏
where 𝑛 is the number of perceptron inputs and 𝑠𝑡𝑒𝑝 is the activation function

• Step 3 – Weight training


• Update the weights 𝒘𝒊 : 𝒘𝒊 𝒑 + 𝟏 = 𝒘𝒊 𝒑 + ∆𝒘𝒊 (𝒑)
where ∆𝒘𝒊 (𝒑) is the weight correction at iteration 𝒑
• The delta rule determines how to adjust the weights: ∆𝒘𝒊 𝒑 = 𝜼 × 𝒙𝒊 𝒑 × 𝒆(𝒑)
where 𝜼 is the learning rate (0 < 𝜂 < 1) and 𝒆 𝒑 = 𝒀𝒅 𝒑 − 𝒀(𝒑)
• Step 4 – Iteration: Increase iteration 𝒑 by one, go back to Step 2 and repeat the process
until convergence.

16
Perceptron for the logical AND/OR
• A single-layer perceptron can learn the AND/OR operations.

The learning of logical AND


converged after several iterations
Threshold  = 0.2, learning rate  = 0.1 17
Perceptron for the logical XOR
• It cannot be trained to perform the Exclusive-OR.

• Generally, perceptron can classify only linearly separable


patterns regardless of the activation function used.
• Research works: Shynk, 1990 and Shynk and Bershad, 1992.
18
Perceptron: An example
Suppose there is a high-tech exhibition in the city, and you are thinking
• 𝑤1 = 6, 𝑤2 = 2, 𝑤3 = 2 → the weather matters to you much more than
about whether to go there. Your decision relies on the below factors:
whether your friend joins you, or the nearness of public transit
• Is the weather good?
• 𝜃 = 5 → decisions are made based on the weather only
• Does your friend want to accompany you?
• 𝜃 = 3 → you go to the festival whenever the weather is good or when
• Isthe
both thefestival
festivalisnear
nearpublic
publictransit?
transit (You don'tfriend
and your own awants
car). to join you.

weather

friend wants to go

near public transit


19
Quiz 01: Perceptron
• Consider the following neural network which receives binary input
values, 𝑥1 and 𝑥2 , and produces a single binary value.

• For every combination (𝑥1 , 𝑥2 ), what are the output values at neurons, 𝐴,
𝐵 and 𝐶?
20
Multi-layer perceptron
Multi-layer perceptron (MLP)

Output signals
Input signals

First Second
hidden layer hidden layer
• It is a fully connected feedforward network with at least three
layers.
• Idea: Map certain input to a specified target value using a
cascade of nonlinear transformations.
22
Learning algorithm: Back-propagation
The input signals are propagated forwardly on a layer-by-layer basis.

The error signals are propagated backwards


from the output layer to the input layer. 23
Back-propagation learning rule
• Step 1 – Initialization: Initial weights are assigned to random numbers.
𝟐.𝟒 𝟐.𝟒
• The numbers may be uniformly distributed in the range − ,+ (Haykin, 1999),
𝑭𝒊 𝑭𝒊
where 𝑭𝒊 is the total number of inputs of neuron
• The weight initialization is done on a neuron-by-neuron basis

• Step 2 – Activation: At iteration 𝒑 , apply the 𝒑𝑡ℎ example, which has inputs
𝒙𝟏 (𝒑), 𝒙𝟐 (𝒑), … , 𝒙𝒏 (𝒑) and desired outputs 𝒚𝒅,𝟏 𝒑 , 𝒚𝒅,𝟐 𝒑 , … , 𝒚𝒅,𝒍 (𝒑).
• (a) Calculate the actual output, from 𝒏 inputs, of neuron 𝒋 in the hidden layer.
𝒏 1
𝒚𝒋 𝒑 = 𝛔 ෍ 𝒙𝒊 (𝒑)𝒘𝒊𝒋 (𝒑) + 𝒃𝒋 𝜎 𝑥 =
1 + 𝑒 −𝑥
𝒊=𝟏
• (b) Calculate the actual output, from 𝒌 inputs, of neuron 𝒎 in the output layer.
𝒎
𝒚𝒌 𝒑 = 𝛔 ෍ 𝒚𝒋 (𝒑)𝒘𝒋𝒌 (𝒑) + 𝒃𝒌
𝒋=𝟏

• 𝒃𝒋 and 𝒃𝒌 are the biases at neurons 𝑗 and 𝑘, respectively.


24
Back-propagation learning rule
• Step 3 – Weight training: Update the weights in the back-propagation network and
propagate backward the errors associated with output neurons.
• (a) Calculate the error signal being back-propagated for neuron 𝒌 in the output layer

𝜹𝒌 𝒑 = 𝒚𝒌 𝒑 × 𝟏 − 𝒚𝒌 𝒑 × [𝒚𝒅,𝒌 𝒑 − 𝒚𝒌 𝒑 ] error
Calculate the weight corrections: ∆𝒘𝒋𝒌 𝒑 = 𝜼 × 𝒚𝒋 𝒑 × 𝜹𝒌 𝒑 error gradient
Update the weights at the output neurons: 𝒘𝒋𝒌 𝒑 + 𝟏 = 𝒘𝒋𝒌 𝒑 + ∆𝒘𝒋𝒌 𝒑
• (b) Calculate the error gradient for neuron 𝒋 in the hidden layer
𝒍
𝜹𝒋 𝒑 = 𝒚𝒋 𝒑 × [𝟏 − 𝒚𝒋 𝒑 ] × ෍ 𝜹𝒌 𝒑 𝒘𝒋𝒌 𝒑
𝒌=𝟏

Calculate the weight corrections: ∆𝒘𝒊𝒋 𝒑 = 𝜼 × 𝒙𝒊 𝒑 × 𝜹𝒋 𝒑

Update the weights at the hidden neurons: 𝒘𝒊𝒋 𝒑 + 𝟏 = 𝒘𝒊𝒋 𝒑 + ∆𝒘𝒊𝒋 𝒑

• Step 4: Iteration: Increase iteration 𝒑 by one, go back to Step 2 and repeat the process
until the selected error criterion is satisfied.

25
Back-propagation network for XOR
• The logical XOR problem took
224 epochs or 896 iterations
for network training.

26
Sum of the squared errors (SSE)
• When the SSE in an entire pass through all training sets is
sufficiently small, a network is deemed to have converged.

Learning curve for


logical operation XOR

27
Sigmoid neuron vs. Perceptron
• Sigmoid neuron better reflects the fact that small changes in
weights and bias cause only a small change in output.

A sigmoidal function is a smoothed-out


version of a step function.

28
About back-propagation learning
• Are randomly initialized weights and thresholds leading to
different solutions?
• Starting with different initial conditions will obtain different weights
and threshold values. The problem will always be solved within
different numbers of iterations.
• Back-propagation learning cannot be viewed as emulation of
brain-like learning.
• Biological neurons do not work backward to adjust the strengths of
their interconnections, synapses.
• The training is slow due to extensive calculations.
• Improvements: Caudill, 1991; Jacobs, 1988; Stubbs, 1990

29
Gradient descent method
• Consider two parameters, 𝑤1 and 𝑤2 , in a network.
Error Surface

Randomly pick a
starting point 𝜃 0

Compute the negative The colors represent the value of the function 𝑓.
gradient at 𝜃 0
→ −∇𝑓(𝜃 0 ) 𝑤2 𝜃∗
−𝜂𝛻𝑓 𝜃 0
Time the learning rate 𝜂
→ −𝜂∇𝑓(𝜃 0 ) −𝛻𝑓 𝜃 0

𝜃0 𝜕𝑓 𝜃 0 /𝜕𝑤1
𝛻𝑓 𝜃 0 =
𝜕𝑓 𝜃 0 /𝜕𝑤2
𝑤1 30
Gradient descent method
• Consider two parameters, 𝑤1 and 𝑤2 , in a network.
Error Surface
Eventually, we would reach a minima …..
Randomly pick a
starting point 𝜃 0

Compute the negative −𝜂𝛻𝑓 𝜃 2


gradient at 𝜃 0 1𝜃 2
−𝜂𝛻𝑓 𝜃
→ −∇𝑓(𝜃 ) 0
𝑤2 −𝛻𝑓 1 𝜃2
−𝛻𝑓 𝜃
𝜃1
Time the learning rate 𝜂
→ −𝜂∇𝑓(𝜃 0 )

𝜃0

𝑤1 31
Gradient descent method
• Gradient descent never guarantees global minima.

Different initial
point 𝜃 0
𝑓
𝐶

Reach different minima,


so different results
𝑤1 𝑤2

32
Gradient descent method
• It also has issues at plateau and saddle point.
cost
Very slow at the plateau

Stuck at saddle point

Stuck at local minima

𝛻𝑓 𝜃 𝛻𝑓 𝜃 𝛻𝑓 𝜃
≈0 =0 =0
parameter space
33
Accelerated learning in ANNs
• Use tanh instead of sigmoid: represent the sigmoidal function
by a hyperbolic tangent
𝟐𝒂 where 𝑎 = 1.716 and 𝑏 = 0.667
𝒀𝐭𝐚𝐧 𝒉 = −𝒃𝑿
−𝒂 (Guyon, 1991)
𝟏−𝒆

34
Accelerated learning in ANNs
• Generalized delta rule: A momentum term is
included in the delta rule (Rumelhart et al., 1986)
∆𝒘𝒋𝒌 𝒑 = 𝜷 × ∆𝒘𝒋𝒌 𝒑 − 𝟏 + 𝜼 × 𝒚𝒋 𝒑 × 𝜹𝒌 𝒑
where 𝛽 = 0.95 is the momentum constant (0 ≤ 𝛽 ≤ 1)

How about put momentum of physical world in gradient descent?

35
Accelerated learning in ANNs
Still not guarantee reaching global minima,
but give some hope ……
cost

Movement = Negative of Gradient + Momentum

Negative of Gradient
Momentum
Real Movement

Gradient = 0 36
Accelerated learning in ANNs
• Adaptive learning rate: Adjust the learning rate parameter 𝜼
during training
• Small  → small weight changes through iterations → smooth
learning curve
• Large  → speed up the training process with larger weight changes
→ possible instability and oscillatory
• Heuristic-like approaches for adjusting 
1. The algebraic sign of the SSE change remains for several
consequent epochs → increase .
2. The algebraic sign of the SSE change alternates for several
consequent epochs → decrease .

37
Learning with momentum only
Learning with momentum for
the logical operation XOR.

38
Learning with adaptive  only

Learning with adaptive learning


rate for the logical operation XOR.

39
Learning with adaptive  and momentum

40
Universality Theorem
• Any continuous function 𝑓: 𝑅𝑁 → 𝑅𝑀 can be realized by a
network with one hidden layer, given enough hidden neurons.
• More explanation can be found here.
• In general, the more parameters, the better performance.

• However,…

41
Universality Theorem
• Deep networks are empirically better in solving many real-
world problems
Word Error Word Error
Layer  Size Layer  Size
Rate (%) Rate (%)
1  2k 24.2
2  2k 20.4
3  2k 18.4
4  2k 17.8
5  2k 17.2 1  3772 22.5
7  2k 17.1 1  4634 22.6
1  16k 22.1
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-
Dependent Deep Neural Networks." Interspeech. 2011.
42
Quiz 02: Multi-layer perceptron
• Consider the below feedforward network with one hidden layer of units.

• If the network is tested with an input vector 𝑥 = 1.0, 2.0, 3.0 then what
are the activation 𝐻1 of the first hidden neuron and the activation 𝐼3 of the
third output neuron?
43
Quiz 02: Forward the input signals
𝑇
• The input vector to the network is 𝑥 = 𝑥1 , 𝑥2 , 𝑥3
𝑇
• The vector of hidden layer outputs is 𝑦 = 𝑦1 , 𝑦2
• The vector of actual outputs is 𝑧 = 𝑧1 , 𝑧2 , 𝑧3 𝑇

• The vector of desired outputs is 𝑡 = 𝑡1 , 𝑡2 , 𝑡3 𝑇

• The network has the following weight vectors


−2.0 1.0
𝑣1 = 2.0 𝑣2 = 1.0 1.0 0.5 0.3
𝑤1 = 𝑤2 = 𝑤3 =
−3.5 −1.2 0.6
−2.0 −1.0
• Assume that all units use sigmoid activation function and zero biases.

44
Quiz 03: Backpropagation error signals
• The figure shows part of the
network described in the
previous slide.
• Use the same weights,
activation functions and bias
values as described.

• A new input pattern is presented to the network and training proceeds as


follows. The actual outputs are given by 𝑧 = 0.15, 0.36, 0.57 𝑇 and the
corresponding target outputs are given by 𝑡 = 1.0, 1.0, 1.0 𝑇 .
• The weights 𝑤12 , 𝑤22 and 𝑤32 are also shown.
• What is the error for each of the output units?
45
Back-propagation
algorithm
Back-propagation algorithm
• Consider a MLP with one hidden layer.
• Note the following notations
• 𝑎𝑖 : the output value of node 𝑖 in the input layer
• 𝑧𝑗 : the input value to node 𝑗 in the layer ℎ
• 𝑔𝑗 : the activation function for node 𝑗 in the layer ℎ (applied to 𝑧𝑗 )
• 𝑎𝑗 = 𝑔𝑗 𝑧𝑗 : the output value of node 𝑗 in the layer ℎ
• 𝑏𝑗 : the bias/offset for unit 𝑗 in the layer ℎ
• 𝑤𝑖𝑗 : weights connecting node 𝑖 in layer (ℎ − 1) to node 𝑗 in layer ℎ
• 𝑡𝑘 : target value for node 𝑘 in the output layer

47
Forward the signal
Input Hidden layer Output
layer layer

* Bias units are not shown.


48
The error function
• Training a neural network entails finding parameters 𝜃 =
𝐖, 𝐛 that minimize the errors.
• The error function is usually the sum of the squared errors
between the target values 𝑡𝑘 and the network outputs 𝑎𝑘 .
𝒍
𝟏 𝟐
𝑬 = ෍ 𝒂𝒌 − 𝒕𝒌
𝟐
𝒌=𝟏
• 𝑙 is the dimensionality of the target for a single observation.

• This parameter optimization problem can be solved using


𝜕𝐸
gradient descent, computing for all 𝜃.
𝜕𝜃

49
Update the output layer parameters
• Calculating the gradient of the error function with respect to those
parameters is straightforward with the chain rule.
𝜕𝐸 1 2
= ෍ 𝑎𝑘 − 𝑡𝑘
𝜕𝑤𝑗𝑘 2
𝑘
𝜕
= 𝑎𝑘 − 𝑡𝑘 𝑎𝑘 − 𝑡𝑘
𝜕𝑤𝑗𝑘
𝜕𝐸 𝜕 𝜕
• Then, = 𝑎𝑘 − 𝑡𝑘 𝑎 since 𝜕𝑤 𝑡𝑘 = 0
𝜕𝑤𝑗𝑘 𝜕𝑤𝑗𝑘 𝑘 𝑗𝑘

𝜕
= 𝑎𝑘 − 𝑡𝑘 𝑔𝑘 𝑧𝑘 since 𝑎𝑘 = 𝑔(𝑧𝑘 )
𝜕𝑤𝑗𝑘
𝜕
= 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ (𝑧𝑘 ) 𝑧
𝜕𝑤𝑗𝑘 𝑘

50
Update the output layer parameters
𝜕
• Recall that 𝑧𝑘 = 𝑏𝑘 + σ𝑗 𝑔𝑗 𝑧𝑗 𝑤𝑗𝑘 , and hence, 𝑧 = 𝑔𝑗 𝑧𝑗 = 𝑎𝑗 .
𝜕𝑤𝑗𝑘 𝑘

𝝏𝑬
• Then, = 𝒂𝒌 − 𝒕𝒌 𝒈′𝒌 (𝒛𝒌 )𝒂𝒋
𝝏𝒘𝒋𝒌

the difference between the derivative of the the output of node 𝑗 from
the network output 𝑎𝑘 activation function at 𝑧𝑘 the hidden layer feeding
and the target value 𝑡𝑘 into the output layer

• The common activation function is the sigmoid function


1
𝑔 𝑧 =
1 + 𝑒 −𝑧
whose derivative is
𝑔′ 𝑧 = 𝑔(𝑧) 1 − 𝑔(𝑧)

51
Update the output layer parameters
• Let 𝛿𝑘 = 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ (𝑧𝑘 ) be the error signal after being back-propagated
through the output activation function 𝑔𝑘 .
• The delta form of the error function gradient for the output layer weights is
𝜕𝐸
= 𝛿𝑘 𝑎𝑗
𝜕𝑤𝑗𝑘

• The gradient descent update rule for the output layer weights is
𝝏𝑬
𝒘𝒋𝒌 ← 𝒘𝒋𝒌 − 𝜼 𝜂 is the learning rate
𝝏𝒘𝒋𝒌
← 𝑤𝑗𝑘 − 𝜂 𝛿𝑘 𝑎𝑗
← 𝒘𝒋𝒌 − 𝜼 𝒂𝒌 − 𝒕𝒌 𝒈𝒌 𝒛𝒌 𝟏 − 𝒈𝒌 𝒛𝒌 𝒂𝒋

• Apply similar update rules for the remaining parameters 𝑤𝑗𝑘 .

52
Update the output layer biases
• The gradient for the biases is simply the back-propagated error signal 𝛿𝑘 .
𝜕𝐸
= 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ 𝑧𝑘 (1) = 𝛿𝑘
𝜕𝑏𝑘

• Each bias is updated as 𝒃𝒌 ← 𝒃𝒌 − 𝜼 𝜹𝒌


𝜕 𝜕
• Note that 𝑧 = 𝑏𝑘 + σ𝑗 𝑔𝑗 𝑧𝑗 =1
𝜕𝑏𝑘 𝑘 𝜕𝑏𝑘
• The biases are weights on activations that are always equal to one, regardless of the
feed-forward signal.
• Thus, the bias gradients aren’t affected by the feed-forward signal, only by the error.

53
Update the hidden layer parameters
• The process starts just the same as for the output layer.
𝜕𝐸 1 2
= ෍ 𝑎𝑘 − 𝑡𝑘
𝜕𝑤𝑖𝑗 2
𝑘
𝜕
= ෍ 𝑎𝑘 − 𝑡𝑘 𝑎𝑘
𝜕𝑤𝑖𝑗
𝑘

• Apply the chain rule again, we obtain:


𝜕𝐸 𝜕
= ෍ 𝑎𝑘 − 𝑡𝑘 𝑔 𝑧 since 𝑎𝑘 = 𝑔𝑘 𝑧𝑘
𝜕𝑤𝑖𝑗 𝜕𝑤𝑖𝑗 𝑘 𝑘
𝑘
𝜕
= ෍ 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ (𝑧𝑘 ) 𝑧𝑘
𝜕𝑤𝑖𝑗
𝑘

54
Update the hidden layer parameters
• The term 𝑧𝑘 can be expanded as follows.

𝑧𝑘 = 𝑏𝑘 + ෍ 𝑎𝑗 𝑤𝑗𝑘 = 𝑏𝑘 + ෍ 𝑔𝑗 𝑧𝑗 𝑤𝑗𝑘 since 𝑎𝑗 = 𝑔𝑗 𝑧𝑗


𝑗 𝑗

= 𝑏𝑘 + ෍ 𝑔𝑗 𝑏𝑗 + ෍ 𝑎𝑖 𝑤𝑖𝑗 𝑤𝑗𝑘 since 𝑧𝑗 = 𝑏𝑗 + σ𝑖 𝑎𝑖 𝑤𝑖𝑗


𝑗 𝑖

𝜕
• Again, use the chain rule to calculate 𝜕𝑤 𝑧𝑘
𝑖𝑗

𝜕 𝜕𝑧𝑘 𝜕𝑎𝑗 𝜕 𝜕𝑎𝑗 𝜕𝑎𝑗 𝜕𝑔𝑗 𝑧𝑗


𝑧𝑘 = = 𝑏𝑘 + ෍ 𝑎𝑗 𝑤𝑗𝑘 = 𝑤𝑗𝑘 = 𝑤𝑗𝑘
𝜕𝑤𝑖𝑗 𝜕𝑎𝑗 𝜕𝑤𝑖𝑗 𝜕𝑎𝑗 𝜕𝑤𝑖𝑗 𝜕𝑤𝑖𝑗 𝜕𝑤𝑖𝑗
𝑗
𝜕𝑧𝑗 𝜕
= 𝑤𝑗𝑘 𝑔𝑗′ (𝑧𝑗 ) = 𝑤𝑗𝑘 𝑔𝑗′ (𝑧𝑗 ) 𝑏𝑗 + σ𝑗 𝑎𝑖 𝑤𝑖𝑗
𝜕𝑤𝑖𝑗 𝜕𝑤𝑖𝑗

= 𝑤𝑗𝑘 𝑔𝑗′ (𝑧𝑗 )𝑎𝑖


55
Update the hidden layer parameters
𝜕𝐸
• Thus, = ෍ 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ (𝑧𝑘 )𝑤𝑗𝑘 𝑔𝑗′ (𝑧𝑗 )𝑎𝑖
𝜕𝑤𝑖𝑗
𝑘

= ෍ 𝛿𝑘 𝑤𝑗𝑘 𝑔𝑗′ (𝑧𝑗 )𝑎𝑖 the output activation signal


from the layer below 𝑎𝑖
𝑘

error term the derivative of the


activation function at 𝑧𝑗
• Let 𝛿𝑗 = 𝑔𝑗′ 𝑧𝑗 σ𝑘 𝛿𝑘 𝑤𝑗𝑘 denote the resulting error signal back to layer 𝑗.
• The error function gradient for the hidden layer weights is
𝜕𝐸
= 𝛿𝑗 𝑎𝑖
𝜕𝑤𝑖𝑗

To calculate the weight gradients at any layer 𝑙, we calculate the backpropagated


error signal 𝛿𝑙 that reaches that layer from the “afterward” layers, and weight it
by the feed-forward signal at 𝑙 − 1 feeding into that layer.
56
Update the hidden layer parameters
• The gradient descent update rule for the hidden layer weights is
𝝏𝑬
𝒘𝒊𝒋 ← 𝒘𝒊𝒋 − 𝜼
𝝏𝒘𝒊𝒋
← 𝑤𝑖𝑗 − 𝜂 𝛿𝑗 𝑎𝑖 ← 𝑤𝑖𝑗 − 𝜂 𝑔𝑗′ 𝑧𝑗 ෍ 𝛿𝑘 𝑤𝑗𝑘 𝑎𝑖
𝑘

← 𝒘𝒊𝒋 − 𝜼 ෍ 𝒂𝒌 − 𝒕𝒌 𝒈𝒌 (𝒛𝒌 ) 𝟏 − 𝒈𝒌 (𝒛𝒌 ) 𝒘𝒋𝒌 𝒈𝒋 (𝒛𝒋 ) 𝟏 − 𝒈𝒋 (𝒛𝒋 ) 𝒂𝒊


𝑘

• Apply similar update rules for the remaining parameters 𝑤𝑖𝑗 .

57
Update the hidden layer biases
• Calculating the error gradients with respect to the hidden layer biases 𝑏𝑗
follows a very similar procedure to that for the hidden layer weights.
𝜕𝐸 𝜕 𝜕𝑧𝑘
= ෍ 𝑎𝑘 − 𝑡𝑘 𝑔 𝑧 = ෍ 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ 𝑧𝑘
𝜕𝑏𝑗 𝜕𝑏𝑗 𝑘 𝑘 𝜕𝑏𝑗
𝑘 𝑘

𝜕𝑧𝑘
• Apply chain rule to solve = 𝑤𝑗𝑘 𝑔𝑗′ 𝑧𝑗 (1)
𝜕𝑏𝑗

• The gradient for the biases is the back-propagated error signal 𝛿𝑗 .

𝜕𝐸
= ෍ 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ 𝑧𝑘 𝑤𝑗𝑘 𝑔𝑗′ 𝑧𝑗 = 𝑔𝑗′ 𝑧𝑗 ෍ 𝛿𝑘 𝑤𝑗𝑘 = 𝛿𝑗
𝜕𝑏𝑗
𝑘 𝑘

• Each bias is updated as 𝒃𝒋 ← 𝒃𝒋 − 𝜼 𝜹𝒋

58
The role of bias
• Bias is like an intercept added in a linear equation, allowing
for shifting the activation function to either right or left.

Video credit: Neural networks


and Deep learning

The effect of weight and bias to the shape of the activation function.
59
The role of bias
• A bias is necessary to reach any result in some cases.

Left: The non-biased neuron cannot make correct classification of the selected
points, because we cannot drive straight line through the center of coordinate
system that separates different values of neuron response. Right: for the biased
neuron, the points are moved to the edge of a sphere, and thus they might be
separated by plane that passes through the center of coordinates system.

Image credit: Biased and non-biased neurons 60


Acknowledgements
• Some parts of the slide are adapted from
• Derivation: Error Backpropagation & Gradient Descent for Neural
Networks (github.io link)
• Negnevitsky, Michael. Artificial intelligence: A guide to intelligent
systems. Pearson, 2005. Chapter 6.
• Machine Learning cơ bản (link)
• Ruder, Sebastian. "An overview of gradient descent optimization
algorithms." arXiv preprint arXiv:1609.04747 (2016).
• ML Glossary (link)
• Gradient Descent Algorithm and Its Variants (link)
• A Visual Explanation of Gradient Descent Methods (link

61

You might also like