0% found this document useful (0 votes)

92 views62 pages

2024 MTH058 Lecture02 Backpropagation

Backpropagation is an algorithm used to train neural networks by propagating errors backwards from the output to minimize loss. It first propagates inputs forward and calculates outputs, then calculates error of output compared to target using loss function. Errors are then propagated backwards to update weights, reducing errors and improving predictions through multiple iterations.

Uploaded by

Mark Mystery

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

92 views62 pages

2024 MTH058 Lecture02 Backpropagation

Uploaded by

Mark Mystery

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

BACK-PROPAGATION

IN NEURAL NETWORKS

Nguyễn Ngọc Thảo

[email protected]
Outline
• Artificial neural networks
• Perceptron
• Multi-layer perceptron
• Back-propagation algorithm

2
Artificial neural networks
What is a neural network?
• The biological neural network (NN) is a reasoning model
based on the human brain.
• There are approximately 86 billion neurons. Estimates of synapses
vary for an adult, ranging in 100 – 500 trillion.
• It is a system that is highly complex, nonlinear and parallel
information-processing.
• Learning through experience is an essential characteristic.
• Plasticity: connections leading to the “right answer” are enhanced,
while those to the “wrong answer” are weakened.

4
Biological neural network
• There are attempts to emulate biological neural network in
the computer, resulting artificial neural networks (ANNs).

• Just resemble the learning mechanisms, not the architecture

• Megatron-Turing's NLG: 530 billion parameters, GPT-3: 175 billion

5
ANN: Network architecture
• An ANN has many neurons, arranging in a hierarchy of layers.
• Each neuron is an elementary information-processing unit.

• ANN improve performance via experience and generalization.

6
ANN: Applications

7
ANN: Neurons and Signals
• Each neuron receives several input signals through its
connections and produces at most a single output signal.
Expressing the strength
of the input

• The set of weights is the long-term memory in an ANN → the

learning process iteratively adjusts the weights.
8
Artificial neuron

Biological neuron

Biological neuron Artificial neuron

Soma Neuron
Dendrite Input
Axon Output
Synapse Weight
9
10 10
11

Source: The Asimov Institute

11
How to build an ANN?
• The network architecture must be decided first.
• How many neurons are to be used?
• How the neurons are to be connected to form a network?
• Then determine which learning algorithm to use,
• Supervised /semi-supervised / unsupervised / reinforcement learning
• And finally train the neural network
• How to initialize the weights of the network?
• How to update them from a set of training examples.

12
Perceptron
Perceptron (Frank Rosenblatt, 1958)
• A perceptron has a single neuron with adjustable synaptic
weights and a hard limiter.

A single-layer two-input perceptron

14
How does a perceptron work?
• Divide the n-dimensional space into two decision regions by
a hyperplane defined by the linearly separable function
𝒏

𝒚 = ෍ 𝒙𝒊 𝒘𝒊 − 𝜽
𝒊=𝟏

15
Perceptron learning rule
• Step 1 – Initialization: Initial weights 𝒘𝟏 , 𝒘𝟐 , … , 𝒘𝒏 are randomly assigned to small
numbers (usually in −0.5, 0.5 , but not restricted to).

• Step 2 – Activation: At iteration 𝒑 , apply the 𝒑𝑡ℎ example, which has inputs
𝒙𝟏 (𝒑), 𝒙𝟐 (𝒑), … , 𝒙𝒏 (𝒑) and desired output 𝒀𝒅 𝒑 , and calculate the actual output
𝒏 1 𝑖𝑓 𝑥 ≥ 0
𝒀 𝒑 =𝛔 ෍ 𝒙𝒊 (𝒑)𝒘𝒊 (𝒑) − 𝜽 𝜎 𝑥 =ቊ
0 𝑖𝑓 𝑥 < 0
𝒊=𝟏
where 𝑛 is the number of perceptron inputs and 𝑠𝑡𝑒𝑝 is the activation function

• Step 3 – Weight training

• Update the weights 𝒘𝒊 : 𝒘𝒊 𝒑 + 𝟏 = 𝒘𝒊 𝒑 + ∆𝒘𝒊 (𝒑)
where ∆𝒘𝒊 (𝒑) is the weight correction at iteration 𝒑
• The delta rule determines how to adjust the weights: ∆𝒘𝒊 𝒑 = 𝜼 × 𝒙𝒊 𝒑 × 𝒆(𝒑)
where 𝜼 is the learning rate (0 < 𝜂 < 1) and 𝒆 𝒑 = 𝒀𝒅 𝒑 − 𝒀(𝒑)
• Step 4 – Iteration: Increase iteration 𝒑 by one, go back to Step 2 and repeat the process
until convergence.

16
Perceptron for the logical AND/OR
• A single-layer perceptron can learn the AND/OR operations.

The learning of logical AND

converged after several iterations
Threshold  = 0.2, learning rate  = 0.1 17
Perceptron for the logical XOR
• It cannot be trained to perform the Exclusive-OR.

• Generally, perceptron can classify only linearly separable

patterns regardless of the activation function used.
• Research works: Shynk, 1990 and Shynk and Bershad, 1992.
18
Perceptron: An example
Suppose there is a high-tech exhibition in the city, and you are thinking
• 𝑤1 = 6, 𝑤2 = 2, 𝑤3 = 2 → the weather matters to you much more than
about whether to go there. Your decision relies on the below factors:
whether your friend joins you, or the nearness of public transit
• Is the weather good?
• 𝜃 = 5 → decisions are made based on the weather only
• Does your friend want to accompany you?
• 𝜃 = 3 → you go to the festival whenever the weather is good or when
• Isthe
both thefestival
festivalisnear
nearpublic
publictransit?
transit (You don'tfriend
and your own awants
car). to join you.

weather

friend wants to go

near public transit

19
Quiz 01: Perceptron
• Consider the following neural network which receives binary input
values, 𝑥1 and 𝑥2 , and produces a single binary value.

• For every combination (𝑥1 , 𝑥2 ), what are the output values at neurons, 𝐴,
𝐵 and 𝐶?
20
Multi-layer perceptron
Multi-layer perceptron (MLP)

Output signals
Input signals

First Second
hidden layer hidden layer
• It is a fully connected feedforward network with at least three
layers.
• Idea: Map certain input to a specified target value using a
cascade of nonlinear transformations.
22
Learning algorithm: Back-propagation
The input signals are propagated forwardly on a layer-by-layer basis.

The error signals are propagated backwards

from the output layer to the input layer. 23
Back-propagation learning rule
• Step 1 – Initialization: Initial weights are assigned to random numbers.
𝟐.𝟒 𝟐.𝟒
• The numbers may be uniformly distributed in the range − ,+ (Haykin, 1999),
𝑭𝒊 𝑭𝒊
where 𝑭𝒊 is the total number of inputs of neuron
• The weight initialization is done on a neuron-by-neuron basis

• Step 2 – Activation: At iteration 𝒑 , apply the 𝒑𝑡ℎ example, which has inputs
𝒙𝟏 (𝒑), 𝒙𝟐 (𝒑), … , 𝒙𝒏 (𝒑) and desired outputs 𝒚𝒅,𝟏 𝒑 , 𝒚𝒅,𝟐 𝒑 , … , 𝒚𝒅,𝒍 (𝒑).
• (a) Calculate the actual output, from 𝒏 inputs, of neuron 𝒋 in the hidden layer.
𝒏 1
𝒚𝒋 𝒑 = 𝛔 ෍ 𝒙𝒊 (𝒑)𝒘𝒊𝒋 (𝒑) + 𝒃𝒋 𝜎 𝑥 =
1 + 𝑒 −𝑥
𝒊=𝟏
• (b) Calculate the actual output, from 𝒌 inputs, of neuron 𝒎 in the output layer.
𝒎
𝒚𝒌 𝒑 = 𝛔 ෍ 𝒚𝒋 (𝒑)𝒘𝒋𝒌 (𝒑) + 𝒃𝒌
𝒋=𝟏

• 𝒃𝒋 and 𝒃𝒌 are the biases at neurons 𝑗 and 𝑘, respectively.

24
Back-propagation learning rule
• Step 3 – Weight training: Update the weights in the back-propagation network and
propagate backward the errors associated with output neurons.
• (a) Calculate the error signal being back-propagated for neuron 𝒌 in the output layer

𝜹𝒌 𝒑 = 𝒚𝒌 𝒑 × 𝟏 − 𝒚𝒌 𝒑 × [𝒚𝒅,𝒌 𝒑 − 𝒚𝒌 𝒑 ] error
Calculate the weight corrections: ∆𝒘𝒋𝒌 𝒑 = 𝜼 × 𝒚𝒋 𝒑 × 𝜹𝒌 𝒑 error gradient
Update the weights at the output neurons: 𝒘𝒋𝒌 𝒑 + 𝟏 = 𝒘𝒋𝒌 𝒑 + ∆𝒘𝒋𝒌 𝒑
• (b) Calculate the error gradient for neuron 𝒋 in the hidden layer
𝒍
𝜹𝒋 𝒑 = 𝒚𝒋 𝒑 × [𝟏 − 𝒚𝒋 𝒑 ] × ෍ 𝜹𝒌 𝒑 𝒘𝒋𝒌 𝒑
𝒌=𝟏

Calculate the weight corrections: ∆𝒘𝒊𝒋 𝒑 = 𝜼 × 𝒙𝒊 𝒑 × 𝜹𝒋 𝒑

Update the weights at the hidden neurons: 𝒘𝒊𝒋 𝒑 + 𝟏 = 𝒘𝒊𝒋 𝒑 + ∆𝒘𝒊𝒋 𝒑

• Step 4: Iteration: Increase iteration 𝒑 by one, go back to Step 2 and repeat the process
until the selected error criterion is satisfied.

25
Back-propagation network for XOR
• The logical XOR problem took
224 epochs or 896 iterations
for network training.

26
Sum of the squared errors (SSE)
• When the SSE in an entire pass through all training sets is
sufficiently small, a network is deemed to have converged.

Learning curve for

logical operation XOR

27
Sigmoid neuron vs. Perceptron
• Sigmoid neuron better reflects the fact that small changes in
weights and bias cause only a small change in output.

A sigmoidal function is a smoothed-out

version of a step function.

28
About back-propagation learning
• Are randomly initialized weights and thresholds leading to
different solutions?
• Starting with different initial conditions will obtain different weights
and threshold values. The problem will always be solved within
different numbers of iterations.
• Back-propagation learning cannot be viewed as emulation of
brain-like learning.
• Biological neurons do not work backward to adjust the strengths of
their interconnections, synapses.
• The training is slow due to extensive calculations.
• Improvements: Caudill, 1991; Jacobs, 1988; Stubbs, 1990

29
Gradient descent method
• Consider two parameters, 𝑤1 and 𝑤2 , in a network.
Error Surface

Randomly pick a
starting point 𝜃 0

Compute the negative The colors represent the value of the function 𝑓.
gradient at 𝜃 0
→ −∇𝑓(𝜃 0 ) 𝑤2 𝜃∗
−𝜂𝛻𝑓 𝜃 0
Time the learning rate 𝜂
→ −𝜂∇𝑓(𝜃 0 ) −𝛻𝑓 𝜃 0

𝜃0 𝜕𝑓 𝜃 0 /𝜕𝑤1
𝛻𝑓 𝜃 0 =
𝜕𝑓 𝜃 0 /𝜕𝑤2
𝑤1 30
Gradient descent method
• Consider two parameters, 𝑤1 and 𝑤2 , in a network.
Error Surface
Eventually, we would reach a minima …..
Randomly pick a
starting point 𝜃 0

Compute the negative −𝜂𝛻𝑓 𝜃 2

gradient at 𝜃 0 1𝜃 2
−𝜂𝛻𝑓 𝜃
→ −∇𝑓(𝜃 ) 0
𝑤2 −𝛻𝑓 1 𝜃2
−𝛻𝑓 𝜃
𝜃1
Time the learning rate 𝜂
→ −𝜂∇𝑓(𝜃 0 )

𝜃0

𝑤1 31
Gradient descent method
• Gradient descent never guarantees global minima.

Different initial
point 𝜃 0
𝑓
𝐶

Reach different minima,

so different results
𝑤1 𝑤2

32
Gradient descent method
• It also has issues at plateau and saddle point.
cost
Very slow at the plateau

Stuck at saddle point

Stuck at local minima

𝛻𝑓 𝜃 𝛻𝑓 𝜃 𝛻𝑓 𝜃
≈0 =0 =0
parameter space
33
Accelerated learning in ANNs
• Use tanh instead of sigmoid: represent the sigmoidal function
by a hyperbolic tangent
𝟐𝒂 where 𝑎 = 1.716 and 𝑏 = 0.667
𝒀𝐭𝐚𝐧 𝒉 = −𝒃𝑿
−𝒂 (Guyon, 1991)
𝟏−𝒆

34
Accelerated learning in ANNs
• Generalized delta rule: A momentum term is
included in the delta rule (Rumelhart et al., 1986)
∆𝒘𝒋𝒌 𝒑 = 𝜷 × ∆𝒘𝒋𝒌 𝒑 − 𝟏 + 𝜼 × 𝒚𝒋 𝒑 × 𝜹𝒌 𝒑
where 𝛽 = 0.95 is the momentum constant (0 ≤ 𝛽 ≤ 1)

How about put momentum of physical world in gradient descent?

35
Accelerated learning in ANNs
Still not guarantee reaching global minima,
but give some hope ……
cost

Movement = Negative of Gradient + Momentum

Negative of Gradient
Momentum
Real Movement

Gradient = 0 36
Accelerated learning in ANNs
• Adaptive learning rate: Adjust the learning rate parameter 𝜼
during training
• Small  → small weight changes through iterations → smooth
learning curve
• Large  → speed up the training process with larger weight changes
→ possible instability and oscillatory
• Heuristic-like approaches for adjusting 
1. The algebraic sign of the SSE change remains for several
consequent epochs → increase .
2. The algebraic sign of the SSE change alternates for several
consequent epochs → decrease .

37
Learning with momentum only
Learning with momentum for
the logical operation XOR.

38
Learning with adaptive  only

Learning with adaptive learning

rate for the logical operation XOR.

39
Learning with adaptive  and momentum

40
Universality Theorem
• Any continuous function 𝑓: 𝑅𝑁 → 𝑅𝑀 can be realized by a
network with one hidden layer, given enough hidden neurons.
• More explanation can be found here.
• In general, the more parameters, the better performance.

• However,…

41
Universality Theorem
• Deep networks are empirically better in solving many real-
world problems
Word Error Word Error
Layer  Size Layer  Size
Rate (%) Rate (%)
1  2k 24.2
2  2k 20.4
3  2k 18.4
4  2k 17.8
5  2k 17.2 1  3772 22.5
7  2k 17.1 1  4634 22.6
1  16k 22.1
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-
Dependent Deep Neural Networks." Interspeech. 2011.
42
Quiz 02: Multi-layer perceptron
• Consider the below feedforward network with one hidden layer of units.

• If the network is tested with an input vector 𝑥 = 1.0, 2.0, 3.0 then what
are the activation 𝐻1 of the first hidden neuron and the activation 𝐼3 of the
third output neuron?
43
Quiz 02: Forward the input signals
𝑇
• The input vector to the network is 𝑥 = 𝑥1 , 𝑥2 , 𝑥3
𝑇
• The vector of hidden layer outputs is 𝑦 = 𝑦1 , 𝑦2
• The vector of actual outputs is 𝑧 = 𝑧1 , 𝑧2 , 𝑧3 𝑇

• The vector of desired outputs is 𝑡 = 𝑡1 , 𝑡2 , 𝑡3 𝑇

• The network has the following weight vectors

−2.0 1.0
𝑣1 = 2.0 𝑣2 = 1.0 1.0 0.5 0.3
𝑤1 = 𝑤2 = 𝑤3 =
−3.5 −1.2 0.6
−2.0 −1.0
• Assume that all units use sigmoid activation function and zero biases.

44
Quiz 03: Backpropagation error signals
• The figure shows part of the
network described in the
previous slide.
• Use the same weights,
activation functions and bias
values as described.

• A new input pattern is presented to the network and training proceeds as

follows. The actual outputs are given by 𝑧 = 0.15, 0.36, 0.57 𝑇 and the
corresponding target outputs are given by 𝑡 = 1.0, 1.0, 1.0 𝑇 .
• The weights 𝑤12 , 𝑤22 and 𝑤32 are also shown.
• What is the error for each of the output units?
45
Back-propagation
algorithm
Back-propagation algorithm
• Consider a MLP with one hidden layer.
• Note the following notations
• 𝑎𝑖 : the output value of node 𝑖 in the input layer
• 𝑧𝑗 : the input value to node 𝑗 in the layer ℎ
• 𝑔𝑗 : the activation function for node 𝑗 in the layer ℎ (applied to 𝑧𝑗 )
• 𝑎𝑗 = 𝑔𝑗 𝑧𝑗 : the output value of node 𝑗 in the layer ℎ
• 𝑏𝑗 : the bias/offset for unit 𝑗 in the layer ℎ
• 𝑤𝑖𝑗 : weights connecting node 𝑖 in layer (ℎ − 1) to node 𝑗 in layer ℎ
• 𝑡𝑘 : target value for node 𝑘 in the output layer

47
Forward the signal
Input Hidden layer Output
layer layer

* Bias units are not shown.

48
The error function
• Training a neural network entails finding parameters 𝜃 =
𝐖, 𝐛 that minimize the errors.
• The error function is usually the sum of the squared errors
between the target values 𝑡𝑘 and the network outputs 𝑎𝑘 .
𝒍
𝟏 𝟐
𝑬 = ෍ 𝒂𝒌 − 𝒕𝒌
𝟐
𝒌=𝟏
• 𝑙 is the dimensionality of the target for a single observation.

• This parameter optimization problem can be solved using

𝜕𝐸
gradient descent, computing for all 𝜃.
𝜕𝜃

49
Update the output layer parameters
• Calculating the gradient of the error function with respect to those
parameters is straightforward with the chain rule.
𝜕𝐸 1 2
= ෍ 𝑎𝑘 − 𝑡𝑘
𝜕𝑤𝑗𝑘 2
𝑘
𝜕
= 𝑎𝑘 − 𝑡𝑘 𝑎𝑘 − 𝑡𝑘
𝜕𝑤𝑗𝑘
𝜕𝐸 𝜕 𝜕
• Then, = 𝑎𝑘 − 𝑡𝑘 𝑎 since 𝜕𝑤 𝑡𝑘 = 0
𝜕𝑤𝑗𝑘 𝜕𝑤𝑗𝑘 𝑘 𝑗𝑘

𝜕
= 𝑎𝑘 − 𝑡𝑘 𝑔𝑘 𝑧𝑘 since 𝑎𝑘 = 𝑔(𝑧𝑘 )
𝜕𝑤𝑗𝑘
𝜕
= 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ (𝑧𝑘 ) 𝑧
𝜕𝑤𝑗𝑘 𝑘

50
Update the output layer parameters
𝜕
• Recall that 𝑧𝑘 = 𝑏𝑘 + σ𝑗 𝑔𝑗 𝑧𝑗 𝑤𝑗𝑘 , and hence, 𝑧 = 𝑔𝑗 𝑧𝑗 = 𝑎𝑗 .
𝜕𝑤𝑗𝑘 𝑘

𝝏𝑬
• Then, = 𝒂𝒌 − 𝒕𝒌 𝒈′𝒌 (𝒛𝒌 )𝒂𝒋
𝝏𝒘𝒋𝒌

the difference between the derivative of the the output of node 𝑗 from
the network output 𝑎𝑘 activation function at 𝑧𝑘 the hidden layer feeding
and the target value 𝑡𝑘 into the output layer

• The common activation function is the sigmoid function

1
𝑔 𝑧 =
1 + 𝑒 −𝑧
whose derivative is
𝑔′ 𝑧 = 𝑔(𝑧) 1 − 𝑔(𝑧)

51
Update the output layer parameters
• Let 𝛿𝑘 = 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ (𝑧𝑘 ) be the error signal after being back-propagated
through the output activation function 𝑔𝑘 .
• The delta form of the error function gradient for the output layer weights is
𝜕𝐸
= 𝛿𝑘 𝑎𝑗
𝜕𝑤𝑗𝑘

• The gradient descent update rule for the output layer weights is
𝝏𝑬
𝒘𝒋𝒌 ← 𝒘𝒋𝒌 − 𝜼 𝜂 is the learning rate
𝝏𝒘𝒋𝒌
← 𝑤𝑗𝑘 − 𝜂 𝛿𝑘 𝑎𝑗
← 𝒘𝒋𝒌 − 𝜼 𝒂𝒌 − 𝒕𝒌 𝒈𝒌 𝒛𝒌 𝟏 − 𝒈𝒌 𝒛𝒌 𝒂𝒋

• Apply similar update rules for the remaining parameters 𝑤𝑗𝑘 .

52
Update the output layer biases
• The gradient for the biases is simply the back-propagated error signal 𝛿𝑘 .
𝜕𝐸
= 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ 𝑧𝑘 (1) = 𝛿𝑘
𝜕𝑏𝑘

• Each bias is updated as 𝒃𝒌 ← 𝒃𝒌 − 𝜼 𝜹𝒌

𝜕 𝜕
• Note that 𝑧 = 𝑏𝑘 + σ𝑗 𝑔𝑗 𝑧𝑗 =1
𝜕𝑏𝑘 𝑘 𝜕𝑏𝑘
• The biases are weights on activations that are always equal to one, regardless of the
feed-forward signal.
• Thus, the bias gradients aren’t affected by the feed-forward signal, only by the error.

53
Update the hidden layer parameters
• The process starts just the same as for the output layer.
𝜕𝐸 1 2
= ෍ 𝑎𝑘 − 𝑡𝑘
𝜕𝑤𝑖𝑗 2
𝑘
𝜕
= ෍ 𝑎𝑘 − 𝑡𝑘 𝑎𝑘
𝜕𝑤𝑖𝑗
𝑘

• Apply the chain rule again, we obtain:

𝜕𝐸 𝜕
= ෍ 𝑎𝑘 − 𝑡𝑘 𝑔 𝑧 since 𝑎𝑘 = 𝑔𝑘 𝑧𝑘
𝜕𝑤𝑖𝑗 𝜕𝑤𝑖𝑗 𝑘 𝑘
𝑘
𝜕
= ෍ 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ (𝑧𝑘 ) 𝑧𝑘
𝜕𝑤𝑖𝑗
𝑘

54
Update the hidden layer parameters
• The term 𝑧𝑘 can be expanded as follows.

𝑧𝑘 = 𝑏𝑘 + ෍ 𝑎𝑗 𝑤𝑗𝑘 = 𝑏𝑘 + ෍ 𝑔𝑗 𝑧𝑗 𝑤𝑗𝑘 since 𝑎𝑗 = 𝑔𝑗 𝑧𝑗

𝑗 𝑗

= 𝑏𝑘 + ෍ 𝑔𝑗 𝑏𝑗 + ෍ 𝑎𝑖 𝑤𝑖𝑗 𝑤𝑗𝑘 since 𝑧𝑗 = 𝑏𝑗 + σ𝑖 𝑎𝑖 𝑤𝑖𝑗

𝑗 𝑖

𝜕
• Again, use the chain rule to calculate 𝜕𝑤 𝑧𝑘
𝑖𝑗

𝜕 𝜕𝑧𝑘 𝜕𝑎𝑗 𝜕 𝜕𝑎𝑗 𝜕𝑎𝑗 𝜕𝑔𝑗 𝑧𝑗

𝑧𝑘 = = 𝑏𝑘 + ෍ 𝑎𝑗 𝑤𝑗𝑘 = 𝑤𝑗𝑘 = 𝑤𝑗𝑘
𝜕𝑤𝑖𝑗 𝜕𝑎𝑗 𝜕𝑤𝑖𝑗 𝜕𝑎𝑗 𝜕𝑤𝑖𝑗 𝜕𝑤𝑖𝑗 𝜕𝑤𝑖𝑗
𝑗
𝜕𝑧𝑗 𝜕
= 𝑤𝑗𝑘 𝑔𝑗′ (𝑧𝑗 ) = 𝑤𝑗𝑘 𝑔𝑗′ (𝑧𝑗 ) 𝑏𝑗 + σ𝑗 𝑎𝑖 𝑤𝑖𝑗
𝜕𝑤𝑖𝑗 𝜕𝑤𝑖𝑗

= 𝑤𝑗𝑘 𝑔𝑗′ (𝑧𝑗 )𝑎𝑖

55
Update the hidden layer parameters
𝜕𝐸
• Thus, = ෍ 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ (𝑧𝑘 )𝑤𝑗𝑘 𝑔𝑗′ (𝑧𝑗 )𝑎𝑖
𝜕𝑤𝑖𝑗
𝑘

= ෍ 𝛿𝑘 𝑤𝑗𝑘 𝑔𝑗′ (𝑧𝑗 )𝑎𝑖 the output activation signal

from the layer below 𝑎𝑖
𝑘

error term the derivative of the

activation function at 𝑧𝑗
• Let 𝛿𝑗 = 𝑔𝑗′ 𝑧𝑗 σ𝑘 𝛿𝑘 𝑤𝑗𝑘 denote the resulting error signal back to layer 𝑗.
• The error function gradient for the hidden layer weights is
𝜕𝐸
= 𝛿𝑗 𝑎𝑖
𝜕𝑤𝑖𝑗

To calculate the weight gradients at any layer 𝑙, we calculate the backpropagated

error signal 𝛿𝑙 that reaches that layer from the “afterward” layers, and weight it
by the feed-forward signal at 𝑙 − 1 feeding into that layer.
56
Update the hidden layer parameters
• The gradient descent update rule for the hidden layer weights is
𝝏𝑬
𝒘𝒊𝒋 ← 𝒘𝒊𝒋 − 𝜼
𝝏𝒘𝒊𝒋
← 𝑤𝑖𝑗 − 𝜂 𝛿𝑗 𝑎𝑖 ← 𝑤𝑖𝑗 − 𝜂 𝑔𝑗′ 𝑧𝑗 ෍ 𝛿𝑘 𝑤𝑗𝑘 𝑎𝑖
𝑘

← 𝒘𝒊𝒋 − 𝜼 ෍ 𝒂𝒌 − 𝒕𝒌 𝒈𝒌 (𝒛𝒌 ) 𝟏 − 𝒈𝒌 (𝒛𝒌 ) 𝒘𝒋𝒌 𝒈𝒋 (𝒛𝒋 ) 𝟏 − 𝒈𝒋 (𝒛𝒋 ) 𝒂𝒊

𝑘

• Apply similar update rules for the remaining parameters 𝑤𝑖𝑗 .

57
Update the hidden layer biases
• Calculating the error gradients with respect to the hidden layer biases 𝑏𝑗
follows a very similar procedure to that for the hidden layer weights.
𝜕𝐸 𝜕 𝜕𝑧𝑘
= ෍ 𝑎𝑘 − 𝑡𝑘 𝑔 𝑧 = ෍ 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ 𝑧𝑘
𝜕𝑏𝑗 𝜕𝑏𝑗 𝑘 𝑘 𝜕𝑏𝑗
𝑘 𝑘

𝜕𝑧𝑘
• Apply chain rule to solve = 𝑤𝑗𝑘 𝑔𝑗′ 𝑧𝑗 (1)
𝜕𝑏𝑗

• The gradient for the biases is the back-propagated error signal 𝛿𝑗 .

𝜕𝐸
= ෍ 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ 𝑧𝑘 𝑤𝑗𝑘 𝑔𝑗′ 𝑧𝑗 = 𝑔𝑗′ 𝑧𝑗 ෍ 𝛿𝑘 𝑤𝑗𝑘 = 𝛿𝑗
𝜕𝑏𝑗
𝑘 𝑘

• Each bias is updated as 𝒃𝒋 ← 𝒃𝒋 − 𝜼 𝜹𝒋

58
The role of bias
• Bias is like an intercept added in a linear equation, allowing
for shifting the activation function to either right or left.

Video credit: Neural networks

and Deep learning

The effect of weight and bias to the shape of the activation function.
59
The role of bias
• A bias is necessary to reach any result in some cases.

Left: The non-biased neuron cannot make correct classification of the selected
points, because we cannot drive straight line through the center of coordinate
system that separates different values of neuron response. Right: for the biased
neuron, the points are moved to the edge of a sphere, and thus they might be
separated by plane that passes through the center of coordinates system.

Image credit: Biased and non-biased neurons 60

Acknowledgements
• Some parts of the slide are adapted from
• Derivation: Error Backpropagation & Gradient Descent for Neural
Networks (github.io link)
• Negnevitsky, Michael. Artificial intelligence: A guide to intelligent
systems. Pearson, 2005. Chapter 6.
• Machine Learning cơ bản (link)
• Ruder, Sebastian. "An overview of gradient descent optimization
algorithms." arXiv preprint arXiv:1609.04747 (2016).
• ML Glossary (link)
• Gradient Descent Algorithm and Its Variants (link)
• A Visual Explanation of Gradient Descent Methods (link

N3 Shinkanzen Master Tu Vung
No ratings yet
N3 Shinkanzen Master Tu Vung
186 pages
Eee 326 Lecture Notes 1
100% (1)
Eee 326 Lecture Notes 1
38 pages
Shannon and Weaver Model of Communication
67% (3)
Shannon and Weaver Model of Communication
6 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
37 pages
Big Book of Data Engineering 2nd Edition Final
No ratings yet
Big Book of Data Engineering 2nd Edition Final
97 pages
Model Lifecycle (XII)
100% (3)
Model Lifecycle (XII)
9 pages
Notes ML 02 Slides RNN ANN
No ratings yet
Notes ML 02 Slides RNN ANN
105 pages
12 Neural Network
No ratings yet
12 Neural Network
52 pages
Semantics Sense and Reference
100% (1)
Semantics Sense and Reference
4 pages
Artificial Neural Network
100% (2)
Artificial Neural Network
20 pages
Artificial Neural Networks - MiniProject
100% (1)
Artificial Neural Networks - MiniProject
16 pages
Chapter 5 Artificial Neural Networks
No ratings yet
Chapter 5 Artificial Neural Networks
50 pages
Quiz No. 1 Et 120 Industrial Process Control
100% (4)
Quiz No. 1 Et 120 Industrial Process Control
3 pages
ANN Most Notes
100% (1)
ANN Most Notes
6 pages
Snowflake
No ratings yet
Snowflake
16 pages
2024 Mth058 Lecture06 Mcts
100% (1)
2024 Mth058 Lecture06 Mcts
38 pages
Closed Loop Response of Temperature Control System
No ratings yet
Closed Loop Response of Temperature Control System
6 pages
Isch 4
No ratings yet
Isch 4
44 pages
2021 Lecture11 NeuralNetworks
No ratings yet
2021 Lecture11 NeuralNetworks
48 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
54 pages
Unit 6 Application of AI
No ratings yet
Unit 6 Application of AI
91 pages
2023 Lecture11 NeuralNetworks
No ratings yet
2023 Lecture11 NeuralNetworks
48 pages
Relationships and Cardinality in Power BI (Slides)
100% (1)
Relationships and Cardinality in Power BI (Slides)
16 pages
Ipcw Ann
No ratings yet
Ipcw Ann
100 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
81 pages
Jntuk R20 ML Unit-V
No ratings yet
Jntuk R20 ML Unit-V
19 pages
Lecture15 NeuronNetworks
No ratings yet
Lecture15 NeuronNetworks
61 pages
Wk9-Neural Networks
No ratings yet
Wk9-Neural Networks
46 pages
19 Learning
No ratings yet
19 Learning
31 pages
Lec03 NeuralNetwork
No ratings yet
Lec03 NeuralNetwork
39 pages
Neural
No ratings yet
Neural
32 pages
Bim309 Ai Week13
No ratings yet
Bim309 Ai Week13
53 pages
ML-Lec10-Artificial Neural Networks
No ratings yet
ML-Lec10-Artificial Neural Networks
76 pages
Lecture 8
No ratings yet
Lecture 8
65 pages
Neural Network: Presented by Lecturer Dept. of Mechatronics Engineering Rajshahi University of Engineering & Technology
No ratings yet
Neural Network: Presented by Lecturer Dept. of Mechatronics Engineering Rajshahi University of Engineering & Technology
25 pages
Lecture 10
No ratings yet
Lecture 10
155 pages
Lecture Slides-Week13,14
No ratings yet
Lecture Slides-Week13,14
62 pages
AN2DL 02 2324 Perceptron 2 FeedForward
No ratings yet
AN2DL 02 2324 Perceptron 2 FeedForward
55 pages
ML Lec11
No ratings yet
ML Lec11
14 pages
28 Lecture CSC462
No ratings yet
28 Lecture CSC462
28 pages
Artificial Neural Networks: HCMC University of Technology Sep. 2008
No ratings yet
Artificial Neural Networks: HCMC University of Technology Sep. 2008
71 pages
Neural Network: Presented by Lecturer Dept. of Mechatronics Engineering Rajshahi University of Engineering & Technology
No ratings yet
Neural Network: Presented by Lecturer Dept. of Mechatronics Engineering Rajshahi University of Engineering & Technology
25 pages
CH 12 - Artificial Neural Networks
No ratings yet
CH 12 - Artificial Neural Networks
39 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
75 pages
Basics
No ratings yet
Basics
48 pages
ML Unit-2
No ratings yet
ML Unit-2
141 pages
2024 MTH058 Lecture05 ReinforcementLearning
No ratings yet
2024 MTH058 Lecture05 ReinforcementLearning
59 pages
Robotics 15
No ratings yet
Robotics 15
35 pages
Artificial Neural Network: Lecture Module 22
No ratings yet
Artificial Neural Network: Lecture Module 22
54 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
75 pages
ANN MODULE 1 Part2
No ratings yet
ANN MODULE 1 Part2
58 pages
Ann MJJ-1
No ratings yet
Ann MJJ-1
64 pages
Neural Network (Basics)
No ratings yet
Neural Network (Basics)
48 pages
2024 MTH058 Lecture01 IntroductionToAI
No ratings yet
2024 MTH058 Lecture01 IntroductionToAI
52 pages
Unit - 4 ANN
No ratings yet
Unit - 4 ANN
46 pages
Machine Learning Unit 5 Notes
No ratings yet
Machine Learning Unit 5 Notes
19 pages
15 Neural Network Updated
No ratings yet
15 Neural Network Updated
85 pages
ML Unit - 2
No ratings yet
ML Unit - 2
70 pages
Neural Networks
No ratings yet
Neural Networks
27 pages
Wk. 12. Artificial Neural Networks (12!05!2021)
No ratings yet
Wk. 12. Artificial Neural Networks (12!05!2021)
48 pages
2025 Lecture07 P2 MLP
No ratings yet
2025 Lecture07 P2 MLP
56 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
66 pages
Part7.2 Artificial Neural Networks
No ratings yet
Part7.2 Artificial Neural Networks
51 pages
2024 MTH058 Lecture07 FederatedLearning
No ratings yet
2024 MTH058 Lecture07 FederatedLearning
25 pages
Lecture 10 Neural Network
No ratings yet
Lecture 10 Neural Network
34 pages
Neural Networks: Some Material Adopted From Notes by
No ratings yet
Neural Networks: Some Material Adopted From Notes by
35 pages
Refined Chapter 5 UceQEJ
No ratings yet
Refined Chapter 5 UceQEJ
79 pages
Perancangan Pabrik Fenol Dari Kumen UMS
No ratings yet
Perancangan Pabrik Fenol Dari Kumen UMS
10 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
35 pages
1.2.1 Artificial Intelligence Concepts
No ratings yet
1.2.1 Artificial Intelligence Concepts
6 pages
L6 Neural Network
No ratings yet
L6 Neural Network
57 pages
Unit 1
No ratings yet
Unit 1
29 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
71 pages
Artificial Neural Networks: HCMC University of Technology Sep. 2008
No ratings yet
Artificial Neural Networks: HCMC University of Technology Sep. 2008
71 pages
Perceptron Neural Networks
No ratings yet
Perceptron Neural Networks
34 pages
Chapter 11 (11-23-04)
No ratings yet
Chapter 11 (11-23-04)
61 pages
Fuzzy Neural Network - Scholarpedia
No ratings yet
Fuzzy Neural Network - Scholarpedia
7 pages
Landslide Kapil PDF
No ratings yet
Landslide Kapil PDF
9 pages
Livro 4 - Deep-Learning
No ratings yet
Livro 4 - Deep-Learning
271 pages
Edge Detection Seminar
No ratings yet
Edge Detection Seminar
25 pages
The Four Types of Artificial Intelligence 09237
No ratings yet
The Four Types of Artificial Intelligence 09237
8 pages
Final Project Jimmie
No ratings yet
Final Project Jimmie
37 pages
CS202 B wk13 L17L18 Ashah
No ratings yet
CS202 B wk13 L17L18 Ashah
67 pages
Alfred Advanced DTB
No ratings yet
Alfred Advanced DTB
5 pages
ESNTutorial Rev
No ratings yet
ESNTutorial Rev
46 pages
Truera Slides LLM Workshop Session 2
No ratings yet
Truera Slides LLM Workshop Session 2
38 pages
Question Bank ROBOTICS UNIT - 3: A) Complicated To Design B) Unreliable C) Either 1 or 2 D) None of The Above
No ratings yet
Question Bank ROBOTICS UNIT - 3: A) Complicated To Design B) Unreliable C) Either 1 or 2 D) None of The Above
3 pages
"Autonomous Navigation of Mobile Robots From Basic Sensing To Problem Solving PDF
No ratings yet
"Autonomous Navigation of Mobile Robots From Basic Sensing To Problem Solving PDF
6 pages
Concurrency Control Techniques
No ratings yet
Concurrency Control Techniques
12 pages
Lunet: A Deep Neural Network For Network Intrusion Detection
No ratings yet
Lunet: A Deep Neural Network For Network Intrusion Detection
8 pages
BINARY LOGGING CHAP 6 (Unfinished)
No ratings yet
BINARY LOGGING CHAP 6 (Unfinished)
5 pages
Tips To Create A Dashboard by Bharat
No ratings yet
Tips To Create A Dashboard by Bharat
2 pages
Convolutional Neural Networks: Fundamentals and Applications for Analyzing Visual Imagery
From Everand
Convolutional Neural Networks: Fundamentals and Applications for Analyzing Visual Imagery
Fouad Sabry
No ratings yet
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
From Everand
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
Fouad Sabry
No ratings yet
Bio Inspired Computing: Fundamentals and Applications for Biological Inspiration in the Digital World
From Everand
Bio Inspired Computing: Fundamentals and Applications for Biological Inspiration in the Digital World
Fouad Sabry
No ratings yet

2024 MTH058 Lecture02 Backpropagation

Uploaded by

2024 MTH058 Lecture02 Backpropagation

Uploaded by

BACK-PROPAGATION

Nguyễn Ngọc Thảo

• Just resemble the learning mechanisms, not the architecture

• ANN improve performance via experience and generalization.

• The set of weights is the long-term memory in an ANN → the

Biological neuron Artificial neuron

Source: The Asimov Institute

A single-layer two-input perceptron

• Step 3 – Weight training

The learning of logical AND

• Generally, perceptron can classify only linearly separable

near public transit

The error signals are propagated backwards

• 𝒃𝒋 and 𝒃𝒌 are the biases at neurons 𝑗 and 𝑘, respectively.

Calculate the weight corrections: ∆𝒘𝒊𝒋 𝒑 = 𝜼 × 𝒙𝒊 𝒑 × 𝜹𝒋 𝒑

Update the weights at the hidden neurons: 𝒘𝒊𝒋 𝒑 + 𝟏 = 𝒘𝒊𝒋 𝒑 + ∆𝒘𝒊𝒋 𝒑

Learning curve for

A sigmoidal function is a smoothed-out

Compute the negative −𝜂𝛻𝑓 𝜃 2

Reach different minima,

Stuck at saddle point

Stuck at local minima

How about put momentum of physical world in gradient descent?

Movement = Negative of Gradient + Momentum

Learning with adaptive learning

• The vector of desired outputs is 𝑡 = 𝑡1 , 𝑡2 , 𝑡3 𝑇

• The network has the following weight vectors

• A new input pattern is presented to the network and training proceeds as

* Bias units are not shown.

• This parameter optimization problem can be solved using

• The common activation function is the sigmoid function

• Apply similar update rules for the remaining parameters 𝑤𝑗𝑘 .

• Each bias is updated as 𝒃𝒌 ← 𝒃𝒌 − 𝜼 𝜹𝒌

• Apply the chain rule again, we obtain:

𝑧𝑘 = 𝑏𝑘 + ෍ 𝑎𝑗 𝑤𝑗𝑘 = 𝑏𝑘 + ෍ 𝑔𝑗 𝑧𝑗 𝑤𝑗𝑘 since 𝑎𝑗 = 𝑔𝑗 𝑧𝑗

= 𝑏𝑘 + ෍ 𝑔𝑗 𝑏𝑗 + ෍ 𝑎𝑖 𝑤𝑖𝑗 𝑤𝑗𝑘 since 𝑧𝑗 = 𝑏𝑗 + σ𝑖 𝑎𝑖 𝑤𝑖𝑗

𝜕 𝜕𝑧𝑘 𝜕𝑎𝑗 𝜕 𝜕𝑎𝑗 𝜕𝑎𝑗 𝜕𝑔𝑗 𝑧𝑗

= 𝑤𝑗𝑘 𝑔𝑗′ (𝑧𝑗 )𝑎𝑖

= ෍ 𝛿𝑘 𝑤𝑗𝑘 𝑔𝑗′ (𝑧𝑗 )𝑎𝑖 the output activation signal

error term the derivative of the

To calculate the weight gradients at any layer 𝑙, we calculate the backpropagated

← 𝒘𝒊𝒋 − 𝜼 ෍ 𝒂𝒌 − 𝒕𝒌 𝒈𝒌 (𝒛𝒌 ) 𝟏 − 𝒈𝒌 (𝒛𝒌 ) 𝒘𝒋𝒌 𝒈𝒋 (𝒛𝒋 ) 𝟏 − 𝒈𝒋 (𝒛𝒋 ) 𝒂𝒊

• Apply similar update rules for the remaining parameters 𝑤𝑖𝑗 .

• The gradient for the biases is the back-propagated error signal 𝛿𝑗 .

• Each bias is updated as 𝒃𝒋 ← 𝒃𝒋 − 𝜼 𝜹𝒋

Video credit: Neural networks

Image credit: Biased and non-biased neurons 60

You might also like