2024 MTH058 Lecture02 Backpropagation
2024 MTH058 Lecture02 Backpropagation
IN NEURAL NETWORKS
2
Artificial neural networks
What is a neural network?
• The biological neural network (NN) is a reasoning model
based on the human brain.
• There are approximately 86 billion neurons. Estimates of synapses
vary for an adult, ranging in 100 – 500 trillion.
• It is a system that is highly complex, nonlinear and parallel
information-processing.
• Learning through experience is an essential characteristic.
• Plasticity: connections leading to the “right answer” are enhanced,
while those to the “wrong answer” are weakened.
4
Biological neural network
• There are attempts to emulate biological neural network in
the computer, resulting artificial neural networks (ANNs).
5
ANN: Network architecture
• An ANN has many neurons, arranging in a hierarchy of layers.
• Each neuron is an elementary information-processing unit.
7
ANN: Neurons and Signals
• Each neuron receives several input signals through its
connections and produces at most a single output signal.
Expressing the strength
of the input
Biological neuron
12
Perceptron
Perceptron (Frank Rosenblatt, 1958)
• A perceptron has a single neuron with adjustable synaptic
weights and a hard limiter.
𝒚 = 𝒙𝒊 𝒘𝒊 − 𝜽
𝒊=𝟏
15
Perceptron learning rule
• Step 1 – Initialization: Initial weights 𝒘𝟏 , 𝒘𝟐 , … , 𝒘𝒏 are randomly assigned to small
numbers (usually in −0.5, 0.5 , but not restricted to).
• Step 2 – Activation: At iteration 𝒑 , apply the 𝒑𝑡ℎ example, which has inputs
𝒙𝟏 (𝒑), 𝒙𝟐 (𝒑), … , 𝒙𝒏 (𝒑) and desired output 𝒀𝒅 𝒑 , and calculate the actual output
𝒏 1 𝑖𝑓 𝑥 ≥ 0
𝒀 𝒑 =𝛔 𝒙𝒊 (𝒑)𝒘𝒊 (𝒑) − 𝜽 𝜎 𝑥 =ቊ
0 𝑖𝑓 𝑥 < 0
𝒊=𝟏
where 𝑛 is the number of perceptron inputs and 𝑠𝑡𝑒𝑝 is the activation function
16
Perceptron for the logical AND/OR
• A single-layer perceptron can learn the AND/OR operations.
weather
friend wants to go
• For every combination (𝑥1 , 𝑥2 ), what are the output values at neurons, 𝐴,
𝐵 and 𝐶?
20
Multi-layer perceptron
Multi-layer perceptron (MLP)
Output signals
Input signals
First Second
hidden layer hidden layer
• It is a fully connected feedforward network with at least three
layers.
• Idea: Map certain input to a specified target value using a
cascade of nonlinear transformations.
22
Learning algorithm: Back-propagation
The input signals are propagated forwardly on a layer-by-layer basis.
• Step 2 – Activation: At iteration 𝒑 , apply the 𝒑𝑡ℎ example, which has inputs
𝒙𝟏 (𝒑), 𝒙𝟐 (𝒑), … , 𝒙𝒏 (𝒑) and desired outputs 𝒚𝒅,𝟏 𝒑 , 𝒚𝒅,𝟐 𝒑 , … , 𝒚𝒅,𝒍 (𝒑).
• (a) Calculate the actual output, from 𝒏 inputs, of neuron 𝒋 in the hidden layer.
𝒏 1
𝒚𝒋 𝒑 = 𝛔 𝒙𝒊 (𝒑)𝒘𝒊𝒋 (𝒑) + 𝒃𝒋 𝜎 𝑥 =
1 + 𝑒 −𝑥
𝒊=𝟏
• (b) Calculate the actual output, from 𝒌 inputs, of neuron 𝒎 in the output layer.
𝒎
𝒚𝒌 𝒑 = 𝛔 𝒚𝒋 (𝒑)𝒘𝒋𝒌 (𝒑) + 𝒃𝒌
𝒋=𝟏
𝜹𝒌 𝒑 = 𝒚𝒌 𝒑 × 𝟏 − 𝒚𝒌 𝒑 × [𝒚𝒅,𝒌 𝒑 − 𝒚𝒌 𝒑 ] error
Calculate the weight corrections: ∆𝒘𝒋𝒌 𝒑 = 𝜼 × 𝒚𝒋 𝒑 × 𝜹𝒌 𝒑 error gradient
Update the weights at the output neurons: 𝒘𝒋𝒌 𝒑 + 𝟏 = 𝒘𝒋𝒌 𝒑 + ∆𝒘𝒋𝒌 𝒑
• (b) Calculate the error gradient for neuron 𝒋 in the hidden layer
𝒍
𝜹𝒋 𝒑 = 𝒚𝒋 𝒑 × [𝟏 − 𝒚𝒋 𝒑 ] × 𝜹𝒌 𝒑 𝒘𝒋𝒌 𝒑
𝒌=𝟏
• Step 4: Iteration: Increase iteration 𝒑 by one, go back to Step 2 and repeat the process
until the selected error criterion is satisfied.
25
Back-propagation network for XOR
• The logical XOR problem took
224 epochs or 896 iterations
for network training.
26
Sum of the squared errors (SSE)
• When the SSE in an entire pass through all training sets is
sufficiently small, a network is deemed to have converged.
27
Sigmoid neuron vs. Perceptron
• Sigmoid neuron better reflects the fact that small changes in
weights and bias cause only a small change in output.
28
About back-propagation learning
• Are randomly initialized weights and thresholds leading to
different solutions?
• Starting with different initial conditions will obtain different weights
and threshold values. The problem will always be solved within
different numbers of iterations.
• Back-propagation learning cannot be viewed as emulation of
brain-like learning.
• Biological neurons do not work backward to adjust the strengths of
their interconnections, synapses.
• The training is slow due to extensive calculations.
• Improvements: Caudill, 1991; Jacobs, 1988; Stubbs, 1990
29
Gradient descent method
• Consider two parameters, 𝑤1 and 𝑤2 , in a network.
Error Surface
Randomly pick a
starting point 𝜃 0
Compute the negative The colors represent the value of the function 𝑓.
gradient at 𝜃 0
→ −∇𝑓(𝜃 0 ) 𝑤2 𝜃∗
−𝜂𝛻𝑓 𝜃 0
Time the learning rate 𝜂
→ −𝜂∇𝑓(𝜃 0 ) −𝛻𝑓 𝜃 0
𝜃0 𝜕𝑓 𝜃 0 /𝜕𝑤1
𝛻𝑓 𝜃 0 =
𝜕𝑓 𝜃 0 /𝜕𝑤2
𝑤1 30
Gradient descent method
• Consider two parameters, 𝑤1 and 𝑤2 , in a network.
Error Surface
Eventually, we would reach a minima …..
Randomly pick a
starting point 𝜃 0
𝜃0
𝑤1 31
Gradient descent method
• Gradient descent never guarantees global minima.
Different initial
point 𝜃 0
𝑓
𝐶
32
Gradient descent method
• It also has issues at plateau and saddle point.
cost
Very slow at the plateau
𝛻𝑓 𝜃 𝛻𝑓 𝜃 𝛻𝑓 𝜃
≈0 =0 =0
parameter space
33
Accelerated learning in ANNs
• Use tanh instead of sigmoid: represent the sigmoidal function
by a hyperbolic tangent
𝟐𝒂 where 𝑎 = 1.716 and 𝑏 = 0.667
𝒀𝐭𝐚𝐧 𝒉 = −𝒃𝑿
−𝒂 (Guyon, 1991)
𝟏−𝒆
34
Accelerated learning in ANNs
• Generalized delta rule: A momentum term is
included in the delta rule (Rumelhart et al., 1986)
∆𝒘𝒋𝒌 𝒑 = 𝜷 × ∆𝒘𝒋𝒌 𝒑 − 𝟏 + 𝜼 × 𝒚𝒋 𝒑 × 𝜹𝒌 𝒑
where 𝛽 = 0.95 is the momentum constant (0 ≤ 𝛽 ≤ 1)
35
Accelerated learning in ANNs
Still not guarantee reaching global minima,
but give some hope ……
cost
Negative of Gradient
Momentum
Real Movement
Gradient = 0 36
Accelerated learning in ANNs
• Adaptive learning rate: Adjust the learning rate parameter 𝜼
during training
• Small → small weight changes through iterations → smooth
learning curve
• Large → speed up the training process with larger weight changes
→ possible instability and oscillatory
• Heuristic-like approaches for adjusting
1. The algebraic sign of the SSE change remains for several
consequent epochs → increase .
2. The algebraic sign of the SSE change alternates for several
consequent epochs → decrease .
37
Learning with momentum only
Learning with momentum for
the logical operation XOR.
38
Learning with adaptive only
39
Learning with adaptive and momentum
40
Universality Theorem
• Any continuous function 𝑓: 𝑅𝑁 → 𝑅𝑀 can be realized by a
network with one hidden layer, given enough hidden neurons.
• More explanation can be found here.
• In general, the more parameters, the better performance.
• However,…
41
Universality Theorem
• Deep networks are empirically better in solving many real-
world problems
Word Error Word Error
Layer Size Layer Size
Rate (%) Rate (%)
1 2k 24.2
2 2k 20.4
3 2k 18.4
4 2k 17.8
5 2k 17.2 1 3772 22.5
7 2k 17.1 1 4634 22.6
1 16k 22.1
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-
Dependent Deep Neural Networks." Interspeech. 2011.
42
Quiz 02: Multi-layer perceptron
• Consider the below feedforward network with one hidden layer of units.
• If the network is tested with an input vector 𝑥 = 1.0, 2.0, 3.0 then what
are the activation 𝐻1 of the first hidden neuron and the activation 𝐼3 of the
third output neuron?
43
Quiz 02: Forward the input signals
𝑇
• The input vector to the network is 𝑥 = 𝑥1 , 𝑥2 , 𝑥3
𝑇
• The vector of hidden layer outputs is 𝑦 = 𝑦1 , 𝑦2
• The vector of actual outputs is 𝑧 = 𝑧1 , 𝑧2 , 𝑧3 𝑇
44
Quiz 03: Backpropagation error signals
• The figure shows part of the
network described in the
previous slide.
• Use the same weights,
activation functions and bias
values as described.
47
Forward the signal
Input Hidden layer Output
layer layer
49
Update the output layer parameters
• Calculating the gradient of the error function with respect to those
parameters is straightforward with the chain rule.
𝜕𝐸 1 2
= 𝑎𝑘 − 𝑡𝑘
𝜕𝑤𝑗𝑘 2
𝑘
𝜕
= 𝑎𝑘 − 𝑡𝑘 𝑎𝑘 − 𝑡𝑘
𝜕𝑤𝑗𝑘
𝜕𝐸 𝜕 𝜕
• Then, = 𝑎𝑘 − 𝑡𝑘 𝑎 since 𝜕𝑤 𝑡𝑘 = 0
𝜕𝑤𝑗𝑘 𝜕𝑤𝑗𝑘 𝑘 𝑗𝑘
𝜕
= 𝑎𝑘 − 𝑡𝑘 𝑔𝑘 𝑧𝑘 since 𝑎𝑘 = 𝑔(𝑧𝑘 )
𝜕𝑤𝑗𝑘
𝜕
= 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ (𝑧𝑘 ) 𝑧
𝜕𝑤𝑗𝑘 𝑘
50
Update the output layer parameters
𝜕
• Recall that 𝑧𝑘 = 𝑏𝑘 + σ𝑗 𝑔𝑗 𝑧𝑗 𝑤𝑗𝑘 , and hence, 𝑧 = 𝑔𝑗 𝑧𝑗 = 𝑎𝑗 .
𝜕𝑤𝑗𝑘 𝑘
𝝏𝑬
• Then, = 𝒂𝒌 − 𝒕𝒌 𝒈′𝒌 (𝒛𝒌 )𝒂𝒋
𝝏𝒘𝒋𝒌
the difference between the derivative of the the output of node 𝑗 from
the network output 𝑎𝑘 activation function at 𝑧𝑘 the hidden layer feeding
and the target value 𝑡𝑘 into the output layer
51
Update the output layer parameters
• Let 𝛿𝑘 = 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ (𝑧𝑘 ) be the error signal after being back-propagated
through the output activation function 𝑔𝑘 .
• The delta form of the error function gradient for the output layer weights is
𝜕𝐸
= 𝛿𝑘 𝑎𝑗
𝜕𝑤𝑗𝑘
• The gradient descent update rule for the output layer weights is
𝝏𝑬
𝒘𝒋𝒌 ← 𝒘𝒋𝒌 − 𝜼 𝜂 is the learning rate
𝝏𝒘𝒋𝒌
← 𝑤𝑗𝑘 − 𝜂 𝛿𝑘 𝑎𝑗
← 𝒘𝒋𝒌 − 𝜼 𝒂𝒌 − 𝒕𝒌 𝒈𝒌 𝒛𝒌 𝟏 − 𝒈𝒌 𝒛𝒌 𝒂𝒋
52
Update the output layer biases
• The gradient for the biases is simply the back-propagated error signal 𝛿𝑘 .
𝜕𝐸
= 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ 𝑧𝑘 (1) = 𝛿𝑘
𝜕𝑏𝑘
53
Update the hidden layer parameters
• The process starts just the same as for the output layer.
𝜕𝐸 1 2
= 𝑎𝑘 − 𝑡𝑘
𝜕𝑤𝑖𝑗 2
𝑘
𝜕
= 𝑎𝑘 − 𝑡𝑘 𝑎𝑘
𝜕𝑤𝑖𝑗
𝑘
54
Update the hidden layer parameters
• The term 𝑧𝑘 can be expanded as follows.
𝜕
• Again, use the chain rule to calculate 𝜕𝑤 𝑧𝑘
𝑖𝑗
57
Update the hidden layer biases
• Calculating the error gradients with respect to the hidden layer biases 𝑏𝑗
follows a very similar procedure to that for the hidden layer weights.
𝜕𝐸 𝜕 𝜕𝑧𝑘
= 𝑎𝑘 − 𝑡𝑘 𝑔 𝑧 = 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ 𝑧𝑘
𝜕𝑏𝑗 𝜕𝑏𝑗 𝑘 𝑘 𝜕𝑏𝑗
𝑘 𝑘
𝜕𝑧𝑘
• Apply chain rule to solve = 𝑤𝑗𝑘 𝑔𝑗′ 𝑧𝑗 (1)
𝜕𝑏𝑗
𝜕𝐸
= 𝑎𝑘 − 𝑡𝑘 𝑔𝑘′ 𝑧𝑘 𝑤𝑗𝑘 𝑔𝑗′ 𝑧𝑗 = 𝑔𝑗′ 𝑧𝑗 𝛿𝑘 𝑤𝑗𝑘 = 𝛿𝑗
𝜕𝑏𝑗
𝑘 𝑘
58
The role of bias
• Bias is like an intercept added in a linear equation, allowing
for shifting the activation function to either right or left.
The effect of weight and bias to the shape of the activation function.
59
The role of bias
• A bias is necessary to reach any result in some cases.
Left: The non-biased neuron cannot make correct classification of the selected
points, because we cannot drive straight line through the center of coordinate
system that separates different values of neuron response. Right: for the biased
neuron, the points are moved to the edge of a sphere, and thus they might be
separated by plane that passes through the center of coordinates system.
61