Fundamentals of ANN
Fundamentals of ANN
Fakhri Karray
University of Waterloo
Outline
Introduction
A Brief History
Features of ANNs
Neural Network Topologies
Activation Functions
Learning Paradigms
Fundamentals of ANNs
McCulloch-Pitts Model
Perceptron
Adaline (Adaptive Linear Neuron)
Madaline
Case Study: Binary Classification Using Perceptron
Introduction
History
A Brief History
ANNs have been originally designed in the early forties for pattern
classification purposes.
⇒ They have evolved so much since then.
ANNs are now used in almost every discipline of science and technology:
Features of ANNs
Activation Functions
2 2
1 1
0
0
-1
-1
-2
-2 0 2
-2
-2 0 2
Activation Functions
Differentiable functions
Sigmoid function Hyperbolic tangent
1 ex −e−x
sigmoid(x) = 1+e−x
tanh(x) = ex +e−x
1 1
0.8
0.5
0.6
0
0.4
0.2 -0.5
0 -1
-2 0 2 -2 0 2
Activation Functions
Differentiable functions
Sigmoid derivative Linear function
e−x
sigderiv(x) = (1+e−x )2
lin(x) = x
0.3 3
0.2 1
0.1 -1
-2
0
-3
-2 0 2 -2 0 2
Learning Paradigms
Learning Paradigms
Supervised Learning
Multilayer perceptrons
Radial basis function networks
Modular neural networks
LVQ (learning vector quantization)
Unsupervised Learning
Competitive learning networks
Kohonen self-organizing networks
ART (adaptive resonant theory)
Others
Autoassociative memories (Hopfield networks)
Learning Paradigms
Supervised Learning
Training by example; i.e., priori known desired output for each input
pattern.
Learning Paradigms
Training Algorithm
2 Use the error through a learning rule (e.g., gradient descent) to adjust the
network’s connection weights
Learning Paradigms
Unsupervised Learning
Learning Paradigms
Learning Paradigms
Unsupervised Training
1 Training data set is presented at the input layer
2 Output nodes are evaluated through a specific criterion
3 Only weights connected to the winner node are adjusted
4 Repeat steps 1 to 3 until maximum number of epochs is reached or the
connection weights reach steady state
Rationale
Competitive learning strengths the connection between the incoming
pattern at the input layer and the winning output node.
Learning Paradigms
Unsupervised Training
1 Training data set is presented at the input layer
2 Output nodes are evaluated through a specific criterion
3 Only weights connected to the winner node are adjusted
4 Repeat steps 1 to 3 until maximum number of epochs is reached or the
connection weights reach steady state
Rationale
Competitive learning strengths the connection between the incoming
pattern at the input layer and the winning output node.
Learning Paradigms
Reinforcement Learning
The qualitative feedback signal simply informs the network whether or not
the system reacted “well” to the output generated by the network.
Learning Paradigms
Learning Paradigms
Reinforcement Learning
Fundamentals of ANNs
Mid 1980’s : BPL II and Multi Layer Perceptron (by Rumelhart and Hinton)
Fundamentals of ANNs
Mid 1980’s : BPL II and Multi Layer Perceptron (by Rumelhart and Hinton)
Fundamentals of ANNs
Mid 1980’s : BPL II and Multi Layer Perceptron (by Rumelhart and Hinton)
Fundamentals of ANNs
Mid 1980’s : BPL II and Multi Layer Perceptron (by Rumelhart and Hinton)
Fundamentals of ANNs
Mid 1980’s : BPL II and Multi Layer Perceptron (by Rumelhart and Hinton)
McCulloch-Pitts Model
McCulloch-Pitts Model
Overview
No learning capability.
McCulloch-Pitts Model
McCulloch-Pitts Model
Functionality
McCulloch-Pitts Model
Remarks
Perceptron
Perceptron
Overview
Uses supervised learning to adjust its weights in response to a
comparative signal between the network’s actual output and the target
output.
Perceptron
Perceptron
Overview
Uses supervised learning to adjust its weights in response to a
comparative signal between the network’s actual output and the target
output.
Perceptron
Perceptron
Perceptron
Perceptron
Remarks
l input signals: x1 , x2 , . . ., xl
Perceptron
Perceptron: Architecture
Perceptron
Perceptron (cont.)
Perceptron Convergence Theorem
If the training set is linearly separable, there exists a set of weights for which
the training of the Perceptron will converge in a finite time and the training
patterns are correctly classified.
x2
Decision boundary
separating the two
classes A and B
Class B (▽)
Perceptron
Training Algorithm
1 Initialize weights and thresholds to small random values.
2 Choose an input-output pattern (x (k ) , t (k ) ) from the training data.
P
l (k )
3 compute the network’s actual output o (k ) = f i=1 wi xi −θ ·
4 Adjust the weights and bias according to the Perceptron learning rule:
(k )
∆wi = η[t (k ) − o (k ) ]xi , and ∆θ = −η[t (k ) − o (k ) ], where η ∈ [0, 1] is the
Perceptron’s learning rate.
If f is the the signum function, this becomes equivalent to:
( (k )
(
2ηt (k ) xi , if t (k ) 6= o (k ) −2ηt (k ) , if t (k ) 6= o (k )
∆wi = ∆θ =
0 , otherwise 0 , otherwise
Perceptron
Example
Problem Statement
Class (1) with target value (−1) :T = [2, 0]T , U = [2, 2]T , V = [1, 3]T
Class (2) with target value (+1) :X = [−1, 0]T , Y = [−2, 0]T , Z = [−1, 2]T
Perceptron
Example
Solution
Perceptron
x2
Updated bound-
ary x2 = −3x1
Z U
Y X T
x1
(◦) Class 1 = -1
(△) Class 2 = 1
Original bound-
ary x2 = x1 − 1
Perceptron
Perceptron (cont.)
Remarks
2 Lack of generalization: once trained, it cannot adapt its weights to a new set
of data.
Overview
Adaline (cont.)
Learning in an Adaline
Adaline adjusts its weights according to the least mean squared (LMS)
algorithm (also known as the Widrow-Hoff learning rule) through gradient
descent optimization.
Adaline (cont.)
Adaline (cont.)
Training Algorithm
1 Initialize weights and thresholds to small random values.
2 Choose an input-output pattern (x (k ) , t (k ) ) from the training data.
(k )
Compute the linear combiner’s output r (k ) =
P
3
i=1 wi x i − θ.
4 Adjust the
weights (and bias)
according to the LMS rule as:
(k )
P (k ) (k )
∆wi = η t − i wi xi xi , where η ∈ [0, 1] being the learning rate.
Adaline (cont.)
Easy to implement.
Madaline
Shortcoming of Adaline
The adaline, while having attractive training capabilities, suffers also (similarly
to the perceptron) from the inability to train patterns belonging to nonlinearly
separable spaces.
When first proposed, this seemingly attractive idea did not lead to much
improvement due to the lack of an existing learning algorithm capable of
adequately updating the synaptic weights of a cascade architecture of
perceptrons.
Madaline: Example
Solving the XOR logic function by combining in parallel two adaline units
using the AND logic gate.
Graphical Solution
Madaline (cont.)
Remarks
We need to train the network using the following set of input and desired
output training vectors:
Epoch 1
Epoch 1
w (3) = w (2)
Epoch 1
Epoch 2
We reuse the training set (x (1) , t (1) ), (x (2) , t (2) ) and (x (3) , t (3) ) as
(x (4) , t (4) ), (x (5) , t (5) ) and (x (6) , t (6) ), respectively.
Epoch 2
w (6) = w (5)
Epoch 2
Epoch 3
We reuse the training set (x (1) , t (1) ), (x (2) , t (2) ) and (x (3) , t (3) ) as
(x (7) , t (7) ), (x (8) , t (8) ) and (x (9) , t (9) ), respectively.
w (8) = w (7)
Epoch 3
w (9) = w (8)
Epoch 3
Epoch 4
We reuse the training set (x (1) , t (1) ), (x (2) , t (2) ) and (x (3) , t (3) ) as
(x (10) , t (10) ), (x (11) , t (11) ) and (x (12) , t (12) ), respectively.
w (11) = w (10)
Epoch 4
Epoch 4
Introducing the input vectors for another epoch will result in no change
to the weights which indicates that w (13) is the solution for this problem;
Outline
Hopfield Network
Background
Background (cont.)
A Two-Stage Algorithm
Objective Function
n n q
X 1 XX
Ec = E (k) = [ti (k) − oi (k)]2
2
k=1 k=1 i =1
n q
1 XX
minw Ec = minw [ti (k) − oi (k)]2
2
k=1 i =1
q
1X
minw E (k) = minw [ti (k) − oi (k)]2
2
i =1
∂E (k)
∆w (l) = −η ,
∂w l
For the case where the layer (l ) is the output layer (L):
(l )
(L) (L) (L) (L−1) (l) ∂f (toti )
∆wij = η[ti − oi ][f ′ (tot)i ]oj ; f ′ (tot)i = (l )
∂toti
(L) (L) (L)
By denoting δi = [ti − oi ][f ′ (tot)i ] as being the error
signal of the i -th node of the output layer, the weight update
(L) (L) (L−1)
at layer (L) is as follows: ∆wij = ηδi oj
Propagating the error backward now, and for the case where
(l)
(l ) represents a hidden layer (l < L ), the expression of ∆wij
(l) (l) (l−1)
becomes given by: ∆wij = ηδi oj ,
(l) ′ (l) Pnl l+1 l+1
where δi = f (tot)i p=1 δp wpi .
(l)
Again when f is taken as the sigmoid function, δi becomes
(l) (l) (l) P l
expressed as: δi = oi (1 − oi ) np=1 δpl+1 wpil+1 .
Momentum
Momentum (cont.)
Example 1
δ6 = f ′ (tot6 )(t − o6 )
= o6 (1 − o6 )(t − o6 )
= 0.0945
Example 2
Six input/output samples were selected from the range [0, 10]
of the variable x
The first run was made for a network with 3 hidden nodes
Example 2: Remarks
It seems here that this network (with five nodes) was able to
interpolate quite well the nonlinear behavior of the curve.
Example 3
The first run was made for a network with 3 three samples
Example 3: Remarks
The first run with three samples was not able to provide a
good mach with the original curve.
Applications of MLP
Applications
Applications of MLP
Applications
Limitations of MLP
Limitations of MLP
Genetic algorithms,
Simulated annealing.
Uniformly pick six sample points from [0 2], use half of them
for training and the rest for testing
Use
P sum of regression error (i.e.
i ∈test samples (Output(i ) − True output(i )) ) as performance
measure
Use half of sample data points for training and the rest for
testing
Topology
Topology (cont.)
Topology (cont.)
Topology (cont.)
The center
The width
Topology (cont.)
k x − vi k
gi (x) = ri ( )
σi
σi is width parameter.
− k x − vi k2
gi (x) = exp( )
2σi2
Learning Algorithm
Hybrid Approach
K-means method,
Once the centers and the widths of radial basis functions are
obtained, the next stage of the training begins.
Least-squares method,
Gradient method.
Because the weights exist only between the hidden layer and
the output layer, it is easy to compute the weight matrix for
the RBFN.
G = [{gij }],
where
− k xi − vj k2
gij = exp( ), i , j = 1, · · · , n
2σj2
If G −1 exists, we get:
W = G −1 D
In practice however, G may be ill-conditioned (close to
singularity) or may even be a non-square matrix (if the
number of radial basis functions is less than the number of
training data) then W is expressed as:
W = G +D
We had:
W = G + D,
G + = (G T G )−1 G T
Once the weight matrix has been obtained, all elements of the
RBFN are now determined and the network could operate on
the task it has been designed for.
We had:
W = G + D,
G + = (G T G )−1 G T
Once the weight matrix has been obtained, all elements of the
RBFN are now determined and the network could operate on
the task it has been designed for.
Example
We use here the same function as the one used in the MLP
section, f (x) = x sin(x).
Three width parameters are used here: 0.5, 2.1, and 8.5.
Example: Comparison
Example: Remarks
Advantages/Disadvantages
Applications
They have been used as well for control systems, audio and
video signals processing, and pattern recognition.
Applications (cont.)
They have also been recently used for chaotic time series
prediction, with particular application to weather and power
load forecasting.
Topology
Unsupervised Learning
Topology (cont.)
Topology (cont.)
Learning
Learning
I = kx − wc k = minij kx − wij k
Example
(1, 1, 1, 0),
(0, 0, 0, 1),
(1, 1, 0, 0),
(0, 0, 1, 1).
Example
α(0) = 0.3,
α(t + 1) = 0.2α(t).
With only three clusters available and the weights of only one
cluster are updated at each step (i.e., Nc = 0), find the weight
matrix. Use one single epoch of training.
Example: Step 1
Step 3:
I (1) = (1 − 0.2)2 + (1 − 0.3)2 + (1 − 0.5)2 + (0 − 0.1)2 = 1.39
I (2) = (1 − 0.4)2 + (1 − 0.2)2 + (1 − 0.3)2 + (0 − 0.1)2 = 1.5
I (3) = (1 − 0.1)2 + (1 − 0.2)2 + (1 − 0.5)2 + (0 − 0.1)2 = 1.71
Step 3:
Step 3:
Step 3:
Example: Step 5
Epoch 1 is complete.
Repeat from the start for new epochs until ∆wj becomes
steady for all input patterns or the error is within a tolerable
range.
Applications
Neighborhood size,
Shape (circular, square, diamond),
Learning rate decaying behavior, and
Dimensionality of the neuron array (1-D, 2-D or n-D).
Applications (cont.)
Speech recognition,
Vector coding,
Robotics applications, and
Texture segmentation.
Hopfield Network
Recurrent Topology
Origin
wij = wji
Network Formulation
This value is fed back to all the input units of the network
except to the one corresponding to that output.
θi : threshold value
Hebbian Learning
q
1X q
W = {wij } = pk pkT − I
n n
k=1
Major Classes of Neural Networks
Multi-Layer Perceptrons (MLPs) Topology
Radial Basis Function Network Learning Algorithm
Kohonen’s Self-Organizing Network Example
Hopfield Network Applications and Limitations
Learning Algorithm
o(0) = u
Learning Algorithm
o(0) = u
Xn
oi (l + 1) = sgn( wij oj (l ))
j=1
Xn
oi (l + 1) = sgn( wij oj (l ))
j=1
Xn
oi (l + 1) = sgn( wij oj (l ))
j=1
Example
Problem Statement
Therefore:
1 1 0 0 0 0 1 1 −1
1 0 1 0 0
= 1 0 1 −1
W =
1 1 1 1 −1 − 0
0 1 0 1 1 0 −1
−1 0 0 0 1 −1 −1 −1 0
Retrieval Stage
Xn
oi = sgn( wij oj − θi )
j=1
X4
o1 = sgn( wij oj − θi ) = sgn(w12 o2 + w13 o3 + w14 o4 − 0)
j=1
X4
o2 = sgn( wij oj − θi ) = sgn(w21 o1 + w23 o3 + w24 o4 )
j=1
No transition is observed.
B = [1, 1, 1, −1]T (−6) → B = [1, 1, 1, −1]T (−6)
Major Classes of Neural Networks
Multi-Layer Perceptrons (MLPs) Topology
Radial Basis Function Network Learning Algorithm
Kohonen’s Self-Organizing Network Example
Hopfield Network Applications and Limitations
No transition is observed.
B = [1, 1, 1, −1]T (−6) → B = [1, 1, 1, −1]T (−6)
Major Classes of Neural Networks
Multi-Layer Perceptrons (MLPs) Topology
Radial Basis Function Network Learning Algorithm
Kohonen’s Self-Organizing Network Example
Hopfield Network Applications and Limitations
It is in its stable state because this state has the lowest energy.
Update the bit o2 , the state becomes E = [1, −1, −1, 1]T
with energy 0
State D: Remarks
Applications
Optimization problems,
Limitations