Artificial Neural Networks: Dr. Md. Aminul Haque Akhand Dept. of CSE, KUET
Artificial Neural Networks: Dr. Md. Aminul Haque Akhand Dept. of CSE, KUET
1. Introduction
It is hope that devices based on biological neural networks will process some of
these characteristics.
4
Why Artificial Neural Networks?
w
1 w
0
x2
w
Inputs
Output
2
n
1 if wi xi 0
x3
w
i 0
3
0 else
x w
4
4
x w
5 5
Logical Operators x2
x 0
0 + +
-0.5
x 1
OR
1
n
1 if wi xi 0 OR
i 0
0 else
x - + x1
2 1
x2
x 0
-1.5
x AND - +
1
n
1 1 if wi xi 0
i 0
0 else AND
x 2
1
x - - x1
0
0.0
n
1 if wi xi 0
NOT
x 1
i 0
AND
OR
XOR
0.5
-2
1 1 1.5
1 1
Reinforcement learning:
The learning of input-output mapping
is performed through continued
interaction with the environment
in oder to minimize a scalar index
of performance.
Learning Paradigms: Unsupervised
Conceptually, nodes
at higher levels
successively
abstract features
from preceding
layers
34
Geometric interpolation of the role of HN
How to get appropriate weight set?
http://en.wikipedia.org/wiki/Backpropagation
Error =
Input Output Desired Output
- Actual Output
yj
yj
Activation
Function
∑ Forward Pass:
Backward Pass:
40
Activation
Function
∑
Weight Update Sequence in BP
i=0 w10
bias
y1
i=1 w1m
wj0
yj ej = dj- yj
wjm
i=m
(Output Layer)
(Input Layer)
BP for Multilayer (With HN)
Input
yj
yk
Output
BP for Multilayer (With HN)
BP for Multilayer (With HN)
BP for Multilayer (With HN)
wji(t+1)=wji(t)+wji
i=0 w10
bias
y1
i=1 w1m
wj0
yj e j = d j- y j
wjm
i=m
Matter of Activation Function(AF) f’(x)= β f(x)(1-f(x))
f(x)=1/(1+exp(-βx)
Max. 0.25β
at f(x)=0.5
an example of continuously
differentiable nonlinear AF is
sigmoidal nonlinearity; its
two forms are :
49
BP at a Glance
w x
The local gradient of output unit (δo ) and hidden unit (δh) are defined by:
e f o f h
o h o wo
f o xo o xh
eo (n)
1
e(n) (d (n) f o (n)) 2 , d ( n) f o ( n )
2 f o (n)
f o (n)
f o (n) 1 / 1 exp xo ( n) f o ( n ) 1 f o ( n )
For logistic sigmoid activation function xo (n)
For the same sigmoid activation function the local h f h (n)1 f h (n) o wo
gradient of hidden unit (δh) becomes o
Operational flowchart of a NN with
50
Single Hidden Layer
Hidden layer
H Output layer
Actual Output
Wo
Wh
Input
-1
∑❑
Error
1
Desired Output
∆ Wh 𝛿 h o 𝛿
0
Operational flowchart of a NN with
51
Two Hidden Layer
Output layer
H1 H2
W2
W1
W3
Actual Output
-1
Input ∑❑
Error
Desired Output
∆ W 1 𝛿 h 1 ∆ W 2 𝛿 h 2 ∆ W 3 𝛿 0
52
A MLP Simulator for Simple Application
53
Hossein Khosravi : The Developer of the Simulator
54
Practical Aspects of BP
56
Practical Considerations
Training Data: Sufficient / proper training data is require for proper input-
output mapping.
Network Size: Complex problem require more hidden neurons and may
perform better with multiple hidden layers.
Weight Initialization: NN works on numeric data. Before training all the
synaptic weights are initialized with small random values, e.g., -0.5 to +0.5.
Learning Rate () and Momentum Constant (): A small results slower
convergence and its larger value gives oscillation. Momentum also speed up
training but larger with large arise oscillation.
Stopping Training: Training should stop at better generalization position.
57
Effect of Learning Rate () and Momentum Constant () values
58
Learning and Generalization
Learning and generalization are the most important topics in NN Available
research. Learning is the ability to approximate the training data while Data
generalization is the ability to predict well beyond the training data.
Available
Data
Training Set
(Use for learning)
Testing Set
(Reserve
to measure
generalization)
UCI contains raw data that require preprocessing to use in NN. Some
preprocessed data is also available at Proben1 (ftp://ftp.ira.uka.de/pub/neuron/).
x1
y1
x2
y2
x3
output layer
input layer
(linear weighted sum)
(fan-out) For pattern classification sigmoid
function could be used.
hidden layer
(weights correspond to cluster centre,
output function usually Gaussian)
65
Clustering
The unique feature of the RBF network is the process
performed in the hidden layer.
The idea is that the patterns in the input space form
clusters.
If the centres of these clusters are known, then the
distance from the cluster centre can be measured.
Furthermore, this distance measure is made non-linear, so
that if a pattern is in an area that is close to a cluster
centre it gives a value close to 1.
Beyond this area, the value drops dramatically.
The notion is that this area is radially symmetrical around
the cluster centre, so that the non-linear function becomes
known as the radial-basis function.
66
Gaussian function
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
1
0.2
0.4
0.6
0.8
-1
-0.8
-0.2
-0.6
-0.4
x
The most commonly used radial-basis function is a
Gaussian function
In a RBF network, r is the distance from the cluster centre.
67
Distance Measure
The distance measured from the cluster centre is usually
the Euclidean distance.
For each neuron in the hidden layer, the weights represent
the co-ordinates of the centre of the cluster.
Therefore, when that neuron receives an input pattern, X,
the distance is:
n
rj (x w )
i 1
i ij
2
68
Width of hidden unit basis function
n
(x i wij ) 2
x1 x2 r1 r2 1 2
0 0 0 2 1 0.1
0 1 1 1 0.4 0.4
1 0 1 1 0.4 0.4
1 1 2 0 0.1 1
Distance calculated using Hamming distance.
XOR Problem Solution
70 1
1
x2
+ -
0.5
XOR XX
- + x1
0 0.5 1 2
This is an ideal solution - the centres were chosen
carefully to show this result.
The learning of an RBF network finds those centre
values
71
Training hidden layer
The hidden layer in a RBF network has units which have
weights that correspond to the vector representation of the
centre of a cluster.
These weights are found either using a traditional clustering
algorithm such as the k-means algorithm, or adaptively
using essentially the Kohonen algorithm.
In either case, the training is unsupervised but the number
of clusters that you expect, k, is set in advance. The
algorithms then find the best fit to these clusters.
72
k -means algorithm
Initially k points in the pattern space are randomly set.
Then for each item of data in the training set, the distances are found
from all of the k centres.
The closest centre is chosen for each item of data - this is the initial
classification, so all items of data will be assigned a class from 1 to k.
Then, for all data which has been found to be class 1, the average or
mean values are found for each of co-ordinates.
These become the new values for the centre corresponding to class 1.
Repeated for all data found to be in class 2, then class 3 and so on
until class k is dealt with - we now have k new centres.
Process of measuring the distance between the centres and each
item of data and re-classifying the data is repeated until there is no
further change – i.e. the sum of the distances monitored and training
halts when the total distance no longer falls.
75
k -means algorithm
6. …Repeat until
terminated!
Adaptive k-means
The alternative is to use an adaptive k -means
algorithm which similar to Kohonen learning.
Input patterns are presented to all of the cluster
centres one at a time, and the cluster centres
adjusted after each one. The cluster centre that is
nearest to the input data wins, and is shifted slightly
towards the new data.
This has the advantage that you don’t have to store
all of the training data so can be done on-line.
Finding radius of Gaussians
Having found the cluster centres, the next step is
determining the radius of the Gaussian curves.
This is usually done using the P-nearest neighbour
algorithm.
A number P is chosen (a typical value for P is 2), and for
each centre, the P nearest centres are found.
The root-mean squared distance between the current
cluster centre and its P nearest neighbours is calculated,
and this is the value chosen for .
So, if the current cluster centre is cj, the value is:
1 P
j
P i 1
( c k ci ) 2
Training output layer
x1
y1
x2
y2
x3
output layer
input layer
(linear weighted sum)
(fan-out)
hidden layer
(weights correspond to cluster centre,
output function usually Gaussian)