0% found this document useful (0 votes)
230 views476 pages

Soft Computing

This document provides an introduction to soft computing and its key concepts. It discusses: - The differences between hard and soft computing, with soft computing being able to handle imprecision, uncertainty, and approximation. - The main constituents of soft computing being fuzzy logic, neural networks, and genetic algorithms. - The goals of soft computing being to develop new artificial intelligence techniques inspired by human decision making to solve real-world problems. - An overview of fuzzy sets and how they allow for gradual membership compared to classical sets, allowing better modeling of imprecise concepts.

Uploaded by

Dj Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
230 views476 pages

Soft Computing

This document provides an introduction to soft computing and its key concepts. It discusses: - The differences between hard and soft computing, with soft computing being able to handle imprecision, uncertainty, and approximation. - The main constituents of soft computing being fuzzy logic, neural networks, and genetic algorithms. - The goals of soft computing being to develop new artificial intelligence techniques inspired by human decision making to solve real-world problems. - An overview of fuzzy sets and how they allow for gradual membership compared to classical sets, allowing better modeling of imprecise concepts.

Uploaded by

Dj Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 476

Introduction

to
Soft Computing
Concept of Computing

Antecedent Computing Consequent


Y=f(x)

Control Action

Figure : Basics of computing

y = f (x), f is a function, f is also called a formal method or


an algorithm to solve a problem.
Important Characteristics of Computing

• Should provide precise solution.

• Control action should be unambiguous and accurate.

• Suitable for problem, which is easy to model


mathematically.
Hard Computing
• In 1996, Lotfi Aliasker Zadeh introduced the term hard
computing.
• According to Zadeh: We term a computing as ”Hard”
computing, if
– Precise result is guaranteed
– Control action is unambiguous
– Control action is formally defined (i.e. with mathematical
model)
Hard Computing
Example:

• Solving numerical problems (e.g. Roots of polynomials,


Integration etc.)

• Searching and sorting techniques

• Solving ”Computational Geometry” problems (e.g.


Shortest tour in Graph theory, Finding closest pair of
points etc.)
Characteristics of Soft Computing
• It does not require any mathematical modelling of
problem solving

• It may not yield the precise solution

• Algorithms are adaptive (i.e. it can adjust to the change


of dynamic environment)

• Use some biological inspired methodologies such as


genetics, evolution, Ant’s behaviors, particles swarming,
human nervous systems etc.
What is soft computing?

• The idea of soft computing was initiated in 1981 when


Zadeh published his first paper on soft data anlysis “
what is soft computing”, Springer-Verilg Germany/USA
1997.

• Zadeh, defined soft computing into one


multidisciplinary system as the fusion of the fields of
fuzzy logic, neuro-computing, Evolutionary and genetic
computing, and probabilistic computing.
What is soft computing?

• Soft computing is the fusion of methodologies designed


to model and enable solutions to real world problems,
which are not modeled or too difficult to model
mathematically.

• The aim of soft computing is to exploit the tolerance for


imprecision, uncertainty, approximate reasoning, and
partial truth in order to achieve close resemblance with
human like decision making.
What is soft computing?
Neural Network (NN) Evolutionary Computing (EC) Fuzzy Logic (FL)
McCulloch 1943 Rechenberg 1960 Zadeh 1965

Soft Computing
Zadeh 1981

Evolutionary Programming (EP)


Fogel 1962
where
Evolutionary Strategies (ES)
Rechenberg 1965
Evolutionary Computing (EC)
Rechenberg 1960
Genetic Algorithm (GA)
Holland 1970

Genetic Programming (GP)


Koza 1992
Definition of Soft Computing

• Lotfi A. Zadeh, 1992: “Soft Computing is an


emerging approach to computing which parallel the
remarkable ability of the human mind to reason
and learn in a environment of uncertainty and
imprecision.”
Constituents of Soft Computing

• The soft computing consists of several computing


paradigms mainly: Fuzzy Systems, Neural Networks, and
Genetic Algorithms.
– Fuzzy sets: For knowledge representation via fuzzy If-Then
rules.
– Neural Networks: For learning and adaptation
– Genetic Algorithms: for evolutionary computation
• These methodologies form the core of SC.
• Hybridization of these three creates a successful
synergic effect; that is, hybridization creates a situation
where different entities cooperate advantageously for a
final outcome.
• Soft computing is still growing and developing.
• Hence, a clear definite agreement on what comprises
soft computing has not yet been reached.
• More new sciences still merge into soft computing.
Goals of Soft Computing
Soft Computing is a new multidisciplinary field, to
construct new generation of Artificial Intelligence,
known as Computational Intelligence.
• The main goal of soft computing is to develop intelligent
machines to provide solutions to real world problems,
which are not modeled, or too difficult to model
mathematically.
• Its aim is to exploit tolerance for approximation,
uncertainty, imprecision, and partial truth in order to
achieve close resemblance with human like decision
making.
• Approximation: here the model features are similar to
the real ones, but not the same.

• Uncertainty: here we are not sure that the features of


the model are same as that of the entity (belief).

• Imprecision: here the model features (quantities) are


not the same as that of the real ones, but close to them.
Importance of Soft Computing
• Soft computing differs from hard (conventional) computing.
• Unlike hard computing, the soft computing is tolerant of
– imprecision,
– uncertainty,
– partial truth, and
– approximation.
• The guiding principles of soft computing is to exploit these
tolerance to achieve
– tractability,
– robustness and
– low solution cost.
• In effect, the role model for soft computing is the human
mind.
• Soft computing is not a mixture, or combination, rather,
soft computing is a partnership in which each of the
partners contributes a distinct methodology for
addressing problems in its domain.

• In principle the constituent methodologies in soft


computing are complementary rather than competitive.
• Hard computing
– Based on the concept of precise modeling and analyzing to
yield accurate results.
– Works well for simple problems, but is bound by the NP-
Complete set.
• Soft computing
– Aims to overcome NP-complete problems.
– Uses inexact methods to give useful but inexact answers to
intractable problems.
– Represents a significant paradigm shift in the aims of
computing - a shift which reflects the human mind.
– Tolerant to imprecision, uncertainty, partial truth, and
approximation.
– Well suited for real world problems where ideal models are
not available.
Difference between Soft and Hard Computing
Hard Computing Soft Computing
1. Background: Mathematics, Logic 1. Biological process, fuzzy logic
2. Conventional computing requires a 2. No such models required.
precisely stated analytical model.
3. Often requires a lot of computation 3. Can solve real world problems in
time reasonably less time.
4. Not suited for real world problems 4. Suitable for real world problems.
for which ideal model is not present.
5. Deterministic in nature 5. Incorporates stochasticity
6. It requires complete truth 6. Can work with partial truth
7. It is precise and accurate 7. Imprecise.
8. Requires exact input data 8. Can deal with ambiguous or noisy
data
9. High cost for solution 9. Low cost for solution
10. No/ Low Machine Intelligence 10. High MIQ
Quotient (MIQ)
Unique Features of Soft Computing
• Soft Computing is an approach for constructing
– systems which are computationally intelligent,
– possess human like expertise in particular domain,
– can adapt to the changing environment and
– can learn to do better can explain their decisions
Introduction to fuzzy logic, classical
sets and fuzzy sets
• The human brain interprets
– imprecise and
– incomplete sensory information
provided by perceptive organs.
• Fuzzy set theory helps to deal with such
information linguistically.
• Performs numerical computation by using
linguistic labels stipulated by membership
functions.
• Selection of fuzzy if-then rules forms the key
component of a fuzzy inference system (FIS)
that can effectively model human expertise in
a specific application.
• A classical set is a set with a crisp boundary.
• Ex. a classical set A of real numbers greater
than 6 can be expressed as
A={x |x > 6},
where there is a clear, unambiguous boundary
6 such that if x is greater than this number,
then x belongs to set A; otherwise x does not
belongs to the set.
• Classical sets are important tool for
mathematics and
computer science,
but they do not reflect the nature of human
concepts and
thoughts,
which tend to be
abstract and
imprecise.
• Ex. Let set of tall persons are collection of persons
whose height is more than 6ft, can be denoted
A={x |x > 6},
Where A=set of tall persons and x=“height.”
• This is an unnatural and inadequate way of
representing the concept of “tall person”.
• Since it will classify a person 6.001ft as tall but a
person 5.999ft as not tall.
• This distinction is intuitively unreasonable.
• The flaw comes from the sharp transition
between inclusion and exclusion in a set.
• In contrast to classical set, a fuzzy set is a set
without a crisp boundary.
• That is, the transition from “belonging to a set” to
“not belonging to a set” is gradual.
• This smooth transition is characterized by
membership functions (MFs).
• MFs give fuzzy sets flexibility in modeling
commonly used linguistic expressions.
• Ex.
“the water is hot” or “the temperature is high.”
 Boolean logic uses sharp distinctions.
 Fuzzy logic reflects how people think.

Ex. How dark an area is?

 Fuzzy logic is a set of mathematical principles for knowledge


representation and reasoning based on degrees of
membership.
 Fuzzy sets are viewed as generalization of crisp sets.
NEED OF FUZZY LOGIC

 Based on intuition and judgment.

 No need for a mathematical model.

 Provides a smooth transition between members and nonmembers.

 Relatively simple, fast and adaptive.

 Less sensitive to system fluctuations.

 Can implement design objectives, difficult to express


mathematically, in linguistic or descriptive rules.
CLASSICAL SETS (CRISP SETS)

Conventional or crisp sets are Binary. An element


either belongs to the set or does not.

{True, False}

{1, 0}
OPERATIONS ON CRISP SETS

 UNION:

 INTERSECTION:

 COMPLEMENT:

 DIFFERENCE:

A| B  A B
OPERATIONS ON CRISP SETS : Example

• Let set A={2, 4, 6, 8, 10}, B={4, 8, 12, 14} in the


universe U={0, 2, 4, 6, 8, 10, 12, 14, 16}, then
• A union B, A  B  {2,4,6,8,10,12,14}
• A intersection B, A  B  {4,8}
• Compliment A, A  {0,12,14,16}
• A difference B, A | B  {2,6,10}
PROPERTIES OF CRISP SETS
The various properties of crisp sets are as follows:
Fuzzy set
• A fuzzy set A in the universe of discourse U
can be defined as a set of ordered pairs and it is
given by A  {( x,  A ( x)) | x U }

where  A ( x)  [0,1] is the degree of


membership of x in A.

April 2007 15
Representation of fuzzy sets
• If the universe of discourse U is discrete and finite, then
fuzzy set A can be represented as
  A ( x1)  A ( x2 )   n  A ( xi ) 
A   ...    
 x1 x2  i 1 xi 

 A ( xi )
• NB i. The horizontal bar in is not a quotient but a
xi
delimiter.
 A ( x1)  A ( x2 )

• ii. The ‘+’ sign in x1 x2 does not performs
addition but is a function theoretic union.

• The  symbol only represents collection of items but not


summation.
Representation of fuzzy sets cntd.

• If the universe of discourse U is continuous and


infinite, then fuzzy set A can be represented as

  A  x 
A   
 x 
• where  symbol i.e. integral sign is a continuous
function theoretic union for continuous variables.
OPERATIONS ON FUZZY SETS
Union/Disjunction: The union of fuzzy sets A and B, denoted by
A  B is defined as:
 A B  x   max A  x ,  B  x    A  x    B  x , x  

 1  0.3  0.5  0.2 


Ex. Consider two fuzzy sets A   and
2 4 6 8 

B   
0.5 0.4 0.1 1 
  
2 4 6 8

Then A  B   max(1,0.5)  max(0.3,0.4)  max(0.5,0.1)  max(0.2,1) 


 2 4 6 8 

1 0.4 0.5 1 
    
 2 4 5 8
Operations on fuzzy sets cntd.
Intersection/Conjuction: The union of fuzzy sets A and B,
denoted by A  B is defined as:

 A B  x   min A  x ,  B  x    A  x    B  x , x  

Ex. Consider the fuzzy sets A  


1 0.3 0.5 0.2 
     and
2 4 6 8 

B   
0.5 0.4 0.1 1  , then
  
2 4 6 8

A B  
min(1,0.5) min(0.3,0.4) min(0.5,0.1) min(0.2,1) 
    
 2 4 6 8 

0.5 0.3 0.1 0.2 
    
 2 4 5 8 
Operations on fuzzy sets cntd.
• Complement/negation: when  A  x   [0,1] , the complement
of A, denoted as A is defined as

 A  x   1   A  x , x  

Consider the fuzzy set A  


1 0.3 0.5 0.2 
    
2 4 6 8 

then 1  1 1  0.3 1  0.5 1  0.2 


A    
 2 4 6 8 
 0 0.7 0.5 0.8 
    
2 4 6 8 
Few more fuzzy union, intersection and complement

Fuzzy Complement
• A fuzzy complement operator is a continuous
function N: [0, 1] →[0, 1] which meets the following
axiomatic requirements:
– Boundary: N(0) = 1 and N(1) = 0
– Monotonicity: N(a) ≥ N(b) if a ≤ b
• All function satisfying these requirements form the
general class of fuzzy complements.
• Optional Requirement
– Involution: N(N(a)) = a.
Sugeno’s Complement 1 a
N s (a ) 
1  sa
Sugeno's compliment
1

s=-0.95
0.8
s=-0.7

0.6 s=0.0
N(a)

0.4
s=2

0.2
s=20

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
x=a
Yager’s Complement N w  a   1  a 
1
w w

Yager's Compliment
1

w=3
0.8 w=1.5

0.6
w=0.7 w=1
N(a)

0.4

0.2
w=0.4

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
x=a
Triangular norm or T-norm
• The intersection of two fuzzy sets A and B is specified in
general by a function T:[0,1]x[0,1] ->[0,1], which
aggregates two membership grades as follows:

~
 AB  x   T  A  x ,  B  x    A  x  *  B  x 

where ~
* is a binary operator for the function T.

This class of fuzzy intersection operators referred as T-


norm operators.
Properties of T-norm operators
A T-norm operator is a two place function T(., .)
satisfying

1. Boundary: T(0,0)=0, T(a,1)=T(1,a)=a

2. Monotonecity: T(a,b)≤T(c,d), if a ≤ c and b ≤ d


3. Commutativity: T(a,b)=T(b,a)
4. Associativity: T(a, T(b, c))=T(T(a, b),c)
Frequently used T-norm operators
i. Minimum: Tmin a, b   mina, b   a  b
ii. Algebraic product: Tap a, b   ab
iii. Bounded product: Tbp a, b   0  a  b  1

a, if b 1

iv. Drastic product: Tdp a, b   b, if a 1
 0 if a, b  1

Relation between the operators are:

Tdp a, b   Tbp a, b   Tap a, b   Tmin a, b 


• Consider two fuzzy sets A  1.0  0.3  0.5  1.0  and B   0.0  0.4  0.7  1.0 
2 4 6 8   2 4 6 8 
• Then A  B using
1. Minimum ( Tmin a, b   mina, b   a  b )
 min(1.0,0.0) min(0.3,0.4) min(0.5,0.7) min(1.0,1.0)   0.0 0.3 0.5 1.0 
A B          
 2 4 6 8   2 4 6 8 

2. Algebraic product (Tap a, b  ab)


 (1.0 * 0.0) (0.3 * 0.4) (0.5 * 0.7) (1.0 *1.0)   0.0 0.12 0.35 1.0 
A B          
 2 4 6 8   2 4 6 8 

3. Bounded product ( Tbp a, b  0  a  b  1 )


 0  (1.0  0.0  1) 0  (0.3  0.4  1) 0  (0.5  0.7  1) 0  (1.0  1  1.0) 
A B      
 2 4 6 8 
 0.0 0.0 0.2 1.0 
a, if b 1     
  2 4 6 8 
Tdp a, b   b, if a 1
4. Drastic product 0 a, b  1
 if

 0.0 as a  1 0.0 as a, b  1 0.0 as a, b  1 1 as a or b  1  0.0 0.0 0.0 1.0 


A B          
 2 4 6 8   2 4 6 8 
T-norm Minimum

0.8

0.6
Tmin

0.4

0.2

0
1
0.8 1
0.6 0.8
0.4 0.6
0.4
0.2 0.2
0 0
b a
T-norm Algebriac Product

0.8

0.6
Tap

0.4

0.2

0
1
1
0.8
0.5 0.6
0.4
0.2
b 0 0
a
T-norm Bounded Product

0.8

0.6
Tbp

0.4

0.2

0
1
1
0.8
0.5 0.6
0.4
0.2
b 0 0
a
T-norm Drastic Product

0.8

0.6
Tdp

0.4

0.2

0
1
1
0.8
0.5 0.6
0.4
0.2
b 0 0
a
T-conorm or S-norm
• The union of two fuzzy sets A and B is specified in general
by a function S:[0,1]x[0,1] ->[0,1], which aggregates two
membership grades as follows:

 A B  x   S  A  x ,  B  x    A  x  ~
 B x

where ~
 is a binary operator for the function S.

This class of fuzzy union operators referred as T-conorm or


S-norm operators.
Properties of S-norm operators
A S-norm operator is a two place function S(., .) satisfying

1. Boundary: S(1,1)=1, S(a,0)=S(0,a)=a

2. Monotonicity: S(a,b)≤S(c,d), if a ≤ c and b ≤ d

3. Commutativity: S(a,b)=S(b,a)

4. Associativity: S(a, S(b, c))=S(S(a, b),c)


Frequently used S-norm operators
i. Maximum: Smax a, b   maxa, b   a  b
ii. Algebraic sum: Sas a, b   a  b  ab
iii. Bounded sum: Sbs a, b   1  a  b 
a, if b0

iv. Drastic sum: S ds a, b   b, if a0
1, if a, b  0

Relation between the S-norm operators:
Smax a, b   Sas a, b   Sbs a, b   Sds a, b 
• Consider two fuzzy sets A  1.0  0.3  0.5  1.0  and B   0.0  0.4  0.7  1.0 
2 4 6 8   2 4 6 8 
• Then A  B using
1. Maximum ( Smax a, b  max a, b  a  b )
 max(1.0,0.0) max( 0.3,0.4) max( 0.5,0.7) max(1.0,1.0)  1.0 0.4 0.7 1.0 
A B          
 2 4 6 8   2 4 6 8 

2. Algebraic sum ( Sas a, b  a  b  ab )


 (1  0  1* 0) (.3  .4  .3 *.4) (.5  .7  .5 * .7) (1  1  1*1)  1.0 0.58 0.85 1.0 
A B          
 2 4 6 8   2 4 6 8 

3. Bounded sum (S a, b  1  a  b)


bs

1  (1.0  0.0) 1  (0.3  0.4) 1  (0.5  0.7) 1  (1.0  1.0)  1.0 0.7 1.0 1.0 
A B          
 2 4 6 8   2 4 6 8 

a, if b0

4. Drastic sum S ds a, b   b, if a0
1, if a, b  0

1.0 as b  0 1.0 as a, b  0 1.0 as a, b  0 1 as a, b  0  1.0 1.0 1.0 1.0 


A B          
 2 4 6 8  2 4 6 8 
S-norm (maximum)

0.8

0.6
Smax

0.4

0.2

0
1
1
0.8
0.5 0.6
0.4
0.2
b 0 0
a
S-norm Algebraic sum

0.8

0.6
Sas

0.4

0.2

0
1
1
0.8
0.5 0.6
0.4
0.2
b 0 0
a
S-norm Bounded Sum

0.8

0.6
Sbs

0.4

0.2

0
1
1
0.8
0.5 0.6
0.4
0.2
b 0 0
a
S-norm Drastic Sum

0.8

0.6
Sds

0.4

0.2

0
1
1
0.8
0.5 0.6
0.4
0.2
b 0 0
a
Do the Laws of Contradiction and Excluded Middle hold?
• Given a Fuzzy set (A, µA), we have
– The Law of Contradiction: A ∩ Ac = ∅
– The Excluded Middle: A ∪ Ac = X
• However, if A is a non-crisp set, then neither law will hold.
• Indeed, note that for a non-crisp set, there exists some x ∈ A such
that µA(x) ∈ (0, 1), i.e. µA(x) ≠ 0, 1.
• Thus, we have
– µA∩Ac (x) = min{µA(x), 1 − µA(x)} ≠ 0
– µA∪Ac (x) = max{µA(x), 1 − µA(x)} ≠ 1
• Hence, neither law holds for a non-crisp set.
PROPERTIES OF FUZZY SETS
Some terminologies
Some terminologies
Support: The support of a fuzzy set A is
the set of all points x in X such that  A  x   0
i.e. support ( A)  x |  A  x   0

Membership grades
1.0
Crossover
Core: The core of a fuzzy set A is the points

set of all points x in X such that  A  x   1 0.5


i.e. core( A)  x |  A  x   1
0.0 x
Normality: A fuzzy set A is normal if
core
core is nonempty i.e. at least a point
x  X such that  A  x   1. Support

Crossover point: A crossover point of a 45 years old

Membership grades
fuzzy set A is a point x  X at which 1.0
 A  x   0.5 i.e. crossover ( A)  x |  A  x   0.5 Core &
support
More than one crossover points possible. 0.5

Fuzzy singleton: A fuzzy set whose Age


support is a single point in X with  A  x   1 45
Ex. Fuzzy singleton
is called fuzzy singleton.
α – level set
α -cut or α-level set
The α-cut of a fuzzy set A is a crisp set defined by
A  x |  A ( x)   
where α is a user specified constant.
Strong α-cut or strong α-level set
Strong α-cut of a fuzzy set A is a crisp set defined by
A'  x |  A ( x)   
where α is a user specified constant.
Ex.
i. When α=0, support(A)= A' i.e.  A ( x)  0
ii. When α=1, core(A)= A i.e.  A ( x)  1
Convexity
A fuzzy set A is convex if and only if any x1, x2  X
and any  [0,1] , then  A x1  1   x2   min A  x1 ,  A  x2 
Ex. X1=30, x2=60,  =0.6
A=“Middle aged person”
 A 0.6  30  1  0.6  60  min A 30,  A 60
  A 18  24  min0.5,0.5
A="Middle aged person"
1

  A 42  0.5 0.8

Membership grade
 1.0  0.5
0.6

0.4

Hence, A is a convex fuzzy set.


0.2

0
0 10 20 30 40 50 60 70 80 90 100
Age
bell function with center, c=45, width, a=15, slope, b=3
• Fuzzy numbers: A fuzzy number A is a fuzzy set in the real
line that satisfies the condition for normality and convexity.
• Bandwidth of normal and convex fuzzy sets: For
normal and convex fuzzy sets, the width or bandwidth is
defined as the distance between the two unique crossover
points.
where width ( A) | x2  x1 |,
Ex. Here the crossover points are 30 and 60, width(A)=60-30=30
 A  x1    A  x2   0.5 1 A="Middle aged person"

Symmetry: A fuzzy set A is


0.8

Membership grade
symmetric if its MF is symmetric 0.6

around a certain point x=c,


0.4

 A c  x    A c  x , x  X 0.2

Ex. Here, c=45. 0


0 10 20 30 40 50 60 70 80 90 100
Age
bell function with center, c=45, width, a=15, slope, b=3
Open left: A fuzzy set A is open left if 1
open left set

lim  A  x   1 and lim  A  x   0 0.8


x  x 
0.6

Open right: A fuzzy set A is open right if 0.4

lim  A  x   0 and lim  A  x   1 0.2


x  x 
0
Closed: A fuzzy set A is closed if
-100 -50 0 50 100

lim  A  x   lim  A  x   0
x  x 
Closed fuzzy set
1
Open right set
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2

0
0 5 10 15 20 25 30 35 0
-100 -50 0 50 100
Some Membership functions
• Triangular MF: is specified by three
parameters (a, b, c) with a<b<c are the 1
Triangular MF

x-coordinates of the 3 corners of the 0.8


underlying triangular MF.
0.6

 0 if xa 0.4

x  a
a xb
0.2

b  a if
triangle x; a, b, c   
0
0 20 40 60 80

cx a=20,b=50, c=65

 if bxc
c  b
 0 if cx

OR
  x a c  x 
triangle x; a, b, c   max min , ,0 
 b  a c b 
Trapezoidal MF: is specified by 4
parameters {a, b, c, d}, with a<b<=c<d are
the x-coordinates of the 4 corners of the
underlying trapezoidal MF.
 0, if xa
x  a
Trapezoid MF
1

b  a , if a xb
 0.8

trapezoid ( x; a, b, c, d )   1 if b xc 0.6


d  x
d  c , if cxd 0.4

 0, dx
 if 0.2

OR 0

  x a d  x  0 20 40 60 80 100

trapezoid ( x; a, b, c, d )  max min ,0 


a=20; b=50, c=65, d=80
,1,
 b  a d c 
Adv:
-Simple to formulate Disadv:
-Computational efficiency - Composed of straight line
-Used extensively segment
-Preferred for real time implementation - Not smooth at corner points.
Gaussian MF: is defined by

1  x c  2
  
gaussian ( x; c, )  e 2  
Gaussian MF
where 1
c = MFs center
 = MFs width
0.8

0.6

0.4

0.2

0
0 20 40 60 80 100
c=50, sigma=12
bell MF
1

0.8

Bell MF/ Cauchy MF: is defined by 0.6

1
bell  x; a, b, c   0.4

x  c 2b 0.2

1 0

where a 0 20 40 60 80
a=20; b=4; c=50;
100 120

bell MF

a = width 0.8
1

b = controls the slope at the crossover 0.6

point. - ve value of b makes the bell 0.4

0.2

upside down. 0
0 20 40 60 80 100 120
a=20; b=-4; c=50;

c = center
Bell MF Bell MF Bell MF
1 1 1
a=16 b=3 c=40
0.8 a=20 0.8 b=6 0.8 c=50
a=24 b=9 c=60
0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0
0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 120
a=16:4:24; b=4;c=50 a=20; b=3:3:9; c=50 a=20; b=4; c=40:10:60
Advantages of Gaussian and Bell MF:
- Smooth
- Concise
- Popular
- Gaussian MFs are invariance under multiplication (i.e. product
of two Gaussian is a Gaussian with a scaling factor).
- Fourier transform of Gaussian is still a Gaussian
- Bell MF has one more parameter than Gaussian MF, so one
more degree of freedom to adjust the steepness at the
crossover point.
Disadvantages: Unable to specify asymmetric MFs.
1
Sigmoidal MF: is defined as sig ( x; a, c) 
1  e  a ( x c )
where ‘a’ controls slope at the crossover point, x=c

Sigmoidal MF Sigmoidal MF
1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
-10 -5 0 5 10 -10 -5 0 5 10
a=1; c=5 a=-2; c=5
Left-right (L-R) MF
Definition: A left-right MF is specified by three parameters {α, β, c}:

where FL(x) and FR(x) are monotonically decreasing functions defined


on [0, ∞) with FL(0) = FR(0) = 1 and .

Example. Let , and


Plot the membership functions for
i. LR(x; 65, 60, 10)
ii. LR(x; 25, 10, 40).
Example. Let , and
Plot the L-R membership function for LR(x; 65, 60, 10)
Example. Let , and
Plot the L-R membership function for LR(x; 25, 10, 40).
MFs of Two Dimension
• MFs with one input are referred as 1-D MFs.
• MFs with two inputs, each input in different universe of discourse
are referred as 2-D MFs.
• 1-D MF extended to2-D MF via cylindrical extension.
Definition: Cylindrical extension of 1-D fuzzy sets.
If A is a fuzzy set in X, then its cylindrical extension in X x Y is a fuzzy
set c(A) defined by
Example: Plot the cylindrical extension of
fuzzy set A defined as bell(x;20,2,50) .
MFs of Two Dimension cntd.
Definition: Projection of fuzzy sets
Let R be a two dimensional fuzzy set on X×Y. Then the projection of
R onto X and Y are defined as
and ,

respectively.
Example: Projection
(a) 2-D fuzzy
set R

(b) RX (projection of R onto X) (c) RY (projection of R onto Y)


Linguistic variables
• Conventional techniques for system analysis are not suitable
for dealing with humanistic systems influenced by judgment,
perception and emotions.
• Principles of incompatibility: As the complexity of a system
increases, our ability to make precise and yet significant
statements about its behavior diminishes until a threshold is
reached beyond which precision and significance become
almost mutually exclusive characteristics.
• Zadeh proposed the concept of linguistic variables as an
alternative approach to model human thinking.
• Linguistic variables approximately summarize information
and express information as fuzzy set.
Linguistic variable
1

0.9

0.8
Definition: A linguistic variable is 0.7

characterized by a quintuple 0.6

(x, T(x), X, G, M) in which 0.5


middle aged
0.4
young
- x is the name of the variable 0.3 very young
old
- T(x) is the term set of x i.e. the set of its 0.2 very old

linguistic values or linguistic terms 0.1

0
- X is the universe of discourse 0 20 40 60 80 100

- G is the syntactic rule which generates the terms in T(x) and


- M is the semantic rule which associates
with each linguistic value A, where M(A) denotes a fuzzy set in X
Ex. If age is interpreted as a linguistic variable then its term set T(age) could be
T(age) = {young, not young, very young, not very young,…
middle aged, not middle aged, ….
old , not old, very old, more or less old, not very old, not very young
and not very old, …….}
where each term in T(age) is characterized by a fuzzy set of a universe of
discourse X =[1, 100].
• The syntactic rule refers how linguistic values in the term set
T(age) are generated.
Term set consists of
- primary terms/operands
ex. Young, middle aged, old
-operators
i. Negations , Ex. not
ii. Hedges, Ex. very, more of less, quite, extremely, etc.
iii. Connectives , Ex. and, or, either, neither, etc.
• The connectives, the hedges and the negation are operators
which changes the meaning of its operands in a specified
context independent manner.
Concentration and Dialation of linguistic variables
• Let A be a linguistic value characterized by a fuzzy set with
membership function  A .. Then Ak is interpreted as a modified
version of the original linguistic value expressed as
A  A
k   x k

x
X

• The operation of concentration is defined as


CON(A)=A2
and dialation as 1

DIL(A)=A0.5 old
0.8 very old
• Ex:
more or less old
if A is a fuzzy set for old, 0.6
then CON(A) is a fuzzy set for
very old and 0.4
DIL(A) is a fuzzy set for
more or less old. 0.2

0
0 20 40 60 80 100
Composite linguistic terms
• Ex. Let the MFs for the linguistic
terms young and old are
1
 young  x   bell ( x;20,2,0) 
 
1  20
x 4

1
old  x   bell  x;30,3,100  6

1   
x 100

 30 
1
• More or less old = DIL(old)=old 0.5
old
0.8
= 1 more or less old

 6
x
x 1   
0.6
x 100
 
 30  0.4

0.2

0
0 20 40 60 80 100
1
• Given  young  x   and old  x  
1
 
6
1  20
x 4
 x  100 
1  
 30 
Then not young and not old=
 
   
1 1
young  old   1    1   x

20


 
x  1  x   1   x  100  
4

6
 
  30  
not young and not old
1
old
young
0.8
not young and not old

0.6

0.4

0.2

0
0 20 40 60 80 100
1
• Given  young  x   and old  x  
1
 
6
1  20
x 4
 x  100 
1  
 30 
Then, young but not too young=
    
2
     
young  young 2   
1   1   1   x
 4   4 
x 1  x        
x
    1  
1
Young but not too young   20      20   

0.8

young
0.6 too young
not too young
young but not too young
0.4

0.2

0
0 20 40 60 80 100
1
• Given  young  x  
1
and old  x  
 
6
1  20
x 4
 x  100 
1  
 30 
Then, extreme old=
8
 
 

CON CON CON old    old 2   2 2
 
1  x
 x 1   x  100  
6
  30  
   
Extreme old
1

0.8
old
very old
0.6
very very old
extreme old
0.4

0.2

0
0 20 40 60 80 100
• Contrast intensification: The operation of
contrast intensification on a linguistic Construct Intensifier
value A is defined by 1
set A

 2 A2 , for 0   A  x   0.5
0.8 INT(A)

INT  A   INT2(A)

2A 2
for 0.5   A  x   1
0.6

0.4

• The contrast intensifier 0.2

– Increases  A, if  A  0.5 0

– Decreases  A , if  A  0.5
0 20 40 60 80 100
muA (x)=triangle(x;5,30,95)

 It reduces the fuzziness of the linguistic


variables
 Repeated application of INT converts fuzzy
set to crisp set at crossover point
Orthogonality

• A term set T = t1, …, tn of a linguistic variable x on the universe


X is orthogonal, if it fulfills the following property

n
 ti  x   1, x  X i
i 1
• Where the ti’s are convex and normal fuzzy sets
CLASSICAL RELATIONS
AND FUZZY RELATIONS
RELATIONS
 Relations represent mappings between sets and
connectives in logic.

 A classical binary relation represents the presence


or absence of a connection or interaction or
association between the elements of two sets.

 Fuzzy binary relations are a generalization of crisp


binary relations, and they allow various degrees of
relationship (association) between elements.
CRISP CARTESIAN PRODUCT
Definition of (crisp) Product set:
Let A and B be two non-empty sets, the
product set or Cartesian product A×B is
defined as follows,
A  B   a, b  | a  A, b  B

(a set of ordered pairs a, b)

Cartesian product of n sets


n
A1  A2  ...  An   Ai   a1 , a2 ,..., an  | a1  A1 ,..., an  An 
i 1
CRISP RELATIONS
Definition of Binary Relation:
If A and B are two sets and there is a specific property between
elements x of A and y of B, this property can be described using
the ordered pair (x, y).
A set of such (x, y) pairs, x  A and y  B , is called a relation
R. R   x, y  | x  A, y  B
Definition of n-ary relation:
For sets A1, A2, …, An, the relation among elements
x1  A1 , x2  A2 ,..., xn  An can be described by n-tuple (x1, x2, …
, xn). A collection of such n-tuple is a relation R among A1, A2,
…, An. ( x1 , x2 ,..., xn )  R,
R  A1  A2  ...  An
CRISP BINARY RELATIONS

Examples of binary relation representations

Bipartite graph Coordinate diagram Relation matrix


A B B
R b1 b2 b3
b3
a1 1 0 0
a1 b1
b2 a2 0 1 0
a2
b2 a3 0 1 0
a3 b1
a4 0 0 1
a4 b3
a1 a2 a3 a4 A
Classical Relations: complete relation

• Let R be a relations on the Cartesian universe X x Y.


• The complete relation denoted by E R , is represented as a
mxn ones matrix, when X and Y contain m and n elements
respectively .
X Y
• Ex.

1 1 1 Apple Father

E R  1 1 1 Orange Mother
 
1 1 1 Banana Child
Classical Relations: null relation

• Let R be a relation on the Cartesian universe XxZ.


• The null relation denoted by R , is represented as a mxn zeros
matrix, when X and Z contain m and n elements respectively .

X Z
• Ex.

0 0 0  Apple Cycle

 R  0 0 0  Orange Car
 
0 0 0 Banana Bike
OPERATIONS ON CRISP RELATIONS
PROPERTIES OF CRISP RELATIONS

The properties of crisp sets (given below) hold good for


crisp relations as well.

 Commutativity,
 Associativity,
 Distributivity,
 Involution,
 Idempotency,
 DeMorgan’s Law,
 Excluded Middle Laws.
COMPOSITION ON CRISP RELATIONS
Let R and S be relations on the Cartesian universe XxY and YxZ respectively.
X Y Z R Ra Si Ba
Ap 1 1 1
Apple Ram Cycle Ch 0 0 1
 
Sw 1 0 1
Chips Sita Car

Sweets Baby Bike S Cy Ca Bi


Ra 0 1 1
Si 0 1 0
 
Then the max-min composition is Ba 1 0 0

 R  Ap, Ra    S  Ra, Cy  


T  Ap, Cy     R  Ap, Si    S Si, Cy  
T Cy Ca Bi
  Ap 1 
  R  Ap, Ba    S  Ba, Cy  Ch  
 
 1  0  1  0  1  1 Sw  
 0  0  1
1
FUZZY CARTESIAN PRODUCT
Let R be a fuzzy subset of M and S be a fuzzy subset of N. Then the
Cartesian product R  S is a fuzzy subset of N  M such that

Example:

Let R be a fuzzy subset of {a, b, c} such that R = a/1 + b/0.8 + c/0.2


and S be a fuzzy subset of {1, 2, 3} such that S = 1/1 + 2/0.5+ 3/0.8.
Then R x S is given by
1 2 3
a 1 .5 .8

b .8 .5 .8
c .2 .2 .2
FUZZY RELATION
Fuzzy Relations Matrices

• Example: Color-Ripeness relation for tomatoes

R1(x, y) unripe semi ripe ripe

green 1 0.5 0

yellow 0.3 1 0.4

Red 0 0.2 1

15
Fuzzy Relations Matrices
• Example: Let R be a fuzzy relation between two sets X1 and
X2 where X1 is the set of diseases and X2 is the set of
symptoms.
X1={typhoid, viral fever, common cold}
X2={running nose, high temperature, shivering}
The fuzzy relation may be defined as
R(x1, x2) Running nose High Shivering
temperature
Typhoid 0.1 0.9 0.8

Viral fever 0.2 0.9 0.7

Common Cold 0.9 0.4 0.6

16
Fuzzy Relations Matrices
• The elements of two sets are X={3,4,5} and Y={3,4,5,6,7}. The
MF of the fuzzy relation is defined as
 yx
 , if y  x
 R  x, y    x  y  2
 0, if y  x

• Find the fuzzy relation matrix R.

3 4 5 6 7
3 0 0.111 0.2 0.273 0.333
R 0 
4  0 0.091 0.167 0.231 
5 0 0 0 0.077 0.143
The Real-Life Relation

• x is close to y
– x and y are numbers
• x depends on y
– x and y are events
• x and y look alike
– x and y are persons or objects
• If x is large, then y is small
– x is an observed reading and y is a
corresponding action
Classical to Fuzzy Relations
• A classical relation is a set of tuples
– Binary relation (x,y)
– Ternary relation (x,y,z)
– N-ary relation (x1,…xn)
– Connection with Cross product
– Married couples
– Nuclear family
– Points on the circumference of a circle
– Sides of a right triangle that are all integers
Example (Approximate Equal)

X  Y  U  {1, 2,3, 4,5}

1 u v  0  1 0.8 0.3 0 0 
 0.8 1 0.8 0.3 0 
0.8 u  v  1  
 R (u, v)  
M R   0.3 0.8 1 0.8 0.3
0.3 u  v  2  
0 otherwise  0 0.3 0.8 1 0.8 
 0 0 0.3 0.8 1 
OPERATIONS ON FUZZY RELATION
The basic operation on fuzzy sets also apply on fuzzy relations.
Projection

RY   R  Y  RX   R  X 
PROPERTIES OF FUZZY RELATIONS
The properties of fuzzy sets (given below) hold good for
fuzzy relations as well.

 Commutativity,
 Associativity,
 Distributivity,
 Involution,
 Idempotency,
 DeMorgan’s Law,
 Excluded Middle Laws.
COMPOSITION OF FUZZY RELATIONS
Extension Principle
Introduction

• Extension Principle is the basic concept of the fuzzy set


theory that provides a general procedure for extending crisp
domains of mathematical expressions to fuzzy domains.

• This generalizes a common point-to-point mapping of a


function f(.) to a mapping between fuzzy sets.
Extension Principle
• Suppose f is a function from X to Y and A is a fuzzy set on X
defined as A   ( x ) / x   ( x ) / x  ...  ( x ) / x
A 1 1 A 2 2 A n n

• Then the extension principle states that the image of fuzzy set
A under the mapping f(.) can be expressed as fuzzy set B.

where yi  f ( xi ), i  1,...,n
If f(.) is a many to one mapping,
then there exists x1 , x2  X , x1  x2 Ex. Let A={0.9/-1, 0.4/1}, f(x)=x2 – 1,
such that f(x1)=f(x2)=y*, y* ∈ Y. Then f(-1) = f(1) = 0 = y*

In this case the membership grade of B at y=y* is the


maximum of the membership grades of A at x=x1 and x=x2,
since f(x)=y* may result from either x=x1 or x=x2, i.e.
 B ( y)  max  A ( x)
-1
x f ( y )
Ex. Extension Principle
 0.1 0.4 0.8 0.9 0.3 
• Let A      and f ( x)  x 2  3 .
  2 1 0 1 2 
• Find the fuzzy set B using extension principle.

Applying the extension principle, we have


 0.1 0.4 0.8 0.9 0.3 
B   2  2  2  0.3
1
 (2)  3 (1)  3 0  3 1  3 2  3 
2 2 2

1 0.9 0
 0.1 0.4 0.8 0.9 0.3 
     
 1 2 3 2 1  0 -1
0.8

 0.8 max( 0.9,0.4) max( 0.3,0.1)  0.4


    -1 -2
  3  2 1 
0.1
-2 -3

 0.8 0.9 0.3 


    X f(x) Y
  3  2 1 
Extension Principle on fuzzy sets with continuous universes
x  12  1, if x  0.
Example: Let  A x  bellx;1.5,2,0.5 and f x   
if x  0.
 x,
Find fuzzy set B using extension principle.

Figure b: the function y=f(x)


Figure a: Fuzzy set A

Figure c: Fuzzy set B

extension principle.
Induced via the
Fuzzy Rule

B. B. Misra
Fuzzy if-then rule/ fuzzy rule/ fuzzy implication/
fuzzy conditional statements
A fuzzy if-then rule assumes the form
if x is A then y is B (i.e. A->B)
Where A and B are linguistic values defined by fuzzy sets on
universe of discourse X and Y.
“x is A” is called the antecedent or premise.
“y in B” is called the consequent or conclusion.

Ex.
- If pressure is high, then volume is small.
- If road is slippery, then driving is dangerous.
- If a tomato is red, then it is ripe.
- If the speed is high, then apply the brake a little.
Fuzzy rule A -> B interpreted in two ways min(A,B)
Y
i. A coupled with B
If A is coupled with B, then
~
R = A -> B = A x B =   A  x  *  B  y   x, y  B
X Y
~
Where * is a T-norm operator and A -> B
represents fuzzy relation R. X
ii. A entails B A
If A entails B written as 4 different formulas (A coupled with B)
1. Material implication,
R  A  B  A  B
2. Propositional calculus, Y A  B
R  A  B  A   A  B 

3. Extended propositional calculus,


R  A  B  A  B   B B
4. Generalization

of modus
~ ponens 
 R  x, y   sup c |  A  x  * c   B  y  and 0  c  1
 ~  X
where R= A -> B and * is a T-norm operator .
A
These 4 formulas reduce to the identity (A entails B)
A  B  A  B
Fuzzy Reasoning/ Approximate Reasoning
• Fuzzy reasoning is an inference procedure that derives conclusion from a
set of fuzzy rules.
• Compositional rule of inference plays a key role in fuzzy reasoning
y
Compositional Rule of Inference
Let curve, y=f(x), we infer y=b=f(a) y= f(x)
Let ‘a’ has an interval, so f(x) is an interval valued function y=b

To find the resulting interval y=b, corresponding to the


interval x=a, we first construct a cylindrical extension of
‘a’ and then find its intersection I with the interval valued Fig. 1 x=a x
curve.
The projection of I onto the y-axis yield the interval y=b.
y= f(x)

Fig. 2
Let F is a fuzzy relation on X x Y at fig.(a) .

A is a fuzzy set of X and its cylindrical extension c(A) at


fig. (b).

The intersection of c(A) and F in fig. (c) forms the analog


of the region of intersection as I in fig. (2)

By projecting c(A)  F on to Y-axis, we infer y as a fuzzy


set B on the Y-axis as at fig. (d).
Let μA, μc(A), μB and μF be the membership functions of A, c(A), B, and F
respectively, where μc(A) is related to μA through
μc(A)(x, y)= μA(x)
Then,
μc(A)  F(x, y)=min[μc(A)(x, y), μF(x, y)]=min[μA(x), μF(x, y)]

By projecting c(A)  F onto Y-axis, we have

μB(y)= max x min[μA(x), μF(x, y)] = x [μA(x) μF(x, y)]


This formula reduces to max-min composition of two relation matrices, if
both A(a unary fuzzy relation) and F (a binary fuzzy relation) have finite
universe of discourse.
Conventionally, B is represented as
B=AoF
Where ‘o’ denotes the composition operator.
Using the compositional rule of inference, we formalize an inference
procedure upon a set of fuzzy if-then rules.
Fuzzy reasoning
• The basic rule of inference in two valued logic is modus
of ponens, according to which we can infer the truth of
a proposition B from the truth of A and the implication
A-> B.
• Ex. Let A= the tomato is red
B= the tomato is ripe
Then if it is true that “the tomato is red”, it is also true
that “the tomato is ripe.”
-Above concept can be illustrated as
Premise1 (fact): x is A,
Premise2 (rule): if x is A, then y is B,
-------------------------------------------------
Consequence (Conclusion): y is B
In human reasoning modus ponens is used in approximate
manner.
Ex. Implication rule- “If tomato is red, then it is ripe.” We
know that “if tomato is more or less red” then we may infer
that “the tomato is more of less ripe.”
Written as:
Premise1 (fact): x is A’,
Premise2 (rule): if x is A, then y is B,
-------------------------------------------------
Consequence (Conclusion): y is B’
Where A’ is close to A and B’ is close to B.
When A, B, A’, and B’ are fuzzy sets of approximate universes,
the foregoing inference procedure is called approximate
reasoning or fuzzy reasoning, it is also called generalized
modus ponens (GMP).
• Generalized modus Tollens (GMT)

Premise1 (fact): y is B’,


Premise2 (rule): if x is A, then y is B,
-------------------------------------------------
Consequence (Conclusion): x is A’

Where A’ is close to A and B’ is close to B.


Approximate Reasoning (Fuzzy reasoning)
Definition: Let A, A’, and B be fuzzy sets of X, X, and Y,
respectively. Assume that the fuzzy implication A-> B
is expressed as a fuzzy relation R on X x Y. Then the
fuzzy set B induced by “x is A’” and the fuzzy rule “if x
is A then y is B” is defined by
 B' ( y )  max x min A'  x ,  R  x, y 
  x  A'  x    R  x, y 
Or equivalently
B'  A'R  A'  A  B 
Single rule with single antecedent
Premise1 (fact): x is A’, Then the rule “if x is A, then y is B”
Premise2 (rule): if x is A, then y is B, can be defined as B’=A’oR
------------------------------------------------- For simplicity it is taken as
Consequence (Conclusion): y is B’  B '  y    x   A '  x    A  x     B  y 
Where A’ is close to A and B’ is close to B.  w  B  y 
Note that A’ and B’ are not compliment of A and B

Graphical interpretation of GMP


using Mamdani ‘s fuzzy
implication and the max min
composition
Single rule with multiple antecedent
A fuzzy if-then rule with two antecedent is written as
“if x is A and y is B then z is C” or (A x B -> C)
The corresponding problem for GMP is expressed as
Premise1 (fact): x is A’ and y is B’
Premise2 (rule): if x is A and y is B then z is C
-------------------------------------------------------------
Consequence (conclusion): z is C’
This fuzzy rule can be transformed into a ternary fuzzy relation Rm based on
the Mamdani’s fuzzy implication function.
Rm  A, B, C    A  B   C    A  x    B  y   C  z   x, y, z 
X Y Z
The resulting C’ is expressed as C '   A'B'  A  B  C 
Thus,
C '  z    x, y  A'  x    B '  y    A  x    B  y   C  z 
  x, y  A'  x    B '  y    A  x    B  y   C  z 
  x  A'  x    A  x    y  B '  y    B  y  C  z   w1  w2   C  z 
                 
w1 w2 Firing strength
The MF of the resulting C’ is equal to the MF of C clipped by the firing strength w,
Where w  w1  w 2

Degree of Compatibility: w1 denotes the degree of compatibility between A and A’ and w2


for B and B’.
Firing strength or Degree of fulfillment: Since the antecedent part of the fuzzy rule is
constructed by the connective “and”, w1  w 2 is called the firing strength of the fuzzy rule,
which represents the degree to which the antecedent part of the rule is satisfied.
Multiple rules with multiple antecedents

Let R1=A1xB1->C1
and R2=A2xB2->C2.
Since the max-min
composition operator ‘o’ is
distributive over the union
operator, it follows that
C '   A'B'  R1  R2 
  A'B'  R1    A'B'  R2 
 C '1C '2
4 steps of fuzzy reasoning
• Degree of Compatibility: Compare the known facts with the
antecedents of the fuzzy rules to find the degree of compatibility
with respect to each antecedent MF.
• Firing Strength: Combine degrees of compatibility with respect to
antecedent MFs in a rule using fuzzy AND or OR operators to form a
firing strength that indicates the degree to which the antecedent
part of the rule is satisfied.
• Qualified (induced) Consequent MFs: Apply the firing
strength to the consequent MF of a rule to generate a qualified
consequent MF. (The qualified consequent MFs represent how the
firing strength gets propagated and used in a fuzzy implication
statement.)
• Overall Output MF: Aggregate all the qualified consequent MFs
to obtain an overall output MF.
Fuzzy Inference Systems

• Introduction
• Mamdani Fuzzy models
• Sugeno Fuzzy Models
• Tsukamoto Fuzzy models
Introduction
 Fuzzy inference is a computer paradigm
based on fuzzy set theory, fuzzy if-then-
rules and fuzzy reasoning

 Applications: data classification, decision


analysis, expert systems, times series
predictions, robotics & pattern recognition

 Different names; fuzzy rule-based system,


fuzzy model, fuzzy associative memory,
fuzzy logic controller & fuzzy system
Introduction (cont.)
 Structure
– Rule base  selects the set of fuzzy rules
– Database (or dictionary)  defines the
membership functions used in the fuzzy rules
– A reasoning mechanism  performs the
inference procedure (derive a conclusion from
facts & rules!)

 Defuzzification: extraction of a crisp value


that best represents a fuzzy set
– Need: it is necessary to have a crisp output in
some situations where an inference system is
used as a controller
Rule 1
w1 Fuzzy
x is A1 y is B1

Rule 2
Crisp or w2 Fuzzy
Fuzzy x is A2 y is B2 Fuzzy Crisp
X Aggregator Defuzzifier

Rule r
wr Fuzzy
x is Ar y is Br

Block diagram for a fuzzy inference system


Introduction (cont.)

 Non linearity

– In the case of crisp inputs & outputs, a


fuzzy inference system implements a
nonlinear mapping from its input space
to output space
Mamdani Fuzzy models [1975]

 Goal: Control a steam engine & boiler


combination by a set of linguistic control
rules obtained from experienced human
operators

Illustrations of how a two-rule Mamdani


fuzzy inference system derives the
overall output z when subjected to two
crisp input x & y
Two fuzzy
inference systems
were used as two
controllers to
generate heat
input to the boiler
and throttle
opening of the
engine cylinder
to regulate steam
pressure in the
boiler and the
speed of engine.

(The Mamdani FIS)


Mamdani Fuzzy models (cont.)

 Defuzzification [definition]

“It refers to the way a crisp value is extracted from


a fuzzy set as a representative value”

– There are five methods of defuzzifying a fuzzy


set A of a universe of discourse Z

• Centroid of area zCOA


• Bisector of area zBOA
• Mean of maximum zMOM
• Smallest of maximum zSOM
• Largest of maximum zLOM
Mamdani Fuzzy models (cont.)
• Centroid of area zCOA
  A (z )zdz
z COA  Z
,
  A (z )dz
Z

where A(z) is the aggregated output MF.

• Bisector of area zBOA


this operator satisfies the following;
z BOA 

  A ( z )dz    A (z )dz,
 z BOA

where  = min {z; z Z} &  = max {z; z Z}. The


vertical line z = zBOA partitions the region between z =
, z = , y = 0 & y = A(z) into two regions with the
same area
Mamdani Fuzzy models (cont.)
• Mean of maximum zMOM

This operator computes the average of the maximizing z at



which the MF reaches a maximum  . It is expressed by :

 zdz
z MOM  Z'
,
 dz
Z'

where Z'  { z;  A ( z )   }

By definition : if  A ( z ) has a single maximum at z  z

then z MOM  z
z1  z 2
However : if max  A ( z )  z 1 , z 2  then z MOM 
z 2
Mamdani Fuzzy models (cont.)

• Smallest of maximum zSOM

Amongst all z that belong to [z1, z2], the smallest is called


zSOM

• Largest of maximum zLOM

Amongst all z that belong to [z1, z2], the largest value is


called zLOM
Various defuzzification schemes for
obtaining a crisp output
 ' Z   ' Z   ' Z 
c1 c2 c3
1 1 1

.9 .9 .9

.8 .8 .8

.7 .7 .7

.6 .6 .6

.5 .5 .5

.4 .4 .4

.3 .3 .3

.2 .2 .2

.1 .1 .1

0 0 0

Z
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 Z 0 1 2 3 4 5 6 7 8 Z

Ex. C’1 C’2  ' Z  C’3


c
Premise1(fact): if x is A’ and y is B’ 1

Premise2(rule):if x is A1 and y is B1 then z is C1 .9

.8
Premise3(rule):if x is A2 and y is B2 then z is C2 .7

Premise4(rule):if x is A3 and y is B3 then z is C3 .6

----------------------------------------------------- .5

Consequence (conclusion): z in C’ .4

.3

.2

Then the defuzzification of C’ using different methods are .1

presented in subsequent slides. 0 Z


0 1 2 3 4 5 6 7 8

C’
 ' Z 
c
1
For ease of calculation, total area under C’ is
converted to different segments numbered .9

as 1, 2, …, 9. .8
.7
Centroid of area, ZCOA .6
Seg. Zi
No. (z value up to seg.+  C ' Zi  Z i .5 7
Area=  ' Zi 
C z value of the
centroid of seg.) .4 4
1 (0.3*1)/2=0.15 0+1*(2/3)=0.67 0.1005 .3
6
2 0.3*(3.6-1)=0.78 1+(3.6-1)/2=2.3 1.794 .2 5
2 9
.1 3 8
1
0 Z
0 1 2 3 4 5 6 7 8
3.6 5.5
C’
Then, the defuzzied value using
centroid of area method is
 C ' Z Z
ZCOA 
 C '  Z 
 C ' ( Z )dZ  C ' ( Z ) ZdZ
=?
 ' Z 
c
1
For ease of calculation, total area under C’ is
.9
converted to different segments numbered as 1,
2, …, 9. .8

1. Centroid of area, ZCOA .7


.6
Seg. Zi
No. (z value up to seg.+  C ' Zi  Z i .5 7
Area=  ' Zi 
C z value of the
centroid of seg.) .4 4
1 (0.3*1)/2=0.15 0+1*(2/3)=0.67 0.1005 .3
6
2 0.3*(3.6-1)=0.78 1+(3.6-1)/2=2.3 1.794 .2 5
2 9
3 0.3*(4-3.6)=0.12 3.6+(4-3.6)/2=3.8 0.456 .1 3 8
4 (0.5-0.3)*(4-3.6)/2 3.6+ (4-3.6)*2/3 0.1546 1
0 Z
=0.04 = 3.8667
0 1 2 3 4 5 6 7 8
5 0.5*(5.5-4)=0.75 4+(5.5-4)/2=4.75 3.5625 5.5
3.6
6 0.5*(6-5.5)=0.25 5.5+(6-5.5)/2
=5.75
1.4375 C’
Then, the defuzzied value using
7 (1.0-0.5)*(6-5.5)/2 5.5+(6-5.5)*2/3 0.729
= 0.125 =5.833 centroid of area method is
8 1.0*(7-6)=1 6+(7-6)/2=6.5 6.5  C ' Z Z
ZCOA 
9 1.0*(8-7)/2=0.5 7+(8-7)/3=7.333 3.665  C '  Z 
sum  C ' ( Z )dZ  C ' ( Z ) ZdZ
=3.715 =18.353 =18.353/3.715 =4.9
 ' Z 
c
1
2. Bisector of area, Z BOA .9
Find a vertical line (or point in Z i.e. .8
ZBOA) that divides the entire area into .7
two equal segments.
.6
As segment 4 is above 3 and 7 is above .5 7
6, any vertical line dividing one .4 4
segment will divide the other, hence .3
6
both the segments may be treated as a .2 5
single segment. 2 9
.1 3 8
1
0 Z
0 1 2 3 4 5 6 7 8
Area left to right Area right to left
3.6 C’ 5.5
Difference
Balance
Seg. Area Cum Area Seg.No Area Cum in cum
area
No. . Area area
A B C=C+B D E F=F+E G=C-F H-(B+E)
- 0 0 - 0 0 0 3.715
1 0.15 0.15 9 0.5 0.5 -0.35 3.065
2. Bisector of area, Z BOA  ' Z 
c
Find a vertical line (or point in Z i.e. ZBOA) that divides 1
the entire area into two equal segments.
.9
As segment 4 is above 3 and 7 is above 6, any vertical
line dividing one segment will divide the other, hence .8
both the segments may be treated as a single segment.
.7
Area left to right
Area right to left Differen
.6
ce in Balanc
Seg. Area Cum Seg Area Cum cum e area .5 7
No. Area No. Area area
.4 4
A B C=C+B D E F=F+ G=C-F H-(B+E)
E .3
- 0 0 - 0 0 0 3.715
6
.2 5
1 0.15 0.15 9 0.5 0.5 -0.35 3.065 9
2 0.78 0.93 8 1.0 1.5 -0.57 1.285
2 3
.1 8
3&4 0.12 + 1.09 - - 1.5 -0.44 1.125 1
0 Z
0.04
5 0.75 1.84 - - 1.5 0.34 0.375
0 1 2 3 4 5 6 7 8
3.6 C’ 5.5
1.0 G is –ve means right side is contains more area. To stop growing –ve value when
area of 3&4 segment is added to left, on segment is added to right . It shows that
seg.7
the Z BOA will pass through the segment 6&7.
0.5 Then to evaluate Z BOA , we equate the area on left side of Z BOA to that of right side
along with the cum difference in area i.e. Cumm diff in area + area (6 left) + area
Seg.6 (7 left)=area (6 right) + area (7 right). Let x= Z BOA
0.34+(x-5.5)*0.5+(1/2)(x-5.5)*0.5=(6-x)*0.5+0.125-(1/2)(x-5.5))*0.5
0.0
Z BOA Z BOA =x=5.5433 (area seg7-area seg left)
5.5 6.0
 ' Z 
c
1
3. Smallest of maximumZ SOM
.9
Z SOM = 6 .8
.7
.6
4. Largest of maximum Z LOM 7
.5
Z LOM =7 .4 4
.3
6
.2 5
5. Mean of maximum, Z MOM 2 9
.1 3 8
1
Z  Z SOM 7  6 0
Z MOM  LOM   6.5 Z
2 2 0 1 2 3 4 5 6 7 8
C’ Z MOM
Z SOM Z LOM
Sugeno Fuzzy Models
[Takagi, Sugeno & Kang, 1985]

 Goal: Generation of fuzzy rules from a


given input-output data set

 A TSK fuzzy rule is of the form:


“If x is A & y is B then z = f(x, y)”
Where A & B are fuzzy sets in the
antecedent, while z = f(x, y) is a crisp
function in the consequent

 f(.,.) is very often a polynomial function


w.r.t. x & y
Sugeno Fuzzy Models (cont.)
 If f(.,.) is a first order polynomial, then the
resulting fuzzy inference is called a first
order Sugeno fuzzy model

 If f(.,.) is a constant then it is a zero-order


Sugeno fuzzy model (special case of
Mamdani model)

 Case of two rules with a first-order Sugeno


fuzzy model
– Each rule has a crisp output
– Overall output is obtained via weighted average
– No defuzzyfication required
The Sugeno fuzzy model
Sugeno Fuzzy Models (cont.)

Example 1: Single output-input Sugeno fuzzy


model with three rules

If X is small then Y = 0.1X + 6.4


If X is medium then Y = -0.5X + 4
If X is large then Y = X – 2

If “small”, “medium” & “large” are nonfuzzy sets


then the overall input-output curve is a piece wise
linear
However, if we have smooth membership
functions (fuzzy rules) the overall input-output
curve becomes a smoother one
Example 2: Two-input single output fuzzy model with 4
rules

R1: if X is small & Y is small then z = -x +y +1


R2: if X is small & Y is large then z = -y +3
R3: if X is large & Y is small then z = -x +3
R4: if X is large & Y is large then z = x + y + 2

R1  (x  s) & (y  s)  w1
R2  (x  s) & (y  l)  w2
R3  (x  l) & (y  s)  w3
R4  (x  l) & (y  l)  w4

Aggregated consequent  F[(w1, z1); (w2, z2); (w3, z3);


(w4, z4)]
= weighted average
Antecedent & consequent MFs
Overall input-output surface
Tsukamoto Fuzzy models [1979]

 It is characterized by the following


The consequent of each fuzzy if-then-rule is
represented by a fuzzy set with a monotonical
MF

 The inferred output of each rule is a crisp


value induced by the rule’s firing strength
The Tsukamoto fuzzy model
Tsukamoto Fuzzy models (cont.)

 Example: single-input Tsukamoto fuzzy model


with 3 rules

if X is small then Y is C1
if X is medium then Y is C2
if X is large then Y is C3
Genetic Algorithm (GA)

B B Misra
• The GA was introduced by Prof. John Holland of the University
of Michigan, Ann Arbor, USA, in 1965, although his seminal book
was published in 1975. This book could lay the foundation of
the GAs
• Genetic Algorithms are the heuristic search and optimization
techniques that mimic the process of natural evolution.
• Principle Of Natural Selection

Better fit individuals have


higher chance of survival.
An Example
• Giraffes have long necks
• Giraffes with slightly longer necks could feed on leaves of higher
branches when all lower ones had been eaten off.
• They had a better
chance of survival.
• Favourable
characteristic
propagated
through
generations of
giraffes.
• Now, evolved
species has long
necks.
• This longer necks may have due to the effect of mutation
initially.
• However as it was favourable, this was propagated over the
generations.
Evolution of species

Initial population of animals

Struggle for existence.


Survival of fittest

Surviving individuals reproduce.


Propagate favorable characteristics.

Evolved species
• Thus genetic algorithms implement the
optimization strategies by simulating evolution of
species through natural selection
Simple Genetic Algorithm
function sga() {
Initialize population;
Calculate fitness function;
While(fitness value != termination criteria) {
Selection;
Crossover;
Mutation;
Calculate fitness function;
}
}
• Problem encoding
• Fitness evaluation
• Crossover
• Mutation GA Operators
• Selection
• Termination
GA example

B B Misra
• Let consider the problem
• Maximize f(x)=x(8-x)
• This is not an appropriate problem for GA.
• For GA we consider problems for which
solution/mathematical models do not exist or the
time required is very high.
• However to understand GA, let’s consider this
simple problem.
• Maximize f(x)=x(8-x)
x f(x) Graph f(x)=x(8-x)
0 0 20
1 7
15
2 12
3 15 10
4 16
5 15 5
6 12
f(x)

0
7 7
8 0 -5
9 -9
-10
10 -20
… … -15
-1 -9
-2 -20 -20
-2 0 2 4 6 8 10
… … x
• For complex problems taking all possible input in
real space and finding the respective solution is not
possible.
• Lets examine the problem here
Maximize f(x)=x(8-x)
• As we want to maximize, let’s ignore –ve values of
f(x).
• Then when x=0, f(x)=0 and when x=8, f(x)=0.
• f(x) has higher +ve values between these two
extremes.
• Let’s take 0 ≤ x ≤ 8 as our search space for the
problem.
Encoding
Problem encoding
• We will discuss binary GA here
• Our search space for problem: Maximize f(x)=x(8-x)
is 0 ≤ x ≤ 8.
• We know that with
Bits we can encode i.e
1 0,1 21=2 values
2 00,01,10,11 22=4 values
3 000, 001, 010, 011, 100, 101, 110, 111 23=8 values
n 2n= values
• Then for our search space 0 ≤ x ≤ 8 i.e. for 9 values
we need 4 bits ( log 2 (9) )

• That is we will encode our problem in 4 binary bits


for manipulation in GA
Initialization
• As natural evolution, we can not allow it to run for
billions of years.
• Hence usually we take a finite number of
generations as the upper bound.
• If the problem does not converge we increase the
upper bound or else if it converges too early we
reduce the upper bound.
• Similarly we define the number of individual in the
population i.e. population size.
• From encoding we understand how many bits we
require to specify the chromosome/properties of an
individual.
• For hand calculation take small population
• MaxGen = 10 (upper bound of generations)
• PopSz = 6 (population size)
• CrLen = 4 (Chromosome length = bits
required to encode the problem)
• Pc = 0.8 (Probability of crossover)
• Pm = 0.01 (Probability of mutation)
• Pop = round(rand(PopSz,CrLen)); (Population
initialization)
• Ex.
• Pop = round(rand(PopSz,CrLen)); (Population
initialization)
0 1 1 0
1 0 1 1
• Pop = 1 1 0 1
0 1 0 1
1 0 1 0
0 0 1 1
Function/Fitness evaluation

Genes in the
Chromosome
Id g1 g2 g3 g4 x f(x)
#1 0 1 1 0 6 12
#2 1 0 1 1 11 -33
Individuals

#3 1 1 0 1 13 -65
#4 0 1 0 1 5 15
#5 1 0 1 0 10 -20
#6 0 0 1 1 3 15
for gen =1 to MaxGen
tempPop=Pop; % A copy of the population taken in the tempPop, new
offspring/children born after crossover are added to it.
Fitness of tempPop is evaluated and in selection process the
individuals selected are stored in Pop.

tempSz=PopSz; %In natural evolution process there is no control on the


individuals produced and population size goes up. Computer
has limited memory, hence we go for fixed population GA.
After one individual s produced, tempSz incremented, to
remember the initial decision of population size we do not
manipulate the contents of PopSz.

Perform crossover
Perform mutation
Evaluate fitness
Perform selection
end
Crossover
• We select randomly two
individuals from the mating P1(1) P1(2) P1(3) P1(4)

pool (selected population) p1 0 1 1 0


for crossover.
P2(1) P2(2) P2(3) P2(4)
• Ex. Single point crossover:
Randomly generate a p2 1 0 1 1
crossover point (cp) across P1(1) P1(2) P2(3) P2(4)
the chromosome length,
generate two offspring by c1 0 1 1 1
exchanging the genes P2(1) P2(2) P1(3) P1(4)
across the crossover point.
c2 1 0 1 1
• Let cp=2
Single point crossover
for p = 1 to PopSz/2 Real value generated by rand [0, 1]
if rand<Pc
Let PopSz = 6
p1= 1+round(rand*(PopSz-1));
p2= 1+round(rand*(PopSz-1)); Real value generated by rand*(PopSz-1) [0,5]
cp= 1+round(rand*(CrLen-1)); Integer generated by round(rand*(PopSz-1)) [0,5]
tempSz = tempSz+1; Integer generated by 1+round(rand*(PopSz-1)) [1,6]
for i=1 to cp
tempPop(tempSz,i)= Pop(p1,i);
end
for i=cp+1 to CrLen
tempPop(tempSz,i)= Pop(p2,i);
end
tempSz = tempSz+1;
for i=1 to cp
tempPop(tempSz,i)= Pop(p2,i);
end
for i=cp+1 to CrLen
tempPop(tempSz,i)= Pop(p1,i);
end
end
end
How to use crossover probability to
approximate number of crossover operations
• From the figure
you can see the rand() function called 1000 time, plot of numer of random nos generate between 0-0.1, 0.1-0.2,...,0.9-1.0
random 120
numbers
generated in
each interval is 100
close.
• Hence with Pc = 80
0.8, if we use if
rand<Pc, we
expect about 60
80% random
numbers are
40
generated.
• Evolutionary
system is 20
stochastic, so
we do not want
it rigidly fixed. 0
.1 .2 .3 .4 .5 .6 .7 .8 .9 1
Frequency of random numbers
Mutation
for p = PopSz+1 to tempSz In crossover, each gene
is inherited from one of
for c=1 to CrLen
the parents.
if rand < Pm But gene mutated may
if tempPop(p,c)==1 not belong to any of the
tempPop(p,c)=0 parent.
It occurs very very
else rarely.
tempPop(p,c)=1;
end
end
x f(x)
end Before mutation 1 1 0 0 10 -20
end
After mutation 0 1 0 0 4 16
Selection
• Though the principle behind selection is “survival of
fittest”, but the evolutionary system contains both
better fit and less fit individuals. However the less
fit individuals die slowly.
%Binary tournament selection
Two individuals selected
randomly. Id g1 g2 g3 g4 x f(x)
Their fitness is compared.
The individual with higher #2 1 0 1 1 11 -33
fitness selected.
#5 1 0 1 0 10 -20
Let #2 and #5 are selected
randomly.
Here f(#2)<f(#5), so #5 is
selected for the next
generation.
Binary tournament selection
for p=1 to PopSz % no of individuals selected = initial pop size
p1= 1+round(rand*(tempSz-1)); % individuals selected from
recently generated population i.e.
tempPop with size tempSz
p2= 1+round(rand*(tempSz-1));
if f(p1)>f(p2)
Pop(p, : )= tempPop (p1, :);
else
Pop(p, : )= tempPop (p2, :);
end
end
Encoding for GA

B. B. Misra
• Usually there are only two main components of most
genetic algorithms that are problem dependent
– the problem encoding and
– the evaluation function.
• The problem is viewed as a black box with a series of
control dials represented by the parameters.
• Value returned by the evaluation function is considered
as the black box output.
• Output indicates how well a particular combination of
parameter settings solves the optimization problem.
• The goal is to set the various parameters so as to
optimize some output.
• Generally GA is used to solve nonlinear
problems.
• Each parameter may not be treated as
independent variable.
• Combined effects of the interacting
parameters are considered to optimize the
black box output.
• In the genetic algorithm community, the
interaction between variables is sometimes
referred to as epistasis.
• The first assumption is that the variables representing
parameters can be represented by bit strings.
• That is the variables are discretized in an a priori fashion,
and the range of the discretization corresponds to
power of 2.
• For example, with 10 bits per parameter, we obtain a
range with 1024 discrete values.
• When the parameters are continuous the discretization
is not a particular problem.
• The discretization provides enough resolution to make it
possible to adjust the output with the desired level of
precision.
• It also assumes that the discretization is in some sense
representative of the underlying function.
• If some parameter can only take on an exact
finite set of values then the coding issue becomes
more difficult.
• For example, if there are exactly 1200 discrete
values which can be assigned to some variable Xi.
• We need at least 11 bits to cover this range, but
this codes for a total of 2048 discrete values.
• The 848 unnecessary bit patterns may result in no
evaluation.
• Solving such coding problems is usually
considered to be part of the design of the
evaluation function.
Problem of extra search space due to encoding in higher dimension
function Problem Bits GA Search space Ratio of search
search space reqd. space needed
↑f(x)=x(8-x) 0 ≤ x ≤ 8 i.e.9 4 0 ≤ x ≤ 15 i.e. 16 9/16 ≈ 1/2
↑f(x)=x1(8-x1) + x2(8-x2) 0 ≤ x1,x2 ≤ 8 4+4 = 0 ≤ x1,x2 ≤ 15 i.e. (9/16)2 ≈ (½)2
i.e. 9*9 = 92 4*2 16*16 = 162
↑f(x)=x1(8-x1) + x2(8-x2) + 0 ≤ x1,x2,x3 4*3 0 ≤ x1,x2,x3 ≤ 15 (9/16)3 ≈ (½)3
x3(8-x3) ≤ 8 i.e. 93 i.e. 163
↑f(x)=Σxi(8-xi), 1≤i≤10 0 ≤ xi ≤ 8 i.e. 4*10 0 ≤ xi ≤ 15 i.e. (9/16)10 ≈ (½)10
910 1610 =9.765×10-4
↑f(x)=Σxi(8-xi), 1≤i≤100 0 ≤ xi ≤ 8 i.e. 4*100 0 ≤ xi ≤ 15 i.e. (9/16)100 ≈
9100 16100 (½)100 =
7.888×10-31
0 2 4 6 8 10 12 14

GA search space
GA search space

0 2 4 6 8 10 12 14
Problem
Search GA
space Search Problem
Problem
space search
search
space space
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
key ideas for encoding
Use a data structure as close as possible to the
natural representation
• Write appropriate genetic operators as
needed
• If possible, ensure that all genotypes
correspond to feasible solutions
• If possible, ensure that genetic operators
preserve feasibility
Encoding
• Encoding of chromosomes is one of the
problems, when you are starting to solve
problem with GA.
• Encoding depends on the problem.
• It is important for problem solution to select
proper encoding.
• Encoding represents transformation of solved
problem to N-dimensional space.
Binary Encoding (1)
• Binary encoding is the most common, mainly
because first works about GA used this type of
encoding.
• In binary encoding every chromosome is a
string of bits, 0 or 1.
• Chromosome A 101100101100101011100101
Chromosome B 111111100000110000011111
Binary Encoding (2)
• Binary encoding gives many possible
chromosomes even with a small number of
alleles.
• On the other hand, this encoding is often not
natural for many problems and sometimes
corrections must be made after crossover
and/or mutation.
Binary Encoding (3)
• Example: Knapsack problem
• The problem: There are certain precious items
with given value and size.
The knapsack has given capacity.
Select items to maximize the value of items in
knapsack, but do not extend knapsack capacity.
• Encoding: Each bit says, if the corresponding
object is in knapsack.
Integer represented as Binary
encoding
• To represent integer values, its binary
equivalent can be taken.
• Example: maximize f(x)=x(8-x)
• Values of x may be represented in binary form

0 1 0 1
Linear mapping
• Linear value x in the range [xl, xu], needs to be
represented in n binary bits,
where xl is the lower and xu is the upper
bound of integer value x.
Then the binary value after conversion to
decimal xn can be mapped to the appropriate
range using
Linear mapping contd
• Ex. Let min. mark to pass is 40 and max marks
is 100. To optimize performance of pass
students, 100-40=60, needs 6 bits to encode.
• Let an individual is 0 1 0 0 1 0
• Then, xn =18
• And x = 40+(100-40)/(64-1)*18 = 56.66
• Integer value 56 or rounded off value 57 may
be considered for the problem.
Binary string
In such cases the decimal Code Fibre Angle
conversion of binary string 0000 0
is not required. 0001 10
To obtain the actual value 0010 20
the string is compared 0011 30
with a table of values. 0100 45
0101 60
0110 -10
Permutation Encoding 1
• Permutation encoding can be used in ordering
problems, such as travelling salesman problem
or task ordering problem.
• In permutation encoding, every chromosome
is a string of numbers, which represents
number in a sequence.
• Chromosome A 1 5 3 2 6 4 7 9 8
Chromosome B 8 5 6 7 2 3 1 4 9
Permutation Encoding 2
• Permutation encoding is only useful for
ordering problems.
• some types of crossover and mutation
corrections must be made to leave the
chromosome consistent.
Travelling salesman problem (TSP)
• Given a list of cities and the distances between each pair of
cities, what is the shortest possible route that visits each city
exactly once and returns to the origin city?

• It is an NP-hard problem in combinatorial optimization,


important in theoretical computer science and operations
research.

• In the theory of computational complexity, the decision version


of the TSP (where given a length L, the task is to decide whether
the graph has a tour of at most L) belongs to the class of NP-
complete problems.

• Thus, it is possible that the worst-case running time for any


algorithm for the TSP increases superpolynomially (but no more
than exponentially) with the number of cities.
TSP cntd.
• A physical interpretation of the abstract is:
consider a graph G as a map of n cities where w (i, j)
is the distance between cities i and j.

• A salesman wants to have the tour of the cities


which starts and ends at the same city includes
visiting each of the remaining cities once and only
once.

• In the graph, if we have n vertices (cities), then


there is (n-1)! Edges (routes) and the total number
of Hamiltonian circuits in a complete graph of n
vertices will be .
TSP cntd.
• The problem: We are given a set of cities and a
symmetric distance matrix that indicates the
cost of travel from each city to every other city.

• The goal is to find the shortest circular tour,


visiting every city exactly once, so as to
minimize the total travel cost, which includes
the cost of traveling from the last city back to
the first city.
TSP encoding
• Every city may be represented
with an integer . Cities Code
Mumbai 1
• Consider 6 Indian cities –
Mumbai, Nagpur , Calcutta, Delhi Nagpur 2
, Bangalore and Chennai. Calcutta 3
• Assign a number to each city. Delhi 4
Bangalore 5
Chennai 6
TSP encoding cntd
• Thus a path would be represented as a sequence
of integers from 1 to 6.
• The path [1 2 3 4 5 6 ]represents a path from
Mumbai to Nagpur, Nagpur to Calcutta, Calcutta
to Delhi, Delhi to Bangalore, Bangalore to
Chennai, and finally from Chennai to Mumbai.
• This is an example of Permutation Encoding as
the position of the elements determines the
fitness of the solution.
TSP: Fitness Function
• The fitness function will be the total cost of
the tour represented by each chromosome.
• This can be calculated as the sum of the
distances traversed in each travel segment.
• The less the sum of the distances, the better
fit the solution represented by that
chromosome.
Distance/Cost Matrix For TSP

1 2 3 4 5 6
1 0 863 1987 1407 998 1369
2 863 0 1124 1012 1049 1083
3 1987 1124 0 1461 1881 1676
4 1407 1012 1461 0 2061 2095
5 998 1049 1881 2061 0 331
6 1389 1083 1676 2095 331 0

Cost matrix for six city example.


Distances in Kilometers
Dist 1 (M) 2(N) 3(Ca) 4(D) 5(B) 6(Ch)
TSP fitness evaluation 1(M) 0 863 1987 1407 998 1369
2(N) 863 0 1124 1012 1049 1083
3(Ca) 1987 1124 0 1461 1881 1676
4(D) 1407 1012 1461 0 2061 2095
5(B) 998 1049 1881 2061 0 331
6(Ch) 1389 1083 1676 2095 331 0

• So, for a chromosome [4 1 3 2 5 6], the total cost of


travel or fitness will be calculated as
• Fitness = Dist(4,1)+ Dist(1,3)+ Dist(3,2)+ Dist(2,5)+
Dist(5,6)+ Dist(6,4)
Fitness = 1407 + 1987 + 1124 + 1049 + 331+ 2095=
7993 kms.
Since our objective is to minimize the distance, the less
the total distance, the better fit the solution.
TSP crossover
• Single point crossover method randomly selects a
crossover point in the string and swaps the
substrings.
• This may produce some invalid offspring as
shown below.

Parents Offsprings

4 1 3 2 5 6 4 1 3 1 5 6

4 3 2 1 5 6 4 3 2 2 5 6
Value Encoding
• Direct value encoding can be used in problems, where
some complicated value, such as real numbers, are
used. Use of binary encoding for this type of problems
would be very difficult.
• In value encoding, every chromosome is a string of
some values. Values can be anything connected to
problem, form numbers, real numbers or chars to
some complicated objects.
• Chromosome A 1.2324 5.3243 0.4556 2.3293 2.4545
Chromosome B ABDJEIFJDHDIERJFDLDFLFEGT
Chromosome C (back), (back), (right), (forward), (left)
Value Encoding
• Value encoding is very good for some special
problems.
• New crossover and mutation specific for the
problem may be required.
• In value encoding more than one gene may have
the same value in the chromosomes (which is
not allowed in permutation encoding.
Value Encoding example
• To solve the following problems using GA
make value encoding of
i. β values in Multiple linear regression.

ii. β, and λ values in Ridge regression.

iii. β, λ, and η values in LASSO.

iv. β, λ, η, and r values in Elastic net.


Value Encoding example
• Example of Problem: Finding weights for
neural network.
• The problem: There is some neural network
with given architecture. Find weights for
inputs of neurons to train the network for
wanted output.
• Encoding: Real values in chromosomes
represent corresponding weights for inputs.
Tree Encoding

• GAs may also be used for program designing


and construction. In that case chromosome
genes represent programming language
commands, mathematical operations and
other components of program.
• In tree encoding every chromosome is a tree
of some objects, such as functions or
commands in programming language.
Tree Encoding

x /

y
3
Tree Encoding
• Tree encoding is good for evolving programs.
Programing language LISP is often used to this,
because programs in it are represented in this
form and can be easily parsed as a tree, so the
crossover and mutation can be done relatively
easily.
Tree Encoding
• Example of Problem: Finding a function from
given values
The problem: Some input and output values
are given. Task is to find a function, which will
give the best (closest to wanted) output to all
inputs.
Encoding: Chromosome are functions
represented in a tree.
• Order of genes on chromosome can be important
• Generally many different coding for the
parameters of a solution are possible
• Good coding is probably the most important
factor for the performance of a GA
• In many cases many possible chromosomes do
not code for feasible solutions
• During coding take care that the evaluation of
function is relatively fast.
Schema

B. B. Misra
GAs: Why Do They Work?

In this section we take an in-depth look at the


working of the standard genetic algorithm,
explaining why GAs constitute an effective
search procedure

For simplicity we discuss binary string


representation of individuals

2
Notation (schema)
• {0,1,#} is the symbol alphabet, where # is
a special wild card symbol
• A schema is a template consisting of a
string composed of these three symbols
• Example: the schema [01#1#] matches the
strings: [01010], [01011], [01110] and
[01111]
Notation (order)
• The order of the schema S (denoted by o(S)) is
the number of fixed positions (0 or 1)
presented in the schema
• Example: for S1 = [01#1#], o(S1) = 3
• for S2 = [##1#1010], o(S2) = 5
• The order of a schema is useful to calculate
survival probability of the schema for
mutations
l-o(S)
• There are 2 different strings that match S
Notation (defining length)
• The defining length of schema S (denoted by
(S)) is the distance between the first and last
fixed positions in it

• Example:for S1 = [01#1#], (S1) = 4 – 1 = 3,


for S2 = [##1#1010], (S2) = 8 – 3 = 5

• The defining length of a schema is useful to


calculate survival probability of the schema
for crossovers
Notation (cont.)
• m(S,t) is the number of individuals in the
population belonging to a particular schema S
at time t (in terms of generations)

• fS(t) is the average fitness value of strings


belonging to schema S at time t

• f (t) is the average fitness value over all


strings in the population
The effect of Selection
• Under fitness-proportionate selection the expected
number of individuals belonging to schema S at
time (t+1) is m (S,t+1) = m (S,t) ( fS(t)/f (t) )

• Assuming that a schema S remains above average


by 0  c, (i.e., fS(t) = f (t) + c f (t) ), then
t
• m (S,t) = m (S,0) (1 + c)

• Significance: “above average” schema receives an


exponentially increasing number of strings in the
next generation
The effect of Crossover
• The probability of schema S (|S| = l) to
survive crossover is ps(S)  1 – pc((S)/(l – 1))

• The combined effect of selection and


crossover yields
• m (S,t+1)  m (S,t) ( fS(t)/f (t) ) [1 - pc((S)/(l – 1))]

• Above-average schemata with short defining


lengths would still be sampled at
exponentially increasing rates
The effect of Mutation
• The probability of S to survive mutation is:
o(S)
ps(S) = (1 – pm)
• Since pm<< 1, this probability can be approximated
by:
ps(S)  1 – pm·o(S)
• The combined effect of selection, crossover and
mutation yields
m (S,t+1)  m (S,t) ( fS(t)/f (t) ) [1 - pc((S)/(l – 1)) -pmo(S)]
Example
f(x)=x(8-x) # pop f(x) F(x)= String belong
Let Schema,S=[#10#] f(x)+110 to schema
Length of string, l = 4 1 0101 15 125 Yes
order of the schema, o(S) = 2 2 1100 -48 62 Yes
defining length of schema, (S) = 3-2=1 3 0100 16 126 Yes
Let probability of crossover, Pc = 0.8 4 0001 7 117 No
And probability of mutation, Pm = 0.01 5 1110 -84 26 No
6 0111 7 117 No
average fitness value of strings belonging to schema S at time t, fs(t)=(125+62+126)/3=104.33
average fitness value over all strings in the population, f(t)=95.5
Expected number of individuals belonging to schema S at time (t+1)
m (S,t+1) = m (S,t) ( fS(t)/f (t) ) = 3*(104.33)/(95.5)= 3.28
Assuming that a schema S remains above average by 0  c, (i.e., fS(t) = f (t) + c f (t) ), then
m (S,t) = m (S,0) (1 + c)t
fS(t) = f (t) + c f (t) => c = (fS(t) - f (t))/ f (t)= (104.33 – 95.5)/95.5 = 0.0925

The combined effect of selection, crossover and mutation yields


m (S,t+1)  m (S,t) ( fS(t)/f (t) ) [1 - Pc((S)/(l – 1)) –Pm O(S)]
=3.28*[1-0.8*(1/(4-1))-0.01*2] =3.28*[1 - 0.8/3- 0.02]=2.34
Schema Theorem
• Short, low-order, above-average schemata
receive exponentially increasing trials in
subsequent generations of a genetic algorithm
• Result: GAs explore the search space by
short, low-order schemata which,
subsequently, are used for information
exchange during crossover
Building Block Hypothesis
• A genetic algorithm seeks near-optimal
performance through the juxtaposition of
short, low-order, high-performance schemata,
called the building blocks.
• The building block hypothesis has been
found to apply in many cases but it
depends on the representation and
genetic operators used.
Building Block Hypothesis (cont)
It is easy to construct examples for which the above
hypothesis does not hold:
S1 = [111#######] and S2 = [########11]
are above average, but their combination
S3 = [111#####11] is much less fit than S4 = [000#####00]
Assume further that the optimal string is S0 =
[1111111111]. A GA may have some difficulties in
converging to S0, since it may tend to converge to points
like [0001111100].
Some building blocks (short, low-order schemata) can
mislead GA and cause its convergence to suboptimal
points
Building Block Hypothesis (cont)
• Dealing with deception:
• Code the fitness function in an appropriate
way (assumes prior knowledge)
or
• Use a third genetic operator, inversion
Selection

B. B. Misra
Theory of Evolution
• Every organism has unique attributes that can be
transmitted to its offspring
• Offspring are unique and have attributes from each
parent
• Selective breeding can be used to manage changes
from one generation to the next
• Nature applies certain pressures that cause
individuals to evolve over time

10/23/2021 2
Evolutionary Pressures
• Environment
– Creatures must work to survive by finding
resources like food and water
• Competition
– Creatures within the same species compete with
each other on similar tasks (e.g. finding a mate)
• Rivalry
– Different species affect each other by direct
confrontation (e.g. hunting) or indirectly by
fighting for the same resources

10/23/2021 3
Natural Selection
• Creatures that are not good at completing tasks like
hunting or mating have fewer chances of having
offspring
• Creatures that are successful in completing basic
tasks are more likely to transmit their attributes to
the next generation since there will be more
creatures born that can survive and pass on these
attributes

10/23/2021 4
• Purpose: to focus the search in promising regions of
the space
• Inspiration: Darwin’s theory “survival of the fittest”
• Trade-off between exploration and exploitation of
the search space
• Too strong fitness selection bias can lead to sub-
optimal solution

• Too little fitness bias selection results in unfocused


and meandering search
• Selection replicates the most successful solutions
found in a population at a rate proportional to their
relative quality
• In natural selection, only the fittest species can
survive, breed, and thereby pass their genes on to
the next generation.
• GAs use a similar approach, but unlike nature, the
size of the chromosome population remains
unchanged from one generation to the next.
Premature convergence
– Fitness too large
– Relatively super fit individuals dominate
population
– Population converges to a local maximum
– Too much exploitation; too few exploration
Slow finishing
– Fitness too small
– No selection pressure
– After many generations, average fitness has
converged, but no global maximum is found; not
sufficient difference between best and average
fitness
– Too few exploitation; too much exploration
Different types of selection

1. Roulette Wheel selection


2. Tournament selection
3. Rank Selection
Roulette Wheel selection
• Commonly used
reproduction operator.
• Selects individuals with a
probability proportional to
the fitness.
• A wheel is segmented
proportional to the fitness.
Roulette Wheel selection cntd.
• The wheel is spun and when it stops a segment
remains close to the pointer.
• The individual representing the segment is selected
to the mating pool.
• This process repeated n times, to select n individuals
to the mating pool.
• Possibility of selection of clones can not be avoided.
Roulette Wheel selection cntd.
• The fitness proportionate representation of the
string i in the Roulette wheel can be derived as
Fi
Pri  n
 Fj
j 1

where n is the population size.


Roulette Wheel

18
13
33
7
25
4

Fitness proportionate
representation of 6 individuals
• Example: Consider maximization of f(x)=x(8-x).
• If fitness, F(x)=f(x).
• Some of the F(x) values may be negative.
n

Sum of the fitness 


j 1
F
will
j

not represent the fitness of Id x F(x) Pr


all individuals truly , as +ve #1 5 15 -0.149
and –ve fitness will get #2 7 7 -0.069
cancelled. #3 3 15 -0.149
Again, -ve values can not #4 9 -9 0.089
be represented in the #5 12 -48 0.475
wheel. #6 15 -105 1.039
#7 6 12 -0.118
#8 2 12 -0.118
Total 59 -101 1
• To overcome this
problem, fitness scaling
may be done. Id x f(x) F(x) Pr
#1 5 15 120 0.162
• Let us consider a 4 bit #2 7 7 112 0.152
representation of the #3 3 15 120 0.162
chromosomes. Then the #4 9 -9 96 0.13
maximum negative #5 12 -48 57 0.078
fitness value -105 is #6 15 -105 0 0
obtained for x=15. #7 6 12 117 0.158
• Can we take #8 2 12 117 0.158
F(x)=f(x)+105. Total 59 -101 739 1

• No, Pr6=0 can not be


represented in wheel.
What should be the value of Scaling factor to be added to the
objective function to find the fitness of individuals?
x f(x) F1=f+110 F2=f+500 Pr1 Pr2
15 -105 5 395 0.0151 0.2089
10 -20 90 480 0.2727 0.2539
5 15 125 515 0.3787 0.2724
0 0 110 500 0.3333 0.2645
Standard Deviation 0.1625 0.0284
Segment range 1.5% to 37.8% 20.8% to 27.2%
Diversity drives changes.
Low standard deviation implies insignificant diversity and it
will lack selective pressure.
Segment assigned to the worst and the best fit individuals
does vary substantially, hence selection does not favors better
individuals appropriately
Then let us consider the scaling factor to be 110, i.e. F(x)=f(x)+110.
Id Pop x f(x) F(x) Pr cPr rand no. Selected New
Id pop
#1 0011 3 15 125 0.1935 0.1935 0.21 #2 1001
#2 1001 9 -9 101 0.1563 0.3498 0.91 #6 0100
#3 1000 8 0 110 0.1703 0.5201 0.14 #1 0011
#4 0010 2 12 122 0.1889 0.709 0.2 #2 1001
#5 1100 12 -48 62 0.096 0.805 0.85 #6 0100
#6 0100 4 16 126 0.195 1 0.71 #5 1100
Total -14 646 1
1. The individuals #3 and #4, though better fit in comparison to #2 and
#5 but are not selected.
2. Possibility of not selecting the best candidate to next generation can
not be overruled.
3. Multiple individuals of #2 and #6 are selected.
4. Not only the best fit candidate, but comparatively less fit candidate
may have possibility of more crossover and producing more offspring
than better fit candidates.
0.195 0.1935
20% 19%
0.96
10% 0.1563
16%

0.1889
19% 0.1703
17%

Representation of the fitness of the individuals in the Roulette wheel


x f(x) F(x) Pr cPr rand no. Selected
Id pop Id
Consider a scenario
#1 1110 14 -84 26 0.11 .11 .1 #1
for the problem as in #2 1111 15 -105 5 0.02 .13 .2 #3
the table. #3 0110 6 12 122 0.54 .67 .4 #3
#4 1111 15 -105 5 0.02 .69 .6 #3
The representation in
#5 1101 13 -65 45 0.2 .89 .7 #5
Roulette wheel is as #6 1110 14 -84 26 0.11 1.0 .9 #6
shown.
One uniform random no. 0.1 0.2 0.4 0.6 0.7 0.9
What will be its in each partition
impact on selection? 0.0 .17 .34 .50 .67 .84 1.0
Each partition 1/popSz
The chromosome #3
with 54 % .11 .02 .54 .02 .2 .11
representation will 11% 11%
select copies of #3
2%
about 54% of the size 20%
of the population.
2% 54%
The population that will be Id Pop x f(x) F(x) Pr Distinct
Pr pop
presented to the next #1 0110 6 12 122 0.28
generation may be as in #2 0110 6 12 122 0.28 0.84 0110
the table. #3 0110 6 12 122 0.28
#4 111014 -84 26 0.05 0.05 1110
#1,#2,#3 represent copies #5 110113 -65 45 0.1 0.1 1101
of one chromosome, that #6 111115 -105 5 0.01 0.01 1111
occupies 84% of the wheel.
.84 .05 .1 .01
In the next selection, 84%
10% 1%
i.e. 5 copies of 0110 is 5%
expected in the
population.
In next generation 100%
population will constitute
84%
of chromosomes with
value 0110, which is not
the optimal solution.
Rank selection Fitness rank Rank *avg. Pr
priority
•If fitness of one is dominating others
Roulette wheel may lead to local trap. 0.02 1 1*4.76 7
•In rank selection, basing on fitness 0.05 2 2*4.76 13
individuals are ranked but not 0.08 3 3*4.76 20
proportionately to fitness. 0.1 4 4*4.76 27
•Basing on rank, segments of wheel is 0.75 5 5*4.76 33
allocated to each individual, selection is Total rank=15 Wheel segment per
made by rotating roulette wheel as before. rank = 100/15=4.76
Roulette wheel Rank selection
2% 5% 8%
10% 7%
33% 13%

20%
75%
27%
Rank selection Fitness rank Rank *avg. Pr
But when the fitness is close enough, priority
rank selection method may be biased. In 14.5 1 1*4.76 5
the example, relative fitness gap between 15.5 2 2*4.76 10
worst and best individuals is 0.23. Rank
16 3 3*4.76 14
selection allocates about 5.6 times larger
17 4 4*4.76 19
segment to best in comparison to the
worst. Which may again lead to improper 18 5 5*4.76 24
favour to a candidate. 19 6 6*4.76 28
Total rank=21 Wheel segment per
rank = 100/21=4.76

Roulette wheel Selection Rank Selection

19 14.5 28 5 10
15.5 14
18
16 24 19
17
Linear Rank Selection
In Linear Rank selection, individuals are assigned
subjective fitness based on the rank within the
population:
– sfi = (P-ri)(max-min)/(P-1) + min
– Where ri is the rank of indvidual i,
– P is the population size,
– Max represents the fitness to assign to the best individual,
– Min represents the fitness to assign to the worst
individual.
pri = sfi / sfj Roulette Wheel Selection can be
performed using the subjective fitnesses.
One disadvantage associated with linear rank
selection is that the population must be sorted on
each cycle.
Fitness Rank Sf Pr
273 1 37 0.296
85 2 31 0.248
47 3 25 0.2
23 4 19 0.152
5 5 13 0.104 Linear Rank Selection
Total 125 1

10% 30%
15%
Pop. Size, P=5;
max=37;
min=13; 20%
25%
Exponential Ranking

• Linear Ranking is limited to selection pressure


• Exponential Ranking can allocate more than 2
copies to fittest individual
• Normalise constant factor c according to
population size
Fitness Rank Pexp Pr
5 1 0.21 0.14
23 2 0.29 0.20
47 3 0.32 0.21
85 4 0.33 0.22
273 5 0.33 0.23 Exponential Ranking, c=3
Total 1.48 1

23% 14%
20%
22%
21%
• Two important issues of evolution process is
population diversity and selective pressure,
Whiteley 1989.
Population Diversity: From the genes of the discovered
individuals promising new areas of search space
continues to be explored.
Selective pressure: is the degree to which the better
individuals are favoured.
- higher selective pressure better convergence.
- very high selective pressure premature convergence
to local optimal solution, population diversity
exploited is lost.
- low selective pressure, slow convergence.
• Disadvantages of proportionate representation
– Stagnation of search because it lacks selective
pressure.
– Premature convergence as search is narrowed
down quickly.
Tournament
• Binary tournament
– Two individuals are randomly chosen; the better fit of the two
is selected as a parent
• Probabilistic binary tournament
– Two individuals are randomly chosen; with a chance p,
0.5<p<1, the better fit of the two is selected as a parent
• Larger tournaments
– n individuals are randomly chosen; the fittest one is
selected as a parent
– By changing n and/or p, the GA can be adjusted
dynamically
Binary tournament
id fitness Random Comparison Selected
id of fitness id
#1 23 #2,#5 19>17 #2
#2 19 #4,#1 37>23 #4
#3 48 #3,#2 48>19 #3
#4 37 #5,#1 17<23 #1
#5 17 #4,#3 37<48 #3
Probabilistic Binary tournament
Random Random Comparison Selected
id fitness
No. id of fitness id
#1 23
.77 #4,#3 37<48 #3
#2 19
.17 - - -
#3 48
.83 #3,#2 48>19 #3
#4 37
.64 #4,#1 37>23 #4
#5 17
.48 - - -

.37 - - -

.96 #5,#1 17<23 #1


.57 #2,#5 19>17 #2
Steady-State GA
• Process
• Select q parents,
• Allow them to create q offspring, and
• Immediately replace the q worst individuals in the
population with the offspring
• Process is repeated until a stopping criterion is
reached
• Notice that on each cycle the steady-state GA will
make q function evaluations while a generational GA
will make P (where P is the population size) function
evaluations.
• Therefore, you must be careful to count only
function evaluations when comparing generational
GAs with steady-state GAs.
Consider the example maximize, f(x)=x(8-x)
Current population Randomly Offspring produced
selected parents with fitness
pop fitness
1011 Crossover 1001 -9
1011 -33 point = 2
0001 0011 15
0010 12
1100 -48 Population for the
0101 15 next generation
0001 7 pop fitness
0011 15
On comparing the fitness of the 0010 12
offspring with the parents, we found
1001 -9
parents with fitness -48 and -33 have
minimal fitness, hence destroyed 0101 15
and two new off spring are included. 0001 7
Elitism
• None of the selection technique discussed so far gives
guarantee that the best individual will not die.
• In elitism a fraction of the population is guaranteed to
survive to the next generation.
• Few best chromosomes of the current population are
copied to the population of the next generation.
• Rest of the population for next generation are selected
using any selection technique discussed before.
• An elitism rate of 0.99 (given a population size of 100)
means 99 individuals are guaranteed to survive and an
elitism rate of 0.01 means 1 individual is guaranteed to
survive.
Generation Gap
• The fraction of the population that is replaced
in each cycle is called generation gap.
• A generation gap of 1.0 means that the whole
population is replaced by the offspring.
• A generation gap of 0.01 (given a population
size of 100) means one individual is replaced.
Mutation in GA

B. B. Misra
Generally reproduction is of two types:
1. Sexual reproduction, in which parts of the
parents are exchanged to produce two
children, is called crossover and produces
children that in general are different from
both parents but still contain large parts of
each.
2. Asexual reproduction, in which a parent
undergoes some form of transformation to
produce a child very much like itself, is called
mutation.
• The main task of mutation is to provide new
solutions that cannot be generated
otherwise.
• It introduces an element of random search,
termed as exploration, where the selection
and crossover processes focus attention on
promising regions of the search space
referred as exploitation.
• The occurrence of mutation operator is
determined by a user-settable parameter
known as the mutation probability.
• This probability is usually much lower than the
crossover probability to prevent too much
random search. Values of 0.001 to 0.05 are
common.
• In the case of binary strings, a mutation may
be the flipping of a randomly chosen bit.

Random bit selected


for mutation

Before mutation 0 1 1 0 0 0 1 1

After mutation 0 1 1 0 1 0 1 1
• In the case of integer or real-coded strings, it
may consist in replacing a number on the
string by a new random value within the
permissible range, or adding a random value
from some distribution to that number.
• In the real-coded strings, care must be taken
to map the new value back into the
permissible range.
• The schema theorem places the greatest
emphasis on the role of crossover and
hyperplane sampling in genetic search.
• To maximize the preservation of hyperplane
samples after selection the disruptive effects
of crossover and mutation should be
minimized.
• This suggests that mutation should perhaps
not be used at all or at least used at very low
levels.
• The motivation for A typical population with
using mutation is missing genetic material
to prevent the 0 1 0 0 0 1 1
permanent loss of 1 0 0 1 0 1 0
any particular bit 0 1 0 1 1 1 1

or allele. 0 0 0 1 0 1 0
1 1 0 0 1 1 1
• Crossover or 0 1 0 0 0 1 0
1 0 0 1 1 1 0
selection operation
0 0 0 1 1 1 1
can not introduce
absent or lost Genetic Genetic
genetic material. material material
1 absent 0 absent
• After several
generations it is possible Example
that selection will drive Maximize f(x)=x(8-x).
all the bits in some After few generations a
position to a single typical population
value either 0 or 1. presented here .
• If this happens without
mutation the genetic 0 1 0 1
algorithm will converge 0 1 0 1
to a suboptimal 0 1 0 1
solution, called 0 1 0 1
premature convergence.
0 1 0 1
0 1 0 1
• premature converge is particularly be a
problem if one is working with a small
population.
• Without a mutation operator there is no
possibility for reintroducing the missing bit
value.
• If the target function is nonstationary and the
fitness landscape changes over time which is
certainly the case in real biological systems, then
there need to be some source of continuing
genetic diversity.

• Mutation therefore acts as a background operator


occasionally changing bit values and allowing
alternative alleles and hyperplane partitions to be
retested.
Visualization of two dimensions of an NK fitness landscape. The
arrows represent various mutational paths that the population
could follow while evolving on the fitness landscape.

The NK model is a mathematical model described by its primary inventor Stuart Kauffman as a
"tunably rugged" fitness landscape. "Tunable ruggedness“ captures the intuition that both the overall
size of the landscape and the number of its local "hills and valleys" can be adjusted via changes to
its two parameters,N and K, with N being the length of a string of evolution and K determining the
level of landscape ruggedness.
• Mutation can have a significant impact on
convergence and change the number of fixed
points in the space.
• Mutation may introduce invalid values outside
the search region, special care may be required to
avoid this.
• In the search space metaphor, every point in the
space is a genotype.
• Evolutionary variation (such as mutation, sexual
recombination and genetic rearrangements)
identifies the legal moves in this space.
Mutation operators for real coded GA
• Let us suppose C  c1 ,...ci ,...cn  a chromosome
and ci  ai , bi  a gene to be mutated.

The gene i resulting from the applications of


'
• c
different mutation operation is presented
here.
Random Mutation
• Random mutation technique is proposed by
Michalewicz in 1992.
• The gene resulting after mutation ci' is a
random (uniform) number from the domain
ai , bi  .
Non Uniform Mutation
If this operator is applied on a generation t,
and gmax is the max number of generations,
then ci  tbi  ci , if   0
ci  
'

ci  tbi  ci , if   1
With being a random number which may have
a value 0 or 1, and 


 1
t  

b

t  y   y1  r  gmax 

 
 
Where r is a random number in the interval
[0,1] and b is a parameter chosen by the user,
which determines the degree of dependency
on the number of iterations.
Random Mutation
• Here, mutated solution is obtained from the original solution using
the rule given below.
Prmutated = Proriginal + (r − 0.5)Δ,
• where
– r is a random number varying in the range of (0.0, 1.0),
– Δ is the maximum value of perturbation defined by the user.
Example:
• Let us assume the original parent solution
Proriginal = 15.6.
• Determine the mutated solution Prmutated, corresponding to
r = 0.7 and
Δ = 2.5.
• The mutated solution is calculated like the following:
Prmutated = 15.6 + (0.7 − 0.5) × 2.5 = 16.1
Polynomial Mutation
• Deb and Goyal proposed a mutation operator based on polynomial
distribution.
• The following steps are considered to obtain the mutated solution from an
original solution:
– Step 1: Generate a random number r lying between 0.0 and 1.0.
– Step 2: Calculate the perturbation factor δ corresponding to r using the
following equation  1
 2r q 1  1 if r  0.5
  1

1 - 21 - r 
q 1
if r  0.5
where q is an exponent (positive real number).
– Step 3: The mutated solution is then determined from the original
solution as follows:
Prmutated = Proriginal + δ × δmax,

where δmax is the user-defined maximum perturbation allowed


between the original and mutated solutions
Example on Polynomial Mutation
• Let us assume the original parent solution Proriginal = 15.6.
• Determine the mutated solution Prmutated,
considering
r = 0.7,
q = 2 and
δmax = 1.2.
• Corresponding to r = 0.7 and q = 2, the perturbation factor δ is found
to be as follows: 1

  1 - 21 - r 
q 1
 0.1565
• The mutated solution is then determined from the original solution
like the following.
Prmutated = Proriginal + δ × δmax = 15.7878
Crossovers in GA

B. B. Misra
• Crossover is the main genetic operator.
• Crossover is the genetic operator that mixes
two chromosomes together to form new
offspring.
• The intuition behind crossover is exploration
of new solutions by exploitation of the existing
solutions
Crossover allows
the genetic
algorithm to
explore new
areas in the
search space
and gives the GA
the majority of
its searching
power.
0100
0101 1100 1101

0110 0111 1111


1110

0000 1000
0001 1001

1010 1011
0010 0011

What are the chromosomes (excepting itself) with which 0101 performs crossover operation and
generates clones of the parents?
All the chromosomes have hamming distance ≤ 1.
Ex. 1101, 0001,0111, 0100 (in the figure, all are in one link distance from 0101)
• The new offspring comprise different
segments from each parent and thereby
inherit properties from both parents.
• GA’s construct a better solution by mixing
good characteristic of chromosomes together.
• Higher fitness chromosomes have an
opportunity to be selected more than the
lower ones, so good solution always alive to
the next generation.
• The occurrence of crossover is determined
probabilistically called crossover probability.
• When crossover is not applied, offspring are
simply duplicates of the parents, thereby
giving each individual a chance of passing on a
pure copy of its genes into the gene pool.
Possibilities in crossover
• Two parents produce two offspring
• There is a chance that the chromosomes of
the two parents are copied unmodified as
offspring
• There is a chance that the chromosomes of
the two parents are randomly recombined
(crossover) to form offspring
• Generally the chance of crossover is between
0.6 and 1.0
1-Point Crossover
• Choose a random point across the
chromosome length
• Split parents at this crossover point
• Create children by exchanging tails
Crossover with Single Crossover Point
Crossover
Point

• Father 0 0 0 0 0 0 0 0

• Mother 1 1 1 1 1 1 1 1

• Child 1 0 0 0 0 1 1 1 1

• Child 2 1 1 1 1 0 0 0 0
Multi-point Crossover
• Choose multiple random points across the
chromosome length
• Split parents at this crossover point
• Create children by exchanging bits among
these crossover points.
NB. Ensure that the genes are copied to the
same positions of the chromosomes only.
Crossover with Three Crossover Points
↓ ↓ ↓
• Father 0 0 0 0 0 0 0 0

• Mother 1 1 1 1 1 1 1 1

• Child 1 0 1 1 0 0 0 1 1

• Child 2 1 0 0 1 1 1 0 0
Uniform Crossover
• It is the extreme state of multi point crossover.
• Each bit /gene has probability 0.5 to be
selected from either of the parent.
• The number of effective crossing points is not
fixed but averages to half of the string length.
Uniform Crossover working procedure
• Generally a vector mask of the size of the
chromosome length is generated randomly with 0
or 1 bit value for each crossover operation.
• To generate each bit for 1st offspring, respective
bit of mask is checked.
• If the bit value of mask is 1, respective gene of
parent1 is copied otherwise from parent2.
• This process is reversed to generate the 2nd
offspring.
Uniform Crossover example
Parent1 Parent2
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

Mask

1 0 1 1 0 0 1 0

Offspring1 Offspring2
0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0
Assignment

Apply different crossover operation and find all possible


offspring that can be generated with specified parents
in the given 4-D search space.
Assignment
Find the
chromosomes
that will produce
clones after
making crossover
with parent
P=01011.
Matrix crossover
• Each individual in the population is a 2-D
matrix.
• Crossover points in both dimension are
randomly obtained.
• The segments between the crossover points
are swapped
Matrix crossover Example
Parent1 Row crossover Row crossover
Point1 point2 Point1 point2
Parent2
Column crossover

0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
point2

0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
Point1

0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1

0 0 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1
0 0 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1
1 1 1 0 0 0 0 1 1 0 0 0 1 1 1 1 0 0
1 1 1 0 0 0 0 1 1 0 0 0 1 1 1 1 0 0
1 1 1 0 0 0 0 1 1 0 0 0 1 1 1 1 0 0
0 0 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1
0 0 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1

Offspring1 Offspring2
Crossover operator for Real code
Linear Crossover
• It was proposed by Wright in 1991.
• To explain its principle, let us consider that two parents: Pr1 and Pr2
are participating in crossover.
• They produce three solutions as follows:
 0.5(Pr1 + Pr2),
 (1.5Pr1 − 0.5Pr2), and
 (−0.5Pr1 + 1.5Pr2).
• Out of these three solutions, the best two are selected as the children
solutions.
Example:
• Let us assume that the parents are: Pr1 = 15.65, Pr2 = 18.83.
• Using the linear crossover operator, three solutions are found to be
like the following:
 0.5(15.65 + 18.83) = 17.24,
 1.5 × 15.65 − 0.5 × 18.83 = 14.06,
 −0.5 × 15.65 + 1.5 × 18.83 = 20.42
Blend Crossover (BLX - α)
• This operator was developed by Eshelman and Schaffer in 1993.
• Let us consider two parents: Pr1 and Pr2, such that Pr1 < Pr2.
• It creates the children solutions lying in the range of
 [{Pr1 − α(Pr2 − Pr1)}, {Pr2 + α(Pr2 − Pr1)}],
• where the constant α is to be selected, so that the children solutions
do not come out of the range.
• Another parameter γ has been defined by utilizing the said α and a
random number r varying in the range of (0.0, 1.0) like the following:
 γ = (1 + 2α)r − α.
• The children solutions (Ch1, Ch2) are determined from the parents as
follows:
 Ch1 = (1 − γ)Pr1 + γPr2,
 Ch2 = (1 − γ)Pr2 + γPr1.
Example for Blend Crossover (BLX - α)
• Example: Let us assume that the parents are:
Pr1 = 15.65,
Pr2 = 18.83.
• Assume: α = 0.5, r = 0.6.
• The parameter γ is calculated like the following
γ = (1 + 2α)r − α = (1 + 2 × 0.5)0.6 − 0.5=0.7
• The children solutions are then determined as follows:
Ch1 = (1 − γ)Pr1 + γPr2 = 17.876
Ch2 = (1 − γ)Pr2 + γPr1 = 16.604
Simulated Binary Crossover (SBX)
• It was proposed by Deb and Agrawal in 1995.
• Its search power is represented with the help of a probability
distribution of generated children solutions from the given parents.
• A spread factor α has been introduced to represent the spread of the
children solutions with respect to that of the parents, as given below.
Ch1  Ch2

, where Pr1, Pr2 represent the parent points and Ch1 and Ch2
Pr1  Pr2
are the children solution.
• Three different cases may occur:
– Case 1: Contracting crossover (α<1) i.e. the spread of the children solutions is
less than that of the parent.
– Case 2: Contracting crossover (α>1) i.e. the spread of the children solutions is
more than that of the parent.
– Case 1: Contracting crossover (α=1) i.e. the spread of the children solutions is
exactly the same as the parents.
Simulated Binary Crossover (SBX) cntd.
• The probability
distributions for
creating children
solutions from
the parents have
been assumed to
be polynomial in
nature as in the
figure.
• The probability
distributions
depend on the
exponent q,
which is a non-
Fig. Probability distributions for creating the children solutions from
negative real the parents vs. spread factor α
number.
Simulated Binary Crossover (SBX) cntd.
• For the contracting crossover, the probability distribution is given by:
C    0.5q  1 q
1
• For the expanding crossover, it is expressed as: Ex   0.5q  1 q  2 

• For small values of q, the children are far away from the parents.
• For high values of q, the children are close to the parents.
• Fig. at pre page shows the variations of probability distributions for
different values of q (say 0, 2 and 10).
• The area under1 the probability distribution curve in the contracting
crossover zone  C d  0.5 
• and that in the expanding crossover zone  1 Ex d  0.5
 0
Simulated Binary Crossover (SBX) cntd.
• The following steps are used to create two children solutions: Ch1 and
Ch2 from the parents, Pr1 and Pr2:
• Step 1: Create a random number r lying between 0.0 and 1.0.
• Step 2: Determine α’for which the cumulative probability
'
 C d  r, if r  0.5and
0

 Ex d  r  0.5, if r  0.5
'

• Step 3: Knowing the value of α’, the children solutions are determined
like the following:
Ch1  0.5Pr1  Pr2    ' Pr2  Pr 1 ,
Ch2  0.5Pr1  Pr2    ' Pr2  Pr 1 
SBX Example
• Let the parents are: Pr1 = 15.65, Pr2 = 18.83.
• Determine children solutions using the SBX.
• Assume exponent q = 2.
• Let the generated random number, r = 0.6.
• As r > 0.5, we calculate α’, such that
'
 Ex d  r  0.5 substituting,r  0.6, wehave
1
' 1
  0.5q 1  q 2  d  0.1, substituting,q  2, wehave
1 
3  ' '
' 1  
 0.5
  1.5 d  0.1, 1.5
  0.1  3 
 0.1
1  4
 3 1  1
 0.5  0.5 0.5
 3  3  0.1  0.4  3   ' 1.0772
' 1 '
• Then the children generated are:
Ch1  0.5Pr1  Pr2    ' Pr2  Pr1   0.515.65  18.83  1.0772 18.83  15.65   15.5273
Ch2  0.5Pr1  Pr2    ' Pr2  Pr1   0.515.65  18.83  1.0772 18.83  15.65   18.9527
Crossover is a critical feature of
genetic algorithms

• It greatly accelerates search early in evolution


of a population
• It leads to effective combination of schemata
(sub-solutions on different chromosomes)
Crossover issue: disruption
• Crossover can be a highly disruptive.
• High performing schemas may not be preserved under
crossover.
• A major difficulty is that if solutions must satisfy
constraints these may not be preserved under
crossover.
• Arrange representation and operators so that a valid
solution is always achieved
• Carry out a ‘repair’ operation on chromosomes so that
validity is achieved.

Note that in biology individuals must always be viable


Fitness Evaluation

B B Misra
• The fitness of the chromosome drives the
process of GA, which relates the qualitative
‘goodness’ of a candidate solution in
quantitative terms.
• The fitness function encapsulates the
problem-specific knowledge.
• The chromosomes are decoded into their
actual representation, analyzed and given a
scalar fitness value to characterize how close
they are to the ideal solution.
• Cost/fitness/objective function determines
the environment within which the solutions
are “live.”
• A fitness value that reflects how good the
chromosomes is.
• An ideal fitness function should correlate
closely to goal of the problem and should be
quickly computable.
• A fitness function quantifies the optimality of
a solution (chromosome) so that that
particular solution may be ranked against all
the other solutions
Purpose of fitness function
• Parent selection
• Measure for convergence
• For Steady state: Selection of individuals to die
• Should reflect the value of the chromosome in
some “real” way
• The most critical part of GA after encoding.
Fitness scaling
• Fitness values are scaled by subtraction and
division so that worst value is close to 0 and
the best value is close to a certain value,
typically 2
– Chance for the most fit individual is 2 times the
average
– Chance for the least fit individual is close to 0
Fitness scaling
• Problems when the original maximum is very
extreme (super-fit) or when the original
minimum is very extreme (super-unfit)
– Can be solved by defining a minimum and/or a
maximum value for the fitness
Perceptron

Ref.: S. Haykins, “Neural networks: a


comprehensive foundation”. Pearson Education,
India
Perceptron
• LMS build around a linear neuron, but perceptron build around
a non-linear neuron (namely the McCulloch-Pitts model of a
neuron)
• Perceptron consists of a linear combiner followed by a hard
limiter (Signum function).
• Hard limiter input/ induced local field of the neuron is
m
v   wi xi  b -(0)
i 1
• Goal: To correctly classify set
of external stimuli x1, x2, …,
xm into one of the two
classes 1 or 2.
• The neuron produce +1 as
output if input to hard
limiter is +ve, otherwise
neuron produce -1 as
output. Figure. Signal flow graph of the perceptron.
• Decision rule for classification: Assign the point represented by
x1, x2, …, xm (input) to class 1 if the perceptron output y is +1
and to class 2 if it is -1.
• To know the behavior of a pattern
classifier, we plot a map of decision
regions in m-dimensional signal space
spanned by the m input variable x1,
x2, …, xm.
• For perceptron, two decision
regions separated by a hyperplane
is defined by
m

 wi xi  b  0 -(1)
i 1
• In Figure, a point (x1, x2) lies above
boundary line assigned class 1 Figure: Illustration of
and if the point is below it then hyper plane as
decision boundary for
assigned class 2 . 2D, 2class pattern
classification problem
• Example: Considering the decision boundary of the given figure,
classify the points {(2,2), (1, 2), (4,-1), (5,-3), (-1,7), (2,1)}.
We know equation of the line is y=mx + c. Considering the given coordinates (0,5)
(3,0) and (0, 5) on the decision boundary, we have 1
0 = m3+c and 2
5 = 0 +c i.e. c=5, then from previous eqn. m= -c/3= -5/3. (3,0)
Then the equation is y = -(5/3)x + 5
Here in the neuron instead of y we take x2 is used so the above eqn. can be rewritten as Decision
boundary
x2 = -(5/3)x1 + 5 -(a)
• But the eqn. we use for neuron is: w1x1 + w2x2 + b=0;
=> if w2 ≠ 0, x2 = -(w1/w2)x1- b/w2 -(b)
Comparing (a) and (b), we get -(w1/w2)= -(5/3) and –b/(w2) = 5 i.e. –b/(w2) = 15/3
 w1 = 5, w2 = 3, b = -15
Then the eqn. for the decision boundary obtained is 5x1 + 3x2 – 15 = 0.
If point (2,2) is considered, we get 5*2 +3* 3 – 15=4>0, so the point is above the decision boundary and
point (2,2) ∈ 𝒸1
If point (1,2) is considered, we get 5*1 +3* 3 – 15= -1< 0, so the point is below the decision boundary and
point (1,2) ∈ 𝒸2

If point (5,-3) is considered, we get 5*5 -3* 3 – 15 = 1>0, so the point is above the decision boundary and
point (5, -3) ∈ 𝒸1
• Bias b shifts the boundary away from the origin.
• Synaptic weights adapted iteration by iteration basis.
• This adaptation use an error-correction rule known as the
perceptron convergence algorithm.
Perceptron convergence theorem
• Let us use the modified signal
flow graph as in the Figure.
• Let the (m+1) –by-1 input
vector, X(n)=[+1, x1(n), …,
xm(n)], where n denotes the
iteration step.

• And the (m+1) –by-1 weight vector W(n)=[w0(n), w1(n), …, wm(n)]T .


• Compact form of the linear combiner output
m
vn    wi n xi n   W T n X n  -(2)
i 0

where w0(n)=b(n)
• Perceptron function properly when classes
1and 2 are linearly separable.
• In Figure (a) 1and 2 are sufficiently
separated to draw a hyperplane as
decision boundary.
• In Figure (b) 1and 2 are more close to
each other, became non-linearly separably
(can not be classified by perceptron).
• Suppose the input variable originates from
two linearly separable classes.
• Let 1 be the subset of training vectors
X1(1), X1(2), … 1 and 2 be the subset of
training vectors X2(1), X2(2), … 2.
• Complete training set is the union of 1
and 2.
• Given the input vector 1 and 2 to train the classifier, the training
process involves adjustment of weight vector W such that
WTX > 0 for every input vector X 1
WTX ≤ 0 for every input vector X 2 (3)
Rules for adapting synaptic weights of perceptron
1. If x(n) is correctly classified by W(n), no correction to W i.e.
W(n+1)= W(n), if WT(n)x(n) > 0 and x(n) 1
W(n+1)= W(n), if WT(n)x(n) ≤ 0 and x(n) 2 -(4)
2. Otherwise update weight
W(n+1)= W(n) – η(n)x(n), if WT(n)x(n) > 0 and x(n) 2
W(n+1)= W(n) + η(n)x(n), if WT(n)x(n) ≤ 0 and x(n) 1 -(5)
Where the learning rate parameter η(n) controls the adjustment
applied to the weight vector at iteration n.

Figure: Illustration of
hyper plane as
decision boundary for
2D, 2class pattern
classification problem
Fixed increment adaptation rule
• η(n)= η>0, where η is a constant learning rate parameter
independent of iteration number n.
• η>0, scale the pattern vectors without affecting separability.
• Proof: let η =1, W(0)=0 and WT(n)x(n)<0. for n=1,2,… and x(n) 1
i.e. perceptron incorrectly classifies the vectors x(1), x(2), … (2nd
condition of eqn.(3) violated).
• Then as η =1, with 2nd condition of eqn. (5) , we have
W(n+1) =W(n) +x(n), x(n) 1 -(6)
• Given initial condition W(0)=0, we iteratively solve for W(n+1) as
W(0)=0
W(1)=W(0)+ ηx(1)=W(0)+ x(1)=x(1), as η=1, if WTx(1) ≤ 0 and x(1) 1
W(2)=W(1)+ ηx(2)=x(1)+x(2), if WTx(1) ≤ 0 and x(1) 1
W(3)=W(2)+ ηx(3)=x(1)+x(2)+x(3), if WTx(1) ≤ 0 and x(1) 1

W(n+1)=x(1)+x(2)+ … +x(n), if WTx(1) ≤ 0 and x(1) 1 -(7)
• Since the classes are linearly separable, w0 for which WTx > 0
for x(1), x(2), … ,x(n) 1
• For a fixed solution w0, let’s define a positive number, α as
  min W0T x(n) -(8)
x ( n ) 1

• Multiply both sides of the eqn.(7) by the row vector W0T,


we get
W0T W(n+1)= W0T x(1)+ W0T x(2)+ … + W0T x(n)
• Using eqn. (8), we can say
W0T W(n+1) ≥ α +α + … + α (Here one α for each term in
right side of the above equation is taken)

W0T W(n+1) ≥ nα -(9) (there are n number of terms in right side of the
above equation)
• Given two vectors W0 and W(n+1), the Cauchy-Schwarz inequality
states that W W (n  1)  W W (n  1) -(10)
2 2 T 2
0 0

where . is Euclidian norm and W0TW(n+1) is a sclar.


• Substituting eqn. (9) in eqn. (10), we get
2 2

W0 W (n  1)  W0T W (n  1)  n 2 2 
2

n 2 2
2
 W ( n  1)  2 -(11)
W0
• Eqn. (6) may be rewritten as
W(k+1) =W(k) +x(k), for k=1,2, …, n, and x(k) 1 -(12)
• Squared Euclidean norm of both side of eqn. (12)
2 2 2
W (k  1)  W (k )  x (k )  2W T (k ) x(k ) -(13)
• As perceptron incorrectly classify vector x(k) 1, we have
WT(k)x(k)<0, then from eqn. (13) we deduce Ex.
2 2 2
W (k  1)  W ( k )  x (k ) 10 = 9+5+2(-2)
=> 10 ≤ 9+5
2 2
 W (k  1)  W (k )  x (k ) , k  1,..., n
2
-(14)
• Adding these n
inequalities for k=1,2,…, n, and for W(0)=0, we have
W (k  1)   x(k )
2 2

k 1
2 2 2 2
 W (k  1)  x(1)  x(2)  ...  x (n)
2
 W (k  1)  β β  ...  β
2
 W (k  1)  nβ -(15)
2
where β is a positive number defined by   max x ( k ) -(16)
x  k  1

2 n 2 2
W (n  1) 
• For large n, eqn. (15) conflicts with W -(11) 0
2

n  2 2

• Hence n can not be larger than nmax, where W


max
2
 nmax 
0

• Solving for the solution of the equation nmax, given W0,


2
W
n 
max

0
2 -(17)
• This proves that for η(n)=1 for all n, and W(0) =0, and given that a solution vector
W0 exists, the rule for adapting the synaptic weights of the perceptron must
terminate after at most nmax iterations.
• From eqns. (8, 16, 17), it is clear that there is no unique solution for W0 and nmax.
Fixed-increment convergence theorem
• The Fixed-increment convergence theorem for the perceptron
(Rosenblatt, 1962) can be stated as:
“Let the subsets of the training vectors 1 and 2 be linearly separable. Let the
inputs presented to the perceptron originate from these two subsets. The
perceptron converges after some n0 iterations, in the sense that W(n0)= W(n0+1)=
W(n0+2) = … is a solution vector for n0 ≤ nmax .”
Absolute error correction procedure
• Absolute error correction procedure for the adaptation of single layer
perceptron, for which η(n) is variable.
• Let η(n) be the smallest integer for which
η(n)xT(n)x(n)>|WT(n)x(n)|
• If WT(n)x(n) at iteration n has an incorrect sign, then WT(n+1)x(n) at
iteration n+1 has correct sign.
• If WT(n)x(n) has incorrect sign, modify training sequence at iteration
n+1 by setting x(n+1) = x(n), i.e. each pattern is presented repeatedly
to the percepton until it is correctly classified.
• Use of initial value W(0) different from null condition merely results
in decrease or increase of iteration to convergence, i.e. perceptron is
assured to converge, regardless of value assigned to W(0).
Perceptron convergence algorithm
• x(n) = [+1, x1(n), x2(n), … , xm(n)]T
• W(n) = [b(n), W1(n), W2(n), … , Wm(n)]T
• b(n)=bias
• y(n)= actual response (quantized)
 1, if x(n)  c1
• d(n) = desired response = 
  1 if x(n)  c2
• η = learning rate parameter, a positive constant less than 1.
Initialize W(0) = 0
repeat for each pattern n = 1, 2, …
 1, if v  0
y(n)=sign[WT(n)x(n)], where signum function, sign(v)  
  1 if v  0
W(n+1) = W(n) + η[d(n) – y(n)]x(n)
Implement logical OR function using perceptron convergence algorithm
patt input w v= y d y==d w=w + η[d- y]x
ern w*x’
1st epoch
1 [111] [000] 0 1 1 yes
2 [11-1] [000] 0 1 1 yes
x0 x1 x 2 d
3 [1-11] [000] 0 1 1 yes
1 1 1 1
4 [1-1-1] [000] 0 1 -1 no [000]-2[1-1-1] = [-2 2 2]
1 1 -1 1
2nd epoch
1 -1 1 1
1 [111] [-2 2 2] 2 1 1 yes
1 -1 -1 -1
2 [11-1] [-2 2 2 -2 -1 1 no [-2 2 2]+2[11-1] = [0 4 0]
3 [1-11] [0 4 0] -4 -1 1 no [0 4 0]+2[1-11]=[2 2 2]
4 [1-1-1] [2 2 2] -2 -1 -1 yes
3rd epoch
1 [111] [2 2 2] 6 1 1 yes
Let w=[0 0 0], and η=1 2 [11-1] [2 2 2] 2 1 1 yes
3 [1-11] [2 2 2] 2 1 1 yes
4 [1-1-1] [2 2 2] -2 -1 -1 yes
- Developed by Widrow-Hoff during 1960
- input-output relation is linear
- uses bipolar activation function
- only one output unit
- Trained with delta rule or Least Mean Square (LMS) rule or
Widrow-Hoff rule
- Similar to perceptron learning rule but origin is different
- Perceptron learning rule originated from Hebbian assumption
but the delta rule is derived from the gradient descent concept.
- Perceptron learning stops after few steps, but training with delta
rule may continue infinite steps, converge asymptotically to soln.
- Delta rule updates wts. proportional to error for each input pattern.
Δwij = α (dj - yinj) xi
where Δwij is change in weight, α is learning rate, dj is desired
output for jth output neuron, yinj is the net input to jth output neuron,
xi input to the ith input neuron.
1. Initialize weight and bias. Set learning rate α
2. Repeat
3. for each bipolar input-output pattern pair.
4. Calculate the input to the output unit
n
yin = b + ∑ i=1xiwi
5. Update the weight and bias
if (yin==0 & d==-1)||(yin <0 & d==1)
wi (new) = wi (old) + α (d - yin) xi
b(new) = b(old) + α (d - yin)
endif
6. endfor
7. until highest wt. change in a epoch is less than tolerance.
x1 x2 d
1 1 1
Implement OR function using
1 -1 1
Adaline network. Use
-1 1 1
bipolar input-output pattern
-1 -1 -1
pairs. 2nd input x1=1,x2= -1, d= 1
let w1 =w2=b=0.1, and α=0.1 yin=b+x1w1+x2w2 =0.17+1×0.17-
Epoch: 1 1×0.17 = 0.17
1st input x1=1,x2=1, t=1 wi(new)=wi(old)+ α(t-yin)xi
yin=b+x1w1+x2w2 =0.1+1×0.1+1×0.1 w1=w1+ α(t-yin) x1=0.17+0.1×(1-
= 0.3 0.17)×1= 0.257
wi(new)=wi(old)+ α(t-yin)xi w2=w2+ α(t-yin)x2=0.17-0.1×(1-
w1=w1+ α(t-yin) x1=0.1+0.1×(1-0.3)×1 0.17)×1= 0.083
=0.17 b=b+α(t-yin)=0.17+0.1×(1-0.17)=0.257
w2=w2+ α(t-yin)x2=0.1+0.1×(1-0.3)×1 Error, E=(t-yin) 2=(1-0.17)2=0.6889
=0.17
b=b+α(t-yin)=0.1+0.1×(1-0.3)=0.17 Though error increases for the 2nd input,
Error, E=(t-yin) 2=(1-0.3)2=0.49 but will decrease slowly after few epochs.

Update the weights for 3rd and 4th input.


- Many Adalines in parallel with single output unit.
- Value of output unit based on certain selection rule, i.e. majority voting,
AND rule etc.
- Below figure shows a simple Madaline architecture with n number of input
units and m number of Adaline units, one output Madaline unit.
- Adaline units may be considered as the hidden layer.
- Hidden layers increases computational capability but complicates the
training process.

1 b1 1 b0
b b2
m

Applied effectively in x1 w11 Z1 v1


communication systems X1 w12
w1m v2 y
of adaptive equalizer, x2 Z2 Y
X2
adaptive noise
cancellation, etc. wn1
xn wn2 Zm vm
Xn wnm
Considerations:
Weights between
i. input and hidden layer are adjusted
ii. hidden and output layer are not adjusted. v1, v2, …,vm
and b0 may be assigned value 0.5.
The activation fn. for Adaline (hidden) and Madaline (output)
units may be taken as
1 b1 1 b0
b b2m

x1 w1 Z1 v1
X1 1
w1
w1m v2 y
x2
2
Z2 Y
X2

wn1
xn wn2 Zm vm
Xn wnm
1. Initialize weights between input and hidden layer. Set the weights between
hidden and output layer. Set the learning rate α.
2. Repeat
3. for each training pattern
4. for each hidden (Adaline) unit
5. Calculate net input, zinj=bj+∑ni=1xiwij, 1 ≤ i ≤ n, 1 ≤ j ≤ m
6. Calculate output of each hidden unit, zj=f(zinj)
7. endfor
8. Find input to the output unit, yin=b0+∑mj=1zjvj, 1 ≤ i ≤ m
9. Calculate output of the net, y=f(yin)
10. Update weights
11. if d ≠ y
12. bj (new)=bj (old) + α (d - zinj)
13. wij (new)=wij (old) + α (d - zinj) xi
14. endif
15. endfor
16. Until stopping condition satisfied (no or minimal wt. change or max. epochs)
x1 x2 d
1 1 -1
Implement XOR fn. b1 1
using Madaline n/w. 1 -1 1 1 b0
b2
Use bipolar input -1 1 1 Z1
and output.
x1 w11
X1 v1
-1 -1 -1 w1
y
Let the initial weights are w21
2
Z2 Y
xn v2
[w11, w21, b1]=[0.05, 0.2, 0.3] X2 w
[w12, w22, b2]=[0.1, 0.2, 0.15] 22

[v1, v2, b0] = [0.5, 0.5, 0.5] ,α=0.5


As d ≠ y, update wt.
Epoch:1
b1=b1 +α(d – zin1) = 0.3+0.5(-1-0.55)= -0.475
1st input: x1=1, x2=1, d=-1
b2=b2 +α(d – zin2) =0.15+0.5(-1-0.45)= -0.575
zin1=b1+x1w11+x2w21=0.3+1×0.05+1×0.2
w11=w11 +α(d – zin1) x1=0.05+0.5(-1-0.55) ×1 = -0.725
=0.55
w21=w21 +α(d – zin1) x2 =0.2+0.5(-1-0.55) ×1 = -0.575
zin2=b2+x1w12+x2w22=0.15+1×0.1+1×0.2
w12=w12 +α(d – zin2) x1=0.1+0.5(-1-0.45) ×1 =-0.625
=0.45
w22=w22 +α(d – zin2) x2 =0.2+0.5(-1-0.45) ×1 = -0.525
z1=f(zin1)=1
Continue this process for other inputs, and then
z2=f(zin2)=1
for subsequent epochs till convergence.
yin=b0+z1v1+z2v2=0.5+1×0.5+1×0.5=1.5
y=f(yin)=1
w11=1
v1= 1
w12= -1
x1 x2 d
Truth table of XNOR gate
0 0 1
0 1 -1
1 0 -1
w21=1
1 1 1 v2= -1
w22= -1

Activation function used


Inp Out Adaline1 Adaline2 Adalin
ut put y1=f(w11x1+w12x2+θ1)= y2=f(w21x1+w22x2+θ2)= z1=f(v1y1+v2y2+θ3)=f(x1 -
f(x1 - x2 + 0.5) f(x1 - x2 - 0.5) x2 - 0.5)
0,0 1 y1=f(0-0+0.5)=f(0.5)=1 y2=f(0-0-0.5)=f(-0.5)= -1 z =f(1+1-0.5)=f(1.5)= 1
0,1 -1 y1=f(0-1+0.5)=f(-0.5)= -1 y2=f(0-1-0.5)=f(-1.5)= -1 z =f(-1+1-0.5)=f(-0.5)= -1
1,0 -1 y1=f(1-0+0.5)=f(1.5)= 1 y2=f(1-0-0.5)=f(0.5)= 1 z =f(1-1-0.5)=f(-0.5)= -1
1,1 1 y1=f(1-1+0.5)=f(0.5)= 1 y2=f(1-1-0.5)=f(-0.5)= -1 z =f(1+1-0.5)=f(1.5)= 1
Multilayer Perceptron (MLP)

Ref.: S. Haykins, “Neural networks: a comprehensive


foundation”. Pearson Education, India
Some preliminaries
• Fully connected neurons in any layer is connected to all the nodes in
the previous layer.
• Signal flow in a forward direction from left to right layer by layer
basis.

Fig: Architectural graph of a multilayer perceptron with two hidden layers


Preliminaries cntd.
• Networks typically consisting of input, hidden, and output layers.
• Commonly referred to as Multilayer perceptrons.
• Popular learning algorithm is the error backpropagation algorithm
(or backpropagation, or backprop), which is a generalization of the
LMS rule.
– Forward pass: activate the network, layer by layer
– Backward pass: error signal backpropagates from output to hidden and
hidden to input, based on which weights are updated.
Types of signals
• Multilayer perceptron has two
kinds of signals
– Function signal /input signal:
It is presented at the input end
propagates forward (neuron by
neuron) and emerges at the
output end as output signal.
– Error signal:
It originates at the output Fig. Illustration of the directions of two basic
signal flows in a multilayer perceptron:
neuron and propagates forward propagation of function signals and
backward (layer by layer). back-propagation of error signals.
Notations
• The indices i, j, and k refers to different neurons in the n/w.
Neuron j in a layer right to neuron i, and neuron k in a layer
right to neuron j.
• In iteration (time step n), nth pattern (training) is presented to
the n/w.
• E(n) - instantaneous sum of error squares or error energy at
iteration n.
• Eav – average of E(n) over entire training set.
• ej(n) – error signal at output neuron j for iteration n.
• dj(n) – desired response for neuron j used to compute ej(n).
• yj(n) – function signal appearing at the output of neuron j at
iteration n.
• wji(n) – synaptic weight connecting the output of neuron i to the
input of neuron j at iteration n.
• Δwji(n) – correction applied to wji(n) at iteration n.
Notations cntd.
• vj(n) – induced local field (i.e. weighted sum of all synaptic
weight with bias) of neuron j at iteration n (vj(n) is the signal
applied to the activation function).
• ϕj(.) – activation function describing the input-output functional
relationship of the nonlinearity associated with neuron j.
• bj – bias applied to neuron j.
• xi(n) – ith element of the input vector.
• ok(n) – kth element of the overall output vector
• η – learning rate parameter.
• ml – number of nodes in layer l of the multilayer perceptron. l
= 1, 2, …, L, where L is the depth of n/w.
• m0 – size of input layer.
• mL – size of output layer.
Basics of Back-Propagation Algorithm
• The error signal at output neuron j at iteration n
ej(n) = dj(n) – yj(n) (1)
• The instantaneous value of error energy for neuron j defined
as: 1 e 2j n 
2
• The instantaneous value of total error energy
1
E n    e 2j n  (2)
2 jC
where C is the set of neurons in the output layer.
• N- total number of patterns in the training set. N
1
• Then the average squared error energy,Eav   E n  , (3)
N i 1
• Eav is a function of all free parameters (synaptic weights and
bias) represents cost function as a measure of learning
performance.
• Objective: Adjust free parameters to minimize Eav.
• For minimization, similar approach to LMS is used here.
Basics of Back-Propagation Algorithm cntd.
• Weights adjusted pattern-by-pattern. m
• The induced local field, v j n    W ji n  yi n  , (4)
i 0
where m is the number of input applied to neuron j,
Wj0 = bias bj
corresponds to
fixed input y0 =
+1.
• Function signal
y j n    j v j n  (5)

Fig. Signal-flow graph highlighting


the details of output neuron j.
Basics of Back-Propagation Algorithm cntd.
• Like LMS algo., the back-propagation algo. applies a
correction Δwji(n) to the synaptic weight wji(n) which is
proportional to the partial derivatives E n  .
w ji n 
• According to the chain rule of calculus, this gradient can be
expressed as E n   E n  e j n  y j n  v j n  (6)
w ji n  e j n  y j n  v j n  w ji n 
E n 
• The partial derivative w ji n 
represents a sensitivity factor,
determining the direction of search in weight space for
synaptic weight wji. 1
E n    e 2j n 
• Differentiating both sides of 2 jC ,eq.(2) w.r.t. ej(n), we
get E n   e n  (7)
e j n 
j

• Differentiating both sides of ej(n) = dj(n) – yj(n),eq.(1) wrt yj(n),


we get e j n   1 (8)
y j n 
Basics of Back-Propagation Algorithm cntd.
• Differentiating both sides of y n    v n  , eq.(5) wrt vj(n), we
j j j
y j n 
have   ' v j n  (9),
v j n 
where use of prime such as ϕ’(.) signifies differentiation wrt the
argument. m
• Finally, differentiatingv j n    W ji n  yi n  ,eq.(4) wrt wji(n), we
v j n 
i 0

have  yi n  (10)
w ji n 
• The use of eqs. 7 – 10 in eq. 6, we have
E n  E n  e j n  y j n  v j n 
  e j n  'j v j n yi n  (11)
w ji n  e j n  y j n  v j n  w ji n 
• Delta rule: The correction Δwji(n) applied to wji(n) is defined by
the delta rule:w ji n    E n  (12)
w ji n 
where η is the learning rate parameter.
• The minus sign in eq.(12) is for gradient descent in the wt.
space.
Basics of Back-Propagation Algorithm cntd.
• Use of eq. (11) in eq. (12) yields w ji n    j n  yi n  (13)
where the local gradient  j n  is defined by
E n  E n  e j n  y j n  (14)
 j n      e j n  'j v j n 
v j n  e j n  y j n  v j n 

• The local gradient points required to change the synaptic


weights.
Case1- The calculation at eq.(14) is for weight adjustment for
output node.
Case2 – No specific desired response for neurons of hidden
layer.
The error signal for hidden neuron determined in terms of error
signal of all neurons to which it is directly connected.
Basics of Back-Propagation Algorithm cntd.
• According to eq.(14) the local gradient δj(n) for the hidden neuron j can be
redefined as E n  y j n  E n  '
 j n      j v j n 
y j n  v j n  y j n  (15)
here neuron j is hidden node.

Fig. Signal flow graph highlighting


the details of output neuron k
connected to hidden neuron j.
Basics of Back-Propagation Algorithm cntd.
• From the Fig., we see that E n   1  ek2 n  (16)
where neuron k is an output node. 2 kC
E n  e n 
• Differentiating eq.(16) wrt yj(n), we get   ek k (17)
y j n  k y j n 
• Using chain rule E n  e n  v n 
  ek k k
(18)
y j n  k vk n  y j n 
• We know that ek(n) = dk(n) - yk(n) = dk(n) – ϕk(vk(n)) (19)
• Hence, ek n  (20)
  k' vk n 
vk n  m
• From the Fig. in the previous slide, vk n    wkj n  y j n  (21)
j 0
where m is the number of inputs (excluding bias) applied to output neuron k
and wk0(n) = bk(n).
vk n 
• Differentiating eq.(21) wrt yj(n) yields  wkj n  (22)
y j n 
• Using eq.(20) and (22) in eq.(18) , we get
E n  e n  vk n  (23)
  ek k   ek k' n wkj n     k n wkj n 
y j n  k vk n  y j n  k k

where  k n  is local gradient (ref. eq.(14)).


Basics of Back-Propagation Algorithm cntd.
• Substituting Eq.(23) in Eq.(15), we get the back-propagation formula
for the local gradient as  j n 
 j n    'j v j n   k (n)wkj n  (24)
k
where neuron j is in the hidden layer.
• The correction ∆wji(n) applied to the
synaptic weight connection neuron j is
defined by the delta rule.
 learning   input 
 Weight     local   
   rate    signal of 
 correction    . gradient . 
 w n    parameter    n    neuron j 
 ji  η  j   y n  
   j 
• The local gradient  j n  depends on the
location of neuron j (output node or hidden Fig. Signal-flow graph of a part of the
node) adjoint system pertaining to back-
I. If neuron j is output node: propagation of error signals.
 j n   e j n  'j v j n  (14)
II. If neuron j is hidden node: j n    'j v j n   k (n)wkj n  (24)
k
Activation functions for MLP
• To compute δ, ϕ(.) should be continuous and differentiable.
• Two forms of continuously differentiable non-linear activation functions used for
MLP are sigmoidal functions.
I. Logistic function defined by
y j   j v j n  
1
, a  0 and -   v j n   
1  exp(  av j n ) (25)
where vj is the induced local field of neuron j and 0 ≤ yi ≤ 1.
a exp- av j n 
 'j v j n   (26)
1  exp- av j n2
let exp(-avj(n)) =z, then
y j   j v j n  
1
,
1 z
 z 1 1    1  
2

 'j v j n  
az
a
z 11
 a  2
 a
1
  
   a y j n   y 2j n  
1  z 2 1  z   1  z  1  z    1  z   1  z   
2 2

 'j v j n   ay j n 1  y j n 
(27)
For output layer, yj(n) = oj(n), then
 j n   e j n  'j v j n   ad j n   o j n o j n 1  o j n  (28)



e j n 
For hidden layers,
 j n    'j v j n   k (n)wkj n   ay j n 1  y j n   k (n)wkj n  (29)
k k
Activation functions for MLP cntd.
II. Hyperbolic tangent function
y j   j v j n   a tanh(bv j n ) , (a, b)  0 (30)
where a, b are constants
• Then by differentiating the hyperbolic tangent function, we get
 'j v j n   ab sech 2 (bv j n )  ab(1  tanh 2 (bv j n ))  ab(1  tanh(bv j n ))(1  tanh(bv j n ))


b
a
b
 
(a  a tanh(bv j n ))(a  a tanh(bv j n ))  a - y j n  a  y j n 
a
 (31)
• The equations for output and hidden layer using hyperbolic tangent fn. can
be written as:

i. For the output layer: j n   e j n  '
j v j n  
b
d j
n   o j n a - y j n a  y j n 
 a
(32)
e j n 

ii. For the hidden layer:


 j n    'j v j n   k (n)wkj n  
b
a
 
a - y j n  a  y j n    k (n)wkj n  (33)
k k
Back-Propagation Algorithm
y00 +1 y10
+1
W 110=b11 W 210=b21
w120=b12 w220=b22

x1 y01 W 111 v11 y11 W 211 v21 y21 = o1 d1

W 121 W 221 e1
W 112 W 112
x2 y02 v12 y12 v22 y22 = o2 d2
W 122 W 222 e2
l0 j1 j2

• One example of MLP is shown in Fig.


• Here depth of n/w: L=2
• Number of neurons in layer
i. Zero, M0 = 2
ii. One, M1 = 2
iii. two, M2 = 2
Back-Propagation Algorithm cntd. y10
+1 y00 +1
Initialize weight and bias uniform randomly W 110=b11 W 210=b21
from [-1, 1]. w120=b12 w220=b22
v21
Set slope parameter, a, leaning rate η, and x1 y01 W 111 v11 y11 W 211 y21 = o1 d1
e1
momentum constant α. W 121 W 221
W 112 W 212
For each epoch //for simplicity iteration number n is omitted x2 y02 v12 y12 v22 y22 = o2 d2
//e.g. v l
j is written in place of v n 
l
j
W 122 W 222 e2
l0 j2
for each example m l
j1

calculate induced local fields v j   w ji yi ,


l l l 1

i 0
l
where v j is the induced local field for neuron j in the layer l.
yil 1 is output of neuron i in layer l-1
wlji is weight of neuron j in layer l that is feed from neuron i in layer l-1
for i = 0, y0l 1  1 and w j 0  b j
l l

for l >0, find the output signal y j   j v j  ,if the first layer (input), y 0j  x j and for
l l

the last (output) layer, y Lj  o j .


Compute error signal e j  d j  o j , where dj is desired response.
Calculate local gradients l 
 e 
L '
v L
j ,  
if output layer
 j n    ' v l   l 1wl 1 , if hidden layer
j j

j  k
j
k
kj

Adjust the synaptic weights (using generalized delta rule)


wlji next   wlji current   wlji  previous    lj yil 1
Example on back-propagation learning
• Let n/w architecture: L  2, m 0  2, m1  2, m 2  1
y00 +1 y10
• Initialization Parameter setting : +1
W 110=b11 W 210=b21
w  0.5, w  0.6 w  0.4,
1
10
1
20
2
10 a  0.2,   0.1,  0.5 w120=b12
w  0.4, w  0.3 w  0.7,
1
11
1
21
2
11 input : x1  0.1, x2  0.9, x1 y01 W 111 v11 y11 W 211 v21 y21 = o1 d1
1
w12  0.2, w122  0.9 w122  0.8 desired response, d1  0.7 W 121 e1
W 112 W 212
• Forward pass x2 y02 v12 y12
v11  w10
1
 w11
1 0
y1  w12
1
y20  0.5  0.4 * 0.1  0.2 * 0.9  0.72 W 122
l0 j1 j2
v  w  w y  w y  0.6  0.3 * 0.1  0.9 * 0.9  1.44
1
2
1
20
1 0
21 1
1
22
0
2

    
y11   v11  1 / 1  exp  av11  1 / 1  exp 0.2 * 0.72  0.5359
y12   v   1 / 1  exp av   1 / 1  exp 0.2 *1.44   0.5715
1
2
1
2

v12  w102  w112 y11  w122 y12  0.4  0.7 * 5359  0.8 * 5715  1.2323
  
o1  y12  1 / 1  exp  av12  1 / 1  exp 0.2 *1.2323  0.5613
e1  d1  o1  0.7  0.5613  0.1387
• Backward pass
12  e1 ' v12   e1ay12 1  y12   0.1387 * 0.2 * 0.5613 *[1  0.5613]  0.0068

   v     
1
1
1
'
1
1
1
11
k
11
w
kj  ay 1  y
1
1
1
1
2
k w112  0.2 * 0.5359 * 1  0.5359* 0.0068 * 0.7  0.0002367
k k 1

   v     
1
1
2
'
2
1
2
11
k w 11
kj  ay 1  y
1
2
1
2
2
k w122  0.2 * 0.5715 * 1  0.5715* 0.0068 * 0.8  0.0002664
k k 1
Example on back-propagation learning cntd.
1
w10  0.5, w120  0.6 w102  0.4, Parameter :   0.1,  0.5
 0.4, w121  0.3 w112  0.7, input : x1  0.1  y1 , x2  0.9  y2
1 0 0
w11
+1 y10
1
w12  0.2, w122  0.9 w122  0.8 +1 y00
W 110=b11 W 210=b21
• Values obtained w120=b12
v11 v21
y11  0.5359, y12  0.5715, o1  y12  0.5613 x1 y01 W 111 y11 W 211 y21 = o1 d1

W 121 e1
  0.0002367,   0.0002664,   0.0068
1
1
1
2 1
2
W 112 W 212
• Let’s update the weights x2 y02 v12 y12
W 122
wlji next   wlji current   wlji  previous    lj yil 1 l0 j1 j2

b12  w102  w102  w102   12 y10  0.4  0.1* 0.4  0.5 * 0.0068 *1  0.4434

w112  w112  w112  12 y11  0.7  0.1* 0.7  0.5 * 0.0068 * 0.5359  0.7718
w122  w122  w122   12 y12  0.8  0.1* 0.8  0.5 * 0.0068 * 0.5715  0.8819
b11  w10
1
 w10
1
 w10
1
  11 y00  0.5  0.1* 0.5  0.5 * 0.0002367 *1  0.5501
1
w11  w11
1
 w11
1
 11 y10  0.4  0.1* 0.4  0.5 * 0.0002367 * 0.1  0.44
1
w12  w12
1
 w12
1
  11 y20  0.2  0.1* 0.2  0.5 * 0.0002367 * 0.9  0.2201
b21  w120  w120  w120   21 y00  0.6  0.1* 0.6  0.5 * 0.0002664 *1  0.6601
w121  w121  w121   21 y10  0.3  0.1* 0.3  0.5 * 0.0002664 * 0.1  0.33
w122  w122  w122   21 y20  0.9  0.1* 0.9  0.5 * 0.0002664 * 0.9  0.9901
Rate of Learning
• Back propagation approximates the trajectory in weight space.
• When η is small, change to weight is small and weight space trajectory is
smoother and learning is slow.
• When η too large, large change in weight, network unstable (i.e. oscillatory)
• Momentum term included to increase learning rate without instability by
modifying delta rule as:
w ji n   w ji n  1   j n  y j n  (34)
• Eq. (34) is also called as Generalized Delta Rule.
• α – momentum constant, a + ve number, controls feed back loop around
Δwji(n), and 0 ≤ α ≤ 1.

• Eq.(34) becomes eq.(13) when α = 0.


• Momentum may prevent a trap from local minima.
• η may be connection dependent, ηji may take
different values for different j and i.
• Some of the weights may be fixed by taking ηji = 0,
for the respective segment.
Signal flow graph illustrating the
Effect of momentum constant α
Sequential and Batch modes of training
1. Sequential mode of back-propagation learning
– Also referred on-line mode or pattern mode or stochastic mode
– Weight is updated after presenting each training example.
– It is
• popular
• Simple to implement
• Provide effective solution to large and difficult problems.
– Order of presentation of training examples randomized after each epoch(to
avoid possibility of limit cycles.)
2. Batch mode of back-propagation learning
– Weight updated after presentation of all training samples.
– Weight updated once after each epoch.
Stopping criteria of back-propagation learning
• No well defined criteria for stopping.
• Let w* denote the weight vector for local or global minimum of
error surface.
I. A necessary condition for W* to be a minimum is that the gradient
vector g(W) (i.e. first order partial derivative) of error surface wrt
weight vector W is zero at W=W*.
=> “The back-propagation algo. (BPA) is considered to have
converged when the Euclidean norm of the gradient vector reaches
a sufficiently small gradient threshold.”
Drawback:
1. learning time is long.
2. computation of gradient vector g(W).
Stopping criteria of back-propagation learning cntd.
II. Cost function or error measure Eav(W) is stationary at the point
W=W*.
=> “The BPA is considered to be converged when the absolute rate
of change in the average squared error/epoch is sufficiently small.”
rate of change of Eav(W)/epoch may be from 0.0001 to 0.01.
Drawback: May lead to premature convergence.
III. After each epoch, network tested for its generalization
performance and stopped when it is adequate.
Heuristics for making the BPA perform better
• It is often said that design of NN using BPA is an art, as it involves
own personal experience.
1. Sequential vs. Batch update
Sequential mode is faster than batch mode.
Sequential mode is better if dataset is large and highly redundant.
Highly redundant data pose computational problem for the
estimation of Jacobian required of the batch update.
Heuristics for making the BPA perform better cnt.
2. Maximizing information content
– Choose training examples based on information content.
– Heuristics to search more on weight space.
Use a pattern/sample
• that results largest error.
• Use a sample that is different from all previous samples.
– For sequential mode shuffle order of presentation of samples, ensure
no two consecutive samples belong to same class.
– Emphasizing scheme-use difficult patterns (i.e. yield more error) for
training.
– Problems in the scheme:
• Distribution within an epoch distorted.
• Outlier/mislabeled samples generate max. error is a threat to
emphasizing scheme.
Heuristics for making the BPA perform better cnt.
3. Activation function
– MLP learns faster when activation function is anti-symmetric than
when it is non-symmetric.
– An activation function ϕ(v) is anti-symmetric (i.e. odd function of its
argument)
• If ϕ(-v) = - ϕ(v)

Anti-symmetric activation function Non-symmetric activation function


(e.g. hyperbolic tangent fn.) (e.g. logistic fn.)
Heuristics for making the BPA perform better cnt.
3. Activation function cntd.
– Popular example of an anti-symmetric activation fn. is hyperbolic
tangent function defined by y = ϕ(v) = a tanh(bv), and
ϕ’(v)=(b/a)[a-y][a+y], where a and b are constants.
– Suitable values for these constants are a = 1.7159 and b = 2/3 (LeCun,
1989, 1993).
– Its properties are ϕ(1)=a tanh(bv)=1.7159*tanh(1*2/3)=1,and ϕ(-1)=-1.
– At the origin of the slope (i.e. effective gain) of the activation fn. Is
close to unity, as shown in fig. in prev. slide:
let a = 1.7159 and b = 2/3 ,
ϕ’(0)= (b/a)[a-y][a+y] =(b/a)[a-ϕ(0)][a+ϕ(0)]= (b/a)[a-0][a+0]= ba
= (2/3 )* 1.7159 = 1.14
– The second derivative of ϕ(v) attains its maximum at v = 1.
Heuristics for making the BPA perform better cnt.
4. Target values
• Target values should be chosen within the range of sigmoid activation function.
• For anti-symmetric activation fn (tanh), the limiting values:

i. +a, we set dj = a – є
ii. -a, we set dj = - a + є
where є is a positive constant.
Ex. for a = 1.7159, we set є = 0.7159, then desired response dj is ± 1

• If offset є is not considered, BPA drives free parameters to infinity, slow down
learning process by driving hidden neurons to saturation.
Heuristics for making the BPA perform better cnt.
5. Normalizing the inputs
• Mean value averaged over entire
training set is close to zero or small
compared to standard deviation.

• Because if the input variables are


all positive, weights in first hidden
layer increase or decrease together.
If these weights change direction, it
zigzag in error surface, slow the
training process, should be avoided.

• Input variables should be


uncorrelated (e.g. Principal
Component Analysis (PCA) may be
used to un-correlate the variables)

• Then the input variables should be


scaled so that their covariances are
approximately equal which ensures
learning of weights at same speed.
Fig. Illustrating the operation of mean removal, de-correlation,
and covariance equalization for a two-dimensional input space.
Heuristics for making the BPA perform better cnt.
6. Initialization
• A good choice of initial weight and bias is of great help for successful n/w design.
• High initial value of neurons may lead to saturation as it generate small value of
local gradients and the learning process become slow.
• With small initial value, BPA operates on flat area near origin of error surface (e.g.
ref. tanh fn.)
• Origin is a saddle point (stationary point) where curvature of the error surface
across the saddle is negative and the curvature along the saddle is positive.
• Hence, avoid initialization of small or large values to weights.
• Let a MLP uses hyperbolic tangent fn. as its activation fn. Let bias/neuron=0;
m
• Then, v j   w ji yi , where m is the number of synaptic connections to neuron j.
• Let input have mean zero and unit variance i.e.  y  E  yi   0, i (35) and
i 1

 
 y2  E  yi  i 2 using (35) i.e. i  0
 
 E  yi   1, i
2
(36)

1, for k  i
• Let the input are uncorrelated, i.e. E  yi y k    (37)
0, for k  i
Heuristics for making the BPA perform better cnt.
6. Initialization cntd.
• Let the weights are drawn from a uniformly distributed set of numbers with zero
mean,  w  E w ji   0,  j, i  pair , (38) and variance
v  E v j   E  w ji yi 
 m

 i 1 

   
  E w ji E  yi , using (38), i.e. E w ji  0
m

i 1

0 (39), and
   
 v2  E v j   v 2  E v j 2
m m 
 E  w ji w jk yi yk 
 i 1 k 1 
 1, k  i
  for
m m
  E w ji w jk E  yi yk   using (37) i.e. E  yi yk   


i 1 k 1  0, for k  i 

 
m
  E w2ji  m w2
i 1 (40)
• For hyperbolic tangent fn., taking a=1.7159, b=2/3, and set σv = 1, then from (40)
 v2  m w2  1  m w2   w  1 / m (41)
• Then for uniform distribution, weights should have mean zero and variance, w  1 / m
2

where m is the number of synaptic connections of a neuron.


Heuristics for making the BPA perform better cnt.
7. Learning rates
• All neurons should learn at same rate
• Local gradient :
– Larger in last layer
– Smaller in first layer
• Value of η:
– Smaller for last layer
– Larger for first layer
• Neurons with
– Many input → small η
– few input → large η
1
l 
• As suggested by LeCun (1993), learning rate for larger l, m l 1 ,
where m is the number of synaptic connections in (l -1) layer
connected to neurons in layer l.
Self-Organizing Maps
(SOM)
Topics to be discussed
1. What is a Self Organizing Map?
2. Topographic Maps
3. Setting up a Self Organizing Map
4. Two Basic Feature-Mapping Models
5. Kohonen Networks
6. Components of Self Organization
7. Overview of the SOM Algorithm
What is a Self Organizing Map?
• So far we have looked at networks with supervised training
techniques, in which there is a target output for each input pattern,
and the network learns to produce the required outputs.
• We now turn to unsupervised training, in which the networks learn
to form their own classifications of the training data without external
help.
• We assume that class membership is broadly defined by the input
patterns sharing common features, and that the network will be able
to identify those features across the range of input patterns.
What is a Self Organizing Map cntd.
• One particularly interesting class of unsupervised system is based on
competitive learning, in which the output neurons compete amongst
themselves to be activated, with the result that only one is activated
at any one time.
• This activated neuron is called a winner-takes-all neuron or simply
the winning neuron.
• Such competition can be induced/implemented by having lateral
inhibition connections (negative feedback paths) between the
neurons.
• The result is that the neurons are forced to organize themselves and
the network is called a Self Organizing Map (SOM).
Topographic Maps
• Neurobiological studies indicate that different sensory inputs (motor,
visual, auditory, etc.) are mapped onto corresponding areas of the
cerebral cortex in an orderly fashion.
• This form of map, known as a topographic map, has two important
properties:
1. At each stage of representation, or processing, each piece of incoming
information is kept in its proper context/neighbourhood.
2. Neurons dealing with closely related pieces of information are kept
close together so that they can interact via short synaptic connections.
Topographic Maps cntd.
• Our interest is in building artificial topographic maps that learn
through self-organization in a neurobiologically inspired manner.
• We shall follow the principle of topographic map formation:
“The spatial location of an output neuron in a topographic map
corresponds to a particular domain or feature drawn from the input
space”
Setting up a Self Organizing Map
• The principal goal of an SOM is to transform an incoming signal
pattern of arbitrary dimension into a one or two dimensional discrete
map, and to perform this transformation adaptively in a topologically
ordered fashion.
• We therefore set up our SOM by placing neurons at the nodes of a
one or two dimensional lattice.
• Higher dimensional maps are also possible, but not so common.
• The output neurons become selectively tuned to various input
patterns (stimuli) or classes of input patterns during the course of the
competitive learning.
Setting up a Self Organizing Map cntd.
• The locations of the neurons so tuned (i.e. the winning neurons)
become ordered and a meaningful coordinate system for the input
features is created on the lattice.
• The SOM thus forms the required topographic map of the input
patterns.
• We can view this as a non-linear generalization of principal
component analysis (PCA).
Organization of the Mapping
• We have points x in the input space mapping to points I(x) in the
output space:

• Each point I in the output space will map to a corresponding point


w(I) in the input space
Two Basic Feature-Mapping Models
1. Willshaw–von der Malsburg’s model
2. Kohonen model
• Willshaw–von der Malsburg’s model (1976) on biological grounds to explain the
problem of retinotopic mapping from the retina to the visual cortex (in higher
vertebrates).
• There are two separate two-dimensional lattices of neurons connected together, one
projecting onto the other.

• One lattice represents


presynaptic (input) neurons,
and the other lattice
represents postsynaptic
(output) neurons.

• The Willshaw–von der


Malsburg model is specialized
to mappings for which the
input dimension is the same
as the output dimension.
Kohonen Networks
• We shall concentrate on the particular kind of SOM known as a
Kohonen Network.
• This SOM has a feed-forward structure with a single computational
layer arranged in rows and columns.
• Each neuron is fully connected to all the source nodes in the input
layer:

• Clearly, a one dimensional map will just have a single row (or a single
column) in the computational layer.
Components of Self Organization
The self-organization process involves four major components:
• Initialization: All the connection weights are initialized with small
random values.
• Competition: For each input pattern, the neurons compute their
respective values of a discriminant function which provides the basis
for competition.
The particular neuron with the smallest value of the discriminant
function is declared the winner.
Components of Self Organization cntd.
• Cooperation: The winning neuron determines the spatial location of
a topological neighbourhood of excited neurons, thereby providing
the basis for cooperation among neighbouring neurons.
• Adaptation: The excited neurons decrease their individual values of
the discriminant function in relation to the input pattern through
suitable adjustment of the associated connection weights, such that
the response of the winning neuron to the subsequent application of
a similar input pattern is enhanced.
The Competitive Process
• If the input space is D dimensional (i.e. there are D input units) we
can write the input patterns as x = {xi : i = 1, …, D} and the connection
weights between the input units i and the neurons j in the
computation layer can be written wj = {wji : j = 1, …, N; i = 1, …, D}
where N is the total number of neurons.
• We can then define our discriminant function to be the squared
Euclidean distance between the input vector x and the weight vector
wj for each neuron j
The Competitive Process cntd.
• In other words, the neuron whose weight vector comes closest to the
input vector (i.e. is most similar to it) is declared the winner.
• In this way the continuous input space can be mapped to the discrete
output space of neurons by a simple process of competition between
the neurons.
The Cooperative Process
• In neurobiological studies we find that there is lateral interaction
within a set of excited neurons.
• When one neuron fires, its closest neighbours tend to get excited
more than those further away.
• There is a topological neighbourhood that decays with distance.
• We want to define a similar topological neighbourhood for the
neurons in our SOM.
• If Sij is the lateral distance between neurons i and j on the grid of
neurons, we take

as our topological neighbourhood, where I(x) is the index of the


winning neuron.
The Cooperative Process cntd.
• This has several important properties: it is maximal at the winning
neuron, it is symmetrical about that neuron, it decreases
monotonically to zero as the distance goes to infinity, and it is
translation invariant (i.e. independent of the location of the winning
neuron)
• A special feature of the SOM is that the size σ of the neighbourhood
needs to decrease with time.
• A popular time dependence is an exponential decay
Figure. (a) A Kohonen self-organizing network with 2 input and 49
output units; (b) the size of a neighborhood around a winning unit
decreases gradually with each iteration.
The Adaptive Process
• Clearly SOM must involve some kind of adaptive, or learning,
process by which the outputs become self-organised and the feature
map between inputs and outputs is formed.
• The point of the topographic neighbourhood is that not only the
winning neuron gets its weights updated, but its neighbours will have
their weights updated as well, although by not as much as the winner
itself.
• In practice, the appropriate weight update equation is

in which we have a time (epoch) t dependent learning rate


, and the updates are applied for all the training
patterns x over many epochs.
• The effect of each learning weight update is to move the weight
vectors wi of the winning neuron and its neighbours towards the
input vector x.
• Repeated presentations of the training data thus leads to topological
ordering.
Ordering and Convergence
• Provided the parameters (σ0 , τσ , η0 , τη ) are selected properly, we
can start from an initial state of complete disorder, and the SOM
algorithm will gradually lead to an organized representation of
activation patterns drawn from the input space.
• However, it is possible to end up in a metastable state in which the
feature map has a topological defect.
Ordering and Convergence cont.
• There are two identifiable phases of this adaptive process:
1. Ordering or self-organizing phase – during which the topological
ordering of the weight vectors takes place.
Typically this will take as many as 1000 iterations of the SOM
algorithm, and careful consideration needs to be given to the choice of
neighbourhood and learning rate parameters.
2. Convergence phase – during which the feature map is fine tuned and
comes to provide an accurate statistical quantification of the input
space.
Typically the number of iterations in this phase will be at least 500
times the number of neurons in the network, and again the
parameters must be chosen carefully.
Visualizing the Self Organization Process
• Suppose we have four data points (crosses) in our continuous 2D
input space, and want to map this onto four points in a discrete 1D
output space.
• The output nodes (circles)map to points in the input space.
• Random initial weights start the circles at random positions in the
centre of the input space.
Visualizing the Self Organization Process cntd.
• We randomly pick one of the data point for training (cross in circle).
• The closest output point represents the winning neuron (solid
diamond).
• That winning neuron is moved towards the data point by a certain
amount, and the two neighbouring neurons move by smaller
amounts (arrows).
Visualizing the Self Organization Process cntd.
• Next we randomly pick another data point for training (cross in
circle).
• The closest output point gives the new winning neuron (solid
diamond).
• The winning neuron moves towards the data point by a certain
amount, and the one neighbouring neuron moves by a smaller
amount (arrows).
Visualizing the Self Organization Process cntd.
• We carry on randomly picking data points for training (cross in circle).
• Each winning neuron moves towards the data point by a certain
amount, and its neighbouring neuron(s) move by smaller amounts
(arrows).
• Eventually the whole output grid unravels itself to represent the
input space.
Overview of the SOM Algorithm
• We have a spatially continuous input space, in which our input
vectors live.
• The aim is to map from this to a low dimensional spatially discrete
output space, the topology of which is formed by arranging a set of
neurons in a grid.
• SOM provides such a non-linear transformation called a feature map.
Overview of the SOM Algorithm cntd.
The stages of the SOM algorithm can be summarized as follows:
1. Initialization – Choose random values for the initial weight vectors
wj .
2. Sampling – Draw a sample training input vector x from the input
space.
3. Matching – Find the winning neuron I(x) with weight vector closest
to input vector.
4. Updating – Apply the weight update equation

5. Continuation – keep returning to step 2 until the feature map stops


changing.
Radial Basis Function Network
(RBFN)

B. B. Misra
Simon Haykin, Neural Networks and Learning Machines, 3rd Eds. Prentice Hall.
Radial basis function
• The interpolation problem (strict) states: Given a set of N

different points x i   m | i  1,2,..., N 
and a

corresponding set of N real numbers d i  1 | i  1,2,..., N 
find a function F :   
N 1
that satisfies the
interpolation condition: F(xi) = di, i = 1, 2, ..., N (5.10)
• For strict interpolation the interpolating surface (i.e., function F)
is constrained to pass through all the training data points.
• The radial-basis-functions (RBF) technique consists of choosing a
N
function F that has the form F x    wi  x  x i  (5.11)
i 1

where  x  x i  | i  1,2,..., N  is a set of N arbitrary (nonlinear)


functions, known as radial-basis functions, and . denotes
Euclidean norm.
• The known data points x i   m0 , i  1,2,..., N are taken to be the
centers of the radial-basis functions.
• Inserting Eq. (5.10) into Eq. (5.11), a set of simultaneous linear
equations for the unknown coefficients (weights) of the
expansion {wi} given by
 11 12 ... 1N   w1   d1 
      
...  2 N   w2   d 2 
 21 22
 (5.12)
 ... ... ... ...   ...   ... 
    
 N 1  N 2 ...  NN   wN  d N 

 
• Where ij   xi  x j , i, j  1,2,..., N
• Let d  d1 , d 2 ,..., d N T
w  w1 , w2 ,..., wN 
T

• The N-by-1 vectors d and w represent the desired response


vector and linear weight vector, respectively, where N is the size
of the training sample.
• Let  denote an N-by-N interpolation matrix with elements
: ij   ij
N
 
i , j 1
(5.14)
• Eq. (5.12) can be written in the compact form as
w  d (5.15)
1
• Assuming  nonsingular, and therefore  exists, we solve eq.
(5.15) to find weight vector w   1d (5.16)
• Is interpolation matrix nonsingular?
• Yes, for a large class of radial-basis functions and under certain
conditions.
Micchelli’s Theorem
• In Micchelli (1986), the following theorem is proved:
Let xi i 1 be a set of distinct points in  0 . Then the N-by-N
N m


interpolation matrix  , whose ij-th element is ij   xi  x j 
is nonsingular.
• There is a large class of radial-basis functions that is covered by
Micchelli’s theorem.
• It includes the following functions that are of particular interest
in the study of RBF networks.
1. Multiquadrics:
 r  r  c  for some c  0 and r  
  2 2 1/ 2
(5.17)
2. Inverse multiquadrics:
1
 r   2 2 1/ 2 for some c  0 and r  
r c  (5.18)
3. Gaussian functions:
 r2 
 r   exp  2 , for some   0 and r   (5.19)
 2 
• For the above radial-basis functions to be nonsingular, the
points xi i 1 must all be different (i.e.,distinct).
N

• Inverse multiquadrics and the Gaussian functions are both



localized functions, in the sense that  r  0 as r  .
• In both of these cases, the interpolation matrix  is positive
definite but not for multiquadrics function.
Radial basis function network
Radial-basis-function(RBF) network structure
A radial-basis-function(RBF) network has 3 layers
1. Input layer, which consists of mo source nodes, where mo is the
dimensionality of the input vector x.
2. Hidden layer, which consists of the same number of
computation units as the size of the training sample N; each
unit is mathematically described by a radial-basis function
 
 j x    x  x j , j  1,2,..., N
The jth input data point xj defines the center of the radial-basis
function, and the vector x is the signal (pattern) applied to the
input layer.
Thus, unlike a multilayer perceptron, the links connecting the
source nodes to the hidden units are direct connections with no
weights.
3. Output layer consists of a single computational unit.
No restriction on the size of the output layer, but it is smaller
than that of the hidden layer.
• Here Gaussian function as the radial-basis function is taken,
then each unit in the hidden layer is defined by
   1 2
 j x    x  x j  exp - 2 x  x j  , j  1,2,..., N (5.20)
 2 
where σj is a measure of the width of the jth Gaussian function
with center xj.
• Usually all the Gaussian hidden units are assigned a common
width .
• Hence, the parameter that distinguishes one hidden unit from
another is the center xj.
Practical Modifications to the RBF Network
• Issue1: In practice the training sample is noisy. Unfortunately,
the use of interpolation based on noisy data could lead to
misleading results—hence the need for a different approach to
the design of an RBF network.
• Issue2: Hidden layer of the same size as the number of samples
could be wasteful of computational resources, particularly when
dealing with large training samples. Again some training
samples may have high correlation, leading to redundant
hidden nodes—make the size of the hidden layer a fraction of
the size of the training sample.
• Note : Unlike the case for a multilayer perceptron, the training
of an RBF network does not involve the back propagation of
error signals.
Radial basis function network (Practical model)
FIGURE 5.4 Structure of a practical RBF
network (hidden layer size k< N size of
hidden layer in Fig. 5.3).
• The approximating function realized by both of these two RBF
structures has the same mathematical form,
F x    w j x  x j 
K
(5.21)
j 1

where the dimensionality of the input vector x (and therefore


that of the input layer) is m0 and each hidden unit is
characterized by the radial-basis function  x  x j  , where
j=1,2,…,K, with K being smaller than N.
• The output layer, assumed to consist of a single unit, is
characterized by weight vector w, whose dimensionality is also
K.
Example 1
• Solve XOR problem using RBFN.
• Let two hidden neurons are there using Gaussian basis function
with c1= 0.1, c2= -0.1 and σ1=0.2 and σ2=0.4.
• Let the output neuron uses a linear function and
the weights are estimated using LSE.

x1 h1(x) w1 x1 x2 y
f(x) 1 1 -1
w2 1 -1 1
x2 h2(x) w3
-1 1 1
1 -1 -1 -1
Training XOR using RBFN and weight obtained using LSE in
matlab
clear;
Weights obtained after
clc;
x=[1 1
training
1 -1 w1= -1616888291.27575
-1 1 w2= -446.694275863286
-1 -1]; %Training data w3= 1.82756548365902
y=[-1 1 1 -1]; %Training target
c=[0.1,-0.1];
sigma=[0.2,0.4];
for i=1:4 %
for j=1:2
h(i,j)=exp(-((x(i,1)-c(j))^2+(x(i,2)-c(j))^2)/(2*sigma(j)^2));
end
end
h=[h, ones(4,1)];
w=inv(h'*h)*h'*y‘;
Adaptive Resonance Theory
(ART)
Motivation
• How can we create a machine that can act and navigate in a world
that is constantly changing?
• Such a machine would have to:
– Learn what objects are and where they are located.
– Determine how the environment has changed if need be.
– Handle unexpected events.
– Be able to learn unsupervised.
• Known as the “Stability – Plasticity Dilemma”: How can a system be
adaptive enough to handle significant events while stable enough to
handle irrelevant events?
Stability – Plasticity Dilemma
• More generally, the Stability – Plasticity Dilemma asks: How can a
system retain its previously learned knowledge while incorporating
new information?
• Real world example:
– Suppose you grew up in New York and moved to California for several
years later in life.
– Upon your return to New York, you find that familiar streets and
avenues have changed due to progress and construction.
– To arrive at your specific destination, you need to incorporate this new
information with your existing (if not outdated) knowledge of how to
navigate throughout New York.
– How would you do this?
Adaptive Resonance Theory
• Gail Carpenter and Stephen Grossberg (Boston University)
developed the Adaptive Resonance learning model to answer this
question.
• Essentially, ART (Adaptive Resonance Theory) models incorporate
new data by checking for similarity between this new data and data
already learned; “memory”.
• If there is a close enough match, the new data is learned.
• Otherwise, this new data is stored as a “new memory”.
• Some models of Adaptive Resonance Theory are:
– ART1 – Discrete input.
– ART2 – Continuous input.
– ARTMAP – Using two input vectors, transforms the unsupervised ART
model into a supervised one.
– Various others: Fuzzy ART, Fuzzy ARTMAP (FARTMAP), etc…
Competitive Learning Models
• ART Models were developed out of Competitive Learning
Models.

• Let: I = 1 … M, J = 1 … N
• XI = normalized input for node “I”
• ZIJ = weight from input node I to LTM category J
Competitive Learning Models cntd.
• Competitive Learning models follow a “winner take all” approach in
that it searches for a LTM node that will determine how ZIJ is
modified.
Competitive Learning Models cntd.
• Once the appropriate LTM node has been chosen the weight vector
is updated based on the memory contained within the “winning”
node.
• One such practice is to replace the existing weight vector with the
difference between itself and the normalized input values. That is:
Let:
Zj = be the weight vector for LTM node “J”
Xj = be the vector representing the normalized input values for each input
node.
• Then the new weight vector is simply:

• Changing this weight vector constitutes the “learning process”.


Problems
• If the input pattern is deformed somehow, the weight vector will
become erroneously changed.
• Passing in the same input pattern consecutively can yield unstable
learning.
• If the weight vector is changed considerably enough, it could
discount previously learned information, and would have to be
retrained.
• That is, it does not solve the Stability – Plasticity Dilemma.
Adaptive Resonance Model
• The basic ART model, ART1, is comprised of the following
components:
1. The short term memory layer: F1 – Short term memory.
2. The recognition layer: F2 – Contains the long term memory of the
system.
3. Vigilance Parameter: ρ – A parameter that controls the generality of
the memory.
Larger ρ means more detailed memories, smaller ρ produces more
general memories.

• Training an ART1 model basically consists of four steps.


Adaptive Resonance Model cntd.
Step 1: Send input from the F1 layer to F2 layer for processing.
The first node within the F2 layer is chosen as the closest match to
the input and a hypothesis is formed.
This hypothesis represents what the node will look like after
learning has occurred, assuming it is the correct node to be
updated.
F1 (short term memory) contains a vector
of size M, and there are N nodes within
F2.
Each node within F2 is a vector of size M.
The set of nodes within F2 is referred to
as “y”.
Adaptive Resonance Model cntd.
• Step 2: Once the hypothesis has been formed, it is sent back to the
F1 layer for matching.
• Let Tj (I*) represent the level of matching between I and I* for node
j.
• Then:
where A^ B = min( A, B)
• If Tj (I*) ≥ ρ then the hypothesis is
accepted and assigned to that node.
• Otherwise, the process moves on to
Step 3
Adaptive Resonance Model cntd.
• Step 3: If the hypothesis is rejected, a “reset” command is sent back
to the F2 layer.
• In this situation, the j th node within F2 is no longer a candidate so
the process repeats for node j+1.
Adaptive Resonance Model cntd.
Step 4:
1. If the hypothesis was accepted, the winning
node assigns its values to it.
2. If none of the nodes accepted the
hypothesis, a new node is created within
F2.
As a result, the system forms a new memory.
• In either case, the vigilance parameter
ensures that the new information does not
cause older knowledge to be forgotten.
ART Extensions
• As mentioned previously, Adaptive Resonance Theory has can be
categorized into the following:
1. ART1 – Default ART architecture. Can handle discrete (binary)
input.
2. ART2 – An extension of ART1. Can handle continuous input.
3. Fuzzy ART – Introduces fuzzy logic when forming the hypothesis.
4. ARTMAP – An ART network where one ART module attempts to
learn based on another ART module. In a sense, this is supervised
learning.
5. FARTMAP – An ARTMAP architecture with Fuzzy logic included.
Integration of neural
network, fuzzy logic
and genetic algorithm

B. B. Misra
Artificial Neural Network

Simplified models of human nervous


system
mimic human ability to adapt to
circumstances and
learn from past experience
Fuzzy logic systems

Addresses the imprecision or vagueness in


input-output description.
Fuzzy set has no crisp boundary, provide
a smooth transaction between membership
and non-membership elements.
Genetic algorithm

Inspired by biological evolution


Adaptive search
Optimization technique
Each technology provide efficient solution to a
wide range of problems belonging different
domain.
These technologies synergize in whole or in part to
solve problems for which these technologies could
not find solution individually
The objective of the synergy or hybridization is to
overcome weakness in one technology with strength
of other.
Hybridization has its pitfalls, should be done with
care.
If one technology solves a problem, then
hybridization may be done
 if a better solution is found. or
 it provides a better method to get solution, or
 at worst to find alternate solution.
Purpose of hybridization is to investigate better
methods of solving problems
Inappropriate hybridization may backfire.
When individual technology gives good result,
expecting better result form hybridization may not
be always correct.
Hybridization may exhibit weakness of partitioning
technologies than that of their strength.
Hybrid systems

Hybrid systems are those for which


more than one technology is employed
to solve the problem.
1. Sequential hybrids
2. Auxiliary hybrids and
3. Embedded hybrids.
Sequential hybrid systems Input
It makes use of the technologies in a
pipeline fashion
Tech. A
Output of one technology is input of
the other
Both the technologies work Tech. B
independently, may not compensate
the need of one another
It is the weakest form of hybridization Output

Ex. GA selects attributes. These


attributes are used by ANN to solve
classification problem.
Auxiliary hybrid systems
one technology calls the other as a subroutine to process
or manipulate information needed by it.
The second technology processes the information
provided by the first and hands it over for further use.
Better than the sequential system, Input
but not the best

Ex. Neuro-genetic system:


Tech. B Tech. A
where the ANN employs
GA for optimization of
structural parameters.
Output
Embedded hybrid systems Input
Integrated in such a way that they
appear intertwined.
It is a complicated fusion of Tech. A
technologies, gives a feel that one Tech. B
technology can not work without other
for solving the problem.
Considered best among the three Output
hybrid systems.
Ex. ANN-FL hybrid systems, ANN
receives fuzzy inputs, processes it and
extracts fuzzy output.
Examples on Neuro-genetic
system Hybridization

BBMisra
• Let us solve one classification problem.
• For hand calculation and simplicity of understanding, let us take a
data set with few attributes.
• Let us finalize one neural network architecture to solve it.
• Then we will use GA for hybridization.
Dataset used
• Haberman's Survival Data Set (UCI Machine Learning Repository)
• Data Set Information:
– The dataset contains cases from a study that was conducted between
1958 and 1970 at the University of Chicago's Billings Hospital on the
survival of patients who had undergone surgery for breast cancer.
• Attribute Information:
1. Age of patient at time of operation (numerical)
2. Patient's year of operation (year - 1900, numerical)
3. Number of positive axillary nodes detected (numerical)
4. Survival status (class attribute)
1 = the patient survived 5 years or longer
2 = the patient died within 5 year
Data Set Characteristics: Multivariate Number of Instances: 306
Attribute Characteristics: Integer Number of Attributes: 3
Associated Tasks: Classification Missing Values? No
• Ref.: Haberman, S. J. (1976). Generalized Residuals for Log-Linear Models, Proceedings of the 9th International
Biometrics Conference, Boston, pp. 104-122.
• For implementation in neural network
– Attributes normalized between [0,1]
– Class labels modified as 1-> 0 and 2->1
• Let’s consider the following 4 random instances from the 306
instances of Haberman's Survival Data Set for hand calculation and
understanding of the method.

Attributes Class
at1 at2 at3 label
sample1 0.23 0.45 0.02 0
Sample2 0.43 0.64 0.02 1
Sample3 0.23 0.09 0.04 0
Sample4 0.57 0.09 0.33 1
1 w 1 Logistic activation Samples from Haberman's Survival Data Set
1 w9 function used in the Attributes Class
w5
x1 neurons z1, z2, y. at1 at2 at3 label (t)
w2
sample1 0.23 0.45 0.02 0
w6 z1 w10 O
y Sample2 0.43 0.64 0.02 1
w3 Sample3 0.23 0.09 0.04 0
x2
w7 Sample4 0.57 0.09 0.33 1
z2 w11
w4 Example of one population using real coded GA
x3
w8 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11
Forward pass calculations for BPN Id1 .27 .61 .17 .91 .82 .45 .37 .63 .72 .47 .29
Z1in=w1+x1w2+x2w3+x3w4, Id2 .39 .49 .57 .72 .13 .68 .51 .25 .91 .69 .17
Z1o=1/(1+exp(-Z1in)) - - - - - - - - - - - -
idn .87 .11 .63 .16 .56 .28 .71 .41 .26 .91 .85
Z2in=w5+x1w6+x2w7+x3w8,
Sample forward pass calculation for ID1 for all 4 samples
Z2o=1/(1+exp(-Z2in)) Sample Z1in Z2in Z1o Z2o Yin yo err
yin=w9+Z1o*w10+Z2o*w11, 1 0.51 1.1 0.62 0.75 1.23 0.77 0.77
Individual 1

2 0.66 1.26 0.66 0.78 1.26 0.78 0.22


yo=1/(1+exp(-yin)),
3 0.46 0.98 0.61 0.73 1.22 0.77 0.77
err=abs(t-yo) 4
GA sends the population (set of weights) to ANN.
0.93 1.32 0.72 0.79 1.29 0.78 0.22
ANN calculates the error, sends GA the error i.e. total 1.98
fitness=1/(1+err^2). No Backward pass of MLP
GA make weight adjustment using crossover and mutation. Stops when fitness doesn't increase/maximum generations reached.
Simulation parameters
• Maximum generations =100
• Population size = 60
• Pc = 0.8
• Pm = 0.05
• Randomly taking 245 samples for training and rest 61 for testing
with previous architecture following results obtained.
• Classification accuracy= 63.93%
• Fail to predict patient died.
• For better result more hidden neurons may be taken.
Convergence
Convergence characteristics ANN-GA hybrid model
0.65
min err
0.6 avg err
max err
0.55

0.5
Absolute error

0.45

0.4

0.35

0.3

0.25

0.2
0 20 40 60 80 100 120
Generations

You might also like