Principles of Soft Computing Using Python Programming, 2023
Principles of Soft Computing Using Python Programming, 2023
Programming
IEEE Press
445 Hoes Lane
Piscataway, NJ 08854
IEEE Press Editorial Board
Sarah Spurgeon, Editor in Chief
Gypsy Nandi
Assam Don Bosco University
Guwahati, India
Copyright © 2024 by The Institute of Electrical and Electronics Engineers, Inc.
All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any
form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise,
except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without
either the prior written permission of the Publisher, or authorization through payment of the
appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers,
MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to
the Publisher for permission should be addressed to the Permissions Department, John Wiley &
Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at
http://www.wiley.com/go/permission.
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley &
Sons, Inc. and/or its affiliates in the United States and other countries and may not be used
without written permission. All other trademarks are the property of their respective owners.
John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best
efforts in preparing this book, they make no representations or warranties with respect to the
accuracy or completeness of the contents of this book and specifically disclaim any implied
warranties of merchantability or fitness for a particular purpose. No warranty may be created or
extended by sales representatives or written sales materials. The advice and strategies contained
herein may not be suitable for your situation. You should consult with a professional where
appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other
commercial damages, including but not limited to special, incidental, consequential, or other
damages.
For general information on our other products and services or for technical support, please
contact our Customer Care Department within the United States at (800) 762-2974, outside the
United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in
print may not be available in electronic formats. For more information about Wiley products,
visit our web site at www.wiley.com.
Contents
2 Fuzzy Computing 35
2.1 Fuzzy Sets 37
2.1.1 Features of Fuzzy Membership Functions 38
2.2 Fuzzy Set Operations 41
2.3 Fuzzy Set Properties 42
2.4 Binary Fuzzy Relation 45
2.5 Fuzzy Membership Functions 46
2.6 Methods of Membership Value Assignments 49
2.7 Fuzzification vs. Defuzzification 58
2.8 Fuzzy c-Means 62
Exercises 71
vi Contents
Index 327
xi
Preface
Chapter 4 – Deep Learning delves into the realm of deep neural networks,
which have revolutionized fields such as computer vision, natural language
processing, and speech recognition. This chapter provides an overview of deep
learning techniques, including convolutional neural networks (CNNs), recur-
rent neural networks (RNNs), generative adversarial networks (GANs), and
autoencoders.
Chapter 5 – Probabilistic Reasoning explores the world of probability and
its applications in soft computing. You will delve into random experiments, ran-
dom variables, and different perspectives on probability. Bayesian inference, belief
networks, Markovian models, and their applications in machine learning are also
covered.
Chapter 6 – Population Based Algorithms introduces genetic algorithms and
swarm intelligence techniques. You will discover how genetic algorithms work
and explore their applications in optimization problems. Additionally, you will
dive into swarm intelligence methods, including ant colony optimization (ACO)
and particle swarm optimization (PSO), with practical Python code examples.
Chapter 7 – Rough Set Theory delves into the Pawlak Rough Set Model and its
applications in information systems, decision rules, and decision tables. You will
explore the use of rough sets in various domains such as classification, clustering,
medical diagnosis, image processing, and speech analysis.
Chapter 8 – Hybrid Systems concludes our journey by discussing hybrid sys-
tems that combine different soft computing techniques, including neuro-genetic
systems, fuzzy-neural systems, and fuzzy-genetic systems. You will also explore
their applications in medical devices.
Soft computing is a vital tool used for performing several computing operations.
It uses one or more computational models or techniques to generate optimum
outcomes. To understand this concept, let us first clarify our idea about computa-
tion. In any computation operation, inputs are fed into the computing model for
performing some operations based on which results are accordingly produced.
In the context of computing, the input provided for computation is called
an antecedent, and the output generated is called the consequence. Figure 1.1
illustrates the basics of any computing operation where computing is done using
a control action (series of steps or actions). Here, in this example, the control
action is stated as p = f (q), where “q” is the input, “p” is the output, and “f ” is
the mapping function, which can be any formal method or algorithm to solve a
problem.
Hence, it can be concluded that computing is nothing but a mapping function
that helps in solving a problem to produce an output based on the input provided.
The control action for computing should be precise and definite so as to provide
accurate solution for a given problem.
Principles of Soft Computing Using Python Programming: Learn How to Deploy Soft Computing Models
in Real World Applications, First Edition. Gypsy Nandi.
© 2024 The Institute of Electrical and Electronics Engineers, Inc. Published 2024 by John Wiley & Sons, Inc.
2 1 Fundamentals of Soft Computing
is “a collection of methodologies that aim to exploit the tolerance for imprecision and
uncertainty to achieve tractability, robustness, and low solution cost.” Prof. Zadeh
also emphasized that “soft computing is likely to play an increasingly important role
in many application areas, including software engineering. The role model for soft
computing is the human mind.” Soft computing mimics the notable ability of the
human mind to reason and make decisions in an environment of improbability
and imprecision. The principal components of soft computing include fuzzy logic,
neurocomputing, and probabilistic reasoning (PR).
If you are wondering in which areas soft computing is being used in our day-
to-day lives, the simplest and most common examples include kitchen appliances
(rice cookers, microwaves, etc.) and home appliances (washing machines, refrig-
erators, etc.). Soft computing also finds its dominance in gaming (chess, poker,
etc.), as well as in robotics work. Prominent research areas such as data compres-
sion, image/video recognition, speech processing, and handwriting recognition
are some of the popular applications of soft computing.
Case A: The car Ziva uses a software program to make movement decisions.
The path coordinates for movement decisions are already included in the soft-
ware program with the help of which Ziva can take a predefined path to arrive
at its destination. Now, suppose, while moving, Ziva encounters an obstacle in
the path. In such a case, the software program can direct it to move to either to
1.2 Soft Computing versus Hard Computing 3
the right, or to the left, or to take a back turn. In this case, the self-driving car
is not modeled to identify the nature and complexity of the obstacle to make a
meaningful and proper decision. In this situation, the computation model used
for the car is deterministic in nature, and the output is also concrete. Undoubtedly,
there is less complexity in solving the problem, but the output is always fixed due
to the rigidness of the computation method.
Case B: The car Ziva uses a software program to make movement decisions.
However, in this case, the complexity of the program is more compared to the
complexity of the program defined in Case A. This is so as the car is much more
involved in complex decision-making. Ziva can now mimic the human brain in
making decisions when any kind of obstacle is met in between its travel.
Ziva, first of all, assesses the type of the obstacle, then decides whether it can
overcome the obstacle by any means, and finally, it keeps track of if any other
alternate path can be chosen instead of overcoming the obstacle found in the
same path. The decision to be taken by Ziva is not very crisp and precise, as there
are many alternative solutions that can be followed to reach destination point B.
For example, if the obstacle is a small stone, Ziva can easily climb up the stone
and continue on the same path, as it will lead to a computationally less-expensive
solution. However, if the obstacle is a big rock, Ziva may choose an alternative to
choose another path to reach the destination point.
Case C: Now, let us consider Case C, in which the software program is written
to let the self-driving car reach its destination by initially listing out all the possible
paths available to reach from source A to destination B. For each path available,
the cost of traveling the path is calculated and accordingly sorted to reach at the
fastest time possible. Finally, the optimum path is chosen, considering the mini-
mum cost as well as considering avoidance of any major obstacle. It can be realized
that Case C appends both the cases of Case A and Case B to inherit approaches
from both cases. It also adds some functionalities to tackle complex scenarios by
choosing an optimum decision to finally reach destination point B.
The above three cases can be summarized (as listed in Figure 1.2) to check the
points of differences among each of these cases. From each of the above three
cases, it can be observed that the nature of computation in each of the three cases
is not similar.
Notice that emphasis is given on reaching the destination point in the first case.
As the result is precise and fixed, the computation of the Case A type is termed
hard computing. Now, in the second case, the interest is to arrive at an approximate
result, as a precise result is not guaranteed by this approach. The computation of
the Case B type is termed soft computing. The third case inherits the properties
of both Case A and Case B, and this part of computing is referred to as hybrid
computing. Thus, computing in perspective of computer science can be broadly
categorized, as shown in Figure 1.3.
4 1 Fundamentals of Soft Computing
As we understood that soft computing can deal with imprecision, partial truth,
and uncertainty, its applications are varied, ranging from day-to-day applications
to various applications related to science and engineering. Some of the dominant
1.3 Characteristics of Soft Computing 5
Table 1.1 Important points of differences between soft computing and hard computing.
characteristics of soft computing are listed in Figure 1.4, and a brief discussion on
each of these characteristics is given next:
(a) Human expertise: Soft computing utilizes human expertise by framing fuzzy
if–then rules as well as conventional knowledge representation for solving
real-world problems that may consist of some degree of truth or false. In short,
where a concrete decision fails to represent a solution, soft computing tech-
niques work best to provide human-like conditional solutions.
(b) Biologically inspired computational models: Computational learning
models that follow the neural model of the human brain have been studied
and framed for complex problem solving with approximation solutions. A few
such popular neural network models include the artificial neural network
(ANN)-, convolutional neural network (CNN)-, and the recurrent neural
network (RNN)-based models. These models are commonly used for solving
classification problems, pattern recognition, and sentiment analysis.
(c) Optimization techniques: Complex optimization problems that are
inspired by nature are often used as soft computing techniques. For example,
6 1 Fundamentals of Soft Computing
Human
expertise
1.0 Biologically
a) Applicable inspired
to real- computational
world models
problems 2.0
b) 7.0
Characteristics
of soft
computing Optimization
Model-free
techniques
learning
3.0
6.0
Goal-driven Fault
tolerant
5.0
4.0
genetic algorithms (GA) can be used to select top-N fit people out of a human
population of a hundred people. The selection of the most fit people is done
by using the mutation properties inspired by biological evolution of genes.
(d) Fault tolerant: Fault tolerance of a computational model indicates the
capacity of the model to continue operating without interruption, even if
any software or hardware failure occurs. That is, the normal computational
process is not affected even if any of the software or hardware components fail.
(e) Goal-driven: Soft computing techniques are considered to be goal-driven.
This indicates that emphasis is given more on reaching the goal or destina-
tion than on the path considered to be taken from the current state to reach
the goal. Simulated annealing and GA are good examples of goal-driven soft
computing techniques.
(f) Model-free learning: The training models used in soft computing need not
be already aware of all the states in the environment. Learning of a step takes
place in due course of actions taken in the present state. In other words, it can
be said that there is a teacher who specifies beforehand all the precise actions
to be taken per condition or state. The learning algorithm indirectly only has a
critic that provides feedback as to whether the action taken can be rewarded or
punished. The rewards or punishments given help in better decision-making
for future actions.
1.4 Components of Soft Computing 7
The three principal components of soft computing include fuzzy logic-based com-
puting, neurocomputing, and GA. These three components form the core of soft
computing. There are a few other components of soft computing often used for
problem solving, such as machine learning (ML), PR, evolutionary reasoning, and
chaos theory. A brief summary of all these components of soft computing tech-
niques is explained next, along with an illustrative diagram, as given in Figure 1.5.
While fuzzy computing involves understanding fuzzy logic and fuzzy sets,
neural networks include the study of several neural network systems such as
artificial neural network (ANN) and CNN. Evolutionary computing (EC) involves
a wide range of techniques such as GA and swarm intelligence. Techniques for
ML are categorized mainly as supervised learning (SL), unsupervised learning,
and reinforcement learning (RL). Soft computing also involves a wide variety of
techniques such as chaos theory, PR, and evolutionary reasoning.
Components of
soft computing
Fully profitable
Is the XYZ courier service profitable?
Moderately profitable
Yes
(Value: 0.75)
(Value: 1)
Neither profitable nor
unprofitable (Value: 0.5)
No
Moderately unprofitable
(Value: 0)
(Value: 0.25)
Fully unprofitable
(Value: 0)
Figure 1.6 (a) Boolean (nonfuzzy) and (b) fuzzy logic-based solutions for a problem.
Let us understand this simple concept with the help of an example. For instance,
if we consider the question, “Is the XYZ Courier Service Profitable?” the reply to
this question can be simply stated as either “Yes” or “No.” If only two close-ended
choices are provided for this question, it can be considered as value 1 if the answer
given is “Yes” or 0 if the answer given is “No.” However, what if the profit is not
remarkably well, and only a moderate profit is incurred from the courier service?
If we have a deeper look at the question, there is a possibility that the answer can
be within a range between 0 and 1, as the amount of profitability level may be not
totally 100% profitable or 100% unprofitable. Here, the role of fuzzy logic comes
into play where the values can be considered in percentages (say, neither profit
nor loss, i.e., 0.5). Thus, fuzzy logic tries to deal with real-world situations, which
consider partial truth as a possible solution for a problem.
Figure 1.6(a) illustrates the two outcomes provided for the question “Is the
XYZ Courier Service Profitable?” The solution provided for the question in
this case is Boolean logic based, as only two extreme choices are provided for
responses. On the other hand, Figure 1.6(b) illustrates the various possibilities of
answers that can be provided for the same question “Is the XYZ Courier Service
Profitable?” Here, the concept of fuzzy logic is applied to the given question by
providing a few possibilities of answers such as “fully unprofitable,” “moderately
unprofitable,” “neither profitable nor unprofitable,” “moderately profitable,” and
“fully profitable.” The class membership is determined by the fuzzy membership
function. As seen in Figure 1.6(b), the membership degree (e.g., 0, 0.25, 0.5, 0.75,
and 1) is taken as output value for each response given.
One common example of using fuzzy sets in computer science is in the field
of image processing, specifically in edge detection. Edge detection is the process
of identifying boundaries within an image, which are areas of rapid-intensity
1.4 Components of Soft Computing 9
changes. Fuzzy logic can be used to make edge detection more robust and
accurate, especially in cases where the edges are not clearly defined. Let us
consider a grayscale image where each pixel’s intensity value represents its
brightness. To detect edges using fuzzy logic, one might define a fuzzy set for
“edgeness” that includes membership functions like “definitely an edge,” “possibly
an edge,” and “not an edge”. In such a case, the membership functions can be
defined as follows:
(a) Definitely an edge: If the intensity difference is high, the pixel is more likely
to be on an edge.
(b) Possibly an edge: If the intensity difference is moderate, the pixel might be
on an edge.
(c) Not an edge: If the intensity difference is low, the pixel is unlikely to be on
an edge.
Using these membership functions, you can assign degrees of membership to
each pixel for each of these fuzzy sets. For instance, a pixel with a high-intensity
difference would have a high degree of membership in the “definitely an edge”
fuzzy set.
A crisp set, as you may know, is a set with fixed and well-defined boundaries.
For instance, if the universal set (U) is a set of all states of India, a crisp set may be
a set of all states of North-East India for the universal set (U). A crisp set (A) can
be represented in two ways, as shown in Equations (1.1) and (1.2)
A = {a1 , a2 , a3 , … , an } (1.1)
A = {x|P(x)} (1.2)
Here, in Equation (1.2), the crisp set “A” consists of a collection of elements
ranging from a1 to an . Equation (1.2) shows the other way of representing a crisp
set “A,” where “A” consists of a collection of values of “x” such that it has got the
property P(x).
Now, a crisp set can also be represented using a characteristic function, as shown
in Equation (1.3):
{
1, if x belongs to A
𝜇A′ (x) = (1.3)
x, if x does not belong to A
A fuzzy set is a more general concept of the crisp set. It is a potential tool to
deal with uncertainty and imprecision. It is usually represented by an ordered pair
where the first element of the ordered pair represents the element belonging to a
set, and the second element represents the degree of membership of the element
to the set. The membership function value may vary from 0 to 1. Mathematically,
′
a fuzzy set A is represented as shown in Equation (1.4).
{
A′ = x, 𝜇A′ (x)|| xX} (1.4)
10 1 Fundamentals of Soft Computing
Here, the membership function value indicates the degree of belongingness and is
denoted by 𝜇A′ (x). Here, in Equation (1.4), “X” indicates the universal set, which
consists of a set of elements “x.” A membership function can either be any stan-
dard function (for example, the Gaussian function) or any user-defined function
in requirement to the problem domain. As this membership function is used to
represent the degree of truth in fuzzy logic, its value on the universe of discourse
“X” is defined as:
𝜇A′ (x) = [0, 1] (1.5)
Here, in Equation (1.5), each value of “X” represents an element that is mapped
to a value between 0 and 1.
The above explanations lead us to the understanding that a fuzzy set does
not have a crisp, clearly defined boundary; rather it contains elements with
only a partial degree of membership. Some of the standard properties of fuzzy
sets include commutative property, associative property, distributive property,
transitivity, and idempotent property. There are few other properties of fuzzy sets
that will be elaborately discussed in Chapter 2.
Also, there are three standard fuzzy set operators used in fuzzy logic – fuzzy
union, fuzzy intersection, and fuzzy complement. In case of complement oper-
ation, while a crisp set determines “Who do not belong to the set?,” a fuzzy set
determines “How many elements do not belong to the set?” Again, in case of union
operation, while a crisp set determines “Which element belongs to either of the set?,”
a fuzzy set determines “How much of the element is in either of the set?” Lastly, in
case of intersection operation, while a crisp set determines “Which element belongs
to both the sets?,” a fuzzy set determines “How much of the element is in both the
sets?” These fuzzy operations will also be elaborately discussed in Chapter 2.
Fuzzy logic systems have proved to be extremely helpful in dealing with
situations that involve decision-making. As some problems cannot be solved by
simply determining whether it is True/Yes or False/No, fuzzy logic is used to
offer flexibility in reasoning in order to deal with uncertainty in such a situation.
The applications of fuzzy logic are varied, ranging from domestic appliances to
automobiles, aviation industries to robotics.
Dendrites
Synapse
Axon
interconnection of one neuron with other neurons). The different parts of a neuron
are illustrated in Figure 1.7. A neuron gets fired only if certain conditions are met.
The signals received on each synapse may be of excitatory or inhibitory type.
When the excitatory signals exceed the inhibitory signals by certain quantified
threshold value, the neuron gets fired. Accordingly, either positive or negative
weights are assigned to signals – a positive weight is assigned to excitatory signals,
whereas a negative weight is assigned to inhibitory signals. This weight value indi-
cates the amount of impact of a signal on excitation of the neuron. The signals
multiplied by the weight in all the incoming synapse is summed up to get a final
cumulative value. If this value exceeds the threshold, then the neuron is excited.
This biological model has been mathematically formulated to accomplish optimal
solutions to different problems and is technically termed as “Artificial Neural Net-
work (ANN).” ANN has been applied in a large number of applications such as pat-
tern matching, pattern completion, classification, optimization, and time-series
modeling.
A simple example of an ANN is given in Figure 1.8. The nodes in ANN are orga-
nized in a layered structure (input layer, hidden layer, and output layer) in which
each signal is derived from an input and passes via nodes to reach the output.
Each black circular structure in Figure 1.8 represents a single neuron. The sim-
plest artificial neuron can be considered to be the threshold logic unit (TLU). The
TLU operation performs a weighted sum of its inputs and then outputs either a “0”
or “1.” An output of “1” occurs if the sum value exceeds a threshold value and a
“0” otherwise. TLU thus models the basic “integrate-and-fire” mechanism of real
neurons.
The basic building block of every ANN is the artificial neuron. At the entrance
section of an artificial neuron, inputs are assigned weights. For this, every input
value is multiplied by an individual weight (Figure 1.9). In the middle section of
the artificial neuron, a sum function is evaluated to find the sum of all the weighted
12 1 Fundamentals of Soft Computing
Input 1
Input 2 Output
Input 3
Input 1 x
Bias
inputs and bias. Next, toward the exit of the artificial neuron, the calculated sum
value is passed through an activation function, also called a transfer function.
ANN provides a simplified model of the network of neurons that occur in the
human or animal brain. ANN was initially found with the sole purpose of solving
problems in the same way that a human or animal brain does. However, more
and more research on ANN has led to the deviation of ANN from biology to solve
several challenging tasks such as speech recognition, medical diagnosis, computer
vision, and social network filtering.
Crossover Mutation
over time, and each variant is suitable for more specific types of problems and
data structures. At times, two or more evolutionary algorithms (EA) are applied
together for problem solving in order to generate better results. This makes EC
very popular in computer science, and a lot of research is explored in this area.
In general, EA mimic the behavior of biological species based on Darwin’s theory
of evolution and natural selection mechanism. The four main steps involved in EA
include – initialization, selection, use of genetic operators (crossover and mutation),
and termination. Each of these chronological steps makes an important contribu-
tion to the process of natural selection and also provides easy ways to modularize
implementations of EA. The four basic steps of EA are illustrated in Figure 1.10,
which begins with the initialization process and ends with the termination process.
The initialization step of EA helps in creating an initial population of solutions.
The initial population is either created randomly or created considering the ideal
condition(s). Once the population is created in the first step, the selection step is
carried out to select the top-N population members. This is done using a fitness
function that can accurately select the right members of the population. The next
step involves use of two genetic operators – crossover and mutation – to create
the next generation of population. Simply stated, these two genetic operators help
in creating new offspring from the given population by introducing new genetic
material into the new generation. Lastly, the EA involve the termination step to end
the process. The termination step occurs in either of the cases – the algorithm has
reached some maximum runtime, or the algorithm has reached some threshold
value based on performance.
Independent research work on EA led to the development of five main streams
of EA, namely, the evolutionary programming (EP), the evolution strategies
(ES), swarm intelligence, the GA, and the differential evolution (DE) (as shown
in Figure 1.11). A brief discussion on each of these subareas of EA is discussed in
the later part of this section.
● Evolutionary programming: The concept of EP was originally conceived by
Lawrence J. Fogel in the early 1960s. It is a stochastic optimization strategy
14 1 Fundamentals of Soft Computing
Evolutionary
programming
Evolutionary algorithms
Artificial bee
Swarm intelligence colony
Particle swarm
Genetic algorithms
optimization
Differential evolution
similar to GA. However, a study is made on the behavioral linkage of parents and
offspring in EP, while genetic operators (such as crossover operators) are applied
to produce better offspring from given parents in GA. EP usually involves four
main steps, as mentioned below. Step 1 involves choosing an initial population
of trial solutions. Step 2 and Step 3 are repeated either until a threshold value
for iteration exceeds or an adequate solution for the given problem is obtained:
⚬ Step 1: An initial population of trial solutions is chosen at random.
⚬ Step 2: Each solution is replicated into a new population. Each of these off-
spring solutions is mutated.
⚬ Step 3: Each offspring solution is assessed by computing its fitness.
⚬ Step 4: Terminate.
The three common variations of EP include the Classical EP (uses Gaussian
mutation for mutating the genome), the Fast EP (uses the Cauchy distribution
for mutating the genome), and the Exponential EP (uses the double exponential
distribution as the mutation operator). A few of the common application areas
of EP include path planning, traffic routing, game learning, cancer detection,
military planning, combinatorial optimization, and hierarchical system design.
● Evolution strategies: Evolution strategies (ES) is yet another optimization
Obstacle
Food
Nest
to the optimization problem, and the amount of nectar in the food sources
decides the quality or fitness of the given solution. In fact, the quality of
a food source depends on many factors, such as the amount of food source
available, the ease of extracting its nectar, and also its distance from the nest.
Depending on the number of food sources, the same number of employed
bees is chosen to solve a problem. It is the role of employed bees to carry on
the information about the quality of the food source and share this informa-
tion with the other bees.
The unemployed bees also play an active role in the food hunt. One type
of unemployed bee is the scout, which explores the environment near the
nest in search of food. The other type of unemployed bee is the onlooker,
which waits in the nest to get information about the quality of food sources
from the employed bees and establish the better food sources. Communica-
tion among bees related to the quality of food sources takes place through
the famous “waggle dance” of honey bees. This exchange of information
among the three types of bees is the most vital occurrence in the formation
of collective knowledge.
● Genetic algorithms: The concept of genetic algorithms (GA) was proposed by
John Holland in the 1960s. Later, Holland along with his colleagues and stu-
dents developed the concepts of GA at the University of Michigan in the 1960s
and 1970s as well. A genetic algorithm is a metaheuristic that is inspired by
Charles Darwin’s theory of natural evolution. GA are a part of the larger class of
EA that emphasize on selecting the fittest individuals for reproduction in order
to produce offspring. The generated offspring inherit the characteristics of their
parents and is therefore expected to have better fitness if the parents do have
good fitness values. Such offspring, in turn, will have a better chance of sur-
vival. If this process continues to repeat multiple times, at some point in time, a
generation of the fittest individuals will be formed.
There are basically five main phases of GA (as illustrated in Figure 1.13): pop-
ulation initialization, fitness function calculation, parent selection, crossover,
and mutation. Initially, a random population of size “n” consisting of several
individual chromosomes is chosen. Next, the fitness value of each of the indi-
vidual chromosomes is calculated based on a fitness function. The fitness value
plays a vital role in the decision-making of the selection of chromosomes for
crossover.
In the crossover phase, every two individual chromosomes selected are repro-
duced using a standard crossover operator. This results in the generation of two
offspring from each pair of chromosomes. The new offspring generated are then
mutated to produce a better set of individual chromosomes in the newly gener-
ated population. These entire five phases of GA are repeated until a termination
condition is met. Each iteration of the GA is called a generation, and the entire
18 1 Fundamentals of Soft Computing
Crossover
Mutation
set of generations is called a run. The final output (result) is the generation of
the fittest individuals that have the greatest chance of survival.
● Differential evolution: Differential evolution (DE) is a common evolutionary
algorithm stimulated by Darwin’s theory of evolution and has been studied
widely to solve diverse areas of optimization applications since its inception
by Storn and Price in the 1990s. The various steps involved in DE include
population initialization, mutation, crossover, selection, and result generation
(illustrated in Figure 1.14). Prior to applying these basic steps, the parameters
of DE need to be defined, such as the population size, the selection method,
the crossover method, and the perturbation rate (the weight applied to random
differential).
There are various popular variants of DE, some of which are mentioned below:
● The standard differential evolution (DE)
● The self-adaptive control parameters differential evolution (JDE)
● The adaptive differential evolution with optional external archive (JADE)
● The composite differential evolution (CODE)
● The self-adaptive differential evolution (SADE)
The applications of DE are varied, including synthesis and optimization of
heat-integrated distillation system, optimization of an alkylation reaction, digital
Population Fitness
initialization evaluation Mutation
Result Crossover
(fittest solution) Selection
Cluster 1
Cluster 2
Machine learning
algorithm
Cluster 3
Cluster 4
Figure 1.15 Machine learning algorithm used for training data to form clusters.
20 1 Fundamentals of Soft Computing
Machine learning
algorithm
Figure 1.16 Machine learning algorithm used for classifying email as spam or
legitimate.
or not spam. Once the algorithm gets fully trained and shows high accuracy in
prediction, this algorithm is now all ready to be used for such similar predictions
for the future.
The contribution of ML right from solving day-to-day tasks to solving complex
real-life problems is tremendous. In fact, many home appliances, health care
monitoring systems, mobile apps, and internet services heavily rely on using
ML. Also, popular virtual personal assistants such as Alexa, Siri, and Google Now
rely dominantly on the techniques of ML. These popular virtual assistants are a
perfect example of the usage of advanced ML, as it has a number of capabilities
that include voice interaction, playing audio, answering the door, dimming the
lights of a room, and reading the latest headlines. Again, traffic predictions using
GPS navigation services use ML techniques to provide live data to users regarding
traffic congestion while traveling. ML algorithms also help companies develop
chatboxes to solve user queries. It is expected that the contribution of ML in the
near future will continue exceeding, and researchers have to extensively depend
on such ML algorithms to build innovative tools and techniques.
There are mainly four types of ML – SL, unsupervised learning, semi-supervised
learning, and RL. Under the umbrella of SL are classification and regression,
which use a dataset having known inputs and outputs. Unsupervised learning, on
the other hand, uses a dataset to identify hidden patterns. Under the umbrella of
unsupervised learning falls clustering and association analysis. Semi-supervised
learning lies between SL and unsupervised learning and handles the drawbacks
of both these types of ML techniques.
Figure 1.17 illustrates the various types of ML used for different variety of data
and problems. All these types of ML have a special role to perform, which is
explained in brief next. Let us now try to understand these four types of ML in brief.
● Supervised learning (SL): Most of the ML techniques are based on SL, which
works on the basis of supervision. Basically, supervised learning techniques
train machines using “labeled” datasets. Such datasets have input variables
that are used by a mapping function to derive the output variable(s). This can
be mathematically expressed as P = f (Q), where Q is the input data, and P is the
1.4 Components of Soft Computing 21
Types of
machine learning
Classification Clustering
Regression Association
analysis
output. SL thus uses the input variables and the corresponding output having
“labeled” data to train machines by approximating the mapping function.
The training is carried out accurately to an extent such that when a new input
data (q1) is fed, a perfect prediction can be made to display the output variable
(P) for that data (q1).
To clear the idea of SL, let us consider a simple example. Let us assume that a
dataset that consists of images of three varying animals – say, dog, goat, and cat,
which are provided as input to the machine. The output variable, also called
the labeled variable, stores the values as either “dog,” “goat,” or “cat.” Now, a
SL model is trained with this labeled dataset that can differentiate among these
three animals and correctly predict the output. This learning model is illustrated
in Figure 1.18, in which the labeled data are trained, and the model is tested with
a test dataset (unlabeled data) to check the accuracy of the predicted output once
the training is completed. If the accuracy of prediction is very high, the model
can be set to be trained and ready for use for future predictions.
There are two main notable techniques of supervised ML – classification and
regression. Both these techniques have a similar goal of predicting the output
(dependent attribute) based on the series of input data provided. Classification
deals with prediction of discrete class label(s), whereas regression deals with
prediction of continuous quantity. Also, a classification problem is mainly eval-
uated using accuracy as the evaluation metric, whereas a regression problem is
usually evaluated using root mean square error (RSME).
(a) Classification: Classification is a type of supervised ML as it considers
“labeled” data to perform the task of prediction of output. The “labeled”
22 1 Fundamentals of Soft Computing
Prediction
Dog
Cat
Training Goat
Goat model
Dog
Cat
Dog
Goat
Dog Predicted output
Labeled data Test data
variables are termed as classifiers, which play a major role in training the
algorithm. It approximates a mapping function (f ) from input variables (Q)
to discrete output variables (P). For a given observation, the primary task of
classification is to predict the value of class. The technique of classification
is applied in many significant areas where new observations are needed to
be categorized, such as spam filtering, face detection, credit approval, fraud
detection, optical character recognition, market segmentation, and so on.
For a given observation, the primary task of classification is to predict the
value of class.
As shown in Figure 1.18, the problem given is of classification in which
the output to be obtained belongs to any of the three classes – cat, dog, or
goat. Another example of a problem of classification would be if the output
to be predicted is either Yes or No, such as “Diabetes” or “No Diabetes,” “Pro-
vide Loan” or “Do not Provide Loan,” “Spam Mail” or “Legitimate Mail,”
and so on.
Classification techniques are further classified under two models – the
lazy learners and the eager learners. In the case of lazy learning model, the
training time taken is comparatively less; however, the prediction time
is more compared to the eager learning model. Examples of lazy learners
include the k-Nearest Neighbor (kNN) classification model, and case-based
learning model. Eager learners construct the classification model from the
given training data even before receiving any test data. Examples of eager
learners include decision trees, Naïve Bayes, and ANNs.
(b) Regression: In regression, the main task is to approximate a mapping
function ( f ) from input variables (X) to a continuous output variable (Y ).
The continuous output variable (Y ) should denote a quantity and has to
be a real value, such as an integer or floating-point value. Given a new
set of input values, regression can make a prediction of the corresponding
quantitative output (y), based on the study of previous corresponding (x, y)
1.4 Components of Soft Computing 23
Standard regression
techniques
Cluster 1
Train
Generate
model
clusters
Cluster 2
Dataset
(without labels) Cluster 3
Predicted output
Actions
Reward
and RL is that SL makes a decision based on the initial input provided, whereas
RL makes a decision sequentially by considering the inputs generated at every
new phase of RL. The decisions made in RL at each level are independent of
each other, whereas the decisions made in SL rely on the decisions made at the
previous level.
This is called Bayes rule and is often used for updating a belief about a hypothesis
Q in the light of new evidence R.
Bayesian networks provide a natural representation of (casually induced)
conditional independence. It has replaced many rule-based systems by being less
rigorous and more robust. However, one major limitation of a Bayesian network
is that it typically requires initial knowledge of many probabilities. Another issue
is that the output is also largely dependent on the quality and extent of prior
knowledge.
● Chaos theory: Chaos theory is well-suited to be used for problems that are
highly sensitive to initial conditions. In such a case, a slight difference in ini-
tial conditions (for example, a change in initial value in the second place after
the decimal point) leads to highly diverging outcome. The chaotic behavior can
be experienced in our nature, such as changes in weather. Robert L. Devaney has
classified a dynamic system as chaotic based on the following three properties:
– It must be sensitive to initial conditions (the “butterfly effect”): The data points
in a chaotic system are arbitrarily in close approximation to each other with
significantly different future paths.
– It must be topologically mixing: The topological transitivity or topological mix-
ing relates to the evolution of the system over time such that any given region
may eventually overlap with another region.
– It must have dense periodic orbits: Every point in the space is approached arbi-
trarily closely by periodic orbits.
To sum up, chaos theory as defined by Kellert is “the qualitative study of
unstable aperiodic behavior in deterministic nonlinear systems” (Kellert 1993,
p. 2). As understood, a chaotic system is nonlinear and sensitive to initial
conditions. There is also no periodic behavior in such systems, and the motion
remains random. Considering these characteristics, a few applications based on
chaos theory include observation of weather patterns, stock market predictions,
algorithmic trading, bird migration patterns, observation of social phenomena,
robotics, and study of brain waves.
● Evidential reasoning: Evidential reasoning (ER) is a recent approach that
has been developed mainly on the basis of AI, decision theory, statistical
analysis, and computer technology. In decision theory, ER approach is a
generic evidence-based multi criteria decision analysis (MCDA) approach
and can deal with problems having uncertainties that include randomness
and ignorance. ER supports assessments, decision analysis, and evaluation
activities. For instance, ER contributes to the environmental impact assessment
for developing a proposal or a project related to the environment.
The ER approach constitutes the modeling framework of multi-criteria
decision-making (MCDM) problems using the following concepts:
Exercises 31
Exercises
Fuzzy Computing
The Father of fuzzy lozic, Lotfi. A. Zadeh, proposed the idea of fuzzy computing
in the year 1965, but the popularity of fuzzy computing was mainly gained during
the 1990s. Some solutions of real-life problems can generate crisp results, like “Yes”
or “No,” “True” or “false,” “0” or “1.” That means, an element can either belong
to or do not belong to the set of Universe Ú. Now, if the element belongs to the
set of Universe Ú, it is considered 1, else 0. This set of elements having precise
output is known as the “crisp set,” which can be represented with the help of a
membership function to represent the output. Figure 2.1 illustrates the difference
between a crisp set and a fuzzy set. While the crisp set displays binary output, fuzzy
set displays a more realistic, varying output.
If we consider an element y and a set S, the membership function can be
denoted as given in Eq. (2.1). A membership function (𝜇) represents the relation-
ship between the values of elements in a set and their degree of membership in
the set. Here, in Eq. (2.1), the output can be either 1 (y belongs to the set S) or 0
(y does not belong to the set S), based on the given membership function.
{
1, if y ∈ S
𝜇S (y) = (2.1)
0, if y ∉ S
However, in real-life problems, solutions may not be always attained in crisp
form. This is where the role of fuzzy logic comes into play. Each element of a fuzzy
set can partially belong to one set and can also partially belong to other sets. Hence,
fuzziness indicates vagueness and uncertainty, as there is no finite output for each
given element. To understand this, let us take a simple example. A male person
is considered tall if he is equal or above 5.5 ft, else he is considered short. This is
an example of a crisp logic, as illustrated in Figure 2.2(a) where, 0 indicates short,
and 1 indicates tall.
In real-life circumstances, it is not always possible to consider a person as either
tall or short based on a precise value (5.5 ft in this example). In Figure 2.2(a), if
a person’s height is 5.49 ft, he will be considered as short. This is so because, as
Principles of Soft Computing Using Python Programming: Learn How to Deploy Soft Computing Models
in Real World Applications, First Edition. Gypsy Nandi.
© 2024 The Institute of Electrical and Electronics Engineers, Inc. Published 2024 by John Wiley & Sons, Inc.
36 2 Fuzzy Computing
Figure 2.1 (a) Crisp set of values (b) Fuzzy set of values.
Membership Degree
Short (0) vs. Tall (1)
1 1
0 0
0 5.5 8 0 5.5 8
Height (in feet) Height (in feet)
(a) (b)
per the condition, a person can be considered tall if he is having the minimum
5.5 ft of height. However, the person has fallen short of only 0.05 ft so as to be
considered tall.
Figure 2.2(b) illustrates another case in which varying levels of height mea-
surement is considered, such as, very short (here, 𝜇 U (x) = 0.2), short (here,
𝜇 U (x) = 0.4), average (here, 𝜇 U (x) = 0.5), tall (here, 𝜇 U (x) = 0.8), and very tall
(here, 𝜇 U (x) = 1). That is, in case of fuzzy logic example, we considered the
varying degrees of height and can be quantified as very short, short, average, tall,
and very tall. This degree of association can be technically coined as membership
value. Mathematically, it can be formulated as shown in Eq. (2.2).
Based on the value of height of a male person, we can estimate the degree of his
tallness. In this case, the output is not crisp or precise – tall or not tall (i.e., short).
2.1 Fuzzy Sets 37
Rather, the output can be very short, very tall, or even average, considering the
varying value of height. It is a better approach in this case, as one is estimating
the varying possible output by associating it with the degree of tallness. Let 𝜇S : y
indicate the degree of membership of y to set S. A minimum degree value of 0
indicates that y is least bound to set S, and a value of 1 indicates that y is strongly
bound to set S. Any other value between 0 and 1 indicates the varying degree of
strength by which y is bound to set S.
X 𝝁S (x)
−3 0.1
−2 0.2
−1 0.5
0 1
1 0.5
2 0.2
3 0.1
0.5
x
–3 –2 –1 0 1 2 3
0.5 (close to 1). For the remaining values (such as −3, −2, 2, and 3), the member-
ship function values vary between 0 and 1. The membership degree is close to 1
if the x value is close to 0. As the x value moves far away from 0, the membership
function value accordingly moves far away from 1.
µ(x)
Core
1
Boundary Boundary
Support x
µA(x) µA(x)
1 1
x x
(a) (b)
Figure 2.5 (a) Normal fuzzy set; (b) subnormal fuzzy set.
and its height is equal to 1. Also, for a fuzzy set having one and only one
element whose membership value is equal to one, such an element is typically
referred to as the prototypical element. Contrary to the normal fuzzy set is the
subnormal fuzzy set whose height value will be always less than 1. Figure 2.5
illustrates the difference between a normal fuzzy set and a subnormal
fuzzy set.
(e) Convex fuzzy set: Here the membership function is either strictly monoton-
ically increasing, strictly monotonically decreasing, or strictly monotonically
increasing and then strictly monotonically decreasing for increasing value of
the element in the universe of discourse. A fuzzy set A is convex if:
for a, b, c A and a < b < c, 𝜇A (b) >= min [𝜇A (a), 𝜇A (c)]
Figure 2.6 illustrates the difference between a convex fuzzy set and a non-
convex fuzzy set. As can be understood from the figure, in Case (a) the
membership function is initially strictly monotonically increasing and then
strictly monotonically decreasing for increasing value of the element in the
universe of discourse. Hence, it is a case of convex fuzzy set, which is not so
in case of Case (b).
µA(x) µA(x)
1
1
0 0
x x
(a) (b)
Figure 2.6 (a) Convex fuzzy set; (b) nonconvex fuzzy set.
2.2 Fuzzy Set Operations 41
Figure 2.7 illustrates the three standard fuzzy set operations – (a) fuzzy union
(𝜇 A U B (x)), (b) fuzzy intersection (𝜇 A ∩ B (x)), and (c) fuzzy complement (𝜇¬A (x)).
These operations are highlighted in grey color. For the union operation in the fuzzy
sets A and B, the result obtained is the maximum value of the membership func-
tion, and for the intersection operation, the result is the minimum of both. In case
of fuzzy complement, the result generated for fuzzy set A includes all elements in
the Universal set that are not in A.
Let us try to find fuzzy union, fuzzy intersection, and fuzzy complement for two
fuzzy sets X and Y , as given in Program 2.1. This is a Python program that ini-
tially creates two fuzzy sets X and Y having four elements each – set X contains
the values 0.3, 0.5, 0.6, and 0.7, and set Y contains the values 0.4, 0.8, 0.2,
and0.6. The dict() function in Python is used in the program to create a dictio-
nary of values. Also, sets U, I, and C are used to create sets for fuzzy union, fuzzy
intersection, and fuzzy complement of X and Y.
0 0 0
µAUB(x) µA ^ B(x) µ¬A(x)
(a) (b) (c)
Figure 2.7 Fuzzy set operations: (a) Fuzzy union; (b) fuzzy intersection; and (c) fuzzy
complement.
42 2 Fuzzy Computing
X = {"x1": 0.3, "x2": 0.5, "x3": 0.6, "x4": 0.7} #first fuzzy
set
Y = {"x1": 0.4, "x2": 0.8, "x3": 0.2, "x4": 0.6} #second fuzzy
set
The output of the Program 2.1 is given next. The output prints both the fuzzy
sets, and also displays the union and intersection of these two sets. Finally, the
complement of the first set is also displayed in the output.
Fuzzy Set #1 : {'x1': 0.3, 'x2': 0.5, 'x3': 0.6, 'x4': 0.7}
Fuzzy Set #2 : {'x1': 0.4, 'x2': 0.8, 'x3': 0.2, 'x4': 0.6}
Fuzzy Set Union : {'x1': 0.4, 'x2': 0.8, 'x3': 0.6, 'x4': 0.7}
Fuzzy Set Intersection : {'x1': 0.3, 'x2': 0.5, 'x3': 0.2, 'x4': 0.6}
Fuzzy Set Complement : {'x1': 0.7, 'x2': 0.5, 'x3': 0.4, 'x4': 0.3}
(a) Involution: The involution property states that the complement of comple-
ment of a set is the set itself.
(X ′ )′ = X
(b) Commutativity: The commutative property for any two fuzzy sets (X and
Y ) states that when applying union or intersection operations, the order of
operands does not alter the result.
(X ∪ Y ) = (Y ∪ X)
(X ∩ Y ) = (Y ∩ X)
(c) Associativity: The associative property for any three fuzzy sets (X, Y , and Z)
states that it can be applied on any two operands followed by the third operand
when applying union or intersection operations. However, the relative order
of operands should not be changed.
(X ∪ Y ) ∪ Z = Z ∪ (Y ∪ Z)
(X ∩ Y ) ∩ Z = Z ∩ (Y ∩ Z)
(d) Distributivity: The distributive property for any three fuzzy sets (X, Y , and
Z) is explained as given in equations below:
X ∪ (Y ∩ Z) = (X ∪ Y ) ∩ (X ∪ Z)
X ∩ (Y ∪ Z) = (X ∩ Y ) ∪ (X ∩ Z)
(e) Idempotency: For any fuzzy set X, the idempotent property is stated below:
X ∪X =X
X ∩X =X
(f) Identity: If 𝜙 is considered as the null set and U as the Universal set, the
following identity property for a fuzzy set X is given as follow:
X ∪𝜙=X
X ∩𝜙=𝜙
X ∪U =U
X ∩U =X
Here, the union or intersection of a fuzzy set X with the null set 𝜙 will result
in the fuzzy set X. Also, the union of a fuzzy set X with the Universal set U
will result in the fuzzy set X, while the intersection of a fuzzy set X with the
Universal set U will result in the Universal set U.
44 2 Fuzzy Computing
(g) Transitivity: The transitive property for any three fuzzy sets (X, Y , and Z)
states that if X is a subset of Y and Y is a subset of Z, then X is a subset of Z.
This is mathematically explained as:
if X ⊆ Y and Y ⊆ Z then X ⊆ Z
(h) Absorption: The absorption property produces the fuzzy set X after it applies
the union and intersection operations in the below given order for any two
fuzzy sets X and Y .
X ∪ (X ∩ Y ) = X
X ∩ (X ∪ Y ) = X
(i) De Morgan’s Law: The De Morgan’s law for fuzzy sets can be stated as:
a. The complement of a union is the intersection of the complement of indi-
vidual sets
(X ∪ Y )′ = X ′ ∩ Y ′
(X ∩ Y )′ = X ′ ∪ Y ′
(j) Fuzzy relation: A fuzzy relation is the Cartesian product of two sets X and
Y , in which X and Y are fed as input, and the fuzzy relation is calculated by
finding the Cartesian product of the two sets. Mathematically, the relation can
be stated as follows:
If X ′ is a fuzzy set defined on a set of Universe say X, and Y ′ is a fuzzy set
defined on the set of Universe say Y , then the Cartesian product can be defined
as follows:
X ′ X Y ′ = R′ ⊂ X x Y
Let us try to understand how Cartesian product is applied on fuzzy sets with the
help an example. Let the membership function values for X ′ and Y ′ be as follows:
X ′ = {0.2∕x1 + 0.3∕x2 + 1.0∕x3 }
Y ′ = {0.4∕y1 + 0.9∕y2 + 0.1∕y3 }
Now, to find the relation R′ over X ′ × Y ′ , let us calculate the Cartesian product
by plotting the fuzzy relation matrix. To find R′ , we need to follow the following
rule:
y1 y2 y3
x1 0.2 0.2 0.1
min(µX ′(x1), µY ′(y1)
R′ = X′ x Y′ = x2 0.3 03 0.1
x3 0.4 0.9 0.1
min(µX ′(x3), µY ′(y2)
Cartesian product: X′ x Y′
A binary fuzzy relation is a relation that connects two sets X and Y , usually
denoted by R(X, Y ). When the sets X and Y are not the same, i.e., X ≠ Y , it is
referred to as bipartite graphs. On the contrary, if X = Y , then it is referred to as
directed graph or digraph. In fact, as X = Y , the relation can also be written as
R(X, X) or R(X 2 ).
If we consider X = {x1 , x2 , x3 , …, xn } and Y = {y1 , y2 , y3 , …, ym }, the fuzzy relation
can be expressed as a n × m matrix called the fuzzy matrix. The matrix can be
denoted as R(X, Y ), as shown below:
⎡ 𝜇R (x1 , y1 ) 𝜇R (x1 , y2 ) · · · 𝜇R (x1 , ym )⎤
⎢ ⎥
𝜇 (x , y ) 𝜇R (x2 , y2 ) · · · 𝜇R (x2 , ym )⎥
R(x, y) = ⎢ R 2 1
⎢ ⋮ ··· ⋮ ⎥
⎢𝜇 (x , y ) 𝜇 (x , y ) · · · 𝜇R (xn , ym )⎥⎦
⎣ R n 1 R n 2
For a binary fuzzy relation R(X, Y ), its domain is the fuzzy set dom R(X, Y )
whose membership function is:
𝜇dom R (X) = max 𝜇R (x, y) for each x ϵ X
y∈Y
Also, for a fuzzy relation R(X, Y ), its range is the a fuzzy set ran R(X, Y ) whose
functions are defined by:
𝜇ran R (Y ) = max 𝜇R (x, y) for each y ϵ Y
x ∈X
In addition, the height of a fuzzy binary relation R(X, Y ) is a number h(R), which
is the largest membership grade attained by any pair (x, y) and is defined as:
h(R) = max max R(x, y)
y∈Y x ∈X
46 2 Fuzzy Computing
To understand further, let us first of all consider a simple example. Given X = {x1 ,
x2 , x3 } and Y = {y1 , y2 }, the cartesian product of the two sets X and Y results in a
fuzzy relation R that can be expressed as R = X × Y . Let the relation R expressed in
a matrix format be having fuzzy membership values as given in Eq. (2.5):
⎡0.6 1.0⎤
R = X × Y = ⎢0.5 0.8⎥ (2.5)
⎢ ⎥
⎣0.9 0.3⎦
Now, the domain of relation R (considering the maximum value per row) can be
defined as:
Dom R = {1.0, 0.8, 0.8}
Next, the range of relation R (considering the maximum value per column) can
be defined as:
Ran R = {0.9, 1.0}
Lastly, the height of relation R (considering the largest membership grade) is:
h(R) = {1.0}
x
x=a
µ(x)
0
a b c x
µ(x)
0 a b c d x
µ(x)
0 x
m
0
a c x
0 20 40 60 80 100
Weight (kg)
A C
𝜇OIE (A, B, C) = min {𝜇O (A, B, C), 𝜇E (A, B, C), 𝜇I (A, B, C)}
Considering Figure 2.14, if the values of angles A = 120∘ , B = 40∘ and C = 20∘
is assumed, then
𝜇I (A, B, C) = 0.88
𝜇E (A, B, C) = 0.80, 𝜇O (A, B, C) = 0.66, and,
𝜇OIE (A, B, C) = min (0.88, 0.8, 0.66) = 0.66.
Cereal — 65 32 20
Sandwich 40 — 38 42
Pancakes 45 50 — 70
Sausages 80 20 98 —
Total 165 135 168 132
Percentage 27.5% 22.5% 28% 22%
Rank order 2nd 3rd 1st 4th
Based on the survey, it is found that 65 people prefer sandwich over cereal,
32 people prefer pancakes over cereal, and 20 people prefer sausage over
cereal. Once the entire comparisons are filled pairwise, it is found that the
item pancakes get the first preference, as it has been voted the highest, when
compared to the rest of the three items. Cereal, sandwich, and sausages are
ranked as second, third, and fourth, respectively.
Based on the percentage value obtained in Table 2.2, the corresponding
membership function can be drawn by following the inference method, as
given in Figure 2.15.
As can be seen in this example, there are four items – sandwich, pancakes,
sausages, and cereal. Hence the total number of items (n) is 4. Therefore, the
number of judgments or comparisons to be made (N) can be calculated using
the Eq. (2.11):
0
Cereal Sandwich Pancakes Saucages
Breakfast items
2.6 Methods of Membership Value Assignments 53
Positive (θ = 45)
Neutral (θ = 0)
Negative (θ = –45)
Figure 2.16 Linguistic terms and their corresponding 𝜃 values using angular fuzzy
model.
parts – training and testing. The training set is used to converge the neural
network, and the testing part is used to validate the neural network.
As seen in Figure 2.19(a), the neural network considers two inputs X1 and
X2, which can be considered as the coordinate values of a data point. The train-
ing is carried out, and the output is displayed to determine the input (X1, X2)
belongs to which of the three classes among C1, C2, and C3. Table 2.3 shows
the set of two coordinate input values that should belong to any one of the
three classes – C1, C2, or C3. Whichever point (P1, P2, P3, or P4) falls to a given
class (C1, C2, or C3) gets membership value 1 to that class and a membership
value of 0 to other classes. The final result is displayed in a graphical format in
Figure 2.19(b). With more and more appropriate training instances, the clas-
sification behavior of the neural network can be made more appropriate.
While applying the neural network for the testing set, the result that will
be generated for each data point will be fuzzy in nature. For example, a data
point X(x1, x2) may belong to class C1 with a fuzzy value of 0.2, C2 with a
fuzzy value of 0.7, and C3 with a fuzzy value of 0.1. This clearly indicates that
the data point “X” belongs to class C2 with the highest fuzzy value of 0.7.
2.6 Methods of Membership Value Assignments 55
Neutral (θ = 0)
Figure 2.17 An example of company’s earnings for a year using angular fuzzy model.
µθ
C1
C3
X1
C2
Neural
network C2
X2
C1
C3
(a) (b)
Figure 2.19 (a) ANN with two input and three class output (b) Graphical result of the
Classified Output.
P1 P2 P3 P4
v. Genetic algorithm: Genetic algorithms (GA) are used to determine the fuzzy
membership functions by mapping a set of input values to corresponding out-
put degree of a membership function. In GA, the membership functions are
coded into bit strings that are connected. Next, an evaluation function is used
to evaluate the fitness of each set of membership functions.
Let us consider an example to understand the concept of GA in determining
fuzzy membership functions. Consider the input and output fuzzy member-
ship function, as shown in Figure 2.20. The linguistic rules can be as follows:
Rule 1: If x is slow, then y is easy
Rule 2: If x is fast, then y is difficult
µ(x) µ(y)
µ(x) µ(y)
0.823
Crisp Crisp
Fuzzification Defuzzification
input output
Fuzzy Fuzzy
Fuzzy logic
input output
Adult = {(15, 0.3), (20, 0.5), (35, 0.9), (50, 0.5), (55, 0.2)}
Now, using the height method, it can be considered that the crisp value for an
adult is 35. Therefore, a person of 35 years can be considered as an adult.
2.7 Fuzzification vs. Defuzzification 59
0
z* z
0
a b z
Adult = {(20, 0.3), (25, 0.5), (30, 0.9), (40, 0.9), (45, 0.4), (50, 0.2)}
Now, using the mean-max membership method, it can be considered that the
crisp value for an adult is (30 + 40)/2 = 35. Therefore, a person of 35 years can
be considered as an adult.
60 2 Fuzzy Computing
Center of
gravity
Z*
iii. Centroid method: This method, also known as the center of mass, center of
area, or center of gravity, is the most commonly used defuzzification method.
The basic principle in this method is to find the point z* where a vertical line
would slide the aggregate into two equal masses. To find the center of gravity,
the entire area is divided into subregions (such as, triangle, trapezoidal, rect-
angle, etc.). The centroid is found by dividing the aggregated output into the
regular structures and finding the area under the curve of each regular struc-
ture. The sum of the center of gravity of each of these subareas is then used to
determine the defuzzified value for a fuzzy set.
This centroid method of defuzzification can be illustrated with an example
given in Figure 2.25.
For a continuous set, the defuzzified output z* is given by the algebraic
equation shown in Eq. (2.15):
∫ 𝜇(z). z dz
z∗ = (2.15)
∫ 𝜇(z) dz
Here, ∫ 𝜇(z) dz denotes the area of the region bounded by the curve z.
For a discrete set, the defuzzified output z* is given by the algebraic equation
shown in Eq. (2.16):
∑n
xi 𝜇(xi )
z = ∑i=1
∗
n (2.16)
i=1 𝜇(xi )
Center of
Subarea no. Area (𝝁(x i )) gravity (x i ) x i 𝝁(x i )
0
z
iv. Weighted average method: This method is applicable only for symmetrical
output membership functions. Each membership function is weighted by its
respective maximum membership value. This method can be illustrated with
the help of the Figure 2.26.
In weighted average method, the defuzzified output z* is given by the alge-
braic equation shown in Eq. (2.17):
∑
𝜇(z).z
z = ∑
∗
(2.17)
𝜇(z)
∑
Here, z′ is the maximum value of the membership function, denotes the
algebraic summation, and z is the element with maximum membership
function.
Let us consider an example consisting of a fuzzy set Z that consists of
elements along with corresponding maximum membership values, as shown
below:
Z = {(60, 0.6), (70, 0.4), (80, 0.2), (90, 0.2)}
Now the defuzzified value z* for the given set Z will be:
(0.6 ∗ 60) + (0.4 ∗ 70) + (0.2 ∗ 80) + (0.2 ∗ 90) 98
Z∗ = = 0.71
0.6 + 0.4 + 0.2 + 0.2 1.4
62 2 Fuzzy Computing
µ(z)
1.0
0.8
0.6
0.4
0.2
0 2 4 6 8 10 12 z
Now, using the weighted average method, it can be considered that the crisp
value for the given dataset Z is 71.
v. Maxima methods: The maxima methods consider values with the maximum
membership. There are different maxima methods found, which are discussed
below:
(a) First of maxima method (FOM): Here the smallest value from a domain
with maximum membership value is considered. From the illustration
shown in Figure 2.17, the defuzzified value z* of the fuzzy set is 4.
(b) Last of maxima method (LOM): Here the largest value from a domain
with maximum membership value is considered. From the illustration
shown in Figure 2.17, the defuzzified value z* of the fuzzy set is 8.
(c) Mean of maxima method (MOM): Here the element with the highest mem-
bership value is considered. If there are more than one such element, the mean
value of the maxima is considered. From the Figure 2.27, the values of z are 4,
6, and 8 that have the maximum membership value and hence cardinality (n)
is 3. The defuzzification value z* is given by z* = (4 + 6 + 8)/3 = 6.
Definitely, fuzzy c-means takes more time to generate results than the traditional
K-means, as it requires a greater number of steps as explained below.
Let us now understand how the fuzzy c-means algorithm work. The main task
is to group a set of “n” data points into “c” clusters. The algorithm initially requires
the selection of the membership values of each data point for each cluster taken at
random. Also, it requires selection of centroid value of each cluster based on the
random assignment of membership values. The entire algorithm can be divided
into the following steps:
Let us consider a set of four data points – (1, 5), (2, 7), (3, 4), and (4, 6). Also,
let us consider two clusters – cluster 1 and cluster 2. This step involves randomly
assigning fuzzy membership values (𝛾) for each data point to decide its probabil-
ity of belonging to either of the cluster. This is depicted in the Table 2.5, which
demonstrates the fuzzy membership matrix. As can be seen from the Table, the
data point (1, 5) belongs to Cluster 1 with a probability (membership value) of 0.8
and to Cluster 2 with a probability (membership value) of 0.2. Similarly, the data
point (2, 7) belongs to Cluster 1 with a probability (membership value) of 0.4 and to
Cluster 2 with a probability (membership value) of 0.6. For the next two data points
too, membership values have been randomly assigned as shown in Table 2.5.
The second step is to find the value of centroid for both the clusters using the
Eq. (2.18).
∑n
v k=1
𝛾 m . xki (2.18)
ij= ∑n ik m
𝛾
k=1 ik
Here, “𝛾” is the membership value, and “m” is the fuzziness parameter (gener-
ally 1.25 ≤ m ≤ 2). Often the value of “m” is considered as 2. Also, γik means the
membership value of the kth data point in the ith cluster.
Data points
Considering the Table 2.5, the centroid for Cluster 1 is calculated as:
0.82 ∗ 1 + 0.42 ∗ 2 + 0.92 ∗ 3 + 0.32 ∗ 4 3.75
v11 = = = 2.20
0.82 + 0.42 + 0.92 + 0.32 1.7
0.82 ∗ 5 + 0.42 ∗ 7 + 0.92 ∗ 4 + 0.32 ∗ 6 8.1
v12 = = = 4.76
0.82 + 0.42 + 0.92 + 0.32 1.7
Similarly, the centroid for Cluster 2 is calculated as:
0.22 ∗ 1 + 0.62 ∗ 2 + 0.12 ∗ 3 + 0.72 ∗ 4 2.75
v21 = = = 3.06
0.22 + 0.62 + 0.12 + 0.72 0.9
0.22 ∗ 5 + 0.62 ∗ 7 + 0.12 ∗ 4 + 0.72 ∗ 6 5.7
v22 = 2 2 2 2
= = 6.33
0.2 + 0.6 + 0.1 + 0.7 0.9
Therefore, the centroids for Cluster 1 and Cluster 2 are (2.20, 4.76) and (3.06,
6.33), respectively.
Step 3: Find out the distance of each data point from the centroid of each
cluster.
The first data point considered in our example is (1, 5). Using the Eucledian
distance measure, the distance is measured from this data point to the centroid
values (2.20, 4.76) and (3.06, 6.33).
For the data point (1, 5), distance D11 from the centroid value (2.20, 4.76) is:
D11 = ((1–2.20)2 + (5 − 4.76)2 )0.5 = 1.22
Again, for the data point (1, 5), distance D12 from the centroid value (3.06,
6.33) is:
D12 = ((1–3.06)2 + (5–6.33)2 )0.5 = 2.45
Let us also find the distances of the second data point (2, 7), the centroid values
(2.20, 4.76), and (3.06, 6.33).
For the data point (2, 7), distance D21 from the centroid value (2.20, 4.76) is:
D21 = ((2–2.20)2 + (7 − 4.76)2 )0.5 = 2.25
Again, for the data point (2, 7), distance D22 from the centroid value (3.06,
6.33) is:
D22 = ((2–3.06)2 + (7–6.33)2 )0.5 = 1.25
Similarly, the distances of all the remaining two data points (3, 4) and (4, 6) from
the centroids are to be calculated. The final calculation is given in the Table 2.6 for
all the data points.
Step 4: Update the membership values of each data point
Based on the distance calculation, the membership values are now required
to be updated in the membership matrix, as given in the Table 2.5.
2.8 Fuzzy c-Means 65
Cluster 1 Cluster 2
For the first data point, distances calculated in the previous steps are D11 = 1.22
and D12 = 2.45. Therefore, the updated membership values can be obtained as:
𝛾11 = [{[(1.22)2 ∕(1.22)2 ] + [(1.22)2 ∕(2.45)2 ]} ∧ {(1∕(2–1))}]−1 = 0.80
𝛾12 = [{[(2.45)2 ∕(2.45)2 ] + [(2.45)2 ∕(1.22)2 ]} ∧ {(1∕(2–1))}]−1 = 0.20
As can be seen,
𝛾12 = 1 − 𝛾11 = 0.20
In this case, the updated membership values 𝛾 11 and 𝛾 12 are 0.80 and 0.20,
respectively. However, it is found to be exactly of the same value as the originally
randomly assigned membership values, as can be seen in the Table 2.5.
Let us again calculate the new membership values for the seconds data point
(2, 7). For this first data point, distances calculated in the previous steps are
D21 = 2.25 and D22 = 1.25. Therefore, the updated membership values can be
obtained as:
𝛾11 = [{[(2.25)2 ∕(2.25)2 ] + [(2.25)2 ∕(1.25)2 ]} ∧ {(1∕(2–1))}]−1 = 0.24
𝛾12 = [{ [(1.25)2 ∕(1.25)2 ] + [(1.25)2 ∕(2.25)2 ]} ∧ {(1∕(2–1))}]−1 = 0.76
Likewise, compute all other membership values for both the clusters and
accordingly update the membership matrix. The final result is displayed in the
Table 2.7.
Step 5: Steps 2–4 are to be repeated until:
• the constant values are obtained for the membership values, or,
• the difference is less than the tolerance value (a small value up to which the
alteration in values of the previous two consequent updation is acceptable).
66 2 Fuzzy Computing
Data points
#Load Dataset
"""
from google.colab import files
uploaded = files.upload()
df_full = pd.read_csv('Iris.csv')
df = df_full.drop(columns=['Id'])
df.shape
df.head()
df = df.drop(columns=['Species'])
df.head()
"""
iris = load_iris()
df = pd.DataFrame(iris.data)
echo "Loading the First Five Records"
df.head()
#Number of data
n = len(df)
#Number of clusters
k = 3
#Dimension of data
d = 4
# m parameter
m = 2
#Number of iterations
MAX_ITERS = 12
"""
weight = np.random.dirichlet(np.ones(k),n)
weight_arr = np.array(weight)
return weight_arr
68 2 Fuzzy Computing
for i in range(k):
dist = (df.iloc[:,:].values - C[i])**2
dist = np.sum(dist, axis=1)
dist = np.sqrt(dist)
weight_arr[:,i] = np.divide(np.power(1/dist,1/(m-1)),denom)
return weight_arr
The output for the Program 2.2 is displayed using visualization graphs.
Figure 2.28 shows the scatter plot diagram for (a) the sepal length versus the sepal
width, and (b) the petal length versus the petal width. The three red diamond-
shaped markers indicate the final centroid values for the three different clusters.
The colors of the data points indicate to which cluster it belongs to.
70 2 Fuzzy Computing
Sepal plot
5.0
4.5
4.0
3.5
Sepal width
3.0
2.5
2.0
1.5
Petal plot
2
Petal width
–1
1 2 3 4 5 6 7
Petal length
(b)
Figure 2.28 Scatter plot for (a) Sepal length versus sepal width, and (b) Petal length
versus petal width.
Exercises 71
0 1 2 3
The Fuzzy c-means is a widely used algorithm for clustering data points
that may belong to more than one cluster. While Program 2.2 imports the
standard Python libraries to implement the Fuzzy c-means algorithm, the skfuzzy
(scikit-fuzzy) library can also be used to implement the same algorithm, as it has
a rich in-built predefined functions specifically dedicated for fuzzy operations.
The easiest way of fuzzy implementation in Python is by using Scikit package. In
Linux platform, the package can be installed by using pip.
pip install-U scikit-fuzzy
Fuzzy logic has a wide extended application in many industrial appliances as
well as home appliances, starting from the smooth transition of a building ele-
vator, washing machines, air conditioner, vacuum cleaner, etc. Most of the recent
fuzzy-based control systems are adaptive, i.e., the membership function shape and
scaling factors are changed during the course of operation based on constrained
and environmental influence of input variables.
Exercises
A) Choose the correct answer from among the alternatives given:
a) Fuzzy logic is based on:
i) Crisp set logic
ii) Fixed-valued logic
iii) Multivalued logic
iv) Binary set logic
b) A fuzzy set whose membership function has at least one value equal to 1 is
called __________
i) Normal fuzzy set
ii) Convex fuzzy set
iii) Support fuzzy set
iv) Membership fuzzy set
72 2 Fuzzy Computing
c) If A = {0.6/p, 0/q, 1/r, 0/s, 0.7/t}, then, support(A) and core(A) are
_________ and _________, respectively.
i) {q, s}, {p, r, t}
ii) {q, s}, {r}
iii) {q, r, s}, {p, t}
iv) {p, r, t}, {r}
d) The _________ property states that the complement of a set is the set itself.
i) associative
ii) involution
iii) commutative
iv) transitivity
e) The Gaussian membership function is defined by two parameters, namely
___________ and ___________.
i) Mean, standard deviation
ii) Mean, fuzzification factor
iii) Standard deviation, fuzzification factor
iv) Lower boundary value, height
f) A sigmoidal membership function is formulated as:
1
𝜇(x) =
1 + e−a(x−b)
[ ( )2 ]
x−f
− 12 σ
𝜇(x) = e
𝜇θ = t tan θ
1
𝜇S (x) =
1 + x2
g) The region of universe that is characterized by complete membership in the
set is called
i) Core
ii) Support
iii) Boundary
iv) Fuzzy
h) Which among the following is not a method of defuzzification?
i) Max-membership principle
ii) Mean-max membership
iii) Convex-max membership
iv) Weighted average method
i) In genetic algorithm, the membership functions are coded into
____________.
i) Bit strings
Exercises 73
9) Consider a fuzzy set for optimum number of hours of sleep per day (SPD)
defined as follows:
SPD = {(5, 0.3), (6, 0.5), (8, 0.9), (12, 0.5), (14, 0.2)}
Find the crisp value for Optimum Sleeping Hours per Day using:
i) The height method
ii) The mean-max membership method
iii) The centroid method
iv) The weighted-average method
75
Principles of Soft Computing Using Python Programming: Learn How to Deploy Soft Computing Models
in Real World Applications, First Edition. Gypsy Nandi.
© 2024 The Institute of Electrical and Electronics Engineers, Inc. Published 2024 by John Wiley & Sons, Inc.
76 3 Artificial Neural Network
Soma
Axon
Dendrite
Synapse
a f(a)
Xi X Wi
Inputs Σ f(a) Outputs
Weighted Activation
sum uint unit
The tunable parameters of this entire network are the weights (wi ) that are
applied with each input. When input is combined with weights by a node, it either
amplifies or dampens the input, thereby conveying implication to inputs with
regard to the task the algorithm is trying to learn. If the signal passes through, the
neuron is said to be “activated.”
There are standard mathematical procedures, often called as training or
learning, for tuning the weight values. Tuning the weights in a neural network is a
crucial step in the training process. The goal of weight tuning is to find the optimal
set of weights and biases that allow the network to make accurate predictions or
classifications. This process is essential for achieving high performance in various
machine learning tasks. Some of the important goals of weight tuning include:
(a) Minimizing Loss Function: The primary goal of weight tuning is to mini-
mize the loss function, which quantifies the difference between the network’s
predictions and the actual target values. Lower loss indicates better alignment
between predictions and targets.
(b) Increasing Accuracy: Weight tuning aims to increase the accuracy of the
network’s predictions or classifications on both the training data and new,
unseen data.
(c) Avoiding Overfitting: Weight tuning helps prevent overfitting, where the
network memorizes the training data instead of learning the underlying pat-
terns. Overfitting can lead to poor performance on new data.
(d) Achieving Convergence: Properly tuned weights contribute to faster conver-
gence during training. This means the network reaches a satisfactory level of
performance in fewer training iterations.
(e) Ensuring Stable Learning: Well-tuned weights promote stable learning
dynamics and prevent issues like vanishing gradients, which can hinder the
training process in deep networks.
This process of weight tuning is typically achieved through optimization
algorithms that iteratively update the network’s parameters. The mathematical
procedures involved in this training process is described next:
78 3 Artificial Neural Network
(a) Forward Pass: During each iteration of training, a set of input data (often
called a training sample) is fed into the neural network. Each neuron in the
network receives inputs from the previous layer, computes a weighted sum of
those inputs, adds a bias term, and applies an activation function to produce
an output.
(b) Compute Loss: The output of the neural network is compared to the actual
target output for that input using a loss function (also known as a cost func-
tion). The loss function quantifies how far off the network’s predictions are
from the desired outputs. Common loss functions include mean squared error
for regression tasks and cross-entropy for classification tasks.
(c) Backpropagation: Backpropagation is the core of the training process. It cal-
culates the gradients of the loss with respect to the weights and biases of the
network. These gradients indicate how much each weight and bias contribute
to the overall error. The gradients are calculated using the chain rule of calcu-
lus. The process starts from the output layer and propagates backward through
the layers.
(d) Weight Update: With the gradients calculated, optimization algorithms are
used to update the weights and biases. These algorithms adjust the parameters
in a way that reduces the loss function. One common optimization algorithm
is gradient descent. It involves subtracting a fraction (learning rate) of the
gradient from each weight. The direction of the update is determined by the
gradient’s sign.
(e) Batch and Epochs: Instead of updating weights after every single input
(stochastic gradient descent), optimization often involves using batches of
inputs. The gradient is averaged over the batch, and a weight update is
applied. Training is typically organized into epochs, where the entire training
dataset is processed. The network goes through multiple epochs to iteratively
improve its performance.
(f) Regularization: To prevent overfitting, regularization techniques are often
applied during training. Regularization adds a penalty term to the loss function
based on the magnitudes of the weights. L1 and L2 regularization are common
approaches.
(g) Hyperparameters: Learning rate, regularization strength, batch size, and
more are hyperparameters that need to be tuned. They impact the training
process, and can affect convergence and generalization.
(h) Termination: Training continues for a predefined number of epochs or until
a termination criterion is met (e.g., loss drops below a threshold, performance
stabilizes).
The training process is iterative, with each iteration aiming to adjust the weights
and biases in a way that improves the network’s performance. This process
3.1 Fundamentals of Artificial Neural Network (ANN) 79
During the forward pass, the weighted sum of inputs is calculated for each
neuron in the hidden layer as follows:
z_h1 = w_h1 * x1 + w_h2 * x2 + b_h1
z_h2 = w_h3 * x1 + w_h4 * x2 + b_h2
z_h3 = w_h5 * x1 + w_h6 * x2 + b_h3
Next, an activation function (e.g., sigmoid) is applied to the hidden layer’s
outputs. This can be done using Python code as follows:
a_h1 = sigmoid(z_h1)
a_h2 = sigmoid(z_h2)
a_h3 = sigmoid(z_h3)
Next, the weighted sum of inputs is calculated for the output neuron as follows:
z_o = w_o1 * a_h1 + w_o2 * a_h2 + w_o3 * a_h3 + b_o
Finally, an activation function is applied to the output neuron’s output
(e.g., linear for regression or sigmoid for binary classification):
a_o = sigmoid(z_o)
Step 2: Compute Loss
Assuming the target output y_target = 0.9, the loss function (mean squared
error) is calculated as:
loss = 0.5 * (y_target - a_o)ˆ2
Step 3: Backpropagation
Now, let us calculate the gradients of the loss with respect to the weights and biases
in reverse order:
80 3 Artificial Neural Network
d_z_o/dw_o1 = a_h1
d_z_o/dw_o2 = a_h2
d_z_o/dw_o3 = a_h3
d_z_o/db_o = 1
(b) Hidden Layer:
• Use the chain rule to calculate the gradient of the loss with respect to the
hidden layer activations (d_loss/da_h1, d_loss/da_h2, d_loss/
da_h3):
d_a_h1/dz_h1 = sigmoid_derivative(z_h1)
d_a_h2/dz_h2 = sigmoid_derivative(z_h2)
d_a_h3/dz_h3 = sigmoid_derivative(z_h3)
• Calculate the gradients of the weighted sum with respect to the hidden layer
weights and bias (d_z_h1/dw_h1, d_z_h1/dw_h2, d_z_h1/db_h1,
d_z_h2/dw_h3, d_z_h2/dw_h4, d_z_h2/db_h2, d_z_h3/dw_h5,
d_z_h3/dw_h6, d_z_h3/db_h3):
d_z_h1/dw_h1 = x1
d_z_h1/dw_h2 = x2
d_z_h1/db_h1 = 1
3.2 Standard Activation Functions in Neural Networks 81
d_z_h2/dw_h3 = x1
d_z_h2/dw_h4 = x2
d_z_h2/db_h2 = 1
d_z_h3/dw_h5 = x1
d_z_h3/dw_h6 = x2
d_z_h3/db_h3 = 1
Step 4: Weight Update:
Using the calculated gradients, the weights and biases in the network are adjusted
using an optimization algorithm like gradient descent. The goal is to update the
parameters in a way that reduces the loss. This process of calculating gradients
and updating weights is performed iteratively over multiple epochs until the loss
converges to a satisfactory level or the training process is completed.
Backpropagation is a fundamental concept that allows neural networks to learn
from data by adjusting their parameters to minimize prediction errors. While the
example provided is simplified, the actual calculations can be more complex, espe-
cially in deeper networks with more layers and connections.
0.8
0.6
f(a)
0.4
0.2
0.0
–8 –6 –4 –2 0 2 4 6 8
a
Figure 3.3 shows a case of binary step function that can result in either of the
two output –0 or 1. The x-axis denotes the value of “a” that is within the integer
range −8 to +8. When the value of “a” is greater than or equal to 0, the output
f (a) results in 1 (as can be seen from the figure), and the output f (a) results in 0
when the value of “a” is less than 0. Mathematically, the binary step function for
Figure 3.3 can be represented as given in Equation (3.3):
{
0 for a < 0
f (a) = (3.3)
1 for a ≥ 0
To plot the graph as shown in Figure 3.3, the corresponding Python code is pro-
vided in Program 3.1. The plotted graph displays the binary step function based on
the value of “a.” The output generated can be either 0 or 1, based on the condition
given in the program.
def binaryStep(a):
return np.heaviside(a,1)
a = np.linspace(-8, 8)
plt.plot(a, binaryStep(a))
3.2 Standard Activation Functions in Neural Networks 83
f (a) = a (3.4)
Figure 3.4 shows a case of linear activation function that results in the same
value of “a” for f (a).
0
–2
–4
–6
–8
–8 –6 –4 –2 0 2 4 6 8
a
To plot the graph as shown in Figure 3.4, the corresponding Python code is pro-
vided in Program 3.2. The plotted graph displays the linear function based on the
value of “a.” The output generated is proportional to the value of “a,” as given in
the condition of the program.
def linear(a):
return (a)
a = np.linspace(-8, 8)
plt.plot(a, linear(a))
plt.title('Linear Activation Function', fontweight
='bold')
plt.xlabel("a", fontweight='bold')
plt.ylabel("f(a)", fontweight='bold')
plt.show()
The linear activation function suffers from a major limitation in neural network,
as it is not possible to use backpropagation to find how the neural weights should
change based on the errors found.
0.8
0.6
f(a)
0.4
0.2
0.0
–8 –6 –4 –2 0 2 4 6 8
a
def sigmoid(a):
return 1/(1+np.exp(-a))
a = np.linspace(-8, 8)
plt.plot(a, sigmoid(a))
plt.title('Sigmoid Activation Function',
fontweight='bold')
plt.xlabel("a", fontweight='bold')
plt.ylabel("f(a)", fontweight='bold')
plt.show()
The sigmoid function suffers from a major problem called as the vanishing
gradient problem, which happens because even a large input of data is finally
converted to an output having a small range of 0–1. Therefore, their derivatives
become much smaller, and it does not always give satisfactory output.
5
f(a)
0
–8 –6 –4 –2 0 2 4 6 8
a
input is positive; otherwise, it displays the output as zero (0). Mathematically, the
ReLU activation function can be represented as given in Equation (3.6):
f (a) = max (0, a) (3.6)
Figure 3.6 shows a case of ReLU activation function that displays the output as
0 if the input is less than or equal to zero; otherwise it displays the same value as
input “a.” Thus, the result obtained is within the range of 0–a.
To plot the graph as shown in Figure 3.6, the corresponding Python code is pro-
vided in Program 3.4. A simple if–else statement is used to handle the condition
and display the output accordingly for the given input, which is in the range of −8
to +8.
def relu(a):
z=[]
for i in a:
if i<0:
z.append(0)
else:
z.append(i)
return z
3.2 Standard Activation Functions in Neural Networks 87
a = np.linspace(-8, 8)
plt.plot(a, relu(a))
plt.title('ReLU Activation Function', fontweight='bold')
plt.xlabel("a", fontweight='bold')
plt.ylabel("f(a)", fontweight='bold')
plt.show()
The ReLU function helps resolve the vanishing gradient problem, which is found
in case of the sigmoid activation function. Also, the ReLU function takes lesser
time for the model and minimize the errors. However, one issue that may occur
while using the ReLU function is that it will constantly provide the output as 0
if the neurons get stuck with the negative values. As a result, “dead neurons” are
created that can never recover.
0.00
0.25
0.50
0.75
–1.00
–8 –6 –4 –2 0 2 4 6 8
a
for hidden layers of a neural network, as it allows centering of data by bringing the
mean value close to 0.
To plot the graph as shown in Figure 3.7, the corresponding Python code is
provided in Program 3.5. In this program, the simple tanh() function is used to
take the value of “a” as input and display the corresponding output. The input “a”
is expected to be in the range of −8 to +8.
def tanh(a):
return np.tanh(a)
a = np.linspace(-8, 8)
plt.plot(a, tanh(a))
plt.title('tanh Activation Function', fontweight='bold')
plt.xlabel("a", fontweight='bold')
plt.ylabel("f(a)", fontweight='bold')
plt.show()
Like the sigmoid function, the tanh function is often used for binary classifi-
cation. However, tanh also suffers from the vanishing gradient problem near the
boundaries just as in case of the sigmoid activation function.
4
f(a)
–8 –6 –4 –2 0 2 4 6 8
a
ReLU function (where, f(a)=0), the output is not exactly zero but nearby a zero
value.
To plot the graph as shown in Figure 3.8, the corresponding Python code is pro-
vided in Program 3.6. Here, in this program, the constant factor considered for
multiplying when input a is less than 0 is 0.05.
def lkrelu(a):
z=[]
for I in a:
if i<0:
z.append(0.05 * i)
else:
z.append(i)
return z
a = np.linspace(-8, 8)
plt.plot(a, lkrelu(a))
plt.title(‘Leaky ReLU Activation Function’,
fontweight=’bold’)
plt.xlabel("a", fontweight=’bold’)
90 3 Artificial Neural Network
plt.ylabel("f(a)", fontweight=’bold’)
plt.show()
The leaky ReLU activation function is hardly used in cases where the ReLU pro-
vides an optimal output. However, to avoid the "dying ReLU" problem, leaky
ReLU is often considered as a better choice, as it performs better and produces
better results.
Figure 3.9 shows a case of SoftMax activation function that considers the input
range between −8 and +8. If the number of classes is only two, the SoftMax func-
tion produces the same output as the sigmoid function.
To plot the graph as shown in Figure 3.9, the corresponding Python code is pro-
vided in Program 3.7. The SoftMax function uses the exponential that acts as the
nonlinear function.
0.25
0.20
0.15
f(a)
0.10
0.05
0.00
–8 –6 –4 –2 0 2 4 6 8
a
def softmax(a):
return np.exp(a) / np.sum(np.exp(a), axis=0)
a = np.linspace(-8, 8)
plt.plot(a, softmax(a))
plt.title('softmax Activation Function',
fontweight='bold')
plt.xlabel("a", fontweight='bold')
plt.ylabel("f(a)", fontweight='bold')
plt.show()
Usually, the ReLU activation function is used in hidden layer to avoid the van-
ishing gradient problem and better computation performance, and the SoftMax
function is used in last output layer for generating the final output.
Table 3.1 summarizes the list of all the standard activation functions covered in
this section, and provides each of their corresponding equations and plot.
Now, the main query that may arise in one’s mind is which activation function
is the best choice to be used. Well, it all depends on the nature of task to be solved.
One basic and easy rule for choosing the appropriate activation function for the
output layer is:
● For linear regression – choose the linear activation function
● For probability prediction – choose the logistic activation function
● For binary classification – choose the logistic or tanh activation function
● For multiclass classification – choose the SoftMax activation function
● For multilabel classification – choose the sigmoid activation function
● For neural network – choose the ReLU and SoftMax activation functions
In the next section, we will learn about the two main types of ANN – feed-
forward neural network and feed-backward neural network. To begin with, the
simplest neural feed-forward network model, namely the Perceptron, is discussed
followed by the more complex models that are often used in many applications.
1
Sigmoid or Logistic f (a) =
1 + e−a
ea − e−a
Tanh f (a) =
ea + e−a
e ai
SoftMax f (a)i = ∑n
ai
j=1 e
models to improve the performance of the model. It is an iterative process used for
learning from the existing condition of the network and accordingly enhance the
ANN’s performance. It is a mathematical method that updates the weights and
bias levels of a neural network during its training process.
All learning rules fall under either of the three categories – supervised learning
rule, unsupervised learning rule, and reinforcement learning rule. In case of super-
vised learning rule, the desired output is compared with the actual output, and
accordingly the weights are adjusted. Unsupervised learning rule combines the
input vectors of similar type to form clusters. There is no direct feedback provided
from the environment with regard to the desired output. Reinforcement learning
rule function by receiving some feedback from the environment based on which
the network performs adjustments of the weights to get better critic information
in future.
Discussed below are the six standard learning rules used in ANNs. While
Hebbian learning rule and competitive learning rule are unsupervised in nature,
Perceptron learning rule, delta learning rule, correlation learning rule, and
Outstar learning rule is supervised in nature.
3.3 Basic Learning Rules in ANN 93
⎪ 0 if wi xi < 𝜃
⎩ i=1
Here, “w” is the vector of real-valued weights, “𝜃” is the threshold value, and “x”
is the vector of input values.
In the Perceptron learning rule, the predicted output is compared with the
actual/known output. If both these values do not match, an error is propagated
94 3 Artificial Neural Network
backward, and weight adjustment occurs until there is no difference in both the
desired and actual outputs. This can be explained using two cases as follows:
The weight in the Perceptron learning rule (W i ) is modified during the training
process according to the rule given in Equation (3.12):
Wi = Wi + 𝛿Wi (3.12)
where,
𝛿Wi = 𝛼 (t − o)xi
Here, 𝛿W i is the change of value of weight (W i ). It can be either a positive or
negative value. α is the positive and constant learning rate that is usually a small
value that keeps a control so that an aggressive change does not occur during the
training of the model.t is the ground truth label (actual output) for the training
set, o is the derived output of the model, and xi is the ith dimension of the input.
The learning of the model stops when o = t, i.e., the model is able to classify or
predict correctly.
the connection between neurons is similar, then the weights between them
increases; otherwise, if the connection shows a negative relationship, then the
weight between them decreases. However, the main difference between Hebbian
learning rule and Correlation learning rule is that correlation learning rule is
supervised in nature, while the former is an unsupervised learning rule.
The mathematical representation of the Correlation learning rule is given in
Equation (3.14):
𝛿Wi = 𝛼 ⋅ xi ⋅ dj (3.14)
Here, 𝛿W i is the change in weight in the ith iteration. α is the positive and constant
learning rate that is usually a small value that keeps a control so that an aggressive
change does not occur during the training of the model. xi is the ith dimension of
the input, and dj is the desired output.
outputs are known in advance. The nodes are assumed to be arranged in layers,
and the change in weight is calculated using the mathematical formula as given
in Equation (3.17):
Here, 𝛿W i is the change in weight in the ith iteration. α is the positive and constant
learning rate that is usually a small value that keeps a control so that an aggressive
change does not occur during the training of the model. dj is the jth dimension of
the desired output, and wi is the weight of the presynaptic neuron.
The most primitive model of ANN was introduced in the year 1943 by McCulloch
and Walter Pitts, and it is popularly known as the McCulloch–Pitts neuron model.
In this highly simplified model, the neuron has a set of inputs (input vector)
and one output. All the neurons on a neural network are connected by synapse.
The synaptic strength between two neurons is indicated by a value in the interval
[0, 1] termed as the weight value. The model only accepts binary input and also
produces a binary output (0 or 1), which is determined by a threshold value.
The diagrammatic representation of the McCulloch–Pitts neuron model is shown
in Figure 3.10. Here, xi is the ith input to the neural network, and wi is the weight
associated with the synapse of the ith input. The weights associated with each
input can be either excitatory (positive) or inhibitory (negative). The same positive
weight value (+w1) is assigned for all the excitatory connections entering into
particular neuron. Similarly, the same negative weight value (−w2) is assigned for
x1 Weights
+w1
x2 +w1
Input signals
n Output
x3
•
+w1
•
•
•
Σ
i=1
xi wi y
•
–w2
• Weighted Threshold
–w2 function
xn–1 sum
xn
all the inhibitory connections entering into particular neuron. The weighted sum
g(x) is calculated as given in Equation (3.18):
∑
n
g(x) = xi wi (3.18)
i=1
Finally, the threshold function is used to compute the output in binary form.
Considering the threshold value as “𝜃,” the output (y) is derived as either 0 or 1.
Here, y = 1 if f (g(x)) ≥ 𝜃; else y = 0 if f (g(x)) < 𝜃. The mathematical representa-
tion for deriving the output of the McCulloch–Pitts neuron model is given in
Equation (3.19).
{
1, f (g(x)) ≥ 𝜃
y = f (g(x)) = (3.19)
0, f (g(x)) < 𝜃
Program 3.8 illustrates the Python code used to generate the McCulloch–Pitts
neuron model. Here, four random binary input values are generated along with
their corresponding weights. Next, the dot product is computed between the vec-
tor of inputs and weights. Finally, the output is generated by comparing the dot
product value (dot) with the threshold value (T). If the dot value is greater than or
equal to T, the neuron fires; otherwise, it does not fire.
else:
return 0
Input vector:[0 1 1 0]
Weight vector:[1 1 1 1]
Dot product: 2
Output: 1
For the above case, if the threshold value is considered as 3 or more than 3, the
final output would have been 0.
The McCulloch–Pitts neuron model is a primitive model that formed the base
for understanding and applying the concept of neural network. As this model
allows only binary inputs and output, it has limited applications. With time,
several complex and advanced ANN models have been developed to solve the
purpose of real-time applications. Some of these types of ANN models are covered
in the next section of this chapter.
ANN is used in diverse applications, and its complexity and working depend on
the type of the ANN model being used. ANNs are mainly classified into the feed-
forward neural network and the feed-backward neural network. There are many
variations of both the cases of ANN, which is illustrated in Figure 3.11. As can
be seen in figure, the main variations of feed-forward neural network are the
single-layer perceptron, the multilayer perceptron, and the radial basis function
network. The main variations of feed backward neural network are the self-
organizing map (SOM), Bayesian regularized neural network, Hopfield network,
and competitive network. Each of these varied types of ANNs are discussed in
detail next.
The information in feed-forward neural network always moves in forward direc-
tion (and never in backward direction). Here, the connections between nodes do
not form any cycle or loop. Here, data is fed through input nodes, and output is
generated through output nodes. In between the input layer and the output layer,
3.5 Feed-Forward Neural Network 99
Single-layer
perceptron
Feed forward
Multilayer
neural
perceptron
networks
Radial basis
function
network
Artificial neural
networks (ANN) Kohonen’s self
organizing map
Bayesian
regularised
Feedback neural network
neural
networks Hopfield
network
Competitive
network
there can be zero to any number of hidden layers. Each neuron in the network
receives input from the neurons in the previous layer, computes a weighted sum
of the inputs, and applies an activation function to produce an output. The weights
and biases of the neurons are typically initialized randomly, and then adjusted dur-
ing the training process to minimize the difference between the predicted output
and the actual output.
Feedforward neural networks are commonly used in many applications, such as
regression, image classification, and natural language processing. They also prove
effective at modeling complex relationships between inputs and outputs. Three of
the standard commonly used feed-forward ANNs are discussed next:
Bias
(b)
+1
Weights
x1 +w1
n Output
+w1
Σ y
Input signals
x2 xi wi + b
•
• i= 1
• •
•
–w2
• Weighted Threshold
–w2 function
xn–1 sum
xn
Here, in Equation (3.20), “b” is the bias (an adjustable value), which has a
significant role to play in the equation. Suppose all the input to the network is
zero (0), then the weighted sum g(x) is also going to be zero (0) in absence of the
bias value “b.” Now, as the weighted output is “0,” it can never be greater than a
positive threshold. Hence, the neuron will never be triggered. We can overcome
this issue by the introduction of the bias variable. Now, if the linear threshold
value is assumed to be “+1,” then a bias value of “+1” could still make the neuron
active with the input vector value as “0.” In certain cases, for the same reason the
bias can also be assumed as “−1.”
To understand the role of bias, let us consider Figure 3.13. If the activation func-
tion (also known as transfer function) in this case is assumed as f (x) = x (x is the
input), then the output is multiplication factor of weight with “x,” i.e., input (wx).
Here, the ANN indicates a linear line.
3.5 Feed-Forward Neural Network 101
Activation function
y
y3
0
x
0
x
Figure 3.14 represents the ANN output corresponding to the neural structure
shown in Figure 3.13. If the weight (w) is changed, then the line may shift either
to a position in the direction of y3 or y2 . But it would still pass through origin, and
we cannot shift the line to intercept y-axis. This may not be optimal in approach
to solving many problems.
By adding a constant “b,” the intercept could be formed or the activation func-
tion could be shifted toward right or left. Figure 3.15 shows the output of the ANN
after bias inclusion.
Next, the threshold function is used to compute the output of the Perceptron
model in binary form. Considering the threshold value as “𝜃,” the output (y) is
derived as either 0 or 1. Here, y = 1 if f (g(x)) ≥ 𝜃; else y = 0 if f (g(x)) < 𝜃. The math-
ematical representation for deriving the output of the Perceptron model is the
same as in the McCulloch–Pitts neuron model. The mathematical representation
for deriving the output of the Perceptron model is same as given in Equation (3.21).
{
1, if f (g(x)) ≥ 𝜃
y= (3.21)
0, if f (g(x)) < 𝜃
Now, to sum up the entire process, the single-perceptron takes inputs from the
input layer, multiplies each input with its corresponding weight, sums up all the
product of inputs and weights, and finally passes the weighted sum to the nonlin-
ear function to produce the output.
102 3 Artificial Neural Network
Program 3.9 creates a function that sets desired inputs and epoch value of 10.
The weights are updated for 10 epochs and iterated through the entire training
set. The bias value is inserted into the input while performing the weight update.
Accordingly, the error is computed, and the update rule is performed.
import numpy as np
class Perceptron(object):
#Implements a perceptron network
def __init__(self, input_size, lr=1, epochs=10):
self.W = np.zeros(input_size+1)
# add one for bias
self.epochs = epochs
self.lr = lr
[0, 1],
[1, 0],
[1, 1]
])
perceptron = Perceptron(input_size=2)
perceptron.fit(X, d)
#Print the weight vector
print("Result: ")
print(perceptron.W)
Result:
[-3, 2, 1]
The output displays the weight vector that uses the AND gate data as its input.
Here, in the output, -3 indicates the bias value and 2 and 1 are the values of
weights. The pre-activation is calculated as: −3 + 0*2 + 0*1 = −3, considering 0
as both the input values. Now, if we apply activation function, it will be 0 (for
x < 0), which is the result of 0 AND 0. Similarly, if we consider the input values
as 1 and 1, the pre-activation is calculated as: −3 + 1*2 + 1*1 = 0. If we apply
activation function, it will be 1 (for x >= 0), which is the result of 1 AND 1.
Output units(signal)
. . .
. .
and, in the same way, the hidden layer processes the information and passes it to
the output layer.
The basic characteristics of a multilayer neuron are as follows:
i. Each neuron in the network has a nonlinear activation function and is also
differentiable.
ii. Network contains one or more number of hidden layers (i.e., hidden from both
input and output)
iii. There network exhibits high degree of connectivity.
The single-layer Perceptron cannot solve the XOR problem (XOR truth table vali-
dation by the network); however, adding additional layer to the Perceptron model
can actually solve the XOR problem. Figure 3.17 shows a multilayer Perceptron
model that successfully solves the XOR problem.
Considering an input (1,0) (A is 1 and B is 0), the input on C would be −1
* 0.5 + 1*1 + 0*1 = −0.5 + 1 = 0.5. This value exceeds threshold
value 0, and therefore C fires giving output 1. For D, the input would be −1*1
+ 1*1 + 0*1 = −1 + 1 = 0, and, therefore D does not fire, giving output
0. The input on E would be −1*0.5 + 1*1 + 0*−1 = 0.5 and therefore
E fires. It can also be proved that E would not be fired for same input (1,1)
or (0,0).
The intermediate layers that are added in between the model to enhance the
computation are termed as hidden layers. One may ask why they are termed as
hidden. It is because the activation function in all those units and the associated
weight are not known. If the output for a given input is not as same as the target,
then one will not know that the misclassification is because of the wrong weight
in the input to next layer or other layers (neurons).
The plain algorithmic representation of the MLP is mentioned as the following
steps (assuming one hidden layer).
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neural_network import MLPClassifier
106 3 Artificial Neural Network
# Load data
data=pd.read_csv(' HR_comma_sep.csv')
print("Displaying First Few Records of Dataset")
print(data.head())
# Creating labelEncoder
le = preprocessing.LabelEncoder()
# Spliting data
X=data[['satisfaction_level', 'last_evaluation',
'number_project', 'average_montly_hours',
'time_spend_company', 'Work_accident',
'promotion_last_5years', 'Department', 'salary']]
y=data['left']
print("Applying MLPClassifier...")
clf = MLPClassifier(hidden_layer_sizes=(6,5),
random_state=5,verbose=True, learning_rate_init=0.02)
Applying MLPClassifier...
Accuracy Score: 0.95
The function MLPClassifier() uses the hidden_layer_sizes param-
eter to set the number of layers and the number of nodes we wish to have in
the Neural Network. It also uses the random_state parameter to set a seed for
reproducing the same results. The activation parameter in not mentioned in
the given program; hence, the default ReLu activation function is used to train the
model. At the end of the program, the accuracy score is calculated and found to be
95% using the test samples of the given dataset.
Here, y is the predicted result, φi is the ith neuron’s output from the hidden layer,
and wi is the weight connection.
Figure 3.18 shows the radial basis function neural network architecture that
consists of mainly three layers – the input layer, the hidden layer, and the out-
put layer. As can be understood from the figure, the input layer receives the inputs
(X 1, X 2 , …, X N ) to the network and pass it to the hidden layer. The hidden layer
uses radial basis functions 𝜑1 , 𝜑2 , …, 𝜑N (usually the Gaussian functions) to
108 3 Artificial Neural Network
x1 φ1
W1
Σ
W2 Y
x2 φ2
Input
Output
N
W
xN φN
transform the inputs into a set of nonlinear features. Finally, the output layer takes
these features and produces the final output of the network.
The RBF networks usually uses the Gaussian functions as the radial basis func-
tions. Each node in the hidden layer corresponds to a particular Gaussian function,
and the weights associated with each node determine the shape and position of
the corresponding Gaussian. During training, the weights of the RBF network
are adjusted by minimizing the difference between the predicted outputs and the
actual outputs for a given set of inputs.
Considering the radial distance as ||||x − 𝜇 ||||, the Gaussian function in RBF net-
work is calculated using Equation (3.23):
( )
||x − 𝜇||2
𝜑(x) = exp (3.23)
2𝜎 2
Here, in Equation (3.23), x is the input to the function (a vector), 𝜇 is the center of
the Gaussian function (another vector), and 𝜎 is a scaling factor that determines
the width of the Gaussian function (𝜎 > 0). The || || notation denotes the Euclidean
distance between the two vectors.
The output of each node is simply the value of the Gaussian function evaluated
at the input to the node. Each output is passed on to the output layer, which are
then combined to produce the final output of the network.
Program 3.11 illustrates the radial basis neural network that uses the iris
dataset. The Iris dataset is a well-known dataset in the field of machine learning,
and is commonly used for classification and pattern recognition tasks. It was
introduced by the British biologist and statistician Ronald A. Fisher in 1936.
The dataset contains measurements of four features (sepal length, sepal
3.5 Feed-Forward Neural Network 109
centers = kmeans.cluster_centers_
# Compute the distances between each data point and the
centers
distances = pairwise_distances_argmin_min(X_train,
centers)
The output of Program 3.11 is displayed next. The first five records of the Iris
dataset is displayed, and finally the accuracy score of 98% is calculated and dis-
played.
Displaying First 5 Records of the Iris Dataset
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]]
Accuracy Score: 0.98
Multilayer Perceptron (MLP) and Radial Basis Function (RBF) are both popular
feedforward neural network architectures. However, the major difference between
MLP and RBF is that MLP consists of one or several hidden layers, while RBF
3.6 Feedback Neural Network 111
consists of just one hidden layer. Also, RBF network has a faster learning speed
compared to MLP.
Here, 𝛼 is a positive learning rate (0 < 𝛼 < 1), and x is the input, and y is the output,
respectively. However, with the above rule, the entire weight space may be satu-
rated. To overcome this problem a forgetting term can be added. The modified
equation is shown in Equation (3.25).
wi (t) + 𝛼xi (t)y(t)
wi (t + 1) = {
∑n [ ]2 }1∕2
j=1 wj (t) + 𝛼xj (t)y(t)
[ ]
≅ wi (t) + 𝛼(t) xi (t) − y(t)wi (t) + O(𝛼 2 ) (3.25)
t
ee
t ee
h sh
tics pti
c
ap na
syn stsy
P re Po
x1
x2 Computational
x3 layer
.
.
.
xN Input layer
Whenever an input is presented to the SOM, the distance of the input to every
neuron in the postsynaptic layer is computed. This distance measured is Euclidean
distance of the input from the neurons of the postsynaptic layer. The distance
of the input to whichever neuron is the smallest becomes the winning neuron.
This neuron is also sometimes called as best matching unit (BMU). The weight
updates of the winning unit occur together with the neighborhood neurons as
defined. The neighborhood indicates the area of influence around the winning
unit. This neighborhood function is a monotonically decreasing function.
Program 3.12 illustrates the Python code used for a SOM using the MiniSom
library. Initially, the input data is randomly generated having 500 data points and
8 unique features. Next, we initialize the SOM using the MiniSom class from the
MiniSom library. Here, we create a SOM with an 8 × 8 grid of neurons, each with
8 input features. After initializing the SOM, the training of the input data is done
by using 500 iterations. Finally, the SOM is visualized by creating a figure with
a size of 8 × 8 inches and then plot the SOM using the pcolor function. The
distance_map method of the SOM returns a 2D numpy array representing the
distance between each neuron in the SOM and its neighbors. Also, the bone_r
colormap is used to visualize it as a heatmap.
plt.figure(figsize=(8, 8))
plt.pcolor(som.distance_map().T, cmap=’bone_r’)
plt.colorbar()
The output of Program 3.12 is given next, which basically displays a figure to plot
the SOM with the given data inputs having 500 records and 8 features (Figure 3.21).
SOMs are used for a wide range of applications in various fields such as feature
extraction, anomaly detection, clustering, data visualization, and dimensionality
reduction. SOM is a powerful tool often used by researchers for data analysis.
As SOM is computationally efficient, it can be trained on large datasets in a
8 1.0
7 0.9
6
0.8
5
0.7
4
0.6
3
0.5
2
0.4
1
0 0.3
0 1 2 3 4 5 6 7 8
Figure 3.21 A self-organizing model (SOM) having 500 data points and 8 features.
116 3 Artificial Neural Network
reasonable amount of time. Also, SOMs can handle missing values or outliers in
the input data, and are robust to incomplete or noisy data.
Here, wij represents the weight matrix, which is calculated during the training pro-
cess of the model. wij indicates the weight associated with the connection between
the ith and the jth neuron, P is the number of patterns to be stored, and si (p) and
sj (p) are the ith element and the jth element of the pth pattern, respectively.
In case of bipolar inputs, the weight updates for the input pattern S(p) [p = 1 to
P] and S(p) = S1 (p) … Si (p) … Sn (p) is given by Equation (3.33):
∑
P
wi j = [si (p)][sj (p)], for i ≠ j (3.33)
p=1
HNN Algorithm
Step 1: Initialize weights (wij ) to store patterns obtained from the training
algorithm using the Hebbian principle.
Step 2: Perform steps 3–9, if the activations of the network is not consolidated.
Step 3: For each input vector X, perform steps 4–8.
Step 4: Make initial activation of the network equal to the external input vector
X as follows:
yi = xi for i = 1 to n
Step 5: For each unit yi , perform steps 6–9.
Step 6: Calculate the total input of the network yin as follows:
∑
yin = xi + yj wji
j
3.6 Feedback Neural Network 117
Step 7: For a threshold value 𝜃 i , apply the activation as follows over the total
input to calculate the output as follows:
⎧ 1 if yin > 𝜃i
⎪
yi = ⎨yi if yin = 𝜃i (3.34)
⎪
⎩ 0 if yin < 𝜃i
Step 8: The output yi is broadcasted to all the other units. Accordingly, the acti-
vation vectors are updated.
Step 9: Finally, test the network for convergence.
Program 3.13 illustrates the Python code used for a Hopfield neural net-
work. The program defines a HFNW class having two methods – train()
and predict(). The weight matrix is updated based on the patterns. Finally,
the predict() method takes in the input pattern and iteratively updates the
neurons. The update is stopped either if the maximum iteration is reached or
until convergence. The program chooses binary patterns (consisting of 0’s and
1’s) and finally displays a binary output.
import numpy as np
class HFNW:
def __init__(self, n):
self.n = n
self.weights = np.zeros((n, n))
# Example usage
patterns = np.array([[0, 1, 0, 1], [1, 0, 1, 0], [1, 1,
0, 1]])
hn = HFNW(4)
hn.train(patterns)
input_pattern = np.array([0, 0, 1, 0])
predicted_pattern = hn.predict(input_pattern)
print("Input pattern:", input_pattern)
print("Predicted pattern:", predicted_pattern)
The output of Program 3.13 is given below. Here, the input pattern
[0, 0, 1, 0] is closest to the pattern [1, 1, 1, 1], which is the
pattern that the Hopfield network predicts.
Input pattern: [0 0 1 0]
Predicted pattern: [1. 1. 1. 1.]
The input received by the model can be a noisy or incomplete input binary
pattern. The Hamming distance is used on the training pattern to calculate the
number of bits that are different between two binary patterns. The network is
trained until it reaches to a stable state, after which the output pattern is gener-
ated that resembles the closest stored pattern to the input pattern. Hopfield neural
network are used in a variety of applications such as associative memory, image
recognition, and optimization problem. However, one major limitation of this neu-
ral network is that it is sensitive to spurious patterns that gets retrieved instead of
the pattern that is anticipated.
A real-life example of a Hopfield Neural Network is the use of these networks in
content-addressable memory systems, also known as associative memory or pat-
tern recognition. Hopfield Neural Networks are particularly well-suited for tasks
involving pattern completion and pattern recall. Imagine a scenario where one
wants to store and retrieve patterns in a memory system. A simple example of
pattern recall in image denoising is demonstrated next:
Let’s say there is an image that has been corrupted by adding noise to some of
its pixels. If someone wants to restore the original image by removing the noise,
Hopfield Neural Network can be used as an associative memory to help with
this task.
Storage Phase: One can store the clean version of the image as a pattern in
the Hopfield network. Each pixel’s value is treated as a neuron state (either +1 or
−1). The connections between neurons are adjusted based on the stored patterns
to create an energy landscape that represents attractors for each pattern.
Pattern Recall Phase: The noisy version of the image can be input into the
network. The noisy pixels can be represented as their noisy values (−1 or +1).
Exercises 119
The network dynamics evolve based on the interactions between neurons and
the energy landscape. Due to the attractors created during the storage phase, the
network tends to settle into one of the stored patterns, which corresponds to a
denoised version of the image.
In this example, the Hopfield Neural Network is acting as an associative memory
that can recall the stored patterns based on partial or noisy inputs. It demonstrates
the network’s ability to complete patterns and retrieve information even when
the input is incomplete or degraded. Keep in mind that while Hopfield Neural
Networks have certain useful properties like pattern recall, they also have limi-
tations, such as their capacity to store patterns and convergence properties. For
more complex tasks, modern neural network architectures, such as deep learning
models, are often preferred due to their ability to learn more intricate patterns and
features from data.
Exercises
Deep Learning
Principles of Soft Computing Using Python Programming: Learn How to Deploy Soft Computing Models
in Real World Applications, First Edition. Gypsy Nandi.
© 2024 The Institute of Electrical and Electronics Engineers, Inc. Published 2024 by John Wiley & Sons, Inc.
124 4 Deep Learning
Accuracy/performance
Deep learning
Traditional learning
Data size
Strawberry /
leaf
Human expert
Input Classification Output
(Feature extraction)
(a)
Strawberry
leaf
objects present in the input space, such as strawberry as well as leaf, which can
be predicted by the model itself.
Deep learning models are flexible and can be applied for a wide range of tasks
that also include real-time applications. It can be trained on large datasets and
generate output having high prediction accuracy. Also, as no human intervention
is required to train the model, it reduces the need for manual feature engineer-
ing. Thus, deep learning has revolutionized today’s world in many sectors, be it
healthcare or industries, research or agricultural sectors, and so on.
Deep learning neural networks are distinguished from simple neural networks
on the basis of the number of hidden layers. While deep learning models have
higher accuracy than a simple neural network, it however takes more time to train
the deep learning model than the basic neural networks. Some of the standard
deep learning models used in various applications are mentioned in Figure 4.3.
These are the common deep learning models used in recent years. However, with
more and more research and development, new deep learning models are evolving
at a rapid rate.
Some deep learning models such as convolutional neural networks (CNNs),
recurrent neural networks (RNNs), transformers, and autoencoders can be cate-
gorized as either supervised or unsupervised, depending on the type of learning
task they are designed to solve. Generative adversarial networks (GANs) and
deep reinforcement learning (Deep RL) falls under the category of unsupervised
learning.
Four of the standard deep learning models (as mentioned in Figure 4.3), namely
CNN, RNN, GAN, and autoencoders, are discussed in the next subsections.
Generative adversarial
network (GAN)
Autoencoders
Transformers
Deep reinforcement
learning (Deep RL)
32 x 32 x 3 Image
5 x 5 x 3 Kernel
32 32
Convolution
Single
number 32
32
3 1
Feature map
3 1 2 4 1 3x1=3
1 2 0 1x2=2
0 2 5 3 4 22 2x0=0
2 3 1
1 4 3 2 3 0x2=0
3 0 1
4 5 1 2 6 2x3=6
6 3 5 1 2 5x1=5
1x3=3
Input data Kernel Convoluted feature 4x0=0
3x1=3
22
channels (RGB) to which the kernel of 3 × 3 matrix is applied to the three color
channels. The first step of convolution operation is shown in Figure 4.5 in which a
filter (kernel) of 3 × 3 matrix slides over the input image of 5 × 5 matrix to produce
a partial output feature map.
The convolution operation involves two main steps: elementwise multiplication
and summation. The output in Figure 4.4 of each filter results in a feature map of
32 × 32 × 1. This entire process of convolution allows recognizing low-level fea-
tures such as edges and textures that allow detecting complex features such as
shapes and patterns. The convolution operation requires the following two opera-
tions to be applied:
a) Elementwise multiplication: The filter is initially placed at the top-left
corner of the input image, and an elementwise multiplication is performed
between the filter and the corresponding pixels in the input image. The result
is obtained by summing up these elementwise multiplications.
b) Summation: The sum of the elementwise multiplication is computed, and the
result is assigned to a single pixel in the output feature map. This process is
repeated for all possible positions (pixels) in the input image by sliding the filter
over the image.
We know that every image is considered as a matrix of pixel values. These pixel
values of an image are used for the process of convolution. To illustrate further,
let us consider a grayscale image of 5 × 5 × 1 input data. Here, the input image
dimension is a 5 (height) × 5 (width) image having only one color dimension, as it
is a grayscale image. A filter (kernel) of 3 × 3 is applied to the input data to produce
a 3 × 3 feature map.
The above process is repeated by applying a particular stride value that deter-
mines the step-size with which the convolutional filter moves over the input
image. If a stride value of 1 is applied, the input data now moves pixel by pixel,
whereas a stride value of 2 moves the filter two pixels at a time. Stride affects the
size of the output feature map. Figure 4.6 shows the next step of filter application
for a stride value of 1, to obtain the next value of the convoluted feature.
128 4 Deep Learning
3 1 2 4 1 1x1=1
2x2=4
0 2 5 3 4 1 2 0 4x0=0
22 41
1 4 3 2 3 2 3 1 2x2=4
4 5 1 2 6 3 0 1 5 x 3 = 15
6 3 5 1 2 3x1=3
4 x 3 = 12
Input data Kernel Convoluted feature 3x0=0
2x1=2
41
3 1 2 4 1
1 2 0
0 2 5 3 4 22 41 45
2 3 1
1 4 3 2 3 34 48 35
3 0 1
4 5 1 2 6 56 35 38
6 3 5 1 2
Input data Kernel Convoluted feature
Figure 4.7 The convoluted feature after applying the convolution operation.
Moving on, the convolution process hops down from the left to the right and
then top to bottom of the image with the same stride value, and repeats the process
until the entire image is traversed. Continuing the above process of applying a
stride value of 1, the final convoluted feature is calculated to obtain a complete
matrix as illustrated in Figure 4.7. Smaller strides preserve more spatial informa-
tion but may increase the computational cost.
After the convolution operation, an activation function is usually applied ele-
mentwise to the output feature map. This introduces nonlinearity and enables
the network to learn complex relationships between features. Common activation
functions used in CNNs include ReLU (Rectified Linear Unit), which sets negative
values to zero and keeps positive values unchanged, and variants like Leaky ReLU
and Parametric ReLU.
CNNs typically employ multiple filters in each layer. Each filter detects a specific
feature or pattern in the input image. These filters are stacked together to form
the depth dimension of the output feature map. For example, if a layer uses eight
filters, the resulting output feature map will have a depth of 32, indicating that it
captures eight different features.
The next step in CNN is the process of pooling that are applied to reduce the
size of images and decrease the computational power required to process the data
through dimensionality reduction. Pooling operates on individual feature maps
independently, and helps to extract and retain the most important features while
discarding unnecessary details. The pooling window slides over the feature
map with a predefined stride, just like in convolutional operations. The stride
4.2 Classification of Deep Learning Techniques 129
determines how much the window shifts after each pooling operation. Common
stride values are 1 or 2.
The two common and famous pooling methods are the max pooling and the
average pooling, out of which the most common one is the max pooling method.
● Max Pooling: This pooling method returns the maximum value from the
portion of the image covered by the kernel. It helps capture the most prominent
features within the window and discard less important details.
● Average Pooling: This pooling method returns the average of all the values
from the portion of the image covered by the kernel. It helps in reducing the
spatial dimensions while providing a smoothed-down representation of the
features.
Figure 4.8 illustrates the method of applying both max pooling and aver-
age pooling to obtain the result. In this example, a 2 × 2 filter is applied for
pooling with a stride value of 2. Pooling enables CNNs to effectively process
high-dimensional data, such as images, while extracting and retaining essential
features. The output of the pooling operation is a downsampled feature map with
reduced spatial dimensions. The size of the output feature map depends on the
size of the pooling window, the stride, and the padding (if applied).
The pooled feature map is flattened to convert the resultant two-dimensional
matrix into a single long continuous linear vector. Now, if consider the resultant
matrix of Figure 4.8, which is obtained after applying max pooling, the flattened
linear vector will be formed as shown in Figure 4.9.
Finally, the flattened linear vector from the pooling layer is fed as input to the
fully connected layer to classify images. The fully connected (FC) layer, also known
as the Dense layer, is a fundamental component of CNN that comes after the
convolutional and pooling layers. It plays a critical role in capturing high-level
Fully connected
Convolution
Pooling
Input Output
Convolution Activation
Pooling Flattening
Input
Fully connected
Softmax function
Output
Figure 4.11 Block diagram of CNN architecture. Source: Eric Isselée/Adobe Stock .
b) AlexNet: This architecture won the ImageNet Large Scale Visual Recogni-
tion Challenge in 2012 and consists of five convolutional layers, three fully
connected layers, and uses ReLU activation functions.
c) VGGNet: This architecture uses small convolutional filters (3 × 3) with many
layers (up to 19 layers) and has achieved top performance on the ImageNet
dataset.
d) GoogLeNet (Inception): This architecture uses a network-in-network
approach, with multiple convolutional layers stacked on top of each other
in parallel, and has achieved high accuracy on ImageNet while using fewer
parameters than other networks.
e) ResNet (Residual Network): This architecture introduced the concept of
residual connections, which allow for training of very deep networks (up to
hundreds of layers) without the problem of vanishing gradients.
f) DenseNet: This architecture connects each layer to every other layer in a
feed-forward fashion, resulting in a densely connected network that improves
feature reuse and reduces the number of parameters.
g) EfficientNet: This architecture uses a compound scaling method to scale the
depth, width, and resolution of the network in a balanced way, resulting in a
more efficient and accurate network.
These are just a few examples of prominent CNN architectures, and there are
many other variations and extensions of these architectures as well. Many other
architectures, such as Xception, ResNeXt, and Inception-ResNet, have also made
significant contributions to the field of computer vision. Each architecture has its
132 4 Deep Learning
own design choices and optimizations, making them suitable for specific tasks or
constraints.
Program 4.1 uses the CIFAR-10 dataset, which consists of 50,000 training images
and 10,000 testing images, each belonging to one of the ten classes. The list of
the ten classes include – airplane, automobile, bird, cat, deer, dog, frog, horse, ship,
and truck. Initially, the code loads the dataset using the cifar10 module from keras
and performs preprocessing steps such as normalizing the pixel values and con-
verting the labels to one-hot encoded vectors. Here, the input shape mentioned is
(32, 32, 3) to match the dimensions of the CIFAR-10 images.
Next, the program uses a simple CNN architecture with two convolutional layers
followed by max pooling, a flattening layer, and two fully connected layers. The
model is compiled with the Adam optimizer and trained for a fixed number of
epochs. Finally, the model is evaluated on the test dataset, and predictions are
made on unseen data. Here is an overview of the layers added in the code snippet:
● Conv2D layer: Performs 2D convolution on the input image.
● MaxPooling2D layer: Performs downsampling by taking the maximum value
within a defined window.
● Flatten layer: Reshapes the multidimensional output from the previous layer
into a 1D vector.
● Dense layer: Represents a fully connected layer that connects all neurons from
the previous layer to the current layer
Overall, the sequential model provides a convenient way to define and organize
the layers of a CNN in a sequential manner.
# Make predictions
predictions = model.predict(X_test)
plt.ylabel('Loss')
plt.title('Training and Test Loss')
plt.legend()
plt.show()
During training, the CNN adjusts its internal parameters (weights and biases)
based on the computed loss. The goal is to minimize the loss, which means
reducing the discrepancy between predicted outputs and true outputs. This opti-
mization process is achieved through techniques like backpropagation and
gradient descent, where the gradients of the loss function with respect to the
model parameters are used to update the parameters in a way that decreases
the loss.
The “epoch” is a term used to represent a complete pass through the entire
training dataset during the training process. In other words, an epoch is
completed when the CNN has processed each training sample once and adjusted
1.0
Loss
0.8
0.6
0.4
0 2 4 6 8
Epoch
0.75
Accuracy
0.70
0.65
0.60
0.55
0.50
0 2 4 6 8
Epoch
its parameters accordingly. During each epoch, the CNN makes predictions on
the training data, compares them with the true labels, computes the loss, and
updates the model’s parameters based on the gradients.
Typically, training a CNN involves multiple epochs. By going through multi-
ple epochs, the model has the opportunity to learn from the data and improve its
performance iteratively. It can refine its internal representations, adapt its param-
eters, and hopefully converge to a state where the loss is minimized, and the model
achieves better accuracy or performance on the given task.
The output of Program 4.1 is displayed below. The output also displays visual-
ization graphs for training and test loss (as shown in Figure 4.12) and training and
test accuracy (as shown in Figure 4.13).
Epoch 1/10
Epoch 1/10
1563/1563 [==============================] - 83s 52ms/step - loss:
1.4081 - accuracy: 0.4971 - val_loss: 1.1328 - val_accuracy: 0.6090
Epoch 2/10
1563/1563 [==============================] - 78s 50ms/step - loss:
1.0545 - accuracy: 0.6317 - val_loss: 1.0284 - val_accuracy: 0.6380
Epoch 3/10
1563/1563 [==============================] - 79s 51ms/step - loss:
0.9115 - accuracy: 0.6824 - val_loss: 0.9327 - val_accuracy: 0.6781
Epoch 4/10
1563/1563 [==============================] - 78s 50ms/step - loss:
0.8053 - accuracy: 0.7203 - val_loss: 0.8951 - val_accuracy: 0.6941
136 4 Deep Learning
Epoch 5/10
1563/1563 [==============================] - 78s 50ms/step - loss:
0.7197 - accuracy: 0.7478 - val_loss: 0.8762 - val_accuracy: 0.6948
Epoch 6/10
1563/1563 [==============================] - 77s 49ms/step - loss:
0.6433 - accuracy: 0.7752 - val_loss: 0.8989 - val_accuracy: 0.6975
Epoch 7/10
1563/1563 [==============================] - 80s 51ms/step - loss:
0.5735 - accuracy: 0.8008 - val_loss: 0.9075 - val_accuracy: 0.7020
Epoch 8/10
1563/1563 [==============================] - 78s 50ms/step - loss:
0.5015 - accuracy: 0.8253 - val_loss: 0.9424 - val_accuracy: 0.7054
Epoch 9/10
1563/1563 [==============================] - 77s 50ms/step - loss:
0.4431 - accuracy: 0.8439 - val_loss: 0.9992 - val_accuracy: 0.7022
Epoch 10/10
1563/1563 [==============================] - 81s 52ms/step - loss:
0.3869 - accuracy: 0.8643 - val_loss: 1.0448 - val_accuracy: 0.7071
Test loss: 1.0448317527770996
Test accuracy: 0.707099974155426
In a convolutional neural network (CNN), “test loss” and “test accuracy” are
performance metrics used to evaluate the effectiveness of the trained model on
unseen or test data.
● Test Loss: The test loss is a measurement of how well the trained CNN model
performs on the test dataset. It quantifies the error between the predicted out-
puts and the true outputs for the test samples. The test loss is typically calculated
using a loss function, such as categorical cross-entropy for classification tasks or
mean squared error for regression tasks. The lower the test loss, the better the
model’s predictions align with the true outputs.
● Test Accuracy: The test accuracy is a metric that indicates the proportion of
correctly classified samples in the test dataset. It represents the percentage of
test samples that the model correctly predicts the class label for. A higher test
accuracy signifies better performance. The test accuracy is usually computed by
comparing the predicted class labels with the true class labels and calculating
the accuracy as the ratio of correctly classified samples to the total number of
test samples
By visualizing the training and test loss as well as training and test accuracy, one
can gain insights into how the model is learning and generalizing. These visual-
izations help in identifying potential issues such as overfitting or underfitting, and
they can guide adjustments to the model or training process if necessary.
4.2 Classification of Deep Learning Techniques 137
Output
Y
yt–1 yt yt+1
w
W W W
unfold
f … f f f f …
Hidden
layer ht–1 ht ht+1
H
U U U
U
xt–1 xt xt+1
Input
X
• The input and output vectors can have different dimensions based on the
specific task.
(b) Hidden State:
• The RNN maintains a hidden state vector h(t) at each time step, which
serves as the memory of the network.
• The hidden state captures information from previous time steps and influ-
ences the current prediction.
• Initially, the hidden state is set to zero or a small random vector.
(c) Recurrent Connections:
• The hidden state at each time step is calculated based on the current input
and the previous hidden state.
• The calculation of the hidden state involves two sets of weights: U (for the
previous hidden state) and W (for the current input).
• The hidden state is updated using an activation function, typically a nonlin-
ear function like tanh or ReLU.
• The update equation for the hidden state can be expressed as:
h(t) = f (W ∗ h(t − 1) + U ∗ x(t)).
(d) Output Calculations:
• Once the hidden state is updated, the output at each time step is computed
based on the hidden state.
• The output can be calculated using a set of weights W that connects the
hidden state to the output.
• The output at time step t is obtained as:
y(t) = W ∗ h(t)
(e) Training:
• During training, the RNN is typically trained using gradient-based
optimization methods like backpropagation through time (BPTT).
• The objective is to minimize a loss function that measures the discrepancy
between the predicted output and the target output.
• Gradients are calculated with respect to the various weights of each state,
and updated using gradient descent or its variants.
One popular variant of RNNs is the long short-term memory (LSTM) network,
which addresses the vanishing gradient problem by introducing specialized
memory cells and gating mechanisms. LSTMs are capable of capturing long-term
dependencies and have been widely used in tasks such as speech recognition,
machine translation, and sentiment analysis. The LSTM network is discussed in
detail next.
Another variant is the gated recurrent unit (GRU), which simplifies the archi-
tecture of LSTMs by combining the memory and hidden state into a single update
4.2 Classification of Deep Learning Techniques 139
gate. GRUs offer similar capabilities to LSTMs but with fewer parameters, making
them more computationally efficient in certain scenarios.
Overall, RNNs have proven to be powerful tools for modeling and predicting
sequential data. They excel in tasks that involve time series forecasting, language
modeling, speech recognition, and machine translation, among others. Their abil-
ity to capture temporal dependencies makes them well-suited for tasks that require
understanding and generating sequential information.
Long Short-Term Memory (LSTM) Network: LSTM is a type of RNN
architecture that is specifically designed to address the vanishing gradient
problem and capture long-term dependencies in sequential data. LSTMs are
particularly effective in tasks involving long sequences, in which preserving and
updating memory over long-time intervals is crucial. The working of an LSTM
can be explained through its key components, and the flow of information during
training and inference:
• The input gate takes the previous hidden state h(t−1) and the current
input x(t) as inputs, and passes them through sigmoid and tanh activation
functions, respectively.
• The sigmoid output determines which values will be updated, while the
tanh output creates the candidate vector of new values.
(e) Updating the Cell State:
• The cell state is updated based on the outputs of the forget gate and the input
gate.
• The forget gate output f t and the previous cell state c(t−1) are multiplied
elementwise to forget the irrelevant information.
• The input gate output it and the candidate vector are multiplied element-
wise and added to the result of the forget gate operation, updating the cell
state.
• The updated cell state c(t) is given by: c(t) = f t * c(t − 1) + it * tanh (candidate
vector).
(f) Output Gate and Hidden State:
• The output gate determines the information that will be outputted from the
LSTM cell.
• It takes the previous hidden state h(t−1) and the current input x(t) as inputs,
passes them through a sigmoid activation function, and produces an output
between 0 and 1.
• The updated cell state c(t) is passed through a tanh activation function to
squash the values between −1 and 1.
• The output gate output ot and the squashed cell state tanh(c(t)) are multi-
plied elementwise to produce the current hidden state h(t).
• The hidden state carries the relevant information from the cell state, and is
passed to the next time step and also used for generating the output predic-
tion.
(g) Training and Inference:
• During training, LSTMs are typically trained using gradient-based optimiza-
tion methods like BPTT.
• The objective is to minimize a loss function that measures the discrepancy
between the predicted output and the target output.
• Gradients are calculated with respect to the LSTM’s parameters and updated
using gradient descent or its variants.
• During inference or testing, the LSTM can be used to generate predictions
for new input sequences by feeding the inputs one step at a time and updat-
ing the hidden state accordingly.
LSTMs have proven to be highly effective in a wide range of tasks involving
sequential data, such as speech recognition, language translation, sentiment
4.2 Classification of Deep Learning Techniques 141
analysis, and time series prediction, among others. They have enabled the model-
ing of complex dependencies over long sequences and have become a cornerstone
in the field of deep learning for sequential data processing.
To further understand the concept of LSTM in RNN, a simple example to predict
the next odd number in a numerical sequence is considered. Suppose there is a
sequence of odd numbers: [1, 3, 5, 7, 9, 11, 13, 15, 17, 19], the
goal is to train an LSTM to predict the next odd number in the sequence. For input
preparation, the sequence is divided into input (X) and output (Y) pairs. Now, if
the sequence length is set to 3, the input and output will be:
X ∶ [1, 3, 5], [3, 5, 7], [5, 7, 9], … , [13, 15, 17]
Y ∶ 7, 9, 11, … , 19
Next, the LSTM model is to be created using a deep learning framework
like Keras or PyTorch. The LSTM model will have an input shape matching
the sequence length (e.g., 3) and the number of features (e.g., 1 in this case).
The model will also have one or more LSTM layers followed by one or more fully
connected (dense) layers for prediction. The LSTM model is trained using the
prepared input and output pairs (X and Y ). During training, the LSTM learns to
capture the patterns and dependencies in the sequence data. The model optimizes
its internal parameters through the backpropagation algorithm to minimize the
prediction error (e.g., mean squared error).
After training, the LSTM model can be used to make predictions on new
sequences. To predict the next odd number in the sequence, the last three
elements are provided as input to the trained LSTM model. The LSTM processes
the input sequence, updates its internal state, and generates the predicted output.
For example, if the input sequence is [15, 17, 19], the LSTM should predict the
next number, which is 21. By training an LSTM on sequential data, it can learn to
recognize patterns and dependencies specific to the given task. In this case, the
LSTM learns the pattern of incrementing odd numbers and can make accurate
predictions based on that pattern.
Program 4.2 uses an RNN (specifically an LSTM) for sentiment classification
of text data. A small set of sample text is used having corresponding sentiment
labels: positive (1) and negative (0). The goal is to train the RNN to predict
the sentiment of new text data. At first, the text data is tokenized using the
Tokenizer class from keras. The tokenizer assigns a unique index to each
word in the vocabulary. Then, the text data is converted to sequences of inte-
gers using texts_to_sequences function. To ensure equal-length input
sequences, the sequences are padded using pad_sequences function. The
length of the longest sequence determines the maximum sequence length, and
all other sequences are padded or truncated accordingly.
142 4 Deep Learning
The RNN model is created using the sequential API in keras. It consists of an
embedding layer, an LSTM layer, and a Dense output layer with a sigmoid acti-
vation function for binary classification. The model is compiled with the Adam
optimizer and binary cross-entropy loss. It is then trained on the padded sequences
and corresponding labels using the fit function. For testing, new text data is pro-
vided in the test_texts list. The same preprocessing steps (tokenization and
padding) are applied to the test data. The model then predicts the sentiment for
each test sample using predict function, and the predicted labels are determined
by applying a threshold of 0.5. Finally, the predicted sentiment labels are printed
for each test text.
model.add(LSTM(10))
model.add(Dense(1, activation='sigmoid'))
The output of Program 4.2 is given next that displays the testing texts and their
corresponding sentiment prediction. A sentiment label of 1 indicates positive sen-
timent, whereas a sentiment label of 0 indicates negative sentiment.
According to the output, the program predicts a positive sentiment for the text
“This is a great movie!” and “It’s amazing!”. On the other hand, it predicts a nega-
tive sentiment for the text “I dislike it.”. These predictions are based on the train-
ing of the RNN model using the provided sample texts and their corresponding
sentiment labels.
144 4 Deep Learning
a) Architecture Setup: The GAN consists of two neural networks: the generator
and the discriminator. The generator takes random noise as input and produces
synthetic data, while the discriminator takes both real and synthetic data as
input, and tries to classify them as either real or fake.
b) Training Data: The GAN is trained on a dataset consisting of real data sam-
ples, such as images, text, or audio that represents the target distribution.
c) Training Process:
• Initialization: The generator and discriminator networks are initialized with
random weights.
• Iterative Training: The training process alternates between two main steps:
4.2 Classification of Deep Learning Techniques 145
Generator
Noise source Fake data
Discriminator
z G x'
D T/F
X
Real data
Backpropagation
Program 4.3 shows how the GAN model is used to generate synthetic images that
resemble the real images from the CIFAR-10 dataset. The generator network takes
random noise as input and learns to generate realistic images. The discriminator
network aims to distinguish between real images from the dataset and fake images
generated by the generator. The pictorial demonstration of the program is given in
Figure 4.15.
Initially, in Program 4.3, the dataset is loaded and the training images
(x_train) are preprocessed by scaling the pixel values between −1 and 1.
For defining the generator network, an input layer (generator_input) is
defined with a shape of (100,), representing a random noise vector. The noise
vector is passed through a Dense layer with ReLU activation, followed by a
reshape layer to transform it into a 4D tensor. Convolutional transpose layers
are added to upsample the tensor gradually, creating a generator output tensor
of shape (32, 32, 3). Finally, a generator model (generator) is created using the
Model class, with the input and output layers.
Next, for defining the discriminator network, an input layer (discrimi-
nator_input) is defined with a shape of (32, 32, 3), representing an image.
Convolutional layers are added to downsample the input tensor, extracting
features. The tensor is then flattened, followed by a Dense layer with sigmoid acti-
vation, giving the discriminator’s output. A discriminator model (discriminator) is
created using the Model class, with the input and output layers. The discriminator
model is compiled using binary cross-entropy loss, and the Adam optimizer with
a learning rate of 0.0002 and beta1 value of 0.5. Also, the discriminator’s trainable
parameter is set to False, freezing its weights during GAN training.
Now, the GAN model is defined by using an input layer (gan_input) defined
with a shape of (100), representing the random noise vector. The generator model
is called with the input layer to get the generator output. The discriminator
model is called with the generator output to get the GAN output. A GAN model
(gan) is created using the Model class, with the input layer and GAN output. The
4.2 Classification of Deep Learning Techniques 147
GAN model is compiled using binary cross-entropy loss, and the Adam optimizer
with a learning rate of 0.0002 and beta1 value of 0.5.
Finally, the code proceeds to the training loop, where the GAN model is trained
for a specified number of epochs. In each epoch, the code iterates over batches
of real and generated images. For each batch, random noise is generated as input
for the generator. The generator generates fake images using the random noise
input. Real images are randomly selected from the CIFAR-10 dataset. Real and
fake images are concatenated along with their corresponding labels. The discrim-
inator is trained on this batch of real and fake images, using the concatenated data
and labels. New random noise is generated as input for the generator. Labels for
the generator are set to ones, aiming to maximize the discriminator’s error. The
GAN is trained by updating the generator’s weights using the random noise and
labels. The losses of the discriminator and GAN are printed for each epoch.
# Training loop
epochs = 100
batch_size = 128
steps_per_epoch = len(x_train) // batch_size
The output of Program 4.3 will show the discriminator loss and GAN loss for
each epoch during the training process. Additionally, the generated output will be
printed for each epoch. Here’s an example of how the output might look like:
Epoch: 1, Discriminator Loss: 0.6522918949127197,
GAN Loss: 0.9615811114311218
Epoch: 2, Discriminator Loss: 0.5419158930778503,
GAN Loss: 1.1447433233261108
Epoch: 3, Discriminator Loss: 0.4025923316478729,
GAN Loss: 1.3756206035614014
...
Epoch: 100, Discriminator Loss: 0.1257862150669098,
GAN Loss: 3.3319873809814453
The goal of this training process is for the generator to improve its ability to
generate more realistic images over time, while the discriminator becomes better
at distinguishing between real and fake images. Through this adversarial process,
the generator and discriminator networks learn to improve iteratively, leading to
the generation of higher-quality synthetic images.
The discriminator loss and GAN loss values will vary depending on the dataset,
model architecture, hyperparameters, and the progress of the training process. The
goal is typically to see a decrease in the discriminator loss and an increase in the
GAN loss over the epochs. It is important to note that since the generator network
is generating images, the program does not explicitly display the generated images
in the provided code snippet. However, the code can be modified to save or display
the generated images during or after the training loop if desired.
4.2.4 Autoencoders
Autoencoders are a class of artificial neural networks used in unsupervised
learning and deep learning. They are designed to learn efficient representations
of input data by compressing it into a lower dimensional latent space and then
150 4 Deep Learning
reconstructing the original data from this compressed representation. The goal of
an autoencoder is to replicate its input at the output layer while minimizing the
reconstruction error.
The architecture of an autoencoder typically consists of two main components:
an encoder and a decoder. The encoder takes the input data and maps it to a lower
dimensional latent representation, which captures the essential features of the
data. The decoder then takes this latent representation and reconstructs the orig-
inal input data. The encoder and decoder components are usually symmetrical,
and the middle layer represents the compressed latent space.
Figure 4.16 shows an example of an autoencoder that takes in a noisy input and
learns to create a compressed representation of the input data to reconstruct the
original data from that representation. This autoencoder is designed for denoising
images by using the encoder and the decoder. The encoder reduces the dimension-
ality of the input image to capture its essential features, while the decoder recon-
structs the clean image from the compressed representation. During the training
process, the noisy images are passed through the autoencoder’s encoder to obtain
the compressed latent representations. The loss is calculated between the noisy
input images and the reconstructed clean images. The loss function can be mean
squared error (MSE) or any other suitable image similarity metric. The loss is then
backpropagated through the decoder, and the model’s parameters are updated
using an optimization algorithm like stochastic gradient descent (SGD) or Adam.
The steps – forward pass, loss calculation, and backpropagation – are repeated for
multiple epochs, iterating over the training sets.
The performance of the trained denoising autoencoder is evaluated on a sepa-
rate test set. The image quality metrics, such as peak signal-to-noise ratio (PSNR)
or structural similarity index (SSIM), is measured to assess the denoising effective-
ness. Once the autoencoder is trained and evaluated, one can apply it to remove
noise from new images. Given a noisy image, pass it through the trained autoen-
coder’s encoder to obtain the compressed representation and then pass it through
the decoder to reconstruct the denoised image.
The training of an autoencoder involves minimizing the difference between the
input data and its reconstruction, typically using a loss function such as mean
Encoder Decoder
Figure 4.16 An Autoencoder used for denoising images. Source: Alina Yudina/Adobe
Stock .
4.2 Classification of Deep Learning Techniques 151
squared error (MSE). This loss function measures the discrepancy between the
original input and the reconstructed output. By optimizing this loss function, the
autoencoder learns to capture the most important features of the data in the latent
representation. Autoencoders follow a series of steps for obtaining its output,
which is explained in detail next.
a) Data Preparation: First, you need to prepare your input data. This involves
collecting a dataset of examples that represent the data you want the autoen-
coder to learn from. The data can be in various forms such as images, text, or
numerical data.
b) Architecture Design: Decide on the architecture of the autoencoder.
Typically, an autoencoder consists of an encoder and a decoder. The encoder
takes the input data and maps it to a lower dimensional latent representation,
and the decoder reconstructs the original input from the latent representation.
The encoder and decoder can be designed using various types of neural net-
work layers, such as fully connected layers, convolutional layers, or recurrent
layers, depending on the nature of the input data.
c) Training Data Split: Split the dataset into a training set and a validation set.
The training set will be used to train the autoencoder, while the validation
set will be used to monitor the model’s performance during training and tune
hyperparameters.
d) Training Process:
• Forward Pass: Pass the training examples through the autoencoder’s encoder
to obtain the compressed latent representations.
• Loss Calculation: Calculate the loss between the input data and the recon-
structed output. Common loss functions used in autoencoders include mean
squared error (MSE) or binary cross-entropy, depending on the type of data.
• Backpropagation: Propagate the loss backward through the decoder to
update the model’s parameters. This is done using gradient descent
optimization algorithms such as stochastic gradient descent (SGD) or
Adam.
• Repeat: Repeat the forward pass, loss calculation, and backpropagation steps
for multiple epochs, iterating over the training set. During each epoch, the
model gradually learns to reconstruct the input data more accurately.
e) Hyperparameter Tuning: Adjust the hyperparameters of the autoencoder,
such as learning rate, batch size, or the number of hidden layers, based on the
performance on the validation set. This step helps improve the model’s gener-
alization ability and avoid overfitting.
f) Evaluation: Once training is complete, evaluate the performance of the trained
autoencoder using a separate test set. Calculate metrics such as reconstruction
loss, accuracy, or any other relevant evaluation metric based on the specific task
or domain.
152 4 Deep Learning
g) Application of Autoencoder: After training and evaluation, you can use the
trained autoencoder for various purposes. For example, if an image autoen-
coder is trained, new images can be encoded into the latent space to obtain
their compressed representations or generate new images by sampling from
the learned latent space.
The above steps provide a general overview of how autoencoders work. There
are different types of autoencoders, such as sparse autoencoders, denoising
autoencoders, and variational autoencoders, which have additional steps or
modifications in their working process to achieve specific objectives or address
particular challenges.
Program 4.4 shows how the autoencoder model is used to denoise images using
the MNIST dataset that is available online. Initially, the dataset is divided into
training and testing sets, containing image data and corresponding labels. The
pixel values of the images are normalized between 0 and 1, by dividing them by
255.0. The images are reshaped from (28, 28) to a flattened vector of size 784
(28 × 28 = 784).
The autoencoder architecture is defined by setting the input dimension to 784,
corresponding to the flattened image vector size. The encoding dimension is set
to 32, which determines the size of the compressed latent space. An input layer is
created using input with the shape equal to the input dimension. A hidden layer
(encoder) is created using Dense with the encoding dimension and ReLU activa-
tion. An output layer (decoder) is created using Dense with the input dimension
and sigmoid activation. The model is created, taking the input layer and the output
layer as arguments.
Next, the autoencoder model is compiled with the Adam optimizer and the
binary cross-entropy loss function. The fit method is used to train the autoencoder
on the training data. The training is performed for a specified number of epochs
(50) with a batch size of 256. The autoencoder learns to minimize the reconstruc-
tion loss between the input and output.
Finally, after training, a subset of test images (10 examples) is encoded and
reconstructed using the trained autoencoder. The autoencoder’s prediction
method is used to reconstruct the selected test images. The original and recon-
structed images are then displayed using matplotlib. The images are displayed
using imshow with the “gray” colormap.
ax.get_yaxis().set_visible(False)
plt.show()
Epoch 1/50
235/235 [==============================] - 4s 12ms/step - loss:
0.2761 - val_loss: 0.1916
Epoch 2/50
235/235 [==============================] - 3s 12ms/step - loss:
0.1713 - val_loss: 0.1543
... ...
... ...
Epoch 49/50
235/235 [==============================] - 3s 11ms/step - loss:
0.0927 - val_loss: 0.0915
Epoch 50/50
235/235 [==============================] - 3s 11ms/step - loss:
0.0927 - val_loss: 0.0915
1/1 [==============================] - 0s 87ms/step
Autoencoders, like any other machine learning model, have their own advan-
tages and disadvantages. Autoencoders can learn from unlabeled data without
requiring explicit class labels and are, therefore, used for unsupervised learning.
Autoencoders can also learn compact representations of high-dimensional data
by compressing it into a lower dimensional latent space. This makes them use-
ful for feature extraction and dimensionality reduction tasks, where the learned
representations can capture the most important features of the data. As seen in
the previous program, this model can also be used for image denoising. Lastly,
autoencoders can be used for anomaly detection, where they are trained on nor-
mal data and are capable of identifying unusual or anomalous patterns during the
reconstruction phase.
Exercises
1 2 4 1 4 0 1
0 0 1 6 1 5 5
1 4 4 5 1 4 1
4 1 5 1 6 5 0
1 0 6 5 1 1 8
2 3 1 8 5 8 1
0 9 1 2 3 1 4
Probabilistic Reasoning
Principles of Soft Computing Using Python Programming: Learn How to Deploy Soft Computing Models
in Real World Applications, First Edition. Gypsy Nandi.
© 2024 The Institute of Electrical and Electronics Engineers, Inc. Published 2024 by John Wiley & Sons, Inc.
160 5 Probabilistic Reasoning
Let us consider two dices that will be considered to be a win when thrown
together, if the sum of the two values of the two dices is found to be 10. The set
of possible values a random variable X can have by throwing the two dices are:
Thus, a random variable can be defined based on the outcome of a process. The
probability of a random variable’s value will not be equally the same. For instance,
if the sum of the two values of the two dices thrown has to be 2, then there is only
one possibility – sum 2{(1, 1)}. However, if the sum of the two values of the two
dices thrown has to be 10, then there are three possibilities – sum 10{(4, 6)}, sum
10{(5, 5), and sum 10{(6, 4)}}.
5.1.3 Independence
In probability theory, two or more variables are statistically independent if the
occurrence of one variable does not affect the occurrence of the other variable(s)
to occur. Two random variables X and Y are said to be independent if either of the
following statements are true:
Here, the two conditions are equivalent, and so, if one condition is met, the other
condition will also be met. If any one condition is not met, then the variables X and
Y are said to be dependent. The first statement discusses the conditional probabil-
ity P(X|Y ) and states that “the probability of X, given Y, is X.” The second statement,
also sometimes called as the product rule, discusses the probability of intersection
of X and Y , and is found to be equal to the product of the probability of X and
probability of Y .
162 5 Probabilistic Reasoning
● Risks: Risks, on the other hand, refer to the potential harm or negative conse-
quences associated with an event. It refers to the probability of occurrence of an
event or outcome. The range of risk is a decimal number between 0 and 1. It is
interesting to note that odds can be converted to risks, and risks can be converted
to odds using the following formulae:
Odds
Risks =
1 + Odds
Risks
Odds =
1 − Risks
While odds focus on the relative likelihood of an event happening compared to
not happening, risks focus on the potential negative outcomes or consequences
associated with the event. Both odds and risks are important considerations
in decision making, risk assessment, and understanding uncertainty. They are
commonly used in various fields, including gambling, finance, insurance, and
healthcare, to evaluate probabilities and make informed choices based on the
potential outcomes and associated risks.
Program 5.1 illustrates a simple Python code to define a sample space, which is a
standard 52 deck of cards in this case. User-defined functions are used to calculate
the odds and risks based on a given event and the sample space. The odds are
calculated as the ratio of favorable outcomes to the total number of outcomes. The
risk is calculated as one minus the sum of event probabilities. Finally, the results
are printed, which provide insights into the odds and risks associated with drawing
a heart or a face card from the deck.
# Odds Calculation
def calculate_odds(event, sample_space):
favorable_outcomes = len(event)
total_outcomes = len(sample_space)
odds = favorable_outcomes / (total_outcomes - favorable_outcomes)
return odds
# Risk Calculation
def calculate_risk(event_probabilities):
risk = 1 - sum(event_probabilities)
return risk
164 5 Probabilistic Reasoning
The output of Program 5.1 is displayed below. The output of the program will
vary, since it involves random card selection. The program randomly selects
a card from the deck. In this example, the selected card is the 4 of diamonds.
Then, both the odds and risks are calculated for the event of drawing a heart.
The output 0.33 indicates that the probability of drawing a heart from the deck
is 33%, and the output 0.75 indicates that the risk associated with drawing a
heart is 75%.
Also, the odds and risks are calculated for the event of drawing a face card
(Jack, Queen, or King). The output 0.3 indicates that the probability of drawing a
face card is 0.3. This means that there is approximately 30% chance of selecting a
face card from the deck. Lastly, the output 0.769 indicates that the risk associated
with drawing a face card is 0.769. This means that there is approximately a 76.9%
chance of not selecting a face card from the deck.
Randomly selected card: ('4', 'Diamonds')
Odds of drawing a heart: 0.3333333333333333
Risk of drawing a heart: 0.75
Odds of drawing a face card: 0.3
Risk of drawing a face card: 0.7692307692307693
The output provides insights into the likelihood (odds) and potential negative
outcomes (risk) associated with the events of drawing a heart or a face card from
the deck of cards. Knowledge about probability helps us in making decisions on
what is likely to occur based on an estimate or on the previous real-time collected
5.2 Four Perspectives on Probability 165
data. A data analyst often uses probability distributions of data for various statis-
tical analyses.
where x represents the values that X can take, and P(X) represents the correspond-
ing probability.
For example, consider rolling a fair six-sided dice. The random variable X rep-
resents the outcome of a single roll. The possible values of X are 1, 2, 3, 4, 5, and
6, each with a probability of 1/6. To calculate the expected value of X, we sum the
products of each value with its probability:
Therefore, the expected value of rolling a fair six-sided dice is 3.5. This means
that over a large number of rolls, the average outcome will tend to be very close
to 3.5.
The expected value is a fundamental concept in probability theory and has var-
ious applications. It provides a way to summarize the central tendency or average
behavior of a random variable. It helps in decision making, risk assessment, and
estimating long-term outcomes. The expected value also serves as the basis for
other statistical measures, such as variance and covariance.
Axiom 1: The first axiom states that the probability of an event is a nonnegative
real number. This can be expressed as P(E) ≥ 0, where P(E) refers to the proba-
bility of an event E. Here, 0 represents an event that will never happen.
Axiom 2: The second axiom states that the probability of sample space is 1.
This can be expressed as the set of all possible outcomes denoted by P(S) = 1.
Axiom 3: If X and Y are mutually exclusive outcomes, then P(X∪Y ) = P(X) + P(Y ).
This indicates that if X and Y are mutually exclusive outcomes, then the proba-
bility of either X or Y happening is the sum of the probabilities of X happening
and Y happening.
168 5 Probabilistic Reasoning
The Bayes’ rule of probability is pictorially depicted in Figure 5.1. In the equation
of Bayes’ theorem, the probability P(D|H) is called the likelihood function, which
assesses the probability of the observed data that arises from hypothesis H. The
likelihood function is a known value, as it states one’s knowledge of how one
expects the data to look, considering the hypothesis H to be true. The probability
P(H) is called the prior, which analyses one’s prior knowledge before the data are
considered. It is basically the strength of one’s belief in the fairness of the outcome.
The probability P(D) is called the evidence, which is determined by summing all
possible values of H and weighted by how strongly one believes in each value of
H. The probability P(H|D) is called the posterior, which reflects the probability of
the hypothesis after consideration of the data.
5.3 The Principles of Bayesian Inference 169
Likelihood
Posterior
data Bayes’ theorem
distribution
prior
likelihood of observing heads given a fair coin is set to 0.6, and the likelihood of
tails is calculated as 1 minus the likelihood of heads. Next, the coin flip is simulated
using np.random.choice(). The code generates 10 number of coin flips and
stores the results in the observed_data array. In this example, as it is a fair coin,
it is assumed to generate output with equal probabilities of heads ("H") and tails
("T").
The Bayesian inference is then performed using a loop that iterates through
each observed coin flip. In the Bayesian inference section, an array poste-
rior_heads is initialized to store the posterior probabilities of the coin landing
heads. The length of the array is set to num_flips+1 to account for the prior
probability as the first element. The posterior probabilities of the coin landing
heads are updated based on Bayes’ theorem, taking into account the prior
probability and the observed data. Finally, the code displays the observed data and
the posterior probabilities of the coin landing heads at each step of the Bayesian
inference process.
# Define likelihoods
likelihood_heads = 0.6 # Likelihood of observing heads (given a
fair coin)
likelihood_tails = 1 - likelihood_heads # Likelihood of
observing tails (given a fair coin)
# Bayesian Inference
posterior_heads = np.zeros(num_flips+1) # Posterior
probabilities of the coin landing heads
posterior_heads[0] = prior_heads # Assign prior probability as the
first element
for i in range(num_flips):
if observed_data[i] == "H":
# Update posterior probability based on Bayes’ theorem
posterior_heads[i+1] = (likelihood_heads *
5.4 Belief Network and Markovian Network 171
posterior_heads[i]) / \
((likelihood_heads * posterior_heads[i]) +
(likelihood_tails * (1 - posterior_heads[i])))
else:
posterior_heads[i+1] = (likelihood_tails *
posterior_heads[i]) / \
((likelihood_heads * (1 - posterior_heads[i])) +
(likelihood_tails * posterior_heads[i]))
The output of Program 5.2 is displayed below. When the code is run, the output
will vary each time due to the random nature of the coin flips. It will display the
observed data and the calculated posterior probabilities of the coin landing heads
at each step of the Bayesian inference process. This code showcases the princi-
ples of Bayesian inference by updating prior probabilities based on observed data,
allowing us to make probabilistic estimations and revise our beliefs over time.
Observed data:['H' 'T' 'H' 'H' 'T' 'H' 'H' 'H' 'H' 'T']
Posterior probabilities of the coin landing heads:
[0.5 0.6 0.5 0.6 0.69230769 0.6
0.69230769 0.77142857 0.83505155 0.88363636 0.83505155]
Belief network and Markovian network are both graphical models used in prob-
abilistic modeling and reasoning. While they share some similarities, they have
distinct characteristics and are applied in different contexts. Belief networks uti-
lize directed acyclic graphs (DAGs) to capture causal dependencies, while Marko-
vian network use undirected graphs to represent conditional independence. Belief
network excel in tasks involving Bayesian inference and decision making, while
Markovian network are well-suited for modeling spatial or temporal dependencies
in tasks, such as image analysis and pattern recognition. The two network models
are discussed next:
● Belief Network: Belief networks, also known as Bayesian networks, are graph-
ical models that represent probabilistic relationships among a set of variables.
They are based on the principles of Bayesian inference and utilize DAGs to depict
the dependencies between variables.
172 5 Probabilistic Reasoning
(a) (b)
def visualize_network(self):
G = nx.DiGraph()
for node_name, node in self.nodes.items():
G.add_node(node_name)
for parent in node["parents"]:
G.add_edge(parent, node_name)
next_states = transition["next_states"]
probabilities = transition["probabilities"]
index = next_states.index(next_state)
return probabilities[index]
def visualize_network(self):
G = nx.DiGraph()
for state, transition in self.transitions.items():
G.add_node(state)
for next_state in transition["next_states"]:
G.add_edge(state, next_state)
The output of Program 5.3 is given next. In this code, the probability of node
"A" is computed based on the evidence given as evidence["A"] = 0 (Node A
is false.) by calling the compute_probability method of the belief_net
object. The result is stored in the prob_A variable, and it is displayed as out-
put. Similarly, the probability of node "B" is computed based on the evidence
given as evidence["B"] = 0 (Node B is false). The result is stored in the
prob_B_given_A variable, and it is displayed as output. By executing this code,
you should see the probabilities of nodes "A" and "B" printed based on the given
evidence in the belief network.
Probability of A: 0.5
Probability of B given A: 0.7
Belief Network
Markovian Network
B
5.4 Belief Network and Markovian Network 177
The above output also displays the graphical representation of the belief
network and Markovian network using network graphs. For this, a new method
visualize_network is added to both the BeliefNetwork and
MarkovianNetwork classes. These methods use the networkx library to
create a directed graph of the network and the matplotlib library to visualize the
graph.
(i) Learning Methods for Markovian Networks: The following are some
standard learning methods for Markovian Networks:
(a) Maximum Likelihood Estimation (MLE): MLE is a commonly used
method for learning Markovian networks. It involves estimating the
parameters (potential functions) of the model that maximize the likeli-
hood of the observed data. MLE aims to find the parameter values that
make the observed data most probable under the Markovian network.
(b) Conditional Random Field (CRF) Learning: Conditional random
fields are a specific type of Markovian network used for structured pre-
diction tasks. Learning CRFs involves optimizing the model parameters
to maximize the conditional likelihood of the observed outputs, given
the inputs. Various optimization algorithms, such as gradient descent or
convex optimization, can be used for CRF learning.
(c) Expectation–Maximization (EM) Algorithm: The EM algorithm is
an iterative optimization method used for learning Markovian networks.
It involves alternating between an expectation step (E-step) and a maxi-
mization step (M-step). The E-step computes the expected values of the
latent variables, while the M-step updates the model parameters based
on these expectations.
(d) Contrastive Divergence: Contrastive divergence is a learning algorithm
used specifically for training restricted Boltzmann machines (RBMs),
which are a type of Markovian network. It involves approximating
the gradient of the log-likelihood function using Gibbs sampling and
stochastic approximation techniques.
(ii) Learning Methods for Belief Networks: The following are some standard
learning methods for Belief Networks:
(a) Parameter Learning: Parameter learning in Belief networks aims
to estimate the CPDs associated with each node. This involves using
178 5 Probabilistic Reasoning
observed data to infer the parameters of the CPDs. Various methods can
be employed, including maximum likelihood estimation (MLE), Bayesian
estimation, and Bayesian structure learning.
(b) Bayesian Structure Learning: Bayesian structure learning involves
inferring the structure of the Belief network, including the directed
edges between nodes. This learning method integrates prior knowledge,
such as expert opinions or domain-specific constraints, with observed
data to estimate the most likely network structure. Techniques such
as score-based methods (e.g., Bayesian scoring) and search-and-score
algorithms (e.g., Markov Chain Monte Carlo) can be employed for
Bayesian structure learning.
(c) Constraint-Based Methods: Constraint-based methods learn the
network structure by applying statistical independence tests to iden-
tify conditional independence relationships between variables. These
methods use independence-based criteria, such as the PC algorithm or
the Grow–Shrink algorithm, to construct a network structure that is
consistent with the observed data.
(d) Hybrid Methods: Hybrid learning methods combine different
approaches, such as parameter learning and structure learning, to
simultaneously estimate the parameters and infer the structure of Belief
networks. These methods leverage both data-driven approaches and
domain knowledge to learn accurate and interpretable network models.
The hidden Markov model (HMM) is a probabilistic model that captures tem-
poral dependencies in sequential data, where the underlying states or processes
are not directly observable. HMMs are widely used in various fields, including
speech recognition, natural language processing, bioinformatics, and more. In an
HMM, the system is modeled as a Markov process with hidden states that gener-
ate observed outputs. The model assumes that the observed outputs are influenced
by the underlying hidden states, but the hidden states themselves are not directly
observed. Instead, we observe a sequence of outputs that provide partial informa-
tion about the hidden states.
5.5 Hidden Markov Model 179
Also, the backward probabilities are computed for the observed sequence using
the compute_backward_probabilities method. The backward probabili-
ties represent the probability of observing the remaining part of the sequence given
being in a particular state at each time step. The backward probabilities are also
displayed as output.
Finally, the state probabilities are computed for the observed sequence using
the compute_state_probabilities method. The state probabilities repre-
sent the probability of being in each state at each time step, given the observed
sequence. The state probabilities are also displayed as output.
return sequence
T = len(observations)
alpha = np.zeros((T, len(self.states)))
return alpha
return beta
return state_probs
pos = nx.spring_layout(G)
labels = {state: state for state in self.states}
edge_labels = {(u, v): f"{prob:.2f}" for (u, v, prob) in
G.edges(data="weight")}
nx.draw_networkx(G, pos, with_labels=True, labels=labels,
node_color="skyblue", node_size=1000, font_size=12,
edge_color="gray", arrows=True)
nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels,
font_size=10)
forward_probabilities = hmm.compute_forward_probabilities
(observed_sequence)
print("Forward probabilities:\n", forward_probabilities)
The output of Program 5.3 is given below. In the output, the observed sequence
is a list of length 10, generated randomly by the HMM. The forward probabilities
represent the probability of being in each state at each time step, given the observed
sequence. In this example, we have a 10 × 2 matrix, where each row represents the
forward probabilities for each time step.
The output also prints the backward probabilities, which represent the proba-
bility of observing the remaining part of the sequence given, being in a particular
state at each time step. It is also a 10 × 2 matrix, where each row represents the
backward probabilities for each time step.
The state probabilities that are displayed at the end of the output represent the
probability of being in each state at each time step, given the observed sequence.
It is computed using the forward and backward probabilities. The result is a 10 × 2
matrix, where each row represents the state probabilities for each time step. By
analyzing the state probabilities, one can observe how the HMM assigns probabil-
ities to each state at each time step, based on the observed sequence.
The Python code uses the visualize_hmm method to the HiddenMarkov-
Model class that uses the networkx and matplotlib libraries to visualize the
HMM as a graph. After computing the state probabilities, the visualize_hmm
method is called to generate the graph representation of the HMM. The graph
shows the states as nodes and the transition probabilities as weighted edges. The
arrows indicate the direction of transitions.
Observed sequence: ['Rainy', 'Sad', 'Happy', 'Sad',
'Sad', 'Happy', 'Happy', 'Happy', 'Happy', 'Sad']
Forward probabilities:
[[0.00000000e+00 0.00000000e+00]
[2.22044605e-16 2.22044605e-16]
[1.95399252e-16 7.99360578e-17]
[3.37507799e-17 6.39488462e-17]
5.5 Hidden Markov Model 185
[9.84101689e-18 2.90967250e-17]
[1.48219215e-17 8.16413603e-18]
[1.09127996e-17 3.73802322e-18]
[7.30733518e-18 2.20666152e-18]
[4.79823939e-18 1.40647899e-18]
[7.84271833e-19 1.37001553e-18]]
Backward probabilities:
[[0.00154341 0.00207201]
[0.00507403 0.00462802]
[0.00690004 0.01008337]
[0.01858355 0.02387968]
[0.06643712 0.05156864]
[0.101632 0.07936 ]
[0.15488 0.12416 ]
[0.232 0.208 ]
[0.32 0.44 ]
[1. 1. ]]
State probabilities:
[[0. 0. ]
[0.52298511 0.47701489]
[0.62585086 0.37414914]
[0.29114471 0.70885529]
[0.30349193 0.69650807]
[0.69924818 0.30075182]
[0.78456311 0.21543689]
[0.78694319 0.21305681]
[0.71273528 0.28726472]
[0.36405163 0.63594837]]
Rainy
40
0.
Sunny
186 5 Probabilistic Reasoning
It is to be noted that the observed sequence generated by the HMM in the pro-
gram will be different for each run of the code. The observed sequence is generated
randomly based on the probabilities defined in the HMM. Hence, the program will
display different output values at each run of the code.
HMMs find applications in various domains, such as speech recognition to
model phonemes or words, natural language processing, bioinformatics for
gene finding, protein structure prediction, and sequence alignment, finance for
modeling stock prices, market trends, and trading strategies, and robotics. In
summary, Hidden Markov Models are probabilistic models that capture temporal
dependencies in sequential data with hidden states. They provide a powerful
framework for modeling and analyzing sequential data, allowing for inference,
decoding, and learning tasks.
(d) Reward Function: A function that assigns a real value, known as a reward, to
each state–action pair or state transition. The reward represents the immediate
desirability or utility of being in a particular state or taking a specific action.
(e) Discount Factor: A value between 0 and 1 that determines the importance of
future rewards relative to immediate rewards. It helps in balancing immediate
rewards against long-term rewards, and influences the agent’s preference for
immediate gains or long-term planning.
● State Space: The agent can be in any of the nine grid cells, represented as S1,
S2, ..., S9.
● Action Space: The agent can take four actions: up (U), down (D), left (L), and
right (R).
● Transition Probabilities: When the agent takes an action, there is a 0.8 proba-
bility that it will move in the desired direction and a 0.1 probability of moving in
each of the perpendicular directions. For example, if the agent chooses to move
up from state S2, there is a 0.8 chance that it will end up in S1, 0.1 chance in S2
(staying in the same state), and 0.1 chance in S5 (moving left).
● Reward Function: The agent receives a reward of −1 for each step taken until
it reaches the terminal state, which is the goal. The terminal state has a reward
of +10.
import numpy as np
transition_probs[1, 0, 2] = 1.0
transition_probs[1, 1, 1] = 1.0
transition_probs[1, 2, 3] = 1.0
transition_probs[1, 3, 1] = 1.0
max_q_value = q_value
policy[s] = a
states = np.arange(num_states).reshape(grid.shape)
for i in range(grid.shape[0]):
for j in range(grid.shape[1]):
if grid[i, j] == -1:
grid_with_states[i, j] = "X"
grid_with_actions[i, j] = "X"
elif grid[i, j] == 1:
grid_with_states[i, j] = "G"
grid_with_actions[i, j] = "G"
else:
grid_with_states[i, j] = f"S{states[i, j]}"
if policy[states[i, j]] == 0:
grid_with_actions[i, j] = "U"
elif policy[states[i, j]] == 1:
grid_with_actions[i, j] = "D"
elif policy[states[i, j]] == 2:
grid_with_actions[i, j] = "L"
elif policy[states[i, j]] == 3:
grid_with_actions[i, j] = "R"
print("States:")
print(grid_with_states)
print("Actions:")
print(grid_with_actions)
The output of Program 5.4 is displayed next. The output consists of two parts:
the grid representation with states (grid_with_states) and the grid represen-
tation with actions (grid_with_actions).
States: The original grid is displayed, where each state is labeled as "S" followed
by its state number. Obstacles are labeled as "X", and the goal state is labeled
as "G".
5.7 Machine Learning and Probabilistic Models 191
Actions: The grid is the same as the original grid, but each cell is labeled with the
action to be taken from that state. Here, "U" represents moving Up, "D" repre-
sents moving Down, "L" represents moving Left, and "R" represents moving
Right. Obstacles and the goal state remain the same.
The output displays the grid world environment along with the corresponding
states and actions based on the optimal policy determined through value iteration.
"spam" or "not spam" class. The email is then classified based on the higher
probability.
The intersection of machine learning and probabilistic models in this example
allows for the quantification of uncertainty through probability estimates.
It enables the classifier to provide not only the predicted class label but also the
probability or confidence associated with that prediction, indicating how certain
or uncertain the model is about its decision.
Program 5.5 illustrates the Python code for email spam classification using Naïve
Bayes classifier. The email classifier can automatically classify incoming emails as
either "spam" or "not spam" based on their content. The Spambase dataset
is loaded from the UCI Machine Learning Repository using the provided URL.
The dataset is then split into training and testing sets using train_test_split
from scikit-learn. Next, the Naive Bayes classifier (MultinomialNB()) is cre-
ated and trained on the training set. Predictions are made on the test set, and
accordingly the accuracy of the classifier is calculated. Finally, the accuracy score
and confusion matrix are printed to assess the performance of spam classification
model.
The output of Program 5.5 is displayed next. The output consists of two parts:
the accuracy and the confusion matrix. The accuracy represents the percentage
of correctly classified instances (emails) out of the total instances in the test set.
The confusion matrix provides a detailed breakdown of the predicted and actual
labels, showing the true positives, true negatives, false positives, and false nega-
tives in a tabular format. By examining the accuracy and the confusion matrix,
you can gain a better understanding of how well the spam classification model is
performing and analyze the model’s predictions in more detail.
Accuracy: 0.7861020629750272
Confusion Matrix:
[[445 86]
[111 279]]
In summary, probabilistic models and machine learning are intertwined con-
cepts that leverage probability theory and statistical inference to learn from data,
make predictions, and quantify uncertainty. Machine learning algorithms can
be based on probabilistic models, and probabilistic models provide a principled
framework for handling uncertainty and performing inference in machine
learning tasks.
This Chapter provides a comprehensive foundation for understanding and
applying probabilistic reasoning techniques in various domains. The chapter
aims to equip readers with the necessary knowledge and tools to reason under
uncertainty, model probabilistic relationships, and make informed decisions
based on available evidence.
Exercises
i) 0.3
ii) 0.7
iii) 0.43
iv) 1.33
b) In the classical perspective, the probability of drawing a heart from a
standard deck of 52 playing cards is:
i) 1/4
ii) 1/13
iii) 4/13
iv) 1/52
c) In Bayes’ theorem, P(A|B) represents:
i) The prior probability of event A.
ii) The prior probability of event B.
iii) The conditional probability of event A, given event B.
iv) The conditional probability of event B, given event A.
d) Bayes’ theorem involves the multiplication of:
i) The prior probabilities
ii) The posterior probabilities
iii) The prior and posterior probabilities
iv) The conditional probabilities
e) In a Markov network, cliques represent:
i) Random variables
ii) Evidence or observations
iii) Joint distributions
iv) Maximal fully connected subsets of nodes
f) In a belief network, the probability of a variable is calculated based on:
i) Its parents and their conditional probabilities
ii) Its children and their conditional probabilities
iii) Evidence or observations
iv) Random sampling from the network
g) The main assumption in a HMM is:
i) All hidden states are dependent on each other
ii) The observations are independent of each other
iii) The transition probabilities are constant over time
iv) The hidden states satisfy the Markov property
h) In an HMM, the emission probabilities represent:
i) The probabilities of transitioning between hidden states
ii) The probabilities of observing the hidden states
iii) The probabilities of transitioning between observable states
iv) The probabilities of observing the observable states
196 5 Probabilistic Reasoning
Population-Based Algorithms
Genetic algorithms (GAs) were invented by John Holland in the 1960s. Later,
Holland along with his colleagues and students developed the concepts of GA at
the University of Michigan in the 1960s and 1970s as well. A GA is a metaheuristic
that is inspired by the Charles Darwin’s theory of natural evolution. GA are a part
of the larger class of evolutionary algorithms (EA), which focus on selecting the
fittest individuals for reproduction in order to produce offspring. Since the off-
spring are supposed to inherit the characteristics of the parents, it is expected that
the offspring will have better fitness than parents if the parents have good fitness
and, in turn, will have a better chance of survival. If this process continues to
repeat multiple times, at some point of time, a generation of the fittest individuals
will be formed.
Principles of Soft Computing Using Python Programming: Learn How to Deploy Soft Computing Models
in Real World Applications, First Edition. Gypsy Nandi.
© 2024 The Institute of Electrical and Electronics Engineers, Inc. Published 2024 by John Wiley & Sons, Inc.
198 6 Population-Based Algorithms
Population-
based
algorithms
Swarm
Evolutionary
intelligence
algorithms
algorithms
Start
Population initialization
Parent selection
Crossover
Mutation
No Terminating
condition?
Yes
Display optimal solution
Stop
C1 1 0 0 1 1 Gene
C2 1 1 0 0 1 Chromosome
C3 1 1 0 1 0
C4 0 0 1 0 1 Population
0 1 1 0 0 1 1
The fitness proportionate selection can be implemented in many ways. Two such
variations of the fitness proportionate selection method are – the Roulette wheel
selection and the Stochastic Universal sampling.
● The Roulette wheel selection: In this method, a circular wheel is divided into
“n” pies (i.e., “n” chromosomes), and a fixed point is initially placed at one of
the pies. This is illustrated in Figure 6.4, which considers a total of five chromo-
somes – A1 to A5. The wheel is then rotated, and, once it stops, the fixed point is
checked to find on which pie it stopped. The region of the pie that gets selected by
considering the fixed point is then chosen as the parent chromosome (A1 in case
of Figure 6.5). The process is repeated to choose the next parent chromosome.
The idea behind using the concept of Roulette wheel is that greater size of the pie
is owned by fitter individuals and therefore has a greater chance of being placed
on the fixed point. Thus, each individual chromosome of the current population
has the probability of being selected proportional to its fitness.
● The Stochastic universal sampling (SUS): This method is very similar to the
Roulette wheel selection method discussed above, except that the rotating wheel
possesses more than one fixed point instead of having just one fixed point. This is
illustrated in Figure 6.6, which considers a total of five chromosomes – A1 to
A5. In such a case, more than one parent gets selected by rotating the wheel
just once (A1 and A3 in case of Figure 6.6). This method leads to selection of
highly fit individuals at least once. SUS allows even the weaker members of the
population a chance to get selected. However, it can perform badly when one of
the members of the population has a significantly large fitness value compared
to the rest of the members of chromosomes in the population.
202 6 Population-Based Algorithms
Spin
A1
A2
Fixed
A3 Point
A4
A5
A1 is selected
Spin
Fixed
Point 1 Fixed
Point 2
A1 and A3
A1 A2 A3 A4 A5 are selected
Fitness Chromosome
value
A
2
B
B
5
C
1 Four F
D chromosomes
8 are randomly
F (The best parent is
E selected at selected based on
4 random
G the fitness value)
F
9
G
3
H
1
I
6
J
4
lowest fitness value, and so on. The probability p(x) of a chromosome “x” being
selected using the rank selection method is given by:
rank(x)
p(x) =
n (n − 1)
Figure 6.8 shows an example of five chromosomes being sorted and ranked
according to their fitness value from rank 1 to rank 5.
The rank selection method will solely consider the rank values of the chromo-
somes rather than their absolute fitness. In turn, this method keeps up selection
pressure when the fitness variance is low.
Fitness Fitness
Chromosome Sort according to Chromosome Rank
value value
fitness values
and accordingly A5 34 5
A1 22
assign ranks
A2 31 4
A2 31
A4 27 3
A3 18
A1 22 2
A4 27
A3 18 1
A5 34
1 0 0 1 1 0 1 1 0 0 1 0 1 1
0 1 0 1 0 1 1 0 1 0 1 1 0 1
Crossover point
6.2.4 Crossover
The most significant phase is the crossover, which can be carried out in several
different ways. One way of mating of parents is by randomly choosing a crossover
point from within the genes. A new offspring is created by exchanging the genes
of parents up to the crossover point, which is then added to the population.
There are several crossover operators that can be chosen based on the nature
of the problem dealt with. Four such standard crossover genetic operators are dis-
cussed next. By using any of these crossover operators, the basic idea is to combine
the genetic information of parents to produce new offsprings.
3 1 5 2 7 8 6 3 1 3 9 4 8 6
2 6 3 9 4 5 1 2 6 5 2 7 5 1
Crossover Crossover
Point 1 Point 2
3 1 5 2 7 8 6 3 6 5 2 4 5 6
2 6 3 9 4 5 1 2 1 3 9 7 8 1
or not in the offspring. Hence, each bit of the parents is independently chosen
from the parents to create offsprings rather than choosing a continuous segment
of the bit array. This operation is explained with the help of an example given in
Figure 6.11. The highlighted cells are the random cells chosen whose bit infor-
mation is swapped to generate new offsprings.
● Partially mapped crossover: This operator initially follows a similar approach
as the two-point crossover in which two random crossover points are chosen,
and then the data between the crossover points are interchanged to get new
offsprings. This operation is explained with the help of an example given in
Figure 6.12, which initially considers two-point crossover and accordingly
defines four mappings: 4 ↔ 1, 5 ↔ 8, 6 ↔ 7, and 7 ↔ 6.
The rest of the gene values (initially marked as “x” in Figure 6.12 to indicate
currently unknown value) of the offsprings that do not fall between the crossover
points are filled by partial mapping. In case of partial mapping, the additional
gene values are filled from the original parent values for those “x” for which
there is no conflict. Usually, conflict arises if the gene value matches with the
values considered for mapping. (4, 1, 5, 8, 6, and 7).
6.2.5 Mutation
The mutation phase is mainly carried out to maintain diversity in the genetic popu-
lation. After the generation of offsprings due to crossover, these offspring chromo-
somes can be mutated with an assigned mutation probability. For this, a randomly
206 6 Population-Based Algorithms
Parent 1 1 2 3 4 5 6 7 8 9
The symbol “x” is to be
interpreted as “currently
unknown value”
Parent 2 4 5 2 1 8 7 6 9 3
selected floating-point value is chosen, and the mutation process is carried out on
the offspring if this value is found to be less than the mutation probability; other-
wise, no mutation occurs. The following are the various mutation operators that
can be chosen for performing GA implementation:
● Bit flip mutation: This operator selects one or more random bits and flips those
bit values. This operation is mostly used for binary encoded GA. An example of
bit flip mutation is given in Figure 6.13 in which the highlighted cells indicate
the randomly selected bits in the bit array.
● Random resetting: This operator follows a similar principle as the bit flip
mutation, with the only difference in that it works for integer representation.
Random resetting operation selects one or more random integer values, and
flips those values to any value within an acceptable range. An example of
random resetting mutation is given in Figure 6.14 in which the highlighted
cells indicate the randomly selected values in the integer array.
0 1 1 0 0 1 1 0 1 0 0 1 1 1
1 1 3 2 1 2 3 1 3 3 2 1 4 3
3 2 2 1 4 1 3 3 2 3 1 4 1 2
3 2 3 1 4 1 3 3 3 4 2 1 1 3
3 2 3 1 4 1 3 3 4 1 3 2 1 3
● Swap mutation: This operator exactly chooses two gene values in a chromo-
some and swaps those two values. An example of swap mutation is given in
Figure 6.15 in which the highlighted cells indicate the two chose gene values in
the integer array.
● Scramble mutation: This operator follows a similar principle as the swap
mutation except that a subset of gene values is chosen instead of swapping only
two positions, and those values are randomly shuffled. An example of scramble
mutation is given in Figure 6.16 in which the highlighted cells indicate the
subset of gene values chosen for shuffling.
● Inverse mutation: This operator, like the scramble mutation, selects a set of
genes but the entire values are inverted from right to left, instead of shuffling
the gene values of the subset. An example of scramble mutation is given in
Figure 6.17 in which the highlighted cells indicate the subset of gene values
chosen for mutation operation. The four gene values selected are directly
inverted from right to left: (2, 3, 1, 4) → (4, 1, 3, 2), and the rest of the gene
values remain unaltered.
C1 1100100010 f (C1) = 4
C2 0101001110 f (C2) = 5
C3 1110011011 f (C3) = 7
C4 0100010011 f (C4) = 4
C5 1101101111 f (C5) = 8
C6 1010010000 f (C6) = 3
C3 1110011011 C1′
C5 1101101111 C2′
C2 0101001110 C3′
C5 1101101111 C4′
C3 1110011011 C5′
C1 1100100010 C6′
The total fitness of the entire population is 31. After the calculation of fitness
values of the initial population, the next task is to carry out selection. For this,
let us choose the Roulette wheel selection. Now, let us design the Roulette wheel
based on the six fitness values calculated. Cluster C5 has the highest fitness value of
8 out of a total fitness value of 31, which is equal to approximately 26%. Similarly,
for clusters C1 and C4, fitness percentage is 13% each. The fitness percentage is
15% for cluster C2, and it is 23% for cluster C3. Lastly, it is only 10% for cluster C6.
As can be seen in Figure 6.18, the area of the pies of the wheel is proportionate to
the fitness of the wheel.
The next task is to spin the Roulette wheel to perform selection. The chances of
fitter individuals being selected are high, as those individuals cover major portions
of the wheel. According to the spinning, let us consider that the selection operation
has considered the following results in order: C3, C5, C2, C5, C3, and C1. This is
tabulated in Table 6.2. The number of times of spinning is based on the number of
individuals present in the population (6 in our example). The clusters are provided
with an alias name for easy reference (C1′ to C6′ ).
C1
10% 12%
C2
14%
30% C3
C4
22%
12% C5
C6
6.3 How Genetic Algorithm Works? 211
Parent 1 1 0 1 1 0 1 1 1 1 Offspring 1 1 1 0 0 1 1 0 1 1
(C2) (O2)
Parent 0 1 0 1 0 0 1 1 1 0 Offspring 0 1 0 1 1 0 1 1 1 0
(C3) (O3)
Parent 1 1 0 1 1 0 1 1 1 1 Offspring 1 1 0 1 0 0 1 1 1 1
(C4) (O4)
The next task is to perform crossover between each couple based on a prede-
fined crossover probability (say, pc = 0.6). This indicates that for a population
size of 6 (as in our example), any four individuals can be chosen for mating. Let
us then consider mating of C1′ with C2′ , and mating of C3′ with C4′ , as shown
in Figure 6.19. As discussed before, there are several crossover operations, and
let us choose two-point crossover by considering the crossover point values of
2 and 8.
The final step is to now apply mutation with a predefined mutation probability
based on the length of each bit string (say, pm = 1/t = 1/10 = 0.1). Before applying
mutation, the values of all the individuals are tabulated in Table 6.3. The bit flip
mutation is applied to each individual (cluster) in which one bit chosen at random
(as pm = 0.1) is flipped for each individual. However, mutation is applied only to
new offsprings and not to chromosomes that do not undergo crossover.
For this, let us consider that the first two clusters are randomly chosen to get
flipped at their third locus, and the fifth and sixth clusters are again randomly
chosen to get flipped at their sixth locus. Table 6.3 highlights those bit values that
are randomly flipped. After the flipping phase is over, one iteration of GA can be
considered to be complete, and the new population generated is now ready for the
second iteration.
As can be seen from the above example of MAXONE problem, there has been
a huge change in the fitness value after completion of the first iteration itself.
The initial population had a fitness value of 31, and it got raised to 41 after the first
generation of GA. Each of these phases is repeated many a times until a stopping
criterion is met. Usually, a typical GA is run from 50 to 500 generations. Thus, the
MAXONE problem can be easily addressed by using GA to raise the number of 1’s
in the final result.
212 6 Population-Based Algorithms
GAs can be applied in various fundamental and important application areas for
solving optimization problems. Its relevance and prominence has been realized
in various real-life applications. Two such application areas in which GA are
frequently used are the travelling salesman problem and the vehicle routing
problem, which are briefly discussed in this section.
at the end. When the problem is stated, it looks like an easy problem to state, but
not so easy to be solved, especially when there are many areas to be covered.
The first phase or task of GA is to represent individuals of the initial population,
i.e. possible solutions or tours in this case. Now, in TSP, each city can be encoded as
a |log 2 (n)| long string, in which each individual of the population (a complete
route) is a string of length n|log 2 (n)|. For simplicity, let us consider that only
four cities are needed to be covered by the salesman, namely C1, C2, C3, and C4.
Each city can be represented in 3-bit format, but only four values will be covered
such as 000, 001, 010 and 011. The other bit patterns will not represent any city,
and hence the offspring(s) generated may not lead to legal tours after performing
a crossover. For such concerns, the binary representation of TSP is hardly used in
practice.
It has been realized that the best way to represent a tour is by using path rep-
resentation. So, if we again consider five cities to be travelled, namely C1, C2,
C3, C4, and C5, the tour C1 → C3 → C5 → C4 → C2 → C1 can be represented as
(135421), in which each decimal number represents a city. If city x is placed in
yth position, this indicates that city x will be the yth city to be visited during the
tour. The total number of tours possible considering “n” cities will be (n − 1)!/2.
Thus, if five cities are considered, as in our example, the total number of tours is
4!/2 = 12. For covering only five cities, the total number of tours is 12. Hence, as
the value of “n” increases, the possible solutions are even more. Choosing of the
214 6 Population-Based Algorithms
Randomly
generated
City weight
C2 0.6
C3 0.8
C4 0.3
C5 0.5
best solution that decides the shortest path is thus a challenging problem that can
be easily addressed by GAs.
Now, let us first of all set the initial population of randomly generated tours. For
this, let us consider a two-dimensional array consisting of cities and randomly gen-
erated decimal numbers between 0 and 1 for representing the weights of each city.
The array will consist of (n − 1) city names (as one of the cities will be considered
as the starting point, say city C1) and their associated weight values. An example is
given in Table 6.4. Next, the weights can be arranged in ascending order to decide
a tour path. Based on values provided in Table 6.4, it can be considered that the
tour will be C1 → C4 → C5 → C2 → C3 → C1, which can be mentioned as (145231).
This process will be repeated “n” number of times to store “n” individual tour
information in the initial population, where “n” value has to be determined by the
user as an input parameter.
Once the initial population has been set, the next phase of GA is fitness calcula-
tion for finding the fitness value of each individual (tour) in the given population.
This can be easily done by summing up the costs of each pair of adjacent genes.
Since TSP is a minimization problem (the lower the cost function, the better will
be the possible solution), the fitness value will be calculated as:
where, z is the total path length of the tour. For instance, for the ordering of
tour (145231) mentioned above, the distance between each pair of genes (cities)
will help in calculating the fitness value. Thus, the cities represented by “1” and
“4” is first found and stored in a variable “sum”; then the distance between the
cities represented by “4” and “5” is considered and added to the previous “sum”
value, and so on. The tour’s fitness value is ultimately set to the overall sum of all
distances between 1 and 4, 4 and 5, 5 and 2, 2 and 3, and, lastly, 3 and 1.
6.4 Application Areas of Genetic Algorithms 215
It is obvious that the smaller the value of the sum (overall distance), the higher
will be the fitness of the individual (tour). If one of the tours T1 covered have an
overall distance value as 120 (km), and another tour T2 has an overall distance
value as 180 (km), the fitness value of tours T1 and T2 will be 0.008 and 0.006,
respectively. As the fitness value of tour T1 is higher than tour T2, the chances of
crossover for T1 are higher than that of tour T2. After the calculation of fitness
values of the initial population, the next task is to carry out selection. For this, let
us choose the Roulette wheel selection, which considers fitness value of each tour
and accordingly selects n/2 parents for a population size of “n.”
After the selection process, the next task is the crossover operation. The main
consideration to be made is that the new offspring generated should not result in
visiting of a city more than once. For this, any classical crossover operator, such
as one-point crossover or multipoint crossover, cannot be used to solve the TSP, as
these crossover operators may often lead to a penalty of selecting the same city for
a single tour. The better choice would be to create offsprings without repeating the
gene (city) values. For instance, let us consider two parents chosen for crossover
as given in Figure 6.21. By using this crossover operation, the offsprings generated
do not result in duplication of city values.
Lastly, mutation needs to be carried out for completing one generation of TSP
using GA. However again, a conventional mutation method that replaces a city
(gene) value with another randomly chosen city value will result in duplication of
city values, which is not permissible. To solve this issue, two random gene values
of an individual are selected, and these two values are swapped to maintain unique
city values per tour. An example of mutation for TSP is shown in Figure 6.22.
After the swapping in mutation phase is over, one iteration of GA can be consid-
ered to be complete, and the new population generated is now ready for the second
iteration. Each of the phases of GA is repeated many a times until a stopping
Parent 1 1 2 3 4 5 6 7 8 9
The last three gene values are
chosen for mating.
Parent 2 4 5 2 1 8 7 6 9 3
1 2 4 5 7 8 9 3 6 1 2 3 5 7 8 9 4 6
criterion is met. The final result is expected to find one of the best solutions for
the classical TSP.
● Each vehicle may have the maximum capacity (weight or volume) of goods to
be carried. For instance, the tankers that carry petrol have a limitation on the
amount of petrol to being carried.
● Each vehicle may have a time period within which it must leave the depot or
return back after providing the delivery service.
6.4 Application Areas of Genetic Algorithms 217
2
5
Depot 2
Depot 1
● The vehicles may follow working hours (for providing rest to drivers during the
nonworking hours). Also, the departure and arrival timing of a vehicle from
depot and back to depot is needed to be maintained
The complexity of VRP can be solved using the approach used in GAs. In case
of VRP, each chromosome of the population represents a route, which can be a
possible solution to the VRP problem. There can be several variations of the VRP.
Let us consider one such variation called the varying-capacity single central depot
system for the VRP in which there are a number of destination points to be covered
by a fixed vehicle parked at a central depot. The vehicle has to follow the given
objectives and constraints:
Objectives
● Minimizing the distance travelled by each vehicle (finding the shortest route)
● Minimizing the total number of vehicles to be used for covering all routes
Constraints
● A vehicle has to start its destination from the central depot and should come
back to the central depot again after covering a route.
● Vehicle capacity constraint is taken into consideration.
Start
Mutation
No Terminating
condition?
Yes
Display optimal
solution
Stop
and ending points of each route. Care should be taken while building the initial
population that every route must be feasible in terms of capacity constraint.
The various notations used for finding the solution of the problem are as follows:
yk : value of yk can be either 0 or 1. When yk = 1, it means that the vehicle k will help
in delivering of goods; when yk = 0, it means the vehicle k will not be involved
in delivering of goods.
yijk : the vehicle k passes through the edge (xi , xj )
dij : the distance covered by edge (xi , xj )
qi : the requirement quantity of customer in destination point xi
nk : the number of destination points covered by the vehicle k
pk : the total quantity of goods delivered by the vehicle k
cm : the total cost of one solution
Now, the basic target of single central-depot varying capacity vehicles in the
VRP is:
f (m) = min (cm )
where,
∑ ∑∑
n n
cm = yk xij dij yijk
k i j
For choosing the shortest route g(xi , xj ), the next destination point to be selected
should be:
next(xi ) = {xj | xij = 1, qj > 0}
Fitness evaluation: To evaluate the fitness value of each chromosome, it will be
measured as the total distance covered by the individual chromosome, the num-
ber of destination points covered by the chromosome, as well as the total quantity
dispatched by the vehicle (which should be less than or equal to the vehicle
capacity).
Selection: In a traditional GA approach, a pair of individual chromosomes is
selected to reproduce offsprings. However, this traditional approach will fail to per-
form correctly in case of VRP, as each individual chromosome produces a complete
solution. If fragments of chromosomes are interchanged, neither of the parent
chromosomes will then probably cover all the destination points. Hence, a new
approach has to be used for crossovers of two chromosomes by selecting one of the
poorest quality chromosomes (based on quality evaluation function) and another
randomly selected chromosome from the rest of the population.
Crossover: For the crossover to occur for the VRP solution, the typical crossover
operators will fail to give a valid result, as it might not lead to a valid shortest route.
Next, a destination point is randomly chosen from a chromosome having poor-
est fitness quality, and it is simply removed or destroyed from that chromosome.
There can be two cases based on which either of the crossover will be performed.
As mentioned before, g(xi , xj ) represents the shortest route between xi and xj . In the
example considered in Figure 6.25, g (x4 , x1 ) = x4 x7 x1 ; g(x3 , x6 ) = x3 x6 ; g (x1 , x4 ) =
x1 x7 x4 ; g(x3 , x1 ) = x3 x6 x1 .
Case I: If the removed destination point belongs to the collection of all destination
points in the selected chromosome, it will be accepted. In this case, the desti-
nation point x3 is picked as the poorest point that is making the chromosome
weak. This crossover operation is explained in Figure 6.25(a).
Case II: If the removed destination point does not belong to the collection of all
destination points in the selected chromosome, it will be randomly inserted into
the selected chromosome using the shortest ways to smooth. In this case, the
destination point x1 is picked as the poorest point that is making the chromo-
some weak. This crossover operation is explained in Figure 6.25(b).
Mutation: The last step of GA is the mutation process. For this, the unlawful
offsprings (that violates any given conditions of VRP) produced after crossover
are split by the central depot. This results in production of new chromosomes.
After mutation, a new population is generated, which is ready for the next iteration
of GA.
6.5 Python Code for Implementing a Simple Genetic Algorithm 221
Selected Offspring
Parent X0 X3 X4 X5 X2 X6 X0 X0 X3 X4 X5 X2 X6 X0
2
Chromosome
(a)
Poorest Offspring
Parent X0 X7 X4 X3 X1 X6 X0 X0 X7 X4 X3 X6 X0
1
Chromosome
Selected Offspring
Parent X0 X3 X4 X5 X2 X6 X0 X0 X3 X6 X1 X7 X4 X5 X2 X6 X0
2
Chromosome
(b)
Figure 6.25 Crossover operation for VRP using Genetic Algorithm approach. (a). Case I
for Crossover. (b). Case II for Crossover.
def random_char():
"""
Return a random character between ASCII 32 and 126 (i.e. spaces,
symbols, letters, and digits). All characters returned will be
nicely printable.
"""
return chr(int(random.randrange(32, 126, 1)))
def random_population():
"""
Return a list of POP_SIZE individuals, each randomly
generated via iterating DNA_SIZE times to generate a string of
random characters with random_char().
"""
pop = []
for i in range(POP_SIZE):
dna = ""
for c in range(DNA_SIZE):
dna += random_char()
pop.append(dna)
return pop
# GA functions
# These make up the bulk of the actual GA algorithm.
def fitness(dna):
"""
For each gene in the DNA, this function calculates the
difference between it and the character in the same position in the
OPTIMAL string. These values are summed and then returned.
"""
fitness = 0
for c in range(DNA_SIZE):
fitness += abs(ord(dna[c]) - ord(OPTIMAL[c]))
return fitness
def mutate(dna):
"""
6.5 Python Code for Implementing a Simple Genetic Algorithm 223
# Main driver
# Generate a population and simulate GENERATIONS generations.
if __name__ == "__main__":
# Generate initial population. This will create a list of
POP_SIZE strings,
# each initialized to a sequence of random characters.
population = random_population()
fitness_val = fitness(individual)
weighted_population.append(pair)
population = []
# Crossover
ind1, ind2 = crossover(ind1, ind2)
The above Python code is run and tested to generate the fittest string after every
1000 iterations. and the output is generated as given in Figure 6.26.
6.6 Introduction to Swarm Intelligence 225
(a) (b)
(c) (d)
Figure 6.27 (a) A colony of ants. Source: Backiris/Adobe Stock. (b) A swam of honey
bees. Source: garten-gg/Pixabay. (c) A school of fish. Source: Milos Prelevic/Unsplash.
(d) A flock of birds. Source: Ethan/Adobe Stock.
This characteristic is found in many insects, animals, and birds such as, ants, bats,
glow-worms, bees, monkeys, and lions. These agents (ants, bees, etc.) have limited
unsophisticated individual capabilities, but the task gets simple and easy when
they interact and cooperate together.
The social collaborations among the swarm agents can be of two types – direct
or indirect. In case of direct interactions, agents interact with each other through
chemical, audio, or visual contacts. For example, the waggle dance of honey bees
that creates a loud buzzing sound for communication among each other while
searching for food. On the other hand, if interactions among swarm agents are
indirect, it is known as stigmergy. Indirect communications occur through the
environment in which one individual agent may create a change in the environ-
ment, and the rest of the agents adapt to such changes for an easy problem solving
or survival strategy. For example, termites coordinate among themselves through
the process of stigmergy to build complex nests for themselves. To do so, a ter-
mite initially creates a small mud ball from its environment and then deposits
pheromones on the mud-ball.
6.7 Few Important Aspects of Swarm Intelligence 227
Let us now examine some common aspects of swarm intelligence (Figure 6.29),
which are adapted from the common swarm behavioral features of biological
systems. These aspects are based on common behavioral characteristics of many
animals, insects, and birds that are collectively managed for a stimulating task to
reach completion.
represents the energetic costs of the foraging trip. The number of dancing rounds
needed to be performed during this foraging trip is based on this quality value of
the food source.
6.7.3 Stigmergy
Stigmergy is the mechanism adapted by agents or individuals to indirectly interact
among themselves via the environment to achieve a particular task. The term
“stigmergy” was originated in the year 1959 by a French biologist, Grasse, and was
gradually adapted by the biological swarm intelligence researchers. The principle
adapted by the agents in stigmergy is very simple. An agent usually leaves some
traces in the environment by carrying out an action that stimulates the other
agents to repeat similar action for completing the task.
Stigmergy is experienced in real-life mainly by social insects such as bees, ants,
and termites. For instance, when ants move in colonies in search of food, they
leave behind traces of pheromones on their way back once the food source is found.
Initially, the ants move in random directions and lead to different paths of different
distances between nest and food source. However, once the shorter path is real-
ized, more of pheromones are dropped on the shortest route, allowing the rest
of the ants to follow the shortest path having stronger pheromones. This example
indicates that few tasks do not require explicit planning to be completed but rather
are dependent more on the state of medium. The medium is hence one of the most
important components of the process of stigmergy.
In general, the stigmergic algorithms use the information provided by the
medium, and the behavior of the agents are accordingly managed. These non-
intelligent agents use very simple algorithms to exhibit collective intelligence.
The actions performed for completion of the task are repetitive, which enriches
the medium with information to allow collective behavior to complete the task
efficiently.
The concept of stigmergy has been adapted by swarm robots. One such instance
is burying RFID (Radio Frequency Identification) tags under the floor in the
form of grids, which act as stigmergic medium. The robots are installed with
RFID readers to access the information of the stigmergic medium. By doing so,
various interesting tasks can be completed by multiple robots. One such task is
building and storing stigmergic maps with the help of multiple robot cooperation.
Stigmergic maps are navigation maps that are stored in the environment through
the stigmergic medium.
and general workers. Specialized workers are given special tasks, which will not
be possible for the general workers to accomplish. A good example of division of
labour is exhibited by ants that form colonies to divide the work among them-
selves. Ant colonies have to undergo several different tasks such as foraging of
resources, maintaining nest, brood feeding, and defending the colony. Usually,
the forager ants perform the task of searching of food, while the nurse ants per-
form the task of feeding and tending the brood. In this way, the entire colony of
ants is divided to perform certain tasks to maintain equilibrium of work in the
environment.
A similar strategy of division of labor is encountered in “job shop” problem.
In this problem, the task is to complete a given number of jobs by a fixed num-
ber of machines, considering time as one of the factor for on-time completion.
Algorithms for “job shop” problem allow each machine to bid for jobs by con-
sidering its internal threshold value. If a machine has a low threshold value, it is
more likely to bid for a job compared to a busy machine having higher threshold
value. The algorithm is set in such a way to maintain a balance for completion
of the number of jobs among machines by keeping a check on the work load of
each machine. However, there is no central controller to decide which machine
can perform a particular job. It is solely managed by each machine itself by proper
bidding system.
● Pushing: This task was initially experimented for robots by using a bottom-up
approach to design the controller of a robot. Various issues were addressed to
solve the pushing problem, such as motion coordination, shape of the object
being pushed, and stagnation. One of the most common problems faced while
pushing an object by robots toward the destination is the collision problem. In
such cases, collision needs to be detected, and accordingly necessary actions are
needed to be taken for smooth movement. If the relative angle between a robot
and the object crosses a certain threshold (computed using odometry and omni-
directional camera), the robot again aligns itself for a suitable position for the
correct angle of movement.
● Pulling: This task involves physical mechanisms to firstly connect a number
of robots to the object. The robots then pull the object toward the destination
using appropriate amount of strength. A common robot built using this tech-
nology is the S-bot. S-bots are simple robots equipped with a number of sensors
and motors that allow these robots to carry out collective tasks of movement of
objects.
● Caging: It is a special case of pushing strategy in which several robots accumu-
late together in an organized matter to cage or trap the object within the robot
formation. The complexity of this process depends on the characteristic features
of the object and the number of robots available for the task.
6.7.6 Self-Organization
Self-organization (SO) can be defined as “a process in which patterns at the global
level of a system emerge solely from numerous interactions among the lower level com-
ponents of the system.” SO is based on four basic rules: positive feedback, negative
feedback, randomness, and multiple interactions.
● Positive feedback (amplification): Positive feedback in self-organizing sys-
tem results in recurrent influence, which amplifies the initial change. A simple
example of positive feedback loop is the birth rate that increases a given popula-
tion for a species, and yet more births lead to a greater population, as illustrated
in Figure 6.30.
It is clear from the above example that when a positive feedback loop takes
place in nature, the product of a reaction causes an increase in that reaction.
Births Population
+
232 6 Population-Based Algorithms
Chemicals Platelets
Wounded tissues signal platelet release Clotted Wound
release chemicals activation chemicals
Figure 6.31 The positive feedback loop for clotting of wounded tissues.
Deaths Population
–
6.8 Swarm Intelligence Techniques 233
Scene (a)
Pheromone trail
Scene (b)
Scene (c)
6.8 Swarm Intelligence Techniques 235
The shortest path is found easily, as few ants will reach the destination back
faster, leaving stronger pheromone concentrations whereas the other ants will
take more time to reach back the nest, and as a result the pheromone trail smell
will evaporate faster than the shorter path. At each diverse path, ants will take
a decision of choosing the shorter path by the smell of stronger pheromone
concentrations, which will help ants to explore the shortest path between food
source and nest in an optimized way. This principle of shortest path finding by
ants has been adapted in many algorithms to solve the optimization problem.
● Phase 2 – Local search: For every individual problem, a local search can
improve the constructed solution. However, it is an optional step, as it is highly
variable according to problems.
● Phase 3 – Pheromones update: Values of pheromone level increase for
promising solutions and decreases for undesired solutions due to pheromone
evaporation.
Start
Yes
More cities to
visit?
No
Return to the initial cities
Update pheromone level using the tour cost for each ant
No All ants
considered?
Yes
Print best tour for each ant
Stop
A simple flowchart for the TSP using ACO is given in Figure 6.35. The pseu-
docode of the TSP algorithm is explained in Algorithm 6.2. In general, the “n”
cities to be covered are divided into “m” groups. Each group visits a city only once.
The two factors taken into consideration while computing the probability that a
city j will be selected by an ant k after visiting city i are:
● The quantity of pheromone trails distributed on the path
● The visibility of city j from city i
The probability of a city j being selected by ant k after visiting city i is equated
based on Eq. (6.1). Here, in the equation, unvisitedk is the set of cities that are in the
classes not yet visited by ant k. The updating rule for 𝜑ij per iteration is followed
as given in Eqs. (6.2)–(6.4).
6.8 Swarm Intelligence Techniques 239
begin
Initialize ant positions
For c=1 to iteration number do
For k=1 to x do
Repeat until ant k has completed a tour
Select the city j to be visited next
With probability pij given by Eq. (6.1)
Calculate Lk
Update the trail levels according to
Eqs. (6.3)–(6.5).
end
The ACO algorithm uses a stopping criterion based on one or more of the
following conditions:
● A predefined maximum number of iterations has been executed
● a specific level of solution quality has been reached
● the best solution has not changed over a certain number of iterations
It is interesting to realize how the pheromone levels play a major role in path
finding. The interesting property of pheromone evaporation leads to stronger smell
of pheromones on shorter edges and lighter smell of pheromones on longer edges.
Based on these facts, many optimization problems have been easily solved by
controlling the behavior of the virtual ants.
#Python Code#
#This source code is available in github
import numpy as np
from numpy.random import choice as np_choice
class AntColony(object):
def __init__(self, distances, n_ants, n_best, n_iterations,
decay, alpha=1, beta=1):
self.distances = distances
self.pheromone = np.ones(self.distances.shape) /
len(distances)
self.all_inds = range(len(distances))
self.n_ants = n_ants
self.n_best = n_best
self.n_iterations = n_iterations
self.decay = decay
self.alpha = alpha
self.beta = beta
def run(self):
shortest_path = None
all_time_shortest_path = ("placeholder", np.inf)
for i in range(self.n_iterations):
all_paths = self.gen_all_paths()
self.spread_pheronome(all_paths, self.n_best,
shortest_path=shortest_path)
shortest_path = min(all_paths, key=lambda x: x[1])
if i % 10 == 0:
print ("Shortest Path - ", shortest_path)
if shortest_path[1] < all_time_shortest_path[1]:
all_time_shortest_path = shortest_path
self.pheromone * self.decay
return all_time_shortest_path
def spread_pheronome(self, all_paths, n_best, shortest_path):
sorted_paths = sorted(all_paths, key=lambda x: x[1])
for path, dist in sorted_paths[:n_best]:
for move in path:
self.pheromone[move] += 1.0 / self.distances[move]
def gen_path_dist(self, path):
total_dist = 0
for ele in path:
total_dist += self.distances[ele]
return total_dist
def gen_all_paths(self):
all_paths = []
for i in range(self.n_ants):
path = self.gen_path(0)
all_paths.append((path, self.gen_path_dist(path)))
return all_paths
def gen_path(self, start):
path = []
visited = set()
visited.add(start)
6.8 Swarm Intelligence Techniques 241
prev = start
for i in range(len(self.distances) - 1):
move = self.pick_move(self.pheromone[prev],
self.distances[prev], visited)
path.append((prev, move))
prev = move
visited.add(move)
path.append((prev, start)) # going back to where we started
return path
def pick_move(self, pheromone, dist, visited):
pheromone = np.copy(pheromone)
pheromone[list(visited)] = 0
row = pheromone ** self.alpha * (( 1.0 / dist) **
self.beta)
norm_row = row / row.sum()
move = np_choice(self.all_inds, 1, p=norm_row)[0]
return move
distances = np.array([[np.inf, 2, 2, 5, 7],
[2, np.inf, 4, 8, 2],
[2, 4, np.inf, 1, 3],
[5, 8, 1, np.inf, 2],
[7, 2, 3, 2, np.inf]])
ant_colony = AntColony(distances, 1, 1, 100, 0.95, alpha=1, beta=1)
shortest_path = ant_colony.run()
print()
print ("Final Shortest Path: {}".format(shortest_path))
Shortest Path - ([(0, 2), (2, 3), (3, 4), (4, 1), (1, 0)], 9.0)
Shortest Path - ([(0, 2), (2, 3), (3, 4), (4, 1), (1, 0)], 9.0)
Shortest Path - ([(0, 2), (2, 3), (3, 4), (4, 1), (1, 0)], 9.0)
Shortest Path - ([(0, 2), (2, 3), (3, 4), (4, 1), (1, 0)], 9.0)
Shortest Path - ([(0, 1), (1, 4), (4, 2), (2, 3), (3, 0)], 13.0)
Shortest Path - ([(0, 2), (2, 3), (3, 4), (4, 1), (1, 0)], 9.0)
Shortest Path - ([(0, 2), (2, 3), (3, 4), (4, 1), (1, 0)], 9.0)
Shortest Path - ([(0, 2), (2, 3), (3, 4), (4, 1), (1, 0)], 9.0)
Shortest Path - ([(0, 1), (1, 4), (4, 2), (2, 3), (3, 0)], 13.0)
Shortest Path - ([(0, 1), (1, 4), (4, 2), (2, 3), (3, 0)], 13.0)
Final Shortest Path: ([(0, 2), (2, 3), (3, 4), (4, 1), (1, 0)], 9.0)
PSO algorithms mimic the social behavior of fish schooling and bird flocking.
The flocking behavior is commonly exhibited by flock of birds while foraging for
food. This behavior is also commonly experienced by herd of animals, swarm of
insects, or school of fishes. For instance, a flock of birds circle around an area
when they discover a source of food. The birds closest to the food source chirp
the loudest, so that the rest of the birds can proceed more toward that direction.
In this way, the cluster gets tightened and more compact, until the food is shared
and swallowed by all. The PSO approach can be applied to bring a solution to
many applications such as the scheduling problem, sequential ordering problem,
vehicle routing problem, combinatorial optimization problem, power system
optimization problem, fuzzy neural networks, and signature verification.
Pk
pibest
Best particle performance
pk pgBest
Current position vk+1 Best swarm
New velocity performance
pk+1
New position
vk
Current velocity
Figure 6.37 Velocity and position updates of particle in the PSO algorithm.
the best past location of the whole swarm of particles. In each iteration, a particle’s
velocity gets updated using the formula:
( ( ))
vi (t + 1) = vi (t) + c1 x rand( ) × pbest
i − pi (t)
+ (c2 x rand( ) × (pgBest − pi (t))) (6.5)
Here,
● vi (t + 1) is the new velocity of the ith particle
● c1 and c2 are positive constants, and represent the weighting coefficients for the
personal best and global best positions, respectively
● pi (t) is the ith particle’s position at time t
● pbest
i
is the ith particle’s best known position
● pgBest is the best position known to the swarm
● rand() is a function to generate a uniformly random variable between 0 and 1.
The values of c1 and c2 play a major role in deciding the search ability of the PSO
approach. High values of c1 and c2 generate new positions that are in relatively
distant regions of the search space. This makes the particles diverge to different
directions and ultimately lead to a better global exploration. When the values of
c1 and c2 are small, there is a lesser movement of particles that ultimately leads to
a more refined local search. Also, when c1 > c2 , the search behavior will be more
prone to produce results based on particles’ historically best experiences. Again,
when c1 < c2 , the search behavior will be more prone to produce results based on
the swarm’s globally best experience. A particle’s position is accordingly updated
using the formula:
pi (t + 1) = pi (t) + vi (t) (6.6)
According to the aforementioned Eqs. (6.5) and (6.6), the basic flow of the pseu-
docode of the PSO algorithm is explained in Algorithm 6.3. Also, the flowchart to
demonstrate the PSO technique is given in Figure 6.38. The basic PSO algorithm
consists of three main steps:
i. Evaluate fitness of each particle
ii. Update individual and global bests
iii. Update velocity and position of each particle
6.8 Swarm Intelligence Techniques 245
Start
Initialize particles
Yes No
No
Target or maximum
epochs reached?
Yes
Stop
These steps are repeated until some stopping condition is met. The value of
velocity is calculated based on how far an individual’s data is from the target.
The further it is, the larger is the value of velocity.
According to the pseudocode of the PSO algorithm, at the beginning the par-
ticles are initialized by randomly assigning each particle to an arbitrarily initial
velocity and an arbitrary position in each dimension of the solution space. Next,
the desired fitness function to be optimized is evaluated for each particle’s posi-
tion. Next, for each individual particle, update its historically best position based
246 6 Population-Based Algorithms
on the current fitness value. Also, update the swarm’s globally best particle that has
the swarm’s best fitness value. The velocities of all particles are also equated using
Eq. (6.5). Each particle is moved to its new position using Eq. (6.6). All these steps
are repeated until a stopping criterion is met. As mentioned earlier, the criteria
to stop could be that the maximum number of allowed iterations is reached; a
sufficiently good fitness value is achieved; or the algorithm has not improved its
performance for a number of consecutive iterations.
iv. Make span evaluation: Calculate make span for each permutation by using
job-based representation (The particle having the least make span becomes
the personal best for that iteration.).
v. Job-based representation: Construct a schedule according to the sequence
of jobs.
vi. Upgrade counter: Upgrade the counter to next iteration (k = k + 1).
vii. Upgrade inertia weight: Upgrade the inertia weight by the formula
wk = wk−1 × 𝛼, (𝛼: decrement factor).
viii. Update velocity value: The velocity is updated as given in Eq. (6.5).
ix. Update position: The position is updated as given in Eq. (6.6).
x. Change the sequence of jobs: This is done based on the updated particle
position.
xi. Check new personal best value: Find the new personal best and compare
with the previous personal best (if value is low, update it as personal best
value).
xii. Find the global best: The minimum value of personal best among all the
personal best gives the global best, and the arrangement of jobs that give the
global best will be adopted.
xiii. Check stopping criteria: Stop if the number of iterations exceeds the max-
imum number of iteration or exceeds the maximum CPU time.
PSO has many similarities when compared with evolutionary computation tech-
niques such as GAs. However, unlike GA, the concept of PSO is simple, and does
not deal with evolution operators such as crossover and mutation. In PSO, the
main object considered is particles that fly through the problem space by follow-
ing the current optimum particles. To conclude, PSO is easy to implement, as it
requires adjustment of only few parameters. This is the reason as to why PSO has
been successfully applied in many areas such as fuzzy system control, artificial
neural network training, function optimization, and several other areas where GA
can also be applied.
in each iteration, and accordingly the particle’s position is updated based on new
velocity updates.
#Python Code#
#This source code is available in github
from __future__ import division
import random
import math
#— MAIN ———————————————————————+
class Particle:
def __init__(self,x0):
self.position_i=[] # particle position
self.velocity_i=[] # particle velocity
self.pos_best_i=[] # best position individual
self.err_best_i=-1 # best error individual
self.err_i=-1 # error individual
for i in range(0,num_dimensions):
self.velocity_i.append(random.uniform(-1,1))
self.position_i.append(x0[i])
for i in range(0,num_dimensions):
r1=random.random()
r2=random.random()
6.8 Swarm Intelligence Techniques 249
vel_cognitive=c1*r1*(self.pos_best_i[i]-self.
position_i[i])
vel_social=c2*r2*(pos_best_g[i]-self.position_i[i])
self.velocity_i[i]=w*self.velocity_i[i]+vel_cognitive+
vel_social
class PSO():
def __init__(self, costFunc, x0, bounds, num_particles,
maxiter, verbose=False):
global num_dimensions
num_dimensions=len(x0)
err_best_g=-1 # best error for group
pos_best_g=[] # best position for group
for j in range(0,num_particles):
swarm[j].update_velocity(pos_best_g)
swarm[j].update_position(bounds)
i+=1
if __name__ == "__PSO__":
main()
#— RUN ———————————————————————-+
initial=[5,5] # initial starting location [x1,x2...]
bounds=[(-10,10),(-10,10)] # input bounds [(x1_min,x1_max),
(x2_min,x2_max)...]
PSO(func1, initial, bounds, num_particles=15,
maxiter=30, verbose=True)
The above Python code is run and tested to generate the best solution for every
iteration. The number of iterations considered for the output is 30, which can be
altered by changing the parameter value. The snapshot of the output is displayed
in Figure 6.39.
Exercises
A) Choose the correct answer from among the alternatives given:
a) Which of the following is not found in Genetic Algorithms?
i) selection
ii) mutation
iii) load balancing
iv) crossover
b) A collection of genes form a/an ________
i) chromosome
ii) population
iii) offspring
iv) destination
c) The ___________ phase is mainly carried out to maintain diversity in the
genetic population
i) selection
ii) crossover
iii) mutation
iv) fitness calculation
d) In case of _________ mutation, a subset of gene values is chosen, and those
values are randomly shuffled.
i) bit-flip mutation
ii) scramble
iii) random resetting
iv) inverse
e) Which among the following is not considered as an input parameter for the
genetic algorithm?
i) Size of population
ii) Crossover probability
iii) Mutation probability
iv) Selection probability
f) _________ is a simulated program that exhibits swarm behavior.
i) Ark
ii) Boids
iii) Swark
iv) Crawl
g) __________ is the task of searching of food resources by animals and
insects.
i) Self-organization
ii) Stigmergy
iii) Foraging
iv) Food scheduling
252 6 Population-Based Algorithms
h) The ___________ smell indicates the other worker ants about the presence
of food in a nearby area.
i) sweet
ii) pungent
iii) sour
iv) pheromone
i) Homeostasis is another example of __________ feedback in biological sys-
tems.
i) negative
ii) positive
iii) amplifying
iv) neutral
j) Which among the following is not one of the main phases of the ACO tech-
nique?
i) Edge construction
ii) Solution construction
iii) Local search
iv) Pheromone updates
k) Which among the following is not one of the main steps of the particle
swarm optimization algorithm?
i) Evaluate fitness of each particle
ii) Update individual and global bests
iii) Evaluate the weight of each particle
iv) Update velocity and position of each particle
l) The three different collective transport strategies used by swarm robots are:
i) Pushing, pulling, and caging
ii) Pushing, pulling, and grasping
iii) Moving, pulling, and throwing
iv) Pushing, throwing, and grasping
B) Answer the following questions:
1) Discuss, in detail, all the five phases of the genetic algorithm.
2) Explain any two types of:
3) Crossover operator
4) Mutation operator
5) Selection operator
6) What is the role of fitness function in genetic algorithm? Explain with an
example.
7) For the given two chromosomes, perform the:
Chromosome 1: 6 5 4 1 3 8 9 7
Chromosome 2: 3 7 1 9 4 8 2 5
8) One-point crossover at the middle of the strings
Exercises 253
Rough set theory, like the fuzzy logic theory, is a mathematical approach to deal
with imperfect knowledge. The concepts of rough sets were introduced in the field
of computer science in the year 1982 by a Polish mathematician and computer
scientist, Zdzisław I. Pawlak. Applications of rough sets are varied and can be
used in several areas such as machine learning, expert systems, pattern recogni-
tion, decision analysis, knowledge discovery from databases, image processing,
voice recognition, and many more. Rough sets adapt a nonstatistical approach for
data analysis and are used for analyzing and classifying imprecise, ambiguous,
or inadequate information and knowledge. The fruitful applications of rough set
theory in various domains of problems have rightly demonstrated its practicality
and versatility.
Chapter 7 provides an expansive overview of rough set theory, its core princi-
ples, and its applicability in real-world scenarios. The chapter underscores the
versatility of rough set theory in handling uncertainty and imprecision, making
it a valuable tool for data analysis and decision support across various domains.
By combining theoretical explanations, measures, and practical applications,
the chapter equips readers with a comprehensive understanding of rough set
theory and its significance. The fundamental concepts within the Pawlak rough
set model, measures of approximation, decision rules, and various application
areas where rough set theory proves valuable are also thoroughly covered in this
chapter.
Principles of Soft Computing Using Python Programming: Learn How to Deploy Soft Computing Models
in Real World Applications, First Edition. Gypsy Nandi.
© 2024 The Institute of Electrical and Electronics Engineers, Inc. Published 2024 by John Wiley & Sons, Inc.
256 7 Rough Set Theory
Lower
approximation
Boundary
region
Negative
region
The set
S is found to be empty, it is called as a crisp set, and it is called as a rough set if the
BR of a set S is nonempty. Rough sets are identified by approximations, namely the
lower approximation and the upper approximation, as well as the BR (as illustrated
in Figure 7.1). To understand the concept of approximations, let us understand few
of the following basic terms connected with the rough set.
ZS(S) ⊆ ZS(ZS(S))
ZS(S) ⊆ ZS(ZS(S))
Z-upper ∶ ZS = {s ∈ U ∶ [s]Z ∩ S ≠ ∅}
Few of the properties that the upper approximation satisfies for any two given
subsets, S and T, such that S, T ⊆ U, are stated below:
ZS(U) = U,
ZS(∅) = ∅
ZS(X) = ∼ ZS(∼ S)
ZS(S ∪ T) = ZS(S) ∪ ZS(T)
ZS(S ∩ T) ⊆ ZS(S) ∩ ZS(T)
S ⊆ T ⇒ ZS(S) ⊆ ZS(T)
S ⊆ ZS(S)
ZS(ZS(S)) ⊆ S
ZS(ZS(S)) ⊆ ZS(S)
ZS(ZS(S)) ⊆ ZS(S)
BR(S) = ZS − ZS
7.1 The Pawlak Rough Set Model 259
Object A1 A2 A3 A4 A5
Ob1 B C A B Y
Ob2 C A A B X
Ob3 A A B C Y
Ob4 C B A C X
Ob5 A A B C Z
Ob6 C A A B X
Ob7 A B C C X
Ob8 C B A C Z
Ob9 B C A B Y
Ob10 C A A B X
(j) Core: Core is the set of attributes that is common to all reducts formed for an
information system. In case of Table 7.1, the two reducts that can be formed are
R1 = {A1, A2, A5} and R2 = {A3, A4, A5}. Hence, core is the set of one attribute
{A5} that is common in both the reducts R1 and R2. If we remove the core set
of attributes from the reduct, it will hamper the equivalence class structures
formed from the information table. Hence, core can be also regarded as the
indispensable attribute of an information system.
Example 7.1 Considering Table 7.2, for set S = {s|A3(s) = 1}, the accuracy and
roughness measures can be calculated as follows:
∣{Ob2,Ob3}∣ 1
Accuracy (S) = =
∣{Ob2,Ob3,Ob5,Ob6,Ob7,Ob8}∣ 3
∣{Ob5,Ob6,Ob7,Ob8}∣ 2
Roughness (S) = =
∣{Ob2,Ob3,Ob5,Ob6,Ob7,Ob8}∣ 3
Object A1 A2 A3
Ob1 1 1 2
Ob2 1 2 1
Ob3 1 3 1
Ob4 2 1 2
Ob5 2 2 2
Ob6 2 3 1
Ob7 2 2 1
Ob8 2 3 2
262 7 Rough Set Theory
knowledge about the set X,” whereas the roughness measure “represents the degree
of incompleteness.” These two numerical measures of rough set theory help in
estimating imprecision of the approximate characterization of a set.
To analyze data, rough set theory uses a simple data representation scheme called
the information system. In such a system, an information table is studied to
find the indiscernibility among objects. Each cell of the information table gives
the value of an object on an attribute. An information table can be expressed
in the form:
where, U is the finite, nonempty set called the Universe, and consists of all objects
or entities. Here, A is the nonempty set of all attributes, V a is the nonempty set of
values for an attribute a ∈ A, and I a is the information function that maps objects
or entities in U with values in V a . Typically, in an information system, rows rep-
resent objects and columns represents attributes. If the information function I a is
missing for one or more objects, the information table is said to be incomplete.
Example 7.2 Table 7.2 shows a simple information table of the form T = {U, A,
V a , I a } where, i.e., A = {A1, A2, A3}, O is the set of all objects, i.e., O = {Ob1, Ob2,
Ob3, Ob4, Ob5, Ob6, Ob7, Ob8}. V a1 refers to the set of values for attribute A1 and
can be written as V A1 = {1, 2}. Similarly, V A2 = {1,2,3} and V A3 = {1,2}. The infor-
mation function I A1 (Ob1) = 1, Similarly, I A2 (Ob3) = 3, I A3 (Ob4) = 2, and so on.
Now, if two attributes – A1 and A2 – are taken into consideration, such that
A = {A1, A2}, we get six equivalence classes – {Ob1}, {Ob2} {Ob3}, {Ob4}, {Ob5,
Ob7}, and {Ob6, Ob8} – in which each equivalence class has the same set of values
for all the considered attributes. Again, if we consider a set of only two attributes
such that A = {A1, A3}, then, we get only four equivalence classes – {Ob1}, {Ob2,
Ob3}, {Ob4, Ob5, Ob8}, and {Ob6, Ob7}. The selection of attributes for set A
becomes the deciding factor for the formation of equivalence classes. This leads
to an indiscernible relation, which is itself an equivalence relation formed by
the intersection of some equivalence relations. Two objects are indiscernible
(equivalent) if and only if they consist of the same set of values for every attribute
considered in set A. In Table 7.2, the set {Ob5, Ob7} is indiscernible in terms
of attributes {A1, A2}. Again, the set {Ob1, Ob2, Ob3} is indiscernible in terms of
attribute {A1}.
7.3 Decision Rules and Decision Tables 263
Therefore, the BR, the PR, and the NR for the above case will be:
BR(S) = ZS − ZS = {Ob6, Ob7, Ob8, Ob5}
PR = Z-lower ∶ ZS = {Ob2, Ob3}
NR = U − ZS = {Ob1, Ob4}
A decision table is like an information table but contains two types of attributes –
the condition attribute and the decision attribute. In rough set theory, a decision
table is represented by T = {U, A, C, D}, where U is the nonempty finite Universe
set, A is the set of all attributes, C is the set of condition attributes, and D is the set
of decision attributes such that C, D ⊂ A. Table 7.3 illustrates a decision table for
customer information of a bank that have availed a loan. In the table, the condition
attributes are Marital_Status and Age_Group, and the decision attribute is
Loan_Paid.
From each row of a decision Table A, condition can be formed by checking
whether the condition attributes are satisfied or not. However, many a times, deci-
sion rules having the same conditions may give different decisions. In such a case,
decision rules are said to be conflicting or inconsistent. In case the decision rules
having the same conditions give same decisions, the rules are said to be noncon-
flicting or consistent. For example, for Customer_id 102 and 106 (rows 2 and 6), the
values of condition attributes are the same (Married and 31-40), but the decision
attribute values of these two records differ – Yes for Customer_Id 102 and No for
Customer_Id 107. Hence, the decision rule is conflicting in this case.
264 7 Rough Set Theory
Decision rules are also called as “if...then...” rules, as these rules can be written in
the form of “if...then” statements. For example, if we consider row 1 from Table 7.3,
it can be represented as an implication:
if (Marital_Status, Single) and (Age_Group, 31–40), then
(Loan_Paid, No)
Similarly, if we consider row 8 from Table 7.3, it can be represented as another
logical implication:
if (Marital_Status, Married) and (Age_Group, 31–40) then
(Loan_Paid, Yes)
Such set of decision rules occurring in a decision table form a decision algorithm.
as there are two inconsistent rules for Customer_Id 102 and 104 out of a total of
seven rules of the table.
Similarly, the property (7.2) can be proven from Table 7.4 by considering Rules 1,
5, and 7 (where the decision attribute value is “No”) for which the value of the
coverage factor sums up to 1.
268 7 Rough Set Theory
∑
Covq (C, D) = 1 (7.2)
q∈ψ
Formulae (7.3) and (7.4) together are referred to as probability theorem that
relates to the strength of decision rules. Here, strength typically refers to the level
of dependency or significance between the conditions (antecedents) and the class
(consequent) of the rule. Strong rules have a higher level of certainty in their
conclusions, while weak rules might have a lower level of certainty.
∑
𝜆(C(s)) = Strengthq (C, D) (7.3)
q∈ψ
∑
𝜆(D(s)) = Strengthq (C, D) (7.4)
q∈Ω
Formulae (7.5) and (7.6) together are referred to as Bayes’ theorem, which is a
powerful tool for reasoning about uncertainty. Bayesian reasoning can be applied
to decision rules in order to incorporate new evidence and update probabilities.
Covs (C, D). 𝜆(D(s)) Strengths (C, D)
Certs (C, D) = ∑ = (7.5)
q∈ψ Covq (C, D).𝜆(D(y)) 𝜆(C(s))
Certs (C, D). 𝜆(C(s)) Strengths (C, D)
Covs (C, D) = ∑ = (7.6)
q∈Ω Certq (C, D).𝜆(C(y)) 𝜆(D(s))
It can be noticed that it is good enough to know the value of strength of that
decision rule to calculate the certainty and coverage factors of a decision rule.
7.4.1 Classification
Dimensionality reduction for complex datasets is usually handled by appropri-
ate feature selection to reduce the high complexity of data. Attribute reduction or
feature selection has always been considered as an important preprocessing step
for removing irrelevant data and increasing data accuracy. Feature selection for
dimensionality reduction can be targeted by finding a minimal subset of attributes
that provides the same classification as the entire dataset.
7.4 Application Areas of Rough Set Theory 269
The minimal subset, called as reduct in Rough Set theory, provides the minimal
relevant features for making a correct decision for classification. However, finding
reducts for a dataset is considered as a NP-hard problem, and researchers have
worked on this area to find near-optimal solutions for this issue. For instance, if
the total number of features is “n,” the total number of candidate subsets will be
2n , and this will result in an exhaustive search even for a moderate value of “n.”
Hence, the lesser the number of features chosen, the faster will be the execution
time for classification (Figure 7.2).
In the feature selection process using rough set theory, every time a new subset
is generated, it is compared with the previous best subset generated. This process is
repeated until a predefined number of iterations are reached or a predefined num-
ber of features are selected. In the best case, the process is stopped when adding
or removing a feature does not yield a better subset of features. At the end, the
selected best features are validated using standard evaluation tests.
The Rough Set Exploration System (RSES) tool can be used for calculation of
reducts using the concept of rough set theory. RSES is a graphical user interface
(GUI)-based package and can perform the following sequence of steps, as shown
in Figure 7.3, based of the rough set approach of analysis of data. Here is a general
guideline on how you might use RSES to calculate reducts:
(a) Download and install RSES: Obtain the latest version of RSES from its offi-
cial website or repository. Follow the installation instructions provided for
your operating system.
(b) Load your dataset: Open RSES and load the dataset you want to work with.
Typically, datasets are loaded from text files or CSV files.
(c) Define attributes and decision attributes: In RSES, specify which
attributes are condition attributes and which one is the decision attribute.
This will be necessary for reduct calculation.
Model Model
Inducing
classifier evaluation
Calculation rules based on using
Splitting of reducts rules cross
data validation
Loading
data
(d) Calculate reducts: To calculate reducts using RSES, follow these steps:
i. Select the “Attributes” or “Feature Selection” section in the RSES interface.
ii. Choose the dataset you loaded earlier.
iii. Specify the condition attributes and decision attribute.
iv. Select an algorithm for reduct calculation. RSES might offer different
algorithms such as Quick Reduct Algorithm (QRA) or Genetic Algorithm
(GA).
v. Initiate the reduct calculation process by clicking a "Calculate" or "Find
Reducts" button.
(e) View and analyze reducts: Once the reduct calculation is complete, RSES
will likely display a list of reducts or a single reduct, depending on the
algorithm used. You can analyze the reducts generated to understand
the minimal subsets of attributes that preserve the decision-making capability
of the original dataset.
(f) Save Results: You might have the option to save the calculated reducts or
analysis results for further reference.
(g) Experiment and Explore: RSES might offer additional features to explore
other rough set concepts, perform rule induction, analyze dependencies, and
more. Experiment with these features to deepen your understanding of rough
set theory.
Keep in mind that the steps provided are a general guideline based on how
similar tools work, and the actual interface and options might vary depending on
the specific version of RSES you are using. Always refer to the official documen-
tation or user guides provided by the RSES developers for detailed and accurate
instructions.
7.4 Application Areas of Rough Set Theory 271
7.4.2 Clustering
The theory of rough sets has been widely accepted and used for classification,
and it has been equally found favorable in applications of clustering. Clustering
algorithms are mainly used in data mining and several other application areas,
such as marketing and bioinformatics. In most of the real-time practical cluster-
ing approaches, the clusters do not have crisp boundaries as an object or node may
belong to more than one cluster. This results in overlapping clusters in which an
object may belong to more than one node. For instance, a particular symptom,
say headache, may belong to both the clusters of diagnoses – migraine and sinus.
Hence, the concept of rough sets can be incorporated in various existing standard
clustering techniques, as it can deal with overlapping objects efficiently. In this
section, we will discuss applications of k-means clustering to rough set theory,
as k-means clustering is one of the most frequently used clustering algorithms in
real-life domain.
In rough clustering, the upper and lower approximations necessary follow some
of the basic rough set properties such as:
i. An object v can be a member of at most one lower approximation.
ii. If an object v is a member of a lower approximation of a set, it will be also a
part of its upper approximation. This results in the lower approximation of the
set being a subset of its corresponding upper approximation.
iii. If an object v is not a member of any lower approximation, it will belong to
two or more upper approximations. This results in an object belonging to more
than one BR.
In k-means clustering, “k” number of clusters is formed from “n” objects, in
which the objects are represented by q-dimensional vectors. Initially, the cluster-
ing technique begins by randomly choosing “k” objects as the centroids of the
k clusters. An object is considered to belong to one out of the “k” clusters by find-
ing the minimum value of the distance D(v, c) between the object vector v = (v1 , v2 ,
v3 , …, vx , …, vq ) and the cluster vector c = (c1 , c2 , c3 , …, cx , …, cq ). The distance
D(v, c) is then calculated as:
∑
v∈c vx
cx = , where 1 ≤ x ≤ q
∣c∣
where, |c| is the size of cluster c. The calculation of distance is repeated many a
times until the centroids of clusters get stabilized. This is logically realized when
the centroid values are almost identical in both the previous iteration and the cur-
rent iteration.
Rough sets approach is incorporated in k-means clustering by including the con-
cepts of lower and upper approximations. In such a case, the centroids calculation
for k-means clustering using rough set approach is done as shown:
272 7 Rough Set Theory
if ZS ≠ 𝜙 and ZS – ZS = 𝜙
∑
v∈ZS vx
cx = ∣ZS∣
else if ZS = 𝜙 and ZS – ZS ≠ 𝜙
∑
v∈(ZS−ZS) vx
cx =
∣ZS−ZS∣
else ∑ ∑
v∈ZS vx v∈(ZS−ZS) vx
cx = plower × ∣ZS∣
+ pupper ×
∣ZS−ZS∣
where 1 ≤ x ≤ q. The two parameters plower and pupper relates to the importance
of lower and upper approximations, respectively, and the sum of plower and pupper
results in the value 1. While conducting experiments, various pairs of values of
(plower, pupper ) can be applied such as, (0.60, 0.40) and (0.70, 0.30).
After conducting the experiment several times with varying values of plower and
pupper , the process can be stopped when the resulting intervals will provide good
representations of clusters. However, the above centroid calculation will result in
conventional k-means centroid calculation if the lower and upper approximations
are found to be equal as shown in first condition when ZS – ZS = 𝜙.
The next step to be followed for k-means clustering using rough set approach
is to find whether each object belongs to either lower approximation or upper
approximation of a cluster. An object “v” will be assigned to a lower approximation
of a cluster, when the distance between “v” and the center of the particular cluster
is the smallest compared to the distances of the remaining other cluster centers.
Let us consider the distance D(v, cx ) between a vector v and the centroid of a
cluster cx . Now, to determine the membership of object v, the following two steps
are to be followed:
i. Find the nearest centroid (refer to Figure 7.4) by considering:
Dmin = D(v, cx ) = min 1≤j≤k D(v, cx )
ii. Check if further centroids are not significantly farther away than the closest
one. For checking this, let us consider T = {u: D(v, cu )/D(v, cx ) ≤ threshold and
u ≠ x}. In such a case:
a. If T = 𝜙, then there is no other centroid that is similarly close to object v.
b. If T ≠ 𝜙, then there is at least one other centroid that is similarly close to
object v.
7.4 Application Areas of Rough Set Theory 273
C2
D1
Object v
C3
By considering the above two steps, the following rule can be applied for an
object v to the approximations:
if T ≠ 𝜙
v ∈ ZS(cu ) and v ∈ ZS(cx ), ∀ u∈T
else
v ∈ ZS(cx ) and v ∈ ZS(cx )
The rough clustering approach forms clusters in which an object can belong to
more than one cluster. The rough K-means described above has augmented the tra-
ditional k-means clustering technique by considering real-life instances of cluster
formation in which objects do belong to more than one cluster.
The rough set approach builds pattern matching system to these MRI images to
provide a meaningful diagnosis through the images.
The concept of decision rules and decision trees, as discussed in Section 7.3, can
also be applied in case of medical data analysis. For this, let us consider the deci-
sion table as given in Table 7.7. In the table, the condition attributes are {Headache,
Temperature, Heartbeat}, and the decision attribute is {Flu}.
Now if we consider a relation A as:
A = {Headache, Temperature, Heartbeat}, then,
Indiscernibility(A) = IR(A) = {{P1, P8}, {P2, P3}, {P4}, {P5, P6}, {P7}}.
Again, if we consider a set S = {s|Flu(s) = “Yes”}, we have, the set as {P1, P2, P3,
P7}. In this case, the lower and upper approximations will be:
Z-lower ∶ ZS = {P2, P3, P7}
Z-upper ∶ ZS = {P1, P8, P2, P3, P7}
Therefore, the BR, the PR, and the NR for the above case will be:
BR(S) = ZS − ZS = {P1, P8}
PR = Z-lower ∶ ZS = { P2, P3, P7}
NR = U − ZS = {P4, P5, P6}
We have also studied in the previous section of this chapter that a decision table,
as a whole, may be considered as consistent or inconsistent, based on whether all
the decision rules formed from the decision table are consistent or inconsistent.
This is found by measuring the consistency factor Con(C, D), where C refers to
the condition attributes, and D refers to the decision attributes. From Table 7.5,
we can calculate that
Con(C, D) = 6∕8 = 0.75
as there are two inconsistent rules for Patient P1 and P8, out of a total of eight
rules of the table.
Again, considering the condition attributes as {Headache, Temperature, Heart-
beat} and the decision attribute as {Flu}, the value of support for the decision
Table 7.5 will be:
Supports (C, D) = 4
This value is calculated based on the following two identical rules fetched from
Table 7.5.
(Headache, Yes) ˆ (Temperature, High) ˆ (Heartbeat,
Abnormal)→ (Flu, Yes)
The various decision rules that can be formed by considering the three condition
attributes and the one decision attribute of Table 7.6 are mentioned below:
Now, the support Supports (C, D), the strength Strengths (C, D), the
certainty factor Certs(C, D), and the coverage factor Cov s (C, D) of each of
the five decision rules for Table 7.6 is calculated as shown in Table 7.7.
The certainty factors from Table 7.7 lead us to the following conclusions:
(a) 50% of the patients who have headache and high temperature but normal heart
beat suffer from flu.
(b) 50% of the patients who have headache and high temperature but normal heart
beat may not suffer from flu.
(c) 100% of the patients who have headache, high temperature, and an abnormal
heart beat suffer from flu.
276 7 Rough Set Theory
(d) 100% of the patients who do not have headache but high temperature and a
normal heart beat do not suffer from flu.
(e) 100% of the patients who have headache but normal temperature and an
abnormal heart beat do not suffer from flu.
(f) 100% of the patients who have headache, low temperature, and a normal heart
beat suffer from flu.
In this way, the inverse decision rules can also be formed using the coverage
factor values. Thus, rough set theory can be used to form decision algorithms
by classifying the information system into lower and upper approximations.
This kind of research study can be applied to form an accurate and reliable expert
system that can help physicians build a diagnosis system.
● Step 1: Considering the three primary color components – red, green, and
blue – the histogram is to be plotted for each color component (can follow the
procedure for plotting of one-dimensional entities). These histograms are often
referred to as base histograms.
● Step 2: Plot the histon of the base histogram. For this, pixels with similar color
values are to be found and accordingly grouped to form regions with similar
color. A sphere of similar color points is formed as given by the formula:
where, R represents the radius of the sphere of the region in the neighborhood
of a pixel having color intensities as (r ′ , g′ , b′ ). By considering the value of R, a
set of histons will be built based on varying degree of color intensity.
● Step 3: Carry out the segmentation of the troughs and valleys for the histon. This
is done since the histon can accentuate the regions having similar color values
by differentiating the regions with respect to dissimilar color values.
● Step 4: Consider the segments formed by the Green histon as the primary seg-
ments, and support the segments with the segments obtained from Red and Blue
histons. The green component of a color is the most important, and also the
human eye is more receptive toward the green component of white light.
278 7 Rough Set Theory
Let us consider an example to further illustrate the steps involved in image seg-
mentation using rough set theory. For this, we consider a grayscale image of a
landscape.
● Step 1: Plotting the base histogram: Assume that the base histogram of the
grayscale image shows three prominent peaks, indicating the presence of sky,
trees, and buildings. Higher peaks in the histogram might indicate regions of
interest.
● Step 2: Plotting the histon: The histon of the base histogram reveals that there
are transitions in pixel intensity frequencies at certain points, suggesting poten-
tial segment boundaries.
● Step 3: Segmentation of troughs and valleys for the histon: Identify three
troughs and valleys in the histon, indicating potential segmentation points.
Troughs and valleys represent regions of the image where there are transitions
in pixel intensity frequencies. These transitions can be interpreted as potential
boundaries between segments. Each trough and valley suggest a potential
segmentation point.
● Step 4: Primary segments: The segments formed by the histon troughs and
valleys correspond to the primary segments, representing different regions of
the image. These segments are labeled as sky, trees, and buildings.
Rough set theory might be considered in cases where image data exhibits sig-
nificant uncertainty or vagueness, which requires handling indiscernibility and
imprecision.
● signal preprocessing
● feature extraction, and,
● vector quantization
● Vector quantization (VQ): This step is mainly used for data compression and
is mainly applied to form a discrete or semi-continuous hidden Markov model
(HMM) based speech recognition system. VQ is treated as simplification of
scalar quantization to vector quantization. For this, the VQ encoding–decoding
techniques are used. The role of VQ encoder is to encode a set of n-dimensional
data vectors with a small subset C, which is known as the codebook. The
elements of C are represented as Ci , and are called as codevectors or codewords.
Next, the index i of codevector is passed to the decoder for the decoding process
based on the table look-up procedure.
7.4 Application Areas of Rough Set Theory 279
Training feature
vectors
Training phase
Build codebook using k – means
Codebook; Classified Training Feature Vectors
Discretize the training feature vectors
Discretized Training Feature Vectors
Compute reducts rough sets
Feature vector
indices
A hybrid approach of building vector quantizer has been built using the concept
of k-means and rough sets (refer to Figure 7.7). The testing of the vector quantizer
showed good performance in terms of recognition rate and time. In this vector
quantization approach, in the training phase, the k-means clustering algorithm is
used in the training phase to form a cluster of training feature vectors. The best
representing vector is chosen for each cluster to form the codebook.
Since only the use of k-means clustering technique results in building a final
codebook that is highly dependent on the initial codebook, the rough set theory
has been augmented with k-means clustering to solve this issue. The classified
280 7 Rough Set Theory
feature vectors formed by the codebook are then used to train a rough set engine.
A discretization algorithm to deal with discrete data is used for this purpose. After
this, the discretized classified training feature vectors are used to compute reducts.
The classification rules are then generated to classify input words feature vectors.
approximation (ZS) of a set S with respect to relation Z is the set of all objects that
can be possibly classified as S with respect to Z. The lower and upper approxima-
tions are given by:
Z-lower ∶ ZS = {s ∈ U ∶ [s]Z ⊆ S}
Z-upper ∶ ZS = {s ∈ U ∶ [s]Z ∩ S ≠ ∅}
Figure 7.10 Display of BANK.ISF dataset (having discretized values for all attributes).
284 7 Rough Set Theory
Methods Tree window. The Approximations option in the Methods window has
to be dragged and dropped on the MUSHROOM.ISF icon available in the Project
List window. This will result in opening a dialog box, as shown in Figure 7.12, for
choosing the decision attribute.
Once the “Ok” button with default options is clicked, the list of classes, the
number of objects in each class, and their corresponding lower and upper approx-
imation values are displayed as shown in Figure 7.13. The result also shows a
classification accuracy of 1.0, which is 100%.
Exercises
A) Choose the correct answer from among the alternatives given:
a) The concepts of rough sets were introduced in the field of computer
science by
i) Dominik Slezak
ii) Zdzisław I. Pawlak
iii) Dennis Ritchie
iv) James Gosling
b) __________ region is basically the difference between the upper and the
lower approximations.
i) Positive
ii) Negative
iii) Boundary
iv) Approximation
c) The ___________ region is the same as the lower approximation of a set.
i) Positive
ii) Negative
iii) Boundary
iv) Approximation
d) The _________ is the complete set of objects that are possibly members of
a target set “S.”
i) Lower approximation
ii) Upper approximation
iii) Approximation space
iv) Boundary region (BR)
e) Core is the set of attributes that is common to all ____________ formed for
an information system.
i) Decision rules
ii) Equivalence relations
iii) Reducts
iv) Approximation sets
286 7 Rough Set Theory
f) For a given set “S,” if the lower approximation is equal to its upper approx-
imation, the accuracy and roughness measures will be equal to
i) 0 and 0
ii) 1 and 1
iii) 1 and 0
iv) 0 and 1
g) For measuring the consistency factor, the two types of attributes to be con-
sidered are ________
i) Simple attributes and multiple attributes
ii) Simple attributes and complex attributes
iii) Composite attributes and distributed attributes
iv) Condition attributes and decision attributes
B) Answer the following questions:
1) What is a rough set? For which applications can rough set theory be used?
2) Define the following terms:
i) Approximation space
ii) Lower approximation
iii) Upper approximation
iv) Boundary region (BR)
v) Positive region (PR)
vi) Negative region (NR)
3) What is meant by core and reduct? Mention any three reducts for the given
Table A. Also, find the upper approximation, lower approximation, bound-
ary region (BR), positive region (PR), and negative region (NR) for Table A.
4) For each rule of Table A, find the support, strength, certainty factor, and
coverage factor.
5) Why can a boundary region (BR) be empty? Mention any three important
properties each for upper approximation and for lower approximation of a
set S.
6) What does an information table contain? How are equivalence classes
formed from an information table?
7) Mention any four important probabilistic properties of a decision table.
8) Explain the two main accuracy measures of rough set approximations.
9) For Table A, calculate the accuracy and roughness measures. What do these
values signify?
10) What do the certainty factor and coverage factor signify? When can a deci-
sion rule be considered certain?
11) What does the strength of a decision rule signify? What can be the possible
range of values for the strength of a decision rule? What should be the sum
total value of the strength of all decision rules of a decision table?
12) What is the role of reduct in the classification process? Explain, in detail,
the process of feature set selection using rough set approach.
13) Differentiate between basic k-means clustering and rough set-based
k-means clustering.
14) What is meant by histon? How can rough set concept be used for segmen-
tation of a colored image?
15) Discuss any two important application areas of rough set theory.
16) Consider a simple binary dataset given below:
dataset = [
{"attr1": 1, "attr2": 0},
{"attr1": 0, "attr2": 1},
{"attr1": 1, "attr2": 1},
{"attr1": 0, "attr2": 0},
]
Explain the process for finding the lower and upper approximations for this
dataset using the ROSE tool.
17) Develop a rough clustering algorithm that groups similar objects based on
their lower and upper approximations in a dataset.
18) Given the below dataset, consider the decision attribute to be "Exam Result"
and the set of condition attributes as {Attendance, Study Hours}.
288 7 Rough Set Theory
Exam
Student ID Attendance Study Hours Result
Calculate the attribute reduction for the given set of condition attributes
using the discernibility matrix approach. Show your calculations step
by step.
289
Hybrid Systems
Principles of Soft Computing Using Python Programming: Learn How to Deploy Soft Computing Models
in Real World Applications, First Edition. Gypsy Nandi.
© 2024 The Institute of Electrical and Electronics Engineers, Inc. Published 2024 by John Wiley & Sons, Inc.
290 8 Hybrid Systems
In the context of hybrid systems in soft computing, various types of hybrid sys-
tems can be classified based on their characteristics and functionality. Here are a
few commonly recognized types of hybrid systems:
(a) Sequential Hybrid Systems: Sequential hybrid systems involve the com-
bination of different soft computing techniques in a sequential manner.
The output of one technique serves as the input to another, forming a pipeline
or a cascaded structure. For example, a sequential hybrid system might
employ fuzzy logic for preprocessing, followed by NNs for feature extraction,
and, finally, genetic algorithms (GAs) for optimization.
(b) Embedded Hybrid Systems: Embedded hybrid systems refer to the integra-
tion of soft computing techniques into a larger existing system or framework.
In this type of hybrid system, soft computing methods are embedded or incor-
porated into traditional algorithms or systems to enhance their capabilities.
For example, incorporating a fuzzy logic controller into a conventional control
system or embedding NNs within a decision support system.
(c) Auxiliary Hybrid Systems: Auxiliary hybrid systems use one soft computing
technique as an auxiliary or supporting tool to improve the performance
of another technique. The auxiliary technique assists in enhancing the
effectiveness, efficiency, or robustness of the primary technique. For instance,
using GAs to tune the parameters of an artificial neural network or using
fuzzy logic to guide the exploration of an evolutionary algorithm. Figure 8.1
illustrates the three types of hybrid systems, namely, (a) sequence hybrid
system, (b) embedded hybrid system, and (c) auxiliary hybrid system.
(d) Cooperative Hybrid Systems: Cooperative hybrid systems involve the
collaboration and cooperation of multiple soft computing techniques or mod-
els to jointly solve a problem. These techniques work in parallel, and share
information or intermediate results to arrive at a final solution. Examples
include swarm intelligence algorithms, where multiple individuals in the
swarm contribute to finding optimal solutions collectively.
(e) Ensemble Hybrid Systems: Ensemble hybrid systems combine the predic-
tions or outputs of multiple individual soft computing models to produce a
final result. Each individual model may use a different algorithm or technique,
and the ensemble system aggregates their outputs to make a collective deci-
sion. Ensemble methods like bagging, boosting, or stacking are examples of
this type of hybrid system.
Applications of neurogenetic systems span various domains, including pattern
recognition, data mining, optimization, control systems, image and signal process-
ing, and bioinformatics. These hybrid systems have been successfully employed
in tasks, such as classification, regression, prediction, optimization, and feature
selection.
8.2 Neurogenetic Systems 291
Input
Input
Input
Soft computing techniques
Soft computing Technique 1
technique 1 Technique 2 Soft computing
Technique 3 technique 1
Soft computing
technique 2 Soft computing
technique 2
Figure 8.1 (a) Sequential hybrid system, (b) embedded hybrid system, and (c) auxiliary
hybrid system.
Hybrid systems may exhibit characteristics of multiple types. The specific type
of hybrid system used depends on the problem at hand, the available techniques,
and the desired outcome. The objective is to leverage the strengths of different soft
computing approaches and create synergistic solutions that outperform individual
methods.
(a) Encoding: Each solution in the population represents a set of weights for
the neural network. The weights are encoded as a chromosome or a string of
binary values, where each value corresponds to a weight in the neural network.
(b) Initialization: A population of candidate solutions (chromosomes) is
randomly generated. Each chromosome corresponds to a set of weights for
the neural network.
(c) Evaluation: Each chromosome is decoded to obtain a set of weights. The neu-
ral network is then trained using these weights on a training dataset, and the
performance of the network is evaluated using a fitness function. The fitness
function measures how well the neural network performs on the task at hand,
such as classification accuracy or mean squared error.
(d) Selection: Chromosomes with higher fitness values have a higher probability
of being selected for reproduction, mimicking the process of natural selection.
8.2 Neurogenetic Systems 293
• The GA loop starts, where each generation undergoes the following steps:
⚬ The fitness of each individual in the population is evaluated using the
evaluate() function.
⚬ The selection operator (toolbox.select()) is applied to select indi-
viduals for the next generation.
⚬ Crossover and mutation operators (toolbox.mate() and toolbox.
mutate()) are applied to create new offspring.
⚬ The population is replaced with the offspring.
The loop continues for the specified number of generations.
(e) Best Individual and Evaluation:
• The best individual from the final population is selected using the
tools.selBest() function.
• The best individual’s weights are evaluated on the test set using the
evaluate() function to obtain the test accuracy.
(f) Output:
• The code prints the best individual (chromosome) found during the evolu-
tion process.
• It also prints the test accuracy achieved by the best individual on the unseen
test data.
return accuracy,
The output of Program 8.1 is displayed next. The output of the code includes
the best individual, represented by a list of weights (chromosome), and the
corresponding test accuracy achieved by the neural network with those
weights. The best individual represents the optimized weights for the multilayer
feed-forward neural network. These weights should result in a higher accuracy
on the test set compared to other individuals in the final population.
Best Individual: [0.0678403418233603, 0.04174335906052762,
0.4597601600227963, -0.7659771052317565,
0.6354974130832376, -0.6133952438054482]
Test Accuracy: 0.86875
The actual output may vary due to the stochastic nature of the GA. The per-
formance of the algorithm depends on various factors such as the dataset, the
configuration of the GA, and the specific problem being solved.
298 8 Hybrid Systems
By combining these features, NEAT can effectively explore the space of possible
neural network architectures and optimize them for specific tasks. It allows for
the automatic discovery of neural network structures that perform well on a given
problem. NEAT has been successfully applied to various tasks, including control
problems, game playing, robotics, and more. It has demonstrated the ability to
evolve NNs that outperform hand-designed architectures in several domains.
Overall, NEAT provides a flexible and powerful framework for automatically
evolving NNs, allowing for the discovery of novel and efficient solutions to
complex problems.
Program 8.2 shows a simple Python code to demonstrate a simple implemen-
tation of NEAT to evolve feed-forward NNs for solving a specific task through
crossover and mutation. The program defines a simple feed-forward neural net-
work structure using the NeuralNetwork class. It has an __init__ method
that initializes the input size, output size, and random weights for the network. It
also has a predict method that performs the forward pass of the neural network.
A fitness function, fitness_function, is defined to evaluate the performance
of each neural network. In this example, it calculates the mean squared error
(MSE) between the neural network’s predictions and the target values for a set
of input examples.
The evolve_neural_networks function handles the NEAT evolution pro-
cess. It takes the number of generations as input. The function initializes the input
size, output size, and an empty population list to hold the NNs. A for loop is used
to generate an initial population of NNs. In this example, 10 NNs are created with
random weights. Another ‘for loop’ in the program code runs for the specified
number of generations. Inside the loop, the fitness of each neural network in the
population is evaluated using the fitness_function. The fitness scores are
stored in a list called fitness_scores.
The fitness_scores list is sorted based on the fitness scores in descending
order. This is done using the sort method and a lambda function as the sorting
key. The population list is then updated by extracting the NNs from the sorted
fitness_scores list. The top-performing NNs (parents) are selected from the
population. In this example, the top five NNs are chosen as parents. New NNs
(children) are created through crossover and mutation. For each child, two parents
are randomly selected without replacement using random.sample.
The weights of the children are obtained by averaging the weights of their
parents and adding a random mutation. The old population is replaced with the
new one, which consists of the parents and children. The best fitness score in the
current generation is printed. It is obtained from the first element of the sorted
fitness_scores list. The evolve_neural_networks function is called
with the specified number of generations.
300 8 Hybrid Systems
The output of Program 8.2 is displayed next. The output indicates the progress of
the NEAT evolution process over the specified number of generations. Each line
displays the current generation number and the corresponding best fitness score
achieved by the top-performing neural network in that generation.
The fitness score represents the performance of the neural network in min-
imizing the mean squared error (MSE) between its predictions and the target
values. The best fitness score should generally improve over the generations, as
the evolution process aims to optimize the NNs. However, keep in mind that the
exact values and trends may vary due to the random initialization and mutation
steps in the process.
Generation 0: Best Fitness = -0.5670
Generation 1: Best Fitness = -0.3788
Generation 2: Best Fitness = -0.3788
Generation 3: Best Fitness = -0.3735
Generation 4: Best Fitness = -0.3735
Generation 5: Best Fitness = -0.3518
Generation 6: Best Fitness = -0.3518
302 8 Hybrid Systems
By observing the best fitness scores, one can assess how well the NEAT algo-
rithm is progressing in optimizing the NNs for the given task. In a more complex
NEAT implementation, additional information such as the neural network topol-
ogy, the connections between nodes, and other parameters could also be tracked
and analyzed during the evolution process.
Inference
Input Fuzzifier Defuzzifier Output
engine
Rule
based
These membership values represent the degree to which the input belongs
to particular linguistic terms or fuzzy sets. Fuzzification is achieved using
membership functions, which map the input values to membership degrees.
(b) Membership Functions: Membership functions are fundamental to fuzzy
neurons. They define the degree of membership for each linguistic term or
fuzzy set. These functions can take various shapes, such as triangular, trape-
zoidal, or Gaussian, depending on the specific application. The fuzzy neu-
ron evaluates the input against the membership functions and calculates the
membership degrees for each linguistic term.
(c) Fuzzy Inference: After fuzzification, the fuzzy neuron performs fuzzy
inference to process the fuzzy input and make decisions. Fuzzy infer-
ence involves applying fuzzy rules that describe how to combine the fuzzified
inputs to produce the desired output. These rules are typically expressed in
IF-THEN format and are determined based on expert knowledge or data-
driven learning.
(d) Defuzzification: The final step in fuzzy neurons is defuzzification, which
converts the fuzzy output obtained from fuzzy inference back into crisp val-
ues. Defuzzification can be achieved using various methods, such as centroid,
mean of maxima, or weighted average.
Program 8.3 shows the Python code to demonstrate fuzzy neuron for temper-
ature control. In this code, fuzzy logic is used to control the air conditioning
system based on the temperature input. The program starts by defining three
fuzzy membership functions: membership_cold, membership_moderate,
and membership_hot. These functions represent the degree of membership
of the input temperature to the linguistic terms "Cold", "Moderate", and
"Hot", respectively. Triangular membership functions are used for simplicity.
Next, the program defines fuzzy inference rules using the fuzzy_or and
fuzzy_and functions. The fuzzy_or function represents the logical OR
operation, and the fuzzy_and function represents the logical AND operation.
The rule-based fuzzy inference determines the strength of activation for each
linguistic term based on the input temperature’s membership values. After fuzzy
inference, the program performs defuzzification to convert the fuzzy output
values back into crisp values. The centroid method is used for defuzzification,
which calculates the center of mass of the fuzzy set. The result is a crisp value
representing the degree of activation of the air conditioning system.
The fuzzy_neuron function combines the fuzzification, fuzzy inference, and
defuzzification steps. It takes the input temperature as an argument, fuzzifies
it using the membership functions, applies fuzzy inference rules, and performs
defuzzification to obtain the air conditioning system’s output. The fuzzy_not
function represents the logical NOT operation, which complements the mem-
bership value (1 - x). The main function demonstrates the use of the fuzzy
8.3 Fuzzy-Neural Systems 305
def membership_moderate(x):
if x > 20 and x < 25:
return (x - 20) / (25 - 20)
elif x >= 25 and x <= 30:
return 1.0
elif x > 30 and x < 35:
return (35 - x) / (35 - 30)
else:
return 0.0
def membership_hot(x):
if x >= 30 and x <= 35:
return (x - 30) / (35 - 30)
elif x > 35:
return 1.0
else:
return 0.0
# Defuzzification
result = [0] * 100
for i in range(100):
result[i] = fuzzy_or(air_cold * membership_cold(i),
air_moderate * membership_moderate(i), air_hot *
membership_hot(i))
return defuzzification(result)
# Fuzzy inference
air_conditioning = fuzzy_neuron(input_temp)
print(f"Input Temperature: {input_temp}∘ C")
print(f"Air Conditioning Output: {air_conditioning:.2f}")
if __name__ == "__main__":
main()
8.3 Fuzzy-Neural Systems 307
The output of Program 8.3 is displayed next. The input temperature is set to
25 ∘ C. The fuzzy neuron applies the fuzzy inference rules to determine the degree
of activation for each linguistic term (cold, moderate, and hot) based on the
input temperature’s membership values. Since the input temperature falls within
the range where "Moderate" has the highest membership value (around 1), the
fuzzy neuron activates the "Moderate" air conditioning level. The output from
the defuzzification step represents the degree of activation for the air conditioning
system, with 0 indicating no activation and 100 indicating full activation.
Input Temperature: 25∘ C
Air Conditioning Output: 15.79
The air conditioning output (15.79, in the given output) will be a numeric value
between 0 and 100, indicating the degree of activation for the air conditioning
system. The higher the value, the more the air conditioning system will be
activated.
The code shows a simplified demonstration of a fuzzy neuron for temperature
control. In real-world applications, more sophisticated membership functions
and fuzzy rules would be used for accurate control. The example provides a basic
understanding of how fuzzy neurons can be used for control systems based on
fuzzy logic principles.
Overall, fuzzy neurons are essential components of Fuzzy-Neural systems,
providing the capability to handle uncertain and imprecise data effectively,
making them valuable tools in various real-world applications.
The ANFIS model employs a hybrid learning algorithm that combines gradient
descent and least-squares estimation to adaptively adjust the parameters (weights
and fuzzy membership function parameters) of the system. This learning process
minimizes the error between the ANFIS model’s predicted output and the target
output from the training data.
Program 8.4 illustrates the Python code used to demonstrate the use of Adaptive
Neuro-Fuzzy Inference System (ANFIS). The program is a simplified imple-
mentation of an ANFIS model to predict outputs from inputs using Gaussian
membership functions and gradient descent for parameter adaptation.
The gaussmf function defines a Gaussian membership function, which is used
for fuzzification in ANFIS. It takes x, mean, and sigma as input and returns the
Gaussian membership values. The predict_anfis function is responsible for
predicting the ANFIS output, given the input x and the ANFIS rules. It loops
through the fuzzy rules, calculates the Gaussian membership values for each rule
using gaussmf, and aggregates the weighted contributions to predict the output.
The train_anfis function trains the ANFIS model using a gradient descent
approach. It initializes two fuzzy rules with initial parameters (mean, sigma, and
coeff). The training process iterates for a specified number of epochs. In each
epoch, it calculates the prediction using the current rules, and updates the rule
coefficients to minimize the error between the predicted and actual output. Ran-
dom input data x is generated for demonstration purposes. Additionally, random
noise is added to the sine function to create the output data y.
The train_anfis function is called with the generated input x and output y.
The ANFIS model is trained for 50 epochs with a learning rate of 0.1. The trained
8.3 Fuzzy-Neural Systems 309
ANFIS rules are used to predict the output values for the input data x. The program
prints the actual output values y and the predicted output values y_pred for com-
parison. The output of the program will show the actual and predicted output
values for the random data generated.
for i in range(len(rules)):
w = gaussmf(x, rules[i]['mean'], rules[i]['sigma'])
delta_coeff = learning_rate * np.sum(w * error)
rules[i]['coeff'] += delta_coeff
return rules
The output of Program 8.4 is given next. The output displays arrays containing
the actual output values y and the predicted output values y_pred. These val-
ues are compared for each corresponding input x generated earlier. The predicted
values (y_pred) are generated using the trained ANFIS model, which has learned
the fuzzy rules and their coefficients from the training data. The actual values (y)
were generated by adding random noise to the sine function.
Actual Values: [-0.83041379 0.85357806 -0.20621284 -0.89448679
-0.74008257 0.36444128
-0.82603257 0.46760626 -0.31734381 -0.53326634 0.95768151
-0.71609827
-0.54607402 0.26567484 0.68774382 0.83581937 0.2018593
1.06914918
1.01007045 0.70307721 -0.16527418 0.85577159 -1.12228985
1.09575455
... ...
... ...
-0.70127374 0.55302958 -0.96779522 0.65272999 -0.3813023
0.16312885
0.0480033 0.41291054 1.05389453 0.70569158 0.14158738
1.03426159
-0.48632557 0.13076826 0.86114029 0.04868543]
Since a simple ANFIS implementation is used with just two fuzzy rules and
random data, the prediction accuracy may not be high. In practice, ANFIS models
with more fuzzy rules and real-world data would be used for better results.
The purpose of this program is to demonstrate the basic concept of Adaptive
Neuro-Fuzzy Inference System (ANFIS), where the ANFIS model learns to map
inputs to outputs by adapting fuzzy rules based on the training data.
ANFIS has found applications in various fields, including function approx-
imation, system modeling and control, times series prediction, and pattern
recognition. ANFIS automatically adjusts its parameters based on the input–
output data, making it adaptable to different problems. ANFIS is a powerful and
versatile technique that combines the strengths of fuzzy logic and NNs, making
it suitable for a wide range of applications where interpretability and adaptability
are essential.
8.4 Fuzzy-Genetic Systems 311
def fuzzify(x_value):
return [
fuzz.interp_membership(x, x_membership_low, x_value),
fuzz.interp_membership(x, x_membership_medium, x_value),
fuzz.interp_membership(x, x_membership_high, x_value)
]
# Fuzzy-Genetic System
varbound = np.array([[-10, 10]])
algorithm_param = {'max_num_iteration': 100, 'population_size': 10,
'mutation_probability': 0.1, 'elit_ratio': 0.01, 'parents_portion':
0.3, 'crossover_probability': 0.5,
'crossover_type': 'uniform', 'max_iteration_without_improv': None}
model = ga(function=objective_function, dimension=1,
variable_type='real', variable_boundaries=varbound,
function_timeout=20, algorithm_parameters = algorithm_param)
The output of Program 8.5 will vary slightly in each run due to the stochastic
nature of GAs. However, it will generally display the following details, and only
the values may differ in each run.
The best solution found:
[2.02769079]
Objective function:
0.0007667798834551931
8.5 Hybrid Systems in Medical Devices 315
Genetic algorithm
0.035
0.030
Objective function
0.025
0.020
0.015
0.010
0.005
0.000
0 20 40 60 80 100
Iteration
Here, in the output, the optimized value of x is the value of x that minimizes the
objective function (x ** 2 - 4 * x + 4). The optimized value of the objec-
tive function is the fitness value corresponding to the optimized value of x, which
is the result of evaluating the objective function at the optimal x. Next, the mem-
bership degrees for the optimized x are the membership degrees for the optimized
x in the fuzzy sets (low, medium, and high). They indicate the degree of truth for
each linguistic term based on the optimized value of x. The fuzzy logic membership
functions provide a more flexible and interpretable representation of the optimiza-
tion problem, and the GA adapts to find the optimal solution based on the fuzzy
fitness evaluation.
rate and delivers medication if the heart rate exceeds a certain threshold (in this
case, 90 beats per minute).
class HeartRateSensor:
def measure_heart_rate(self):
# Simulate heart rate measurement (replace this with real
sensor data)
return random.randint(60, 100)
class MedicationDeliverySystem:
def deliver_medication(self):
print("Delivering medication...")
# Simulate the medication delivery process
time.sleep(2)
print("Medication delivered.")
class HybridMedicalDevice:
def __init__(self):
self.heart_rate_sensor = HeartRateSensor()
self.medication_delivery_system =
MedicationDeliverySystem()
time.sleep(1)
if __name__ == "__main__":
hybrid_device = HybridMedicalDevice()
8.5 Hybrid Systems in Medical Devices 319
A sample output for Program 8.6 is displayed next. In the given output, the heart
rate exceeded 90 twice, and the medication was delivered on those occasions. The
simulation runs for five iterations (as specified by iterations=5), and the pro-
gram terminates after the 5th iteration.
Heart rate: 80
Heart rate: 93
Delivering medication...
Heart rate: 62
Heart rate: 87
Heart rate: 95
Delivering medication...
Heart rate: 68
Heart rate: 72
Heart rate: 74
class Accelerometer:
def measure_acceleration(self):
# Simulate acceleration measurement (replace this with real
sensor data)
return 0.5 # Example: Simulated acceleration value of
0.5 m/sˆ2
class Actuator:
def provide_feedback(self, force):
print(f"Applying force: {force:.2f} N")
# Simulate the actuator providing feedback (e.g.,
vibrating or resisting)
time.sleep(1)
print("Feedback applied.")
class HybridRehabilitationDevice:
def __init__(self):
self.accelerometer = Accelerometer()
self.actuator = Actuator()
time.sleep(0.5)
if __name__ == "__main__":
rehab_device = HybridRehabilitationDevice()
rehab_device.perform_rehabilitation(duration=10)
A sample output for Program 8.7 is displayed next. In the given output, the
acceleration exceeds the threshold (0.7 m/sˆ2) twice, triggering the actua-
tor to provide feedback. The simulation runs for 10 seconds (as specified by
duration=10), and the program will terminate after the 10-second duration.
Exercises
A) Choose the correct answer from among the alternatives given:
a) A type of hybrid system that uses one soft computing technique as a sup-
porting tool to improve the performance of another technique:
i) Sequential hybrid system
ii) Embedded hybrid system
iii) Auxiliary hybrid system
iv) Ensemble hybrid system
b) What is the primary goal of using a genetic algorithm (GA) for weight deter-
mination in a Multilayer Feed-forward Neural Network?
i) To minimize the number of hidden layers in the network.
ii) To maximize the number of neurons in the output layer.
iii) To find the optimal set of weights that minimizes the network’s error
function.
iv) To increase the learning rate of the network.
c) What is the role of a fitness function in the genetic algorithm?
i) It determines the probability of a genetic mutation occurring.
ii) It measures how well an individual solution performs the task at hand.
iii) It controls the crossover operation between parents.
iv) It keeps track of the total number of generations in the algorithm.
d) In NEAT, how are neural network structures represented?
i) As a fixed, predefined architecture.
ii) As a direct acyclic graph (DAG) with no recurrent connections.
iii) As a single-layer perceptron.
iv) As a stack of fully connected layers.
e) What is speciation in the context of NEAT?
i) The process of selecting the best-performing neural networks.
ii) The technique of adding new layers to the neural network.
Exercises 323
iii) The process of dividing the population into species based on their sim-
ilarity.
iv) The process of gradually reducing the size of the neural network over
generations.
f) Fuzzy-Neural systems combine the principles of fuzzy logic and neural
networks to:
i) Process crisp inputs and generate crisp outputs.
ii) Handle only binary classification tasks.
iii) Represent knowledge in the form of rules and linguistic variables.
iv) Implement complex mathematical operations efficiently.
g) Which of the following techniques is used to represent membership func-
tions in fuzzy neurons?
i) Heaviside function
ii) Step function
iii) Sigmoid function
iv) Gaussian function
h) Which statement best describes the purpose of the "defuzzification"
process in fuzzy neurons?
i) To convert fuzzy output into crisp values
ii) To calculate the error gradient during backpropagation
iii) To initialize the connection weights in the neural network
iv) To combine fuzzy inputs into a single fuzzy output
i) Which technique is commonly used to update the connection weights in
Fuzzy-Neural networks during training?
i) Gradient Descent
ii) K-Means Clustering
iii) Random Forest
iv) Breadth-First Search
j) What is the main function of the "Fuzzification Layer" in an
ANFIS architecture?
i) It converts the crisp inputs into fuzzy sets using membership
functions.
ii) It updates the connection weights during training.
iii) It calculates the output using a defuzzification method.
iv) It applies the activation function to the weighted sum of inputs.
k) What is the primary function of the "Consequent Layer" in ANFIS?
i) It converts the fuzzy rule outputs into crisp values.
ii) It calculates the overall error of the system.
iii) It performs the forward pass during training.
iv) It normalizes the fuzzy rule firing strengths.
324 8 Hybrid Systems
7) Explain the learning process in ANFIS. How are the parameters of the fuzzy
sets and the neural network weights adapted during the training phase to
optimize the model’s performance?
8) Discuss the advantages and limitations of ANFIS compared to other
machine learning models, such as traditional fuzzy systems, neural net-
works, or support vector machines. In what scenarios does ANFIS excel,
and when might other models be more appropriate choices?
9) What is Neuro Evolution of Adaptive Topologies (NEAT), and how does
it differ from traditional neuroevolution or genetic algorithms for evolving
neural networks?
10) Describe the key components and steps involved in the NEAT algorithm.
How does it maintain innovation and prevent premature convergence
during the evolution process?
327
Index
a b
Activation Function (AF) 81 Bayesian Inference (BI) 168
binary step AF 81 Bayesian Machine Learning 191
leaky ReLU 88 Bayesian Network 171
linear AF 83 Bayesian Structure Learning 178
ReLU AF 85 Bayes Theorem 168
sigmoid/logistic AF 84 Belief Network 171
SoftMax AF 90 Bias 100
tanh AF 87 Binary Fuzzy Relation 45
AlexNet 131 Binary Step Activation Function 81
ANFIS 307 Boundary Region 258
Angular Fuzzy Sets 53
ANN Learning Rule 91 c
Ant Colony Optimization 15, 233 Chaos Theory 30
Antecedent 1 Chromosome 198
Approximation space 257 Classification 21
Artificial Bee Colony 16 Clustering 25, 271
confidence 27 centroid-based clustering 25
minimum confidence 27 density-based clustering 25
minimum support 27 distribution-based clustering 25
support 27 hierarchical clustering 25
Artificial Neural Network 11, 75 Collective Sorting 228
Association Analysis 26 Collective Transport 230
Attribute discretization 280 Competitive Learning Rule 95
Autoencoders 149 Conditional Independence 172
Average Pooling 129 Conditional Random Field Learning
Axon 10–11, 75 177
Principles of Soft Computing Using Python Programming: Learn How to Deploy Soft Computing Models
in Real World Applications, First Edition. Gypsy Nandi.
© 2024 The Institute of Electrical and Electronics Engineers, Inc. Published 2024 by John Wiley & Sons, Inc.
328 Index
Consequence 1 e
Constrained-Based Method 178 Efficient Net 131
Contrastive Divergence 177 Equivalence Class 257
Control Action 1 Equivalence Relation 256
Convolutional Neural Network Evidential Reasoning 30
125 Evolutionary Computing 12
convolutional layer 126 Evolutionary Programming 13
feature map 126 Evolutionary Strategies 14
kernel/filter 126 Expectation Maximization 177
pooling 128 Expected Values 165
Correlation Learning Rule 94
Crisp Set 35 f
Crossover 204 Feature Map 126
multi-point crossover 204 Feature Selection 268
one-point crossover 204 Feedback Neural Network 111
partially mapped crossover 205 Feed-Forward Neural Network 98
uniform crossover 204 Filter/Kernel 126
Fitness Function 199
d Foraging Behavior 228
Decision Rule 263 Fuzzification 58
Decision Table 263 Fuzzy c-Means 62
certainty factor 265 Fuzzy Computing 7, 35–37
consistency factor 264 Fuzzy Genetic Systems 311
coverage factor 266 Fuzzy Membership Functions 38,
probabilistic properties 266 46–49
strength 265 Fuzzy Neural Systems 302
support 265 Fuzzy Neuron 303
Deep Learning 123–125 Fuzzy Set 35, 37–38
Defuzzification 58 Fuzzy Set Operations 41
centroid method 60 Fuzzy Set Properties 42
maxima methods 62
max-membership principle 58 g
mean-max membership 59 Gaussian Membership Function 48
weighted average method 61 Generative Adversial Network 144
Delta Learning Rule 94 Genetic Algorithms 17, 56–57, 197
Dendrite 10–11, 75 chromosome 198
DenseNet 131 crossover 204
Differential Equation 18 fitness function 199
Division of Labor 229 mutation 205
Index 329
o logistic regression 23
Odds 162 polynomial regression 23
Optimization Algorithm 235 Ridge regression 23
Outstar Learning Rule 95 support vector regression 24
Reinforcement Learning 28
p ReLU 85
Parameter Learning 177 ResNET 131
Particle Swarm Intelligence 15 Risks 162
Pawlak Rough Set 255 ROSE Tool 280
Perceptron 99 Rough Set Theory 255
multi-layer perceptron 103 accuracy 261
single-layer perceptron 99 roughness 260
Perceptron Learning Rule 93
Pheromone 227 s
Pooling 128 Sample Space 162
average pooling 129 Self-Organization 231
max pooling 129 Self-Organizing Map 111
Population-Based Algorithms 197 Semi-Supervised Learning 27
Positive Region 259 Sigmoid Activation Function 84
Probabilistic Models 191 Sigmoidal Membership Function
Probabilistic Reasoning 29, 159 48
Probability 159 Single-Layer Perceptron 99
Probability Perspectives 165 Singleton Membership Function
axiomatic approach 167 46
classical approach 166 SoftMax Activation Function 90
empirical approach 166 Soma 10–11, 75
subjective approach 167 Speech Analysis 278
Stigmergy 229
r Supervised Learning 20
Radial Basis Function Network 107 Swarm Intelligence 15, 225
Random Experiment 160 Synapse 10–11, 75
Random Variables 160
continuous random variables 161 t
discrete random variables 161 Tanh Activation Function 87
Recurrent Neural Network 137 Test Accuracy 136
Reduct 259 Test Loss 136
Regression 22 Threshold Logic Unit 11
Lasso regression 23 Trapezoidal Membership Function
linear regression 23 48
Index 331