0% found this document useful (0 votes)

43 views135 pages

ASC NOTES II Sem

The document discusses computing, highlighting the distinction between hard and soft computing. Hard computing relies on precise algorithms and deterministic outputs, while soft computing accommodates uncertainty and approximation, making it suitable for complex, real-world problems. It also covers various techniques within soft computing, including fuzzy logic, neural networks, and genetic algorithms, which mimic human-like reasoning and adapt to changing conditions.

Uploaded by

Adit Arora

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views135 pages

ASC NOTES II Sem

Uploaded by

Adit Arora

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 135

Computing

● In this diagram, we observe that there is a function, f, which takes some input, x, and produces an
output, y. The input is referred to as the Antecedent, while the output is called the Consequent.
● The function f can also be called a formal method, algorithm, or a mapping function. It represents the
logic or steps used to process the input and generate the output.
● The middle part of this diagram is known as the computing unit, where the function resides. This is
where we feed the input, and a process occurs to convert that input into the desired output y.
● In the computing process, the steps that guide how the input is manipulated are called control actions.
These control actions ensure that the input gradually approaches the desired output. The moment the
process completes, we obtain the final result, known as the Consequent.
● Basic Characteristics Related to Computing:
1. Precise Solution: Computing is used to produce a solution that is exact and definitive.
2. Unambiguous & Accurate: The control actions within the computing unit must be unambiguous,
meaning they must have only one interpretation. Each control action should also be accurate to
ensure that the process is valid and reliable.
3. Mathematical Model: A well-defined algorithm is a requirement for solving any problem through
computing. This algorithm is essentially a mathematical model that represents the problem and its
solution process.
● In computing, there are two main types:
1. Hard Computing
2. Soft Computing

Hard Computing
In 1996 L.A. Zadeh (who is the pioneer of fuzzy logic) introduced the term hard computing. Hard computing
refers to a traditional computing approach where the results are always precise and exact. This is because
hard computing relies on well-defined mathematical models and algorithms to solve problems. There is no
room for approximation, as every calculation or action leads to a deterministic output.

Main Features of Hard Computing:

1. Exact Results: The results in hard computing are always exact. This means, whenever we solve a
problem using hard computing methods, the answer will always be the same and correct. There is no
room for uncertainty or approximation.
2. Clear Control Actions: In hard computing, the steps or actions we take to solve a problem are clear and
have only one meaning. For example, when following an algorithm, each step must be clearly
understood and should only have one possible result. There should be no confusion or multiple
meanings for a single step. This is called being "unambiguous."
3. Based on Mathematical Models: All the actions in hard computing are based on well-defined
mathematical rules or formulas. This means that each step we take while solving a problem has a
proper mathematical explanation behind it. These steps follow a fixed pattern or method, which helps in
solving the problem in a precise way.

Examples of Hard Computing:

1. Numerical Problems: Problems that require accurate mathematical calculations, like solving equations
or doing arithmetic, are examples of hard computing.
2. Searching and Sorting Algorithms: In computer science, methods like searching for an item in a list or
sorting a list in a particular order are solved using hard computing techniques because they always give
an exact answer.
3. Computational Geometry Problems: These are problems related to shapes, distances, and areas in
geometry. For example, calculating the shortest distance between two points or finding the area of a
triangle requires exact methods, which is where hard computing comes in.

Soft Computing
● Soft Computing is an approach to computing that models the human mind’s ability to make decisions in
an uncertain, imprecise, or complex environment. Unlike traditional, or "hard," computing which relies
on exact binary logic (0s and 1s), soft computing deals with approximation, flexibility, and learning from
experience to solve complex real-world problems.
● Soft computing techniques focus on developing systems that can handle ambiguity, uncertainty, and
approximation, making them well-suited for fields like artificial intelligence (AI), pattern recognition, and
robotics.
● Dr. L.A. Zadeh quoted that “The guiding principle of soft computing is to exploit the tolerance for
imprecision, uncertainty, and partial truth to achieve tractability, robustness, low solution cost, better
rapport with reality” i.e. The main idea of soft computing is to make use of its ability to handle
imprecision, uncertainty, and incomplete information. This helps in finding practical solutions that are
reliable, cost-effective, and closer to real-world situations. He pointed out that soft computing is not a
single method, but instead it is a combination of several methods, such as fuzzy logic, neural networks,
and genetic algorithms. All these methods are not competitive, but are complimentary to each other and
can be used together to solve a given problem.

Requirement of Soft Computing

Soft computing is essential because many real-world problems are too complex for traditional computing
methods. Systems must handle:
a. Uncertainty: Data may not always be precise, requiring flexible models.
b. Partial truths: Situations where a simple binary "true/false" answer isn’t sufficient.
c. Imprecision: Many problems, such as human language processing, involve vague or imprecise inputs.
d. Complex systems: Soft computing helps in dealing with complex, non-linear systems, like weather
prediction or stock market analysis.

Real life examples to understand the soft computing and hard computing
i. A manual transmission car relies on the driver to make precise decisions and execute gear shifts,
making it more aligned with the principles of hard computing (exact, deterministic input and output). In
contrast, an automatic transmission car adapts to varying conditions like speed and load, using fuzzy
logic or other soft computing techniques to decide when to shift gears smoothly.

2
ii. A manual air conditioner requires the user to set precise temperature and fan speed. The device works
rigidly based on these fixed inputs without adapting to external conditions. It is an example of hard
computing but an automatic air conditioner with smart features uses fuzzy logic to maintain a
comfortable room temperature. It adapts to inputs like room size, occupancy, and external weather
conditions to adjust cooling intensity dynamically. It is an example of soft computing.
Characteristic of soft computing
1. Handles Uncertainty and Approximation:
● Traditional algorithms often require exact inputs, but real-world problems don't always have precise
data. Soft computing methods can handle uncertainty, making them more suitable for practical
applications.
● Example: In medical diagnosis, data such as symptoms might be uncertain or incomplete, and soft
computing helps make decisions despite that.

2. Flexible and Adaptive:

● Soft computing systems are highly adaptive to changes and can learn from new data. This makes them
ideal for dynamic environments where conditions or inputs change over time.
● Example: In stock market predictions, a neural network can learn from past trends and adapt to new
market conditions.

3. Tolerance to Noise and Incompleteness:

● Soft computing techniques can work with noisy data (data that is random or disturbed) and incomplete
information. They don’t require the exact and clean data that traditional methods do.
● Example: Image recognition systems in self-driving cars can still work even when the images are
slightly unclear or distorted.

4. Ease of Implementation:
● Soft computing techniques like fuzzy logic and genetic algorithms are often easier to implement for
complex, real-world problems compared to traditional methods. They don’t require a precise model of
the system but instead learn from examples.
● Example: In control systems, fuzzy logic can be quickly applied to model human-like decision-making
without needing complex mathematical models.

5. Incorporates Human-Like Reasoning:

● Soft computing allows for reasoning based on approximate data, mimicking human decision-making.
This is especially useful in applications where exact answers aren’t available.
● Example: In customer service, fuzzy logic can simulate human-like reasoning in responding to
customer queries.

Difference between hard computing and soft computing

Aspect Hard Computing Soft Computing
Approach Relies on precise, binary logic (0 or 1) Works with approximations and fuzzy logic
Problem-solving Provides exact solutions Offers approximate solutions
Flexibility Rigid, no tolerance for uncertainty Flexible, handles uncertainty and partial truth
Computation Requires strict mathematical models Employs probabilistic and approximate
models
Error Handling Errors are not tolerated, must be Errors and uncertainty are part of the process
3
corrected
Learning No learning ability Learning from experience (adaptive)

Type of Can perform sequential computations. Can perform parallel computations.

computation
Nature Deterministic Stochastic (Involved randomness)
Example Conventional computers, strict Neural networks, genetic algorithms, fuzzy
programming rules systems

Various Techniques of Soft Computing

1. Fuzzy Computing
The conventional binary (bivalent) logic is crisp and allows for only two states. Fuzzy logic allows for a realistic
extension of binary, crisp logic to qualitative, subjective, and approximate situations.

An Example
A step fan regulator is an example of crisp logic. In a step fan regulator, the fan speed can only be set to fixed
levels like Low, Medium, or High.

Rules:
a. If temperature 20°C Fan speed = Low
b. If 20°C < temperature 30°C Fan speed = Medium
c. If temperature > 30°C Fan speed = High

How it works:
a. At 25°C the fan runs at Medium speed.
b. At 35°C the fan runs at High speed.
c. The change between speeds is sudden and abrupt. The fan speed will be different for 30°C and 31°C.

A smooth fan regulator is an example of fuzzy logic. In a smooth fan regulator, the fan speed can vary
continuously based on the temperature, not just fixed levels.

Rules:
a. If the temperature is Low, fan speed is set to Low.
b. If the temperature is Medium, fan speed is set to Medium.
c. If the temperature is High, fan speed is set to High.

How it works:
a.
At 25°C the fan adjusts smoothly between Low and Medium speeds (e.g. 70% of Medium speed).
b. At 28°C the fan gradually increases speed further (e.g. 90% of Medium speed).
c. The changes in speed are gradual and smooth, mimicking human judgment.

In fuzzy logic, the knowledge base is represented by if-then rules of fuzzy descriptors. An example of a fuzzy
rule would be

"If the speed is slow and the target is far then moderately increase the power.”

It contains the fuzzy descriptors slow, far, and moderate. A fuzzy descriptor represents a qualitative or
imprecise concept. It may be represented by a membership function which gives a membership grade between
0 and 1 for each possible value of the fuzzy descriptor it represents. Mathematically, a fuzzy set A is
represented by a membership function, or a "possibility function" of the form.

Fz[x ∈ A] = μA(x): ℜ [0, 1]

4
● The symbol Fz in Fz[x ∈ A] stands for the fuzzy set membership evaluation process. It assesses how
much an element x belongs to a fuzzy set A. A fuzzy set is a set where elements can have varying
degrees of membership, rather than being fully included or excluded. It is defined by a membership
function, which assigns each element a membership grade between 0 and 1.
● The x represents an element from the universal set which is the universal set or domain from which x is
drawn.
● The A represents a fuzzy set under consideration. This fuzzy set will be a subset of the universal set.
Again reminding that x is an element from the universal set means x belongs to the Universal set but it
will be long to set. A fuzzily means may be fully, may be to some extent or may not be.
● The μA is the membership function for the fuzzy set A. It assigns a membership grade (μ) to each
element x, indicating how much x belongs to A.
● R=[0,1] is the range of the membership function μ A(x), where each value indicates the grade of
membership for a given x.
0: x does not belong to A.
1: x fully belongs to A.
0 < μA(x) < 1: x partially belongs to A.

Here are examples to understand the fuzzy set.

Example-01: Fuzzy set A: "Slow speed"
Membership function:
At 10 km/h A(10) = 1 (completely slow).
At 30 km/h A(30) = 0.6 (partially slow).
At 70 km/h A(70) = 0 (not slow).

Example-02: Fuzzy set A: "Tall people"

Membership function:
A person with a height of 190 cm A(10) = 0.9 (very tall)
A person with a height of 160 cm A (10) = 0.2 (slightly tall)
A person with a height of 145 cm A (10) = 0.0 (not tall)

The translation from x to μA(x) is known as fuzzification. This topic will be covered in detail later.

2. Neural Network
A neural network consists of a set of nodes, usually organized into layers, and connected through weight
elements called synapses. At each node, the weighted inputs are summed (aggregated), thresholded, and
subjected to an activation function in order to generate the output of that node.

5
If the weighted sum of the inputs to a node (neuron) exceeds a threshold value w0, then the neuron is fired and
an output y(t) is generated according to

Where xi are neuron inputs, wi are the synaptic weights, and f[.] is the activation function.

There are two main classes of neural networks known as feedforward networks (or static networks) and
feedback networks (or recurrent networks).

In a feedforward network, the signal flow from a node to another node takes place in the forward direction
only. There are no feedback paths. In a feedforward neural network, learning is achieved through example.
This is known as supervised learning. Specifically, first a set of input–output data of the actual process is
determined (e.g., by measurement). The input data are fed into the NN. The network output is compared with
the desired output (experimental data) and the synaptic weights of the NN are adjusted using an algorithm until
the desired output is achieved.

In a feedback NN, the outputs of one or more nodes (say in the output layer) are fed back to one or more
nodes in a previous layer (say hidden layer or input layer) or even to the same node. The feedback provides
the capability of "memory" to the network.

3. Genetic Algorithm
Genetic algorithms represent an optimization approach where a search is made to "evolve" a solution
algorithm, which will retain the "most fit" components in a procedure. In the present context of intelligent
machines, Intellectual fitness is important for the evolutionary process rather than physical fitness. Evolutionary
computing can play an important role in the development of an optimal and self-improving intelligent machine.
Evolutionary computing has the following characteristics:
a. It is based on multiple searching points or solution candidates (population based search).
b. It uses evolutionary operations such as crossover and mutation.
c. It is based on probabilistic operations.

A genetic algorithm works with a population of individuals, each representing a possible solution to a given
problem. Each individual is assigned a fitness score according to how good its solution to the problem is. The
highly fit (in an intellectual sense) individuals are given opportunities to reproduce by crossbreeding with other
individuals in the population. This produces new individuals as offspring, who share some features taken from
each parent. The least fit members of the population are less likely to get selected for reproduction and will
eventually die out. An entirely new population of possible solutions is produced in this manner, by mating the
best individuals (i.e., individuals with best solutions) from the current generation. The new generation will

6
contain a higher proportion of the characteristics possessed by the "fit" members of the previous generation. In
this way, over many generations, desirable characteristics are spread throughout the population, while being
mixed and exchanged with other desirable characteristics in the process. By favoring the mating of the
individuals who are more fit (i.e., who can provide better solutions), the most promising areas of the search
space would be exploited. A GA determines the next set of searching points using the fitness values of the
current searching points, which are widely distributed throughout the searching space. It uses the mutation
operation to escape from a local minimum.

4. Associative Memory
It is a type of memory structure that allows for the retrieval of information based on content rather than a
specific address (as in traditional memory systems). It is inspired by the way the human brain recalls
information, where a partial or noisy input can trigger the recall of a complete memory. In soft computing,
associative memory is often implemented using neural networks or fuzzy systems, which are capable of
handling imprecise, uncertain, or incomplete data Types of Associative Memory
1. Auto-associative Memory: This type of memory retrieves a stored pattern that most closely matches the
input pattern. It is useful for pattern completion or noise reduction. For example, if you input a noisy or
incomplete version of a pattern, the auto-associative memory can recall the original, clean version.
Example: If you train an auto-associative memory with the word "HELLO" and later input "H*LLO", it
can recall "HELLO".
2. Hetero-associative Memory: This type of memory retrieves a different pattern from the input pattern.
The output may differ in content, type, or format. It is useful for pattern association or mapping tasks.
For example, mapping a name to a face or a word to its meaning.
Example: If you train a hetero-associative memory with the input "cat" and output "animal", then
inputting "cat" will retrieve "animal".

5. Adaptive Resonance Theory

A type of neural network technique developed by Stephen Grossberg and Gail Carpenter in 1987 to address
the stability-plasticity dilemma. The term “adaptive” and “resonance” used in this suggests that they are open
to new learning without discarding the previous or the old information. This is stable and flexible to gain new
information. ART networks implement a clustering algorithm. Here are core concepts of Adaptive resonance
theory-

1. Stability-Plasticity Dilemma: ART was designed to address a fundamental challenge in neural networks:
how to learn new information without catastrophically forgetting previously learned knowledge.
"Stability" refers to preserving old knowledge, while "plasticity" refers to the ability to adapt to new input.
ART achieves a balance between these two.
2. Clustering: ART networks excel at unsupervised learning and clustering. They can automatically group
similar input patterns into clusters without needing predefined categories.
3. Resonance: The "resonance" aspect is crucial. When an input is presented, the network searches for a
matching cluster. If a good match is found, a state of "resonance" occurs, strengthening the
connections between the input and the cluster. If no match is found, a new cluster is created.
4. Vigilance Parameter: This parameter controls the network's sensitivity to new patterns. A high vigilance
means the network is more likely to create new clusters (fine-grained distinctions), while a low vigilance
means it's more likely to assimilate new inputs into existing clusters (broader generalizations).

Types of ART Networks

1. ART1: Works with binary input patterns.

7
2. ART2: Handles continuous-valued data.
3. ART3: An extension with hierarchical processing for large-scale applications.
4. Fuzzy ART: Incorporates fuzzy logic to process uncertainty in data.

ART networks are valuable tools in soft computing due to their ability to handle uncertainty and learn from data
without strict supervision. Here are some examples:
1. Image Recognition: Imagine a system that needs to categorize images of different types of flowers.
ART can learn to cluster images of roses, lilies, and tulips without being explicitly told what these
flowers are. It identifies patterns and similarities in the image data to form its own categories.
2. Anomaly Detection: In a network security context, ART can be used to detect unusual patterns in
network traffic. By learning the "normal" patterns, it can identify deviations that might indicate an attack,
even if the attack is new and unseen before.
3. Data Mining: ART can be used to discover hidden patterns and structures in large datasets. For
example, in customer behavior analysis, it can group customers based on their purchasing habits,
revealing distinct customer segments.
4. Robotics: In robotics, ART can help a robot learn to navigate an environment without a detailed map.
The robot can use its sensors to perceive the environment and cluster similar sensory experiences,
allowing it to build a mental representation of the space.

Applications of Soft Computing

Soft computing techniques are applied in a variety of fields, including:
1. Artificial Intelligence (AI): Many AI systems rely on neural networks and fuzzy logic to emulate human
decision-making.
2. Pattern Recognition: Soft computing helps identify patterns in complex data, such as handwriting
recognition or facial recognition.
3. Robotics: Soft computing enables robots to navigate environments with uncertain data, making them
more adaptable.
4. Data Mining: Soft computing techniques are used to discover patterns and relationships in large
datasets.
5. Control Systems: Fuzzy logic controllers are used in industrial systems, like temperature control in
manufacturing plants.
6. Medicine: Soft computing helps in diagnosing diseases, predicting patient outcomes, and optimizing
treatment plans.
7. Economics and Finance: Used in predicting market trends, risk analysis, and portfolio optimization.

Apart from the above listed; here are few examples for applications of soft computing

8
Here are few important questions
1. List a few examples of Hard and Soft Computing.
2. Identify the characteristics of Hard and Soft Computing.
3. Differentiate between Hard and Soft Computing.
4. What are the major applications of AI?
5. Identify major techniques of Soft Computing.
6. List a few examples of Fuzzy Logic, Artificial Neural Network and Genetic Algorithms.
7. Identify real time problem that can be resolved through Soft Computing
*****

Let us come back to the fuzzy logic system and this time we are going to have discussion about the same in
details
● Fuzzy logic was first developed by L.A. Zadeh in the mid-1960s for representing some types of
"approximate" knowledge that cannot be represented by conventional, crisp methods.
● Fuzzy logic is an extension of crisp bivalent (two-state) logic in the sense that it provides a platform for
handling approximate knowledge. Fuzzy logic is based on fuzzy set theory.
● A fuzzy set is represented by a membership function. A particular "element" value in the range of
definition of the fuzzy set will have a grade of membership which gives the degree to which the
particular element belongs to the set.
● The October 19, 1987 issue of Nikkei Industrial News stated: "Toshiba has developed an AI system
which controls machinery and tools using Fuzzy Logic. It controls rules, simulation, and valuation.
Toshiba will add Expert System function to it and accomplish the synthetic AI. Toshiba is going to turn it
into practical uses in the field of industrial products, traffic control, and nuclear energy.” This news item
is somewhat ironic and significant, because around the same time at the Information Engineering
Division of the University of Cambridge a similar application of fuzzy logic was developed for a robot.
This work was subsequently extended in the Industrial Automation Laboratory of the University of
British Columbia, where the applications centered around the fish processing industry.
● Popular applications are Process temperature control by OMRON, Air conditioner by Mitsubishi,
Vacuum cleaner by Panasonic, Automatic transmission system by Nissan, Subaru, Mitsubishi, Antilock
braking system by Nissan.

Fuzzy Sets
A fuzzy set is a set without clear or sharp (crisp) boundaries or without binary membership characteristics.
Unlike an ordinary set where each object (or element) either belongs or does not belong to the set, partial
9
membership in a fuzzy set is possible. In other words, there is a "softness" associated with the membership of
elements in a fuzzy set.
The membership in a fuzzy set need not be complete, i.e., members of one fuzzy set can also be members of
other fuzzy sets in the same universe. Vagueness is introduced in a fuzzy set by eliminating the sharp
boundaries that divide members from nonmembers in the group. There is a gradual transition between
membership and nonmembership, not abrupt transition.

Consider the variable "temperature". It can take a fuzzy value (e.g., cold, cool, tepid, warm, hot). A fuzzy value
such as "warm" is a fuzzy descriptor. It may be represented by a fuzzy set because any temperature that is
considered to represent "warm" belongs to this set and any other temperature does not belong to the set. Still,
one cannot realistically identify a precise temperature interval (e.g., 25°C to 30°C), which is a crisp set, to
represent warm temperatures.

Let X be a set that contains every set of interest in the context of a given class of problems. This is called the
universe of discourse (or simply universe), whose elements are denoted by x. A fuzzy set A in X may be
represented by a Venn diagram as given below.

Note: Generally, the elements x are not numerical quantities. For analytical convenience, however, the
elements x are assigned real numerical values.

A fuzzy set may be represented by a membership function. This function gives the grade (degree) of
membership within the set, of any element of the universe of discourse. The membership function maps the
elements of the universe to numerical values in the interval [0, 1].

μA (x): X [0, 1]

where μA(x) is the membership function of the fuzzy set A in the universe in X. Stated in another way, fuzzy set
A is a set of ordered pairs:
A = {(x, μA(x)); x ∈ X, μA(x) ∈ [0, 1]}

The membership function μA(x) represents the grade of possibility that an element x belongs to the set A.
● A membership function value of zero implies that the corresponding element is definitely not an element
of the fuzzy set.
● A membership function value of unity means that the corresponding element is definitely an element of
the fuzzy set.
● A grade of membership greater than 0 and less than 1 corresponds to a non-crisp (or fuzzy)
membership, and the corresponding elements fall on the fuzzy boundary of the set. The closer the μ A(x)
is to 1 the more the x is considered to belong to A, and similarly the closer it is to 0 the less it is
considered to belong to A.

10
Let us talk about the symbolic representation of the fuzzy set; here are few key terms-
● A universe of discourse (i.e. complete set of possible values under consideration) and a membership
function which spans the universe, completely define a fuzzy set. A fuzzy set may be symbolically
represented as:
A = {x|μA(x)}
● The element that has a membership grade greater than 0 but less than 1 is called a boundary element.
The crisp set of all elements in the universe that have a non-zero membership grade (i.e. μ > 0) is
called a support set.
Support set: SA A(x) > 0}

● The support set is a crisp subset (means it is either a subset or not a subset, no scope for degree of
membership when it comes to subset) of the universe. A fuzzy set is clearly a subset of its support set.
● The core is the subset of elements with a membership grade of 1.
○ Note: We are not usually interested in the elements with zero grade of membership.

An Example
Universe of Discourse: All possible temperatures, e.g. [0°C, 50°C].
Fuzzy Set: "Warm temperatures."
Membership Function: Define the membership grades for "warm" temperatures as follows:
μ(10°C) = 0 (definitely not warm)
μ(20°C) = 0.2 (slightly warm)
μ(25°C) = 0.5 (moderately warm)
μ(30°C) = 0.8 (quite warm)
μ(35°C) = 1.0 (perfectly warm)
μ(40°C) = 0.6 (still warm but getting hot)
μ(50°C) = 0 (definitely not warm)

Fuzzy set A = {(20,0.2),(25,0.5),(30,0.8),(35,1.0),(40,0.6)}

Boundary elements: {20°C, 25°C, 30°C, 40°C}
Support set: {20°C, 25°C, 30°C, 35°C, 40°C}
core: {35°C}

Note: A fuzzy set may have fewer elements than the support set when specific conditions, such as
thresholds or filtering criteria, are applied to define the fuzzy set.
●
For fuzzy sets, "subset" is about membership grades being less than or equal, not about elements being contained so in fuz
A B does
not mean that A contains all elements of B.
An Example
A={(20,0.2),(25,0.5),}
B={(20,0.2),(25,0.7),(30,0.8)}
Since μA(x) <= μB (x) for all x X, we conclude that: A is subset of B yet A is not equals to B/

● Two fuzzy sets A and B are said to be equal fuzzy sets if μA(x) = μB(x) for all x ∈ X.
An Example
11
A={(20,0.2),(25,0.5),(30,0.8),(35,1.0)}
B={(20,0.2),(25,0.5),(30,0.8),(35,1.0)}
Since μA(x) = μB
(x) for all xX, we conclude that: A=B
A) and
universal set (X).
S.No. Statement Remark

1 A is subset to S. This statement holds according to crisp set theory but not
applicable for Fuzzy set theory because S does not have
membership grade. S is derived from the A such that in the
derivation process all elements of A are added to S so A will be
considered as a subset of S.

2 A is not equal to S. This statement holds because A has more information than the S
as S has elements only but A has elements + membership grade.

3 S is not a subset of A. This statement holds because if it holds then A should be equal to
S.

4 A is subset or equal to X. This statement holds according to crisp set theory as well as
Fuzzy set theory

5 S is subset or equal to X. This statement holds according to crisp set theory but not
applicable for Fuzzy set theory because S does not have
membership grade.

● A fuzzy set is a universal set/whole fuzzy set if and only if the value of the membership function is 1 for
all the members under consideration.
● A fuzzy set A is said to be an empty fuzzy set if and only if the value of the membership function is 0 for
all possible members under consideration.
● The collection of all fuzzy sets and fuzzy subsets on universe X is called fuzzy power set P(X) since all
the fuzzy sets can overlap, the cardinality of the fuzzy power set np(x) is infinite. So we can say
A ⊆ X ⟹ μA(x) ≤ μX(x)
Also for x ∈ X; μϕ(x) = 0; μx(x) = 1

A note of Cardinality
- The cardinality of a crisp set refers to the number of distinct elements in the set.
An Example:
A={1,2,3}
The cardinality of A, denoted as A, is 3

- In fuzzy sets, elements are associated with membership grades between 0 and 1, so cardinality takes
these grades into account.

where μA(x) is the membership grade of x in the fuzzy set A, and U is the universe of discourse.
An Example:
Consider a fuzzy set A = {(1,0.5), (2,1.0), (3,0.2)}
Cardinality: A = 0.5 + 1.0 + 0.2 = 1.7

Discrete and continuous universe

12
● A discrete universe consists of distinct, individual elements that are countable and separate from each
other. You can list all elements explicitly. The set of letters, set of natural numbers is an example of a
discrete universe.
● A continuous universe consists of values that form a continuum and can take any value within a given
range. The elements are not distinct; they vary smoothly. You cannot list the elements explicitly, but
you can define a range. Temperature within a range & time e within a range is an example of a
continuous universe.

If the universe is discrete with elements xi then a fuzzy set may be specified using a convenient form of
notation due to Zadeh, in which each element is paired with its grade of membership in the form of a "formal
series" as

A = μA(x1)/x1 + μA(x2)/x2 + . . . + μA(xi)/xi + . . . .

Important Note: Here "+" signifies combination, not arithmetic: The "+" symbol here is just a way to combine
all the pairs. In computational systems, it might mean iterating over or processing each (x i,μA(xi)) pair
independently. There’s no summation or computation happening here.

An example of using zadeh notation of discrete universe:

Suppose that the universe of discourse X is the set of positive integers (or natural numbers). Consider the
fuzzy set A in this discrete universe, given by the Zadeh notation:

A = 0.2/3 + 0.3/4 + 1.0/5 + 0.2/6 + 0.1/7

This set may be interpreted as a fuzzy representation of the integer 5.

If the universe is continuous, an equivalent form of notation is given in terms of a "symbolic" integration

Important Note: For continuous fuzzy sets, the integration symbol is also symbolic and is not intended to
perform an actual numerical integration.

An example of using zadeh notation of continuous universe:

Consider the continuous universe of discourse X representing the set of real numbers (ℜ). The membership
function:

μA(x) = 1/[1 + (x a)10]

defines a fuzzy set A whose elements x vaguely represent those satisfying the crisp relation x = a. This fuzzy
set corresponds to a fuzzy relation.

Just a reminder about the fuzzy descriptor

Linguistic terms such as "tall men", "beautiful women", and "fast cars are" fuzzy labels which can be
represented by fuzzy sets. Note that their membership in a set is subjective and not crisp. The fuzzy terms
such as "tall", "beautiful", and "fast" are fuzzy descriptors. These are also linguistic variables, which take fuzzy
values. Consider the linguistic variable “very tall”. This can assume a fuzzy value and can be represented by a
13
fuzzy set (or a membership function). The fuzzy adjective "very" is called a linguistic hedge. Other examples of
linguistic hedges are slightly, extremely, more or less, slightly, and highly.

Difference between fuzzy set and crisp set

Crisp Set Fuzzy Set

S={s|s X}
F = (s, ) | s X and (s) is the degree of membership.

It is a collection of elements. It is a collection of ordered pairs.

Inclusion of an element s X into S has strict boundary Yes orInclusion

No. of an element s X into F is fuzzy, with a degree of Membe

Crisp sets do not handle uncertainty or vagueness. Fuzzy sets are designed to handle uncertainty and
They are rigid and precise, making them suitable for vagueness. They are useful for representing
well-defined, clear-cut categories. concepts that are not clearly defined or have
gradations.

Crisp sets are used in traditional logic, computer Fuzzy sets are used in areas like control systems,
science (e.g., binary decisions), and mathematics artificial intelligence, decision-making, and pattern
where clear boundaries are required. recognition, where human-like reasoning and
handling of ambiguity are needed.

Q.1 Can a Crisp set be a fuzzy set or vice versa?

A crisp set is a special case of a fuzzy set where each element has a membership value of either 0 or 1 (no
intermediate values). In a crisp set, the membership function can still exist, but it only outputs 0 or 1, making it
a special case of a fuzzy set.

Let’s consider a universe U = {1, 2, 3, 4, 5} and a crisp set A representing "Even Numbers". Membership
function of crisp set A: μ A(1) = 0, μA(2) = 1, μA(3) = 0, μA(4) = 1, μA(5) = 0. This is essentially a fuzzy set with
only two possible values: 0 or 1, making it a crisp set.

A general fuzzy set cannot always be a crisp set because a fuzzy set allows membership values between 0
and 1, while a crisp set only allows 0 or 1.

Cardinality of fuzzy set

Note: A fuzzy set can have a finite number of elements. This happens when:
● The universal set (U) itself is finite.
● The fuzzy membership function is defined for only a finite number of elements.

An Example: Consider a fuzzy set A representing the "likelihood of passing an exam" for students in a class of
5 students:
A={(S1,0.2),(S2,0.5),(S3,1.0),(S4,0.7),(S5,0.9)}
Here, the fuzzy set A has 5 elements, so its cardinality is finite.

A fuzzy set is considered infinite when:

● The universal set (U) is infinite (e.g., all real numbers between 0 and 1).
● The membership function assigns values over an infinite range.

An Example: A fuzzy set representing "tall heights" over all real numbers:
A={(x,A(x)) xR, A(x) = e (180x)2/50 }
Here, the universal set is the set of all real numbers, making the cardinality infinite.

14
Types of membership functions
The fuzzy membership function is the graphical way of visualizing the degree of membership of any value in a
given fuzzy set. In the graph, the X-axis represents the universe of discourse and the Y-axis represents the
degree of membership in the range [0, 1].

1. Triangular membership function: This is one of the most widely accepted and used membership
functions (MF) in fuzzy controller design. The triangle which fuzzifies the input can be defined by three
parameters a, b and c, where c defines the base and b defines the height of the triangle.

An Example: Determine μ, corresponding to x = 8.0

2. Trapezoidal membership function: The trapezoidal membership function is defined by four parameters:
a, b, c and d. Span b to c represents the highest membership value that element can take. And if x is
between (a, b) or (c, d), then it will have a membership value between 0 and 1

An Example: Determine μ, corresponding to x = 3.5

15
3. Gaussian membership function: A Gaussian MF is specified by two parameters {m, σ} and can be
defined as follows

In this function, m represents the mean / center of the gaussian curve and σ represents the spread of
the curve. This is a more natural way of representing the data distribution, but due to mathematical
complexity, it is not much used for fuzzification.

An Example: Determine μ corresponding to x = 9, m = 10 and σ = 3.0

4. Generalized bell-shaped function: A generalized bell MF is specified by three parameters {a, b, c} and
can be defined as follows.

An Example: Determine μ corresponding to x = 8

16
5. Sigmoid Membership function: Sigmoid functions are widely used in classification tasks in machine
learning. It is controlled by parameters a and c. Where a controls the slope at the crossover point x = c.
Mathematically, it is defined as

An Example: Determine μ corresponding to x = 8

Fuzzy logic operations

Complement (negation, NOT)
Consider a fuzzy set A in a universe X. Its complement A′ is a fuzzy set whose membership function is given
by

μA (x) = 1 A(x) for all x ∈ X

A graphic (membership function) representation of the complement of a fuzzy set (or negation of a fuzzy state)
is given as

17
An Example: Consider the fuzzy set of "hot temperatures". This may be represented by the membership
function shown using a solid line in Figure given below. This is the set containing all values of hot temperature
in a specified universe. In fuzzy logic then, this membership function can represent the fuzzy logic state "hot"
or the fuzzy statement, "the temperature is hot". The complement of the fuzzy set is represented by the dotted
line in the same Figure. This is the set containing all temperature values that are not hot, in the given universe.

Union (disjunction, OR)

Consider two fuzzy sets A and B in the same universe X. Their union is a fuzzy set containing all the elements
from both sets, in a "fuzzy" sense. This set operation is denoted by ∪. The membership function of the
resulting set A ∪ B is given by

μA∪B(x) = max[(μA(x), μB(x)] ∀x ∈X

The union corresponds to a logical OR operation (called Disjunction), and is denoted by A ∨ B, where A and B
are fuzzy states or fuzzy propositions. The rationale for the use of max to represent a fuzzy-set union is that,
because element x may belong to one set or the other, the larger of the two membership grades should govern
the outcome (union). A graphic (membership function) representation of the union of two fuzzy sets (or the
logical combination OR of two fuzzy states in the same universe) is given in following Figure

An Example
Consider a universe representing the driving speeds on a highway, in km/h. Suppose that the fuzzy logic state
“Fast” is given by the discrete membership function
F = 0.6/80 + 0.8/90 + 1.0/100 + 1.0/110 + 1.0/120

& the fuzzy state “Medium” is given by

M = 0.6/50 + 0.8/60 + 1.0/70 + 1.0/80 + 0.8/90 + 0.4/100

Then the combined fuzzy condition “Fast OR Medium” is given by the membership function
F ∨ M = 0.6/50 + 0.8/60 + 1.0/70 + 1.0/80 + 0.8/90 + 1.0/100 + 1.0/110 + 1.0/120

Even though set intersection is applicable to sets in a common universe, a logical "OR" may be applied for
concepts in different universes. In particular, when the operands belong to different universes, orthogonal axes
have to be used to represent them in a common membership function.

18
An Example
Consider a universe representing room temperature (in °C) and another universe representing relative
humidity (%). Suppose that an acceptable temperature is given by the membership function
T = 0.4/16 + 0.8/18 + 1.0/20 + 1.0/22 + 0.8/24 + 0.5/26

& an acceptable humidity is given by the membership function

H = 0.2/0 + 0.8/20 + 1.0/40 + 0.6/60 + 0.2/80

Then the fuzzy condition "Acceptable Temperature OR Acceptable Humidity" is given by the following
membership function with a two-dimensional universe:
Temperature (°C)
16 18 20 22 24 26
0 0.4 0.8 1.0 1.0 0.8 0.5
2 0.8 0.8 1.0 1.0 0.8 0.8
0
4 1.0 1.0 1.0 1.0 1.0 1.0
Relative Humidity
0
(%)
6 0.6 0.8 1.0 1.0 0.8 0.6
0
8 0.4 0.8 1.0 1.0 0.8 0.8
0

Intersection (conjunction, AND)

Consider two fuzzy sets A and B in the same universe X. Their intersection is a fuzzy set containing all the elements that are comm

μ AB (x) = min[(μA(x), μB(x)] ∀x ∈ X

The intersection corresponds to a logical AND operation (called Conjunction), and is denoted by A ∧ B, where
A and B are fuzzy states or fuzzy propositions. The rationale for the use of min to represent fuzzy-set
intersection is that, because the element x must simultaneously belong to both sets, the smaller of the two
membership grades should govern the outcome (intersection).

A graphic (membership function) representation of the intersection of two fuzzy sets (or the logical combination
AND of two fuzzy states in the same universe) is given in Figure below-

An Example
Universe of discourse (X): Temperatures in degrees Celsius.
Fuzzy set A: "Warm temperatures."
A = {20:0.2, 25:0.5, 30:0.8, 35:1.0, 40:0.6}

Fuzzy set B: "Hot temperatures."

B = {30:0.4, 35:0.7, 40:0.9, 45:1.0, 50:0.8}

19
Intersection (AB): "Warm and hot temperatures."
A B={20:0.0, 25:0.0, 30:0.4, 35:0.7, 40:0.6, 45:0.0, 50:0.0}

Equality Operation (=)

The equality operation between two fuzzy sets A and B is defined to determine whether A and B are identical.
Two fuzzy sets A and B are said to be equal if:

μA(x) = μB

An Example
Universe of Discourse, temperatures: X={20,25,30,35}
Fuzzy Set A: "Warm temperatures"
A={(20,0.2), (25,0.5), (30,0.8), (35,1.0)}

Fuzzy Set B: "Comfortable temperatures"

B={(20,0.2), (25,0.5), (30,0.8), (35,1.0)}

For both sets we have all elements identical so we can say that A = B

Difference operation (-)

Consider 2 Fuzzy Sets denoted by A and B. This set operation is denoted by -. The membership function of the
resulting set A - B is given by

μA-B(x) = min[(μA(x), 1 - μB(x)] ∀x ∈ X

Other operations of fuzzy sets

1. Algebraic Sum (+): The algebraic sum of two fuzzy sets A and B is defined as:

μA+B(x) = μA(x) + μB (x) A (x) (x)

This represents the union of fuzzy sets A and B, accounting for their overlaps.

An Example:
Let μA(x) = 0.6 and μB(x) = 0.4
μA+B (x) = 0.6 + 0.4 (0.60.4) = 1.0 0.24 = 0.76

2. Algebraic Product (.): The algebraic product of two fuzzy sets A and B is defined as:

μ AB(x) = μA (x) (x)

This represents the intersection of fuzzy sets A and B.

An Example:
Let μA(x) = 0.6 and μB(x) = 0.4
μ AB (x) = 0.60.4=0.24

3. Bounded Sum () : The bounded sum of two fuzzy sets A and B is defined as:

μ AB (x) = min(1,μA(x) + μB(x))

This ensures that the combined membership value does not exceed 1.

An Example
Let μA(x) = 0.8 and μB(x) = 0.5.
μ AB (x) = min(1, 0.8 + 0.5) = min(1, 1.3) = 1
20
4. Bounded Difference (): The bounded difference of two fuzzy sets A and B is defined as:

μ AB (x)= max(0, μA (x) (x))

This represents the difference between the memberships of A and B, ensuring the result is non-
negative.

An Example:
Let μA(x) = 0.7 and μB(x) = 0.4.
μ AB

5. Bounded Product (☉): The bounded product of two fuzzy sets A and B is defined as:

μA☉B(x)= max(0, μA(x) + μB(x) - 1)

An Example:
Let μA(x) = 0.7 and μB(x) = 0.4.
μA☉B(x) = max(0, 0.7 + 0.4 - 1) = 0.1

Additional Note: The scalar product of a fuzzy set A with a scalar α is defined as:
μαA A(x)
where:
❖ α is a scalar constant.
❖ μA(x) is the membership function of the fuzzy set A.
❖ μαA(x) is the new membership function after applying the scalar multiplication.

An Example
Let’s consider a fuzzy set A representing "membership in a good student category" based on some
performance levels:
A={(S1,0.2),(S2,0.5),(S3,0.7),(S4,1.0)}
Apply the scalar multiplication using α=0.6

Solution:
Computing for each element:
A′ = {(S1,0.6×0.2),(S2,0.6×0.5),(S3,0.6×0.7),(S4,0.6×1.0)} = {(S1,0.12),(S2,0.3),(S3,0.42),(S4,0.6)}

DIY:
Q.1 Given the two fully sets
B1 = {1/1.0 + 0.75/1.5 + 0.3/2.0 + 0.15/2.5 + 0/3.0}
B2 = {1/1.0 + 0.6/1.5 + 0.2/2.0 + 0.1/2.5 + 0/3.0}

Find the following

(i) B1 ∪ B2 (ii) B1 B 2 (iii) B1' (iv) B2'
(v) B1-B2 (vi) (B1 ∪ B2)' (vii) (B1 B 2)' (viii) B1 (B )'
1
(ix) B1 ∪ (B1)' (x) B1 (B 2)' (xi) B2 ∪ (B2)'

Q.2 Consider two fuzzy sets

A = {0.2/1 + 0.3/2 + 0.4/3 + 0.5/4}
B = {0.1/1 + 0.2/2 + 0.2/3 + 1/4}

Find the algebraic sum, algebraic product, bounded sum, bounded difference and bounded product of
the given fuzzy sets.

21
Basic laws of fuzzy logic
Consider three general fuzzy sets A, B, and C defined in a common universe X. Let ϕ denote the null set (a set
with no elements, and hence having a membership function of zero value). With this notation, some important
properties of fuzzy sets are summarized here

Property name Relation

Exclusion:
Law of excluded middle
Law of contradiction
A A X (In crisp/classical set A A = X i.e. An element must either belong to set A OR

The following laws are followed in fuzzy sets

Distributivity
A (B C) = (A B) (A C)

DeMorgan’s Laws
(A B) = A B

Commutativity
A B=B A

Associativity
(A B) C = A (B C)

Absorption
A (A B) = A

Idempotency
A A=A
(Idem = same; potent = power)
(Similar to unity or identity operation)

Boundary conditions
A X=X

Involution (Double negation) (A′)’ = A

Transitivity If A B C then A C

● DeMorgan’s Laws are particularly useful in simplifying (processing) expressions of sets (and logic).

● Law of Excluded Middle violation:Consider a fuzzy set of "tall people." Let's say someone is 5'10"
(1.78m) tall. In fuzzy logic, they might have a membership value of 0.6 in the "tall" set and 0.4 in the
"not tall" set. When we add these:0.6 + 0.4 = 1.0 (for some cases)But for other heights, you might
have:0.3 + 0.5 = 0.8 ≠ 1. This shows that A ∪ Ā ≠ 1 (where Ā is the complement of A), violating the law
of excluded middle.
● Law of Contradiction violation: Using the same "tall people" example, someone who is 5'10" can
simultaneously belong to both "tall" (0.6) and "not tall" (0.4) sets. The intersection isn't empty: A ∩ Ā ≠

22
∅. Someone 5'10" tall might have:μ(tall) = 0.6 and μ(not tall) = 0.5 and that's perfectly valid in fuzzy
logic because we're dealing with approximate reasoning and linguistic variables .
● The law of excluded middle is violated by fuzzy sets, consider the example shown in Figure below.
Here a broken line is used to represent a fuzzy set A. The complement A′ is shown by a dotted line,
which is obtained by subtracting the original membership function from 1. Next, the union of A and A′ is
performed using the max operation. The resulting membership function is given by the solid line. Note
that the result is not uniformly equal to 1 (i.e., not equal to mX ). In particular, the middle part of mX is
excluded in the result.

Let us take example to verify Distributivity of fuzzy law

Law 1: A (B C) = (A B) (A C)

Consider the Fuzzy Sets:

A = {(1, 0.8), (2, 0.5), (3, 0.9)}
B = {(1, 0.6), (2, 0.3), (3, 0.4)}
C = {(1, 0.7), (2, 0.8), (3, 0.2)}

Let us take the Left-hand Side (LHS): A (B C)

B C = {(1, max(0.6, 0.7)), (2, max(0.3, 0.8)), (3, max(0.4, 0.2))}
B C = {(1, 0.7), (2, 0.8), (3, 0.4)}

A (B C) = {(1, min(0.8, 0.7)), (2, min(0.5, 0.8)), (3, min(0.9, 0.4))}

A (B C) = {(1, 0.7), (2, 0.5), (3, 0.4)}

Let us take the Right-hand Side (RHS): (A B) (A C)

A B = {(1, min(0.8, 0.6)), (2, min(0.5, 0.3)), (3, min(0.9, 0.4))}
A B = {(1, 0.6), (2, 0.3), (3, 0.4)}

A C = {(1, min(0.8, 0.7)), (2, min(0.5, 0.8)), (3, min(0.9, 0.2))}

A C = {(1, 0.7), (2, 0.5), (3, 0.2)}

(A B) (A C) = {(1, max(0.6, 0.7)), (2, max(0.3, 0.5)), (3, max(0.4, 0.2))}

(A B) (A C) = {(1, 0.7), (2, 0.5), (3, 0.4)}

Result : LHS = RHS, so A (B C)=(A B) (A C)

Law 2: A (B C) = (A B) (A C)
Let us take example to verify another distributivity of fuzzy law
A = {(1, 0.5), (2, 0.8), (3, 0.6)}
B = {(1, 0.7), (2, 0.4), (3, 0.9)}
C = {(1, 0.6), (2, 0.9), (3, 0.3)}

Let us take the Left-hand Side (LHS): A (B C)

23
B C = {(1, min(0.7, 0.6)), (2, min(0.4, 0.9)), (3, min(0.9, 0.3))}
B C = {(1, 0.6), (2, 0.4), (3, 0.3)}

A (B C) = {(1, max(0.5, 0.6)), (2, max(0.8, 0.4)), (3, max(0.6, 0.3))}

A (B C) = {(1, 0.6), (2, 0.8), (3, 0.6)}

Let us take the Right-hand Side (RHS): (A B) (A C)

A B = {(1, max(0.5, 0.7)), (2, max(0.8, 0.4)), (3, max(0.6, 0.9))}
A B = {(1, 0.7), (2, 0.8), (3, 0.9)}

A C = {(1, max(0.5, 0.6)), (2, max(0.8, 0.9)), (3, max(0.6, 0.3))}

A C = {(1, 0.6), (2, 0.9), (3, 0.6)}

(A B) (A C) = {(1, min(0.7, 0.6)), (2, min(0.8, 0.9)), (3, min(0.9, 0.6))}

(A B) (A C) = {(1, 0.6), (2, 0.8), (3, 0.6)}

Result: LHS = RHS, so A (B C)=(A B) (A C)

Let us take example to verify DeMorgan’s Laws

Law 1: (A B) = A B

Consider the Fuzzy Sets:

A={(1, 0.8), (2, 0.6), (3, 0.4)}
B={(1, 0.7), (2, 0.5), (3, 0.9)}

Let us take the Left-hand Side (LHS) first:

A B = {(1, min(0.8, 0.7)), (2, min(0.6, 0.5)), (3, min(0.4, 0.9))}
A B = {(1, 0.7), (2, 0.5), (3, 0.4)}

(A B) = {(1, 1 0.7), (2, 1 0.5), (3, 1 0.4)}

(A B) = {(1, 0.3), (2, 0.5), (3, 0.6)}

A = {(1, 1 0.8), (2, 1 0.6), (3, 1 0.4)}

A′ = {(1, 0.2), (2, 0.4), (3, 0.6)}

B = {(1, 1 0.7), (2, 1 0.5), (3, 1 0.9)}

B′= {(1, 0.3), (2, 0.5), (3, 0.1)}

A B = {(1, max(0.2, 0.3)), (2, max(0.4, 0.5)), (3, max(0.6, 0.1))}

A B = {(1, 0.3), (2, 0.5), (3, 0.6)}

Result: (A B) = A B.

Law 2 : (A B) = A B

Let us consider the fuzzy set

A={(1, 0.4), (2, 0.7), (3, 0.2)}
B={(1, 0.6), (2, 0.5), (3, 0.8)}

Let us start with the LHS

A B = {(1, max(0.4, 0.6)), (2, max(0.7, 0.5)), (3, max(0.2, 0.8))}
A B = {(1, 0.6), (2, 0.7), (3, 0.8)}

(A B) = {(1, 1 0.6), (2, 1 0.7), (3, 1 0.8)}

24
(A B) = {(1, 0.4), (2, 0.3), (3, 0.2)}

Now talk about the RHS

A = {(1, 1 0.4), (2, 10.7), (3, 1 0.2)}
A′ = {(1, 0.6), (2, 0.3), (3, 0.8)}

B = {(1, 1 0.6), (2, 1 0.5), (3, 1 0.8)}

B' = {(1, 0.4), (2, 0.5), (3, 0.2)}

A B = {(1, min(0.6, 0.4)), (2, min(0.3, 0.5)), (3, min(0.8, 0.2))}

A B = {(1, 0.4), (2, 0.3), (3, 0.2)}

Result : (A B) = A B

An Example
Suppose that A denotes the set of "my true statements". Then, in the universe of "all my statements", the
complement of A (i.e., A′) denotes the set of "my false statements".
a) Using A, A′, and the operations of union (∪) and equality (=) of sets, write an equation (in sets) that is
equivalent to the logic proposition: "All my statements are false."
b) Show that according to bivalent, crisp logic (i.e. assuming that A is a crisp set) the statement in (a)
which is (“my false statements”) above is a contradiction.
c) Suppose that fuzzy logic (using fuzzy sets) is used with the statement in (a), where the complement
operation is given by μA′ = 1 A and the union operation is represented by “max”. For what
membership values of A does the statement in (a) hold?

A fuzzy set whose support is a single element in X with μ A(x) = 1 is referred to as a fuzzy singleton i.e. only
one element has a membership value of 1, and all other elements have a membership value of 0.

An Example
Let X = {1,2,3,4,5} and define a fuzzy set A as:
A={(1,0),(2,0),(3,1),(4,0),(5,0)}

Here, only element 3 has full membership (μA(3)=1), while all others have membership 0 hence A is a fuzzy
singleton.

A fuzzy set whose membership function has at least one element x in the universe whose membership value is
unity is called a normal fuzzy set. The element of the normal fuzzy set for which the membership is equal to 1
is called the prototypical element.
An Example
A = {(1,0.2),(2,0.5),(3,1),(4,0.8)}

Normal fuzzy set

A fuzzy set wherein no membership function has its value equal to 1 is called a subnormal fuzzy set.
An Example
B={(1,0.3),(2,0.5),(3,0.8),(4,0.7)}

25
Subnormal fuzzy set
Note:
1. The maximum value of the membership function in a fuzzy set A is called the height of the fuzzy set.
For a normal fuzzy set, the height is equal to 1 because the maximum value of the membership
function allowed is 1. Thus, if the height of a fuzzy set is less than 1, then the fuzzy set is called a
subnormal fuzzy set.
2. A prototypical element can also refer to the most representative element of the fuzzy set, even if it does
not necessarily have a membership value of 1 (like in the case of a subnormal fuzzy set where no core
element exists). In such cases, the prototypical element is chosen based on the context, such as the
central value in a fuzzy number (e.g., the peak of a triangular membership function).

The element in the universe for which a particular fuzzy set A has its value equal to 0.5 is called crossover
point of a membership function. The membership value of a crossover point of a fuzzy set is equal to 0.5. i.e.,
UA(x) = 0.5. For a fuzzy set, the bandwidth (or width) is defined as the distance between the two unique
crossover points. Bandwidth( A ) = |x1 – x2| Where, μA(x1) = μA(x2) = 0.5

Crossover point of a fuzzy set

A convex fuzzy set has a membership function whose membership values are strictly monotonically
increasing or strictly monotonically decreasing or strictly monotonically increasing than strictly monotonically
decreasing with increasing elements in the universe.

An Example of convex set

A={(1,0),(2,0.4),(3,0.8),(4,1),(5,0.8),(6,0.4),(7,0)}

The convex normal fuzzy set can be defined in the following way. For elements X1, X2 and X3 in a fuzzy set A.
If the following relation between X1, X2 and x3 holds i.e.

μA(x2) >= min[μA(x1), μA(x3)]

then A is said to be a convex fuzzy set. The membership of the element x2 should be greater than or equal to
the membership of elements x1 and X3

A fuzzy set possessing characteristics opposite to that of convex fuzzy set is called non-convex fuzzy set i.e.
the membership values of the membership function are not strictly monotonically increasing or decreasing or
strictly monotonically decreasing than decreasing.

26
convex normal fuzzy set (left) and non-convex normal fuzzy set (right)

An Example of non-convex set

B={(1,0),(2,0.6),(3,0.3),(4,1),(5,0.2),(6,0.7),(7,0)}

Tip: The intersection of two convex fuzzy set is also a convex fuzzy set

When the fuzzy set A is a convex single-point normal fuzzy set defined on the real time, then A is termed as a
fuzzy number.

An introduction to classical relation

Relations represent mapping between sets and connectives in logic
● A classical binary relation represents the presence or absence of a connection or interaction or
association between the elements of two sets.
● Fuzzy binary relations are a generalization of crisp binary relations, and they allow various degrees of
relationship (association) between elements. In other words, fuzzy relations impart degrees of strength
to such connections and association". In a fuzzy binary relation, the degree of association is
represented by membership grades in the same way as the degree of set membership is represented
in a fuzzy set.

An ordered r-tuple is an ordered sequence of r-elements expressed in the form (a 1, a2, a3, ... ar). An unordered
tuple is a collection of r elements without any restrictions in order. For r = 2, the r-ruple is called an ordered
pair.

For crisp sets A₁, A₂, ... , Aᵣ, the set of all r-tuples (a₁, a₂, a₃, ... , aᵣ), where a₁ ∈ A₁, a₂ ∈ A₂ ... , aᵣ ∈ Aᵣ, is
called the Cartesian product of A₁, A₂ ... , Aᵣ and is denoted by A₁ × A₂ × . . . × Aᵣ. The Cartesian product of
two or more sets is not the same as the arithmetic product of two or more sets. If all the aᵣ's are identical and
equal to A, then the Cartesian product A₁ × A₂ × · · · × Aᵣ is denoted as Aʳ.

An r-ary relation over A₁, A₂, ... , Aᵣ is a subset of the Cartesian product A₁ × A₂ × · · · × Aᵣ. When r = 2, the
relation is a subset of the Cartesian product A₁ × A₂. This is called a binary relation from A₁ to A₂. When three,
four or five sets are involved in the subset of a full Cartesian product then the relations are called ternary,
quaternary and quinary respectively.

Consider two universes X and Y; their Cartesian product X x Y is given by

X x Y = {(x,y) | x ∈ X,y ∈ Y)

Here the Cartesian product forms an ordered pair of every x ∈ X with every y ∈ Y. Every element in x is
completely related to every element in y. The characteristic function, denoted by χ, gives the strength of the
relationship between ordered pairs of elements in each universe. If it takes unity as its value, then a complete
relationship is found, if the value is zero then there is no relationship. i.e.

When the universes or sets are finite, then the relation is represented by a matrix called relation matrix. An r-
dimensional relation matrix represents an r-ary relation. Thus, binary relations are represented by two-
dimensional matrices.
27
An Example
Consider the elements defined in the universes X and Y as follows:
X = {2,4,6}
Y = {p,q,r}

The Cartesian product X×Y consists of all ordered pairs (x,y), where x ∈ X and y ∈ Y.
X × Y = {(2,p), (2,q), (2,r), (4,p), (4,q), (4,r), (6,p), (6,q), (6,r)}

Let’s choose the following subset S from X × Y:

S = {(2,p), (4,q), (6,r)}

The relationship between elements of the subset can be represented in the form of matrix diagram is as follow-
p q r
2 1 0 0
4 0 1 0
6 0 0 1

Using the co-ordinate diagram it can be represented as follow-

Each subset pair is represented as a point in the Cartesian plane, where the X-axis corresponds to elements of
X and the Y-axis corresponds to elements of Y. Points like (2,p), (4,q), and (6,r) are plotted. The relationship
between X and Y is depicted with arrows connecting elements of X to their corresponding elements in Y as per
the subset pairs. The arrows indicate the mapping between these elements.

Do It Yourself
The elements in two sets are given as
A = {2, 4}
B = {a, b, c}
Find the following Cartesian product of two sets
(i) A x B (ii) B x A (iii) A x A (iii) B x B

This notation R : XY represents a relation R from set X to set Y. R associates elements of X (called the domain of the relation) wit
∈ X and y ∈ Y. R specifies which elements of X relate to which
elements of Y, but it does not require that every x ∈ X must be related to a y ∈ Y, or vice versa.

28
● A Constrained relation is one where the elements in the domain X are related to elements in the
codomain Y based on specific conditions or rules. Let X = {1, 2, 3} and Y = {2, 4, 6}. The relation
R={(x,y) ∣ y = 2x} is a constrained relation because y must equal 2x.
● An Unconstrained relation is one that has no specific condition or rule governing the relationship
between elements of X and Y. Any element of X can be related to any element of Y. Example: X =
{1,2,3}, Y={2,4,6} and R={(1,2), (2,4), (3,6), (1,4)}.
● A Universal/Complete relation is a relation where every element of X is related to every element of Y.
In other words, it includes all possible ordered pairs (x, y) where x ∈ X and y ∈ Y. Let X={1, 2} and
Y={a, b}. The universal relation R={(1,a), (1,b), (2,a), (2,b)}.
● An Identity relation relates every element of a set X to itself, and only to itself. Let X = {1, 2, 3}. The
identity relation I = {(1, 1), (2, 2), (3, 3)}.
● The null relation is a relation where no pair of elements from the sets is related. It is represented by an
empty set. If R is a relation between two sets X and Y, the null relation is: R=∅. For X={1, 2} and Y={a,
b}, the null relation is: R=∅

The cardinality of a relation refers to the number of ordered pairs in the relation. A relation R ⊆ X×Y is a
subset of the Cartesian product of two sets X and Y.
Example:
Let X = {1,2} and Y = {a,b}.
The Cartesian product X × Y = {(1, a), (1, b), (2, a),(2, b)}.

A relation R ⊆ X×Y might be:

R = {(1, a),(2, b)}.
Here, the cardinality of relation R is ∣R∣=2 (as it has 2 pairs).

The power set of a set S is the set of all possible subsets of S. If ∣S∣=n, then the power set P(S) has 2n
elements.
Example:
Let S = {1,2}.
The power set P(S) = {∅, {1}, {2}, {1,2}}.
The cardinality of P(S) is ∣P(S)∣ = 22 = 4.

Operations on classical relation

Let X = {1, 2}, Y = {a, b}
Relation R1 = {(1, a), (2, b)}
Relation R2 = {(1, b), (2, a)}.

Matrix Representation of Relation R1 & R2 is

Union of Relations (R1 ∪ R2)

The union of two relations is obtained by taking the element-wise OR (logical disjunction) of the matrices.
R1 ∪ R2 = {(1, a),(1, b),(2, a),(2, b)}
The matrix representation is as follow-

Intersection of Relations (R1 R 2)

The intersection of two relations is obtained by taking the element-wise AND (logical conjunction) of the
matrices.

29
R1 R 2 = ∅ (Null Relation)
The matrix representation is as follow-

Complement of Relation (R1C)

The complement of a relation is obtained by subtracting the relation matrix from the universal relation matrix
(which is the full Cartesian product X×Y, meaning all entries are 1).
R1C = {(1, b),(2, a)}

The universal relation matrix:

Containment (R1 ⊆ R2)

A relation R1 is contained in R2 if:
MR1 M R2

That is, every element in MR1 R2.

R1 ⊈ R2 and R2 ⊈ R1

Properties of crisp relation

Property Equation
Commutativity R1 ∪ R 2 = R 2 ∪ R 1
R1 R 2 = R 2 R 1
Associativity (R1 ∪ R2) ∪ R3 = R1 ∪ (R2 ∪ R3)
(R1 R 2 ) R 3 = R1 (R 2 R 3)
Distributivity R1 (R 2 ∪ R3) = (R1 R 2) ∪ (R2 R 3)
R1 ∪ (R2 R 3) = (R1 ∪ R2 ) (R 2 ∪ R3)
Involution C
(RC) = R
Idempotency R∪
R=R
DeMorgan's Laws C
(R1 ∪ R2) =R1C R 2C
(R1 R 2)C = R1C ∪ R2C

Law of Excluded Middle R ∪ RC=X × Y

Law of Contradiction R R c
=∅

Composition in classical relation

30
In classical set theory, the composition of relations is used to combine two relations to form a new relation. Let
R be a relation from set X to set Y (R ⊆ X ×Y) & S be another relation from set Y to set Z (S ⊆ Y × Z). Then,
the composition of R and S, denoted as R ∘ S, is defined as:

R ∘ S = {(x, z) ∣ ∃ y ∈ Y such that (x, y) ∈ R and (y, z) ∈ S}

This means that R ∘ S contains a pair (x, z) if there exists some element y in Y such that x is related to y in R
and y is related to z in S.

An Example
Consider the sets X = {1, 2}, Y = {a, b}, Z = {p, q} & relation R from X to Y:
R = {(1, a),(2, b)}
& Relation S from Y to Z:
S = {(a, p),(b, q)}

The matrices for relations R and S is as follow ((Rows correspond to X, columns correspond to Y)

The composition R ∘ S = {(1, p), (2, q)} is obtained by matrix multiplication:

DIY (Do it yourself)

A = {a1, a2, a3}:
B = {b1, b2, b3}
C = {c1, c2, c3}

Let the relation R and S are as follow-

R = {(a1, b1), (a1, b2), (a2, b2), (a3, b3)}
S = {(b1, c1), (b2, c3), (b3, c2)}

a) Draw the matrix of R and S

b) Find the R ∘ S using the matrix multiplication
c) Write all pairs of the R ∘ S
d) Draw the illustration

The composition operations are of two types-

1. Max-min Composition
2. Max-product/Max-dot Composition

Max-Min Composition
Mathematical Definition for relations represented as matrices:

Where:
● i indexes rows (elements of X),
● j indexes columns in R and rows in S (elements of Y),
● k indexes columns (elements of Z).

31
The "min" operation finds the weakest link in each path (as it takes the smallest value) then "max" selects the
strongest among these weak links. It's like finding the most reliable path by considering the weakest point in
each possible route

An Example
Using the same matrices for R and S (not from DIY but from example of composition from classical relation):

Here is the explanation

For (1,p):
min(R(1,a), S(a,p)) = min(1,1) = 1
min(R(1,b), S(b,p)) = min(0,0) = 0
max(1,0) = 1

For (1,q):
min(R(1,a), S(a,q)) = min(1,0) = 0
min(R(1,b), S(b,q)) = min(0,1) = 0
max(0,0) = 0

For (2,p):
min(R(2,a), S(a,p)) = min(0,1) = 0
min(R(2,b), S(b,p)) = min(1,0) = 0
max(0,0) = 0

For (2,q):
min(R(2,a), S(a,q)) = min(0,0) = 0
min(R(2,b), S(b,q)) = min(1,1) = 1
max(0,1) = 1

Thus, the Max-Min Composition Matrix (X × Z) is:

Max-Product Composition
The Max-Product Composition replaces the min operation with multiplication:

The "product" operation gives you the combined strength of the entire path then "max" selects the path with the
highest combined strength. It's like finding the most efficient path by considering the overall performance of
each route

An Example
Using the same matrices for R and S:

Here is the explanation

For (1,p):
(R(1,a) × S(a,p)) = 1 × 1 = 1
(R(1,b) × S(b,p)) = 0 × 0 = 0
max(1,0) = 1

32
For (1,q):
(R(1,a) × S(a,q)) = 1 × 0 = 0
(R(1,b) × S(b,q)) = 0 × 1 = 0
max(0,0) = 0

For (2,p):
(R(2,a) × S(a,p)) = 0 × 1 = 0
(R(2,b) × S(b,p)) = 1 × 0 = 0
max(0,0) = 0

For (2,q):
(R(2,a) × S(a,q)) = 0 × 0 = 0
(R(2,b) × S(b,q)) = 1 × 1 = 1
max(0,1) = 1

Thus, the Max-Product Composition Matrix (X × Z) is:

Important note
● If the relation values are binary (0 or 1), both methods might (but not necessarily) give the same result.
● If the relation values are real numbers between 0 and 1, the multiplication step in Max-Product
Composition often leads to different values than the Min-Max Composition.

Properties of Composition
Associativity (R∘S)∘T = R∘(S∘T)
Non-Commutativity R∘ S S ∘R
Inverse Relation
(R ∘ S) 1 = S 1 ∘ R 1

Fuzzy Relation
A fuzzy relation is an extension of classical (crisp) relations where elements are not just related or unrelated,
but rather partially related with a degree of membership ranging between 0 and 1. It allows modeling of
vague, uncertain, or imprecise relationships. A fuzzy relation is a fuzzy set defined on the Cartesian product of
classical sets {XI, X2, X3…. Xn} where tuples (x1, x2, x3 … xn) may have varying degrees of membership µR(x1, x2,
x3… xn) within the relation.

A fuzzy relation R from set X to set Y is represented as a fuzzy subset of the Cartesian product X×Y.
Mathematically, it is defined as:

R : X × Y [0,1]

Where μR(x,y) is the membership function that assigns a degree of relation between x ∈ X and y ∈ Y.

Let A be a fuzzy set on universe X and B be a fuzzy set on universe Y. The Cartesian product over A and B
results in fuzzy relation R and is contained within the entire (complete) Cartesian space, i.e., A x B = R where
R ⊂ X x Y. The membership function of fuzzy relation is given by

µR(x,y) = µA x B(x,y) = min[µA(x), µB(y)]

The Cartesian product is not an operation similar to an arithmetic product. Cartesian product R = A X B, is
obtained in the same way as the cross product of two vectors. For example, for a fuzzy set A that has three
elements (hence column vector of size 3 x 1) and a fuzzy set, that has four elements (hence row vector of size

33
1 x 4), the resulting fuzzy relation R will be represented by a matrix of size 3 x 4, i.e., R will have three rows
and four columns.
An Example
Consider the two fuzzy sets
A = {(x1, 0.3), (x2, 0.7), (x3, 1.0)}
B = {(y1, 0.4), (y2, 0.9)}
Perform the Cartesian product on these sets.

Elements (x, y) min(μA(x), μB(y))

(x₁, y₁) min(0.3, 0.4) = 0.3
(x₁, y₂) min(0.3, 0.9) = 0.3
(x₂, y₁) min(0.7, 0.4) = 0.4
(x₂, y₂) min(0.7, 0.9) = 0.7
(x₃, y₁) min(1.0, 0.4) = 0.4
(x₃, y₂) min(1.0, 0.9) = 0.9

Thus, the fuzzy Cartesian product is:

A × B = {(0.3,(x1,y1)), (0.3,(x1,y2)), (0.4,(x2,y1)), (0.7,(x2,y2)), (0.4,(x3,y1)), (0.9,(x3,y2))}

The fuzzy Cartesian product can be represented in matrix form where:

● Rows correspond to elements of A(x₁, x₂, x₃).
● Columns correspond to elements of B(y₁, y₂).
● Each entry represents the membership value min(μA(x), μB(y))

For a fuzzy relation R on X × Y, the domain is the set of all elements in X that have at least one nonzero
relation with some element in Y:

Domain(R) = {x ∈ X ∣ ∃ y ∈ Y, μR(x, y) > 0}

Example: For the matrix given above; the domain is: {x1, x2, x3} because all these elements have at least one
nonzero entry.

The range is the set of all elements in Y that have at least one nonzero relation with some element in X:

Range(R) = {y ∈ Y ∣ ∃ x ∈ X, μR(x, y) > 0}

The range is: {y1, y2} since each column has at least one nonzero value.

A fuzzy graph is a graphical representation of a binary fuzzy relation. Each element in X and Y corresponds to
a node in the fuzzy graph. The connection links are established between the nodes by the elements of X x Y
with non-zero membership grades in R(X, Y). The links may also be present in the form of arcs. These links
are labeled with the membership values as μR(xi, yj

). When X Y, the link connecting the two nodes is an undirected binary g

Consider the following universe X = {x1, x2, x3, x4} and the binary fuzzy relation on X as
34
The bipartile graph and the simple fuzzy graph will be as follow-

Let X=(x1, x2, x3, x4) and Y= (y1, y2, y3, y4) & R be a relation from X to Y given by

R = 0.2/(x1, y3) + 0.4/(x1,y2) + 0.1/(x2,y2) + 0.6/(x2,y3) + 1.0/(x3,y3) + 0.5/(x3,y1)

The corresponding fuzzy matrix for relation R is

The graph of the above relation R = X x Y is shown in

The cardinality of a fuzzy set or a fuzzy relation depends on whether the underlying universe is finite or infinite.
If the universe is infinite, the cardinality will be infinite; otherwise, it can be finite.

The basic operations on fuzzy sets also apply on fuzzy relations. Let R and S, be fuzzy relations on the
Cartesian space X x Y. The operations that can be performed on these fuzzy relations are described below,
consider the following sets and relations
We define relations over sets X={x1,x2} and Y={y1,y2}; Relation R1 and R2 are as follow-

1. Union: µR∪S(x, y) = max[µR(x,y), µs(x,y)]

35
2. Intersection: µR S (x, y) = min[µR(x,y), µs(x,y)]

3. Complement: µR’(x, y) = 1 - µR(x, y)

4. Containment: R ⊆ S ⇒ µR (x, y) µ s(x, y)

0.3 0.4 (Success) 0.7 0.6 (Fails) 0.5 0.8 (Success) 0.9 0.5 (Fails)
Hence R1 ⊈ R2

5. Inverse: The inverse of fuzzy relation R on X x Y is denoted by R -1. It is a relation on Y x X defined by R -1(y,
x) = R(x, y) for all pairs (y, x) ∈ Y x X

6. Projection: For a fuzzy relation R(X,Y) let [R Y] denote the projection of R onto Y then [R Y] is a fuzzy relation in Y whose m

µ[ R Y ]x∈X(x, y) = max(µR(x, y))

For a fuzzy relation R(X,Y) let [R X] denote the projection of R onto X then [R X] is a fuzzy relation in X whose membership fun

µ[ R X ] y∈Y(x, y) = max(µR(x, y))

An Example
Consider the fuzzy relation:
R={(0.3,(x1,y1)),(0.7,(x1,y2)),(0.4,(x2,y1)),(0.8,(x2,y2))}

The projection on Y is:

µ[ R Y ]x∈X(x, y) = {(y1,max(0.3, 0.4)),(y2,max(0.7, 0.8))}
µ[ R Y ]x∈X(x, y) = {(y1,0.4),(y2 ,0.8)}

The projection on X is:

µ[ R X ] y∈Y(x, y) = {(x1, max(0.3, 0.7)),(x2, max(0.4, 0.8))}
µ[ R X ] y∈Y(x, y) = {(x1, 0.7), (x2,0.8)}

Note: The projection operation always provides a fuzzy set from the fuzzy relation.

Properties of fuzzy relations

36
Like classical relations, the properties of commutativity, associativity, distributivity, idempotency and identity
also hold good for fuzzy relations. DeMorgan's laws hold good for fuzzy relations as they do for classical
Relation. The nil relation ϕ and complete relation E R are analogous to the null set ϕ and the whole Set E,
respectively, in set theoretic form. The excluded middle laws are not satisfied in fuzzy relations as for fuzzy
sets. This is because a fuzzy relation R is also a fuzzy set, and there exists an overlap between a relation and
its complement. Hence
R ∪ R’ E (whole set)
R R’ (null set)

Fuzzy Composition
The classical relations work with crisp values 0 & 1 while fuzzy relations work with membership values [0,1]
that require different operations to handle these partial truths. In fuzzy relations, we replace AND with min, OR
with max. Here in fuzzy relations the regular matrix multiplication isn't used in fuzzy relations - it wouldn't
properly capture the fuzzy nature of the relationships between elements.

For example:
Classical: 1 × 1 = 1 (AND)
Fuzzy: min(0.7, 0.8) = 0.7

The Max-Min Composition is defined by the function theoretic expression as T = R o S

Let's consider two fuzzy relations R and S: R is a relation from set X = {x₁, x₂} to set Y = {y₁, y₂} S is a relation
from set Y = {y₁, y₂} to set Z = {z₁, z₂}

Given:
R = {((x₁,y₁), 0.8), ((x₁,y₂), 0.3), ((x₂,y₁), 0.2), ((x₂,y₂), 0.9)}
S = {((y₁,z₁), 0.4), ((y₁,z₂), 0.6), ((y₂,z₁), 0.1), ((y₂,z₂), 0.7)}

For element (x₁,z₁): max[min(0.8, 0.4), min(0.3, 0.1)] = max(0.4, 0.1) = 0.4
For element (x₁,z₂): max[min(0.8, 0.6), min(0.3, 0.7)] = max(0.6, 0.3) = 0.6
For element (x₂,z₁): max[min(0.2, 0.4), min(0.9, 0.1)] = max(0.2, 0.1) = 0.2
For element (x₂,z₂): max[min(0.2, 0.6), min(0.9, 0.7)] = max(0.2, 0.7) = 0.7

The Max-Product Composition is defined as:

For element (x₁, z₁): max[(0.8 × 0.4), (0.3 × 0.1)] = max(0.32, 0.03) = 0.32
For element (x₁, z₂): max[(0.8 × 0.6), (0.3 × 0.7)] = max(0.48, 0.21) = 0.48
For element (x₂, z₁): max[(0.2 × 0.4), (0.9 × 0.1)] = max(0.08, 0.09) = 0.09
For element (x₂, z₂): max[(0.2 × 0.6), (0.9 × 0.7)] = max(0.12, 0.63) = 0.63

The Min-Max Composition is defined as:

37
For element (x₁,z₁): min[max(0.8, 0.4), max(0.3, 0.1)] = min(0.8, 0.3) = 0.3
For element (x₁,z₂): min[max(0.8, 0.6), max(0.3, 0.7)] = min(0.8, 0.7) = 0.7
For element (x₂,z₁): min[max(0.2, 0.4), max(0.9, 0.1)] = min(0.4, 0.9) = 0.4
For element (x₂,z₂): min[max(0.2, 0.6), max(0.9, 0.7)] = min(0.6, 0.9) = 0.6

Also note that

(R ∘ S)' = (R' ∘ S')

Note: Properties of Composition are same for both classical relation and fuzzy set.

Do it yourself
Q.1 Two fuzzy relations are given by

Obtain fuzzy relation T as a composition (for max-min and max-product) between the fuzzy relations.

Q.2 For a speed control of DC motor, the membership functions of series resistance, armature current and
speed are given as follows:

Compute relation T for relating series resistance to motor speed, i.e., R to N. Perform max-min composition
only.

Q.3 Consider two fuzzy Sets given by

(a) Find the fuzzy relation for the Cartesian product of A and B, i.e. R = A X B
(b) Introduce a fuzzy set C given by

Find the relation between C and B using Cartesian product i.e. find S = C x B
(c) Find C ∘ R using max-min composition
38
(d) Find C ∘ S using max-min composition

Tolerance and Equivalent Relation

● A relation is said to be reflexive if every vertex (node) in the graph originates from a single loop.

● A relation is said to be symmetric if for every edge pointing from vertex i to vertex j, there is an edge
pointing in the opposite direction, i.e., from vertex j to vertex i where i, j = 1, 2, 3, ....

● A relation is said to be transitive if for every pair of edges in the graph one pointing from vertex – i to
vertex – j and the other pointing from vertex j to vertex k; there is an edge pointing from vertex i from
vertex k.

Equivalence/Similarity Relation
Classical Equivalence/Similarity Relation
Let relation R on universe X be a relation from X to X Relation R is an equivalence relation if the following three
properties are satisfied:
1. Reflexive: ∀a ∈ A: (a,a) ∈ R
2. Symmetric: ∀a,b ∈ A: (a,b) ∈ R ⟺ (b,a) ∈ R
3. Transitive: ∀a,b,c ∈ A: ((a,b) ∈ R and (b,c) ∈ R) ⟹ (a,c) ∈ R

An Example
Let's consider a set A = {1, 2, 3, 4} and define a relation R based on "having the same parity" (both odd or both
even).

R = {(1,1), (1,3), (2,2), (2,4), (3,1), (3,3), (4,2), (4,4)}

The matrix representation is as follow-

39
● Observe that (1, 1), (2, 2), (3, 3), and (4, 4) all belong to R. Since every element in A is related to itself
in R, the relation R is reflexive.
● (1, 3) has its counterpart (3, 1), (2, 4) has its counterpart (4, 2), (1, 1), (2, 2), (3, 3), and (4, 4) are their
own counterparts. For every pair (a, b) in R, the pair (b, a) also exists in R. Therefore, the relation R is
symmetric.
● Consider following pairs in R
(1,3), (3,1), (1,1) & (1,3), (3,3), (1,3) & (1,1), (1,3), (1,3)
(2,4), (4,2), (2,2) & (2,4), (4,4), (2,4) & (2,2), (2,4), (2,4)
(3,1), (1,1), (3,1) & (3,3), (3,1), (3,1)
(4,2), (2,2), (4,2) & (4,4), (4,2), (4,2)
So transitivity is here.
● So the relation given above is a classical equivalence or classical similarity relation.

Fuzzy Equivalence/Similarty Relation

Let R be a fuzzy relation on universe X, which maps elements from X to X Relation R will be a fuzzy
equivalence relation if all the three properties (The membership function theoretic forms for these properties
are represented as follows)
1. Reflexive: µR(xi, xi) = 1 ∀x ∈ X; If this is not the case for few x ∈ X then R(X,X) is said to be irreflexive.
2. Symmetry: µR(xi, xj) = µR(xj, xi) ∀xI, xj ∈ X
3. Transitivity: µR(xi, xk ) max xj ∈ X min[µR(xi, xj), µR(xj, xk)] ∀(xj, xk) ∈ X2
This can also be called max-min transitive. If this is not satisfied for some members of X then R(X, X) is
non-transitive. If the given transitivity inequality is not satisfied for all the members (xi,xk) ∈ X2 then the
relation is called anti-transitive.

Note: Another formula to check for the transitivity is

µR(xi, xj) = λ1 and µR(xj, xk) = λ2 => µR(xi, xk) = λ where λ = min(λ1, λ2)
The above is rigid and not preferred because the strict equality condition often doesn't hold in practice
because fuzzy measures are imprecise. The inequality condition respects the idea that while indirect links
provide a lower bound (via the minimum), the actual direct relationship may be higher.

The max-product transitivity can be given as follow-

µR(xi, xk ) max xj ∈ X [µR(xi, xj) * µR(xj, xk)] ∀(xj, xk) ∈ X2
In most of the cases we have to use max-min transitivity.

An Example
Let A = {1, 2, 3} and define a fuzzy relation R = {((1,1), 1.000), ((1,2), 0.800), ((1,3), 0.800), ((2,1), 0.800), ,
((2,2), 1.000), ((2,3), 0.800), ((3,1), 0.800), ((3,2), 0.800, ((3,3), 1.000)}.

The matrix representation is as follow-

● Observe that μR(1,1) = 1.000, μR(2,2) = 1.000 and μR(3,3) = 1.000 hence this relation is reflexive
● Consider μR(1,2) = μR(2,1) = 0.800 and μR(1,3) = μR(3,1) = 0.800 and μR(2,3) = μR(3,2) = 0.800 hence
the relation is symmetrical
● Observe that For x = 1, y = 2, z = 3: μ R(1,3) = 0.800; min(μR (1,2), μR
(2,3)) = min(0.800, 0.800) = 0.800; 0.800 0.
R (1,2) = 0.800; min(μR (1,3),

40
μR
(3,2)) = min(0.800, 0.800) = 0.800 = 0.800 0.800 is true. This relation is transitive because all values outside the diagon
R(x,y), μR(y,z)) will never be greater than μR(x,z).
● So the relation given above is a fuzzy equivalence or fuzzy similarity relation.

Tolerance/Proximity Relation
Classical Tolerance/Proximity Relation
A tolerance relation is a relation that is reflexive & symmetric. It does NOT need to be transitive, unlike
equivalence relation.

An Example
Let A = {1, 2, 3, 4} and define relation R as "numbers differing by at most 1". R = {(1,1), (1,2), (2,1), (2,2), (2,3),
(3,2), (3,3), (3,4), (4,3), (4,4)}

The matrix representation is as follow-

● Observe (1,1), (2,2), (3,3), (4,4) means every element is related to itself hence this is a reflexive
relation.
● Observing (1,2), (2,1), (2,3), (3,2), (3,4), (4,3) means every element is related to itself hence this is a
symmetric relation.
● Observe that (1,2), (2,3) is in relation but the (1, 3) is not in the relation hence it is not transitive.
● Hence it is a classical tolerance relation.

An equivalence relation can be formed from tolerance relation R 1 by (n-1) compositions within itself where n is
the cardinality of the set that defines R1.
R1(n-1) = R1 ∘ R1 ∘ R1 ∘….. R1 = R

Here R1(n-1) is the tolerance relation and R is the equivalence relation.

Let us take matrix from the previous example

Now computer R ∘ R

New pair is added in R ∘ R so now computer R ∘ R ∘ R

Now this matrix is for fully connected relation means no more pairs can be added i.e. it is reflexive, symmetric
and transitive hence it is an equivalence relation.

Fuzzy Tolerance/Proximity Relation

A fuzzy tolerance relation that is reflexive i.e. μR(x,x) = 1 for all x and symmetric i.e. μR(x,y) = μR(y,x) for all x,y.
41
An Example
Let A = {1, 2, 3, 4} and define relation R based on "similarity" where similarity = 1 - |a-b|/4

R = {((1,1), 1.000), ((1,2), 0.750), ((1,3), 0.500), ((1,4), 0.250), ((2,1), 0.750), ((2,2), 1.000), ((2,3), 0.750),
((2,4), 0.500), ((3,1), 0.500), ((3,2), 0.750), ((3,3), 1.000), ((3,4), 0.750), ((4,1), 0.250), ((4,2), 0.500), ((4,3),
0.750), ((4,4), 1.000)}
The matrix representation is as follow-

● Observe that the diagonal elements are 1 hence this is reflexive

● Observe that the matrix is symmetric hence this is symmetric relation
● Observe that μR(1,2) = 0.750 and μR(2,3) = 0.750 and μR(1,3) = 0.500 i.e. 0.500 not >= min(0.750,
0.750) hence it is not transitive.
● Hence it is a fuzzy tolerance relation.

The fuzzy tolerance relation can be reformed into fuzzy equivalence relation in the same way as a crisp
tolerance relation is reformed into crisp equivalence relation.

R1(n-1) = R1 ∘ R1 ∘ R1 ∘….. R1 = R

Note: Every equivalence relation is a tolerance relation (since it satisfies reflexivity and symmetry). Not
every tolerance relation is an equivalence relation.

Let us take matrix from the previous example

Apply max-min composition to get R ∘ R

For (R∘R)(1,1): max(min(1.000,1.000), min(0.750,0.750), min(0.500,0.500), min(0.250,0.250))

=max(1.000,0.750,0.500,0.250)=1.000

For (R∘R)(1,2): max(min(1.000,0.750), min(0.750,1.000), min(0.500,0.750), min(0.250,0.500))

=max(0.750,0.750,0.500,0.250)=0.750

For (R∘R)(1,3): max(min(1.000,0.500), min(0.750,0.750), min(0.500,1.000), min(0.250,0.750))

=max(0.500,0.750,0.500,0.250)=0.750

For (R∘R)(1,4):max(min(1.000,0.250), min(0.750,0.500), min(0.500,0.750), min(0.250,1.000))

=max(0.250,0.500,0.500,0.250)=0.500

The above relation is reflexive, symmetric but not transitive because μR(1,3) = 0.750 & μR(3, 4) = 0.750 but
μR
(1,4) = 0.500 so if max-min transitivity is applied then 0.500 min(0.750, 0.750) which is false.

42
let us try to find the composition R∘R∘R using the same way as we did it before

The above relation is reflexive, symmetric and transitive because μR(1,3) = 0.750 & μR(3, 4) = 0.750 but
μR (1,4) = 0.750 so if max-min transitivity is applied then 0.750 min(0.750, 0.750) which is true.

Do it yourself
Q.1 The following figure shows three relations on the universe X ={a, b, c). Are these relations equivalence
relations?

Q.2 Which of the following are equivalence relations?

Draw graphs of the equivalence relations.

Non interactive fuzzy sets

A non-interactive fuzzy set refers to a situation where two or more fuzzy sets do not influence each other. In
other words, the membership value of an element in one fuzzy set does not depend on its membership value in
another fuzzy set.

Let’s consider two fuzzy sets defined on the domain of people:

Consider a fuzzy set T for Tallness
T = {(5.5, 0.2),(5.8, 0.6),(6.0, 0.9),(6.2, 1.0)}

Consider another fuzzy set T for Intelligence

I = {(IQ=90, 0.3),(IQ=110, 0.7),(IQ=130, 1.0)}

Since "Tallness" and "Intelligence" are unrelated attributes, they are non-interactive fuzzy sets. Their
combination follows basic fuzzy set operations without influencing each other.

A fuzzy set A defined on the Cartesian space X = X1 × X2 is separable into two non-interactive fuzzy sets,
called orthogonal projections, if and only if:
A = OPrx1(A) × OPrx2(A)

43
This equation means that the membership function of A can be written as the product of the membership
functions of its projections:
μA(x1,x2) = μOPrx1(A)(x1)⋅μOPrx2(A)(x2)

This implies independence of x1 & x2. It means that knowing x1 tells us nothing about x2 and vice versa.

An Example
Let's define our spaces:X1 = {1, 2} X2 = {a, b} So X = X1 × X2 = {(1,a), (1,b), (2,a), (2,b)} and a fuzzy set A that is
separable: A = {((1,a), 0.3), ((1,b), 0.3), ((2,a), 0.6), ((2,b), 0.6)}

the matrix for A is

The orthogonal projection OPrx1(A)

For x1 = 1: max{0.3, 0.3} = 0.3
For x1 = 2: max{0.6, 0.6} = 0.6
So, OPrx1(A) = {(1, 0.3), (2, 0.6)}

Similarly the orthogonal projection OPrx2(A)

For x2 = a: max{0.3, 0.6} = 0.6
For x2 = b: max{0.3, 0.6} = 0.6
So, OPrx2(A) = {(a, 0.6), (b, 0.6)}

Now, let's reconstruct A using the Cartesian product: μA(x1, x2) = min(μOPrx1(A)(x1), μOPrx2(A)(x2)); For each point:
● (1,a): min(0.3, 0.6) = 0.3
● (1,b): min(0.3, 0.6) = 0.3
● (2,a): min(0.6, 0.6) = 0.6
● (2,b): min(0.6, 0.6) = 0.6
The reconstructed values match our original set A, confirming that A is separable into orthogonal projections.

The reconstructed values match our original set A, confirming that A is separable into orthogonal projections.

Do it yourself
Q.1 For the same sets X1 and X2; Let B = {((1,a), 0.3), ((1,b), 0.8), ((2,a), 0.6), ((2,b), 0.4)}. Check if B is
separable into orthogonal projections.

Fuzzification
Fuzzification is the process of transforming a crisp set to a fuzzy set or a fuzzy set to a fuzzier set, i.e., crisp
quantities are converted to fuzzy quantities. For example, when one is told that the temperature is 9°C, the
person translates this crisp input value into linguistic variables such as cold or warm according to one's
knowledge and then makes a decision about needing to wear a jacket. If one fails to fuzzify then it is not
possible to continue the decision process or error decision may be reached.

For a fuzzy set A, a common fuzzification algorithm is performed by keeping μ i constant and xi being
transformed to a fuzzy set Q(xi) depicting the expression about xi. The fuzzy set Q(xi) is referred to as the
kernel of fuzzification. The fuzzified set A can be expressed as

A = μ1Q(x1) + μ2Q(x2) + .... + μnQ(xn) [where the symbol ~ means fuzzified.]

Methods of Membership Value Assignments

The process of membership value assignment may be by intuition, logical reasoning, procedural method or
algorithmic approach. The methods of assigning membership value are as follows:
44
1. Intuition
2. Inference
3. Rank Ordering
4. Angular fuzzy sets
5. Neural Networks
6. Genetic Algorithm
7. Inductive Reasoning

Intuition
Intuition method is based upon the common intelligence of humans. lt is the capacity of the human to develop
membership functions on the basis of their own intelligence and understanding capacity.

Consider the figure below that shows various shapes of weights of people measured in kilograms in the
universe. Each curve is a membership function corresponding to various fuzzy (linguistic) variables, such as
very light, light, normal. heavy and very heavy. The curves are based on context functions and the human
developing them. For example, if the weights are referred to the range of thin persons we get one set of
curves, and if they are referred to the range of normal weighing persons we get another set and so on. The
main characteristics of these curves for their usage in functions are based on their overlapping capacity.

Do it yourself
Q.1 Using your own intuition, plot the fuzzy membership function for the age of people.

Inference
The inference method uses knowledge to perform deductive reasoning. Deduction achieves conclusion by
means of forward inference. There are various methods for performing deductive reasoning. Here the
knowledge of geometrical shapes and geometry is used for defining membership values. The membership
functions may be defined by various shapes: triangular, trapezoidal, bell-shaped, Gaussian and so on. The
inference method here is discussed via triangular shape.

Consider a triangle, where X, Y and Z are the angles such that X Y Z 0 and let U be the universe of triangles
U= {(X, Y, Z) I X Y Z 0;X + Y + Z = 180}

There are various types of triangles available. Here a few are considered to explain inference methodology:
l = Isosceles triangle (approximate) E = Equilateral triangle (approximate)
R = Right-angle triangle (approximate) IR = Isosceles and right-angle triangle (approximate)
T = Other triangle

By the method of inference, we can obtain the membership values for all the above-mentioned triangles. since we possess knowled

45
If X = Y or Y = Z the membership value of the approximate isosceles triangle is equal to 1. On the other hand,
if X = 120°, Y = 60° and Z = 0°, we get

The membership value of approximate right-angle triangle is given by

If X= 90°, the membership value of a right-angle triangle is 1, and if X= 180°. the membership value μ R
becomes 0:
X = 90° R = 1
X = 180° R = 0

The membership value of the approximate isosceles right angle triangle is obtained by taking the logical
intersection of the approximate isosceles and approximate right-angle triangle membership function. i.e.

IR = I R

and it is given by

Above both terms of the formula are the same you can put values of X = 80, Y = 70 and Z = 30.

The membership function for a fuzzy equilateral triangle is given by

The membership function of other triangles, denoted by T, is the complement of the. logical union of I, R and E
i.e.
T = (I ∪ R ∪ E)’

By using De Morgan's law, we get

T = I’ ∩ R’ ∩ E’

The membership value can be obtained using the equation

Do it yourself
46
Q.1 Using the inference approach, find the membership values for the triangular shapes I,R, E, IR, and T for a
triangle with angles 45°, 55° and 80°.
Q.2 Using the inference approach, obtain the membership values for the triangular shapes (I, R, T) for a
triangle with angles 40°, 60° and 80°.

Rank Ordering
The formation of the government is based on the polling concept; to identify a best student, ranking may be
performed; to buy a car, one can ask for several opinions and so on. All the above mentioned activities are
carried out on the basis of the preferences made by an individual, a committee, a poll and other opinion
methods.This methodology can be adapted to assign the membership values to a fuzzy variable. Pairwise
comparisons enable us to determine preferences and this results in determination of the membership.

An Example
Imagine a group of students is voting for the best basketball player in their school. Each student ranks the top
players based on their skills, teamwork, and performance in past games.

Step 1: Collecting Rankings

Five students vote and give their rankings as follows:

Step 2: Assigning Membership Values

Now, we assign fuzzy membership values based on how frequently a player is ranked high. The more often a
player is ranked 1st or 2nd, the higher their membership value.
● Alice: Ranked 1st twice, 2nd twice, 3rd once A = 0.8
● Bob: Ranked 1st three times, 2nd once, 3rd once B = 0.9
● Charlie: Ranked 2nd once, 3rd four times C = 0.6

Step 3: Interpreting Membership Values

The membership values represent how much each player belongs to the fuzzy set of "Best Basketball
Players":
● Bob (μ = 0.9) is the most preferred player.
● Alice (μ = 0.8) is a close second.
● Charlie (μ = 0.6) is still a good player but less preferred than the others.
This is how rank ordering helps assign fuzzy membership values based on preferences or voting.

Angular Fuzzy Set

Angular fuzzy sets are defined on the universe of angles thus repeating the shapes in every 2π cycles. The
truth values of the linguistic variable are represented by angular fuzzy sets. The truth values of the linguistic
variable are represented by angular fuzzy sets. The logical prepositions are equated to the membership value
“truth," as they are associated with the degree of truth. The certain preposition with membership value "1" is
said to be true and that the preposition with membership value "0" is said to be false. The intermediate values
between 0 and 1 correspond to a preposition being partially true or partially false.

Consider the pH value of wastewater from a dyeing industry. These pH readings are assigned linguistic labels,
such as high base, medium acid, etc., to understand the quality of the polluted water. The pH value should be
taken care of because the waste from the dyeing industry should not be hazardous to the environment. As is
known, the neutral solution has a pH value of 7. The linguistic variables are built in such a way that a "neutral
(N)" solution corresponds to θ = 0 rad, and "exact base (EB)" and "exact acid (EA)" corresponds to θ=π/2 rad
and θ=-π/2 rad, respectively. The levels of pH between 7 and 14 can be termed as "very base" (VB), "medium
base" (MB) and so on and are represented between 0 to π/2. Levels of pH between 0 and 7 can be termed as

47
"very acid (VA)," "medium acid (MA)” and so on are represented between 0 rad and -π/2 rad. The mode of
angular fuzzy set using these linguistic labels for pH is shown in figure below

The value of the linguistic variable with varying “θ” and their membership values are on the μ(θ) axis. The
membership value corresponding to the linguistic term can be obtained from the following equation

μr(θ) = |z.tan(θ)|

where z is the horizontal projection of the radial vector. Angular fuzzy sets are best in cases with polar
coordinates or in cases where the value of the variable is cyclic. In Angular Fuzzy Sets, the membership
function depends on an angle θ, and projections are determined using trigonometric functions like cosine and
sine.
● Horizontal Projection (on the X-axis) is given by: μx (x) = ()cos()
● Vertical Projection (on the Y-axis) is given by: μy (y) = ()sin()

An Example
Q.1 The energy E of a particle spinning in a magnetic field B is given by the equation
E = μB sinθ
Where μ is the magnetic moment of a spinning particle and θ is the complement angle of the magnetic moment
with respect to the direction of the magnetic field. Assume the magnetic field B and magnetic moment μ to be
constant, and the linguistic terms for the complement angle of magnetic moment be given as

High Moment (H) θ = π/2

Slightly high moment (SH) θ = π/3

No moment (Z) θ=0

Slightly low moment (SL) θ = -π/3

Low moment (L) θ = -π/2

Find the membership values using the angular fuzzy set approach for these linguistic labels and plot these
values versus θ.

Solution: The angular fuzzy set diagram is as follow-

48
Now calculate the angular fuzzy membership values as shown in the table below

The plot for the membership function shown in this table is given in Figure

Neural Networks
The neural network can be used to obtain fuzzy membership values. Consider a case where fuzzy
membership functions are to be created fuzzy classes of an input data set. The input data set is collected and
divided into training data set and testing dataset. The training dataset trains the neural network. Consider an
input training data set as shown in figure below-

Consider the part-(A) of the figure; The data set is found to contain several data points. The data points are
first divided into different classes by conventional clustering techniques. it can be noticed that the data points
are divided into three-classes, RA, RB and Rc. Consider data point 1 having input coordinate values X i = 0.6
and Xj = 0.8. This data point lies in the region R B; hence we assign complete membership 1 to class R B and 0
to classes RA and Rc. In a similar manner, the other data points are given membership values of I for the class
they initially belong to.

Consider the part-(B) of the figure; A neural network is created which uses the data point marked 1 and the
corresponding membership values in different classes for training itself for simulating the relationship between
coordinate location and the membership values.
49
The output of the neural network is shown in part-(C) of the figure, which classifies data points into one of the
three regions. The neural network uses the next data set of data values and membership values for further
training processes.

The process is continued until the neural network simulates the entire set of input-output values. The network
performance is tested using a testing data set. Now consider the next figure which shows the next

When the neural network is ready in its final version (as shown in the figure below)

it can be used to determine the membership values of any input data in the different regions (classes). A
complete mapping of the membership of various data points in various fuzzy classes can be derived to
determine the overlap of the different classes. The overlap of the three fuzzy classes is shown in the hatched
portion of the figure. In this manner, the neural network is used to determine the fuzzy membership function.

Genetic Algorithm
The genetic algorithm is based on Darwin's theory of evolution; the basic rule is “survival of the fittest”. The
genetic algorithm is used here to determine the fuzzy membership functions. This can be done using the
following steps:
1. For a particular functional mapping system, the same membership functions and shapes are assumed
for various fuzzy variables to be defined.
2. These chosen membership functions are then coded into bit strings.
3. Then these bit strings are concatenated together.
4. The fitness function to be used here is noted. In genetic algorithms, fitness function plays a major role
similar to that played by activation function in neural networks.
5. The fitness function is used to evaluate the fitness of each set of membership functions.
6. These membership functions define the functional mapping of the system.
The process of generating and evaluating strings is carried out until we get a convergence to the solution
within a generation, i.e., we obtain the membership functions with best fitness value. Thus, fuzzy membership
functions can be obtained from genetic algorithms.

Induction Reasoning
Induction is used to deduce causes by means of backward inference. The characteristics of inductive
reasoning can be used to generate membership functions. Induction employs the entropy minimization
principle, which clusters the parameters corresponding to the output classes. To perform an inductive
reasoning method, a well-defined database for the input–output relationship should exist. The inductive

50
reasoning can be applied for complex systems where the data are abundant and static. For dynamic data sets,
this method is not best suited, because the membership functions continually change with time. There exist
three laws of induction
1. Given a set of irreducible outcomes of an experiment, the induced probabilities are those probabilities
consistent with all available information that maximize the entropy of the set.
2. The induced probability of a set of independent observations is proportional to the probability density of
the induced probability of a single observation.
3. The induced rule is that rule consistent with all available information that minimizes the entropy.

The third law stated above is widely used for the development of membership functions. The membership
functions using inductive reasoning are generated as follows:
1. A fuzzy threshold is to be established between classes of data.
2. Using the entropy minimization screening method, first determine the threshold line.
3. Then start the segmentation process.
4. The segmentation process results into two classes.
5. Again partitioning the first two classes one more time, we obtain three different classes.
6. The partitioning is repeated with threshold value calculations, which lead us to partition the data set into
a number of classes or fuzzy sets.
7. Then on the basis of the shape, membership function is determined

Thus the membership function is generated on the basis of the partitioning or analog screening concept. This
draws a threshold line between two classes of sample data. The idea behind drawing the threshold line is to
classify the samples when minimizing the entropy for optimum partitioning.

Defuzzification
Defuzzification is the process of converting fuzzy values into clear, precise values that can be used for real-
world applications. Since many control systems require specific, non-fuzzy actions, defuzzification helps
translate fuzzy control decisions into exact outputs.

This process takes a fuzzy control decision, represented by a range of possible values, and selects the best
single value that represents it. Defuzzification can turn a fuzzy set into a single crisp value, transform a fuzzy
matrix into a precise matrix, or convert a fuzzy number into a definite number.

Mathematically, the defuzzification process may also be termed as “rounding it off”. Fuzzy set with a collection
of membership values or a vector of values on the unit interval may be reduced to a single scalar quantity
using the defuzzification process.

Lambda-Cuts for Fuzzy Sets (Alpha-Cuts)

Consider a fuzzy set A and a variable such that 0 1, the crisp set A λ
of fuzzy set A is called the lambd

Aλ = {x | μA (x) λ}; λ ∈ [0, 1]

The set Aλ is called a weak lambda-cut set if it consists of all the elements of a fuzzy set whose membership
functions have values greater than or equal to a specified value.
The set Aλ is called a strong lambda-cut set if it consists of all the elements of a fuzzy set whose membership
functions have values strictly greater than to a specified value. A strong λ-cut set is given by

Aλ = {x | μA(x) > λ}; λ ∈ [0, 1]

All the λ-cut sets form a family of crisp sets. It is important to note the λ-cut set A λ (or A
, if -cut set) does not have a tilde score, because it is a crisp set deri

51
Note: Typically, λ is taken from the interval (0,1] rather than (0,1) because λ-cuts are not defined for λ = 0 (as it
would include all elements of the universal set). In some cases, λ can be taken from [0,1], where A₀ includes
all elements of the fuzzy set, and A₁ includes only elements with full membership (μA(x) = 1).

An Example
Consider a fuzzy set A representing "tall people" with elements:

A={(150,0.2),(160,0.5),(170,0.8),(180,1.0),(190,0.9)}

Let's take λ = 0.5 (λ-cut at 0.5): A0.5 = {x A (x) 0.5}

From the given fuzzy set:
● 150 has 0.2 Not included
● 160 has 0.5 included
● 170 has 0.8 included
● 180 has 1.0 included
● 190 has 0.9 included

So, the λ-cut set A₀.5 is: A0.5 = {160, 170,180,190}

Similarly , the λ-cut set A0.2 is: A0.2 = {150,160,170,180,190}

The properties of λ-cut sets are as follows:

Let us take an example of the fourth property; A is a fuzzy set defined over a universe X={x 1, x2, x3}. The
membership function μA(x) assigns values to elements of X.

A = {(x1, 0.8),(x2, 0.4),(x3, 0.2)} so A’ = {(x1, 0.2),(x2, 0.6),(x3, 0.8)}

Compute (A′)λ = {x A′ 0.5′ = {x2, x3}

Compute (Aλ ) = X A λ (i.e. {x xA λ }) If λ=0.5 then (A0.5)′ = {x2, x3} because A0.5 = {x1}
So (A′)λ = (Aλ)′

Now take λ = 0.8

(A′)0.8 = {x3}
(A0.8 0.8 = {x2, x3}
(A′)λ λ)′ for λ=0.8

Now consider another figure

52
Figure above shows the features of the membership functions. The core of fuzzy set A is the λ = 1– cut set A 1.
The support of A is the λ-cut set A0+, where λ =0+, and it can be defined as

A0+ = {x | μA(x) > 0}

The interval [A0+, A1] forms the boundaries of the fuzzy set A, i.e., the regions with the membership values
between 0 and 1, i.e., for λ = 0 to 1.

Do it yourself
Q.1 Consider the fuzzy sets A and B both defined on X

Express the following λ-cut sets:

Q.2 Consider the discrete fuzzy set defined on the universe X = {a,b,c,d,e} as

Using Zadeh’s notation, find the λ-cut sets for λ =1, 0.9,0.6,0.3,0 + and 0.

Lambda-Cuts for Fuzzy Relations

The λ-cut for fuzzy relations is similar to that for fuzzy sets. Let R be a fuzzy relation where each row of the
relational μatrix is considered a fuzzy set. The jth row in a fuzzy relation matrix R denotes a discrete
membership function for a fuzzy set R. A fuzzy relation can be converted into a crisp relation in the following
manner:

Rλ={(x, y)| μ(x, y) }

where Rλ
is a -cut relation of the fuzzy relation R. Since here R is defined as a two-dimensional array, defined on the universes X
λ belongs to R with a relation
greater than or equal to λ. Similar to the properties of λ-cut fuzzy sets, the λ-cuts on fuzzy relations also obey
certain properties. They are listed as follows. For two fuzzy relations R and S the following properties should
hold:
1. For any , where 0 1, it is true that R ⊆ Rλ
2. (R S) λ = Rλ S λ
3. (R S) λ = Rλ S λ
53
4. (R’)λ (R λ)’ except when λ = 0.5

Do it yourself
Q.1 Determine the crisp λ-cut relation when λ =0.1, 0+, 0.3 and 0.9 for the following relation R

Note:
1. Any lambda-cut relation of a fuzzy tolerance relation results in a crisp tolerance relation.
2. Any lambda-cut relation of a fuzzy equivalence relation results in a crisp equivalence relation.

Defuzzification Methods
Defuzzification is the process of conversion of a fuzzy quantity into a precise quantity. The output of a fuzzy
process may be a union of two or more fuzzy membership functions defined on the universe of discourse of the
output variable.

Consider the following diagrams

Consider a fuzzy output comprising two parts: the first part, C1 a triangular membership shape (as shown in
figure-A), the second part, C2, a trapezoidal shape (as shown in figure-B). The union of these two membership
functions, i.e. C = C1 ∪ C2 the max-operator, which is going to be the outer envelope of the two shapes shown
in (C). A fuzzy output process may involve many output parts, and the membership function representing each
part of the output can have any shape. The membership function of the fuzzy output need not always be
normal. In general, we have

Defuzzification methods include the following:

1. Max-membership principle/Height Method
2. First of maxima
3. Last of maxima
4. Mean-max membership
5. Weighted average method
6. Center of sums
7. Centroid method/Center of gravity/Center of area/Center of mass
7.1. Bisector of Area Method (BOA)
8. Center of largest area

1. Max-Membership Principle/Height method

54
This method is also known as height method and is limited to peak output functions (i.e. clear, single peak (i.e.,
one maximum membership value)). This method is given by the algebraic expression

μC(x ) C (x) for all x X

The method is illustrated in Figure below-

Here x is the defuzzified output.

Note: if there are multiple peaks available then the result will be ambiguous. To overcome this we have to use
either the First of maxima or Last of maxima or Mean-max membership.

An Example:
Consider the following fuzzy set for ripeness
Ripe = {(0, 0), (2, 0.3), (4, 0.6), (5, 0.8), (6, 1), (7, 0.7), (8, 0.5), (10, 0)

and the graph is as follow-

The first maximum membership of 1.0 occurs at a ripeness of 6 hence the defuzzified Value is 6.

Do it yourself
Q.1 Find the defuzzified value for the following fuzzy set
Dosage = {(50, 0), (75, 0.4), (100, 0.7), (125, 0.9), (150, 1), (175, 0.8), (200, 0.5), (225, 0.2)}

2. First of Maxima
This method determines the smallest value of the domain with maximum membership value.

An Example:
Consider the following fuzzy set for speed
Speed = {(40, 0),(50, 0.5),(60, 0.9),(70, 1),(75, 1),(80, 0.8),(90, 0.2)}

55
The maximum membership is 1.0. The lowest volume at which this occurs is 70 hence defuzzified value is 70

Do it yourself
Q.1 Find the defuzzified value for the following fuzzy set
Consider the following fuzzy set for volume
Volume = {(1, 0), (3, 0.3), (5, 0.7), (6, 1), (7, 1), (8, 0.8), (10, 0.2)}

3. Last of maxima
This method determines the largest value of the domain with maximum membership value.

An Example:
Consider the following fuzzy set for quality
Quality = {(2, 0.1), (4, 0.5), (6, 0.8), (7, 1), (8, 1), (9, 0.7), (10, 0.3)}

The maximum membership is 1.0. The last quality value at which this occurs is 8 hence the defuzzified value is
8.

Do it yourself
Q.1 Find the defuzzified value for the following fuzzy set
Sweetness = {(1, 0.1), (3, 0.4), (5, 0.8), (7, 1), (8, 1), (9, 0.7), (10, 0.3)}

4. Mean-max Membership
In this method, the defuzzified value is taken as the element with the highest membership values. When there
are more than one element having maximum membership values, the mean value of the maxima is taken. Let
A be a fuzzy set with membership function μ A(x) defined over x ∈ X, where X is a universe of discourse. The
defuzzified value is let say x of a fuzzy set and is defined as,

Here, M = {xi | μA(xi) is equal to the height of the fuzzy set A} and |M| is the cardinality of the set M.

An Example:

56
Consider the following fuzzy set for temperature
Temperature = {(18, 0.2), (20, 0.7), (22, 1), (23, 1), (24, 1), (26, 0.6), (28, 0.1)}

The maximum membership is 1.0, occurring at 22°C, 23°C, and 24°C. Mean Calculation is (22 + 23 + 24)/3 =
23 hence the defuzzified value is 23°C.

Do it yourself
Q.1 Find the defuzzified value for the following fuzzy set
Temperature = {(18, 0.3), (20, 0.6), (21, 1), (22, 1), (23, 1), (25, 0.7), (27, 0.2)}

5. Weighted average method

This method is valid for fuzzy sets with symmetrical output membership functions and produces results
very close to the COA (center of area) method. This method is less computationally intensive. Each
membership function is weighted by its maximum membership value. The defuzzified value is defined as :

Here denotes the algebraic summation and x is the element with maximum membership function.

An Example
Let A be a fuzzy set that tells about a student the elements with corresponding maximum membership values
are also given.

A = {(P, 0.6), (F, 0.4),(G, 0.2),(VG, 0.2), (E, 0)}

Here, the linguistic variable P represents a Pass student, F stands for a Fair student, G represents a Good
student, VG represents a Very Good student and E for an Excellent student.

Now the defuzzified value x for set A will be

57
The defuzzified value for the fuzzy set A with weighted average method represents a Fair student.

Do it yourself
Q.1 Consider a fuzzy set for Fan Speed = {(1, 0.2), (3, 0.5), (5, 0.8), (7, 0.6), (9, 0.3)} Find the defuzzified value
using weighted average method.

6. Center of sums
This is the most commonly used defuzzification technique. In this method, the overlapping area is counted
twice. This method employs the algebraic sum of the individual fuzzy subsets instead of their union. The
defuzzified value x is defined as :

Here, n is the number of fuzzy sets, N is the number of fuzzy variables, μ ak(xi) is the membership function for
the k-th fuzzy set. The defuzzified value x is defined as :

Here, Ai represents the firing area of ith rules and k is the total number of rules fired and xi (with bar) represents
the center of area.

An Example: This example for the case when symmetric shapes are created by the fuzzy set.
Given the following three fuzzy output sets, find the crisp value corresponding to that.

For the first fuzzy set For the second fuzzy set For the third fuzzy set

A1 = 1/2 × (3 + 5) × 0.3 = 1.2 A2 = 1/2 × (2 + 4) × 0.5 = 1.5 A3 = 1/2 × (1 + 3) × 1.0 = 2

x1 = (0 + 5) / 2 = 2.5 x2 = (3 + 7) / 2 = 5 x3 = (5 + 8) / 2 = 6.5

Another Example: This example for the case when asymmetric shapes are created by the fuzzy set. All
calculations in this example are correct.
Consider the union of two fuzzy sets

58
The aggregated fuzzy set of two fuzzy sets C1 and C2 is shown in Figure above. Let the area of these two fuzzy
sets is A1 and A2.

Now the center of area of the fuzzy set C 1 is 5 (exact value is 4.73, 5 is taken only for the simplicity of
calculation because in fuzzy everything works on approximation so slight adjustment won’t affect so much)
which is value for the variable x1 (with bar) and the center of area of the fuzzy set C 2 is 8 which is value for the
variable x2 (with bar). Now the defuzzified value is

Here the calculation for centroid of the C1

Do it yourself
Q.1 Consider the union of following fuzzy sets
C1 = {(1, 0.0), (2, 1.0), (3, 1.0), (4, 1.0), (5, 0.0)}
C2 = {(3, 0.0), (4, 1.0), (5, 1.0), (6, 1.0), (8, 1.0)}
C3 = {(2, 0.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 0.0)}

59
Find the defuzzified value for the same using the center of sums method.
Hint: The x coordinate for the centroid of second set C2 is 5. (The Answer is 4.5 approximately)

7. Centroid method/Center of gravity/Center of area/Center of mass

This method provides a crisp value based on the center of gravity of the fuzzy set. The total area of the
membership function distribution used to represent the combined control action is divided into a number of
sub-areas. The area and the center of gravity or centroid of each sub-area is calculated and then the
summation of all these sub-areas is taken to find the defuzzified value for a discrete fuzzy set.

For discrete membership function, the defuzzified value denoted as x using COG is defined as:

Here xi indicates the sample element, μ(xi) is the membership function, and n represents the number of
elements in the sample.
For continuous membership function, x is defined as :

An Example: Consider the following sets

A1 = [(-45, 0), (-30, 0.3), (-15, 0.6), (15, 0.6), (30, 0.3), (45, 0)]
A2 = [(-45, 0), (-30, 0.4), (30, 0.4), (45, 0)]
A3 = [(0, 0), (15, 0.3), (75, 0.3), (90, 0)]

Here the graph for the three sets

Find the defuzzified value for the same using the center of gravity method.

Find the union of the three sets is

A1 ∪ A2 ∪ A3 = {(-45, 0), (-30, 0.4), (-15, 0.6), (15, 0.6), (30, 0.4), (75, 0.3), (90, 0)}

Now we have to apply the center of gravity method to find the defuzzified value for discrete membership
function.
60
The numerator part
∑xi⋅μ(xi) = −30⋅0.4 + −15⋅0.6 + 15⋅0.6 + 30⋅0.4 + 75⋅0.3
∑xi⋅μ(xi) = 22.5

Similarly the denominator part

∑μ(xi) = 0.4+0.6+0.6+0.4+0.3=2.3

So the defuzzified value is

22.5/2.3 = 9.78

The defuzzified value using the Center of Gravity (COG) method for discrete membership function is
approximately 9.78.

Now we have to apply the center of gravity method to find the defuzzified value for continuous membership
function.

The continuous membership function is defined by the following points:

{(45,0),(30,0.4),(15,0.6),(15,0.6),(30,0.4),(75,0.3),(90,0)}

We will find the equation of the line for each segment using the formula:

Line from (45, 0) to (30, 0.4):

Line from (30,0.4) to (15,0.6):

Line from (15,0.6) to (15,0.6):

Line from (15,0.6) to (30,0.4):

Line from (30,0.4) to (75,0.3):

Line from (75,0.3) to (90,0):

The numerator is:

We will compute the integral over each segment:

From x = -45 to x = -30
61
From x = -30 to x = -15

From x = -15 to x = 15

From x = 15 to x = 30

From x = 30 to x = 75

From x = 75 to x = 90

The denominator is:

We will compute the integral over each segment:

From x = -45 to x = -30

From x = -30 to x = -15

From x = -15 to x = 15

From x = 15 to x = 30

From x = 30 to x = 75

From x = 75 to x = 90

After performing the integrations, the defuzzified value using the continuous membership function is
approximately 17.34.

Comparison of Discrete and Continuous Approaches

1. Discrete Membership Function:

62
oWhen to Use: Use this approach when the membership function is defined by discrete points, and the
intervals between points are small or the function is piecewise constant.
o Advantages: Simpler calculations, especially for small datasets.
o Disadvantages: Less accurate for continuous or smoothly varying membership functions.
2. Continuous Membership Function:
o When to Use: Use this approach when the membership function is continuous or smoothly varying,
and you need higher accuracy.
o Advantages: More accurate for continuous functions, better representation of real-world systems.
o Disadvantages: More complex calculations, especially for piecewise linear functions.

Do it yourself
Q.1 Consider the union of following fuzzy sets
C1 = {(0, 0.0), (1, 1.0), (2, 1.0), (3, 1.0), (4, 1.0), (5, 0.0)}
C2 = {(3, 0.0), (4, 1.0), (5, 1.0), (6, 1.0), (8, 0.0)}

Find the defuzzified value for the same using the center of area method for both the discrete membership
function and the continuous membership function.
The Answer is 4.151 approximately for continuous membership function.

7.1. Center of Area/ Bisector of Area Method (BOA)

This method calculates the position under the curve where the areas on both sides are equal. The BOA
generates the action that partitions the area into two regions with the same area.

An Example
Consider a fuzzy set {(1,0), (3, 0.5), (7, 0.8), (10,0)}

Left Triangle (1 x 3)
Base = 31=2Height = 0.5
Area

Middle Triangle (3 x 7)
Bases = 0.5 (left height) and 0.8 (right height)
Width = 73=4
Area:
63
Right Triangle (7 x 10)
Base = 107=3
Height = 0.8
Area:

Thus, the total area is:

The bisector area should divide this total area into two equal halves:

Linear interpolation is a method of estimating an unknown value within a given range based on two known
points. It assumes that the change between two values is linear. Suppose the bisecting area is located
between xp and xq then:

where:
● Ap is the cumulative area up to xp,
● Aq is the cumulative area up to xq.

We interpolate between xp = 3 (cumulative area = 0.5) and xq = 7 (cumulative area = 3.1).

Thus, the bisector of area occurs at xb 5.54

Do it yourself
Q.1 Consider a fuzzy set {(2, 0), (4, 1), (7, 1) & (9, 0)} Find the defuzzified value using bisectors of the area
method.

8. Center of Largest Area

The Center of Largest Area (COLA) method is a defuzzification technique used to convert a fuzzy set into a
crisp (single) value. It identifies the largest area within the aggregated fuzzy region and then calculates the
centroid (center of gravity) of that specific area. This centroid's x-coordinate represents the crisp output value.
COLA is particularly useful when dealing with fuzzy sets that have multiple peaks or plateaus, as it focuses on
the dominant region. The centroid of an area is found using integration

Where:
● x is the defuzzified crisp output.
● μA(x) is the membership function of the fuzzy set.
● [a,b] is the range of the largest continuous region with the highest membership value.

64
An Example
Consider a fuzzy set A = {(10, 0), (20, 1), (30, 0)} and another fuzzy set B = {(25, 0), (30, 1), (40, 1), (45, 0)}.
The union of both these is shown below

We take the union of Set A and Set B, and the resulting fuzzy region looks like:
● From x=10, the membership comes from Set A.
● From x=20, the maximum membership dominates (Set A or Set B).
● From x=30, the membership is 1 (from Set B).
● From x=40, the membership gradually decreases (from Set B).

The largest continuous high-membership area is identified from the trapezoidal part of Set B
30 x 40. For the largest area (30 x 40), where A(x)=1:

For numerator

For denominator

Do it yourself
Q.1 Consider a fuzzy sets A = {(5, 0), (10, 1), (15, 0)}, B = {(12, 0),(18, 1),(25, 1),(30, 0)} & C = {(28, 0),(35, 1),
(42, 0)}. Find the defuzzified value using the center of the largest area method.

How non-numerical values are defuzzified?

The fuzzy sets don't always contain purely numerical values. They might represent linguistic terms like
"cold," "warm," or "hot." The challenge, then, is: How can we perform mathematical operations like integration
or summation on non-numerical elements?

Since mathematical operations require numerical values, the solution is mapping non-numerical elements to
numerical values. Here’s how it works:

If we have a fuzzy set describing temperature, like: A={("cold",0.2),("warm",0.7),("hot",0.9)}. We first map

these linguistic terms to a numerical scale based on their real-world meaning:
● "Cold" 10°C
● "Warm" 25°C

65
● "Hot" 40°C
Once we assign numerical values, we can apply mathematical formulas. Like in the Center of Area (CoA)
method, we compute:

Substituting values:

So the defuzzified output is 30.83°C, which makes sense as a compromise between warm and hot.

Fuzzy Inference System

The concept of fuzzy set theory has been used in a number of ways, such as Fuzzy Logic Controller (FLC),
fuzzy clustering algorithms, fuzzy mathematical programming, fuzzy graph theory, and others. Out of all such
applications, the FLC is the most popular one, due to the following reasons:
I. Ease of understanding and implementations
II. Ability to handle uncertainty and imprecision.
Moreover, an exact mathematical formulation of the problem is not essential for development of an FLC. This
feature makes it a natural choice for solving complex real-world problems, which are either difficult to model
mathematically or the mathematical model becomes highly non-linear.

Fuzzy Logic Controller (FLC) was first developed by Mamdani and Assilian around 1975. Human beings have
the natural quest to know input-output relationships of a process. The behavior of a human being is modeled
artificially for designing a suitable FLC. The performance of an FLC depends on its Knowledge Base (KB),
which consists of both Data Base (DB) (that is, data related to membership function distributions of the
variables of the process to be controlled) as well as Rule Base (RB). However, designing a proper KB of an
FLC is a difficult task, which can be implemented in one of the following ways:
➔ Optimization of the DB only,
➔ Optimization of the RB only,
➔ Optimization of the DB and RB in stages,
➔ Optimization of the DB and RB simultaneously.

The membership function distributions are assumed to be either linear (such as triangular, trapezoidal) or non-linear
(namely Gaussian, bell-shaped, sigmoid). To design and develop a suitable FLC for controlling a process, its
variables need to be expressed in the form of some linguistic terms (such as VN: Very Near, VFR: Very Far, A:
Ahead etc.) and the relationships between input (antecedent) and output (consequent) variables are expressed in
the form of some rules. For example, a rule can be expressed as follows
IF I1 is NR AND I2 is A THEN O is ART.

It is obvious that the number of rules to be present in the rule base will increase, as the number of linguistic
terms used to represent the variables increases (to ensure a better accuracy in prediction). Moreover,
66
computational complexity of the controller will increase with the number of rules. For an easy implementation in
either the software or hardware, the number of rules present in the RB should be as small as possible.

Let us have discussion about two more parameters-

1. Interpretability of a fuzzy modeling is defined as its capability to express the behavior of a system in
an understandable form. It is generally expressed in terms of compactness, completeness, consistency
and transparency.
2. The accuracy of a fuzzy model indicates how closely it can represent the modeled system.

The working principles of both these approaches are briefly explained below.
An FLC consists of four modules, namely a rule base, an inference engine, fuzzification and de-fuzzification.

The following steps are involved in the working cycle of an FLC:

❖ The condition (also known as antecedent) and action (also called consequent) variables needed to
control a process are identified and measurements are taken of all the condition variables.
❖ The measurements taken in the previous step are converted into appropriate fuzzy sets to express
measurement uncertainties. This process is known as fuzzification.
❖ The fuzzified measurements are then used by the inference engine to evaluate the control rules stored
in the fuzzy rule base and a fuzzified output is determined.
❖ The fuzzified output is then converted into a single crisp value. This conversion is called de-
fuzzification. The de-fuzzified values represent the actions to be taken by the FLC in controlling the
process.

Linguistic fuzzy modeling, such as “Mamdani Approach” is characterized by its high interpretability and low
accuracy, whereas the aim of precise fuzzy modeling like “Takagi and Sugeno’s Approach” is to obtain high
accuracy but at the cost of interpretability.

Let us have a discussion about the “mamdani” approach first. Assume for simplicity that only two fuzzy control
rules (out of many rules present in the rule base) are being fired as given below, for a set of inputs: (s1*, s2*).

Rule 1: IF s1 is A1 AND s2 is B1 THEN f’ is C1

Rule 2: IF s1 is A2 AND s2 is B2 THEN f’ is C2

Let us try to understand the fuzzy reasoning process via following diagram

67
The inputs for fuzzy variables: s1 and s2 is s1* and s2* respectively; if μA1 and μB1 are the membership function
values for A and B, respectively, then the grade of membership of s1* in A1 and that of s2* in B1 are
represented by μA1(s1*) and μB1(s2*), respectively, for rule 1. Similarly, for rule 2, μA2(s1*) and μB2(s2*) are used
to represent the membership function values. The f’ is the output variable for output set C1 and C2
respectively. The firing strengths of the first and second rules are calculated as follows:

α1 = min(μA1(s1*), μB1(s2*))
α2 = min(μA2(s1*), μB2(s2*))

Here α1 and α2 are the values for variable f’ in output set C1 and C2. The membership value of the combined
control action C is given by
μc(f’) = max(μ*c1(f’), μ*c2(f’))

Once the above processing is done then the defuzzification is applied to get the crisp value. The popular
choices for the defuzzification are center of sum, center of area, mean of maxima etc.

An Example
A typical problem scenario related to navigation of a mobile robot in the presence of four moving obstacles.
The directions of movement of the obstacles are shown in the figure.

The obstacle O2 is found to be the most critical one. Our aim is to develop a fuzzy logic-based motion planner
that will be able to generate the collision-free path for the robot. There are two inputs, namely the distance
between the robot and the obstacle (D) and angle ( GSO2) for the motion planner and it will generate one
68
output, that is, deviation. Distance is represented using four linguistic terms, namely Very Near (VN), Near
(NR), Far (FR) and Very Far (VFR), whereas the input: angle and output: deviation are expressed with the help
of five linguistic terms, such as Ahead (A), Ahead Left (AL), Left (LT), Ahead Right (ART) and Right (RT). The
figure given below shows the DB of the FLC.

The rule base of the fuzzy logic-based motion planner is given in table below-

Determine the output - deviation for the set of inputs: distance D = 1.04 m and angle GSO2 = 30 degrees, using
Mamdani Approach. Use different methods of defuzzification.

Solution:
The inputs are: Distance = 1.04 m, Angle = 30 degrees, the distance of 1.04 m may be called either NR (Near)
or FR (Far). Similarly, the input angle of 30 degrees can be declared either A (Ahead) or ART (Ahead Right).
Consider the figure below used to determine the membership value, corresponding to the distance of 1.04 m.

Using the principle of similar triangle, we can write the following relationship:

From the above expression, x is found to be equal to 0.6571. Thus, the distance of 1.04 m may be declared
NR with a membership value of 0.6571, that is, μNR = 0.6571. Similarly, the distance of 1.04 m can also be
called FR with a membership value of 0.3429, that is, μFR = 0.3429. In the same way, an input angle of 30
degrees may be declared either A with a membership value of 0.3333 (that is, μA = 0.3333) or ART with a
membership value of 0.6667 (that is, μART = 0.6667). For the above set of inputs, the following four rules are
being fired from a total of 20:
1. If Distance is NR AND Angle is A Then Deviation is RT
2. If Distance is NR AND Angle is ART Then Deviation is A
69
3. If Distance is FR AND Angle is A Then Deviation is ART
4. If Distance is FR AND Angle is ART Then Deviation is A

The strengths (α values) of the fired rules are calculated as follows:

α1 = min(μNR, μA)) = min(0.6571, 0.3333)) = 0.3333
α2 = min(μNR, μART)) = min(0.6571, 0.6667)) = 0.6571
α3 = min(μFR, μA)) = min(0.3429, 0.3333)) = 0.3333
α4 = min(μFR, μART)) = min(0.3429, 0.6667)) = 0.3429

The fuzzified outputs corresponding to above four fired rules are shown in Figure below-

The union of the fuzzified outputs, corresponding to the above four fired rules is shown below-

The above fuzzified output cannot be used as a control action and its crisp value has been determined using
the following methods of defuzzification:
1. Center of Sums Method

70
The shaded region corresponding to each fired rule is shown in this figure. The values of area and
center of area of the above shaded regions are also given in this figure, in a tabular form. The crisp
output U of above four fired rules can be calculated as follows-

Therefore, the robot should deviate by 19.5809 degrees towards the right with respect to the line joining the
present position of the robot and the goal to avoid collision with the obstacle.

2. Centroid method:

The shaded region representing the combined output corresponding to above four fired rules has been divided
into a number of regular sub-regions, whose area and center of area can easily be determined. For example,
the shaded region of Figure has been divided into four sub-regions (two triangles and two rectangles). The
values related to area and center of area of the above four sub-regions are shown in this figure, in a tabular
form. In this method, the crisp output U can be obtained like the following

Where A = 9.7151 × ( 25.2860) + 20.2788 × 0.0 + 2.3588 × 20.2870 + 24.8540 × 52.7153, B = 9.7151 +
20.2788 + 2.3588 + 24.8540. Therefore, U turns out to be equal to 19.4450. Thus, the robot should deviate by

71
19.4450 degrees towards the right with respect to the line joining the present position of the robot and the goal
to avoid collision with the obstacle.

3. Mean of Maxima

It is observed that the maximum value of membership (that is, 0.6571) has occurred in a range of deviation
angles starting from 15.4305 to 15.4305 degrees. Thus, its mean is coming out to be equal to 0.0. Therefore,
the crisp output U of the controller becomes equal to 0.0, that is, U = 0.0. The robot will move along the line
joining its present position and the goal.

Now let us talk about the “Takagi and Sugeno’s Approach”. Here, a rule is composed of fuzzy antecedent and
functional consequent parts. Thus, a rule (say i-th) can be represented as follows:

Where a0, a1,…, an are the coefficients. It is to be noted that i does not represent the power but it is a
superscript only. In this way, a nonlinear system is considered to be a combination of several linear systems.
The weight of i-th rule can be determined for a set of inputs (x1, x2,…, xn) like the following:

where A1,A2, … ,An indicates membership function distributions of the linguistic terms used to represent the
input variables and μ denotes the membership function value. Thus, the combined control action can be
determined as follows:

Here k indicates the total number of rules.

A Numerical Example:
A fuzzy logic-based expert system is to be developed that will work based on Takagi and Sugeno’s approach
to predict the output of a process. The DB of the FLC is shown in Figure below-

As there are two inputs: I1 and I2, and each input is represented using three linguistic terms (for example, LW,
M, H for I1 and NR, FR, VFR for I2), there is a maximum of 3 × 3 = 9 feasible rules. The output of i-th rule, that
is, yi (i = 1, 2, . . . , 9) is expressed as follows:

72
Where j, k = 1, 2, 3; a i 1 = 1, a i 2 = 2 and a i 3 = 3, if I1 is found to be LW, M and H respectively; b i 1 = 1, b i 2 = 2
and b i 3 = 3, if I2 is seen to be NR, FR and V FR, respectively. Calculate the output of the FLC for the inputs: I1
= 6.0, I2 = 2.2.

Solution
The inputs are: I1 = 6.0 & I2 = 2.2. The input I1 of 6.0 units can be called either LW (Low) or M (Medium).
Similarly, the second input I2 of 2.2 units may be declared either FR (Far) or VFR (Very Far). Figure below
shows a schematic view used to determine the membership value corresponding to the first input I1 = 6.0.

Using the principle of similar triangle, we can write the following relationship:

From the above expression, x is coming out to be equal to 0.8. Thus, the input I1 = 6.0 may be called LW with a
membership value of 0.8, that is, μLW = 0.8. Similarly, the same input I1 = 6.0 may also be called M with a
membership value of 0.2, that is, μM = 0.2. In the same way, the input I2 = 2.2 may be declared either FR with a
membership value of 0.8 (that is, μFR = 0.8) or VFR with a membership value of 0.2 (that is, μVFR = 0.2). For the
above set of inputs, the following four combinations of input variables are being fired from a total of nine.
1. I1 is LW and I2 is FR,
2. I1 is LW and I2 is VFR,
3. I1 is M and I2 is FR,
4. I1 is M and I2 is VFR.

Now, the weights: w1, w2, w3 and w4 of the first, second, third and fourth combination of fired input variables,
respectively, have been calculated as follows:
w1 = μLW × μFR = 0.8 × 0.8 = 0.64
w2 = μLW × μVFR = 0.8 × 0.2 = 0.16
w3 = μM × μFR = 0.2 × 0.8 = 0.16
w4 = μM × μVFR = 0.2 × 0.2 = 0.04

The functional consequent values: y1, y2, y3 and y4 of the first, second, third and fourth combination of fired
input variables can be determined like the following:
y1 = I1 + 2I2 = 6.0 + 2 × 2.2 = 10.4
y2 = I1 + 3I2 = 6.0 + 3 × 2.2 = 12.6
y3 = 2I1 + 2I2 = 2× 6.0 + 2 × 2.2 = 16.4
y4 = 2I1 + 3I2 = 2× 6.0 + 3 × 2.2 = 18.6

Therefore, the output y of the controller can be determined as follows

Let us see the difference between the mamdani FIS and Sugeno FIS

73
Mamdani FIS Sugeno FIS
Output membership function is present No output membership function is present
The output of surface is discontinuous The output of surface is continuous
Distribution of output Non distribution of output, only Mathematical
combination of the output and the rules strength
Through defuzzification of rules consequent of crisp No defuzzification here. Using weighted average of the
result is obtained rules of consequent crisp result is obtained
Expressive power and interpretable rule consequent Here is loss of interpretability
Mamdani FIS possess less flexibility in the system Sugeno FIS possess more flexibility in the system
design design
It has more accuracy in security evaluation block It has less accuracy in security evaluation block cipher
cipher algorithm algorithm
It is used in MISO (Multiple Input and Single Output) It is used only in MISO (Multiple Input and Single
and MIMO (Multiple Input and Multiple Output) Output) systems
systems
Mamdani inference system is well suited to human Sugeno inference system is well suited to mathematical
input analysis
Application: Medical Diagnosis System Application: To keep track of the change in aircraft
performance with altitude

Advantages and Disadvantages of Fuzzy Logic Controller

● It does not require an extensive mathematical formulation of the problem.
● The rules are expressed in terms of IF...THEN form. Thus, it becomes easier for the user to understand
the control action.

However, it suffers from the following drawbacks:

● The performance of an FLC depends mainly on its Knowledge Base (KB) and designing a proper KB of
an FLC is not an easy task. The designer should have a thorough knowledge of the process to be
controlled.
● The number of rules of an FLC depends on the number of variables and that of the linguistic terms
used to represent each variable. Thus, computational complexity of an FLC increases, when it is
developed for controlling a process involving many variables.

ARTIFICIAL NEURAL NETWORKS

Biological Neural Network
Consider the schematic view of a biological neuron. It consists of a bush of thin fibers called dendrites, cell
body (also known as soma), a long cylindrical fiber known as axon, synapse, and others. A synapse is that
74
part of a neuron, where its axon makes contact with the dendrites of its neighboring neuron. A neuron collects
information from its neighbors with the help of its dendrites. The collected information is then summed up in the
cell body, before it passes through the axon. This information is then transferred to the next neuron through the
synapse using the difference in concentration of Na+, K+ ions between them.

Artificial Neuron
Consider the schematic view of an artificial neuron. In which a biological neuron has been modeled artificially.

Let us suppose that there are n inputs (such as I1, I2, . . . , In) to a neuron j. The weights connecting n number of
inputs to jth neuron are represented by [W] = [W1j, W2j, ..., Wnj]. The function of summing junctions of an artificial
neuron is to collect the weighted inputs and sum them up. Thus, it is similar to the function of combined
dendrites and soma. The activation function (also known as the transfer function) performs the task of axon
and synapse. The output of the summing junction may sometimes become equal to zero and to prevent such a
situation, a bias of fixed value bj is added to it. Thus, the input to transfer function f is determined as

The output of the summing function is also called Linear combiner Output/Induced field input/net input/pre-
activation value. The output of jth neuron, that is Oj can be obtained as follows:

Do it yourself
Q.1 For the network shown in Figure below, calculate the net input to the output neuron.

75
Q.2 For the network shown in Figure below, calculate the net input to the output neuron.

Difference between Artificial Neural Network (ANN) and Biological Neural Network (BNN):

Artificial Neural Network (ANN) Biological Neural Network (BNN)

Processing speed is fast as compared to Biological They are slow in processing information. Cycle time
Neural Network. Cycle time for execution is in for execution is in milliseconds.
nanoseconds.

It can perform massive parallel operations It can perform massive parallel operations
simultaneously like BNN. simultaneously.

Size and complexity depends on the application The size and complexity of BNN is more than ANN
chosen but it is less complex than BNN. with 1011 neurons and 1015 interconnections.

Information is stored in contiguous memory locations. Information is stored in interconnections or in

synapse strength.

To store new information, the old information is Any new information is stored in interconnection, and
deleted if there is a shortage of storage. the old information is stored with lesser strength.

There is no fault tolerance in ANN. The corrupted It has fault tolerance capability. It can store and
information cannot be processed. retrieve information even if the interconnection is
disconnected.

The control unit processes the information. The chemical present in neurons does the
processing.

Threshold is a set value based upon which the final output of the network may be calculated. The threshold
value is used in the activation function. A comparison is made between the calculated net input and the
threshold to obtain the network output. For each and every application, there is a threshold limit. Consider a
direct current (DC) motor. If its maximum speed is 1500 rpm then the threshold based on the speed is 1500
rpm. If the motor is run on a speed higher than its set threshold, it may damage motor coils. Similarly, in neural
networks, based on the threshold value, the activation functions are defined and the output is calculated. The
activation function using threshold can be defined as
f(net) = { 1 if net ≥ θ−1if net<θ }
Where θ is the fixed threshold value.

BASIC MODELS OF ARTIFICIAL NEURAL NETWORK

76
The models of ANN are specified by the three basic entities namely:
1. The model’s synaptic interconnections
2. The training or learning rules adopted for updating and adjusting the connection weights
3. Their activation functions

The model’s synaptic interconnections

The network architecture refers to arrangement of neurons to form layers and the connection pattern formed
within and between layers. A layer is formed by taking a processing element and combining it with other
processing elements. Practically, a layer implies a stage, going stage by stage, i.e., the input stage and the
output stage are linked with each other. These linked interconnections lead to the formation of various network
architectures. There exist five basic types of neuron connection architectures. They are:

i. Single-layer feed-forward Network

When a layer of the processing nodes is formed, the inputs can be connected to these nodes with
various weights, resulting in a series of outputs, one per node. Thus, a single-layer feed-forward
network is formed.

Here are few important functions of the input layer

1. The input layer receives raw data (e.g., images, text, numerical features) and passes it to the
subsequent layers for processing. Each neuron corresponds to one input feature (pixel, word, sensor
reading, etc.).
2. It does proper data formatting (normalization, flattening) which is essential before feeding into the
input layer.
3. It does not perform computations (like weighted sums or activations). Instead, it distributes the data
to the first hidden layer. Unlike hidden/output layers, input layer neurons have no trainable
parameters (weights/biases). The input layer is passive i.e. it does not have an activation function
hence activations are applied in later layers.

A note about the role of weight

1. Weights quantify the importance of each input feature (or neuron) in influencing the output. A high
value for weight for an input parameter shows that the feature is important and a near-zero weight
implies the feature is irrelevant.
2. Weights are adjusted during backpropagation to minimize the loss function (e.g., MSE, cross-
entropy) i.e. they play a significant role in the learning.

ii. Multilayer feed-forward network

A multilayer feed-forward network is formed by the interconnection of several layers. The input layer is
that which receives the input and this layer has no function except buffering the input signal. The output
layer generates the output of the network. Any layer that is formed between the input and output layers
is called a hidden layer. This hidden layer is internal to the network and has no direct contact with the
external environment. It should be noted that there may be zero to several hidden layers in an ANN.
More the number of the hidden layers, more is the complexity of the network. This may, however,
provide an efficient output response. In case of a fully connected network, every output from one layer
is connected to each and every node in the next layer.
77
Difference between hidden layer and output layer
Hidden Layer Output Layer

Extract and transform intermediate features from Produces the final prediction (e.g., class
the input data. Learned features are often probabilities, regression values) which is human-
abstract and hard to interpret readable (e.g. class labels, scalars).

Introduce non-linearity (via activation functions) to Maps the learned features to the target format
model complex relationships. Typically use non- using task-specific activations like Sigmoidal for
linear activations like ReLU, Tanh Binary Classification, Softmax for Multi-Class
Classification
Computes the initial error gradient

In case of error; Propagates error backward In case of error; It computes initial loss gradient

iii. Single node with its own feedback

❖ A network is said to be a feed-forward network if no neuron in the output layer is an input to a
node in the same layer or in the preceding layer.
❖ When outputs can be directed back as inputs to same or preceding layer nodes then it results in
the formation of feedback networks.
❖ If the feedback of the output of the processing elements is directed back as input to the
processing elements in the same layer then it is called lateral feedback.
❖ Recurrent networks are feedback networks with closed loop. The loops allow information to be
retained over time (feedback). It is used for sequential data like time series or natural language
processing. It is used in Speech recognition and language modeling.
❖ A simple recurrent neural network having a single neuron with feedback to itself.

iv. Single-layer recurrent network

A single-layer network with a feedback connection in which a processing element’s output can be
directed back to the processing element itself or to the other processing element or to both.

78
v. Multilayer recurrent network
A processing element output can be directed back to the nodes in a preceding layer, forming a
multilayer recurrent network. Also, in these networks, a processing element output can be directed back
to the processing element itself and to other processing elements in the same layer.

Note: Maxnet is a type of neural network used for competitive learning, specifically to determine the maximum
activation among a set of neurons. It is commonly used in winner-take-all (WTA) networks

, where only the most strongly activated neuron

The training or learning rules adopted for updating and adjusting the connection weights
The main property of an ANN is its capability to learn. Learning or training is a process by means of which a
neural network adapts itself to a stimulus by making proper parameter adjustments, resulting in the production
of desired response. Broadly, there are two kinds of learning in ANNs:
1. Parameter learning: It updates the connecting weights in a neural net.
2. Structure learning: It focuses on the change in network structure (which includes the number of
processing elements as well as their connection types).

The above two types of learning can be performed simultaneously or separately. Apart from these two
categories of learning, the learning in an ANN can be generally classified into three categories as: supervised
learning; unsupervised learning & reinforcement learning.

79
1. Supervised Learning
Each input vector requires a corresponding target vector, which represents the desired output. The
input vector along with the target vector is called a training pair. The network here is informed precisely
about what should be emitted as output.

During training, the input vector is presented to the network, which results in an output vector. This
output vector is the actual output vector. Then the actual output vector is compared with the desired
(target) output vector. If there exists a difference between the two output vectors then an error signal is
generated by the network. This error signal is used for adjustment of weights until the actual output
matches the desired (target) output. In this type of training, a supervisor or teacher is required for error
minimization. Hence, the network trained by this method is said to be using supervised training
methodology. In supervised learning, it is assumed that the correct "target" output values are known for
each input pattern.

Key Features:
Requires labelled training data.
Uses loss functions to measure prediction accuracy.
Common algorithms: Neural Networks, Support Vector Machines, Decision Trees

Step-by-Step Training Process:

1. Initialize weights and biases.
2. Compute output for each input data point.
3. Calculate the error (difference between predicted output and actual output).
4. Update weights and biases using an optimization algorithm (e.g., Gradient Descent).
5. Repeat the process for several epochs (iterations) until the error is minimized.
Example: In binary classification (e.g., XOR problem), the network adjusts weights to correctly map
inputs to the correct class.

Regression technique in supervised learning

Regression is a type of supervised learning where the goal is to predict a continuous output variable
based on one or more input features. The model learns the relationship between the input features and
the continuous output.

Scenario: You want to predict the price of a house based on its size (in square feet) and other features
like the number of bedrooms, location, and age of the house.

Input Features:
● Size of the house (square feet)
● Number of bedrooms
● Location
● Age of the house
Output: House price (a continuous value)

A regression model, such as Linear Regression, can be used to predict the house price. The model
learns the relationship between the input features and the house price during training. For example, it
might learn that larger houses with more bedrooms in desirable locations tend to have higher prices.
80
Equation: In simple linear regression, the relationship can be represented as:

where θ0,θ1,θ2,θ3,θ4 are the parameters learned by the model.

Classification technique in supervised learning

Classification is a type of supervised learning where the goal is to predict a discrete label or category
based on one or more input features. The model learns to assign input data to one of several
predefined classes.

Scenario: You want to classify animals into different categories based on their features, such as the
number of legs, type of skin covering, and whether they can fly.

Input Features:
● Number of legs
● Type of skin covering (e.g., fur, feathers, scales)
● Ability to fly (yes/no)
Output: Animal category (e.g., mammal, bird, reptile, amphibian)

A classification model, such as Logistic Regression, Decision Trees, or Support Vector Machines, can
be used to classify the animals. The model learns the relationship between the input features and the
animal category during training. For example, it might learn that animals with feathers and the ability to
fly are likely to be birds.
Decision Boundary: The model creates a decision boundary that separates the different classes. For
instance, it might determine that if an animal has feathers and can fly, it should be classified as a bird.

81
2. Un-supervised Learning
The input vectors of similar type are grouped without the use of training data to specify how a member
of each group looks or to which group a number belongs. In the training process, the network receives
the input patterns and organizes these patterns to form clusters. When a new input pattern is applied,
the neural network gives an output response indicating the class to which the input pattern belongs. If
for an input, a pattern class cannot be found then a new class is generated.

It is clear that there is no feedback from the environment to inform what the outputs should be or
whether the outputs are correct. In this case, the network must itself discover patterns, regularities,
features or categories from the input data and relations for the input data over the output. While
discovering all these features, the network undergoes change in its parameters. This process is called
self-organizing in which exact clusters will be formed by discovering similarities and dissimilarities
among the objects.
Example: Clustering, Anomaly detection

The two popular learning algorithms are self organizing maps (SOMs) and k-means Clustering.

Self-Organizing Maps (SOMs)

● Feature mapping is a process which converts the patterns of arbitrary dimensionality into a
response of one-two dimensional arrays of neurons. The network performing such a mapping is
called feature map
● Apart from its capability to reduce the higher dimensionality, it has to preserve the neighborhood
relations of the input patterns, i.e., it has to obtain a topology preserving map. For obtaining such
feature maps, it is required to find a self-organizing neural array which consists of neurons
arranged in a one-dimensional array or a two-dimensional array.

Topological preservation refers to the ability of the Kohonen Self-Organizing Map (KSOM) to
maintain the spatial relationships between input data points when mapping them onto a lower-
dimensional space (typically a 1D or 2D grid).

82
● Similar input vectors should be mapped to neighboring neurons in the output map.
● The network should retain the structure of the input data after training.

To depict this, a typical network structure where each component of the input vector x is connected to
each of the nodes is shown in figure below

On the other hand, if the input vector is two-dimensional, the inputs, say x(a, b), can arrange
themselves in a two-dimensional array defining the input space (a, b) as in Figure below; Here, the two
layers are fully connected

A typical architecture of Kohonen self-organizing feature map (KSOFM) is shown in below-

The architecture consists of two layers: input layer and output layer (cluster). There are “n” units in the
input layer and “m” units in the output layer. Basically, here the winner unit is identified by using either
dot product or Euclidean distance method and the weight update using Kohonen learning rules is
performed over the winning cluster unit. At the time of self-organization, the weight vector of the cluster
unit which matches the input pattern very closely is chosen as the winner unit. The closeness of the
weight vector of the cluster unit to the input pattern may be based on the square of the minimum
Euclidean distance. The weights are updated for the winning unit and its neighboring units.
83
The steps involved in the training algorithm are as shown below.
Step-1: Initialize the weights wij: Random values may be assumed. They can be chosen as the same
range of values as the components of the input vector. If information related to distribution of clusters is
known, the initial weights can be taken to reflect that prior knowledge. Initialize the learning rate α: It
should be a slowly decreasing function of time.
Step 2: Perform Steps 3–8 when the stopping condition is false.
Step-3: Take the sample training input vector x from the input layer.
Step 4: Compute the square of the Euclidean distance, i.e., for each j = 1 to m,
n m
D(J) = ∑ ❑∑ ❑(xi - wij)2
i=1 j=1
Find the winning unit index J, so that D(J) is minimum. (In Steps 3; dot product method can also be
used to find the winner, which is basically the calculation of net input, and the winner will be the one
with the
largest dot product.)
Step-5: For all units j within a specific neighborhood of J and for all i, calculate the new weights:
wij(new) = wij(old) + α[xi w ij(old)]
or
wij(new) = (1 - α)wij(old) + αxi
Step-6: Repeat the step 3-5 until update in the weight is negligible
Step-7: Update the learning rate using the formula α(t +1)= 0.5α(t).
Step 8: Test for stopping condition of the network.

An Example
Construct a Kohonen self-organizing map to cluster the four given vectors, [0 0 1 1], [1 0 0 0], [0 1 1 0]
and [0 0 0 1]. The number of clusters to be formed is two. Assume an initial learning rate of 0.5.

Do it yourself
Consider a Kohonen self-organizing net with two cluster units and five input units. The weight vectors
for the cluster units are given by
w1 = [1.0 0.9 0.7 0.5 0.3]
w2 = [0.30.5 0.7 0.91.0]
Use the square of the Euclidean distance to find the winning cluster unit for the input pattern x =[0.0 0.5
1.0 0.5 0.0] . Using a learning rate of 0.25, find the new weights for the winning unit.

Applications of SOMs
● Data Clustering: Identifying patterns in customer behavior, genetics, and more.
● Anomaly Detection: Detecting fraud or unusual patterns in financial transactions.
● Feature Extraction: Reducing data dimensions for visualization and analysis.
● Image Recognition: Organizing images based on similarities.

K-means clustering

Clustering is the task of grouping similar data points together based on their features.

K-means clustering is an iterative algorithm that divides the unlabeled dataset into k different clusters in
such a way that each dataset belongs to only one group that has similar properties. The goal is to
maximize intra-cluster similarity & minimize inter-cluster similarity. Intra-cluster similarity means
elements in the same cluster should be close to one another i.e. Euclidean distance between them
should be as little as possible; Inter-cluster similarity means Euclidean distance between two the
centroids of the clusters should be maximum i.e. there should be no common element in two clusters.
The number of clusters is represented using letter K. This algorithm discovers patterns without prior
knowledge of groups i.e. it falls under the category of unsupervised learning.

Here is the Steps of K-Means Clustering

84
1. Choose k: Select the number of clusters randomly (e.g., k=2).
2. Initialize Centroids: Randomly pick k data points as initial centroids.

3. Assign Clusters:
Calculate Euclidean distance between each point and centroids.
Assign each point to the nearest centroid.

4. Update Centroids: Recompute centroids as the mean of all points in the cluster.

5. Repeat step 3 & 4: Reassign points and update centroids until convergence (no further
changes).

The step of computing the centroid and assigning all the points to the cluster based on their distance
from the centroid is a single iteration. There are essentially three stopping criteria that can be adopted
to stop the K-means algorithm:
1. Centroids of newly formed clusters do not change
2. Points remain in the same cluster
3. Maximum number of iterations is reached

We have to understand the effect of choosing the value of K. Before this let us understand the meaning
of inertia which is the sum of squared distances of points to their centroid (measures cluster
compactness).
● If the value of k is too small, it means the size of the cluster will be big that results in high inertia;
It will give us poor insights because distinct groups will be merged into the same cluster.
● If the value of k is too large, it means the size of the cluster will be small that results in low
inertia; It will give us clear, well-separated clusters but clusters will have overlapping elements
or fragmented groups.
The impact of increasing the value of K can be understood like stretching a rubber band: Initial effort
(low k) yields big changes; later effort (high k) barely stretches it further.

An Example
Dataset: 12 Customers with Annual Spending ($1000) and Visits/Year
+-------------------------------------------+
| Customer | Spending ($1000) | Visits/Year |
|----------|------------------|-------------|
| 1 | 5 | 2 |
| 2 | 10 | 4 |
| 3 | 8 | 3 |
| 4 | 50 | 15 |
| 5 | 55 | 18 |
85
| 6 | 60 | 20 |
| 7 | 100 | 40 |
| 8 | 95 | 35 |
| 9 | 110 | 45 |
| 10 | 120 | 50 |
| 11 | 4 | 1 |
| 12 | 6 | 2 |
+-------------------------------------------+

Step-01: Say the value of k = 2

Step-02: Initialize the random centroids
C₁: Customer-1 (5, 2) C₂: Customer 4 (50, 15)
Iteration-1
For step-03: Assign Clusters; Compute Euclidean distance (squared) for all points:

Cluster 1: Customers 1, 2, 3, 11, 12

Cluster 2: Customers 4, 5, 6, 7, 8, 9, 10

For step-04: update centroid

New C₁: Mean of Cluster 1

New C2: Mean of Cluster 2

Iteration-2
For step-03: Assign Clusters; Compute Euclidean distance (squared) for all points:

86
Clusters remain unchanged Algorithm converges.

Inertia for k=2

● For cluster-01: 2.92 + 14.12 + 2.32 + 8.32 + 0.52 = 28.2
● For cluster-02: 1424 + 1048 + 703 + 275 + 118 + 685 + 1492 = 5745
● Total Inertia: 28.2 + 5745 = 5773

Application of k-means clustering

● Customer segmentation in marketing.
● Image compression.
● Anomaly detection.
● Document clustering.

Differences Between SOMs and K-Means

Aspect Self-Organizing Maps K-means clustering

Output Low-dimensional map (2D grid) Cluster assignments (no visualization)

Topology Preserves topological Does not preserve topology

relationships

Representation Neurons with weight vectors Centroids (mean of cluster points)

Use case Visualization, dimensionality Pure clustering

reduction

Complexity More complex, involves Simpler, only updates centroids

neighborhood updates

3. Reinforcement Learning

87
Reinforcement learning is a form of supervised learning because the network receives some feedback
from its environment. However, the feedback obtained here is only evaluative and not instructive. The
external reinforcement signals are processed in the critic signal generator, and the obtained critic
signals are sent to the ANN for adjustment of weights properly so as to get better critic feedback in
future. The critic signal is like a reward or a penalty. The reinforcement learning is also called learning
with a critic as opposed to learning with a teacher, which indicates supervised learning.

Key Features:
● Trial-and-error learning.
● Uses rewards and penalties as feedback.

Role of Reward and Punishment in Learning

The RL (Reinforcement Learning) agent receives rewards for desirable actions and penalties for
undesired ones. Uses Markov Decision Processes (MDP) to model sequential decision-making Markov
Decision Processes (MDPs) are mathematical frameworks used to model decision-making in situations
where outcomes are partly random and partly under the control of a decision-maker..
Example: A robot navigating a maze receives positive rewards for correct paths and negative rewards
for hitting obstacles.

Use of Neural Networks in Reinforcement Learning

Deep Q-Networks (DQN): Uses deep learning for complex decision-making.
Policy Gradient Methods: Directly learn the optimal action policy.
Applications: Used in game playing, robotic movements, and financial modelling.

Applications of reinforcement learning

● Self-driving cars – Learning optimal driving strategies.
● Robotics – Automated control systems.
● Stock trading – Portfolio management.
● AlphaGo (DeepMind) –AlphaGo is an artificial intelligence (AI) program developed by
DeepMind, a subsidiary of Google. It became famous for defeating human world champions in
the board game Go, which is considered one of the most complex games due to its vast number
of possible moves.

Features Supervised Learning Unsupervised Reinforcement

Learning Learning

Definition Learning from labeled Learning from Learning through trial

data (input-output unlabeled data to find and error using
pairs) patterns rewards and penalties

Training Data Labeled data Unlabeled data No predefined data;

learns from actions
taken

Objective Predict output based Find structure and Maximize cumulative

88
on given input patterns in data rewards through
interactions

Algorithm Examples Decision Trees, Neural K-Means Clustering, Q-Learning, Deep Q-

Networks, Support PCA (Principle Networks, Policy
Vector Machines component analysis), Gradient Methods
Autoencoders

Example use cases Spam detection, Face Customer Self-driving cars,

recognition, Fraud segmentation, Anomaly Robotics, Game AI
detection detection, Topic
modeling

Supervision Required Yes No Indirect (via rewards)

A network is generally trained using either an incremental (also known as a sequential) or a batch mode, the
principles of which are discussed below.

1. Incremental Training/On-Line Training

Here, a particular training scenario is passed through the network and depending on its output, the
error is calculated using the corresponding target value. The said error is then propagated in the
backward direction to update the connecting weights of the neurons and biases i.e. the model
continuously updates its weights after processing each instance.

Let us consider the incremental training of an NN using a number of scenarios (say 20), sent one after
another. There is a chance that the optimal network obtained after passing the 20-th training scenario
will be too different from that obtained after using the 1-st training scenario.

2. Batch Mode of Training/ Off-line

In this approach, the whole training set consisting of a large number of scenarios is passed through the
network and an average error in predictions is determined. It is important to mention that the whole
training set mentioned above may also be called an epoch. The average error in prediction is then
propagated back to update the weights and bias values of the network, so that it can yield a more
accurate prediction. In this mode of training, the necessary data set is to be collected before the actual
commencement of training. As the network is optimized using an average error in predictions, there is a
chance of the network of being adaptive in nature. The adaptability generally grows due to interpolation
capability of the trained network.

Note: It is important to mention that incremental training is easier to implement and computationally
faster than the batch mode of training.

Some important questions

a. What are the key components of an artificial neural network (ANN)?
b. Explain the function of the input layer in an artificial neural network.
c. What role do weights play in the architecture of an ANN?
d. How does the output layer differ from the hidden layers in an ANN?
e. What is a single-layer artificial neural network (ANN), and how does its architecture differ from a multi-
layer neural network?
f. How does a single-layer perceptron (SLP) work in terms of input, processing, and output?
g. What is the role of weights and biases in a single-layer neural network?
h. How does the activation function influence the output of a single-layer ANN?
i. What are the limitations of a single-layer neural network in solving complex problems?
89
The activation function
An activation function in a neural network determines whether a neuron should be activated or not. It takes the
weighted sum of inputs and applies a mathematical function to introduce non-linearity, allowing the network to
learn complex patterns. Apart from this it helps in Decision Making and normalizing the output i.e. conditioned
or dampened as a result of large or small activating stimuli and is thus controllable. Nonlinear functions are
widely used in multilayer networks compared to linear functions because when a signal is fed through a
multilayer network with linear activation functions, the output obtained remains the same as that which could
be obtained using a single-layer network. There are several activation functions. Let us discuss a few in this
section:

1. Identity function: It is a linear function and can be defined as

The output here remains the same as input. The input layer uses the identity activation function.

Identity function

2. Binary step function: This function can be defined as

Where θ represents the threshold value. This function is most widely used in single-layer nets to
convert the net input to an output that is a binary (1 or 0).

Binary Step function

3. Bipolar step function: This function can be defined as

Where θ represents the threshold value. This function is also used in single-layer nets to convert the
net input to an output that is bipolar (+1 or –1).

90
Bipolar step function

4. Sigmoidal functions: The sigmoidal functions are widely used in back-propagation nets because of
the relationship between the value of the functions at a point and the value of the derivative at that point
which reduces the computational burden during training. Sigmoidal functions are of two types:
a. Binary sigmoid function: It is also termed as logistic sigmoid function or unipolar sigmoid function.
It can be defined as
1
f ( x )= − λx
1+e
Where λ is the steepness parameter. For standard binary sigmoid λ = 1. If more than one input is
available in the neuron then use the summed output value for x. Here the range of the sigmoid
function is from 0 to 1.

Binary sigmoid function

Do it yourself
Q.1 Obtain the output of the neuron Y for the network shown in Figure below using binary
sigmoidal activation function.

The derivative of f(x) function is

This derivative is important in neural networks because it is used during back propagation, which is the process of upd

91
Do it yourself
Q.2 Assume that λ = 1 and x = 0.53; compute f’(x)
[ Answer: f(0.53)0.625 & f(0.53)0.233 ]

b. Bipolar sigmoid function: This function is defined as

Where λ is the steepness parameter For standard binary sigmoid λ = 1. If more than one input is
available in the neuron then use the summed output value for x. Here the sigmoid function range is
between –1 and +1.

Bipolar sigmoid function

Do it yourself
Q.1 Obtain the output of the neuron Y for the network shown in Figure below using bipolar
sigmoidal activation function.

The derivative of f(x) function can be

For λ = 1, the derivative of f(x) function can be

1
f ( x )= ( 1−f (x)2 )
2
Do it yourself
Q.2 Assume that λ = 1 and x = 0.53; compute f’(x)
[ Answer: f(0.53)0.259 & f(0.53)0.466 ]

The bipolar sigmoidal function is closely related to hyperbolic tangent function, which is written

92
Hyperbolic tangent function

The derivative of the hyperbolic tangent function is

If the network uses binary data, it is better to convert it to bipolar form and use the bipolar sigmoidal
activation function or hyperbolic tangent function.

5. Rectified linear unit (ReLU)/Ramp function: The ramp function is defined as

Ramp function

Key Roles of Activation Functions:

1. Introducing Non-Linearity: Without an activation function, a neural network would simply be a linear
regression model, regardless of the number of layers. Activation functions allow the network to learn
and represent complex, non-linear relationships between inputs and outputs.
2. Enabling Backpropagation: Activation functions provide differentiable gradients, which are essential for
the backpropagation algorithm. Backpropagation uses these gradients to update the weights of the
network, minimizing the error in predictions.
3. Determining Output: Activation functions decide whether a neuron should be activated or not, based on
the weighted sum of inputs. This helps in propagating the signal through the network and producing the
final output.

Desirable properties of activation functions

1. Non Linearity: The purpose of the activation function is to introduce non-linearity. Non-linear means that
the output cannot be reproduced from a linear combination of the inputs. Without a non-linear activation
function in the network, a NN, no matter how many layers it had, would behave just like a single-layer
perceptron, because summing these layers would give you just another linear function.
2. Continuously differentiable: This property is necessary for enabling gradient-based optimization
methods. The binary step activation function is not differentiable at 0, and it differentiates to 0 for all
other values, so gradient-based methods can make no progress with it
3. Range: When the range of the activation function is finite, gradient-based training methods tend to be
more stable, because pattern presentations significantly affect only limited weights. When the range is
93
infinite, training is generally more efficient because pattern presentations significantly affect most of the
weights. In the latter case, smaller learning rates are typically necessary.
4. Monotonic: When the activation function is monotonic, the error surface associated with a single-layer
model is guaranteed to be convex.

Some important questions

a. What is a Feedforward Neural Network (FNN), and how does it work?
b. How does a Multi-Layer Perceptron (MLP) differ from a Single-Layer Perceptron (SLP)?
c. What is a Convolutional Neural Network (CNN), and what makes it suitable for image processing
tasks?
d. How does a Recurrent Neural Network (RNN) handle sequential data differently from feedforward
networks?

Other than the above discussed; we have following types of neural networks also

Convolution Neural Network

CNNs are a feed-forward neural network and a type of deep learning model commonly used for image
processing tasks. They consist of several layers that work together to extract features from images and make
predictions. The main layers in a CNN are: Input Layer, Convolutional Layer, Pooling Layer, Fully
Connected Layer and Output Layer.

The input layer is where the image data is fed into the network. Each pixel in the image is represented as a
value, and these values form the input to the network.

Now let us have a detailed discussion about the convolution layer

Convolutional layer possess a set of trainable filters and every filter is spatially small (along the width and
height) but noted to extend through the fullest depth of the input volume. When the forward pass gets initiated,
each filter slides across the height and width of the input volume and the dot product is computed between the
input at any position and that of the entries in the filter. When the filter slides across the height and weight of
the input volume, a two-dimensional activation feature map is produced that gives the responses of that filter at
every spatial position. The filters get activated when they come across certain type of visual features (like edge
detection, color stain on the first layer, certain specific patterns or honeycomb existing on higher layers of the
network) and the network learns from the filter that gets activated. The convolutional layer consists of the
complete set of filters and each of these filters produces a separate 2-dimensional activation map. These
activation maps will be stacked along the depth dimension and result in the output volume.
● The input presented to the convolutional layer is an n × n × p image where "n" is the height and width of
an image and "p" refers to the number of channels (e.g., an RGB image possess 3 channels and so p =
3, for black and white image it is 1 for medical images it is more than 3).
●

The convolutional layer to be constructed possesses "m" filters of size r × r × q, where "r" tends to be smaller than the dim

An Example
A 5×5 grayscale image might have pixel values like:

94
Each value represents intensity (0 = black, 255 = white). Say we apply a filter over the pixel matrix. Say it is
3×3 Edge Detection Kernel

We take the top-left 3×3 region from the image and apply the kernel. Extracted 3×3 Region from Image

Now, we apply the kernel:

(1×1)+(2×0)+(3×−1)+(5×1)+(6×0)+(7×−1)+(9×1)+(10×0)+(11×−1) = -6

Second Convolution Operation (Next 3×3 region, sliding right) Now, move the filter one step to the right. New
3×3 Region from Image

Now, we apply the kernel:

(2×1)+(3×0)+(4×−1)+(6×1)+(7×0)+(8×−1)+(10×1)+(11×0)+(12×−1) = -6

Continuing for Other Regions; Following the same process for the entire 3×3 sliding process, we fill up the
feature map. The Final Feature Map output will be

Now let us have a detailed discussion about the pooling layer

Between the successive convolutional layers, pooling layers are placed. The presence of a pooling layer
between the convolutional layers is to gradually decrease the spatial size of the parameters and to reduce the
computation in the network. This placement of the pooling layer also controls the occurrence of over fitting.
Generally used pooling mechanism is the “max pooling”.
● Each of the feature maps then gets pooled (sub-sampled) based on maximum or average pooling over
r × r connecting regions. The value of “r” is 2 for small images and 5 for larger images.
● A bias and a non-linear sigmoidal function can be applied to each of the feature map before or after the
pooling layer.

Continued in the example; We are applying the max-pooling for every 2×2 region and Stride = 1 (stride is to
specify for sliding the filter.)
Region 1 (Top-left 2×2)

Max value: -6

Region 2 (Top-middle 2×2)

Max value: -6
& so in so the final max-pooled output is

95
Since all values were the same, the result remains unchanged.

After several convolutional and pooling layers, the final output is flattened into a single vector and passed
through one or more fully connected layers. These layers are similar to those in a regular neural network and
are used to combine the features extracted by the previous layers to make final predictions, such as classifying
the image into different categories.

The output layer produces the final output of the network, such as the class scores for classification tasks. The
number of neurons in this layer corresponds to the number of classes the network is trying to predict.

What makes CNN suitable for image processing tasks over ANN?
Convolutional Neural Networks (CNNs) are highly effective for image processing due to their ability to
automatically learn spatial hierarchies of features. Here’s why they work so well:
1. Unlike traditional Artificial Neural Networks (ANNs), CNNs do not require manually extracted features.
CNN’s convolutional layers automatically detect edges, textures, patterns, and complex structures
without human intervention.
2. CNNs use pooling layers (e.g., max pooling) to reduce spatial dimensions while keeping the most
important features. This makes CNNs robust to position changes (i.e., an object can be anywhere in the
image, and CNN can still detect it).
3. Instead of fully connecting each pixel (like ANN), CNNs use small filters (kernels) that slide over the
image. This reduces the number of parameters, making CNNs computationally efficient. A 100×100
image with ANN requires 10,000 neurons, but CNN just needs a few filters to process it.
4. CNNs learn directly from raw pixel data and adjust filters automatically using backpropagation. They do
not require handcrafted features, making them highly adaptable.

Recurrent Neural Network (RNN)

It is a specialized class of neural networks designed to handle sequential data. It is a specific type of feedback
network that utilizes a feedback mechanism to process sequences by maintaining hidden states that capture
information about previous inputs. RNNs have connections that form directed cycles, allowing them to maintain
a 'memory' of previous inputs. This unique architecture enables RNNs to process sequences of data, such as
time series data, text data, speech data, and video data, making them particularly effective for tasks where the
order and context of data points are crucial.

RNN information is fed back into the system after each step. Think of it like reading a sentence, when you’re
trying to predict the next word you don’t just look at the current word but also need to remember the words that
came before to make an accurate guess. RNNs allow the network to “remember” past information by feeding
the output from one step into the next step. This helps the network understand the context of what has already
happened and make better predictions based on that. For example when predicting the next word in a
sentence the RNN uses the previous words to help decide what word is most likely to come next.

The fundamental processing unit in RNN is a Recurrent Unit. Recurrent units hold a hidden state that
maintains information about previous inputs in a sequence. Recurrent units can “remember” information from
prior steps by feeding back their hidden state, allowing them to capture dependencies across time. RNN

96
unfolding or unrolling is the process of expanding the recurrent structure over time steps. During unfolding
each step of the sequence is represented as a separate layer in a series illustrating how information flows
across each time step.

This unrolling enables “backpropagation through time (BPTT)” which is a learning process where errors are
propagated across time steps to adjust the network’s weights enhancing the RNN’s ability to learn
dependencies within sequential data. RNNs share similarities in input and output structures with other deep
learning architectures but differ significantly in how information flows from input to output. Unlike traditional
deep neural networks, where each dense layer has distinct weight matrices, RNNs use shared weights across
time steps, allowing them to remember information over sequences.

Application areas include Natural Language Processing, Time Series Prediction, Music Generation, and more.

Generative Adversarial Network (GAN)

A generative adversarial network (GAN) has two parts:

● The generator part of a GAN learns to create fake data by incorporating feedback from the
discriminator. It learns to make the discriminator classify its output as real. Generator training requires
tighter integration between the generator and the discriminator than discriminator training requires. The
portion of the GAN that trains the generator includes:
o Random input
o Generator network, which transforms the random input into a data instance
o Discriminator network, which classifies the generated data
o Discriminator output
o Generator loss, which penalizes the generator for failing to fool the discriminator
The generator is trained with the following procedure:
1. Sample random noise.
2. Produce generator output from sampled random noise.
3. Get discriminator "Real" or "Fake" classification for generator output.
4. Calculate loss from discriminator classification.
5. Backpropagate through both the discriminator and generator to obtain gradients.
6. Use gradients to change only the generator weights.

97
● The discriminator learns to distinguish the generator's fake data from real data. The discriminator
penalizes the generator for producing implausible (i.e. fake data that is difficult to believe on) results.
The discriminator data comes from two sources: Real data instances, such as real pictures of people.
The discriminator uses these instances as positive examples during training. Fake data instances
created by the generator. The discriminator uses these instances as negative examples during training.
During discriminator training the generator does not train. Its weights remain constant while it produces
examples for the discriminator to train on. During discriminator training:
1. The discriminator classifies both real data and fake data from the generator.
2. The discriminator loss penalizes the discriminator for misclassifying a real instance as fake or a
fake instance as real.
3. The discriminator updates its weights through backpropagation from the discriminator loss
through the discriminator network.

When training begins, the generator produces obviously fake data, and the discriminator quickly learns to tell
that it's fake:

As training progresses, the generator gets closer to producing output that can fool the discriminator:

Finally, if generator training goes well, the discriminator gets worse at telling the difference between real and
fake. It starts to classify fake data as real, and its accuracy decreases.

A GAN can have two loss functions: one for generator training and one for discriminator training. Among
multiple implementations the common error loss function is minimax. The generator tries to minimize the
following error loss function while the discriminator tries to maximize it

In this function:
❖ D(x) is the discriminator's estimate of the probability that real data instance x is real.
❖ Ex is the expected value over all real data instances.
❖ G(z) is the generator's output when given noise z.
❖ D(G(z)) is the discriminator's estimate of the probability that a fake instance is real.
❖ Ez is the expected value over all random inputs to the generator (in effect, the expected value over all
generated fake instances G(z)).

98
The generator can't directly affect the log(D(x)) term in the function, so, for the generator, minimizing the loss is
equivalent to minimizing log(1 - D(G(z))).

RBFN (Radial Basis Function Network)

The radial basis function (RBF) is a classification and functional approximation neural network. This network
uses the most common nonlinearities such as sigmoidal and Gaussian kernel functions. The Gaussian
functions are also used in regularization networks. The response of such a function is positive for all values of
y; the response decreases to 0 as y increases to infinite. The Gaussian function is generally defined as

The derivative of this function is given by

The graphical representation of this Gaussian function is as follow-

Gaussian Kernel function

When the Gaussian functions are being used, each node is found to produce an identical output for inputs
existing within the fixed radial distance from the center of the kernel, they are found to be radically symmetric
(i.e. distance matters not the direction; for same distance in either direction will lead to same value), and hence
the name radial basis function network. The entire network forms a linear combination of the nonlinear basis
function.

The architecture for the radial basis function network (RBFN) is here-

Architecture of RBF

The architecture consists of two layers whose output nodes form a linear combination of the kernel (or basis)
functions computed by means of the RBF nodes or hidden layer nodes. The basis function (nonlinearity) in the
hidden layer produces a significant nonzero response to the input stimulus it has received only when the input
of it falls within a small localized region of the input space. This network can also be called as localized
receptive field network.

The training algorithm describes in detail all the calculations involved in the training process depicted in the
flowchart. The training is started in the hidden layer with an unsupervised learning algorithm. The training is
continued in the output layer with a supervised learning algorithm. Simultaneously, we can apply supervised
learning algorithms to the hidden and output layers for fine-tuning of the network. The training algorithm is
given as follows.
99
Step 0: Set the weights to small random values.
Step 1: Perform Steps 2-8 when the stopping condition is false.
Step 2: Perform Steps 3-7 for each input.
Step 3: Each input unit (xi for all i = 1 to n) receives input signals and transmits to the next hidden layer unit.
Step 4: Calculate the radial basis function.
Step 5: Select the centers for the radial basis function. The centers are selected from the set of input vectors. It
should be noted that a sufficient number of centers have to be selected to ensure adequate sampling of the
input vector space.
Step 6: Calculate the output from the hidden layer unit:

Where
x: Input vector (e.g., xj1,xj2,…,xjn)
ci: Center of the ith RBF unit
σi: Width (spread) of the i-th RBF unit.
∥ xci ∥: Euclidean distance between x and ci

Step 7: Calculate the output of the neural network:

Where
m: the number of hidden layer nodes (RBF function).
wim: Weight connecting the ith hidden unit to the mth output node
w0: Bias term (optional).

Step 8: Calculate the error and test for the stopping condition. The stopping condition may be the number of
epochs or to a certain extent weight change.

Applications of RBNF:
RBFNs are primarily used for classification tasks, but they can also be applied to regression and function
approximation problems. Some common application areas include:
1. Pattern Recognition: RBFNs are effective in recognizing patterns in data, making them useful in
image and speech recognition.
2. Time Series Prediction: They can be used to predict future values in a time series based on past data.
3. Control Systems: RBFNs are used in adaptive control systems to model and control dynamic systems.
4. Medical Diagnosis: They can assist in diagnosing diseases by classifying medical data.

Comparison between different types of Networks

Type of Network Best used For Key Features
Feed-Forward Neural Network Classification, Regression
Simple structure, fast training
Convolution Neural Network Image and Video Processing
Feature extraction, deep
learning
Recurrent Neural Network Time series, NLP, Sequential Memory of past inputs
Data
Generative Adversarial Neural Data Generation (e.g. images) Generator vs Discriminator
Network framework
Radial Basis Function Network Pattern Recognition, Gaussian-based activation

100
Classification functions
Self-Organizing Maps Clustering, Data visualization Unsupervised learning, topology
based

Identify the type of learning for following

1. Facebook face recognition
2. Netflix movie recommendation
3. Fraud detection
4. A spam detection system learns from labelled emails (spam or not spam) to classify new emails
automatically
5. A self-driving car learns to adjust its speed by receiving rewards for safe driving and penalties for
unsafe behavior.
6. A clustering algorithm groups customers based on their purchasing behaviour without prior labels.
7. A speech recognition system is trained on labelled voice samples to convert speech into text
accurately.
8. A robot in a factory learns how to pick up objects by trial and error, receiving rewards when successful.
9. A recommendation system suggests new movies to users based on patterns in their previous watch
history, without predefined labels.
10. A credit card fraud detection system is trained using labelled transactions (fraudulent or non-fraudulent)
to detect fraudulent activities
11. A chess-playing AI improves by playing millions of games against itself and adjusting its strategies
based on wins and losses.
12. A genetic algorithm clusters different plant species based on their genetic similarities, with no
predefined classifications
13. A virtual assistant like Siri or Google Assistant learns to recognize commands by training on labeled
datasets of voice recordings.

Some important questions

a. What is the difference between supervised and unsupervised learning in neural networks?
b. How does the process of learning occur in an artificial neural network?
c. What is backpropagation, and how does it contribute to the learning process in ANNs?
d. How do neural networks adapt to new data during training?

McCulloch–Pitts Neuron (M–P neuron)

● The McCulloch–Pitts neuron was the earliest neural network discovered in 1943.
● The M–P neurons are connected by directed weighted paths. The weights associated with the
communication links may be excitatory (weight is positive) or inhibitory (weight is negative). All the
excitatory connected weights entering into a particular neuron will have the same weights.
● The threshold plays a major role in M–P neuron. There is a fixed threshold for each neuron, and if the
net input to the neuron is greater than the threshold then the neuron fires. Also, it should be noted that
any nonzero inhibitory input would prevent the neuron from firing. The M–P neurons are most widely
used in the case of logic functions.
● It should be noted that the activation of a M–P neuron is binary, that is, at any time step the neuron may
fire or may not fire.

101
The inputs from x1 to xn possess excitatory weighted connections and inputs from xn+1 to xn+m possess inhibitory
weighted interconnections. Since the firing of the output neuron is based upon the threshold, the activation
function here is defined as

For inhibition to be absolute, the threshold with the activation function should satisfy the following condition:
Ө > nw - P

In the above equation P refers to total contribution from all inhibitory inputs and output. The above equation
works when all the inhibitory inputs are active i.e. in case of weak absolute inhibition.

For the strong absolute inhibition i.e. when only one inhibitory inputs is active the equation should be modified
as
Ө > nw - Pmin

Here Pmin refers to the minimum contribution from inhibitory inputs (e.g., the weight of a single inhibitory input).

Do not get confused with the firing condition which is that a neuron can fire if the net input equals the threshold.
nw - P

The output will fire if it receives say “k” or more excitatory inputs but no inhibitory inputs, where
kw (k - 1)w

If the neuron receives k excitatory inputs, the net input (kw) will be greater than or equal to the threshold, causing the neuron to fir

The M–P neuron has no particular training algorithm. An analysis has to be performed to determine the values
of the weights and the threshold. Here the weights of the neuron are set along with the threshold to make the
neuron perform a simple logic function. The M-P neurons are used as building blocks on which we can model
any function or phenomenon, which can be represented as a logic function.

Do it yourself
Q.1 Implement AND function using McCulloch–Pitts neuron (take binary data).
Q.2 Implement ANDNOT function using McCulloch–Pitts neuron (use binary data representation). In the case
of the ANDNOT function, the response is true if the first input is true and the second input is false. For all other
input variations, the response is false.
Q.3 Implement XOR function using McCulloch–Pitts neuron (use binary data representation).

Hebb Network
Donald Hebb stated in 1949

“When an axon of cell A is near enough to excite cell B, and repeatedly or permanently takes place in firing it,
some growth process or metabolic change takes place in one or both the cells such that A’s efficiency, as one
of the cells firing B, is increased”.

According to the Hebb rule, the weight vector is found to increase proportionately to the product of the input
and the learning signal which is equal to the neuron’s output. In Hebb learning, if two interconnected neurons
are ‘on’ simultaneously then the weights associated with these neurons can be increased by the modification
made in their synaptic gap (strength). The weight update in Hebb rule is given by

The Hebb rule is more suited for bipolar data than binary data.

102
Flowchart of Training Algorithm
❖ Step 0: First initialize the weights. Basically in this
network they may be set to zero, i.e., wi = 0 for i = 1
to “n” where n may be the total number of input
neurons.
❖ Step 1: Steps 2–4 have to be performed for each
input training vector and target output pair, s : t.
❖ Step 2: Input units activations are set. Generally, the
activation function of input layer is identity function:
xi = sj for i = 1 to n.
❖ Step 3: Output units activations are set: y = t.
❖ Step 4: Weight adjustments and bias adjustments
are performed:
wi(new) = wi(old) + xiy
b(new)= b(old) + y

In Step 4, the weight updation formula can also be given

in vector form as
w(new)=w(old)+ xy
Here the change in weight can be expressed as
Δw = xy
As a result,
w(new) = w(old) + Δw
The Hebb rule can be used for pattern association,
pattern categorization, pattern classification and over a
range of other areas.

Hebbian Learning Rule

Hebbian Learning is one of the oldest and most influential learning rules in Artificial Neural Networks (ANN). It
was proposed by Donald Hebb in 1949 in his book "The Organization of Behavior". The core concept of
Hebbian Learning is based on associative learning, which means: "Neurons that fire together, wire together." In
simple terms, if two neurons activate together, their connection strength (weight) increases. The Hebbian
Learning Rule is based on the idea that:
● If an input and output neuron activate simultaneously, their connection strength (weight) should be
increased.
● If one activates and the other doesn’t, no significant change occurs in the connection.
● There is no error calculation in Hebbian Learning as there is no target output.
● It is a type of Unsupervised Learning because there is no target output to compare.
● The strength of the connection (weight) between neurons is increased based on:
● Strength of Connection is proportional to Activity of Neurons
● The standard mathematical form of the Hebbian Learning Rule is:
wi = w i + η × x × y

wi: weight of the ith connection between input and output neuron
η: Learning rate
x: Input value from the input neuron
y: Output value from the output neuron
103
The Hebbian Learning Rule can be explained as:

Characteristics of the Learning Rule

Limitations of Basic Hebbian Learning

1. Unbounded weight growth: Without normalization or decay, weights may explode.
2. No weakening of connections: Hebbian learning alone does not handle cases where weights should
decrease (e.g., when x and y are anti-correlated).

Perceptron Networks
Let us understand the linear separability first with an example. Imagine you have a table with a bunch of fruits:
apples and oranges. Your task is to separate the apples from the oranges using a straight stick (like a ruler).

Scenario 1: Easy Separation (Linearly Separable)

Suppose all the apples are on one side of the table, and all the oranges are on the other side. You can easily
place the stick in such a way that all the apples are on one side of the stick, and all the oranges are on the
other side. This is an example of linear separability because a straight line (the stick) can perfectly separate
the two types of fruits.

Scenario 2: Mixed Fruits (Not Linearly Separable)

Now, imagine the apples and oranges are mixed up on the table. Some apples are on the left, some on the
right, and the same goes for the oranges. No matter how you place the stick, you cannot separate all the
apples from all the oranges with a single straight line. This is an example of non-linearly separable data.

Summarization: Linear separability means you can draw a straight line (or a flat plane in higher dimensions)
to separate two groups of things (like apples and oranges). If you cannot draw such a straight line, the data is
not linearly separable.

The perceptron is the simplest form of a neural network used for the classification of patterns said to be linearly
separable (i.e., patterns that lie on opposite sides of a hyperplane). Basically, it consists of a single neuron with
adjustable synaptic weights and bias.

Rosenblatt proved that if the patterns (vectors) used to train the perceptron are drawn from two linearly
separable classes, then the perceptron algorithm converges (i.e. eventually find solution) and positions the
decision surface in the form of a hyperplane between the two classes. The proof of convergence of the
algorithm is known as the perceptron convergence theorem.
104
The perceptron built around a single neuron is limited to performing pattern classification with only two classes
(hypotheses). By expanding the output (computation) layer of the perceptron to include more than one neuron,
we may correspondingly perform classification with more than two classes.

Consider the figure

Signal-flow Graph of the Perceptron

The summing node of the neural model computes a linear combination of the inputs applied to its synapses, as
well as incorporates an externally applied bias. The resulting sum, that is, the induced local field, is applied to a
hard limiter. Accordingly, the neuron produces an output equal to +1 if the hard limiter input is positive, and -1 if
it is negative.

The goal of the perceptron is to correctly classify the set of externally applied stimuli (i.e. input data) x1, x2 ... xm
into one of two classes C1 and C2. The decision rule for the classification is to assign the point represented by
the inputs x1, x2, ..., xm to class C1 if the perceptron output y is +1 and to class C2 if it is -1.

The synaptic weights of the perceptron are denoted by w 1, w2 ...,wm. Correspondingly, the inputs applied to the
perceptron are denoted by x1, x2, ..., xm. The externally applied bias is denoted by b. From the model, we find
that the hard limiter input, or induced local field, of the neuron is

To develop insight into the behavior of a pattern classifier, it is customary to plot a map of the decision regions
in the m-dimensional signal space spanned by the m input variables x 1, x2, ..., xm. In the simplest form of the
perceptron, there are two decision regions separated by a hyperplane, which is defined by

Take a look at the figure for the case of two input variables x1 and x2, for which the decision boundary takes the
form of a straight line.

A point (x1, x2) that lies above the boundary line is assigned to class C 1, and a point (x1, x2) that lies below the
boundary line is assigned to class C 2. Note also that the effect of the bias b is merely to shift the decision
boundary away from the origin. The synaptic weights w 1, w2, ...,wm of the perceptron can be adapted on an
iteration-by-iteration basis.

105
For the perceptron to function properly, the two classes C 1 and C2 must be linearly separable. This, in turn,
means that the patterns to be classified must be sufficiently separated from each other to ensure that the
decision surface consists of a hyperplane. This requirement is illustrated in Figure below for the case of a two-
dimensional perceptron. In the (a) part of the figure, the two classes C 1 and C2 are sufficiently separated from
each other for us to draw a hyperplane (in this case, a straight line) as the decision boundary. If, however, the
two classes C1 and C2 are allowed to move too close to each other, as in (b) part of the figure, they become
nonlinearly separable, a situation that is beyond the computing capability of the perceptron.

Gradient Descent Learning

Terminology
1. Cost Function
The cost function (or loss function) measures how well the model is performing. It quantifies the difference
between the predicted values and the actual values. For linear regression, the cost function is Mean Squared
Error (MSE):

Where:
yi : Actual value
^y i : Predicted value
n : Number of data points.

2. Gradient Vector
A partial derivative measures how a function changes when you vary only one variable, while keeping all
other variables constant.

Consider a function of two variables:

The function depends on both x and y.

● A partial derivative with respect to x means we treat y as a constant and differentiate only with respect
to x.
● Similarly, a partial derivative with respect to y means we treat x as a constant and differentiate only with
respect to y.

Partial derivative with respect to x:

The derivative of x2 with respect to x is 2x. Since 3y is treated as a constant, its derivative is 0.

106
Partial derivative with respect to y:

The term x is treated as a constant, so its derivative is 0. The derivative of 3y with respect to y is 3
2

The gradient vector is simply a vector of partial derivatives and points in the direction of the steepest ascent.
The gradient vector (denoted as ∇f pronounced as nabla f) is formed by collecting all partial derivatives:

For our function:

The function changes most rapidly in the direction of (2x,3). If we move in the direction of this gradient, the
function f(x,y) increases fastest.

In this case; For a weight wi, the gradient is:

This represents the rate of change of the cost function with respect to w i.

3. Chain Rule
The chain rule helps us find the derivative of a function that is composed of two or more functions. In simple
terms, it tells us how to take the derivative of a "function inside a function."
● If y=f(g(x)), then f is the "outer function," and g is the "inner function."
● The chain rule helps us find the derivative of y with respect to x.

The derivative of y with respect to x for y=f(g(x)) is:

In words:
● Take the derivative of the outer function (f) with respect to the inner function (g).
● Multiply it by the derivative of the inner function (g) with respect to x.

An Example
Let’s say:

Here
● The outer function is f(g) = g2
● The inner function is g(x) = 3x + 2.

The derivative of f(g) = g2 with respect to g is:

The derivative of g(x) = 3x+2 with respect to x is:

Multiply the two derivatives:

Substitute g = 3x + 2
107
So, the derivative of y is:

Note:
In Gradient Descent, we use the chain rule to compute the gradient of the cost function. For example:
● The cost function J(w) depends on the predicted value ^y .
● The predicted value ^y depends on the weight w.
We use the chain rule

Here

4. Learning Rate(α)
The learning rate (α) affects the convergence of the ANN. It controls the size of the steps taken during
parameter updates. The range of α is from 0 to.

The Gradient Descent Algorithm

Gradient Descent is a powerful optimization algorithm used to minimize cost functions in machine learning. It
works by iteratively updating model parameters (weights) in the direction of the negative gradient. It is widely
used in training neural networks, linear regression, and other models.

Step-1: Initialize Weights: Start with random values for the weights (wi), bais (b) and learning rate (α).
Step-2: Compute Gradient: Calculate the gradient of the cost function with respect to each weight and bias:

Step-3: Update Weights: Adjust the weights in the opposite direction of the gradient:

Step-4: Repeat: Repeat steps 2 and 3 until one of the stopping criteria is met
● Maximum number of iterations is reached.
● The step size becomes smaller than a predefined tolerance.

An Example
Input: House sizes (x) in square feet
Output: House prices (y) in thousands of dollars.
Model: Linear regression model ^y = wx + b, where: w is weight (slope) and b is bias (intercept)
Goal: Use Gradient Descent to find the optimal values of w and b that minimize the Mean Squared Error (MSE)
cost function.

Here is the dataset

House size House Price
(x) (y)
1 2
2 4
3 6
4 8

108
Solution
Step-1: Let us initialize w = 0 and b = 0 and α = 0.1
Step-2 & 3:
Iteration-01
Compute predicted output using formula ^y = wx + b
^y 1 = 0 x 1 + 0 = 0 ^y 2 = 0 x 2 + 0 = 0 ^y 3 = 0 x 3 + 0 = 0 ^y 4 = 0 x 4 + 0 = 0

Compute the Gradient

Update Parameters:

Iteration-02
Compute predicted output using formula ^y = wx + b
^y 1 = 1.5 x 1 + 0.5 = 2 ^y 2 = 1.5 x 2 + 0.5 = 3.5 ^y 3 = 1.5 x 3 + 0.5 = 5 ^y 4 = 1.5 x 4 + 0.5 = 6.5

Compute the Gradient

Update Parameters:

Iteration-03
Compute predicted output using formula ^y = wx + b
^y 1 = 1.75 x 1 + 0.575 = 2.325 ^y 2 = 1.75 x 2 + 0.575 = 4.075
^y 3 = 1.75 x 3 + 0.575 = 5.825 ^y 4 = 1.75 x 4 + 0.575 = 7.575

Compute the Gradient

109
Update Parameters:

● Continue iterating until the changes in w and b become very small (e.g. <0.001).
● After several iterations, w and b will converge to their optimal values. For this example, they will
approach w = 2, b = 0 & the final model will be ^y =2 x

A note about the tolerance based stopping

The predefined tolerance (ϵ) is a parameter that defined the level of accepted error in the predicted output (i.e.
the output generated by a model). Once the change in the cost function is smaller than the predefined
tolerance then the algorithm stops. Another condition of stopping the algorithm is when maximum number of
iterations is already performed.

Least Mean Squares (LMS)

1. LMS is an adaptive filtering algorithm used to minimize the Mean Squared Error (MSE) between
predicted and actual values.
2. It is an online learning algorithm, meaning it updates the model parameters (weights and bias)
iteratively using one data point (or a small batch) at a time. It is computationally efficient because it
processes one data point at a time.

The MSE is the cost function that LMS aims to minimize:

The gradient of the MSE with respect to w and b is:

LMS Update Rule to Update w & b

Where α is the learning rate.

Let us do the same example again

Step-01: Initialize w = 0, b = 0 and α = 0.1
Step-02:
Iteration-01

110
Iteration-02

Iteration-03

111
● Continue iterating until the changes in w and b become very small (e.g. <0.001).
● After several iterations, w and b will converge to their optimal values. For this example, they will
approach w = 2, b = 0 & the final model will be ^y =2 x
● After the first four iterations (where you’ve used all four data points), you simply start over from the first
data point and continue the process. This is called cycling through the dataset.

Do it Yourself
Do this same question with adjusting weight only not bias

Key difference between the gradient descent and LMS

Aspect Full MSE (Gradient Descent) LMS
Data usage Predict outputs for all inputs using the
Uses one data point (or small batch) at a
current weights and bias. time.
Gradient Computes the exact gradient using all Approximates the gradient using one data
Computation data. point.
Update Updates weights and bias after Updates weights and bias after each data
Frequency processing all data. point.
Convergence Slower but more stable. Faster but noisier (may oscillate around
the minimum).
Approach This is a batch learning approach This is an online learning approach
because it uses the entire dataset in because it processes one data point (or
each iteration. small batch) at a time.

Effect of learning rate

The learning rate determines how quickly or slowly the model parameters (e.g., weights w and bias b) are
updated during training. It scales the gradient of the cost function:

Learning Rate (α) Behavior Pros Cons

112
Too Low Small steps toward the 1. Stable convergence. 1. Slow convergence
minimum. 2. Less likely to overshoot (requires many
the minimum. iterations).
2. May get stuck in local
minima or saddle points.

Optimal Balanced steps that 1. Fast and stable 1. Requires tuning to find
converge efficiently to convergence. the right value
the minimum. 2. Efficient use of
computational resources.

Too High Large steps that may 1. Faster initial progress. 1. Oscillations around the
overshoot the minimum.
minimum. 2. Risk of divergence
(moving away from the
minimum)

local minima and saddle point

➢ A local minimum is a point in the cost function where the error is lower than all nearby points but higher
than the global minimum (the best possible solution). It happens when the cost function in complex models
(e.g., neural networks) is often non-convex, meaning it has many "hills" and "valleys." If the model
parameters (weights and biases) land in a local minimum during training, Gradient Descent or other
optimization algorithms may get stuck there because the gradient is zero (or close to zero). An example of
it is rolling a ball down a hilly terrain. If the ball lands in a small valley (local minimum), it won’t roll further
even if there’s a deeper valley (global minimum) nearby.
➢ The saddle point is a point in the cost function where the gradient is zero, but it is neither a minimum nor a
maximum (it’s a flat region). It happens in high-dimensional spaces (common in neural networks), saddle
points are more prevalent than local minima. The model may get stuck at a saddle point because the
gradient is zero, and the optimization algorithm stops making progress.

Oscillations & Divergence

➢ Oscillation occurs when the learning rate is too high, causing the model parameters to bounce around the
minimum instead of converging smoothly. It happens when Large steps cause the model to overshoot the
minimum, and the next update overshoots in the opposite direction, creating a cycle. An example of it is a
ball rolling down a hill. If the ball has too much momentum (high learning rate), it will overshoot the bottom
and roll up the other side, then roll back, and so on.
➢ Divergence occurs when the learning rate is so high that the model parameters move away from the
minimum instead of converging toward it. It happens when extremely large steps cause the model to
overshoot the minimum by such a large margin that the cost function increases instead of decreasing. An
example of it is a ball rolling down a hill with so much momentum that it flies off the hill entirely and never
returns.

Multilayer Perceptron (MLP)

It is a type of artificial neural network (ANN) that consists of multiple layers of neurons. It is one of the most
commonly used architectures for deep learning tasks such as classification, regression, and pattern
recognition. The key components of an MLP include:
● Input Layer: Receives the raw data (e.g., features of a dataset). Each neuron in the input layer
represents a feature of the input data.
● Hidden Layers: It is the heart of an MLP. Perform complex computations like feature extraction and
transformation. Each neuron in a hidden layer computes a weighted sum of its inputs, applies an
activation function, and passes the result to the next layer.
● Output Layer: The output layer produces the final prediction (e.g., class label for classification,
continuous value for regression). The number of neurons in the output layer depends on the task:
Binary Classification: 1 neuron (e.g., sigmoid activation).

113
Regression: 1 neuron (linear activation)
Multi-Class Classification: n neurons

An MLP is a fully connected feedforward neural network, meaning that each neuron in one layer is connected
to every neuron in the next layer. It uses activation functions such as Sigmoid, or Tanh to introduce non-
linearity, enabling it to learn complex patterns in data.

Role of Hidden Layers in ANN

The number of hidden layers and neurons in each layer is a hyperparameter that must be tuned.
Hyperparameters are parameters that are set before training a model. They are not learned from the data but
are crucial for controlling the learning process. Some examples of hyperparameters in an MLP are; Number of
hidden layers, Number of neurons in each hidden layer, Learning rate., Activation functions, Batch size &
Number of epochs (the model has seen every training example once when one epoch is completed. Training
for multiple epochs allows the model to gradually improve its performance by minimizing the loss function.).
The parameters other than the hyperparameter are called trainable parameters e.g.weight and bias are
examples of trainable parameters. Hidden layers play a crucial role in deep learning models by enabling
hierarchical feature extraction. Their functions include:

● Feature Learning
Lower Layers: Detect simple patterns, such as edges, textures, or basic shapes.
Deeper Layers: Detect abstract concepts, such as objects or high-level features.
● Non-Linearity: Introducing non-linearity using activation functions, which allows the model to solve
complex problems. Without non-linearity, an MLP would be equivalent to a linear model, incapable of
solving complex problems.
● Representation Learning: Hidden layers transform raw input data into meaningful representations that
make it easier for the output layer to perform classification or regression.
● Capturing Relationships: Hidden layers can capture complex relationships between input features
that are not easily separable in lower-dimensional space.

Applications of MLP
● Classification: Image classification, spam detection, sentiment analysis.
● Regression: Predicting house prices, stock prices, or temperature.
● Pattern Recognition: Handwriting recognition, speech recognition.
● Function Approximation: Approximating complex mathematical functions.

BACK-PROPAGATION NETWORK
A back-propagation neural network is a multilayer, feed-forward neural network consisting of an input layer, a
hidden layer and an output layer. The neurons present in the hidden and output layers have biases, which are
the connections from the units whose activation is always 1. The bias terms also acts as weights.

114
The figure above shows the architecture of a BPN, depicting only the direction of information flow for the feed-
forward phase. During the back-propagation phase of learning, signals are sent in the reverse direction.

The inputs are sent to the BPN and the output obtained from the net could be either binary (0, 1) or bipolar (–1,
+1). The activation function could be any function which increases monotonically and is also differentiable.

x = input training vector (x1, ..., xi , ..., xn)

t = target output vector ( t1, ..., tk , ..., tm)
α = learning rate parameter
xi = input unit i. (Since the input layer uses an identity activation function, the input and output signals here are
same.)
v0j = bias on jth hidden unit
w0k = bias on kth output unit
zj = hidden unit j.

The net input to zj is

and the output is

yk = output unit k. The net input to yk is

and the output is

δk: Error correction weight adjustment for Wjk that is due to an error at output unit y k, which is back-propagated
to the hidden units that feed into unit yk
δj: Error correction weight adjustment for vij that is due to the back-propagation of error to the hidden unit zj.

Also, it should be noted that the commonly used activation functions are binary sigmoidal and bipolar sigmoidal
activation functions. The range of binary sigmoid is from 0 to 1, and for bipolar sigmoid it is from –1 to +1.
These functions are used in the BPN because of the following characteristics
(i) continuity
(ii) differentiability
115
(iii) nondecreasing monotony

The error back-propagation learning algorithm can be outlined in the following algorithm:
Step 0: Initialize weights and learning rate (take some small random values).
Step 1: Perform Steps 2–9 when stopping condition is false.
Step 2: Perform Steps 3–8 for each training pair.

Feed-forward phase (Phase I):

Step 3: Each input unit receives input signal xi and sends it to the hidden unit (i = 1 to n).
Step 4: Each hidden unit zj (j =1 to p) sums its weighted input signals to calculate net input:

Calculate output of the hidden unit by applying its activation functions over zinj (binary or bipolar sigmoidal
activation function):

and send the output signal from the hidden unit to the input of output layer units.
Step 5: For each output unit yk (k = 1 to m), calculate the net input:

and apply the activation function to compute output signal

Back-propagation of error (Phase II):

Step 6: Each output unit yk (k = 1 to m) receives a target pattern corresponding to the input training pattern and
computes the error correction term:

On the basis of the calculated error correction term, update the change in weights and bias:

Step 7: Each hidden unit (zj, j = 1 to p) sums its delta inputs from the output units:

The term δinj gets multiplied with the derivative of f(zinj) to calculate the error term:

On the basis of the calculated δj, update the change in weights and bias:

Weight and bias updation (Phase III):

Step 8: Each output unit (yk, k = 1 to m) updates the bias and weights:

Each hidden unit (zj, j = 1 to p) updates its bias and weights

Step 9: Check for the stopping condition. The stopping condition may be a certain number of epochs reached
or when the actual output equals the target output.

116
The above algorithm uses the incremental approach for updation of weights, i.e. the weights are being
changed immediately after a training pattern is presented i.e. it is working like the online training. There is
another way of training called batch-mode training, where the weights are changed only after all the training
patterns are presented. The batch-mode training requires additional local storage for each connection to
maintain the immediate weight changes.

The training of a BPN is based on the choice of various parameters. Also, the convergence of the BPN is
based on some important learning factors such as the initial weights, the learning rate, the updation rule, the
size and nature of the training set, and the architecture (number of layers and number of neurons per layer).

An Example
Using a back-propagation network, find the new weights for the net shown in Figure below. It is presented with
the input pattern [0, 1] and the target output is 1. Use a learning rate a =0.25 and binary sigmoidal activation
function.

Do it yourself
Find the new weights, using a back-propagation network for the network shown in Figure below. The network
is presented with the input pattern [ - 1, 1] and the target output is + 1. Use a learning rate of a = 0.25 and
bipolar sigmoidal activation function

Difference between LMS and backpropagation

Aspect Least Mean Square Backpropagation

Model type Simple linear models (e.g., linear Complex models (e.g., neural
regression). networks).

Gradient Computation Approximates gradient because Computes exact gradient using chain
computed using a single data point. rule because gradient is computed
using all data points

Efficiency LMS is computationally efficient because Backpropagation is more

it does not require processing the entire computationally expensive than LMS
dataset at once. because it processes multiple data
points at once.

117
Learning type Online learning (real-time adaptation). Batch or mini-batch learning.

Use Cases Online learning, real-time systems. Deep learning, multi-layer networks.

XOR Problem
In Rosenblatt’s single-layer perceptron, there are no hidden neurons. Consequently, it cannot classify input patterns that
are not linearly separable. However, nonlinearly separable patterns commonly occur. For example, this situation arises in
the exclusive-OR (XOR) problem, which may be viewed as a special case of a more general problem, namely, that of
classifying points in the unit hypercube (An n-D hypercube is has 2 n vertices in a n-dimensional space; here it is two-D
space so it has 4 vertices. "unit" in unit hypercube means that each dimension is constrained to values between 0 and 1
and we have binary input here so the values will be exactly 0 and 1).

However, in the special case of the XOR problem, we need consider only the four corners of a unit square that correspond
to the input patterns (0,0), (0,1), (1,1), and (1,0), where a single bit (i.e., binary digit) changes as we move from one corner
to the next.

The first and third input patterns are in class 0, as shown by

0⨁0=0
and
1⨁1=0

Where ⨁ denotes the exclusive-OR boolean function operator. The input patterns (0,0) and (1,1) are at opposite corners
of the unit square, yet they produce the identical output 0. On the other hand, the input patterns (0,1) and (1,0) are also at
opposite corners of the square, but they are in class 1, as shown by
1⨁0=1
and
0⨁1=1

We first recognize that the use of a single neuron with two inputs results in a straight line for a decision boundary in the
input space. For all points on one side of this line, the neuron outputs 1; for all points on the other side of the line, it
outputs 0. The position and orientation of the line in the input space are determined by the synaptic weights of the neuron
connected to the input nodes and the bias applied to the neuron. With the input patterns (0,0) and (1,1) located on opposite
corners of the unit square, and likewise for the other two input patterns (0,1) and (1,0), it is clear that we cannot construct
a straight line for a decision boundary so that (0,0) and (0,1) lie in one decision region and (0,1) and (1,0) lie in the other
decision region. In other words, the single layer perceptron cannot solve the XOR problem.

However, we may solve the XOR problem by using a single hidden layer with two neurons (as in figure below along with
its diagram of signal flow)

Architectural graph of network for solving the XOR problem Signal-flow graph of the network

Take a look at following-

● wij: Weight from input xi to hidden neuron j.
o w11: Weight from x1 to Neuron 1.
o w12: Weight from x1 to Neuron 2.

118
o w21: Weight from x2 to Neuron 1.
o w22: Weight from x2 to Neuron 2.
● bi: Bias term for neuron i.

The top neuron, labeled as “Neuron 1” in the hidden layer, is characterized as

w11 = w21 = +1 & b1 = -1.5

The slope of the decision boundary constructed by this hidden neuron is equal to -1. Here is the calculation
For an input pattern (x1,x2), the weighted sum z1 for Neuron 1 is calculated as:
z1=w11⋅x1 + w21⋅x2 + b1

Substituting the given values:

z1=1⋅x1 + 1⋅x2 – 1.5

The decision boundary is the line where the neuron's output transitions from 0 to 1. This occurs when the
weighted sum z1 is exactly 0:
z1 = 0 ⟹ x1 + x2 − 1.5=0
x2 = -x1 + 1.5
This is the equation of the decision boundary line for Neuron 1. It has:
● A slope of 1 (since the coefficient of x 1
● A y-intercept of 1.5 (when x1=0, x2=1.5)

The decision boundary is a straight line that passes through the points (0, 1.5) & (1.5, 0) & positioned as

Decision boundary constructed by hidden neuron 1 of the network

The bottom neuron, labeled as “Neuron 2” in the hidden layer, is characterized as

w12 = w22 = +1 & b2 = -0.5

The slope of the decision boundary constructed by this hidden neuron is equal to -1. Here is the calculation
For an input pattern (x1,x2), the weighted sum z2 for Neuron 2 is calculated as:
z2=w12⋅x1 + w22⋅x2 + b2

Substituting the given values:

z2=1⋅x1 + 1⋅x2 – 0.5

The decision boundary is the line where the neuron's output transitions from 0 to 1. This occurs when the
weighted sum z2 is exactly 0:
z2 = 0 ⟹ x1 + x2 − 0.5=0
x2 = -x1 + 0.5

This is the equation of the decision boundary line for Neuron 2. It has:
● A slope of 1 (since the coefficient of x 1
● A y-intercept of 0.5 (when x1=0, x2=0.5)

119
The orientation and position of the decision boundary constructed by this second hidden neuron are as follow-

Decision boundary constructed by hidden neuron 2 of the network.

The output neuron, labeled as “Neuron 3” , is characterized as

w1 = -2, w2 = +1 & b3 = -0.5

Say the output from neuron-1 is a1 and output from neuron-2 is a2; so the output from the neuron-3 is
z3=−2.a1+1.a2−0.5

The function of the output neuron is to construct a linear combination of the decision boundaries formed by the
two hidden neurons. The result of this computation as follow-

Decision boundaries constructed by the complete network.

The activation function for the neuron is assumed to be a step function, which outputs 1 if the weighted sum of
the inputs is greater than or equal to 0, and 0 otherwise.

Input: (0,0)
Neuron 1: z₁ = 1⋅0 + 1⋅0 - 1.5 = -1.5 < 0 ⟹ a₁ = 0
Neuron 2: z₂ = 1⋅0 + 1⋅0 - 0.5 = -0.5 < 0 ⟹ a₂ = 0
Output Neuron (Neuron 3): z₃ = (-2)⋅0 + 1⋅0 - 0.5 = -0.5 < 0 ⟹ Output = 0. Matches XOR: 0 ⊕ 0 = 0

Input: (0,1)
Neuron 1: z₁ = 1⋅0 + 1⋅1 - 1.5 = -0.5 < 0 ⟹ a1 = 0
Neuron 2: z2 = 1⋅0 + 1⋅1 - 0.5 = +0.5 ≥ 0 ⟹ a1 = 1
Output Neuron (Neuron 3): z3 = (-2)⋅0 + 1⋅1 - 0.5 = +0.5 ≥ 0 ⟹ Output = 1. Matches XOR: 0 ⊕ 1 = 1

Input: (1,0)
Neuron 1: z₁ = 1⋅1 + 1⋅0 - 1.5 = -0.5 < 0 ⟹ a1 = 0
Neuron 2: z2 = 1⋅1 + 1⋅0 - 0.5 = +0.5 ≥ 0 ⟹ a2 = 1
Output Neuron (Neuron 3): z3 = (-2)⋅0 + 1⋅1 - 0.5 = +0.5 ≥ 0 ⟹ Output = 1. Matches XOR: 1 ⊕ 0 = 1

Input: (1,1)
Neuron 1: z1 = 1⋅1 + 1⋅1 - 1.5 = +0.5 ≥ 0 ⟹ a1 = 1

120
Neuron 2: z2 = 1⋅1 + 1⋅1 - 0.5 = +1.5 ≥ 0 ⟹ a2 = 1
Output Neuron (Neuron 3): z₃ = (-2)⋅1 + 1⋅1 - 0.5 = -1.5 < 0 ⟹ Output = 0. Matches XOR: 1 ⊕ 1 = 0

The bottom hidden neuron has an excitatory (positive) connection to the output neuron, whereas the top
hidden neuron has an inhibitory (negative) connection to the output neuron. When both hidden neurons are off,
which occurs when the input pattern is (0,0), the output neuron remains off. When both hidden neurons are on,
which occurs when the input pattern is (1,1), the output neuron is switched off again because the inhibitory
effect of the larger negative weight connected to the top hidden neuron overpowers the excitatory effect of the
positive weight connected to the bottom hidden neuron. When the top hidden neuron is off and the bottom
hidden neuron is on, which occurs when the input pattern is (0,1) or (1,0), the output neuron is switched on
because of the excitatory effect of the positive weight connected to the bottom hidden neuron.

Applications of ANN to solve Real Life Problem

1. Artificial Neural Networks in Healthcare
Disease Prediction
Artificial Neural Networks have revolutionized disease prediction by identifying patterns in patient data that
might go unnoticed by traditional statistical methods.
Examples of ANN in Disease Prediction:
1. Cardiovascular Disease Prediction
● ANNs analyze multiple risk factors simultaneously (blood pressure, cholesterol levels, age,
family history)
● Studies show accuracy rates of 85-95% in predicting heart disease risk

● Deep learning models can incorporate time-series data to predict acute cardiac events
2. Cancer Detection and Classification
● Convolutional Neural Networks (CNNs) identify malignant patterns in imaging data

● Particularly successful in breast cancer detection from mammograms and skin cancer
identification from dermatological images
● Research shows some AI systems matching or exceeding dermatologist accuracy in melanoma
detection
3. Diabetes Risk Assessment
● ANNs predict diabetes onset by analyzing blood glucose patterns, BMI, age, and other
biomarkers
● Recurrent Neural Networks (RNNs) can track changes over time to predict progression from
pre-diabetes to diabetes
Case Study: Diabetic Retinopathy Detection
Google's DeepMind developed a system using CNNs to identify diabetic retinopathy from retinal scans.
The system achieved over 90% accuracy, comparable to human ophthalmologists, potentially allowing
earlier intervention in areas with limited specialist access.

2. Artificial Neural Networks in Finance

Artificial Neural Networks have transformed financial forecasting by identifying complex patterns in market data
that traditional statistical methods might miss.
Examples of ANN in Stock Market Prediction:
1. Price Movement Forecasting
● Time-series analysis using Recurrent Neural Networks (RNNs) and Long Short-Term Memory
(LSTM) networks to predict short-term price movements
121
● Models incorporate technical indicators, historical prices, and trading volumes

● Performance typically exceeds traditional time-series forecasting methods like ARIMA

2. Sentiment Analysis for Market Prediction
● Natural Language Processing (NLP) combined with neural networks analyzes news articles,
social media, and financial reports
● Models quantify market sentiment as an additional predictive feature

● Helps capture market reactions to breaking news and events

3. Portfolio Optimization
● Deep Reinforcement Learning (DRL) models dynamically adjust portfolio allocations

● Neural networks optimize for risk-adjusted returns across various market conditions

● Can incorporate multiple objectives like volatility minimization and return maximization

JPMorgan developed the LOXM (Limit Order Execution) system using deep learning to execute equity trades
at optimal prices. The system analyzes market conditions and historical patterns to minimize market impact
while achieving best execution prices, outperforming human traders in many scenarios.

3. ANN in Image Recognition

Image recognition is one of the most successful and widely implemented applications of artificial neural
networks, with transformative impacts across numerous industries.

1. Medical Imaging Analysis

● Detection of tumors, fractures, and anomalies in X-rays, MRIs, and CT scans

● Automated classification of skin lesions for cancer detection

● Retinal scan analysis for diabetic retinopathy and other eye conditions

2. Facial Recognition and Biometrics

● Identity verification for security systems

● Emotion recognition from facial expressions

● Age and gender estimation

● Face detection and tracking in photographs and video

3. Autonomous Vehicles
● Object detection and classification (pedestrians, vehicles, road signs)

● Lane detection and road condition analysis

● Environmental mapping and navigation

● Obstacle avoidance systems

Google Lens uses advanced CNNs to recognize objects in real-time through a smartphone camera. The
system can identify products, landmarks, plants, animals, and text, demonstrating how image recognition can
create intuitive user interfaces for information retrieval.
122
Another example is the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) was a pivotal moment
in image recognition history. In 2012, AlexNet, a CNN developed by Alex Krizhevsky, Ilya Sutskever, and
Geoffrey Hinton, reduced the error rate from 26% to 15.3%, demonstrating the power of deep CNNs.
Subsequent architectures like VGG, ResNet, and Inception further improved performance, with modern
systems now surpassing human-level accuracy on certain image recognition tasks.

4. Artificial Neural Networks in Speech Recognition

Speech recognition has been revolutionized by artificial neural networks, enabling the voice-activated
assistants and transcription services we now use daily.
1. Applications of ANNs in Speech Recognition
● Voice Assistants

● Virtual assistants like Siri, Alexa, Google Assistant

● Command and control systems for smart homes and devices

● In-car voice control systems

2. Transcription Services
● Real-time meeting transcription

● Medical dictation systems

● Legal and court reporting

● Accessibility tools for the hearing impaired

3. Language Learning and Education

● Pronunciation assessment and feedback

● Interactive language learning applications

● Automated scoring of spoken language tests

4. Call Centers and Customer Service

● Interactive voice response (IVR) systems

● Call routing based on spoken queries

● Real-time transcription for customer service quality monitoring

Google's speech recognition system has achieved near-human accuracy using a combination of LSTMs,
CNNs, and more recently, Transformer-based models. Their system processes audio using multiple neural
network layers trained on thousands of hours of speech data and has been deployed across various products
including Google Assistant, automatic YouTube captioning, and Google Translate.

5. ANN in Robotics and Automation

Artificial Neural Networks have revolutionized robotics and automation by enabling machines to perceive their
environment, make decisions, and perform complex tasks with increasing autonomy.

1. Manufacturing Automation
● Visual inspection and quality control
123
● Anomaly detection in assembly lines

● Robot learning from demonstration (LfD)

● Collaborative robots (cobots) working alongside humans

2. Warehouse and Logistics

● Autonomous mobile robots (AMRs) for material transport

● Pick-and-place robots for order fulfillment

● Optimized routing and resource allocation

● Package handling and sorting

3. Agriculture
● Autonomous harvesting robots

● Precision weeding and crop management

● Livestock monitoring systems

● Soil and crop health analysis

4. Autonomous Vehicles
● Self-driving cars and trucks

● Autonomous delivery robots

● Drones for aerial inspection and delivery

● Mining and construction equipment automation

NVIDIA has developed Isaac Sim, a robotics simulation platform that uses neural networks to generate
synthetic training data. This enables sim-to-real transfer learning, where robots train in virtual environments
before deploying skills in the physical world.

6. ANN in Natural Language Processing

Natural Language Processing (NLP) has been transformed by artificial neural networks, enabling machines to
understand, generate, and interact with human language in increasingly sophisticated ways.

Key Applications of ANNs in NLP

1. Machine Translation
● Neural Machine Translation (NMT) systems outperform traditional statistical approaches
● End-to-end sequence-to-sequence models with attention mechanisms
● Multilingual models capable of translating between dozens of languages
● Example systems: Google Translate, DeepL, and Meta's No Language Left Behind

2. Text Classification
● Sentiment analysis for product reviews and social media monitoring
● Topic classification for news articles and documents
● Spam detection and content moderation
● Intent recognition for conversational systems

124
3. Question Answering
● Extractive QA systems locate answers within reference documents
● Generative QA systems formulate original answers based on knowledge
● Domain-specific systems for customer support and information retrieval
● Open-domain QA for general knowledge questions

OpenAI's GPT models (and subsequently similar models like Claude) demonstrated that neural networks
trained on massive text corpora can generate coherent, contextually appropriate text across diverse topics.
These models showcase emergent abilities including complex reasoning, code generation, and creative
writing, highlighting how scale and architecture innovations can produce systems with capabilities beyond their
explicit training objectives.

Build a simple perceptron model to classify linearly separable data using

Python.
First let us go through the working of Perceptron
The perceptron is a binary classifier that learns a linear decision boundary. It works as follows:
1. Initialize weights and bias (often to zeros or small random values).
2. Iterate over the training data:
● Compute the prediction for each sample.
linear_output = input * weight + bias
y_pred = apply activation function on linear_output
● Update weights and bias if the prediction is wrong.
weight = weight + α * (y_true - y_pred) * x_i
bias = bias + α * (y_true - y_pred)
3. Repeat until all samples are correctly classified (or max epochs reached).

Here I this the implementation of the perceptron learning rule

https://colab.research.google.com/drive/1NruWaRyXYERbJDHJ3YHzbUyfFxCNlce1?usp=sharing

Use Python libraries (e.g., TensorFlow/Keras) to train a neural network on the

MNIST dataset for digit recognition
Introduction to tensorflow and Keras
TensorFlow is the powerhouse behind modern machine learning, acting as a versatile framework that handles
the heavy-duty computations required for training and deploying AI models. Developed by Google, it operates
at a low level, managing everything from tensor operations (multidimensional arrays) to GPU acceleration,
making it ideal for both research and production. Think of TensorFlow as the "electric grid" of machine
learning—it provides the essential infrastructure, ensuring energy (data) flows efficiently through complex
circuits (algorithms) to power everything from small devices to massive data centers.

Keras, on the other hand, is the user-friendly interface built on top of TensorFlow, designed to simplify the
process of creating neural networks. Originally an independent library, Keras is now TensorFlow’s official high-
level API, offering intuitive tools to construct models with minimal code. Imagine Keras as the "smart home
system" that lets you control the electric grid with a simple app. Instead of wiring circuits manually (coding
low-level math), you use preconfigured switches (layers like Dense or Conv2D) to build models effortlessly.

Together, TensorFlow and Keras form a seamless partnership: TensorFlow handles the gritty details of
optimization and hardware acceleration, while Keras provides a clean, modular way to design experiments.
This combo is why they dominate industries—from healthcare (diagnosing diseases) to entertainment (Netflix
recommendations). For students, Keras lowers the barrier to entry, while TensorFlow ensures your skills scale
to real-world challenges.

"If TensorFlow is the engine and gears of a high-performance car, Keras is the steering wheel and dashboard
—giving you control without needing to be a mechanical engineer."

125
A simple code example to check the tensor flow version
import tensorflow as tf
print(tf.__version__)

Click on the run button available on the left hand side of the code; you will get the tensor flow version which is
2.18.0

A note about the MNIST (Modified National Institute of Standards and Technology) Database
The MNIST dataset is the quintessential starting point for anyone learning machine learning and computer
vision. It consists of 70,000 handwritten digits (0–9), split into 60,000 training images and 10,000 test images,
each grayscale and sized at 28×28 pixels.

Here are the steps to use it in the form of paragraph

1. Preprocessing: Pixel values (0–255) are scaled to 0–1 (normalization).
2. Model Input: Images are flattened into 1D arrays (784 values) for classic neural networks.
3. Labels: Each digit comes with a true label (0–9), enabling supervised learning.

The MNIST dataset has become the quintessential starting point for machine learning and computer vision due
to its simplicity, accessibility, and well-structured format. Its small image size (28x28 pixels) and grayscale
format reduce computational complexity, making it ideal for beginners to experiment with algorithms without
needing high-end hardware. The dataset's clean, centered digits and balanced class distribution allow
newcomers to focus on core concepts like data preprocessing, model training, and evaluation metrics without
getting bogged down by noise or class imbalances. Additionally, MNIST's integration into popular libraries like
TensorFlow and PyTorch ensures easy access, enabling rapid prototyping and benchmarking.

Despite its widespread use, MNIST has notable limitations. Its simplicity, while great for beginners, fails to
capture real-world challenges like varying backgrounds, lighting conditions, or distorted handwriting ,
leading to inflated accuracy scores (often >99%) that don't translate to practical applications. The dataset's
uniformity also means models trained on MNIST struggle with more complex tasks, exposing a gap between
academic exercises and real-world problems.

https://colab.research.google.com/drive/1WO6Cq2ihoipDjkaq2YhUf2HMGgFXBoRm?usp=sharing

NOTES END HERE; DO NOT READ FURTHER PAGES

126
The next topic is perceptron convergence theorem but before taking about this, let us have a discussion about
few important terms

127
❖ A vector is a one-dimensional array of numbers, either representing inputs, weights, or outputs. A
vector can indeed be thought of as a matrix of order n×1, where n is the number of elements in the
vector. A vector is often represented as a column vector, which is a matrix with n rows and 1 column
(n×1). For example, a vector v with 3 elements can be written as:

The transpose of a column vector is a row vector, and vice versa.

❖ A Hyperplane
is a geometric entity that separates a space into two distinct parts. In n-dimensional space, a hyperplane is

w1x1 + w2x2 + + w x +b=0

n n

Where:
● w1, w2, …, wn are the weights
● x1, x2, …, xn are the input features
● b is the bias term.

A. In one-dimensional space, a hyperplane is simply a point. For example, on a number line, a point x =
c can separate the space into two regions: x < c and x > c.
B. In two-dimensional space, a hyperplane is a line.
C. In three-dimensional space, a hyperplane is a plane.

❖ What is Norm notation ||W|| and Euclidean norm ||W||2

A norm is a function that assigns a non-negative scalar value to a vector. It measures the "size" or
"length" of the vector. The general form of the Lp norm for a vector v = [v1,v2,…,vd]T is defined as:

p is a positive integer (e.g., p=1,2 … )

v p is the Lp norm of v, which is a scalar value.

The specific type of norm depends on the subscript or context. For example:
||W||1: L1 norm (sum of absolute values).
||W||2: L2 norm (Euclidean norm).
||W||p: Lp norm (generalized norm).

Euclidean Norm (L2 Norm): W 2

The Euclidean norm (or L2 norm) is the most commonly used norm. For a vector W = [w 1,w2,…,wn] the
Euclidean norm is defined as:

This represents the "straight-line distance" from the origin to the point defined by the vector W in
Euclidean space.

❖ The Cauchy-Schwarz Inequality is a fundamental inequality in mathematics that establishes a

relationship between the inner product of two vectors and their norms. It's important in various fields
including linear algebra, analysis, probability theory, and machine learning. For two vectors u and v in
an inner product space, the inequality states:

128
|u, v| ||u|| 2 · ||v||2
(The same can be written as |u, v| ||u|| · ||v|| if euclidean norm is mentioned explicitly for norm)

Where:
u, v represents the absolute value of inner product of vectors u and v
||u||2 and ||v||2 represent the norms of the vectors u and v respectively

See the demonstration with a simple example using vectors:

Example:
Let's take two vectors:
u = (1, 2, 3)
v = (4, 5, 6)

Step 1: Calculate the inner product u, v

Inner product = u₁v₁ + u₂v₂ + u₃v₃ = (1×4) + (2×5) + (3×6) = 4 + 10 + 18 = 32

Step 2: Calculate the norms of both vectors

||u|| = (1² + 2² + 3²) = (1 + 4 + 9) = 14 3.742
||v|| = (4² + 5² + 6²) = (16 + 25 + 36) = 77 8.775

Step 3: Verify the inequality

Left side: |u, v| = |32| = 32
Right side: ||u|| · ||v|| 3.742 × 8.775 32.84

Therefore: 32 32.84
The inequality holds, as expected. Note that the values aren't exactly equal, which tells us that these
two vectors aren't scalar multiples of each other.

Note: The equality holds if and only if one vector is a scalar multiple of the other (meaning they're
linearly dependent; When vectors are linearly dependent, the angle between them is either 0° (same
direction) or 180° (opposite direction).).

❖ General Strategy for Tightening Inequalities: The idea of eliminating redundant terms to tighten an
inequality is based on the following points:
1. Redundancy: If a term in an inequality is already included in another term (e.g., as part of a
sum), explicitly writing it separately does not provide additional information.
2. Monotonicity of Inequalities:
If A B + C and C is already included in B (i.e. B = C + D), then A B + C can be re
The principle given above is useful when applying Cauchy–Schwarz inequality in the Perceptron
Convergence Theorem, where we try to remove unnecessary terms to tighten the inequalities.

The perceptron convergence theorem

Instruction: Here multiple fonts are used so to avoid confusion in the capital letters and the small
letters, the capital lettered characters are marked bold.

To derive the error-correction learning algorithm for the perceptron, we find it more convenient to work with the
modified signal-flow graph model in figure below

129
The only difference here is the bias b(n) is treated as a synaptic weight driven by a fixed input equal to +1. We
may thus define the (m + 1)-by-1 input vector

The T in the superscript stands for the transpose operation. The n denotes the time-step when the algorithm is
applied. A time-step (denoted by n) represents a specific iteration or update in the algorithm. It is the point at
which the algorithm processes a data point, updates its parameters (e.g., weights), and moves closer to finding
a solution.

Similarly, we define the (m + 1)-by-1 weight vector as

The w T(n) symbol in will be-

W T(n) = [b, w1(n), w2(n), ..., wm(n)]

Accordingly, the linear combiner output is written in the compact form

i.e.

v(n) = [b, w1(n), w2(n), ...,

wm(n)] x

The dot product is computed as:

v(n)=WT(n).X(n)=b⋅(+1)+w1(n)⋅x1(n)+w2(n)⋅x2(n)+⋯+wm(n)⋅xm(n)
This is equivalent to the linear combiner output of the Perceptron.

In the first line, w0(n), corresponding to i = 0, represents the bias b. For fixed n, the equation WTX = 0, plotted
in an m-dimensional space (and for some prescribed bias) with coordinates x1, x2, ..., xm, defines a hyperplane
as the decision surface between two different classes of inputs.

130
Suppose then that the input variables of the perceptron originate from two linearly separable classes. Let H1 be
the subspace of training vectors X1(1), X1(2), ... that belong to class C1, and let H2 be the subspace of training
vectors X2(1), X2(2), ... that belong to class C2. The union of H1 and H2 is the complete space denoted by H.
Given the sets of vectors H1 and H2 to train the classifier, the training process involves the adjustment of the
weight vector W in such a way that the two classes C1 and C2 are linearly separable. That is, there exists a
weight vector W such that we may state

Note:
If we used strict inequalities for both classes:
● wTx > 0 for C1.
● wTx < 0 for C2.

This would leave input vectors with wTx=0 unclassified, which is undesirable. The Perceptron algorithm needs to classify
all input vectors, so it uses:
● wTx > 0 for C1.
● wT x 0 for C 2.
This ensures that every input vector is assigned to one of the two classes.

Given the subsets of training vectors H1 and H2, the training problem for the perceptron is then to find a weight
vector w such that the two inequalities of above statements are satisfied.

The algorithm for adapting the weight vector of the elementary perceptron may now be formulated as follows:
1. If the nth member of the training set, x(n), is correctly classified by the weight vector w(n) computed at
the nth iteration of the algorithm, no correction is made to the weight vector of the perceptron in
accordance with the rule:

2. Otherwise, the weight vector of the perceptron is updated in accordance with the rule

Note: The learning rate is denoted by η (Greek letter ETA). It is used to control the amount of weight
adjustment at each step of training. The learning rate, ranging from 0 to 1, determines the rate of
learning at each time step. The learning rate plays a significant role in determining how fast or slow a
neural network learns. If the learning rate is low then the neuron will learn slowly similarly If the learning
rate is high then the neuron will learn fastly.

We are using the fixed-increment adaptation rule for the perceptron in which we are keeping the η fixed i.e. it is
a constant independent of the iteration number n. This means the learning rate does not change over time.

Proof of the perceptron convergence algorithm is presented for the initial condition W(0) = 0. Suppose that
WT(n)X(n) < 0 for n = 1, 2, ..., and the input vector X(n) belongs to the subset H1. That is, the perceptron
incorrectly classifies the vectors X(1), X(2) ..., since the first condition of equation (4) is violated. Then, with the
constant η(n) = 1, we may use the second line of equation (6) to write
W(n + 1) = W(n) + X(n) for X(n) belonging to class C1 ---------- (7)
Note: - The update rule W(n+1) = W(n) + X(n) means that the weight vector at the next iteration W(n+1) is updated based
on the current input X(n) and the current weight vector W(n). This ensures that the algorithm processes each input X(n)
sequentially and updates the weights accordingly. The variable n is being used in two different ways in the perceptron
algorithm.
- For weight; It is taking data either 0 or from the previous input data
- For input; it is taking values from 1 to total number of inputs

131
For iteration n = 0, the weight vector is initialized as W(0)
For iteration n = 1, The algorithm uses W(0) to classify the X(1); it updates the weight of W(1) based on the X(1)
W(1) = W(0) + X(1) [because W(0) = 0]
hence W(1) = X(1)

For iteration n = 2, The algorithm uses W(1) to classify the X(2); it updates the weight of W(2) based on the X(2)
W(2) = W(1) + X(2) [because W(1) = X(1)]
hence W(2) = X(1) + X(2)

For iteration n = 3, The algorithm uses W(2) to classify the X(3); it updates the weight of W(3) based on the X(3)
W(3) = W(2) + X(3) [because W(2) = X(1) + X(2)]
hence W(3) = X(1) + X(2) + X(3)

From the above calculation we can say that that for W(0) = 0, we may iteratively solve this equation for W(n + 1),
obtaining the result
W(n + 1) = X(1) + X(2) + . . . + X(n) - - - - - - - - - - (8)

Since the class C1 and C2 are assumed to be linearly separable, there exists a solution Wo for which WTX(n) >
0 for the vectors X(1), ..., X(n) belonging to the subset H1 For a fixed solution Wo, we may then define a
positive number α as

Hence, multiplying both sides of Eq (8) by the row vector WoT, we get
WoTW(n + 1) = WoTX(1) + WoTX(2) + WoTX(3) + … + WoTX(n)

From the equation (9) we can say that

WoTW (n + 1) n α ---------(10)

Next we make use of an inequality known as the Cauchy–Schwarz inequality. Given two vectors W0 and W(n +
1), the Cauchy–Schwarz inequality states that
||Wo||2||W(n + 1)||2 n 2α 2 ---------(11)
Here ||wo||2 is the squared euclidean norm it can be written as ||wo||22 for the sake of simplicity we are writing it ||wo||2

because ||Wo||2||W(n + 1)||2 WoTW(n + 1) and WoTW (n + 1) n α so ||Wo||2||W(n + 1)||2 n 2

α2

or, equivalently

We next follow another development route.

W(k + 1) = W(k) + X(k) for k = 1, ..., n and X (k) H1 ---------- (13)

By taking the squared Euclidean norm of both sides

||W(k + 1)||2 = ||W(k)||2 + ||X(k)||2 + 2WT(k)X(k) ---------- (14)

But, WT(k)X (k) 0.We therefore deduce from

||W(k + 1)||2 || W(k)||2 + ||X(k)||2

or equivalently
||W(k + 1)||2 - ||W(k)||2 || X(k)||2 ---------- (15)

Summing the inequalities for k=1,…,n gives:

132
The left-hand side is a telescoping sum, meaning most terms cancel out:

After cancellation, this simplifies to:

However, since W(0)=0, we can say W(1) = X(1) so:

So we will get

Rearranging:

The above can be written as follow after applying the General Strategy for tightening Inequalities

||W(n + 1)||2 n
where β is a positive number defined by

Let us combine the result of equation (12) and equation (16)

Here

The inequality mentioned in the second part of equation (16) is in conflict with the inequality mentioned in the
equation (12) for sufficiently large values of n because
1. The upper bound grows linearly with n
2. The lower bound grows quadratically with n
For sufficiently large n, the quadratic term will eventually exceed the linear term which would violate the upper
bound. This is why the two inequalities appear to be in conflict for large n.

The Perceptron algorithm is guaranteed to converge (i.e. find solution) after a finite number of updates
(nmax)
if the data is linearly separable. This means that the inequalities are only relevant for n n
max, where nmax is the maximum number of updates required for convergence.

Solving for nmax, given a solution vector Wo, we find that

133
We have thus proved that for η(n) = 1 for all n and W(0) = 0, and given that a solution vector Wo exists, the rule
for adapting the synaptic weights of the perceptron must terminate after at most n max iterations. We may now
state the fixed-increment convergence theorem for the perceptron as follows

Let the subsets of training vectors H1 and H2 be linearly separable. Let the inputs presented to the perceptron
originate from these two subsets. The perceptron converges after some no iterations, in the sense that
w(no) = w(no + 1) = w(no + 2) = ….
is a solution vector for n0 < nmax.

The Perceptron Convergence Algorithm guarantees that if the data is linearly separable, the algorithm will find
a solution (i.e., a weight vector that correctly classifies all training examples) in a finite number of iterations.
The goal is indeed to determine the value of n 0 (or nmax) such that the algorithm will surely converge within n 0
iterations.

Back Propagation Network

A note about partial derivative and gradient vector:
A partial derivative measures how a function changes when you vary only one variable, while keeping all
other variables constant.

Consider a function of two variables:

The function depends on both x and y.

Partial derivative with respect to x:

The derivative of x2 with respect to x is 2x. Since 3y is treated as a constant, its derivative is 0.

Partial derivative with respect to y:

134
The term x2 is treated as a constant, so its derivative is 0. The derivative of 3y with respect to y is 3

The gradient vector is simply a vector of partial derivatives and points in the direction of the steepest ascent.
The gradient vector
(denoted as f pronounced as nabla f) is formed by collecting all partial derivatives:

For our function:

The function changes most rapidly in the direction of (2x,3). If we move in the direction of this gradient, the
function f(x,y) increases fastest.

Convergence is made faster if a momentum factor is added to the weight updation process. This is generally
done in the back propagation network. If momentum has to be used, the weights from one or more previous
training patterns must be saved. Momentum helps the net in reasonably large weight adjustments until the
corrections are in the same general direction for several patterns.

The vigilance parameter is denoted by “ρ”. It is generally used in adaptive resonance theory (ART) networks.
The vigilance parameter is used to control the degree of similarity required for patterns to be assigned to the
same cluster unit. The choice of vigilance parameter ranges approximately from 0.7 to 1 to perform useful work
in controlling the number of clusters.

𝓵
η

135

Soft Computing
No ratings yet
Soft Computing
476 pages
Soft Computing: Lecture Notes
100% (2)
Soft Computing: Lecture Notes
100 pages
Soft Computing - An Introduction: Session - 1
No ratings yet
Soft Computing - An Introduction: Session - 1
33 pages
Introduction Soft Computing
No ratings yet
Introduction Soft Computing
16 pages
Soft Computing Decode
No ratings yet
Soft Computing Decode
142 pages
1653848846162-Sca Unit-1
No ratings yet
1653848846162-Sca Unit-1
81 pages
Relevance of Learning History of Mathematics
No ratings yet
Relevance of Learning History of Mathematics
14 pages
Soft Computing Unit-1 by Arun Pratap Singh
100% (1)
Soft Computing Unit-1 by Arun Pratap Singh
100 pages
Unit - 1 (Introduction)
No ratings yet
Unit - 1 (Introduction)
103 pages
The University of The West Indies, Mona: The 2016 Jamaican Mathematical Olympiad Test For Grades 7 and 8
No ratings yet
The University of The West Indies, Mona: The 2016 Jamaican Mathematical Olympiad Test For Grades 7 and 8
4 pages
General Relativity: Part I: Mathematical Background
No ratings yet
General Relativity: Part I: Mathematical Background
8 pages
Soft Computing
No ratings yet
Soft Computing
52 pages
Existential Presuppositions and Existential Commitments PDF
No ratings yet
Existential Presuppositions and Existential Commitments PDF
14 pages
Adobe Scan 27 Sep 2022
No ratings yet
Adobe Scan 27 Sep 2022
25 pages
On Circulant Matrices
No ratings yet
On Circulant Matrices
10 pages
Unit-4 Soft Comp Fuzzy Logic
No ratings yet
Unit-4 Soft Comp Fuzzy Logic
12 pages
Dcfls
No ratings yet
Dcfls
37 pages
Unit-1 Computing:: Zadeh Published His First Paper On Soft Data Analysis "What Is Soft
No ratings yet
Unit-1 Computing:: Zadeh Published His First Paper On Soft Data Analysis "What Is Soft
9 pages
Soft Computing-Unit1
No ratings yet
Soft Computing-Unit1
28 pages
Soft Computing
No ratings yet
Soft Computing
12 pages
Introduction To Soft Computing: Presentation By: C. Vinoth Kumar SSN College of Engineering
No ratings yet
Introduction To Soft Computing: Presentation By: C. Vinoth Kumar SSN College of Engineering
21 pages
B.A/B.Sc. Part-Il Exam, 2020 Paper-L1, Session (2018-21) : Words
No ratings yet
B.A/B.Sc. Part-Il Exam, 2020 Paper-L1, Session (2018-21) : Words
5 pages
UNIT-1 Soft Computing
No ratings yet
UNIT-1 Soft Computing
21 pages
Unit 1: Introduction To Soft Computing
No ratings yet
Unit 1: Introduction To Soft Computing
9 pages
Introduction To Soft Computing
No ratings yet
Introduction To Soft Computing
9 pages
Pec-Cs702b 16900119031
No ratings yet
Pec-Cs702b 16900119031
4 pages
Unit-19 Assignment brief 2 - đã chuyển đổi
No ratings yet
Unit-19 Assignment brief 2 - đã chuyển đổi
26 pages
L1-Intro Soft Comp
No ratings yet
L1-Intro Soft Comp
12 pages
ASTM Tolerancias
No ratings yet
ASTM Tolerancias
13 pages
LIET III CSE AIML II SEM A & B OU Soft Computing UNIT I LN
No ratings yet
LIET III CSE AIML II SEM A & B OU Soft Computing UNIT I LN
13 pages
Rational Numbers wks-1 PDF
No ratings yet
Rational Numbers wks-1 PDF
2 pages
STAT 151A Syllabus
No ratings yet
STAT 151A Syllabus
2 pages
Unit 1
No ratings yet
Unit 1
34 pages
Soft Computing - Introduction
No ratings yet
Soft Computing - Introduction
7 pages
Softcomputing Course Material
No ratings yet
Softcomputing Course Material
51 pages
Soft Computing Honors Course
No ratings yet
Soft Computing Honors Course
5 pages
Lecture1.1 Overview - of - Soft - Computing
No ratings yet
Lecture1.1 Overview - of - Soft - Computing
27 pages
ES-341: Numerical Analysis: Dr. Mazhar Ali Mehboob Ul Haq (TA)
No ratings yet
ES-341: Numerical Analysis: Dr. Mazhar Ali Mehboob Ul Haq (TA)
12 pages
Analysis of Anti-Windup Techniques in PID Controle of Process With Measurement
No ratings yet
Analysis of Anti-Windup Techniques in PID Controle of Process With Measurement
6 pages
Mahesan-Soliton Interview Experience
No ratings yet
Mahesan-Soliton Interview Experience
4 pages
Branch Design and Construction of Prefabricated Structures: CEE76 CEE77 Cee88 Matrix Methods of
No ratings yet
Branch Design and Construction of Prefabricated Structures: CEE76 CEE77 Cee88 Matrix Methods of
3 pages
Unit-1 P-1
No ratings yet
Unit-1 P-1
18 pages
Lame Theorem
No ratings yet
Lame Theorem
3 pages
Real Life Applications of Direct Proportions
No ratings yet
Real Life Applications of Direct Proportions
8 pages
Hard Computing
No ratings yet
Hard Computing
2 pages
CS175 Lecture 3
No ratings yet
CS175 Lecture 3
13 pages
Special Kind of Functions
No ratings yet
Special Kind of Functions
14 pages
Hard Computing and Soft Computing
No ratings yet
Hard Computing and Soft Computing
3 pages
Python Assignment-5-1
No ratings yet
Python Assignment-5-1
1 page
Homework Worksheets For 4th Grade
100% (1)
Homework Worksheets For 4th Grade
4 pages
Unit 1
No ratings yet
Unit 1
29 pages
Unit 1
No ratings yet
Unit 1
14 pages
Appendix E - Numerical Stability and O - 2006 - Numerical Methods in Biomedical
No ratings yet
Appendix E - Numerical Stability and O - 2006 - Numerical Methods in Biomedical
18 pages
Soft Computing
No ratings yet
Soft Computing
47 pages
Lecture Ii-5: The Bose-Fermi Correspondence and Its Applications Edward Witten 5.1. 2-Dimensional Gauge Theories With Fermions
No ratings yet
Lecture Ii-5: The Bose-Fermi Correspondence and Its Applications Edward Witten 5.1. 2-Dimensional Gauge Theories With Fermions
9 pages
Place Value Worksheets Grade 4 Worksheet 1
No ratings yet
Place Value Worksheets Grade 4 Worksheet 1
10 pages
01 Intro
No ratings yet
01 Intro
10 pages
Math Ai, Ia Final Draft
No ratings yet
Math Ai, Ia Final Draft
27 pages
3 Sem. B.Tech. EE: Ruturaj Pattanayak Veer Surendra Sai University of Technology, Odisha, Siddhi Vihar, Burla
No ratings yet
3 Sem. B.Tech. EE: Ruturaj Pattanayak Veer Surendra Sai University of Technology, Odisha, Siddhi Vihar, Burla
12 pages
Soft Computing
No ratings yet
Soft Computing
773 pages
Lecture 1.1.2
No ratings yet
Lecture 1.1.2
17 pages
SC 01
No ratings yet
SC 01
24 pages
SCT Saloni
No ratings yet
SCT Saloni
36 pages
Soft Computing UNIT - I
No ratings yet
Soft Computing UNIT - I
11 pages
2-Artificial Neural Networks Vs Biological Neural Networks, Neural Network Architectures, Char
No ratings yet
2-Artificial Neural Networks Vs Biological Neural Networks, Neural Network Architectures, Char
16 pages
L1 INtro
No ratings yet
L1 INtro
21 pages
Some Characteristics of Soft Computing
No ratings yet
Some Characteristics of Soft Computing
9 pages
Soft Computing
No ratings yet
Soft Computing
20 pages
Unit 1
No ratings yet
Unit 1
8 pages
607 Midterm Review 22
No ratings yet
607 Midterm Review 22
2 pages
Unit-1 Soft Computing
No ratings yet
Unit-1 Soft Computing
8 pages
Chapter 1 of SC
No ratings yet
Chapter 1 of SC
5 pages
Kellie IA
No ratings yet
Kellie IA
23 pages
Hard Vs Soft Computing
No ratings yet
Hard Vs Soft Computing
13 pages
Department of Mathematics: (D) Does Not Exist (B) Is Not Continuous at
No ratings yet
Department of Mathematics: (D) Does Not Exist (B) Is Not Continuous at
2 pages
SC Unit 1
No ratings yet
SC Unit 1
33 pages
Unit II
No ratings yet
Unit II
8 pages
Zhang Et Al 2023 A Physics Informed Data Driven Approach For Consolidation Analysis
No ratings yet
Zhang Et Al 2023 A Physics Informed Data Driven Approach For Consolidation Analysis
12 pages
Unit 1 Introduction To Soft Computing
No ratings yet
Unit 1 Introduction To Soft Computing
100 pages
Module 1 - SC
No ratings yet
Module 1 - SC
17 pages
Honors 1
No ratings yet
Honors 1
3 pages
Fuzzy Logic
No ratings yet
Fuzzy Logic
49 pages
Unit 1
No ratings yet
Unit 1
15 pages
1.1.introduction To Softcomputing and Its Properties
No ratings yet
1.1.introduction To Softcomputing and Its Properties
15 pages
Unit 1 - Ai
No ratings yet
Unit 1 - Ai
9 pages
Flyer Applicatio Form AMO2025
No ratings yet
Flyer Applicatio Form AMO2025
2 pages
Fundamentals of Machine Learning: a Simplified Approach
From Everand
Fundamentals of Machine Learning: a Simplified Approach
Er. Sudhir Goswami
No ratings yet
Elements of Statistical Learning
From Everand
Elements of Statistical Learning
Swarnalata Verma
No ratings yet
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
Python Machine Learning: Machine Learning Algorithms for Beginners - Data Management and Analytics for Approaching Deep Learning and Neural Networks from Scratch
From Everand
Python Machine Learning: Machine Learning Algorithms for Beginners - Data Management and Analytics for Approaching Deep Learning and Neural Networks from Scratch
Ahmed Ph. Abbasi
No ratings yet