ASC NOTES II Sem
ASC NOTES II Sem
● In this diagram, we observe that there is a function, f, which takes some input, x, and produces an
output, y. The input is referred to as the Antecedent, while the output is called the Consequent.
● The function f can also be called a formal method, algorithm, or a mapping function. It represents the
logic or steps used to process the input and generate the output.
● The middle part of this diagram is known as the computing unit, where the function resides. This is
where we feed the input, and a process occurs to convert that input into the desired output y.
● In the computing process, the steps that guide how the input is manipulated are called control actions.
These control actions ensure that the input gradually approaches the desired output. The moment the
process completes, we obtain the final result, known as the Consequent.
● Basic Characteristics Related to Computing:
1. Precise Solution: Computing is used to produce a solution that is exact and definitive.
2. Unambiguous & Accurate: The control actions within the computing unit must be unambiguous,
meaning they must have only one interpretation. Each control action should also be accurate to
ensure that the process is valid and reliable.
3. Mathematical Model: A well-defined algorithm is a requirement for solving any problem through
computing. This algorithm is essentially a mathematical model that represents the problem and its
solution process.
● In computing, there are two main types:
1. Hard Computing
2. Soft Computing
Hard Computing
In 1996 L.A. Zadeh (who is the pioneer of fuzzy logic) introduced the term hard computing. Hard computing
refers to a traditional computing approach where the results are always precise and exact. This is because
hard computing relies on well-defined mathematical models and algorithms to solve problems. There is no
room for approximation, as every calculation or action leads to a deterministic output.
Soft Computing
● Soft Computing is an approach to computing that models the human mind’s ability to make decisions in
an uncertain, imprecise, or complex environment. Unlike traditional, or "hard," computing which relies
on exact binary logic (0s and 1s), soft computing deals with approximation, flexibility, and learning from
experience to solve complex real-world problems.
● Soft computing techniques focus on developing systems that can handle ambiguity, uncertainty, and
approximation, making them well-suited for fields like artificial intelligence (AI), pattern recognition, and
robotics.
● Dr. L.A. Zadeh quoted that “The guiding principle of soft computing is to exploit the tolerance for
imprecision, uncertainty, and partial truth to achieve tractability, robustness, low solution cost, better
rapport with reality” i.e. The main idea of soft computing is to make use of its ability to handle
imprecision, uncertainty, and incomplete information. This helps in finding practical solutions that are
reliable, cost-effective, and closer to real-world situations. He pointed out that soft computing is not a
single method, but instead it is a combination of several methods, such as fuzzy logic, neural networks,
and genetic algorithms. All these methods are not competitive, but are complimentary to each other and
can be used together to solve a given problem.
Real life examples to understand the soft computing and hard computing
i. A manual transmission car relies on the driver to make precise decisions and execute gear shifts,
making it more aligned with the principles of hard computing (exact, deterministic input and output). In
contrast, an automatic transmission car adapts to varying conditions like speed and load, using fuzzy
logic or other soft computing techniques to decide when to shift gears smoothly.
2
ii. A manual air conditioner requires the user to set precise temperature and fan speed. The device works
rigidly based on these fixed inputs without adapting to external conditions. It is an example of hard
computing but an automatic air conditioner with smart features uses fuzzy logic to maintain a
comfortable room temperature. It adapts to inputs like room size, occupancy, and external weather
conditions to adjust cooling intensity dynamically. It is an example of soft computing.
Characteristic of soft computing
1. Handles Uncertainty and Approximation:
● Traditional algorithms often require exact inputs, but real-world problems don't always have precise
data. Soft computing methods can handle uncertainty, making them more suitable for practical
applications.
● Example: In medical diagnosis, data such as symptoms might be uncertain or incomplete, and soft
computing helps make decisions despite that.
4. Ease of Implementation:
● Soft computing techniques like fuzzy logic and genetic algorithms are often easier to implement for
complex, real-world problems compared to traditional methods. They don’t require a precise model of
the system but instead learn from examples.
● Example: In control systems, fuzzy logic can be quickly applied to model human-like decision-making
without needing complex mathematical models.
An Example
A step fan regulator is an example of crisp logic. In a step fan regulator, the fan speed can only be set to fixed
levels like Low, Medium, or High.
Rules:
a. If temperature 20°C Fan speed = Low
b. If 20°C < temperature 30°C Fan speed = Medium
c. If temperature > 30°C Fan speed = High
How it works:
a. At 25°C the fan runs at Medium speed.
b. At 35°C the fan runs at High speed.
c. The change between speeds is sudden and abrupt. The fan speed will be different for 30°C and 31°C.
A smooth fan regulator is an example of fuzzy logic. In a smooth fan regulator, the fan speed can vary
continuously based on the temperature, not just fixed levels.
Rules:
a. If the temperature is Low, fan speed is set to Low.
b. If the temperature is Medium, fan speed is set to Medium.
c. If the temperature is High, fan speed is set to High.
How it works:
a.
At 25°C the fan adjusts smoothly between Low and Medium speeds (e.g. 70% of Medium speed).
b. At 28°C the fan gradually increases speed further (e.g. 90% of Medium speed).
c. The changes in speed are gradual and smooth, mimicking human judgment.
In fuzzy logic, the knowledge base is represented by if-then rules of fuzzy descriptors. An example of a fuzzy
rule would be
"If the speed is slow and the target is far then moderately increase the power.”
It contains the fuzzy descriptors slow, far, and moderate. A fuzzy descriptor represents a qualitative or
imprecise concept. It may be represented by a membership function which gives a membership grade between
0 and 1 for each possible value of the fuzzy descriptor it represents. Mathematically, a fuzzy set A is
represented by a membership function, or a "possibility function" of the form.
4
● The symbol Fz in Fz[x ∈ A] stands for the fuzzy set membership evaluation process. It assesses how
much an element x belongs to a fuzzy set A. A fuzzy set is a set where elements can have varying
degrees of membership, rather than being fully included or excluded. It is defined by a membership
function, which assigns each element a membership grade between 0 and 1.
● The x represents an element from the universal set which is the universal set or domain from which x is
drawn.
● The A represents a fuzzy set under consideration. This fuzzy set will be a subset of the universal set.
Again reminding that x is an element from the universal set means x belongs to the Universal set but it
will be long to set. A fuzzily means may be fully, may be to some extent or may not be.
● The μA is the membership function for the fuzzy set A. It assigns a membership grade (μ) to each
element x, indicating how much x belongs to A.
● R=[0,1] is the range of the membership function μ A(x), where each value indicates the grade of
membership for a given x.
0: x does not belong to A.
1: x fully belongs to A.
0 < μA(x) < 1: x partially belongs to A.
The translation from x to μA(x) is known as fuzzification. This topic will be covered in detail later.
2. Neural Network
A neural network consists of a set of nodes, usually organized into layers, and connected through weight
elements called synapses. At each node, the weighted inputs are summed (aggregated), thresholded, and
subjected to an activation function in order to generate the output of that node.
5
If the weighted sum of the inputs to a node (neuron) exceeds a threshold value w0, then the neuron is fired and
an output y(t) is generated according to
Where xi are neuron inputs, wi are the synaptic weights, and f[.] is the activation function.
There are two main classes of neural networks known as feedforward networks (or static networks) and
feedback networks (or recurrent networks).
In a feedforward network, the signal flow from a node to another node takes place in the forward direction
only. There are no feedback paths. In a feedforward neural network, learning is achieved through example.
This is known as supervised learning. Specifically, first a set of input–output data of the actual process is
determined (e.g., by measurement). The input data are fed into the NN. The network output is compared with
the desired output (experimental data) and the synaptic weights of the NN are adjusted using an algorithm until
the desired output is achieved.
In a feedback NN, the outputs of one or more nodes (say in the output layer) are fed back to one or more
nodes in a previous layer (say hidden layer or input layer) or even to the same node. The feedback provides
the capability of "memory" to the network.
3. Genetic Algorithm
Genetic algorithms represent an optimization approach where a search is made to "evolve" a solution
algorithm, which will retain the "most fit" components in a procedure. In the present context of intelligent
machines, Intellectual fitness is important for the evolutionary process rather than physical fitness. Evolutionary
computing can play an important role in the development of an optimal and self-improving intelligent machine.
Evolutionary computing has the following characteristics:
a. It is based on multiple searching points or solution candidates (population based search).
b. It uses evolutionary operations such as crossover and mutation.
c. It is based on probabilistic operations.
A genetic algorithm works with a population of individuals, each representing a possible solution to a given
problem. Each individual is assigned a fitness score according to how good its solution to the problem is. The
highly fit (in an intellectual sense) individuals are given opportunities to reproduce by crossbreeding with other
individuals in the population. This produces new individuals as offspring, who share some features taken from
each parent. The least fit members of the population are less likely to get selected for reproduction and will
eventually die out. An entirely new population of possible solutions is produced in this manner, by mating the
best individuals (i.e., individuals with best solutions) from the current generation. The new generation will
6
contain a higher proportion of the characteristics possessed by the "fit" members of the previous generation. In
this way, over many generations, desirable characteristics are spread throughout the population, while being
mixed and exchanged with other desirable characteristics in the process. By favoring the mating of the
individuals who are more fit (i.e., who can provide better solutions), the most promising areas of the search
space would be exploited. A GA determines the next set of searching points using the fitness values of the
current searching points, which are widely distributed throughout the searching space. It uses the mutation
operation to escape from a local minimum.
4. Associative Memory
It is a type of memory structure that allows for the retrieval of information based on content rather than a
specific address (as in traditional memory systems). It is inspired by the way the human brain recalls
information, where a partial or noisy input can trigger the recall of a complete memory. In soft computing,
associative memory is often implemented using neural networks or fuzzy systems, which are capable of
handling imprecise, uncertain, or incomplete data Types of Associative Memory
1. Auto-associative Memory: This type of memory retrieves a stored pattern that most closely matches the
input pattern. It is useful for pattern completion or noise reduction. For example, if you input a noisy or
incomplete version of a pattern, the auto-associative memory can recall the original, clean version.
Example: If you train an auto-associative memory with the word "HELLO" and later input "H*LLO", it
can recall "HELLO".
2. Hetero-associative Memory: This type of memory retrieves a different pattern from the input pattern.
The output may differ in content, type, or format. It is useful for pattern association or mapping tasks.
For example, mapping a name to a face or a word to its meaning.
Example: If you train a hetero-associative memory with the input "cat" and output "animal", then
inputting "cat" will retrieve "animal".
1. Stability-Plasticity Dilemma: ART was designed to address a fundamental challenge in neural networks:
how to learn new information without catastrophically forgetting previously learned knowledge.
"Stability" refers to preserving old knowledge, while "plasticity" refers to the ability to adapt to new input.
ART achieves a balance between these two.
2. Clustering: ART networks excel at unsupervised learning and clustering. They can automatically group
similar input patterns into clusters without needing predefined categories.
3. Resonance: The "resonance" aspect is crucial. When an input is presented, the network searches for a
matching cluster. If a good match is found, a state of "resonance" occurs, strengthening the
connections between the input and the cluster. If no match is found, a new cluster is created.
4. Vigilance Parameter: This parameter controls the network's sensitivity to new patterns. A high vigilance
means the network is more likely to create new clusters (fine-grained distinctions), while a low vigilance
means it's more likely to assimilate new inputs into existing clusters (broader generalizations).
7
2. ART2: Handles continuous-valued data.
3. ART3: An extension with hierarchical processing for large-scale applications.
4. Fuzzy ART: Incorporates fuzzy logic to process uncertainty in data.
ART networks are valuable tools in soft computing due to their ability to handle uncertainty and learn from data
without strict supervision. Here are some examples:
1. Image Recognition: Imagine a system that needs to categorize images of different types of flowers.
ART can learn to cluster images of roses, lilies, and tulips without being explicitly told what these
flowers are. It identifies patterns and similarities in the image data to form its own categories.
2. Anomaly Detection: In a network security context, ART can be used to detect unusual patterns in
network traffic. By learning the "normal" patterns, it can identify deviations that might indicate an attack,
even if the attack is new and unseen before.
3. Data Mining: ART can be used to discover hidden patterns and structures in large datasets. For
example, in customer behavior analysis, it can group customers based on their purchasing habits,
revealing distinct customer segments.
4. Robotics: In robotics, ART can help a robot learn to navigate an environment without a detailed map.
The robot can use its sensors to perceive the environment and cluster similar sensory experiences,
allowing it to build a mental representation of the space.
Apart from the above listed; here are few examples for applications of soft computing
8
Here are few important questions
1. List a few examples of Hard and Soft Computing.
2. Identify the characteristics of Hard and Soft Computing.
3. Differentiate between Hard and Soft Computing.
4. What are the major applications of AI?
5. Identify major techniques of Soft Computing.
6. List a few examples of Fuzzy Logic, Artificial Neural Network and Genetic Algorithms.
7. Identify real time problem that can be resolved through Soft Computing
*****
Let us come back to the fuzzy logic system and this time we are going to have discussion about the same in
details
● Fuzzy logic was first developed by L.A. Zadeh in the mid-1960s for representing some types of
"approximate" knowledge that cannot be represented by conventional, crisp methods.
● Fuzzy logic is an extension of crisp bivalent (two-state) logic in the sense that it provides a platform for
handling approximate knowledge. Fuzzy logic is based on fuzzy set theory.
● A fuzzy set is represented by a membership function. A particular "element" value in the range of
definition of the fuzzy set will have a grade of membership which gives the degree to which the
particular element belongs to the set.
● The October 19, 1987 issue of Nikkei Industrial News stated: "Toshiba has developed an AI system
which controls machinery and tools using Fuzzy Logic. It controls rules, simulation, and valuation.
Toshiba will add Expert System function to it and accomplish the synthetic AI. Toshiba is going to turn it
into practical uses in the field of industrial products, traffic control, and nuclear energy.” This news item
is somewhat ironic and significant, because around the same time at the Information Engineering
Division of the University of Cambridge a similar application of fuzzy logic was developed for a robot.
This work was subsequently extended in the Industrial Automation Laboratory of the University of
British Columbia, where the applications centered around the fish processing industry.
● Popular applications are Process temperature control by OMRON, Air conditioner by Mitsubishi,
Vacuum cleaner by Panasonic, Automatic transmission system by Nissan, Subaru, Mitsubishi, Antilock
braking system by Nissan.
Fuzzy Sets
A fuzzy set is a set without clear or sharp (crisp) boundaries or without binary membership characteristics.
Unlike an ordinary set where each object (or element) either belongs or does not belong to the set, partial
9
membership in a fuzzy set is possible. In other words, there is a "softness" associated with the membership of
elements in a fuzzy set.
The membership in a fuzzy set need not be complete, i.e., members of one fuzzy set can also be members of
other fuzzy sets in the same universe. Vagueness is introduced in a fuzzy set by eliminating the sharp
boundaries that divide members from nonmembers in the group. There is a gradual transition between
membership and nonmembership, not abrupt transition.
Consider the variable "temperature". It can take a fuzzy value (e.g., cold, cool, tepid, warm, hot). A fuzzy value
such as "warm" is a fuzzy descriptor. It may be represented by a fuzzy set because any temperature that is
considered to represent "warm" belongs to this set and any other temperature does not belong to the set. Still,
one cannot realistically identify a precise temperature interval (e.g., 25°C to 30°C), which is a crisp set, to
represent warm temperatures.
Let X be a set that contains every set of interest in the context of a given class of problems. This is called the
universe of discourse (or simply universe), whose elements are denoted by x. A fuzzy set A in X may be
represented by a Venn diagram as given below.
Note: Generally, the elements x are not numerical quantities. For analytical convenience, however, the
elements x are assigned real numerical values.
A fuzzy set may be represented by a membership function. This function gives the grade (degree) of
membership within the set, of any element of the universe of discourse. The membership function maps the
elements of the universe to numerical values in the interval [0, 1].
μA (x): X [0, 1]
where μA(x) is the membership function of the fuzzy set A in the universe in X. Stated in another way, fuzzy set
A is a set of ordered pairs:
A = {(x, μA(x)); x ∈ X, μA(x) ∈ [0, 1]}
The membership function μA(x) represents the grade of possibility that an element x belongs to the set A.
● A membership function value of zero implies that the corresponding element is definitely not an element
of the fuzzy set.
● A membership function value of unity means that the corresponding element is definitely an element of
the fuzzy set.
● A grade of membership greater than 0 and less than 1 corresponds to a non-crisp (or fuzzy)
membership, and the corresponding elements fall on the fuzzy boundary of the set. The closer the μ A(x)
is to 1 the more the x is considered to belong to A, and similarly the closer it is to 0 the less it is
considered to belong to A.
10
Let us talk about the symbolic representation of the fuzzy set; here are few key terms-
● A universe of discourse (i.e. complete set of possible values under consideration) and a membership
function which spans the universe, completely define a fuzzy set. A fuzzy set may be symbolically
represented as:
A = {x|μA(x)}
● The element that has a membership grade greater than 0 but less than 1 is called a boundary element.
The crisp set of all elements in the universe that have a non-zero membership grade (i.e. μ > 0) is
called a support set.
Support set: SA A(x) > 0}
● The support set is a crisp subset (means it is either a subset or not a subset, no scope for degree of
membership when it comes to subset) of the universe. A fuzzy set is clearly a subset of its support set.
● The core is the subset of elements with a membership grade of 1.
○ Note: We are not usually interested in the elements with zero grade of membership.
An Example
Universe of Discourse: All possible temperatures, e.g. [0°C, 50°C].
Fuzzy Set: "Warm temperatures."
Membership Function: Define the membership grades for "warm" temperatures as follows:
μ(10°C) = 0 (definitely not warm)
μ(20°C) = 0.2 (slightly warm)
μ(25°C) = 0.5 (moderately warm)
μ(30°C) = 0.8 (quite warm)
μ(35°C) = 1.0 (perfectly warm)
μ(40°C) = 0.6 (still warm but getting hot)
μ(50°C) = 0 (definitely not warm)
Note: A fuzzy set may have fewer elements than the support set when specific conditions, such as
thresholds or filtering criteria, are applied to define the fuzzy set.
●
For fuzzy sets, "subset" is about membership grades being less than or equal, not about elements being contained so in fuz
A B does
not mean that A contains all elements of B.
An Example
A={(20,0.2),(25,0.5),}
B={(20,0.2),(25,0.7),(30,0.8)}
Since μA(x) <= μB (x) for all x X, we conclude that: A is subset of B yet A is not equals to B/
● Two fuzzy sets A and B are said to be equal fuzzy sets if μA(x) = μB(x) for all x ∈ X.
An Example
11
A={(20,0.2),(25,0.5),(30,0.8),(35,1.0)}
B={(20,0.2),(25,0.5),(30,0.8),(35,1.0)}
Since μA(x) = μB
(x) for all xX, we conclude that: A=B
A) and
universal set (X).
S.No. Statement Remark
1 A is subset to S. This statement holds according to crisp set theory but not
applicable for Fuzzy set theory because S does not have
membership grade. S is derived from the A such that in the
derivation process all elements of A are added to S so A will be
considered as a subset of S.
2 A is not equal to S. This statement holds because A has more information than the S
as S has elements only but A has elements + membership grade.
3 S is not a subset of A. This statement holds because if it holds then A should be equal to
S.
4 A is subset or equal to X. This statement holds according to crisp set theory as well as
Fuzzy set theory
5 S is subset or equal to X. This statement holds according to crisp set theory but not
applicable for Fuzzy set theory because S does not have
membership grade.
● A fuzzy set is a universal set/whole fuzzy set if and only if the value of the membership function is 1 for
all the members under consideration.
● A fuzzy set A is said to be an empty fuzzy set if and only if the value of the membership function is 0 for
all possible members under consideration.
● The collection of all fuzzy sets and fuzzy subsets on universe X is called fuzzy power set P(X) since all
the fuzzy sets can overlap, the cardinality of the fuzzy power set np(x) is infinite. So we can say
A ⊆ X ⟹ μA(x) ≤ μX(x)
Also for x ∈ X; μϕ(x) = 0; μx(x) = 1
A note of Cardinality
- The cardinality of a crisp set refers to the number of distinct elements in the set.
An Example:
A={1,2,3}
The cardinality of A, denoted as A, is 3
- In fuzzy sets, elements are associated with membership grades between 0 and 1, so cardinality takes
these grades into account.
where μA(x) is the membership grade of x in the fuzzy set A, and U is the universe of discourse.
An Example:
Consider a fuzzy set A = {(1,0.5), (2,1.0), (3,0.2)}
Cardinality: A = 0.5 + 1.0 + 0.2 = 1.7
12
● A discrete universe consists of distinct, individual elements that are countable and separate from each
other. You can list all elements explicitly. The set of letters, set of natural numbers is an example of a
discrete universe.
● A continuous universe consists of values that form a continuum and can take any value within a given
range. The elements are not distinct; they vary smoothly. You cannot list the elements explicitly, but
you can define a range. Temperature within a range & time e within a range is an example of a
continuous universe.
If the universe is discrete with elements xi then a fuzzy set may be specified using a convenient form of
notation due to Zadeh, in which each element is paired with its grade of membership in the form of a "formal
series" as
or
Important Note: Here "+" signifies combination, not arithmetic: The "+" symbol here is just a way to combine
all the pairs. In computational systems, it might mean iterating over or processing each (x i,μA(xi)) pair
independently. There’s no summation or computation happening here.
If the universe is continuous, an equivalent form of notation is given in terms of a "symbolic" integration
Important Note: For continuous fuzzy sets, the integration symbol is also symbolic and is not intended to
perform an actual numerical integration.
defines a fuzzy set A whose elements x vaguely represent those satisfying the crisp relation x = a. This fuzzy
set corresponds to a fuzzy relation.
S={s|s X}
F = (s, ) | s X and (s) is the degree of membership.
Crisp sets do not handle uncertainty or vagueness. Fuzzy sets are designed to handle uncertainty and
They are rigid and precise, making them suitable for vagueness. They are useful for representing
well-defined, clear-cut categories. concepts that are not clearly defined or have
gradations.
Crisp sets are used in traditional logic, computer Fuzzy sets are used in areas like control systems,
science (e.g., binary decisions), and mathematics artificial intelligence, decision-making, and pattern
where clear boundaries are required. recognition, where human-like reasoning and
handling of ambiguity are needed.
Let’s consider a universe U = {1, 2, 3, 4, 5} and a crisp set A representing "Even Numbers". Membership
function of crisp set A: μ A(1) = 0, μA(2) = 1, μA(3) = 0, μA(4) = 1, μA(5) = 0. This is essentially a fuzzy set with
only two possible values: 0 or 1, making it a crisp set.
A general fuzzy set cannot always be a crisp set because a fuzzy set allows membership values between 0
and 1, while a crisp set only allows 0 or 1.
An Example: Consider a fuzzy set A representing the "likelihood of passing an exam" for students in a class of
5 students:
A={(S1,0.2),(S2,0.5),(S3,1.0),(S4,0.7),(S5,0.9)}
Here, the fuzzy set A has 5 elements, so its cardinality is finite.
An Example: A fuzzy set representing "tall heights" over all real numbers:
A={(x,A(x)) xR, A(x) = e (180x)2/50 }
Here, the universal set is the set of all real numbers, making the cardinality infinite.
14
Types of membership functions
The fuzzy membership function is the graphical way of visualizing the degree of membership of any value in a
given fuzzy set. In the graph, the X-axis represents the universe of discourse and the Y-axis represents the
degree of membership in the range [0, 1].
1. Triangular membership function: This is one of the most widely accepted and used membership
functions (MF) in fuzzy controller design. The triangle which fuzzifies the input can be defined by three
parameters a, b and c, where c defines the base and b defines the height of the triangle.
2. Trapezoidal membership function: The trapezoidal membership function is defined by four parameters:
a, b, c and d. Span b to c represents the highest membership value that element can take. And if x is
between (a, b) or (c, d), then it will have a membership value between 0 and 1
15
3. Gaussian membership function: A Gaussian MF is specified by two parameters {m, σ} and can be
defined as follows
In this function, m represents the mean / center of the gaussian curve and σ represents the spread of
the curve. This is a more natural way of representing the data distribution, but due to mathematical
complexity, it is not much used for fuzzification.
4. Generalized bell-shaped function: A generalized bell MF is specified by three parameters {a, b, c} and
can be defined as follows.
16
5. Sigmoid Membership function: Sigmoid functions are widely used in classification tasks in machine
learning. It is controlled by parameters a and c. Where a controls the slope at the crossover point x = c.
Mathematically, it is defined as
A graphic (membership function) representation of the complement of a fuzzy set (or negation of a fuzzy state)
is given as
17
An Example: Consider the fuzzy set of "hot temperatures". This may be represented by the membership
function shown using a solid line in Figure given below. This is the set containing all values of hot temperature
in a specified universe. In fuzzy logic then, this membership function can represent the fuzzy logic state "hot"
or the fuzzy statement, "the temperature is hot". The complement of the fuzzy set is represented by the dotted
line in the same Figure. This is the set containing all temperature values that are not hot, in the given universe.
The union corresponds to a logical OR operation (called Disjunction), and is denoted by A ∨ B, where A and B
are fuzzy states or fuzzy propositions. The rationale for the use of max to represent a fuzzy-set union is that,
because element x may belong to one set or the other, the larger of the two membership grades should govern
the outcome (union). A graphic (membership function) representation of the union of two fuzzy sets (or the
logical combination OR of two fuzzy states in the same universe) is given in following Figure
An Example
Consider a universe representing the driving speeds on a highway, in km/h. Suppose that the fuzzy logic state
“Fast” is given by the discrete membership function
F = 0.6/80 + 0.8/90 + 1.0/100 + 1.0/110 + 1.0/120
Then the combined fuzzy condition “Fast OR Medium” is given by the membership function
F ∨ M = 0.6/50 + 0.8/60 + 1.0/70 + 1.0/80 + 0.8/90 + 1.0/100 + 1.0/110 + 1.0/120
Even though set intersection is applicable to sets in a common universe, a logical "OR" may be applied for
concepts in different universes. In particular, when the operands belong to different universes, orthogonal axes
have to be used to represent them in a common membership function.
18
An Example
Consider a universe representing room temperature (in °C) and another universe representing relative
humidity (%). Suppose that an acceptable temperature is given by the membership function
T = 0.4/16 + 0.8/18 + 1.0/20 + 1.0/22 + 0.8/24 + 0.5/26
Then the fuzzy condition "Acceptable Temperature OR Acceptable Humidity" is given by the following
membership function with a two-dimensional universe:
Temperature (°C)
16 18 20 22 24 26
0 0.4 0.8 1.0 1.0 0.8 0.5
2 0.8 0.8 1.0 1.0 0.8 0.8
0
4 1.0 1.0 1.0 1.0 1.0 1.0
Relative Humidity
0
(%)
6 0.6 0.8 1.0 1.0 0.8 0.6
0
8 0.4 0.8 1.0 1.0 0.8 0.8
0
Consider two fuzzy sets A and B in the same universe X. Their intersection is a fuzzy set containing all the elements that are comm
The intersection corresponds to a logical AND operation (called Conjunction), and is denoted by A ∧ B, where
A and B are fuzzy states or fuzzy propositions. The rationale for the use of min to represent fuzzy-set
intersection is that, because the element x must simultaneously belong to both sets, the smaller of the two
membership grades should govern the outcome (intersection).
A graphic (membership function) representation of the intersection of two fuzzy sets (or the logical combination
AND of two fuzzy states in the same universe) is given in Figure below-
An Example
Universe of discourse (X): Temperatures in degrees Celsius.
Fuzzy set A: "Warm temperatures."
A = {20:0.2, 25:0.5, 30:0.8, 35:1.0, 40:0.6}
19
Intersection (AB): "Warm and hot temperatures."
A B={20:0.0, 25:0.0, 30:0.4, 35:0.7, 40:0.6, 45:0.0, 50:0.0}
μA(x) = μB
An Example
Universe of Discourse, temperatures: X={20,25,30,35}
Fuzzy Set A: "Warm temperatures"
A={(20,0.2), (25,0.5), (30,0.8), (35,1.0)}
For both sets we have all elements identical so we can say that A = B
This represents the union of fuzzy sets A and B, accounting for their overlaps.
An Example:
Let μA(x) = 0.6 and μB(x) = 0.4
μA+B (x) = 0.6 + 0.4 (0.60.4) = 1.0 0.24 = 0.76
2. Algebraic Product (.): The algebraic product of two fuzzy sets A and B is defined as:
An Example:
Let μA(x) = 0.6 and μB(x) = 0.4
μ AB (x) = 0.60.4=0.24
3. Bounded Sum () : The bounded sum of two fuzzy sets A and B is defined as:
This ensures that the combined membership value does not exceed 1.
An Example
Let μA(x) = 0.8 and μB(x) = 0.5.
μ AB (x) = min(1, 0.8 + 0.5) = min(1, 1.3) = 1
20
4. Bounded Difference (): The bounded difference of two fuzzy sets A and B is defined as:
This represents the difference between the memberships of A and B, ensuring the result is non-
negative.
An Example:
Let μA(x) = 0.7 and μB(x) = 0.4.
μ AB
5. Bounded Product (☉): The bounded product of two fuzzy sets A and B is defined as:
An Example:
Let μA(x) = 0.7 and μB(x) = 0.4.
μA☉B(x) = max(0, 0.7 + 0.4 - 1) = 0.1
Additional Note: The scalar product of a fuzzy set A with a scalar α is defined as:
μαA A(x)
where:
❖ α is a scalar constant.
❖ μA(x) is the membership function of the fuzzy set A.
❖ μαA(x) is the new membership function after applying the scalar multiplication.
An Example
Let’s consider a fuzzy set A representing "membership in a good student category" based on some
performance levels:
A={(S1,0.2),(S2,0.5),(S3,0.7),(S4,1.0)}
Apply the scalar multiplication using α=0.6
Solution:
Computing for each element:
A′ = {(S1,0.6×0.2),(S2,0.6×0.5),(S3,0.6×0.7),(S4,0.6×1.0)} = {(S1,0.12),(S2,0.3),(S3,0.42),(S4,0.6)}
DIY:
Q.1 Given the two fully sets
B1 = {1/1.0 + 0.75/1.5 + 0.3/2.0 + 0.15/2.5 + 0/3.0}
B2 = {1/1.0 + 0.6/1.5 + 0.2/2.0 + 0.1/2.5 + 0/3.0}
Find the algebraic sum, algebraic product, bounded sum, bounded difference and bounded product of
the given fuzzy sets.
21
Basic laws of fuzzy logic
Consider three general fuzzy sets A, B, and C defined in a common universe X. Let ϕ denote the null set (a set
with no elements, and hence having a membership function of zero value). With this notation, some important
properties of fuzzy sets are summarized here
Exclusion:
Law of excluded middle
Law of contradiction
A A X (In crisp/classical set A A = X i.e. An element must either belong to set A OR
Distributivity
A (B C) = (A B) (A C)
DeMorgan’s Laws
(A B) = A B
Commutativity
A B=B A
Associativity
(A B) C = A (B C)
Absorption
A (A B) = A
Idempotency
A A=A
(Idem = same; potent = power)
(Similar to unity or identity operation)
Boundary conditions
A X=X
Transitivity If A B C then A C
● DeMorgan’s Laws are particularly useful in simplifying (processing) expressions of sets (and logic).
● Law of Excluded Middle violation:Consider a fuzzy set of "tall people." Let's say someone is 5'10"
(1.78m) tall. In fuzzy logic, they might have a membership value of 0.6 in the "tall" set and 0.4 in the
"not tall" set. When we add these:0.6 + 0.4 = 1.0 (for some cases)But for other heights, you might
have:0.3 + 0.5 = 0.8 ≠ 1. This shows that A ∪ Ā ≠ 1 (where Ā is the complement of A), violating the law
of excluded middle.
● Law of Contradiction violation: Using the same "tall people" example, someone who is 5'10" can
simultaneously belong to both "tall" (0.6) and "not tall" (0.4) sets. The intersection isn't empty: A ∩ Ā ≠
22
∅. Someone 5'10" tall might have:μ(tall) = 0.6 and μ(not tall) = 0.5 and that's perfectly valid in fuzzy
logic because we're dealing with approximate reasoning and linguistic variables .
● The law of excluded middle is violated by fuzzy sets, consider the example shown in Figure below.
Here a broken line is used to represent a fuzzy set A. The complement A′ is shown by a dotted line,
which is obtained by subtracting the original membership function from 1. Next, the union of A and A′ is
performed using the max operation. The resulting membership function is given by the solid line. Note
that the result is not uniformly equal to 1 (i.e., not equal to mX ). In particular, the middle part of mX is
excluded in the result.
Law 2: A (B C) = (A B) (A C)
Let us take example to verify another distributivity of fuzzy law
A = {(1, 0.5), (2, 0.8), (3, 0.6)}
B = {(1, 0.7), (2, 0.4), (3, 0.9)}
C = {(1, 0.6), (2, 0.9), (3, 0.3)}
Result: (A B) = A B.
Law 2 : (A B) = A B
Result : (A B) = A B
An Example
Suppose that A denotes the set of "my true statements". Then, in the universe of "all my statements", the
complement of A (i.e., A′) denotes the set of "my false statements".
a) Using A, A′, and the operations of union (∪) and equality (=) of sets, write an equation (in sets) that is
equivalent to the logic proposition: "All my statements are false."
b) Show that according to bivalent, crisp logic (i.e. assuming that A is a crisp set) the statement in (a)
which is (“my false statements”) above is a contradiction.
c) Suppose that fuzzy logic (using fuzzy sets) is used with the statement in (a), where the complement
operation is given by μA′ = 1 A and the union operation is represented by “max”. For what
membership values of A does the statement in (a) hold?
A fuzzy set whose support is a single element in X with μ A(x) = 1 is referred to as a fuzzy singleton i.e. only
one element has a membership value of 1, and all other elements have a membership value of 0.
An Example
Let X = {1,2,3,4,5} and define a fuzzy set A as:
A={(1,0),(2,0),(3,1),(4,0),(5,0)}
Here, only element 3 has full membership (μA(3)=1), while all others have membership 0 hence A is a fuzzy
singleton.
A fuzzy set whose membership function has at least one element x in the universe whose membership value is
unity is called a normal fuzzy set. The element of the normal fuzzy set for which the membership is equal to 1
is called the prototypical element.
An Example
A = {(1,0.2),(2,0.5),(3,1),(4,0.8)}
25
Subnormal fuzzy set
Note:
1. The maximum value of the membership function in a fuzzy set A is called the height of the fuzzy set.
For a normal fuzzy set, the height is equal to 1 because the maximum value of the membership
function allowed is 1. Thus, if the height of a fuzzy set is less than 1, then the fuzzy set is called a
subnormal fuzzy set.
2. A prototypical element can also refer to the most representative element of the fuzzy set, even if it does
not necessarily have a membership value of 1 (like in the case of a subnormal fuzzy set where no core
element exists). In such cases, the prototypical element is chosen based on the context, such as the
central value in a fuzzy number (e.g., the peak of a triangular membership function).
The element in the universe for which a particular fuzzy set A has its value equal to 0.5 is called crossover
point of a membership function. The membership value of a crossover point of a fuzzy set is equal to 0.5. i.e.,
UA(x) = 0.5. For a fuzzy set, the bandwidth (or width) is defined as the distance between the two unique
crossover points. Bandwidth( A ) = |x1 – x2| Where, μA(x1) = μA(x2) = 0.5
A convex fuzzy set has a membership function whose membership values are strictly monotonically
increasing or strictly monotonically decreasing or strictly monotonically increasing than strictly monotonically
decreasing with increasing elements in the universe.
The convex normal fuzzy set can be defined in the following way. For elements X1, X2 and X3 in a fuzzy set A.
If the following relation between X1, X2 and x3 holds i.e.
then A is said to be a convex fuzzy set. The membership of the element x2 should be greater than or equal to
the membership of elements x1 and X3
A fuzzy set possessing characteristics opposite to that of convex fuzzy set is called non-convex fuzzy set i.e.
the membership values of the membership function are not strictly monotonically increasing or decreasing or
strictly monotonically decreasing than decreasing.
26
convex normal fuzzy set (left) and non-convex normal fuzzy set (right)
Tip: The intersection of two convex fuzzy set is also a convex fuzzy set
When the fuzzy set A is a convex single-point normal fuzzy set defined on the real time, then A is termed as a
fuzzy number.
An ordered r-tuple is an ordered sequence of r-elements expressed in the form (a 1, a2, a3, ... ar). An unordered
tuple is a collection of r elements without any restrictions in order. For r = 2, the r-ruple is called an ordered
pair.
For crisp sets A₁, A₂, ... , Aᵣ, the set of all r-tuples (a₁, a₂, a₃, ... , aᵣ), where a₁ ∈ A₁, a₂ ∈ A₂ ... , aᵣ ∈ Aᵣ, is
called the Cartesian product of A₁, A₂ ... , Aᵣ and is denoted by A₁ × A₂ × . . . × Aᵣ. The Cartesian product of
two or more sets is not the same as the arithmetic product of two or more sets. If all the aᵣ's are identical and
equal to A, then the Cartesian product A₁ × A₂ × · · · × Aᵣ is denoted as Aʳ.
An r-ary relation over A₁, A₂, ... , Aᵣ is a subset of the Cartesian product A₁ × A₂ × · · · × Aᵣ. When r = 2, the
relation is a subset of the Cartesian product A₁ × A₂. This is called a binary relation from A₁ to A₂. When three,
four or five sets are involved in the subset of a full Cartesian product then the relations are called ternary,
quaternary and quinary respectively.
X x Y = {(x,y) | x ∈ X,y ∈ Y)
Here the Cartesian product forms an ordered pair of every x ∈ X with every y ∈ Y. Every element in x is
completely related to every element in y. The characteristic function, denoted by χ, gives the strength of the
relationship between ordered pairs of elements in each universe. If it takes unity as its value, then a complete
relationship is found, if the value is zero then there is no relationship. i.e.
When the universes or sets are finite, then the relation is represented by a matrix called relation matrix. An r-
dimensional relation matrix represents an r-ary relation. Thus, binary relations are represented by two-
dimensional matrices.
27
An Example
Consider the elements defined in the universes X and Y as follows:
X = {2,4,6}
Y = {p,q,r}
The Cartesian product X×Y consists of all ordered pairs (x,y), where x ∈ X and y ∈ Y.
X × Y = {(2,p), (2,q), (2,r), (4,p), (4,q), (4,r), (6,p), (6,q), (6,r)}
The relationship between elements of the subset can be represented in the form of matrix diagram is as follow-
p q r
2 1 0 0
4 0 1 0
6 0 0 1
Each subset pair is represented as a point in the Cartesian plane, where the X-axis corresponds to elements of
X and the Y-axis corresponds to elements of Y. Points like (2,p), (4,q), and (6,r) are plotted. The relationship
between X and Y is depicted with arrows connecting elements of X to their corresponding elements in Y as per
the subset pairs. The arrows indicate the mapping between these elements.
Do It Yourself
The elements in two sets are given as
A = {2, 4}
B = {a, b, c}
Find the following Cartesian product of two sets
(i) A x B (ii) B x A (iii) A x A (iii) B x B
This notation R : XY represents a relation R from set X to set Y. R associates elements of X (called the domain of the relation) wit
∈ X and y ∈ Y. R specifies which elements of X relate to which
elements of Y, but it does not require that every x ∈ X must be related to a y ∈ Y, or vice versa.
28
● A Constrained relation is one where the elements in the domain X are related to elements in the
codomain Y based on specific conditions or rules. Let X = {1, 2, 3} and Y = {2, 4, 6}. The relation
R={(x,y) ∣ y = 2x} is a constrained relation because y must equal 2x.
● An Unconstrained relation is one that has no specific condition or rule governing the relationship
between elements of X and Y. Any element of X can be related to any element of Y. Example: X =
{1,2,3}, Y={2,4,6} and R={(1,2), (2,4), (3,6), (1,4)}.
● A Universal/Complete relation is a relation where every element of X is related to every element of Y.
In other words, it includes all possible ordered pairs (x, y) where x ∈ X and y ∈ Y. Let X={1, 2} and
Y={a, b}. The universal relation R={(1,a), (1,b), (2,a), (2,b)}.
● An Identity relation relates every element of a set X to itself, and only to itself. Let X = {1, 2, 3}. The
identity relation I = {(1, 1), (2, 2), (3, 3)}.
● The null relation is a relation where no pair of elements from the sets is related. It is represented by an
empty set. If R is a relation between two sets X and Y, the null relation is: R=∅. For X={1, 2} and Y={a,
b}, the null relation is: R=∅
The cardinality of a relation refers to the number of ordered pairs in the relation. A relation R ⊆ X×Y is a
subset of the Cartesian product of two sets X and Y.
Example:
Let X = {1,2} and Y = {a,b}.
The Cartesian product X × Y = {(1, a), (1, b), (2, a),(2, b)}.
The power set of a set S is the set of all possible subsets of S. If ∣S∣=n, then the power set P(S) has 2n
elements.
Example:
Let S = {1,2}.
The power set P(S) = {∅, {1}, {2}, {1,2}}.
The cardinality of P(S) is ∣P(S)∣ = 22 = 4.
29
R1 R 2 = ∅ (Null Relation)
The matrix representation is as follow-
R1 ⊈ R2 and R2 ⊈ R1
Law of Contradiction R R c
=∅
This means that R ∘ S contains a pair (x, z) if there exists some element y in Y such that x is related to y in R
and y is related to z in S.
An Example
Consider the sets X = {1, 2}, Y = {a, b}, Z = {p, q} & relation R from X to Y:
R = {(1, a),(2, b)}
& Relation S from Y to Z:
S = {(a, p),(b, q)}
The matrices for relations R and S is as follow ((Rows correspond to X, columns correspond to Y)
Max-Min Composition
Mathematical Definition for relations represented as matrices:
Where:
● i indexes rows (elements of X),
● j indexes columns in R and rows in S (elements of Y),
● k indexes columns (elements of Z).
31
The "min" operation finds the weakest link in each path (as it takes the smallest value) then "max" selects the
strongest among these weak links. It's like finding the most reliable path by considering the weakest point in
each possible route
An Example
Using the same matrices for R and S (not from DIY but from example of composition from classical relation):
For (1,q):
min(R(1,a), S(a,q)) = min(1,0) = 0
min(R(1,b), S(b,q)) = min(0,1) = 0
max(0,0) = 0
For (2,p):
min(R(2,a), S(a,p)) = min(0,1) = 0
min(R(2,b), S(b,p)) = min(1,0) = 0
max(0,0) = 0
For (2,q):
min(R(2,a), S(a,q)) = min(0,0) = 0
min(R(2,b), S(b,q)) = min(1,1) = 1
max(0,1) = 1
Max-Product Composition
The Max-Product Composition replaces the min operation with multiplication:
The "product" operation gives you the combined strength of the entire path then "max" selects the path with the
highest combined strength. It's like finding the most efficient path by considering the overall performance of
each route
An Example
Using the same matrices for R and S:
32
For (1,q):
(R(1,a) × S(a,q)) = 1 × 0 = 0
(R(1,b) × S(b,q)) = 0 × 1 = 0
max(0,0) = 0
For (2,p):
(R(2,a) × S(a,p)) = 0 × 1 = 0
(R(2,b) × S(b,p)) = 1 × 0 = 0
max(0,0) = 0
For (2,q):
(R(2,a) × S(a,q)) = 0 × 0 = 0
(R(2,b) × S(b,q)) = 1 × 1 = 1
max(0,1) = 1
Important note
● If the relation values are binary (0 or 1), both methods might (but not necessarily) give the same result.
● If the relation values are real numbers between 0 and 1, the multiplication step in Max-Product
Composition often leads to different values than the Min-Max Composition.
Properties of Composition
Associativity (R∘S)∘T = R∘(S∘T)
Non-Commutativity R∘ S S ∘R
Inverse Relation
(R ∘ S) 1 = S 1 ∘ R 1
Fuzzy Relation
A fuzzy relation is an extension of classical (crisp) relations where elements are not just related or unrelated,
but rather partially related with a degree of membership ranging between 0 and 1. It allows modeling of
vague, uncertain, or imprecise relationships. A fuzzy relation is a fuzzy set defined on the Cartesian product of
classical sets {XI, X2, X3…. Xn} where tuples (x1, x2, x3 … xn) may have varying degrees of membership µR(x1, x2,
x3… xn) within the relation.
A fuzzy relation R from set X to set Y is represented as a fuzzy subset of the Cartesian product X×Y.
Mathematically, it is defined as:
R : X × Y [0,1]
Where μR(x,y) is the membership function that assigns a degree of relation between x ∈ X and y ∈ Y.
Let A be a fuzzy set on universe X and B be a fuzzy set on universe Y. The Cartesian product over A and B
results in fuzzy relation R and is contained within the entire (complete) Cartesian space, i.e., A x B = R where
R ⊂ X x Y. The membership function of fuzzy relation is given by
33
1 x 4), the resulting fuzzy relation R will be represented by a matrix of size 3 x 4, i.e., R will have three rows
and four columns.
An Example
Consider the two fuzzy sets
A = {(x1, 0.3), (x2, 0.7), (x3, 1.0)}
B = {(y1, 0.4), (y2, 0.9)}
Perform the Cartesian product on these sets.
For a fuzzy relation R on X × Y, the domain is the set of all elements in X that have at least one nonzero
relation with some element in Y:
Example: For the matrix given above; the domain is: {x1, x2, x3} because all these elements have at least one
nonzero entry.
The range is the set of all elements in Y that have at least one nonzero relation with some element in X:
The range is: {y1, y2} since each column has at least one nonzero value.
A fuzzy graph is a graphical representation of a binary fuzzy relation. Each element in X and Y corresponds to
a node in the fuzzy graph. The connection links are established between the nodes by the elements of X x Y
with non-zero membership grades in R(X, Y). The links may also be present in the form of arcs. These links
are labeled with the membership values as μR(xi, yj
Consider the following universe X = {x1, x2, x3, x4} and the binary fuzzy relation on X as
34
The bipartile graph and the simple fuzzy graph will be as follow-
Let X=(x1, x2, x3, x4) and Y= (y1, y2, y3, y4) & R be a relation from X to Y given by
The cardinality of a fuzzy set or a fuzzy relation depends on whether the underlying universe is finite or infinite.
If the universe is infinite, the cardinality will be infinite; otherwise, it can be finite.
The basic operations on fuzzy sets also apply on fuzzy relations. Let R and S, be fuzzy relations on the
Cartesian space X x Y. The operations that can be performed on these fuzzy relations are described below,
consider the following sets and relations
We define relations over sets X={x1,x2} and Y={y1,y2}; Relation R1 and R2 are as follow-
35
2. Intersection: µR S (x, y) = min[µR(x,y), µs(x,y)]
5. Inverse: The inverse of fuzzy relation R on X x Y is denoted by R -1. It is a relation on Y x X defined by R -1(y,
x) = R(x, y) for all pairs (y, x) ∈ Y x X
6. Projection: For a fuzzy relation R(X,Y) let [R Y] denote the projection of R onto Y then [R Y] is a fuzzy relation in Y whose m
For a fuzzy relation R(X,Y) let [R X] denote the projection of R onto X then [R X] is a fuzzy relation in X whose membership fun
An Example
Consider the fuzzy relation:
R={(0.3,(x1,y1)),(0.7,(x1,y2)),(0.4,(x2,y1)),(0.8,(x2,y2))}
Note: The projection operation always provides a fuzzy set from the fuzzy relation.
Fuzzy Composition
The classical relations work with crisp values 0 & 1 while fuzzy relations work with membership values [0,1]
that require different operations to handle these partial truths. In fuzzy relations, we replace AND with min, OR
with max. Here in fuzzy relations the regular matrix multiplication isn't used in fuzzy relations - it wouldn't
properly capture the fuzzy nature of the relationships between elements.
For example:
Classical: 1 × 1 = 1 (AND)
Fuzzy: min(0.7, 0.8) = 0.7
Let's consider two fuzzy relations R and S: R is a relation from set X = {x₁, x₂} to set Y = {y₁, y₂} S is a relation
from set Y = {y₁, y₂} to set Z = {z₁, z₂}
Given:
R = {((x₁,y₁), 0.8), ((x₁,y₂), 0.3), ((x₂,y₁), 0.2), ((x₂,y₂), 0.9)}
S = {((y₁,z₁), 0.4), ((y₁,z₂), 0.6), ((y₂,z₁), 0.1), ((y₂,z₂), 0.7)}
For element (x₁,z₁): max[min(0.8, 0.4), min(0.3, 0.1)] = max(0.4, 0.1) = 0.4
For element (x₁,z₂): max[min(0.8, 0.6), min(0.3, 0.7)] = max(0.6, 0.3) = 0.6
For element (x₂,z₁): max[min(0.2, 0.4), min(0.9, 0.1)] = max(0.2, 0.1) = 0.2
For element (x₂,z₂): max[min(0.2, 0.6), min(0.9, 0.7)] = max(0.2, 0.7) = 0.7
For element (x₁, z₁): max[(0.8 × 0.4), (0.3 × 0.1)] = max(0.32, 0.03) = 0.32
For element (x₁, z₂): max[(0.8 × 0.6), (0.3 × 0.7)] = max(0.48, 0.21) = 0.48
For element (x₂, z₁): max[(0.2 × 0.4), (0.9 × 0.1)] = max(0.08, 0.09) = 0.09
For element (x₂, z₂): max[(0.2 × 0.6), (0.9 × 0.7)] = max(0.12, 0.63) = 0.63
Note: Properties of Composition are same for both classical relation and fuzzy set.
Do it yourself
Q.1 Two fuzzy relations are given by
Obtain fuzzy relation T as a composition (for max-min and max-product) between the fuzzy relations.
Q.2 For a speed control of DC motor, the membership functions of series resistance, armature current and
speed are given as follows:
Compute relation T for relating series resistance to motor speed, i.e., R to N. Perform max-min composition
only.
(a) Find the fuzzy relation for the Cartesian product of A and B, i.e. R = A X B
(b) Introduce a fuzzy set C given by
Find the relation between C and B using Cartesian product i.e. find S = C x B
(c) Find C ∘ R using max-min composition
38
(d) Find C ∘ S using max-min composition
● A relation is said to be symmetric if for every edge pointing from vertex i to vertex j, there is an edge
pointing in the opposite direction, i.e., from vertex j to vertex i where i, j = 1, 2, 3, ....
● A relation is said to be transitive if for every pair of edges in the graph one pointing from vertex – i to
vertex – j and the other pointing from vertex j to vertex k; there is an edge pointing from vertex i from
vertex k.
Equivalence/Similarity Relation
Classical Equivalence/Similarity Relation
Let relation R on universe X be a relation from X to X Relation R is an equivalence relation if the following three
properties are satisfied:
1. Reflexive: ∀a ∈ A: (a,a) ∈ R
2. Symmetric: ∀a,b ∈ A: (a,b) ∈ R ⟺ (b,a) ∈ R
3. Transitive: ∀a,b,c ∈ A: ((a,b) ∈ R and (b,c) ∈ R) ⟹ (a,c) ∈ R
An Example
Let's consider a set A = {1, 2, 3, 4} and define a relation R based on "having the same parity" (both odd or both
even).
39
● Observe that (1, 1), (2, 2), (3, 3), and (4, 4) all belong to R. Since every element in A is related to itself
in R, the relation R is reflexive.
● (1, 3) has its counterpart (3, 1), (2, 4) has its counterpart (4, 2), (1, 1), (2, 2), (3, 3), and (4, 4) are their
own counterparts. For every pair (a, b) in R, the pair (b, a) also exists in R. Therefore, the relation R is
symmetric.
● Consider following pairs in R
(1,3), (3,1), (1,1) & (1,3), (3,3), (1,3) & (1,1), (1,3), (1,3)
(2,4), (4,2), (2,2) & (2,4), (4,4), (2,4) & (2,2), (2,4), (2,4)
(3,1), (1,1), (3,1) & (3,3), (3,1), (3,1)
(4,2), (2,2), (4,2) & (4,4), (4,2), (4,2)
So transitivity is here.
● So the relation given above is a classical equivalence or classical similarity relation.
An Example
Let A = {1, 2, 3} and define a fuzzy relation R = {((1,1), 1.000), ((1,2), 0.800), ((1,3), 0.800), ((2,1), 0.800), ,
((2,2), 1.000), ((2,3), 0.800), ((3,1), 0.800), ((3,2), 0.800, ((3,3), 1.000)}.
● Observe that μR(1,1) = 1.000, μR(2,2) = 1.000 and μR(3,3) = 1.000 hence this relation is reflexive
● Consider μR(1,2) = μR(2,1) = 0.800 and μR(1,3) = μR(3,1) = 0.800 and μR(2,3) = μR(3,2) = 0.800 hence
the relation is symmetrical
● Observe that For x = 1, y = 2, z = 3: μ R(1,3) = 0.800; min(μR (1,2), μR
(2,3)) = min(0.800, 0.800) = 0.800; 0.800 0.
R (1,2) = 0.800; min(μR (1,3),
40
μR
(3,2)) = min(0.800, 0.800) = 0.800 = 0.800 0.800 is true. This relation is transitive because all values outside the diagon
R(x,y), μR(y,z)) will never be greater than μR(x,z).
● So the relation given above is a fuzzy equivalence or fuzzy similarity relation.
Tolerance/Proximity Relation
Classical Tolerance/Proximity Relation
A tolerance relation is a relation that is reflexive & symmetric. It does NOT need to be transitive, unlike
equivalence relation.
An Example
Let A = {1, 2, 3, 4} and define relation R as "numbers differing by at most 1". R = {(1,1), (1,2), (2,1), (2,2), (2,3),
(3,2), (3,3), (3,4), (4,3), (4,4)}
● Observe (1,1), (2,2), (3,3), (4,4) means every element is related to itself hence this is a reflexive
relation.
● Observing (1,2), (2,1), (2,3), (3,2), (3,4), (4,3) means every element is related to itself hence this is a
symmetric relation.
● Observe that (1,2), (2,3) is in relation but the (1, 3) is not in the relation hence it is not transitive.
● Hence it is a classical tolerance relation.
An equivalence relation can be formed from tolerance relation R 1 by (n-1) compositions within itself where n is
the cardinality of the set that defines R1.
R1(n-1) = R1 ∘ R1 ∘ R1 ∘….. R1 = R
Now computer R ∘ R
Now this matrix is for fully connected relation means no more pairs can be added i.e. it is reflexive, symmetric
and transitive hence it is an equivalence relation.
R = {((1,1), 1.000), ((1,2), 0.750), ((1,3), 0.500), ((1,4), 0.250), ((2,1), 0.750), ((2,2), 1.000), ((2,3), 0.750),
((2,4), 0.500), ((3,1), 0.500), ((3,2), 0.750), ((3,3), 1.000), ((3,4), 0.750), ((4,1), 0.250), ((4,2), 0.500), ((4,3),
0.750), ((4,4), 1.000)}
The matrix representation is as follow-
The fuzzy tolerance relation can be reformed into fuzzy equivalence relation in the same way as a crisp
tolerance relation is reformed into crisp equivalence relation.
R1(n-1) = R1 ∘ R1 ∘ R1 ∘….. R1 = R
Note: Every equivalence relation is a tolerance relation (since it satisfies reflexivity and symmetry). Not
every tolerance relation is an equivalence relation.
The above relation is reflexive, symmetric but not transitive because μR(1,3) = 0.750 & μR(3, 4) = 0.750 but
μR
(1,4) = 0.500 so if max-min transitivity is applied then 0.500 min(0.750, 0.750) which is false.
42
let us try to find the composition R∘R∘R using the same way as we did it before
The above relation is reflexive, symmetric and transitive because μR(1,3) = 0.750 & μR(3, 4) = 0.750 but
μR (1,4) = 0.750 so if max-min transitivity is applied then 0.750 min(0.750, 0.750) which is true.
Do it yourself
Q.1 The following figure shows three relations on the universe X ={a, b, c). Are these relations equivalence
relations?
Since "Tallness" and "Intelligence" are unrelated attributes, they are non-interactive fuzzy sets. Their
combination follows basic fuzzy set operations without influencing each other.
A fuzzy set A defined on the Cartesian space X = X1 × X2 is separable into two non-interactive fuzzy sets,
called orthogonal projections, if and only if:
A = OPrx1(A) × OPrx2(A)
43
This equation means that the membership function of A can be written as the product of the membership
functions of its projections:
μA(x1,x2) = μOPrx1(A)(x1)⋅μOPrx2(A)(x2)
This implies independence of x1 & x2. It means that knowing x1 tells us nothing about x2 and vice versa.
An Example
Let's define our spaces:X1 = {1, 2} X2 = {a, b} So X = X1 × X2 = {(1,a), (1,b), (2,a), (2,b)} and a fuzzy set A that is
separable: A = {((1,a), 0.3), ((1,b), 0.3), ((2,a), 0.6), ((2,b), 0.6)}
Now, let's reconstruct A using the Cartesian product: μA(x1, x2) = min(μOPrx1(A)(x1), μOPrx2(A)(x2)); For each point:
● (1,a): min(0.3, 0.6) = 0.3
● (1,b): min(0.3, 0.6) = 0.3
● (2,a): min(0.6, 0.6) = 0.6
● (2,b): min(0.6, 0.6) = 0.6
The reconstructed values match our original set A, confirming that A is separable into orthogonal projections.
The reconstructed values match our original set A, confirming that A is separable into orthogonal projections.
Do it yourself
Q.1 For the same sets X1 and X2; Let B = {((1,a), 0.3), ((1,b), 0.8), ((2,a), 0.6), ((2,b), 0.4)}. Check if B is
separable into orthogonal projections.
Fuzzification
Fuzzification is the process of transforming a crisp set to a fuzzy set or a fuzzy set to a fuzzier set, i.e., crisp
quantities are converted to fuzzy quantities. For example, when one is told that the temperature is 9°C, the
person translates this crisp input value into linguistic variables such as cold or warm according to one's
knowledge and then makes a decision about needing to wear a jacket. If one fails to fuzzify then it is not
possible to continue the decision process or error decision may be reached.
For a fuzzy set A, a common fuzzification algorithm is performed by keeping μ i constant and xi being
transformed to a fuzzy set Q(xi) depicting the expression about xi. The fuzzy set Q(xi) is referred to as the
kernel of fuzzification. The fuzzified set A can be expressed as
Intuition
Intuition method is based upon the common intelligence of humans. lt is the capacity of the human to develop
membership functions on the basis of their own intelligence and understanding capacity.
Consider the figure below that shows various shapes of weights of people measured in kilograms in the
universe. Each curve is a membership function corresponding to various fuzzy (linguistic) variables, such as
very light, light, normal. heavy and very heavy. The curves are based on context functions and the human
developing them. For example, if the weights are referred to the range of thin persons we get one set of
curves, and if they are referred to the range of normal weighing persons we get another set and so on. The
main characteristics of these curves for their usage in functions are based on their overlapping capacity.
Do it yourself
Q.1 Using your own intuition, plot the fuzzy membership function for the age of people.
Inference
The inference method uses knowledge to perform deductive reasoning. Deduction achieves conclusion by
means of forward inference. There are various methods for performing deductive reasoning. Here the
knowledge of geometrical shapes and geometry is used for defining membership values. The membership
functions may be defined by various shapes: triangular, trapezoidal, bell-shaped, Gaussian and so on. The
inference method here is discussed via triangular shape.
Consider a triangle, where X, Y and Z are the angles such that X Y Z 0 and let U be the universe of triangles
U= {(X, Y, Z) I X Y Z 0;X + Y + Z = 180}
There are various types of triangles available. Here a few are considered to explain inference methodology:
l = Isosceles triangle (approximate) E = Equilateral triangle (approximate)
R = Right-angle triangle (approximate) IR = Isosceles and right-angle triangle (approximate)
T = Other triangle
By the method of inference, we can obtain the membership values for all the above-mentioned triangles. since we possess knowled
45
If X = Y or Y = Z the membership value of the approximate isosceles triangle is equal to 1. On the other hand,
if X = 120°, Y = 60° and Z = 0°, we get
If X= 90°, the membership value of a right-angle triangle is 1, and if X= 180°. the membership value μ R
becomes 0:
X = 90° R = 1
X = 180° R = 0
The membership value of the approximate isosceles right angle triangle is obtained by taking the logical
intersection of the approximate isosceles and approximate right-angle triangle membership function. i.e.
IR = I R
and it is given by
Above both terms of the formula are the same you can put values of X = 80, Y = 70 and Z = 30.
The membership function of other triangles, denoted by T, is the complement of the. logical union of I, R and E
i.e.
T = (I ∪ R ∪ E)’
Do it yourself
46
Q.1 Using the inference approach, find the membership values for the triangular shapes I,R, E, IR, and T for a
triangle with angles 45°, 55° and 80°.
Q.2 Using the inference approach, obtain the membership values for the triangular shapes (I, R, T) for a
triangle with angles 40°, 60° and 80°.
Rank Ordering
The formation of the government is based on the polling concept; to identify a best student, ranking may be
performed; to buy a car, one can ask for several opinions and so on. All the above mentioned activities are
carried out on the basis of the preferences made by an individual, a committee, a poll and other opinion
methods.This methodology can be adapted to assign the membership values to a fuzzy variable. Pairwise
comparisons enable us to determine preferences and this results in determination of the membership.
An Example
Imagine a group of students is voting for the best basketball player in their school. Each student ranks the top
players based on their skills, teamwork, and performance in past games.
Consider the pH value of wastewater from a dyeing industry. These pH readings are assigned linguistic labels,
such as high base, medium acid, etc., to understand the quality of the polluted water. The pH value should be
taken care of because the waste from the dyeing industry should not be hazardous to the environment. As is
known, the neutral solution has a pH value of 7. The linguistic variables are built in such a way that a "neutral
(N)" solution corresponds to θ = 0 rad, and "exact base (EB)" and "exact acid (EA)" corresponds to θ=π/2 rad
and θ=-π/2 rad, respectively. The levels of pH between 7 and 14 can be termed as "very base" (VB), "medium
base" (MB) and so on and are represented between 0 to π/2. Levels of pH between 0 and 7 can be termed as
47
"very acid (VA)," "medium acid (MA)” and so on are represented between 0 rad and -π/2 rad. The mode of
angular fuzzy set using these linguistic labels for pH is shown in figure below
The value of the linguistic variable with varying “θ” and their membership values are on the μ(θ) axis. The
membership value corresponding to the linguistic term can be obtained from the following equation
μr(θ) = |z.tan(θ)|
where z is the horizontal projection of the radial vector. Angular fuzzy sets are best in cases with polar
coordinates or in cases where the value of the variable is cyclic. In Angular Fuzzy Sets, the membership
function depends on an angle θ, and projections are determined using trigonometric functions like cosine and
sine.
● Horizontal Projection (on the X-axis) is given by: μx (x) = ()cos()
● Vertical Projection (on the Y-axis) is given by: μy (y) = ()sin()
An Example
Q.1 The energy E of a particle spinning in a magnetic field B is given by the equation
E = μB sinθ
Where μ is the magnetic moment of a spinning particle and θ is the complement angle of the magnetic moment
with respect to the direction of the magnetic field. Assume the magnetic field B and magnetic moment μ to be
constant, and the linguistic terms for the complement angle of magnetic moment be given as
Find the membership values using the angular fuzzy set approach for these linguistic labels and plot these
values versus θ.
48
Now calculate the angular fuzzy membership values as shown in the table below
The plot for the membership function shown in this table is given in Figure
Neural Networks
The neural network can be used to obtain fuzzy membership values. Consider a case where fuzzy
membership functions are to be created fuzzy classes of an input data set. The input data set is collected and
divided into training data set and testing dataset. The training dataset trains the neural network. Consider an
input training data set as shown in figure below-
Consider the part-(A) of the figure; The data set is found to contain several data points. The data points are
first divided into different classes by conventional clustering techniques. it can be noticed that the data points
are divided into three-classes, RA, RB and Rc. Consider data point 1 having input coordinate values X i = 0.6
and Xj = 0.8. This data point lies in the region R B; hence we assign complete membership 1 to class R B and 0
to classes RA and Rc. In a similar manner, the other data points are given membership values of I for the class
they initially belong to.
Consider the part-(B) of the figure; A neural network is created which uses the data point marked 1 and the
corresponding membership values in different classes for training itself for simulating the relationship between
coordinate location and the membership values.
49
The output of the neural network is shown in part-(C) of the figure, which classifies data points into one of the
three regions. The neural network uses the next data set of data values and membership values for further
training processes.
The process is continued until the neural network simulates the entire set of input-output values. The network
performance is tested using a testing data set. Now consider the next figure which shows the next
When the neural network is ready in its final version (as shown in the figure below)
it can be used to determine the membership values of any input data in the different regions (classes). A
complete mapping of the membership of various data points in various fuzzy classes can be derived to
determine the overlap of the different classes. The overlap of the three fuzzy classes is shown in the hatched
portion of the figure. In this manner, the neural network is used to determine the fuzzy membership function.
Genetic Algorithm
The genetic algorithm is based on Darwin's theory of evolution; the basic rule is “survival of the fittest”. The
genetic algorithm is used here to determine the fuzzy membership functions. This can be done using the
following steps:
1. For a particular functional mapping system, the same membership functions and shapes are assumed
for various fuzzy variables to be defined.
2. These chosen membership functions are then coded into bit strings.
3. Then these bit strings are concatenated together.
4. The fitness function to be used here is noted. In genetic algorithms, fitness function plays a major role
similar to that played by activation function in neural networks.
5. The fitness function is used to evaluate the fitness of each set of membership functions.
6. These membership functions define the functional mapping of the system.
The process of generating and evaluating strings is carried out until we get a convergence to the solution
within a generation, i.e., we obtain the membership functions with best fitness value. Thus, fuzzy membership
functions can be obtained from genetic algorithms.
Induction Reasoning
Induction is used to deduce causes by means of backward inference. The characteristics of inductive
reasoning can be used to generate membership functions. Induction employs the entropy minimization
principle, which clusters the parameters corresponding to the output classes. To perform an inductive
reasoning method, a well-defined database for the input–output relationship should exist. The inductive
50
reasoning can be applied for complex systems where the data are abundant and static. For dynamic data sets,
this method is not best suited, because the membership functions continually change with time. There exist
three laws of induction
1. Given a set of irreducible outcomes of an experiment, the induced probabilities are those probabilities
consistent with all available information that maximize the entropy of the set.
2. The induced probability of a set of independent observations is proportional to the probability density of
the induced probability of a single observation.
3. The induced rule is that rule consistent with all available information that minimizes the entropy.
The third law stated above is widely used for the development of membership functions. The membership
functions using inductive reasoning are generated as follows:
1. A fuzzy threshold is to be established between classes of data.
2. Using the entropy minimization screening method, first determine the threshold line.
3. Then start the segmentation process.
4. The segmentation process results into two classes.
5. Again partitioning the first two classes one more time, we obtain three different classes.
6. The partitioning is repeated with threshold value calculations, which lead us to partition the data set into
a number of classes or fuzzy sets.
7. Then on the basis of the shape, membership function is determined
Thus the membership function is generated on the basis of the partitioning or analog screening concept. This
draws a threshold line between two classes of sample data. The idea behind drawing the threshold line is to
classify the samples when minimizing the entropy for optimum partitioning.
Defuzzification
Defuzzification is the process of converting fuzzy values into clear, precise values that can be used for real-
world applications. Since many control systems require specific, non-fuzzy actions, defuzzification helps
translate fuzzy control decisions into exact outputs.
This process takes a fuzzy control decision, represented by a range of possible values, and selects the best
single value that represents it. Defuzzification can turn a fuzzy set into a single crisp value, transform a fuzzy
matrix into a precise matrix, or convert a fuzzy number into a definite number.
Mathematically, the defuzzification process may also be termed as “rounding it off”. Fuzzy set with a collection
of membership values or a vector of values on the unit interval may be reduced to a single scalar quantity
using the defuzzification process.
The set Aλ is called a weak lambda-cut set if it consists of all the elements of a fuzzy set whose membership
functions have values greater than or equal to a specified value.
The set Aλ is called a strong lambda-cut set if it consists of all the elements of a fuzzy set whose membership
functions have values strictly greater than to a specified value. A strong λ-cut set is given by
All the λ-cut sets form a family of crisp sets. It is important to note the λ-cut set A λ (or A
, if -cut set) does not have a tilde score, because it is a crisp set deri
51
Note: Typically, λ is taken from the interval (0,1] rather than (0,1) because λ-cuts are not defined for λ = 0 (as it
would include all elements of the universal set). In some cases, λ can be taken from [0,1], where A₀ includes
all elements of the fuzzy set, and A₁ includes only elements with full membership (μA(x) = 1).
An Example
Consider a fuzzy set A representing "tall people" with elements:
A={(150,0.2),(160,0.5),(170,0.8),(180,1.0),(190,0.9)}
Let us take an example of the fourth property; A is a fuzzy set defined over a universe X={x 1, x2, x3}. The
membership function μA(x) assigns values to elements of X.
52
Figure above shows the features of the membership functions. The core of fuzzy set A is the λ = 1– cut set A 1.
The support of A is the λ-cut set A0+, where λ =0+, and it can be defined as
The interval [A0+, A1] forms the boundaries of the fuzzy set A, i.e., the regions with the membership values
between 0 and 1, i.e., for λ = 0 to 1.
Do it yourself
Q.1 Consider the fuzzy sets A and B both defined on X
Q.2 Consider the discrete fuzzy set defined on the universe X = {a,b,c,d,e} as
Using Zadeh’s notation, find the λ-cut sets for λ =1, 0.9,0.6,0.3,0 + and 0.
where Rλ
is a -cut relation of the fuzzy relation R. Since here R is defined as a two-dimensional array, defined on the universes X
λ belongs to R with a relation
greater than or equal to λ. Similar to the properties of λ-cut fuzzy sets, the λ-cuts on fuzzy relations also obey
certain properties. They are listed as follows. For two fuzzy relations R and S the following properties should
hold:
1. For any , where 0 1, it is true that R ⊆ Rλ
2. (R S) λ = Rλ S λ
3. (R S) λ = Rλ S λ
53
4. (R’)λ (R λ)’ except when λ = 0.5
Do it yourself
Q.1 Determine the crisp λ-cut relation when λ =0.1, 0+, 0.3 and 0.9 for the following relation R
Note:
1. Any lambda-cut relation of a fuzzy tolerance relation results in a crisp tolerance relation.
2. Any lambda-cut relation of a fuzzy equivalence relation results in a crisp equivalence relation.
Defuzzification Methods
Defuzzification is the process of conversion of a fuzzy quantity into a precise quantity. The output of a fuzzy
process may be a union of two or more fuzzy membership functions defined on the universe of discourse of the
output variable.
Consider a fuzzy output comprising two parts: the first part, C1 a triangular membership shape (as shown in
figure-A), the second part, C2, a trapezoidal shape (as shown in figure-B). The union of these two membership
functions, i.e. C = C1 ∪ C2 the max-operator, which is going to be the outer envelope of the two shapes shown
in (C). A fuzzy output process may involve many output parts, and the membership function representing each
part of the output can have any shape. The membership function of the fuzzy output need not always be
normal. In general, we have
54
This method is also known as height method and is limited to peak output functions (i.e. clear, single peak (i.e.,
one maximum membership value)). This method is given by the algebraic expression
Note: if there are multiple peaks available then the result will be ambiguous. To overcome this we have to use
either the First of maxima or Last of maxima or Mean-max membership.
An Example:
Consider the following fuzzy set for ripeness
Ripe = {(0, 0), (2, 0.3), (4, 0.6), (5, 0.8), (6, 1), (7, 0.7), (8, 0.5), (10, 0)
The first maximum membership of 1.0 occurs at a ripeness of 6 hence the defuzzified Value is 6.
Do it yourself
Q.1 Find the defuzzified value for the following fuzzy set
Dosage = {(50, 0), (75, 0.4), (100, 0.7), (125, 0.9), (150, 1), (175, 0.8), (200, 0.5), (225, 0.2)}
2. First of Maxima
This method determines the smallest value of the domain with maximum membership value.
An Example:
Consider the following fuzzy set for speed
Speed = {(40, 0),(50, 0.5),(60, 0.9),(70, 1),(75, 1),(80, 0.8),(90, 0.2)}
55
The maximum membership is 1.0. The lowest volume at which this occurs is 70 hence defuzzified value is 70
Do it yourself
Q.1 Find the defuzzified value for the following fuzzy set
Consider the following fuzzy set for volume
Volume = {(1, 0), (3, 0.3), (5, 0.7), (6, 1), (7, 1), (8, 0.8), (10, 0.2)}
3. Last of maxima
This method determines the largest value of the domain with maximum membership value.
An Example:
Consider the following fuzzy set for quality
Quality = {(2, 0.1), (4, 0.5), (6, 0.8), (7, 1), (8, 1), (9, 0.7), (10, 0.3)}
The maximum membership is 1.0. The last quality value at which this occurs is 8 hence the defuzzified value is
8.
Do it yourself
Q.1 Find the defuzzified value for the following fuzzy set
Sweetness = {(1, 0.1), (3, 0.4), (5, 0.8), (7, 1), (8, 1), (9, 0.7), (10, 0.3)}
4. Mean-max Membership
In this method, the defuzzified value is taken as the element with the highest membership values. When there
are more than one element having maximum membership values, the mean value of the maxima is taken. Let
A be a fuzzy set with membership function μ A(x) defined over x ∈ X, where X is a universe of discourse. The
defuzzified value is let say x of a fuzzy set and is defined as,
Here, M = {xi | μA(xi) is equal to the height of the fuzzy set A} and |M| is the cardinality of the set M.
An Example:
56
Consider the following fuzzy set for temperature
Temperature = {(18, 0.2), (20, 0.7), (22, 1), (23, 1), (24, 1), (26, 0.6), (28, 0.1)}
The maximum membership is 1.0, occurring at 22°C, 23°C, and 24°C. Mean Calculation is (22 + 23 + 24)/3 =
23 hence the defuzzified value is 23°C.
Do it yourself
Q.1 Find the defuzzified value for the following fuzzy set
Temperature = {(18, 0.3), (20, 0.6), (21, 1), (22, 1), (23, 1), (25, 0.7), (27, 0.2)}
Here denotes the algebraic summation and x is the element with maximum membership function.
An Example
Let A be a fuzzy set that tells about a student the elements with corresponding maximum membership values
are also given.
Here, the linguistic variable P represents a Pass student, F stands for a Fair student, G represents a Good
student, VG represents a Very Good student and E for an Excellent student.
57
The defuzzified value for the fuzzy set A with weighted average method represents a Fair student.
Do it yourself
Q.1 Consider a fuzzy set for Fan Speed = {(1, 0.2), (3, 0.5), (5, 0.8), (7, 0.6), (9, 0.3)} Find the defuzzified value
using weighted average method.
6. Center of sums
This is the most commonly used defuzzification technique. In this method, the overlapping area is counted
twice. This method employs the algebraic sum of the individual fuzzy subsets instead of their union. The
defuzzified value x is defined as :
Here, n is the number of fuzzy sets, N is the number of fuzzy variables, μ ak(xi) is the membership function for
the k-th fuzzy set. The defuzzified value x is defined as :
Here, Ai represents the firing area of ith rules and k is the total number of rules fired and xi (with bar) represents
the center of area.
An Example: This example for the case when symmetric shapes are created by the fuzzy set.
Given the following three fuzzy output sets, find the crisp value corresponding to that.
For the first fuzzy set For the second fuzzy set For the third fuzzy set
Another Example: This example for the case when asymmetric shapes are created by the fuzzy set. All
calculations in this example are correct.
Consider the union of two fuzzy sets
58
The aggregated fuzzy set of two fuzzy sets C1 and C2 is shown in Figure above. Let the area of these two fuzzy
sets is A1 and A2.
Now the center of area of the fuzzy set C 1 is 5 (exact value is 4.73, 5 is taken only for the simplicity of
calculation because in fuzzy everything works on approximation so slight adjustment won’t affect so much)
which is value for the variable x1 (with bar) and the center of area of the fuzzy set C 2 is 8 which is value for the
variable x2 (with bar). Now the defuzzified value is
Do it yourself
Q.1 Consider the union of following fuzzy sets
C1 = {(1, 0.0), (2, 1.0), (3, 1.0), (4, 1.0), (5, 0.0)}
C2 = {(3, 0.0), (4, 1.0), (5, 1.0), (6, 1.0), (8, 1.0)}
C3 = {(2, 0.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 0.0)}
59
Find the defuzzified value for the same using the center of sums method.
Hint: The x coordinate for the centroid of second set C2 is 5. (The Answer is 4.5 approximately)
For discrete membership function, the defuzzified value denoted as x using COG is defined as:
Here xi indicates the sample element, μ(xi) is the membership function, and n represents the number of
elements in the sample.
For continuous membership function, x is defined as :
Find the defuzzified value for the same using the center of gravity method.
Now we have to apply the center of gravity method to find the defuzzified value for discrete membership
function.
60
The numerator part
∑xi⋅μ(xi) = −30⋅0.4 + −15⋅0.6 + 15⋅0.6 + 30⋅0.4 + 75⋅0.3
∑xi⋅μ(xi) = 22.5
The defuzzified value using the Center of Gravity (COG) method for discrete membership function is
approximately 9.78.
Now we have to apply the center of gravity method to find the defuzzified value for continuous membership
function.
We will find the equation of the line for each segment using the formula:
From x = -15 to x = 15
From x = 15 to x = 30
From x = 30 to x = 75
From x = 75 to x = 90
From x = -15 to x = 15
From x = 15 to x = 30
From x = 30 to x = 75
From x = 75 to x = 90
After performing the integrations, the defuzzified value using the continuous membership function is
approximately 17.34.
62
oWhen to Use: Use this approach when the membership function is defined by discrete points, and the
intervals between points are small or the function is piecewise constant.
o Advantages: Simpler calculations, especially for small datasets.
o Disadvantages: Less accurate for continuous or smoothly varying membership functions.
2. Continuous Membership Function:
o When to Use: Use this approach when the membership function is continuous or smoothly varying,
and you need higher accuracy.
o Advantages: More accurate for continuous functions, better representation of real-world systems.
o Disadvantages: More complex calculations, especially for piecewise linear functions.
Do it yourself
Q.1 Consider the union of following fuzzy sets
C1 = {(0, 0.0), (1, 1.0), (2, 1.0), (3, 1.0), (4, 1.0), (5, 0.0)}
C2 = {(3, 0.0), (4, 1.0), (5, 1.0), (6, 1.0), (8, 0.0)}
Find the defuzzified value for the same using the center of area method for both the discrete membership
function and the continuous membership function.
The Answer is 4.151 approximately for continuous membership function.
An Example
Consider a fuzzy set {(1,0), (3, 0.5), (7, 0.8), (10,0)}
Left Triangle (1 x 3)
Base = 31=2Height = 0.5
Area
Middle Triangle (3 x 7)
Bases = 0.5 (left height) and 0.8 (right height)
Width = 73=4
Area:
63
Right Triangle (7 x 10)
Base = 107=3
Height = 0.8
Area:
The bisector area should divide this total area into two equal halves:
Linear interpolation is a method of estimating an unknown value within a given range based on two known
points. It assumes that the change between two values is linear. Suppose the bisecting area is located
between xp and xq then:
where:
● Ap is the cumulative area up to xp,
● Aq is the cumulative area up to xq.
Do it yourself
Q.1 Consider a fuzzy set {(2, 0), (4, 1), (7, 1) & (9, 0)} Find the defuzzified value using bisectors of the area
method.
Where:
● x is the defuzzified crisp output.
● μA(x) is the membership function of the fuzzy set.
● [a,b] is the range of the largest continuous region with the highest membership value.
64
An Example
Consider a fuzzy set A = {(10, 0), (20, 1), (30, 0)} and another fuzzy set B = {(25, 0), (30, 1), (40, 1), (45, 0)}.
The union of both these is shown below
We take the union of Set A and Set B, and the resulting fuzzy region looks like:
● From x=10, the membership comes from Set A.
● From x=20, the maximum membership dominates (Set A or Set B).
● From x=30, the membership is 1 (from Set B).
● From x=40, the membership gradually decreases (from Set B).
The largest continuous high-membership area is identified from the trapezoidal part of Set B
30 x 40. For the largest area (30 x 40), where A(x)=1:
For numerator
For denominator
So
Do it yourself
Q.1 Consider a fuzzy sets A = {(5, 0), (10, 1), (15, 0)}, B = {(12, 0),(18, 1),(25, 1),(30, 0)} & C = {(28, 0),(35, 1),
(42, 0)}. Find the defuzzified value using the center of the largest area method.
Since mathematical operations require numerical values, the solution is mapping non-numerical elements to
numerical values. Here’s how it works:
65
● "Hot" 40°C
Once we assign numerical values, we can apply mathematical formulas. Like in the Center of Area (CoA)
method, we compute:
Substituting values:
So the defuzzified output is 30.83°C, which makes sense as a compromise between warm and hot.
Fuzzy Logic Controller (FLC) was first developed by Mamdani and Assilian around 1975. Human beings have
the natural quest to know input-output relationships of a process. The behavior of a human being is modeled
artificially for designing a suitable FLC. The performance of an FLC depends on its Knowledge Base (KB),
which consists of both Data Base (DB) (that is, data related to membership function distributions of the
variables of the process to be controlled) as well as Rule Base (RB). However, designing a proper KB of an
FLC is a difficult task, which can be implemented in one of the following ways:
➔ Optimization of the DB only,
➔ Optimization of the RB only,
➔ Optimization of the DB and RB in stages,
➔ Optimization of the DB and RB simultaneously.
The membership function distributions are assumed to be either linear (such as triangular, trapezoidal) or non-linear
(namely Gaussian, bell-shaped, sigmoid). To design and develop a suitable FLC for controlling a process, its
variables need to be expressed in the form of some linguistic terms (such as VN: Very Near, VFR: Very Far, A:
Ahead etc.) and the relationships between input (antecedent) and output (consequent) variables are expressed in
the form of some rules. For example, a rule can be expressed as follows
IF I1 is NR AND I2 is A THEN O is ART.
It is obvious that the number of rules to be present in the rule base will increase, as the number of linguistic
terms used to represent the variables increases (to ensure a better accuracy in prediction). Moreover,
66
computational complexity of the controller will increase with the number of rules. For an easy implementation in
either the software or hardware, the number of rules present in the RB should be as small as possible.
The working principles of both these approaches are briefly explained below.
An FLC consists of four modules, namely a rule base, an inference engine, fuzzification and de-fuzzification.
Linguistic fuzzy modeling, such as “Mamdani Approach” is characterized by its high interpretability and low
accuracy, whereas the aim of precise fuzzy modeling like “Takagi and Sugeno’s Approach” is to obtain high
accuracy but at the cost of interpretability.
Let us have a discussion about the “mamdani” approach first. Assume for simplicity that only two fuzzy control
rules (out of many rules present in the rule base) are being fired as given below, for a set of inputs: (s1*, s2*).
Let us try to understand the fuzzy reasoning process via following diagram
67
The inputs for fuzzy variables: s1 and s2 is s1* and s2* respectively; if μA1 and μB1 are the membership function
values for A and B, respectively, then the grade of membership of s1* in A1 and that of s2* in B1 are
represented by μA1(s1*) and μB1(s2*), respectively, for rule 1. Similarly, for rule 2, μA2(s1*) and μB2(s2*) are used
to represent the membership function values. The f’ is the output variable for output set C1 and C2
respectively. The firing strengths of the first and second rules are calculated as follows:
α1 = min(μA1(s1*), μB1(s2*))
α2 = min(μA2(s1*), μB2(s2*))
Here α1 and α2 are the values for variable f’ in output set C1 and C2. The membership value of the combined
control action C is given by
μc(f’) = max(μ*c1(f’), μ*c2(f’))
Once the above processing is done then the defuzzification is applied to get the crisp value. The popular
choices for the defuzzification are center of sum, center of area, mean of maxima etc.
An Example
A typical problem scenario related to navigation of a mobile robot in the presence of four moving obstacles.
The directions of movement of the obstacles are shown in the figure.
The obstacle O2 is found to be the most critical one. Our aim is to develop a fuzzy logic-based motion planner
that will be able to generate the collision-free path for the robot. There are two inputs, namely the distance
between the robot and the obstacle (D) and angle ( GSO2) for the motion planner and it will generate one
68
output, that is, deviation. Distance is represented using four linguistic terms, namely Very Near (VN), Near
(NR), Far (FR) and Very Far (VFR), whereas the input: angle and output: deviation are expressed with the help
of five linguistic terms, such as Ahead (A), Ahead Left (AL), Left (LT), Ahead Right (ART) and Right (RT). The
figure given below shows the DB of the FLC.
The rule base of the fuzzy logic-based motion planner is given in table below-
Determine the output - deviation for the set of inputs: distance D = 1.04 m and angle GSO2 = 30 degrees, using
Mamdani Approach. Use different methods of defuzzification.
Solution:
The inputs are: Distance = 1.04 m, Angle = 30 degrees, the distance of 1.04 m may be called either NR (Near)
or FR (Far). Similarly, the input angle of 30 degrees can be declared either A (Ahead) or ART (Ahead Right).
Consider the figure below used to determine the membership value, corresponding to the distance of 1.04 m.
Using the principle of similar triangle, we can write the following relationship:
From the above expression, x is found to be equal to 0.6571. Thus, the distance of 1.04 m may be declared
NR with a membership value of 0.6571, that is, μNR = 0.6571. Similarly, the distance of 1.04 m can also be
called FR with a membership value of 0.3429, that is, μFR = 0.3429. In the same way, an input angle of 30
degrees may be declared either A with a membership value of 0.3333 (that is, μA = 0.3333) or ART with a
membership value of 0.6667 (that is, μART = 0.6667). For the above set of inputs, the following four rules are
being fired from a total of 20:
1. If Distance is NR AND Angle is A Then Deviation is RT
2. If Distance is NR AND Angle is ART Then Deviation is A
69
3. If Distance is FR AND Angle is A Then Deviation is ART
4. If Distance is FR AND Angle is ART Then Deviation is A
The fuzzified outputs corresponding to above four fired rules are shown in Figure below-
The union of the fuzzified outputs, corresponding to the above four fired rules is shown below-
The above fuzzified output cannot be used as a control action and its crisp value has been determined using
the following methods of defuzzification:
1. Center of Sums Method
70
The shaded region corresponding to each fired rule is shown in this figure. The values of area and
center of area of the above shaded regions are also given in this figure, in a tabular form. The crisp
output U of above four fired rules can be calculated as follows-
Therefore, the robot should deviate by 19.5809 degrees towards the right with respect to the line joining the
present position of the robot and the goal to avoid collision with the obstacle.
2. Centroid method:
The shaded region representing the combined output corresponding to above four fired rules has been divided
into a number of regular sub-regions, whose area and center of area can easily be determined. For example,
the shaded region of Figure has been divided into four sub-regions (two triangles and two rectangles). The
values related to area and center of area of the above four sub-regions are shown in this figure, in a tabular
form. In this method, the crisp output U can be obtained like the following
Where A = 9.7151 × ( 25.2860) + 20.2788 × 0.0 + 2.3588 × 20.2870 + 24.8540 × 52.7153, B = 9.7151 +
20.2788 + 2.3588 + 24.8540. Therefore, U turns out to be equal to 19.4450. Thus, the robot should deviate by
71
19.4450 degrees towards the right with respect to the line joining the present position of the robot and the goal
to avoid collision with the obstacle.
3. Mean of Maxima
It is observed that the maximum value of membership (that is, 0.6571) has occurred in a range of deviation
angles starting from 15.4305 to 15.4305 degrees. Thus, its mean is coming out to be equal to 0.0. Therefore,
the crisp output U of the controller becomes equal to 0.0, that is, U = 0.0. The robot will move along the line
joining its present position and the goal.
Now let us talk about the “Takagi and Sugeno’s Approach”. Here, a rule is composed of fuzzy antecedent and
functional consequent parts. Thus, a rule (say i-th) can be represented as follows:
Where a0, a1,…, an are the coefficients. It is to be noted that i does not represent the power but it is a
superscript only. In this way, a nonlinear system is considered to be a combination of several linear systems.
The weight of i-th rule can be determined for a set of inputs (x1, x2,…, xn) like the following:
where A1,A2, … ,An indicates membership function distributions of the linguistic terms used to represent the
input variables and μ denotes the membership function value. Thus, the combined control action can be
determined as follows:
A Numerical Example:
A fuzzy logic-based expert system is to be developed that will work based on Takagi and Sugeno’s approach
to predict the output of a process. The DB of the FLC is shown in Figure below-
As there are two inputs: I1 and I2, and each input is represented using three linguistic terms (for example, LW,
M, H for I1 and NR, FR, VFR for I2), there is a maximum of 3 × 3 = 9 feasible rules. The output of i-th rule, that
is, yi (i = 1, 2, . . . , 9) is expressed as follows:
72
Where j, k = 1, 2, 3; a i 1 = 1, a i 2 = 2 and a i 3 = 3, if I1 is found to be LW, M and H respectively; b i 1 = 1, b i 2 = 2
and b i 3 = 3, if I2 is seen to be NR, FR and V FR, respectively. Calculate the output of the FLC for the inputs: I1
= 6.0, I2 = 2.2.
Solution
The inputs are: I1 = 6.0 & I2 = 2.2. The input I1 of 6.0 units can be called either LW (Low) or M (Medium).
Similarly, the second input I2 of 2.2 units may be declared either FR (Far) or VFR (Very Far). Figure below
shows a schematic view used to determine the membership value corresponding to the first input I1 = 6.0.
Using the principle of similar triangle, we can write the following relationship:
From the above expression, x is coming out to be equal to 0.8. Thus, the input I1 = 6.0 may be called LW with a
membership value of 0.8, that is, μLW = 0.8. Similarly, the same input I1 = 6.0 may also be called M with a
membership value of 0.2, that is, μM = 0.2. In the same way, the input I2 = 2.2 may be declared either FR with a
membership value of 0.8 (that is, μFR = 0.8) or VFR with a membership value of 0.2 (that is, μVFR = 0.2). For the
above set of inputs, the following four combinations of input variables are being fired from a total of nine.
1. I1 is LW and I2 is FR,
2. I1 is LW and I2 is VFR,
3. I1 is M and I2 is FR,
4. I1 is M and I2 is VFR.
Now, the weights: w1, w2, w3 and w4 of the first, second, third and fourth combination of fired input variables,
respectively, have been calculated as follows:
w1 = μLW × μFR = 0.8 × 0.8 = 0.64
w2 = μLW × μVFR = 0.8 × 0.2 = 0.16
w3 = μM × μFR = 0.2 × 0.8 = 0.16
w4 = μM × μVFR = 0.2 × 0.2 = 0.04
The functional consequent values: y1, y2, y3 and y4 of the first, second, third and fourth combination of fired
input variables can be determined like the following:
y1 = I1 + 2I2 = 6.0 + 2 × 2.2 = 10.4
y2 = I1 + 3I2 = 6.0 + 3 × 2.2 = 12.6
y3 = 2I1 + 2I2 = 2× 6.0 + 2 × 2.2 = 16.4
y4 = 2I1 + 3I2 = 2× 6.0 + 3 × 2.2 = 18.6
Let us see the difference between the mamdani FIS and Sugeno FIS
73
Mamdani FIS Sugeno FIS
Output membership function is present No output membership function is present
The output of surface is discontinuous The output of surface is continuous
Distribution of output Non distribution of output, only Mathematical
combination of the output and the rules strength
Through defuzzification of rules consequent of crisp No defuzzification here. Using weighted average of the
result is obtained rules of consequent crisp result is obtained
Expressive power and interpretable rule consequent Here is loss of interpretability
Mamdani FIS possess less flexibility in the system Sugeno FIS possess more flexibility in the system
design design
It has more accuracy in security evaluation block It has less accuracy in security evaluation block cipher
cipher algorithm algorithm
It is used in MISO (Multiple Input and Single Output) It is used only in MISO (Multiple Input and Single
and MIMO (Multiple Input and Multiple Output) Output) systems
systems
Mamdani inference system is well suited to human Sugeno inference system is well suited to mathematical
input analysis
Application: Medical Diagnosis System Application: To keep track of the change in aircraft
performance with altitude
Artificial Neuron
Consider the schematic view of an artificial neuron. In which a biological neuron has been modeled artificially.
Let us suppose that there are n inputs (such as I1, I2, . . . , In) to a neuron j. The weights connecting n number of
inputs to jth neuron are represented by [W] = [W1j, W2j, ..., Wnj]. The function of summing junctions of an artificial
neuron is to collect the weighted inputs and sum them up. Thus, it is similar to the function of combined
dendrites and soma. The activation function (also known as the transfer function) performs the task of axon
and synapse. The output of the summing junction may sometimes become equal to zero and to prevent such a
situation, a bias of fixed value bj is added to it. Thus, the input to transfer function f is determined as
The output of the summing function is also called Linear combiner Output/Induced field input/net input/pre-
activation value. The output of jth neuron, that is Oj can be obtained as follows:
Do it yourself
Q.1 For the network shown in Figure below, calculate the net input to the output neuron.
75
Q.2 For the network shown in Figure below, calculate the net input to the output neuron.
Difference between Artificial Neural Network (ANN) and Biological Neural Network (BNN):
Processing speed is fast as compared to Biological They are slow in processing information. Cycle time
Neural Network. Cycle time for execution is in for execution is in milliseconds.
nanoseconds.
It can perform massive parallel operations It can perform massive parallel operations
simultaneously like BNN. simultaneously.
Size and complexity depends on the application The size and complexity of BNN is more than ANN
chosen but it is less complex than BNN. with 1011 neurons and 1015 interconnections.
To store new information, the old information is Any new information is stored in interconnection, and
deleted if there is a shortage of storage. the old information is stored with lesser strength.
There is no fault tolerance in ANN. The corrupted It has fault tolerance capability. It can store and
information cannot be processed. retrieve information even if the interconnection is
disconnected.
The control unit processes the information. The chemical present in neurons does the
processing.
Threshold is a set value based upon which the final output of the network may be calculated. The threshold
value is used in the activation function. A comparison is made between the calculated net input and the
threshold to obtain the network output. For each and every application, there is a threshold limit. Consider a
direct current (DC) motor. If its maximum speed is 1500 rpm then the threshold based on the speed is 1500
rpm. If the motor is run on a speed higher than its set threshold, it may damage motor coils. Similarly, in neural
networks, based on the threshold value, the activation functions are defined and the output is calculated. The
activation function using threshold can be defined as
f(net) = { 1 if net ≥ θ−1if net<θ }
Where θ is the fixed threshold value.
Extract and transform intermediate features from Produces the final prediction (e.g., class
the input data. Learned features are often probabilities, regression values) which is human-
abstract and hard to interpret readable (e.g. class labels, scalars).
Introduce non-linearity (via activation functions) to Maps the learned features to the target format
model complex relationships. Typically use non- using task-specific activations like Sigmoidal for
linear activations like ReLU, Tanh Binary Classification, Softmax for Multi-Class
Classification
Computes the initial error gradient
In case of error; Propagates error backward In case of error; It computes initial loss gradient
78
v. Multilayer recurrent network
A processing element output can be directed back to the nodes in a preceding layer, forming a
multilayer recurrent network. Also, in these networks, a processing element output can be directed back
to the processing element itself and to other processing elements in the same layer.
Note: Maxnet is a type of neural network used for competitive learning, specifically to determine the maximum
activation among a set of neurons. It is commonly used in winner-take-all (WTA) networks
The training or learning rules adopted for updating and adjusting the connection weights
The main property of an ANN is its capability to learn. Learning or training is a process by means of which a
neural network adapts itself to a stimulus by making proper parameter adjustments, resulting in the production
of desired response. Broadly, there are two kinds of learning in ANNs:
1. Parameter learning: It updates the connecting weights in a neural net.
2. Structure learning: It focuses on the change in network structure (which includes the number of
processing elements as well as their connection types).
The above two types of learning can be performed simultaneously or separately. Apart from these two
categories of learning, the learning in an ANN can be generally classified into three categories as: supervised
learning; unsupervised learning & reinforcement learning.
79
1. Supervised Learning
Each input vector requires a corresponding target vector, which represents the desired output. The
input vector along with the target vector is called a training pair. The network here is informed precisely
about what should be emitted as output.
During training, the input vector is presented to the network, which results in an output vector. This
output vector is the actual output vector. Then the actual output vector is compared with the desired
(target) output vector. If there exists a difference between the two output vectors then an error signal is
generated by the network. This error signal is used for adjustment of weights until the actual output
matches the desired (target) output. In this type of training, a supervisor or teacher is required for error
minimization. Hence, the network trained by this method is said to be using supervised training
methodology. In supervised learning, it is assumed that the correct "target" output values are known for
each input pattern.
Key Features:
Requires labelled training data.
Uses loss functions to measure prediction accuracy.
Common algorithms: Neural Networks, Support Vector Machines, Decision Trees
Scenario: You want to predict the price of a house based on its size (in square feet) and other features
like the number of bedrooms, location, and age of the house.
Input Features:
● Size of the house (square feet)
● Number of bedrooms
● Location
● Age of the house
Output: House price (a continuous value)
A regression model, such as Linear Regression, can be used to predict the house price. The model
learns the relationship between the input features and the house price during training. For example, it
might learn that larger houses with more bedrooms in desirable locations tend to have higher prices.
80
Equation: In simple linear regression, the relationship can be represented as:
Scenario: You want to classify animals into different categories based on their features, such as the
number of legs, type of skin covering, and whether they can fly.
Input Features:
● Number of legs
● Type of skin covering (e.g., fur, feathers, scales)
● Ability to fly (yes/no)
Output: Animal category (e.g., mammal, bird, reptile, amphibian)
A classification model, such as Logistic Regression, Decision Trees, or Support Vector Machines, can
be used to classify the animals. The model learns the relationship between the input features and the
animal category during training. For example, it might learn that animals with feathers and the ability to
fly are likely to be birds.
Decision Boundary: The model creates a decision boundary that separates the different classes. For
instance, it might determine that if an animal has feathers and can fly, it should be classified as a bird.
81
2. Un-supervised Learning
The input vectors of similar type are grouped without the use of training data to specify how a member
of each group looks or to which group a number belongs. In the training process, the network receives
the input patterns and organizes these patterns to form clusters. When a new input pattern is applied,
the neural network gives an output response indicating the class to which the input pattern belongs. If
for an input, a pattern class cannot be found then a new class is generated.
It is clear that there is no feedback from the environment to inform what the outputs should be or
whether the outputs are correct. In this case, the network must itself discover patterns, regularities,
features or categories from the input data and relations for the input data over the output. While
discovering all these features, the network undergoes change in its parameters. This process is called
self-organizing in which exact clusters will be formed by discovering similarities and dissimilarities
among the objects.
Example: Clustering, Anomaly detection
The two popular learning algorithms are self organizing maps (SOMs) and k-means Clustering.
Topological preservation refers to the ability of the Kohonen Self-Organizing Map (KSOM) to
maintain the spatial relationships between input data points when mapping them onto a lower-
dimensional space (typically a 1D or 2D grid).
82
● Similar input vectors should be mapped to neighboring neurons in the output map.
● The network should retain the structure of the input data after training.
To depict this, a typical network structure where each component of the input vector x is connected to
each of the nodes is shown in figure below
On the other hand, if the input vector is two-dimensional, the inputs, say x(a, b), can arrange
themselves in a two-dimensional array defining the input space (a, b) as in Figure below; Here, the two
layers are fully connected
The architecture consists of two layers: input layer and output layer (cluster). There are “n” units in the
input layer and “m” units in the output layer. Basically, here the winner unit is identified by using either
dot product or Euclidean distance method and the weight update using Kohonen learning rules is
performed over the winning cluster unit. At the time of self-organization, the weight vector of the cluster
unit which matches the input pattern very closely is chosen as the winner unit. The closeness of the
weight vector of the cluster unit to the input pattern may be based on the square of the minimum
Euclidean distance. The weights are updated for the winning unit and its neighboring units.
83
The steps involved in the training algorithm are as shown below.
Step-1: Initialize the weights wij: Random values may be assumed. They can be chosen as the same
range of values as the components of the input vector. If information related to distribution of clusters is
known, the initial weights can be taken to reflect that prior knowledge. Initialize the learning rate α: It
should be a slowly decreasing function of time.
Step 2: Perform Steps 3–8 when the stopping condition is false.
Step-3: Take the sample training input vector x from the input layer.
Step 4: Compute the square of the Euclidean distance, i.e., for each j = 1 to m,
n m
D(J) = ∑ ❑∑ ❑(xi - wij)2
i=1 j=1
Find the winning unit index J, so that D(J) is minimum. (In Steps 3; dot product method can also be
used to find the winner, which is basically the calculation of net input, and the winner will be the one
with the
largest dot product.)
Step-5: For all units j within a specific neighborhood of J and for all i, calculate the new weights:
wij(new) = wij(old) + α[xi w ij(old)]
or
wij(new) = (1 - α)wij(old) + αxi
Step-6: Repeat the step 3-5 until update in the weight is negligible
Step-7: Update the learning rate using the formula α(t +1)= 0.5α(t).
Step 8: Test for stopping condition of the network.
An Example
Construct a Kohonen self-organizing map to cluster the four given vectors, [0 0 1 1], [1 0 0 0], [0 1 1 0]
and [0 0 0 1]. The number of clusters to be formed is two. Assume an initial learning rate of 0.5.
Do it yourself
Consider a Kohonen self-organizing net with two cluster units and five input units. The weight vectors
for the cluster units are given by
w1 = [1.0 0.9 0.7 0.5 0.3]
w2 = [0.30.5 0.7 0.91.0]
Use the square of the Euclidean distance to find the winning cluster unit for the input pattern x =[0.0 0.5
1.0 0.5 0.0] . Using a learning rate of 0.25, find the new weights for the winning unit.
Applications of SOMs
● Data Clustering: Identifying patterns in customer behavior, genetics, and more.
● Anomaly Detection: Detecting fraud or unusual patterns in financial transactions.
● Feature Extraction: Reducing data dimensions for visualization and analysis.
● Image Recognition: Organizing images based on similarities.
K-means clustering
Clustering is the task of grouping similar data points together based on their features.
K-means clustering is an iterative algorithm that divides the unlabeled dataset into k different clusters in
such a way that each dataset belongs to only one group that has similar properties. The goal is to
maximize intra-cluster similarity & minimize inter-cluster similarity. Intra-cluster similarity means
elements in the same cluster should be close to one another i.e. Euclidean distance between them
should be as little as possible; Inter-cluster similarity means Euclidean distance between two the
centroids of the clusters should be maximum i.e. there should be no common element in two clusters.
The number of clusters is represented using letter K. This algorithm discovers patterns without prior
knowledge of groups i.e. it falls under the category of unsupervised learning.
3. Assign Clusters:
Calculate Euclidean distance between each point and centroids.
Assign each point to the nearest centroid.
4. Update Centroids: Recompute centroids as the mean of all points in the cluster.
5. Repeat step 3 & 4: Reassign points and update centroids until convergence (no further
changes).
The step of computing the centroid and assigning all the points to the cluster based on their distance
from the centroid is a single iteration. There are essentially three stopping criteria that can be adopted
to stop the K-means algorithm:
1. Centroids of newly formed clusters do not change
2. Points remain in the same cluster
3. Maximum number of iterations is reached
We have to understand the effect of choosing the value of K. Before this let us understand the meaning
of inertia which is the sum of squared distances of points to their centroid (measures cluster
compactness).
● If the value of k is too small, it means the size of the cluster will be big that results in high inertia;
It will give us poor insights because distinct groups will be merged into the same cluster.
● If the value of k is too large, it means the size of the cluster will be small that results in low
inertia; It will give us clear, well-separated clusters but clusters will have overlapping elements
or fragmented groups.
The impact of increasing the value of K can be understood like stretching a rubber band: Initial effort
(low k) yields big changes; later effort (high k) barely stretches it further.
An Example
Dataset: 12 Customers with Annual Spending ($1000) and Visits/Year
+-------------------------------------------+
| Customer | Spending ($1000) | Visits/Year |
|----------|------------------|-------------|
| 1 | 5 | 2 |
| 2 | 10 | 4 |
| 3 | 8 | 3 |
| 4 | 50 | 15 |
| 5 | 55 | 18 |
85
| 6 | 60 | 20 |
| 7 | 100 | 40 |
| 8 | 95 | 35 |
| 9 | 110 | 45 |
| 10 | 120 | 50 |
| 11 | 4 | 1 |
| 12 | 6 | 2 |
+-------------------------------------------+
Iteration-2
For step-03: Assign Clusters; Compute Euclidean distance (squared) for all points:
86
Clusters remain unchanged Algorithm converges.
3. Reinforcement Learning
87
Reinforcement learning is a form of supervised learning because the network receives some feedback
from its environment. However, the feedback obtained here is only evaluative and not instructive. The
external reinforcement signals are processed in the critic signal generator, and the obtained critic
signals are sent to the ANN for adjustment of weights properly so as to get better critic feedback in
future. The critic signal is like a reward or a penalty. The reinforcement learning is also called learning
with a critic as opposed to learning with a teacher, which indicates supervised learning.
Key Features:
● Trial-and-error learning.
● Uses rewards and penalties as feedback.
A network is generally trained using either an incremental (also known as a sequential) or a batch mode, the
principles of which are discussed below.
Let us consider the incremental training of an NN using a number of scenarios (say 20), sent one after
another. There is a chance that the optimal network obtained after passing the 20-th training scenario
will be too different from that obtained after using the 1-st training scenario.
Note: It is important to mention that incremental training is easier to implement and computationally
faster than the batch mode of training.
The output here remains the same as input. The input layer uses the identity activation function.
Identity function
Where θ represents the threshold value. This function is most widely used in single-layer nets to
convert the net input to an output that is a binary (1 or 0).
Where θ represents the threshold value. This function is also used in single-layer nets to convert the
net input to an output that is bipolar (+1 or –1).
90
Bipolar step function
4. Sigmoidal functions: The sigmoidal functions are widely used in back-propagation nets because of
the relationship between the value of the functions at a point and the value of the derivative at that point
which reduces the computational burden during training. Sigmoidal functions are of two types:
a. Binary sigmoid function: It is also termed as logistic sigmoid function or unipolar sigmoid function.
It can be defined as
1
f ( x )= − λx
1+e
Where λ is the steepness parameter. For standard binary sigmoid λ = 1. If more than one input is
available in the neuron then use the summed output value for x. Here the range of the sigmoid
function is from 0 to 1.
This derivative is important in neural networks because it is used during back propagation, which is the process of upd
91
Do it yourself
Q.2 Assume that λ = 1 and x = 0.53; compute f’(x)
[ Answer: f(0.53)0.625 & f(0.53)0.233 ]
Where λ is the steepness parameter For standard binary sigmoid λ = 1. If more than one input is
available in the neuron then use the summed output value for x. Here the sigmoid function range is
between –1 and +1.
Do it yourself
Q.1 Obtain the output of the neuron Y for the network shown in Figure below using bipolar
sigmoidal activation function.
The bipolar sigmoidal function is closely related to hyperbolic tangent function, which is written
92
Hyperbolic tangent function
If the network uses binary data, it is better to convert it to bipolar form and use the bipolar sigmoidal
activation function or hyperbolic tangent function.
Ramp function
Other than the above discussed; we have following types of neural networks also
The input layer is where the image data is fed into the network. Each pixel in the image is represented as a
value, and these values form the input to the network.
The convolutional layer to be constructed possesses "m" filters of size r × r × q, where "r" tends to be smaller than the dim
An Example
A 5×5 grayscale image might have pixel values like:
94
Each value represents intensity (0 = black, 255 = white). Say we apply a filter over the pixel matrix. Say it is
3×3 Edge Detection Kernel
We take the top-left 3×3 region from the image and apply the kernel. Extracted 3×3 Region from Image
Second Convolution Operation (Next 3×3 region, sliding right) Now, move the filter one step to the right. New
3×3 Region from Image
Continuing for Other Regions; Following the same process for the entire 3×3 sliding process, we fill up the
feature map. The Final Feature Map output will be
Continued in the example; We are applying the max-pooling for every 2×2 region and Stride = 1 (stride is to
specify for sliding the filter.)
Region 1 (Top-left 2×2)
Max value: -6
Max value: -6
& so in so the final max-pooled output is
95
Since all values were the same, the result remains unchanged.
After several convolutional and pooling layers, the final output is flattened into a single vector and passed
through one or more fully connected layers. These layers are similar to those in a regular neural network and
are used to combine the features extracted by the previous layers to make final predictions, such as classifying
the image into different categories.
The output layer produces the final output of the network, such as the class scores for classification tasks. The
number of neurons in this layer corresponds to the number of classes the network is trying to predict.
What makes CNN suitable for image processing tasks over ANN?
Convolutional Neural Networks (CNNs) are highly effective for image processing due to their ability to
automatically learn spatial hierarchies of features. Here’s why they work so well:
1. Unlike traditional Artificial Neural Networks (ANNs), CNNs do not require manually extracted features.
CNN’s convolutional layers automatically detect edges, textures, patterns, and complex structures
without human intervention.
2. CNNs use pooling layers (e.g., max pooling) to reduce spatial dimensions while keeping the most
important features. This makes CNNs robust to position changes (i.e., an object can be anywhere in the
image, and CNN can still detect it).
3. Instead of fully connecting each pixel (like ANN), CNNs use small filters (kernels) that slide over the
image. This reduces the number of parameters, making CNNs computationally efficient. A 100×100
image with ANN requires 10,000 neurons, but CNN just needs a few filters to process it.
4. CNNs learn directly from raw pixel data and adjust filters automatically using backpropagation. They do
not require handcrafted features, making them highly adaptable.
RNN information is fed back into the system after each step. Think of it like reading a sentence, when you’re
trying to predict the next word you don’t just look at the current word but also need to remember the words that
came before to make an accurate guess. RNNs allow the network to “remember” past information by feeding
the output from one step into the next step. This helps the network understand the context of what has already
happened and make better predictions based on that. For example when predicting the next word in a
sentence the RNN uses the previous words to help decide what word is most likely to come next.
The fundamental processing unit in RNN is a Recurrent Unit. Recurrent units hold a hidden state that
maintains information about previous inputs in a sequence. Recurrent units can “remember” information from
prior steps by feeding back their hidden state, allowing them to capture dependencies across time. RNN
96
unfolding or unrolling is the process of expanding the recurrent structure over time steps. During unfolding
each step of the sequence is represented as a separate layer in a series illustrating how information flows
across each time step.
This unrolling enables “backpropagation through time (BPTT)” which is a learning process where errors are
propagated across time steps to adjust the network’s weights enhancing the RNN’s ability to learn
dependencies within sequential data. RNNs share similarities in input and output structures with other deep
learning architectures but differ significantly in how information flows from input to output. Unlike traditional
deep neural networks, where each dense layer has distinct weight matrices, RNNs use shared weights across
time steps, allowing them to remember information over sequences.
Application areas include Natural Language Processing, Time Series Prediction, Music Generation, and more.
97
● The discriminator learns to distinguish the generator's fake data from real data. The discriminator
penalizes the generator for producing implausible (i.e. fake data that is difficult to believe on) results.
The discriminator data comes from two sources: Real data instances, such as real pictures of people.
The discriminator uses these instances as positive examples during training. Fake data instances
created by the generator. The discriminator uses these instances as negative examples during training.
During discriminator training the generator does not train. Its weights remain constant while it produces
examples for the discriminator to train on. During discriminator training:
1. The discriminator classifies both real data and fake data from the generator.
2. The discriminator loss penalizes the discriminator for misclassifying a real instance as fake or a
fake instance as real.
3. The discriminator updates its weights through backpropagation from the discriminator loss
through the discriminator network.
When training begins, the generator produces obviously fake data, and the discriminator quickly learns to tell
that it's fake:
As training progresses, the generator gets closer to producing output that can fool the discriminator:
Finally, if generator training goes well, the discriminator gets worse at telling the difference between real and
fake. It starts to classify fake data as real, and its accuracy decreases.
A GAN can have two loss functions: one for generator training and one for discriminator training. Among
multiple implementations the common error loss function is minimax. The generator tries to minimize the
following error loss function while the discriminator tries to maximize it
In this function:
❖ D(x) is the discriminator's estimate of the probability that real data instance x is real.
❖ Ex is the expected value over all real data instances.
❖ G(z) is the generator's output when given noise z.
❖ D(G(z)) is the discriminator's estimate of the probability that a fake instance is real.
❖ Ez is the expected value over all random inputs to the generator (in effect, the expected value over all
generated fake instances G(z)).
98
The generator can't directly affect the log(D(x)) term in the function, so, for the generator, minimizing the loss is
equivalent to minimizing log(1 - D(G(z))).
The architecture for the radial basis function network (RBFN) is here-
Architecture of RBF
The architecture consists of two layers whose output nodes form a linear combination of the kernel (or basis)
functions computed by means of the RBF nodes or hidden layer nodes. The basis function (nonlinearity) in the
hidden layer produces a significant nonzero response to the input stimulus it has received only when the input
of it falls within a small localized region of the input space. This network can also be called as localized
receptive field network.
The training algorithm describes in detail all the calculations involved in the training process depicted in the
flowchart. The training is started in the hidden layer with an unsupervised learning algorithm. The training is
continued in the output layer with a supervised learning algorithm. Simultaneously, we can apply supervised
learning algorithms to the hidden and output layers for fine-tuning of the network. The training algorithm is
given as follows.
99
Step 0: Set the weights to small random values.
Step 1: Perform Steps 2-8 when the stopping condition is false.
Step 2: Perform Steps 3-7 for each input.
Step 3: Each input unit (xi for all i = 1 to n) receives input signals and transmits to the next hidden layer unit.
Step 4: Calculate the radial basis function.
Step 5: Select the centers for the radial basis function. The centers are selected from the set of input vectors. It
should be noted that a sufficient number of centers have to be selected to ensure adequate sampling of the
input vector space.
Step 6: Calculate the output from the hidden layer unit:
Where
x: Input vector (e.g., xj1,xj2,…,xjn)
ci: Center of the ith RBF unit
σi: Width (spread) of the i-th RBF unit.
∥ xci ∥: Euclidean distance between x and ci
Where
m: the number of hidden layer nodes (RBF function).
wim: Weight connecting the ith hidden unit to the mth output node
w0: Bias term (optional).
Step 8: Calculate the error and test for the stopping condition. The stopping condition may be the number of
epochs or to a certain extent weight change.
Applications of RBNF:
RBFNs are primarily used for classification tasks, but they can also be applied to regression and function
approximation problems. Some common application areas include:
1. Pattern Recognition: RBFNs are effective in recognizing patterns in data, making them useful in
image and speech recognition.
2. Time Series Prediction: They can be used to predict future values in a time series based on past data.
3. Control Systems: RBFNs are used in adaptive control systems to model and control dynamic systems.
4. Medical Diagnosis: They can assist in diagnosing diseases by classifying medical data.
100
Classification functions
Self-Organizing Maps Clustering, Data visualization Unsupervised learning, topology
based
101
The inputs from x1 to xn possess excitatory weighted connections and inputs from xn+1 to xn+m possess inhibitory
weighted interconnections. Since the firing of the output neuron is based upon the threshold, the activation
function here is defined as
For inhibition to be absolute, the threshold with the activation function should satisfy the following condition:
Ө > nw - P
In the above equation P refers to total contribution from all inhibitory inputs and output. The above equation
works when all the inhibitory inputs are active i.e. in case of weak absolute inhibition.
For the strong absolute inhibition i.e. when only one inhibitory inputs is active the equation should be modified
as
Ө > nw - Pmin
Here Pmin refers to the minimum contribution from inhibitory inputs (e.g., the weight of a single inhibitory input).
Do not get confused with the firing condition which is that a neuron can fire if the net input equals the threshold.
nw - P
The output will fire if it receives say “k” or more excitatory inputs but no inhibitory inputs, where
kw (k - 1)w
If the neuron receives k excitatory inputs, the net input (kw) will be greater than or equal to the threshold, causing the neuron to fir
The M–P neuron has no particular training algorithm. An analysis has to be performed to determine the values
of the weights and the threshold. Here the weights of the neuron are set along with the threshold to make the
neuron perform a simple logic function. The M-P neurons are used as building blocks on which we can model
any function or phenomenon, which can be represented as a logic function.
Do it yourself
Q.1 Implement AND function using McCulloch–Pitts neuron (take binary data).
Q.2 Implement ANDNOT function using McCulloch–Pitts neuron (use binary data representation). In the case
of the ANDNOT function, the response is true if the first input is true and the second input is false. For all other
input variations, the response is false.
Q.3 Implement XOR function using McCulloch–Pitts neuron (use binary data representation).
Hebb Network
Donald Hebb stated in 1949
“When an axon of cell A is near enough to excite cell B, and repeatedly or permanently takes place in firing it,
some growth process or metabolic change takes place in one or both the cells such that A’s efficiency, as one
of the cells firing B, is increased”.
According to the Hebb rule, the weight vector is found to increase proportionately to the product of the input
and the learning signal which is equal to the neuron’s output. In Hebb learning, if two interconnected neurons
are ‘on’ simultaneously then the weights associated with these neurons can be increased by the modification
made in their synaptic gap (strength). The weight update in Hebb rule is given by
The Hebb rule is more suited for bipolar data than binary data.
102
Flowchart of Training Algorithm
❖ Step 0: First initialize the weights. Basically in this
network they may be set to zero, i.e., wi = 0 for i = 1
to “n” where n may be the total number of input
neurons.
❖ Step 1: Steps 2–4 have to be performed for each
input training vector and target output pair, s : t.
❖ Step 2: Input units activations are set. Generally, the
activation function of input layer is identity function:
xi = sj for i = 1 to n.
❖ Step 3: Output units activations are set: y = t.
❖ Step 4: Weight adjustments and bias adjustments
are performed:
wi(new) = wi(old) + xiy
b(new)= b(old) + y
wi: weight of the ith connection between input and output neuron
η: Learning rate
x: Input value from the input neuron
y: Output value from the output neuron
103
The Hebbian Learning Rule can be explained as:
Perceptron Networks
Let us understand the linear separability first with an example. Imagine you have a table with a bunch of fruits:
apples and oranges. Your task is to separate the apples from the oranges using a straight stick (like a ruler).
Summarization: Linear separability means you can draw a straight line (or a flat plane in higher dimensions)
to separate two groups of things (like apples and oranges). If you cannot draw such a straight line, the data is
not linearly separable.
The perceptron is the simplest form of a neural network used for the classification of patterns said to be linearly
separable (i.e., patterns that lie on opposite sides of a hyperplane). Basically, it consists of a single neuron with
adjustable synaptic weights and bias.
Rosenblatt proved that if the patterns (vectors) used to train the perceptron are drawn from two linearly
separable classes, then the perceptron algorithm converges (i.e. eventually find solution) and positions the
decision surface in the form of a hyperplane between the two classes. The proof of convergence of the
algorithm is known as the perceptron convergence theorem.
104
The perceptron built around a single neuron is limited to performing pattern classification with only two classes
(hypotheses). By expanding the output (computation) layer of the perceptron to include more than one neuron,
we may correspondingly perform classification with more than two classes.
The goal of the perceptron is to correctly classify the set of externally applied stimuli (i.e. input data) x1, x2 ... xm
into one of two classes C1 and C2. The decision rule for the classification is to assign the point represented by
the inputs x1, x2, ..., xm to class C1 if the perceptron output y is +1 and to class C2 if it is -1.
The synaptic weights of the perceptron are denoted by w 1, w2 ...,wm. Correspondingly, the inputs applied to the
perceptron are denoted by x1, x2, ..., xm. The externally applied bias is denoted by b. From the model, we find
that the hard limiter input, or induced local field, of the neuron is
To develop insight into the behavior of a pattern classifier, it is customary to plot a map of the decision regions
in the m-dimensional signal space spanned by the m input variables x 1, x2, ..., xm. In the simplest form of the
perceptron, there are two decision regions separated by a hyperplane, which is defined by
Take a look at the figure for the case of two input variables x1 and x2, for which the decision boundary takes the
form of a straight line.
A point (x1, x2) that lies above the boundary line is assigned to class C 1, and a point (x1, x2) that lies below the
boundary line is assigned to class C 2. Note also that the effect of the bias b is merely to shift the decision
boundary away from the origin. The synaptic weights w 1, w2, ...,wm of the perceptron can be adapted on an
iteration-by-iteration basis.
105
For the perceptron to function properly, the two classes C 1 and C2 must be linearly separable. This, in turn,
means that the patterns to be classified must be sufficiently separated from each other to ensure that the
decision surface consists of a hyperplane. This requirement is illustrated in Figure below for the case of a two-
dimensional perceptron. In the (a) part of the figure, the two classes C 1 and C2 are sufficiently separated from
each other for us to draw a hyperplane (in this case, a straight line) as the decision boundary. If, however, the
two classes C1 and C2 are allowed to move too close to each other, as in (b) part of the figure, they become
nonlinearly separable, a situation that is beyond the computing capability of the perceptron.
Where:
yi : Actual value
^y i : Predicted value
n : Number of data points.
2. Gradient Vector
A partial derivative measures how a function changes when you vary only one variable, while keeping all
other variables constant.
The derivative of x2 with respect to x is 2x. Since 3y is treated as a constant, its derivative is 0.
106
Partial derivative with respect to y:
The term x is treated as a constant, so its derivative is 0. The derivative of 3y with respect to y is 3
2
The gradient vector is simply a vector of partial derivatives and points in the direction of the steepest ascent.
The gradient vector (denoted as ∇f pronounced as nabla f) is formed by collecting all partial derivatives:
The function changes most rapidly in the direction of (2x,3). If we move in the direction of this gradient, the
function f(x,y) increases fastest.
This represents the rate of change of the cost function with respect to w i.
3. Chain Rule
The chain rule helps us find the derivative of a function that is composed of two or more functions. In simple
terms, it tells us how to take the derivative of a "function inside a function."
● If y=f(g(x)), then f is the "outer function," and g is the "inner function."
● The chain rule helps us find the derivative of y with respect to x.
In words:
● Take the derivative of the outer function (f) with respect to the inner function (g).
● Multiply it by the derivative of the inner function (g) with respect to x.
An Example
Let’s say:
Here
● The outer function is f(g) = g2
● The inner function is g(x) = 3x + 2.
Substitute g = 3x + 2
107
So, the derivative of y is:
Note:
In Gradient Descent, we use the chain rule to compute the gradient of the cost function. For example:
● The cost function J(w) depends on the predicted value ^y .
● The predicted value ^y depends on the weight w.
We use the chain rule
Here
4. Learning Rate(α)
The learning rate (α) affects the convergence of the ANN. It controls the size of the steps taken during
parameter updates. The range of α is from 0 to.
Step-1: Initialize Weights: Start with random values for the weights (wi), bais (b) and learning rate (α).
Step-2: Compute Gradient: Calculate the gradient of the cost function with respect to each weight and bias:
Step-3: Update Weights: Adjust the weights in the opposite direction of the gradient:
Step-4: Repeat: Repeat steps 2 and 3 until one of the stopping criteria is met
● Maximum number of iterations is reached.
● The step size becomes smaller than a predefined tolerance.
An Example
Input: House sizes (x) in square feet
Output: House prices (y) in thousands of dollars.
Model: Linear regression model ^y = wx + b, where: w is weight (slope) and b is bias (intercept)
Goal: Use Gradient Descent to find the optimal values of w and b that minimize the Mean Squared Error (MSE)
cost function.
108
Solution
Step-1: Let us initialize w = 0 and b = 0 and α = 0.1
Step-2 & 3:
Iteration-01
Compute predicted output using formula ^y = wx + b
^y 1 = 0 x 1 + 0 = 0 ^y 2 = 0 x 2 + 0 = 0 ^y 3 = 0 x 3 + 0 = 0 ^y 4 = 0 x 4 + 0 = 0
Update Parameters:
Iteration-02
Compute predicted output using formula ^y = wx + b
^y 1 = 1.5 x 1 + 0.5 = 2 ^y 2 = 1.5 x 2 + 0.5 = 3.5 ^y 3 = 1.5 x 3 + 0.5 = 5 ^y 4 = 1.5 x 4 + 0.5 = 6.5
Update Parameters:
Iteration-03
Compute predicted output using formula ^y = wx + b
^y 1 = 1.75 x 1 + 0.575 = 2.325 ^y 2 = 1.75 x 2 + 0.575 = 4.075
^y 3 = 1.75 x 3 + 0.575 = 5.825 ^y 4 = 1.75 x 4 + 0.575 = 7.575
109
Update Parameters:
● Continue iterating until the changes in w and b become very small (e.g. <0.001).
● After several iterations, w and b will converge to their optimal values. For this example, they will
approach w = 2, b = 0 & the final model will be ^y =2 x
110
Iteration-02
Iteration-03
111
● Continue iterating until the changes in w and b become very small (e.g. <0.001).
● After several iterations, w and b will converge to their optimal values. For this example, they will
approach w = 2, b = 0 & the final model will be ^y =2 x
● After the first four iterations (where you’ve used all four data points), you simply start over from the first
data point and continue the process. This is called cycling through the dataset.
Do it Yourself
Do this same question with adjusting weight only not bias
Optimal Balanced steps that 1. Fast and stable 1. Requires tuning to find
converge efficiently to convergence. the right value
the minimum. 2. Efficient use of
computational resources.
Too High Large steps that may 1. Faster initial progress. 1. Oscillations around the
overshoot the minimum.
minimum. 2. Risk of divergence
(moving away from the
minimum)
113
Regression: 1 neuron (linear activation)
Multi-Class Classification: n neurons
An MLP is a fully connected feedforward neural network, meaning that each neuron in one layer is connected
to every neuron in the next layer. It uses activation functions such as Sigmoid, or Tanh to introduce non-
linearity, enabling it to learn complex patterns in data.
● Feature Learning
Lower Layers: Detect simple patterns, such as edges, textures, or basic shapes.
Deeper Layers: Detect abstract concepts, such as objects or high-level features.
● Non-Linearity: Introducing non-linearity using activation functions, which allows the model to solve
complex problems. Without non-linearity, an MLP would be equivalent to a linear model, incapable of
solving complex problems.
● Representation Learning: Hidden layers transform raw input data into meaningful representations that
make it easier for the output layer to perform classification or regression.
● Capturing Relationships: Hidden layers can capture complex relationships between input features
that are not easily separable in lower-dimensional space.
Applications of MLP
● Classification: Image classification, spam detection, sentiment analysis.
● Regression: Predicting house prices, stock prices, or temperature.
● Pattern Recognition: Handwriting recognition, speech recognition.
● Function Approximation: Approximating complex mathematical functions.
BACK-PROPAGATION NETWORK
A back-propagation neural network is a multilayer, feed-forward neural network consisting of an input layer, a
hidden layer and an output layer. The neurons present in the hidden and output layers have biases, which are
the connections from the units whose activation is always 1. The bias terms also acts as weights.
114
The figure above shows the architecture of a BPN, depicting only the direction of information flow for the feed-
forward phase. During the back-propagation phase of learning, signals are sent in the reverse direction.
The inputs are sent to the BPN and the output obtained from the net could be either binary (0, 1) or bipolar (–1,
+1). The activation function could be any function which increases monotonically and is also differentiable.
δk: Error correction weight adjustment for Wjk that is due to an error at output unit y k, which is back-propagated
to the hidden units that feed into unit yk
δj: Error correction weight adjustment for vij that is due to the back-propagation of error to the hidden unit zj.
Also, it should be noted that the commonly used activation functions are binary sigmoidal and bipolar sigmoidal
activation functions. The range of binary sigmoid is from 0 to 1, and for bipolar sigmoid it is from –1 to +1.
These functions are used in the BPN because of the following characteristics
(i) continuity
(ii) differentiability
115
(iii) nondecreasing monotony
The error back-propagation learning algorithm can be outlined in the following algorithm:
Step 0: Initialize weights and learning rate (take some small random values).
Step 1: Perform Steps 2–9 when stopping condition is false.
Step 2: Perform Steps 3–8 for each training pair.
Calculate output of the hidden unit by applying its activation functions over zinj (binary or bipolar sigmoidal
activation function):
and send the output signal from the hidden unit to the input of output layer units.
Step 5: For each output unit yk (k = 1 to m), calculate the net input:
On the basis of the calculated error correction term, update the change in weights and bias:
Step 7: Each hidden unit (zj, j = 1 to p) sums its delta inputs from the output units:
The term δinj gets multiplied with the derivative of f(zinj) to calculate the error term:
On the basis of the calculated δj, update the change in weights and bias:
Step 9: Check for the stopping condition. The stopping condition may be a certain number of epochs reached
or when the actual output equals the target output.
116
The above algorithm uses the incremental approach for updation of weights, i.e. the weights are being
changed immediately after a training pattern is presented i.e. it is working like the online training. There is
another way of training called batch-mode training, where the weights are changed only after all the training
patterns are presented. The batch-mode training requires additional local storage for each connection to
maintain the immediate weight changes.
The training of a BPN is based on the choice of various parameters. Also, the convergence of the BPN is
based on some important learning factors such as the initial weights, the learning rate, the updation rule, the
size and nature of the training set, and the architecture (number of layers and number of neurons per layer).
An Example
Using a back-propagation network, find the new weights for the net shown in Figure below. It is presented with
the input pattern [0, 1] and the target output is 1. Use a learning rate a =0.25 and binary sigmoidal activation
function.
Do it yourself
Find the new weights, using a back-propagation network for the network shown in Figure below. The network
is presented with the input pattern [ - 1, 1] and the target output is + 1. Use a learning rate of a = 0.25 and
bipolar sigmoidal activation function
Model type Simple linear models (e.g., linear Complex models (e.g., neural
regression). networks).
Gradient Computation Approximates gradient because Computes exact gradient using chain
computed using a single data point. rule because gradient is computed
using all data points
117
Learning type Online learning (real-time adaptation). Batch or mini-batch learning.
Use Cases Online learning, real-time systems. Deep learning, multi-layer networks.
XOR Problem
In Rosenblatt’s single-layer perceptron, there are no hidden neurons. Consequently, it cannot classify input patterns that
are not linearly separable. However, nonlinearly separable patterns commonly occur. For example, this situation arises in
the exclusive-OR (XOR) problem, which may be viewed as a special case of a more general problem, namely, that of
classifying points in the unit hypercube (An n-D hypercube is has 2 n vertices in a n-dimensional space; here it is two-D
space so it has 4 vertices. "unit" in unit hypercube means that each dimension is constrained to values between 0 and 1
and we have binary input here so the values will be exactly 0 and 1).
However, in the special case of the XOR problem, we need consider only the four corners of a unit square that correspond
to the input patterns (0,0), (0,1), (1,1), and (1,0), where a single bit (i.e., binary digit) changes as we move from one corner
to the next.
Where ⨁ denotes the exclusive-OR boolean function operator. The input patterns (0,0) and (1,1) are at opposite corners
of the unit square, yet they produce the identical output 0. On the other hand, the input patterns (0,1) and (1,0) are also at
opposite corners of the square, but they are in class 1, as shown by
1⨁0=1
and
0⨁1=1
We first recognize that the use of a single neuron with two inputs results in a straight line for a decision boundary in the
input space. For all points on one side of this line, the neuron outputs 1; for all points on the other side of the line, it
outputs 0. The position and orientation of the line in the input space are determined by the synaptic weights of the neuron
connected to the input nodes and the bias applied to the neuron. With the input patterns (0,0) and (1,1) located on opposite
corners of the unit square, and likewise for the other two input patterns (0,1) and (1,0), it is clear that we cannot construct
a straight line for a decision boundary so that (0,0) and (0,1) lie in one decision region and (0,1) and (1,0) lie in the other
decision region. In other words, the single layer perceptron cannot solve the XOR problem.
However, we may solve the XOR problem by using a single hidden layer with two neurons (as in figure below along with
its diagram of signal flow)
Architectural graph of network for solving the XOR problem Signal-flow graph of the network
118
o w21: Weight from x2 to Neuron 1.
o w22: Weight from x2 to Neuron 2.
● bi: Bias term for neuron i.
The slope of the decision boundary constructed by this hidden neuron is equal to -1. Here is the calculation
For an input pattern (x1,x2), the weighted sum z1 for Neuron 1 is calculated as:
z1=w11⋅x1 + w21⋅x2 + b1
The decision boundary is the line where the neuron's output transitions from 0 to 1. This occurs when the
weighted sum z1 is exactly 0:
z1 = 0 ⟹ x1 + x2 − 1.5=0
x2 = -x1 + 1.5
This is the equation of the decision boundary line for Neuron 1. It has:
● A slope of 1 (since the coefficient of x 1
● A y-intercept of 1.5 (when x1=0, x2=1.5)
The decision boundary is a straight line that passes through the points (0, 1.5) & (1.5, 0) & positioned as
The slope of the decision boundary constructed by this hidden neuron is equal to -1. Here is the calculation
For an input pattern (x1,x2), the weighted sum z2 for Neuron 2 is calculated as:
z2=w12⋅x1 + w22⋅x2 + b2
The decision boundary is the line where the neuron's output transitions from 0 to 1. This occurs when the
weighted sum z2 is exactly 0:
z2 = 0 ⟹ x1 + x2 − 0.5=0
x2 = -x1 + 0.5
This is the equation of the decision boundary line for Neuron 2. It has:
● A slope of 1 (since the coefficient of x 1
● A y-intercept of 0.5 (when x1=0, x2=0.5)
119
The orientation and position of the decision boundary constructed by this second hidden neuron are as follow-
Say the output from neuron-1 is a1 and output from neuron-2 is a2; so the output from the neuron-3 is
z3=−2.a1+1.a2−0.5
The function of the output neuron is to construct a linear combination of the decision boundaries formed by the
two hidden neurons. The result of this computation as follow-
The activation function for the neuron is assumed to be a step function, which outputs 1 if the weighted sum of
the inputs is greater than or equal to 0, and 0 otherwise.
Input: (0,0)
Neuron 1: z₁ = 1⋅0 + 1⋅0 - 1.5 = -1.5 < 0 ⟹ a₁ = 0
Neuron 2: z₂ = 1⋅0 + 1⋅0 - 0.5 = -0.5 < 0 ⟹ a₂ = 0
Output Neuron (Neuron 3): z₃ = (-2)⋅0 + 1⋅0 - 0.5 = -0.5 < 0 ⟹ Output = 0. Matches XOR: 0 ⊕ 0 = 0
Input: (0,1)
Neuron 1: z₁ = 1⋅0 + 1⋅1 - 1.5 = -0.5 < 0 ⟹ a1 = 0
Neuron 2: z2 = 1⋅0 + 1⋅1 - 0.5 = +0.5 ≥ 0 ⟹ a1 = 1
Output Neuron (Neuron 3): z3 = (-2)⋅0 + 1⋅1 - 0.5 = +0.5 ≥ 0 ⟹ Output = 1. Matches XOR: 0 ⊕ 1 = 1
Input: (1,0)
Neuron 1: z₁ = 1⋅1 + 1⋅0 - 1.5 = -0.5 < 0 ⟹ a1 = 0
Neuron 2: z2 = 1⋅1 + 1⋅0 - 0.5 = +0.5 ≥ 0 ⟹ a2 = 1
Output Neuron (Neuron 3): z3 = (-2)⋅0 + 1⋅1 - 0.5 = +0.5 ≥ 0 ⟹ Output = 1. Matches XOR: 1 ⊕ 0 = 1
Input: (1,1)
Neuron 1: z1 = 1⋅1 + 1⋅1 - 1.5 = +0.5 ≥ 0 ⟹ a1 = 1
120
Neuron 2: z2 = 1⋅1 + 1⋅1 - 0.5 = +1.5 ≥ 0 ⟹ a2 = 1
Output Neuron (Neuron 3): z₃ = (-2)⋅1 + 1⋅1 - 0.5 = -1.5 < 0 ⟹ Output = 0. Matches XOR: 1 ⊕ 1 = 0
The bottom hidden neuron has an excitatory (positive) connection to the output neuron, whereas the top
hidden neuron has an inhibitory (negative) connection to the output neuron. When both hidden neurons are off,
which occurs when the input pattern is (0,0), the output neuron remains off. When both hidden neurons are on,
which occurs when the input pattern is (1,1), the output neuron is switched off again because the inhibitory
effect of the larger negative weight connected to the top hidden neuron overpowers the excitatory effect of the
positive weight connected to the bottom hidden neuron. When the top hidden neuron is off and the bottom
hidden neuron is on, which occurs when the input pattern is (0,1) or (1,0), the output neuron is switched on
because of the excitatory effect of the positive weight connected to the bottom hidden neuron.
● Deep learning models can incorporate time-series data to predict acute cardiac events
2. Cancer Detection and Classification
● Convolutional Neural Networks (CNNs) identify malignant patterns in imaging data
● Particularly successful in breast cancer detection from mammograms and skin cancer
identification from dermatological images
● Research shows some AI systems matching or exceeding dermatologist accuracy in melanoma
detection
3. Diabetes Risk Assessment
● ANNs predict diabetes onset by analyzing blood glucose patterns, BMI, age, and other
biomarkers
● Recurrent Neural Networks (RNNs) can track changes over time to predict progression from
pre-diabetes to diabetes
Case Study: Diabetic Retinopathy Detection
Google's DeepMind developed a system using CNNs to identify diabetic retinopathy from retinal scans.
The system achieved over 90% accuracy, comparable to human ophthalmologists, potentially allowing
earlier intervention in areas with limited specialist access.
● Neural networks optimize for risk-adjusted returns across various market conditions
● Can incorporate multiple objectives like volatility minimization and return maximization
JPMorgan developed the LOXM (Limit Order Execution) system using deep learning to execute equity trades
at optimal prices. The system analyzes market conditions and historical patterns to minimize market impact
while achieving best execution prices, outperforming human traders in many scenarios.
● Retinal scan analysis for diabetic retinopathy and other eye conditions
3. Autonomous Vehicles
● Object detection and classification (pedestrians, vehicles, road signs)
2. Transcription Services
● Real-time meeting transcription
1. Manufacturing Automation
● Visual inspection and quality control
123
● Anomaly detection in assembly lines
3. Agriculture
● Autonomous harvesting robots
4. Autonomous Vehicles
● Self-driving cars and trucks
NVIDIA has developed Isaac Sim, a robotics simulation platform that uses neural networks to generate
synthetic training data. This enables sim-to-real transfer learning, where robots train in virtual environments
before deploying skills in the physical world.
2. Text Classification
● Sentiment analysis for product reviews and social media monitoring
● Topic classification for news articles and documents
● Spam detection and content moderation
● Intent recognition for conversational systems
124
3. Question Answering
● Extractive QA systems locate answers within reference documents
● Generative QA systems formulate original answers based on knowledge
● Domain-specific systems for customer support and information retrieval
● Open-domain QA for general knowledge questions
OpenAI's GPT models (and subsequently similar models like Claude) demonstrated that neural networks
trained on massive text corpora can generate coherent, contextually appropriate text across diverse topics.
These models showcase emergent abilities including complex reasoning, code generation, and creative
writing, highlighting how scale and architecture innovations can produce systems with capabilities beyond their
explicit training objectives.
Keras, on the other hand, is the user-friendly interface built on top of TensorFlow, designed to simplify the
process of creating neural networks. Originally an independent library, Keras is now TensorFlow’s official high-
level API, offering intuitive tools to construct models with minimal code. Imagine Keras as the "smart home
system" that lets you control the electric grid with a simple app. Instead of wiring circuits manually (coding
low-level math), you use preconfigured switches (layers like Dense or Conv2D) to build models effortlessly.
Together, TensorFlow and Keras form a seamless partnership: TensorFlow handles the gritty details of
optimization and hardware acceleration, while Keras provides a clean, modular way to design experiments.
This combo is why they dominate industries—from healthcare (diagnosing diseases) to entertainment (Netflix
recommendations). For students, Keras lowers the barrier to entry, while TensorFlow ensures your skills scale
to real-world challenges.
"If TensorFlow is the engine and gears of a high-performance car, Keras is the steering wheel and dashboard
—giving you control without needing to be a mechanical engineer."
125
A simple code example to check the tensor flow version
import tensorflow as tf
print(tf.__version__)
Click on the run button available on the left hand side of the code; you will get the tensor flow version which is
2.18.0
A note about the MNIST (Modified National Institute of Standards and Technology) Database
The MNIST dataset is the quintessential starting point for anyone learning machine learning and computer
vision. It consists of 70,000 handwritten digits (0–9), split into 60,000 training images and 10,000 test images,
each grayscale and sized at 28×28 pixels.
The MNIST dataset has become the quintessential starting point for machine learning and computer vision due
to its simplicity, accessibility, and well-structured format. Its small image size (28x28 pixels) and grayscale
format reduce computational complexity, making it ideal for beginners to experiment with algorithms without
needing high-end hardware. The dataset's clean, centered digits and balanced class distribution allow
newcomers to focus on core concepts like data preprocessing, model training, and evaluation metrics without
getting bogged down by noise or class imbalances. Additionally, MNIST's integration into popular libraries like
TensorFlow and PyTorch ensures easy access, enabling rapid prototyping and benchmarking.
Despite its widespread use, MNIST has notable limitations. Its simplicity, while great for beginners, fails to
capture real-world challenges like varying backgrounds, lighting conditions, or distorted handwriting ,
leading to inflated accuracy scores (often >99%) that don't translate to practical applications. The dataset's
uniformity also means models trained on MNIST struggle with more complex tasks, exposing a gap between
academic exercises and real-world problems.
https://colab.research.google.com/drive/1WO6Cq2ihoipDjkaq2YhUf2HMGgFXBoRm?usp=sharing
126
The next topic is perceptron convergence theorem but before taking about this, let us have a discussion about
few important terms
127
❖ A vector is a one-dimensional array of numbers, either representing inputs, weights, or outputs. A
vector can indeed be thought of as a matrix of order n×1, where n is the number of elements in the
vector. A vector is often represented as a column vector, which is a matrix with n rows and 1 column
(n×1). For example, a vector v with 3 elements can be written as:
❖ A Hyperplane
is a geometric entity that separates a space into two distinct parts. In n-dimensional space, a hyperplane is
Where:
● w1, w2, …, wn are the weights
● x1, x2, …, xn are the input features
● b is the bias term.
A. In one-dimensional space, a hyperplane is simply a point. For example, on a number line, a point x =
c can separate the space into two regions: x < c and x > c.
B. In two-dimensional space, a hyperplane is a line.
C. In three-dimensional space, a hyperplane is a plane.
The specific type of norm depends on the subscript or context. For example:
||W||1: L1 norm (sum of absolute values).
||W||2: L2 norm (Euclidean norm).
||W||p: Lp norm (generalized norm).
This represents the "straight-line distance" from the origin to the point defined by the vector W in
Euclidean space.
128
|u, v| ||u|| 2 · ||v||2
(The same can be written as |u, v| ||u|| · ||v|| if euclidean norm is mentioned explicitly for norm)
Where:
u, v represents the absolute value of inner product of vectors u and v
||u||2 and ||v||2 represent the norms of the vectors u and v respectively
Therefore: 32 32.84
The inequality holds, as expected. Note that the values aren't exactly equal, which tells us that these
two vectors aren't scalar multiples of each other.
Note: The equality holds if and only if one vector is a scalar multiple of the other (meaning they're
linearly dependent; When vectors are linearly dependent, the angle between them is either 0° (same
direction) or 180° (opposite direction).).
❖ General Strategy for Tightening Inequalities: The idea of eliminating redundant terms to tighten an
inequality is based on the following points:
1. Redundancy: If a term in an inequality is already included in another term (e.g., as part of a
sum), explicitly writing it separately does not provide additional information.
2. Monotonicity of Inequalities:
If A B + C and C is already included in B (i.e. B = C + D), then A B + C can be re
The principle given above is useful when applying Cauchy–Schwarz inequality in the Perceptron
Convergence Theorem, where we try to remove unnecessary terms to tighten the inequalities.
To derive the error-correction learning algorithm for the perceptron, we find it more convenient to work with the
modified signal-flow graph model in figure below
129
The only difference here is the bias b(n) is treated as a synaptic weight driven by a fixed input equal to +1. We
may thus define the (m + 1)-by-1 input vector
or
The T in the superscript stands for the transpose operation. The n denotes the time-step when the algorithm is
applied. A time-step (denoted by n) represents a specific iteration or update in the algorithm. It is the point at
which the algorithm processes a data point, updates its parameters (e.g., weights), and moves closer to finding
a solution.
or
i.e.
In the first line, w0(n), corresponding to i = 0, represents the bias b. For fixed n, the equation WTX = 0, plotted
in an m-dimensional space (and for some prescribed bias) with coordinates x1, x2, ..., xm, defines a hyperplane
as the decision surface between two different classes of inputs.
130
Suppose then that the input variables of the perceptron originate from two linearly separable classes. Let H1 be
the subspace of training vectors X1(1), X1(2), ... that belong to class C1, and let H2 be the subspace of training
vectors X2(1), X2(2), ... that belong to class C2. The union of H1 and H2 is the complete space denoted by H.
Given the sets of vectors H1 and H2 to train the classifier, the training process involves the adjustment of the
weight vector W in such a way that the two classes C1 and C2 are linearly separable. That is, there exists a
weight vector W such that we may state
Note:
If we used strict inequalities for both classes:
● wTx > 0 for C1.
● wTx < 0 for C2.
This would leave input vectors with wTx=0 unclassified, which is undesirable. The Perceptron algorithm needs to classify
all input vectors, so it uses:
● wTx > 0 for C1.
● wT x 0 for C 2.
This ensures that every input vector is assigned to one of the two classes.
Given the subsets of training vectors H1 and H2, the training problem for the perceptron is then to find a weight
vector w such that the two inequalities of above statements are satisfied.
The algorithm for adapting the weight vector of the elementary perceptron may now be formulated as follows:
1. If the nth member of the training set, x(n), is correctly classified by the weight vector w(n) computed at
the nth iteration of the algorithm, no correction is made to the weight vector of the perceptron in
accordance with the rule:
2. Otherwise, the weight vector of the perceptron is updated in accordance with the rule
Note: The learning rate is denoted by η (Greek letter ETA). It is used to control the amount of weight
adjustment at each step of training. The learning rate, ranging from 0 to 1, determines the rate of
learning at each time step. The learning rate plays a significant role in determining how fast or slow a
neural network learns. If the learning rate is low then the neuron will learn slowly similarly If the learning
rate is high then the neuron will learn fastly.
We are using the fixed-increment adaptation rule for the perceptron in which we are keeping the η fixed i.e. it is
a constant independent of the iteration number n. This means the learning rate does not change over time.
Proof of the perceptron convergence algorithm is presented for the initial condition W(0) = 0. Suppose that
WT(n)X(n) < 0 for n = 1, 2, ..., and the input vector X(n) belongs to the subset H1. That is, the perceptron
incorrectly classifies the vectors X(1), X(2) ..., since the first condition of equation (4) is violated. Then, with the
constant η(n) = 1, we may use the second line of equation (6) to write
W(n + 1) = W(n) + X(n) for X(n) belonging to class C1 ---------- (7)
Note: - The update rule W(n+1) = W(n) + X(n) means that the weight vector at the next iteration W(n+1) is updated based
on the current input X(n) and the current weight vector W(n). This ensures that the algorithm processes each input X(n)
sequentially and updates the weights accordingly. The variable n is being used in two different ways in the perceptron
algorithm.
- For weight; It is taking data either 0 or from the previous input data
- For input; it is taking values from 1 to total number of inputs
131
For iteration n = 0, the weight vector is initialized as W(0)
For iteration n = 1, The algorithm uses W(0) to classify the X(1); it updates the weight of W(1) based on the X(1)
W(1) = W(0) + X(1) [because W(0) = 0]
hence W(1) = X(1)
For iteration n = 2, The algorithm uses W(1) to classify the X(2); it updates the weight of W(2) based on the X(2)
W(2) = W(1) + X(2) [because W(1) = X(1)]
hence W(2) = X(1) + X(2)
For iteration n = 3, The algorithm uses W(2) to classify the X(3); it updates the weight of W(3) based on the X(3)
W(3) = W(2) + X(3) [because W(2) = X(1) + X(2)]
hence W(3) = X(1) + X(2) + X(3)
From the above calculation we can say that that for W(0) = 0, we may iteratively solve this equation for W(n + 1),
obtaining the result
W(n + 1) = X(1) + X(2) + . . . + X(n) - - - - - - - - - - (8)
Since the class C1 and C2 are assumed to be linearly separable, there exists a solution Wo for which WTX(n) >
0 for the vectors X(1), ..., X(n) belonging to the subset H1 For a fixed solution Wo, we may then define a
positive number α as
Hence, multiplying both sides of Eq (8) by the row vector WoT, we get
WoTW(n + 1) = WoTX(1) + WoTX(2) + WoTX(3) + … + WoTX(n)
Next we make use of an inequality known as the Cauchy–Schwarz inequality. Given two vectors W0 and W(n +
1), the Cauchy–Schwarz inequality states that
||Wo||2||W(n + 1)||2 n 2α 2 ---------(11)
Here ||wo||2 is the squared euclidean norm it can be written as ||wo||22 for the sake of simplicity we are writing it ||wo||2
or, equivalently
or equivalently
||W(k + 1)||2 - ||W(k)||2 || X(k)||2 ---------- (15)
132
The left-hand side is a telescoping sum, meaning most terms cancel out:
So we will get
Rearranging:
The above can be written as follow after applying the General Strategy for tightening Inequalities
||W(n + 1)||2 n
where β is a positive number defined by
Here
The inequality mentioned in the second part of equation (16) is in conflict with the inequality mentioned in the
equation (12) for sufficiently large values of n because
1. The upper bound grows linearly with n
2. The lower bound grows quadratically with n
For sufficiently large n, the quadratic term will eventually exceed the linear term which would violate the upper
bound. This is why the two inequalities appear to be in conflict for large n.
The Perceptron algorithm is guaranteed to converge (i.e. find solution) after a finite number of updates
(nmax)
if the data is linearly separable. This means that the inequalities are only relevant for n n
max, where nmax is the maximum number of updates required for convergence.
133
We have thus proved that for η(n) = 1 for all n and W(0) = 0, and given that a solution vector Wo exists, the rule
for adapting the synaptic weights of the perceptron must terminate after at most n max iterations. We may now
state the fixed-increment convergence theorem for the perceptron as follows
Let the subsets of training vectors H1 and H2 be linearly separable. Let the inputs presented to the perceptron
originate from these two subsets. The perceptron converges after some no iterations, in the sense that
w(no) = w(no + 1) = w(no + 2) = ….
is a solution vector for n0 < nmax.
The Perceptron Convergence Algorithm guarantees that if the data is linearly separable, the algorithm will find
a solution (i.e., a weight vector that correctly classifies all training examples) in a finite number of iterations.
The goal is indeed to determine the value of n 0 (or nmax) such that the algorithm will surely converge within n 0
iterations.
The derivative of x2 with respect to x is 2x. Since 3y is treated as a constant, its derivative is 0.
134
The term x2 is treated as a constant, so its derivative is 0. The derivative of 3y with respect to y is 3
The gradient vector is simply a vector of partial derivatives and points in the direction of the steepest ascent.
The gradient vector
(denoted as f pronounced as nabla f) is formed by collecting all partial derivatives:
The function changes most rapidly in the direction of (2x,3). If we move in the direction of this gradient, the
function f(x,y) increases fastest.
Convergence is made faster if a momentum factor is added to the weight updation process. This is generally
done in the back propagation network. If momentum has to be used, the weights from one or more previous
training patterns must be saved. Momentum helps the net in reasonably large weight adjustments until the
corrections are in the same general direction for several patterns.
The vigilance parameter is denoted by “ρ”. It is generally used in adaptive resonance theory (ART) networks.
The vigilance parameter is used to control the degree of similarity required for patterns to be assigned to the
same cluster unit. The choice of vigilance parameter ranges approximately from 0.7 to 1 to perform useful work
in controlling the number of clusters.
𝓵
η
135