MCS-224 Artificial Intelligence and Machine Learning
MCS-224 Artificial Intelligence and Machine Learning
Prof. (Retd.) S.K. Gupta , IIT, Delhi Sh. Shashi Bhushan Sharma, Associate Professor, SOCIS, IGNOU
Prof. Ela Kumar, IGDTUW, Delhi Sh. Akshay Kumar, Associate Professor, SOCIS, IGNOU
Prof. T.V. Vijay Kumar JNU, New Delhi Dr. P. Venkata Suresh, Associate Professor, SOCIS, IGNOU
Prof. Gayatri Dhingra, GVMITM, Sonipat Dr. V.V. Subrahmanyam, Associate Professor, SOCIS, IGNOU
Mr. Milind Mahajan,. Impressico Business Solutions, Sh. M.P. Mishra, Assistant Professor, SOCIS, IGNOU
New Delhi Dr. Sudhansh Sharma, Assistant Professor, SOCIS, IGNOU
SOCIS FACULTY
Prof. P. Venkata Suresh, Director, SOCIS, IGNOU
Prof. V.V. Subrahmanyam, SOCIS, IGNOU
Dr. Akshay Kumar, Associate Professor, SOCIS, IGNOU
Dr. Naveen Kumar, Associate Professor, SOCIS, IGNOU (on EOL)
Dr. M.P. Mishra, Associate Professor, SOCIS, IGNOU
Dr. Sudhansh Sharma, Assistant Professor, SOCIS, IGNOU
Dr. Manish Kumar, Assistant Professor, SOCIS, IGNOU
PREPARATION TEAM
Dr.Sudhansh Sharma, (Writer- Unit 1) Prof Ela Kumar (Content Editor)
Assistant Professor SOCIS, IGNOU Department of Computers & Engg. IGDTUW, Delhi
(Unit-1 : Partially Adapted from MCSE003
Artificial Intelligence & Knowledge Management)
Prof.Parmod Kumar (Language Editor)
Mr. Anant Kumar Jayswal, (Writer – Unit 2 & Unit 3) SOH, IGNOU, New Delhi
Assistant Professor
Amity School of Engineering and Technology
Print Production
Sh Sanjay Aggarwal,Assistant Registrar, MPDD
, 2022
Indira Gandhi National Open University, 2022
ISBN-
All rights reserved. No part of this work may be reproduced in any form, by mimeograph or any other means, without permission
in writing from the Indira Gandhi National Open University.
Further information on the Indira Gandhi National Open University courses may be obtained from the University’s office at
Maidan Garhi, New Delhi-110068.
Printed and published on behalf of the Indira Gandhi National Open University, New Delhi by MPDD, IGNOU.
UNIT 1 INTRODUCTION TO ARTIFICIAL
INTELLIGENCE
Structure
1.1 Introduction
1.2 Objectives
1.3 Basics of Artificial Intelligence (AI)?
1.4 Brief history of Artificial Intelligence
1.5 Components of Intelligence
1.6 Approaches to Artificial Intelligence
1.7 Comparison between Artificial Intelligence (AI),
Machine Learning (ML) and DeepLearning (DL).
1.8 Application Areas of Artificial Intelligence Systems
1.9 Intelligent Agents
1.9.1 Stimulus - Response Agents
1.10 Summary
1.11 Solutions/Answers
1.12 Further Readings
1.1 INTRODUCTION
Today, artificial intelligence is used in a wide variety of applications, including engineering,
technology, the military, opinion mining, sentiment analysis, and many more. It is also used in
more advanced domains, such as language processing and applications for aerospace.
AI is everywhere in today's world, and people are gradually becoming accustomed to its
presence. It is utilised in systems that recognise both voices and faces. In addition to this, it can
provide you with shopping recommendations that are tailored to your own purchasing
preferences. Finding spam and preventing fraudulent use of credit cards is made much easier
when you have this skill. The most cutting-edge technology currently on the market are virtual
assistants like Apple's Siri, Amazon's Alexa, Microsoft's Cortana, and Google's own Google
Assistant. It's possible that you're already familiar with the technology involved in artificial
intelligence (AI). Are you?
AI has become very popular all over the world today. It imitates human intelligence in machines
by programming them to do the same things people do. As a technology, AI is going to have a
bigger impact on how people live their daily lives. Everyone wants to connect to Artificial
Intelligence as a technology these days. Before we can understand AI, we need to know and talk
about some basic things. For example, what is the difference between knowledge and
intelligence? The key to starting this unit is the answer to this question.
The accumulation of information and abilities that a person has gained through their life
experiences is known as knowledge. While intelligence refers to one's capacity to put one's
knowledge into practise. To put it simply, knowledge is what we have learned over the years,
and it expands as time passes. Because of this, it represents the culmination of everything that we
have realised over the course of our lives. It is important to highlight that having information
does not automatically make one intelligent; rather, intelligence is what makes one smart.
There is a well-known proverb that says "marks are not the measure of intelligence." This is due
to the fact that intelligence is not a measurement of how much information one possesses. In fact,
it is the measure of how much we comprehend and put into practise. People who are
knowledgeable may gain a lot of information, but an intelligent person understands how to
comprehend, analyze, and use the information. You could have a lot of knowledge but still be the
least intelligent person in the room. Knowledge and intelligence are inextricably linked, and each
contributes to the other's development. Knowledge enables one to learn the understandings that
others have of things, whereas intelligence is the foundation for one's ability to grasp the things
themselves.
Now that we have an understanding of the distinction between intelligence and knowledge, our
next issue is: what exactly is artificial intelligence? Incorporating intelligence into a machine is
related to the field of Artificial Intelligence, whereas both concepts, namely the representation of
knowledge and its engineering, are the basis of traditional AI research. This topic will be
discussed in section 1.3 of this unit, but in a nutshell, incorporating intelligence into a machine is
related to the field of Artificial Intelligence. Knowledge engineering is a subfield of artificial
intelligence (AI) that applies rules to data in order to simulate the way in which an expert would
think about the information. It does this by analysing the structure of a job or a decision in order
to figure out how one arrives at a conclusion.
The subsequent units of this course, you will learn about some of the concepts that are essential
for knowledge representation, such as frames, scripts, and other related topics. In addition, this
course will address the issues that are associated with the knowledge representation for uncertain
situations, such as employing the method of fuzzy logic, rough sets, and the Dempster Shafer
theory, among other relevant topics.
The following is a list of eight definitions of artificial intelligence that have been provided by
well-known authors of artificial intelligence textbooks.
1) According to Haugeland in 1985, "The Exciting New Effort to Make Computers
Think... Machines with Minds, in the Full and Literal Sense,"
3) "The study of mental capabilities through the application of computer models," (also
known as "The Study of Mental Capabilities"), Charniak and McDermott's 1985.
4) According to Winston (1992), "the study of the calculations that make it possible to
perceive, reason, and act."
5) "The art of building machines that execute functions that demand intellect when
performed by people," as defined by Kurzweil in the year 1990.
5) "The art of building machines that execute functions that demand intellect when
performed by people," as defined by Kurzweil in the year 1990. To the Rich and the
Knight, 1991
7) According to Schalkoff (1990), "a field of study that aims to explain and replicate
intelligent behaviour in terms of computational processes."
8) According to Luger and Stubblefield (1993), "the discipline of computer science that is
concerned with the automation of intelligent behaviour."
According to the concepts presented earlier, there are four distinct objectives that might be pursued in
the field of artificial intelligence. These objectives are as follows:
• The creation of systems that think in the same way as people do.
• The creation of systems that are capable of logical thought.
• The creation of machines that can mimic human behaviour.
• The creation of systems that behave in a logical manner
In addition, we discovered through our conversation in the earlier section 1.1 of this Unit, that
Artificial Intelligence (AI) is the intelligence that is incorporated into machines; in other words,
AI is the ability of a machine to display human-like capabilities such as reasoning, learning,
planning, and creativity. We learned this information because AI is the ability of a machine to
display human-like capabilities. Taking into mind the Emerging AI technologies to sense,
interpret, and act according to the circumstances, relevant exemplary solutions are summarised in
Figure 1, which attempts to encapsulate the understanding of the question "What is Artificial
Intelligence?"
Identify
Analytics
Machine Learning
Act Data
Visualization
Expert Systems
SuperAI
Here’s a brief introduction the first type of AI i.e., Type 1 AI. Following are the three stages of
Type 1 - Artificial Intelligence:
a) Artificial Narrow Intelligence-(ANI)
b) Artificial General Intelligence-(AGI)
c) Artificial Super Intelligence-(ASI)
Types of
Artificial Intelligence
Artificial Narrow Artificial General Artificial Super
Intelligence (ANI) Intelligence (AGI) Intelligence (ASI)
a) Artificial Narrow Intelligence (ANI), also called Weak AI or Narrow AI: Weak AI is a term
for thinking that is "simulated." Such systems seem to act intelligently, but they don't have
any awareness of what they are doing. For example, a chatbot might talk to you in a way
that seems natural, but it doesn't know who it is or why it's talking to you. Artificial
intelligence is a system that was built to do a certain job.
b) Artificial General Intelligence (AGI): Strong or General Artificial Intelligence, also called
"actual" thinking. That is, acting like a smart human and thinking like one with a conscious,
subjective mind. For instance, when two humans talk, they probably both know who they
are, what they're doing, and why.
Systems with strong or general artificial intelligence can do things that people can do. These
systems tend to be harder to understand and more complicated. They are set up to handle
situations where they might need to solve problems on their own without help from a
person. Uses for these kinds of systems include self-driving cars and operating rooms in
hospitals.
c) Artificial Super Intelligence (ASI) - Super intelligence: The term "super intelligence"
usually refers to a level of general and strong AI that is smarter than humans, if that's even
possible. The ASI is seen as the logical next step after the AGI because it can do more than
humans can. This includes making decisions, making rational decisions, and even things
like building emotional relationships. Their is a marginal difference between AGI and ASI.
……………………………………………………………………………………………
……………………………………………………………………………………………
……………………………………………………………………………………………
Q2 Classify AI on the basis of the functionalities of AI
……………………………………………………………………………………………
……………………………………………………………………………………………
……………………………………………………………………………………………
Q3 Compare ANI, AGI and ASI, in context of AI
……………………………………………………………………………………………
……………………………………………………………………………………………
1.4 BRIEF HISTORY - ARTIFICIAL INTELLIGENCE
AI's ideas come from early research into how people learn and think. Also very old is the idea
that a computer could act like a person. Greek mythology is where the idea of machines that can
think for themselves comes from.
• Aristotle, who lived from 384 BC to 322 BC, made a syllogistic logic system that was not
formal. This is where the first formal system of deductive reasoning got its start.
At the start of the 17th century, Descartes said that animal bodies are just complex machines.
• Pascal made the first mechanical digital calculator in the year 1642.
In the 1800s, George Boole came up with a number system called "binary algebra" that showed
(some) "laws of thought."
In the late 19th century and early 20th century, mathematicians and philosophers like Gottlob
Frege, Bertram Russell, Alfred North Whitehead, and Kurt Godel built on Boole's first ideas
about logic to make mathematical representations of logic problems.
When electronic computers came along, it was a big step forward in how we could study
intelligence.
McCulloch and Pitts made a Boolean circuit model of the brain in 1943. They wrote about how
neural networks can be used to do math in the paper "A Logical Calculus of Ideas Immanent in
Nervous Activity."
• In 1950, Turing wrote a paper called "Computing Machines and Intelligence." This article gave
a good overall picture of AI. To learn more about Alan Turing, go to
http://www.turing.org.uk/turing.
Turing's paper talked about a lot of things, one of which was how to solve problems by using
heuristics as guides to look through the space of possible solutions. He used the game of chess to
explain how his ideas about how machines can think work. He even said that the machine could
change its own instructions so that machines could learn from what they do.
The SNARC was built by Marvin Minsky and Dean Edmonds in 1951. It was the first randomly
wired neural network learning machine (SNARC stands for Stochastic Neural Analog
Reinforcement Computer). It was a computer with a network of 40 neurons and 3000 vacuum
tubes.
Samuel made a number of programmes to help people play checkers between 1952 and 1956.
In 1956, Dartmouth was the site of a well-known meeting. At the conference, the people who
came up with the idea of AI met for the first time. At this meeting, the name "Artificial
Intelligence" was chosen.
• Newell and Simon's book The Logic Theorist came out. Many people think it was the first
show to use artificial intelligence.
In 1959, Gelernter made a Geometry Engine. In 1961, James Slagle's Ph.D. dissertation at MIT
was a programme called SAINT. It was written in LISP, and a first-year college student could
use it to solve calculus problems.
Thomas Evan made a programme called "Analogy" in 1963 that could solve analogy problems
like those on an IQ test. The first collection of articles about artificial intelligence was put
together by Edward A. Feigenbaum and Julian Feldman. It was called "Computers and Thought."
It was released in 1963.
In 1965, J. Allen Robinson came up with a way to prove things mechanically. He called it the
Resolution Method. This made it possible for formal logic to work well as a language for
representing programmes. In 1967, Feigenbaum, Lederberg, Buchanan, and Sutherland at
Stanford showed how the Dendral programme could be used to understand the mass spectra of
organic chemical compounds. This was the first programme that worked well and was based on
scientific knowledge. The SRI robot Shakey showed in 1969 that it was possible to move, see,
and solve problems all at the same time.
From 1969 to 1979, the first systems based on knowledge were set up in place.
• In 1974, MYCIN showed how powerful rule-based systems can be for representing and
drawing conclusions about knowledge in medical diagnosis and treatment. For the Knowledge
Representation Version 2 CSE IIT, Kharagpur, plans were made. There were also some frames
that Minski had made. There are logic-based programming languages like Prolog and Planner. In
the 1980s, Lisp Machines was made and sold. In 1985, neural networks were once again all the
rage. In 1988, probabilistic and decision-theoretic methods were used again.
Early AI was based on general systems that didn't know much. AI researchers realised that for
machines to be able to reason about complex tasks, they need to know a lot about a narrow field.
Dean Pomerleau made ALVINN at CMU in 1989. Autonomous Land Vehicle in a Neural
Network is what ALVINN stands for. This is a system that learns to drive by watching someone
else do it. It has a neural network that gets an image from a two-dimensional camera that is
30x32 units. The output layer tells the vehicle where it needs to go. The system drove a car from
the East Coast to the West Coast of the United States, which is about 2,850 miles. A person only
drove about 50 of these miles. The system took care of the rest.
In the 1990s, AI made a lot of progress, especially in machine learning, data mining, intelligent
tutoring, case-based reasoning, multi-agent planning and scheduling, uncertain reasoning,
understanding and translating natural language, vision, virtual reality, games, and other areas.
Rod Brooks' COG Project at MIT made a lot of progress toward making a humanoid robot with
the help of a lot of people.
In the 1990s,
1997 was the year of the first official Robo-Cup soccer game. It was played on a tabletop
with 40 teams of robots talking to each other.
As more and more people use the web, web crawlers and other AI-based programmes
that pull information from it are becoming more and more important.
Deep Blue In 1997, Gary Kasparov, who was the world champion at the time, lost to
IBM's Deep Blue chess programme.
In 2000,
The Nomad robot goes to remote parts of Antarctica to look for meteorite samples.
Space probes that are made of robots can work on their own to learn more about space.
They keep an eye on what's going on around them, make decisions, and take action to get
where they want to go. In April 2004, the first three-month missions of NASA's Mars
rovers went well. The Spirit rover was looking at a group of hills on Mars that took two
months to reach. It is finding strangely eroded rocks that might be new pieces of the
puzzle that is the history of the area. Spirit's twin sister, Opportunity, was looking at the
layers of rock in a crater.
Internet agents: As the Internet grows quickly, more people want to use Internet agents to
keep track of what users are doing, find the information they need, and figure out which
information is the most useful. The reader can learn more about AI by reading about it in
the news.
According to the dominant school of thought in psychology, human intelligence should not be
viewed as a singular talent or cognitive process but rather as a collection of distinct components.
The majority of attention in the field of artificial intelligence research has been paid to the
following aspects of intelligence: learning, reasoning, problem-solving, perception, and language
comprehension.
Learning: There are numerous approaches to develop a learning system. Making mistakes is the
simplest way to learn. A basic software that solves "mate in one" chess issues, for example,
might test different moves until it finds one that answers the problem. The programme
remembers which move worked so that the next time the computer is given the identical
situation, it can provide an immediate response. The simple act of memorising things like
answers to problems, words in a vocabulary list, and so on is known as "rote learning" or
memorization.
We'll talk about another classification that doesn't depend on the way knowledge is represented
or how it is represented. According to this system, there are five ways to learn:(ii)
Rote learning Rote learning is the simplest method of learning since it involves the least amount
of interpretation. The information is simply copied into a database in this method of learning.
This is the technique for memorising multiplication tables.
Learning by analogy: When you learn by analogy, you generate new ideas by connecting
previously learned concepts. Textbooks frequently employ this method of instruction. For
example, in the text, some problems are solved as examples, and students are subsequently given
problems that are comparable to the examples. This type of learning also occurs when someone
who can drive a light car attempts to drive a heavy vehicle.
Learning by Induction Learning through induction is the most common method of learning.
This is a method of learning that employs inductive reasoning, which is a style of reasoning that
involves drawing a conclusion from a large number of good instances. If we encounter a lot of
cows, we might notice that they have four legs, are white, and have two horns in the same
position on their head, for example. Even though inductive reasoning frequently leads to valid
conclusions, the conclusions are not always unarguable. For example, with the above-mentioned
concept of cow, we might come across a black cow, a three-legged cow who has lost one leg in
an accident, or a single-horn cow.
Learning by deduction: Finally, we discuss deductive learning, which is founded on deductive
inference, a non-debatable mode of thinking. By irrefutable method of reasoning, we mean that if
the hypotheses (or given facts) are accurate, the conclusion arrived through deductive (i.e., any
irrefutable) reasoning is always correct. This is the most common method of thinking in
mathematics.
Inductive learning is a crucial component of an agent's learning architecture. An agent learns
based on:
• What it is learning, such as concepts, problem-solving techniques, or game-playing
techniques, etc.
• The representation, predicate calculus, frame, script, and other elements that were
employed.
• The critic, who expresses their opinion of the agency in general.
Learning based on feedback is normally categorized as:
Supervised learning
Unsupervised learning
Reinforcement Learning.
Supervised Learning: It has a function that learns from inputs and outputs that are shown as examples.
Some examples of this kind of learning are figuring out useful things about the world from what you see,
making a map from the current state's conditions to actions, and learning how the world changes over
time.
Unsupervised Learning: There is no way to know what the inputs are and what the expected outputs are
in this type of learning. So, the learning system has to figure out on its own which properties of objects it
doesn't know about are important. For example, figuring out the shortest way to get from one city to
another in a country you know nothing about.
Reinforcement (Rewards) for Learning: In some problems, the task or problem can only be seen, not
said. Also, the job may be an ongoing one. The user tells the agent how happy or unhappy he or she is
with the agent's work by sometimes giving the agent positive or negative rewards (i.e., reinforcements).
The agent's job is to get as many rewards (or reinforcements) as possible. In a simple goal-attainment
problem, the agent can be rewarded when it reaches the goal and punished when it doesn't.
You need an action plan to get the most out of this kind of task. But when it comes to tasks that never
end, the future reward might be endless, making it hard to decide how to get the most out of it. One way
to move forward in this kind of situation is to ignore future rewards after a certain point. That is, the agent
may want rewards that will come soon more than rewards that will come a long time from now.
Delayed-reinforcement learning is the process of determining how to behave in situations when rewards
are contingent on previous actions.
Reasoning: To reason is to draw conclusions that are appropriate for the situation. Both
deductive and inductive reasoning can be used to make conclusions. An example of a deductive
inference is, "Fred is either in the museum or in the cafe. He isn't in the cafe, so he must be in the
museum." An example of inductive inference is, "In the past, accidents just like this one have
been caused by instrument failure." The difference between the two is that in the deductive case,
the truth of the premises guarantees the truth of the conclusion, while in the inductive case, the
truth of the premises supports the conclusion that instrument failure caused the accident, but
more research could show that the conclusion is actually false, even though the premises are true.
Programming computers to make inferences, especially deductive inferences, has had a lot of
success. But you can't say that a programme can reason just because it can draw conclusions. To
reason, you have to draw conclusions that make sense for the task or situation at hand. Giving
computers the ability to tell what is important and what isn't is one of the hardest problems AI
has to face.
Problem-solving: Problems usually go like this: given these data, find x. AI is used to solve a
very wide range of problems. Some examples are finding the best way to win a board game,
figuring out who someone is from a picture, and planning a series of steps that will allow a robot
to do a certain task.
Methods for solving problems can be either specific or general. A special-purpose method is
made to solve a specific problem and often takes advantage of very specific parts of the situation
where the problem is happening. A general method can be used to solve many different kinds of
problems. The difference between the current state and the goal state can be reduced step by step
with means-end analysis, which is a technique used in AI. The programme chooses actions from
a list of ways, which for a simple robot might include pick up, put down, move forward, move
back, move left, and move right, until the current state is changed into the goal state.
Perception: Perception involves scanning the surroundings with numerous sense organs,
genuine or artificial, and internal processes for analyzing the scene into objects, their features,
and relationships. The fact that one and the same item can have various appearances depending
on the angle from which it is viewed, whether or not parts of it are projecting shadows, and so
on, complicates analysis.
Artificial perception has progressed to the point where a self-controlled car-like device can drive
at modest speeds on the open road, and a mobile robot can search a suite of bustling offices for
and remove empty drink cans. FREDDY, a stationary robot with a moving TV 'eye' and a pincer
'hand,' was one of the first systems to merge perception and action (constructed at Edinburgh
University during the period 1966-1973 under the direction of Donald Michie). FREDDY could
recognise a wide range of items and could be taught to create simple artefacts from a jumble of
parts, such as a toy automobile.
Language-understanding: A language is a set of signs with predetermined meaning. For
example, traffic signs establish a mini-language; it is a matter of convention that the hazard-
ahead sign signifies trouble ahead. This language-specific meaning-by-convention is distinct
from what is known as natural meaning, as evidenced by phrases like "Those clouds signify rain"
and "The drop in pressure suggests the valve is malfunctioning."
The productivity of full-fledged human languages, such as English, separates them from other
types of communication, such as bird sounds and traffic sign systems. A productive language is
one that is rich enough to allow for the creation of an infinite number of different sentences.
Perhaps, in the future, we will reach a point where AI can behave like humans, but what
guarantees do we have that this will continue? Is it possible to make a system that acts like a
human to test the certainty of Artificial Intelligence? " The following approaches constitute the
foundation for evaluating an AI entity's human-likeness:
Turing Test
Approach of The Cognitive Modelling
Approach of The Law of Thought
Approach of The Rational Agent
In the past, researchers have worked hard to reach all four of these goals. But it is hard to find a
good balance between approaches that focus on people and approaches that focus on logic.
People are often "irrational" in the sense of being "emotionally unstable," so it's important to tell
the difference between human and rational behaviour.
Researchers have found through their studies that a human-centered approach must be an
empirical science with hypotheses and experiments to prove them. In a rationalist approach,
math and engineering are used together. People in each group sometimes say bad things about
the work done by the other groups, but the truth is that each way has led to important discoveries.
Let's take a closer look at each one. Acting humanly: The approach of the Turing Test: Alan
Turing, who is the most well-known name among the pioneers, thought about how to test an A.I.
product to see if it was intelligent. Alan Turing came up with the idea for the Turing Test in 1950
(Turing).
Let's try to answer the question, "What is the Turing Test in AI?" Alan Turing, who is the most
well-known name among the pioneers, thought about how to test an A.I. product to see if it was
intelligent. Turing came up with a test, which is now known as the Turing Test, to see if
something is smart. Below is a summary of the Turing test. See figure 2 for more details.
The only way for the three of them to talk to each other is through computer terminals. This means that
the identity of the computer or person B can only be determined by how intelligent or not the responses
are, not by any other human or machine traits. If C can't figure out who the computer is, then the
computer must be smart. More accurately, the computer is smart if it can hide its identity from C.
Note that for a computer to be considered smart, it should be smart enough not to answer too quickly, at
least not in less than a hundredth of a second, even if it can do something like find the sum of two
numbers with more than 20 digits each.
Criticism to Turing Test: There have been a number of criticisms of the Turing test as a machine
intelligence test. The Chinese Room Test, developed by John Searle, is one of the most well-known
Criticism. The crux of the Chinese Room Test, which we'll discuss below, is that convincing a system,
say A, that it possesses qualities of another system, say B, does not suggest that system A actually
possesses those qualities. For example, a male human's ability to persuade people that he is a woman does
not imply that he is capable of bearing children like a woman.
The scenario for the Chinese Room Test takes place in a single room with two windows. A Shakespeare
scholar who knows English but not Chinese is sitting in the room with a kind of Shakespeare
encyclopaedia. The encyclopaedia is printed so that for every pair of pages next to each other, one page is
written in Chinese characters and the other page is an English translation of the Chinese page. Through
one of the windows, Chinese characters with questions about Shakespeare's writing are sent to the person
inside. The person looks through the encyclopaedia and, when he or she finds the exact copy of the
sequence of characters sent in, reads the English translation, thinks of the answer, and writes it down in
English for his or her own understanding. The person then looks in the encyclopaedia for the
corresponding sequence of Chinese characters and sends the sequence of Chinese characters through the
other window. Now, Searle says that even though the scholar acts as though he or she knows Chinese, this
is not the case. Just because a system can mimic a quality doesn't mean that it has that quality.
Thinking humanly: It is the cognitive modeling approach to thinking like a human, from this point
of view, the Artificial Intelligence model is based on Human Cognition, which is the core of the human
mind. This is done through three approaches, which are as follows:
• Introspection, which means to look at our own thoughts and use those thoughts to build a model.
• Psychological Experiments, which means running tests on people and looking at how they act.
• Brain imaging, which means to use an MRI to study how the brain works in different situations
and then copy that through code.
Thinking rationally i.e. The laws of thought approach: This approach Relates to use the laws of
thought to think logically: The Laws of Thought are a long list of logical statements that tell our minds
how to work. This method, called "Thinking Rationally," is based on these laws. By putting in place
algorithms for artificial intelligence, these laws can be written down and made to work. But solving a
problem by following the law is very different from solving a problem in the real world. Here are the
biggest problems with this approach.
Acting rationally i.e. The rational agent approach: In every situation, a rational agent approach tries to
find the best possible outcome. This means that it tries to make the best decision it can given the
circumstances. It means that the agent approach is much more flexible and open to change. The Laws of
Thought approach, on the other hand, says that a thing must act in a way that makes sense. But there are
some situations where there is no logically right thing to do and there are more than one way to solve the
problem, each with different results and trade-offs. At that point, the rational agent method works well.
Algorithms for machine learning are based on ways that people learn from their experiences.
This means that they are programmed to learn from what they do and get better at what they do.
They don't need to be told what to do to get the desired results. They are set up so that people can
learn by looking at the data sets they can access and comparing what they see to examples of the
final results. They also look for patterns in the output and try to figure out how to use the
different parts to get the output they want.
ML shows a machine how to draw conclusions and make decisions based on what it has learned
in the past. It looks for patterns and analyses past data to figure out what these patterns mean so
that a possible conclusion can be reached without the need for human experience. Businesses
save time and make better decisions by using automation to evaluate data and come to
conclusions..
Machine learning provides predictions and prescriptions
Types of analytics (in order of increasing complexity)
x y z
A neural network is made up of layers of software-based calculators called "neurons" that are
linked together. This neural network can take in a huge amount of data and process it through
many layers. At each layer, the network learns more complex features of the data. The network
can then decide what to do with the data, find out if it was right, and use what it has learned to
decide what to do with new data. Deep learning is a way of programming computers that uses the
way neural networks work to teach computers to do things that humans do naturally. So, Deep
Learning is a way to teach a computer model to run classification algorithms based on an image,
text, or sound. Once a neural network knows what an object looks like, it can spot that object in a
new picture.
Deep learning is becoming more popular because its models can get better results. It uses large
sets of labeled data and neural network architectures to train the models..
Q7 Compare Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL).
……………………………………………………………………………………………
……………………………………………………………………………………………
Q8 Compare Descriptive, Predictive and Prescriptive analytics performed under Machine
Learning.
……………………………………………………………………………………………
……………………………………………………………………………………………
Artificial intelligence is the most important factor in the transformation of economies straight
from the ground up, and it is contributing as an efficient alternative. It has a lot of potential to
perform optimization in any industry, whether it smart cities or the health sector or agriculture or
any other prospective sector of relevance, and below we have included a few of the systems in
which AI is functioning as the major source of competitive advantage:
a) Healthcare: The application of AI in healthcare can help address issues of high barriers to
access to healthcare facilities, particularly in rural areas that suffer from poor
connectivity and a limited supply of healthcare professionals. This is especially true in
areas where the supply of healthcare professionals is limited. The deployment of use
cases like as AI-driven diagnostics, personalised treatment, early diagnosis of potential
pandemics, and imaging diagnostics, amongst others, is one way to accomplish this goal.
Training Keeping
Well
Research Early
Detection
AI and
Robotics
End of Diagnosis
Life Care
Decision
Treatment
Making
b) Agriculture: AI has the potential to bring in a food revolution while simultaneously satisfying
the ever-increasing need for food (global need to produce 50 percent more food and cater to an
additional 2 billion people by 2050 as compared to today). It also has the ability to resolve issues
such as inadequate demand prediction, a lack of secure irrigation, and the abuse or misuse of
pesticides and fertilizers. These are only some of the problems that it could solve. The increase
of crop output through real-time advising is one example of a use case. Other use cases include
the advanced detection of pest infestations and the forecast of crop prices to advise sowing
methods.
Survey fields, map weeds,
yield and soil variations;
Sensors monitor animal enable application of inputs
health and food intake; and map productivity. Drones
send alerts on health are also used for applying GPS-controlled
anomalies or reduction in pesticide and herbicide. autonomous tractor
Vast farm data is stored on food/water intake. charts its route
cloud, fed to advanced automatically,
analytics engine, and used ploughs the land
by agro-input companies to saving fuel, and
customize serving and reduces soil erosion
farmers to make timely FARMING CONNECTE SMART AUTONO and maintains soil
DATA D LIVE DRONES MOUS
operating decisions to STOCK TRACTOR
enhance yield and
profitability.
All of the stages of the agricultural value chain indicated above in figure 5 have the potential for the
application of artificial intelligence and other associated technologies to have an impact on the levels of
production and efficiency at those stages.
c) Smart Mobility, including Transports and Logistics: Autonomous fleets for ride sharing, semi-
autonomous features such as driver assistance, and predictive engine monitoring and maintenance are all
possible use cases for smart mobility, which includes transportation and logistics. Other areas where AI
can have a positive impact include self-driving trucks and delivery, as well as better traffic control.
d) Retail: The retail industry was one of the first to use AI solutions. For example, personalised
suggestions, browsing based on user preferences, and image-based product search have all been used to
improve the user experience. Other use cases include predicting what customers will want, keeping track
of inventory better, and managing deliveries more efficiently.
e) Manufacturing: AI-based solutions are expected to help the manufacturing industry the most. This will
make possible the "Factory of the Future" by allowing flexible and adaptable technical systems to
automate processes and machinery that can respond to new or unexpected situations by making smart
decisions. Impact areas include engineering (AI for R&D), supply chain management (predicting
demand), production (AI can cut costs and increase efficiency), maintenance (predictive maintenance and
better use of assets), quality assurance (e.g., vision systems with machine learning algorithms to find
flaws and differences in product features), and in-plant logistics and warehousing.
f) Energy: In the energy sector, possible use cases include modelling and forecasting the energy system to
make it less unpredictable and make balancing and using power more efficient. In renewable energy
systems, AI can help store energy through smart metres and intelligent grids. It can also make
photovoltaic energy more reliable and less expensive. AI could also be used to predict maintenance of
grid infrastructure, just like it is in manufacturing.
g) Smart Cities: Integrating AI into newly built smart cities and infrastructure could also help meet the
needs of a population that is moving to cities quickly and improve the quality of life for those people.
Some possible use cases include controlling traffic to reduce traffic jams and managing crowds better to
improve security.
h) Education and Skilling: Quality and access problems in the education sector might be fixed by AI.
Possible uses include adding to and improving the learning experience through personalized learning,
automating and speeding up administrative tasks, and predicting when a student needs help to keep them
from dropping out or to suggest vocational training.
i) Financial industry: The financial industry also uses AI. For example, it helps the fraud department of a
bank find and flag suspicious banking and finance activities like unusual debit card use and large account
deposits. AI is also used to make trading easier and more efficient. This is done by making it easier to
figure out how many securities are being bought and sold and how much they cost.
There are various ways to use artificial intelligence. The technology can be used in different
industries and sectors, but the adoption of AI by different sectors has been affected by technical
and regulatory challenges, but the biggest factor has been how it will affect business.
1.9 INTELLIGENT AGENTS
An agent may be thought of as an entity that acts, generally on behalf of someone else. More precisely, an
agent is an entity that perceives its environment through sensors and acts on the environment through
actuators. Some experts in the field require an agent to be additionally autonomous and goal directed also.
A percept may be thought of as an input to the agent through its censors, over a unit of time, sufficient
enough to make some sense from the input.
Percept sequence is a sequence of percepts, generally long enough to allow the agent to initiate some
action.
In order to further have an idea about what a computer agent is, let us consider one of the first
definitions of agent, which was coined by John McCarthy and his friends at MIT.
A software agent is a system which, when given a goal to be achieved, could carry out the details of the
appropriate (computer) operations and further, in case it gets stuck, it can ask for advice and can receive it
from humans, may even evaluate the appropriateness of the advice and then act suitably.
Essentially, a computer agent is a computer software that additionally has the following attributes:
(i) it has autonomous control i.e., it operates under its own control
(ii) it is perceptive, i.e., it is capable of perceiving its own environment
(iii) it persists over a long period of time
(iv) it is adaptive to changes in the environment and
(v) it is capable of taking over others’ goals.
As the concept of Intelligent Agents is of relatively new, different pioneers and other experts have been
conceiving and using the term in different ways. There are two distinct but related approaches for
defining an agent. The first approach treats an agent as an ascription i.e., the perception of a person
(which includes expectations and points of view) whereas the other approach defines an agent on the basis
of the description of the properties that the agent to be designed is expected to possess.
Let us first discuss the definition of agent according to first approach. Among the people who consider an
agent as an ascription, a popular slogan is “Agent is that agent does”. In everyday context, an agent is
expected to act on behalf of someone to carry out a particular task, which has been delegated to it. But to
perform its task successfully, the agent must have knowledge about the domain in which it is operating
and also about the properties of its current user in question. In the course of normal life, we hire different
agents for different jobs based on the required expertise for each job. Similarly, a non-human intelligent
agent also is imbedded with required expertise of the domain as per requirements of the job under
consideration. For example, a football-playing agent would be different from an email-managing agent,
although both will have the common attribute of modeling their user.
According to the second approach, an agent is defined as an entity, which functions continuously and
autonomously, in a particular environment, which may have other agents also. By continuity and
autonomy of an agent, it is meant that the agent must be able to carry out its job in a flexible and
intelligent fashion and further is expected to adapt to the changes in its environment without requiring
constant human guidance or intervention. Ideally, an agent that functions continuously in an environment
over a long period of time would also learn from its experience. In addition, we expect an agent, which
lives in a multi-agent environment, to be able to communicate and cooperate with them, and perhaps
move from place to place in doing so.
According to the second approach to defining agent, an agent is supposed to possess some or all of the
following properties:
Reactivity: The ability of sensing the environment and then acting accordingly.
Autonomy: The ability of moving towards its goal, changing its moves or strategy, if required,
without much human intervention.
Communicating ability: The ability to communicate with other agents and humans.
Ability to coexist by cooperating: The ability to work in a multi-agent environment to achieve a
common goal.
Ability to adapt to a new situation: Ability to learn, change and adapt to the situations in the world
around it.
Ability to draw inferences: The ability to infer or conclude facts, which may be useful, but are not
available directly.
Temporal continuity: The ability to work over long periods of time.
Personality: Ability to impersonate or simulate someone, on whose behalf the agent is acting.
Mobility: Ability to move from one environment to another.
Task environments or problem environments are the environments, which include all the elements
involved in the problems for which agents are thought of as solutions. Task environments will vary with
every new task or problem for which an agent is being designed. Specifying the task environment is a
long process which involves looking at different measures or parameters. Next, we discuss a standard set
of measures or parameters for specifying a task environment under the heading PEAS.
The four parameters may be collectively called as PEAS. We explain these parameters further, through an
example of an automated agent, which we will preferably call automated public road transport driver.
This is a much more complex agent than the simple boundary following robot which we have already
discussed.
Example (An Automated Public Road Transport Driver Agent)
Performance Measures: Some of the performance measures which can easily be perceived of an
automated public road transport driver would be:
Environment (or the world around the agent) We must remember that the environment or the world
around the agent is extremely uncertain or open ended. There are unlimited combinations of possibilities
of the environment situations, which such an agent could face. Let us enumerate some of the possibilities
or circumstances which an agent might face:
Variety of roads e.g., from 12-lane express-ways, freeways to dusty rural bumpy roads; different road
rules including the ones requiring left-hand drive in some parts of the world and right-hand drive-in
other parts.
The degree of knowledge of various places through which and to which driving is to be done.
All kind of other traffic possibly including heavy vehicles, ultra-modern cars, three-wheelers and even
bullock carts.
Sensors: The agent acting as automated public road transport driver must have some way of sensing the
world around it i.e., the traffic around it, the distance between the automobile and the automobiles ahead
of it and its speed, the speeds of neighboring vehicles, the condition of the road, any turn ahead etc. It
may use sensors like odometer, speedometer, sensors telling the different parameters of the engine,
Global Positioning System (GPS) to understand its current location and the path ahead. Also, there should
be some sort of sensors to calculate its distance from other vehicles etc.
We must remember that the agent example the automated public road transport driver, which we have
considered above, is quite difficult to implement. However, there are many other agents, which operate in
comparatively simpler and less dynamic environments, e.g., a game playing robot, an assembly line robot
control, and an image processing agent etc.
In respect of the design and development of intelligent agents, with the passage of time, the momentum
seems to have shifted from hardware to software, the letter being thought of as a major source of
intelligence. But, obviously, some sort of hardware is essentially needed as a home to the intelligent
agent.
A (hardware) device with sensors and actuators in which that agent will reside,
called the architecture of the agent.
An agent program that will convert or map the percepts in to actions.
Also, the agent program and its architecture are related in the sense that for a different agent architecture a
different type of agent program is required and vice-versa. For example, in case of a boundary following
robot, if the robot does not have the capability of sensing adjacent cells to the right, then the agent
program for the robot has to be changed.
Next, we discuss different categories of agents, which are differentiated from each other on the basis of
their agent programs. Capability to write efficient agent programs is the key to the success for developing
efficient rational agents. Although the table driven approach (in which an agent acts on the basis of the set
of all possible percepts by storing these percepts in tables) to design agents is possible yet the approach of
developing equivalent agent programs is found much more efficient.
Next, we discuss some of the general categories of agents based on their agents’ programs, Agents can be
grouped into five classes based on their degree of perceived intelligence and capability. All these agents
can improve their performance and generate better action over the time. These are given below:
SR (Simple Reflex) agents
Model Based reflex agents
Goal-based agents
Utility based agents
Stimulus-Response Agents
Learning agents
SR (Simple Reflex) agents: These are the agents or machines that have no internal state (i.e., the don’t
remember anything) and simply react to the current percepts in their environments. An interesting set of
agents can be built, the behaviour of the agents in which can be captured in the form of a simple set of
functions of their sensory inputs. One of the earliest implemented agents of this category was called
Machina Speculatrix . This was a device with wheels, motor, photo cells and vacuum tubes and was
designed to move in the direction of light of less intensity and was designed to avoid the direction of the
bright light. A boundary following robot is also an SR agent. For an automobile-driving agent also,
some aspects of its behavior like applying brakes immediately on observing either the vehicle
immediately ahead applying brakes or a human being coming just in front of the automobile suddenly,
show the simple reflex capability of the agent. Such a simple reflex action in the agent program of the
agent can be implemented with the help of simple condition-action rules.
Although implementation of SR agents is simple yet on the negative side this type of agents has very
limited intelligence because they do not store or remember anything. As a consequence, they cannot
make use of any previous experience. In summary, they do not learn. Also they are capable of
operating correctly only if the environment is fully observable.
Model Based Reflex agents :Simple Reflex agents are not capable of handling task environments that are
not fully observable. In order to handle such environments properly, in addition to reflex capabilities, the
agent should, maintain some sort of internal state in the form of a function of the sequence of percepts
recovered up to the time of action by the agent. Using the percept sequence, the internal state is
determined in such a manner that it reflects some of the aspects of the unobservable environment. Further,
in order to reflect properly the unobserved environment, the agent is expected to have a model of the task
environment encoded in the agent’s program, where the model has the knowledge about–
(i) the process by which the task environment evolves independent of the agent and
(ii) effects of the actions of the agent have on the environment.
Thus, in order to handle properly the partial observability of the environment, the agent should have a
model of the task environment in addition to reflex capabilities. Such agents are called Model-based
Reflex Agents
Goal Based Agents : In order to design appropriate agent for a particular type of task, we know the
nature of the task environment plays an important role. Also, it is desirable that the complexity of the
agent should be minimum and just sufficient to handle the task in a particular environment. In this regard,
first we discussed the simplest type of agents, viz., Simple Reflex Agents. The action of this type of agent
is decided by the current precept only. Next, we discussed the Model-Based Reflex Agents, for which an
action is decided by taking into consideration not only the latest precept, but the whole precept history
summarized in the form of internal state. Also, action for this type of agent is also decided by taking into
consideration the knowledge of the task environment, represented by a model of the environment and
encoded into the agent’s program. However, in respect of a number of tasks, even this much knowledge
may not be sufficient for appropriate action. For example, when we are going from city A to city B, in
order to take appropriate action, it is not enough to know the summary of actions and path which has
taken us to some city C between A and B. We also have to remember the goal of reaching to city B.
Goal based agents are driven by the goal they want to achieve, i.e., their actions are based on the
information regarding their goal, in addition to, of course, other information in the current state.
This goal information is also a part of the current state description and it describes everything that is
desirable to achieve the goal. As mentioned earlier, an example of a goal-based agent is an agent that is
required to find the path to reach a city. In such a case, if the agent is an automobile driver agent, and if
the road is splitting ahead into two roads, then the agent has to decide which way to go to achieve its goal
of reaching its destination. Further, if there is a crossing ahead then the agent has to decide, whether to go
straight, to go to the left or to go to the right. In order to achieve its goal, the agent needs some
information regarding the goal which describes the desirable events and situations to reach the goal. The
agent program would then use this goal information to decide the set of actions to take in order to reach
its goal.
Another desirable capability which a good goal-based agent should have been that if an agent finds that a
part of the sequence of the previous steps has taken the agent away from its goal then it should be able to
retract and start its actions from a point which may take the agent toward the goal.
In order to take appropriate action, decision-making process in goal-based agents may be simple or quite
complex depending on the problem. Also, the decision-making required by the agents of this kind
needs some sort of looking into the future. For example, it may analyze the possible outcome of a
particular action before it actually performs that action. In other words, we can say that the agent would
perform some sort of reasoning of if-then-else type, e.g., an automobile driver agent having one of its
goals as not to hit any vehicle in front of it, when finds the vehicle immediately ahead of it slowing down
may not apply brakes with full force and instead may apply brakes slowly so that the vehicles following it
may not hit it.
As the goal-based agents may have to reason before they take an action, these agents might be slower
than other types of agents but will be more flexible in taking actions as their decisions are based on the
acquired knowledge which can be modified also. Hence, as compared to SR agents which may require
rewriting of all the condition-action rules in case of change in the environment, the goal-based agents can
adapt easily when there is any change in its goal.
Utility Based Agents :Goal based agent’s success or failure is judged in terms of its capability for
achieving or not achieving its goal. A goal-based agent, for a given pair of environment state and possible
input, only knows whether the pair will lead to the goal state or not. Such an agent will not be able to
decide in which direction to proceed when there are more than one conflicting goals. Also, in a goal-
based agent, there is no concept of partial success or somewhat satisfactory success. Further, if there are
more than one method of achieving a goal, then no mechanism is incorporated in a Goal-based agent of
choosing or finding the method which is faster and more efficient one, out of the available ones, to reach
its goal.
A more general way to judge the success or happiness of an agent may be, through assigning to each state
a number as an approximate measure of its success in reaching the goal from the state. In case, the agent
is embedded with such a capability of assigning such numbers to states, then it can choose, out of the
reachable states in the next move, the state with the highest assigned number, out of the numbers assigned
to various reachable states, indicating possibly the best chance of reaching the goal.
It will allow the goal to be achieved more efficiently. Such an agent will be more useful, i.e., will have
more utility. A utility-based agent uses a utility function, which maps each of the world states of the
agent to some degree of success. If it is possible to define the utility function accurately, then the agent
will be able to reach the goal quite efficiently. Also, a utility-based agent is able to make decisions in case
of conflicting goals, generally choosing the goal with higher success rating or value. Further, in
environments with multiple goals, the utility-based agent quite likely chooses the goal with least cost or
higher utility goal out of multiple goals.
Stimulus-Response Agents : A stimulus response agent (or a reactive agent) take input from the world
through sensors, and then take action based on those inputs through actuators. Between the stimulus and
response, there is a processing unit that can be arbitrarily complex. An example of such an agent is one
that controls a vehicle in a racing game: the agent “looks” at the road and nearby vehicles, and then
decides how much to turn and break. Such Agents (Stimulus-Response Agents are the Reactive agents)
represents a special category of agents, which do not possess internal, symbolic models of their
environments; instead, they act/respond in a stimulus-response manner to the present state of the
environment in which they are embedded. These agents are relatively simple and they interact with other
agents in basic ways. Nevertheless, complex patterns of behavior emerged from the interactions when the
ensemble of agents is viewed globally
Learning Agents :It is not possible to encode all the knowledge in advance, required by a rational agent
for optimal performance during its lifetime. This is especially true of the real life, and not just theoretical,
environments. These environments are dynamic in the sense that the environmental conditions change,
not only due to the actions of the agents under considerations, but due to other environmental factors also.
For example, all of a sudden, a pedestrian comes just in front of the moving vehicle, even when there is
green signal for the vehicle. In a multi-agent environment, all the possible decisions and actions an agent
is required to take, are generally unpredictable in view of the decisions taken and actions performed
simultaneously by other agents. Hence, the ability of an agent to succeed in an uncertain and
unknown environment depends on its learning capability i.e., its capability to change approximately
its knowledge of the environment. For an agent with learning capability, some initial knowledge is coded
in the agent program and after the agent starts operating, it learns from its actions the evolving
environment, the actions of its competitors or adversaries etc. so as to improve its performance in ever-
changing environment. If approximate learning component is incorporated in the agent, then the
knowledge of the agent gradually increases after each action starting from its initial knowledge which was
manually coded into it at the start.
(ii) Performance Component: It is the component from which all actions originate on the basis of
external percepts and the knowledge provided by the learning component.
The design of learning component and the design of performance element are very much related to
each other because a learning component is of no use unless the performance component can be designed
to convert the newly acquired knowledge into better useful actions.
(iii) Critic Component: This component finds out how well the agent is doing with respect to a certain
fixed performance standard and it is also responsible for any future modifications in the performance
component. The critic is necessary to judge the agent’s success with respect to the chosen
performance standard, especially in a dynamic environment. Forexample, in order to check
whether a certain job is accomplished, the critic will not depend on external percepts only but it will
also compare the current state to the state, which indicates the completion of that task.
(iv) Problem Generator Component: This component is responsible for suggesting actions (some of
which may not be optimal) in order to gain some fresh and innovative experiences. Thus, this
component allows the agent to experiment a little by traversing sometimes uncharted territories by
choosing some new and suboptimal actions. This may be useful, because the actions which may
seem suboptimal in a short run, may turn out to be much better in the long run.
In the case of an automobile driver agent, this agent would be of little use if it does not have learning
capability, as the environment in which it has to operate is totally dynamic and unpredictable in nature.
Once the automobile driver agent starts operating, it keeps on learning from its experiences, both
positive and negative. If faced with a totally new and previously unknown situation, e.g., encountering a
vehicle coming from the opposite direction on a one-way road, the problem generator component of the
driver agent might suggest some innovative action to tackle this new situation. Moreover, the learning
becomes more difficult in the case of an automobile driver agent, because the environment is only
partially observable.
Different Forms of Learning in Agents: The purpose of embedding learning capability in an agent is
that it should not depend totally on the knowledge initially encoded in it and on the external percepts for
its actions. The agent learns by evaluating its own decisions and/or making observations of new situations
it encounters in the ever-changing environment.
There may be various criteria for developing learning taxonomies. The criteria may be based on –
The type of knowledge learnt, e.g., concepts, problem-solving or game playing,
The type of representation used, e.g., predicate calculus, rules or frames,
The area of application, e.g., medical diagnosis, scheduling or prediction.
……………………………………………………………………………………………
……………………………………………………………………………………………
Q10 What are Task environments? Briefly discuss the standard set of measures or parameters for
specifying a task environment under the heading PEAS.
……………………………………………………………………………………………
……………………………………………………………………………………………
1.10 SUMMARY
In this unit we learned about the difference between knowledge and intelligence and also pointed
out the meaning of Artificial Intelligence (AI), along with the application of AI systems in various
fields. The unit also covers the historical development of the field of AI systems. Along with the
development of the AI as a discipline, the need of classification of AI systems was felt, and hence
the unit discussed the classification of the AI systems in detail. Further, the unit discussed about
the concepts of Artificial Intelligence (AI), Machine Learning (ML) and Deep Learning (DL).
Finally, the unit discussed the components of Intelligence, which was extended for the
understanding of the concepts of Intelligent Agents, with special emphasis on Stimulus - Response
Agents
1.11 SOLUTIONS/ANSWERS
Q5 how do we measure if Artificial Intelligence is making a machine to behave or act or perform like
human being or not ?”
Q10 What are Task environments? Briefly discuss the standard set of measures or parameters for
specifying a task environment under the heading PEAS.
2.0 Introduction
2.1 Objectives
2.2Introduction to State Space Search
2.2.1 Problem Formulation
2.2.2 Structure of a State space
2.2.3 Problem solution of State space
2.3.4 Searching for solution in state spaces formulated
2.3 Formulation of 8 puzzle problem from AI perspective
2.4N-queen’s problem- Formulation and Solution
2.4.1 Formulation of 8 Queen’s problem
2.4.2 State space tree for 4-Queen’s problem
2.4.3 Backtracking approach to solve N Queen’s problem
2.5 Two agent search: Adversarial search
2.5.1 Elements of Game playing search
2.5.2 Types of algorithms in Adversarial search
2.6 Minimax search strategy
2.6.1 Minimax algorithm
2.6.2 Working of Minimax algorithm
2.6.3 Properties of Minimax algorithm
2.6.4 Advantages and Disadvantages of Minimax search
2.7 Alpha-Beta Pruning algorithm
2.7.1 Working of Alpha-Beta pruning
2.7.2 Move Ordering of Alpha-Beta pruning
2.8 Summary
2.9 Solutions/Answers
2.10Further readings
2.0 INTRODUCTION
Many AI-based applications need to figure out how to solve problems. In the world, there are
two types of problems. First, the problem which can be solved by using deterministic procedure
and the success is guaranteed. But most real-world problems can be solved only by searching a
solution. AI is concerned with these second types of problems solving.
Define the problem precisely-find initial and final configuration for acceptable solution
to the problem.
Analyse the problem-find few important features that may have impact on the
appropriateness of various possible techniques for solving the problem
Isolate and represent task knowledge necessary to solve the problem
Choose the best problem-solving technique(s) and apply it to the particular problem.
a. Define a state space that contains all the possible configurations of the relevant objects.
b. Specify one or more states that describe possible situations, from which the problem-
solving process may start. These states are called initial states.
d. Defining a set of rules for the actions (operators) that can be taken.
The problem can then be solved by using the rules, in combination with an appropriate control
strategy, to move through the problem space until a path from an initial state to a goal state is
found. This process is known as ‘search’. Thus, search is fundamental to the problem-solving
process. Search is a general mechanism that can be used when a more direct method is not
known. Search provides the framework into which more direct methods for solving subparts of a
problem can be embedded. All Al problems are formulated as search problems.
A problem space is represented by a directed graph, where nodes represent search state and
paths represent the operators applied to change the state. To simplify search algorithms, it is
often convenient to logically and programmatically represent a problem space as a tree. A tree
usually decreases the complexity of a search at a cost. Here, the cost is due to duplicating some
nodes on the tree that were linked numerous times in the graph, e.g., node B and node D shown
in example below.
Graph Tree
A
A B
B C
C D
D B D
D
A tree is a graph in which any two vertices are connected by exactly one path. Alternatively, any
connected graph with no cycles is a tree.
Before an AI problem can be solved it must be represented as a state space. Here state means
representation of elements at a given moment. Among all possible states, there are two special
states called initial state (the start point) and final state (the goal state). A successor function (a
set of operators)is used to change the state. It is used to move from one state to another. A state
space is the set of all states reachable from the initial state. A state space essentially consists of a
set of nodes representing each state of the problem, arcs between nodes representing the legal
moves from one state to another, an initial state, and a goal state. Each state space takes the form
of a tree or a graph. In AI, a wide range of problems can be formulated as search problem. The
process of searching means a sequence of action that take you from an initial state to a goal state
as sown in the following figure 1.
So, State space is the one of the methods to represent the problem in AI. A set of all possible
states reachable from the initial state by taking some sequence of action (using some operator)
for a given problem is known as the state space of the problem. A state space represents a
problem in terms of states and operators that change states”.
In this unit we examine the concept of a state space and the different search process that can be
used to explore the search space in order to find a solution (Goal) state. In the worst case, search
explores all possible paths between the initial state and the goal state.
For better understanding of these definitions describe above, consider the following 8-puzzle
problem:
Uniformed/Blind 1 2 3 2
3
4 5 Move 4 1 5
6 7 8 the 6 7 8
blank
Goal state: Tiles in a specific order
1 2
3 4 5
6 7 8
8-Puzzle
Start Goal
3 1 2 1 2
4 5 3 4 5
6 7 8 6 7 8
3 2 3 1 2 3 1 2 3 1 2
4 1 5 4 7 5 4 5 4 5
6 7 8 6 8 6 7 8 6 7 8
up
1 2
Goal 3 4 5
6 7 8
The structures of state space are trees and graphs. A tree has one and only one path from any
point to any other point. Graph consists of a set of nodes (vertices) and a set of edges (arcs). Arcs
establish relationship (connections) between the nodes, i.e., a graph has several paths to a given
node. Operators are directed arcs between nodes.
The method of solving problem through AI involves the process of defining the search space,
deciding start and goal states and then finding the path from start state to goal state through
search space.
Search process explores the state space. In the worst case, the search explores all possible paths
between the initial state and the goal state.
In a state space, a solution is a path from the initial state to a goal state or sometime just a goal
state. A numeric cost is assigned to each path. It also gives the cost of applying the operators to
the states. A path cost function is used to measure the quality of solution and out of all possible
solutions, an optimal solution has the lowest path cost. The importance of cost depends on the
problem and the type of solution asked.
2.2.3 Problem formulation:
Many problems can be represented as state space. The state space of a problem includes: an
initial state, one or more goal state, set of state transition operator (or a set of production rules),
used to change the current state to another state. This is also known as actions. A control
strategy is used that specifies the order in which the rules will be applied. For example, Depth-
first search (DFS), Breath-first search (BFS) etc. It helps to find the goal state or a path to the
goal state.
In general, a state space is represented by 4 tuples as follows:𝑺𝒔 : [𝑺, 𝒔𝟎 , 𝑶, 𝑮]
The production rule is represented in the form of a pair. Each pair consists of a left side that
determines the applicability of the rule and a right side that describes the action to be performed,
if the rule is applied.
The sequence of actions (or operators) is called a solution path. It is a path from the initial state
to a goal state. This sequence of actions leads to a number of states, starting from initial state to a
goal state, as {𝑠 , 𝑠 , 𝑠 , … … , 𝑠 ∈ 𝐺}.A sequence of state is called a path. The cost of a path is a
positive number. In most of the cases the path cost is computed as the sum of the costs of each
action.
The following figure 3 shows a search process in a given state space.
State Space
Initial state
actions
Goal State
Searching is applied to a search tree which is generated through state expansion, that is applying
the successor function to the current state, note that here we mean by state a node in the search
tree.
Generally, search is about selecting an option and putting the others aside for later in case the
first option does not lead to a solution. The choice of which option to expand first is determined
by the search strategy used.
Thus, the problem is solved by using the rules (operators), in combination with an appropriate
control strategy, to move through the problem space until a path from initial state to a goal state
is found. This process is known as search. A solution path is a path in state space from 𝑠 (initial
sate) to G (Goal state).
Problem statement:
Given two jugs, a 4-gallon and a 3-gallon, both of which do not have measuring indicators on
them. The jugs can be filled with water with the help of a pump that is available (as shown in
figure 4)
The question is “how can you get exactly 2 gallons of water into 4-gallon jug”.
3gl
Production Rules
The following 2 solutions are found for the problem “how can you get exactly 2 gallons of water
into 4-gallon jug”, as shown in Table-2 and in Table-3.
Solution-1:
Table-2Getting exactly 2 gallons of water into 4-gallon jug (solution1)
A state space tree for WJP with all possible solution is shown in figure 7.
[0,0]
[4,0] [0,3]
[1,0] [3,3]
[0,1] [4,2]
[4,1] [0,2]
[2,3] [2,0]
The eight-tile puzzle consists of a 3-by-3 (3 × 3) square frame board which holds eight (8)
movable tiles numbered as 1 to 8. One square is empty, allowing the adjacent tiles to be shifted.
The objective of the puzzle is to find a sequence of tile movements that leads from a starting
configuration to a goal configuration
Operator: Slide tiles (Move Blank) to reach the goal (as shown below). There are 4 operators
that is, “Moving the blank”:
Move the blank UP,
Move the blank DOWN,
Move the blank LEFT and
Move the blank RIGHT.
3 1 2 3 1 2 1 2
4 5 4 5 3 4 5
6 7 8 6 7 8 6 7 8
Path Cost: Sum of the cost of each path from initial state to goal state. Here cost of each action (blank
move) = 1, so cost of a sequence of actions= the number of actions. A optimal solution is one
which has a lowest cost path.
Performing State-Space Search: Basic idea:
Consider the successor (and their successors…) until you find a goal state.
Different search strategies consider the state in different orders. They may use different data
structures to store the states that have yet to be considered.
The predecessor reference connects the search nodes, creating a data structure known as a tree.
3 1 2
4 5 Initial state
6 7 8
3 2 3 1 2 3 1 2 3 1 2
4 1 5 4 7 5 4 5 4 5
6 7 8 6 2 6 7 8 6 7 8
. . . . . . . . .
1 2 3 1 2 3 1 2
3 4 5 6 4 5 4 5
6 7 8 7 8 6 7 8
When we reach a goal, we trace up the tree to get the solution i.e., the sequence of actions from
the initial state to the goal.
Q.1 Find the minimum cost path for the 8-puzzle problem, where the start and goal state are
given as follows:
PATH COST=5
2.4 N Queen’s Problem-Formulation and Solution
The N Queen’s problem was originally proposed in 1848 by the chess player Max Bazzel, and over the
years, many mathematicians, including Gauss have worked on this puzzle. In 1874, S. Gunther proposed
a method of finding solutions by using determinants, and J.W.L. Glaisher refined this approach.
The solutions that differ only by summary operations (rotations and reflections) of the board are
counted as one. For 4 queen’s problems, there are 16 possible arrangements on a 4 × 4
chessboard and there are only 2 possible solutions for 4 Queen’s problem. Note that,there are only
1 unique solution, out of 2 possible solutions as second solution is just a mirror image of the first
solution
Similarly, the one possible solution for 8-queen’s problem is shown in figure 11.
The 8-queen problem is computationally very expensive since the total number of possible
arrangements of queen on a 8 × 8 chessboard is 64 = 64!/(56! x 8!) ≈ 4.4 × 10 . Note that, 8-
Queens problem has 92 distinct solutions and 12 unique solutions, as shown in table-5
column
1 2 3 4 5 6 7 8
1 Q
2 Q
Row 3 Q
4 Q
5 Q
6 Q
7 Q
8 Q
8-tuple = (4, 6, 8, 2, 7, 1, 3, 5)
The following table-4 summarizes the both distinct and unique solution for the problem of 1-Queen to
26 Queens problem. In general, there is no known formula to find the exact number of solutions
for N queen’s problem.
Table-4 Solution of N Queen’s problem for N=1 to N=26, both Unique and Distinct
For the initial state, there are 16 successors. At the next level, each of the states has 15
successors, and so on down the line. This search tree can be restricted by considering only those
successors where No queens are attacking each other. To do that, we have to check the new
queen with all the other queens on the board. In this way, the answer is found at a depth 4.For the
sake of simplicity, you can consider a problem of 4-Queen’s and see how 4-queen’s problem is
solved using the concept of “Backtracking”.
Fig 12 State-space tree showing all possible ways to place a queen on a 4x4 chess board
So, to reduce the size (not anywhere on chess board, since there are 𝐶 Possibilities), we place
queen row-by-row, and no Queen in same column.This tree is called a permutation tree (here we
avoid same row or same columns but allowing diagonals)
Total nodes=1 + 4 + 4 × 3 + 4 × 3 × 2 + 4 × 3 × 2 × 1 = 65
The edges are labeled by possible values of x i. Edges from level 1 to level 2 nodes specify the
values for x1. Edges from level 𝒊 to level 𝒊 + 𝟏 are labeled with the values of xi.
The solution space is defined by all paths from root node to leaf node. There are 4! = 24 leaf
nodes are in the tree Nodes are numbered as depth first Search. The state space tree for 4-
Queen’s problem (avoid same row or same columns but allowing diagonals)is shown in figure
13.
Fig13 State space tree for 4 queen’s problems (allowing same diagonal but not same row and
same column)
We can further reduce the search space, as shown in figure 6 by avoiding diagonal also. Now
you can avoid the same row, avoid same columns and avoid same diagonals, while placing
any queen. In this case, the state space tree is look like as shown in figure 15.
Fig 15 State space tree for 4 queen’s problem (avoiding same row, columns, and diagonal)
Note that Queens are placed row-by-row, that is,𝑄 in row 1, 𝑄 in row 2 and so on. In a tree,
node (1,1) is a promising node (no queen attack) as 𝑄 is placed in 1st row and 1st column. Node
(2,1) is non promising node, because we cannot place 𝑄 in the same column (as 𝑄 is already
placed in column 1). Note that nonpromising node is marked as ×. So, we try (2,2) again
nonpromising (due to same column), next try (2,3), it’s a promising node, so proceed and try to
place 3rd queen on 3rd row. But in 3rd row, all positions (3,1),(3,2),(3,3) and (3,4) are non
promising and we cannot place the 𝑄 in any of this position. So, we backtrack to (1,1) and try
for (2,4) and so on.Backtracking approach gives “all possible solution”. Figure 7 shows one
possible solution for 4-Queen’s problem as {(1,2),(2,4),(3,1),(4,3)}. This can also be written as
(𝑥 , 𝑥 , 𝑥 , 𝑥 ) = (2,4,1,3) . There are 2 possible solution of 4-Queen’s problem. Another
solution is (𝑥 , 𝑥 , 𝑥 , 𝑥 ) = (3,1,4,2), which is a mirror image of 1st solution.
Consider the chess board squares indices of the 2-Dimentional array [1…n,1…n]. we observe
that every element on the same diagonal that rows from the upper left to the right has same
(𝑟𝑜𝑤 − 𝑐𝑜𝑙𝑢𝑚𝑛) value. It is called left diaonals. Similarly, every elemnet on the same diagonal
that goes from the upper rigth to the lower left has same (𝑟𝑜𝑤 + 𝑐𝑜𝑙𝑢𝑚𝑛) value. This is called
Right Diagonals. For example consider a 5 × 5 chessboard as [1…5,1…5] (as shown in figure
16).
Case1:(Left diagonal):- suppose queen’s are palced in same diadinal in locations:
(1,2),(2,3),(3,4),(4,5) or (1,4),(2,5) or any other same left diagonal value. Observe that every
element on the same diagonal has the same (𝑟𝑜𝑤 − 𝑐𝑜𝑙𝑢𝑚𝑛)value. Similarly,
Case2:(right diagonal):- suppose queen’s are palced in same diadinal in locations:
(1,3),(2,2),(3,1) or (1,4),(2,3),(3,2),(4,1) or any other same right diagonal value. Observe that
every element on the same diagonal has the same (𝑟𝑜𝑤 + 𝑐𝑜𝑙𝑢𝑚𝑛)value.
Suppose two queen’s are placed at position (𝑖, 𝑗) and (𝑘, 𝑙) then they are on the same diagonal if
and only if:
(𝑖 − 𝑗) = (𝑘 − 𝑙) 𝑜𝑟 (𝑗 − 𝑙) = (𝑖 − 𝑘) -----------(1) [left diagonal]
From equation (1) and (2), we can combine and write a one condition to check diagonal as:
𝒂𝒃𝒔(𝒋 − 𝒍) = 𝒂𝒃𝒔(𝒊 − 𝒌).
Algorithm NQueen(k,n)
Algorithm place(k,i)
// This algorithm return true, if a queen can be placed in kth row ith
//column. Else it return false. X[] is a global array. Abs® returns
//absolute value of r.
1. {
2. for j=1 to k-1 do
3. {
4.if(x[j]=i) // in the same column
5.Or (abs(x[j]-i)==abs(j-k))// in the
//same diagonal
6. Return false;
7. }
8. Return true;
9. }
Q.1 What are the various factors need to be taken into consideration when developing a
statespace representation?
Q.2 Consider the following Missionaries and cannibal problem:
Three missionaries and three cannibals are side of a river, along with a boat that can hold one or
two people. Find a way to get everyone to the other side, without ever leaving a group of
missionaries outnumbered by cannibals.
a) Formulate the missionaries and cannibal problem.
b) Solve the problem formulated in part (a)
c) Draw the state- space search graph for solving this problem.
Q.3 Draw a state space tree representation to solve Tower of Hanoi problem. (Hint: You can take
number of disk n=2 or 3).
Q.4: Draw the state space tree for the following 8-puzzle problem, where the start and goal state
are given below. Also Find the minimum cost path for this 8-puzzle problem. Each blank move
is having cost=1.
Q.5 Discuss a Backtracking algorithm to solve a N -Queen’s problem. Draw a state space tree to
solve a 4-Queen’s problem.
To play a game, we use a game tree to know all the possible choices and to pick the best one out.
There are following elements of a game-playing:
S0: It is the initial state from where a game begins.
PLAYER (s): It defines which player is having the current turn to make a move in the
state.
ACTIONS (s): It defines the set of legal moves that a player can make to change the
state.
RESULT (s, a): It is a transition model which defines the result of a move.
TERMINAL-TEST (s): It defines that the game has ended (or over) and returns true.
States where the game has ended are called terminal states.
UTILITY (s,p): It defines the final value with which the game has ended. This function
is also known as Objective function or Payoff function. This utility function gives a
numeric value for the outcome of a game i.e.
For example, in chess, tic-tac-toe, we have two or three possible outcomes. Either win or lose,
or draw the match, which we can represent by the values +1,-1 or 0. In other word we can say
that (-1): if the PLAYER loses, (+1), if the PLAYER wins and (0): If there is a draw
between the PLAYERS.
Let’s understand the working of the elements with the help of a game tree designed for tic-tac-
toe. Here, the node represents the game state and edges represent the moves taken by the players.
The root of the tree is the initial state. Next level is all of MAX’s moves, then next level is all of
MIN’s moves and so on. Note that root has 9 blank square (MAX), level 1 has 8 blank squares
(MIN), level 2 has 7 blank square (MAX) and so on.
Objective:
Player1: Maximize outcome and Player2: Minimize outcome
Terminal (goal) state:
utility: -1, 0, +1 (that is win for X is +1 and win for O is -1 and 0 for draw. The number on each
leaf node indicates the utility value of the terminal state from the point of view of MAX. High
values are assumed to be good for MAX and bad for MIN (which is how the players get their
names). It is MAX’s job to use the search tree to determine the best move. Note that if MAX win
then Utility value is +1, if MIN wins then utility value is -1 and if DRAW then utility value is 0.
In a tic-tac-toe game playing, as shown in figure 17, we have the following elements:
INITIAL STATE (S0): The top node in the game-tree represents the initial state in the
tree and shows all the possible choice to pick out one.
PLAYER (s): There are two players, MAX and MIN. MAX begins the game by picking
one best move and place X in the empty square box.
ACTIONS (s): Both the players can make moves in the empty boxes chance by chance.
RESULT (s, a): The moves made by MIN and MAX will decide the outcome of the
game.
TERMINAL-TEST(s): When all the empty boxes will be filled, it will be the
terminating state of the game.
UTILITY: At the end, we will get to know who wins: MAX or MIN, and accordingly,
the price will be given to them. If MAX win then Utility value is +1, if MIN wins then
utility value is -1 and if DRAW then utility value is 0.
2.5.2 Issues in Adversarial search
In a normal search, we follow a sequence of actions to reach the goal or to finish the game
optimally. But in an adversarial search, the result depends on the players which will decide the
result of the game. It is also obvious that the solution for the goal state will be an optimal
solution because the player will try to win the game with the shortest path and under limited
time. Minimaxsearch Algorithmis an example of adversarial search. Alpha-beta Pruning in
this is used to reduce search space.
They both play the game alternatively, i.e., turn by turn and following the above strategy, i.e., if
one wins, the other will definitely lose it. Both players look at one another as competitors and
will try to defeat one-another, giving their best.
In minimax strategy, the result of the game or the utility value is generated by a heuristic
function by propagating from the initial node to the root node. It follows the backtracking
technique and backtracks to find the best choice. MAX will choose that path which will increase
its utility value and MIN will choose the opposite path which could help it to minimize MAX’s
utility value.
Generate the whole game tree, all the way down to the terminal states.
Apply the utility function to each terminal state to get its value.
Use the utility of the terminal states to determine the utility of the nodes one level higher
up in the search tree.
Continue backing up the values from the leaf nodes toward the root, one layer at a time.
Eventually, the backed-up values reached the top of the tree; at that point, MAX chooses
the move that leads to the highest value.
This is called a minimax decision, because it maximizes the utility under the assumption that the
opponent will play perfectly to minimize it. To better understand the concept, consider the
following game tree or search tree as shown in figure 18.
In the above figure 2, the two players MAX and MIN are there. MAX starts the game by
choosing one path and propagating all the nodes of that path. Now, MAX will backtrack to the
initial node and choose the best path where his utility value will be the maximu m. After this,
its MIN chance. MIN will also propagate through a path and again will backtrack, but MIN will
choose the path which could minimize MAX winning chances or the utility value.
So, if the level is minimizing, the node will accept the minimum value from the successor
nodes. If the level is maximizing, the node will accept the maximum value from the successor.
In other word we can say that - Minimax is a decision rule algorithm, which is represented as a
game-tree. It has applications in decision theory, game theory, statistics and philosophy.
Minimax is applied in two player games. The one is the MIN and the other is the MAX player.
By agreement the root of the game-tree represents the MAX player. It is assumed that each
player aims to do the best move for himself and therefore the worst move for his opponent in
order to win the game. The question may arise “How to deal with the contingency problem?”
The answer is:
Assuming that the opponent is rational and always optimizes its behaviour ( opposite to
us) we consider the best response. opponent's
Then the minimax algorithm determines the best move
2.6.2 Working of Minimax Algorithm:
Minimax is applicable for decision making for two agent systems participating in competitive
environment. These two players P1 and P2, also known as MIN and MAX player, maximizes
and minimizes utility value of heuristics function. Algorithm uses recursion to search through
game tree and compute minimax decision for current state. We traverse the complete game tree
in a depth-first search (DFS) manner to explore the node. MAX player always select the
maximum value and MIN always select the minimum value from its successor’s node. The initial
value of MAX and MIN is set to as 𝑀𝐴𝑋 = −∞ and 𝑀𝐼𝑁 = +∞. This is a worst value assigned
initially and as the algorithm progress these values are changes and finally, we get the optimal
value.
Example1: Let’s take an example of two-player game tree search (shown in figure 19a) to
understand the working of Minimax algorithm.
Similarly, the value at node F (which is also Max node) is 𝑀𝐴𝑋 = 𝑚𝑎𝑥(−∞, 𝟐) =
𝟐, 𝒕𝒉𝒆𝒏 𝒎𝒂𝒙(𝟐, 𝟏) = 𝟐, and
The value at node G (which is also MAX node) is
𝑀𝐴𝑋 = 𝑚𝑎𝑥(−∞, −𝟑) = −𝟑, and then𝑚𝑎𝑥(−3,4) = 4.
Thus, at node C, which is also at MIN level, select the minimum value from its successor node F
and G as 𝑀𝐼𝑁 = 𝑚𝑖𝑛(2,4) = 2.
Now, the value at node B and C is -1 and 2 respectively.
Thus, finally, the value at node A, which is at MAX level, is
𝑀𝐴𝑋 = 𝑚𝑎𝑥(−1,2) = 2.
The final game tree with max or min value at each node and optimal path, with shaded line
ACF2, is shown in the following figure 19(b).
Fig 19(b) Game tree with final value at each node with optimal path
Example2Consider the following two-player game tree search. The working of Minimax
algorithmis illustrated from fig (a)-fig(k)
Fig (f)
Fig (a)
Fig (g)
For example, in chess playing game b = 35, m ≈100 for “reasonable” games. In this case exact
solution completely infeasible
Disadvantages:
It's completely infeasible in practice.
When the search tree is too large, we need to limit the search depth and apply an
evaluation function to the cut-off states.
Note: Alpha-beta pruning technique can be applied to trees of any depth, and it is possible to
prune the entire sub-trees easily.
As we know there are two-parameter is defined for Alpha-beta pruning, namely alpha (𝜶)and
beta(𝜷). The initial value of alpha and beta is set to as 𝜶 = −∞ and 𝜷 = +∞.As the algorithm
progresses its values are changes accordingly. Note that in Alpha-beta pruning (cut), at any node
in a tree, if 𝜶 ≥ 𝜷, then prune (cut) the next branch else search is continued. Note the following
point for alpha-beta pruning:
The MAX player will only update the value of𝜶(on MAX level).
The MIN player will only update the value of 𝜷(on MIN level).
We will only pass the 𝜶 and 𝜷value from top to bottom (that is from any parent to child
node, but never from child to parent node).
While backtracking the tree, the node values will be passed to upper node instead of
values of 𝜶and𝜷.
Before going to next branch of the node in a tree, we check the value of 𝜶and𝜷. If the
value of𝜶 ≥ 𝜷, then prune (cut) the next (unnecessary) branches (i.e., no need to search
the remaining branches where the condition 𝜶 ≥ 𝜷is satisfied) else search continued.
Consider the below example of a game tree where P and Q are two players. The game will be
played alternatively, i.e., chance by chance. Let, P be the player who will try to win the game by
maximizing its winning chances. Q is the player who will try to minimize P’s winning chances.
Here, 𝜶 will represent the maximum value of the nodes, which will be the value for P as
well. 𝜷 will represent the minimum value of the nodes, which will be the value of Q.
Any one player will start the game. Following the DFS order, the player will choose one
path and will reach to its depth, i.e., where he will find the TERMINAL value.
If the game is started by player P, he will choose the maximum value in order to increase
its winning chances with maximum utility value.
If the game is started by player Q, he will choose the minimum value in order to decrease
the winning chances of A with the best possible minimum utility value.
Both will play the game alternatively.
The game will be started from the last level of the game tree, and the value will be chosen
accordingly.
Like in the figure 5, the game is started by player Q. He will pick the leftmost value of
the TERMINAL and fix it for beta ( 𝜷). Now, the next TERMINAL value will be
compared with the 𝜷-value. If the value will be smaller than or equal to the 𝜷-value,
replace it with the current 𝜷-value otherwise no need to replace the value.
After completing one part, move the achieved 𝜷-value to its upper node and fix it for the
other threshold value, i.e.,𝜶.
Now, its P turn, he will pick the best maximum value. P will move to explore the next
part only after comparing the values with the current 𝜶-value. If the value is equal or
greater than the current 𝜶-value, then only it will be replaced otherwise we will prune the
values.
The steps will be repeated unless the result is not obtained.
So, number of pruned nodes in the above example are four and MAX wins the game with
the maximum UTILITY value, i.e.,3.
The rule which will be followed is: “Explore nodes, if necessary, otherwise prune the
unnecessary nodes.”
Note: It is obvious that the result will have the same UTILITY value that we may get from the
MINIMAX strategy.
Example1: Let’s take an example of two-player search tree (Figure 21) to understand the
working of alpha-beta pruning.
Fig 21Two player search tree
We initially start the search by setting the initial value of 𝜶 = −∞and𝜷 = +∞to root node A.
While backtracking the tree (from bottom to top node), the node
values will be passed to upper node instead of values of 𝜶and𝜷.
The MAX player will only update the value of𝜶(on MAX level)
and the MIN player will only update the value of 𝜷(on MIN
level).
Step1: We traverse the tree in a depth-first search (DFS) manner and assign (pass) this value of
𝜶and𝜷down to subsequent nodesB and then to nodeDas [𝜶 = −∞; 𝜷 = +∞].
Now at node D [𝜶 = −∞, 𝜷 = +∞], Since node D is at MAX level, so only 𝜶value will be
changed. Now, at D, it first checks the left child (which is a terminal node) with value 2. This
node returns a value of 2. Now, the value of 𝜶 at node D is calculated as𝛼 = 𝑚𝑎𝑥(−∞, 2) = 2.
So modified value at node D is [𝛼 = 2, 𝛽 = +∞]. To decide whether it’s worth looking at its
right node or not, we check 𝜶 ≥ 𝜷. The answer in NO since 2 ≱ +∞. So, proceed and search is
continued for right child of Node D.
The value of right child (terminal node with value=3) of D returns a value 3. Now at D, the value
of 𝛼 is compared with terminal node value 3, that is, 𝛼 = 𝑚𝑎𝑥(2,3) = 3.Now the value of
Node(D)=3, and the final values of 𝛼 and 𝛽is updated at node D as[𝛼 = 3, 𝛽 = +∞] as shown in
figure 21(a).
∝ = −∞
A β = +∞
∝ = −∞
β = +∞ B
∝=3
D 3
β = +∞
2 3
Fig 21(a)
Step 2. We backtrack from node D to B. Note that, while backtracking the node in a tree, the
node values of D(=3) will be passed to upper nodeB instead of values of 𝜶and𝜷.Now the value
of node(B)=node(D)=3. Since B is at MIN level, so only𝜷value will be changed. Now at node B
[𝜶 = −∞, 𝜷 = 𝟑] (note that 𝛽 is change from +∞ to 3). Here we again check 𝛼 ≥ 𝛽. It is False,
so search is continued on right side of B, as shown in figure (b).
∝ = −∞
β= 3 B 3
∝=3 D 3
β = +∞
2 3
Fig21(b)
Step 3.B now calls E, we pass the 𝜶and𝜷value from top node B to bottom node E as [𝜶 =
−∞, 𝜷 = 𝟑]. Since, node E is at MAX level, so only 𝜶value will be change.Now, at E, it first
checks the left child (which is a terminal node) with value5. This node returns a value of 5.
Now, the value of 𝜶 at node E is calculated as𝛼 = 𝑚𝑎𝑥(−∞, 5) = 5, so value of Node(E)=5 and
modified value of 𝜶and𝜷 at node Eis [𝛼 = 5, 𝛽 = 3]. To decide whether it’s worth looking at its
right node or not, we check 𝜶 ≥ 𝜷. The answer isYES, since 5 ≥ 3. So,we prune (cut) the right
branch of E, as shown in figure 21(c).
∝ = −∞
β= 3 B 3
∝=3 ∝=5 5
D E
β = +∞ β=3
2 3 2 9
Fig21(c)
Step4. We backtrack from node E to B. Note that, while backtracking, the node values of E(=5)
will be passed to upper node B, instead of values of 𝜶and𝜷. E return a value 5 to B.Since B is
at MIN level, so only𝜷value will be changed. Previously, at node B [𝜶 = −∞, 𝜷 = 𝟑], but
now𝜷 = 𝑚𝑖𝑛(3,5) = 3, so, there is no change in 𝜷 value and value of node(B) is still 3. Thus
finally, modified value at node B is[𝜶 = −∞, 𝜷 = 𝟑].
We backtrack from node B to A. Again, note that, while backtracking the tree, the value of
node(B)=3 will be passed to upper node A, instead of values of 𝜶 and 𝜷 .Now value of
Node(A)=3.
∝ = −∞
β= 3 B 3
∝=3 ∝=5 5
D β=3 E
β = +∞
2 3 5 9
Fig21(d)
Step 5.
Now at node C, we pass the 𝜶and𝜷value from top node A to bottom node C as [𝜶 = 𝟑, 𝜷 =
+∞]. Check,𝛼 ≥ 𝛽. The answer in NO. So, search is continued. Now pass the 𝜶and𝜷value
from top node C to bottom node F as [𝜶 = 𝟑, 𝜷 = +∞]. Since F is at MAX level, so only𝜶value
will be changed.
Now, at F, it first checks the left child (which is a terminal node) with value0. This node returns
a value of 0. Now, the value of 𝜶 at node F is calculated as𝛼 = 𝑚𝑎𝑥(3,0) = 3. So modified
value at node Fis [𝛼 = 3, 𝛽 = +∞]. To decide whether it’s worth looking at its right node or not,
we check 𝜶 ≥ 𝜷. The answer in NO since 3 ≱ +∞. So, proceed and search is continued for
right child of Node F.
The value of right child (terminal node with value=1) of F returns a value 1, so finally, value of
node(F)=1. Now at F, the value of 𝛼 is compared with terminal node value 1, that is, 𝛼 =
𝑚𝑎𝑥(3,1) = 3, and the final values of 𝛼 and 𝛽 is updated at node F as[𝛼 = 3, 𝛽 = +∞] as
shown in figure 21(e).
∝= 3 A 3 MAX
β = +∞
∝ = −∞ ∝=3 MIN
B 3 C
β= 3 β = +∞
2 3 2 8 0 1 Terminal Node
Fig21(e)
6. We backtrack from node F to C. Note that, while backtracking the tree, the node values of
F(=3) will be passed to upper node C. Now the value of node(C)=node(F)=1.
Since C is at MIN level, so only 𝛽value will be changed. Previously, at node C [𝛼 = 3, 𝛽 =
+∞ ]. Now, old value of 𝜷 = +∞ is compared with value of node(F)=node(C)=1. That is,
𝜷 = 𝑚𝑖𝑛(+∞, 1) = 1. Thus finally, at node B [𝜶 = 𝟑, 𝜷 = 𝟏]. Now we check,𝛼 ≥ 𝛽. It is
TRUE, so we prune (cut) the right branch of node C. That is node G will be pruned and
algorithm stop searching on right subtree of node C.
Thus finally, we backtrack from node C to A and node C return the value 1 to node A. Since A is
a MAX node, so only𝜶value will be changed.
Previously, at node A [𝛼 = 3, 𝛽 = +∞] and after comparing value of node(C)=1 with old value
of 𝛼 at node A, that is 𝛼 = 𝑚𝑎𝑥(3,1) = 3. Thus finally, at node A [𝛼 = 3, 𝛽 = +∞] and value
of Node(A)=3. Now, we completed the right sub tree of A also.
Following is the final game tree, showing the nodes which are computed and nodes which are
pruned (cut) during search process of Alpha-beta pruning. Here the optimal value for the
maximizer is 3 and there are 3 terminal nodes are pruned (9, 7 and 5). The optimal search path is
ABD3.
3 MAX
∝= 3 A
β = +∞
MIN
B ∝=3 C 1
β=1
∝=3
D E β = +∞ F 1 MAX
2 3 5 9 0 1 Terminal Node
Fig 21(f)
Example2: Consider the following game tree (figure 22) in which root is maximizing node and
children are visited from left to right. Find which nodes are pruned by the Alpha-beta pruning.
Solution:
Step1:We start the search by setting the initial value for node A as[𝜶 = −∞, 𝜷 = +∞]. We
traverse the node in depth-first search (DFS) manner so assign the same value of 𝜶and𝜷tonode
B[𝜶 = −∞, 𝜷 = +∞], Since node B is at MIN level, so only 𝜷value will be changed at node B.
Now, at D, it looks at it left child (terminal node), which returns a value 3 to D. So, we compare
the old value of 𝛽 at node B with this terminal node value 3, that is 𝛽 = 𝒎𝒊𝒏(+∞, 𝟑) = 𝟑.So
modified value of 𝜶and𝜷tonode B is[𝜶 = −∞, 𝛽 = 3].
To decide whether it worth looking at right subtree of B, we check
𝜶 ≥ 𝜷. The answer in NO, since −∞ ≱ 3. So, proceed and search is continued for right child of
Node B, that is E.
The terminal value of E=12. Now, the value of right child terminal E(=12) is compared with
previous old value of 𝛽 = 3, that is, 𝛽 = 𝑚𝑖𝑛(3,12) = 3.So no change in 𝛽 value. So, at present,
current modified value of 𝜶and𝜷atnode B is same[𝜶 = −∞, 𝛽 = 3]. Again check,𝛼 ≥ 𝛽. The
answer in NO. So, proceed and search is continued for right child of Node B, that is F. The
terminal value of F=8. The value of right child terminal at F(=8) is compared with previous old
value of 𝛽 = 3 , that is, 𝛽 = 𝑚𝑖𝑛(8,3) = 3. So no change in 𝛽 value. So, finally value of
Node(B)=3 and modified value of 𝜶and𝜷atnode B[𝜶 = −∞, 𝛽 = 3] as shown in figure 22(a)
Step2: We backtrack from node B to A. Note that, while backtracking the tree, the node values
of B(=3) will be passed to upper node A, instead of values of 𝜶and𝜷. Now the value of
node(A)=node(B)=3. Since A is at MAX level, so only𝜶value will be changed. Previously, at
node A [𝛼 = −∞, 𝛽 = +∞] and after comparing value of node(B)=3 with old value of 𝛼 at node
A, that is 𝛼 = 𝑚𝑎𝑥(−∞, 3) = 3 . Thus finally, at node A [𝜶 = 𝟑, 𝜷 = +∞ ] and value of
Node(A)=3.
To decide whether it’s worth looking at its right node of A or not, we check 𝜶 ≥ 𝜷. The answer
in NO since 3 ≱ +∞. So, proceed and search is continued for right child of Node A. Now, we
completed the left sub tree of A and proceed towards right subtree.
∝ = −∞ MAX ∝= 3 MAX
β = +∞ A β = +∞ A 3
∝ = −∞ ∝ = −∞
β = +3 B 3 MIN
β= 3 B 3 MIN
Step 3:Now at node C, we pass the 𝜶and𝜷value from top node A to bottom node C as [𝜶 =
𝟑, 𝜷 = +∞]. Check,𝛼 ≥ 𝛽. The answer in NO. So, continue the search on right side. Since C is
at MIN level, so only𝜷value will be changed.
Now, we first check the left child (terminal) of node C, that is G=2. So, we compare the old
value of 𝜷 at node C with this terminal node value 2, that is 𝜷 = 𝒎𝒊𝒏(+∞, 𝟐) = 𝟐. So value
of Node(C)=2 and modified value of 𝜶and𝜷atnodeC is[𝜶 = 𝟑, 𝛽 = 2]. Now, before proceed
next, we again check𝛼 ≥ 𝛽. The answer is YES. So,we prune (cut) the right branch of node C.
That is node H and I will be pruned (cut) and algorithm stop searching on right subtree of node
C, as shown in figure 22(c).
∝= 3 MAX
A
β = +∞
∝= 3
∝ = −∞
B 3 2 C β= 2 MIN
β= 3
D E F G H I Terminal Node
3 12 8 2 15 6
Fig22(c)
Step4: Finally, we backtrack from node C to A. Note that, while backtracking the node in a tree,
the node values of C(=2) will be passed to upper node A, instead of values of 𝜶and𝜷.The
previous node(A)=3 value is compared with this new node(C)=2 value. The best value at
node(A)=𝛼 = 𝑚𝑎𝑥(3,2) = 3.
The previous 𝛼 and 𝛽 value at node A[𝛼 = 3, 𝛽 = +∞]. Since A is at MAX level so only 𝛼 value is
change. So, we compare old 𝛼 = 3 value with value at node(C)=2. That is 𝛼 = 𝑚𝑎𝑥(3,2) = 3. Thus,
there is no change in 𝛼 value as well.
Thus finally, 𝜶and𝜷value at node A is[𝛼 = 3, 𝛽 = +∞] and value of Node(A)=3. So, optimal
value for the maximizer is 3 and there are 2 terminal nodes are pruned (H and I). The optimal
search path is ABD (as shown in figure 22(d)).
∝= 3
3 A β = +∞ MAX
∝= 3
∝ = −∞
B 3 2 C β= 2 MIN
β= 3
D E F G H I Terminal
3 12 8 2 15 6
Fig 22(d)
The effectiveness of Alpha-beta pruning is highly dependent on the order in which each node is
examined. Move order is an important aspect of alpha-beta pruning. We have two types of move ordering:
Worst case ordering: In some cases, alpha-beta pruning algorithm does not prune any of the leaves of
the tree and works exactly as MiniMax algorithm. In this case, it consumes more time because of alpha-
beta factors, such a move of pruning is called a worst ordering. The time complexity for such an order is
𝑂(𝑏 )where b: Branching Factor and m is the depth of the tree.
Best (ideal) case ordering: The ideal ordering for alpha-beta pruning occurs when lots of pruning
happens in the tree, and best move occur at the left side of the tree. We apply DFS hence it first search left
of the tree and go deep twice as minimax algorithm in the same amount of time. The time complexity for
best case order is 𝑂(𝑏 / ) (since we search only left sub tree, not a right subtree).
Note that pruning does not affect the final result. Good move ordering improves the
effectiveness of pruning. With ideal case ordering, time complexity is 𝑂(𝑏 )
In Alpha-beta pruning:
- α value can never decrease and β value can never increase. Search
can be discontinued at anode if:
-It is a Max node and α ≥ β it is beta cutoff
-It is a Min node and β ≤ α it is a alpha cutoff.
Q.3Apply Alpha-Beta pruning algorithm on the following graph and find which node(s) are
pruned?
Q.4:Consider the following Minimax game tree search (figure1) in which root is maximizing
node and children are visited from left to right.
A
MAXX
MIN B C D
MAX E F G H I J K
Terminal 4 3 6 2 2 1 4 5 3 1 5 4 7 5
Figure1(a)
(a) Find the value of the root node of the game tree?
(b) Find all the nodes pruned in the tree?
(c) Find the optimal path for the maximizer in a tree?
Q.5: Consider the following Minimax game tree search in which root is maximizing node and
children are visited from left to right. Find what will be the value propagated at the root?
Q.6: Consider the following Minimax game tree search in which root is maximizing node and
children are visited from left to right. Find the value of the root node of the game tree?
Multiple choice Question
Q.7: Consider the following Minimax game tree search in which root is maximizing node and
children are visited from left to right. Find the value of the root node of the game tree?
A. 14 B. 17 C. 111 D. 112
2.8 Summary
Before an AI problem can be solved it must be represented as a state space. Among all
possible states, there are two special states called initial state (the start point) and final
state (the goal state).
A successor function (a set of operators)is used to change the state. It is used to move
from one state to another.
A state space is set of all possible states of a problem.
A state space essentially consists of a set of nodes representing each state of the problem,
arcs between nodes representing the legal moves from one state to another, an initial
state, and a goal state. Each state space takes the form of a tree or a graph.
The process of searching means a sequence of action that take you from an initial state to
a goal state.
search is fundamental to the problem-solving process. Search means the problem is
solved by using the rules, in combination with an appropriatecontrol strategy, to move
through the problem space until a path from an initial state to a goal state is found.
A problem space is represented by a directed graph, where nodes represent search
stateand pathsrepresent the operators applied to change the state.
In general, a state space is represented by 4 tuples as follows:𝑺𝒔 : [𝑺, 𝒔𝟎 , 𝑶, 𝑮], Where S:
Set of all possible states (possibly infinite), 𝒔𝟎 : start state (initial configuration) of the
problem, 𝑠 ∈ 𝑆 . O: Set of production rules (or set of state transition operator) used to
change the state from one state to another. It is the set of arcs (or links) between nodes.
Adversarial search is a game-playing technique where the agents are surrounded by a
competitive environment. A conflicting goal is given to the agents (multiagent). These
agents compete with one another and try to defeat one another in order to win the game.
Such conflicting goals give rise to the adversarial search.
In a normal search, we follow a sequence of actions to reach the goal or to finish the
game optimally. But in an adversarial search, the result depends on the players which
will decide the result of the game. It is also obvious that the solution for the goal state
will be an optimal solution because the player will try to win the game with the shortest
path and under limited time.
There are 2 types of adversarial search: Minimax Algorithm and Alpha-beta Pruning.
Minimax is a two-player (namely MAX and MIN) game strategy where if one wins, the
other lose the game. This strategy simulates those games that we play in our day-to-day
life. Like, if two persons are playing chess, the result will be in favour of one player and
will go against the other one. MIN: Decrease the chances of MAX to win the game and
MAX: Increases his chances of winning the game. They both play the game alternatively,
i.e., turn by turn and following the above strategy, i.e., if one wins, the other will
definitely lose it. Both players look at one another as competitors and will try to defeat
one-another, giving their best.
In minimax strategy, the result of the game or the utility value is generated by a heuristic
function by propagating from the initial node to the root node. It follows
the backtracking technique and backtracks to find the best choice. MAX will choose
that path which will increase its utility value and MIN will choose the opposite path
which could help it to minimize MAX’s utility value.
The drawback of minimax strategy is that it explores each node in the tree deeply to
provide the best path among all the paths. This increases its time complexity.
If b is the branching factor and d is the depth of the tree, then time complexity of
MINIMAX algorithm is 𝑂(𝑏 ) that is exponential.
Alpha-beta pruning is an advance version of MINIMAX algorithm. Therefore, alpha-beta
pruning reduces the drawback of minimax strategy by less exploring the nodes of the
search tree.
The alpha-beta pruning method cut-off the search by exploring a smaller number of
nodes. It makes the same moves as a minimax algorithm does, but it prunes the unwanted
branches using the pruning technique.
Alpha-beta pruning works on two threshold values, i.e., 𝛼 (alpha) and 𝛽 (beta).𝜶 : It is
the best highest value; a MAX player can have. The initial value of 𝜶 is set to negative
infinity value, that is 𝜶 = −∞. As the algorithm progress its value may change and
finally get the best (highest) value. 𝜷 : It is the best lowest value; a MIN player can have.
The initial value of 𝜷 is set to positive infinity value, that is 𝜷 = +∞. As the algorithm
progress its value may change and finally get the best (lowest) value.
So, each MAX node has 𝜶 value, which never decreases, and each MIN node has 𝜷
value, which never increases. The main condition which required for alpha-beta pruning
is 𝜶 ≥ 𝜷, that is if𝜶 ≥ 𝜷, then prune (cut) the branches otherwise search is continued.
As we know there are two-parameter is defined for Alpha-beta pruning, namely alpha
(𝜶)and beta(𝜷). The initial value of alpha and beta is set to as 𝜶 = −∞ and 𝜷 = +∞.As
the algorithm progresses its values are changes accordingly. Note that in Alpha-beta
pruning (cut), at any node in a tree, if 𝜶 ≥ 𝜷, then prune (cut) the next branch else search
is continued.
The effectiveness of Alpha-beta pruning is highly dependent on the order in which each
node is examined.
Worst case ordering: In some cases, alpha-beta pruning algorithm does not prune any of
the leaves of the tree and works exactly as MiniMax algorithm. In this case, the time
complexity is 𝑂(𝑏 )where b: Branching Factor and m is the depth of the tree.
Best (ideal) case ordering: The ideal ordering for alpha-beta pruning occurs when lots
of pruning happens in the tree, and best move occur at the left side of the tree. The time
complexity for best case order is 𝑂(𝑏 / ) (since we search only left sub tree, not a right
subtree).
2.9 Solutions/Answers
Answer1: A number of factors need to be taken into consideration when developing a statespace
representation. Factors that must be addressed are:
Best solution vs. Good enough solution - For some problems a good enough solutionis
sufficient. For example: theorem proving eight squares.
However, some problems require a best or optimal solution, e.g., the traveling salesman
problem.
Answer 2
State: (#M,#C,0/1)
Where #M represents Number of missionaries in the left side bank (i.e., left side of the river)
#C : represents the number of cannibals in the left side bank (i.e., left side of the river)
0/1 : indicate the boat position of the boat. 0 indicates the boat is on the left side of the river and
1 indicate the boat is on the right side.
Start state:(3,3,0)
Operator: Sate will be changed by moving missionaries and (or) cannibals from one side to
another using boat. So, it can be represented as number of persons on the either side of the river.
Note that the boat can carries maximum 2 persons.
Boat carries: (1,0) or (0,1) or (1,1) or (2,0) or (0,2).Here in (i,j), i represents number of
missionaries and j means number of cannibals.
Start state:(3,3,0)
The legal moves in this state space involve moving one ring from one pole to another, moving one ring at
a time, and ensuring that a larger ring is not placed on a smaller ring.
Answer 4:
a) If the queen can be placed safely in this row then mark this [row, column] as part of the
solution and recursively check if placing queen here leads to a solution.
b) If placing the queen in [row, column] leads toa solution then return true.
c) If placing queen doesn't lead to a solution then unmark this [row, column] (Backtrack) and
go to step (a) to try other rows.
4) If all rows have been tried and nothing worked, return false to trigger backtracking.
State space tree for 4 Queen’s problem
A state-space tee (SST) can be constructed to show the solution to this problem. The following SST
(figure 1) shows one possible solution{𝑥1, 𝑥2, 𝑥3, 𝑥4} = {2,4,3,1}for the 4 Queen’s Problem.
2 18
x2=4
x2=2 x2=3 x2=4
3 8 13 19 24 29
B B B x3=1
30
9 11 14 16 x4=3
B B B
31
15
B denotes the Dead Node (nonpromising node). The figure 1 shows the Implicit tree for 4 queen
problem for solution <2,4,1,3>. The Root represents an initial state. The Nodes reflect the
specific choices made for the components of a solution. Explore The state space tree using
depth-first search. "Prune" non-promising nodesdfs stops exploring subtree rooted at nodes
leading to no solutions and then backtracks to its parent node
Alpha-Beta pruning is a way of finding the optimal Minimax solution while avoiding searching
subtrees of moves which won’t be selected. The effectiveness of Alpha-beta pruning is highly
dependent on the order in which each node is examined. The ideal ordering for alpha-beta pruning occurs
when lots of pruning happens in the tree, and best move occur at the left side of the tree. We apply DFS
hence it first search left of the tree and go deep twice as minimax algorithm in the same amount of time.
The time complexity for best case order is 𝑂(𝑏 / ) (since we search only left sub tree, not a right
subtree).
Answer 2: The final game tree with max and min value at each node is shown in the following
figure.
Answer3:
Solution: Solve the question as sown in Example1
The initial call starts from A. We initially start the search by setting the initial value of 𝜶 =
−∞and𝜷 = +∞for root node A. These values are passed down to subsequent nodes in the tree.
At A the maximizer must choose max of B and C, so A calls B first. At B it the minimizer must
choose min of D and E and hence calls D first. At D, it looks at its left child which is a leaf node.
This node returns a value of 3. Now the value of alpha at D is max( −∞, 3) which is 3. To decide
whether it’s worth looking at its right node or not, it checks the condition 𝛼 ≥ 𝛽. This is false
since 𝜷 = +∞ and ∞= 3. So, it continues the search.
D now looks at its right child which returns a value of 5. At D, alpha = max(3, 5) which is 5.
Now the value of node D is 5. Value at node D=5, move up to node B(=5). Now at node B, 𝛽
value will be modified as
𝛽 = 𝑚𝑖𝑛(+∞, 5) = 5.
B now calls E, we pass the 𝜶and𝜷value from top node B to bottom node E as [𝜶 = −∞, 𝜷 = 𝟓].
Since, node E is at MAX level, so only 𝜶value will be change. Now, at E, it first checks the left
child (which is a terminal node) with value 6. This node returns a value of 6.
Now, the value of 𝜶 at node E is calculated as 𝛼 = 𝑚𝑎𝑥(−∞, 6) = 6, so value of Node(E)=6
and modified value of 𝜶and𝜷 at node E is [𝛼 = 6, 𝛽 = 5]. To decide whether it’s worth looking
at its right node or not, we check 𝜶 ≥ 𝜷. The answer is YES, since 6 ≥ 5. So, we prune (cut)
the right branch of E, as shown in figure (a).
Figure (a) game tree after applying alpha-beta pruning on left side of node A
Similarly, we solve for right sub tree for Node A [refer the example 1 and solve for right sub tree part].
The final tree with node value at every node is shown in the figure(b).
Thus finally, 𝜶and𝜷value at node A is[𝛼 = 5, 𝛽 = +∞] and best value at Node(A)=max (5,2)=5.
So, optimal value for the maximizer is 5 and there are 3 terminal nodes are pruned (9, 0 and -1).
The optimal search path is ABD5 , as shown in figure (b).
Figure (b) Final tree with node value on every node with prune branches
∝ = −4
Answer4: 4 A β = +∞ MAX
∝= 4
∝= ∞ ∝= 4
MIN
β = +4 B 4 2 C β = +2 5 D β= 5
∝= 4 ∝= 5 ∝= 7
∝= 6
5 J K β= 5 MAX
β = +∞ E 4 6 F β= 4 G 2 H I β = +∞
7
4 3 6 2 2 1 4 5 3 1 5 4 7 5 Term Node
Answer5: 5
Answer6: 7
3.0 Introduction
3.1 Objectives
3.2 Formulating search in state space
3.2.1 Evaluation of search Algorithm
3.3Uninformed Search
3.3.1 Breath-First search (BFS)
3.3.2 Time and space complexity of BFS
3.3.3 Advantages & disadvantages of BFS
3.3.4 Depth First search (DFS)
3.3.5 Performance of DFS algorithm
3.3.6 Advantages and disadvantages of DFS
3.3.7 Comparison of BFS and DFS
3.4 Iterative Deepening Depth First search (IDDFS)
3.4.1 Time and space complexity of IDDFS
3.4.2 Advantages and Disadvantages of IDDFS
3.5 Bidirectional search
3.6 Comparison of Uninformed search strategies
3.7 Informed (heuristic) search
3.7.1 Strategies for providing heuristics information
3.7.2 Formulation of Informed (heuristic) search problem as state space
3.7.3 Best-First search
3.7.4 Greedy Best first search
3.8 A* Algorithm
3.8.1 Working of A* algorithm
3.8.2 Advantages and disadvantages of A* algorithm
3.8.3 Admissibility properties of A* algorithm
3.8.4 Properties of heuristic algorithm
3.8.5 Results on A* algorithm
3.9 Problem reduction search
3.9.1 Problem definition in AND-OR graph
3.9.2 AO* algorithm
3.9.3 Advantages of AO* algorithm
3.10 Memory Bound heuristic search
3.10.1 Iterative Deepening A* (IDA*)
3.10.2 Working of IDA*
3.10.3 Analysis of IDA*
3.10.4 Comparison of A* and IDA* algorithm
3.11 Recursive Best First search (RBFS)
3.11.1 Advantages and disadvantages of RBFS
3.12 Summary
3.13 Solutions/Answers
3.14 Further readings
3.0 INTRODUCTION
Before an AI problem can be solved it must be represented as a state space. In AI, a wide range
of problems can be formulated as search problem. The process of searching means a sequence of
action that take you from an initial state to a goal state as sown in the following figure 1.
In the unit 2 we have already examined the concept of a state space and adversarial (game
playing) search strategy. In many applications there might be multiple agents or persons
searching for solutions in the same solution space. In adversarial search, we need a path to take
action towards the winning direction and for finding a path we need different type of search
algorithms.
Search algorithms are one of the most important areas of Artificial Intelligence. This unit will
explain all about the search algorithms in AI which explore the search space to find a solution.
One disadvantage of state space representation is that it is not possible to visualize all states for
a given problem. Also, the resources of the computer system are limited to handle huge state
space representation. But many problems in AI take the form of state-space search.
The states might be legal board configurations in a game, towns and cities in some sort of
route map, collections of mathematical propositions, etc.
The state-space is the configuration of the possible states and how they connect to each other
e.g., the legal moves between states.
When we don't have an algorithm which tells us definitively how to negotiate the state-space
we need to search the state-space to find an optimal path from a start slate to a goal state,
We can only decide what to do (or where to go), by considering the possible moves from the
current state, and trying to look ahead as far as possible. Chess, for example is a very difficult
state space search problem.
Searching is the process looking for the solution of a problem through a set of possibilities (state
space).In general, the searching process starts from the initial state (root node ) and proceeds by
performing the following steps:
Check whether the current state is the goal state or not?
Expand the current state to generate the new sets of states.
Choose one of the new states generated for search depending upon search strategy
(for example BFS, DFS etc.).
Repeat step 1 to 3 until the goal state is reached or there are no more state tube
expanded.
Evaluation (properties) of search strategies : A search strategy is characterized by the
sequence in which nodes are expanded. Any search algorithms are commonly evaluated
according to the following four criteria. The following four essential properties of search
algorithms is used to compare the efficiency of any search algorithms.
Completeness: A search algorithm is said to be complete if it guarantees to return a solution, if
exist.
Optimality/Admissibility: If a solution found for an algorithm is guaranteed to be the best
solution (lowest path cost) among all other solutions, then such a solution for is said to be an
optimal solution.
Time Complexity: Time complexity is a measure of time for an algorithm to complete its task.
Usually measured in terms of the number of nodes expended during the search.
Space Complexity: It is the maximum storage space required at any point during the search.
Usually measured in terms of the maximum number of nodes in memory at a time.
Time and space complexity are measured in terms of:
b - max branching factor of the search tree
d - depth of the least-cost solution
m - max depth of the search tree (may be infinity)
In all search algorithms, the order in which nodes are expended distinguishes them from one
another. There are two broad classes of search methods:
Search Algorithm
Uniformed/Blind
Uninformed search is also called Brute force search or Blind search or Exhaustive search. It is
called blind search because of the way in which search tree is searched without using any
information about the search space. It is called Brute force because it assumes no additional
knowledge other than how to traverse the search tree and how to identify the leaf nodes and goal
nodes. This search ultimately examines every node in the tree until it finds a goal.
Informed search is also called as Heuristic (or guided) search. These are the search techniques
where additional information about the problem is provided in order to guide the search in a
specific direction. A heuristic is a method that might not always find the best solution but is
guaranteed to find a good solution in reasonable time. By sacrificing completeness, it increases
efficiency.
The following table summarizes the differences between uninformed and informed search:
3.1 OBJECTIVES
After studying this unit, you should be able to:
Differentiate the Uninformed and informed search algorithm
Formulate the search problem in the form of state space
Explain the differences between various uninformed search approaches such as BFS,
DFS, IDDFS, Bi-directional search.
Evaluate the various Uninformed search algorithm with respect to Time, space and
Optimality/Admissibility criteria.
Explain Informed search such as Best-First search and A* algorithm.
Differentiate between advantages and disadvantages of heuristic search: A* and AO*
algorithm
Differentiate between memory bound search: Iterative Deepening A* and Recursive
Best-First Search.
A state space is a graph, (V, E) where V is a set of nodes and E is a set of arcs, where each arc is
directed from one node to another node.
V: a node is a data structure that contains state description, plus, optionally other
information related to the parent of the node, operation to generate the node from that
parent, and other bookkeeping data.
E: Each arc corresponds to an applicable action/operation. The source and destination
nodes are called as parent (immediate predecessor) and child (immediate successor)
nodes with respect to each other. Ancestors(also called predecessors) and descendants
(also called successors) node. Each arc has a fixed, non-negative cost associated with it,
corresponding to the cost of the action.
Each node has a set of successor nodes. Corresponding to all operators (actions) that can apply at
source node’s state. Expanding a node is generating successor nodes and adding them (and
associated arcs) to the state-space graph. One or more nodes may be designated as start nodes.
A goal test predicate is applied to a node to determine if its associated state is a goal state. A
solution is a sequence of operations that is associated with a path in a state space from a start
node to a goal node. Thecost of a solution is the sum of the arc costs on the solution path.
State-space search is the process of searching through a state space for a solution by making
explicit a sufficient portion of an implicit state-space graph to include a goal node.
Hence, initially V={S}, where S is the start node; when S is expanded, its successors are
generated, and those nodes are added to V and the associated arcs are added to E. This process
continues until a goal node is generated (included in V) and identified (by goal test).
To implement any Uninformed search algorithm, we always initialize and maintain a list called
OPEN and put start node of G in OPEN. If after some time, we find OPEN is empty and we are
not getting “goal node”, then terminate with failure. We select a node 𝒏 from OPEN and if
𝒏 ∈ 𝑮𝒐𝒂𝒍 𝒏𝒐𝒅𝒆 , then terminate with success, else we generate the successor of n (using
operator O) and insert them in OPEN. In this way we repeat the process till search is successful
or unsuccessful.
Search strategies differ mainly on how to select an OPEN node for expansion at each step of
search.
A general search algorithm
1. Initialize: Set = {𝑠} , where s is a start state.
2.Fail:If 𝑂𝑃𝐸𝑁 = { }, terminate with failure.
3. Select: Select a state, n, from OPEN
4.Terminate: If 𝑛 ∈ 𝐺𝑜𝑎𝑙 𝑛𝑜𝑑𝑒, terminate with success
5. Expand: Generate the successor of n using operator O and insert them in OPEN.
6. LOOP:Goto Step 2
But the problem with the above search algorithm, it is not mentioned that when a node is already
visited, then do not revisit that node again. That is “how we can maintain a part of the state
space that is already visited”. So, we have an extension of the same algorithm, where we can
save the explicit state space. To save the explicit space, we maintained another list called
CLOSED.
Thus, to implement any Uninformed search algorithm efficiently, two list OPEN and CLOSED
are used.
Now we can select a node from OPEN and save it in CLOSED. The CLOSED list keeps record
of nodes that are Opened. The major difference of this algorithm with the previous algorithm is
that when we generate successor node from CLOSED, we check whether it is already in
(𝑂𝑃𝐸𝑁 ∪ 𝐶𝐿𝑂𝑆𝐸𝐷). If it is already in (𝑂𝑃𝐸𝑁 ∪ 𝐶𝐿𝑂𝑆𝐸𝐷), we will not insert in OPEN again,
otherwise insert. The following modified algorithm is used to save the explicit space using the
list CLOSED.
Note that initially OPEN list initializes with start state of G (e.g.,𝑶𝑷𝑬𝑵 = {𝒔}) and CLOSED list as
empty (e.g.,𝑪𝑳𝑶𝑺𝑬𝑫 = {}).
In any search algorithm, we select a node and generate its successor. Search strategies differ
mainly on how to select an OPEN node for expansion at each step of search. Also, Insertion or
deletion of any node from OPEN list depends on specific search strategy. Any search algorithms
are commonly evaluated according to the following 4 criteria:It is the measure to evaluate the
performance of the search algorithms:
Time Complexity: How long (worst or average case) does it take to find a solution?
Usually measured in terms of the number of nodes expanded.
Space Complexity: How much space is used by the algorithm? Usually measured in
terms of the maximum size that the “OPEN" list becomes during the search. The Time
and Space complexity are measured in terms of: The branching factor or maximum
number of successors of any node and d: the depth of shallowest goal node (depth of the
least cost solution) and m: The maximum depth (length) of any path in the state space
(may be infinite).
Search process constructs a search tree, where root is the initial state S, and leaf nodes are
nodesnot yet been expanded (i.e., they are in OPEN List) or having no successors (i.e.,
they're dead ends"). Search tree may be infinite because of loops even if statespace is small
Search strategies mainly differ on select OPEN. Each node represents a partial solution path
and cost of the partial solution path) from the start node to the given node. In general, from
this node there are many possible paths (and therefore solutions that have this partial path as
a prefix.
All search algorithms are distinguished by the order in which nodes are expended. There are
two broad classes of search methods: Uninformed search and Heuristic Search. Let us first
discuss the Uninformed search.
Breadth-first search
Depth-first search
Uniform cost search
Iterative deepening depth-first search
Bidirectional Search
It is the simplest form of blind search. In this technique the root node is expanded first, then all its
successors are expanded and then their successors and so on. In general, in BFS, all nodes are
expanded at a given depth in the search tree before any nodes at the next level are expanded. It
means that all immediate children of nodes are explored before any of the children’s children are
considered. The search tree generated by BFS is shown below in Fig 2.
Root
A B C
D E F G
Goal Node
Fig 2 Search tree for BFS
Note that BFS is a brute-search, so it generates all the nodes tor identifying the goal and note that
we are using the convention that the alternatives are tried in the left-to-right order.
A BFS algorithm uses a data structure-queue that works on FIFO principle. This queue will hold
all generated but still unexplored nodes. Please remember that the order in which nodes are
placed on the queue or removal and exploration determines the type of search.
We can implement it by using two lists called OPEN and CLOSED. The OPEN list contains
those states that are to be expanded and CLOSED list keeps track of state already expanded.
Here OPEN list is used as a queue.
BFS is effective when the search tree has a low branching factor.
Example1: Consider the following graph in fig-1 and its corresponding state space tree
representation in fig-2. Note that A is a start state and G is a Goal state.
E A
B
B C
A D E H
C
D E D G
C C F B F
G
G G H E G H
Step1: Initially open contains only one node Step 5: Node D is removed from open. Its children C and F are
corresponding to the source state A. generated and added to the back of open.
A A
B C B C
C
D E D G C
D E D G
C F B F C F B F
G G H E G H G G H E G H
B C
B C
C
D E D G
C
D E D G
C F B F C F B F
G G H E G H G G H E G H
Open= B,C CLOSED={A } Open= D,G,C,F CLOSED=A,B,C,D,E
Step: 3: Node B is removed from open and is Step 7: D is expanded, B and F are put in OPEN.
expended. Its children D, E are generated and put at the
A
back of open.
A
B C
B C
C
D E D G
C
D E D G
C F B F
C F B F
G G H E G H
G G H E G H
Open= G,C,F,B,F
Open= C,D,E CLOSED=A,B,C,D,E,D
CLOSED={A,B}
Step 4: Node C is removed from open and is expanded Step 8: G is selected for expansion. It is found to be a goal
its children D and G are added to the back of open. node. So, the algorithm returns the path ACG by following the
parent pointers of the node corresponding to G. The algorithm
A
A
B C
B C
C
D E D G
C
D E D G Goal!
C F B F
C F B F
G G H E G H
G G H E G H
Open=D,E,D,G terminates.
CLOSED={A,B,C}
Open=C,F,B,F
CLOSED={A,B,C,D,E.G}
1 b
2 b^2
d b^d
Fig 3Search tree with b branching factor
For Example, consider a complete search tree of depth 12, where every node at depths
0,1, . . . , 11 has 10 children (branching factor b=10) and every node at depth 12 has 0 children,
then there are
1. (10 − 1)
1 + 10 + 10 + 10 + ⋯ + 10 =
10 − 1
= = 𝑂(10 ) nodes in the complete search tree.
• BFS is suitable for problems with shallow solutions
Space complexity:
BFS has to remember each and every node it has generated. Space complexity (maximum length
of OPEN list):
So, space complexity is given by:
1+b+b2+b3+…bd= O(bd).
Performance of BFS:
Time Required for BFS for tree of b branching factor and d depth is 𝑂(𝑏 ).
Space (memory) requirement for a tree with b branching factor and d depth is also 𝑂(𝑏 )
BFS algorithm is Complete (if b is finite).
BFS algorithm is Optimal (if cost = 1 per step)
Disadvantages: BFS has certain disadvantages also. They are given below-
1. Time complexity and Space complexity are both O(bd) i.e., exponential type. This is very
hurdle.
2. All nodes are to be generated in BFS. So, even unwanted nodes are to be remembered (stored
in queue) which is of no practical use of the search.
In depth-first search we go as far down as possible into the search tree/graph before backing up
and trying alternatives. It works by always generating a descendent of the most recently
expanded node until some depth cut off is reached and then backtracks to next most recently
expanded node and generates one of its descendants. So only path of nodes from the initial node
to the current node is stored, in order to execute the algorithm. For example, consider the following
tree and see how the nodes are expended using DFS algorithm.
Example1:
S (Root)
A B
c D E F
(Goal Node)
Fig. 5 Search tree for DFS
After searching root node S, then A and C, the search backtracks and tries another path from
A. Nodes are explored in the order 𝑆, 𝐴, 𝐶, 𝐷, 𝐵, 𝐸, 𝐹.
Here again we use the list OPEN as a STACK to implement DFS. If we found that the first
element of OPEN is the Goal state, then the search terminates successfully.
6. LOOP:Goto Step 2
Note: The only difference between BFS and DFS is in Expend (step 5). In BFS, we always insert the
generated successors at the right end of OPEN list, whereas in DFS at the left end of OPEN list.
Time Required for DFS for tree of b branching factor and m depth (of shallowest goal
node) is 𝑂(𝑏 ).
Space (memory) requirement for a tree with b branching factor and m depth (of
shallowest goal node) is also 𝑂(𝑏𝑚)
Advantages:
If depth-first search finds solution without exploring much in a path then the time and
space it takes will be very less.
The advantage of depth-first Search is that memory requirement is only linear with
respect to the search graph. This is in contrast with breadth-first search which requires
more space.
Example1: Consider the following graph in fig-1 and its corresponding state space tree
representation in fig-2. Note that A is a start state and G is a Goal state.
B C
E
B
C
D E D G
A D E H
C F B F
G G G H E G H
A
A
B C
B C
C
D E D G
C
D E D G
C F B F
C F B F
G G H E G H
G G H E G H
Step 2: A is removed from open. The node A is Step 5: Node C is removed from open. Its children G and
expanded, and its children B and C are generated. They F is added to the front of open.
are placed at the from of open.
A A
B C B C
C
D E D G C
D E D G
C F B F C F B F
G G H E G H G G H E G H
Step 3: Node B is removed from open and is expended. Step 6: Node G is expanded and found to be a goal node.
Its children D, E are generated and put at the front open. The solution path A-B-D-C-G is returned and the
algorithm terminates.
A A
B C B C
C
D E D G C
D E D G
C F B F C F B F
G G H E G H G G H E G H
BFS goes level wise , but requires more space as compared to DFS. The space required by DFS is O(d)
d
where d is depth of tree, but space required by BFS is O(b ).
DFS: The problem with this appraoch is, if there is a node close to root, but not in first few subtrees
explored bt DFS, then DFS reaches that node very late. Also, DFS may not find shortest path to a node (in
terms of number of edges.)
0 node to be searched
Path followed by a
DFS
1 2
3 4 5 6
DFS: The problem with this approach is, if there is a node close to root, but not in first few subtrees
explored by DFS, then DFS reaches that node very late. Also, DFS may not find shortest path to a node
(in terms of number of edges).
Suppose, we want to find node- ‘2’ of the given infinite undirected graph/tree. A DFS starting from node-
0 will dive left, towards node 1 and so on.
An Iteratie Deepenign Depth First Search overcomes this andquickly find the required node.
Iterative Deepening Depth First Search (IDDFS) neither suffers the drawbacks of BFS nor DFS
on trees. It takes the advantages of both the strategy.
It begins by performing DFS to a depth of zero, then depth of one, depth of two, and so on until a solution
is found or some maximum depth is reached.
It is like BFS in that it explores a complete layer of new nodes at each iteration before going to next layer.
It is likes DFS for a single iteration.
It is preferred when there is a large search space, and the depth of a solution is notknown. But it performs
the wasted computation before reaching the goal depth. Since IDDFS expends all nodes at a given depth
before expending any nides at greater depth, it is guaranteed to find a shortest-length (path) solution from
initial state to goal state.
At any given time, it is performing a DFS and never searches deeper than depth ‘d’. Hence, it uses same
space as DFS.
Disadvantage of IDDFS is that it performs wasted computation prior to reaching the goal depth.
Algorithm (IDDFS)
𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒𝑑 𝑑 = 1/∗ depth of search tree ∗/, found = false
𝑊ℎ𝑖𝑙𝑒 (𝐹𝑜𝑢𝑛𝑑 = 𝐹𝑎𝑙𝑠𝑒)
𝐷𝑂{
𝑝𝑒𝑟𝑓𝑜𝑟𝑚 𝑎 𝑑𝑒𝑝𝑡ℎ 𝑓𝑖𝑟𝑠𝑡 𝑠𝑒𝑎𝑟𝑐ℎ 𝑓𝑟𝑜𝑚 𝑠𝑡𝑎𝑟𝑡 𝑡𝑜 𝑑𝑒𝑝𝑡ℎ 𝑑.
𝑖𝑓 𝑔𝑜𝑎𝑙 𝑠𝑡𝑎𝑡𝑒 𝑖𝑠 𝑜𝑏𝑡𝑎𝑖𝑛𝑒𝑑
𝑡ℎ𝑒𝑛 𝐹𝑜𝑢𝑛𝑑 = 𝑡𝑟𝑢𝑒
𝑒𝑙𝑠𝑒
B C
D E F
G
H I J K L M N O
The Iterative deepening search proceeds as follows:
Iterative Deepening search 𝑳 = 𝟎
A
Limit = 0
A A A
B C B C C
Limit = 1
A A
A A
B C B C B C B C
D E F G D E F G D E F G E F G
Limit = 2
A
A A
C
C C
F G F G G
A A A A
B C B C B C B C
D E F G D E F G D E F G D E F G
H I J K L M N O H I J K L M N O H I J K L M N O H I J K L M N O
Limit =3
A A A A A
B C B C B C B C B C
D E F G E F G E F G E F G F G
I J K L M N O J K L M N O J K L M N O K L M N O L M N O
A A A
C B C B C
F G F G F G
L M N O L M N O M N O
3.4.1 Time and space complexities of IDDFS
The time and space complexities of IDDFSalgorithm is O(b d) and O(d) respectively.
It can be shown that depth first iterative deepening is asymptotically optimal, among brute
force tree searches, in terms of time, space and length of the solution. In fact, it is linear in its
space complexity like DFS, and is asymptotically optimal to BFS in terms of the number of
nodes expanded.
Please note that in general iterative deepening is preferred uninformed search method
when there is large search space, and the depth of the solution is unknown. Also note that
iterative deepening search is analogous to BFS in that is explores a complete layer going to
the next layer.
Advantages:
1. It combines the benefits of BFS and DFS search algorithms in terms of fast search and
memory efficiency.
2. It is guaranteed to find a shortest path solution.
3. It is a preferred uniformed search method when the search space is large and the depth of
the solution is not known.
Disadvantages
1. The main drawback of IDDFS is that it repeats all the work from the previous phase. That
is, it performs wasted computations before reaching the goal depth.
d
2. The time complexity is O(b ) i.e., exponential type only.
This search is used when a problem has a single goal state that is given explicitly and all the
node generation operators have inverses,
So, it is used to find shortest path from an initial node to goal node
instead of goal itself along with path.
It works by searching forward from the initial node and backward from the goal node
simultaneously, by hoping that two searches meet in the middle.
Check at each stage if the nodes of one have been generated by the other, i.e., they meet in
the middle.
If so, the path concatenation is the solution.
Thus, the BS Algorithm is applicable when generating predecessors is easy in both forward and
backward directions and there exist only 1 or fewer goal states. The following figure illustrate
how the Bidirectional search is executed.
Root node
1 Bidirectional
13
4 Search 11
2 14
8 9 10
3 15
6 Intersection 12
Node
5 16
Goal node
Fig 7 Bidirectional search
We have node 1 as the start/root node and node 16 as the goal node. The algorithm divides the
search tree into two sub-trees. So, from start node 1, we do a forward search and at the same
time, we do a backward search from goal node 16. The forward search traverse’s nodes 1, 4, 8,
and 9 whereas the backward search traverses through nodes 16, 12, 10, and 9. We see that both
forward and backward search meets at node 9 called the intersection node. So, the total path
traced by forwarding search and the path traced by backward search is the optimal solution. This
is how the BS Algorithm is implemented.
Advantages:
Since BS uses various techniques like DFS, BFS, Depth limited search (DLS) etc, it is
efficient and requires less memory.
Disadvantages:
Implementation of the bidirectional search tree is difficult.
In bidirectional search, one should know the goal state in advance.
Practically inefficient due to additional overhead to perform insertion operation at each
point of search.
Time complexity:
⁄
The total number of nodes expended in Bidirectional search is= 2𝑏 = 𝑶 𝒃𝒅⁄𝟐 , where b is a
branching factor and d is the depth of the shallowest goal node.
d/2
Complete? Yes
Time Complexity: O(bd/2)
Space complexity: O(bd/2)
Optimal: Yes (if step cost is uniform in both forward and backward directions)
3.6 Comparison of Uninformed search strategies
The following table-1compare the efficiency of uninformed search algorithms. These are the
measure to evaluate the performance of the search algorithms:
2 3 4
6 5 7
9 8
Goal State
C D
B
E F G H I J
K L M N O P Q R
S T U
E
B
A D F H
C
G
Let A be the state and G be the final or goal state to be searched.
Q.4 Compare the Uninformed search algorithm with respect to Time, space, Optimal and
Complete.
Deciding which node to expand next, instead of doing the expansion in a strictly breadth-
first or depth-first order;
In the course of expanding a node, deciding which successor or successors to generate,
instead of blindly generating all possible successors at one time:
Deciding that certain nodes should be discarded, or pruned, from the search space.
Informed search algorithms use domain knowledge. In an informed
search, problem information is available which can guide the search. Informed search strategies
can find a solution more efficiently thanan uninformed search strategy. Informed search is also
called a Heuristic search.
Heuristics is a guess work, or additional information about the problem. It may miss the solution,
if wrong heuristics is supplied. However, in almost all problems with correct heuristic
information, it provides good solution in reasonable time
Informed search can solve much complex problem which could not be solved in another way.
We have the following informed search algorithm:
1. Best-First Search
2. A* algorithm
3. Iterative Deepening A*
The informed search algorithm is more useful for large search space. All the informed search
algorithm uses the idea of heuristic, so it is also called Heuristic search.
“Heuristics are criteria, methods or principles for deciding which among several alternative
courses of action promises to be the most effective in order to achieve some goal”
Heuristic Function:
It takes the current state of the agent as its input and produces the estimation of how close
agent is from the goal.
Heuristic function estimates how close a state is to the goal. It is represented byh(n), and it
calculates the cost of an optimal path between the pair of states.
The heuristic method, however, might not always give the best solution, but it guaranteed to
find a good solution in reasonable time.
Informed Search Define a heuristic function. h(n). that estimates the “goodness" of a node n.
The heuristic function is an estimate, based on domain-specific information that is
computable from the current state description of how close we are to a goal.
Specifically, h(n) = estimated cost (or distance) of minimal cost path from state ‘n’ to a goal
state.
A heuristic function at a node n is an estimate of the optimum cost from the current node toa goal.
Denoted by h(n)
h(n) = estimated cost of the cheapest path from node n to a goal node
For example, suppose you want to find a shortest path from Kolkata to Guwahati, then heuristic for
Guwahati may be straight-line distance between Kolkata and Guwahati, that is
Initial state
actions
Goal State
We need to find a sequence of actions which transform the agent from the initial sate 𝑠 to Goal
state G. State space is commonly defined as a directed graph or as a tree in which each node is a
state and each arc represents the application of an operator transforming a state to a successor
state.
Thus, the problem is solved by using the rules (operators), in combination with an appropriate
control strategy, to move through the problem space until a path from initial state to a goal state
is found. This process is known as search. A solution path is a path in state space from 𝑠 (initial
sate) to G (Goal state).
We have already seen an OPEN list is used to implement an uninformed (Blind) search (section
3.2). But the problem with using only one list OPEN is that, it is not possible to keep track of the
node which is already visited. That is “how we can maintain a part of the state space that is
already visited”. To save the explicit space, we maintained another list called CLOSED. Now
we can select a node from OPEN and save it in CLOSED. Now, when we generate successor
node from CLOSED, we check whether it is already in (𝑂𝑃𝐸𝑁 ∪ 𝐶𝐿𝑂𝑆𝐸𝐷). If it is already in
(𝑂𝑃𝐸𝑁 ∪ 𝐶𝐿𝑂𝑆𝐸𝐷), we will not insert in OPEN again, otherwise insert.
Best first search uses an evaluation function f(n) that gives an indication of which node to
expand next for each node. Every node in a search space has an evaluation function (heuristic
function) associated with it. A heuristic function value h(n) on each node indicates how the node
is from the goal node. Note that Evaluation function=heuristic cost function (in case of minimization
problem) OR objective function(in case of maximization).Decision of which node to be expanded
depends onvalue of evaluation function. Evaluation value= cost/distance of current node from goal node
and for goal node evaluation function value=0
Based on the evaluation function, f(n), Best-first search can be categorized into the following
categories:
1) Greedy Best first search
2) A* search
The following 2 list (OPEN and CLOSED) are maintained to implement these two algorithms.
1. OPEN – all those nodes that have been generated & have has heuristic function applied
to them but have not yet been examined.
2. CLOSED- contains all nodes that have already been examined.
Greedy best-first search algorithm always selects the path which appears best at that moment.
It is the combination of depth-first search and breadth-first search algorithms. It uses the
heuristic function and search. Best-first search allows us to take the advantages of both
algorithms. With the help of best-first search, at each step, we can choose the most promising
node. In the best first search algorithm, we expand the node which is closest to the goal node and
the closest cost is estimated by heuristic function, i.e.
f(n)= g(n).
The greedy best first algorithm is implemented by the priority queue (or to store the heuristic
function value).
Best first search can switch between BFS and DFS, thus gaining the advantages of both
the algorithms.
Consider the following example for better understanding of greedy Best-First search algorithm.
Example1: Consider the following example (graph) with heuristic function value h(n)[Fig 2]
which illustrate the greedy Best-first search. Note that in the following example, heuristic
function is defined as
A 7 D
11 14 25
B C 10
F
8 20
15
9 10 G
E
H
Let heuristic function value h(n) for each node n to goal node G is defined as
The nodes added/deleted from OPEN and CLOSED list using Best-First Search algorithm are
shown below.
OPEN CLOSED
[A] []
[C,B,D] [A]
B,D A,C
F,E,B,D A,C
G.E.B.D A,C,F
E,B,D A,C,F,G
4
12 A B
9
8 E F
I G 2
0
Time Complexity:
Space Complexity:
The worst-case space complexity of Greedy best first search is O(bm). Where, m is the maximum
depth of the search space.
Complete: Greedy best-first search is also incomplete, even if the given state space is finite.
Example3: Apply Greedy Best-First Search algorithm on the following graph (L is a goal node).
Evaluation
10
function D
Start
E
node 2 8
A
F
6 13
s
B G
14 1
5 I K
5 0 Goal
C 7 L Node
H
6
J 2 M
Working: We start with a start-nodes, S. Now, S has three children i.e., A, B and C with their Heuristic
function values 2, 6 and 5 respectively. These weights show approximately, how far theyare from goal
node. So, we write, children of S are -(A:2), (B: 6), (C:5)
Out of these, the node with minimum value is (A : 2). So, we select A and its children are explored (or
generated).
Its children are(D: 10) and (E:8)
The search process now has four nodes to search for, namely-
(B : 6), (C: 5), (8.10) and (E: 8)
Out of these, node-C has the minimal value of 5. So, we select it and expand. So, we get(H: 7) as its child.
Now, the nodes to search are as follows-
(B:6), (D: 10,(E: 8) and (H : 7) and so on.
Step Node Children (on Available nodes (to search) Node Chosen
being expansion)
expanded
1. S (A:2), (B:6),(C:5) (A:2), (B:6),(C:5) (A:2)
2. A (D:10), (E:8) (B:6),(C:5), (D:10), (E:8) (C:5)
3. C (H:7) (B:6), (D:10), (E:8),(H:7) (B:6)
4. B (F:13), (G:14) (D:10), (E:8), (H:7) ,(F:13), (G:14) (H:7)
5. H (I:5), (J:6) (D:10), (E:8),(F:13), (G:14), (I:5), (I:5)
(J:6)
6. I (K:1), (L:0), (M:2) (D:10), (E:8), (H:7) ,(F:13), Goal node is found.
(G:14), (J:6),(K:1),(L:0),(M:2) So, search stops now
3.8 A* Algorithm
A* search is the most commonly known form of best-first search. It uses heuristic function h(n),
and cost to reach the node n from the start state g(n). It has combined features of uniform cost
search (UCS) and greedy best-first search, by which it solves the problem efficiently. A* search
algorithm finds the shortest path through the search space using the heuristic function. This
search algorithm expands less search tree and provides optimal result faster.
In A* search algorithm, we use search heuristic as well as the cost to reach the node. Hence, we
can combine both costs as following, and this sum is called as a fitness number (Evaluation
Function).
f(n) = g(n) + h(n)
Estimated cost of the Cost to reach node n Cost to reach from node
cheapest solution. from start state. n to goal node.
If h(n) is admissible then search will find optimal solution. Admissible means underestimates cost of any
solution which can reached from node. In other words, a heuristic is called admissible if it always under-
estimates, that is, we always have ℎ(𝑛) ≤ ℎ∗ (𝑛), where h*(n) denotes the minimum distance to a goal
state from state n.
A* search begins at root node and then search continues by visiting the next node which has the least
evaluation value f(n).
It evaluates nodes by using the following evaluation function
f(n) = h(n) +g(n) = estimated cost of the cheapest solution through n.
Where,
g(n): the actual shortest distance traveled from initial node to current node, it helps to avoid expanding
paths that are already expansive
h(n): the estimated (or “heuristic”) distance from current node to goal, it estimates which node is closest
to the goal node.
Nodes are visited in this manner until a goal is reached.
Suppose s is a start state then calculation of evaluation function f(n) for any node n is shown in following
figure 10.
s
g(n)
n f(n) = g(n)+h(n)
Algorithm A*
In step5, we generate a successor of n (say m) and for each successor m, if it does not belong to
OPEN or CLOSED that is 𝑚 ∉ [𝑂𝑃𝐸𝑁 ∪ 𝐶𝐿𝑂𝑆𝐸𝐷], then we insert it in OPEN with the cost
g(n)+C(n,m) i.e., cost up to n and additional cost from n m.
If 𝑚 ∈ [𝑂𝑃𝐸𝑁 ∪ 𝐶𝐿𝑂𝑆𝐸𝐷] then we set g(m) with original cost and new cost [𝑔(𝑛) + 𝐶(𝑛, 𝑚)].
If we arrive at some state with another path which has less cost from original one, then we
replace the existing cost with this minimum cost.
If we find f(m) is decreased (if larger then ignore) and 𝑚 ∈ 𝐶𝐿𝑂𝑆𝐸𝐷 then move m from
CLOSED to OPEN.
Note that, the implementation of A* Algorithm involves maintaining two lists- OPEN and
CLOSED. The list OPEN contains those nodes that have been evaluated by the heuristic
function but have not expanded into successors yet and the list CLOSED contains those nodes
that have already been visited.
See the following steps for working of A* algorithm:
Step-1: Define a list OPEN. Initially, OPEN consists of a single
node, the start node S.
Step-2: If the list is empty, return failure and exit.
Step-3: Remove node n with the smallest value of f(n) from OPEN
and move it to list CLOSED.
If node n is a goal state, return success and exit.
Step-4: Expand node n.
Step-5: If any successor to n is the goal node, return success and the
solution by tracing the path from goal node to S.
Otherwise, go to Setp-6.
Step-6: For each successor node,
Apply the evaluation function f to the node.
If the node has not been in either list, add it to OPEN.
Step-7: Go back to Step-2.
Example1: Let’s us consider the following graph to understand the working of A* algorithm.
The numbers written on edges represent the distance between the nodes. The numbers written on
nodes represent the heuristic value. Find the most cost-effective path to reach from start state A
to final state J using A* Algorithm.
10
A 3
6
6
1 F
7
8 B 2 5 G
H 3
D 7 3 1
3
1 2
I
C 8
5 5
5 3
E
3 5 J
Step-1:
We start with node A. Node B and Node F can be reached from node A. A* Algorithm
calculates f(B) and f(F). Estimated Cost f(n) = g(n) +h(n) for Node B and Node F is:
f(B) = 6+8=14
f(F) = 3+6=9
State h(n)
S 5
A 3
B 4
C 2
D 6
G 0
Solution:
S→A=1+3=4
S→G=10+0=10
S→A→B=1+2+4=7
S→A→C=1+1+2=4
S→A→C→D=1+1+3+6=11
S→A→C→G=1+1+4=6
S→A→B→D=1+2+5+6=14
S→A→C→D→G=1+1+3+2=7
S→A→B→D→G=1+2+5+2=10
3.8.2 Advantages and disadvantages of A* algorithm
Advantages:
A* search algorithm is the best algorithm than other search algorithms.
A* search algorithm is optimal and complete.
This algorithm can solve very complex problems.
Disadvantages:
It does not always produce the shortest path as it mostly based on heuristics and
approximation.
A* search algorithm has some complexity issues.
The main drawback of A* is memory requirement as it keeps all generated nodes in the
memory, so it is not practical for various large-scale problems.
In other words, if the heuristic function h always underestimates then true cost h* (that is heuristic
function cost h(n) is smaller than true cost h*(n)), then A* is guaranteed to find an optimal solution.
If there is a path from s to a goal state, A* terminates (even when the state space is infinite). Algorithm
A* is admissible, that is, if there is a path from s to a goal state, A* terminates by finding an optimal path
. If we are given two or more admissible heuristics, we can take there max to get a stronger admissible
heuristic.
Admissibility Condition :
By admissible algorithm. we mean that the algorithm is sure to find a most optimal solution if one exists.
Please note that this is possible only when the evaluation function value never overestimates the distance
of the node to the goal. Also note that if the evaluation function value which is a heuristic one is exactly
the same of the distance of the node to the goal, then this algorithm will immediately give the solution.
For example, the A* algorithm discussed above is admissible. There are three conditions to be satisfied
for A to be admissible.
They are as follows-
1. Each node in the graph has finite number of successors (or O).
2. All arcs in the graph have costs greater than some positive amount, (say C).
3. For each node in the graph, n, h(n) ≤h’(n).
This implies that the heuristic guess of the cost of getting from node n to the goal is never an
overestimate. This is known as a heuristic condition. Only if these three conditions are satisfied, A* is
guaranteed to find an optimal (least) cost path. Please note that A* algorithm is admissible for any
node n if on such path, h'(n) is always less than or equal to h(n). This is possible only when the
evaluation function value never overestimates the distance of the node to the goal. Although the
admissibility condition requires h'(n) to be a lower bound on h(n), it is expected that the more closely
h'(n) approaches h(n), the better is the performance of the algorithm.
If h(n) = h’(n) -an optimal solution path would be found without over expanding a node of the path. We
assume that one optimal solution exists. If h'(n)=0 then A* reduces to blind uniform cost algorithm or
breadth-first algorithm.
Please note that the admissible heuristics are by nature optimistic because they think that the cost of
solving the problem is less than it actually is because g(n) is the exact cost for each n. Also note that f(n)
should never overestimate the true cost of a solution through n.
For example, consider a network of roads and cities with roads connecting these cities. Our problems to
find a path between two cities such that the mileage/fuel cost is minimal. Then an admissible heuristic
would be to use distance to estimate the costs from a given city to the goal city. Naturally, the air distance
will be either equal to the real distance or will underestimate it i.e., h(n) ≤ h’(n).
1. A* is admissible: Algorithm A* is admissible , that is, if there is a path from S to goal state,
A* terminates by finding an optimal solution.
2. A* is complete: If there is a path from S to goal state, A terminates (Even when the state
space is ∞).
3. Dominance property: If A1& A2→ two Admissible versions of A* S.t.
A1 is more informed than A2, then A2 expends at least as many states as does A1. (So A1
dominates A2 Here, b/s its better heuristics than A 2)
If we are given two or more admissible heuristics, we can take their max to get a stronger
admissible heuristic.
Problem reduction search is broadly defined as a planning how best to solve a problem that can
be recursively decomposed into subproblems in multiple ways. There are many ways to
decompose a problem, we have to find the best decomposition, which gives the quality of
searching or cost is minimum.
We already know about the divide and conquer strategy, a solution to a problem can be obtained
by decomposing it into smaller sub-problems. Each of this sub-problem can then be solved to get
its sub solution. These sub solutions can then be recombined to get a solution as a whole. That is
called is Problem Reduction. This method generates arc which is called as AND arcs. One
AND arc may point to any number of successor nodes, all of which must be solved for an arc to
point to a solution.
When a problem can be divided into a set of sub problems, where each sub problem can be
solved separately and a combination of these will be a solution, AND-OR graphs or AND - OR
trees are used for representing the solution. The decomposition of the problem or problem
reduction generates AND arcs. Consider the following example to understand the AND-OR
graph (figure-11).
AND
OR
Fig 11AND-OR graph
The figure-11 shows an AND-OR graph. In an AND-OR graph, OR node represents a choice
between possible decompositions, and an AND node represents given decomposition. For
example, to Get a bike, we have two options, either:
1.(Steal a bike)
OR
2. Get some money AND Buy a Bike.
In this graph we are given two choices, first Steal a bike or get some money AND Buy a Bike.
When we have more than one choice and we have to pick one, we apply OR condition to choose
one.(That's what we did here).
Basically, the ARC here denotes AND condition.
Here we have replicated the arc between the Get some money and buy a bike because by getting
some money possibility of buying a bike is more than stealing.
AO* search algorithm is based on AND-OR graph, so it is called AO* search algorithm. AO*
Algorithm basically based on problem decomposition (Breakdown problem into small pieces).
The main difference between the A*(A star) and AO*(AO star) algorithms is that A* algorithm
represents an OR graph algorithm that is used to find a single solution (either this or that).
Butan AO* algorithm represents an AND-OR graph algorithm that is used to find more than
one solution by ANDing more than one branch.
A* algorithm guarantees to give an optimal solution while AO* doesn’t since AO* doesn’t
explore all other solutions once it got a solution.
Given [G, s, T]
Where G: Implicitly specified AND/OR graph
s: Strat node of the AND/OR graph
T: Set of terminal nodes (called SOLVED)
h(n): Heuristic function estimating the cost of solving the sub
problem at n.
Example1:
Let us see one example with the presence of heuristic value at every node (see fig 2) . The
estimated heuristic value is given at each node. The heuristic value h(n) at any node indicates
“from this node at least h(n) value (or cost) is required to find solution”. Here we assume the
edge cost value (i.e., g(n) value) for each edge is 1. Remember in OR node we always mark that
successor node which indicates best path for solution.
S
1 1 1
7 A 12 B C 13
1 1 1 1
5 D E 6 5 F G 7
H
Fig 12: AND-OR graph with heuristic value at each node
Note that, the graph given in fig 2, there are two paths for solution from start state S: either S-A-
B or S-C. To calculate the cost of the path we use the formula f(n)=g(n)+h(n) [note that here g(n)
value is 1 for every edge].
Path1: f(S-A-B)=1+1+7+12=21
Path2: f(S-C)=1+13=14
Since 𝑚𝑖𝑛(21,14) = 14; so, we select successor node C, as its cost is minimum, so it indicates
best path for solution.
Note that C is a AND node; so, we consider both the successor node of C. The cost of node C is
f(C-F-G)=1+1+5+7=14; so, the revised cost of node C is 14 and now the revised cost of node S
is f(S-C)=1+14=15 (revised).
Note that once the cost (that is f value) of any node is revised, we propagate this change
backward through the graph to decide the current best path.
Now let us explore another path and check whether we are getting lessor cost as compared to this
cost or not.
F(A-D)=1+5=6 and f(A-E)=1+6=7; since A is an OR node so best successor node is D since
min(6,7)=6. So revised cost of Node A will be 6 instead of 7, that is f(A)=6 (revised). Now next
selected node is D and D is having only one node H so f(D-H)=1+2=3, so the revised cost of
node D is 3, so now the revised cost of node A, that is f(A-D-H)=4. This path is better than f(A-
E)=7. So, the final revised cost of node A is 4. Now the final revised cost of f(S-A-
B)=1+1+4+12=18 (revised).
Thus, the final revised cost for
Path1: f(S-A-B)=18 and
Path2: f(S-C)=15
So optimal cost is 15.
Example2:
Consider the following AND-OR Graph with estimated heuristic cost at every node. Note that
A,D,E are AND node and B, C are OR node. Edge cost (i.e., g(n) value) is also given.
Apply AO* algorithm and find the optimal cost path using AO* algorithm.
A h=7
2 1 A,D,E → AND Node
B,C → OR Node
h=4 B C
2 1 0 0
h=2 D E 6 9 F G 7
1 0 0 0
3 H 10 I J 4 K 3
Heuristic (Estimated) cost at every Node is given. For example, heuristic cost at node A is h=7,
which means at least 7-unit cost required to find a solution.
Since A is AND Node, so we have to solve Both of its successor Node B and C.
Cost of Node A i.e., f(A-B-C) = (2+4) + (1+3) = 10
We perform cost revision in Bottom-up fashion.
h=4 B C h=3
A h=14
2 1
h=4 B C h=7
0 0
9 F G 7
Note that all the leaf Node in the Marked tree is solved is solved. So, the best way to solve the problem is
to following the marked tree and solving those marked problem. This best cost to solve the problem is 18.
Note that for AND Node: if both successor (unit problem) is solved, then we declared SOLVED and for
OR Node: if any one best successor is SOLVED, then we declared SOLVED.
Our real-life situations cannot be exactly decomposed into either AND tree or OR tree but is
always combination of both. So, we need an AO* algorithm where O stands for 'ordered'. Instead
of two lists OPEN and CLOSED of A* algorithm, we use a single structure GRAPH in AO*
algorithm. If represents a part of the search graph that has been explicitly generated so far. Please
note that each node in the graph will point both down to its immediate successors and to its
immediate predecessors. Also note that each node will have some h'(n) value associated with it.
But unlikeA* search, g(n) is not stored. It is not possible to compute a single value of g(n) due to
many paths to the same state. It is not required also as we are doing top-down traversing along
best-knownpath.
This guarantees that only those nodes that are on the best path are considered for expansion
Hence, h'(n) will only serve as the estimate of goodness of a node.
Next, we develop an AO* algorithm.
Algorithm AO*
1. Initialize: Set G* ={s}, f(s) =h(s)
If s ∈ T, label s as SOLVED
2. Terminate: If s is SOLVED, then Terminate
3. Select: Select a non-terminal leaf node n from the
marked
sub-free
4. Expand: Make explicit the successors of n for each new
successors, m:
Set f(m) =h(m)
If m is terminal, label m SOLVED
5. Cost Revision: Call Cost-Revise (n)
6. Loop: Go to Step 2.
Cost Revision in AO*: Cost-Revise(n)
1. Create Z = {n}
2. If Z = {} return
3. Select a node m from Z such that m has no descendants in Z
4. If m is an AND node with successors 𝑟1, 𝑟2, … 𝑟𝑘:
Set 𝑓(𝑚) = ∑ [𝑓(𝑟 ) + 𝑐(𝑚, 𝑟 )]
Mark the edge to each successor of m
If each successor is labelled SOLVED,
then label m as SOLVED.
Note that AO* will always find a minimum cost solution if one exists ifℎ’(𝑛) < ℎ(𝑛)and that all arc
costs are positive. The efficiency of this algorithm will depend on how closelyℎ’(𝑛) approximates ℎ(𝑛).
Also note that AO* is guaranteed to terminate even on graphs that have cycles.
Note: When the graph has only OR node then AO* algorithm works just like A* algorithm.
The following are the commonly used memory bound heuristics search:
IDA* is a variant of the A* search algorithm which uses iterative deepening to keep the memory usage
lower than in A*.It is an informed search based on the idea of the uniformed iterative deepening search.
Iterative deepening A* or IDA* is similar to iterative-deepening depth-first, but with the
following modifications:
f1
f2
f3
f4
A B C
f=120 f=130 f=120
D G E F
f=140 f=125 f=140 f=125
A B C
f=120 f=130 f=120
D G E F
f=140 f=125 f=140 f=125
f-limited, f-bound
P
f=100
A B C
f=120 f=130 f=120
D G E F
f=140 f=125 f=140 f=125
SUCC
3.10.3 Analysis of IDA*
IDA* is complete, optimal, and optimally efficient (assuming a consistent, admissible heuristic),
and requires only a polynomial amount of storage in the worst case:
S
F* = optimal path cost to a goal
𝑓∗ b
B = branching factor
𝛿
𝛿= minimum operator step cost
∗
nodes of storage required
Note that IDA* is complete & optimal Space usage is linear in the depth of solution. Each
iteration is depth first search, and thus it does not require a priority queue.
Iterative Deepening Search (IDS) is nothing but BFS plus DFS for tree search.
IDA* algorithm is “complete and Optimal” algorithm.
BFA and A* is good for optimality, but not memory.
DFS: good for memory O(bd), but not optimality
In the worst case, only one new state is expanded in each iteration
(IDA*). If A* expands N states, then IDA* can expand:
1+2+3+…+N=O(N2)
The idea of recursive best first search is to simulate A* search with O(bd) memory, where b is
the branching factor and d is the solution depth.
It is a memory bound, simple recursive algorithm that works like a standard best first search but
only takes up linear space. There are some things that make it different from recursive DFS. It
keeps track of f, the value of the best alternative path that can be found from any ancestor of the
current node, instead of continuing indefinitely down the current path.
RBFS mimic the operation of standard Best-First search algorithm. IBFS keep track of the f-
value of the best alternative path available from any ancestor of the current node. If the current
node exceeds the limit, the recursion unwinds back to the alternative path. As the recursion
unwinds, RBFS replaces the f-value of each node along the path with the best f-value of its
children. In this way, RBFS remembers the f-value of the best leaf in the forgotten subtree and
can therefore decide whether it’s worth re-expanding the subtree at some later time.
RBFS is somewhat most efficient than IDA*, but still suffers from excessive node regeneration.
A* and RBFS are optimal algorithms if heuristic function h(n) is admissible.
12 10 16 15
2 1 2
1 2 3 4
1 3 3 1
7 11
12 5 6 7 8 15
5 5
1 4 10 0 15
0
12 9 10 11 12 0
8 3 1
4
5
1
3 2
4 2 3 23
4 3
4 2
5 3
20
6 0
Q.4: Apply AO* algorithm on the following graph. Heuristic value is also given at every node
and assume the edge cost value of each node is 1.
9
A
1 1 1
3 B 4 C D 5
1 1 1 1
5 E F 7 4 G H 4
Q.5 Apply AO* algorithm on the following graph. Heuristic value is also given at every node
and assume the edge cost value of each node is 1.
P
\ \
\
5 q r 11 s 8
\ \ \
\ \
4 t u v w x
7 1 3 3
\
y
1
Example 6: Given the 3 matrices A1, A2, A3 with their dimensions (3 × 4), (4 × 10), (10 × 1).
Consider the problem of solving this chain matrix multiplication. Apply the concept of AND-
OR graph and find a minimum cost solution tree.
3.12 Summary
As the name ‘Uninformed Search’ means the machine blindly follows the algorithm
regardless of whether right or wrong, efficient or in-efficient.
These algorithms are brute force operations, and they don’t have extra information about
the search space; the only information they have is on how to traverse or visit the nodes
in the tree. Thus, uninformed search algorithms are also called blind search algorithms.
The search algorithm produces the search tree without using any domain knowledge,
which is a brute force in nature. They don’t have any background information on how to
approach the goal or whatsoever. But these are the basics of search algorithms in AI.
The different types of uninformed search algorithms are as follows:
Depth First Search
Breadth-First Search
Depth Limited Search
Uniform Cost Search
Iterative Deepening Depth First Search
Bidirectional Search (if applicable)
The following terms are frequently used in any search algorithms:
To evaluate and compare the efficiency of any search algorithm, the following 4
properties are used:
3.13 Solutions/Answers
Answer 1:
Answer 2
Answer 4
Answer 1:
The OPEN and CLOSED list are shown below. Node with their f(n) values are inserted in OPEN list and
that node will be expended next whose f(n) value is minimum.
CLOSED
1(12) 2(12) 6(12) 5(13) 10(13) 11(13) 12(13)
OPEN
1(12)
2(12) 5(13)
5(13) 3(14) 6(12)
5(13) 3(14) 7(17) 10(13)
3(19) 7(17) 10(13)
3(19) 7(17) 10(13) 9(14)
3(19) 7(17) 9(14) 11(13)
3(19) 7(17) 9(14) 12(13)
Note that only 6 nodes are expended to reach to a goal node. Optimal cost to reach from start state (1) to
goal node (12) is 13.
Note: If all the edge cost is positive then Uniform cost search (UCS) algorithm is same as Dijkstra’s
algorithm. Dijkstra algorithm fails if graph is having a negative weight cycle. A* algorithm allows
negative weight also. It means A* algorithm work for negative (-ve) edge cost also. If some edge cost is
negative, then at any point of successive iteration, we cannot say till that node we have optimum cost
(because of the negative cost). So, in this case (-ve edge cost), nodes come back from CLOSED to OPEN.
Let us see one example (Example 2) having negative edge cost and you can also see how nodes come
back from CLOSED to OPEN.
Answer 2:
The OPEN and CLOSED list are shown below. Node with their f(n) values are inserted in OPEN list and
that node will be expended next whose f(n) value is minimum.
CLOSED
1(15) 2(7) 4(9) 5(11) 3(25) 4(7) 5(9)
OPEN
1(15)
2(7) 3(25)
3(25) 4(9)
3(25) 5(11)
3(25) 6(28) This is goal Node, but we can’t Pick this
4(7) 6(28) because better cost/path may exist.
Cost of 4 is decreased from 9 →7
so bring 4 from close →open 5(9) 6(28) (Note: In open 6(28) is Not min. cost ∴ we do not
pick them).
∴4(9) 6(26) 6(28)
Same Logic Goal (optimal cost is 26)
Optimal cost to reach from start state (1) to goal node (6) is 26.
3 B 4 C D 5
1 1 1 1
5 E F 7 4 G H 4
B 4 C D 10
1 1 1 1
5 E F 7 4 G H 4
Note that AO* algorithm does not explore all the solution path once it finds a solution.
[5×10]
[3×5] [3×6] [3×10]
A1 A2 A3 A1 A2 A3
(0) (300) (90) (0)
In this AND-OR graph, parent (root) node indicates the given problem for multiplying A1A2A3.
Next level of the tree (2 successors node) indicates the 2 choices (or ways) of multiplying (or
parenthesizing) the A1A2A3; first way is 𝐴1 × (𝐴2 × 𝐴3) and another way is (𝐴1 × 𝐴2) × 𝐴3.
Since out of these two choices, anyone will be the solution so there is a OR node for this. In an
OR node, we always mark the current best successor node. Next level we have an AND node.
For any AND node we must add the cost of both the successor node.
Cost of multiplying (𝐴2 × 𝐴3) = 5 × 6 × 10 = 300 and dimension of 𝐴2 × 𝐴3 is (5 × 10).
Since the dimension of A1 is (3 × 5) and the dimension of 𝐴2 × 𝐴3 is (5 × 10), so the cost of
multiplying 𝐴1 × (𝐴2 × 𝐴3) = 3 × 5 × 10 = 150. Thus the total cost will be 300+150=450.
Similarly,
The cost of multiplying (𝐴1 × 𝐴2) = 3 × 5 × 6 = 90 and dimension of 𝐴1 × 𝐴2 will be
(3 × 6). Since the dimension of 𝐴1 × 𝐴2 is (3 × 6) and the dimension of A3 is (6 × 10), so the
cost of multiplying (𝐴1 × 𝐴2) × 𝐴3 = 3 × 6 × 10 = 180. Thus the total cost will be
90+180=270. So, the best way to multiplying 𝐴1 × 𝐴2 × 𝐴3 is (𝐴1 × 𝐴2) × 𝐴3 and the
minimum cost of multiplying 𝐴1 × 𝐴2 × 𝐴3
is 270.
Answer 6: Option C
Answer 7: Option A
4.1 Introduction
4.2 Objectives
4.3 Introduction to Propositional Logic
4.4 Syntax of Propositional Logic
4.4.1 Atomic Propositions
4.4.2 Compound Propositions
4.5 Logical Connectives
4.5.1 Conjunction
4.5.2 Disjunction
4.5.3 Negation
4.5.4 Implication
4.5.5 Bi-Conditional
4.6 Semantics
4.6.1 Negation Truth Table
4.6.2 Conjunction/Disjunction/Implication/Biconditional
Truth Table
4.6.3 Truth Table with three variables
4.7 Propositional Rules of Inference
4.7.1 Modus Ponens (MP)
4.7.2 Modus Tollens (MT)
4.7.3 Disjunctive Syllogism (DS)
4.7.4 Addition
4.7.5 Simplification
4.7.6 Conjunction
4.7.7 Hypothetical Syllogism (HS)
4.7.8 Absorption
4.7.9 Constructive Dilemma (CD)
4.8 Propositional Rules of Replacement
4.9 Validity and Satisfiability
4.10 Introduction to Predicate Logic
4.11 Inferencing in Predicate Logic
4.12 Proof Systems
4.13 Natural Deduction
4.14 Propositional Resolution
4.14.1 Clausal Form
4.14.2 Determining Unsatisfiability
4.15 Answers/Solutions
4.16 Further Readings
4.1 INTRODUCTION
Logic is the study and analysis of the nature of the valid argument, the
reasoning tool by which valid inferences can be drawn from a given set of facts
and premises. It is the basis on which all the sciences are built, and this
mathematical theory of logic is called symbolic logic. The English
mathematician George Boole (1815-1864) seriously studied and developed this
theory, called symbolic logic.
The reason why the subject-matter of the study is called Symbolic Logic is that
symbols are used to denote facts about objects of the domain and relationships
between these objects. Then the symbolic representations and not the original
facts and relationships are manipulated in order to make conclusions or to solve
problems.
Using symbolic logic, we can formalize our arguments and logical reasoning in
a manner that can easily show if the reasoning is valid, or is a fallacy. How we
symbolize the reasoning is what is presented in this unit.
4.2 OBJECTIVES
Apart from the application of logic in mathematics, it also helps in various other
tasks related to computer science. It is widely used to design the electronic
circuitry, programming of android applications, applying artificial intelligence
to different tasks, etc. In simple terms, a proposition is a statement which is
either true or false.
Consider the following statements:
1. Earth revolves around the sun.
2. Water freezes at 100° Celsius.
3. An hour has 3600 seconds.
4. 2 is the only even prime number.
5. Mercury is the closest planet to the Sun in the solar system.
6. The USA lies on the continent of North America.
7. 1 + 2 = 4.
8. 15 is a prime number.
9. Moon rises in the morning and sets in the evening.
10. Delhi is the capital of India.
For all the above statements, one can easily conclude whether the particular
statement is true or false so these are propositions. First statement is a universal
truth. Second statement is false as the water freezes at 0° Celsius. Third
statement is again a universal truth. Fourth statement is true. Fifth statement is
also true as it is again a universal truth. On similar lines, the sixth statement is
also true. Seventh and eighth statements are false again as they deny the basic
mathematical rules. The Ninth statement is a negation of the universal truth so it
is a false statement. The Tenth Statement is also true.
Now consider the following statements:
1. What is your name?
2. a + 5 = b.
3. Who is the prime minister of India?
4. p is less than 5.
5. Pay full attention while you are in the class.
6. Let’s play football in the evening.
7. Don’t behave like a child, you are grown up now!
8. How much do you earn?
9. ∠X is an acute angle greater than 27°.
For all the above statements, we can’t say anything about their truthfulness so
they are not propositions. First and third statements are neither true nor false as
they are interrogative in nature. Also, we can’t say anything about the
truthfulness of the second statement until and unless we have the values of a and
b. Similar reasoning applies to the fourth statement as well. We can’t say
anything about fifth, sixth and seventh statements again as they are informative
statements. Eighth statement is again interrogative in nature. Again, we can’t
say anything about the truthfulness of the ninth statement until and unless we
have the value of ∠X.
Propositional logic has the following facts:
1. Propositional statements can be either true or false, they can’t be
both simultaneously.
2. Propositional logic is also referred to as binary logic as it works
only on the two values 1 (True) and 0 (False).
3. Symbols or symbolic variables such as x, y, z, P, Q, R, etc. are
used for representing the logic and propositions.
4. Any proposition or statement which is always valid (true) is known
as a tautology.
5. Any proposition or statement which is always invalid (false) is
known as a contradiction.
6. A table listing all the possible truth values of a proposition is
known as a truth table.
7. Objects, relations (or functions) and logical connectives are the
basic building blocks of propositional logic.
8. Logical connectives are also referred to as logical operators.
9. Statements which are interrogative, informative or opinions such as
“Where is Chandni Chowk located?”, “Mumbai is a good city to
live in”, “Result will be declared on 31st March” are not
propositions.
Logical connectives are the operators used to join two or more atomic
propositions (operands). The joining should be done in a way that the logic and
truth value of the obtained compound proposition is dependent on the input
atomic propositions and the connective used.
4.5.1 Conjunction
A proposition “A ∧ B”with connective ∧ is known as conjunction of A and B. It
is a proposition (or operation) which is true only when both the constituent
propositions are true. Even if one of the input propositions is false then the
output is also false. It is also referred to as AND-ing the propositions. Example:
Ram is a playful boy and he loves to play football. It can be written as:
A = Ram is a playful boy.
B = Ram loves to play football.
A ∧ B = Ram is a playful boy and he loves to play football.
4.5.2 Disjunction
A proposition “A ∨ B”with connective ∨ is known as disjunction of A and B. It
is a proposition (or operation) which is true when at least one of the constituent
propositions are true. The output is false only when both the input propositions
are false. It is also referred to as OR-ing the propositions. Example:
I will go to her house or she will come to my house. It can be written as:
A = I will go to her house.
B = She will come to my house.
A ∨ B = I will go to her house or she will come to my house.
4.5.3 Negation
The proposition ¬ A (or ~A) with ¬ (or ~) connective is known as negation of
A. The purpose of negation is to negate the logic of given proposition. If A is
true, its negation will be false, and if A is false, its negation will be true.
Example:
University is closed. It can be written as:
A = University is closed.
¬ A = University is not closed.
4.5.4 Implication
The proposition A → B with → connective is known as A implies B. It is also
called if-then proposition. Here, the second proposition is a logical consequence
of the first proposition. For example, “If Mary scores good in examinations, I
will buy a mobile phone for her”. In this case, it means that if Mary scores
good, she will definitely get the mobile phone but it doesn’t mean that if she
performs bad, she won’t get the mobile phone. In set notation, we can also say
that A ⊆ B i.e., if something exists in the set A, then it necessarily exists in the
set B. Another example:
If you score above 90%, you will get a mobile phone.
A = You score above 90%.
B = You will get a mobile phone.
A → B = If you score above 90%, you will get a mobile phone.
4.5.5 Bi-conditional
A proposition A ⟷ B with connective ⟷ is known as a biconditional or if-and-
only-if proposition. It is true when both the atomic propositions are true or both
are false. A classic example of biconditional is “A triangle is equivalent if and
only if all its angles are60° each”. This statement means that if a triangle is an
equivalent triangle, then all of its angles are 60° each. There is one more
associated meaning with this statement which means that if all the interior
angles of a triangle are of 60° each then it's an equivalent triangle. Example:
You will succeed in life if and only if you work hard.
A = You will succeed in life.
B = You work hard.
A ⟷ B = You will succeed in life if and only if you work hard.
Which of the following propositions are atomic and which are compound?
1. The first battle of Panipat was fought in 1556.
2. Jack either plays cricket or football.
3. Posthumously, at the age of 22, Neerja Bhanot became the youngest recipient of the
Ashok Chakra award which is India's highest peacetime gallantry decoration.
4. Chandigarh is the capital of the Indian states Haryana and Punjab.
5. Earth takes 365 days, 5 hours, 59 minutes and 16 seconds to complete one revolution
around the Sun.
6. Dermatology is the branch of medical science which deals with the skin.
7. Indian sportspersons won 7 medals at the 2020 Summer Olympics and 19 medals at
the 2020 Summer Paralympics both held at the Japanese city of Tokyo.
8. Harappan civilization is considered to be the oldest human civilization and it lies in
the parts of present-day India, Pakistan and Afghanistan.
9. IGNOU is a central university and offers courses through the distance learning mode.
10. Uttarakhand was carved out of the Indian state of Uttar Pradesh in the year 2000.
4.6 SEMANTICS
You had already learned various of the concepts to be covered in this unit, in MCS-212 i.e.,
Discrete Mathematics, here is a quick revision to those concepts and we will extend our
discussion to the advanced concepts, which are useful for our field of work i.e., Artificial
Intelligence. In MCS – 212 i.e., Discrete Mathematics you learned that Propositions are the
declarative sentence or statements which is either true or false, but not both, such sentences
can either be universally true or universally false.
On the other hand, consider the declarative sentence ‘Women are more intelligent than men’.
Some people may think it is true while others may disagree. So, it is neither universally true
nor universally false. Such a sentence is not acceptable as a statement or proposition in
mathematical logic.
In propositional logic, as mentioned earlier also, symbols are used to denote propositions. For
instance, we may denote the propositions discussed above as follows:
The symbols, such as P, Q, and R, that are used to denote propositions, are called atomic
formulas, or atoms., in this case, the truth-value of P is False, the truth-value of Q is True
and the truth-value of R, though not known yet, is exactly one of ‘True’ or ‘False’,
depending on whether Ram is actually a Ph. D or not.
At this stage, it may be noted that once symbols are used in place of given statements in, say,
English, then the propositional system, and, in general, a symbolic system is aware only of
symbolic representations, and the associated truth values. The system operates only on these
representations. And, except for possible final translation, is not aware of the original
statements, generally given in some natural language, say, English.
When you’re talking to someone, do you use very simple sentences only? Don’t you use
more complicated ones which are joined by words like ‘and’, ‘or’, etc? In the same way,
most statements in mathematical logic are combinations of simpler statements joined by
words and phrases like ‘and’. ‘or’, ‘if … then’. ‘If and only if’, etc. We can build, from
atoms, more complex propositions, sometimes calledcompound propositions, by using
logical connectives,
The Logical Connectives are used to frame compound propositions, and they are as follows:
a) Disjunction The disjunction of two propositions p and q is the compound statement por
q, denoted by p q.
The exclusive disjunction of two propositions p and q is the statement ‘Either of the two
(i.e. p or q) can be true, but both can’t be true’. We denote this by p q .
Let p and q be two propositions. The compound statement (p q) (q p) is the bi-
conditional of p and q. We denote it by p q, and read it as ‘p if and only q’
The rule of precedence: The order of preference in which the connectives are applied in a
formula of propositions that has no brackets is
i) ~
ii)
iii) and
iv) and
Note that the ‘inclusive or’ and ‘exclusive or’ are both third in the order of preference.
However, if both these appear in a statement, we first apply the left most one. So, for
instance, in p q ~ p, we first apply and then . The same applies to the ‘implication’
and the ‘biconditional’, which are both fourth in the order of preference.
Let’s see the working of the various concepts learned above, with the help of
truth tables. In the following truth table, we write every TRUE value as Tand
every FALSE value as F.
4.6.1 Negation Truth Table
α ~α
_F(0)_ _T(1)_
_T(1)_ _F(0)_
Using these logical connectives, we can transform any sentence in to its equivalent
mathematical representation in Symbolic Logic and that representation is referred as Well
Form Formula (WFF), you had already learned a lot about WFF in MCS-212, lets briefly
discuss it here also, as it has wide applications in Artificial Intelligence also.
A Well-formed formula, or wff or formula in short, in the propositional logic is defined
recursively as follows:
1. An atom is a wff.
2. If A is a wff, then (~A) is a wff.
3. If A and B are wffs, then each of (A B), (A B), (A B), and (A B) is a wff.
4. Any wff is obtained only by applying the above rules.
From the above recursive definition of a wff it is not difficult to see that expression:
Further, it is easy to see that according to the recursive definition of a wff, each of the
expressions: (P (Q )) and (P ( Q R )) is not a wff.
A B and A B respectively may be used instead of the given wffs ( A B) and (A B),
respectively. We can omit the use of parentheses by assigning priorities in increasing order
to the connectives as follows:
, , , , ~.
Thus, ‘’ has least priority and ‘~’ has highest priority. Further, if in an expression, there are
no parentheses and two connectives between three atomic formulas are used, then the
operator with higher priority will be applied first and the other operator will be applied later.
For example: Let us be given the wff P Q ~ R without parenthesis. Then among the
operators appearing in wff, the operator ‘~’ has highest priority. Therefore, ~ R is replaced by
(~R). The equivalent expression becomes P Q (~ R). Next, out of the two operators viz
‘’ and ‘’, the operators ‘’ has higher priority. Therefore, by applying parentheses
appropriately, the new expression becomes P (Q (~ R)). Finally, only one operator is
left. Hence the fullyparenthesized expression becomes (P (Q (~ R)))
Following are the rules of finding the truth value or meaning of a wff, when truth values of
the atoms appearing in the wff are known or given.
1. The wff ~ A is True when A is False, and ~ A is False when A is true. The wff
~ A is called the negation of A.
2. The wff (A B) is True if A and B are both True; otherwise, the wff A B is False. The
wff (A B) is called the conjunction of A and B.
3. The wff (A B) is true if at least one of A and B is True; otherwise, (A B) is False. (A
B) is called the disjunction of A and B.
4. The wff (A B) is False if A is True and B is False; otherwise, (A B) is True. The
wff (A B) is read as “If A, then B,” or “A implies B.” The symbol ‘’ is called
implication.
5. The wff (A B) is True whenever A and B have the same truth values; otherwise (A
B) is False. The wff (A B) is read as “A if and only if B.”
4.7.4 Addition
The rule states that if a proposition is true, then its disjunction with any other
proposition is also true.
Rule 1:α_1=>α_1∨α_2
Rule 2:α_2=>α_1 ∨α_2
4.7.5 Simplification
Simplification means that if we have a conjunction, then both the constituent
propositions are also true.
Rule 1:α_1 ∧α_2 =>α_1
Rule 2:α_1 ∧α_2 =>α_2
4.7.6 Conjunction
Conjunction states if two propositions are true, then their conjunction is also
true. It is written as:
α_1, α_2 =>α_1 ∧α_2
4.7.8 Absorption
The rule states that if the literal α_1 conditionally implies another literal α_2
i.e.,α_1 → α_2 is true, then α_1 → (α_1 ∧α_2) also holds.
α_1 → α_2 =>α_1 → (α_1 ∧α_2)
Note : is the symbol used for the Existential quantifier and is used for the Universal
quantifier.
In predicate logic, a replacement rule is used to replace an argument or a set of
arguments with an equivalent argument. By equivalent arguments, we mean that
the logical interpretation of the arguments is the same. These rules are used to
manipulate the propositions. Also, the axioms and the propositional rules of
inference are used as an aid to generate the replacement rules. Given below is
the table summarizing the different replacement rules over the propositions α_1,
α_2 and α_3.
Replacement Rule Proposition Equivalent
It is noteworthy from the above truth table that the argument α_1 ∨ (α_2 ∨α_3)
∨ (~α_2 ∧ ~α_3) is a valid argument as it has true values for all possible
combinations of the premises α_1, α_2 and α_3.
Example 2: Consider the expression α_1 ∧ ((α_2 ∧α_3) ∨ (α_1 ∧α_3)) for the
α_1 α_2 α_3 α_1 α_2 (α_2 ∧α_3) ∨ (α_1 α_1 ∧ ((α_2 ∧α_3) ∨ (α_1
∧α_3 ∧α_3 ∧α_3) ∧α_3))
_F(0)_ _F(0)_ _F(0)_ _F(0)_ _F(0)_ _F(0)_ _F(0)_
_F(0)_ _F(0)_ _T(1)_ _T(1)_ _F(0)_ _T(1)_ _F(0)_
_F(0)_ _T(1)_ _F(0)_ _F(0)_ _F(0)_ _F(0)_ _F(0)_
_F(0)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_ _F(0)_
_T(1)_ _F(0)_ _F(0)_ _T(1)_ _F(0)_ _T(1)_ _T(1)_
_T(1)_ _F(0)_ _T(1)_ _T(1)_ _F(0)_ _T(1)_ _T(1)_
_T(1)_ _T(1)_ _F(0)_ _T(1)_ _F(0)_ _T(1)_ _T(1)_
_T(1)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_
propositions α_1,α_2, and α_3whose truth table is given below.
In this example, as the argument α_1 ∧ ((α_2 ∧α_3) ∨ (α_1 ∧α_3)) is true for a
few combinations of the premises α_1, α_2, and α_3, hence it is a satisfiable
argument.
4.10 INTRODUCTION TO PREDICATE LOGIC
Now it’s time to understand the difference between the Proposition and the Predicate(also
known as propositional function). In short, a proposition is a specialized statement whereas
Predicate is a generalized statement. To be more specific the propositions use the logical
connectives only and the predicates uses logical connectives and quantifiers (universal and
existential), both.
Note : is the symbol used for the Existential quantifier and is used for the Universal
quantifier.
Let’s understand the difference through some more detail, as given below.
So, if p(x) is ‘x > 5’, then p(x) is not a proposition. But when we give x particular values, say
x = 6 or x = 0, then we get propositions. Here, p(6) is a true proposition and p(0) is a false
proposition.
Similarly, if q(x) is ‘x has gone to Patna.’, then replacing x by ‘Taj Mahal’ gives us a false
proposition.
Note that a predicate is usually not a proposition. But, of course, every proposition is a
prepositional function in the same way that every real number is a real-valued function,
namely, the constant function.
Now, can all sentences be written in symbolic from by using only the logical connectives?
What about sentences like ‘x is prime and x + 1 is prime for some x.’? How would you
symbolize the phrase ‘for some x’, which we can rephrase as ‘there exists an x’? You must
have come across this term often while studying mathematics. We use the symbol ‘’ to
denote this quantifier, ‘there exists’. The way we use it is, for instance, to rewrite ‘There is
at least one child in the class.’ as‘( x in U)p(x)’,
where p(x) is the sentence ‘x is in the class.’ and U is the set of all children.
Now suppose we take the negative of the proposition we have just stated. Wouldn’t it be
‘There is no child in the class.’? We could symbolize this as ‘for all x in U, q(x)’ where x
ranges over all children and q(x) denotes the sentence ‘x is not in the class.’, i.e., q(x) ~
p(x).
We have a mathematical symbol for the quantifier ‘for all’, which is ‘’.So, the
proposition above can be written as
‘( x U)q(x)’, or ‘q(x), x U’.
( x R) (x + 1 > 0), which is read as ‘There exists an x in R for which x + 1 > 0.’.
1 1
( x N) (x - = 0), which is read as ‘There exists an x in N for which x - = 0.’.
2 2
An example of the use of the universal quantifier is ( x N) (x2> x), which is read as ‘for
every x not in N, x2> x.’. Of course, this is a false statement, because there is at least one x
N, x R, for which it is false.
This is one of the rules for negation that relate and . The two rules are
Note : is the symbol used for the Existential quantifier and is used for the Universal
quantifier.
In general, we are given a set of arguments in predicate logic. Now, using the
rules of inference, we can deduce other arguments (or predicates) based on the
given arguments (or predicates). This process is known as entailment as we
entail new arguments (or predicates). The inference rules you learned in MCS-
212 and also in section 4.7 above, of this unit are also applicable here for the
process of entailment or making inferences. Now, with the help of the following
example we will learn how the rules of inference, discussed above, can be used
to solve the problems.
Example : There is a village that consists of two types of people – those who always tell the
truth, and those who always lie. Suppose that you visit the village and two villagers A and B
come up to you. Further, suppose
∴ B is a truth-teller.
∴ A and B are of the same type, i.e., both of them always lie.
Let us now consider the problem of showing that a statement is false. I.e.,Counter examples
: A common situation in which we look for counterexamples is to disprove statements of the
form p q needs to be an example where p ~ q. Therefore, a counterexample to p q
needs to be an example where p ~ q is true, i.e., p is true and ~ q is true, i.e., the hypothesis
p holds but the conclusion q does not hold.
For instance, to disprove the statement ‘If n is an odd integer, then n is prime.’, we need to
look for an odd integer which is not a prime number. 15 is one such integer. So, n = 15 is
a counterexample to the given statement.
Notice that a counter example to a statement p proves that p is false, i.e., ~ p is true.
Example: Following statements are to be Symbolized and thereafter construct a proof for the
following valid argument:
(i) If the BOOK_X is literally true, then the Earth was made in six days.
(ii) If the Earth was made in six days, then carbon dating is useless and
Scientists/Researchers are liars.
(iii) Scientists/Researchers are not liars.
(iv) The BOOK_X is literally true, Hence
(v) God does not exist.
(ii) E C S
(iii) ~ S
(iv) B
Remarks: (iii) and (viii) are contradicts with each other in the above deduction. In general, if
we come across two statements (like S and ~ S) that contradict each other during the process
of deduction, we can deduce any statement, even if the statement can never be True in any
way. So, we can assume that any statement is true if both S and ~ S have already happened in
the process of derivation.
In an earlier section, we introduced eight different inference rules that can be used in
propositional logic to help derive logical inferences. The methods of drawing valid
conclusions that have been discussed up until this point are examples of an approach to
drawing valid conclusions that is called the natural deduction approach of making
inferences. This is an approach to drawing valid conclusions in which the reasoning system
starts the reasoning process from the axioms, uses inferencing rules, and, if the conclusion
can be validly drawn, then it ultimately reaches the conclusion that was intended. On the
other hand, there is a method of obtaining legitimate conclusions that is known as the
Refutation approach. This method, which will be covered in the following part, will be
addressed.
The normal forms (CNF and DNF) also play a vital role in both Natural deductions and
Resolution approach. To understand the normal forms, we need to start with the basic
concepts of clauses, literals etc.
(i) (~AB)(A~BC)
(ii) (AB) (~B~A)
Using table of equivalent formulas given above, any valid Propositional Logic formula can be
transformed into CNF as well as DNF.
(i) EG=(Eg)(GE)
(ii) EG=~EG
(iii) ~(~E)=E
(v) ~(EG)=~E~G
(vi) ~(EG)=~E~G
(vii) E(GH)=(EG)(EH)
(viii) E(GH)=(EG)(EH)
Hence, ~(A(~BC))=~(~A(~BC))
=~(~A)(~(~BC)) (Using~(EF)=~E~F)
~(EF)=~E~F
(B~C) is a disjunct.
Example: Obtain conjunctive Normal Form (CNF) for the formula: D(A(BC))
Consider
=~D(~A(BC))
= (~D~A)(BC) (using Associative law for disjunction)
= ((~D~AB)(~D~AC)
The last line denotes the conjunctive Normal Form of D(A(BC))
Note: If we stop at the last but one stop, then we obtain (~D~A)(BC)=~D~A(BC) is
a Disjunctive Normal Form for the given formula : D (A ( B C ) )
For the most part, there are two distinct strategies that can be implemented in order to
demonstrate the correctness of a theorem or derive a valid conclusion from a given collection
of axioms:
i) natural deduction
ii) the method of refutation
In the method known as natural deduction, one begins with a given set of axioms, applies
various rules of inference, and ultimately arrives at a conclusion. This method is strikingly
similar to the intuitive reasoning that is characteristic of humans.
When using a refutation approach, one begins with the denial of the conclusion that is to be
drawn and then proceeds to deduce a contradiction or the word "false." We are able to deduce
a contradiction as a result of having presupposed that the conclusion is incorrect; hence, the
premise that the conclusion is incorrect is itself incorrect. Therefore, the argument concerning
the technique of resolution leads to the correctness of the conclusion. In this part of the
article, we will talk about a different method known as the Resolution Method, which was
proposed by Robinson in 1965 and is based on the refutation approach. The Robinson
technique, which has served as the foundation for numerous computerised theorem provers,
highlights the significance of the method in question.
Propositional resolution is a sound, complete and powerful rule of inference used in
Propositional Logic. It is used prove the unsatisfiability of the given set of statements. This is
done using a strategy called Resolution Refutation that uses Resolution rule as described
below.
Resolution Rule: The rule states given two statements as {α1, α2, α3…α,αm} and
{γ1, γ2, γ3….γn},then the conclusion is {α1, α2, α3…αm,γ1, γ2, γ3,….γn}.
For example, the statements {A, B} and {C, ~B} leads to the conclusion {A,
C}.
Resolution Refutation:
a) Convert all the given statements to Conjunctive Normal Form (CNF).It is
also described as AND of ORs’.For eg., (A ∨ B) ∧ (~A ∨ B) ∧ (~B ∨ A)
is a CNF.
b) Obtain the negation of the given conclusion
c) Apply the resolution rule until either the contradiction is obtained or the
resolution rule cannot be applied anymore.
4.14.1 Clausal Form
Any atomic sentence or its negation is called as a literal. A literal or the
disjunction of at least two literals is called as the clausal form or clause
expression. Next, a clause is defined as the set of literals in the clause form or
clause expression. For example, consider two atomic statements represented
using the literals X and Y. Their clausal expressions are X, ~X and (X V Y) and
the clauses for these expressions are {X}, {~X} and {X, Y}. It is noteworthy
that the empty set {} is always a clause as it represents an empty disjunction and
hence, is unsatisfiable. Kindly note, conjunctive normal form (CNF) of an
expression represents the corresponding clausal form.
Now, we shall first understand certain rules for converting the statements to the
clause form as given below.
i) Operator: (β1Vβ2Vβ3V.….Vβm)=>{β1,β2,β3,…. βm}
(β1∧ β2∧ β3∧….∧ βm)=>{ β1},{ β2},{β3},….,{βm}
ii) Negation: same as Double Negation and De Morgan’s Law in section
4.8
iii) Distribution: as in sub-section 4.8
iv) Implications: as Material implication and Material Equivalence
(section 4.8)
Example 1: Convert A∧ (B->C) to clausal expression.
Step 1: A ∧ (~B V C) (using rule Material Implication to eliminate ->)
Step 2: {A}, {~B V C} (using rule Operator to eliminate ∧)
Example 2:Use propositional resolution to derive the goal from the given
knowledge base.
a) Either it is a head, or Lisa wins.
b) If Lisa wins, then Mary will go.
c) If it is a head, then the game is over.
d) The game is not over.
Conclusion: Mary will go.
Proof: First consider propositions to represent each of the statement in
knowledge base.
Let H: It is a head
L: Lisa wins
M: Mary will go
G: Game is over.
Re-writing the given knowledge base using the propositions defined.
a) H∨L
b) L -> M
c) H -> G
d) ~G
Conclusion: M
Step 1: H∨L (Premise)
Step 2: L -> M (Premise)
Step 3: ~L ∨ M (Step 4, Material Implication)
Step 4: H -> G (Premise)
Step 5: ~H ∨ G (Step 4, Material Implication)
Step 6: ~G (Premise)
Step 7: ~M (Negated conclusion as Premise)
Step 8: H ∨ M (Resolution principle on Step 1 and 3)
Step 9: M ∨ G (Resolution principle on Step 8 and 5)
Step 10: M (Resolution principle on Step 9 and 6)
Step 11: {} (Sep 10 and 7)
After applying Proof by Refutation i.e., contradicting the conclusion, the
problem is terminated with an empty clause ({}). Hence, the conclusion is
derived.
Example 3: Show that ~S1 follows from S1 -> S2 and ~(S1 ∧ S2).
Proof:
Step 1: S1 -> S2 (Premise)
Step 2: ~S1 ∨ S2 (Material Implication, Step 1)
Step 3: ~(S1 ∧ S2) (Premise)
Step 4: ~S1 ∨~S2 (De Morgan’s, Step 3)
Step 5: ~S1 (Resolution, Step 2, 4)
The resolution mechanism in PL is not used until after the given statements or wffs have been
converted into clausal forms. To obtain the clasual form of a wff, one must first convert the
wff into the Conjuctive Normal Form (CNF). We are already familiar with the fact that a
phrase is a formula (and only a formula) of the form: A1 A2…….. An ,where Ai might be
either any atomic formula or its negation.
The method of resolution is actually generalization of Modus Ponens, whose expression is
P, P Q P, ~ P Q
which can be written in the equivalent form as (i.e. by using the
Q Q
relation PQ => ~P Q).
If we are provided that both P and ~ P Q are true, then we may safely assume that Q is also
true. This is a straightforward application of a general resolution principle that will be
covered in more detail in this unit.
The construction of a truth table can be used to demonstrate the validity of a resolution
process (generally). In order to talk about the resolution process, we will first talk about some
of the applications of that method.
Example: Let C1:QR and C2: ~QS be two given clauses, so that, one of the literals i.e., Q
occurs in one of the clauses (in this case C1)and its negation (~Q) occurs in the other clause
C2. Then application of resolution method in this case tells us to take disjunction of the
remaining parts of the given clause C1 and C2, i.e., to take C3:RS as deduction from C1 and
C2. Then C3 is called a resolvent of C1 and C2.
The two literals Q and (~Q) which occur in two different clauses are called complementary
literals.
In this case, complementary pair of literals viz. Q and ~Q occur in the two clause C1 and C2.
Then, in this case, the clauses do not have any complementary pair of literals and hence,
C1:R
C2:~RS
C3:~S
C4:S
However, a resolvent FALSE can be deduced only from anunstatisfiable set of clauses.
Hence, the set of clauses C1, C2 and C3 is an unsatisfiable set of clauses.
C1:RS
C2:~RS
C3:R~S
C4:~R~S
Then, from clauses C1 and C2 we get the resolvent
C5 : SS=S
C6:~S
C7:FALSE
Note: We could have obtained the resolvent FALSE from only two clauses, viz., C2 and C3.
Thus, out of the given four clauses, even set of only two clauses viz, C2 and C3 is
unsatisfiable. Also, a superset of any unsatisfiable set is unsatisfiable.
C1:RS
C2:~SW
C3:~RS
C4:~W is unsatisfiable.
From clauses C1 and C3 we get the resolvent
C7:S
From the clauses C7 and C2 we get the resolvent
C8:W
From the clauses C8 and C4 we get
C9:FALSE
Hence, the given set of clauses is unsatisfiable.
Solution of the Problem Using the Resolution Method As was discussed before, the
resolution process can also be understood as a refutation approach. The following is an
example of a proving technique that can be used to solve problems:
After the symbolic representation of the issue at hand, an additional premise in the form of
the negation of the wff, which stands for conclusion, should be added. You can infer either
false or a contradiction from this improved set of premises and axioms. If we are able to get
to the conclusion that the statement is not true, then the conclusion that was required to be
formed is correct, and the issue has been resolved. If, despite our best efforts, we are unable
to arrive at the conclusion that the hypothesis is false, then we are unable to determine
whether or not the conclusion is correct. As a result, the predicament cannot be solved using
the axioms that have been provided and the conclusion that has been drawn.
Let's go on to the next step and apply Resolution Method to the issues we discussed earlier.
Example: If the interest rate goes up, the stock prices might go down. Also, let's say that
most people are unhappy when the price of stocks goes down. Let's say that the rate of
interest goes up. Show that most people are unhappy and that we can draw that conclusion.
(i)~AS
(ii)~SU
(iii)A
(iv)~U
Then from (i) and (iii), through resolution, we get the clause
(v) S.
From (ii) and (iv), through resolution, we get the clause
(vi)~S
(viii)FALSE
From the above solution using the resolution method, we might have noticed that clausal
conversion is a major step that takes a lot of time after translation to wffs. Most of the time,
once the clause form is known, proof is easy to see, at least by a person.
4.15 ANSWERS/SOLUTIONS
2. First introduce notation set for the given knowledge base as:
S: Mary goes to school
L: Mary eats lunch
F: It is Friday
Corresponding knowledge base is:
KB1: S -> L
KB2:F -> (S V L)
Conclusion: F -> L
SOCIS FACULTY
Prof. P. Venkata Suresh, Director, SOCIS, IGNOU
Prof. V.V. Subrahmanyam, SOCIS, IGNOU
Dr. Akshay Kumar, Associate Professor, SOCIS, IGNOU
Dr. Naveen Kumar, Associate Professor, SOCIS, IGNOU (on EOL)
Dr. M.P. Mishra, Associate Professor, SOCIS, IGNOU
Dr. Sudhansh Sharma, Assistant Professor, SOCIS, IGNOU
Dr. Manish Kumar, Assistant Professor, SOCIS, IGNOU
PREPARATION TEAM
Dr.Sudhansh Sharma, (Writer- Unit 5, 6 ) Prof Ela Kumar (Content Editor)
Asst.Professor SOCIS, IGNOU Department of Computers & Engg. IGDTUW, Delhi
(Writer- Unit 5, 6 )-(Partially Adapted from MCSE003
Artificial Intelligence & Knowledge Management)
Prof.Parmod Kumar (Language Editor)
Dr. Manish Kumar, Assistant Professor SOH, IGNOU, New Delhi
SOCIS, IGNOU (Writer- Unit 7)
Print Production
Sh Sanjay Aggarwal,Assistant Registrar, MPDD
, 2022
Indira Gandhi National Open University, 2022
ISBN-
All rights reserved. No part of this work may be reproduced in any form, by mimeograph or any other means, without permission
in writing from the Indira Gandhi National Open University.
Further information on the Indira Gandhi National Open University courses may be obtained from the University’s office at
Maidan Garhi, New Delhi-110068.
Printed and published on behalf of the Indira Gandhi National Open University, New Delhi by MPDD, IGNOU.
UNIT 5 FIRST ORDER PREDICATE LOGIC
Structure Page Nos.
5.0 Introduction
5.1 Objectives
5.2 Syntax of First Order Predicate Logic(FOPL)
5.3 Interpretations in FOPL
5.4 Semantics of Quantifiers
5.5 Inference & Entailment in FOPL
5.6 Conversion to clausal form
5.7 Resolution & Unification
5.8 Summary
5.9 Solutions/Answers
5.10 Further/Readings
5.0 INTRODUCTION
In the previous unit, we discussed how propositional logic helps us in solving problems. However, one of
the major problems with propositional logic is that, sometimes, it is unable to capture even elementary
type of reasoning or argument as represented by the following statements:
Raman is a man.
Hence, he is mortal.
The above reasoning is intuitively correct. However, if we attempt to simulate the reasoning through
Propositional Logic and further, for this purpose, we use symbols P, Q and R to denote the statements
given above as:
Q: Raman is a man,
R: Raman is mortal.
Once, the statements in the argument in English are symbolised to apply tools of propositional logic, we
just have three symbols P, Q and R available with us and apparently no link or connection to the original
statements or to each other. The connections, which would have helped in solving the problem become
invisible. In Propositional Logic, there is no way, to conclude the symbol R from the symbols P and Q.
However, as we mentioned earlier, even in a natural language, the conclusion of the statement denoted by
R from the statements denoted by P and Q is obvious. Therefore, we search for some symbolic system of
reasoning that helps us in discussing argument forms of the above-mentioned type, in addition to those
forms which can be discussed within the framework of propositional logic. First Order Predicate Logic
(FOPL) is the most well-known symbolic system for the pourpose.
The symbolic system of FOPL treats an atomic statement not as an indivisible unit. Rather, FOPL not
only treats an atomic statement divisible into subject and predicate but even further deeper structures of
an atomic statement are considered in order to handle larger class of arguments. How and to what extent
FOPL symbolizes and establishes validity/invalidity and consistency/inconsistency of arguments is the
subject matter of this unit.
5.1 OBJECTIVES
After studying this unit, you should be able to:
Note : is the symbol used for the Existential quantifier and is used for the Universal quantifier.
Let’s understand the difference through some more detail, as given below.
So, if p(x) is ‘x > 5’, then p(x) is not a proposition. But when we give x particular values, say x = 6 or x =
0, then we get propositions. Here, p(6) is a true proposition and p(0) is a false proposition.
Similarly, if q(x) is ‘x has gone to Patna.’, then replacing x by ‘Taj Mahal’ gives us a false proposition.
Note that a predicate is usually not a proposition. But, of course, every proposition is a prepositional
function in the same way that every real number is a real-valued function, namely, the constant function.
Now, can all sentences be written in symbolic from by using only the logical connectives? What about
sentences like ‘x is prime and x + 1 is prime for some x.’? How would you symbolize the phrase ‘for
some x’, which we can rephrase as ‘there exists an x’? You must have come across this term often while
studying mathematics. We use the symbol ‘’ to denote this quantifier, ‘there exists’. The way we
use it is, for instance, to rewrite ‘There is at least one child in the class.’ as‘( x in U)p(x)’,
where p(x) is the sentence ‘x is in the class.’ and U is the set of all children.
Now suppose we take the negative of the proposition we have just stated. Wouldn’t it be ‘There is no
child in the class.’? We could symbolize this as ‘for all x in U, q(x)’ where x ranges over all children and
q(x) denotes the sentence ‘x is not in the class.’, i.e., q(x) ~ p(x).
We have a mathematical symbol for the quantifier ‘for all’, which is ‘’. So the proposition above
can be written as
( x R) (x + 1 > 0), which is read as ‘There exists an x in R for which x + 1 > 0.’.
1 1
( x N) (x - = 0), which is read as ‘There exists an x in N for which x - = 0.’.
2 2
An example of the use of the universal quantifier is ( x N) (x2 > x), which is read as ‘for every x not
in N, x2 > x.’. Of course, this is a false statement, because there is at least one x N, x R, for which it is
false.
This is one of the rules for negation that relate and . The two rules are
~ ( x U)p(x) ( x U) (~ p(x))
Hence, he is mortal.
In order to derive the validity of above simple argument, instead of looking at an atomic statement as
indivisible, to begin with, we divide each statement into subject and predicate. The two predicates which
occur in the above argument are:
IN: is_man.
IN (Raman).
Hence, IL (RAMAN)
More generally, relations of the form greater-than (x, y) denoting the phrase ‘x is greater than y’,
is_brother_ of (x, y) denoting ‘x is brother of y,’ Between (x, y, z) denoting the phrase that ‘x lies between
y and z’, and is_tall (x) denoting ‘x is tall’ are some examples of predicates. The variables x, y, z etc
which appear in a predicate are called parameters of the predicate.
The parameters may be given some appropriate values such that after substitution of appropriate value
from all possible values of each of the variables, the predicates become statements, for each of which we
can say whether it is ‘True’ or it is ‘False’.
For example, for the predicate greater-than (x, y), if x is given value 3 then we obtain greater-than (3, y),
for which still it is not possible to tell whether it is True or False. Hence, ‘greater-than (3, y)’ is also a
predicate. Further, if the variable y is given value 5 then we get greater (3, 5) which , as we known, is
False. Hence, it is possible to give its Truth-value, which is False in this case. Thus, from the predicate
greater-than (x, y), we get the statement greater-than (3, 5) by assigning values 3 to the variable x and 5
to the variable y. These values 3 and 5 are called parametric values or arguments of the predicate greater-
than.
Similarly, we can represent the phrase x likes y by the predicate LIKE (x, y). Then Ram likes Mohan can
be represented by the statement LIKE (RAM, MOHAN).
Also function symbols can be used in the first-order logic. For example, we can use product (x, y) to
denote x y and father (x) to mean the ‘father of x’. The statement: Mohan’s father loves Mohan can be
symbolised as LOVE (father (Mohan), Mohan). Thus, we need not know name of father of Mohan and
still we can talk about him. A function serves such a role.
We may note that LIKE (Ram, Mohan) and LOVE (father (Mohan),Mohan) are atoms or atomic
statements of PL, in the sense that, one can associate a truth-value True or False with each of these, and
each of these does not involve a logical operator like ~, , , or .
Summarizing in the above discussion, LIKE (Ram, Mohan) and LOVE (father (Mohan) Mohan) are
atoms; where as GREATER, LOVE and LIKE are predicate symbols; x and y are variables and 3, Ram
and Mohan are constants; and father and product are function symbols.
i) Individual symbols or constant symbols: These are usually names of objects, such as Ram, Mohan,
numbers like 3, 5 etc.
ii) Variable symbols: These are usually lowercase unsubscripted or subscripted letters, like x, y, z, x 3.
iii) Function symbols: These are usually lowercase letters like f, g, h,….or strings of lowercase letters
such as father and product.
iv) Predicate symbols: These are usually uppercase letters like P, Q, R,….or strings of lowercase
letters such as greater-than, is_tall etc.
A function symbol or predicate symbol takes a fixed number of arguments. If a function symbol f takes n
arguments, f is called an n-place function symbol. Similarly, if a predicate symbol Q takes m arguments, P
is called an m-place predicate symbol. For example, father is a one-place function symbol, and
GREATER and LIKE are two-place predicate symbols. However, father-of in father_of (x, y) is a, two
place predicate symbol.
The symbolic representation of an argument of a function or a predicate is called a term where a term is
defined recursively as follows:
i) A variable is a term.
ii) A constant is a term.
iii) If f is an n-place function symbol, and t1….tn are terms, then f(t1,….,tn) is a term.
iv) Any term can be generated only by the application of the rules given above.
For example: Since, y and 3 are both terms and plus is a two-place function symbol, plus (y, 3) is a term
according to the above definition.
Furthermore, we can see that plus (plus (y, 3), y) and father (father (Mohan)) are also terms; the former
denotes (y + 3) + y and the later denotes grandfather of Mohan.
A predicate can be thought of as a function that maps a list of constant arguments to T or F. For example,
GREATER is a predicate with GREATER (5, 2) as T, but GREATER (1, 3) as F.
We already know that in PL, an atom or atomic statement is an indivisible unit for representing and
validating arguments. Atoms in PL are denoted generally by symbols like P, Q, and R etc. But in FOPL,
Definition: An Atom is
Once, the atoms are defined, by using the logical connectives defined in Propositional Logic, and
assuming having similar meaning in FOPL, we can build complex formulas of FOPL. Two special
symbol and are used to denote qualifications in FOPL. The symbols and are called, respectively,
the universal quantifier and existential quantifier. For a variable x, (x) is read as for all x, and (x) is
read as there exists an x. Next, we consider some examples to illustrate the concepts discussed above.
let us denote x is a rational number by Q(x), x is a real number by R(x), and x is less than y by LESS(x,
y). Then the above statements may be symbolized respectively, as
(i) (x) Q(x)
Each of the expressions (i), (ii), and (iii) is called a formula or a well-formed formula or wff.
To understand the semantics of quantifiers we need to first understand the difference between the
Proposition and the Predicate(also known as propositional function). In short, a proposition is a
specialized statement whereas Predicate is a generalized statement. To be more specific the propositions
uses the logical connectives only and the predicates uses logical connectives and quantifiers (universal
and existential), both.
Note : is the symbol used for the Existential quantifier and is used for the Universal quantifier.
Let’s understand the difference through some more detail, as given below.
So, if p(x) is ‘x > 5’, then p(x) is not a proposition. But when we give x particular values, say x = 6 or x =
0, then we get propositions. Here, p(6) is a true proposition and p(0) is a false proposition.
Similarly, if q(x) is ‘x has gone to Patna.’, then replacing x by ‘Taj Mahal’ gives us a false proposition.
Note that a predicate is usually not a proposition. But, of course, every proposition is a prepositional
function in the same way that every real number is a real-valued function, namely, the constant function.
Now, can all sentences be written in symbolic from by using only the logical connectives? What about
sentences like ‘x is prime and x + 1 is prime for some x.’? How would you symbolize the phrase ‘for
some x’, which we can rephrase as ‘there exists an x’? You must have come across this term often while
studying mathematics. We use the symbol ‘’ to denote this quantifier, ‘there exists’. The way we
use it is, for instance, to rewrite ‘There is at least one child in the class.’ as‘( x in U)p(x)’,
where p(x) is the sentence ‘x is in the class.’ and U is the set of all children.
Now suppose we take the negative of the proposition we have just stated. Wouldn’t it be ‘There is no
child in the class.’? We could symbolize this as ‘for all x in U, q(x)’ where x ranges over all children and
q(x) denotes the sentence ‘x is not in the class.’, i.e., q(x) ~ p(x).
We have a mathematical symbol for the quantifier ‘for all’, which is ‘’. So the proposition above
can be written as
( x R) (x + 1 > 0), which is read as ‘There exists an x in R for which x + 1 > 0.’.
1 1
( x N) (x - = 0), which is read as ‘There exists an x in N for which x - = 0.’.
2 2
An example of the use of the universal quantifier is ( x N) (x2 > x), which is read as ‘for every x not
in N, x2 > x.’. Of course, this is a false statement, because there is at least one x N, x R, for which it is
false.
This is one of the rules for negation that relate and . The two rules are
~ ( x U)p(x) ( x U) (~ p(x))
Next, we discuss three new concepts, viz Scope of occurrence of a quantified variable, Bound occurrence
of a quantifier variable or quantifier and Free occurrence of a variable.
Before discussion of these concepts, we should know the difference between a variable and occurrence
of a variable in a quantifier expression.
Also, the variable y has only one occurrence and the variable z has zero occurrence in the above formula.
Next, we define the three concepts mentioned above.
Scope of an occurrence of a quantifiers is the smallest but complete formula following the quantifier
sometimes delimited by pair f parentheses. For example, Q(x) is the scope of (x) in the formula
But the scope of (x) in the formula: (x) (Q(x) P(x, y)) is (Q(x) P(x, y)).
the scope of first occurrence of (x) is the formula (P(x) Q (x, y) and the scope of second occurrence
of (x) is the formula
(P(x) R(x, 3)).
As another example, the scope of the only occurrence of the quantifier (y) in
(x) (( P(x) Q(x) (y) (Q (x) R (y))) is ( Q (x) R(y)). But the scope of the only occurrence of
the existential variable (x) in the same formula is the formula:
An occurrence of a variable in a formula is bound if and only if the occurrence is within the scope of a
quantifier employing the variable, or is the occurrence in that quantifier. An occurrence of a variable in a
formula is free if and only if this occurrence of the variable is not bound.
Thus, in the formula (x) P(x, y) Q (x), there are three occurrences of x, out of which first two
occurrences of x are bound, where, the last occurrence of x is free, because scope of (x) in the above
formula is P(x, y). The only occurrence of y in the formula is free. Thus, x is both a bound and a free
variable in the above formula and y is only a free variable in the formula so far, we talked of an
occurrence of a variable as free or bound. Now, we talk of (only) a variable as free or bound. A variable
is free in a formula if at least one occurrence of it is free in the formula. A variable is bound in a formula
if at least one occurrence of it is bound.
It may be noted that a variable can be both free and bound in a formula. In order to further elucidate the
concepts of scope, free and bound occurrences of a variable, we consider a similar but different formula
for the purpose:
In this formula, scope of the only occurrence of the quantifier (x) is the whole of the rest of the formula,
viz. scope of (x) in the given formula is (P(x, y) Q (x))
Also, all three occurrence of variable x are bound. The only occurrence of y is free.
Remarks: It may be noted that a bound variable x is just a place holder or a dummy variable in the
sense that all occurrences of a bound variable x may be replaced by another free variable say y, which
does not occur in the formula. However, once, x is replaced by y then y becomes bound. For example,
(x) (f (x)) is the same as (y) f (y). It is something like
2 2 23 13 7
x dx =
2
y dy
2
1 1 3 3 3
Replacing a bound variable x by another variable y under the restrictions mentioned above is called
Renaming of a variable x
Having defined an atomic formula of FOPL, next, we consider the definition of a general formula
formally in terms of atoms, logical connectives, and quantifiers.
Definition A well-formed formula, wff a just or formula in FOPL is defined recursively as follows:
ii) If E and G are wff, then each of ~ (E), (E G), (E G), (E G), (E G) is a wff.
iii) If E is a wff and x is a free variable in E, then (x)E is a wff.
iv) A wff can be obtained only by applications of (i), (ii), and (iii) given above.
We may drop pairs of parentheses by agreeing that quantifiers have the least scope. For example,
(x) P(x, y) Q(x) stands for
Example
Translate the statement: Every man is mortal. Raman is a man. Therefore, Raman is mortal.
As discussed earlier, let us denote “x is a man” by MAN (x), and “x is mortal” by MORTAL(x). Then
“every man is mortal” can be represented by
“Raman is a man” by
MORTAL(Raman).
as a single statement.
In order to further explain symbolisation let us recall the axioms of natural numbers:
(1) For every number, there is one and only one immediate successor,
(3) For every number other than 0, there is one and only one immediate predecessor.
Let the immediate successor and predecessor of x, respectively be denoted by f(x) and g(x).
Let E (x, y) denote x is equal to y. Then the axioms of natural numbers are represented respectively by the
formulas:
(i) (x) (y) (E(y, f(x)) (z) (E(z, f(x)) E(y, z)))
(ii) ~ ((x) E(0, f(x))) and
(iii) (x) (~ E(x, 0) ((y), g(x)) (z) (E(z, g(x)) E(y, z))))).
From the semantics (for meaning or interpretation) point of view, the wff of FOPL may be divided into
two categories, each consisting of
The wffs of FOPL in which there is no occurrence of a free variable, are like wffs of PL in the sense that
we can call each of the wffs as True, False, consistent, inconsistent, valid, invalid etc. Each such a
formula is called closed formula. However, when a wff involves a free occurrence, then it is not possible
to call such a wff as True, False etc. Each of such a formula is
called an open formula.
For example: Each of the formulas: greater (x, y), greater (x, 3), (y) greater (x, y) has one free
occurrence of variable x. Hence, each is an open formula.
Each of the formulas: (x) (y) greater (x, y), (y) greater (y, 1), greater (9, 2), does not have free
occurrence of any variable. Therefore each of these formulas is a closed formula.
The following equivalences hold for any two formulas P(x) and Q(x):
Then L.H.S of (iii) above states for every natural number it is either odd or even, which is correct. But
R.H.S of (iii) states that every natural number is odd or every natural number is even, which is not
correct.
Next, L.H.S. of (iv) states that: there is a natural number which is both even and odd, which is not
correct. However, R.H.S. of (iv) says there is an integer which is odd and there is an integer which is
even, correct.
R = Q ~ Q = False
Hence, the proof.
(ii) Consider
(x) P(x) (y) P(y)
Replacing ‘’ we get
= ~ (x) P(x) (y) P(y)
= (x) ~ P(x) (y) P(y)
= (x) ~ P(x) (x) P(x) (renaming x as y in the second disjunct)
In other words,
= (x) (~ P(x) P(x)) (using equivalence)
The last formula states: there is at least one element say b, for ~ P(b) P(b) holds i.e., for b, either P(b)
is False or P(b) is True.
But, as P is a predicate symbol and b is a constant ~ P(b) P(b) must be True. Hence, the proof.
Ex. 1 Let P(x) and Q(x) represent “x is a rational number” and “x is a real number,” respectively.
Symbolize the following sentences:
Ex. 2 Let C(x) mean “x is a used-car dealer,” and H(x) mean “x is honest.” Translate each of the
following into English:
(i) (x)C(x)
Before introducing and discussing the Quantifier rules, we briefly discuss why, at all, these rules are
required. For this purpose, let us recall the argument discussed earlier, which Propositional Logic could
not handle:
then using Modus Ponens on (ii’) & (iv) in Propositional Logic, we would have obtained (iii’) Mortal
(Raman).
However, from (i’) & (ii’) we cannot derive in Propositional Logic (iii’). This suggests that there should
be mechanisms for dropping and introducing quantifier appropriately, i.e., in such a manner that validity
of arguments is not violated. Without discussing the validity-preserving characteristics, we introduce the
four Quantifier rules.
(x) p ( x)
p(a )
Where is an a arbitrary constant.
The rule states if (x) p(x) is True, then we can assume P(a) as True for any constant a (where a constant
a is like Raman). It can be easily seen that the rule associates a formula P(a) of Propositional Logic to a
formula (x) p(x) of FOPL. The significance of the rule lies in the fact that once we obtain a formula like
P(a), then the reasoning process of Propositional Logic may be used. The rule may be used , whenever,
its application seems to be appropriate.
P ( a ), for all a
( x ) p ( x )
The rule says that if it is known that for all constants a, the statement P(a) is True, then we can, instead,
use the formula (x) p( x) .
The rule associates with a set of formulas P(a) for all a of Propositional Logic, a formula (x) p( x) of
FOPL.
Before using the rule, we must ensure that P(a) is True for all a,
Otherwise it may lead to wrong conclusions.
(x ) P ( x )
( E .I .)
P(a)
The rule says if the Truth of (x) P( x) is known then we can assume the Truth of P(a) for some fixed a .
The rule, again, associates a formula P(a) of Propositional Logic to a formula (x) p( x) of FOPL.
An inappropriate application of this rule may lead to wrong conclusions. The source of possible errors
lies in the fact that the choice ‘a’ in the rule is not arbitrary and can not be known at the time of deducing
P(a) from (x) P( x) .
If during the process of deduction some other (y) Q( y) or (x) ( R( x) ) or even another (x)P(x) is
encountered, then each time a new constant say b, c etc. should be chosen to infer Q (b) from (y) Q( y)
or R(c) from (x) ( R( x) ) or P(d) from (x) P( x) .
P(a)
(E.G)
(x) P( x)
The rule states that if P(a), a formula of Propositional Logic is True, then the Truth of (x) P( x) , a
formula of FOPL , may be assumed to be True.
The Universal Generalisation (U.G) and Existential Instantiation rules should be applied with
utmost care, however, other two rules may be applied, whenever, it appears to be appropriate.
Next, The purpose of the two rules, viz.,
is to associate formulas of Propositional Logic (PL) to formulas of FOPL in a manner, the validity of
arguments due to these associations, is not disturbed. Once, we get formulas of PL, then any of the eight
rules of inference of PL may be used to validate conclusions and solve problems requiring logical
reasoning for their solutions.
The purpose of the other Quantification rules viz. for generalisation, i.e.,
P ( a)
(iv)
(x) P( x)
is that the conclusion to be drawn in FOPL is not generally a formula of PL but a formula of FOPL.
However, while making inference, we may be first associating formulas of PL with formulas of FOPL and
then use inference rules of PL to conclude formulas in PL. But the conclusion to be made in the problem
may correspond to a formula of FOPL. These two generalisation rules help us in associating formulas of
FOPL with formulas of PL.
Example: Tell, supported with reasons, which one of the following is a correct inference and which one
is not a correct inference.
The above inference or conclusion is incorrect in view of the fact that the scope of universal
quantification is only the formula: F ( x) G( x) and not the whole of the formula.
The occurrences of x in H ( x) I ( x) are free occurrences. Thus, one of the correct inferences would
have been:
F (a) G(a) H ( x) I ( x)
(iii) To conclude ~ F(a) for an arbitrary a, from ~ (x) F(x) using U.I.
The conclusion is incorrect, because actually
Thus, the inference is not a case of U.I., but of Existential Instantiation (E.I.)
Further, as per restrictions, we can not say for which a, ~ F(x) is True. Of course, ~ F(x) is true for
some constant, but not necessarily for a pre-assigned constant a.
The reason being that the constant to be substituted for x cannot be assumed to be the same constant b,
being given in advance, as an argument of F. However,
Step for using Predicate Calculus as a Language for Representing Knowledge & for Reasoning:
Step 1: Conceptualisation: First of all, all the relevant entities and the relations that exist between these
entities are explicitly enumerated. Some of the implicit facts like, ‘a person dead once is dead for ever’
have to be explicated.
Step 2: Nomenclature & Translation: Giving appropriate names to objects and relations. And then
translating the given sentences given in English to formulas in FOPL. Appropriate names are essential in
order to guide a reasoning system based on FOPL. It is well-established that no reasoning system is
complete. In other words, a reasoning system may need help in arriving at desired conclusion.
Step 3: Finding appropriate sequence of reasoning steps, involving selection of appropriate rule and
appropriate FOPL formulas to which the selected rule is to be applied, to reach the conclusion.
Example: Symbolize the following and then construct a proof for the argument:
(i) Anyone who repairs his own car is highly skilled and saves a lot of money on repairs
(ii) Some people who repair their own cars have menial jobs. Therefore,
(iii) Some people with menial jobs are highly skilled.
P(x) : x is a person
S(x) : x saves money on repairs
From (ii) using Existential Instantiation (E.I), we get, for some fixed a
(v) R(a)
From (i), using Universal Instantiation (U.I.), we get
H(a) S(a)
(vii)
By specialisation of (vii) we get
(viii)
H(a)
By specialisation of (iv) we get
(ix) M(a)
By conjunctions of (viii) & (ix) we get
M(a) H(a)
Example:
(i) Some juveniles who commit minor offences are thrown into prison, and any juvenile thrown into
prison is exposed to all sorts of hardened criminals.
(ii) A juvenile who is exposed to all sorts of hardened criminals will become bitter and learn more
techniques for committing crimes.
(iii) Any individual who learns more techniques for committing crimes is a menace to society, if he is
bitter.
(iv) Therefore, some juveniles who commit minor offences will be menaces to the society.
Example: Let us symbolize the statement in the given argument as follows:
(viii) J(b)
(ix) C(b) and
(x) P(b)
Using Universal Instantiation, on (vi), we get
(xii) E(b)
Using conjunction for (viii) & (xii) we get
(xvii) M(b)
Using conjunction for (viii), (ix) and (xvii) we get
Remark: It may be noted the occurrence of quantifiers is not, in general, commutative i.e.,
For example
The occurrence of (y) on L.H.S depends on x i.e., occurrence of y on L.H.S is a function of x. However,
the occurrence of (y) on R.H.S is independent of x, hence, occurrence of y on R.H.S is not a function of
x.
then, L.H.S of (A) above states: For each x there is a y such that y>x.
On the other hand, R.H.S of (A) above states that: There is an integer y which is greater than x, for all x.
When the logical statements are interconnected in a manner that one is consequence of other then such
Logical consequences (also called entailment) are the fundamental concept in logical reasoning, which
describes the relationship between statements that hold true when one statement logically follows
from one or more statements.
A valid logical argument is one in which the conclusion is entailed by the premises, because the
conclusion is the consequence of the premises. The philosophical analysis of logical consequence
involves the questions: In what sense does a conclusion follow from its premises? and What does it mean
for a conclusion to be a consequence of premises? All of philosophical logic is meant to provide accounts
of the nature of logical consequence and the nature of logical truth.
Logical consequence is necessary and formal, by way of examples that explain with formal
proof and models of interpretation. A sentence is said to be a logical consequence of a set of sentences,
for a given language, if and only if, using only logic (i.e., without regard to any personal interpretations of
the sentences) the sentence must be true if every sentence in the set is true.
5.6 CONVERSION TO CLAUSAL FORM
In order to facilitate problem solving through Propositional Logic, we discussed two normal forms, viz,
the conjunctive normal form CNF and the disjunctive normal form DNF. In FOPL, there is a normal
form called the prenex normal form. Further the statement in Prenex Normal Form is required to be
skolomized to get the clausal form, which can be used for the purpose of Resolution.
So, first step towards the Clausal form is to begin with Prenex Normal Form (PNF), and the second
step is skolomization, which will be discussed after PNF.
Prenex Normal Form (PNF): In broad sense it relates to re-alignment of the quantifiers, i.e. to bring all
the quantifiers in the beginning of the expression and then replacement the existential and universal
quantifiers with constants and the functions is performed for skolomization i.e. to bring the statement in
the clausal form.
The use of a prenex normal form of a formula simplifies the proof procedures, to be discussed.
Definition A formula G in FOPL is said to be in a prenex normal form if and only if the formula G is in
the form
(Q1x1)….(Qn xn) P
where each (Qixi), for i = 1, ….,n, is either (xi) or (xi), and P is a quantifier free formula. The
expression (Q1x1)….(Qn xn) is called the prefix and P is called the matrix of the formula G.
Next, we consider a method of transforming a given formula into a prenex normal form. For this,
first we discuss equivalence of formulas in FOPL. Let us recall that two formulas E and G are equivalent,
denoted by E = G, if and only if the truth values of F and G are identical under every interpretation. The
pairs of equivalent formulas given in Table of equivalent Formulas of previous unit are still valid as these
are quantifier–free formulas of FOPL. However, there are pairs of equivalent formulas of FOPL that
contain quantifiers. Next, we discuss these additional pairs of equivalent formulas. We introduce some
notation specific to FOPL: the symbol G denote a formula that does not contain any free variable x. Then
we have the following pairs of equivalent formulas, where Q denotes a quantifier which is either or .
Next, we introduce four laws for pairs of equivalent formulas.
In the rest of the discussion of FOPL, P[x] is used to denote the fact that x is a free variable in the formula
P, for example, P[x] = (y) P (x, y). Similarly, R [x, y] denotes that variables x and y occur as free
variables in the formula R Some of these equivalences, we have discussed earlier.
(i) ( Qx ) P [ x] G = ( Qx ) ( P [x ] G).
That is, the universal quantifier and the existential quantifier can be distributed respectively over
and .
But we must be careful about (we have already mentioned these inequalities)
(vii) (x) E [x] (x) H [x] (x) (P [x] H [x]) and
(viii) (x ) P [x] (x) H [x] (x) (P [x] H [x])
Step 1 Remove the connectives ‘’ and ‘’ using the equivalences
P G = (P G) ( G P)
P G = ~ P G
~ ( ~ P) = P
Step 3 Apply De Morgan’s laws in order to bring the negation signs immediately before atoms.
~ (P G) = ~ P ~ G ~ (P G) = ~ P
~G
Step 5 Bring quantifiers to the left before any predicate symbol appears in the formula. This is achieved
by using (i) to (vi) discussed above.
We have already discussed that, if all occurrences of a bound variable are replaced uniformly throughout
by another variable not occurring in the formula, then the equivalence is preserved. Also, we mentioned
under (vii) that does not distribute over and under (viii) that does not distribute over . In such
cases, in order to bring quantifiers to the left of the rest of the formula, we may have to first rename one
of bound variables, say x, may be renamed as z, which does not occur either as free or bound in the other
component formulas. And then we may use the following equivalences.
Part (i)
Step 1: By removing ‘’, we get
Part (ii)
As first component formula Q (x, y) does not involve z and S(x) does not involve both y and z and ~ R(z)
does not involve y. Therefore, we may take out ( y) and (z) so that, we get
(x) (y) (z) (Q (x, y) (~ R(z) S (x) ), which is the required formula in prenex normal form.
Part (iii)
Step 3: As variables z, u & v do not occur in the rest of the formula except the formula which is in its
scope, therefore, we can take all quantifiers outside, preserving the order of their occurrences,
Thus we get
(x) (y) (z) (u) (v) (Q (x, y, z) (~ R (x, u) R (y, v)))
Skolomization : A further refinement of Prenex Normal Form (PNF) called (Skolem) Standard Form, is
the basis of problem solving through Resolution Method. The Resolution Method will be discussed next.
The Standard Form of a formula of FOPL is obtained through the following three steps:
(1) The given formula should be converted to Prenex Normal Form (PNF), and then
(2) Convert the Matrix of the PNF, i.e, quantifier-free part of the PNF into conjunctive normal form
(3) Skolomization: Eliminate the existential quantifiers using skolem constants and functions
Before illustrating the process of conversion of a formula of FOPL to Standard Normal Form, through
examples, we discuss briefly skolem functions.
Skolem Function
We in general, mentioned earlier that (x) (y) P(x,y) (y) (x) P(x,y)…….(1)
For example, if P(x,y) stands for the relation ‘x>y’ in the set of integers, then the L.H.S. of the inequality
(i) above states: some (fixed) integer (x) is greater than all integers (y). This statement is False.
On the other hand, R.H.S. of the inequality (1) states: for each integer y, there is an integer x so that x>y.
This statement is True.
The difference in meaning of the two sides of the inequality arises because of the fact that on L.H.S. x in
(x) is independent of y in (y) whereas on R.H.S x of dependent on y. In other words, x on L.H.S. of
the inequality can be replaced by some constant say ‘c’ whereas on the right hand side x is some function,
say, f(y) of y.
Therefore, the two parts of the inequality (i) above may be written as
The above argument, in essence, explains what is meant by each of the terms viz. skolem constant, skolem
function and skolomisation.
The constants and functions which replace existential quantifiers are respectively called skolem
constants and skolem functions. The process of replacing all existential variables by skolem constants
and variables is called skolemisation.
(iii) applying skolomization is called Skolem Standard Form or just Standard Form.
We explain through examples, the skolomisation process after PNF and CNF have already been obtained.
(ii) (x1)(y1)(x2)(y2) (x3)P(x1, x2, x3, y1, y2 )(x1)(y3)( x2) (y4)Q(x1, x2,
y3, y4)
Solution (i) As existential quantifiers x1 and x2 precede all universal quantifiers, therefore, x1 and x2 are
to be replaced by constants, but by distinct constants, say by ‘c’ and ‘d’ respectively. As existential
variable x3 is preceded by universal quantifiers y1 and y2, therefore, x3 is replaced by some function f(y1,
y2) of the variables y1 and y2. After making these substitutions and dropping universal and existential
variables, we get the skolemized form of the given formula as
Solution (ii) As a first step we must bring all the quantifications in the beginning of the formula through
Prenex Normal Form reduction. Also,
therefore, we rename the second occurrences of quantifiers (x1) and (x2) by renaming these as x5 and
x6. Hence, after renaming and pulling out all the quantifications to the left, we get
Then the existential variable x1 is independent of all the universal quantifiers. Hence, x1 may be replaced
by a constant say, ‘c’. Next x2 is preceded by the universal quantifier y1 hence, x2 may be replaced by f
(y1). The existential quantifier x3 is preceded by the universal quantifiers y1 and y2. Hence x3 may be
replaced by g
(y1, y2). The existential quantifier x5 is preceded by again universal quantifier y1 and y2. In other words, x5
is also a function of y1 and y2. But, we have to use a different function symbol say h and replace x 5 by h
(y1, y2). Similarly x6 may be replaced by
Ex: 4 (i) Transform the formula (x) P(x) (x) Q(x) into prenex normal form.
The complexity of the resolution method for FOPL mainly results from the fact that a clause in FOPL is
generally of the form : P(x) Q ( f(x), x, y) ….., in which the variables x, y, z, may assume any one of
the values of their domain.
Thus, the atomic formula (x) P(x), which after dropping of universal quantifier, is written as just P(x)
stands for P(a1) P(a2)… P(an) where the set {a1 a2…, an} is assumed here to be domain (x).
However, in order to resolve two clauses – one containing say P(x) and the other containing ~ P(y) where
x and y are universal quantifiers, possibly having some restrictions, we have to know which values of x
and y satisfy both the clauses. For this purpose we need the concepts of substitution and unification as
defined and discussed in the rest of the section.
Instead of giving formal definitions of substitution, unification, unifier, most general unifier and
resolvent, resolution of clauses in FOPL, we illustrate the concepts through examples and minimal
definitions, if required
To conclude
MORTAL (Raman)
from
In the above x varies over the set of human beings including Raman. Hence, one special instance of (iv)
becomes
(a) MAN(Raman) and MORTAL(Raman) do not contain any variables, and, hence, their truth or falsity can
be determined directly. Hence, each of like a formula of PL. In term of formula which does not contain
any variable is called ground term or ground formula.
(b) Treating MAN (Raman) as formula of PL and using resolution method on (v) and (vi), we conclude
Unification: In the process of solution of the problem discussed above, we tried to make the two
expression MAN(x) and MAN(Raman) identical. Attempt to make identical two or more expressions is
called unification.
In order to unify MAN (x) and MAN (Raman) identical, we found that because one of the possible values
of x is Raman also. And, hence, we replaced x by one of its possible values : Raman.
This replacement of a variable like x, by a term (which may be another variable also)
which is one of the possible values of x, is called substitution. The substitution, in this case is denoted
formally as {Raman/x}
Substitution, in general, notationally is of the form {t1 / x1 , t2 / x2 …tm/ xm } where x1, x2 …, xm are
variables and t2, t2 …tm are terms and ti replaces the variable xi in some expression.
Example: (i) Assume Lord Krishna is loved by everyone who loves someone (ii) Also assume that no
one loves nobody. Deduce Lord Krishna is loved by everyone.
LK : Lord Krishna
so that we get
As existential quantifier x is not preceded by any universal quantification, therefore, x may be substituted
by a constant a , i.e., we use the substitution {a/x} in (iii) to get the standard form:
Thus, to solve the problem, we have the following standard form formulas for resolution:
The possibilities exist because for each possibility pair, the predicate Love occurs in complemented form
in the respective pair.
For this purpose we attempt to make the two formulas Love(x, f(x)) and Love (a, LK) identical, through
unification involving substitutions. We start from the left, matching the two formulas, term by term. First
place where matching may fail is when ‘x’ occurs in one formula and ‘a’ occurs in the other formula. As,
one of these happens to be a variable, hence, the substitution {a/x} can be used to unify the portions so
far.
Next, possible disagreement through term-by-term matching is obtained when we get the two disagreeing
terms from two formulas as f(x) and LK. As none of f(x) and LK is a variable (note f(x) involves a
variable but is itself not a variable), hence, no unification and, hence, no resolution of (v) and (vi) is
possible.
Next, we attempt unification of (vi) Love (a, LK) with Love (x, LK) of (iv).
Then first term-by-term possible disagreement occurs when the corresponding terms are ‘a’ and ‘x’
respectively. As one of these is a variable, hence, the substitution{a/x} unifies the parts of the formulas so
far. Next, the two occurrences of LK, one each in the two formulas, match. Hence, the whole of each of
the two formulas can be unified through the substitution {a/x}. Though the unification has been attempted
in corresponding smaller parts, substitution has to be carried in the whole of the formula, in this case in
whole of (iv). Thus, after substitution, (iv) becomes
(viii) ~ Love (a, y) Love (a, L K)
In order to resolve (v) and (ix), we attempt to unify Love (x, f(x)) of (v) with
As, one of these is a variable, hence, the substitution {a/x} will unify the portions considered so far.
Next, possible disagreement may occur with f (x) of (v) and y of (ix). As one of these are a variable viz. y,
therefore, we can unify the two terms through the substitution {f(x)/y}. Thus, the complete
substitution {a/x, f (x)/y} is required to match the formulas. Making the substitutions, we get (v) becomes
Love (a, f(x)) and (ix) becomes ~ Love (a, f (x))
Next pair of terms from the two formulas, viz, z and h(v, v) are also unifiable, because, one of the terms is
a variable, and the required substitution for unification is { h (v, v)/z}.
Next pair of terms at corresponding positions is again {w, x} for which, we have determined the
substitution {w/x}. Thus, the substitution {w/x, h(v, v)/z} unfies the two formulas. Using the
substitutions, (i) and (ii) become resp. as
5.8 SUMMARY
In this unit, initially, we discuss how PL is inadequate to solve even simple problems, requires some
extension of PL or some other formal inferencing system so as to compensate for the inadequacy. First
Order Predicate Logic (FOPL), is such an extension of PL that is discussed in the unit.
Next, syntax of proper structure of a formula of FOPL is discussed. In this respect, a number of new
concepts including those of quantifier, variable, constant, term, free and bound occurrences of variables;
closed and open wff, consistency/validity of wffs etc. are introduced.
Next, two normal forms viz. Prenex Normal Form (PNF) and Skolem Standard Normal Form are
introduced. Finally, tools and techniques developed in the unit, are used to solve problems involving
logical reasoning.
5.9 SOLUTIONS/ANSWERS
Check Your Progress - 1
Ex. 2
(i) There is (at least) one (person) who is a used-car dealer.
(v) There is at least one thing in the universe, (for which it can be said that) if that something is
Honest then that something is a used-car dealer
Note: the above translation is not the same as: Some no gap one honest, is a used-car dealer.
Now P(a) is an atom in PL which may assume any value T or F. On taking P(a) as F the given formula
becomes T, hence, consistent.
(x) P(x) ~ (x) (P(x)), by taking negation outside the second disjunct and then renaming.
The (x) P(x) being closed is either T or F and hence can be treated as formula of PL.
Let x P(x) be denoted by Q. Then the given formula may be denoted by Q ~ Q = True (always)
Therefore, formula is valid.
Ex: 4 (i) (x) P(x) (x) Q(x) = ~ ((x) P(x)) (x) Q(x) (by removing the connective)
Therefore, a prenex normal form of (x) P(x) (x) Q(x) is (x) (~P(x) Q(x)).
(ii) (x) (y) ((z) (P(x, y) P(y, z)) (u) Q (x, y, u)) (removing the connective)
Therefore, we obtain the last formula as a prenex normal form of the first formula.
Ex 5 (i) In the given formula (x) is not preceded by any universal quantification. Therefore, we replace
the variable x by a (skolem) constant c in the formula and drop (x).
Next, the existential quantifier (z) is preceded by two universal quantifiers viz., v and y. we replace the
variable z in the formula, by some function, say, f (v, y) and drop (z). Finally, existential variable (u) is
preceded by three universal quantifiers, viz., (y), (y) and (w). Thus, we replace in the formula the
variable u by, some function g(y, v, w) and drop the quantifier ( u). Finally, we obtain the standard form
for the given formula as
(y) (v) (w) P(x, y, z, u, v, w)
Next, in the formula, there are two existential quantifiers, viz., (y) and (z). Each of these is preceded by
the only universal quantifier, viz. (x).
Thus, each variable y and z is replaced by a function of x. But the two functions of x for y and z must be
different functions. Let us assume, variable, y is replaced in the formula by f(x) and the variable z is
replaced by g(x). Thus the initially given formula, after dropping of existential quantifiers is in the
standard form:
(x) ((~ P (x, y) R (x, y, z)) (Q (x, z) R (x, y, z)))
Check Your Progress - 3
6.0 INTRODUCTION
Computer Science is the study of how to create models that can be represented in and executed by some
computing equipment. In this respect, the task for a computer scientist is to create, in addition to a model
of the problem domain, a model of an expert of the domain as problem solver who is highly skilled in
solving problems from the domain under consideration, and the concerned field relates to the field of
Expert Systems.
First of all we must understand that an expert system is nothing but a computer program or a set of
computer programs which contains the knowledge and some inference capability of an expert, most
generally a human expert, in a particular domain. An expert system is supposed to contain the capability
to lead to some conclusion, based on the inputs provided, the system already contains some pre-existing
information; which is processed to infer some conclusion. The expert system belongs to the branch of
Computer Science called Artificial Intelligence.
Taking into consideration all the points, discussed above, one of the many possible definitions of an
Expert System is : “An Expert System is a computer program that possesses or represents knowledge in a
particular domain, has the capability of processing/ manipulating or reasoning with this knowledge
with a view to solving a problem, giving some achieving or to achieve some specific goal.”
Whereas, the Artificial Intelligence programs written to achieve expert-level competence in solving
problems of different domains are more called knowledge based systems. A knowledge-based system is
any system which performs a job or task by applying rules of thumb to a symbolic representation of
knowledge, instead of employing mostly algorithmic or statistical methods. Often the term expert systems
is reserved for programs whose knowledge base contains the knowledge used by human experts, in
contrast to knowledge gathered from textbooks or non-experts. But more often than not, the two terms,
expert systems and knowledge-based systems are taken as synonyms. Together they represent the
most widespread type of AI application.
One of the underlying assumptions in Artificial Intelligence is that intelligent behaviour can be
achieved through the manipulation of symbol structures (representing bits of knowledge). One of the
main issues in AI is to find appropriate representation of problem elements and available actions as
symbol structures so that the representation can be used to intelligently solve problems. In AI, an
important criteria about knowledge representation schemes or languages is that they should
support inference. For intelligent action, the inferencing capability is essential in view of the fact that we
can’t represent explicitly everything that the system might ever need to know–some things have to be
left implicit, to be inferred/deduced by the system as and when needed in problem solving.
In general, a good knowledge representation scheme should have the following features:
It should allow us to express the knowledge we wish to represent in the language. For example, the
mathematical statement: Every symmetric and transitive relation on a domain, need not be reflexive
is not expressible in First Order Logic.
It should allow new knowledge to be inferred from a basic set of facts, as discussed above.
It should have well-defined syntax and semantics.
Building a expert system is known as knowledge engineering and its practitioners are called
knowledge engineers. It is the job of the knowledge engineer to ensure to make sure that the computer
has all the knowledge needed to solve a problem. The knowledge engineer must choose one or more
forms in which to represent the required knowledge i.e., s/he must choose one or more knowledge
representation schemes.
A number of knowledge representing schemes like predicate logic, semantic nets, frames, scripts and rule
based systems, exists; and we will discuss them in this unit. Some popular knowledge representation
schemes are:
6.1 OBJECTIVES
We know that Planning is the process that exploits the structure of the problem under consideration for
designing a sequence of actions in order to solve the problem under consideration.
In order to plan a solution to the problem, one should have the knowledge of the nature and the structure
of the problem domain, under consideration. For the purpose of planning, the problem environments are
divided into two categories, viz., classical planning environments and non-classical planning
environments. The classical planning environments/domains are fully observable, deterministic, finite,
static and discrete. On the other hand, non-classical planning environments may be only partially
observable and/or stochastic. In this unit, we discuss planning only for classical environments.
Rather than representing knowledge in a declarative and somewhat static way (as a set of statements, each
of which is true), rule-based systems represent knowledge in terms of a set of rules each of which
specifies the conclusion that could be reached or derived under given conditions or in different situations.
A rule-based system consists of
(i) Rule base, which is a set of IF-THEN rules,
(ii) A bunch of facts, and
(iii) Some interpreter of the facts and rules which is a mechanism which decides which rule to apply
based on the set of available facts. The interpreter also initiates the action suggested by the rule
selected for application.
In a forward chaining system we start with the initial facts, and keep using the rules to draw new
intermediate conclusions (or take certain actions) given those facts. The process terminates when the final
conclusion is established. In a backward chaining system, we start with some goal statements, which are
intended to be established and keep looking for rules that would allow us to conclude, setting new sub-
goals in the process of reaching the ultimate goal. In the next round, the subgoals become the new goals
to be established. The process terminates when in this process all the subgoals are given fact. Forward
chaining systems are primarily data-driven, while backward chaining systems are goal-driven. We will
discuss each in detail.
Next, we discuss in detail some of the issues involved in a rule-based system.
Advantages of Rule-base
A basic principle of rule-based system is that each rule is an independent piece of knowledge. In an IF-
THEN rule, the IF-part contains all the conditions for the application of the rule under consideration.
THEN-part tells the action to be taken by the interpreter. The interpreter need not search any where else
except within the rule itself for the conditions required for application of the rule.
Disadvantages
The main problem with the rule-based systems is that when the rule-base grows and becomes very large,
then checking (i) whether a new rule intended to be added is redundant, i.e., it is already covered by some
of the earlier rules. Still worse, as the rule- base grows, checking the consistency of the rule-base also
becomes quite difficult. By consistency, we mean there may be two rules having similar conditions, the
actions by the two rules conflict with each other.
Let us first define working memory, before we study forward and backward chaining systems.
Sources of Uncertainty
Two important sources of uncertainty in rule based systems are:
The theory of the domain may be vague or incomplete so the methods to generate exact or accurate
knowledge are not known.
Case data may be imprecise or unreliable and evidence may be missing or in conflict.
So even though methods to generate exact knowledge are known but they are impractical due to lack or
data, imprecision or data or problems related to data collection.
So rule based deduction system developers often build some sort of certainty or probability computing
procedure on and above the normal condition-action format of rules. Certainty computing procedures
attach a probability between 0 and 1 with each assertion or fact. Each probability reflects how certain an
assertion is, whereas certainty factor of 0 indicates that the assertion is definitely false and certainty factor
of 1 indicates that the assertion is definitely true.
Example 1: In the example discussed above the assertion (ram at-home) may have a certainty factor, say
0.7 attached to it.
Example 2: In MYCIN a rule based expert system (which we will discuss later), a rule in which
statements which link evidence to hypotheses are expressed as decision criteria, may look like :
The interpreter controls the application of the rules, given the working memory, thus controlling the
system’s activity. It is based on a cycle of activity sometimes known as a recognize-act cycle. The system
first checks to find all the rules whose condition parts are satisfied i.e., the those rules which are
applicable, given the current state of working memory (A rule is applicable if each of the literals in its
antecedent i.e., the condition part can be unified with a corresponding fact using consistent substitutions.
This restricted form of unification is called pattern matching). It then selects one and performs the actions
in the action part of the rule which may involve addition or deleting of facts. The actions will result in a
new i.e., updated working memory, and the cycle starts again (When more than one rule is applicable,
then some sort of external conflict resolution scheme is used to decide which rule will be applied. But
when there are a large numbers of rules and facts then the number of unifications that must be tried
becomes prohibitive or difficult). This cycle will be repeated until either there is no rule which fires, or
the required goal is reached.
Rule-based systems vary greatly in their details and syntax, let us take the following example in
which we use forward chaining :
Example
Let us assume that the working memory initially contains the following facts :
(day monday)
(at-home ram)
(does-not-like ram)
R1 : IF (day monday)
THEN ADD to working memory the fact : (working-with ram)
R2 : IF (day monday)
THEN ADD to working memory the fact : (talking-to ram)
R3 : IF (talking-to X) AND (working-with X)
THEN ADD to working memory the fact : (busy-at-work X)
R4 : IF (busy-at-work X) OR (at-office X)
THEN ADD to working memory the fact : (not-at-home X)
R5 : IF (not-at-home X)
THEN DELETE from working memory the fact : (happy X)
R6 : IF (working-with X)
THEN DELETE from working memory the fact : (does-not-like X)
Now to start the process of inference through forward chaining, the rule based system will first search
for all the rule/s whose antecedent part/s are satisfied by the current set of facts in the working memory.
For example, in this example, we can see that the rules R1 and R2 are satisfied, so the system will chose
one of them using its conflict resolution strategies. Let the rule R1 is chosen. So (working-with ram) is
added to the working memory (after substituting “ram” in place of X). So working memory now looks
like:
(working-with ram)
(day monday)
(at-home ram)
(does-not-like ram)
Now this cycle begins again, the system looks for rules that are satisfied, it finds rule R2 and R6. Let the
system chooses rule R2. So now (taking-to ram) is added to working memory. So now working memory
contains following:
(talking-to ram)
(working-with ram)
(day monday)
(at-home ram)
(does-not-like ram)
Now in the next cycle, rule R3 fires, so now (busy-at-work ram) is added to working memory, which now
looks like:
(busy-at-work ram)
(talking-to ram)
(working-with ram)
(day monday)
(at-home ram)
(does-not-like ram)
Now antecedent parts of rules R4 and R6 are satisfied. Let rule R4 fires, so (not-at-home, ram) is added
to working memory which now looks like :
(not-at-home ram)
(busy-at-work ram)
(talking-to ram)
(working-with ram)
(day monday)
(at-home ram)
(does-not-like ram)
In the next cycle, rule R5 fires so (at-home ram) is removed from the working memory :
(not-at-home ram)
(busy-at-work ram)
(talking-to ram)
(working-with ram)
(day monday)
(does-not-like ram)
The forward chining will continue like this. But we have to be sure of one thing, that the ordering of the
rules firing is important. A change in the ordering sequence of rules firing may result in a different
working memory.
If we know what the conclusion would be, or have some specific hypothesis to test, forward chaining
systems may be inefficient. In forward chaining we keep on moving ahead until no more rules apply or
we have added our hypothesis to the working memory. But in the process the system is likely to do a lot
of additional and irrelevant work, adding uninteresting or irrelevant conclusions to working memory. Let
us say that in the example discussed before, suppose we want to find out whether “ram is at home”. We
could repeatedly fire rules, updating the working memory, checking each time whether (at-home
ram) is found in the new working memory. But maybe we had a whole batch of rules for drawing
conclusions about what happens when I’m working, or what happens on Monday–we really don’t care
about this, so would rather only have to draw the conclusions that are relevant to the goal.
This can be done by backward chaining from the goal state or on some hypothesized state that we are
interested in. This is essentially how Prolog works. Given a goal state to try and prove, for example (at-
home ram), the system will first check to see if the goal matches the initial facts given. If it does, then
that goal succeeds. If it doesn’t the system will look for rules whose conclusions i.e., actions match the
goal. One such rule will be chosen, and the system will then try to prove any facts in the preconditions of
the rule using the same procedure, setting these as new goals to prove. We should note that a backward
chaining system does not need to update a working memory. Instead it needs to keep track of what
goals it needs to prove its main hypothesis. So we can say that in a backward chaining system, the
reasoning proceeds “backward”, beginning with the goal to be established, chaining through rules,
and finally anchoring in facts.
Although, in principle same set of rules can be used for both forward and backward chaining. However,
in backward chaining, in practice we may choose to write the rules slightly differently. In backward
chaining we are concerned with matching the conclusion of a rule against some goal that we are trying to
prove. So the ‘then or consequent’ part of the rule is usually not expressed as an action to take (e.g.,
add/delete), but as a state which will be true if the premises are true.
To learn more, let us take a different example in which we use backward chaining (The system is used to
identify an animal based on its properties stored in the working memory):
Example
1. Let us assume that the working memory initially contains the following facts:
1. IF (gives-milk X)
THEN (mammal X)
2. IF (has-hair X)
THEN (mammal X)
Now to start the process of inference through backward chaining, the rule
based system will first form a hypothesis and then it will use the antecedent – consequent rules
(previously called condition – action rules) to work backward toward hypothesis supporting
assertions or facts.
Let us take the initial hypothesis that “raja is a lion” and then reason
about whether this hypothesis is viable using backward chaining approach explained below :
The system searches a rule, which has the initial hypothesis in the consequent part that someone i.e.,
raja is a lion, which it finds in rule 8.
The system moves from consequent to antecedent part of rule 8 and it finds the first condition i.e., the
first part of antecedent which says that “raja must be a carnivorous”.
Next the system searches for a rule whose consequent part declares that someone i.e., “raja is a
carnivorous”, two rules are found i.e., rule 3 and rule 4. We assume that the system tries rule 3 first.
To satisfy the consequent part of rule 3 which now has become the system’s new hypothesis, the
system moves to the first part of antecedent which says that X i.e., raja has to be mammal.
So a new sub-goal is created in which the system has to check that “raja is a mammal”. It does so by
hypothesizing it and tries to find a rule having a consequent that someone or X is a mammal. Again
the system finds two rules, rule 1 and rule 2. Let us assume that the system tries rule 1 first.
In rule 1, the system now moves to the first antecedent part which says that X i.e., raja must give
milk for it to be a mammal. The system cannot tell this because this hypothesis is neither supported
by any of the rules and also it is not found among the existing facts in the working memory. So the
system abandons rule 1 and try to use rule 2 to establish that “raja is a mammal”.
In rule 2, it moves to the antecedent which says that X i.e., raja must have hair for it to be a mammal.
The system already knows this as it is one of the facts in working memory. So the antecedent part of
rule 2 is satisfied and so the consequent that “raja is a mammal” is established.
Now the system backtracks to the rule 3 whose first antecedent part is satisfied. In second condition
of antecedent if finds its new sub-goal and in turn forms a new hypothesis that X i.e., raja eats meat.
The system tries to find a supporting rule or an assertion in the working memory which says that “raja
eats meat” but it finds none. So the system abandons the rule 3 and try to use rule 4 to establish that
“raja is carnivorous”.
In rule 4, the first part of antecedent says that raja must be a mammal for it to be carnivorous. The
system already knows that “raja is a mammal” because it was already established when trying to
satisfy the antecedents in rule 3.
The system now moves to second part of antecedent in rule 4 and finds a new sub-goal in which the
system must check that X i.e., raja has long-pointed-teeth which now becomes the new hypothesis.
This is already established as “ raja has long-pointed-teeth” is one of the assertions of the working
memory.
In third part of antecedent in rule 4 the system’s new hypothesis is that “raja has claws”. This also is
already established because it is also one the assertions in the working memory.
Now as all the parts of the antecedent in rule 4 are established so its consequent i.e., “raja is
carnivorous” is established.
The system now backtracks to rule 8 where in the second part of the antecedent says that X i.e., raja
must have a big-mouth which now becomes the new hypothesis. This is already established because
the system has an assertion that “raja has a big mouth”.
Now as the whole antecedent of rule 8 is satisfied so the system concludes that “raja is a lion”.
We have seen that the system was able to work backward through the antecedent – consequent rules,
using desired conclusions to decide that what assertions it should look for and ultimately establishing the
initial hypothesis.
How to choose the type of chaining among forward or backward chaining for a given problem ?
Many of the rule based deduction systems can chain either forward or backward, but which of these
approaches is better for a given problem is the point of discussion.
First, let us learn some basic things about rules i.e., how a rule relates its input/s (i.e., facts) to output/s
(i.e., conclusion). Whenever in a rule, a particular set of facts can lead to many conclusions, the rule is
said to have a high degree of fan out, and a strong candidate of backward chaining for its processing. On
the other hand, whenever the rules are such that a particular hypothesis can lead to many questions for the
hypothesis to be established, the rule is said to have a high degree of fan in, and a high degree of fan in is
a strong candidate of forward chaining.
To summarize, the following points should help in choosing the type of chaining for reasoning purpose :
If the set of facts, either we already have or we may establish, can lead to a large number of
conclusions or outputs , but the number of ways or input paths to reach that particular conclusion in
which we are interested is small, then the degree of fan out is more than degree of fan in. In such
case, backward chaining is the preferred choice.
But, if the number of ways or input paths to reach the particular conclusion in which we are
interested is large, but the number of conclusions that we can reach using the facts through that rule
is small, then the degree of fan in is more than the degree of fan out. In such case, forward
chaining is the preferred choice.
For case where the degree of fan out and fan in are approximately same, then in case if not many
facts are available and the problem is check if one of the many possible conclusions is true, backward
chaining is the preferred choice.
Rule-based systems vary greatly in their details and syntax, A basic principle of rule-based system is that
each rule is an independent piece of knowledge. In an IF-THEN rule, the IF-part contains all the
conditions for the application of the rule under consideration. THEN-part tells the action to be taken by
the interpreter. The interpreter need not search any where else except within the rule itself for the
conditions required for application of the rule.
Some of the conflict resolution strategies which are used to decide which rule to fire are given below:
These strategies may help in getting reasonable behavior from a forward chaining system, but the most
important thing is how should we write the rules. They should be carefully constructed, with the
preconditions specifying as precisely as possible when different rules should fire. Otherwise we will have
little idea or control of what will happen.
To understand, let us take the following example in which we use forward chaining:
Example
Let us assume that the working memory initially contains the following facts :
(day monday)
(at-home ram)
(does-not-like ram)
R1 : IF (day monday)
THEN ADD to working memory the fact : (working-with ram)
R2 : IF (day monday)
THEN ADD to working memory the fact : (talking-to ram)
R4 : IF (busy-at-work X) OR (at-office X)
THEN ADD to working memory the fact : (not-at-home X)
R5 : IF (not-at-home X)
THEN DELETE from working memory the fact : (happy X)
R6 : IF (working-with X)
THEN DELETE from working memory the fact : (does-not-like X)
Now to start the process of inference through forward chaining, the rule based system will first search
for all the rule/s whose antecedent part/s are satisfied by the current set of facts in the working memory.
For example, in this example, we can see that the rules R1 and R2 are satisfied, so the system will chose
one of them using its conflict resolution strategies. Let the rule R1 is chosen. So (working-with ram) is
added to the working memory (after substituting “ram” in place of X). So working memory now looks
like:
(working-with ram)
(day monday)
(at-home ram)
(does-not-like ram)
Now this cycle begins again, the system looks for rules that are satisfied, it finds rule R2 and R6. Let the
system chooses rule R2. So now (taking-to ram) is added to working memory. So now working memory
contains following:
(talking-to ram)
(working-with ram)
(day monday)
(at-home ram)
(does-not-like ram)
Now in the next cycle, rule R3 fires, so now (busy-at-work ram) is added to working memory, which now
looks like:
(busy-at-work ram)
(talking-to ram)
(working-with ram)
(day monday)
(at-home ram)
(does-not-like ram)
Now antecedent parts of rules R4 and R6 are satisfied. Let rule R4 fires, so (not-at-home, ram) is added
to working memory which now looks like :
(not-at-home ram)
(busy-at-work ram)
(talking-to ram)
(working-with ram)
(day monday)
(at-home ram)
(does-not-like ram)
In the next cycle, rule R5 fires so (at-home ram) is removed from the working memory :
(not-at-home ram)
(busy-at-work ram)
(talking-to ram)
(working-with ram)
(day monday)
(does-not-like ram)
The forward chining will continue like this. But we have to be sure of one thing, that the ordering of the
rules firing is important. A change in the ordering sequence of rules firing may result in a different
working memory.
Check your Progress - 1
Exercise 1 ; In the “Animal Identifier System” discussed above use forward chaining to try to identify the
animal called “raja”.
For example, the fact (a piece of knowledge): Mohan struck Nita in the garden with a sharp knife last
week, is represented by the semantic network shown in Figure 1.1.
struck
past of
time agent
last week strike Mohan
instrument
place
object knife
garden
Nita
property of
sharp
As information in semantic networks is clustered together through relational links, the knowledge
required for the performance of some task is generally available within short spatial span of the semantic
network. This type of knowledge organisation in some way, resembles the way knowledge is stored and
retrieved by human beings.
Subclass and instance relations allow us to use inheritance to infer new facts/relations from the explicitly
represented ones. We have already mentioned that the graphical portrayal of knowledge in semantic
networks, being visual, is easier than other representation schemes for the human beings to comprehend.
This fact helps the human beings to guide the expert system, whenever required. This is perhaps the
reason for the popularity of semantic networks.
6.4 FRAMES
Frames are a variant of semantic networks that are one of the popular ways of representing non-
procedural knowledge in an expert system. In a frame, all the information relevant to a particular concept
is stored in a single complex entity, called a frame. Frames look like the data structure, record. Frames
support inheritance. They are often used to capture knowledge about typical objects or events, such as a
car, or even a mathematical object like rectangle. As mentioned earlier, a frame is a structured object and
different names like Schema, Script, Prototype, and even Object are used in stead of frame, in computer
science literature.
Mammal :
Subclass : Animal
warm_blooded : yes
Lion :
subclass : Mammal
eating-habbit : carnivorous
size : medium
Raja :
instance : Lion
colour : dull-Yellow
owner : Amar Circus
Sheru :
instance : Lion
size : small
A particular frame (such as Lion) has a number of attributes or slots such as eating-habit and size. Each
of these slots may be filled with particular values, such as the eating-habit for lion may be filled up as
carnivorous.
Sometimes a slot contains additional information such as how to apply or use the slot values. Typically, a
slot contains information such as (attribute, value) pairs, default values, conditions for filling a slot,
pointers to other related frames, and also procedures that are activated when needed for different
purposes.
In the case of frame representation of knowledge, inheritance is simple if an object has a single parent
class, and if each slot takes a single value. For example, if a mammal is warm blooded then automatically
a lion being a mammal will also be warm blooded.
But in case of multiple inheritance i.e., in case of an object having more than one parent class, we have
to decide which parent to inherit from. For example, a lion may inherit from “wild animals” or “circus
animals”. In general, both the slots and slot values may themselves be frames and so on.
Frame systems are pretty complex and sophisticated knowledge representation tools. This
representation has become so popular that special high level frame based representation languages have
been developed. Most of these languages use LISP as the host language. It is also possible to represent
frame-like structures using object oriented programming languages, extensions to the programming
language LISP.
6.5 SCRIPTS
A script is a structured representation describing a stereotyped sequence of events in a particular context.
Scripts are used in natural language understanding systems to organize a knowledge base in terms of the
situations that the system should understand. Scripts use a frame-like structure to represent the commonly
occurring experience like going to the movies eating in a restaurant, shopping in a supermarket, or
visiting an ophthalmologist.
Thus, a script is a structure that prescribes a set of circumstances that could be expected to follow on from
one another.
Scripts are beneficial because:
Events tend to occur in known runs or patterns.
A casual relationship between events exist.
An entry condition exists which allows an event to take place.
Prerequisites exist upon events taking place.
Components of a script
The components of a script include:
Entry condition: These are basic condition which must be fulfilled before events in the script
can occur.
Results: Condition that will be true after events in script occurred.
Props: Slots representing objects involved in events
Roles: These are the actions that the individual participants perform.
Track: Variations on the script. Different tracks may share components of the same scripts.
Scenes: The sequence of events that occur.
Describing a script, special symbols of actions are used. These are:
6.6 SUMMARY
This unit majorly discussed the various knowledge representation mechanisms, used in Artificial
Intelligence. The unit begins with the discussion on Rule Based Systems, and discussed the related
concept of Forward chaining and Backward chaining, later the concept of Conflict resolution is discussed.
The unit also discussed the other techniques of knowledge representation like Semantic nets, Frames and
Scripts; along with relevant examples for each.
6.7 SOLUTIONS/ANSWERS
Check Your Progress – 1
Exercise 1: Refer to section 6.2
Check Your Progress – 2
Exercise 2: Refer to section 6.3
Check Your Progress – 3
Exercise 3: Refer to section 6.4
6.8 FURTHER READINGS
1. Ela Kumar, “ Artificial Intelligence”, IK International Publications
2. E. Rich and K. Knight, “Artificial intelligence”, Tata Mc Graw Hill Publications
3. N.J. Nilsson, “Principles of AI”, Narosa Publ. House Publications
4. John J. Craig, “Introduction to Robotics”, Addison Wesley publication
5. D.W. Patterson, “Introduction to AI and Expert Systems" Pearson publication
UNIT 7 PROBABILISTIC REASONING
Structure Page Nos.
7.0 Introduction 50
7.1 Objectives 51
7.2 Reasoning with uncertain information 51
7.3 Review of Probability Theory 55
7.4 Introduction to Bayesian Theory 57
7.5 Baye’s Networks 59
7.6 Probabilistic Inference 62
7.7 Basic idea of Inferencing with Bayes Networks 64
7.8 Other Paradigm of Uncertain Reasoning 65
7.9 Dempster Scheffer Theory 66
7.10 Summary 67
7.11 Solutions/ Answers 67
7.12 Further Readings 68
7.0 INTRODUCTION
This unit is dedicated to probability theory and its usage in decision making for various problems.
Contrary to the classical decision making of True and False propositions, the probability of the truth value
with a certain probability is used for making decisions. The inclusion of such a probabilistic approach is
quite relevant since uncertainties are quite obvious in the real world.
As we know, the probability of an event (uncertain event I) is basically the measure of the degree of
likelihood of the occurrence of event I. Let the set of all such possible events is represented as sample
space S The measure of probability is a function P () mapping the event outcome E from sample space S
to some real number and satisfying few conditions such as:
(iii) For E ∩ E = ϕ , for all i ≠ j (the E are mutually exclusive), i.e. P(E ∪ E . . . ) = P(E ) +
P(E )+ . ..
Using the above mentioned three conditions, we can derive the basic laws of probability. It is also to be
noted that only these three conditions are not enough to compute the probability of an outcome. This
additionally requires the collection of experimental data for estimating the underlying distribution.
7.1 OBJECTIVES
After going through this unit, you should be able to:
Trials, Sample Space, Events : You must have often observed that a random experiment may comprise a
series of smaller sub-experiments. These are called trials. Consider for instance the following situations.
Example 1: Suppose the experiment consists of observing the results of three successive tosses of a coin.
Each toss is a trial and the experiment consist of three trials so that it is completed only after the third toss
(trial) is over.
Example 2: Suppose from a lot of manufactured items, ten items are chosen successively following a
certain mechanism for checking. The underlying experiment is completed only after the selection of the
tenth item is completed; the experiment obviously comprises 10 trials.
Example 3: If you consider Example 1 once again you would notice that each toss (trial) results into
either a head (H) or a tail (T). In all there are 8 possible outcomes of the experiment viz., s1 = (H,H,H), s2
= (H,H,T), s3 = (H,T,H), s4 = (T,H,H), s5 =(1,T,H), s6 = (T,H,T), s7 =(H,T,T) and s8 = (T, T,
T).
Let ζ be a fixed sample space. We have already defined an event as a collection of sample points from ζ.
Imagine that the (conceptual) experiment underlying ζ is being performed. The phrase "the event E
occurs" would mean that the experiment results in an outcome that is included in the event E. Similarly,
non-occurrence of the event E would mean that the experiment results into an outcome that is not an
element of the event E. Thus, the collection of all sample points that are not included in the event E is also
an event which is complementary to E and is denoted as Ec. The event Ec is therefore the event which
contains all those sample points of ζ which are not in E. As such, it is easy to see that the event E occurs if
and only if the event Ec does not take place. The events E and Ec are complementary events and taken
together they comprise the entire sample space, i.e., E Ec = ζ.
You may recall that ζ is an event which consists of all the sample points. Hence, its complement is an
empty set in the sense that it does not contain any sample point and is called the null event, usually
denoted as ø so that ζc = ø.
Let us once again consider Example 3. Consider the eevent vent E that the three tosses produce at least one
head. Thus, E = {s1,s2, s3, s4, s5, s6, s7} so that the complementary event Ec={s8}, which is the event of not
scoring a head at all. Again in Example 3 in the case of selection without replacement, event that the
white marble is picked up at least once is defined as E = {(r 1,w), (r2,w), (w, r2) (w, r1)}. Hence, Ec = {(r1,
r2), (r2 , r1)} i.e. the event of not picking the white marble at all.
Let us now consider two events E and F. We write E F, read as E "union” F, to denote the collection of
sample points, which are responsible for occurrence of either E or F or both. Thus, E F is a new event
and it occurs if and only if either E or F or both occur i.e. if and only if at least one of the events E or F
occurs. Generalizing this idea, we can define a new event Ej, read as "union" of the k events E1, E2,..., Ek,
as the event which consists of all sample points that are in at least one of the events E 1, E2,…Ekand it
occurs if and only if at least one of the e vents E1, E2,...,Ek occurs.
Again, let E and F be two given events. We write E ∩ F, read as E "Intersection" F, to denote the
collection of sample points any of whose occurrence implies the occurrence of both E and F. Thus, E ∩ F
is a new event and it occurs if and only if both the events E and F occur. Generalizing this idea, we can
define a new event Ej read as ”intersection" of the k events E 1, E2,...,Ek, as the event which consists of
sample points that are common to each of the events E 1, E2,..., Ek, and it occurs only if all the k events E1,
E2,...,Ek occur simultaneously. Further, two events E and F are said to be mutually exclusive or disjoint if
they do not have a common sample point i.e. E ∩ F = ø.
Two mutually exclusive events then cannot occur simultaneously. In the coin-tossing experiment for
instance, the two events, heads and tails, are mutually exclusive: if one occurs, the other cannot occur. To
have a better understanding of these events let us once again look at Example 3. Let E be the event of
scoring an odd number of heads and F be the event that tail appears in the first two tosses, so that E = {s 1,
s5, s6, s7} and F = {s5, s8}. Now E ∩ F = {s5}, the event that only the third toss yields a head. Thus events
E and F are not mutually exclusive.
The above relations between events can be best viewed through a Venn diagram. A rectangle is drawn to
represent the sample space ζ. All the sample points are represented within the rectangle by means of
points. An event is represented by the region enclosed by a closed curve containing all the sample points
leading to that event. The space inside the rectangle but outside the closed curve representing E represents
the complementary event Ec (See Fig.1(a) above) Similarly, in Fig.1(b), the space inside the curve
represented by the broken line represent the event E U F and the shaded portion represents E ∩ F.
As is clear by now, the outcome of a random experiment being uncertain, none of the various events
associated with a sample space can be predicted with certainty before the underlying experiment is
performed and the outcome of it is noted. However, some events may intuitively seem to be more likely
than the rest. For example, talking about human beings, the event that a person will live 20 years seems to
be more likely compared to the event that the person will live 200 years. Such thoughts motivate us to
explore if one can construct a scale of measurement to distinguish between likelihoods of various events.
Towards this, a small but extremely significant fact comes to our help. Before we elaborate on this, we
need a couple of definitions.
Consider an event E associated with a random experiment; suppose the experiment is repeated n times
under identical conditions and suppose the event E (which is not likely to occur with every performance
of the experiment) occurs fn(E) times in these n repetitions. Then, fn(E) is called the frequency of the
event E in n repetitions of the experiment and rn.(E) = fn,(E)/n is called the relative frequency of the event
E in n repetitions of the experiment. Let us consider the following example.
Example 4: Consider the experiment of throwing a coin. Suppose we repeat the process of throwing a
coin 5 times and suppose the frequencies of occurrence of head is tabulated below in Table-1:
No. of repetitions (n) Frequency of head (fn(H) Relative frequency of head rn(H)
1 0 0
2 1 1/2
3 2 2/3
4 3 3/4
5 3 3/5
Notice that the third column in Table-1 gives the relative frequencies r n (H) of heads. We can keep on
increasing the number of repetitions n and continue calculating the values of r n (H) in Table 1. Merely to
fix ideas regarding the concept of probability of an event, we present below a very naive approach which
in no way is rigorous, but it helps to see things better at this stage.
Check Your Progress- 1
Problem -1. In each of the following exercises, an experiment is described. Specify the relevant sample
spaces:
b) An urn contains six balls, which are colored differently. A ball is drawn from the urn and its
color is noted.
c) An urn contains ten cards numbered 1 through 10. A card is drawn, its number noted and the
card is replaced. Another card is drawn and its number is noted.
Problem 2. Suppose a six-faced die is thrown twice. Describe each of the following events:
Example 5: Suppose, E and F are such that F E so that occurrence of F would automatically imply the
occurrence of E. Thus with the information that the event F has taken place in view, it is plausible to
assign probability 1 to the occurrence of E irrespective of its original probability.
Example 6: Suppose. E and F are two mutually exclusive events and thus they cannot occur together.
Thus whenever we come to know that the event F has taken place, we can rule out the occurrence of E.
Therefore, in such a situation, it will be appropriate to assign probability 0 to the occurrence of E.
Example 7: Suppose a pair of balanced dice A and B are rolled simultaneously so that each of the 36
possible outcomes is equally likely to occur and hence has probability Let E be the event that the sum of
the two scores is 10 or more and F be the event that exactly one of the two scores is 5.
Then E = {(4.6), (5.5), (5,6), (6,4), (6,5), (6,6)} so that P(E) = 6/36 = 1/6.
Also, F= {(1.5), (2,5), (3,5), (4,5), (6,5), (5,1), (5,2), (5,3), (5,4), (5,6)}.
Now suppose we are told that the event F has taken place (note that this is only partial information
relating to the outcome of the experiment). Since each of the outcome originally had the same probability
of occurring, they should still have equal probabilities. Thus given that exactly one of the two scores is 5
each of the 10 outcomes of event F has probability while the probability of remaining 26 points in the
sample space is 0.
In the light of the information that the event F has taken place the sample points (4,6), (6,4), (5,5) and
(6,6) in the event E must not have materialized. One of the two sample points (5,6) or (6,5) must have
materialized. Therefore the probability of E would no longer be 1/6. Since all the 10 sample points in F
are equally likely, the revised probability of E given the occurrence of F, which occurs through the
materialization of one of the two sample points (6,5) or (5,6) should be 2/10 = 1/5.
The probability just obtained is called the conditional probability that E occurs given that F has occurred
and is denoted by P(E|F). We shall now derive a general formula for calculating P(E|F).
Table 2
Events E Ec
F P Q
Fc r s
In Table 2, P(E ∩ F) = p, P(Ec ∩ F) = q, P(E ∩ Fc) = r and P(Ec ∩ Fc) = s and hence, P(E)=P(E ∩ F) U (E
∩ Fc)) = P(E ∩ F) + P(E ∩ Fc) =p+r and similarly, P(F) = q +s.
Now suppose that the underlying random experiment is being repeated a large number of times, say N
times. Thus, taking a cue from the long term relative frequency interpretation of probability, the
approximate number of times the event F is expected to take place will be NP(F) = N(q+s). Under the
condition that the event F has taken place, the number of times the event E is expected to take place
would be NP(E ∩ F) as both E and F must occur simultaneously. Thus, the long term relative frequency
of E under the condition of occurrence of F, i.e. the probability of occurrence of E under the condition of
occurrence of F, should be NP(E ∩ F)/NP(F) = P(E ∩ F)/P(F). This is the proportion of times E occurs
out of the repetitions where F takes place. With the above background, we are now ready to define
formally the conditional probability of an event given another.
Definition: Let E and F be two events from a sample space ζ. The conditional probability of the event E
given the event F, denoted by P(E|F), is defined as P(E|F) = P(E ∩ F)/P(F), whenever P(F) > 0.
When P(F) = 0, we say that P(E|F) is undefined. We can also write from Eqn. P(E ∩ F) =
P(E|F)P(F).
Referring back to Example 3, we see that P(E) = 6/36,P(F) = 10/36; since, E ∩ F= {(5,6), (6,5)}, P(E ∩ F)
= 2/36, P(E|F) = (2/36)/(10/36) = 2/10 = 1/5, which is the same as that obtained in Example 3. Another
result can be generalized to k events E1 E2, ..., Ek, where k >2. And now an exercise for you.
Problem-1: In a class, three students tossed one coins (one each) for 3 times. Write down all the possible
outcomes which can be obtained in this experiment.
Problem-2: In problem 1, what is the probability of getting 2 more than 2 heads at a time. Also write the
probability of getting three tails at a time.
Joint Probability: This refers to the probability of two or more events simultaneously occurring, e.g. P(A
and B) or P(A, B).
Marginal Probability: It is the probability of an event occurring irrespective of outcome of the other
random variables e.g. P(A).
The conditional probability can also be written in terms of joint probability as P(A|B) = P(A, B)/ P(B). In
other way, if one conditional probability is given, other can be calculated as P(A|B) = P(B|A)*P(A) /P(B).
Let ‘S’ be a sample space in consideration. Let events ‘A1’ , ‘A2’ , . . . . . ‘An’ is the set of mutually
exclusive events in sample space ‘S’. Let ‘B’ be an event from sample space ‘S’ provided P(B) > 0, then
according to Bayes’ theorem.
While representing a Bayes’ Network graphically, nodes represent the distribution of probabilities for
random variables. The edges in the graph represent the relationship among random variables. The key
benefits of a Bayes’ Network araree model visualization, relationships among random variables and
computations of complex probabilities.
Example 8: Let us now create a Bayesian Network for an example problem. Let us consider three random
variables A, B and C. It is given that A is depende
dependent
nt on B, and C is dependent on B. The conditional
dependence can be stated as P(A|B|) and P(C|B) for both the given statements respectively. Similarly the
conditional independence can be stated here as P(A|B, C) and P(C|B, A) for both statements respective ly.
Here, we can also write P(A|C, B) = P(A|B) as A is unaffected by the C. Also the joint probability of A
and C given B can be written as product of conditional probabilities as P(A, C|B) = P(A|B) * P(C|B).
Now using Bayes' theorem, the joint probability of P(A, B, C) can be written as P(A,B,C)=
P(A|B)*P(C|B)*P(B).
The corresponding graph is shown below in figure 1. Here each random variable is represented as a node
and edges between nodes are conditional probabilities.
When an experiment is repeated a large number of times (say 𝑛), the above expression can be given a
frequency interpretation. Let the number of occurrences of an event F is represented as No. (F) and the
probability of a joint event of E and F as No. (E F). The relative frequencies of both these events can be
computed as 𝑓 :
.( ∩ )
𝑓 (𝐸 ∩ 𝐹) = and similarly,
𝑁𝑜. (𝐹)
𝑓 (𝐹) =
𝑛
Here, if 𝑛 is large, the ratio of above two expressions represent the proportion of times the event E occurs
relative to the occurrence of F. This can also be understood as the approximate conditional occurrence of
event F with E.
We can also write the conditional probability of event F while it is given that event E has already
occurred, as
𝑃(𝐹 / 𝐸) = 𝑃( 𝐸 / 𝐹) 𝑃(𝐹)/𝑃(𝐸)
The above expression is also one form of Bayes’ Rule. Here the notion is simple: the probability of an
event F occurring when we know the probability of an event E which has already occurred is the same as
the probability of occurring of event E when the probability of occurrence of event F is known.
The other ways of dealing with uncertainty are the ones with no theoretical proof. These are mostly based
on intuition. These are selected over formal methods as a pragmatic solution to a particular problem,
when the formal methods impose difficult or impossible conditions. One such ad hoc procedure is used to
diagnose meningitis and infectious blood disease, the system is called MYCIN. The MYCIN uses If and
then rules to assess various forms of patient evidence. It also measures both belief and disbelief to
represent degree of confirmation and disconfirmation respectively in a given hypothesis. The ad hoc
methods have been used in a larger number of knowledge-based systems than formal methods. This is due
to the difficulties encountered in acquiring a large number of reliable probabilities related to the given
domain and to the complexities to the ensuing calculations.
One other paradigm is to use Heuristic reasoning methods. These are based on the use of procedures,
rules and other forms of encoded knowledge to achieve specified goals under certainty. Using both
domain specific and general heuristics, one of several alternative conclusions may be chosen through the
strength of positive vs negative evidence presented in the form of justification or endorsement.
The in depth and detailed discussion on this is not in the scope of this unit/ course.
Let us now define a few terms used in D-S theory which will be useful for us.
7.9.1 Evidence
These are events related to one hypothesis or set of hypotheses. Hare, a relation is not permitted between
various pieces of evidence or set of hypotheses. Also, the relation between the set of hypotheses and the
piece of evidence is only quantified by a source of data. In context of D-S theory, we have four types of
evidences as following:
a) Consonant Evidence: These are basically appea appearing
ring in a nested structure where each subset is
included into the next bigger subset and so on. Here with each increasing subset size, the
information refines the evidentiary set over the time.
b) Consistent Evidence: This assures the presence of at least one common element to all the subsets.
c) Arbitrary Evidence: A situation where there is not a common element occurring in the subsets
though some of the subsets may have a few common element(s).
d) Disjoint Evidence: There is no subset having common elements.
All these four evidence types can be understood by looking at the below given figure 2.(a -d).
Figure 2. (a-d)
The source of information can be an entity or person giving some relevant state information. Here the
information source is a non biased source of information. The information received from such sources is
combined to provide more reliable information for further use. The D-S theory models are able to handle
the varying precision regarding the information and hence no additional assumptions are ne eded to
represent the information.
Example9: Let 𝜃 = {a, b, c}; then the power set is P(𝜃) = {𝜙, a,b,c, (a,b), (a,c), (b,c),
(a,b,c)}. The information source assigned the m-values as m(a) = 0.2, m(c) = 0.1 and m(a,b) =
0.4. Here the mentioned three subsets are focal elements.
While assessing the grades of the class of 100 students, two of the class teachers responded the overall
result as follow. First teacher assessed that 40 students will get A and 20 students will get B grade
amongst the total 60 students he interviewed. Whereas second teacher stated that 30 students will get A
grade and 30 students will get either A or B amongst the 60 students he took the interview. Combining
both evidences to find the resultant evidence, we will do following calculations. Here frame of
discernment θ= {A, B} and Power set 2 = {∅, A, B, (A, B)},
A∩B=∅
A∩B=∅
B∩A=∅
(A, B) ∩ B = B ≠ ∅ , m2 (B) = 0
D-S Rule of Combination: Table 3 shows combination of concordant evidences using D-S theory.
m2(A)=0.3 m1-2 (A) 0.12 m1-2 (∅) 0.06 m1-2 (A) 0.12
m2(A,B)=0.3 m1-2 (A) 0.12 m1-2 (B) 0.06 m1-2 (A,B) 0.12
m2(θ)=0.4 m1-2 (A) 0.16 m1-2 (B) 0.08 m1-2 0.16
(θ)
0.170 = 1
Pl1−2(A, B) = m1−2(A) + m1−2(B) + m1−2(AB) + m1−2(θ) = 0.553 + 0.149 + 0.128 + 0.170 = 1.0
According to rule of combination, concluded ranges are then 55 to 85 students will get
Problem-1. Differentiate between Join, Marginal and conditional probability with an example of each.
Problem-3. What are different type of evidences? Give suitable example of each.
7.10 SUMMARY
This unit relates to the discussion over Reasoning with uncertain information, whih involves Review of
Probability Theory, and Introduction to Bayesian Theory. Unit also covers the concept of Baye’s
Networks, which is later used for the purpose of inferencing. Finally, the unit discussed about the Other
Paradigm of Uncertain Reasoning, including the Dempster Scheffer Theory
7.11 SOLUTIONS/ANSWERS
Check Your Progress- 1
Problem -1. In each of the following exercises, an experiment is described. Specify the relevant sample
spaces:
a) A machine manufactures a certain item. An item produced by the machine is tested to
determine whether or not it is defective.
b) An urn contains six balls, which are colored differently. A ball is drawn from the urn and its
color is noted.
c) An urn contains ten cards numbered 1 through 10. A card is drawn, its number noted and the
card is replaced. Another card is drawn and its number is noted.
Solution - *Please refer to section 7.3 to answer these problems.
Problem 2. Suppose a six-faced die is thrown twice. Describe each of the following events:
i) The maximum score is 6.
ii) The total score is 9.
iii) Each throw results in an even score.
iv) Each throw results in an even score larger than 2.
v) The scores on the two throws differ by at least 2.
Solution - *Please refer to section 7.3 to answer these problems.
Check Your Progress 2
Problem-1: In a class,three students tossed one coins (one each) for 3 times. Write down all the possible
outcomes which can be obtained in this experiment.
Solution - *Please refer to example 4 and section 7.3 to solve these problems
Problem-2: In problem 1, what is the probability of getting 2 more than 2 heads at a time. Also write the
probability of getting three tails at a time.
Solution - *Please refer to example 4 and section 7.3 to solve these problems
Problem-3. What are different type of evidences? Give suitable example of each.
Solution - *Please refer to section 7.9 and example 10 to answer these problems.
8.0 Introduction 50
8.1 Objectives
8.2 Fuzzy Systems 51
8.3 Introduction to Fuzzy Sets 55
8.4 Fuzzy Set Representation 57
8.5 Fuzzy Reasoning
8.6 Fuzzy Inference 59
8.7 Rough Set Theory 62
8.8 Summary 67
8.9 Solutions/ Answers 67
8.10 Further Readings 68
8.0 INTRODUCTION
In the earlier units, we discussed PL and FOPL systems for making inferences and
solving problems requiring logical reasoning. However, these systems assume that the
domain of the problems under consideration is complete, precise and consistent. But,
in the real world, the knowledge of the problem domains is generally neither precise
nor consistent and is hardly complete.
In this unit, we discuss a number of techniques and formal systems that attempt to
handle some of these blemishes. To begin with we discuss the fuzzy systems that
attempt to handle imprecision in knowledge bases, specially, due to use of natural
language words like hot, good, tall etc.
8.1 OBJECTIVES
After going through this unit, you should be able to:
enumerate various formal methods, which deal with different types of blemishes
like incompleteness, imprecision and inconsistency in a knowledge base;
discuss, why fuzzy systems are required;
discuss, develop and use fuzzy arithmetic tools in solving problems, the
descriptions of which involve imprecision;
discuss default reasoning as a tool for handling incompleteness of knowledge;
discuss Closed World Assumption System, as another tool for handling
incompleteness of knowledge, and
discuss and use non-deductive inference rules like abduction and induction, as
tools for solving problems from everyday experience.
5
The First Order
8.2 FUZZY SYSTEMS Predicate Logic
In the symbolic Logic systems like, PL and FOPL, that we have studied so far, any
(closed) formula has a truth-value which must be binary, viz., True or False.
However, in our everyday experience, we encounter problems, the descriptions of
which involve some words, because of which, to statements of situations, it is not
possible to assign a truth value: True or False. For example, consider the statement:
If the water is too hot, add normal water to make it comfortable for taking a bath.
In the above statement, for a number of words/phrases including ‘too hot’ ‘add’,
‘comfortable’ etc., it is not possible to tell when exactly water is too hot, when water
is (at) normal (temperature), when exactly water is comfortable for taking a bath.
For example, we cannot tell the temperature T such that for water at temperature T or
less, truth value False can be associated with the statement ‘Water is too hot’ and at
the same time truth-value True can also be associated to the same statement ‘Water is
too hot’ when the temperature of the water is, say, at degree T + 1, T + 2….etc.
Healthy Person: we cannot even enumerate all the parameters that determine health.
Further, it is even more difficult to tell for what value of a particular parameter, one is
healthy or otherwise.
Old/young person: It is not possible to tell exactly upto exactly what age, one is
young and, by just addition of one day to the age, one becomes old. We age gradually.
Aging is a continuous process.
Sweet Milk: Add small sugar cube one at a time to glass of milk, and go on adding
upto, say, 100 small cubes.
Initially, without sugar, we may take milk as not sweet. However, with addition of
each one small sugar particle cube, the sweetness gradually increases. It is not
possible to say that after addition of 100 small cubes of sugar, the milk becomes
sweet, and, till addition of 99 small cubes, it was not sweet.
Pool, Pond, Lake,….., Sea, Ocean: for different sized water bodies, we can not say
when exactly a pool becomes a pond, when exactly a pond becomes a lake and so on.
One of the reasons, for this type of problem of our inability to associate one of the
two-truth values to statements describing everyday situations, is due to the use of
natural language words like hot, good, beautiful etc. Each of these words does not
denote something constant, but is a sort of linguistic variable. The context of a
particular usage of such a word may delimit the scope of the word as a linguistic
variable. The range of values, in some cases, for some phrases or words, may be very
large as can be seen through the following three statements:
Dinosaurs ruled the earth for a long period (about millions of years)
It has not rained for a long period (say about six months).
I had to wait for the doctor for a long period (about six hours).
Fuzzy theory provides means to handle such situations. A Fuzzy theory may be
thought as a technique of providing ‘continuization’ to the otherwise binary
disciplines like Set Theory, PL and FOPL.
Further, we explain how using fuzzy concepts and rules, in situation like the ones
quoted below, we, the human beings solve problems, despite ambiguity in language.
6
Let us recall the case of crossing a road discussed in Unit 1 of Block 1. We The First Order
Predicate Logic
Mentioned that a step by step method of crossing a road may consist of
(i) Knowing (exactly) the distances of various vehicles from the path to be
followed to cross over.
(ii) Knowing the velocities and accelerations of the various vehicles moving on the
road within a distance of, say, one kilometer.
2
(iii) Using Newton’s Laws of motion and their derivatives like s = ut + at , and
calculating the time that would be taken by each of the various vehicles to reach
the path intended to be followed to cross over.
(iv) Adjusting dynamically our speeds on the path so that no collision takes place
with any of the vehicle moving on the road.
But, we know the human beings not only do not follow the above precise method but
cannot follow the above precise method. We, the human beings rather feel
comfortable with fuzziness than precision. We feel comfortable, if the instruction
for crossing a road is given as follows:
Look on both your left hand and right hand sides, particularly in the beginning, to
your right hand side. If there is no vehicle within reasonable distance, then attempt to
cross the road. You may have to retreat back while crossing, from somewhere on the
road. Then, try again.
The above instruction has a number of words like left, right (it may 45° to the right or
90° to the right) reasonable, each of which does not have a definite meaning. But we
feel more comfortable than the earlier instruction involving precise terms.
Let us consider another example of our being comfortable with imprecision than
precision. The statement: ‘The sky is densely clouded’ is more comprehensible to
human beings than the statement: ‘The cloud cover of the sky is 93.5 %’.
Thus is because of the fact that, we, the human beings are still better than computers
in qualitative reasoning. Because of better qualitative reasoning capabilities
just by looking at the eyes only and/or nose only, we may recognize a person.
just by taking and feeling a small number of grains from cooking rice bowl, we
can tell whether the rice is properly cooked or not.
just by looking at few buildings, we can identify a locality or a city.
We know that for any problem, the plan of the proposed solution and the relevant
information is fed in the computer in a form acceptable to the computer.
However, the problems to be solved with the help of computers are, in the first place,
felt by the human beings. And then, the plan of the solution is also prepared by human
beings.
7
It is conveyed to the computer mainly for execution, because computers have much The First Order
Predicate Logic
better executional speed.
(i) We, the human beings, sense problems, desire the problems to be solved and
express the problems and the plan of a solution using imprecise words of a natural
language.
(ii) We use computers to solve the problems, because of their executional power.
(iii) Computers function better, when the information is given to the computer in
terms of mathematical entities like numbers, sets, relations, functions, vectors,
matrices graphs, arrays, trees, records, etc., and when the steps of solution are
generally precise, involving no ambiguity.
(i) Imprecision of natural language, with which the human beings are comfortable,
where human beings feel a problem and plan its solution.
(ii) Precision of a formal system, with which computers operate efficiently, where
computers execute the solution, generally planned by human beings
a new formal system viz. Fuzzy system based on the concept of ‘Fuzzy’ was
suggested for the first time in 1965 by L. Zadeh.
In order to initiate the study of Fuzzy systems, we quote two statements to recall the
difference between a precise statement and an imprecise statement.
A precise Statement is of the form: ‘If income is more than 2.5 lakhs then tax is 10%
of the taxable income’.
An imprecise statement may be of the form: ‘If the forecast about the rain being
slightly less than previous year is believed, then there is around 30% probability that
economy may suffer heavily’.
Next, we explain, how the fuzzy sets are defined, using mathematical entities, to
capture imprecise concepts, through an example of the concept : tall.
Next step is to model ‘definitely tall’ ‘not at all tall’, ‘little bit tall’, ‘slightly tall’
‘reasonably Tall’ etc. in terms of mathematical entities, e.g., numbers; sets etc.
8
In modelling the vague concept like ‘tall’, through fuzzy sets, the numbers in the The First Order
Predicate Logic
closed set [0, 1] of reals may be used on the following lines:
(iii) ‘A little bit tall’ may be represented as ‘tallness having value say .2’.
(iv) ‘Slightly tall’ may be represented as ‘tallness having value say .4’.
(v) ‘Reasonably tall’ may be represented as ‘tallness having value say .7’.
and so on.
Similarly, the values of other concepts or, rather, other linguistic variables like
sweet, good, beautiful, etc. may be considered in terms of real numbers between
0 and 1.
Coming back to the imprecise concept of tall, let us think of five male persons of an
organisation, viz., Mohan, Sohan, John, Abdul, Abrahm, with heights 5' 2”, 6' 4”,
Then had we talked only of crisp set of tall persons, we would have denoted the
But, a fuzzy set, representing tall persons, include all the persons alongwith
respective degrees of tallness. Thus, in terms of fuzzy sets, we write:
Let A = {1, 4, 3, 5}
B = {4, 1, 3, 5}
C = {1, 4, 2, 5}
B = {4, 1, 3, 5}
C = {4, 8}
C A.
If A = {1, 4, 3, 5}
Then each of 1, 4, 3 and 5 is called an element or member of A and the fact that 1 is a
member of A is denoted by 1 A.
In order to define for fuzzy sets, the concepts corresponding to the concepts of
Equality of Sets, Subset and Membership of a Set considered so far only for crisp sets,
first we illustrate the concepts through an example:
Note: In every fuzzy set, all the elements of X with their corresponding memberships
values from 0 to 1, appear.
In the set
For (ii) Equality of Fuzzy sets: Let A, B and C be fuzzy sets defined on X as
follows:
Then, as degrees of each element in the two sets, equal; we say fuzzy set A equals
fuzzy set B, denoted as A = B
A ≠ C.
(iii) Subset/Superset
Intuitively, we know
10
(i) The Set of ‘Very Tall’ people should be a subset of the set of Tall The First Order
Predicate Logic
people.
then, in view of the fact that for each element, degree in A is greater than or equal to
degree in B, B is a subset of A denoted as B A.
Let A and B be fuzzy sets on the universal set X = {x1, x2, …, xn} (X is called the
Universe or Universal set) such that
with that 0 vi , wi 1. Then fuzzy set A equals fuzzy set B, denoted as A = B, if and
only if vi = wi for all i = 1,2,….,n. Further if w vi for all i. then B is a fuzzy subset
of A.
Definition: Support set of a Fuzzy Set, say C, is a crisp set, say D, containing all the
elements of the universe X for which degree of membership in Fuzzy set is positive.
Let us consider again
Definition: Fuzzy Singleton is a fuzzy set in which there is exactly one element
which has positive membership value.
Example:
Let us define a fuzzy set OLD on universal set X in which degree of OLD is zero if a
person in X is below 20 years and Degree of Old is .2 if a person is between 20 and 25
years and further suppose that
Ex. 1: Discuss equality and subset relationship for the following fuzzy sets defined on
the Universal set X = { a, b , c, d, e}
A B = {x3, x5}
The concepts of Union, intersection and complementation for crisp sets may be
extended to FUZZY sets after observing that for crisp sets A and B, we have
(2) While taking union of Crisp sets, members of both sets are included, and none
else. However, in each Fuzzy set, all members of the universal set occur but their
degrees determine the level of membership in the fuzzy set.
The Union of two fuzzy sets A and B, is the set C with the same universe as that of A
and B such that, the degree of an element of C is equal to the MAXIMUM of degrees
of the element, in the two fuzzy sets.
(if Universe A Universe B, then take Universe C as the union of the universe A and
universe B)
The Intersection C of two fuzzy sets A and B is the fuzzy set in which, the degree
of an element of C is equal to the MINIMUM of degrees in the two fuzzy sets.
Example:
Then
The following properties which hold for ordinary sets, also, hold for fuzzy sets
Commutativity
(i) A B = B A
(ii) A B = B A
We prove only (i) above just to explain, how the involved equalities, may be proved in
general.
y B A.
Rest of the properties are stated without proof.
Associativity
(i) (A B ) C = A (B C)
(ii) (A B ) C = A (B C)
Distributivity
(i) A (B C) = (A B) (A C)
(ii) A (B C) = (A B) (A C)
DeMorgan’s Laws
(A')' = A
Idempotence
AA=A
AA=A
Identity
AU =U AU=A
A =A A= 14
The First Order
where : empty fuzzy set = {x/0 with xU} and U: universe = {x/1 with xU}
Predicate Logic
Check Your Progress - 2
Next, we discuss three operations, viz., concentration, dilation and normalization, that
are relevant only to fuzzy sets and can not be discussed for (crisp) sets.
Example:
then
In respect of concentration, it may be noted that the associated values being between 0
and 1, on squaring, become smaller. In other words, the values concentrate towards
zero. This fact may be used for giving increased emphasis on a concept. If Brightness
of articles is being discussed, then Very bright may be obtained in terms of
CON. (Bright).
The resulting fuzzy set, called the normal, (or normalized) fuzzy set, has the
maximum of membership function value of 1.
Example:
Norm (A) = {Mohan/(.5 .9 = .55.); Sohan/1; John /(.7 .9 = .77.); Abdul/0;
Abrahm/(.2 .9 = .22.)} 15
Note: If one of the members has value 1, then Norm (A) = A, The First Order
Predicate Logic
Relation & Fuzzy Relation
We know from our earlier background in Mathematics that a relation from a set A to a
set B is a subset of A x B.
For example, The relation of father may be written as {{Dasrath, Ram), …}, which is
a subset of A B, where A and B are sets of persons living or dead.
Fuzzy Relation
In fuzzy sets, every element of the universal set occurs with some degree of
membership. A fuzzy relation may be defined in different ways. One way of
defining fuzzy relation is to assume the underlying sets as crisp sets. We will discuss
only this case.
For example:
father.
Now suppose
Ram is UNCLE of Mohan with degree 1, Majid is UNCLE of Abdul with degree .7
and Peter is UNCLE of John with degree .7. Ram is UNCLE of John with degree.4
{(Ram, Mohan, 1), (Majid, Abdul, .7), (Peter, John, .7), (Ram, John, .4)}.
As in the case of ordinary relations, we can use matrices and graphs to represent
FUZZY relations, e.g., the relation of UNCLE discussed above, may be graphically
denoted as
1
Ram .4
Mohan
Majid .7 John
.7
Peter Abdul
Fuzzy Graph
16
Fuzzy Reasoning The First Order
Predicate Logic
In the rest of this section, we just have a fleeting glance on Fuzzy Reasoning.
(i) AND
(ii) OR
(iii) NOT
(iv) IF P THEN Q
The deg (P) = 0 denotes P is False and deg (P) =1 denotes P is True.
Then P Q denotes: ‘Mohan is tall and educated’ with degree ((min) {.7, .4}) = .4
Then P Q denotes: ‘Mohan is tall or educated’ with degree ((max) {.7, .4}) = .7
However, in everyday life, many times in the light of new facts that become available,
we may have to revise our earlier knowledge. For example, we consider a sort of
deductive argument in FOPL:
However, later on, we come to know that Tweety is actually a hen and a hen cannot
fly long distances. Therefore, we have to revise our belief that Tweety can fly over
long distances.
This type of situation is not handled by any monotonic reasoning system including PL
and FOPL .This is appropriately handled by Non-Monotomic Reasoning Systems,
which are discussed next.
The KB contains information, facts, rules, procedures etc. relevant to the type of
problems that are expected to be solved by the system. The component IE of NMRS
gets facts from KB to draw new inferences and sends the new facts discovered by it
(i.e., IE) to KB. The component TMS, after addition of new facts to KB. either from
the environment or through the user or through IE, checks for validity of the KB. It
may happen that the new fact from the environment or inferred by the IE may
conflict/contradict some of the facts already in the KB. In other words, an
inconsistency may arise. In case of inconsistencies, TMS retracts some facts from
KB. Also, it may lead to a chain of retractions which may require interactions
between KB and TMS. Also, some new fact either from the environment or from IE,
may invalidate some earlier retractions requiring reintroduction of earlier retracted 18
facts. This may lead to a chain of reintroductions. These retrievals and introductions The First Order
Predicate Logic
are taken care of by TMS. The IE is completely relieved of this responsibility. Main
job of IE is to conclude new facts when it is supplied a set of facts.
IE IE TMS
KB
Let us assume KB has two facts P and ~ Q ~ P and a rule called Modus Tollens.
When IE is supplied these knowledge items, it concludes Q and sends Q to KB.
However, through interaction with the environment, KB is later supplied with the
information that ~ P is more appropriate than P. Then TMS, on the addition of ~ P to
KB, finds that KB is no more consistent, at least, with P. The knowledge that ~ P is
more appropriate, suggests that P be retracted. Further Q was concluded assuming P
as True. But, in the new situation in which P is assumed to be not appropriate, Q also
becomes inappropriate. P and Q are not deleted from KB, but are just marked as
dormant or ineffective. This is done in view of the fact that later on, if again, it is
found appropriate to include P or Q or both, then, instead of requiring some
mechanism for adding P and Q, we just remove marks that made these dormant.
2) DEFAULT REASONING
In the previous section, we discussed uncertainty due to beliefs (which are not
necessarily facts) where beliefs are changeable. Here, we discuss another form of
uncertainty that occur as a result of incompleteness of the available knowledge at a
particular point of time.
Whenever, for any entity relevant to the application, information is not in the KB, then
a default value for that type of entity, is assumed and is assigned to the entity. The
default assignment is not arbitrary but is based on experiments, observations or some
other rational grounds. However, the typical value for the entity is removed if some
information contradictory to the assumed or default value becomes available.
The advantage of this type of a reasoning system is that we need not store all facts
regarding a situation. Reiter has given one theory of default reasoning, which is
expressed as
a ( x ) : M b ( x ),....., Mb k ( x )
(A)
C( x )
Suppose we have
Bird ( x ) : Mfly ( x )
(i)
Fly ( x )
M fly (x) stands for a statement of the form ‘KB does not have any statement of the
form that says x does not have wings etc, because of which x may not be able to fly’.
In other words, Bird (x) : M fly (x) may be taken to stand for the statement ‘if x is a
normal bird and if the normality of x is not contradicted by other facts and rules in the
KB.’ then we can assume that x can fly. Combining with Bird (Twitty), we conclude
that if KB does not have any facts and rules from which, it can be inferred that Twitty
can not fly, then, we can conclude that twitty can fly.
From these two facts in the K.B., it is concluded that Twitty being an ostrich, can not
fly. In the light of this knowledge the fact that Twitty can fly has to be withdrawn.
Thus, Fly (twitty) would be locked. Because, default Mfly (Twitty) is now
inconsistent.
Adult ( x ) : M drive ( x )
Drive ( x )
If a person x is an adult and in the knowledge base there is no fact (e.g., x is blind, or
x has both of his/her hands cut in an accident etc) which tells us something making x
incapable of driving, then x can drive, is assumed.
This means if a ground atom P(a) is not provable, then assume ~ P(a). A predicate like
LESS (x, y) becomes a ground atom when the variables x and y are replaced by
constants say x by 2 and y by 3, so that we get the ground atom LESS (2, 3).
AKB is complete if for each ground atom P(a); either P(a) or ~ P(a) can be proved.
20
By the use of CWA any incomplete KB becomes complete by the addition of the The First Order
Predicate Logic
meta rule:
(i) P(a).
(ii) P(b).
The above KB is incomplete as we can not say anything about Q(b) (or ~ Q(b)) from
the given KB.
it may contain two mutually conflicting wffs. For example, if our KB contains
(Note: from P (a) Q (b), we can not conclude either of P (a) and Q (b) with
definiteness)
As neither P(a) nor Q(b) is provable, therefore, we add ~ P(a) and ~ Q(b) by using
CWA.
But, then, the set of P(a) Q(b), ~ P(a) and ~Q(b) is inconsistent.
Abduction Rule (P Q , Q) / P
Note that abductive inference rule is different form Modus Ponens inference rule in
diagnosing a disease (say P), the doctor asks for the symptoms (say Q). Also, the
doctor knows that for given the disease, say, Malaria (P); the symptoms include high
fever starting with feeling of cold etc. (Q)
The doctor then attempts to diagnose the disease (i.e., P) from symptoms. However, it
should be noted that the conclusion of the disease from the symptoms may not always 21
be correct. In general, abductive reasoning leads to correct conclusions, but the The First Order
Predicate Logic
conclusions may be incorrect also. In other words, Abductive reasoning is not a valid
form of reasoning.
For example, we may, conclude that: all cows are white, after observing a large
number of white cows. However, this conclusion may have some exception in the
sense that we may come across a black cow also. Inductive Reasoning like Abductive
Reasoning, Closed World Assumption Reasoning and Default Reasoning is not
irrefutable. In other words, these reasoning rules lead to conclusions, which may be
True, but not necessarily always.
However, all the rules discussed under Propositional Logic (PL) and FOPL, including
Modus Ponens etc are deductive i.e., lead to irrefutable conclusions.
In this section we will try to understand the Rough set theory approach, to manage the
imprecise knowledge, it was proposed by Z. Pawlak. This theory is quite
comprehensive and may be dealt as an independent discipline. It is quite connected
with other theories and hence connected with various fields like AI, Machine
Learning, Cognitive sciences, data mining, pattern recognition etc.
• Facilitates the user with efficient tools and techniques to detect the hidden patterns
• Promotes data reductionality i.e. it reduces the original data and, find minimal
datasets from the data with the similar knowledge as it is in the original dataset.
• Supports the mechanism to Sets the decision rules from the data, automatically
4) Each rough set has boundary-line cases, i.e., objects which cannot be with
certainty classified, by employing the available knowledge, as members of the
set or its complement. Obviously rough sets, in contrast to precise sets, cannot
be characterized in terms of information about their elements. With any rough
set a pair of precise sets, called the lower and the upper approximation of the
rough set, is associated.
Note: The lower approximation consists of all objects which surely belong to
the set and the upper approximation contains all objects which possibly
belong to the set. The difference between the upper and the lower
approximation constitutes the boundary region of the rough set.
Approximations are fundamental concepts of rough set theory.
5) Rough set based data analysis starts from a data table called a decision table,
columns of which are labeled by attributes, rows – by objects of interest and
entries of the table are attribute values.
6) Attributes of the decision table are divided into two disjoint groups called
condition and decision attributes, respectively. Each row of a decision table
induces a decision rule, which specifies decision (action, results, outcome,
etc.) if some conditions are satisfied. If a decision rule uniquely determines
decision in terms of conditions – the decision rule is certain. Otherwise the
decision rule is uncertain.
Note: Decision rules are closely connected with approximations. Roughly
speaking, certain decision rules describe lower approximation of decisions in
terms of conditions, whereas uncertain decision rules refer to the boundary
region of decisions.
7) With every decision rule two conditional probabilities, called the certainty and
the coverage coefficient, are associated.
8.8 SUMMARY
In this unit the Fuzzy Systems are discussed along with the Introduction to Fuzzy Sets
and their Representation. Later the conceptual understanding of Fuzzy Reasoning is
build, and the same is used to perform the Fuzzy Inference. The unit finally discussed
the concept of Rough Set Theory, also. 23
The First Order
Predicate Logic
8.9 SOLUTIONS/ANSWERS
Check Your Progress - 1
Ex. 1: Discuss equality and subset relationship for the following fuzzy sets defined on
the Universal set X = { a, b , c, d, e}
A = { a/.3, b/.6, c/.4 d/0, e/.7} ; B = {a/.4, b/.8, c/.9, d/.4, e/.7}; C = {a/.3, b/.7, c/.3,
d/.2, e/.6}
SOLUTION: Both A and C are subsets of the fuzzy set B, because deg (x in A )
deg (x in B) for all x X
Ex. 2: For the following fuzzy sets A = {a/.5, b/.6, c/.3, d/0, e/.9} and B = { a/.3, b/.7,
c/.6, d/.3, e/.6}, find the fuzzy sets A B, A B and (A B)'
Hence
24
PROGRAMME DESIGN COMMITTEE
Prof. (Retd.) S.K. Gupta , IIT, Delhi Sh. Shashi Bhushan Sharma, Associate Professor, SOCIS, IGNOU
Prof. Ela Kumar, IGDTUW, Delhi Sh. Akshay Kumar, Associate Professor, SOCIS, IGNOU
Prof. T.V. Vijay Kumar JNU, New Delhi Dr. P. Venkata Suresh, Associate Professor, SOCIS, IGNOU
Prof. Gayatri Dhingra, GVMITM, Sonipat Dr. V.V. Subrahmanyam, Associate Professor, SOCIS, IGNOU
Mr. Milind Mahajan,. Impressico Business Solutions, Sh. M.P. Mishra, Assistant Professor, SOCIS, IGNOU
New Delhi Dr. Sudhansh Sharma, Assistant Professor, SOCIS, IGNOU
SOCIS FACULTY
Prof. P. Venkata Suresh, Director, SOCIS, IGNOU
Prof. V.V. Subrahmanyam, SOCIS, IGNOU
Dr. Akshay Kumar, Associate Professor, SOCIS, IGNOU
Dr. Naveen Kumar, Associate Professor, SOCIS, IGNOU (on EOL)
Dr. M.P. Mishra, Associate Professor, SOCIS, IGNOU
Dr. Sudhansh Sharma, Assistant Professor, SOCIS, IGNOU
Dr. Manish Kumar, Assistant Professor, SOCIS, IGNOU
PREPARATION TEAM
Mr. VenuGopal, General Manager.(Writer- Unit 9) Prof Anjana Gosain (Content Editor)
Sify Technologies, Noida,U.P USICT-GGSIPU,Delhi
Print Production
Sh Sanjay Aggarwal,Assistant Registrar, MPDD
, 2022
Indira Gandhi National Open University, 2022
ISBN-
All rights reserved. No part of this work may be reproduced in any form, by mimeograph or any other means, without permission
in writing from the Indira Gandhi National Open University.
Further information on the Indira Gandhi National Open University courses may be obtained from the University’s office at
Maidan Garhi, New Delhi-110068.
Printed and published on behalf of the Indira Gandhi National Open University, New Delhi by MPDD, IGNOU.
UNIT 9 INTRODUCTION TO MACHINE LEARNING
METHODS
Structure Page Nos.
9.0 Introduction 50
9.1 Objectives 51
9.2 Introduction to Machine Learning 51
9.3 Techniques of Machine Learning 55
9.4 Reinforcement Learning and Algorithms 57
9.5 Deep Learning and Algorithms 59
9.6 Ensemble Methods 62
9.7 Summary 67
9.8 Solutions/ Answers 67
9.9 Further Readings 68
9.0 INTRODUCTION
After Artificial Intelligence was introduced, in Computing World. There was a need for a machine that
would automatically make things better. This needs to be kept in check, so there should be some rules that
apply to all learning processes.
The main goal of machine learning, even at its most basic level, is to be able to analyse and adapt data on
its own and make decisions based on calculations and analyses. Machine learning is a way to try to
improve computers by imitating how the human brain learns. A computer that doesn't have intelligence is
just a fast machine for processing data. The devices that don't have AI or ML are just data processing
units that use the information they are given. Machine Learning is what we need to make devices that can
make decisions based on data.
To get to this level of intelligence, you need to put algorithms and data into a machine in a way that lets it
make decisions.
For example, Real-time GPS data is used by Maps on devices to show the quickest and fastest route.
Several algorithms, such as the shortest path (Dijkstra's algorithm) and the travelling salesman (an
algorithm that works like water flow), can be used to make the decision (WFA). These are algorithms that
have been used and can be made better, but they are useful for learning. Here, we can see that the
Machine, which is your computer or mobile device, uses GPS coordinates, traffic data based on density,
and predefined map routes to figure out the fastest way to get from Point A to Point B..
This is one of the simplest examples that can help us understand how Machine Learning can help in
independent decision making by devices and how it can help in making decision making easier and more
accurate.
The accuracy of the data as a whole is a topic of debate since decisions based on the data might be
accurate, but it is one of the issues whether or not they are acceptable within the limitations.
Consequently, it is necessary to set these boundaries for the entirety of the machine learning algorithms
and engines.
Simplest example would be if we instruct an auto driving car to reach a destination at a specified time. IT
should also work within the legal boundaries of the land not to break traffic rules to achieve the desired
result. The Boundaries and restriction cannot be ignored as they are very important for any Self learning
system.
Data/Inputs is a soul of all the business .Data has been a key component for making any decision . Data
is the key to successes from prehistoric era. The more you have the data more is the probability of making
the right decision. Machine learning is the key to unlock new world where customer data , corporate data ,
demographic data, or related dimension data relevant to the decision can help you make right and more
informed decision to stay ahead of competition.
Both artificial intelligence and statistics research groups contributed to the development of machine
learning. Companies like Google, Microsoft, Facebook, and Amazon all use machine learning as part of
their decision-making processes.
The most common applications of machine learning nowadays are to interpret and investigate cyber
phenomena, to extract and project the future values of those phenomena, and to detect anomalies.
There are a number of open-source solutions for machine learning that can be used with API calls or
without programming. Some of the Open-source Machine Learning projects, such as Weka, Orange, and
Rapid-Miner. To see how data that has been processed by an algorithm looks, you can put the results into
tools like Tableau, Pivotal, or Spotfire and use them to make dashboards and workflow strategies.
Michie et al. (D. Michie, 1994) says that Machine Learning usually refers to automatic computing
procedures based on logical or binary operations that learn how to do a task by looking at a series of
examples. Machine learning is used in a lot of ways today, but whether or not they are all ready is up for
debate. There is a lot of room for improvement when it comes to accuracy, which is a process that never
ends and changes every day.
9.1 OBJECTIVES
Understanding data, describing the characteristics of a data collection, and locating hidden connections
and patterns within that data are all necessary steps in the process of developing a model. These steps can
be accomplished through the application of statistics, data mining, and machine learning. When it comes
to finding solutions to business issues, the methods and tools that are employed by these fields share a lot
in common with one another.
The more conventional forms of statistical investigation are the origin of a great deal of the prevalent data
mining and machine learning techniques. Data scientists have a background in technology and also have
expertise in areas such as statistics, data mining, and machine learning. This allows them to collaborate
effectively across all fields.
The process of "data mining" refers to the extraction of information from data that is latent, previously
unknown, and has the potential to be beneficial. Building computer algorithms that can automatically
search through large databases for recurring structures or patterns is the goal of this project. In the event
that robust patterns are discovered, it is likely that they will generalise to enable accurate predictions on
future data.
In the renowned book "Data Mining: Practical Machine Learning Tools and Techniques," written by Ian
Witten and Eibe Frank, the subject matter is thoroughly covered. The activity known as "data mining"
refers to the practise of locating patterns within data. The procedure needs to be fully automatic or, at the
very least, semiautomatic. The patterns that are found have to be significant in the sense that they lead to
some kind of benefit, most commonly an economic one. Consistently and in substantial amounts, the
statistics are there to be found.
Machine learning, on the other hand, is the core of data mining's technical infrastructure. It is used to
extract information from the raw data that is stored in databases; this information is then expressed in a
form that is understandable and can be applied to a range of situations.
MACHINE
LEARNING
Most learning algorithms use statistical tests to build rules or trees and fix models that are "overfitted," or
too dependent on the details of the examples that were used to make them. So, a lot of statistical thinking
goes into the techniques we will talk about in this unit. Statistical tests are used to evaluate and validate
machine learning models and algorithms.
Machine learning is when a computer learns how to do a task by using algorithms that are logical and can
be turned into models that can be used. Artificially intelligent communities are the main reason why
Machine Learning is growing.The most important factor contributing to this expansion was that it assisted
in the collection of statistical and computational methods that could automatically construct usable
models from data. Companies such as Google, Microsoft, Facebook, and Netflix have been putting in
consistent effort over the past decade to make this more accurate and mature.
The primary function or application of machine learning algorithms can be summarized as follows:
(a) To gain an understanding of the cyber phenomenon that produced the data that is being
investigated;
(b) To abstract the understanding of underlying phenomena in the form of a model;
(c) To predict the future values of a phenomenon by using the model that was just generated; and
(d) To identify anomalous behavior exhibited by a phenomenon that is being observed.
There are various open-source implementations of machine learning algorithms that can be utilised with
either application programming interface (API) calls or non-programmatic applications. These methods
can also be used in conjunction with each other. Weka, Orange, and Rapid Miner are a few instances of
open-source application programming interfaces. These algorithms' outputs can be fed into visual
analytics tools like Tableau4 and Spotfire5, which can then be used to build dashboards and actionable
pipelines.
Almost all of the Frameworks have emphasised decision-tree techniques, in which classification is
determined by a series of logical processes. Given enough data (which may be a lot! ), these are capable
of representing even the most complex problems. Other techniques, such as genetic algorithms and
inductive logic procedures (ILP), are currently in development and, in theory, would allow us to deal with
a wider range of data, including cases where the number and type of attributes vary, where additional
layers of learning are superimposed, and where attributes and classes are organised hierarchically, and so
on. Machine Learning seeks to provide classification phrases that are basic enough for humans to
understand. They must be able to sufficiently simulate human reasoning in order to provide insight into
the decision-making process. Background knowledge, including statistical techniques, can be used in
development, but operation is assumed to be without human interference.
………………………………………………………………………………………………………………
……………………………………………………
Machine learning approaches are needed to make prediction models more accurate. Depending on the
type and amount of data and the business problem being solved, there are different ways to approach the
problem. In this section, we talk about the machine learning cycle.
The Machine Learning Cycle: Making a machine learning application is similar to making a machine
learning algorithm work, which is an iterative process. You can't just train a model once and leave it
alone, because data changes, preferences change, and new competitors come along. So, when your model
goes into production, you need to keep it updated. Even though you won't need as much training as when
you created the model, don't expect it to run on its own.
Figure 2 :Machine Learning Cycle at a Glance
1. ACCESS and load the data. 4.TRAIN models using the
features derived in step 3.
MOBILE
2. PREPROCESS the data 5. ITERATE to find the best
DEVICE
model
To use machine learning techniques effectively, you need to know how they work. You can't just use
them without knowing how they work and expect to get good results. Different techniques work for
different kinds of problems, but it's not always clear which techniques will work in a given situation. You
need to know something about the different kinds of solutions. We talk about a very large number of
techniques.
One step in the machine learning cycle is choosing the right machine learning algorithm. So, let's look at
how the machine learning cycle works.
1. Data Identification
2. Data Preparation
3. Selection of machine learning algorithm:
4. Training the algorithm to develop a model
5. Evaluating the model
6. Deploying the model
7. Performing Prediction
8. Assess the predictions
When your model has reached the point where it can make accurate predictions, you can restart the
process by re-evaluating it using questions such as "Is all of the information important?" Exist any more
data sets that could be used to improve the accuracy of the predictions? You may maintain the usefulness
of your applications that are based on machine learning by continually improving the models and
assessing new approaches.
When should you use machine learning? Think about using machine learning when you have a hard task
or problem that involves a lot of data and many different factors but no formula or equation to solve it.
For example, machine learning is a good choice if you need to deal with situations like face and speech
recognition, fraud detection by analysing transaction records, automated trading, energy demand
forecasting, predicting shopping trends, and many more.
When it comes to machine learning, there's rarely a straight line from the beginning to the end. Instead,
you'll find yourself constantly iterating and trying out new ideas and methods.
This unit talks about a step-by-step process for machine learning and points out some important decision
points along the way. The most common problem with machine learning is getting your data in order and
finding the right model. Here are some of the most important things to worry about with the data:
• Data comes in all shapes and sizes : There are many different kinds of data. Datasets from the
real world can be messy, with missing values, and may be in different formats. You might just
have simple numeric data. But sometimes you have to combine different kinds of data, like sensor
signals, text, and images from a camera that are being sent in real time.
• Preprocessing your data might require specialized knowledge and tools : You might need
specialised tools and knowledge to prepare your data before you use it. For example, you need to
know a lot about image processing to choose features to train an object detection algorithm.
Preprocessing needs to be done in different ways for different kinds of data.
• It takes time to find the best model to fit the data : Finding the best model to fit the data
takes time. Finding the right model is like walking a tightrope. Highly flexible models tend to fit
the data too well by trying to explain small differences that could just be noise. On the other hand,
models that are too simple might assume too much. Model speed, accuracy, and complexity are
always at odds with each other.
Does it appear to be a challenge? Try not to let this discourage you. Keep in mind that the process of
machine learning relies on trial and error. You merely go on to the next method or algorithm in the event
that the first one does not succeed. On the other hand, a well-organized workflow will assist you in
getting off to a good start.
a) Supervised learning, which requires training a model with data whose inputs and outputs are
already known in order for the model to be able to predict future outputs, such as whether or not
an email is authentic or spam or whether or not a tumor is cancerous. Classification models
classify given data into categories. Imaging for medical purposes, speech recognition, and rating
credit are a few examples of typical applications.
b) Unsupervised learning analyses data to uncover previously unkno
unknownwn patterns or structures. It is
used to infer conclusions from sets of data that contain inputs but no tagged answers. The most
prevalent method of learning without being observed is clustering. Exploratory data analysis is
used to uncover hidden patterns or groups in data. Clustering can be used for gene sequence
analysis, market research, and object recognition.
Note: In semi-supervised learning, algorithms are trained on small sets of labelled data before being
applied to unlabeled data, like in unsuper vised learning. When there is a dearth of quality data, this
unsupervised
method is frequently used.
MACHINE LEARNING
Classification Regression
Clustering
Linear
Support Vector Regression, GLM PARTITIONING ALGORITHMS
K-Means, K-Medoids Fuzzy C-Means
SCR, GPR
Discriminant Analysis
Hierarchical
Ensemble Methods
Naive Bayes
Gaussian
Decision Trees Mixture
Nearest Neighbor
Neural Networds
Neural Networds
Hidden Markov
Model
"How Do You Choose Which Algorithm to Use?" is a crucial question. There are numerous supervised
and unsupervised machine learning algorithms, each with its own learning strategy. This can make
picking the appropriate one difficult. There is no alternative solution or strategy that will work for
everyone. It takes some trial and error to find the proper algorithm. Even the most seasoned data scientists
can't predict whether or not an algorithm would work without putting it to the test. However, the size and
type of data you're working with, the insights you want to gain from the data, and how those insights will
be used all go into the algorithm you choose.
• If you need to train a model to produce a forecast, such as the future value of a continuous variable like
temperature or a stock price, or a classification, such as determining what kind of automobile is on a
webcam footage, go with supervised learning.
• If you want to look at your data and train a model to identify an appropriate way to represent it
internally, for as by grouping it, use unsupervised learning.
The purpose of supervised machine learning is to create a model capable of making predictions based on
data even when there is ambiguity. A supervised learning technique trains a model to generate good
predictions about the response to new data using a known set of input data and previous responses to the
data (output).
Using Supervised Learning to Predict Heart Attacks as an Example: Assume doctors want to determine if
someone will suffer a heart attack in the coming year. They have information on former patients' age,
weight, height, and blood pressure. They know if any of the previous patients had heart attacks within a
year. The challenge is to create a model using existing data that can predict if a new person will have a
heart attack in the coming year.
Supervised Learning Techniques: Every supervised learning method may be broken down into one of
two categories: classification or regression. Methods like as classification and regression, which are
employed in supervised learning, are put to use in the development of models that are able to forecast the
future.
• Regression methods : Predictions can be made with regression algorithms about things like
shifts in temperature or alterations in the quantity of power consumed. The most typical
applications are stock price predicting, handwriting recognition, electricity load forecasting,
acoustic signal processing, and other similar tasks.
Note:
Is it possible to tag or categorise your data? Use classification techniques if your data can be
divided into distinct groups or classes.
Working with a collection of data? Use regression techniques if your answer is a real number,
such as the temperature or the time until a piece of equipment fails.
Binary vs. Multiclass Classification: Before you start working on a classification problem, figure
out whether it's a binary or multiclass problem. A single training or test item (instance) can only
be classified into two classes in a binary classification task, such as determining whether an email
is real or spam. If you wish to train a model to categorise a picture as a dog, cat, or other animal,
for example, a multiclass classification problem might be separated into more than two
categories. Remember that a multiclass classification problem is more difficult to solve since it
necessitates a more sophisticated model. Certain techniques (such as logistic regression) are
specifically intended for binary classification situations. These methods are more efficient than
multiclass algorithms during training.
Now it's time to talk about the role of algorithms in machine learning. Algorithms are a very important
part of how machine learning works, so it's important to talk about both of them. Discussion about
algorithms and machine learning go hand in hand. They're the most important part of learning. In the
world of computers, algorithms have been used for a long time to help us solve hard problems. They are a
set of computer instructions for working with, changing, and interacting with data. An algorithm can be as
simple as adding a column or as complicated as figuring out how to recognize anyone's face in a picture.
For an algorithm to work, it must be written as a programme that a computer can understand. Machine
learning algorithms are usually written in either Java, Python, or R. Each of these languages has machine
learning libraries that support a wide range of machine learning algorithms.
Active user communities for these languages share code and talk about ideas, problems, and ways to solve
business problems. Machine learning algorithms are different from other algorithms. Most of the time, a
programmer starts by typing in the algorithm. Machine learning turns the process around. With machine
learning, the data itself creates the model. When you add more data to an algorithm, it gets harder to
understand. As the machine learning algorithm gets more and more information, it can make more
accurate algorithms.
It's a mix of science and art, to choose the right kind of machine learning algorithm. If you ask two data
scientists to solve the same business problem, they might do it in different ways. But data scientists can
figure out which machine learning algorithms work best if they know the different kinds. So, the most
important step after getting the data in the right format is to choose the right machine learning algorithm.
As a result of our earlier discussion, we understood that choosing a right algorithm for machine learning
is a process of trial and error. There is also a contradiction between certain aspects of the algorithms, such
as:
Let’s take a closer look at the most commonly used machine learning algorithms.
Naive Bayes
(AODE)Averaged One-Dependence
Bayesian Estimators
Bayesian Belief Network (BBN)
Deep Boltzmann Machine (DBM) Deep Learning Gaussian Naive Bayes
Deep Belief Networks (DBN) Multinomial Naive Bayes
Convolutional Neural Network
(CNN)
Bayesian Network (BN)
Stacked Auto-Encoders
Instance
Linear Regression Based k-Nearest Neighbour (kNN)
Ordinary Least Squares Regression Learning Vector Quantization (LVQ)
(OLSR) Self-Organizing Map(SOM)
Regression
Stepwise Regression Locally Weighted Learning (LWL)
Multivariate Adaptive Regression
Splines (MARS) Clustering k-Means
Locally Estimated Scatterplot k-Medians
Smoothing (LOESS) Expectation Maximization
Logistic Regression Hierarchical Clustering
Bayesian: Regardless of what the data shows, data scientists can use Bayesian algorithms to save
their ideas about how models should look. Given how much attention is devoted to how the data
shapes the model, you might ask why anyone would be interested in Bayesian algorithms. When
you don't have much data to work with, Bayesian techniques come in handy.
If you already knew something about a part of the model and could code that part directly, a
Bayesian algorithm might make sense. Consider a medical imaging system that looks for signs of
lung disease. These estimates can be incorporated into the model if a study published in a journal
calculates the likelihood of various lung diseases based on a person's lifestyle.
Decision tree : Decision tree algorithms show what will happen when a choice is made by using
a structure with branches. Decision trees can be used to show all the possible outcomes of a
choice. A decision tree shows all the possible outcomes at each branch. The likelihood of the
outcome is shown as a percentage for each node.
Sometimes, online sales use decision trees. You might want to figure out who is most likely to
use a 50% off coupon before sending it to them. Customers can be split into four groups:
a) Customers who are likely to use the code if they get a personal message.
b) Customers who will buy no matter what.
c) Customers who will never buy.
d) Customers who are likely to be upset if someone tries to reach out to them.
If you send out a campaign, it's obvious that you don't want to send items to three of the groups
since they will either ignore them or respond negatively. You'll get the best return on investment
(ROI) if you go after the convenience.
A decision tree will assist you in identifying these four client categories and organizing prospects
and customers according to who will respond best to the marketing campaign.
Instance based : Instance-based algorithms are used to classify new data points based on training
data. Because there's no training phase, these algorithms are called "lazy learners." Instead,
instance-based algorithms compare new data to training data and classify it based on how similar
it is. Data sets with random changes, irrelevant data, or missing values are not good for instance-
based learning.
For example, instance learning is used in spatial and chemical structure analysis. There are many
instance-based algorithms used in biology, pharmacology, chemistry, and engineering.
Neural networks and deep learning : A neural network is an artificial intelligence system that
attempts to solve problems in the same way that the human brain does. This is accomplished by
the utilisation of many layers of interconnected units that acquire knowledge from data and infer
linkages. In a neural network, the layers can be connected to one another in various ways. When
referring to the process of learning that takes place within a neural network with multiple hidden
layers, the term "deep learning" is frequently used. Models built with neural networks are able to
adapt to new information and gain knowledge from it. Neural networks are frequently utilised in
situations in which the data in question is not tagged or is not organised in a particular fashion.
The field of computer vision is quickly becoming one of the most important applications for
neural networks. Today, one can find applications for deep learning in a diverse range of
contexts.
The process of deep learning is utilised to assist self-driving autos in figuring out what is going
on in their surroundings. Deep learning algorithms analyse the unstructured data that is being
collected by the cameras as they capture pictures of the environment around them. This allows the
system to make judgments in what is essentially real time. The apps that radiologists use to better
analyse medical images also include deep learning as an integral part of their design.
Linear regression : Regression algorithms are important in machine learning and are often used
for statistical analysis. Regression algorithms help analysts figure out how data points are related.
Regression algorithms can measure how strongly two variables in a set of data are linked to each
other. Regression analysis can also be used to predict the values of data in the future based on
their past values. But it's important to remember that regression analysis is based on the idea that
correlation means cause. Regression analysis can lead to wrong conclusions if you don't
understand the context of the data.
Rule-based machine learning : Rule-based machine learning algorithms describe data with the
help of rules about relationships. A rule-based system is different from a machine learning
system, which builds a model that can be used on all the data. Rule-based systems are easy to
understand in general: if X data is put in, do Y. A rule-based approach to machine learning, on
the other hand, can get very complicated as systems get more complicated. For example, a system
might have 100 rules that are already set. As the system gets more and more data and learns how
to use it, it is likely that hundreds of rules will be broken. When making a rule-based approach,
it's important to make sure it doesn't get so complicated that it stops being clear.
Think about how hard it would be to make an algorithm based on rules to apply the GST codes.
………………………………………………………………………………………………………………
……………………………………………………………………
………………………………………………………………………………………………………………
……………………………………………………………………
Q5 Compare the concept of Classification, Regression and Clustering? List the algorithms in respective
categories.
………………………………………………………………………………………………………………
……………………………………………………………………
According to what we observed in the previous section, learning can be broken down into three main
categories: supervised, unsupervised, and semi-supervised. However, in addition to these two categories,
there are also other types of learning, such as reinforcement learning (RL), deep learning (DL), adaptive
learning, and so on.
The graph shown below, depicts the various branches and sub-branches of Machine learning, including
the various algorithms involved in each sub-branch. Let’s understand them in brief, as the entire coverage
of the said Machine Learning techniques is out of the scope of this unit. We will begin our discussion
with Reinforcement learning.
Euclat
Regression Polynomial
Apriori Pattern Search Regression
Ridge/Lasso
Fp-Growth
UNSUPERVISED SUPERVISED Regression
DIMENSION REDUCTION
(generalization)
t-SNE Random Forest
CLASSICAL
PCA LSA LDA LEARNING
SVD
Stacking Bagging
MACHINE ENSEMBLE
RELNFORCEMENT
METHODS
LEARNING LEARNING
Genetic Q-Learning XGBoost
Algorithm Boosting
SARSA Deep Q-Network
A3C LightGBM
(DQN) AdaBoost CatBoost
NEURAL
Convolutional NETS AND
Neural Networks DEEP LEARNING Perceptrons
DCNN (CNN)
(MLP)
Autoencoders
Recurrent
Neural Networks
LSM (RNN) Generative Seq2seq
Adversarial Networks
(GAN)
LSTM
GRU
In Reinforcement Learning (RL), algorithms get a set of instructions and rules and then figure
out how to handle a task by trying things out and seeing what works and what doesn't. As a way
to help the AI find the best way to solve a problem, decisions are either rewarded or punished.
Machine learning models are taught through reinforcement learning to make a series of
decisions. It is set up so that an Agent talks to an Environment.
Reinforcement Learning (RL) is a type of Machine Learning in which the agent gets a delayed re
ward in the next time step to evaluate how well it did in the previous time step. It was mostly use
d in games, like Atari and Mario, where it could do as well as or better than a person. Since Neur
al Networks have been added to the algorithm, it has been able to do more complicated tasks.
In reinforcement learning, an AI system is put in a situation that is like a game (i.e. a simulation).
The AI system tries until it finds a solution to the problem. Slowly but surely, the agent learns
how to reach a goal in an uncertain, potentially complicated environment, but we can't expect the
agent to slip upon the perfect solution by accident. This is where the interactions come into play,
the Agent is provided with the State of the Environment which becomes the input/basis for the
Agent to take Action. An Action first gives the Agent a Reward. (Note that rewards can be both
positive and negative depending on the fitness function for the problem.) Based on this reward,
the Policy (ML model) inside the Agent adapts and learns. Second, it affects the Environment
and changes its State, which means the input for the next cycle changes.
This cycle will keep going until the best Agent is created. This cycle tries to imitate the way that
organisms learn over the course of their lives. Most of the time, the Environment is reset after a
certain number of cycles or if something goes wrong. Note that you can run more than one Agent
at the same time to get to the solution faster, but each Agent runs on its own, independently.
Reinforcement Learning (RL) refers to a kind of Machine Learning method in which the agent receives a
delayed reward in the next time step to evaluate its previous action. It was mostly used in games.
Typically, a RL setup is composed of two components, an agent and an environment.
Environment
Reward / Action
Penalty
Agent
Next-State
The following are the meanings of the different parts of reinforcement learning:
1. AGENT : The agent is the person who learns and makes decisions.
2. ENVIRONMENT : The agent's environment is where it learns and decides what to do.
3.ACTION : A group of things that the agent can do.
5. REWARD : The environment gives the agent a reward for each action they choose. Usually a scalar
value.
6. POLICY : Policy is the agent's way of deciding what to do (its control strategy), which is a mapping
from situations to actions.
7. VALUE FUNCTION : A way to map states to real numbers, where the value of a state is the long -term
reward that can be earned by starting in that state and following a certain policy.
8. FUNCTION APPROXIMATOR : is a term for the problem of figuring out what a function is by
looking at training examples. Decision trees, neural networks, and nearest -neighbor methods are all
examples of standard approximators.
9. MODEL : The agent's view of the environment, which maps state -action pairs to probability
distributions over states. Note that not every agent that learns from its environment uses a model of its
environment.
In spite of the fact that there is a large number of RL algorithms, it does not appear that there is
a comparison that is exhaustive of each of them. It is quite challenging to determine which
algorithms should be used for which type of activity. This section will attempt to provide an
introduction to several well-known algorithms.
RL Algorithms
Model-free RL Model-Based RL
Model-Free Vs Model Based RL: The model is used to perform a simulation of the dynamic
processes that take place in the environment. In other words, the model learns the transition
probability T(s1|(s0, a)) from the present state s0 and action a to the next state s1, and it does so
by pairing the two states together. If the agent is able to successfully learn the transition
probability, then the agent will be aware of how probable it is to reach a particular state given the
present state and activity. On the other hand, as the state space and the action space grow, model-
based algorithms become less practical.
On the other hand, model-free algorithms acquire new information through an iterative process
of trial and error. As a consequence of this, it does not need any additional space in order to store
every possible combination of states and actions.
Within the realm of Model-Free RL, policy optimization serves as a subclass, and it is comprised
of two distinct sorts of policies. i.e. On-Policy Vs Off-Policy: The value is learned by an on-
policy agent based on its current action "a" which is derived from the current policy, but the
value is learned by an off-policy agent's counterpart based on the action "a*" which is received
from another policy. This policy is referred to as the greedy policy in Q-learning.
The Q-learning or value-iteration methods are the next subcategory that is included in Model-
Free RL. Q-learning is responsible for the acquisition of the action-value function. How
advantageous would it be to perform a certain action at a certain state? In its most basic form, the
action "a" receives a scalar value that is determined by the state "s". The algorithm is shown in
the following chart, which does a good job of conveying its details.
Initialize Q Table
Chose an action a
Perform action
Measure reward
Update Q
Lets extend our discussion to some more Reinforcement Learning Algorithms i.e. DQN and
SARSA
Deep Q Neural Network (DQN): It is Q-learning with Neural Networks . The motivation
behind is simply related to big state space environments where defining a Q-table would be a
very complex, challenging and time-consuming task. Instead of a Q-table Neural Networks
approximate Q-values for each action based on the state.
On Policy: In this, the learning agent learns the value function according to the current action
derived from the policy currently being used.
Off Policy: In this, the learning agent learns the value function according to the action derived
from another policy.
The Q-Learning technique is considered an Off Policy technique that employs the greedy
learning strategy in order to acquire knowledge of the Q-value. On the other hand, the SARSA
approach is an On Policy and it makes advantage of the action that is being performed by the
current policy in order to learn the Q-value.
Text mining, facial recognition, city planning, and targeted marketing are some of the
applications which are actually the implementation of unsupervised learning algorithms. In a
similar manner, the classification methods that fall under the supervised learning umbrella have
applications in the areas of fraud detection, spam detection, diagnostics, picture classification,
and score prediction. Similarly , reinforcement learning has a wide range of applications in a
variety of fields, including the gaming industry, manufacturing, inventory management, and the
financial sector, among many others..
Deep learning is a type of machine learning that uses artificial neural networks and representation
learning. It is also called deep structured learning or differential programming.Deep learning is a way for
machines to learn through deep neural networks. It is used a lot to solve practical problems in fields like
computer vision (image), natural language processing (text), and automated speech recognition (audio).
Machine learning is often thought of as a tool with several algorithms. However, deep learning is actually
just a subset of approaches that mostly use neural networks, which are a type of algorithm loosely based
on the human brain.
A deep learning model learns to solve classification tasks directly from images, text, or sound. A neural
network architecture is commonly used to implement deep learning. The number of layers in a network
defines the depth of the network; the more layers, the deeper the network. Traditional neural networks
have two or three layers, whereas deep neural networks include hundreds.
Deep learning is especially well-suited to identification applications such as face recognition, text
translation, voice recognition, and advanced driver assistance systems, including, lane classification and
traffic sign recognition.
Machine Learning
L
Artificial Intelligence
Deep Learning
As seen in the diagram above, machine learning (ML), deep learning (DL), and artificial intelligence (AI)
are all related. Deep Learning is a collection of algorithms inspired by the human brain's workings in
processing the data and creating patterns for use in decision making, which are expanding and improving
or refining the idea of a single model architecture termed Artificial Neural Network (ANN). Later in this
course, we shall go deeper into neural networks. However fr now, a quick overview of neural networks is
provided below, followed by a discussion of the various Deep Learning algorithms, such as CNN, RNN,
Auto Encoders, GAN, and others..
Neural Networks: Just like the human brain, Neural Networks consist of Neurons. Each Neuron
takes in signals as input, multiplies them by weights, adds them together, and then applies a non-
linear function. These neurons are arranged in layers and stacked close to each other.
Neural Networks have proven to be effective function approximators. We can presume that e very
behaviour and system can be represented mathematically at some point (sometimes an incredible
complex one). If we can find that function, we will know everything there is to know about the
system. However, locating the function can be difficult. As a result, we must use Neural
Networks to estimate it.
A deep neural network is one that incorporates several nonlinear processing layers, makes use of
simple pieces that work in parallel, and takes its cues from the biological nervous systems of
living things. There is an input layer, numerous hidden layers, and an output layer that make up
this structure. Each hidden layer takes as its input the information that was output by the layer
that came before it and is connected to the other layers via nodes, also known as neurons.
To understand the basic deep neural networks we need to have brief understanding of various
algorithms, the same are given below:
To further understand, let's look at an example. Let's say we need to recognise pictures that
contain a tree. Photos are input into the network, and the system produces results. We might
evaluate the results in light of our current situation and make adjustments to the network
accordingly.
As more photographs are passed via the network, the number of errors that occur decreases. We
can now feed it an unknown image, and it will tell us whether or not it has a tree. or not, that is
astounding either way.
Feed-forward neural networks (FNN) : Typically, feed-forward neural networks, also known as FNN,
are completely connected, this implies that each neuron in one layer is connected to each neuron in the
layer next to it. A "Multilayer Perceptron" is the name given to the structure shown below and that is the
topic of discussion here. A multilayer perceptron, has the ability to learn associations between the data
that are not linear, in contrast to a single-layer perceptron, which can only learn patterns that can be
separated in a linear manner. FNN are exceptionally well on tasks like classification and regression.
Contrary to other machine learning algorithms, they don’t converge so easily. The more data they have,
the higher their accuracy.
Convolutional Neural Networks (CNN) The term "convolution" refers to the function that is utilised by
convolutional neural networks (CNN). The idea that undelies them is that rather than linking each neuron
with all of the ones that come after it, we just connect it with a select few of those that come after it (the
receptive field). They strive to regularise feed-forward networks in order to avoid overfitting, which is
when the model is unable to generalise its findings since it can only learn from the data it has already
seen. Because of this, they are particularly skilled at determining how the data are related to one another
spatially. As a result, computer vision is their primary application, which includes image classification,
video identification, medical image analysis, and self-driving automobiles. These are the types of tasks
where they achieve near-superhuman results.
Due to their adaptability, they are also ideal for merging with other types of models, such as Recurrent
Networks and Auto-encoders. The recognition of sign languages is one such example.
Recurrent Neural Networks (RNN) are utilised in time series forecasting because they are
ideal for time-related data. They employ some type of feedback, in which the output is fed back
into the input. You can think of it as a loop that passes data back to the network from the output
to the input. As a result, they are able to recall previous data and use it to make predictions.
Researchers have transformed the original neuron into more complicated structures such as GRU
units and LSTM Units to improve performance. Language translation, speech production, and
text to speech synthesis have all employed LSTM units extensively in natural language
processing.
Recursive Neural Networks : Another type of recurrent network is the recursive neural
network, which is set up in a tree-like manner. As a result, they can simulate the hierarchical
structures training dataset's.
They're frequently utilised in NLP applications like audio-to-text transcription and sentiment
analysis because they're related to binary trees, contexts, and natural-language-based parsers.
They are, however, typically much slower than Recurrent Networks.
Auto-Encoders (Auto Encoder Neural Networks) are a type of unsupervised technique that
is used to reduce dimensionality and compress data. Their technique is to try and make the
output equal to the input. They are attempting to recreate the data.
An encoder and a decoder are included in Auto-Encoders. The encoder receives the input
and encodes it in a lower-dimensional latent space. Whereas, the decoder is used
to decode that vector back to the original input.
Input
Output
Code
Encoder Decoder
Restricted Boltzmann Machines (RBM) are stochastic neural networks that can learn a
probability distribution over their inputs and so have generative capabilities. They differ from
other networks in that they only have input and hidden layers ( no outputs).
They take the input and create a representation of it in the forward phase of the training. They
rebuild the original input from the representation in the backward pass. (This is similar to
autoencoders, but in a single network.)
Several RBMs are piled on top of each other to form a Deep Belief Network. They have the
same appearance as Fully Connected layers, but they are trained differently.
Generative Adversarial Networks (GANs): Ian Goodfellow introduced Generative Adversarial
Networks (GANs) in 2016, and they are built on a basic but elegant idea: You need to create
data, such as photos. What exactly do you do?
You must construct two models. You teach the first one to make up fake data (generator) and the
second one to tell the difference between actual and fake data (discriminator). And you turned
them against one another.
The generator develops better and better at image production, as its ultimate purpose is to
mislead the discriminator. As its purpose is to avoid being tricked, the discriminator improves its
ability to identify fake from real images. As a result, we now have extremely realistic fake data
from the discriminator.
Video games, astronomical imagery, interior design, and fashion are all examples of Generative
Adversarial Networks at action. Essentially, you can utilise GANs if you have photos in your
fields. Do you recall the movie Deep Fakes? That was all created by GANs.
Transformers are also very new, and they are mostly employed in language applications
because recurrent networks are becoming obsolete. They are based on the concept of "attention,"
which instructs the network to focus on a certain data piece.
Instead of complicating LSTM units, you may use Attention mechanisms to assign varying
weights to different regions of the input based on their importance. The attention mechanism is
simply another weighted layer whose sole purpose is to change the weights such that some parts
of the inputs are given greater weight than others.
In actuality, transformers are made up of stacked encoders (encoder layer), stacked decoders
(decoder layer), including several attention layers (self- attentions and encoder-decoder
attentions)
Graph Neural Networks: Deep Learning does not operate well with unstructured data in
general. And there are many circumstances in which unstructured data is organised as a graph in
the actual world. Consider social networks, chemical molecules, knowledge graphs, and location
information.
Graph Neural Networks are used to model graph data. This implies they locate and convert the
connections between nodes in a network into integers. As if it were an embedding. As a result,
they can be utilized in any other machine learning model to perform tasks such as grouping,
classifying, and so on.
Ensemble learning is a general meta approach to machine learning that combines predictions
from different models to improve predictive performance.
Although you can create an apparently infinite number of ensembles for any predictive
modelling problem, the subject of ensemble learning is dominated by three methods. Bagging,
stacking, and boosting. They are the three primary classes of ensemble learning methods, and it's
essential to understand each one thoroughly.
• Bagging Ensemble learning is the process of fitting multiple decision trees to various
samples of the same dataset and averaging the results.
• Stacking Ensemble learning is fitting multiple types of models to the same data and
then using another model to learn how to combine the predictions in the best way
possible.
• Boosting Ensemble Learning entails successively adding ensemble members that
correct prior model predictions and produce a weighted average of the predictions.
(I) Bagging Ensemble learning Bagging ensemble learning involves fitting numerous decision trees to
various samples of the same dataset, and then averaging the results of those tree fittings to provide a final
prediction.
In most cases, this is accomplished by making use of a single machine learning method, which is nearly
invariably an unpruned decision tree, and by training each model on a separate sample from the same
training dataset. After then, straightforward statistical approaches such as voting or averaging are utilised
in order to aggregate the predictions that were generated by each individual participant in the ensemble.
The manner in which each individual data sample is prepared to train members of the ensemble
constitutes the most essential component of the technique. Every model receives its own unique,
customised portion of the dataset to use for testing. Rows (examples) are selected at random from the
dataset, and once selected, they are replaced.
When a row is selected, it is added back to the dataset that it was learned from, so that it can be selected
once more from the same training dataset. This indicates that within a specific training dataset, a row of
data may be selected 0 times, 1 times, or multiple times.
This type of sample is known as a bootstrap sample. In the field of statistics, this approach is a way for
estimating the statistical value of a limited data sample. It is typically applied to somewhat limited data
sets. You can get a better overall estimate of the desired quantity if you make a number of distinct
bootstrap samples, estimate a statistical quantity, and then determine the average of the estimates. This is
in comparison to the situation in which you would just estimate the quantity based on the dataset.
In the same way, several training datasets can be compiled, put to use in the process of estimating a
predictive model, and then put to use in order to produce predictions. The majority of the time, it is
preferable to take the average of the predictions made by all of the models rather than to fit a single model
directly to the dataset used for training.
The following is a concise summary of the most important aspects of bagging:
• Take samples of the training dataset using bootstrapping.
• Unpruned decision trees fit on each sample.
• Voting or taking the average of all the predictions.
In a nutshell, bagging has an effect because it modifies the training data that is used to fit each individual
member of the ensemble. This results in skillful but unique models.
Bagging Ensemble
Input (X)
Combine
Output
(Y)
Bagging ensemble learning
It is a comprehensive strategy that is simple to expand upon. For instance, additional alterations
can be made to the dataset that was used for training, the method that was used to fit the training
data can be modified, and the manner in which predictions are constructed can be altered.
(II) Stacking Ensemble learning: Stacked Generalization, sometimes known as "stacking" due
to its abbreviated form, is an ensemble strategy that searches for a diverse group of members by
varying the types of models that are fitted to the training data and utilising a model to aggregate
predictions. It requires fitting of various kinds of models, applied to the same data, and then
using another model to find out how to integrate the predictions in the best way possible. This
process is known as model fitting.
There is a specific vocabulary for stacking. The individual models that comprise an ensemble are
referred to as level-0 models, whereas the model that integrates all of the predictions is referred
to as a level-1 model.
Although there are often only two levels of models applied, you are free to apply as many levels
as you see fit. For instance, instead of a single level-1 model, we might have three or five level-1
models and a single level-2 model that integrates the forecasts of level-1 models to generate a
prediction. This would allow us to make more accurate predictions.
It is possible to integrate the predictions using any machine learning model, but the majority of
users prefer linear models, such as linear regression for regression and logistic regression for
binary classification. Because of this, it is more likely that the more difficult components of the
model will be included in the lower-level ensemble member models, and that straightforward
models will be used to learn how to apply the various predictions.
As a consequence of this, it is recommended to make use of a variety of models that can be learnt
or constructed in a wide variety of methods. Because of this, it is ensured that they will make
separate assumptions, and as a consequence, it is less probable that their errors in prediction
would be linked to one another.
Stacking Ensemble
Input (X)
Model
Output(y)
Stacking Ensemble Learning
Many popular ensemble algorithms are based on this approach, including:
Stacked Models (canonical stacking)
Blending
Super Ensemble
(III) Boosting Ensemble learning: Boosting is an ensemble strategy that aims to alter the
training data so that it focuses on examples that earlier models that fit the training data got
wrong. Boosting tries to do this by focusing on examples that prior models got wrong. In order
for it to function, members are added to the ensemble one at a time, and when each new member
is added, the predictions that were produced by the model that came before it are refined. A
weighted average of the forecasts is what we get as a result.
The fact that boosting ensembles may correct errors in forecasts is the single most important
advantage of using them. The models are calibrated and introduced to the ensemble one at a
time, which means that the second model attempts to fix what the first model indicated, and so
on and so forth.
The majority of the time, this is accomplished using weak learners, which are relatively
straightforward decision trees that only make a single or a few decisions at a time. The forecasts
of the weak learners are merged by simple voting or by average, but the importance of each
learner's input is weighted according to how well they performed or how much they know. The
objective is to create a "strong-learner" out of a number of "weak-learners," each of which was
designed to accomplish a particular task.
Majority of the time the training dataset is left unchanged ; instead, the learning algorithm is
adjusted to pay more or less attention to certain examples (rows of data) depending on how well
they were predicted by ensemble members who were added earlier. For instance, a weight could
be assigned to each row of data in order to demonstrate the level of focus that a learning
algorithm must maintain on the model while it is doing so.
• Give more weight to examples that are hard to guess when training.
• Add members of the ensemble one at a time to correct the predictions of earlier models.
• Use a weighted average of models to combine their predictions.
The idea of turning a group of weak learners into a group of strong learners was first thought of
in theory, and many algorithms were tried but didn't work very well. Until the Adaptive Boosting
(AdaBoost) algorithm was made, it wasn't clear that boosting was a good way to put together a
group of methods.
Since AdaBoost, many boosting methods have been made, and some, like stochastic gradient
boosting, may be among the best ways to use tabular (structured) data for classification and
regression.
Boosting Ensemble
Input
Weighted
Model 1
Sample 1
Weighted
Model 2
Sample 2
Weighted
Model 3
Sample 3
Combine
Output (y)
To summarize, many popular ensemble algorithms are based on this approach, including:
In this unit we discussed about the basic concepts of machine learning and also about the various Machine
learning algorithms. The unit also covers the understanding of reinforcement learning and its related
algorithms. There after we discussed the concept of Deep Learning and various techniques involved in
Deep Learning. The unit finally discussed about the Ensemble Learning and its related methods. The unit
9.8 SOLUTIONS/ANSWERS
Q2 Briefly discuss the major function or use of Machine learning algorithms Solution: Refer to
Section 9.2
Q5 Compare the concept of Classification, Regression and Clustering? List the algorithms in
respective categories. 9.3
Prof. Ela Kumar, “Artificial Intelligence”Edition: First, Publisher: Dreamtech Press, (2020)ISBN:
9789389795134
Machine learning an algorithm perspective, Stephen Marsland, 2 nd Edition, CRC Press,,
2015.
Machine Learning, Tom Mitchell, 1st Edition, McGraw- Hill, 1997.
Machine Learning: The Art and Science of Algorithms that Make Sense of Data, Peter
Flach, 1st Edition, Cambridge University Press, 2012.
UNIT 10 CLASSIFICATION
Structure
10.1 Introduction
10.2 Objectives
10.3 Understanding of Supervised Learning
10.4 Introduction to Classification
10.5 Classification Algorithms
10.5.1 Naïve Bayes
10.5.2 K-Nearest Neighbour (K-NN)
10.5.3 Decision Trees
10.5.4 Logistic Regression
10.5.5 Support Vector Machines
10.6 Summary
10.7 Solutions/Answers
10.8 Further Readings
10.1 INTRODUCTION
What exactly does learning entail, anyway? What exactly is meant by "machine learning"? These
are philosophical problems, but we won't be focusing too much on philosophy in this lesson; the
whole focus will be on gaining a solid understanding of how things work in practise. In the
subject of data mining, many of the ideas, such as classification and clustering, are being
addressed, and so here in this Unit, we are going to once again investigate those concepts.
Therefore, in order to achieve a better knowledge, the first step is to differentiate between the
two fields of study known as data mining and machine learning.
It's possible that, at their core, data mining and machine learning are both about learning from
data and improving one's decision-making. On the other hand, they approach things in a different
manner. To get things started, let's start with the most important question, What exactly is the
difference between Data Mining and Machine Learning?
What is data mining? Data mining is a subset of business analytics that involves exploring an
existing huge dataset in order to discover previously unknown patterns, correlations, and
anomalies that are present in the data. This process is referred to as "data exploration." It enables
us to come up with wholly original ideas and perspectives.
What exactly is meant by "machine learning"? The field of artificial intelligence (AI) includes
the subfield of machine learning . Machine learning involves computers performing analyses on
large data sets, after which the computers "learn" patterns that will assist them in making
predictions regarding additional data sets. It is not necessary for a person to interact with the
computer for it to learn from the data; the initial programming and possibly some fine-tuning are
all that are required.
It has come to our attention that there are a number of parallels between the two ideas, namely
Data Mining and Machine Learning. These parallels include the following:
The following are some of the most important distinctions between the two:
Machine learning goes beyond what has happened in the past to make predictions about
future events based on the pre-existing data. Data mining, on the other hand, consists of
just looking for patterns that already exist in the data.
At the beginning of the process of data mining, the 'rules' or patterns that will be used are
unknown. In contrast, when it comes to machine learning, the computer is typically
provided with some rules or variables to follow in order to comprehend the data and learn
from it.
The mining of data is a more manual process that is dependent on the involvement and
choice-making of humans. With machine learning, on the other hand, once the
foundational principles have been established, the process of information extraction, as
well as "learning" and refining, is fully automated and does not require the participation
of a human. To put it another way, the machine is able to improve its own level of
intelligence.
Finding patterns in an existing dataset (like a data warehouse) can be accomplished
through the process of data mining. On the other hand, machine learning is trained on a
data set referred to as a "training" data set, which teaches the computer how to make
sense of data and then how to make predictions about fresh data sets.
The approaches to data mining problems are based on the type of information/ knowledge to be
mined. We will emphasis on three different approaches: Classification, Clustering, and
Association Rules.
The classification task puts data into groups or classes that have already been set up. The value
of a user-specified goal attribute shows what type of thing a tuple is. Tuples are made up of one
or more predicating attributes and one or more goal attributes. The task is to find some kind of
relationship between the predicating attributes and the goal attribute, so that the information or
knowledge found can be used to predict the class of new tuple (s).
The purpose of the clustering process is to create distinct classes from groups of tuples that share
characteristic values. Clustering is the process of defining a mapping, using as input a database
containing tuples and an integer value k, in such a way that the tuples are mapped to various
clusters.
The idea entails increasing the degree of similarity within a class while decreasing the degree of
similarity between classes. There is not an objective attribute in the clustering process.
Clustering, on the other hand, is an example of an unsupervised classification, in contrast to
classification, which is supervised by the aim attribute.
The goal of association rule mining is to find interesting connections between elements in a data
set. Its initial use was for "market basket data." The rule is written as XY, where X and Y are
two sets of objects that do not intersect. Support and confidence are the two metrics for any rule.
The aim is to identify, using rules with support and confidence above, minimum support and
minimum confidence given the user-specified minimum support and minimum confidence.
The distance measure determines the distance between items or their dissimilarity. The following
are the measures used in this unit:
(t ih t jh ) 2
Euclidean distance: dis(ti,tj)= h 1
| (t ih t jh ) |
Manhattan distance: dis(ti,tj)= h 1
where ti and tj are tuples and h are the different attributes which can take values from 1 to k
There are some clear differences between the two, though. But as businesses try to get better at
predicting the future, machine learning and data mining may merge more in the future. For
example, more businesses may want to use machine learning algorithms to improve their data
mining analytics.
Machine learning algorithms use computational methods to "learn" information directly from
data, without using an equation as a model. As more examples are available for learning, the
algorithms get better and better at what they do.
Machine learning algorithms look for patterns in data that occur naturally. This gives you
more information and helps you make better decisions and forecasts. They are used every
day to make important decisions in diagnosing medical conditions, trading stocks,
predicting energy load, and more. Machine learning is used by media sites to sort through
millions of options and suggest songs or movies. It helps retailers figure out what their
customers buy and how they buy it. With the rise of "big data," machine learning has
become very important for solving problems in areas like:
• Computational finance, for applications such as credit scoring and algorithmic trading
• Face identification, motion detection, and object detection can all be accomplished
through image processing and computer vision.
• Tumor detection, drug development, and DNA sequencing can all be accomplished
through computational biology.
• Production of energy, for the sake of pricing and load forecasting • Automotive,
aerospace, and manufacturing, for the purpose of predictive maintenance • Processing of
natural languages
In general, Classical Machine Learning Algorithms can be put into two groups: Supervised
Learning Algorithms, which use data that has been labelled, and Un-Supervised Learning
Algorithms, which use data that has not been labelled and are used for Clustering. We will talk
more about Clustering in Unit 15, which is part of Block 4 of this course.
In this unit we will be discussing about the Supervised Learning Algorithms, which are mainly
used for the classification purpose.
10.2 OBJECTIVES
To use machine learning techniques effectively, you need to know how they work. You can't just
use them without knowing how they work and expect to get good results. Different techniques
work for different kinds of problems, but it's not always clear which techniques will work in a
given situation. You need to know something about the different kinds of solutions.
Every workflow for machine learning starts with the following three questions:
Your responses to these questions will assist you in determining whether supervised or
unsupervised learning is best for you.
Workflow at a Glance
1. ACCESS and load the 4.TRAIN models using the
data. features derived in step 3.
MOBILE
2. PREPROCESS the data 5. ITERATE to find the best
DEVICES
model
Technically, supervised learning means learning a function that gives an output for a given input
based on a set of input-output pairs that have already been defined. It does this with the help of
something called "training data," which is a set of examples for training.
In supervised learning, the data used for training is labelled. For example, every shoe is labelled
as a shoe, and the same goes for every pair of socks. This way, the system knows the labels, and
if it sees a new type of shoes, it will recognise them as "shoes" even if it wasn't told to do so.
In the example above, the picture of shoes and the word "shoes" are both inputs, and the word
"shoes" is the output. After learning from hundreds or thousands of different pictures of shoes
and the word "shoes" as well as the word "socks," our system will know what to do when given
only a new picture of shoes (name: shoes).
supervised ML is often represented by the function y = f(x), where x is the input data and y is the
output variable, which is a function of x that needs to be predicted. In training data, the example
pair is usually made up of an input, which is usually a vector, and an output (a collection of
features determining a sample). The output value we want, which we call a "supervisory signal"
and whose meaning is clear from the name.
In fact, the goal of supervised machine learning is to build a model that can make predictions
based on evidence even when there is uncertainty. A supervised learning algorithm uses a known
set of input data and known responses to the data (output) to train a model to make reasonable
predictions about the response to new data.
Supervisor
Example: Predicting heart attacks with the help of supervised learning: Let's say doctors want to
know if someone will have a heart attack in the next year. They have information about the age,
weight, height, and blood pressure of past patients. They know if the patients who were there
before had heart attacks within a year. So the problem is making a model out of the existing data
that can tell if a new person will have a heart attack in the next year.
Following Steps are Involved in Supervised Learning, and they are self explanatory:
6. Evaluate the Accuracy of the Learned Function Using Values from Test Set
There are some common issues which are generally faced when one applies the Supervised Learning, and
they are listed below:
(i) Training and classifying require a lot of computer time, especially when big data is involved.
(ii) Overfitting: The model may learn so much from the noise in the data that instead of seeing it
as a mistake, it can be seen as a learning concept.
(iii) A key difference between supervised and unsupervised learning is that if an input doesn't fit
into any class, the model will add it to one of the existing ones instead of making a new one.
Lets discuss some of the Practical Applications of Supervised Machine Learning. For beginners
at least, probably knowing ‘what does supervised learning achieve’ becomes equally or more
important than simply knowing ‘what is supervised learning’.
A very large number of practical applications of the method can be outlined, but the following
are some of the common ones
a) Detection of spam
b) Detection of fraudulent banking or other activities
c) Medical Diagnosis
d) Image recognition
e) Predictive maintenance
With increasing applications each day in all the fields, machine learning knowledge is an
essential skill.
☞ Check Your Progress 1
1. Compare between Supervised and Un-Supervised Learning.
……………………………………………………………………………………………
……………………………………………………………………………………………
2. List the Steps Involved in Supervised Learning
……………………………………………………………………………………………
……………………………………………………………………………………………
3. What are the Common Issues Faced While Using Supervised Learning
……………………………………………………………………………………………
……………………………………………………………………………………………
Before moving ahead lets understand some of the key terms, which will be frequently occurring
in this course, they are listed below :
Classification The process of organizing data into a predetermined number of categories
is referred to as classification. Finding out which category or class a new collection of
data falls under is the primary objective of a classification problem, which can be stated
as follows, Data that is structured as well as data that is not structured can be utilized for
classification:
Structured data (data that is in a fixed field in a file or record is called "structured
data"). A relational database (RDBMS) is where most structured data is kept.
Unstructured data (unstructured data may have a natural structure, but it isn't set up
in a way that can be predicted). There is no data model, and the data is stored in the
format in which it was created. Rich media, text, social media activity, surveillance
images, and so on, are all types of unstructured data.Following are some of the
terminologies frequently encountered in machine learning – classification:
Classifier: A classifier is an algorithm that puts the data you give it into a certain
category. A classifier is an algorithm that does classification on a dataset.
Classification model: A classification model looks at the output values to try to figure
out what the values used for training mean. It will guess the class labels or categories of
the new data.
Feature: property of any object (real or virtual) that can be measured on its own is
called a feature.
Classification predictive modeling involves assigning a class label to input examples.
k-Nearest Neighbors.
Decision Trees.
Naive Bayes.
Random Forest.
Gradient Boosting.
Binary classification algorithms can be changed to work for problems with more than two
classes. This is done by fitting multiple binary classification models for each class vs. all other
classes (called "one-vs-rest") or one model for each pair of classes (called one-vs-one).
One versus the Rest: Fit one binary classification model for each class versus all of the
other classes in the dataset.
One-versus-one: Fit one binary classification model for each pair of classes using the
one-on-one comparison method.
Binary classification techniques such as logistic regression and support vector machine are two
examples of those that are capable of using these strategies for multi-class classification.
Multi-label classification: Multi-label classification, also known as more than one class
classification, is a classification task in which each sample is mapped to a collection of target
labels. This classification task involves making predictions about one or more classes; say for
example, a news story can be about Games, People, and a Location all together at the same
time.
Note : Classification algorithms used for binary or multi-class classification cannot be used
directly for multi-label classification.
Specialized versions of standard classification algorithms can be used, so-called multi-label
versions of the algorithms, including:
Another approach is to use a separate classification algorithm to predict the labels for each class.
The different types of classifications discussed above, have to deal with different type of
learners, and Learners in Classification Problems are categorized into following two types :
1. Lazy Learners: Lazy Learner will first store the training dataset, and then it will wait until it is given
the test dataset. In the case of the Lazy learner, classification is done based on the information in the
training dataset that is most relevant to the question at hand. Less time is needed for training, but more
time is needed for making predictions. K-NN algorithm and Case-based reasoning are two examples.
2. Eager Learners: Eager Learners use a training dataset to make a classification model before they get
a test dataset. Eager learners, on the other hand, spend less time on training and more time on making
predictions. Decision Trees, Naive Bayes, and ANN are some examples.
Imagine, on the other hand, in the lazy learner approach the learner is required to wait until the
final moment, to develop a model for classification of the given test tuple, i.e. in the lazy
approach the learner just stores the training tuple, given for classification. It does not do
generalization until it is given a test tuple, after receiving the test tuple, it classifies the tuple
based on how similar it is to the training tuples that it has previously stored. Lazy learning
methods, on the other hand, do less work when a training pair is shown but more work when
classifying or making a prediction. Because lazy learners keep the training tuples, which are also
called "instances." So, they are also called "instance-based learners," even though this is how
most people learn.
When classifying or making a prediction, lazy learners can take a lot of processing power. They
need efficient ways to store information and can be done well on parallel hardware. They don't
explain or show much about how the data is put together. Lazy learners, on the other hand, tend
to be in favour of incremental learning. They can make models of complex decision spaces with
hyper-polygonal shapes that other learning algorithms may not be able to do as well (such as
hyper-rectangular shapes modeled by decision trees). The k-nearest neighbour classifier and the
case-based reasoning classifier are both types of lazy learners.
☞ Check Your Progress 2
4. Compare between Multi Class and Multi Label Classification
……………………………………………………………………………………
……………………………………………………………………………………
5. Compare between structured and unstructured data
……………………………………………………………………………………
……………………………………………………………………………………
6. Compare between Lazy learners and Eager Learners algorithms for machine learning.
……………………………………………………………………………………
……………………………………………………………………………………
The Classification algorithm is a type of Supervised Learning that uses the training data to figure
out the category of new observations. This method is used to figure out what kind of thing a new
observation is. Classification is the process by which a computer programme learns from a set of
data or observations and then sorts new observations into different classes or groups. "Yes" or
"No," "0" or "1," "Spam" or "Not Spam," "Cat or Dog," and so on are all good examples. Classes
are the same thing that have different names, like categories, objectives, and labels.
In classification, the output variable is not a value but a category, such as "Green or Blue," "Fruit
or Animal," etc. This is different from regression, where the output variable is a value. Since the
classification method is a supervised learning method, it needs data that has been labelled in
order to work. This means that the implementation of the algorithm includes both the input and
the output that go with it.
As the name suggests, classification algorithms do the job of predicting a label or putting a
variable into a category (categorization). For example, classifying something as "socks" or
"shoes" from our last example. Classification Predictive Algorithm is used every day in the spam
detector in emails. It looks for features that help it decide if an email is spam or not spam.
The primary objective of the Classification algorithm is to determine the category of the dataset
that is being provided, and these algorithms are primarily utilised to forecast the output for the
data that is categorical in nature. A discrete output function, denoted by y, is mapped to an input
variable, denoted by x, in a classification process. Therefore, y = function (x), where y denotes
the categorical output. The best example of an ML classification algorithm is Email Spam
Detector.
The diagram that follows can be used to have a better understanding of classification methods.
There are two classes, Class A and Class B, depicted in the graphic that may be found below.
These classes share characteristics that distinguish them from other classes but also distinguish
them from one another.
Y
Class A
Class B
The question arises as a result of the existence of a variety of algorithms under both supervised
and unsupervised learning. How Should One Choose Which Algorithm to Employ? The task of
selecting the appropriate machine learning algorithm can appear to be insurmountable because
there are dozens of supervised and unsupervised machine learning algorithms, and each takes a
unique approach to the learning process. There is no single approach that is superior or
universally applicable. Finding the appropriate algorithm requires some amount of trial and
error; even highly experienced data scientists are unable to determine whether or not an
algorithm will work without first putting it to the test themselves. However, the choice of
algorithm also depends on the quantity and nature of the data being worked with, as well as the
insights that are desired from the data and the applications to which those insights will be put.
• Choose supervised learning if you need to train a model to make a prediction, for
instance, the future value of a continuous variable, such as TEMP. or a stock price; use
regression techniques and use classification techniques in situations such as identifying
makes of cars from webcam video footage or identifying spams from emails; etc.
• Choose unsupervised learning if you need to investigate your data and want to train a
model to find a decent internal representation, such as by dividing the data into clusters.
This type of learning allows for more freedom in exploring and representing the data.
Note : The algorithm for Supervised Machine Learning can be broken down into two basic
categories: regression algorithms and classification algorithms. We have been able to forecast the
output for continuous values using the Regression methods; but, in order to predict the
categorical values, we will need to use the Classification algorithms.
Let’s take a closer look at the most commonly used algorithms for supervised machine learning.
Classification algorithms can be further divided into the mainly two categories, Linear Models
and Non Linear Models, which includes various algorithms under them, the same are listed
below :
2. Confusion Matrix:
The confusion matrix tells us how well the model works and gives us a matrix or table as
Output.
Sometimes, this kind of structure is called the error matrix.
The matrix is a summary of the results of the predictions. It shows how many predictions
were right and how many were wrong.
3. AUC-ROC curve:
The letters AUC and ROC stand for "area under the curve" and "receiver operating
characteristics curve," respectively.
This graph shows how well the classification model works at several different thresholds.
We use the AUC-ROC Curve to see how well the multi-class classification model is
doing.
The (TPR)TruePositiveRate and the (FPR)FalsePositiveRate are used to plot the ROC
curve, with TPR on the Y-axis and FPR on the X-axis.
Classification algorithms have several applications, Following are some popular applications or
use cases of Classification Algorithms:
o Detecting Email Spam
o Recognizing Speech
o Detection of Cancer tumor cells.
o Classifying Drugs
o Biometric Identification, etc.
☞ Check Your Progress 3
4. List the classification algorithms under the categories of Linear and Non-Linear Models. Also
Discuss the various methods used for evaluating a classification model
……………………………………………………………………………
……………………………………………………………………………
This is an example of a statistical classification, which estimates the likelihood that a particular sample
belongs to a particular group given the sample in question. The Bayes theorem provides the foundation
for it. When used to big databases, the Bayesian classification demonstrates both improved accuracy and
increased speed. In this section, we will talk about the most basic kind of Bayesian categorization.
"The effect of a given attribute value on a certain class is unaffected by the values of other attributes, i.e.
both are independent," is one of the fundamental underlying assumptions that underpin the native
Bayesian classification, which is the simplest form of Bayesian classification. Class conditional
independence is another name for this basic assumption.
Let's go into greater depth about the naïve Bayesian classification, shall we? But before we get into it,
let's take a moment to define the fundamental theorem that underpins this classification i.e. the Bayes
Theorem.
Bayes Theorem: In order to understand this theorem firstly lets understand the meaning of the
following symbols or assumptions, they are as follows :
Note: From the data sample X and the data used for training, we may get the parameters P(X), P(X | H),
and P(H). Whereas, P(H | X) is the only variable that, by itself, may define the likelihood that X belongs
to a class C; this probability, however, cannot be calculated. This purpose is served by Bayes' theorem in
particular.
Now after defining the Bayes theorem, let us explain the Bayesian classification with the help of an
example.
i) Consider the sample having an n-dimensional feature vector. For our example, it is a 3-dimensional
(Department, Age, Salary) vector with training data as given in the Figure 3.
ii) Assume that there are m classes C1 to Cm. And an unknown sample X. The problem is to data mine
which class X belongs to. As per Bayesian classification, the sample is assigned to the class, if the
following holds:
In other words the class for the data sample X will be the class, which has the maximum probability for
the unknown sample. Please note: The P(Ci |X) will be found using:
iii) The value of P(X) is constant for all the classes, therefore, only P(X|Ci) P(Ci) needs to be found to
be maximum. Also, if the classes are equally, then,
P(C1)=P(C2)=…..P(Cn), then we only need to maximise P(X|Ci).
In our example,
5
P(C1)=
11
and
6
P(C2)=
11
So P(C1) P(C2)
iv) P(X|Ci) calculation may be computationally expensive if, there are large numbers of attributes. To
simplify the evaluation, in the naïve Bayesian classification, we use the condition of class
conditional independence, that is the values of attributes are independent of each other. In such a
situation:
n
P(X|Ci)= П P(xk|Ci) ….(4)
k=1
x3 as Salary= “Medium_Range”
= 0.032727
= 0.030303
Since, the first probability of the above two is higher, the sample data may be classified into the _BOSS_
position. Kindly check to see that you obtain the same result from the decision tree .
Step 2: Summarizing the Data : Summarise the properties in the training data set to calculate the
probabilities and make predictions.
Step 3: Making a Prediction : A particular prediction is made using a summarise of the data set to make a
single prediction.
Step 4: Making all the Predictions : Generate prediction given a test data set and a summarise data set.
Step 5: Evaluate Accuracy : Accuracy of the prediction model for the test data set as a percentage correct
out of them all the predictions made.
Step 6: Tying all Together : Finally, we tie to all steps together and form our own model of Naive Bayes
Classifier.
With the help of the following example, you can see how Naive Bayes' Classifier works:
Example: Let's say we have a list of _WEATHER_ conditions and a target variable called "Play"
that goes with it. So, using this set of data, we need to decide whether or not to play on a given
day based on the _WEATHER_.
If it's _SUNNY_, should the Player play?
So, here are the steps we need to take to solve this problem:
2. Make a Likelihood table by figuring out how likely each feature is.
_OUTLOOK_ _PLAY_
0 _RAINY_ YES
1 _SUNNY_ YES
2 _OVERCAST_ YES
3 _OVERCAST_ YES
4 _SUNNY_ NO
5 _RAINY_ YES
6 _SUNNY_ YES
7 _OVERCAST_ YES
8 _RAINY_ NO
9 _SUNNY_ NO
10 _SUNNY_ YES
11 _RAINY_ NO
12 _OVERCAST_ YES
13 _OVERCAST_ YES
_WEATHER_ NO YES
_OVERCAST_ 0 5
_RAINY_ 2 2
_SUNNY_ 2 3
TOTAL 4 10
Likelihood table _WEATHER_ condition:
_WEATHER_ NO YES
_OVERCAST_ 0 5 5/14=0.35
_RAINY_ 2 2 4/14=0.29
_SUNNY_ 2 3 5/14=0.35
ALL 4/14 = 0.29 10/14 = 0.71
Applying Bayes'theorem:
P(Yes|_SUNNY_)= P(_SUNNY_|Yes)*P(Yes)/P(_SUNNY_)
P(_SUNNY_)= 0.35
P(Yes)=0.71
P(No|_SUNNY_)= P(_SUNNY_|No)*P(No)/P(_SUNNY_)
P(_SUNNY_|NO)= 2/4=0.5
P(No)= 0.29
P(_SUNNY_)= 0.35
8. Predicting a class label using naïve Bayesian classification. We wish to predict the class label
of a tuple using naïve Bayesian classification, given the training data as shown in Table-1 Below.
The data tuples are described by the attributes age, income, student, and credit rating.
The class label attribute known as "buys computer" can take on one of two distinct values—
specifically, "yes" or "no." Let's say that C1 represents the class buying a computer and C2
represents the class deciding not to buy a computer. We are interested in classifying X as having
the following characteristics: (age = youth, income = medium, student status = yes, credit rating
= fair).
This approach, places items in the class to which they are “closest” to their neighbour.It must determine
distance between an item and a class. Classes are represented by centroid (Central value) and the
individual points.One of the algorithms that is used is K-Nearest Neighbors.
We know that The classification task maps data into predefined groups or classes. Given database/dataset
D={t1,t2,…,tn} and a set of classes C={C1,…,Cm}, the classification Problem is to define a mapping
f:DC where each ti is assigned to one class, that is, it divides database/dataset D into classes specified
in the Set C.
A few very simple examples to elucidate classification could be:
Some of the most common techniques used for classification may include the use of Decision Trees,
K-NN etc. Most of these techniques are based on finding the distances or uses statistical methods.
The distance measure finds, the distance or dissimilarity between objects the measures that are used in
this unit are as follows:
k
Euclidean distance: dis(ti,tj)= (t
h 1
ih t jh ) 2
k
Manhattan distance: dis(ti,tj)= | (t
h 1
ih t jh ) |
where ti and tj are tuples and h are the different attributes which can take values from 1 to k
In this section, we look at the distance based classifier i.e. the k-nearest neighbor classifiers.
A test tuple is compared to training tuples that are used in the classification process that are
similar to it. This is how nearest-neighbor classifiers work. There are n different characteristics
that can be used to define the training tuples. Each tuple can be thought of as a point located in a
space that has n dimensions. In this method, each and every one of the training tuples is
preserved within an n-dimensional pattern space. A K-nearest-neighbor classifier searches the
pattern space in order to find the k training tuples that are the most comparable to an unknown
tuple. These k training tuples are referred to as the "k nearest neighbours" of the unknown tuple.
A distance metric, like Euclidean distance, is used to define "closeness."The Euclidean distance
between two points or tuples, say, X1 = (x11, x12,..., x1n) and X2 = (x21, x22,..., x2n), is
(eq. 1)
In other words, for each numeric attribute, we take the difference in value between the
corresponding values of that property in tuple X1 and the values of that attribute in tuple X2 and
then square this difference. Finally, we add up all of these differences. The final tally of all the
accumulated distances is used to calculate the square root. In most cases, prior to making use of
Equation, we first normalize the values of every property (eq. 1). This helps avoid attributes with
initially high ranges (like income, for example) from outweighing attributes with originally
lower ranges, which helps prevent unfairness (such as binary attributes). Min-max normalization,
for example, can be used to change the value of a numeric attribute A from v to v' in the range
[0, 1].
(eq. 2)
Where, minA and maxA are the minimum and maximum values of attribute A
For the purpose of k-nearest-neighbor classification, the unknown tuple is assigned to the class
that has the highest frequency among its k closest neighbours. When k equals 1, the class of the
training tuple that is assigned to the unknown tuple is the one that is most similar to the unknown
tuple in pattern space. It is also possible to utilise nearest neighbour classifiers for prediction,
which means that they can be used to deliver a real-valued forecast for a given unknown tuple.
The result that the classifier produces in this scenario is the weighted average of the real-valued
labels that are associated with the unknown tuple's k nearest neighbours.
Classification Using Distance (K-Nearest Neighbours) - Some of the basic points to be noted
about this algorithm are:
The training set includes classes along with other attributes. (Please refer to the training data given in
the Table given below).
The value of the K defines the number of near items (items that have less distance to the attributes of
concern) that should be used from the given set of training data (just to remind you again, training
data is already classified data). This is explained in point (2) of the following example.
A new item is placed in the class in which the most number of close items are placed. (Please refer to
point (3) in the following example).
The value of K should be <= Number _ of _ training _ items However, in our example for
limiting the size of the sample data, we have not followed this formula.
Example: Consider the following data, which tells us the person’s class depending upon gender and
height
2) Let us take only the height attribute for distance calculation and suppose K=5 then the following are
the near five tuples to the data that is to be classified (using Manhattan distance as a measure on the
height attribute).
3) On examination of the tuples above, we classify the tuple <Ram, M, 1.6> to Short class since most of
the tuples above belongs to Short class.
Example- To classify whether a special paper tissue is Fine or not, we used data from a questionnaire
survey (to get people's opinions) and objective testing with two properties (acid durability and strength).
Here are four examples of training.
X1 =
X2 = Strength(gram/Cm2) Y = Classification
Acid_Durability_(seconds)
7 7 Poor
7 4 Poor
3 4 Fine
1 4 Fine
Now, the firm is producing a new kind of paper tissue that is successful in the laboratory and has the
values X1 = 3 and X2 = 7 respectively. Can we make an educated judgement about the classification of
this novel tissue without doing yet another expensive survey?
1. Find the value of the parameter K as the number of the nearest neighbours Suppose use K = 3
2. Determine the distance that separates the query-instance from each of the samples used for training
The coordinates of the query instance are (3, 7), and rather than computing the distance, we compute the
square distance, which is a more efficient calculation (without square root)
7 7
7 4
3 4
1 4
2. Sort the distance and determine nearest neighbors based on the K-th minimum distance
X1 = Rank
X2 = Strength Square Distance to query Is it included in 3-
Acid_Durability_(sec minimum
(gram/Cm2) instance (3, 7) Nearest neighbors?
onds) distance
7 7 3 Yes
7 4 4 No
3 4 1 Yes
1 4 2 Yes
3. Gather the category (Y) of the nearest neighbors. Notice in the second row last scolumn that the
category of nearest neighbor (Y) is not included because the rank of this data is more than 3 (=K).
Is it Y=
X2 = Rank included Category
X1 = Square Distance to
Strength minimum in 3- of
Acid_Durability_(seconds) query instance (3, 7)
(gram/Cm2) distance Nearest nearest
neighbors? Neighbor
7 7 3 Yes Poor
7 4 4 No -
3 4 1 Yes Fine
1 4 2 Yes Fine
4. Use simple majority of the category of nearest neighbors as the prediction value of the query
instance
Use the simple majority of the nearest neighbours as the query instance's prediction value.
Since 2<1 and we have 2 Fine and 1 Poor, So we can say that a new paper tissue that passed the lab test
with X1 = 3 and X2 = 7 is in the Fine category.
"However, distance cannot be determined using qualities that are categorical, as opposed to
quantitative, such as color." The preceding description operates under the presumption that all of
the attributes that are used to describe the tuples are numeric. When dealing with categorical
attributes, a straightforward way is to contrast the value of the attribute that corresponds to tuple
X1 with the value that corresponds to tuple X2. If there is no difference between the two (for
example, If tuple X1 and X2 both contain the blue color, then the difference between the two is
regarded as being equal to zero. If the two are distinct from one another (for example, if tuple X1
carries blue and tuple X2 carries red), then the comparison between them is counted as 1. It's
possible that other ways will incorporate more complex systems for differentiating grades (such
as in a scenario in which a higher difference score is provided, say, for blue and white than for
blue and black).
"What about the missing values?" If the value of a certain attribute A is absent from either the
tuple X1 or the tuple X2, we will, as a rule, assume the greatest feasible disparity between the
two. Imagine for a moment that each of the traits has been mapped to the interval [0, 1]. When it
comes to categorical attributes, the difference value is set to 1 if either one of the related values
of A or both of them are absent. If A is a number and it is absent from both the tuple X1 and the
tuple X2, then the difference is also assumed to be 1. If there is just one value that is absent and
the other value (which we will refer to as v 0) is present and has been normalised, Consequently,
we can either take the difference to be |1 – v' | or |0 – v' | (i.e., 1–v' or v'), depending on which of
the two is larger.
Nearest-neighbor classifiers use comparisons based on distance to give each attribute an equal
amount of weight. So, they can be less accurate if their attributes are noisy or don't make sense.
But the method has been changed to include the weighting of attributes and the removal of noisy
data tuples. How you measure distance can be very important. You could also use the city block
distance or another way to measure distance.
Nearest neighbour classifiers can be very slow when classifying test tuples into groups. If D is a
training database that contains |D| tuples and k is equal to one, then in order to classify a given
test tuple, it must be compared to |D| training tuples. It is possible to reduce the total number of
comparisons to O(log(|D|) by first putting the stored tuples in search trees and then performing
the comparisons. The running time can be reduced to O(1) if parallel implementation is used,
which is a constant that doesn't change no matter how big D is. You can also use partial distance
calculations and change the stored tuples to cut down on the time it takes to classify. In the
partial distance method, we use only some of the n attributes to figure out how far apart two
things are. If this distance exceeds a specified threshold, the procedure aborts the execution
of the current stored tuple and continues on to the next. Training tuples that aren't required are
removed using the editing procedure. This strategy is also known as pruning or condensing.
because it minimises the number of stored tuples.
☞ Check Your Progress 5
9. Apply KNN classification algorithm to the following data and predict value for (10,7)
for K = 3
Feature 1 Feature 2 Class
1 1 A
2 3 A
2 4 A
5 3 A
8 6 B
8 8 B
9 6 B
11 7 B
……………………………………………………………………………………………
……………………………………………………………………………………………
Given a data set D = {t1,t2 …, tn} where ti=<ti1, …, tih>, that is, each tuple is represented by h attributes,
assume that, the database schema contains attributes as {A1, A2, …, Ah}. Also, let us suppose that the
classes are C={C1, …., Cm}, then:
Decision Tree Induction is the process of learning about the classification using the inductive approach.
During this process, we create a decision tree from the training data. This decision tree can, then be used,
for making classifications. To define this we need to define the following.
Let us assume that we are given probabilities p1, p2, .., ps whose sum is 1. Let us also define the term
Entropy, which is the measure of the amount of randomness or surprise or uncertainty. Thus our basic
goal in the classification process is that, the entropy for a classification should be zero, that, if no surprise
then, entropy is equal to zero. Entropy is defined as:
s
H(p1,p2,…ps)= i 1
( p i * log(1 / p i )) ……. (1)
This algorithm creates a tree using the algorithm given below and tries to reduce the expected number of
comparisons.
Algorithm: ID3 algorithm for creating decision tree from the given training data.
Input: The training data and the attribute-list.
Step 2: If all of the sample data belong to the same class, C, which means the probability is 1,
then return N as a leaf node with the class C label.
Step 3: Return N as a leaf node if attribute-list is empty, and label it with the most common class
with in training data; // majority voting
Step 4: Select split-attribute, which is the attribute in the attribute-list with the highest
information gain;
Step 6: for each known value Ai, of split-attribute // partition the samples
Let xi be the set of data from training data that satisfies the condition:
split-attribute = Ai
attach a leaf labeled with the most common class in the prior set of training data;
else
attach the node returned after recursive call to the program with
training data as xi and
End of Algorithm.
Please note: The algorithm given above, chooses the split attribute with the highest information gain, that
is, calculated as follows:
s
Gain (D,S) =H(D) - i 1
( P ( Di ) * H ( Di )) ………..(2)
where S is new states ={D1,D2,D3…DS} and H(D) finds the amount of order in that state
Consider the following data in which Position attribute acts as class
Now let us calculate gain for the departments using the formula at (2)
Since age has the maximum gain, so, this attribute is selected as the first splitting attribute. In age range
31-40, class is not defined while for other ranges it is defined.
So, we have to again calculate the spitting attribute for this age range (31-40). Now, the tuples that belong
to this range are as follows:
Department Position
_PERSONNEL _BOSS_
_
_ADMIN_ _BOSS_
_ADMIN_ _ASSISTANT_
Again in the _PERSONNEL_ department, all persons are _BOSS_, while, in the _ADMIN_ there is a tie
between the classes. So, the person can be either _BOSS_ or _ASSISTANT_ in the _ADMIN_
department.
Now the decision tree will be as follows:
Age ?
21-30 41-50
31-40
_ASSISTANT_
_BOSS_
Salary ?
Low Range High Range
Medium
_ASSISTANT_ _BOSS_
range
Department ?
_PERSON Administration
_BOSS_
_ASSISTANT_/_BOSS_
Figure 4: The decision tree using ID3 algorithm for the sample data
Now, we will take a new dataset and we will classify the class of each tuple by applying the decision tree
that we have built above.
Here There are for independent variables to determine the dependent variable. The independent
variables are _OUTLOOK_, TEMP., Humidity, and Wind. The dependent variable is whether to
As the first step, we have to find the parent node for our decision tree. For that follow the steps:
Find the entropy of the class variable. E(S) = -[(9/14)log(9/14) + (5/14)log(5/14)] = 0.94
note: Here typically we will take log to base 2.Here total there are 14 yes/no. Out of which 9 yes
From the above data for _OUTLOOK_ we can arrive at the following table easily
PLAY
YES NO TOTAL
_SUNNY_ 3 2 5
_OUTLOOK_ _OVERCAST_ 4 0 4
_RAINY_ 2 3 5
14
Now we have to calculate average weighted entropy. ie, we have found the total of weights of
each feature multiplied by probabilities.
Now select the feature having the largest entropy gain. Here it is _OUTLOOK_. So it forms the
Since _OVERCAST_ contains only examples of class ‘Yes’ we can set it as yes. That means If
_OUTLOOK_ is _OVERCAST_ football will be played. Now our decision tree looks as follows.
The next step is to find the next node in our decision tree. Now we will find one under
_SUNNY_. We have to determine which of the following TEMP., Humidity or Wind has higher
information gain.
PLAY
YES NO TOTAL
_HOT_ 0 2 2
TEMP. _COOL_ 1 1 2
_MILD_ 1 0 1
5
Similarly we get
Here IG(_SUNNY_, Humidity) is the largest value. So Humidity is the node that comes under
_SUNNY_.
PLAY
YES NO TOTAL
HIGH 0 3 3
HUMIDITY NORMAL 2 0 2
5
For humidity from the above table, we can say that play will occur if humidity is normal and will
Logistic regression, which is part of the Supervised Learning method, is one of the most
popular Machine Learning algorithms. It is used to predict the categorical dependent
variable based on a set of independent variables.
Logistic regression predicts the outcome of a dependent variable that has a "yes" or "no"
answer. Because of this, the result must be a discrete or categorical value. It can be Yes
or No, 0 or 1, true or false, etc., but instead of giving the exact value as 0 or 1, it gives the
probabilistic values that lie between 0 and 1.
Logistic Regression is a lot like Linear Regression, but the way they are used is different.
Linear regression is used to solve regression problems, while logistic regression is used to
solve classification problems.
In logistic regression, we fit a "S"-shaped logistic function, which predicts two maximum
values, instead of a regression line (0 or 1).
The curve from the logistic function shows how likely something is, like whether the
cells are cancerous or not, whether a mouse is overweight or not based on its weight, etc.
Logistic Regression is an important machine learning algorithm because it can use both
continuous and discrete datasets to give probabilities and classify new data.
Logistic regression can be used to classify observations based on different types of data, and it is
easy to figure out which variables are the most useful for classifying. The logistic function is
shown in the picture below:
Y
1
S-Curve
y=0.8
0.5
Threshold Value
y=0.3
0 X
Note: Logistic regression is based on the idea of predictive modeling as regression, so that's why
it's called "logistic regression." However, it's used to classify samples, so it's a part of the
classification algorithm.
The relationship between a numerical response and a numerical or categorical predictor is the
subject of the statistical technique known as simple linear regression. While multiple regression
looks at the relationship between a single numerical response and a number of different
numerical and/or categorical predictors, single regression looks at the relationship between a
single numerical response and a single. What should be done, however, when the predictors are
odd (nonlinear, intricate dependence structure, and so on), or when the response is unusual
(categorical, count data, and so on)? When this occurs, we deal with odds, which are another
method of measuring the likelihood of an event and are frequently applied in the context of
gambling (and logistic regression).
Logistic regression is a statistical approach for modelling a binary categorical variable using
numerical and categorical predictors, and this idea of Odds is commonly employed in it. We
suppose the outcome variable was generated by a binomial distribution, and we wish to build a
model with p as the probability of success for a given collection of predictors. There are other
alternatives, but the logit function is the most popular.
Example-1: In a survey of 250 customers of an auto dealership, the service department was
asked if they would tell a friend about it. The number of people who said "yes" was 210, where
"p" is the percentage of customers in the group from which the sample was taken who would
answer "yes" to the question. Find the sample odds and sample proportion.
Solution: The number of customers who would respond Yes in an simple random sample (SRS)
of size n has the binomial distribution with parameters n and p. The sample size of customers is n
= 250, and the number who responded Yes is the count X = 210. Therefore, the sample
proportion is p’=210/250 = 0.84
Since, Logistic regressions work with odds rather than proportions. We need to calculate the
Odds, the odds are simply the ratio of the proportions for the two possible outcomes. If p’ is the
proportion for one outcome, then 1 – p’ is the proportion for the second out
odds = p’ (1 – p’)
A similar formula for the population odds is obtained by substituting p for p’ in this expression
Odds of responding Yes. For the customer service data, the proportion of customers who would
recommend the service in the sample of customers is p’ = 0.84, so the proportion of customers
who would not recommend the service department will be 1 – p’ i.e. 1 – p’ = 1 - 0.84 = 0.16
When people speak about odds, they often round to integers or fractions. If we round 5.25 to 5 =
5/1, we would say that the odds are approximately 5 to 1 that a customer would recommend the
service to a friend. In a similar way, we could describe the odds that a customer would not
recommend the service as 1 to 5.
Support Vector Machine, also called Support Vector Classification, is a supervised and linear
Machine Learning technique that is most often used to solve classification problems. In this
section, we will take a look at Support Vect
Vectoror Machines, a new approach for categorising data
that has a lot of potential and can be used for both linear and nonlinear datasets. A support vector
machine, often known as an SVM, is a type of algorithm that transforms the primary training
data into a new format that has a higher dimension by making use of a nonlinear mapping. It
searches for the ideal linear separating hyperplane in this additional dimension. This hyperplane
is referred to as a "decision boundary" since it separates the tuples of one cl class
ass from those of
another. A hyperplane can always be used to split data from two classes if the appropriate
nonlinear mapping to a high enough dimension is used. This hyperplane is located by the SVM
through the use of support vectors, also known as impor
important
tant training tuples, and margins (defined
by the support vectors). We'll go into further detail about these fresh concepts in the next
paragraphs.
While studying machine learning, one of the classifiers that we come across is called a Support
Vector Machine, or SVM for short. One of the most common approaches for categorising data in
the field of machine learning, which performs admirably on both small and large datasets.
SVMs, which stands for support vector machines, can be utilised for both classificat ion and
regression jobs; however, their performance is superior when applied to classification scenarios.
When they were first introduced in the 1990s, they quickly became quite popular, and even now,
with only minor adjustments, they are the solution of c hoice when a high-performing algorithm
is required.
• Linear Support Vector Machine: You can only use Linear SVM if the data can be completely
separated into linear categories. The ability to separate a set of data points into two classes using
just one straight line is what is meant when we talk about something being "completely linearly
separable" (if 2D).
• Non-Linear Support Vector Machine: When the data isn't linearly separable, we can use Non-
Linear SVM, which means that we apply advanced techniques like kernel tricks to categorise the
data points that can't be divided into two classes using a straight line. This allows us to use Non-
Linear SVM to classify the data points that aren't linearly separable (in 2D). We don't discover
datapoints that are linearly separable in the majority of real-world applications, so we use the
kernel approach to solve them instead.
Support Vectors: The points on the hyperplane that are closest to the object in question are
referred to as support vectors. The boundary between the two groups will be determined with the
help of these data points.Infact these are the spots on the hyperplane that are closest to it. These
data points will be used to define a separation line.
Margin: The margin is the distance between the hyperplane and the nearest observations to the
hyperplane (support vectors). A margin that is high is considered to be a favourable margin by
SVM.In Short, It's the distance between the hyperplane and the hyperplane's nearest observations
(support vectors). SVM considers a high margin to be a favourable margin.
Maximum
X2 Margin
Positive
Hyperplane
Maximum Margin
Hyperplane
Support Vectors
Negative Hyperplane
X1
Working of SVM
SVM is defined solely in terms of support vectors; we do not need to be concerned with any other
observations because the margin is calculated based on the points that are support vectors that are closest
to the hyperplane. This is in contrast to logistic regression, in which the classifier is defined over all of the
points. Because of this, SVM is able to take use of some natural speedups.
To further understand how SVM operates, let's look at an example. Suppose we have a dataset that has
two different classes (green and blue). It is necessary for us to decide whether the new data point should
be classified as blue or green.
There are many ways to put these points into groups, but the question is which is the best and how do we
find it?
NOTE: We call this decision boundary a "straight line" because we are plotting data points on a two-
dimensional graph. If there are more dimensions, we call it a "hyperplane."
Fig4. Hyperplane
SVM is all about finding the hyperplane with the most space between the two classes, such hyperplane is
the best hyperplane. This is done by finding many hyperplanes that best fit the labels and then picking the
one that is farthest from the data points or has the biggest margin.i.e. The best hyperplane is the one with
the greatest distance between the two classes, and this is what SVM is all about.
Support vector
In geometry, a hyperplane is the name given to a subspace that is one dimension smaller than the ambient
space. Despite the fact that this definition is accurate, it is not very clear. Instead of making use of it, we
will concentrate on acquiring knowledge about lines in order to better understand what a hyperplane is. If
you can recall the mathematics that you studied in high school, you presumably know that a line has an
equation of the form, that the constant is called the slope, and that the y-axis is crossed by. If you can't
remember those things, you should look them up. It is important to note, however, that the linear equation
y = a x + b involves two variables. These variables are denoted by the letters y and x, but we are free to
give them any name we like. This formula is valid for a wide range of possibilities for the value of, and
we refer to the collection of those possibilities as a line.
Another notation for the equation of a line may be obtained if we define the two-dimensional vectors
x=(x1,x2) and w=(a,-1) as follows: where w.x is the dot product of w and x.
w.x+b=0
Now we need to locate a hyperplane : locating a hyperplane with the largest margin (a margin is a buffer
zone around the hyperplane equation), and working toward having the largest margin while having the
fewest points possible (known as support vectors).
"The goal is to maximise the minimum distance," to put it another way. for the sake of distance. If the
point from the positive group is substituted in the hyperplane equation while generating predictions on the
training data that was binary classified as positive and negative groups, we will get a value larger than 0.
(zero), Mathematically, wT(Φ(x)) + b > 0 And predictions from the negative group in the hyperplane
equation would give negative value as wT(Φ(x)) + b < 0. The indicators, on the other hand, were about
training data, which is how we're training our model. Give a positive sign for a positive class and a
negative sign for a negative class.
However, if we properly predict a positive class (positive sign or greater than zero sign) as positive while
testing this model on test data, then two positives equals positive and hence a greater than zero result. The
same is true if we correctly forecast the negative group, because two negatives equal a positive.
However, if the model incorrectly identifies the positive group as a negative group
group,, one plus and one
minus equals a minus, resulting in a result that is less than zero. Thus summarising this we can say that
The product of a predicted and actual label would be greater than 0 (zero) on correct prediction, otherwise
less than zero.
11. Suppose you are using a Linear SVM classifier with 2 class classification problem. Now you have
been given the data in which some points are circled red that are representing support vectors. If you
remove any one red points from the data. Does the decision boundary will change?
A) Yes
B) No
10.11 SOLUTIONS/ANSWERS
☞ Check Your Progress 1
1. Compare between Supervised and Un-Supervised Learning.
Solution : Refer to section 10.3
2. List the Steps Involved in Supervised Learning
Solution : Refer to section 10.3
3. What are the Common Issues Faced While Using Supervised Learning
Solution : Refer to section 10.3
8. Using naive Bayesian classification, predict a class label. Given the training data
in Table-1 below, we want to use naive Bayesian classification to predict the class
label of a tuple. The characteristics age, income, student, and credit rating
characterise the data tuples.
For i = 1, 2, we must maximise P(X|Ci)P(Ci). The prior probability for each class, P(Ci), can be
calculated using the training tuples:
Therefore, the naïve Bayesian classifier predicts buys computer = yes for tuple X.
1 1 A
2 3 A
2 4 A
5 3 A
8 6 B
8 8 B
9 6 B
11 7 B
Solution : Refer to section 10.5.2
A) Yes
B) No
Solution: A
A) Selection of Kernel
B) Kernel Parameters
C) Soft Margin Parameter C
D) All of the above
Solution: D
The SVM effectiveness depends upon how you choose the basic 3 requirements mentioned above in such
a way that it maximises your efficiency, reduces error and overfitting.
1. Machine learning an algorithm perspective, Stephen Marsland, 2nd Edition, CRC Press,, 2015.
2. Machine Learning, Tom Mitchell, 1st Edition, McGraw- Hill, 1997.
3. Machine Learning: The Art and Science of Algorithms that Make Senseof Data, Peter Flach, 1st
Edition, Cambridge University Press, 2012.
UNIT 11 REGRESSION
11.0 Introduction
11.1 Objectives
11.2 Regression Algorithm
11.3 Linear Regression
11.4 Polynomial Regression
11.5 Support Vector Regression
11.6 Summary
11.7 Solution/Answers
11.0 INTRODUCTION
In 1908 British biologist Francis Galton investigated the relationships between two variables to study the
hereditary growth of children. In his research he categorised parents into two categories on the height: 1 st
category of the parents belongs to the family length smaller than average length of than parents’ length
and 2nd category of parents belong to the parents having lengththan the average length. This “regression
toward mediocrity” gave these statistical methods thereprimarily the term regression describes the
relationship between variables.
Simple regression y=m*x + C describes the relationship between one independent and one dependent
variable Where theueuse variable y varies with the value of x and thus a dependent variable,the value of
variable x affected any variable hence is a independent variable and m is having some constant value.
Consider the following the parent-children’s set
Parent 64.5 65.5 66.5 67.5 68.5 69.5 70.5 71.5 72.5
Children 65.8 66.7 67.2 67.6 68.2 68.9 69.5 69.9 72.2
The mean height of the children is 68.44 whereas the mean height for the parents is 68.5.
The linear equation for the parents and children is
height_child = 21.52 + 0.69 * height_parent
Mathematically simple linear regression can be defined as y=bx+c+ϵ.Where b is the slope of the
regression lin , x is the variable which can change the value of y but can’t be affected by another variable.
Whereas y is a variable which varies with a change in the n value of x. And the known as error value
majors between the actual value and predicted value. Variable y is described ass dependent variable or
response variable, and variable x is defined as an explanatory or predictor variable.
Regression is a supervised machine learning model which describes the relationships between the
response variable and predictor variables.So, regression model is used when it is required to determine the
value of one variable using another variable.
If the variable to be predicted is a single variable, then the regression equation will be y=a+bx
To determine the value of dependent variable y we need to determine the slope b and constant value and
thus by substituting the different set of values for variable x we can get the different value of variabley.
when x=0 then y=a, which means when there is no independent variable then the predicted variable
constant value.Suppose we are having multiple independent variables x 1,x2,x3,x4….,xn.Then the
regressionequation will be y=a+b1x1+b2x2+b3x3+…….box.
The regression line is also calasthma best fit line because the regression line aims to fit all the points or
will be minimum. Regression is a linear regression when there is one predictor variable, and we can apply
a linear regression model. Themultiple linear regression model came into existence when the number of
predictorsvaries are than then one in number. When the relationships between variable y and x are not
linear,we can apply-linear regression model.
Following are the ways used by regression analysis to determine the relationships between the response
variable and predictor variables:
Find the relationship: It is required to determine the relationships between the predictor variable
and response variables. If any change in the independent variable will result inin the age of
dependent variable there is an exitstance of relationship.
Strength of relationships: By changing the value of one variable how much another variable
changedetermine the strength of relationships.
Formation of relationships: If a change in the value of the dependent variable will result in a
change in independent variable, then formulate a mathematical equation to represent the
relationships between both variables.
Prediction: After formulation of the mathematical equation find the predicted value.
Another independent variable:Another independent variable which is having impact on
dependent variable. If there exist, then formulate the mathematical equation using these variables
also.
Uses of Regression
In a business scenario when it is required to determine the impact of different-different independent
variables to find the target value regression can be used.
When we want to represent in a mathematical expressionform, or we want to model a problem to
determine the impact of different variables.
It is very easy to explain about the business logic with the help of regression. Business logics can be
explained very easily to the person.
When the target variable is normally distributed having some characteristics, regression is very
effective.
Examples of Regression
Relationship between uploading a picture on Facebook page and number of likes by the friends.
Relationship between the height of the child and their parents’ heights.
Relationship between the average food intake and weight gain.
Relationship between the numbers of hours studied and marks scored by the students.
Relationship between the product consumption by increasing the product price.
11.1 OBJECTIVES
Multiple Linear Regression: When there is only one dependent variable and more than one independent
variables, then it results in multiple linear regression i.e. y=a+bx1+cx2+dx3. ; example weight = a+b *
(daily meal)+ c* (daily exercise)
Logistic regression: In logistic regression algorithm dependent variable is binary in nature (False/True).
This algorithm is generally used under cases like testing of the medicines, to detect the bank fraud etc.We
had already discussed the concept of logistic regression in unit no. 10 of this course.
Polynomial regression: Polynomial regression is described with the help of polynomial equation where
the occurrence of independent variable is more than one. There is no linear relationships between the
dependent and independent variables. It results in a curved line instead of a straight line i.e. y=c+
a*x+b*x2
Ecologic regression: Ecological regression algorithm is used when group data belongs to a group. Thus,
data is divided into different groups and regression is performed on different groups. Ecologocal
regression is mostly used in political research eg.party_votes %=.2+.5*(below_poverty_people_votes)
Ridge regression: It is a type of regularization. When data variables are highly correlated ridge
regression is used. Using some constraints on regression coefficients, it is used to reduce the error and
lower the bias. Mostly used in feature selection.
Lasso regression: Least absolute shrinkage and selection operator regression algorithm a penalty is
assign the coefficients. Lasso regression uses shrinkage technique where data values are shrunk towards a
mean.
Logic regression: In logic regression predictor variable and response variable both are binary in nature
and applicable to both classification and regression problem.
Bayesian regression: Bayesian regression algorithm is based on Bayesian statistics. Random variables
are used as a parameter to estimates. In this algorithm if the data is absent then some prior data is taken as
an input.
Quantile regression: This is used when the boundary of the quantile is of interest. Whenoverweight and
underweight is considered for the health analysis it is consider as an quantile regression.
Cox regression: Cox regression algorithm is used when output of a variable depends on set of
independent variables example patient_survival_after_surgery(Survived,Died)=(age,condition,BMI)
If DV is the dependent variable and IV is the independent variable, then the Positive Linear relationship
results with the increases in dependent variable (DV)on the y-axis with respect to increase in value of
independent variable (IV) on x-axis. For example, the distance traversed by the car increases when the
speed of the car increases. Thus, the distance traversed by the car depends on the speed of the car.
And, the Negative Linear relationships result with the decrease in dependent variable (DV) on the y-axis
with respect to the increase in independent variable (IV) on x-axis. For example, time taken by the car
decreases with the increase in speed of car.
Dataset
Learning Algorithm
Average error= 𝑦( ) − 𝑦( )
J(Ɵ)= 𝑦( ) − 𝑦( )
𝑦 ( ) =hƟ(x(i))=Ɵ1x(i) + Ɵ0.
Now it is required to minimize our loss function(J(Ɵ)). A Gradient Descent approach will be used to
minimize the loss function
Linear regression using least square method
Mathematical function is used to find the sum of squares (square of the distance of the points and the
regression line) of all the data points. Least square method is a statistical method given by Carl Friedrich
Gauss used to determine the best fit line or the regression line by minimizing the sum of squares. Least
square method is used to find the line having minimum value of the sum of squares and this line is the
best-fit regression line.
Regression line is y=m*x+c where
y= Estimated or predicted value (Dependent Variable)
x= Value of x for observation (Independent variable)
c= Intercept with the y-axis.
m= Slope of the line
Example :1
Consider the following set of data points (x,y), find the regression line for the given data points.
X 1 2 3 4 5
Y 3 4 2 4 5
Solution:
(x
X y (x − ) (y − ) − )2 (x − )(y − )
1 3 -2 -0.6 4 1.2
2 4 -1 0.4 1 -0.4
3 2 0 -1.6 0 0
4 4 1 0.4 1 0.4
5 5 2 1.4 4 2.8
3 3.6 0 0 10 4
∑( ̅ )( )
where m= ̅)
∑(
m= 4/10= 0.4
𝑥̅ =mean of x=3
𝑦=mean of y=3.6
y=mx+c
m= 0.4
c=2.4
y=.4x+2.4
In the above figure blue points are the actual points and yellow points are the predicted points using least
square method. Some points represented by blue color lie above the line while some other blue color
points lie below the line. However some points represented by the yellow color lie on the line. All other
points not lying on the line are the far away from the line with some distance. Thus, actual blue data
points and the predicted yellow data points contain some distance between them. This distance or
difference between the data points represent an error.
Cost function is used to find the distance between the actual data point value lying other than the
regression line and the predicted value of data points lying on the regression line. Cost function optimizes
the regression coefficient or weights. It measures how a linear regression model is performing.
Difference between the actual value y on y-axis and predicted value 𝑦 is (y-𝑦), and cost=(y-𝑦)2
if there are n number of data points then the cost function will be
1
cost = (𝑦 − 𝑦)
2𝑛
or
1
cost = ∑|(𝑦 − 𝑦)|
𝑛
Since cost function provide the error between the actual value and predicted value so minimizing the
value of cost function will improve the prediction value. Higher the cost function value will degrade the
performance.
Mean Squared Error (MSE):The average of squared of the distance measured between the actual data
points lying other than the line and predicted data points lying on the line is called as a mean squared
error. It is written as:
1
MSE = 𝑦 − (𝑎 𝑥 + 𝑎 )
𝑁
Mean Absolute Error (MAE) is used to determine by calculation sum of all errors divided by the total
number of errors in a group of predictions. While considering a group of data points their directions are
not important. In other words, it is a mean of absolute differences among actual value and response value
results where all individual deviations have even importance.
1
MAE = 𝑦( ) − 𝑦( )
𝑁
Check your progress 1
𝑦 =𝑎 +𝑏 𝑥 +𝑏 𝑥 +𝑏 𝑥 …………. 𝑏 𝑥 =𝑎 + (𝑏 𝑥 )
𝑆 = 𝑦 − 𝑎 +𝑎 𝑥 +𝑎 𝑥
𝑆 = 𝑦 −𝑎 −𝑎 𝑥 −𝑎 𝑥 … (i)
𝛿𝑆
= −2 𝑦 −𝑎 −𝑎 𝑥 −𝑎 𝑥
𝛿𝑎
Since =0
.
⇒ −2 𝑦 + 2𝑎 + 2𝑎 𝑥 + 2𝑎 𝑥 = 0
𝑛𝑎 + 𝑎 𝑥 +𝑎 𝑥 = 𝑦 … . (𝑖𝑖𝑖)
𝛿𝑆
= −2𝑥 𝑦 −𝑎 −𝑎 𝑥 −𝑎 𝑥
𝛿𝑎
Since =0
.
⇒ −2 𝑥 𝑦 + 2𝑎 𝑥 + 2𝑎 𝑥 + 2𝑎 𝑥 =0
.
⇒𝑎 𝑥 +𝑎 𝑥 +𝑎 𝑥 = 𝑥𝑦 … . (𝑖𝑣)
𝛿𝑆
= −2𝑥 𝑦 −𝑎 −𝑎 𝑥 −𝑎 𝑥
𝛿𝑎
Since =0
.
⇒ −2 𝑥 𝑦 + 2𝑎 𝑥 + 2𝑎 𝑥 + 2𝑎 𝑥 =0
.
⇒𝑎 𝑥 +𝑎 𝑥 +𝑎 𝑥 = 𝑥 𝑦 … . (𝑣)
Example 2. Consider the following set of data points (x,y). Find the 2nd order polynomial
y=𝑎 + 𝑎 𝑥 + 𝑎 𝑥 , and using polynomial regression determine the value of y when x is 40.
X 40 10 -20 -88 -150 -170
Y 5.89 5.99 5.98 5.54 4.3 3.33
𝑥 𝑦 𝑥 𝑥 𝑥 𝑥𝑦 𝑥 𝑦
= = = = = = =
-378 31.03 61244 -8912472 1404159536 -1522.7 248304
(b) Height of regression line is used to determine the intercept in multiple regression.
(c) Multiple regression is used when dependent variable does not depend on more than
If all the data points are classified by the marginal line, then it will overfit the machine. And this
is not happening in real scenario. It is not always possible that all the data point lies on the right
side of the classification. As shown in figure one of the red data points lie below positive margin
and one of the blue data points lie above negative margin of the hyperplane. These two data
points lie in opposite plane area. If ξis the distance of the data point from respective marginal
line, we need to find out the error ξifor such points.
𝑦 − ( 𝑤 𝑥 + 𝑏 ) ≤ 𝜖 + 𝜉 for each ξi ≥0
(𝑤 𝑥 + 𝑏 )− 𝑦 ≤ 𝜖+𝜉
Error computed = C ξ where C is the number of error and ξ is the error value
2
Thus it is required to minimize (𝑤 ∗ , 𝑏 ∗ ) = +C ξ
|𝑤|
Example 3. For the given points of two classes red and blue:
Blue: { (1,1), (2,1), (1,-1), (2,-1)}
Red : { (4,0), (5,1), (5,-1), (6,0)}
Ploat a graph for the red and blue categories. Find the support vectors and optimal separating
line.
Solution.
Now first support vector SV1 with x-coordinate 2 and y-coordinate 1 is represented by
2
SV1=
1
Similarly support vector SV2with x-coordinate 2 and y-coordinate -1 and SV3with x-coordinate 4
and y-coordinate 0 will be represented by
2 4
SV2= and SV3=
−1 0
Adding 1 as a input bias in support vector SV 1, SV2 and SV3
2 2 4
𝑆𝑉 = 1 , 𝑆𝑉 = −1 , and 𝑆𝑉 = 0
1 1 1
To determine the value of α1, α2and α3 form the given linear equations we will assume that the support
vector SV1, SV2 belong to the negative class and support vector SV 3belongs to the positive class.
α1𝑆𝑉 𝑆𝑉 + α2𝑆𝑉 𝑆𝑉 + α3𝑆𝑉 𝑆𝑉 = -1 (-ve class)
α1𝑆𝑉 𝑆𝑉 + α2𝑆𝑉 𝑆𝑉 + α3𝑆𝑉 𝑆𝑉 = -1 (-ve class)
α1𝑆𝑉 𝑆𝑉 + α2𝑆𝑉 𝑆𝑉 + α3𝑆𝑉 𝑆𝑉 = +1 (+ve class)
2 2 2 2 2 4
α 1 1 + α 1 −1 + α 1 0 = -1
1 1 1 1 1 1
2 2 2 2 2 4
α 1 −1 + α −1 −1 + α −1 0 = -1
1 1 1 1 1 1
2 4 2 4 4 4
α 1 0 + α −1 0 + α 0 0 = +1
1 1 1 1 1 1
After simplification of above three equations, we get
6α + 2α + 9 α = -1
4α + 6α + 9α =-1
9α + 9α + 17α =1
After simplification of above three equations, we get
α = α = -3.25 and α =3.5
The hyperplane that discriminates the positive class from the negative class is given by
𝑤= α 𝑆𝑉𝑖
2 2 4
𝑤=α 1 +α −1 + α 0
1 1 1
2 2 4 1
𝑤 = (−3.25) ∗ 1 + (-3.25) * −1 + (3.5) * 0 = 0
1 1 1 −3
Hyperplane equation is y=wx+b
2
Where w= and b=-3 or b+3=0 is a line parallel to y-axis which separate both of the category
0
red and blue.
Applications of Support Vector Regression
Used to solve supervised regression problems.
Can be used in both linear and non linear type of data.
Prediction of fire in forest during weather changes.
Prediction of electric power demand.
This unit also focused on polynomial regression and how to plot a polynomial curve is also discussed.
Concepts of overfitting and underfitting are also discussed in this unit.
In this unit support vector regression algorithm is discussed. Concept of hyperplane, marginal hyperplane
and marginal distance are discussed with an example.
𝑦= 𝑚 𝑥 +𝐶
3. a. T
b. T
c. F
𝑤= α 𝑆𝑖
1 3 3
𝑤 = α ∗ 0 + α ∗ 1 + α ∗ −1
1 1 1
1 3 3 1
𝑤 = (−3.5) ∗ 0 + (7.5) * 1 + (7.5) * −1 = 0
1 1 1 −2
Hyperplane equation is y=wx+b
1
Where w= and
0
b=-2 or b+2=0 is a line parallel to y-axis which separate both classes.
12.1 INTRODUCTION
Jain et al. in 1996 mentioned in their work that a neuron is a unique biological cell that has the
capability of information processing. Figure 1 describes a biological neuron's structure,
consisting of a cell body and tree-like branches called axons and dendrites. The working of
neurons is based on receiving the signals from other neurons through their dendrites, processing
the alerts through their body, and finally passing the signals to other neurons via its axon. The
synapse is responsible for connecting two neurons through an axon for the first neuron while the
dendrite for the second neuron. A synapse can either enhance or reduce the learning capabilities'
signal value. If the signals exceed a particular value, called a threshold, then the neuron fires,
otherwise not fire.
12.2 OBJECTIVES
An artificial neural network (ANN) is like a computing system that simulates how the human
brain analyzes information and processes it. It is the branch of artificial intelligence (AI) and
solves problems that may be difficult or impossible to solve such issues for humans. In addition,
ANNs have the potential for self-learning that provide better results if more data becomes
available.
ANN is similar to the biological neural networks as both perform the functions collectively and
in parallel. Artificial Neural Network (ANN) is a general term used in various applications, such
as weather predictions, pattern recognitions, recommendation systems, and regression problems.
Figure 2 describes three neurons that perform "AND" logical operations. In this case, the output
neuron will fire if both input neurons are fired. The output neurons use a threshold value (T),
T=3/2 in this case. If none or only one input neuron is fired, then the total input to the output
becomes less than 1.5 and firing for output is not possible. Take another scenario where both
input neurons are firing, and the total input becomes 1+1=2, which is greater than the threshold
value of 1.5, then output neurons will fire. Similarly, we can perform the "OR” logical operation
with the help of the same architecture but set the new threshold to 0.5. In this case, the output
neurons will be fired if at least one input is fired.
T=3/2
X1
W1
X2 W2
V Y=Φ(V)
W3
X3
Figure A: Single unit with three inputs.
The node has three inputs x = (x1, x2, x3) that receive only binary signals (either 0 or 1). How many
different input patterns this node can receive? What if the node had four inputs? Or Five inputs? Can you
give a formula that computes the number of binary input patterns for a given number of inputs?
x1 : 0 1 0 1 0 1 0 1
x2 : 0 0 1 1 0 0 1 1
x3 : 0 0 0 0 1 1 1 1
x1 : 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
x2 : 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
x3 : 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
x4 : 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
You may check that for five inputs the number of combinations will be 32. Note that 8 = 2 3, 16 = 24 and
32 = 25 (for three, four and five inputs).
Thus, the formula for the number of binary input patterns is: 2n, where n in the number of inputs.
X1
W1
V Y=Φ(V)
W2
X2
Figure A-1: Single unit with three inputs.
The node has three inputs x = (x1, x2) that receive only binary signals (either 0 or 1). How many different
input patterns this node can receive?
……………………………………………………………………………………………
……………………………………………………………………………………………
A multilayer feed forward neural network consists of the interconnection of various layers,
named input, hidden layer, and output layer. The number of hidden layers is not fixed. It depends
upon the requirements and complexity of the problem. The simple neural network is one with a
single input layer and an output layer is known as perceptrons. A Perceptron accepts inputs,
moderates them with certain weight values, then applies the transformation function to output the
final result. The word perceptron is used here because every connection has a certain weight, and
through these connections, one layer is connected to the next layer.
The model's working is defined as follows: All inputs usually are multiplied by the weight, and
this weighted sum is calculated. After it, this sum is applied to the activation function, and it is
the output of an individual layer. This output becomes the input to the next layer. We have
various activation functions, such as sigmoid, tanh, and Relu. After getting the output, the
predicted output is compared with the actual output.
Output Layer
Input
Layer Input
Output
Layer
Hidden Layer
Now, we multiply m features (x1, x2, …, xm) with (w1, w2, …, wm) as a weight matrix, and then the
sum is computed by adding these multiplicative terms. Finally, we define it as a dot product:
w.x = w1 x1 + w2 x2 + … + wmxm = ∑ 𝑤𝑥
Now, refereeing the above steps, you can understand the working of the multiple layers model
and how it works; When you train the model networks on more extensive datasets with many
input features, this process will consume a lot of computing resources. Therefore, deep learning
was not popular in the early days as limited computing resources were available. However, when
better configuration hardware is available, deep learning takes the attention of researchers. The
procedure for forwarding the input features to the hidden layer and the hidden layer to the output
layer is shown below in Figure 4.
Output
Layer Multi Later
Input Neural
Data m Input Network with
features one Hidden
Output Layer
Layer
Hidden Layer
𝑊 : 𝑤 𝑤 …..𝑤 𝑊 : 𝑤 𝑤 …..𝑤 𝑊 : 𝑤 𝑤 …..𝑤
Now you can understand the exact working how multiple layers work.
X2 W2
V Y=Φ(V)
W3
X3
Figure B: Single unit with three inputs.
Suppose that the weights corresponding to the three inputs have the following values:
w1 = 2 ; w2 = -4 ; w3 = 1
and the activation of the unit is given by the step-function:
Φ(V) = 1 for V>=0 and Φ(V) = 0 Otherwise
Calculate what will be the output value y of the unit for each of the following input patterns:
Pattern P1 P2 P3 P4
X1 1 0 1 1
X2 0 1 0 1
X3 0 1 1 1
Answer:
To find the output value y for each pattern we have to:
a) Calculate the weighted sum: v = ∑ wi xi = w1 · x1 + w2 · x2 + w3 · x3
b) Apply the activation function to v, the calculations for each input pattern are:
P1 : v = 2 · 1 − 4 · 0 + 1 · 0 = 2 , (2 > 0) , y = ϕ(2) = 1
P2 : v = 2 · 0 − 4 · 1 + 1 · 1 = −3 , (−3 < 0) , y = ϕ(−3) = 0
P3 : v = 2 · 1 − 4 · 0 + 1 · 1 = 3 , (3 > 0) , y = ϕ(3) = 1
P4 : v = 2 · 1 − 4 · 1 + 1 · 1 = −1 , (−1 < 0) , y = ϕ(−1) = 0
Example - 3: Logical operators (i.e. NOT, AND, OR, XOR, etc) are the building blocks of any
computational device. Logical functions return only two possible values, true or false, based on the truth
or false values of their arguments. For example, operator AND returns true only when all its arguments
are true, otherwise (if any of the arguments is false) it returns false. If we denote truth by 1 and false by 0,
then logical function AND can be represented by the following table:
x1 : 0 0 1 1
x2 : 0 1 0 1
x1 AND x2 : 0 0 0 1
X1 W1
V Y=Φ(V)
W2
X3
Figure C: Single unit with two inputs.
if the weights are w1 = 1 and w2 = 1 and the activation of the unit is given by the step-function:
Φ(V) = 1 for V>=2 and Φ(V) = 0 Otherwise
Answer (a):
P1 : v = 1 · 0 + 1 · 0 = 0 , (0 < 2) , y = ϕ(0) = 0
P2 : v = 1 · 1 + 1 · 0 = 1 , (1 < 2) , y = ϕ(1) = 0
P3 : v = 1 · 0 + 1 · 1 = 1 , (1 < 2) , y = ϕ(1) = 0
P4 : v = 1 · 1 + 1 · 1 = 2 , (2 = 2) , y = ϕ(2) = 1
b) Suggest how to change either the weights or the threshold level of this single unitto implement
the logical OR function (true when at least one of the arguments is true):
x1 : 0 0 1 1
x2 : 0 1 0 1
x1 OR x2 : 0 1 1 1
c) The XOR function (exclusive or) returns true only when one of the arguments is true and
another is false. Otherwise, it returns always false. This can be represented by the following table:
x1 : 0 0 1 1
x2 : 0 1 0 1
x1 XOR x2 : 0 1 1 0
Do you think it is possible to implement this function using a single unit? A network of several
units?
Answer(c): This is a difficult question, and it puzzled scientists for some time because it is
impossible to implement the XOR function neither by a single unit nor by a single-layer feed-
forward network (single-layer perceptron). This was known as the XOR problem. The solution
was found using a feed-forward network with a hidden layer. The XOR network uses two hidden
nodes and one output node.
X2 W2
V Y=Φ(V)
W3
X3
Figure B: Single unit with three inputs.
Suppose that the weights corresponding to the three inputs have the following values:
w1 = 1 ; w2 = -1 ; w3 = 2
and the activation of the unit is given by the step-function:
Φ(V) = 1 for V>=1 and Φ(V) = 0 Otherwise
Calculate what will be the output value y of the unit for each of the following input patterns:
Pattern P1 P2 P3 P4
X1 1 0 1 1
X2 0 1 0 1
X3 0 1 1 1
……………………………………………………………………………………………
……………………………………………………………………………………………
Question - 3: NAND, NOR) are the universal building blocks of any computational device. Logical
functions return only two possible values, true or false, based on the truth or false values of their
arguments. For example, operator NAND returns False only when all its arguments are True, otherwise (if
any of the arguments is false) it returns false. If we denote truth by 1 and false by 0, then logical function
NAND can be represented by the following table:
x1 : 0 0 1 1
x2 : 0 1 0 1
x1 NAND x2 : 1 1 1 0
X1 W1
V Y=Φ(V)
W2
X3
Figure C1: Single unit with two inputs.
if the weights are w1 = 1 and w2 = 1 and the activation of the unit is given by the step-function:
Φ(V) = 1 for V>=2 and Φ(V) = 0 Otherwise
x1 : 0 0 1 1
x2 : 0 1 0 1
x1 NOR x2 : 1 0 0 0
……………………………………………………………………………………………
……………………………………………………………………………………………
So far, we have paid attention to a neural network model, how it works and what is the role of
hidden layers. But now, we are required to emphasize on activation functions and their role in
neural networks. The activation function is a mathematical function that decides the threshold
value for a neuron, it may be linear or nonlinear.The purpose of an activation function is to add
non-linearity to the neural network. If you have a linear activation function, then the number of
hidden layers does matter, and the final output remains a linear combination of the input data.
However, this linearity cannot help solving complex problems like patterns separated by curves
where nonlinear activation is required.
Moreover, the activation function does not have a helpful derivative as its derivative is 0
everywhere. Therefore, it doesn't work for Backpropagation, a fundamental and valuable
concept in multilayer perceptron.
Now, as we’ve covered the essential concepts, let’s go over the most popular neural networks
activation functions.
Binary Step Function:Binary step function depends on a threshold value that decides whether a
neuron should be activated or not. The input fed to the activation function is compared to a certain
threshold; if the input is greater than it, then the neuron is activated, else it is deactivated, meaning
that its output is not passed on to the next hidden layer.
t(z)
0 Z
The idea of step function/Activation will be clear from this paragraph. For example, we have a
perceptron with an activation function that isn't very "stable" as a relationship candidate.
For example, say some person has bipolar issues. One day ( z < 0), s/he behaves with no
responses as s/he is quiet, and on the second day ( z ≥ 0), s/he changes the mood and becomes
very talkative, and speaks non-stop in front of you. There is s no transition for the spirit, and you
don't know the behavior when s/he will be quiet or talking. In such cases, we have a nonlinear
step function that helps.
So, minor changes in the weight of the input layer of our model may activate the neuron by
flipping from 0 to 1, which impacts the working of the hidden layer's working, and then the
outcome may affect. Therefore, we want a model that enhances our exiting neural network by
adjusting the weights. However, it is not possible by a linear activation function. If we don't have
such activation functions, this task cannot be accomplished by simply changing the weights.
w+∆w
y+∆y
So, we need to say goodbye to the perceptron model with this linear activation function.
We are finding a new activation function that accomplishes our task for our neural network
through the sigmoid function. We are changing only one thing: the activation function, and it
meets our recruitments, which are sudden changes in the mood. Now, we define the learning
Function by
𝑍= 𝑤 𝑥 + 𝑏𝑖𝑎𝑠
1
Sigmoidal function is: 𝜎(𝑧) =
1+𝑒
1
Sigmoed Function =
1−𝑙
𝑍= 𝑤 𝑥 + 𝑏𝑖𝑎𝑠
𝜎(𝑧) The function is called the sigmoid function. First, the value, 𝑍, is computed then the sigmoid
function is applied to𝑍. However, it looks very abstract or strange to you how it works. Those
who don't have good knowledge of mathematics need not worry. Figure 7 explains its curve and
its derivative. Here are some observations mentioned:
1. The output of the Sigmoid Function produces the same results as produced by the
linear step function;the output remains between 0 and 1. The curve marks 0.5 at z=0,
for which we can make a straightforward rule that if the sigmoid neuron's output
becomes more than or equal to 0.5, then its output one; otherwise, output o given for
smaller values.
2. The sigmoid function should be continuous. It means that partial derivative, that is,
σ(z) / (1-σ(z)), which is differentiable everywhere on the curve.
The sigmoid activation function introduces non-linearity, which is the essential part, into our
model. The meaning of this non-linearity is that the output is found out by the dot product of some
inputs x (x1, x2, …, xm), weights w (w1, w2, …, wm) plus bias, and then apply sigmoid function,
cannot be represented linearly. The idea is that the nonlinear activation function allows us to
classify nonlinear decision boundaries in our data.
We use hidden layers in our model by replacing perceptron with sigmoid activation function
neurons. Now, the question arises what the requirement for hidden layers is? Are these useful?
The answer is in yes. Hidden layers help us handle complex problems that single-layer neurons
cannot solve.
Hidden layers twist the problem so that it can rewrite the problem and provide easy solutions to
complex problems, pattern recognition problems. For example, figure 8 explains a classic
textbook problem, recognition of handwritten digits, that can help you understand the workings of
hidden layers and how they work.
6043862
Figure 8: Digits in dataset MNIST
The digits in figure 8 is taken from a well-known dataset called MNIST. It has 70,000 examples
of numbers that were written by a human. A picture of 28x28 pixels represents every digit.
Therefore, this value is 28*28=784 pixels. Every pixel takes a deal between 0 and 255 (RGB
color code). Zero manners the coloration is white and 255 manners the shade black.
Now, think about that computer that can really "see" a digit like a human see—the answer is no.
Therefore, we need proper training to recognize these digits. The computer can't understand an
image as a human can see. For this purpose, it can be interpreted to analyze how the pixel
numbers are working to represent an image. Here, we dissect an image into an array defined by
784 numbers as appearing in each collection [0, 0, 180, …, … 77, 0, 0, 0], and after that, we need
to feed the array into our model.
x1 0
x2 1
Input Layer: 784 28
28 Neurons, each
with value between
0 and 255 x3 2
9
x78
The Backpropagation algorithm is a supervised learning algorithm for training the neural network
model. This algorithm was first introduced in the 1960s, it was not popular, and in 1989 it gets
popularized by Rumelhart, Hinton, and Williams, who have used this concept in a paper
titled "Learning representations by back-propagating errors.". It is one of the most fundamental
building blocks of any neural network., if you have multiple layers in the neural network. Thenit
is used to adjust the weight in the backward direction.
When designing a neural network, we initially need to initialize the weights and biases with some
random values. We initially gave some random values for weight and bias, but our model, through
the backpropagation algorithm, will adjust these values and get the output if the difference
between our actual output and predicted output is a large, more significant error.
This algorithm trains the neural network model based on chain rule method. In simple
terms, you can say that after every forward pass through a network, the backpropagation
algorithm works to perform a backward pass to adjust the weights and biased parameters of the
model. It repeatedly adjusts the weights and biases of all the edges among all the layers so that
Error i.e., the difference between predicted output and real output, should be minimum.
In other words, you can conclude that Backpropagation is used to minimize the error/ cost
function by repeatedly adjusting the network's weights and biases. The level of adjustment is
calculated by a method called the gradients of the error function concerning weight and biased
parameters. In short, we are changing the weight and bias parameters so that Error becomes very
small. Below Figure 10 explains the working of Backpropagation algorithm.
Training
Calculate the error
1. First, some random value 'W'is initialized as the weights and propagated forward
accordingly.
2. Then, find out the Error after reducing that Error by propagating backward and increasing
the value weight 'W.'
3. After that, observe the Error and whether it has been increased. If supplementing, then
don't increase the value of 'W.'
4. Once again propagated backward and, at this time, decreased the value of 'W'.
5. Now, notice the Error, and check it has been reduced or not.
The weights that minimize the cost/ error reported. The detailed working is given by:
Calculate the Error – The difference between the output produced by the model and
the actual.
Minimize the Error – need to check the Error, whether it is minimum or not.
Tune the parameters – If the error value is substantial, the weights and biases must be
updated. After reporting that significant Error again, this process will be repeated until
the Error becomes very small.
Check whether the model is ready for prediction – if an Error becomes significantly
less, you can give some inputs to your model to get the output.
Now we learned about the need of backpropagation model and the meaning of training the model.
Now, we understand how the weight values are adjusted to reduce the Error. We are to determine
whether an increment or decrement in the weight is required. After knowing it, we can keep
updating the weights in that direction to reduce the Error. After some time, you will get to the
exact point where the Error is also increased if weights are updated further. We need to stop at
that moment, and it becomes the final weight value.
Square
Error
Increase
weight
Decrease
weight
Weight
Backpropagation Algorithm:
Initially, initialize the network weights and take small random values for it
do
compute Delta w{h}} for all the mentioned weights, from the hidden layer to the output
layer // it is used for the backward pass
compute Delta w{i}} for all weights, from input layer to hidden layer // It is
backward pass continue step
need to update network weights accordingly // input layer does not modify
till all examples are correctly classified or after meeting another stopping criterion
12.6.1 How does Backpropagation work?
Now you may consider below Neural Network for a better understanding:
.15w1 .4w5
.05 i1 h1 o1
.20w2 .45w6 .01
.25w3 .5w7
.3w4
.10 i2 h2 o2
.99
.55w8
b1.35 b2.60
j I
1 1
𝑂𝑢𝑡 0 = → .
= 0.7523
1+𝑙 1+𝑙
𝑂𝑢𝑡 0 = 0.7629
We also repeat the process for the output layer neurons. Hidden layer neurons outputs become
the inputs.
w5
w6 net out 𝐸
01 01
b2
After applying the Backpropagation, we find a total change in errors regarding output-1 : O 1 and output-
2 : O2.
1 1
𝐸 = (𝑡𝑎𝑟𝑔𝑒𝑡 0 − 𝑜𝑢𝑡 0 ) + (𝑡𝑎𝑟𝑔𝑒𝑡 0 − 𝑜𝑢𝑡 𝑜 )
2 2
𝑆𝐸
= −(𝑡𝑎𝑟𝑔𝑒𝑡 0 − 𝑜𝑢𝑡 0 ) = −(0.01 − .7513) = 0.74136
𝑆𝑜𝑢𝑡 0
Now, we need to propagate backward to find the changes in O1 concerning its total net input.
𝑜𝑢𝑡 𝑜1 = 1⁄1 + 𝑒
𝛿𝑜𝑢𝑡 01
= 𝑜𝑢𝑡 𝑜1 (1 − 𝑜𝑢𝑡 𝑜1) = 0.75136507 (1 − 0.75136507) = 0.186815602
𝛿𝑛𝑒𝑡 01
Now, we check the total changes in O1 concerning weight W5.
𝛿𝑛𝑒𝑡 0 ( )
= 1 ∗ 𝑜𝑢𝑡 ℎ 𝑤 + 0 + 0 = .593269
𝛿𝑤
Step 3: We need to put all the values for calculating the updated weight
If we put all the values together, then:
𝛿𝐸𝑡𝑜𝑡𝑎𝑙
𝑊 =𝑊 −𝑛
𝛿𝑤
𝑊 = .4 − .5 ∗ 0.082167041
Updated w5 0.35891648
Feed forward neural network is used for various problems, including classification , regression,
and pattern encoding. In the first case, the web returns a value called z=f(w,x), which is very
close to the target value y. While in the second case, the target becomes the input itself
v 𝑥, 𝑦, 𝑓(𝑤, 𝑥) .To deal with multiclassification, we can use either of the techniques.
The above-mentioned left-hand side network is a modular architecture. Here, every class
connects with three distinct hidden neurons. While mentioned right-hand side network defines a
fully connected network, which is used for a richer classification process. The left-side network
is advantageous as it is modular and supports the classifiers' gradual construction. Whenever we
feel to add a new class, the fully connected network requires further training, while the modular
network only involves training for a new module. The same issue also holds for regression.
However, it is worth mentioning that the output neurons are typically linear in regression tasks
since there is no need to approximate any code.
As we have mentioned, a Neural network (NN) has several hidden layers; every layer consists of
multiple neurons/ nodes. Each node connects with input layer connections and output layer
connections. Also, we have mentioned that every connection is assigned a different weight, and
finally output layer. Before giving the data into the NN, the dataset should be normalized and
then processed. Training a neural network means adjusting the weights so that errors should be
minimum. After introducing the NN, we can apply new data for classification or regression
purposes.
Example - 4:The following diagram represents a feed-forward neural network with one hidden
layer:
A weight on connection between nodes i and j is denoted by wij , such as w13 is the weight on
the connection between nodes 1 and 3. The following table lists all the weights in the network:
w13 = -2, w23 = 3 ; w14=4,w24=-1 ; w35=1,w45=-1 ; w36 = -1, w46 = 1
Where, v denotes the weighted sum of a node. Each of the input nodes (1 and 2) can only receive
binary values (either 0 or 1). Calculate the output of the network (y5 and y6) for each of the input
patterns:
Pattern : P1 P2 P3 P4
Node 1 : 0 1 0 1
Node 2 : 0 0 1 1
Answer: In order to find the output of the network it is necessary to calculate weighted sums of
hidden nodes 3 and 4:
v3 = w13x1 + w23x2 , v4 = w14x1 + w24x2
Then find the outputs from hidden nodes using activation function ϕ:
y3 = ϕ(v3) , y4 = ϕ(v4) .
Use the outputs of the hidden nodes y3 and y4 as the input values to the output layer (nodes 5
and 6), and find weighted sums of output nodes 5 and 6:
v5 = w35y3 + w45y4 , v6 = w36y3 + w46y4 .
Finally, find the outputs from nodes 5 and 6 (also using ϕ):
y5 = ϕ(v5) , y6 = ϕ(v6) .
The output pattern will be (y5, y6). Perform this calculation for each input pattern:
v3 = −2 · 0 + 3 · 0 = 0, y3 = ϕ(0) = 1
v4 = 4 · 0 − 1 · 0 = 0, y4 = ϕ(0) = 1
v5 = 1 · 1 − 1 · 1 = 0, y5 = ϕ(0) = 1
v6 = −1 · 1 + 1 · 1 = 0, y6 = ϕ(0) = 1
The output of the network is (1, 1)
A weight on connection between nodes i and j is denoted by wij , such as w23 is the weight on
the connection between nodes 2 and 3. The following table lists all the weights in the network:
w13 = -3, w23 = 2 ; w14=3,w24= - 2 ; w35=4,w45= - 3 ; w36 = -2, w46 = 2
Deep learning is a subset of artificial intelligence, commonly called AI, that tells us the workings
of the human brain to process data and patterns defining for decision making. Deep learning has
capable of learning unsupervised from unstructured data or unlabeled data. Deep learning is
further classified as an AI function that is used to simulate the workings of the human brain in
processing data to detect objects, recognize speech, translate languages, and make decisions.
However, this data is so vast and primarily considered unstructured that it could take decades or
centuries for humans to understand or find meaningful decisions. As mentioned earlier, the deep
learning model's work is similar to the multilayer perceptron models. We have various models,
such as convolution neural network (CNN) and long short term Model (LSTM). The exact
working of CNN and LSTM is out of scope, but you can refer to the working of multilayer
perception to understand the working of the deep learning model.
For example, if a digital payments company wants to find out the occurrence of fraud in their
system, such a company might use machine learning tools. The used algorithm, built on the
machine learning technique, will process all transactions on the digital platform, try to find out
the patterns, and mention the detection of anomalies by the pattern.
Deep learning, termed a subset of machine learning algorithms, uses a hierarchical structure of
neural networks model to forward the same process as the machine learning algorithm used. The
neural networks, like the human brain, connect like a web. The program built for machine
learning linearly uses data, while deep learning systems enable a nonlinear approach to process
the data.
As we have mentioned earlier, each layer of the neural network takes the inputs from the
previous layer; for example, the input layer has the parameters like sender information, data from
social media, a credit score of the customer, using IP address, and others and passed the output to
next layer for decision making. The final layer, the output layer, decides whether fraud has been
detected or not.
Deep learning, a prevalent technology, is used across all major industries for various tasks,
including decision making. Other examples may include commercial apps for image
recognition, apps for a recommendation system, and medical research tools for exploring the
possibility of reusing drugs.
12.9 SUMMARY
In this unit we learned about the fundamental concepts of Neural networks, and various concepts related
to area of neural networks and deep learning, this includes the understanding of the activation function,
back propagation algorithm, feed forward networks and many more. In this unit the concepts are
simplified with the help of the numerical, which will help you map the theoretical concepts of neural
networks with that of their implementation part.
12.10 SOLUTIONS/ANSWERS
X1
W1
V Y=Φ(V)
W2
X2
Figure A-1: Single unit with three inputs.
The node has three inputs x = (x1, x2) that receive only binary signals (either 0 or 1). How many different
input patterns this node can receive?
X2 W2
V Y=Φ(V)
W3
X3
Figure B: Single unit with three inputs.
Suppose that the weights corresponding to the three inputs have the following values:
w1 = 1 ; w2 = -1 ; w3 = 2
and the activation of the unit is given by the step-function:
Φ(V) = 1 for V>=1 and Φ(V) = 0 Otherwise
Calculate what will be the output value y of the unit for each of the following input patterns:
Pattern P1 P2 P3 P4
X1 1 0 1 1
X2 0 1 0 1
X3 0 1 1 1
Solution: Refer to section 12.4
Question - 3: NAND, NOR) are the universal building blocks of any computational device. Logical
functions return only two possible values, true or false, based on the truth or false values of their
arguments. For example, operator NAND returns False only when all its arguments are True, otherwise (if
any of the arguments is false) it returns false. If we denote truth by 1 and false by 0, then logical function
NAND can be represented by the following table:
x1 : 0 0 1 1
x2 : 0 1 0 1
x1 NAND x2 : 1 1 1 0
X1 W1
V Y=Φ(V)
W2
X3
Figure C1: Single unit with two inputs.
if the weights are w1 = 1 and w2 = 1 and the activation of the unit is given by the step-function:
Φ(V) = 1 for V>=2 and Φ(V) = 0 Otherwise
Where, v denotes the weighted sum of a node. Each of the input nodes (1 and 2) can only receive
binary values (either 0 or 1). Calculate the output of the network (y5 and y6) for each of the input
patterns:
Pattern : P1 P2 P3 P4
Node 1 : 0 1 0 1
Node 2 : 0 0 1 1
Solution: Refer to Section 12.7
SOCIS FACULTY
Prof. P. Venkata Suresh, Director, SOCIS, IGNOU
Prof. V.V. Subrahmanyam, SOCIS, IGNOU
Dr. Akshay Kumar, Associate Professor, SOCIS, IGNOU
Dr. Naveen Kumar, Associate Professor, SOCIS, IGNOU (on EOL)
Dr. M.P. Mishra, Associate Professor, SOCIS, IGNOU
Dr. Sudhansh Sharma, Assistant Professor, SOCIS, IGNOU
Dr. Manish Kumar, Assistant Professor, SOCIS, IGNOU
PREPARATION TEAM
Dr.Sudhansh Sharma, (Writer- Unit 13) Prof Anjana Gosain (Content Editor)
Assistant Professor SOCIS, IGNOU USICT-GGSIPU,Delhi
Print Production
Sh Sanjay Aggarwal,Assistant Registrar, MPDD
, 2022
Indira Gandhi National Open University, 2022
ISBN-
All rights reserved. No part of this work may be reproduced in any form, by mimeograph or any other means, without permission
in writing from the Indira Gandhi National Open University.
Further information on the Indira Gandhi National Open University courses may be obtained from the University’s office at
Maidan Garhi, New Delhi-110068.
Printed and published on behalf of the Indira Gandhi National Open University, New Delhi by MPDD, IGNOU.
UNIT 13 FEATURE SELECTION AND EXTRACTION
13.1 Introduction
13.2 Dimensionality Reduction
13.2.1 Feature Selection
13.2.2 Feature extraction
13.3 Principal Component Analysis
13.4 Linear Discriminant Analysis
13.5 Singular Value Decomposition.
13.6 Summary
13.7 Solutions/Answers
13.8 Further Readings
13.1 INTRODUCTION
Data sets are made up of numerous data columns, which are also referred to as data attributes.
These data columns can be interpreted as dimensions on an n-dimensional feature space, and
data rows can be interpreted as points inside that space. One can gain a better understanding of a
dataset by applying geometry in this manner. In point of fact, these characteristics are
measurements of the same entity. It is possible for their existence in the algorithm's logic to get
muddled, which will result in a change to how well the model functions.
Input variables are the columns of data that are fed into a model in order to provide a forecast for
a target variable. However, if your data is given in the form of rows and columns, such as in a
spreadsheet, then features is another term that can be used interchangeably with input variables.
It is possible that the presence of a large number of dimensions in the feature space implies that
the volume of that space is enormous. As a result, the points (data rows) in that space reflect a
small and non-representative sample of the space's contents. It is possible for the performance of
machine learning algorithms to degrade when there are an excessive number of input variables.
The existence of an excessive number of input variables has a significant impact on the
efficiency with which machine learning algorithms function. when it is used to data that contains
a large number of input attributes; this phenomenon is referred to as the "curse of
dimensionality." As a consequence of this, one of the most common goals is to cut down on the
number of input features. The process of decreasing the number of dimensions that characterise a
feature space is referred to as "dimensionality reduction," which is a phrase that was made up
specifically to describe this phenomenon.
At this point in time, we are required to put the concept of Dimension Reductionality into
practise. This can be done in one of two ways: either by selecting features to be extracted or by
extracting features to be selected. Both of these approaches are broken down in greater detail
below. The step of dimension reduction is one of the preprocessing phases that occurs during the
process of data mining. This step is one of the preprocessing steps that may be beneficial in
minimising the impacts of noise, correlation, and excessive dimensionality.
Some more examples are presented below to let you understand What does dimensionality
reduction have to do with machine learning and predictive modelling?
● A simple issue concerning the classification of e-mails, in which we are tasked with
deciding whether or not a certain email constitutes spam. can be brought up as a practical
illustration of the concept of dimensionality reduction. This can include elements like
whether or not the email has a standard subject line, the content of the email, whether or
not it uses a template, and so on. However, some of these features may overlap with one
another.
● A classification problem that involves humidity and rainfall can sometimes be simplified
down to just one underlying feature as a result of the strong relationship that exists
between the two variables. As a direct consequence of this, the number of characteristics
could get cut down in some circumstances.
● A classification problem with three dimensions can be difficult to understand, whereas a
problem with two dimensions can be translated to a fundamental space with two
dimensions, and a problem with one dimension can be mapped to a line with one
dimension. This concept is depicted in the diagram that follows, which shows how a
three-dimensional feature space can be broken down into two one-dimensional feature
spaces, with the number of features being reduced even further if it is discovered that
they are related.
Dimensionality Reduction
Y
X
Y
Y
X
X
In context of dimensionality reduction, various techniques like Principal Component Analysis,
Linear Discriminant Analysis, Singular Value Decomposition are frequently used. In this unit we will
discuss all the mentioned concepts, related to Dimension reductionality
The Data mining and Machine Learning methodologies both have processing challenges when
working with big amounts of data (many attributes). In point of fact, the dimensions of the
feature space utilised by the approach, often referred to as the model attributes, play the most
important function. Processing algorithms grow more difficult and time-consuming to implement
as the dimensionality of the processing space increases.
These elements, also known as the model attributes, are the fundamental qualities, and they can
either be variables or features. When there are more features, it is more difficult to see them all,
and as a result, the work on the training set becomes more complex as well. This complexity was
further increased when a significant number of characteristics were linked; hence, the
classification became irrelevant as a result. In circumstances like these, the strategies for
decreasing the number of dimensions can prove to be highly beneficial. In a nutshell, "the
process of making a set of major variables from a huge number of random variables is what is
referred to as dimension reduction." When conducting data mining, the step of dimension
reduction can be helpful as a preprocessing step to lessen the negative effects of noise,
correlation, and excessive dimensionality.
Feature selection: During this approach, a subset of the complete set of variables is
selected; as a result, the number of conditions that can be utilised to illustrate the issue is
narrowed down. It's normally done in one of three ways.:
o Filter method
o Wrapper method
o Embedded method
Feature extraction: It takes data from a space with many dimensions and transforms it
into another environment with fewer dimensions.
13.2.1 Feature selection: It is the process of selecting some attributes from a given collection of
prospective features, and then discarding the rest of the attributes that were considered. The use
of feature selection can be done for one of two reasons: either to get a limited number of
characteristics in order to prevent overfitting or to avoid having features that are redundant or
irrelevant. For data scientists, the ability to pick features is a vital asset. It is essential to the
success of the machine learning algorithm that you have a solid understanding of how to choose
the most relevant features to analyse. Features that are irrelevant, redundant, or noisy can
contaminate an algorithm, which can have a detrimental impact on the learning performance,
accuracy, and computing cost. The importance of feature selection is only going to increase as
the size and complexity of the typical dataset continues to balloon at an exponential rate.
Feature Selection Methods: Feature selection methods can be divided into two categories:
supervised, which are appropriate for use with labelled data, and unsupervised, which are
appropriate for use with unlabeled data. Filter methods, wrapper methods, embedding methods,
and hybrid methods are the four categories that unsupervised approaches fall under.:
● Filter methods: Filter methods choose features based on statistics instead of how well they
perform in feature selection cross-validation. Using a chosen metric, irrelevant attributes are
found and recursive feature selection is done. Filter methods can be either univariate, in
which an ordered ranking list of features is made to help choose the final subset of features,
or multivariate, in which the relevance of all the features as a whole is evaluated to find
features that are redundant or not important.
● Wrapper methods: Wrapper feature selection methods look at the choice of a set of
features as a search problem. Their quality is judged by preparing, evaluating, and
comparing a set of features to other sets of features. This method makes it easier to find
possible interactions between variables. Wrapper methods focus on subsets of features that
will help improve the quality of the results from the clustering algorithm used for the
selection. Popular examples are Boruta feature selection and Forward feature selection.
Among all approaches the most conventional feature selection is feed forward feature selection.
Forward feature selection: The first step in the process of feature selection is to evaluate each
individual feature and choose the one that results in the most effective algorithm model. This is
referred to as "forward feature selection." After that step, each possible combination of the
feature that was selected and a subsequent feature is analysed, and then a second feature is
selected, and so on, until the required specified number of features is chosen. The operation of
the forward feature selection algorithm is depicted here in the figure.
Selecting the best subset
Generate Learning
Set of all Performance
a subset Algorithm
features
1. Train the model with each feature being treated as a separate entity, and then evaluate
its overall performance.
2. Select the variable that results in the highest level of performance.
3. Carry on with the process while gradually introducing each variable.
4. The variable that produced the greatest amount of improvement is the one that gets
kept.
5. Perform the entire process once more until the performance of the model does not
show any meaningful signs of improvement.
Here, a fitness level prediction based on the three independent variables is used to show how
forward feature selection works.
ID Calories_burnt Gender Plays_Sport? Fintess Level
1 121 M Yes Fit
2 230 M No Fit
3 342 F No Unfit
4 70 M Yes Fit
5 278 F Yes Unfit
6 146 M Yes Fit
7 168 F No Unfit
8 231 F Yes Fit
9 150 M No Fit
10 190 F No Fit
So, the first step in Forward Feature Selection is to train n models and judge how well they work
by looking at each feature on its own. So, if you have three independent variables, we'll train
three models, one for each of these three traits. Let's say we trained the model using the Calories
Burned feature and the Fitness Level goal variable and got an accuracy of 87 percent.
ID Calories_burnt Gender Plays_Sport? Fintess Level
1 121 M Yes Fit
2 230 M No Fit
3 342 F No Unfit
4 70 M Yes Fit
5 278 F Yes Unfit
6 146 M Yes Fit
7 168 F No Unfit
8 231 F Yes Fit
9 150 M No Fit
10 190 F No Fit
Accuracy = 87%
We'll next use the Gender feature to train the model, and we acquire an accuracy of 80%. –
Accuracy = 91%
The process of reducing the amount of resources needed to describe a large amount of data is
called "feature extraction." One of the main problems with doing complicated data analysis is
that there are a lot of variables to keep track of. A large number of variables requires a lot of
memory and processing power, and it can also cause a classification algorithm to overfit to
training examples and fail to generalise to new samples. Feature extraction is a broad term for
different ways to combine variables to get around these problems while still giving a true picture
of the data. Many people who work with machine learning think that extracting features in the
best way possible is the key to making good models. The data's information must be shown by
the features in a way that fits the needs of the algorithm that will be used to solve the problem.
Some "inherent" features can be taken straight from the raw data, but most of the time, we need
to use these "inherent" features to find "relevant" features that we can use to solve the problem.
In simple terms "feature extraction." can be described as a technique for Defining a set of
features, or visual qualities, that best show the information. Feature Extraction Techniques such
as: PCA, ICA, LDA, LLE, t-SNE and AE. are some of the common examples in machine
learning.
Karl Pearson was the first person to come up with this plan. It is based on the idea that when data
from a higher-dimensional space is put into a lower-dimensional space, the lower-dimensional
space should have the most variation. In simple terms, principal component analysis (PCA) is a
way to get important variables (in the form of components) from a large set of variables in a data
set. It tends to find the direction in which the data is most spread out. PCA is more useful when
you have data with three or more dimensions.
f2
e1
e2
f1
When applying the PCA method, the following are the primary steps that should be followed:
● It helps to compress data, which reduces the amount of space needed to store it and the
amount of time it takes to process it.
● If there are any redundant features, it also helps to get rid of them.
OR
Consider the two-dimensional patterns (2, 1), (3, 5), (4, 3), (5, 6), (6, 7), (8, 8) and (9, 10). (7, 8).
OR
X 2,3,4 X 5,6,7
Y 1,5,3 Y 6,7,8
Answer :
The given feature vectors are- x1,x2,x3,x4,x5,x6 with the values given below:
2 3 4 5 6 7
1 5 3 6 7 8
Step-2:
Find the mean vector (µ).
−2.5 [ 6.25 10
𝑚 = (𝑥 − 𝜇)(𝑥 − 𝜇) = −2.5 −4] =
−4 10 16
−1.5 [ 2.25 0
𝑚 = (𝑥 − 𝜇)(𝑥 − 𝜇) = −1.5 0] =
0 0 0
−0.5 [ 0.25 1
𝑚 = (𝑥 − 𝜇)(𝑥 − 𝜇) = −0.5 −2] =
−2 1 4
0.5 [ 0.25 0.5
𝑚 = (𝑥 − 𝜇)(𝑥 − 𝜇) = 0.5 1] =
1 0.5 4
1.5 [ 2.25 3
𝑚 = (𝑥 − 𝜇)(𝑥 − 𝜇) = 1.5 2] =
2 3 4
2.5 [ 6.25 7.5
𝑚 = (𝑥 − 𝜇)(𝑥 − 𝜇) = 2.5 3] =
3 7.5 9
1 17.5 22
Covariance Marix =
6 22 34
2.92 3.67
Covariance Matrix =
3.67 5.67
Step-05:
Clearly, the second eigen value is vary small compared to the first eigen value.
Eigen vector corresponding to the greatest eigen value is the principle component for the given
data set.
MX = λX
Where-
On simplification, we get-
X1 2.55
Eigen Vector ∶ =
X2 3.67
Thus, PCA for the given problem is
X1 2.55
Principle Component: =
X2 3.67
Lastly, we project the data points onto the new subspace as-
Projection
8
x1=0.69 x2
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8
Problem -02
Use PCA Algorithm to transform the pattern (2, 1) onto the eigen vector in the previous
question.
Solution-
2
The given feature vector is (2, 1) i.e. Given Feature Vector:
1
The feature vector gets transformed to :
The algorithm for linear classification known as logistic regression is known for being both
straightforward and robust. On the other hand, there are a few restrictions or faults in the system
that highlight the requirement for more complex linear classification algorithms. The following
is a list of some of the problems:
Binary class Problems. Concerns regarding the binary class is that the Logistic
regression is utilised for issues that involve binary classification or two classes. It is
possible to enhance it such that it can manage multiple-class categorization, but in
practise, this is not very common.
Unstable, but with well-defined classes. When the classes are extremely distinct from
one another, logistic regression may become unstable.
It is prone to instability when there are only a few occurrences. When there are not
enough examples from which to draw conclusions about the parameters, the logistic
regression model may become unstable.
In view of the limitations of logistic regression that were discussed earlier, the linear
discriminant analysis is one of the prospective linear methods that can be used for multi-class
classification. This is because it addresses each of the aforementioned concerns in their totality,
which is the primary reason for its success (i.e. flaws of logistic regression). When dealing with
issues that include binary categorization, two statistical methods that could be effective are
logical regression and linear discriminant analysis. Both of these techniques are linear and
regression-based.
Understanding LDA Models : In order to simplify the analysis of your data and make it
more accessible, LDA will make the following assumptions about it:
1. The distribution of your data is Gaussian, and when plotted, each variable appears to be a bell
curve.
2. Each feature has the same variance, which indicates that the values of each feature vary by the
same amount on average in relation to the mean.
On the basis of these presumptions, the LDA model generates estimates for both the mean and
the variance of each class. In the case where there is only one input variable, which is known as
the univariate scenario, it is straightforward to think about this.
When the sum of values is divided by the total number of values, we are able to compute the
mean value, or mu, of each input, or x, for each class(k), in the following manner.
muk = 1/nk * sum(x)
Where,
muk represents the average value of x for class k and
nk represents the total number of occurrences that belong to class k.
When calculating the variance across all classes, the average squared difference of each
individual result's distance from the mean is employed.
sigma^2 = 1 / (n-K) * sum((x – mu)^2)
Where sigma^2 represents the variance of all inputs (x), n represents the number of instances, K
represents the number of classes, and mu is the mean for input x.
LDA generates predictions by calculating the likelihood that each class will be given a fresh
batch of data and then extrapolating from there. A forecast is created by selecting the output
class that contains the events that are the most likely to occur. The Bayes Theorem is
incorporated into the model in order to calculate the probabilities involved. Utilizing the
likelihood of each class as well as the probability of data belonging to that class, Bayes's
Theorem may be utilised to estimate the probability of the output class (k) given the input class
(x). This is accomplished by using the following formula:
The base probability of each class (k) that can be found in your training data is denoted by the
symbol PIk (e.g. 0.5 for a 50-50 split in a two class problem). This concept is referred to as the
prior probability within Bayes' Theorem.
PIk = nk/n
The value of f, which represents the estimated likelihood that x is a member of the class, is
presented here as f(x). We make use of a Gaussian distribution function for the variable (x). By
simplifying the previous equation and then introducing the Gaussian, we are able to arrive at the
equation that is presented below. This type of function is referred to as a discriminate function,
and the output classification (y) is determined by selecting the category that contains the greatest
value:
Dk(x) = x * (muk/sigma^2) – (muk^2/(2*sigma^2)) + ln(PIk)
Where, Dk(x) is the discriminating function for class k given input x, and mu k, sigma^2, and PIk
are all estimated from your data.
Now to perform the above task we need to prepare our data first, so the question arises,
How to prepare data suitable for LDA?
This section gives you some ideas to think about when getting your data ready to use with LDA.
Problems with Classification: LDA is used to solve classification problems where the output
variable is a categorical one. This may seem obvious, as LDA works with both two and more
than two classes.
Gaussian Distribution: The standard way to use the model assumes that the input variables
have a Gaussian distribution. Think about looking at the univariate distributions of each attribute
and using transformations to make them look more like Gaussian distributions (e.g. log and root
for exponential distributions and Box-Cox for skewed distributions).
Remove Outliers: Think about removing outliers from your data. These things can mess up the
basic statistics like the mean and the standard deviation that LDA uses to divide classes.
Same Variance: LDA assumes that the variance of each input variable is the same. Before using
LDA, you should almost always normalise your data so that it has a mean of 0 and a standard
deviation of 1.
Problem-2 : Compute the Linear Discriminant projection for the following two-dimensional
datasetX1=(x1,x2)={(4,1),(2,4),(2,3),(3,6),(4,4)} & X2=(x1,x2)={(9,10),(6,8),(9,5),(8,7),(10,8)}
10
X2
4
2
WLDA
0
0 2 4 6 8 10
X1
The class statistics are:
The LDA projection is then obtained as the solution of the generalized eigenvalue
problem
11.89 8.81
S.1 .1
w SB v = λv ⇒ Sw SB − λl |= 0 ⇒| = 0 ⇒ λ = 15.65
5.08 3.76 − λ
11.89 8.81 v1 v1 v1 0.91
v = 15.65 v ⇒ v =
5.08 3.76 2 2 2 0.39
On directly by
Singular value decomposition is a method of decomposing a matrix into three smaller matrices.
A = USVT
Where:
● A : is an m × n matrix
● U : is an m × n orthogonal matrix
● S : is an n × n diagonal matrix
● V : is an n × n orthogonal matrix
The rationale for the transposition of the last matrix will be revealed later in the presentation.
Also defined (in case your algebra is rusty) will be the word "orthogonal," as well as it describes
the reason for having two outer matrices having the same feature.
The diagonal matrix, S, has been flattened into a vector, reducing the formula into a single
summation. Singular values, or Si, are variables that are generally organized from largest to
smallest.
1 −1⁄3 1 1
For λ = 0, the reduced matrix is , so v =
0 0 √10 3
1 3 1 −3
For λ = 90, the reduced matrix is , so v =
0 0 √10 1
1
1 1
Now, we can find the reduced SVD right away since 𝑢1 = 𝐴𝑉1 = −2
1 3
−2
10 −20 −20 1 −2 −2 2 2
𝐴𝐴 = −20 40 40 ~ 0 0 0 ⇒ 1 , 0
−40 40 40 0 0 0 0 1
13.6 SUMMARY
Ans2. The goal of Feature Extraction is to reduce the number of features in a dataset by making
new features from the ones that are already there (and then discarding the original features).
Ans1. The LDA technique is a multi-class classification method that may be utilised to
automatically perform dimensionality reduction. LDA cuts the number of features down from the
initial number of features.
LDA projects the data into a new linear feature space, and obviously, the classifier will have a
high level of accuracy if the data can be linearly separated.Ans2. Some of the limitations of
Logistic Regression are as follows:
Ans1The singular value decomposition is a factorization technique that can be used in linear
algebra for real or complex matrices. It extends the application of the eigen decomposition of a
square normal matrix to any m x n matrix by using an ortho normal eigen basis. It has something
to do with the polar decomposition.
Structure
14.1 Introduction
14.2 Objectives
14.3 What are Association Rules?
14.3.1 Basic Concepts
14.3.2 Association rules: Binary Representation
14.3.3 Association rules Discovery
14.4 Apriori Algorithm
14.4.1 Frequent Itemsets Generation
14.4.2 Case Study
14.4.3 Generating Association Rules using Frequent Itemset
14.5 FP Tree Growth
14.5.1 FP Tree Construction
14.6 Pincer Search
14.7 Summary
14.8 Solutions to Check Your Progress
14.9 Further Readings
14.1 INTRODUCTION
Imagine a salesperson at a shop. If you buy a product from the shop, the salespersonrecommends you
more items related to the product you have purchased. He may also suggest you about the items that are
frequently bought with the product you have purchased. The salesperson may also try to figure out your
choices about the product you observe at the shop and may recommend you further. One of the
common examples of the situation is market basket analysis as illustrated in two cases in Figure 1. If
you add bread and butter to your basked at the store, the salesperson may recommend you add cookies,
eggs, milk to your shopping cart too. Similarly, if customer puts vegetables like onion and potato in
their cart, the salesman at shop may suggest adding other vegetables like tomato to the basket. If
salesman notices a male customer, then he may suggest adding beer too from his historical analysis of
male customer preferring to buy beer as well.
Figure 1: Two different cases of Market Basket Analysis
The salesperson thus analyses the purchasing habits of the customers and tries to analyze the correlations
among the items/products that are frequently bought together by them. The analysis of such
correlationshelps the retailers to develop marketing strategies to increase sale of products in the shop.
Discovering the correlations among all the items sold in a shop help businessman in making decisions
regarding designing catalogue, organizing store, and customer shopping analysis.
OBJECTIVES
In machine learning, association rules are one of the important concepts that is widely applied in
problems like market basket analysis. Consider a supermarket, where all the related items such as
grocery items, dairy items, cosmetics, stationary items etc are kept together in same aisle. This
helps the customers to find their required items timely. This further helps them to remember the
items to purchase they might have forgotten or to they may like to purchase if suggested.
Association rules thus enable one to corelate among various products from a huge set of available
items. Analysing the items customer buy together also helps the retailers to identify the items they
can offer on discount. For example, retailer selling baby lotion and baby shampoo on MRP, but
offering a discount on their combination. Customer who wished to buy only shampoo or only
lotion, may now think of buying the combination. Other factors too can contribute to the purchase
of combination of products. Another strategy can keep related products on the opposite ends of the
shelf to prompt the customer to scan through the entire shelf hoping that he might add a few more
items to his cart.
It is important to note that the association rules do not extract the customer’s preference about the
items but find the relations among the items that are generally bought together by them. The rules
only identify the frequent associations between the items. The rules work with an antecedent (if)
and a consequent (then), both connecting to the set of items. For example, if a person buys pizza,
then he may buy a cold drink too. This is because there is a strong relation between pizza and cold
drink. Association rules help to find the dependency of one item on other by consider the history
of customer’s transaction patterns.
There are few terms that one should understand before understanding the algorithm.
a. k-Itemset: It is a set of kitems. For example, 2-itemset can be {pencil, eraser} or {bread, butter}
etc., 3-itemset can be {bread, butter, milk}.
b. Support: Frequency of appearance of an item appears in all the considered transactions is called
as the support of an item. Mathematically, support of an item x is defined as:
c. Confidence: Confidence is defined as the likelihood of obtaining item y along with an item x.
Mathematically, it is defined as the ratio of frequency of transactions containing items x and y to
the frequency of transactions that contained item x.
𝑠𝑢𝑝𝑝𝑜𝑟𝑡 (𝑥 ∪ 𝑦)
𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(𝑦) =
𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑥)
d. Frequent Itemset: An item whose support is at least the minimum support threshold is known
as a frequent itemset. For example, let minimum support threshold is 10, then an item set with
support score 11 is a frequent itemset but an item set with support score 9 is not.
Let I = {I1, I2, I3……..In} be a set of n items and T = {T1, T2, T3……..Tt} be a set of t
transactions, where each transaction Ti contains a not null set of items purchased by a customer
such that Ti⊆ I . Let each item Ii in the store be represented by a binary variable, B. The variable
takes up the value 0 or 1, representing the absence or presence of item at the store.
For example, consider a set of four transactions T1, T2, T3 and T4 with the following items:
The binary representations for the transaction set are shown in Table 1.
Milk => Bread with support as 75% (or value as 0.75) and confidence as 100% (or value as 1).
Support(milk) = = = 0.75
( , ) .
Confidence (bread) = ( )
= = =1
⋅ .
This means that out of all the purchases made at the store, 10% of the times, whenever milk is
purchased, bread is also purchased together. Confidence 100% implies that out of all the
customers who bought milk, all of them also bought bread. Thus, support and confidence are the
two important rules to show the interest in items. Thus, in this case association rules are
important to consider if they satisfy the minimum threshold of support and confidence. These
thresholds can be set by the experts.
Let A and B be the two set of transactions such that A ⊆ T and B ⊆ T, such that A ≠Ø, B ≠Ø and
A∩ B=Ø. For a given transaction Ti, the rule A => B holds with support s and confidence c as
defined in section 1.3.1. Support s is defined as the percentage of transactions that contains the
items contained in set A as well as in set B i.e, union of set A and set B represented as A∪B.
Confidence c is defined as the percentage of transactions containing A also containing B. When
writing support and confidence in terms of percentage, their value range between 0 and 100, and
when expressed as ratios, their values range between 0 and 1.
Association rules can further be defined as strong association rules if they satisfy the minimum
threshold support called as min_sup and minimum threshold confidence called as min_conf.
The problem of discovery of association rules can be stated as: Given a set of transactions T, find
the rules whose support and confidence are greater than equal to the minimum support and
confidence threshold.
Traditional approach to generate association rules is to compute the support and confidence for
every possible combination of items. Butt his approach is computationally not possible as the
number of combinations of items can be exponentially large. To avoid such large number of
computations, basic approach should be to ignore the needless computations without computing
their support and confidence scores. For example, we can observe from Table 1 that the
combination {milk, egg} can be ignored as the combination if infrequent. Hence, we prune the rule
Milk => Egg without computing the support and confidence for the items.
Therefore, the steps for obtaining association rules can be summarized as:
1. Find all frequent item sets: By definition, obtain the frequent itemset as set of items whose
support score is at least the min_sup.
2. Generate strong association rules: By definition, obtain rules for the item sets obtained in step
1, with support score as min_sup and confidence score as min_conf. These rules are called as
strong rules.
Challenge in obtaining association rules:
Low threshold value: If minimum threshold support (min_sup) is set quite low, then many items
can belong to the frequent itemset.
Solution to the problem is to define a closed frequent itemset and maximal frequent itemset.
1. Closed Frequent Itemset: A frequent itemset A is said to be a closed frequent itemset if it
is closed and has support score at least equal to min_sup. An itemset is said to be closed
in a data set T if there does not exist any superset B such that support(B) equals support
(A).
2. Maximal Frequent Itemset: A frequent itemset A is said to be a maximal frequent
itemset in T if A is frequent and there exists no super set B such that A⊂B and B is
frequent in T.
We use Venn diagram to understand the concept as shown in Figure 2.
Figure 2: Venn Diagram to demonstrate the relationship between Closed item sets, frequent item sets,
closed frequent item sets and maximal frequent item sets.
This illustrates that all the maximal frequent item sets are subsets of closed items sets and frequent
item sets.
R Aggarwal and R.Srikant proposed Apriori algorithm in the year 1994. The algorithm is used to
obtain the frequent item sets for association rules. The algorithm is names so, as it needs the prior
knowledge of the frequent item sets. This section discusses about the generation of frequent patterns
as observed in the analysis of market basket problem. Section 1.4.1 presents Apriori algorithm, used
to obtain the frequent item sets. Section 1.4.2 talks about generating strong association rules from the
frequent item sets generated. Finally, section 1.4.3 presents variations of Apriori algorithm.
For a given set of n items, there are 2n-1 possible combination of items. Consider an itemset I = {A, B, C, D}
with four items, there are 15 combinations of items such as {{A}, {B}, {C}, {D}, {A, B}, {A, C}, {A, D}, {B,
C}, {B, D}, {C, D}, {A, B, C}, {A, B, D}, {A, C, D}, {B, C, D}, {A, B, C, D}}. This can be represented by a
lattice diagram as shown in Figure 3.
Figure 3: Lattice representing 15 combinations of items
Apriori algorithm searches the items level by level; to find the (k+1)item sets, it uses the kitem
sets. To determine the frequent item sets, the algorithm first finds the candidate item sets from
the lattice representation. But as explained, for a given item sets of size n, maximum number of
item sets in the lattice can be 2n -1, one needs to control the search space in exponentially
growing item sets and increase the efficiency of the algorithm. For this, two important principles
are given below.
Definition 1: Apriori Principle: If an itemset is frequent, then all of its subsets must be frequent.
To understand the principle, let’s consider the itemset {A, B, C} is the frequent itemset,
then its subsets {A, B}, {A, C}, {B, C}, {A}, {B} and {C} are also the frequent item sets
(marked in yellow) as shown in Figure 4. This is because if a transaction contains the itemset {A,
B, C}, then the transaction will also contain all the subsets of this itemset.
On the other hand, if an itemset say {B, D} is not a frequent itemset, then all the
supersets containing {B, D} i.e {A, B, D}, {B, C, D} and {A, B, C, D} are also not the frequent
item sets (marked in blue) as shown in Figure 4. This is because if an itemset X is not a frequent
itemset, then any other itemset containing X will also not be a frequent itemset. Recalling the
definition that an itemset is frequent if and only if its support is greater than or equals to the
minimum support threshold. Conversely, an itemset who support is less than the minimum
support threshold, then the itemset is infrequent.
This Apriori property helps in pruning the item sets, thus reducing the search space and
also increases the efficiency of the algorithm. This type of pruning technique is known as
support-based pruning.
Figure 4: Frequent (yellow nodes) and Infrequent Item sets (blue nodes)
Definition 2: Antimonotone Property: This property states that if an itemset X does not satisfy a
function, then any proper set containing X will also not satisfy the function. This property is
called as antimonotone property as it is monotonic in terms of not satisfying a function. This
property further helps in reducing the search space.
Given the set of transactions with the minimum support and confidence scores, perform the
following steps to generate the frequent item sets using for Apriori algorithm.
Let C(k) be the set of k candidate item, F(k) be the set of k frequent items.
Step 1: Find F (1) i.e., frequent 1- item sets by computing the support for each item in the set of
transactions.
i) Join Operation: For k ≥2, it generates C(k), new candidate k- itemset based on
frequent item sets obtained in the previous step. This can be done in one of the two
ways:
1. Compute F(k-1) * F(1) to extend each F(k-1) itemset by joining it with each F(1)
itemset. This join operation results in C(k) candidate item sets.
For e.g.: Let k=2, and three 1 item setsF (1) = {A}, F (1) = {B} and F (1) = {C}.
The two 1-itemsets can be augmented together to obtain 2-itemset such as F (2) =
{A, B}, F (2)= {A, C} and F(2) = {B, C}. Further, to obtain 3-itemsets, we
augment 2-itemsets with 1-itemsets such as joining {A, B} with {C} to obtain {A,
B, C}.
However, this method may generate duplicate F(k) item sets. For example,
augmenting F(2) = {A, B} and F(1) = {C} generates C(3) = {A, B, C}. Also
augmenting F(2) = {A, C} with F(1) = {B} generate C(3) = {A, B, C}. One way
to avoid the generation of duplicate item sets is to sort the items in the item sets in
lexicographic order. For example, the item sets {A, B}, {B, C}, {A, C}, {A, B,
C} are sorted in their lexicographic order but {C, B}, {A, C, B} are not. Thus,
items {A, B} , {C, D} can be augmented but the item sets {A, B}, {A, C} cannot
be augmented as it will violate the lexicographic order. A working example to
generate candidate F(k) itemset using this approach is shown in Figure 5.
a) F(1) joins F(1) to give C(2) (b) F(2) joins F(1) to give C(3)
Figure 5: Example of Augmenting F(k−1) and F(1) to generate candidate F(k) itemset
2. F(k-1) * F(k-1): A distinct pair of (k-1) item sets X = {A1, A2, A3…Ak-1} and Y =
{B1, B2,… Bk-1, sorted in lexicographic order, are augmented iff:
Xi = Yi, i= 1, 2,….(k-2) but Ak-1≠ Bk-1.
The item sets X and Y would be augmented only if the items A(k-1) and B(k-1)
are sorted in lexicographic order. For example, let frequent itemset X(3)= {A, B,
C} and Y(3) = {A, B, E}. X is joined with Y to generate the candidate itemset
C(4) ={A, B, C, E}. On the other hand, consider another frequent itemset Z = {A,
B, D}. Itemset X and Z can be joined to generate candidate set C(4) ={A, B, C, E}
but itemset Y and Z cannot be joined as their last items, E and D are not arranged
in lexicographic order. As a result, the F(k−1) × F(k−1) candidate \generation
method merges a frequent itemset onlywith ones that contains the items in the
sorted list, thus saving some computations. Working example demonstrating
candidate generation using F(k−1) × F(k−1) is demonstrated in Figure 6.
ii) Prune Operation: Scan the generated entire candidate itemsets C(k) to compute the
support score of each item in the set. Any itemset with support score less than
minimum support threshold is pruned from the candidate itemset. Thus, the itemsets
left after pruning would form the Frequent itemset F(K) such that F(k) ⊆ C(k). To
prune the itemsets from the dataset, a priori property is used. According to the
property, an infrequent (k-1)-itemset is not a subset of a frequent k-itemset. Thus, if
any (k -1)-subset of a candidate itemset C(k) is not in F(k-1), then the candidate
cannot be frequent and so is removed from C(k).
14.4.2 Case Study: Find the frequent itemset from the given set of items
Consider the set of transactions represented in binary form in Table 1 as given below. Assume
minimum support threshold to be 2.
Transaction List of items
Id
T1 Milk, Cookies, Bread
Step 1: Arrange the items in lexicographic order. Call this candidate set C(1)
Step 2: Obtain support score for each item in candidate itemset C(1) as:
Step 3: Prune the items whose support score is less than the minimum support threshold. This
results in 1-frequent itemset, F(1).
As support score of each itemset is at least 2, hence, none of the itemset is pruned. So, the table
above represents 2-candidate itemsets, F(2).
Step 5: Generate 3-candidate itemsets by joining F(2) and F(1) to obtain C(3). Compute the
support score of each itemset.
Step 6: Prune the itemsets in C(3) if their support is less than the minimum support threshold.
We then obtain F(3).
Step 7: Similarly we obtain C(4) using F(3) and prune the candidate itemset to obtain frequent
itemset F(4).
S.No Itemset Support
As the only itemset in C(4) has support count 1, so it is pruned and F(4) =Ø.
1. F1 ← {large 1 - itemsets}
2. k←2
3. while F(k−1) is not empty
4. C(k)← Apriori_gen(F(k−1), k)
5. for transactions t in T
6. Dt ← {c in C(k) : c ⊆ t}
7. for candidates c in Dt
8. count[c] ← count[c] + 1
9.
10. F(k) ← {c in C(k): count[c] ≥ minsup}
11. k←k+1
12.
13. return Union(F(k))
14.
15. Apriori_gen(F, k)
16. for all X ⊆ F, Y ⊆ F where X1 = Y1, X2 = Y2, ..., Xk-2 = Yk-2 and Xk-1,
Yk-1 are in lexicographic order
17. c = X ∪ {Yk-1}
18. if u ⊆ c for all u in C
19. result ←append (result, c)
20. return result
Once the frequent itemsets F(k) are generated from set of transactions T, next step is to generate
strong association rules from them. Recall that strong association rules satisfy both minimum
support as well as minimum confidence. Theoretically, confidence is defined as the ratio of
frequency of transactions containing items x and y to the frequency of transactions that contained
item x.
Based on the definition. Follow the given rules to generate strong association rules:
( )
( )
≥ min_conf
Consider the set of transactions and the generated frequent itemsets described in section 1.3.2.
One of the frequent itemset f belonging to set F(3) is:
where all itemsets are sorted in lexicographic order. The generated association rules with their
confidence scores are stated:
Rule Confidence
Depending on the minimum confidence scores, the obtained association rules are preserved, and
rest are pruned.
Ques 2: For the following given Transaction Data-set, Generate Rules using Apriori Algorithm. Consider
the values as Support=15% and Confidence=45%
Transaction Set of
Id Items
T1 A, B, C
T2 B, D
T3 B, E
T4 A, B, D
T5 A, E
T6 B, E
T7 A, E
T8 A, B, C, E
T9 A, B, E
Although there had been significant reduction in the numb er of candidate itemsets, yet Apriori
algorithm can be slow due to scanning of entire transaction set iteratively. So, Frequent Pattern
growth, also called as FP growth method employs divide and conquer strategy to generate frequent
itemsets.
Working of FP growth algorithm:
1. Create Frequent Pattern Tree, or FP-tree by compressing the transaction database. Along with
preserving the information about the itemsets, the tree structure also retains the association
among the itemsets.
2. Divide the transaction database into a set of conditional databases. where each associated with
one frequent item or “pattern fragment,” and examines each database separately.
3. For each “pattern fragment,” examine its associated itemsets only. Therefore, this approach
may substantially reduce the size of the itemsets to be searched, along with examining the
“growth” of patterns.
We shall be representing each item with its S.No. in the description below. The items are sorted in
descending order of their support score as given below.
iv) Examining the last transaction T 4 generates the final FP tree as shown in Figure 10.
I4 {I1, I2}:1
I2 {I1}:3
I1
vi) For each item, build the conditional Pattern Tree by taking the set of elements common in all
paths in the Conditional Pattern Base of that item. Further calculate its support score by
summing the support scores of all the paths in the Conditional Pattern Base.
I2 {I1}:3 {I1}:3
I1
vii) Frequent patterns generated are: {I1, I2, I5} i.e {Bread, Butter, Milk}, {I1, I3} i.e. {Bread, Cookies}
and {I1, I2}: {Bread Butter}.
Ques 1: Generate FP Tree for the following Transaction Dataset, with minimum support 40%.
T6 I4, I2
T7 I1, I4, I5
T8 I2, I3
Ques 2: Find the frequent itemset using FP Tree, from the given set of Transaction Dataset with
minimum support score as 2.
T2 Apple, Carrot
T3 Apple, Peas
T6 Apple, Peas
T7 Banana, Peas
T8 Banana, Apple,
Peas, Tea
In Apriori algorithm, the computation begins from the smallest set of frequent itemsets and
proceeds till it reaches the maximum frequent itemset. But the algorithm’s efficiency decreases
as the algorithm passes through many iterations. Solution to the problem is to generate the
frequent itemsets by using bottom-up and top-down approach together. Thus, the Pincer
algorithm works on bidirectional approach. It attempts to find the frequent item sets in a bottom
– up manner but, at the same time, it maintains a list of maximal frequent item sets. While
iterating the transactions set, it computes the support score of the candidate maximal frequent
item sets to record the frequent itemsets. This way, the algorithm outputs the subsets of the
frequent itemsets and, hence, they need not maintain support score in the next iteration. Thus,
Pincer–search is advantageous over Apriori algorithm when the frequent item set is long.
In each iteration, while computing the support count of each iteration in bottom-up direction, the
algorithm also computes the support score of some itemsets in top-down direction. These
itemsets are called as Maximal Frequent Candidate Set (MFCS). Consider an itemset x 𝜖 MFCS,
such that cardinality of x is greater than k, in the kth iteration. Then all the subsets of x must also
be frequent. Thus, all these subset itemsets can be pruned from the candidate sets in bottom-up
direction. Thus, increasing the algorithm efficiency. Similarly, when the algorithm finds an
infrequent itemset in bottom–up direction, then MFCS is updated I order to remove these
infrequent itemsets.
The MFCS is initialized to a singleton, which is the item set of cardinality n that contains all the
elements of the transaction set. If some ‘m’ infrequent 1– itemsets are observed after the first
iteration, then these infrequent itemsets will be pruned from the set, thereby reducing the
cardinality of singleton to n –m. This is an efficient algorithm as one the algorithm may result a
maximal frequent set in few iterations only.
Ques1: What is the advantage of using Princer algorithm over Apriori algorithm?
14.7 Summary
This unit presented the concepts of association rules. In this unit, we briefly described about
what are these association rules and how various models help to analyse data for patterns, or co-
occurrences, in a transaction database. These rules are created by finding the frequent itemsets in
the database using the support and the confidence. Support indicates the frequency of occurrence
of an item given the entire transaction database. High support score indicates a popular item
among the users. Confidence of a rule indicates the rule’s reliability. Higher confidence indicates
more occurrence of item(s) in a set.
A typical example that has been used to study association rules is market basket analysis.
The unit presented different concepts like frequent itemsets, closed items and maximal itemsets.
Various algorithms viz, Apriori algorithm, FP tree and Pincer algorithm are discussed in detail in
this unit, to generate the association rules by identifying the relationship between the items that
are often purchased or occurred together.
Ques 1
a) Binary Representation of the Transaction set:
Transaction X Y Z
{X, Y} 1 1 0
{X, Y, Z} 1 1 1
{X, Z} 1 0 1
{Z} 0 0 1
b) {A} is not closed as frequency of {A, E} (superset of {A}) is same as frequency of {A} =
4.
c) {C} is not closed as frequency of {C, E} (superset of {C}) is same as frequency of {C} =
5.
d) {D} and {E} are closed as they do not have any superset with same frequency, but {D}
and {E} are not maximal as they both have frequent supersets as {C, D} and {C, E}
respectively.
e) {A, C} is not closed as frequency of its superset {A, C, E} is same as that of {A, C}.
f) {A, D} is not closed as frequency of its superset {A, D, E} is same as that of {A, D}.
g) {A, E} is closed as it has no superset with same frequency. But {A, E} is not maximal as
it has a frequent superset i.e {A, D, E}.
h) {C, D} is not closed due to its superset {C, D, E}.
i) {C, E} is closed but not maximal due to its frequent superset {C, D, E}.
j) {D, E} is not closed due to its superset {C, D, E}.
k) {A, C, E} is a maximal itemset as it has no frequent super itemset. It is also closed as it
has no superset with same frequency.
l) {A, D, E} is also a maximal and closed itemset.
Step ii) Remove the items whose support is less than the minimum support (=50%)
Step iii) Form the two-item candidate set and find their frequency and support.
Item Frequency Support %
Pen, Notebook 2 2/5 = 40%
Pen, Colours 3 3/5 = 60%
Pen, Eraser 2 2/5 = 40%
Notebook, Colours 3 3/5 = 60%
Notebook, Eraser 1 1/5 = 20%
Colours, Eraser 2 2/5 = 40%
Step (iv) Remove the pair of items whose support is less than the minimum support
Ques 2
Given set of transactions as:
T1 A, B, C
T2 B, D
T3 B, E
T4 A, B, D
T5 A, E
T6 B, E
T7 A, E
T8 A, B, C, E
T9 A, B, E
Step i) Find the frequency and support of each item in the transaction set, where
support = frequency of item / number of transactions
A, B 4 4/9 = 44.44%
A, C 2 2/9 = 22.22%
A, D 1 1/9 = 11.11%
A, E 3 3/9 = 33.33%
B, C 6 6/9 = 66.67%
B, D 2 2/9 = 22.22%
B, E 4 4/9 = 44.44%
C, D 0 0
C, E 1 1/9 = 11.11%
D, E 0 0
Step iii) Prune the two items whose support is less than the minimum support(15%)
Item Frequency Support
A, B 4 4/9 = 44.44%
A, C 2 2/9 = 22.22%
A, E 3 3/9 = 33.33%
B, C 6 6/9 = 66.67%
B, D 2 2/9 = 22.22%
B, E 4 4/9 = 44.44%
Step iv) Now, find three item frequent set and their support score
Pruning the three item sets whose support is less than the minimum support, we get:
Item Frequency Support
A, B, C 2 2/9 = 22.22%
A, B, E 2 2/9 = 22.22%
Step v) Generate the rules from each three itemset and compute the confidence of
each rule as:
Confidence(x->y) = support(x∪y)/support(x)
For three itemset (A, B, C):
Rule Support
(A, B)->C (2/9) / (4/9) = 1/2 = 50%
(A, C) -> B (2/9) / (2/9) = 1 = 100%
(B, C) -> A (2/9) / (2/9) = 1 = 100%
A -> (B, C) (2/9) / (6/9) = 2/6 = 33.33%
B -> (A, C) (2/9) / (7/9) = 2/7 = 28.57%
C -> (A, B) (2/9) / (2/9) = 1 = 100%
From the generated set of rules, the rules with confidence greater than or equal to the
minimum confidence (45%) are the valid rules.
i) (A, B) -> C
ii) (B, C) -> A
iii) (A, C) -> B
iv) C -> (A, B)
Step i) From the given transaction set, find the frequency of each item and
arrange them in decreasing order of their frequencies.
T8 I2, I3 I2, I3
Step iii) Create the FP tree now.
Create a NULL node.
From ordered items in T1, we get FP tree as:
Follow transaction T6. This leads to increase in count of I2 and I4 as the path in
tree already exists.
Follow Transaction T7
Finally follow transaction T8
As the minimum support given 3. Hence all the itemsets with support score greater than equal to
3 form the frequent itemset as:
{Apple, Banana}, {Apple, Peas}, and {Banana, Peas}
Structure
15.1 Introduction to clustering
15.2Types of clustering
15.3 Partition Based
15.4Hierarchical Based
15.5 Density Based Clustering techniques
15.6Clustering algorithms
K-Means,
Agglomerative and Divisive,
DBSCAN,
Introduction to Fuzzy Clustering
Summary
15.7Solutions to Check your Progress
15.1 INTRODUCTION
Clustering or cluster analysis is a method for dividing a group of data objects into subgroups
based on a single observation. Here, each cluster is a subset of the data, where objects with
resemblance or similar properties are grouped. Also, they are similar to each other in one cluster
but differ from those objects which are in different clusters. In simple terms, the effort of
dividing a population into multiple groups so that data points of one group can easily be
compared with data points of another group. Thus, to separate the groups with identical features
and assign them into clusters. This process is performed by machines, not by humans and is
known as unsupervised learning because clustering is a form of learning by observation.
Clustering is often confused with classification in data analysis, where separation of data
happens based on class labels, while in clustering partitioning of large data sets occurs in groups
occurs based on similarity.
Let’s understand this with an example. Suppose, you are the head of a rental store and wish to
understand the preferences of your customers to scale up your business. Is it possible for you to
look at the details of each customer and devise a unique business strategy for each one of them?
Not. But, what you can do is to cluster all of your customers into say 10 groups based on their
purchasing habits and use a separate strategy for customers in each of these 10 groups. And this
is what we call clustering.
“Clustering is the process of dividing the entire data into groups (also known as clusters) based on the
patterns in the data.”
Data clustering is a prominent technique in data analysis and is applied in various research areas
including mining of data, data statistics, area of machine learning for any kind of analysis. It is
also applied in the world of financial services, health information systems web mining, financial
sectors and many more. Cluster analysis is the most recent area of research in data analysis due
to the massive volumes of data produced in databases. An example of clustering is outlier
detection where credit card fraud and criminal activities are monitored. Clustering can be used in
image recognition to find clusters or patterns in image or text recognition systems. Clustering has
a lot of uses in web search as well. Due to the enormous quantity of online pages, a keyword
search may frequently produce a huge number of hits (i.e., pages relevant to the search). Thus,
clustering is a very promising machine learning process and is proved to be one of the most
pragmatic data mining tools. In this unit, you will learn about the various types of clustering
techniques and algorithms.
Fig 1. Clustering
A bank can potentially have millions of customers. Would it make sense to look at the details of
each customer separately and then decide? Certainly not! It is a manual process and will take a
huge amount of time.
So, what can the bank do? Clustering comes to the rescue in these situations where the banks can
group the customers based on their income, as shown:
Applications of Clustering in Real-World Scenarios
Clustering is a widely used technique in the industry. It is being used in almost every domain,
ranging from banking to recommendation engines, document clustering to image segmentation.
Customer Segmentation
Customer segmentation is one of the most common applications of clustering. And it isn’t just
limited to banking. This strategy is across functions, including telecom, e-commerce, sports,
advertising, sales, etc.
Document Clustering
Document Clustering is another common application of clustering. Let’s assume that you have
multiple documents, and you need to cluster similar documents together. Clustering helps us
group these documents such that similar documents are in the same clusters.
Image Segmentation
We can also use clustering to perform image segmentation. Here, we try to club similar pixels in
the image together. We can apply clustering to create clusters having similar pixels in the same
group.
There are many clustering algorithms in the literature. Traditionally, documents are grouped
based on how similar they are to other documents. Similarity-based algorithms define a function
for computing document similarity and use it as the basis for assigning documents to clusters.
Each cluster should have data that are comparable to one another but different from those in
other clusters. Clustering algorithms fall into different categories based on the underlying
methodology of the algorithm (agglomerative or partition), the structure of the final solution (flat
or hierarchical), or the density based All the above-mentioned clustering types are discussed in
detail in the rest of this chapter.
Partition algorithm is one of the most applied clustering algorithms. It has been widely used in
many applications due to its simple structure and easy implementation as compared to other
clustering algorithms. This clustering method classifies the information into multiple groups
based on the characteristics and similarities of the data. In the partitioning method when
database(D) contains multiple(N) objects then the partitioning approach divides the data into
user-specified(K) partitions, each of which represents a cluster and a specific region. That is, the
data is divided into K groups. That is, it divides the data into K groups or partitions in such a
manner that each group should have at least one object from the existing data. To put it another
way, it splits the data items into non-overlapping subsets (clusters) so that each data object fits
perfectly into one of them.
Partitioning approaches require a set of starting seeds (or clusters), which are then enhanced
iteratively by transferring objects from one group to another to improve partitioning.Objects in
the same cluster are "near" or related to one other, whereas objects in other clusters are "far
apart" or significantly distinct, according to the most prevalent approach to good partitioning.
Many algorithmsthat come under partitioning method some of the popular ones are K-Mean,
PAM (K-Mediods), and CLARA algorithm (Clustering Large Applications) etc.
Hierarchical Clustering analysis is an algorithm that groups the data points with similar
properties and these groups are termed “clusters”. As a result of hierarchical clustering, we get a
set of clusters, and these clusters are always different from each other. Clustering of this data into
clusters is classified as:
Hierarchical clustering helps in creating clusters in the proper order (or hierarchy).
Example:
The most common everyday example we see is how we order our files and folders in our
computer by proper hierarchy.
As mentioned, Hierarchical clustering is classified into two types i.e., Agglomerative clustering
(AGNES) and Divisive Clustering (DIANA)
Hierarchical
clustering
Agglomerative Divisive
AGNES DIANA
Time complexity: The time complexity is also very high because we have to execute n
iterations and update and restore the similarity matrix in each iteration. The order of the
cube of n is the time complexity.
Qn2. What are sub-types of Hierarchical Clustering and what is the difference between them?
Clusters are formed in the Density Depending Clustering Technique based on the density of the
data points represented in the data space.Those locations that become dense as a result of the
large amount of data points that reside there are termed as clusters.
How it works:
1. It starts with a random unvisited starting data point. All points within a distance ‘Epsilon’
– Ɛ classify as neighborhood points.
2. We need a minimum number of points within the neighborhood to start the clustering
process. In this scenario, the current point of data turns into the first point in the cluster.
Else, the point is regardedas ‘Noise.’ In either case, the current point becomes a visited
point.
3. All points within the distance(
distance(Ɛ)
Ɛ) become part of the same cluster. This process is repeated
for all the newly added data points in the cluster group.
4. Continue with the process until you visit and label each point within the ‘ Ɛ’ neighborhood
of the cluster.
5. On completion of the process, start again with a new unvisited point thereby leading to
the discovery of more clusters or noise. At the end of the process, yo
youu ensure that you
mark each point as either cluster or noise.
a) K-Means
Among clustering algorithms, is an algorithm that tries to minimize the distance of the points in a
cluster with their centroid – the k-means clustering technique.
K-means is a centroid-based algorithm. It can also be called asa distance-based technique where
distances between points are calculated to allocate a point to a cluster. Each cluster in K -Means
is paired with a centroid.
The K-Means algorithm's main goal is to reduce the sum of distances between points and th eir
corresponding cluster centroid.
In the figure 5 data is presented into two stages. The first figure shows the data in raw stage
which is not clustered by k-means. Here all types of data are clubbed together thus becomes
impossible to differentiate them into their original category.
The second figure shows three clusters of three different colors red, green, and blue. These
clusters are formed after applying k-means clustering. The second figure shows data into three
different categories which are called clusters.
k-means clustering technique is an immense clustering algorithm to group similar types of data
in groups which are so called clusters. It is the simplest and commonly used technique in
machine learning extensively used for data analysis. It can easily locate the similarity points
between the different data items and can group them into the clusters. The working of K-means
clustering algorithm is shown in the following three steps. Let’s see what these three steps are.
1. Selection of k values.
2. Initialization of the centroids.
3. Selection of the cluster and finding the average.
Let us understand the above steps with the help of the Fig6:
=0
= |6 – 2| + |8 – 10|
=4+2
=6
Use the k-means algorithm and Euclidean distance to cluster the following 8 examples into 3
clusters: A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9).
The distance matrix based on the Euclidean distance is given below:
Assume that the first seeds (cluster centers) are A1, A4, and A7. Only use the k-means method
for one epoch. Show: a) The new clusters (i.e., the cases belonging to each cluster) b) The new
clusters' centers at the end of this epoch c) Draw a 10 by 10 space with all 8 points and illustrate
the clusters and new centroids after the first epoch. d) How many more iterations will it take to
reach convergence? For each period, draw the result.
b) Agglomerative clustering
In this case of clustering, the hierarchical decomposition is done with the help of bottom-up
strategy where it starts by creating atomic (small) clusters by adding one data object at a time
and then merges them together to form a big cluster at the end, where this cluster meets all the
termination conditions. This procedure is iterative until all the data points are brought under one
single big cluster.
3. Do it again.
AGNES (Agglomerative Nesting) is a type of agglomerative clustering that combines the data
objects into a cluster based on similarity. The result of this algorithm is a tree-based structure
called Dendrogram. Here it uses the distance metrics to decide which data points should be
combined with which cluster. Basically, it constructs a distance matrix and checks for the pair of
clusters with the smallest distance and combines then. The figure 7. given below shows
dendrogram.
Fig 7. Agglomerative clustering
We begin at the bottom with 25 data points, each of which is assigned to a different cluster. Then
two nearest clusters are selected to merge till we get only one cluster at the topmostposition. The
distance between two clusters in the data space is represented by the height in the dendrogram at
which two clusters are merged. Based on how the distance between each cluster is m easured, we
Single linkage: Where the shortest distance between the two points in each cluster is
Complete linkage: In this case, we will consider the longest distance between each
AGNES has few limitations one of this is it has a time complexity of at least O(n2);
hence it doesn’t do well in scaling, and one other major drawback is that whatever has
been done can never be undone, i.e. If we incorrectly group any cluster in an earlier stage
of the algorithm, then we will not be able to change the outcome/modify it. But this
algorithm has a bright side since there are many smaller clusters are formed; it can be
helpful in the process of discovery. It produces an ordering of objects that is very helpful
in visualization.
Problem-03: Hierarchical clustering (to be done at your own time, not in class) Use single-link,
complete-link, average-link agglomerative clustering as well as medoid and centroid to cluster
the following 8 examples: A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4),
A7=(1,2), A8=(4,9). The distance matrix is the same as the one in Exercise 1. Show the
dendrograms.
Solution:
c) Divisive Clustering (DIANA)
Diana basically stands for Divisive Analysis; this is another type of hierarchical clustering where
basically it works on the principle of top-down approach (inverse of AGNES) where the
algorithm begins by forming a big cluster, and it recursively divides the most dissimilar cluster
into two, and it goes on until we’re all the similar data points belong in their respective clusters.
These divisive algorithms result in highly accurate hierarchies than the agglomerative approach,
but they are computationally expensive.
Fig 8. Divisive clustering step by step process
The distance between objects is used to cluster things in most partitioning methods. Such
algorithms can only locate spherical-shaped clusters and have trouble finding clusters of other
forms. DBSCAN can form clusters in different shapes; this type of algorithm is most suitable
when the dataset contains noise or outliers. Also, it depends on a density-based concept of
cluster: A cluster is defined as the most densely connected set of points.
Other clustering algorithms based on the concept of density have been developed. Their basic
concept is to keep creating a cluster till the density which consists of data points in the
"neighbourhood" surpasses a certain threshold.
For instance, every data point present in a given cluster should meet the requirement of having
minimum number of points in the neighborhood of a given radius. This type of method is
extremely useful for detecting outliers or to filter out noise. This is also useful for finding
clusters of arbitrary shape.
The best part in density-based methods is that they can divide a set of objects into multiple
exclusive clusters, or a hierarchy of clusters. Density-based techniques often only evaluate
exclusive clusters and ignore fuzzy clusters. Furthermore, density-based clustering algorithms
can be extended from entire space to subspace.
If a point has more than a given number of points (MinPts) within Eps, it is considered a core
point.
▪ These points belong in a dense region and are at the interior of a cluster.
Within Eps, a border point has less points than MinPts, but it is close to a core point.
Any point that isn't a core point or a boundary point is referred to as a noise point.
DBSCAN Algorithm
1. Label points as core, border and noise
2. Eliminate noise points for every core point p that has not been assigned to a cluster
3. Create a new cluster with the point p and all the points that are density-connected to p.
4. Assign border points to the cluster of the closest core point.
e) Fuzzy Clustering
Fuzzy clustering is the most used clustering algorithm present in real world and is a sort of
clustering in where each data point is assigned to multiple clusters. The algorithm suggests that
the data points can belongs to more than one cluster unlike hard clustering where data points can
actually belong to only one cluster.
Illustrative Example
A Guava can either be Yellow and Green (hard clustering), but a Guava can also be Yellow and
Green (this is fuzzy clustering). Here, the Guava can be Yellow to a certain degree as well as
Green to a certain degree. Instead of the Guava belonging to Green [green = 1] and not Yellow
[Yellow = 0], the Guava can belong to Green [green = 0.5] and Yellow [Yellow = 0.5]. These
values are normalized between 0 and 1; however, they do not represent probabilities, so the two
values do not need to add up to 1.
when a finite set of data is given, the algorithm returns a list of clusters centers
𝐹 = 𝑓1, . . . , 𝑓𝑓𝐹 = {𝒇 , . . . , 𝑓 }And a partition matrix
𝑎𝑟𝑔𝑚𝑖𝑛 ∑ ∑ 𝑤 ∥𝒓 −𝒇 ∥ ,
where:
1
𝑤 = .
∥𝒓 −𝒇 ∥
∥𝒓 −𝒇 ∥
Illustrative Example
To better understand this principle, a classic example of mono-dimensional data is given below
on an x axis.
This data set can be conventionally grouped into two clusters, and this is done by selecting a
threshold on the x-axis.
The resulting two clusters are labelled 'A' and 'B', as seen in the image below. Each point
belonging to the data set would therefore have a membership coefficient of either 0 or 1. This
membership coefficient of each corresponding data point is represented by the inclusion of the y-
axis.
In fuzzy clustering, each data point can have membership to multiple clusters. By relaxing the
definition of membership coefficients from strictly 0 or 1, these values can range from inclusive
value from 0 to 1. The following image shows the data set from the previous clustering, but now
fuzzy c-means clustering has been applied. First, a new threshold value defining two clusters is
generated. Next, new membership coefficients for each data point are generated based on
clusters centroids, as well as distance from each cluster centroid.
As we can see, the middle data point belongs to cluster A and cluster B. The value (point) of 0.3
is this data point's membership coefficient for cluster A
Qn2. What is Agglomerative Clustering? What are 3 methods that can be used for Agglomerative
Clustering?
Qn3. Briefly explain fuzzy clustering and provide a Real-World Example for the same.
15.7 SUMMARY
The technique of separating a set of data objects into subgroups based on some observation is known as
clustering or cluster analysis. Each subset is referred to as a 'cluster,' with objects that are related to one
another but different from those in other clusters. In simple terms, the process of separating a population
or set of data points into several groups so that data points from the same group can be compared to data
points from different groups. Clustering is often confused with classification in data analysis, where
separation of data happens on the basis of class labels, while in clustering partitioning of large data sets
occurs in groups occurs on the basis of similarity.
Clustering algorithms fall into different categories based on the underlying methodology of the algorithm,
the structure of the final solution, or the density based.
In Density Based Clustering Technique, the clusters are created based upon the density of the data points
which are represented in the data space.
K-Means: Among clustering algorithms, is an algorithm that tries to minimize the distance of the points in
a cluster with their centroid the k-means clustering technique. The main objective of the K-Means
algorithm is to minimize the sum of distances between the points and their respective cluster centroid.
AGNES is a type of agglomerative clustering that combines the data objects into a cluster based on
similarity.
Fuzzy Clustering is the most widely used clustering technique in practice, and it is a sort of clustering in
which each data point belongs to many clusters.
Clustering is a widely used technique in the industry. It is being used in almost every domain, ranging
from banking to recommendation engines, document clustering to image segmentation.
Customer Segmentation
Document Clustering
Image Segmentation
PAM (K-Medoids),
Space complexity: When the number of data points is large, the space required for the
Hierarchical Clustering Technique is large since the similarity matrix must be stored in
RAM. The space complexity is measured by the order of the square of n.
Time complexity: The time complexity is also very high because we have to execute n
iterations and update and restore the similarity matrix in each iteration. The order of the
cube of n is the time complexity.
Answer 2.
Refer to Page 8
Check Your Progress - 5
For detailed answers refer to Section 15.6.
Answer 1.
a) K-Means
Among clustering algorithms, is an algorithm that tries to minimize the distance of the points in a
cluster with their centroid – the k-means clustering technique.
K-means is a centroid-based or distance-based technique in which the distances between points
are calculated to allocate a point to a cluster. Each cluster in K-Means is paired with a centroid.
b) AGNES (Agglomerative Nesting)
AGNES is a type of agglomerative clustering that combines the data objects into a cluster based
on similarity. The result of this algorithm is a tree-based structure called Dendrogram. Here it
uses the distance metrics to decide which data points should be combined with which cluster.
Basically, it constructs a distance matrix and checks for the pair of clusters with the smallest
distance and combines then.
c) Divisive Clustering (DIANA)
Diana basically stands for Divisive Analysis; this is another type of hierarchical clustering where
basically it works on the principle of top-down approach (inverse of AGNES) where the
algorithm begins by forming a big cluster, and it recursively divides the most dissimilar cluster
into two, and it goes on
d) Density-Based Spatial Scan (DBSCAN)
The distance between objects is used to cluster things in most partitioning methods. Only
spherical-shaped clusters can be found using these approaches, and clusters of any shape are
difficult to find. DBSCAN may create clusters of various shapes; this technique is best suited to
datasets with noise or outliers.
Fuzzy Clustering
The most widely used clustering technique in the real world is fuzzy clustering, which is a sort of
clustering in which each data point belongs to many clusters. The algorithm suggests that the
data points can belongs to more than one cluster unlike hard clustering where data points can
actually belong to only one cluster.
Answer 2.
a) AGNES (AGglomerativeNESting)
AGNES is a type of agglomerative clustering that combines the data objects into a cluster based on
similarity. The result of this algorithm is a tree-based structure called Dendrogram. Here it uses the
distance metrics to decide which data points should be combined with which cluster. Basically, it
constructs a distance matrix and checks for the pair of clusters with the smallest distance and combines
then.
Based on how the distance between each cluster is measured, we can have 3 different methods
Single linkage: Where the shortest distance between the two points in each cluster is defined as
the distance between the clusters.
Complete linkage: In this case, we will consider the longest distance between each cluster’s
points as the distance between the clusters.
Average linkage: In this situation, we'll take the average of each point in one cluster compared to
every other point in the other.
Answer 3.
The most widely used clustering technique in the real world is fuzzy clustering, which is a type of
clustering in which each object belongs to many clusters. The algorithm suggests that the data points can
actually belongs to more than one cluster unlike hard clustering where data points can actually belong to
only one cluster.
Real-World Example:
A Guava can either be Yellow and Green (hard clustering), but a Guava can also be Yellow and Green
(this is fuzzy clustering). Here, the Guava can be Yellow to a certain degree as well as Green to a certain
degree. Instead of the Guava belonging to Green [green = 1] and not Yellow [Yellow = 0], the Guava can
belong to Green [green = 0.5] and Yellow [Yellow = 0.5]. These values are normalized between 0 and 1;
however, they do not represent probabilities, so the two values do not need to add up to 1.
Answer 4:
DIANA is like the reverse of AGNES. It begins with the root, in which all observations are included in a
single cluster. At each step of the algorithm, the current cluster is split into two clusters that are
considered most heterogeneous. The process is iterated until all observations are in their own cluster.
Answer 5:
DBSCAN can form clusters in different shapes; this type of algorithm is most suitable when the dataset
contains noise or outliers.
Also, it depends on a density-based concept of cluster: A cluster is defined as the most densely
connected set of points.
Answer 6:
Refer Page 10
MULTIPLE CHOICE QUESTIONS
Q1. The objective of clustering is to-
A. Sort the data points into categories.
B. To classify the objects into different classes
C. To predict the values of input data points and generate output.
D. All of the above
Solution: (A)
Q2. Clustering is a-
A. Supervised learning
B. Unsupervised learning
C. Reinforcement learning
D. None
Solution:(B)
Solution: (A)
Explanation: The K-Means clustering approach, which employs the mean of cluster data points to locate
the cluster center, is the most sensitive to outliers of all the options.
Q4. You saw the dendrogram below after doing K-Means Clustering analysis on a dataset. From
the dendrogram, which of the following conclusions can be drawn?
A. There were 32 data points in clustering analysis
B. The data points analyzed has best number of clusters is 4
C. The proximity function used here is Average-link clustering
D. The above interpretation of dendrogram is not possible for K -Means clustering analysis
Solution: (D)
Explanation: Dendrogram is not possible for K -Means clustering analysis. However, one can create a
cluster gram based on K-Means clustering analysis.
.
Q5. What are the two types of Hierarchical Clustering?
A. Top-Down Clustering (Divisive)
B. Bottom-Top Clustering (Agglomerative)
C. Dendrogram
D. K-means
Solution: Both A & B
Q6. Is it reasonable to assume the same clustering results from two K -Mean clustering runs?
A. Yes
B. No
Solution: (B)
Explanation: Instead, the K-Means clustering technique talks about local minima, which may or may not
equate to global minima in some situations. As a result, it's a good idea to run the K -Means method
several times before making any conclusions about the clusters.
It's worth noting, though, that by using the same seed value for each run, you may get the same clusterin g
results via K-means. However, this is accomplished by simply instructing the algorithm to select the same
set of random numbers for each iteration.
Q7. Which of the following clustering techniques has a difficulty with local optima convergence?
A. Agglomerative clustering algorithm
B. K- Means clustering algorithm
C. Expectation-Maximization clustering algorithm
D. Diverse clustering algorithm
Options:
A. A only
B. B and C
C. B and D
D. A and D
Solution: (B)
Out of the options given, only K-Means clustering algorithm and EM clustering algorithm has the
drawback of converging at local minima.
Q8. Which of the following is a bad characteristic of a dataset for clustering analysis-?
A. Objects with outliers
B. Objects with different densities
C. Objects with non-convex shapes
D. All of the above
Solution: (D)
16.0 INTRODUCTION
In this unit we will see the implementation of various machine learning algorithms,
learned in this course. To understand the codes you need to have understanding of the
respective Machine learning algorithms along with that understanding of Python
programming is must. The codes are readily using various libraries of Python
programming language viz. Scikit Learn, Matplotlib, numpy etc., you can execute
these codes through any of the Python programming tools. Most of the machines
learning algorithms, you learned in this course, are implemented here, just try to
execute them and analyse the results.
16.1 OBJECTIVES
The starting units of this course primarily focused on the various classification
algorithms viz. Naïve Bayes classifiers, K-Nearest Neighbour (K-NN), Decision
Trees, Logistic Regression and Support Vector Machines.The theoretical aspects of
50
the same is already discussed in the respective units, now we will see the
implementation part of the mentioned classifiers, in Python programming language.
51
OUTPUT :
You have already discussed this classifier in detail in Block 3 Unit 10 of this
course, you may refer to Block 3 Unit 10 to understand the concept.
We learned that Suppose the value of K is 3. The KNN algorithm starts by
calculating the distance of point X from all the points. It then finds the 3
nearest points with least distance to point X
Now, in this section, we will see how Python's Scikit-Learn library can be
used to implement the KNN algorithm
Implementation code in Python
The screenshot of the executed code is given below
52
16.2.3 Desicion Tree Implementation
A decision tree is a type of supervised machine learning algorithm that may be
used for both regression and classification tasks. It is one of the most popular
and widely used machine learning techniques.
In this case, the decision tree method creates a node for each attribute present
in the dataset, with the attribute that is considered to be the most significant
being placed at the top of the tree. When we first get started, we will think of
the entire training set as the root. There must be a categorical breakdown of
the feature values.
Before beginning to develop the model, the values are discretized in order to
determine whether or not they are continuous. A recursive process distributes
records according to the attribute values of each record. A statistical method is
utilised in order to determine which qualities should be placed at the tree's
root and which should be placed at internal nodes.
You have already discussed this classifier in detail in Block 3 Unit 10 of this
course, you may refer to Block 3 Unit 10 to understand the concept.
Implementation code in Python
The screenshot of the executed code is given below
53
54
55
56
16.2.4 Logistic Regression
57
58
16.2.5 Support Vector Machine
Support Vector Machine, more usually referred to as SVM, is a technique for
supervised and linear machine learning that is most frequently utilised for the
purpose of addressing classification issues. Support Vector Classification is
another name for SVM. In addition, there is a subset of SVM known as SVR,
which stands for Support Vector Regression. SVR applies the similar concepts
to the problem-solving process when addressing regression issues. SVM also
offers a method known as the kernel method, which is also known as the kernel
SVM. This method enables us to deal with non-linearity.
The following are the steps involved in implementation:
59
You have already discussed this classifier in detail in Block 3 Unit 10 of this
course, you may refer to Block 3 Unit 10 to understand the concept.
Implementation code in Python
The screenshot of the executed code is given below
60
OUTPUT:
We learned about the basic concept of regression in the respective unit of this course,
in this unit we will implement the Linear regression and Polynomial regression in
Python language. Lets start with the Linear regression.
61
Add up the individual costs that you have determined for each of the
numbers.
You have already discussed this classifier in detail in Block 3 Unit 11 of this
course, you may refer to Block 3 Unit 11 to understand the concept.
Implementation code in Python
The screenshot of the executed code is given below
62
63
OUTPUT:
OUTPUT:
65
16.4 FEATURE SELECTION AND EXTRACTION
Feature selection and extraction are one of the most important steps that must be
performed in order for machine learning to be successful. While we covered the
theoretical aspects of this process in the earlier units of this course, it is now time to
understand the implementation part of the mechanisms that we have learned for
Feature selection and extraction. Let's begin with dimensionality reduction, which is
the process of lowering the number of random variables that are being considered by
generating a set of primary variables. Dimensionality reduction may be seen of as a
way to streamline the analysis process.
66
OUTPUT :
We discussed Apriori algorithm and FP Growth algorithm, while studying the topic of
Association Rules. These algorithms are frequently used in pattern matching. Since FP
Growth is a step ahead to Apriori Algorithm, we are discussing the implementation of
Apriori algorithm only
67
16.5.1 Apriori Algorithm
The Apriori algorithm is a data mining technique that is used for mining
frequent item sets and appropriate association rules. It does this by using a
mathematical formula. We focused on the definitions of association rule
mining and Apriori algorithms, as well as the application of an Apriori
algorithm, in the area of this class that was most pertinent to the topic. In this
section, we will construct one Apriori model utilising the Python
programming language and a hypothetical situation involving a small firm.
However, it does have some limits, the effects of which can be mitigated
using a variety of different approaches. Data mining and pattern recognition
are two of the many applications that see widespread use of the method.
The candidate set is produced by the model that is described further down
below by merging the set of frequent items from the step before it.
Conduct testing on subsets, and if the candidate set contains infrequent item
sets, remove them. And then calculate the final frequent itemset by obtaining
the items that meet the minimal support requirement.
You have already discussed this classifier in detail in Block 4 Unit 14 of this
course, you may refer to Block 4 Unit 14 to understand the concept.
Implementation code in Python
The screenshot of the executed code is given below
68
69
70
71
16.6 CLUSTERING ALGORITHMS
You have already discussed this classifier in detail in Block 4 Unit 15 of this
course, you may refer to Block 4 Unit 15 to understand the concept.
Implementation code in Python
The screenshot of the executed code is given below
72
Check Your Progress - 4
16.8 SUMMARY
● https://www.kaggle.com/
● https://www.github.com/
● https://towardsdatascience.com
● https://machinelearningmastery.com
74