0% found this document useful (0 votes)
526 views493 pages

MCS-224 Artificial Intelligence and Machine Learning

Uploaded by

vikram139singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
526 views493 pages

MCS-224 Artificial Intelligence and Machine Learning

Uploaded by

vikram139singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 493

PROGRAMME DESIGN COMMITTEE

Prof. (Retd.) S.K. Gupta , IIT, Delhi Sh. Shashi Bhushan Sharma, Associate Professor, SOCIS, IGNOU
Prof. Ela Kumar, IGDTUW, Delhi Sh. Akshay Kumar, Associate Professor, SOCIS, IGNOU
Prof. T.V. Vijay Kumar JNU, New Delhi Dr. P. Venkata Suresh, Associate Professor, SOCIS, IGNOU
Prof. Gayatri Dhingra, GVMITM, Sonipat Dr. V.V. Subrahmanyam, Associate Professor, SOCIS, IGNOU
Mr. Milind Mahajan,. Impressico Business Solutions, Sh. M.P. Mishra, Assistant Professor, SOCIS, IGNOU
New Delhi Dr. Sudhansh Sharma, Assistant Professor, SOCIS, IGNOU

COURSE DESIGN COMMITTEE


Prof. T.V. Vijay Kumar, JNU, New Delhi Sh. Shashi Bhushan Sharma, Associate Professor, SOCIS, IGNOU
Prof. S.Balasundaram, JNU, New Delhi Sh. Akshay Kumar, Associate Professor, SOCIS, IGNOU
Prof D.P. Vidyarthi, JNU, New Delhi Dr. P. Venkata Suresh, Associate Professor, SOCIS, IGNOU
Prof. Anjana Gosain, USICT, GGSIPU, New Delhi Dr. V.V. Subrahmanyam, Associate Professor, SOCIS, IGNOU
Dr. Ayesha Choudhary, JNU, New Delhi Sh. M.P. Mishra, Assistant Professor, SOCIS, IGNOU
Dr. Sudhansh Sharma, Assistant Professor, SOCIS, IGNOU

SOCIS FACULTY
Prof. P. Venkata Suresh, Director, SOCIS, IGNOU
Prof. V.V. Subrahmanyam, SOCIS, IGNOU
Dr. Akshay Kumar, Associate Professor, SOCIS, IGNOU
Dr. Naveen Kumar, Associate Professor, SOCIS, IGNOU (on EOL)
Dr. M.P. Mishra, Associate Professor, SOCIS, IGNOU
Dr. Sudhansh Sharma, Assistant Professor, SOCIS, IGNOU
Dr. Manish Kumar, Assistant Professor, SOCIS, IGNOU

PREPARATION TEAM
Dr.Sudhansh Sharma, (Writer- Unit 1) Prof Ela Kumar (Content Editor)
Assistant Professor SOCIS, IGNOU Department of Computers & Engg. IGDTUW, Delhi
(Unit-1 : Partially Adapted from MCSE003
Artificial Intelligence & Knowledge Management)
Prof.Parmod Kumar (Language Editor)
Mr. Anant Kumar Jayswal, (Writer – Unit 2 & Unit 3) SOH, IGNOU, New Delhi
Assistant Professor
Amity School of Engineering and Technology

Prof. Arvind (Writer Unit-4)


Department of Mathematics,Hansraj College
University of Delhi

Course Coordinator: Dr.Sudhansh Sharma,

Print Production
Sh Sanjay Aggarwal,Assistant Registrar, MPDD

, 2022
Indira Gandhi National Open University, 2022
ISBN-
All rights reserved. No part of this work may be reproduced in any form, by mimeograph or any other means, without permission
in writing from the Indira Gandhi National Open University.
Further information on the Indira Gandhi National Open University courses may be obtained from the University’s office at
Maidan Garhi, New Delhi-110068.
Printed and published on behalf of the Indira Gandhi National Open University, New Delhi by MPDD, IGNOU.
UNIT 1 INTRODUCTION TO ARTIFICIAL
INTELLIGENCE
Structure
1.1 Introduction
1.2 Objectives
1.3 Basics of Artificial Intelligence (AI)?
1.4 Brief history of Artificial Intelligence
1.5 Components of Intelligence
1.6 Approaches to Artificial Intelligence
1.7 Comparison between Artificial Intelligence (AI),
Machine Learning (ML) and DeepLearning (DL).
1.8 Application Areas of Artificial Intelligence Systems
1.9 Intelligent Agents
1.9.1 Stimulus - Response Agents
1.10 Summary
1.11 Solutions/Answers
1.12 Further Readings

1.1 INTRODUCTION
Today, artificial intelligence is used in a wide variety of applications, including engineering,
technology, the military, opinion mining, sentiment analysis, and many more. It is also used in
more advanced domains, such as language processing and applications for aerospace.
AI is everywhere in today's world, and people are gradually becoming accustomed to its
presence. It is utilised in systems that recognise both voices and faces. In addition to this, it can
provide you with shopping recommendations that are tailored to your own purchasing
preferences. Finding spam and preventing fraudulent use of credit cards is made much easier
when you have this skill. The most cutting-edge technology currently on the market are virtual
assistants like Apple's Siri, Amazon's Alexa, Microsoft's Cortana, and Google's own Google
Assistant. It's possible that you're already familiar with the technology involved in artificial
intelligence (AI). Are you?

AI has become very popular all over the world today. It imitates human intelligence in machines
by programming them to do the same things people do. As a technology, AI is going to have a
bigger impact on how people live their daily lives. Everyone wants to connect to Artificial
Intelligence as a technology these days. Before we can understand AI, we need to know and talk
about some basic things. For example, what is the difference between knowledge and
intelligence? The key to starting this unit is the answer to this question.

The accumulation of information and abilities that a person has gained through their life
experiences is known as knowledge. While intelligence refers to one's capacity to put one's
knowledge into practise. To put it simply, knowledge is what we have learned over the years,
and it expands as time passes. Because of this, it represents the culmination of everything that we
have realised over the course of our lives. It is important to highlight that having information
does not automatically make one intelligent; rather, intelligence is what makes one smart.

There is a well-known proverb that says "marks are not the measure of intelligence." This is due
to the fact that intelligence is not a measurement of how much information one possesses. In fact,
it is the measure of how much we comprehend and put into practise. People who are
knowledgeable may gain a lot of information, but an intelligent person understands how to
comprehend, analyze, and use the information. You could have a lot of knowledge but still be the
least intelligent person in the room. Knowledge and intelligence are inextricably linked, and each
contributes to the other's development. Knowledge enables one to learn the understandings that
others have of things, whereas intelligence is the foundation for one's ability to grasp the things
themselves.

Now that we have an understanding of the distinction between intelligence and knowledge, our
next issue is: what exactly is artificial intelligence? Incorporating intelligence into a machine is
related to the field of Artificial Intelligence, whereas both concepts, namely the representation of
knowledge and its engineering, are the basis of traditional AI research. This topic will be
discussed in section 1.3 of this unit, but in a nutshell, incorporating intelligence into a machine is
related to the field of Artificial Intelligence. Knowledge engineering is a subfield of artificial
intelligence (AI) that applies rules to data in order to simulate the way in which an expert would
think about the information. It does this by analysing the structure of a job or a decision in order
to figure out how one arrives at a conclusion.

The subsequent units of this course, you will learn about some of the concepts that are essential
for knowledge representation, such as frames, scripts, and other related topics. In addition, this
course will address the issues that are associated with the knowledge representation for uncertain
situations, such as employing the method of fuzzy logic, rough sets, and the Dempster Shafer
theory, among other relevant topics.

Some of the prerequisites to get started with this subject.


a) Strong understanding of Basic concepts of Mathematics viz. Algebra, Calculus,
probability andStatistics.
b) Experience in programming using Python or Java.
c) A good understanding of algorithms.
d) Reasonably good data analytics skills.
e) Fundamental knowledge in discrete mathematics.
f) Finally, internal will to learn
1.2 OBJECTIVES
After going through this unit, you will be able to:
 Understand the difference between knowledge and intelligence
 Answer – what is AI?
 Identify various approaches of AI
 Compare Artificial Intelligence (AI), Machine Learning (ML) & Deep Learning (DL).
 Understand the concept of agents in AI

1.3 BASICS OF ARTIFICIAL INTELLIGENCE?


Knowledge and intelligence are two important concepts, and we were able to gain an
understanding of the fundamental distinction between the two terms. Now that we have your
attention, let's talk about what artificial intelligence actually is. The meaning of Artificial
Intelligence will be covered in this section; however, before we get started, it is important to note
that the field of Artificial Intelligence is related to the process of incorporating intelligence into
machines; the specifics of this process, as well as the mechanism itself, will be covered in this
unit as well as the subsequent units of this training.

The following is a list of eight definitions of artificial intelligence that have been provided by
well-known authors of artificial intelligence textbooks.
1) According to Haugeland in 1985, "The Exciting New Effort to Make Computers
Think... Machines with Minds, in the Full and Literal Sense,"

2) According to Bellman, "the automation of behaviours that we connect with human


thinking, activities such as decision-making, problem-solving, and learning..." 1978

3) "The study of mental capabilities through the application of computer models," (also
known as "The Study of Mental Capabilities"), Charniak and McDermott's 1985.

4) According to Winston (1992), "the study of the calculations that make it possible to
perceive, reason, and act."

5) "The art of building machines that execute functions that demand intellect when
performed by people," as defined by Kurzweil in the year 1990.

5) "The art of building machines that execute functions that demand intellect when
performed by people," as defined by Kurzweil in the year 1990. To the Rich and the
Knight, 1991

7) According to Schalkoff (1990), "a field of study that aims to explain and replicate
intelligent behaviour in terms of computational processes."
8) According to Luger and Stubblefield (1993), "the discipline of computer science that is
concerned with the automation of intelligent behaviour."

According to the concepts presented earlier, there are four distinct objectives that might be pursued in
the field of artificial intelligence. These objectives are as follows:

• The creation of systems that think in the same way as people do.
• The creation of systems that are capable of logical thought.
• The creation of machines that can mimic human behaviour.
• The creation of systems that behave in a logical manner

In addition, we discovered through our conversation in the earlier section 1.1 of this Unit, that
Artificial Intelligence (AI) is the intelligence that is incorporated into machines; in other words,
AI is the ability of a machine to display human-like capabilities such as reasoning, learning,
planning, and creativity. We learned this information because AI is the ability of a machine to
display human-like capabilities. Taking into mind the Emerging AI technologies to sense,
interpret, and act according to the circumstances, relevant exemplary solutions are summarised in
Figure 1, which attempts to encapsulate the understanding of the question "What is Artificial
Intelligence?"

AI TECHNOLOGIES ILLUSTRATIVE SOLUTIONS

Sense Computer Vision


Virtual
Agents
Audio Processing

Identify
Analytics

Natural Language Processing Cognitive


Com prehend
Human Robotics
Brain
Knowledge Representation Speech
Analytics
s
Recommendation
Systems

Machine Learning
Act Data
Visualization
Expert Systems

Figure-1: What is Artificial Intelligence? – Emerging AI Technologies


The applications of artificial intelligence (AI) that are shown in figure-1 do not all require the
same kinds of AI. In a general sense, one may say that AI can be categorised into the following
levels:
 software level and hardware level (i.e., Embodied AI). Where, the software level consists
of things like search engines, virtual assistants, speech and facial recognition systems,
picture analysis tools, and other things like that.
 Hardware level (Embedded) includes Robots, autonomous vehicles, drones, the Internet
of Things, and other technologies fall under the category of embedded artificial
intelligence (AI).
On the basis of the functionalities the AI can be classified based on Type 1 and Type 2

SuperAI

Figure 1(a) – Classification of Artificial Intelligence

Here’s a brief introduction the first type of AI i.e., Type 1 AI. Following are the three stages of
Type 1 - Artificial Intelligence:
a) Artificial Narrow Intelligence-(ANI)
b) Artificial General Intelligence-(AGI)
c) Artificial Super Intelligence-(ASI)
Types of
Artificial Intelligence
Artificial Narrow Artificial General Artificial Super
Intelligence (ANI) Intelligence (AGI) Intelligence (ASI)

Stage-1 Stage-2 Stage-3

Machine Learning Machine Intelligence Machine


Consciousness

 Specialises in one  Refers to a computer  An intellect that is


area and solves one that is as smart as a much smarter than
problem human across the the best human
board brains in practically

Figure 1(b) –Three Stages of Type-I Artificial Intelligence


The various categories of Artificial Intelligence are discussed as follows:

a) Artificial Narrow Intelligence (ANI), also called Weak AI or Narrow AI: Weak AI is a term
for thinking that is "simulated." Such systems seem to act intelligently, but they don't have
any awareness of what they are doing. For example, a chatbot might talk to you in a way
that seems natural, but it doesn't know who it is or why it's talking to you. Artificial
intelligence is a system that was built to do a certain job.

b) Artificial General Intelligence (AGI): Strong or General Artificial Intelligence, also called
"actual" thinking. That is, acting like a smart human and thinking like one with a conscious,
subjective mind. For instance, when two humans talk, they probably both know who they
are, what they're doing, and why.

Systems with strong or general artificial intelligence can do things that people can do. These
systems tend to be harder to understand and more complicated. They are set up to handle
situations where they might need to solve problems on their own without help from a
person. Uses for these kinds of systems include self-driving cars and operating rooms in
hospitals.

c) Artificial Super Intelligence (ASI) - Super intelligence: The term "super intelligence"
usually refers to a level of general and strong AI that is smarter than humans, if that's even
possible. The ASI is seen as the logical next step after the AGI because it can do more than
humans can. This includes making decisions, making rational decisions, and even things
like building emotional relationships. Their is a marginal difference between AGI and ASI.

☞ Check Your Progress 1


Q1 How Knowledge differs from intelligence? What do you understand by the term Artificial
Intelligence (AI)? List the various technologies and their corresponding illustrative solutions.

……………………………………………………………………………………………
……………………………………………………………………………………………
……………………………………………………………………………………………
Q2 Classify AI on the basis of the functionalities of AI

……………………………………………………………………………………………
……………………………………………………………………………………………
……………………………………………………………………………………………
Q3 Compare ANI, AGI and ASI, in context of AI

……………………………………………………………………………………………
……………………………………………………………………………………………
1.4 BRIEF HISTORY - ARTIFICIAL INTELLIGENCE
AI's ideas come from early research into how people learn and think. Also very old is the idea
that a computer could act like a person. Greek mythology is where the idea of machines that can
think for themselves comes from.

• Aristotle, who lived from 384 BC to 322 BC, made a syllogistic logic system that was not
formal. This is where the first formal system of deductive reasoning got its start.

At the start of the 17th century, Descartes said that animal bodies are just complex machines.

• Pascal made the first mechanical digital calculator in the year 1642.

In the 1800s, George Boole came up with a number system called "binary algebra" that showed
(some) "laws of thought."

• Charles Babbage and Ada Byron worked on programmable mechanical calculators.

In the late 19th century and early 20th century, mathematicians and philosophers like Gottlob
Frege, Bertram Russell, Alfred North Whitehead, and Kurt Godel built on Boole's first ideas
about logic to make mathematical representations of logic problems.

When electronic computers came along, it was a big step forward in how we could study
intelligence.

McCulloch and Pitts made a Boolean circuit model of the brain in 1943. They wrote about how
neural networks can be used to do math in the paper "A Logical Calculus of Ideas Immanent in
Nervous Activity."

• In 1950, Turing wrote a paper called "Computing Machines and Intelligence." This article gave
a good overall picture of AI. To learn more about Alan Turing, go to
http://www.turing.org.uk/turing.

Turing's paper talked about a lot of things, one of which was how to solve problems by using
heuristics as guides to look through the space of possible solutions. He used the game of chess to
explain how his ideas about how machines can think work. He even said that the machine could
change its own instructions so that machines could learn from what they do.

The SNARC was built by Marvin Minsky and Dean Edmonds in 1951. It was the first randomly
wired neural network learning machine (SNARC stands for Stochastic Neural Analog
Reinforcement Computer). It was a computer with a network of 40 neurons and 3000 vacuum
tubes.

Samuel made a number of programmes to help people play checkers between 1952 and 1956.
In 1956, Dartmouth was the site of a well-known meeting. At the conference, the people who
came up with the idea of AI met for the first time. At this meeting, the name "Artificial
Intelligence" was chosen.

• Newell and Simon's book The Logic Theorist came out. Many people think it was the first
show to use artificial intelligence.

In 1959, Gelernter made a Geometry Engine. In 1961, James Slagle's Ph.D. dissertation at MIT
was a programme called SAINT. It was written in LISP, and a first-year college student could
use it to solve calculus problems.

Thomas Evan made a programme called "Analogy" in 1963 that could solve analogy problems
like those on an IQ test. The first collection of articles about artificial intelligence was put
together by Edward A. Feigenbaum and Julian Feldman. It was called "Computers and Thought."
It was released in 1963.

In 1965, J. Allen Robinson came up with a way to prove things mechanically. He called it the
Resolution Method. This made it possible for formal logic to work well as a language for
representing programmes. In 1967, Feigenbaum, Lederberg, Buchanan, and Sutherland at
Stanford showed how the Dendral programme could be used to understand the mass spectra of
organic chemical compounds. This was the first programme that worked well and was based on
scientific knowledge. The SRI robot Shakey showed in 1969 that it was possible to move, see,
and solve problems all at the same time.

From 1969 to 1979, the first systems based on knowledge were set up in place.

• In 1974, MYCIN showed how powerful rule-based systems can be for representing and
drawing conclusions about knowledge in medical diagnosis and treatment. For the Knowledge
Representation Version 2 CSE IIT, Kharagpur, plans were made. There were also some frames
that Minski had made. There are logic-based programming languages like Prolog and Planner. In
the 1980s, Lisp Machines was made and sold. In 1985, neural networks were once again all the
rage. In 1988, probabilistic and decision-theoretic methods were used again.

Early AI was based on general systems that didn't know much. AI researchers realised that for
machines to be able to reason about complex tasks, they need to know a lot about a narrow field.

Dean Pomerleau made ALVINN at CMU in 1989. Autonomous Land Vehicle in a Neural
Network is what ALVINN stands for. This is a system that learns to drive by watching someone
else do it. It has a neural network that gets an image from a two-dimensional camera that is
30x32 units. The output layer tells the vehicle where it needs to go. The system drove a car from
the East Coast to the West Coast of the United States, which is about 2,850 miles. A person only
drove about 50 of these miles. The system took care of the rest.

In the 1990s, AI made a lot of progress, especially in machine learning, data mining, intelligent
tutoring, case-based reasoning, multi-agent planning and scheduling, uncertain reasoning,
understanding and translating natural language, vision, virtual reality, games, and other areas.
Rod Brooks' COG Project at MIT made a lot of progress toward making a humanoid robot with
the help of a lot of people.

In the 1990s,

 1997 was the year of the first official Robo-Cup soccer game. It was played on a tabletop
with 40 teams of robots talking to each other.
 As more and more people use the web, web crawlers and other AI-based programmes
that pull information from it are becoming more and more important.
 Deep Blue In 1997, Gary Kasparov, who was the world champion at the time, lost to
IBM's Deep Blue chess programme.

In 2000,

 The Nomad robot goes to remote parts of Antarctica to look for meteorite samples.

 Space probes that are made of robots can work on their own to learn more about space.
They keep an eye on what's going on around them, make decisions, and take action to get
where they want to go. In April 2004, the first three-month missions of NASA's Mars
rovers went well. The Spirit rover was looking at a group of hills on Mars that took two
months to reach. It is finding strangely eroded rocks that might be new pieces of the
puzzle that is the history of the area. Spirit's twin sister, Opportunity, was looking at the
layers of rock in a crater.
 Internet agents: As the Internet grows quickly, more people want to use Internet agents to
keep track of what users are doing, find the information they need, and figure out which
information is the most useful. The reader can learn more about AI by reading about it in
the news.

1.5 COMPONENTS OF INTELLIGENCE

According to the dominant school of thought in psychology, human intelligence should not be
viewed as a singular talent or cognitive process but rather as a collection of distinct components.
The majority of attention in the field of artificial intelligence research has been paid to the
following aspects of intelligence: learning, reasoning, problem-solving, perception, and language
comprehension.
Learning: There are numerous approaches to develop a learning system. Making mistakes is the
simplest way to learn. A basic software that solves "mate in one" chess issues, for example,
might test different moves until it finds one that answers the problem. The programme
remembers which move worked so that the next time the computer is given the identical
situation, it can provide an immediate response. The simple act of memorising things like
answers to problems, words in a vocabulary list, and so on is known as "rote learning" or
memorization.
We'll talk about another classification that doesn't depend on the way knowledge is represented
or how it is represented. According to this system, there are five ways to learn:(ii)

(i) Rote Learning or memorising.


(ii) Learning by Instructions
(iii) Learning by analogy.
(iv) Learning by Induction
(v) Learning by deduction.

Rote learning Rote learning is the simplest method of learning since it involves the least amount
of interpretation. The information is simply copied into a database in this method of learning.
This is the technique for memorising multiplication tables.

On a computer, rote learning is relatively simple to implement. The difficulty of implementing


what is known as generalisation is more difficult. Generalized learning allows the student to
perform better in situations they haven't faced before. A programme that learns the past tenses of
regular English verbs by rote will not be able to form the past tense of "jump" until it has been
presented with "jumped" at least once, whereas a programme that can generalise from examples
will be able to learn the "added" rule and thus form the past tense of "jump" even if it has never
encountered this verb before. Modern techniques allow programmes to generalise complex rules
based on data.

Learning by Instructions: The next method of learning is to be instructed. Because new


knowledge must be contributed to an existing knowledge base in order to be useful, this type of
learning necessitates greater inference. When a teacher instructs a student, this is the type of
learning that occurs.

Learning by analogy: When you learn by analogy, you generate new ideas by connecting
previously learned concepts. Textbooks frequently employ this method of instruction. For
example, in the text, some problems are solved as examples, and students are subsequently given
problems that are comparable to the examples. This type of learning also occurs when someone
who can drive a light car attempts to drive a heavy vehicle.

Learning by Induction Learning through induction is the most common method of learning.
This is a method of learning that employs inductive reasoning, which is a style of reasoning that
involves drawing a conclusion from a large number of good instances. If we encounter a lot of
cows, we might notice that they have four legs, are white, and have two horns in the same
position on their head, for example. Even though inductive reasoning frequently leads to valid
conclusions, the conclusions are not always unarguable. For example, with the above-mentioned
concept of cow, we might come across a black cow, a three-legged cow who has lost one leg in
an accident, or a single-horn cow.
Learning by deduction: Finally, we discuss deductive learning, which is founded on deductive
inference, a non-debatable mode of thinking. By irrefutable method of reasoning, we mean that if
the hypotheses (or given facts) are accurate, the conclusion arrived through deductive (i.e., any
irrefutable) reasoning is always correct. This is the most common method of thinking in
mathematics.
Inductive learning is a crucial component of an agent's learning architecture. An agent learns
based on:
• What it is learning, such as concepts, problem-solving techniques, or game-playing
techniques, etc.
• The representation, predicate calculus, frame, script, and other elements that were
employed.
• The critic, who expresses their opinion of the agency in general.
Learning based on feedback is normally categorized as:
 Supervised learning
 Unsupervised learning
 Reinforcement Learning.

Supervised Learning: It has a function that learns from inputs and outputs that are shown as examples.
Some examples of this kind of learning are figuring out useful things about the world from what you see,
making a map from the current state's conditions to actions, and learning how the world changes over
time.

Unsupervised Learning: There is no way to know what the inputs are and what the expected outputs are
in this type of learning. So, the learning system has to figure out on its own which properties of objects it
doesn't know about are important. For example, figuring out the shortest way to get from one city to
another in a country you know nothing about.

Reinforcement (Rewards) for Learning: In some problems, the task or problem can only be seen, not
said. Also, the job may be an ongoing one. The user tells the agent how happy or unhappy he or she is
with the agent's work by sometimes giving the agent positive or negative rewards (i.e., reinforcements).
The agent's job is to get as many rewards (or reinforcements) as possible. In a simple goal-attainment
problem, the agent can be rewarded when it reaches the goal and punished when it doesn't.

You need an action plan to get the most out of this kind of task. But when it comes to tasks that never
end, the future reward might be endless, making it hard to decide how to get the most out of it. One way
to move forward in this kind of situation is to ignore future rewards after a certain point. That is, the agent
may want rewards that will come soon more than rewards that will come a long time from now.

Delayed-reinforcement learning is the process of determining how to behave in situations when rewards
are contingent on previous actions.
Reasoning: To reason is to draw conclusions that are appropriate for the situation. Both
deductive and inductive reasoning can be used to make conclusions. An example of a deductive
inference is, "Fred is either in the museum or in the cafe. He isn't in the cafe, so he must be in the
museum." An example of inductive inference is, "In the past, accidents just like this one have
been caused by instrument failure." The difference between the two is that in the deductive case,
the truth of the premises guarantees the truth of the conclusion, while in the inductive case, the
truth of the premises supports the conclusion that instrument failure caused the accident, but
more research could show that the conclusion is actually false, even though the premises are true.

Programming computers to make inferences, especially deductive inferences, has had a lot of
success. But you can't say that a programme can reason just because it can draw conclusions. To
reason, you have to draw conclusions that make sense for the task or situation at hand. Giving
computers the ability to tell what is important and what isn't is one of the hardest problems AI
has to face.

Problem-solving: Problems usually go like this: given these data, find x. AI is used to solve a
very wide range of problems. Some examples are finding the best way to win a board game,
figuring out who someone is from a picture, and planning a series of steps that will allow a robot
to do a certain task.

Methods for solving problems can be either specific or general. A special-purpose method is
made to solve a specific problem and often takes advantage of very specific parts of the situation
where the problem is happening. A general method can be used to solve many different kinds of
problems. The difference between the current state and the goal state can be reduced step by step
with means-end analysis, which is a technique used in AI. The programme chooses actions from
a list of ways, which for a simple robot might include pick up, put down, move forward, move
back, move left, and move right, until the current state is changed into the goal state.

Perception: Perception involves scanning the surroundings with numerous sense organs,
genuine or artificial, and internal processes for analyzing the scene into objects, their features,
and relationships. The fact that one and the same item can have various appearances depending
on the angle from which it is viewed, whether or not parts of it are projecting shadows, and so
on, complicates analysis.

Artificial perception has progressed to the point where a self-controlled car-like device can drive
at modest speeds on the open road, and a mobile robot can search a suite of bustling offices for
and remove empty drink cans. FREDDY, a stationary robot with a moving TV 'eye' and a pincer
'hand,' was one of the first systems to merge perception and action (constructed at Edinburgh
University during the period 1966-1973 under the direction of Donald Michie). FREDDY could
recognise a wide range of items and could be taught to create simple artefacts from a jumble of
parts, such as a toy automobile.
Language-understanding: A language is a set of signs with predetermined meaning. For
example, traffic signs establish a mini-language; it is a matter of convention that the hazard-
ahead sign signifies trouble ahead. This language-specific meaning-by-convention is distinct
from what is known as natural meaning, as evidenced by phrases like "Those clouds signify rain"
and "The drop in pressure suggests the valve is malfunctioning."

The productivity of full-fledged human languages, such as English, separates them from other
types of communication, such as bird sounds and traffic sign systems. A productive language is
one that is rich enough to allow for the creation of an infinite number of different sentences.

1.6 APPROACHES TO ARTIFICIAL INTELLIGENCE


In the previous sections of this unit, we learned about various concepts of Artificial Intelligence
but now the question is “how do we measure if Artificial Intelligence is making a machine to
behave or act or perform like human being or not?”

Perhaps, in the future, we will reach a point where AI can behave like humans, but what
guarantees do we have that this will continue? Is it possible to make a system that acts like a
human to test the certainty of Artificial Intelligence? " The following approaches constitute the
foundation for evaluating an AI entity's human-likeness:

 Turing Test
 Approach of The Cognitive Modelling
 Approach of The Law of Thought
 Approach of The Rational Agent

Let’s take a look at how these approaches perform:

In the past, researchers have worked hard to reach all four of these goals. But it is hard to find a
good balance between approaches that focus on people and approaches that focus on logic.
People are often "irrational" in the sense of being "emotionally unstable," so it's important to tell
the difference between human and rational behaviour.

Researchers have found through their studies that a human-centered approach must be an
empirical science with hypotheses and experiments to prove them. In a rationalist approach,
math and engineering are used together. People in each group sometimes say bad things about
the work done by the other groups, but the truth is that each way has led to important discoveries.
Let's take a closer look at each one. Acting humanly: The approach of the Turing Test: Alan
Turing, who is the most well-known name among the pioneers, thought about how to test an A.I.
product to see if it was intelligent. Alan Turing came up with the idea for the Turing Test in 1950
(Turing).
Let's try to answer the question, "What is the Turing Test in AI?" Alan Turing, who is the most
well-known name among the pioneers, thought about how to test an A.I. product to see if it was
intelligent. Turing came up with a test, which is now known as the Turing Test, to see if
something is smart. Below is a summary of the Turing test. See figure 2 for more details.

Figure 2: Turing Test


For the purpose of the test, There are three rooms that will be used for the test. In one of the rooms, there
is a computer system that is said to be smart. In each of the other two rooms, there is one person sitting.
One of the people, who we'll call C, is supposed to ask questions of the computer and the other person,
who we'll call B, without knowing who each question is for and, of course, with the goal of figuring out
who the computer is. On the other hand, the computer would reply in a way that would keep C from
finding out who it is.

The only way for the three of them to talk to each other is through computer terminals. This means that
the identity of the computer or person B can only be determined by how intelligent or not the responses
are, not by any other human or machine traits. If C can't figure out who the computer is, then the
computer must be smart. More accurately, the computer is smart if it can hide its identity from C.

Note that for a computer to be considered smart, it should be smart enough not to answer too quickly, at
least not in less than a hundredth of a second, even if it can do something like find the sum of two
numbers with more than 20 digits each.

Criticism to Turing Test: There have been a number of criticisms of the Turing test as a machine
intelligence test. The Chinese Room Test, developed by John Searle, is one of the most well-known
Criticism. The crux of the Chinese Room Test, which we'll discuss below, is that convincing a system,
say A, that it possesses qualities of another system, say B, does not suggest that system A actually
possesses those qualities. For example, a male human's ability to persuade people that he is a woman does
not imply that he is capable of bearing children like a woman.

The scenario for the Chinese Room Test takes place in a single room with two windows. A Shakespeare
scholar who knows English but not Chinese is sitting in the room with a kind of Shakespeare
encyclopaedia. The encyclopaedia is printed so that for every pair of pages next to each other, one page is
written in Chinese characters and the other page is an English translation of the Chinese page. Through
one of the windows, Chinese characters with questions about Shakespeare's writing are sent to the person
inside. The person looks through the encyclopaedia and, when he or she finds the exact copy of the
sequence of characters sent in, reads the English translation, thinks of the answer, and writes it down in
English for his or her own understanding. The person then looks in the encyclopaedia for the
corresponding sequence of Chinese characters and sends the sequence of Chinese characters through the
other window. Now, Searle says that even though the scholar acts as though he or she knows Chinese, this
is not the case. Just because a system can mimic a quality doesn't mean that it has that quality.

Thinking humanly: It is the cognitive modeling approach to thinking like a human, from this point
of view, the Artificial Intelligence model is based on Human Cognition, which is the core of the human
mind. This is done through three approaches, which are as follows:

• Introspection, which means to look at our own thoughts and use those thoughts to build a model.

• Psychological Experiments, which means running tests on people and looking at how they act.

• Brain imaging, which means to use an MRI to study how the brain works in different situations
and then copy that through code.

Thinking rationally i.e. The laws of thought approach: This approach Relates to use the laws of
thought to think logically: The Laws of Thought are a long list of logical statements that tell our minds
how to work. This method, called "Thinking Rationally," is based on these laws. By putting in place
algorithms for artificial intelligence, these laws can be written down and made to work. But solving a
problem by following the law is very different from solving a problem in the real world. Here are the
biggest problems with this approach.

Acting rationally i.e. The rational agent approach: In every situation, a rational agent approach tries to
find the best possible outcome. This means that it tries to make the best decision it can given the
circumstances. It means that the agent approach is much more flexible and open to change. The Laws of
Thought approach, on the other hand, says that a thing must act in a way that makes sense. But there are
some situations where there is no logically right thing to do and there are more than one way to solve the
problem, each with different results and trade-offs. At that point, the rational agent method works well.

☞ Check Your Progress 2


Q4 Briefly discuss the various components of intelligence
……………………………………………………………………………………………
……………………………………………………………………………………………
Q5 how do we measure if Artificial Intelligence is making a machine to behave or act or perform like
human being or not?”
……………………………………………………………………………………………
……………………………………………………………………………………………
Q6 What is Turing Test? What is the Criticism to the Turing Test?
……………………………………………………………………………………………
……………………………………………………………………………………………
1.7 COMPARISON - ARTIFICIAL INTELLIGENCE,
MACHINE LEARNING & DEEP LEARNING
Artificial intelligence is a big field that includes a lot of different ways of doing things, from top-
down (knowledge representation) to bottom-up (machine learning). In recent years, people have
often talked about three related ideas: artificial intelligence (AI), machine learning (ML), and
deep learning (DL) (DL). AI is the most general term, machine learning is a part of AI, and deep
learning is a type of machine learning. Figure 5 shows how these three ideas are related to each
other. Figure 2(b) shows that AI is a broad field with many different subdomains. However, AI's
recent rise in popularity is largely due to how well machine learning, especially deep learning,
works. So, this entry will talk about these two areas of AI: Machine Learning (ML) and Deep
Learning (DL) Figure 2 (a)

Deep Machine Artificial


Learning Learning Intelligence

Fig 2(a):AI, ML, DL


SUB DOMAINS OF ARTIFICIAL INTELLIGENCE

Deep Natural Language Machine Neural Computer


Learning Processing Learning Networks Vision

Supervised Unsupervised Reinforcement Semi-supervised


Learning Learning Learning Learning

Figure 2(b) : Various Sub Domains of Artificial Intelligence

To make a system that is artificially intelligent, you have to carefully do Reverse-Engineering of


human traits and machine abilities. Also, to understand how an AI system really works, you have
to get a good grasp of the different parts of AI and how they can be used in different industries or
industrial fields.
Introduction to Machine Learning (ML): Machine learning is a branch of artificial
intelligence (AI). It explains one of the most important ideas in AI, which has to do with learning
through experience and not through being taught. One of the most recent advances in AI, this
way of learning is made possible by applying machine learning to very large data sets. Machine
learning algorithms find patterns and learn how to make predictions and recommendations by
using data and experiences instead of explicit programming instructions. The algorithms also
change as they get new data and learn from their experiences. This makes them more effective
over time.

Algorithms for machine learning are based on ways that people learn from their experiences.
This means that they are programmed to learn from what they do and get better at what they do.
They don't need to be told what to do to get the desired results. They are set up so that people can
learn by looking at the data sets they can access and comparing what they see to examples of the
final results. They also look for patterns in the output and try to figure out how to use the
different parts to get the output they want.

ML shows a machine how to draw conclusions and make decisions based on what it has learned
in the past. It looks for patterns and analyses past data to figure out what these patterns mean so
that a possible conclusion can be reached without the need for human experience. Businesses
save time and make better decisions by using automation to evaluate data and come to
conclusions..
Machine learning provides predictions and prescriptions
Types of analytics (in order of increasing complexity)

Descriptive Predictive Prescriptive

x y z

 Describe what happened  Anticipate what will  Provide recommendations


 Employed heavily across happen (inherently on what to do to achieve
all industries probabilistic goals
 Employed in data-driven  Employed heavily by
organizations as a key leading data and internet
source of insight companies

Focus of machine learning

Figure 3 :Machine Learning : Descriptive, Predictive and Prescriptive Analytics


Introduction to Deep Learning (DL) Deep Learning is a subfield of machine learning that
focuses on algorithms called "Artificial Neural Networks" (ANN) that are based on how the
brain is built and how it works. Deep Learning is a type of machine learning that can handle a
wider range of data sources, needs less pre-processing of data, and often gives more accurate
results than traditional machine learning methods.

A neural network is made up of layers of software-based calculators called "neurons" that are
linked together. This neural network can take in a huge amount of data and process it through
many layers. At each layer, the network learns more complex features of the data. The network
can then decide what to do with the data, find out if it was right, and use what it has learned to
decide what to do with new data. Deep learning is a way of programming computers that uses the
way neural networks work to teach computers to do things that humans do naturally. So, Deep
Learning is a way to teach a computer model to run classification algorithms based on an image,
text, or sound. Once a neural network knows what an object looks like, it can spot that object in a
new picture.

Deep learning is becoming more popular because its models can get better results. It uses large
sets of labeled data and neural network architectures to train the models..

☞ Check Your Progress 3

Q7 Compare Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL).

……………………………………………………………………………………………
……………………………………………………………………………………………
Q8 Compare Descriptive, Predictive and Prescriptive analytics performed under Machine
Learning.

……………………………………………………………………………………………
……………………………………………………………………………………………

1.8 APPLICATION AREAS OF ARTIFICIAL


INTELLIGENCE SYSTEMS

Artificial intelligence is the most important factor in the transformation of economies straight
from the ground up, and it is contributing as an efficient alternative. It has a lot of potential to
perform optimization in any industry, whether it smart cities or the health sector or agriculture or
any other prospective sector of relevance, and below we have included a few of the systems in
which AI is functioning as the major source of competitive advantage:
a) Healthcare: The application of AI in healthcare can help address issues of high barriers to
access to healthcare facilities, particularly in rural areas that suffer from poor
connectivity and a limited supply of healthcare professionals. This is especially true in
areas where the supply of healthcare professionals is limited. The deployment of use
cases like as AI-driven diagnostics, personalised treatment, early diagnosis of potential
pandemics, and imaging diagnostics, amongst others, is one way to accomplish this goal.

Training Keeping
Well

Research Early
Detection
AI and
Robotics
End of Diagnosis
Life Care

Decision
Treatment
Making

Figure 4 : Potential Use of AI in Health Care

b) Agriculture: AI has the potential to bring in a food revolution while simultaneously satisfying
the ever-increasing need for food (global need to produce 50 percent more food and cater to an
additional 2 billion people by 2050 as compared to today). It also has the ability to resolve issues
such as inadequate demand prediction, a lack of secure irrigation, and the abuse or misuse of
pesticides and fertilizers. These are only some of the problems that it could solve. The increase
of crop output through real-time advising is one example of a use case. Other use cases include
the advanced detection of pest infestations and the forecast of crop prices to advise sowing
methods.
Survey fields, map weeds,
yield and soil variations;
Sensors monitor animal enable application of inputs
health and food intake; and map productivity. Drones
send alerts on health are also used for applying GPS-controlled
anomalies or reduction in pesticide and herbicide. autonomous tractor
Vast farm data is stored on food/water intake. charts its route
cloud, fed to advanced automatically,
analytics engine, and used ploughs the land
by agro-input companies to saving fuel, and
customize serving and reduces soil erosion
farmers to make timely FARMING CONNECTE SMART AUTONO and maintains soil
DATA D LIVE DRONES MOUS
operating decisions to STOCK TRACTOR
enhance yield and
profitability.

Establish agribusiness Enables decisions about


communities of practice when to plant, what
to share insights or CROWD FLEET OF SOIL WEATHER area and crop variety to
SOURCING AGRIBOT SENSORS FORECAS
videos/pictures, also plant, when to apply
S T
share information with fertilizers and when to
other farmers in rural
areas.

Agribots tend to crops, Provides information for ground-truthing


weeding, fertilization and irrigation decisions and fine-tuning irrigation
harvesting, reduce practices; avoids under and over-irrigation
saving corps from yield loss, water-related
fertilizer cost up to 90%
diseases, nutrient losses and leach-outs.
and eliminate human
loabor.
Figure 5: AI for Precision Farming -

All of the stages of the agricultural value chain indicated above in figure 5 have the potential for the
application of artificial intelligence and other associated technologies to have an impact on the levels of
production and efficiency at those stages.

c) Smart Mobility, including Transports and Logistics: Autonomous fleets for ride sharing, semi-
autonomous features such as driver assistance, and predictive engine monitoring and maintenance are all
possible use cases for smart mobility, which includes transportation and logistics. Other areas where AI
can have a positive impact include self-driving trucks and delivery, as well as better traffic control.

d) Retail: The retail industry was one of the first to use AI solutions. For example, personalised
suggestions, browsing based on user preferences, and image-based product search have all been used to
improve the user experience. Other use cases include predicting what customers will want, keeping track
of inventory better, and managing deliveries more efficiently.

e) Manufacturing: AI-based solutions are expected to help the manufacturing industry the most. This will
make possible the "Factory of the Future" by allowing flexible and adaptable technical systems to
automate processes and machinery that can respond to new or unexpected situations by making smart
decisions. Impact areas include engineering (AI for R&D), supply chain management (predicting
demand), production (AI can cut costs and increase efficiency), maintenance (predictive maintenance and
better use of assets), quality assurance (e.g., vision systems with machine learning algorithms to find
flaws and differences in product features), and in-plant logistics and warehousing.

f) Energy: In the energy sector, possible use cases include modelling and forecasting the energy system to
make it less unpredictable and make balancing and using power more efficient. In renewable energy
systems, AI can help store energy through smart metres and intelligent grids. It can also make
photovoltaic energy more reliable and less expensive. AI could also be used to predict maintenance of
grid infrastructure, just like it is in manufacturing.

g) Smart Cities: Integrating AI into newly built smart cities and infrastructure could also help meet the
needs of a population that is moving to cities quickly and improve the quality of life for those people.
Some possible use cases include controlling traffic to reduce traffic jams and managing crowds better to
improve security.

h) Education and Skilling: Quality and access problems in the education sector might be fixed by AI.
Possible uses include adding to and improving the learning experience through personalized learning,
automating and speeding up administrative tasks, and predicting when a student needs help to keep them
from dropping out or to suggest vocational training.

i) Financial industry: The financial industry also uses AI. For example, it helps the fraud department of a
bank find and flag suspicious banking and finance activities like unusual debit card use and large account
deposits. AI is also used to make trading easier and more efficient. This is done by making it easier to
figure out how many securities are being bought and sold and how much they cost.

Top Used Applications of Artificial Intelligence

• Tools and checkers for plagiarism


• Recognizing faces;
• Putting an AI autopilot on commercial planes
• Applications for sharing rides (E.g.: Uber, Lyft)
• E-mail spam filters; voice-to-text features; search suggestions
• Google's predictions based on AI (E.g.: Google Maps)
• Protecting against and stopping fraud.
• Smart personal assistants (E.g.: Siri, Alexa)

There are various ways to use artificial intelligence. The technology can be used in different
industries and sectors, but the adoption of AI by different sectors has been affected by technical
and regulatory challenges, but the biggest factor has been how it will affect business.
1.9 INTELLIGENT AGENTS
An agent may be thought of as an entity that acts, generally on behalf of someone else. More precisely, an
agent is an entity that perceives its environment through sensors and acts on the environment through
actuators. Some experts in the field require an agent to be additionally autonomous and goal directed also.

A percept may be thought of as an input to the agent through its censors, over a unit of time, sufficient
enough to make some sense from the input.

Percept sequence is a sequence of percepts, generally long enough to allow the agent to initiate some
action.

In order to further have an idea about what a computer agent is, let us consider one of the first
definitions of agent, which was coined by John McCarthy and his friends at MIT.

A software agent is a system which, when given a goal to be achieved, could carry out the details of the
appropriate (computer) operations and further, in case it gets stuck, it can ask for advice and can receive it
from humans, may even evaluate the appropriateness of the advice and then act suitably.

Essentially, a computer agent is a computer software that additionally has the following attributes:

(i) it has autonomous control i.e., it operates under its own control
(ii) it is perceptive, i.e., it is capable of perceiving its own environment
(iii) it persists over a long period of time
(iv) it is adaptive to changes in the environment and
(v) it is capable of taking over others’ goals.

As the concept of Intelligent Agents is of relatively new, different pioneers and other experts have been
conceiving and using the term in different ways. There are two distinct but related approaches for
defining an agent. The first approach treats an agent as an ascription i.e., the perception of a person
(which includes expectations and points of view) whereas the other approach defines an agent on the basis
of the description of the properties that the agent to be designed is expected to possess.

Let us first discuss the definition of agent according to first approach. Among the people who consider an
agent as an ascription, a popular slogan is “Agent is that agent does”. In everyday context, an agent is
expected to act on behalf of someone to carry out a particular task, which has been delegated to it. But to
perform its task successfully, the agent must have knowledge about the domain in which it is operating
and also about the properties of its current user in question. In the course of normal life, we hire different
agents for different jobs based on the required expertise for each job. Similarly, a non-human intelligent
agent also is imbedded with required expertise of the domain as per requirements of the job under
consideration. For example, a football-playing agent would be different from an email-managing agent,
although both will have the common attribute of modeling their user.
According to the second approach, an agent is defined as an entity, which functions continuously and
autonomously, in a particular environment, which may have other agents also. By continuity and
autonomy of an agent, it is meant that the agent must be able to carry out its job in a flexible and
intelligent fashion and further is expected to adapt to the changes in its environment without requiring
constant human guidance or intervention. Ideally, an agent that functions continuously in an environment
over a long period of time would also learn from its experience. In addition, we expect an agent, which
lives in a multi-agent environment, to be able to communicate and cooperate with them, and perhaps
move from place to place in doing so.

According to the second approach to defining agent, an agent is supposed to possess some or all of the
following properties:

 Reactivity: The ability of sensing the environment and then acting accordingly.
 Autonomy: The ability of moving towards its goal, changing its moves or strategy, if required,
without much human intervention.
 Communicating ability: The ability to communicate with other agents and humans.
 Ability to coexist by cooperating: The ability to work in a multi-agent environment to achieve a
common goal.
 Ability to adapt to a new situation: Ability to learn, change and adapt to the situations in the world
around it.
 Ability to draw inferences: The ability to infer or conclude facts, which may be useful, but are not
available directly.
 Temporal continuity: The ability to work over long periods of time.
 Personality: Ability to impersonate or simulate someone, on whose behalf the agent is acting.
 Mobility: Ability to move from one environment to another.

Task environments or problem environments are the environments, which include all the elements
involved in the problems for which agents are thought of as solutions. Task environments will vary with
every new task or problem for which an agent is being designed. Specifying the task environment is a
long process which involves looking at different measures or parameters. Next, we discuss a standard set
of measures or parameters for specifying a task environment under the heading PEAS.

PEAS (Performance, Environment, Actuators, Sensors)


For designing an agent, the first requirement is to specify the task environment to the maximum extent
possible. The task environment for an agent to solve one type of problems, may be described by the four
major parameters namely, performance (which is actually the expected performance), environment (i.e.,
the world around the agent), actuators (which include entities through which the agent may perform
actions) and sensors (which describes the different entities through which the agent will gather
information about the environment).

The four parameters may be collectively called as PEAS. We explain these parameters further, through an
example of an automated agent, which we will preferably call automated public road transport driver.
This is a much more complex agent than the simple boundary following robot which we have already
discussed.
Example (An Automated Public Road Transport Driver Agent)

We describe the task environment of the agent on the basis of PEAS.

Performance Measures: Some of the performance measures which can easily be perceived of an
automated public road transport driver would be:

 Maximizing safety of passengers


 Maximizing comfort of passengers
 Ability to reach correct destination
 Ability to minimize the time to reach the destination Obeying traffic rules
 Causing minimum discomfort or disturbance to other agents
 Minimizing costs, etc.

Environment (or the world around the agent) We must remember that the environment or the world
around the agent is extremely uncertain or open ended. There are unlimited combinations of possibilities
of the environment situations, which such an agent could face. Let us enumerate some of the possibilities
or circumstances which an agent might face:

 Variety of roads e.g., from 12-lane express-ways, freeways to dusty rural bumpy roads; different road
rules including the ones requiring left-hand drive in some parts of the world and right-hand drive-in
other parts.
 The degree of knowledge of various places through which and to which driving is to be done.

 Various kinds of passengers, including high cultured to almost ruffians etc.

 All kind of other traffic possibly including heavy vehicles, ultra-modern cars, three-wheelers and even
bullock carts.

Actuators: These include the following:


 Handling steering wheel, brakes, gears and accelerator
 Understanding the display screen
 A device or devices for all communication required

Sensors: The agent acting as automated public road transport driver must have some way of sensing the
world around it i.e., the traffic around it, the distance between the automobile and the automobiles ahead
of it and its speed, the speeds of neighboring vehicles, the condition of the road, any turn ahead etc. It
may use sensors like odometer, speedometer, sensors telling the different parameters of the engine,
Global Positioning System (GPS) to understand its current location and the path ahead. Also, there should
be some sort of sensors to calculate its distance from other vehicles etc.

We must remember that the agent example the automated public road transport driver, which we have
considered above, is quite difficult to implement. However, there are many other agents, which operate in
comparatively simpler and less dynamic environments, e.g., a game playing robot, an assembly line robot
control, and an image processing agent etc.

In respect of the design and development of intelligent agents, with the passage of time, the momentum
seems to have shifted from hardware to software, the letter being thought of as a major source of
intelligence. But, obviously, some sort of hardware is essentially needed as a home to the intelligent
agent.

There are two parts of an agent or its structure:

 A (hardware) device with sensors and actuators in which that agent will reside,
called the architecture of the agent.
 An agent program that will convert or map the percepts in to actions.

Also, the agent program and its architecture are related in the sense that for a different agent architecture a
different type of agent program is required and vice-versa. For example, in case of a boundary following
robot, if the robot does not have the capability of sensing adjacent cells to the right, then the agent
program for the robot has to be changed.

Next, we discuss different categories of agents, which are differentiated from each other on the basis of
their agent programs. Capability to write efficient agent programs is the key to the success for developing
efficient rational agents. Although the table driven approach (in which an agent acts on the basis of the set
of all possible percepts by storing these percepts in tables) to design agents is possible yet the approach of
developing equivalent agent programs is found much more efficient.

Next, we discuss some of the general categories of agents based on their agents’ programs, Agents can be
grouped into five classes based on their degree of perceived intelligence and capability. All these agents
can improve their performance and generate better action over the time. These are given below:
 SR (Simple Reflex) agents
 Model Based reflex agents
 Goal-based agents
 Utility based agents
 Stimulus-Response Agents
 Learning agents

SR (Simple Reflex) agents: These are the agents or machines that have no internal state (i.e., the don’t
remember anything) and simply react to the current percepts in their environments. An interesting set of
agents can be built, the behaviour of the agents in which can be captured in the form of a simple set of
functions of their sensory inputs. One of the earliest implemented agents of this category was called
Machina Speculatrix . This was a device with wheels, motor, photo cells and vacuum tubes and was
designed to move in the direction of light of less intensity and was designed to avoid the direction of the
bright light. A boundary following robot is also an SR agent. For an automobile-driving agent also,
some aspects of its behavior like applying brakes immediately on observing either the vehicle
immediately ahead applying brakes or a human being coming just in front of the automobile suddenly,
show the simple reflex capability of the agent. Such a simple reflex action in the agent program of the
agent can be implemented with the help of simple condition-action rules.

For example : IF a human being comes in front of the automobile suddenly


THEN apply breaks immediately.

Although implementation of SR agents is simple yet on the negative side this type of agents has very
limited intelligence because they do not store or remember anything. As a consequence, they cannot
make use of any previous experience. In summary, they do not learn. Also they are capable of
operating correctly only if the environment is fully observable.

Model Based Reflex agents :Simple Reflex agents are not capable of handling task environments that are
not fully observable. In order to handle such environments properly, in addition to reflex capabilities, the
agent should, maintain some sort of internal state in the form of a function of the sequence of percepts
recovered up to the time of action by the agent. Using the percept sequence, the internal state is
determined in such a manner that it reflects some of the aspects of the unobservable environment. Further,
in order to reflect properly the unobserved environment, the agent is expected to have a model of the task
environment encoded in the agent’s program, where the model has the knowledge about–

(i) the process by which the task environment evolves independent of the agent and
(ii) effects of the actions of the agent have on the environment.
Thus, in order to handle properly the partial observability of the environment, the agent should have a
model of the task environment in addition to reflex capabilities. Such agents are called Model-based
Reflex Agents

Goal Based Agents : In order to design appropriate agent for a particular type of task, we know the
nature of the task environment plays an important role. Also, it is desirable that the complexity of the
agent should be minimum and just sufficient to handle the task in a particular environment. In this regard,
first we discussed the simplest type of agents, viz., Simple Reflex Agents. The action of this type of agent
is decided by the current precept only. Next, we discussed the Model-Based Reflex Agents, for which an
action is decided by taking into consideration not only the latest precept, but the whole precept history
summarized in the form of internal state. Also, action for this type of agent is also decided by taking into
consideration the knowledge of the task environment, represented by a model of the environment and
encoded into the agent’s program. However, in respect of a number of tasks, even this much knowledge
may not be sufficient for appropriate action. For example, when we are going from city A to city B, in
order to take appropriate action, it is not enough to know the summary of actions and path which has
taken us to some city C between A and B. We also have to remember the goal of reaching to city B.

Goal based agents are driven by the goal they want to achieve, i.e., their actions are based on the
information regarding their goal, in addition to, of course, other information in the current state.
This goal information is also a part of the current state description and it describes everything that is
desirable to achieve the goal. As mentioned earlier, an example of a goal-based agent is an agent that is
required to find the path to reach a city. In such a case, if the agent is an automobile driver agent, and if
the road is splitting ahead into two roads, then the agent has to decide which way to go to achieve its goal
of reaching its destination. Further, if there is a crossing ahead then the agent has to decide, whether to go
straight, to go to the left or to go to the right. In order to achieve its goal, the agent needs some
information regarding the goal which describes the desirable events and situations to reach the goal. The
agent program would then use this goal information to decide the set of actions to take in order to reach
its goal.
Another desirable capability which a good goal-based agent should have been that if an agent finds that a
part of the sequence of the previous steps has taken the agent away from its goal then it should be able to
retract and start its actions from a point which may take the agent toward the goal.
In order to take appropriate action, decision-making process in goal-based agents may be simple or quite
complex depending on the problem. Also, the decision-making required by the agents of this kind
needs some sort of looking into the future. For example, it may analyze the possible outcome of a
particular action before it actually performs that action. In other words, we can say that the agent would
perform some sort of reasoning of if-then-else type, e.g., an automobile driver agent having one of its
goals as not to hit any vehicle in front of it, when finds the vehicle immediately ahead of it slowing down
may not apply brakes with full force and instead may apply brakes slowly so that the vehicles following it
may not hit it.

As the goal-based agents may have to reason before they take an action, these agents might be slower
than other types of agents but will be more flexible in taking actions as their decisions are based on the
acquired knowledge which can be modified also. Hence, as compared to SR agents which may require
rewriting of all the condition-action rules in case of change in the environment, the goal-based agents can
adapt easily when there is any change in its goal.

Utility Based Agents :Goal based agent’s success or failure is judged in terms of its capability for
achieving or not achieving its goal. A goal-based agent, for a given pair of environment state and possible
input, only knows whether the pair will lead to the goal state or not. Such an agent will not be able to
decide in which direction to proceed when there are more than one conflicting goals. Also, in a goal-
based agent, there is no concept of partial success or somewhat satisfactory success. Further, if there are
more than one method of achieving a goal, then no mechanism is incorporated in a Goal-based agent of
choosing or finding the method which is faster and more efficient one, out of the available ones, to reach
its goal.

A more general way to judge the success or happiness of an agent may be, through assigning to each state
a number as an approximate measure of its success in reaching the goal from the state. In case, the agent
is embedded with such a capability of assigning such numbers to states, then it can choose, out of the
reachable states in the next move, the state with the highest assigned number, out of the numbers assigned
to various reachable states, indicating possibly the best chance of reaching the goal.

It will allow the goal to be achieved more efficiently. Such an agent will be more useful, i.e., will have
more utility. A utility-based agent uses a utility function, which maps each of the world states of the
agent to some degree of success. If it is possible to define the utility function accurately, then the agent
will be able to reach the goal quite efficiently. Also, a utility-based agent is able to make decisions in case
of conflicting goals, generally choosing the goal with higher success rating or value. Further, in
environments with multiple goals, the utility-based agent quite likely chooses the goal with least cost or
higher utility goal out of multiple goals.

Stimulus-Response Agents : A stimulus response agent (or a reactive agent) take input from the world
through sensors, and then take action based on those inputs through actuators. Between the stimulus and
response, there is a processing unit that can be arbitrarily complex. An example of such an agent is one
that controls a vehicle in a racing game: the agent “looks” at the road and nearby vehicles, and then
decides how much to turn and break. Such Agents (Stimulus-Response Agents are the Reactive agents)
represents a special category of agents, which do not possess internal, symbolic models of their
environments; instead, they act/respond in a stimulus-response manner to the present state of the
environment in which they are embedded. These agents are relatively simple and they interact with other
agents in basic ways. Nevertheless, complex patterns of behavior emerged from the interactions when the
ensemble of agents is viewed globally

Learning Agents :It is not possible to encode all the knowledge in advance, required by a rational agent
for optimal performance during its lifetime. This is especially true of the real life, and not just theoretical,
environments. These environments are dynamic in the sense that the environmental conditions change,
not only due to the actions of the agents under considerations, but due to other environmental factors also.
For example, all of a sudden, a pedestrian comes just in front of the moving vehicle, even when there is
green signal for the vehicle. In a multi-agent environment, all the possible decisions and actions an agent
is required to take, are generally unpredictable in view of the decisions taken and actions performed
simultaneously by other agents. Hence, the ability of an agent to succeed in an uncertain and
unknown environment depends on its learning capability i.e., its capability to change approximately
its knowledge of the environment. For an agent with learning capability, some initial knowledge is coded
in the agent program and after the agent starts operating, it learns from its actions the evolving
environment, the actions of its competitors or adversaries etc. so as to improve its performance in ever-
changing environment. If approximate learning component is incorporated in the agent, then the
knowledge of the agent gradually increases after each action starting from its initial knowledge which was
manually coded into it at the start.

Conceptually the learning agent consists of four components:


(i) Learning Component: It is the component of the agent, which on the basis of the percepts and the
feedback from the environment, gradually improves the performance of the agent.

(ii) Performance Component: It is the component from which all actions originate on the basis of
external percepts and the knowledge provided by the learning component.

The design of learning component and the design of performance element are very much related to
each other because a learning component is of no use unless the performance component can be designed
to convert the newly acquired knowledge into better useful actions.

(iii) Critic Component: This component finds out how well the agent is doing with respect to a certain
fixed performance standard and it is also responsible for any future modifications in the performance
component. The critic is necessary to judge the agent’s success with respect to the chosen
performance standard, especially in a dynamic environment. Forexample, in order to check
whether a certain job is accomplished, the critic will not depend on external percepts only but it will
also compare the current state to the state, which indicates the completion of that task.

(iv) Problem Generator Component: This component is responsible for suggesting actions (some of
which may not be optimal) in order to gain some fresh and innovative experiences. Thus, this
component allows the agent to experiment a little by traversing sometimes uncharted territories by
choosing some new and suboptimal actions. This may be useful, because the actions which may
seem suboptimal in a short run, may turn out to be much better in the long run.

In the case of an automobile driver agent, this agent would be of little use if it does not have learning
capability, as the environment in which it has to operate is totally dynamic and unpredictable in nature.
Once the automobile driver agent starts operating, it keeps on learning from its experiences, both
positive and negative. If faced with a totally new and previously unknown situation, e.g., encountering a
vehicle coming from the opposite direction on a one-way road, the problem generator component of the
driver agent might suggest some innovative action to tackle this new situation. Moreover, the learning
becomes more difficult in the case of an automobile driver agent, because the environment is only
partially observable.
Different Forms of Learning in Agents: The purpose of embedding learning capability in an agent is
that it should not depend totally on the knowledge initially encoded in it and on the external percepts for
its actions. The agent learns by evaluating its own decisions and/or making observations of new situations
it encounters in the ever-changing environment.

There may be various criteria for developing learning taxonomies. The criteria may be based on –
 The type of knowledge learnt, e.g., concepts, problem-solving or game playing,
 The type of representation used, e.g., predicate calculus, rules or frames,
 The area of application, e.g., medical diagnosis, scheduling or prediction.

☞ Check Your Progress 4


Q9 What are Intelligent agents in AI? Briefly discuss the properties of Agents.

……………………………………………………………………………………………
……………………………………………………………………………………………
Q10 What are Task environments? Briefly discuss the standard set of measures or parameters for
specifying a task environment under the heading PEAS.

……………………………………………………………………………………………
……………………………………………………………………………………………
1.10 SUMMARY

In this unit we learned about the difference between knowledge and intelligence and also pointed
out the meaning of Artificial Intelligence (AI), along with the application of AI systems in various
fields. The unit also covers the historical development of the field of AI systems. Along with the
development of the AI as a discipline, the need of classification of AI systems was felt, and hence
the unit discussed the classification of the AI systems in detail. Further, the unit discussed about
the concepts of Artificial Intelligence (AI), Machine Learning (ML) and Deep Learning (DL).
Finally, the unit discussed the components of Intelligence, which was extended for the
understanding of the concepts of Intelligent Agents, with special emphasis on Stimulus - Response
Agents

1.11 SOLUTIONS/ANSWERS

☞ Check Your Progress 1


Q1 How Knowledge differs from intelligence? What do you understand by the term Artificial
Intelligence (AI) ? List the various technologies and their corresponding illustrative solutions.

Sol- Refer section 1.3

Q2 Classify AI on the basis of the functionalities of AI

Sol- Refer section 1.3

Q3 Compare ANI, AGI and ASI, in context of AI

Sol- Refer section 1.3

☞ Check Your Progress 2


Q4 Briefly discuss the various components of intelligence

Sol – Refer Section 1.5

Q5 how do we measure if Artificial Intelligence is making a machine to behave or act or perform like
human being or not ?”

Sol – Refer Section 1.6

Q6 What is Turing Test? What is the Criticism to the Turing Test ?

Sol – Refer Section 1.6

☞ Check Your Progress 3


Q7 Compare Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL).

Sol – Refer Section 1.7


Q8 Compare Descriptive, Predictive and Prescriptive analytics performed under Machine
Learning.

Sol – Refer Section 1.7

☞ Check Your Progress 4


Q9 What are Intelligent agents in AI? Briefly discuss the properties of Agents.

Sol – Refer Section 1.9

Q10 What are Task environments? Briefly discuss the standard set of measures or parameters for
specifying a task environment under the heading PEAS.

Sol – Refer Section 1.9

1.12 FURTHER READINGS

1. Ela Kumar, “Artificial Intelligence”, IK International Publications


2. E. Rich and K. Knight, “Artificial intelligence”, Tata Mc Graw Hill Publications
3. N.J. Nilsson, “Principles of AI”, Narosa Publ. House Publications
4. John J. Craig, “Introduction to Robotics”, Addison Wesley publication
5. D.W. Patterson, “Introduction to AI and Expert Systems" Pearson publication
UNIT 2 PROBLEM SOLVING USING SEARCH
Structure Page No

2.0 Introduction
2.1 Objectives
2.2Introduction to State Space Search
2.2.1 Problem Formulation
2.2.2 Structure of a State space
2.2.3 Problem solution of State space
2.3.4 Searching for solution in state spaces formulated
2.3 Formulation of 8 puzzle problem from AI perspective
2.4N-queen’s problem- Formulation and Solution
2.4.1 Formulation of 8 Queen’s problem
2.4.2 State space tree for 4-Queen’s problem
2.4.3 Backtracking approach to solve N Queen’s problem
2.5 Two agent search: Adversarial search
2.5.1 Elements of Game playing search
2.5.2 Types of algorithms in Adversarial search
2.6 Minimax search strategy
2.6.1 Minimax algorithm
2.6.2 Working of Minimax algorithm
2.6.3 Properties of Minimax algorithm
2.6.4 Advantages and Disadvantages of Minimax search
2.7 Alpha-Beta Pruning algorithm
2.7.1 Working of Alpha-Beta pruning
2.7.2 Move Ordering of Alpha-Beta pruning
2.8 Summary
2.9 Solutions/Answers
2.10Further readings

2.0 INTRODUCTION
Many AI-based applications need to figure out how to solve problems. In the world, there are
two types of problems. First, the problem which can be solved by using deterministic procedure
and the success is guaranteed. But most real-world problems can be solved only by searching a
solution. AI is concerned with these second types of problems solving.

To build a system to solve a problem, we need to

 Define the problem precisely-find initial and final configuration for acceptable solution
to the problem.

 Analyse the problem-find few important features that may have impact on the
appropriateness of various possible techniques for solving the problem
 Isolate and represent task knowledge necessary to solve the problem

 Choose the best problem-solving technique(s) and apply it to the particular problem.

To provide a formal description of a problem, we need to do the following:

a. Define a state space that contains all the possible configurations of the relevant objects.

b. Specify one or more states that describe possible situations, from which the problem-
solving process may start. These states are called initial states.

c. Specify one or more than one goal states.

d. Defining a set of rules for the actions (operators) that can be taken.

The problem can then be solved by using the rules, in combination with an appropriate control
strategy, to move through the problem space until a path from an initial state to a goal state is
found. This process is known as ‘search’. Thus, search is fundamental to the problem-solving
process. Search is a general mechanism that can be used when a more direct method is not
known. Search provides the framework into which more direct methods for solving subparts of a
problem can be embedded. All Al problems are formulated as search problems.

A problem space is represented by a directed graph, where nodes represent search state and
paths represent the operators applied to change the state. To simplify search algorithms, it is
often convenient to logically and programmatically represent a problem space as a tree. A tree
usually decreases the complexity of a search at a cost. Here, the cost is due to duplicating some
nodes on the tree that were linked numerous times in the graph, e.g., node B and node D shown
in example below.

Graph Tree

A
A B

B C

C D
D B D

D
A tree is a graph in which any two vertices are connected by exactly one path. Alternatively, any
connected graph with no cycles is a tree.

Before an AI problem can be solved it must be represented as a state space. Here state means
representation of elements at a given moment. Among all possible states, there are two special
states called initial state (the start point) and final state (the goal state). A successor function (a
set of operators)is used to change the state. It is used to move from one state to another. A state
space is the set of all states reachable from the initial state. A state space essentially consists of a
set of nodes representing each state of the problem, arcs between nodes representing the legal
moves from one state to another, an initial state, and a goal state. Each state space takes the form
of a tree or a graph. In AI, a wide range of problems can be formulated as search problem. The
process of searching means a sequence of action that take you from an initial state to a goal state
as sown in the following figure 1.

Fig 1: A sequence of action in a search space from initial to goal state.

So, State space is the one of the methods to represent the problem in AI. A set of all possible
states reachable from the initial state by taking some sequence of action (using some operator)
for a given problem is known as the state space of the problem. A state space represents a
problem in terms of states and operators that change states”.

In this unit we examine the concept of a state space and the different search process that can be
used to explore the search space in order to find a solution (Goal) state. In the worst case, search
explores all possible paths between the initial state and the goal state.

For better understanding of these definitions describe above, consider the following 8-puzzle
problem:

Eight-Puzzle problem Formulation from AI perspectives

initial state: some configuration of the 8-tiles on a 9-cell board.


operators (Action): it’s easier if we focus on the blank. There are 4 operators that is, “Moving
the blank” : UP, DOWN, LEFT and RIGHT.

Uniformed/Blind 1 2 3 2
3
4 5 Move 4 1 5
6 7 8 the 6 7 8
blank
Goal state: Tiles in a specific order
1 2
3 4 5
6 7 8

Solution: Optimal sequence of operators (Actions)


Path costs:
cost of each action = 1
cost of a sequence of actions= the number of actions

A state space representation of 8-puzzle problem is shown in figure-2

8-Puzzle
Start Goal

3 1 2 1 2

4 5 3 4 5

6 7 8 6 7 8

up down Left right

3 2 3 1 2 3 1 2 3 1 2

4 1 5 4 7 5 4 5 4 5

6 7 8 6 8 6 7 8 6 7 8

up

1 2
Goal 3 4 5

6 7 8

Fig2: A state space representation of 8-puzzle problem generated by


“Move Blank” operator
2.1 OBJECTIVES
After studying this unit, you should be able to:
 Understand the state space search
 Formulate the problems in the form of state space
 Understand how implicit state space can be unfolded during search
 Explain State space search representation for Water-Jug, 8-puzzle and N-Queen’s
problem.
 Solve N Queen’s Problem using Backtracking approach
 Understand adversarial search (two agent search)
 Differentiate between Minimax and Alpha-beta pruning search algorithm.

2.2 Introduction to State Space Search


It is necessary to represent an AI problem in the form of a state space before it can be solved. A
state space is the set of all states reachable from the initial state. A state space forms a graph in
which the nodes are states and the arcs between nodes are actions. In state space, a path is a
sequence of states connected by a sequence of actions.

2.2.1 Structure of a state space:

The structures of state space are trees and graphs. A tree has one and only one path from any
point to any other point. Graph consists of a set of nodes (vertices) and a set of edges (arcs). Arcs
establish relationship (connections) between the nodes, i.e., a graph has several paths to a given
node. Operators are directed arcs between nodes.
The method of solving problem through AI involves the process of defining the search space,
deciding start and goal states and then finding the path from start state to goal state through
search space.
Search process explores the state space. In the worst case, the search explores all possible paths
between the initial state and the goal state.

2.2.2 Problem Solution:

In a state space, a solution is a path from the initial state to a goal state or sometime just a goal
state. A numeric cost is assigned to each path. It also gives the cost of applying the operators to
the states. A path cost function is used to measure the quality of solution and out of all possible
solutions, an optimal solution has the lowest path cost. The importance of cost depends on the
problem and the type of solution asked.
2.2.3 Problem formulation:
Many problems can be represented as state space. The state space of a problem includes: an
initial state, one or more goal state, set of state transition operator (or a set of production rules),
used to change the current state to another state. This is also known as actions. A control
strategy is used that specifies the order in which the rules will be applied. For example, Depth-
first search (DFS), Breath-first search (BFS) etc. It helps to find the goal state or a path to the
goal state.
In general, a state space is represented by 4 tuples as follows:𝑺𝒔 : [𝑺, 𝒔𝟎 , 𝑶, 𝑮]

Where S: Set of all possible states.


𝒔𝟎 : start state (initial configuration) of the problem, 𝑠 ∈ 𝑆 .
O: Set of production rules (or set of state transition operator)
used to change the state from one state to another. It is the set
of arcs (or links) between nodes.

The production rule is represented in the form of a pair. Each pair consists of a left side that
determines the applicability of the rule and a right side that describes the action to be performed,
if the rule is applied.

G: Set of Goal state, 𝐺 ∈ 𝑆.

The sequence of actions (or operators) is called a solution path. It is a path from the initial state
to a goal state. This sequence of actions leads to a number of states, starting from initial state to a
goal state, as {𝑠 , 𝑠 , 𝑠 , … … , 𝑠 ∈ 𝐺}.A sequence of state is called a path. The cost of a path is a
positive number. In most of the cases the path cost is computed as the sum of the costs of each
action.
The following figure 3 shows a search process in a given state space.

State Space

Initial state

actions

Goal State

Fig 3 State space with initial and goal node


We need to identify a sequence of actions that will turn the initial state 𝑠 into the desired goal
state G. State space is commonly defined as a directed graph or as a tree in which each node is a
state and each arc represents the application of an operator transforming a state to a successor
state.

2.2.4 SEARCHING FOR SOLUTIONS


After formulating our problem, we are ready to solve it, this can be done by searching through
the state space for a solution, this search will be applied On a search tree or generally a graph
that is generated using the initial state and the successor function.

Searching is applied to a search tree which is generated through state expansion, that is applying
the successor function to the current state, note that here we mean by state a node in the search
tree.
Generally, search is about selecting an option and putting the others aside for later in case the
first option does not lead to a solution. The choice of which option to expand first is determined
by the search strategy used.
Thus, the problem is solved by using the rules (operators), in combination with an appropriate
control strategy, to move through the problem space until a path from initial state to a goal state
is found. This process is known as search. A solution path is a path in state space from 𝑠 (initial
sate) to G (Goal state).

Example1: State space representation of Water-Jug problem (WJP):

Problem statement:
Given two jugs, a 4-gallon and a 3-gallon, both of which do not have measuring indicators on
them. The jugs can be filled with water with the help of a pump that is available (as shown in
figure 4)

The question is “how can you get exactly 2 gallons of water into 4-gallon jug”.

The water Jug Problem


4gl

3gl

Fig 4 The Water Jug problem


The state space or production rule of this problem can be defined as a collection of ordered pairs
of integers (x,y) where x=0,1,2,3,4and y= 0,1,2,or 3.
In the order pair (x,y), x is the amount of water in four-gallon jug and y is the amount of water in
the three-gallon jug.
State space: all possible combination of (x,y)
The start state is (0,0) and
Goalstate is (2, 𝑛), 𝑤ℎ𝑒𝑟𝑒 𝑛 = 0,1,2,3
The following table-1 shows the set of production rules (actions) that can be used to change one
state to another.
Table-1 : Production rules for Water Jug problem

Production Rules

Rule No Production Meaning


R1 (𝑥, 𝑦 | 𝑥 < 4) → (4, 𝑦) Fill 4-gallon jug
R2 (𝑥, 𝑦 | 𝑦 < 3) → (𝑥, 3) Fill 3-gallon jug
R3 (𝑥, 𝑦 | 𝑥 > 0) → (0, 𝑦) Empty 4-gallon jug
R4 (𝑥, 𝑦 | 𝑦 > 0) → (𝑥, 0) Empty 3-gallon jug
R5 (𝑥, 𝑦|𝑥 + 𝑦 ≥ 4 𝑎𝑛𝑑 𝑦 > 0) Pour water from 3-gallon jug into 4-gallon
→ (4, 𝑦 − (4 − 𝑥)) jug until 4-gallon jug is full
R6 (𝑥, 𝑦|𝑥 + 𝑦 ≥ 3 𝑎𝑛𝑑 𝑥 > 0) Pour water from 4-gallon jug into 3-gallon
→ (𝑥 − (3 − 𝑦), 3) jug until 3-gallon jug is full
R7 (𝑥, 𝑦|𝑥 + 𝑦 ≤ 4 𝑎𝑛𝑑 𝑦 > 0) → (𝑥 + 𝑦, 0) Pour all water from 3-gallon jug into 4-
gallon jug.
R8 (𝑥, 𝑦|𝑥 + 𝑦 ≤ 3 𝑎𝑛𝑑 𝑥 > 0) → (0, 𝑥 + 𝑦) Pour all water from 4-gallon jug into 3-
gallon jug.
R9 (𝑥, 𝑦 | 𝑥 > 0) → (𝑥 − 𝑑, 𝑦) Pure some water d out from 4-gallon jug
R10 (𝑥, 𝑦 | 𝑦 > 0) → (𝑥, 𝑦 − 𝑑) Pure some water d out from 3-gallon jug

The following 2 solutions are found for the problem “how can you get exactly 2 gallons of water
into 4-gallon jug”, as shown in Table-2 and in Table-3.

Solution-1:
Table-2Getting exactly 2 gallons of water into 4-gallon jug (solution1)

4-gal jug 3-gal jug Rule applied


0 0 Initial state
4 0 R1 {fill 4-gal jug}
1 3 R6 {Pure all water from 4-gal jug to 3-gal jug
1 0 R4 {empty 3-gal jug}
0 1 R8 {Pour water from 4 to 3-gal jug}
4 1 R1 {Fill 4-gal jug}
2 3 R6 {Pour water from 4-gal jug to 3-gal jug until it is full}
2 0 R4 {empty 3-gal jug}
Solution-2
Table-3Getting exactly 2 gallons of water into 4-gallon jug (solution2)
4-gal jug 3-gal jug Rule applied
0 0 Initial state
0 3 R2 {fill 3-gal jug}
3 0 R7 {Pure all water from 3-gal jug to 4-gal jug
3 3 R2 {fill 3-gal jug}
4 2 R5 {Pour from 3 to 4-gal jug until it is full}
0 2 R3 {Empty 4-gal jug}
2 0 R7 {Pour all water from 3-gal jug to 4-gal jug}

A state space tree for WJP with all possible solution is shown in figure 7.

[0,0]

[4,0] [0,3]

[4,3] [0,0] [1,3] [4,3] [0,0] [3,0]

[1,0] [3,3]

[0,1] [4,2]

[4,1] [0,2]

[2,3] [2,0]

Fig 5 all possible Solution of WJP using state space tree

2.3 Formulation of 8 Puzzle problem

The eight-tile puzzle consists of a 3-by-3 (3 × 3) square frame board which holds eight (8)
movable tiles numbered as 1 to 8. One square is empty, allowing the adjacent tiles to be shifted.
The objective of the puzzle is to find a sequence of tile movements that leads from a starting
configuration to a goal configuration

The Eight Puzzle Problem formulation:

Given a 3 × 3 grid with 8 sliding tiles and one “blank”


Initial state: some other configuration of the tiles, for example
3 1 2
4 5
6 7 8
Goal state:
1 2
3 4 5
6 7 8

Operator: Slide tiles (Move Blank) to reach the goal (as shown below). There are 4 operators
that is, “Moving the blank”:
Move the blank UP,
Move the blank DOWN,
Move the blank LEFT and
Move the blank RIGHT.

3 1 2 3 1 2 1 2
4 5 4 5 3 4 5
6 7 8 6 7 8 6 7 8

Fig6Moving Blank LEFT and then UP

Path Cost: Sum of the cost of each path from initial state to goal state. Here cost of each action (blank
move) = 1, so cost of a sequence of actions= the number of actions. A optimal solution is one
which has a lowest cost path.
Performing State-Space Search: Basic idea:

If the initial state is a goal state, return it.


If not, apply the operators to generate all states that are one step from the initial state (its
successors)
3 1 2 3 2 3 1 2 3 1 2 3 1 2
4 5 4 1 5 4 7 5 4 5 4 5
6 7 8 6 7 8 6 8 6 7 8 6 7 8
initial state Its successors

Fig 7 All possible successors for a given initial state

Consider the successor (and their successors…) until you find a goal state.
Different search strategies consider the state in different orders. They may use different data
structures to store the states that have yet to be considered.

State-Space Search Tree for 8-Puzzle problem:

The predecessor reference connects the search nodes, creating a data structure known as a tree.
3 1 2
4 5 Initial state
6 7 8

3 2 3 1 2 3 1 2 3 1 2
4 1 5 4 7 5 4 5 4 5
6 7 8 6 2 6 7 8 6 7 8
. . . . . . . . .
1 2 3 1 2 3 1 2
3 4 5 6 4 5 4 5
6 7 8 7 8 6 7 8

Fig 8 Tracing a tree bottom-up form Goal state to initial sate

When we reach a goal, we trace up the tree to get the solution i.e., the sequence of actions from
the initial state to the goal.

Q.1 Find the minimum cost path for the 8-puzzle problem, where the start and goal state are
given as follows:

Initial State: {(1,2,3),(4,8,-),(7,6,5)}

Successor State: {(1,2,3),(4,8,5),(7,6,-)}; Move 5 up or – to down

{(1,2,3),(4,8,5),(7,-,6)}; Move 6 right or – to left

{(1,2,3),(4,-,5),(7,8,6)}; Move 8 down or – to up

{(1,2,3),(4,5,-),(7,8,6)}; Move 5 to left or – to right

Goal State: {(1,2,3),(4,5,6),(7,8,-)}; Move 6 to up or – to down

PATH COST=5
2.4 N Queen’s Problem-Formulation and Solution

The N-Queen problem is the problem of placing N Queen’s (Q1,Q2,Q3,….Qn) on an 𝑁 × 𝑁 chessboard so


that no two queens attack each other. The colour of the queens is meaningless in this puzzle, and any
queen is assumed to be attack any other. So, a solution requires that no two queens share the same
row, column, or diagonal.
The N-queen problem must follow the following rules:
1. There is at most one queen in each column.
2. There is at most one queen in each row.
3. There is at most one queen in each diagonal.

Fig-9 No two queens placed on same row, column or diagonal

The N Queen’s problem was originally proposed in 1848 by the chess player Max Bazzel, and over the
years, many mathematicians, including Gauss have worked on this puzzle. In 1874, S. Gunther proposed
a method of finding solutions by using determinants, and J.W.L. Glaisher refined this approach.
The solutions that differ only by summary operations (rotations and reflections) of the board are
counted as one. For 4 queen’s problems, there are 16 possible arrangements on a 4 × 4
chessboard and there are only 2 possible solutions for 4 Queen’s problem. Note that,there are only
1 unique solution, out of 2 possible solutions as second solution is just a mirror image of the first
solution

Fig 10 Two possible solutions of 4-Queen’s problem

Similarly, the one possible solution for 8-queen’s problem is shown in figure 11.
The 8-queen problem is computationally very expensive since the total number of possible
arrangements of queen on a 8 × 8 chessboard is 64 = 64!/(56! x 8!) ≈ 4.4 × 10 . Note that, 8-
Queens problem has 92 distinct solutions and 12 unique solutions, as shown in table-5
column
1 2 3 4 5 6 7 8
1 Q
2 Q
Row 3 Q
4 Q
5 Q
6 Q
7 Q
8 Q

Fig 11 One possible solution of 8 Queen’s problem

8-tuple = (4, 6, 8, 2, 7, 1, 3, 5)

The following table-4 summarizes the both distinct and unique solution for the problem of 1-Queen to
26 Queens problem. In general, there is no known formula to find the exact number of solutions
for N queen’s problem.

Table-4 Solution of N Queen’s problem for N=1 to N=26, both Unique and Distinct

No. of Unique Solution Total distinct solutions


Queen’s
1 1 1
2 0 0
3 0 0
4 1 2
5 2 10
6 1 4
7 6 40
8 12 92
9 46 352
10 92 724
11 341 2,680
12 1787 14,200
13 9233 73,712
14 45,752 365,596
15 285,053 2,279,184
16 1,846,955 14,772,184
17 11,977939 95,815,104
18 83,263,591 666,090,624
19 621,012,754 4,968,057,848
20 4,878,666,808 39,029,188,884
21 39,333,324,973 314,666,222,712
22 336,376,244,042 2,691,008,701,644
23 3,029,242,658,210 24,233,937,684,440
24 28,439,272,956,934 227,514,171,973,736
25 275,986,683,743,434 2,207,893,435,808,352
26 2,789,712,466,510,289 22,317,699,616,364,044

2.4.1 Formulation of 4-Queen’s problem:

States: any arrangement of 0 to 4 queens on the board

Initial state: 0 queens on the board

Successor function: Add queen in any square

Goal test: 4 queens on the board, none attacked

For the initial state, there are 16 successors. At the next level, each of the states has 15
successors, and so on down the line. This search tree can be restricted by considering only those
successors where No queens are attacking each other. To do that, we have to check the new
queen with all the other queens on the board. In this way, the answer is found at a depth 4.For the
sake of simplicity, you can consider a problem of 4-Queen’s and see how 4-queen’s problem is
solved using the concept of “Backtracking”.

2.4.2 State space tree for 4 Queen’s Problem


We place queen row-by-row (i.e.,𝑄 in row 1, 𝑄 in row 2 and so on).
Backtracking gives “all possible solution”. If you want optimal solution, then go for Dynamic
programming.
Let’s see Backtracking method, there are 𝐶 ways to place a queen on a 4x4 chess board as
shown in the following state space tree (figure 12). In a tree, the value (𝑖, 𝑗) means in the𝑖 row,
𝑗 queen is placed.

Fig 12 State-space tree showing all possible ways to place a queen on a 4x4 chess board
So, to reduce the size (not anywhere on chess board, since there are 𝐶 Possibilities), we place
queen row-by-row, and no Queen in same column.This tree is called a permutation tree (here we
avoid same row or same columns but allowing diagonals)

Total nodes=1 + 4 + 4 × 3 + 4 × 3 × 2 + 4 × 3 × 2 × 1 = 65

The edges are labeled by possible values of x i. Edges from level 1 to level 2 nodes specify the
values for x1. Edges from level 𝒊 to level 𝒊 + 𝟏 are labeled with the values of xi.
The solution space is defined by all paths from root node to leaf node. There are 4! = 24 leaf
nodes are in the tree Nodes are numbered as depth first Search. The state space tree for 4-
Queen’s problem (avoid same row or same columns but allowing diagonals)is shown in figure
13.

Fig13 State space tree for 4 queen’s problems (allowing same diagonal but not same row and
same column)

The two solutions are found in the tree, as


(𝑥 , 𝑥 , 𝑥 , 𝑥 ) = (2,4,1,3) and
(𝑥 , 𝑥 , 𝑥 , 𝑥 ) = (3,1,4,2) , which is shown in figure-14

Fig 14 Two possible solutions of 4-Queen’s problem


Note that the second solution is just a mirror image of the first solution.

We can further reduce the search space, as shown in figure 6 by avoiding diagonal also. Now
you can avoid the same row, avoid same columns and avoid same diagonals, while placing
any queen. In this case, the state space tree is look like as shown in figure 15.

Fig 15 State space tree for 4 queen’s problem (avoiding same row, columns, and diagonal)

Note that Queens are placed row-by-row, that is,𝑄 in row 1, 𝑄 in row 2 and so on. In a tree,
node (1,1) is a promising node (no queen attack) as 𝑄 is placed in 1st row and 1st column. Node
(2,1) is non promising node, because we cannot place 𝑄 in the same column (as 𝑄 is already
placed in column 1). Note that nonpromising node is marked as ×. So, we try (2,2) again
nonpromising (due to same column), next try (2,3), it’s a promising node, so proceed and try to
place 3rd queen on 3rd row. But in 3rd row, all positions (3,1),(3,2),(3,3) and (3,4) are non
promising and we cannot place the 𝑄 in any of this position. So, we backtrack to (1,1) and try
for (2,4) and so on.Backtracking approach gives “all possible solution”. Figure 7 shows one
possible solution for 4-Queen’s problem as {(1,2),(2,4),(3,1),(4,3)}. This can also be written as
(𝑥 , 𝑥 , 𝑥 , 𝑥 ) = (2,4,1,3) . There are 2 possible solution of 4-Queen’s problem. Another
solution is (𝑥 , 𝑥 , 𝑥 , 𝑥 ) = (3,1,4,2), which is a mirror image of 1st solution.

2.4.3 Backtracking Approach to solve N queen’s Problem:

Consider the chess board squares indices of the 2-Dimentional array [1…n,1…n]. we observe
that every element on the same diagonal that rows from the upper left to the right has same
(𝑟𝑜𝑤 − 𝑐𝑜𝑙𝑢𝑚𝑛) value. It is called left diaonals. Similarly, every elemnet on the same diagonal
that goes from the upper rigth to the lower left has same (𝑟𝑜𝑤 + 𝑐𝑜𝑙𝑢𝑚𝑛) value. This is called
Right Diagonals. For example consider a 5 × 5 chessboard as [1…5,1…5] (as shown in figure
16).
Case1:(Left diagonal):- suppose queen’s are palced in same diadinal in locations:
(1,2),(2,3),(3,4),(4,5) or (1,4),(2,5) or any other same left diagonal value. Observe that every
element on the same diagonal has the same (𝑟𝑜𝑤 − 𝑐𝑜𝑙𝑢𝑚𝑛)value. Similarly,
Case2:(right diagonal):- suppose queen’s are palced in same diadinal in locations:
(1,3),(2,2),(3,1) or (1,4),(2,3),(3,2),(4,1) or any other same right diagonal value. Observe that
every element on the same diagonal has the same (𝑟𝑜𝑤 + 𝑐𝑜𝑙𝑢𝑚𝑛)value.

Fig 16 Left diagonal and right diagonal for 5 × 5 chessboard.

Suppose two queen’s are placed at position (𝑖, 𝑗) and (𝑘, 𝑙) then they are on the same diagonal if
and only if:
(𝑖 − 𝑗) = (𝑘 − 𝑙) 𝑜𝑟 (𝑗 − 𝑙) = (𝑖 − 𝑘) -----------(1) [left diagonal]

(𝑖 + 𝑗) = (𝑘 + 𝑙) 𝑜𝑟 (𝑗 − 𝑙) = (𝑘 − 𝑖) ----------- (2) [right diagonal]

From equation (1) and (2), we can combine and write a one condition to check diagonal as:
𝒂𝒃𝒔(𝒋 − 𝒍) = 𝒂𝒃𝒔(𝒊 − 𝒌).

Algorithm NQueen(k,n)

// This procedure prints all possible placement of n queen’s on 𝑛 × 𝑛


//chessboard so that they are non-attacking.
{
1. for i=1 to n do
2. {
3. if place(k,i) then
4. {
5. x[k]=i;
6. if (k==n) then print(x[1….n])
7. else
8. NQueen(k+1,n);
9. }
10. }
11. }

Algorithm place(k,i)
// This algorithm return true, if a queen can be placed in kth row ith
//column. Else it return false. X[] is a global array. Abs® returns
//absolute value of r.

1. {
2. for j=1 to k-1 do
3. {
4.if(x[j]=i) // in the same column
5.Or (abs(x[j]-i)==abs(j-k))// in the
//same diagonal
6. Return false;
7. }
8. Return true;
9. }

☞ Check Your Progress 1

Q.1 What are the various factors need to be taken into consideration when developing a
statespace representation?
Q.2 Consider the following Missionaries and cannibal problem:
Three missionaries and three cannibals are side of a river, along with a boat that can hold one or
two people. Find a way to get everyone to the other side, without ever leaving a group of
missionaries outnumbered by cannibals.
a) Formulate the missionaries and cannibal problem.
b) Solve the problem formulated in part (a)
c) Draw the state- space search graph for solving this problem.

Q.3 Draw a state space tree representation to solve Tower of Hanoi problem. (Hint: You can take
number of disk n=2 or 3).

Q.4: Draw the state space tree for the following 8-puzzle problem, where the start and goal state
are given below. Also Find the minimum cost path for this 8-puzzle problem. Each blank move
is having cost=1.
Q.5 Discuss a Backtracking algorithm to solve a N -Queen’s problem. Draw a state space tree to
solve a 4-Queen’s problem.

2.5 Adversarial Search-Two agent search


In computer science, a search algorithm is an algorithm for finding an item with specified
properties among a collection of items. The items may be stored individually as records in a
database; or may be elements of a search space defined by a mathematical formula or procedure.
Adversarial search is a game-playing technique where the agents are surrounded by a
competitive environment. A conflicting goal is given to the agents (multiagent). These agents
compete with one another and try to defeat one another in order to win the game. Such
conflicting goals give rise to the adversarial search. Here, game-playing means discussing those
games where human intelligence and logic factor is used, excluding other factors such as luck
factor. Tic-tac-toe, chess, checkers
checkers,, etc., are such type of games where no luck factor works,
only mind works.
Mathematically, this search is based on the concept of ‘Game Theory.’ According to game
theory, a game is played between two players. To complete the game, one has to win the game
and the other loses automatically.’

Techniques required to get the best optimal solution


There is always a need to choose those algorithms which provide the best optimal solution in a
limited time. So, we use the following techniq
techniques
ues which could fulfil our requirements:
 Pruning: A technique which allows ignoring the unwanted portions of a search tree
which make no difference in its final result.
 Heuristic Evaluation Function: It allows to approximate the cost value at each level of
the search tree, before reaching the goal node.

2.5.1 Elements of Game Playing search

To play a game, we use a game tree to know all the possible choices and to pick the best one out.
There are following elements of a game-playing:
 S0: It is the initial state from where a game begins.
 PLAYER (s): It defines which player is having the current turn to make a move in the
state.
 ACTIONS (s): It defines the set of legal moves that a player can make to change the
state.
 RESULT (s, a): It is a transition model which defines the result of a move.
 TERMINAL-TEST (s): It defines that the game has ended (or over) and returns true.
States where the game has ended are called terminal states.
 UTILITY (s,p): It defines the final value with which the game has ended. This function
is also known as Objective function or Payoff function. This utility function gives a
numeric value for the outcome of a game i.e.

For example, in chess, tic-tac-toe, we have two or three possible outcomes. Either win or lose,
or draw the match, which we can represent by the values +1,-1 or 0. In other word we can say
that (-1): if the PLAYER loses, (+1), if the PLAYER wins and (0): If there is a draw
between the PLAYERS.

Let’s understand the working of the elements with the help of a game tree designed for tic-tac-
toe. Here, the node represents the game state and edges represent the moves taken by the players.
The root of the tree is the initial state. Next level is all of MAX’s moves, then next level is all of
MIN’s moves and so on. Note that root has 9 blank square (MAX), level 1 has 8 blank squares
(MIN), level 2 has 7 blank square (MAX) and so on.

Objective:
Player1: Maximize outcome and Player2: Minimize outcome
Terminal (goal) state:
utility: -1, 0, +1 (that is win for X is +1 and win for O is -1 and 0 for draw. The number on each
leaf node indicates the utility value of the terminal state from the point of view of MAX. High
values are assumed to be good for MAX and bad for MIN (which is how the players get their
names). It is MAX’s job to use the search tree to determine the best move. Note that if MAX win
then Utility value is +1, if MIN wins then utility value is -1 and if DRAW then utility value is 0.

Fig 17A game-tree for tic-tac-toe

In a tic-tac-toe game playing, as shown in figure 17, we have the following elements:

 INITIAL STATE (S0): The top node in the game-tree represents the initial state in the
tree and shows all the possible choice to pick out one.
 PLAYER (s): There are two players, MAX and MIN. MAX begins the game by picking
one best move and place X in the empty square box.
 ACTIONS (s): Both the players can make moves in the empty boxes chance by chance.
 RESULT (s, a): The moves made by MIN and MAX will decide the outcome of the
game.
 TERMINAL-TEST(s): When all the empty boxes will be filled, it will be the
terminating state of the game.
 UTILITY: At the end, we will get to know who wins: MAX or MIN, and accordingly,
the price will be given to them. If MAX win then Utility value is +1, if MIN wins then
utility value is -1 and if DRAW then utility value is 0.
2.5.2 Issues in Adversarial search

In a normal search, we follow a sequence of actions to reach the goal or to finish the game
optimally. But in an adversarial search, the result depends on the players which will decide the
result of the game. It is also obvious that the solution for the goal state will be an optimal
solution because the player will try to win the game with the shortest path and under limited
time. Minimaxsearch Algorithmis an example of adversarial search. Alpha-beta Pruning in
this is used to reduce search space.

2.6 Min-Max search strategy


In artificial intelligence, minimax is a decision-making strategy under game theory, which is
used to minimize the losing chances in a game and to maximize the winning chances. This
strategy is also known as ‘Min-Max,’ ’MM,’ or ‘Saddle point.’ Basically, it is a two-player
game strategy where if one wins, the other lose the game. This strategy simulates those games
that we play in our day-to-day life. Like, if two persons are playing chess, the result will be in
favour of one player and will againstthe other one. The person who will make his best try, efforts
as well as cleverness, will surely win.
We can easily understand this strategy via game tree-where nodes are the states of the game, and
edges are moves that were made by the players in the game. Players will be two namely:
 MIN: Decrease the chances of MAX to win the game.
 MAX: Increases his chances of winning the game.

They both play the game alternatively, i.e., turn by turn and following the above strategy, i.e., if
one wins, the other will definitely lose it. Both players look at one another as competitors and
will try to defeat one-another, giving their best.
In minimax strategy, the result of the game or the utility value is generated by a heuristic
function by propagating from the initial node to the root node. It follows the backtracking
technique and backtracks to find the best choice. MAX will choose that path which will increase
its utility value and MIN will choose the opposite path which could help it to minimize MAX’s
utility value.

2.6.1 MINIMAX Algorithm:


MINIMAX algorithm is a backtracking algorithm where it backtracks to pick the best move out
of several choices. MINIMAX strategy follows the DFS (Depth-first search) concept. Here, we
have two players MIN and MAX, and the game is played alternatively between them, i.e.,
when MAX made a move, then the next turn is of MIN. It means the move made by MAX is
fixed and, he cannot change it. The same concept is followed in DFS strategy, i.e., we follow the
same path and cannot change in the middle. That’s why in MINIMAX algorithm, instead of BFS,
we follow DFS. The following steps are used in MINMAX algorithm:

 Generate the whole game tree, all the way down to the terminal states.
 Apply the utility function to each terminal state to get its value.
 Use the utility of the terminal states to determine the utility of the nodes one level higher
up in the search tree.
 Continue backing up the values from the leaf nodes toward the root, one layer at a time.
 Eventually, the backed-up values reached the top of the tree; at that point, MAX chooses
the move that leads to the highest value.
This is called a minimax decision, because it maximizes the utility under the assumption that the
opponent will play perfectly to minimize it. To better understand the concept, consider the
following game tree or search tree as shown in figure 18.

Fig 18 Two player game tree

In the above figure 2, the two players MAX and MIN are there. MAX starts the game by
choosing one path and propagating all the nodes of that path. Now, MAX will backtrack to the
initial node and choose the best path where his utility value will be the maximu m. After this,
its MIN chance. MIN will also propagate through a path and again will backtrack, but MIN will
choose the path which could minimize MAX winning chances or the utility value.
So, if the level is minimizing, the node will accept the minimum value from the successor
nodes. If the level is maximizing, the node will accept the maximum value from the successor.

In other word we can say that - Minimax is a decision rule algorithm, which is represented as a
game-tree. It has applications in decision theory, game theory, statistics and philosophy.
Minimax is applied in two player games. The one is the MIN and the other is the MAX player.
By agreement the root of the game-tree represents the MAX player. It is assumed that each
player aims to do the best move for himself and therefore the worst move for his opponent in
order to win the game. The question may arise “How to deal with the contingency problem?”
The answer is:

 Assuming that the opponent is rational and always optimizes its behaviour ( opposite to
us) we consider the best response. opponent's
 Then the minimax algorithm determines the best move
2.6.2 Working of Minimax Algorithm:
Minimax is applicable for decision making for two agent systems participating in competitive
environment. These two players P1 and P2, also known as MIN and MAX player, maximizes
and minimizes utility value of heuristics function. Algorithm uses recursion to search through
game tree and compute minimax decision for current state. We traverse the complete game tree
in a depth-first search (DFS) manner to explore the node. MAX player always select the
maximum value and MIN always select the minimum value from its successor’s node. The initial
value of MAX and MIN is set to as 𝑀𝐴𝑋 = −∞ and 𝑀𝐼𝑁 = +∞. This is a worst value assigned
initially and as the algorithm progress these values are changes and finally, we get the optimal
value.
Example1: Let’s take an example of two-player game tree search (shown in figure 19a) to
understand the working of Minimax algorithm.

Fig 19a Two player game tree


The initial value of MAX and MIN is set to as 𝑀𝐴𝑋 = −∞ and 𝑀𝐼𝑁 = +∞. The tree is
traversed in a DFS manner. So, we start from node A, then move to node B and then D.
Now at node D[𝑴𝑨𝑿 = −∞]. Now, at D, it first checks the left child (which is a terminal node)
with value-1. This node returns a value of 𝑀𝐴𝑋 = (−∞, −𝟏) = −𝟏. So,modified value at node
D is [𝑀𝐴𝑋 = −1]. Next, we proceedfor right child of Node D (which has terminal value 8) and
compare this value (8) with previous value at node D. that is 𝑀𝐴𝑋 = 𝑚𝑎𝑥(−𝟏, 𝟖) = 𝟖. So final
value at node D is 8.
Similarly,
the value at node E (which is a Max node) is
𝑀𝐴𝑋 = 𝑚𝑎𝑥(−∞, −𝟑) = −𝟑, then 𝒎𝒂𝒙(−𝟑, −𝟏) = −𝟏.
So, at node B, which is at MIN level, select the minimum value from its successor node D and E
as 𝑀𝐼𝑁 = 𝑚𝑖𝑛(8, −1) = −1

Similarly, the value at node F (which is also Max node) is 𝑀𝐴𝑋 = 𝑚𝑎𝑥(−∞, 𝟐) =
𝟐, 𝒕𝒉𝒆𝒏 𝒎𝒂𝒙(𝟐, 𝟏) = 𝟐, and
The value at node G (which is also MAX node) is
𝑀𝐴𝑋 = 𝑚𝑎𝑥(−∞, −𝟑) = −𝟑, and then𝑚𝑎𝑥(−3,4) = 4.
Thus, at node C, which is also at MIN level, select the minimum value from its successor node F
and G as 𝑀𝐼𝑁 = 𝑚𝑖𝑛(2,4) = 2.
Now, the value at node B and C is -1 and 2 respectively.
Thus, finally, the value at node A, which is at MAX level, is
𝑀𝐴𝑋 = 𝑚𝑎𝑥(−1,2) = 2.
The final game tree with max or min value at each node and optimal path, with shaded line
ACF2, is shown in the following figure 19(b).

Fig 19(b) Game tree with final value at each node with optimal path
Example2Consider the following two-player game tree search. The working of Minimax
algorithmis illustrated from fig (a)-fig(k)
Fig (f)

Fig (a)
Fig (g)

Fig (b) Fig (h)

Fig (c) Fig (i)

Fig (d) Fig (j)


Fig (e) Fig (k)

2.6.3 Properties of Minimax Algorithm:


1. Complete: Minimax algorithm is complete, if the tree is finite.
2. Optimal: If a solution found for an algorithm is guaranteed to be the best solution (lowest
path cost) among all other solutions, then such a solution is said to be an optimal solution.
Minimax is Optimal.
3. Time complexity:𝑂(𝑏 ), where b: Branching Factor and m is the maximum depth of the
game tree.
4. Space complexity: 𝑂(𝑏𝑚)

For example, in chess playing game b = 35, m ≈100 for “reasonable” games. In this case exact
solution completely infeasible

2.6.4 Advantages and disadvantages of Minimax search


Advantages:
 Returns an optimal action, assuming perfect opponent play.

 Minimax is the simplest possible (reasonable) game search algorithm.

Disadvantages:
 It's completely infeasible in practice.
 When the search tree is too large, we need to limit the search depth and apply an
evaluation function to the cut-off states.

2.7 Alpha-beta Pruning


The drawback of Minimax strategy is that it explores each node in the tree deeply to provide the
best path among all the paths. This increases its time complexity. If b is the branching factor and
d is the depth of the tree, then time complexity of MINIMAX algorithm is 𝑂(𝑏 ) that is
exponential. But as we know, the performance measure is the first consideration for any optimal
algorithm. Alpha-beta pruning is a method to reduce (prone) search space. Using Alpha-Beta
pruning, the Minimax algorithm is modified. Therefore, alpha-beta pruning reduces this
drawback of minimax strategy by less exploring the nodes of the search tree.
The method used in alpha-beta pruning is that its cut-off the search by exploring a smaller
number of nodes. It makes the same moves as a minimax algorithm does, but it prunes the
unwanted branches using the pruning technique (discussed in adversarial search). Alpha-beta
pruning works on two threshold values, i.e., 𝛼 (alpha) and 𝛽 (beta).
 𝜶 : It is the best highest value; a MAX player can have. The initial value of 𝜶 is set to
negative infinity value, that is
𝜶 = −∞. As the algorithm progress its value may change and finally get the best
(highest) value.
 𝜷 : It is the best lowest value; a MIN player can have. The initial value of 𝜷 is set to
positive infinity value, that is
𝜷 = +∞. As the algorithm progress its value may change and finally get the best (lowest)
value.
So, each MAX node has 𝜶-value, which never decreases, and each MIN node has 𝜷-value,
which never increases. The main condition which required for alpha-beta pruning is 𝜶 ≥ 𝜷, that
is if𝜶 ≥ 𝜷, then prune (cut) the branches otherwise proceed.

Note: Alpha-beta pruning technique can be applied to trees of any depth, and it is possible to
prune the entire sub-trees easily.

2.7.1 Working of Alpha-beta Pruning

As we know there are two-parameter is defined for Alpha-beta pruning, namely alpha (𝜶)and
beta(𝜷). The initial value of alpha and beta is set to as 𝜶 = −∞ and 𝜷 = +∞.As the algorithm
progresses its values are changes accordingly. Note that in Alpha-beta pruning (cut), at any node
in a tree, if 𝜶 ≥ 𝜷, then prune (cut) the next branch else search is continued. Note the following
point for alpha-beta pruning:
 The MAX player will only update the value of𝜶(on MAX level).
 The MIN player will only update the value of 𝜷(on MIN level).
 We will only pass the 𝜶 and 𝜷value from top to bottom (that is from any parent to child
node, but never from child to parent node).
 While backtracking the tree, the node values will be passed to upper node instead of
values of 𝜶and𝜷.
 Before going to next branch of the node in a tree, we check the value of 𝜶and𝜷. If the
value of𝜶 ≥ 𝜷, then prune (cut) the next (unnecessary) branches (i.e., no need to search
the remaining branches where the condition 𝜶 ≥ 𝜷is satisfied) else search continued.

Consider the below example of a game tree where P and Q are two players. The game will be
played alternatively, i.e., chance by chance. Let, P be the player who will try to win the game by
maximizing its winning chances. Q is the player who will try to minimize P’s winning chances.
Here, 𝜶 will represent the maximum value of the nodes, which will be the value for P as
well. 𝜷 will represent the minimum value of the nodes, which will be the value of Q.

Fig 20Alpha-beta pruning

 Any one player will start the game. Following the DFS order, the player will choose one
path and will reach to its depth, i.e., where he will find the TERMINAL value.
 If the game is started by player P, he will choose the maximum value in order to increase
its winning chances with maximum utility value.
 If the game is started by player Q, he will choose the minimum value in order to decrease
the winning chances of A with the best possible minimum utility value.
 Both will play the game alternatively.
 The game will be started from the last level of the game tree, and the value will be chosen
accordingly.
 Like in the figure 5, the game is started by player Q. He will pick the leftmost value of
the TERMINAL and fix it for beta ( 𝜷). Now, the next TERMINAL value will be
compared with the 𝜷-value. If the value will be smaller than or equal to the 𝜷-value,
replace it with the current 𝜷-value otherwise no need to replace the value.
 After completing one part, move the achieved 𝜷-value to its upper node and fix it for the
other threshold value, i.e.,𝜶.
 Now, its P turn, he will pick the best maximum value. P will move to explore the next
part only after comparing the values with the current 𝜶-value. If the value is equal or
greater than the current 𝜶-value, then only it will be replaced otherwise we will prune the
values.
 The steps will be repeated unless the result is not obtained.
 So, number of pruned nodes in the above example are four and MAX wins the game with
the maximum UTILITY value, i.e.,3.

The rule which will be followed is: “Explore nodes, if necessary, otherwise prune the
unnecessary nodes.”
Note: It is obvious that the result will have the same UTILITY value that we may get from the
MINIMAX strategy.

Alpha beta cut-off (or pruning):

1. for each node store limit [𝛼, 𝛽].

2. Update [𝛼, 𝛽],

where 𝛼 is the lower bound at max node; it can’t decrease.

𝛽 is the upper bound at min node; it can’t increase.

3. If 𝛼 value of a max node is greater than 𝛽 value of its parent (𝛼 ≥ 𝛽),


thesubtree of that max node need not be evaluated (i.e., pruned).

4. If 𝛽 value of a min node is lesser than 𝛼 value of its parent (𝛽 ≤ 𝛼),


the subtree of that min node need not be evaluated (i.e., pruned).

Example1: Let’s take an example of two-player search tree (Figure 21) to understand the
working of alpha-beta pruning.
Fig 21Two player search tree

We initially start the search by setting the initial value of 𝜶 = −∞and𝜷 = +∞to root node A.

Note the following important point to apply the Alpha-beta pruning:


 We will only pass the 𝜶and𝜷value from top to bottom (that is
from any parent to child node), but never from child to parent
node.

 While backtracking the tree (from bottom to top node), the node
values will be passed to upper node instead of values of 𝜶and𝜷.

 Before exploring the next branch in a tree, we check 𝜶 ≥ 𝜷. If


YES, then prune (cut) the next (unnecessary) branches (i.e., no
need to search the remaining branches where the condition
𝜶 ≥ 𝜷is satisfied) else search continued.

 The MAX player will only update the value of𝜶(on MAX level)
and the MIN player will only update the value of 𝜷(on MIN
level).

Step1: We traverse the tree in a depth-first search (DFS) manner and assign (pass) this value of
𝜶and𝜷down to subsequent nodesB and then to nodeDas [𝜶 = −∞; 𝜷 = +∞].
Now at node D [𝜶 = −∞, 𝜷 = +∞], Since node D is at MAX level, so only 𝜶value will be
changed. Now, at D, it first checks the left child (which is a terminal node) with value 2. This
node returns a value of 2. Now, the value of 𝜶 at node D is calculated as𝛼 = 𝑚𝑎𝑥(−∞, 2) = 2.
So modified value at node D is [𝛼 = 2, 𝛽 = +∞]. To decide whether it’s worth looking at its
right node or not, we check 𝜶 ≥ 𝜷. The answer in NO since 2 ≱ +∞. So, proceed and search is
continued for right child of Node D.
The value of right child (terminal node with value=3) of D returns a value 3. Now at D, the value
of 𝛼 is compared with terminal node value 3, that is, 𝛼 = 𝑚𝑎𝑥(2,3) = 3.Now the value of
Node(D)=3, and the final values of 𝛼 and 𝛽is updated at node D as[𝛼 = 3, 𝛽 = +∞] as shown in
figure 21(a).
∝ = −∞
A β = +∞

∝ = −∞
β = +∞ B

∝=3
D 3
β = +∞

2 3

Fig 21(a)

Step 2. We backtrack from node D to B. Note that, while backtracking the node in a tree, the
node values of D(=3) will be passed to upper nodeB instead of values of 𝜶and𝜷.Now the value
of node(B)=node(D)=3. Since B is at MIN level, so only𝜷value will be changed. Now at node B
[𝜶 = −∞, 𝜷 = 𝟑] (note that 𝛽 is change from +∞ to 3). Here we again check 𝛼 ≥ 𝛽. It is False,
so search is continued on right side of B, as shown in figure (b).

∝ = −∞
β= 3 B 3

∝=3 D 3
β = +∞

2 3

Fig21(b)

Step 3.B now calls E, we pass the 𝜶and𝜷value from top node B to bottom node E as [𝜶 =
−∞, 𝜷 = 𝟑]. Since, node E is at MAX level, so only 𝜶value will be change.Now, at E, it first
checks the left child (which is a terminal node) with value5. This node returns a value of 5.
Now, the value of 𝜶 at node E is calculated as𝛼 = 𝑚𝑎𝑥(−∞, 5) = 5, so value of Node(E)=5 and
modified value of 𝜶and𝜷 at node Eis [𝛼 = 5, 𝛽 = 3]. To decide whether it’s worth looking at its
right node or not, we check 𝜶 ≥ 𝜷. The answer isYES, since 5 ≥ 3. So,we prune (cut) the right
branch of E, as shown in figure 21(c).

∝ = −∞
β= 3 B 3

∝=3 ∝=5 5
D E
β = +∞ β=3

2 3 2 9

Fig21(c)
Step4. We backtrack from node E to B. Note that, while backtracking, the node values of E(=5)
will be passed to upper node B, instead of values of 𝜶and𝜷. E return a value 5 to B.Since B is
at MIN level, so only𝜷value will be changed. Previously, at node B [𝜶 = −∞, 𝜷 = 𝟑], but
now𝜷 = 𝑚𝑖𝑛(3,5) = 3, so, there is no change in 𝜷 value and value of node(B) is still 3. Thus
finally, modified value at node B is[𝜶 = −∞, 𝜷 = 𝟑].
We backtrack from node B to A. Again, note that, while backtracking the tree, the value of
node(B)=3 will be passed to upper node A, instead of values of 𝜶 and 𝜷 .Now value of
Node(A)=3.

Since A is at MAX level, so only𝜶value will be changed. Previously, at node A [𝛼 = −∞, 𝛽 =


+∞] and after comparing value of node(B)=3 with old value of 𝛼 at node A, that is 𝛼 =
𝑚𝑎𝑥(−∞, 3) = 3. Thus finally, at node A [𝛼 = 3, 𝛽 = +∞] and value of Node(A)=3. we check
𝛼 ≥ 𝛽, it is False, so proceed on right side. Now, we completed the left sub tree of A and
proceed towards right subtree, as shown in figure 21(d).
∝= 3
β = +∞ A 3

∝ = −∞
β= 3 B 3

∝=3 ∝=5 5
D β=3 E
β = +∞

2 3 5 9

Fig21(d)
Step 5.
Now at node C, we pass the 𝜶and𝜷value from top node A to bottom node C as [𝜶 = 𝟑, 𝜷 =
+∞]. Check,𝛼 ≥ 𝛽. The answer in NO. So, search is continued. Now pass the 𝜶and𝜷value
from top node C to bottom node F as [𝜶 = 𝟑, 𝜷 = +∞]. Since F is at MAX level, so only𝜶value
will be changed.
Now, at F, it first checks the left child (which is a terminal node) with value0. This node returns
a value of 0. Now, the value of 𝜶 at node F is calculated as𝛼 = 𝑚𝑎𝑥(3,0) = 3. So modified
value at node Fis [𝛼 = 3, 𝛽 = +∞]. To decide whether it’s worth looking at its right node or not,
we check 𝜶 ≥ 𝜷. The answer in NO since 3 ≱ +∞. So, proceed and search is continued for
right child of Node F.
The value of right child (terminal node with value=1) of F returns a value 1, so finally, value of
node(F)=1. Now at F, the value of 𝛼 is compared with terminal node value 1, that is, 𝛼 =
𝑚𝑎𝑥(3,1) = 3, and the final values of 𝛼 and 𝛽 is updated at node F as[𝛼 = 3, 𝛽 = +∞] as
shown in figure 21(e).

∝= 3 A 3 MAX
β = +∞

∝ = −∞ ∝=3 MIN
B 3 C
β= 3 β = +∞

∝=3 ∝=5 ∝=3


D E 5 β = +∞ F 1 MAX
β = +∞ β=3

2 3 2 8 0 1 Terminal Node

Fig21(e)

6. We backtrack from node F to C. Note that, while backtracking the tree, the node values of
F(=3) will be passed to upper node C. Now the value of node(C)=node(F)=1.
Since C is at MIN level, so only 𝛽value will be changed. Previously, at node C [𝛼 = 3, 𝛽 =
+∞ ]. Now, old value of 𝜷 = +∞ is compared with value of node(F)=node(C)=1. That is,
𝜷 = 𝑚𝑖𝑛(+∞, 1) = 1. Thus finally, at node B [𝜶 = 𝟑, 𝜷 = 𝟏]. Now we check,𝛼 ≥ 𝛽. It is
TRUE, so we prune (cut) the right branch of node C. That is node G will be pruned and
algorithm stop searching on right subtree of node C.
Thus finally, we backtrack from node C to A and node C return the value 1 to node A. Since A is
a MAX node, so only𝜶value will be changed.
Previously, at node A [𝛼 = 3, 𝛽 = +∞] and after comparing value of node(C)=1 with old value
of 𝛼 at node A, that is 𝛼 = 𝑚𝑎𝑥(3,1) = 3. Thus finally, at node A [𝛼 = 3, 𝛽 = +∞] and value
of Node(A)=3. Now, we completed the right sub tree of A also.
Following is the final game tree, showing the nodes which are computed and nodes which are
pruned (cut) during search process of Alpha-beta pruning. Here the optimal value for the
maximizer is 3 and there are 3 terminal nodes are pruned (9, 7 and 5). The optimal search path is
ABD3.

3 MAX
∝= 3 A
β = +∞

MIN
B ∝=3 C 1
β=1

∝=3
D E β = +∞ F 1 MAX

2 3 5 9 0 1 Terminal Node
Fig 21(f)

Example2: Consider the following game tree (figure 22) in which root is maximizing node and
children are visited from left to right. Find which nodes are pruned by the Alpha-beta pruning.

Fig 22 Two Player search tree

Solution:
Step1:We start the search by setting the initial value for node A as[𝜶 = −∞, 𝜷 = +∞]. We
traverse the node in depth-first search (DFS) manner so assign the same value of 𝜶and𝜷tonode
B[𝜶 = −∞, 𝜷 = +∞], Since node B is at MIN level, so only 𝜷value will be changed at node B.
Now, at D, it looks at it left child (terminal node), which returns a value 3 to D. So, we compare
the old value of 𝛽 at node B with this terminal node value 3, that is 𝛽 = 𝒎𝒊𝒏(+∞, 𝟑) = 𝟑.So
modified value of 𝜶and𝜷tonode B is[𝜶 = −∞, 𝛽 = 3].
To decide whether it worth looking at right subtree of B, we check
𝜶 ≥ 𝜷. The answer in NO, since −∞ ≱ 3. So, proceed and search is continued for right child of
Node B, that is E.

The terminal value of E=12. Now, the value of right child terminal E(=12) is compared with
previous old value of 𝛽 = 3, that is, 𝛽 = 𝑚𝑖𝑛(3,12) = 3.So no change in 𝛽 value. So, at present,
current modified value of 𝜶and𝜷atnode B is same[𝜶 = −∞, 𝛽 = 3]. Again check,𝛼 ≥ 𝛽. The
answer in NO. So, proceed and search is continued for right child of Node B, that is F. The
terminal value of F=8. The value of right child terminal at F(=8) is compared with previous old
value of 𝛽 = 3 , that is, 𝛽 = 𝑚𝑖𝑛(8,3) = 3. So no change in 𝛽 value. So, finally value of
Node(B)=3 and modified value of 𝜶and𝜷atnode B[𝜶 = −∞, 𝛽 = 3] as shown in figure 22(a)

Step2: We backtrack from node B to A. Note that, while backtracking the tree, the node values
of B(=3) will be passed to upper node A, instead of values of 𝜶and𝜷. Now the value of
node(A)=node(B)=3. Since A is at MAX level, so only𝜶value will be changed. Previously, at
node A [𝛼 = −∞, 𝛽 = +∞] and after comparing value of node(B)=3 with old value of 𝛼 at node
A, that is 𝛼 = 𝑚𝑎𝑥(−∞, 3) = 3 . Thus finally, at node A [𝜶 = 𝟑, 𝜷 = +∞ ] and value of
Node(A)=3.
To decide whether it’s worth looking at its right node of A or not, we check 𝜶 ≥ 𝜷. The answer
in NO since 3 ≱ +∞. So, proceed and search is continued for right child of Node A. Now, we
completed the left sub tree of A and proceed towards right subtree.
∝ = −∞ MAX ∝= 3 MAX
β = +∞ A β = +∞ A 3

∝ = −∞ ∝ = −∞
β = +3 B 3 MIN
β= 3 B 3 MIN

D E F Terminal Node D E F Terminal


3 12 8 3 12 8

Fig22(a) Fig 22(b)

Step 3:Now at node C, we pass the 𝜶and𝜷value from top node A to bottom node C as [𝜶 =
𝟑, 𝜷 = +∞]. Check,𝛼 ≥ 𝛽. The answer in NO. So, continue the search on right side. Since C is
at MIN level, so only𝜷value will be changed.
Now, we first check the left child (terminal) of node C, that is G=2. So, we compare the old
value of 𝜷 at node C with this terminal node value 2, that is 𝜷 = 𝒎𝒊𝒏(+∞, 𝟐) = 𝟐. So value
of Node(C)=2 and modified value of 𝜶and𝜷atnodeC is[𝜶 = 𝟑, 𝛽 = 2]. Now, before proceed
next, we again check𝛼 ≥ 𝛽. The answer is YES. So,we prune (cut) the right branch of node C.
That is node H and I will be pruned (cut) and algorithm stop searching on right subtree of node
C, as shown in figure 22(c).
∝= 3 MAX
A
β = +∞

∝= 3
∝ = −∞
B 3 2 C β= 2 MIN
β= 3

D E F G H I Terminal Node
3 12 8 2 15 6

Fig22(c)

Step4: Finally, we backtrack from node C to A. Note that, while backtracking the node in a tree,
the node values of C(=2) will be passed to upper node A, instead of values of 𝜶and𝜷.The
previous node(A)=3 value is compared with this new node(C)=2 value. The best value at
node(A)=𝛼 = 𝑚𝑎𝑥(3,2) = 3.
The previous 𝛼 and 𝛽 value at node A[𝛼 = 3, 𝛽 = +∞]. Since A is at MAX level so only 𝛼 value is
change. So, we compare old 𝛼 = 3 value with value at node(C)=2. That is 𝛼 = 𝑚𝑎𝑥(3,2) = 3. Thus,
there is no change in 𝛼 value as well.
Thus finally, 𝜶and𝜷value at node A is[𝛼 = 3, 𝛽 = +∞] and value of Node(A)=3. So, optimal
value for the maximizer is 3 and there are 2 terminal nodes are pruned (H and I). The optimal
search path is ABD (as shown in figure 22(d)).

∝= 3
3 A β = +∞ MAX

∝= 3
∝ = −∞
B 3 2 C β= 2 MIN
β= 3

D E F G H I Terminal
3 12 8 2 15 6

Fig 22(d)

2.7.2 Move ordering of Alpha-beta pruning

The effectiveness of Alpha-beta pruning is highly dependent on the order in which each node is
examined. Move order is an important aspect of alpha-beta pruning. We have two types of move ordering:

Worst case ordering: In some cases, alpha-beta pruning algorithm does not prune any of the leaves of
the tree and works exactly as MiniMax algorithm. In this case, it consumes more time because of alpha-
beta factors, such a move of pruning is called a worst ordering. The time complexity for such an order is
𝑂(𝑏 )where b: Branching Factor and m is the depth of the tree.

Best (ideal) case ordering: The ideal ordering for alpha-beta pruning occurs when lots of pruning
happens in the tree, and best move occur at the left side of the tree. We apply DFS hence it first search left
of the tree and go deep twice as minimax algorithm in the same amount of time. The time complexity for
best case order is 𝑂(𝑏 / ) (since we search only left sub tree, not a right subtree).

Note that pruning does not affect the final result. Good move ordering improves the
effectiveness of pruning. With ideal case ordering, time complexity is 𝑂(𝑏 )
In Alpha-beta pruning:
- α value can never decrease and β value can never increase. Search
can be discontinued at anode if:
-It is a Max node and α ≥ β it is beta cutoff
-It is a Min node and β ≤ α it is a alpha cutoff.

☞ Check Your Progress 2


Q.1: Compare the MINIMAX and Alpha-Beta Pruning algorithm with respect to Time
complexity.
Q.2Consider the following Minimax game tree search in which root is maximizing node and
children are visited from left to right. Find the value of the root node of the game tree?

Q.3Apply Alpha-Beta pruning algorithm on the following graph and find which node(s) are
pruned?
Q.4:Consider the following Minimax game tree search (figure1) in which root is maximizing
node and children are visited from left to right.
A
MAXX

MIN B C D

MAX E F G H I J K

Terminal 4 3 6 2 2 1 4 5 3 1 5 4 7 5

Figure1(a)

(a) Find the value of the root node of the game tree?
(b) Find all the nodes pruned in the tree?
(c) Find the optimal path for the maximizer in a tree?
Q.5: Consider the following Minimax game tree search in which root is maximizing node and
children are visited from left to right. Find what will be the value propagated at the root?

Q.6: Consider the following Minimax game tree search in which root is maximizing node and
children are visited from left to right. Find the value of the root node of the game tree?
Multiple choice Question

Q.7: Consider the following Minimax game tree search in which root is maximizing node and
children are visited from left to right. Find the value of the root node of the game tree?

A. 14 B. 17 C. 111 D. 112

2.8 Summary

 Before an AI problem can be solved it must be represented as a state space. Among all
possible states, there are two special states called initial state (the start point) and final
state (the goal state).
 A successor function (a set of operators)is used to change the state. It is used to move
from one state to another.
 A state space is set of all possible states of a problem.
 A state space essentially consists of a set of nodes representing each state of the problem,
arcs between nodes representing the legal moves from one state to another, an initial
state, and a goal state. Each state space takes the form of a tree or a graph.
 The process of searching means a sequence of action that take you from an initial state to
a goal state.
 search is fundamental to the problem-solving process. Search means the problem is
solved by using the rules, in combination with an appropriatecontrol strategy, to move
through the problem space until a path from an initial state to a goal state is found.
 A problem space is represented by a directed graph, where nodes represent search
stateand pathsrepresent the operators applied to change the state.
 In general, a state space is represented by 4 tuples as follows:𝑺𝒔 : [𝑺, 𝒔𝟎 , 𝑶, 𝑮], Where S:
Set of all possible states (possibly infinite), 𝒔𝟎 : start state (initial configuration) of the
problem, 𝑠 ∈ 𝑆 . O: Set of production rules (or set of state transition operator) used to
change the state from one state to another. It is the set of arcs (or links) between nodes.
 Adversarial search is a game-playing technique where the agents are surrounded by a
competitive environment. A conflicting goal is given to the agents (multiagent). These
agents compete with one another and try to defeat one another in order to win the game.
Such conflicting goals give rise to the adversarial search.
 In a normal search, we follow a sequence of actions to reach the goal or to finish the
game optimally. But in an adversarial search, the result depends on the players which
will decide the result of the game. It is also obvious that the solution for the goal state
will be an optimal solution because the player will try to win the game with the shortest
path and under limited time.
 There are 2 types of adversarial search: Minimax Algorithm and Alpha-beta Pruning.
 Minimax is a two-player (namely MAX and MIN) game strategy where if one wins, the
other lose the game. This strategy simulates those games that we play in our day-to-day
life. Like, if two persons are playing chess, the result will be in favour of one player and
will go against the other one. MIN: Decrease the chances of MAX to win the game and
MAX: Increases his chances of winning the game. They both play the game alternatively,
i.e., turn by turn and following the above strategy, i.e., if one wins, the other will
definitely lose it. Both players look at one another as competitors and will try to defeat
one-another, giving their best.
 In minimax strategy, the result of the game or the utility value is generated by a heuristic
function by propagating from the initial node to the root node. It follows
the backtracking technique and backtracks to find the best choice. MAX will choose
that path which will increase its utility value and MIN will choose the opposite path
which could help it to minimize MAX’s utility value.
 The drawback of minimax strategy is that it explores each node in the tree deeply to
provide the best path among all the paths. This increases its time complexity.
 If b is the branching factor and d is the depth of the tree, then time complexity of
MINIMAX algorithm is 𝑂(𝑏 ) that is exponential.
 Alpha-beta pruning is an advance version of MINIMAX algorithm. Therefore, alpha-beta
pruning reduces the drawback of minimax strategy by less exploring the nodes of the
search tree.
 The alpha-beta pruning method cut-off the search by exploring a smaller number of
nodes. It makes the same moves as a minimax algorithm does, but it prunes the unwanted
branches using the pruning technique.
 Alpha-beta pruning works on two threshold values, i.e., 𝛼 (alpha) and 𝛽 (beta).𝜶 : It is
the best highest value; a MAX player can have. The initial value of 𝜶 is set to negative
infinity value, that is 𝜶 = −∞. As the algorithm progress its value may change and
finally get the best (highest) value. 𝜷 : It is the best lowest value; a MIN player can have.
The initial value of 𝜷 is set to positive infinity value, that is 𝜷 = +∞. As the algorithm
progress its value may change and finally get the best (lowest) value.
 So, each MAX node has 𝜶 value, which never decreases, and each MIN node has 𝜷
value, which never increases. The main condition which required for alpha-beta pruning
is 𝜶 ≥ 𝜷, that is if𝜶 ≥ 𝜷, then prune (cut) the branches otherwise search is continued.
 As we know there are two-parameter is defined for Alpha-beta pruning, namely alpha
(𝜶)and beta(𝜷). The initial value of alpha and beta is set to as 𝜶 = −∞ and 𝜷 = +∞.As
the algorithm progresses its values are changes accordingly. Note that in Alpha-beta
pruning (cut), at any node in a tree, if 𝜶 ≥ 𝜷, then prune (cut) the next branch else search
is continued.
 The effectiveness of Alpha-beta pruning is highly dependent on the order in which each
node is examined.
 Worst case ordering: In some cases, alpha-beta pruning algorithm does not prune any of
the leaves of the tree and works exactly as MiniMax algorithm. In this case, the time
complexity is 𝑂(𝑏 )where b: Branching Factor and m is the depth of the tree.
 Best (ideal) case ordering: The ideal ordering for alpha-beta pruning occurs when lots
of pruning happens in the tree, and best move occur at the left side of the tree. The time
complexity for best case order is 𝑂(𝑏 / ) (since we search only left sub tree, not a right
subtree).

2.9 Solutions/Answers

Check your progress 1:

Answer1: A number of factors need to be taken into consideration when developing a statespace
representation. Factors that must be addressed are:

 What is the goal to be achieved?

 What are the legal moves or actions?

 What knowledge needs to be represented in the state description?


 Type of problem - There are basically three types of problems. Some problems only need
a representation, e.g., crossword puzzles. Other
problems require a yes or no response indicating whether a solution can be found or not.
Finally, the last type problem are those that require a solution path as an output e.g.,
mathematical theorems Towers of Hanoi. In these cases we know the goal state and we
need to know how to attain this state

 Best solution vs. Good enough solution - For some problems a good enough solutionis
sufficient. For example: theorem proving eight squares.
However, some problems require a best or optimal solution, e.g., the traveling salesman
problem.

Answer 2

(a) Formulation of Missionaries and Cannibal problem:

State: (#M,#C,0/1)

Where #M represents Number of missionaries in the left side bank (i.e., left side of the river)

#C : represents the number of cannibals in the left side bank (i.e., left side of the river)

0/1 : indicate the boat position of the boat. 0 indicates the boat is on the left side of the river and
1 indicate the boat is on the right side.

Start state:(3,3,0)

Goal State: (0,0,1)

Operator: Sate will be changed by moving missionaries and (or) cannibals from one side to
another using boat. So, it can be represented as number of persons on the either side of the river.
Note that the boat can carries maximum 2 persons.

Boat carries: (1,0) or (0,1) or (1,1) or (2,0) or (0,2).Here in (i,j), i represents number of
missionaries and j means number of cannibals.

(b) Solution of Missionaries and Cannibal problem:

Start state:(3,3,0)

Goal State: (0,0,1)


Figure: Solution of Missionaries and Cannibal problem

A state space tree for this problem is shown below:


Figure: A state space tree showing all possible solution of missionaries and cannibal problem.
Answer 3: Towers Hanoi A possible state space representation of the Towers Hanoi problem using a
graph is indicated in Figure 1.

Figure 1: Towers of Hanoi slate space representation for n=2

The legal moves in this state space involve moving one ring from one pole to another, moving one ring at
a time, and ensuring that a larger ring is not placed on a smaller ring.
Answer 4:

Minimum Cost path for solution= 6


Answer 5: Backtracking Algorithm: The idea is to place queens one by one in different
columns, starting from the leftmost column. When we place a queen in a column, we check for
clashes with already placed queens. In the current column, if we find a row for which there is no
clash, we mark this row and column as part of the solution. If we do not find such a row due to
clashes, then we backtrack and return false.

1) Start in the leftmost column

2) If all queens are placed


return true

3) Try all rows in the current column.


Do following for every tried row.

a) If the queen can be placed safely in this row then mark this [row, column] as part of the
solution and recursively check if placing queen here leads to a solution.
b) If placing the queen in [row, column] leads toa solution then return true.
c) If placing queen doesn't lead to a solution then unmark this [row, column] (Backtrack) and
go to step (a) to try other rows.
4) If all rows have been tried and nothing worked, return false to trigger backtracking.
State space tree for 4 Queen’s problem

A state-space tee (SST) can be constructed to show the solution to this problem. The following SST
(figure 1) shows one possible solution{𝑥1, 𝑥2, 𝑥3, 𝑥4} = {2,4,3,1}for the 4 Queen’s Problem.

Or we can also denote the State space tree as follows:


1
x1=1 x1=2

2 18
x2=4
x2=2 x2=3 x2=4

3 8 13 19 24 29

B B B x3=1

30
9 11 14 16 x4=3
B B B
31

15
B denotes the Dead Node (nonpromising node). The figure 1 shows the Implicit tree for 4 queen
problem for solution <2,4,1,3>. The Root represents an initial state. The Nodes reflect the
specific choices made for the components of a solution. Explore The state space tree using
depth-first search. "Prune" non-promising nodesdfs stops exploring subtree rooted at nodes
leading to no solutions and then backtracks to its parent node

Check your progress 2


Answer1:
Solution: Alpha-beta pruning is an advance version of MINIMAX algorithm. The drawback of
minimax strategy is that it explores each node in the tree deeply to provide the best path among
all the paths. This increases its time complexity. If b is the branching factor and d is the depth of
the tree, then time complexity of MINIMAX algorithm is 𝑂(𝑏 ) that is exponential.

Alpha-Beta pruning is a way of finding the optimal Minimax solution while avoiding searching
subtrees of moves which won’t be selected. The effectiveness of Alpha-beta pruning is highly
dependent on the order in which each node is examined. The ideal ordering for alpha-beta pruning occurs
when lots of pruning happens in the tree, and best move occur at the left side of the tree. We apply DFS
hence it first search left of the tree and go deep twice as minimax algorithm in the same amount of time.
The time complexity for best case order is 𝑂(𝑏 / ) (since we search only left sub tree, not a right
subtree).

Answer 2: The final game tree with max and min value at each node is shown in the following
figure.
Answer3:
Solution: Solve the question as sown in Example1
The initial call starts from A. We initially start the search by setting the initial value of 𝜶 =
−∞and𝜷 = +∞for root node A. These values are passed down to subsequent nodes in the tree.
At A the maximizer must choose max of B and C, so A calls B first. At B it the minimizer must
choose min of D and E and hence calls D first. At D, it looks at its left child which is a leaf node.
This node returns a value of 3. Now the value of alpha at D is max( −∞, 3) which is 3. To decide
whether it’s worth looking at its right node or not, it checks the condition 𝛼 ≥ 𝛽. This is false
since 𝜷 = +∞ and ∞= 3. So, it continues the search.
D now looks at its right child which returns a value of 5. At D, alpha = max(3, 5) which is 5.
Now the value of node D is 5. Value at node D=5, move up to node B(=5). Now at node B, 𝛽
value will be modified as
𝛽 = 𝑚𝑖𝑛(+∞, 5) = 5.
B now calls E, we pass the 𝜶and𝜷value from top node B to bottom node E as [𝜶 = −∞, 𝜷 = 𝟓].
Since, node E is at MAX level, so only 𝜶value will be change. Now, at E, it first checks the left
child (which is a terminal node) with value 6. This node returns a value of 6.
Now, the value of 𝜶 at node E is calculated as 𝛼 = 𝑚𝑎𝑥(−∞, 6) = 6, so value of Node(E)=6
and modified value of 𝜶and𝜷 at node E is [𝛼 = 6, 𝛽 = 5]. To decide whether it’s worth looking
at its right node or not, we check 𝜶 ≥ 𝜷. The answer is YES, since 6 ≥ 5. So, we prune (cut)
the right branch of E, as shown in figure (a).

Figure (a) game tree after applying alpha-beta pruning on left side of node A

Similarly, we solve for right sub tree for Node A [refer the example 1 and solve for right sub tree part].
The final tree with node value at every node is shown in the figure(b).
Thus finally, 𝜶and𝜷value at node A is[𝛼 = 5, 𝛽 = +∞] and best value at Node(A)=max (5,2)=5.
So, optimal value for the maximizer is 5 and there are 3 terminal nodes are pruned (9, 0 and -1).
The optimal search path is ABD5 , as shown in figure (b).

Figure (b) Final tree with node value on every node with prune branches
∝ = −4
Answer4: 4 A β = +∞ MAX

∝= 4
∝= ∞ ∝= 4
MIN
β = +4 B 4 2 C β = +2 5 D β= 5

∝= 4 ∝= 5 ∝= 7
∝= 6
5 J K β= 5 MAX
β = +∞ E 4 6 F β= 4 G 2 H I β = +∞
7

4 3 6 2 2 1 4 5 3 1 5 4 7 5 Term Node
Answer5: 5

Answer6: 7

Answer 7: Option (B) 17

2.11 FURTHER READINGS

1. Ela Kumar, “ Artificial Intelligence”, IK International Publications


2. E. Rich and K. Knight, “Artificial intelligence”, Tata Mc Graw Hill Publications
3. N.J. Nilsson, “Principles of AI”, Narosa Publ. House Publications
4. John J. Craig, “Introduction to Robotics”, Addison Wesley publication
5. D.W. Patterson, “Introduction to AI and Expert Systems" Pearson publication
UNIT 3 UNINFORMED & INFORMED SEARCH
Structure Page No

3.0 Introduction
3.1 Objectives
3.2 Formulating search in state space
3.2.1 Evaluation of search Algorithm
3.3Uninformed Search
3.3.1 Breath-First search (BFS)
3.3.2 Time and space complexity of BFS
3.3.3 Advantages & disadvantages of BFS
3.3.4 Depth First search (DFS)
3.3.5 Performance of DFS algorithm
3.3.6 Advantages and disadvantages of DFS
3.3.7 Comparison of BFS and DFS
3.4 Iterative Deepening Depth First search (IDDFS)
3.4.1 Time and space complexity of IDDFS
3.4.2 Advantages and Disadvantages of IDDFS
3.5 Bidirectional search
3.6 Comparison of Uninformed search strategies
3.7 Informed (heuristic) search
3.7.1 Strategies for providing heuristics information
3.7.2 Formulation of Informed (heuristic) search problem as state space
3.7.3 Best-First search
3.7.4 Greedy Best first search
3.8 A* Algorithm
3.8.1 Working of A* algorithm
3.8.2 Advantages and disadvantages of A* algorithm
3.8.3 Admissibility properties of A* algorithm
3.8.4 Properties of heuristic algorithm
3.8.5 Results on A* algorithm
3.9 Problem reduction search
3.9.1 Problem definition in AND-OR graph
3.9.2 AO* algorithm
3.9.3 Advantages of AO* algorithm
3.10 Memory Bound heuristic search
3.10.1 Iterative Deepening A* (IDA*)
3.10.2 Working of IDA*
3.10.3 Analysis of IDA*
3.10.4 Comparison of A* and IDA* algorithm
3.11 Recursive Best First search (RBFS)
3.11.1 Advantages and disadvantages of RBFS
3.12 Summary
3.13 Solutions/Answers
3.14 Further readings
3.0 INTRODUCTION
Before an AI problem can be solved it must be represented as a state space. In AI, a wide range
of problems can be formulated as search problem. The process of searching means a sequence of
action that take you from an initial state to a goal state as sown in the following figure 1.

Fig 1: A sequence of action in a search space from initial to goal state.

In the unit 2 we have already examined the concept of a state space and adversarial (game
playing) search strategy. In many applications there might be multiple agents or persons
searching for solutions in the same solution space. In adversarial search, we need a path to take
action towards the winning direction and for finding a path we need different type of search
algorithms.

Search algorithms are one of the most important areas of Artificial Intelligence. This unit will
explain all about the search algorithms in AI which explore the search space to find a solution.

In Artificial Intelligence, Search techniques are universal problem-solving methods. Rational


agents or Problem-solving agents in AI mostly used these search strategies or algorithms to
solve a specific problem and provide the best result. Problem-solving agents are the goal-based
agents and use atomic representation.

One disadvantage of state space representation is that it is not possible to visualize all states for
a given problem. Also, the resources of the computer system are limited to handle huge state
space representation. But many problems in AI take the form of state-space search.

Many problems in Al take the form of state-space search.

 The states might be legal board configurations in a game, towns and cities in some sort of
route map, collections of mathematical propositions, etc.
 The state-space is the configuration of the possible states and how they connect to each other
e.g., the legal moves between states.
 When we don't have an algorithm which tells us definitively how to negotiate the state-space
we need to search the state-space to find an optimal path from a start slate to a goal state,
We can only decide what to do (or where to go), by considering the possible moves from the
current state, and trying to look ahead as far as possible. Chess, for example is a very difficult
state space search problem.

Searching is the process looking for the solution of a problem through a set of possibilities (state
space).In general, the searching process starts from the initial state (root node ) and proceeds by
performing the following steps:
 Check whether the current state is the goal state or not?
 Expand the current state to generate the new sets of states.
 Choose one of the new states generated for search depending upon search strategy
(for example BFS, DFS etc.).
 Repeat step 1 to 3 until the goal state is reached or there are no more state tube
expanded.
Evaluation (properties) of search strategies : A search strategy is characterized by the
sequence in which nodes are expanded. Any search algorithms are commonly evaluated
according to the following four criteria. The following four essential properties of search
algorithms is used to compare the efficiency of any search algorithms.
Completeness: A search algorithm is said to be complete if it guarantees to return a solution, if
exist.
Optimality/Admissibility: If a solution found for an algorithm is guaranteed to be the best
solution (lowest path cost) among all other solutions, then such a solution for is said to be an
optimal solution.
Time Complexity: Time complexity is a measure of time for an algorithm to complete its task.
Usually measured in terms of the number of nodes expended during the search.
Space Complexity: It is the maximum storage space required at any point during the search.
Usually measured in terms of the maximum number of nodes in memory at a time.
Time and space complexity are measured in terms of:
 b - max branching factor of the search tree
 d - depth of the least-cost solution
 m - max depth of the search tree (may be infinity)
In all search algorithms, the order in which nodes are expended distinguishes them from one
another. There are two broad classes of search methods:

Search Algorithm

Uniformed/Blind Informed Search

Breadth first search Best first search

Uniform cost search A* search

Depth first search

Depth limited search

Iterative deepening depth first


search

Uniformed/Blind
Uninformed search is also called Brute force search or Blind search or Exhaustive search. It is
called blind search because of the way in which search tree is searched without using any
information about the search space. It is called Brute force because it assumes no additional
knowledge other than how to traverse the search tree and how to identify the leaf nodes and goal
nodes. This search ultimately examines every node in the tree until it finds a goal.

Informed search is also called as Heuristic (or guided) search. These are the search techniques
where additional information about the problem is provided in order to guide the search in a
specific direction. A heuristic is a method that might not always find the best solution but is
guaranteed to find a good solution in reasonable time. By sacrificing completeness, it increases
efficiency.

The following table summarizes the differences between uninformed and informed search:

Uninformed Search Informed Search


No information about the path, cost, from the The path cost from current state to goal state
current state to the goal state. It doesn't use is calculated, to select the minimum cost path
domain specific knowledge for searching as the next state. It uses domain specific
process. knowledge for the searching process.
It finds solution slow as compared to It finds solution more quickly.
informed
Less efficient More efficient
Cost is high. Cost is low
No suggestion is given regarding the solution It provides the direction regarding the
init. Problem to be solved with the given solution. Additional information can be added
information only. as assumption to solve the problem.
Examples are: Examples are:
 Depth First Search,  Best first search
 Breadth First Search,  Greedy search
 Depth limited search,  A* search
 Iterative Deepening DFS,
 Bi-directional search

3.1 OBJECTIVES
After studying this unit, you should be able to:
 Differentiate the Uninformed and informed search algorithm
 Formulate the search problem in the form of state space
 Explain the differences between various uninformed search approaches such as BFS,
DFS, IDDFS, Bi-directional search.
 Evaluate the various Uninformed search algorithm with respect to Time, space and
Optimality/Admissibility criteria.
 Explain Informed search such as Best-First search and A* algorithm.
 Differentiate between advantages and disadvantages of heuristic search: A* and AO*
algorithm
 Differentiate between memory bound search: Iterative Deepening A* and Recursive
Best-First Search.

3.2 Formulating search in state space

A state space is a graph, (V, E) where V is a set of nodes and E is a set of arcs, where each arc is
directed from one node to another node.

 V: a node is a data structure that contains state description, plus, optionally other
information related to the parent of the node, operation to generate the node from that
parent, and other bookkeeping data.
 E: Each arc corresponds to an applicable action/operation. The source and destination
nodes are called as parent (immediate predecessor) and child (immediate successor)
nodes with respect to each other. Ancestors(also called predecessors) and descendants
(also called successors) node. Each arc has a fixed, non-negative cost associated with it,
corresponding to the cost of the action.

Each node has a set of successor nodes. Corresponding to all operators (actions) that can apply at
source node’s state. Expanding a node is generating successor nodes and adding them (and
associated arcs) to the state-space graph. One or more nodes may be designated as start nodes.

A goal test predicate is applied to a node to determine if its associated state is a goal state. A
solution is a sequence of operations that is associated with a path in a state space from a start
node to a goal node. Thecost of a solution is the sum of the arc costs on the solution path.

State-space search is the process of searching through a state space for a solution by making
explicit a sufficient portion of an implicit state-space graph to include a goal node.

Hence, initially V={S}, where S is the start node; when S is expanded, its successors are
generated, and those nodes are added to V and the associated arcs are added to E. This process
continues until a goal node is generated (included in V) and identified (by goal test).

To implement any Uninformed search algorithm, we always initialize and maintain a list called
OPEN and put start node of G in OPEN. If after some time, we find OPEN is empty and we are
not getting “goal node”, then terminate with failure. We select a node 𝒏 from OPEN and if
𝒏 ∈ 𝑮𝒐𝒂𝒍 𝒏𝒐𝒅𝒆 , then terminate with success, else we generate the successor of n (using
operator O) and insert them in OPEN. In this way we repeat the process till search is successful
or unsuccessful.

Search strategies differ mainly on how to select an OPEN node for expansion at each step of
search.
A general search algorithm
1. Initialize: Set = {𝑠} , where s is a start state.
2.Fail:If 𝑂𝑃𝐸𝑁 = { }, terminate with failure.
3. Select: Select a state, n, from OPEN
4.Terminate: If 𝑛 ∈ 𝐺𝑜𝑎𝑙 𝑛𝑜𝑑𝑒, terminate with success
5. Expand: Generate the successor of n using operator O and insert them in OPEN.
6. LOOP:Goto Step 2
But the problem with the above search algorithm, it is not mentioned that when a node is already
visited, then do not revisit that node again. That is “how we can maintain a part of the state
space that is already visited”. So, we have an extension of the same algorithm, where we can
save the explicit state space. To save the explicit space, we maintained another list called
CLOSED.

Thus, to implement any Uninformed search algorithm efficiently, two list OPEN and CLOSED
are used.

Now we can select a node from OPEN and save it in CLOSED. The CLOSED list keeps record
of nodes that are Opened. The major difference of this algorithm with the previous algorithm is
that when we generate successor node from CLOSED, we check whether it is already in
(𝑂𝑃𝐸𝑁 ∪ 𝐶𝐿𝑂𝑆𝐸𝐷). If it is already in (𝑂𝑃𝐸𝑁 ∪ 𝐶𝐿𝑂𝑆𝐸𝐷), we will not insert in OPEN again,
otherwise insert. The following modified algorithm is used to save the explicit space using the
list CLOSED.

Modified search algorithm to saving the explicit space

1. Initialize: Set 𝑂𝑃𝐸𝑁 = {𝑠}, 𝐶𝐿𝑂𝑆𝐸𝐷 = { }.


2.Fail:If 𝑂𝑃𝐸𝑁 = { }, terminate with failure.
3. Select: Select a state, n, from OPEN and save n in CLOSED
4.Terminate: If 𝑛 ∈ 𝐺𝑜𝑎𝑙 𝑛𝑜𝑑𝑒, terminate with success
5. Expand: Generate the successor of n using operator O
For each successor, m, insert m in OPEN,only if 𝑚 ∉ (𝑂𝑃𝐸𝑁 ∪ 𝐶𝐿𝑂𝑆𝐸𝐷)
6. LOOP:Goto Step 2.

Here the OPEN and CLOSED list are used as follows:

OPEN: Nodes are yet to be visited.

CLOSED: Keeps track of all the nodes visited already

Note that initially OPEN list initializes with start state of G (e.g.,𝑶𝑷𝑬𝑵 = {𝒔}) and CLOSED list as
empty (e.g.,𝑪𝑳𝑶𝑺𝑬𝑫 = {}).

Insertion or removal of any node in OPEN depends on specific search strategy.


3.2.1 Evaluation of Search Algorithms:

In any search algorithm, we select a node and generate its successor. Search strategies differ
mainly on how to select an OPEN node for expansion at each step of search. Also, Insertion or
deletion of any node from OPEN list depends on specific search strategy. Any search algorithms
are commonly evaluated according to the following 4 criteria:It is the measure to evaluate the
performance of the search algorithms:

 Completeness: Guarantees finding a solution whenever one exists.

 Time Complexity: How long (worst or average case) does it take to find a solution?
Usually measured in terms of the number of nodes expanded.

 Space Complexity: How much space is used by the algorithm? Usually measured in
terms of the maximum size that the “OPEN" list becomes during the search. The Time
and Space complexity are measured in terms of: The branching factor or maximum
number of successors of any node and d: the depth of shallowest goal node (depth of the
least cost solution) and m: The maximum depth (length) of any path in the state space
(may be infinite).

 Optimality/Admissibility: If a solution is found, is it guaranteed to be an optimal one?


For example, is it the one with minimum cost?

Search process constructs a search tree, where root is the initial state S, and leaf nodes are
nodesnot yet been expanded (i.e., they are in OPEN List) or having no successors (i.e.,
they're dead ends"). Search tree may be infinite because of loops even if statespace is small
Search strategies mainly differ on select OPEN. Each node represents a partial solution path
and cost of the partial solution path) from the start node to the given node. In general, from
this node there are many possible paths (and therefore solutions that have this partial path as
a prefix.
All search algorithms are distinguished by the order in which nodes are expended. There are
two broad classes of search methods: Uninformed search and Heuristic Search. Let us first
discuss the Uninformed search.

3.3 UNINFORMED SEARCH


The uninformed search does not contain any domain knowledge such as closeness, the location
of the goal. It operates in a brute-force way as it only includes information about how to traverse
the tree and how to identify leaf and goal nodes. Uninformed search applies a way in which
searchtree is searched without any information about the search space like initial state operators
and test for the goal, so it is also called blind search. It examines each node of the tree until it
achieves the goal node. Sometimes we may not get much relevant information to solve a
problem.
For Example, suppose we lost our car key, and we are not able to recall where we left, we have
to search for the key with some information such as in which places, we used to place it. It may
be our pant pocket or may be the table drawer. If it is not there, then we must search the whole
house to get it. The best solution would be to search in the places from the table to the wardrobe.
Here we need to search blindly with less clue. This type of search is called uninformed search or
blind search.
Based on the order in which nodes are expended, we have the following types of uninformed
search algorithms:

 Breadth-first search
 Depth-first search
 Uniform cost search
 Iterative deepening depth-first search
 Bidirectional Search

3.3.1 Breadth-first search (BFS):

It is the simplest form of blind search. In this technique the root node is expanded first, then all its
successors are expanded and then their successors and so on. In general, in BFS, all nodes are
expanded at a given depth in the search tree before any nodes at the next level are expanded. It
means that all immediate children of nodes are explored before any of the children’s children are
considered. The search tree generated by BFS is shown below in Fig 2.

Root

A B C

D E F G

Goal Node
Fig 2 Search tree for BFS

Note that BFS is a brute-search, so it generates all the nodes tor identifying the goal and note that
we are using the convention that the alternatives are tried in the left-to-right order.

A BFS algorithm uses a data structure-queue that works on FIFO principle. This queue will hold
all generated but still unexplored nodes. Please remember that the order in which nodes are
placed on the queue or removal and exploration determines the type of search.
We can implement it by using two lists called OPEN and CLOSED. The OPEN list contains
those states that are to be expanded and CLOSED list keeps track of state already expanded.
Here OPEN list is used as a queue.

BFS is effective when the search tree has a low branching factor.

Breath-First Search (BFS) Algorithm


1. Initialize: Set = {𝑠} , where s is a start state.
2.Fail:If 𝑂𝑃𝐸𝑁 = { }, terminate with failure.
3. Select: Remove a left most state (say 𝒂) from OPEN.
4.Terminate: If 𝑎 ∈ 𝐺𝑜𝑎𝑙 𝑛𝑜𝑑𝑒, terminate with success, else
5. Expend: Generate the successor of node 𝒂, discard the successors of 𝒂 if it’s
already in OPEN, Insert only remaining successors on right end of OPEN [i.e.,
QUEUE]
6. LOOP:Goto Step 2
Let us take an example to see how this algorithm works.

Example1: Consider the following graph in fig-1 and its corresponding state space tree
representation in fig-2. Note that A is a start state and G is a Goal state.
E A

B
B C

A D E H
C
D E D G

C C F B F

G
G G H E G H

Fig-1 State space graph Fig-2 State space tree

Step1: Initially open contains only one node Step 5: Node D is removed from open. Its children C and F are
corresponding to the source state A. generated and added to the back of open.
A A

B C B C

C
D E D G C
D E D G

C F B F C F B F

G G H E G H G G H E G H

Open= A CLOSED={ } Open= E,D,G,C,F CLOSED=A,B,C,D


Step2: A is removed from open. The node is expended, Step 6: Node E is removed from open. It has no children
and its children B and C are generated. They are placed
at the back of open.
A
A

B C
B C

C
D E D G
C
D E D G

C F B F C F B F

G G H E G H G G H E G H
Open= B,C CLOSED={A } Open= D,G,C,F CLOSED=A,B,C,D,E
Step: 3: Node B is removed from open and is Step 7: D is expanded, B and F are put in OPEN.
expended. Its children D, E are generated and put at the
A
back of open.
A
B C

B C
C
D E D G
C
D E D G
C F B F
C F B F

G G H E G H
G G H E G H
Open= G,C,F,B,F
Open= C,D,E CLOSED=A,B,C,D,E,D
CLOSED={A,B}
Step 4: Node C is removed from open and is expanded Step 8: G is selected for expansion. It is found to be a goal
its children D and G are added to the back of open. node. So, the algorithm returns the path ACG by following the
parent pointers of the node corresponding to G. The algorithm
A
A

B C
B C

C
D E D G
C
D E D G Goal!

C F B F
C F B F

G G H E G H
G G H E G H
Open=D,E,D,G terminates.
CLOSED={A,B,C}
Open=C,F,B,F
CLOSED={A,B,C,D,E.G}

3.3.2 Time and Space complexity of BFS:


Consider a complete search tree of depth d where each non-leaf node has b children (i.e.,
branching factor), has a total of
.
1 + 𝑏 + 𝑏 + 𝑏 + ⋯+ 𝑏 = nodes.
Time complexity is the number of nodes generated, so time complexity of BFS algorithm is
𝑂(𝑏 )
S
1

1 b
2 b^2

d b^d
Fig 3Search tree with b branching factor
For Example, consider a complete search tree of depth 12, where every node at depths
0,1, . . . , 11 has 10 children (branching factor b=10) and every node at depth 12 has 0 children,
then there are
1. (10 − 1)
1 + 10 + 10 + 10 + ⋯ + 10 =
10 − 1
= = 𝑂(10 ) nodes in the complete search tree.
• BFS is suitable for problems with shallow solutions

Space complexity:

BFS has to remember each and every node it has generated. Space complexity (maximum length
of OPEN list):
So, space complexity is given by:
1+b+b2+b3+…bd= O(bd).

Performance of BFS:

 Time Required for BFS for tree of b branching factor and d depth is 𝑂(𝑏 ).
 Space (memory) requirement for a tree with b branching factor and d depth is also 𝑂(𝑏 )
 BFS algorithm is Complete (if b is finite).
 BFS algorithm is Optimal (if cost = 1 per step)

Space is the bigger problem in BFS as compared to DFS.

3.3.3 Advantages and disadvantages of BFS:

Advantages: BFS has some advantages and are given below

1. BFS will never get trapped exploring bund alley.


2. It is guaranteed to find a solution if one exists.

Disadvantages: BFS has certain disadvantages also. They are given below-

1. Time complexity and Space complexity are both O(bd) i.e., exponential type. This is very
hurdle.

2. All nodes are to be generated in BFS. So, even unwanted nodes are to be remembered (stored
in queue) which is of no practical use of the search.

3.3.4 Depth First Search


A Depth-First Search (DFS) explores a path all the way to a leaf before backtracking and
exploring another path. That is expand deepest unexpanded node (expand most recently
generated deepest node first).

The search tree generated by the DFS is show in figure below:

Fig 4 Depth first search (DFS) tree

In depth-first search we go as far down as possible into the search tree/graph before backing up
and trying alternatives. It works by always generating a descendent of the most recently
expanded node until some depth cut off is reached and then backtracks to next most recently
expanded node and generates one of its descendants. So only path of nodes from the initial node
to the current node is stored, in order to execute the algorithm. For example, consider the following
tree and see how the nodes are expended using DFS algorithm.
Example1:

S (Root)

A B

c D E F
(Goal Node)
Fig. 5 Search tree for DFS

After searching root node S, then A and C, the search backtracks and tries another path from
A. Nodes are explored in the order 𝑆, 𝐴, 𝐶, 𝐷, 𝐵, 𝐸, 𝐹.

Here again we use the list OPEN as a STACK to implement DFS. If we found that the first
element of OPEN is the Goal state, then the search terminates successfully.

Depth-First Search (DFS) Algorithm

1. Initialize: Set = {𝑠} , where s is a start state.

2.Fail:If 𝑂𝑃𝐸𝑁 = { }, terminate with failure.

3. Select: Remove a left most state (say 𝒂) from OPEN.

4.Terminate: If 𝑎 ∈ 𝐺𝑜𝑎𝑙 𝑛𝑜𝑑𝑒, terminate with success, else

5. Expend: Generate the successor of node 𝒂, discard the successors of 𝒂 if it’s


already in OPEN, Insert remaining successors on left end of OPEN [i.e.,
STACK]

6. LOOP:Goto Step 2

Note: The only difference between BFS and DFS is in Expend (step 5). In BFS, we always insert the
generated successors at the right end of OPEN list, whereas in DFS at the left end of OPEN list.

Properties of DFS Algorithm:

Supposeb(branching factor), that is Maximum number of successors of any node and M:


maximum depth of a leaf node, then

Number of nodes generated (in worst case): 1 + 𝑏 + 𝑏 + ⋯ + 𝑏 = 𝑂(𝑏 )


3.3.5 Performance of DFS:

 Time Required for DFS for tree of b branching factor and m depth (of shallowest goal
node) is 𝑂(𝑏 ).

 Space (memory) requirement for a tree with b branching factor and m depth (of
shallowest goal node) is also 𝑂(𝑏𝑚)

 BFS algorithm is Complete (if b is finite).

 BFS algorithm isnotOptimal.

Space is the advantage of DFS as compared to BFS.


3.3.6 Advantages and disadvantages of DFS:

Advantages:

 If depth-first search finds solution without exploring much in a path then the time and
space it takes will be very less.

 The advantage of depth-first Search is that memory requirement is only linear with
respect to the search graph. This is in contrast with breadth-first search which requires
more space.

Example1: Consider the following graph in fig-1 and its corresponding state space tree
representation in fig-2. Note that A is a start state and G is a Goal state.

B C
E
B
C
D E D G

A D E H
C F B F

G G G H E G H

Fig-1 State space graph Fig-2 State space tree


Step 1: Initially open contains only one node corresponding to Step 4: Node D is removed from open and is expanded. Its
the source state A. children C and F are added to the front of open.

A
A

B C
B C

C
D E D G
C
D E D G

C F B F
C F B F

G G H E G H
G G H E G H

Open= C,F,E,C CLOSED=A,B,D


Open= A CLOSED={ }

Step 2: A is removed from open. The node A is Step 5: Node C is removed from open. Its children G and
expanded, and its children B and C are generated. They F is added to the front of open.
are placed at the from of open.
A A

B C B C

C
D E D G C
D E D G

C F B F C F B F

G G H E G H G G H E G H

Open= B,C CLOSED={A} Open= G,F,E,C CLOSED=A,B,D,C

Step 3: Node B is removed from open and is expended. Step 6: Node G is expanded and found to be a goal node.
Its children D, E are generated and put at the front open. The solution path A-B-D-C-G is returned and the
algorithm terminates.
A A

B C B C

C
D E D G C
D E D G

C F B F C F B F

G G H E G H G G H E G H

Open= D,E,C CLOSED={A,B} Open= F,E,C CLOSED=A,B,D,C,G


3.3.7Comparison of BFS and DFS

BFS goes level wise , but requires more space as compared to DFS. The space required by DFS is O(d)
d
where d is depth of tree, but space required by BFS is O(b ).

DFS: The problem with this appraoch is, if there is a node close to root, but not in first few subtrees
explored bt DFS, then DFS reaches that node very late. Also, DFS may not find shortest path to a node (in
terms of number of edges.)
0 node to be searched

Path followed by a
DFS
1 2

3 4 5 6

Fig 6: Path sequence in DFS

DFS: The problem with this approach is, if there is a node close to root, but not in first few subtrees
explored by DFS, then DFS reaches that node very late. Also, DFS may not find shortest path to a node
(in terms of number of edges).

Suppose, we want to find node- ‘2’ of the given infinite undirected graph/tree. A DFS starting from node-
0 will dive left, towards node 1 and so on.

Whereas, the node 2 is just adjacent to node 1.

Hense, a DFS wastes a lot of time in coming back to node 2.

An Iteratie Deepenign Depth First Search overcomes this andquickly find the required node.

3.4 Iterative Deepening Depth First Search (IDDFS)

Iterative Deepening Depth First Search (IDDFS) neither suffers the drawbacks of BFS nor DFS
on trees. It takes the advantages of both the strategy.
It begins by performing DFS to a depth of zero, then depth of one, depth of two, and so on until a solution
is found or some maximum depth is reached.
It is like BFS in that it explores a complete layer of new nodes at each iteration before going to next layer.
It is likes DFS for a single iteration.
It is preferred when there is a large search space, and the depth of a solution is notknown. But it performs
the wasted computation before reaching the goal depth. Since IDDFS expends all nodes at a given depth
before expending any nides at greater depth, it is guaranteed to find a shortest-length (path) solution from
initial state to goal state.
At any given time, it is performing a DFS and never searches deeper than depth ‘d’. Hence, it uses same
space as DFS.
Disadvantage of IDDFS is that it performs wasted computation prior to reaching the goal depth.

Algorithm (IDDFS)
𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒𝑑 𝑑 = 1/∗ depth of search tree ∗/, found = false
𝑊ℎ𝑖𝑙𝑒 (𝐹𝑜𝑢𝑛𝑑 = 𝐹𝑎𝑙𝑠𝑒)
𝐷𝑂{
𝑝𝑒𝑟𝑓𝑜𝑟𝑚 𝑎 𝑑𝑒𝑝𝑡ℎ 𝑓𝑖𝑟𝑠𝑡 𝑠𝑒𝑎𝑟𝑐ℎ 𝑓𝑟𝑜𝑚 𝑠𝑡𝑎𝑟𝑡 𝑡𝑜 𝑑𝑒𝑝𝑡ℎ 𝑑.
𝑖𝑓 𝑔𝑜𝑎𝑙 𝑠𝑡𝑎𝑡𝑒 𝑖𝑠 𝑜𝑏𝑡𝑎𝑖𝑛𝑒𝑑
𝑡ℎ𝑒𝑛 𝐹𝑜𝑢𝑛𝑑 = 𝑡𝑟𝑢𝑒
𝑒𝑙𝑠𝑒

𝑑𝑖𝑠𝑐𝑎𝑟𝑑 𝑡ℎ𝑒 𝑛𝑜𝑑𝑒𝑠 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑒𝑑 𝑖𝑛 𝑡ℎ𝑒 𝑠𝑒𝑎𝑟𝑐ℎ 𝑎𝑓𝑡𝑒𝑟 𝑑𝑒𝑝𝑡ℎ 𝑑


(𝑖. 𝑒. 𝑑 + 1 𝑜𝑛𝑤𝑎𝑟𝑑𝑠 𝑡𝑖𝑙𝑙 𝑙𝑎𝑠𝑡 𝑜𝑓 𝑡𝑟𝑒𝑒)
}/∗ 𝑒𝑛𝑑 𝑤ℎ𝑖𝑙𝑒 ∗/

𝑅𝑒𝑝𝑜𝑟𝑡 𝑡ℎ𝑒 𝑠𝑜𝑙𝑢𝑡𝑖𝑜𝑛, 𝑖𝑓 𝑜𝑏𝑡𝑎𝑖𝑛𝑒𝑑


𝑆𝑡𝑜𝑝
Let us consider the following example to understand the IDDFS:

 Here initial state is A and goal state is M :

B C

D E F
G

H I J K L M N O
The Iterative deepening search proceeds as follows:
Iterative Deepening search 𝑳 = 𝟎
A

Limit = 0

Iterative Deepening search 𝑳 = 𝟏

A A A

B C B C C
Limit = 1

Iterative Deepening search 𝑳 = 𝟐

A A
A A

B C B C B C B C

D E F G D E F G D E F G E F G
Limit = 2

A
A A
C
C C
F G F G G

Iterative Deepening search 𝑳 = 𝟑

A A A A

B C B C B C B C

D E F G D E F G D E F G D E F G

H I J K L M N O H I J K L M N O H I J K L M N O H I J K L M N O
Limit =3
A A A A A

B C B C B C B C B C

D E F G E F G E F G E F G F G

I J K L M N O J K L M N O J K L M N O K L M N O L M N O

A A A

C B C B C

F G F G F G

L M N O L M N O M N O
3.4.1 Time and space complexities of IDDFS

The time and space complexities of IDDFSalgorithm is O(b d) and O(d) respectively.

It can be shown that depth first iterative deepening is asymptotically optimal, among brute
force tree searches, in terms of time, space and length of the solution. In fact, it is linear in its
space complexity like DFS, and is asymptotically optimal to BFS in terms of the number of
nodes expanded.
Please note that in general iterative deepening is preferred uninformed search method
when there is large search space, and the depth of the solution is unknown. Also note that
iterative deepening search is analogous to BFS in that is explores a complete layer going to
the next layer.

3.4.2 Advantages and Disadvantages of IDDFS:

Advantages:

1. It combines the benefits of BFS and DFS search algorithms in terms of fast search and
memory efficiency.
2. It is guaranteed to find a shortest path solution.
3. It is a preferred uniformed search method when the search space is large and the depth of
the solution is not known.
Disadvantages

1. The main drawback of IDDFS is that it repeats all the work from the previous phase. That
is, it performs wasted computations before reaching the goal depth.
d
2. The time complexity is O(b ) i.e., exponential type only.

The following summarizes when to use which algorithm:

DFS BFS IDDFS


Many solutions exist Some solutions are known Space is limited and the
Know (or have a good to be shallow shortest solution path is
estimate of) the depth of required
solution.

3.5 Bidirectional Search


This search technique expands nodes from the start and goal state simultaneously. Check at each
stage if the nodes of one have been generated by the other. If so, the path concatenation is the
solution.

 This search is used when a problem has a single goal state that is given explicitly and all the
node generation operators have inverses,
 So, it is used to find shortest path from an initial node to goal node
instead of goal itself along with path.
 It works by searching forward from the initial node and backward from the goal node
simultaneously, by hoping that two searches meet in the middle.
 Check at each stage if the nodes of one have been generated by the other, i.e., they meet in
the middle.
 If so, the path concatenation is the solution.

Thus, the BS Algorithm is applicable when generating predecessors is easy in both forward and
backward directions and there exist only 1 or fewer goal states. The following figure illustrate
how the Bidirectional search is executed.

Root node

1 Bidirectional
13

4 Search 11

2 14

8 9 10

3 15

6 Intersection 12
Node
5 16

Goal node
Fig 7 Bidirectional search

We have node 1 as the start/root node and node 16 as the goal node. The algorithm divides the
search tree into two sub-trees. So, from start node 1, we do a forward search and at the same
time, we do a backward search from goal node 16. The forward search traverse’s nodes 1, 4, 8,
and 9 whereas the backward search traverses through nodes 16, 12, 10, and 9. We see that both
forward and backward search meets at node 9 called the intersection node. So, the total path
traced by forwarding search and the path traced by backward search is the optimal solution. This
is how the BS Algorithm is implemented.

Advantages:
 Since BS uses various techniques like DFS, BFS, Depth limited search (DLS) etc, it is
efficient and requires less memory.
Disadvantages:
 Implementation of the bidirectional search tree is difficult.
 In bidirectional search, one should know the goal state in advance.
 Practically inefficient due to additional overhead to perform insertion operation at each
point of search.
Time complexity:

The total number of nodes expended in Bidirectional search is= 2𝑏 = 𝑶 𝒃𝒅⁄𝟐 , where b is a
branching factor and d is the depth of the shallowest goal node.

Initial State Final State

d/2

Finally, the Bidirectional search is:

 Complete? Yes
 Time Complexity: O(bd/2)
 Space complexity: O(bd/2)
 Optimal: Yes (if step cost is uniform in both forward and backward directions)
3.6 Comparison of Uninformed search strategies

The following table-1compare the efficiency of uninformed search algorithms. These are the
measure to evaluate the performance of the search algorithms:

Table -1 Performance of uninformed search algorithm


BFS DFS IDDFS Bidirectional Search (if applicable)
d d
Time b b bd bd/2
Space bd bm bd bd/2
Optimum? Yes No Yes Yes
Complete? Yes No Yes Yes
Were
b=branching factor
d=depth of shallowest goal state
m=Maximum depth of the search space

☞ Check Your Progress 1


Q.1 Apply BFS and DFS algorithm on the following graph, clearly show the contents of OPEN
and CLOSE list.

2 3 4

6 5 7

9 8
Goal State

Q.2 Apply BFS algorithm on the following tree (M is goal node)

C D
B

E F G H I J

K L M N O P Q R

S T U

Q.3 Apply BFS algorithm on the following graph.

E
B

A D F H

C
G
Let A be the state and G be the final or goal state to be searched.

Q.4 Compare the Uninformed search algorithm with respect to Time, space, Optimal and
Complete.

3.7 INFORMED (HEURISTIC) SEARCH


Uninformed (blind) search is inefficient in most cases because they do not have any domain
specific knowledge about goal state. Heuristic Search Uses domain-dependent (heuristic) information
beyond the definition of the problem itself in order to search the space more efficiently.

The following are some ways of using heuristic information:

 Deciding which node to expand next, instead of doing the expansion in a strictly breadth-
first or depth-first order;
 In the course of expanding a node, deciding which successor or successors to generate,
instead of blindly generating all possible successors at one time:
 Deciding that certain nodes should be discarded, or pruned, from the search space.

Informed search algorithms use domain knowledge. In an informed
search, problem information is available which can guide the search. Informed search strategies
can find a solution more efficiently thanan uninformed search strategy. Informed search is also
called a Heuristic search.
Heuristics is a guess work, or additional information about the problem. It may miss the solution,
if wrong heuristics is supplied. However, in almost all problems with correct heuristic
information, it provides good solution in reasonable time
Informed search can solve much complex problem which could not be solved in another way.
We have the following informed search algorithm:

1. Best-First Search
2. A* algorithm
3. Iterative Deepening A*

3.7.1Strategies for providing heuristics information:

The informed search algorithm is more useful for large search space. All the informed search
algorithm uses the idea of heuristic, so it is also called Heuristic search.

“Heuristics are criteria, methods or principles for deciding which among several alternative
courses of action promises to be the most effective in order to achieve some goal”

Heuristic Function:

Heuristic information is provided in form of function called heuristic function.


 Heuristic is a function which is used in Informed Search, and it finds the most promising
path.

 It takes the current state of the agent as its input and produces the estimation of how close
agent is from the goal.

 Heuristic function estimates how close a state is to the goal. It is represented byh(n), and it
calculates the cost of an optimal path between the pair of states.

 The heuristic method, however, might not always give the best solution, but it guaranteed to
find a good solution in reasonable time.

 This technique always uses to find solution quickly.

 Informed Search Define a heuristic function. h(n). that estimates the “goodness" of a node n.
 The heuristic function is an estimate, based on domain-specific information that is
computable from the current state description of how close we are to a goal.
 Specifically, h(n) = estimated cost (or distance) of minimal cost path from state ‘n’ to a goal
state.
A heuristic function at a node n is an estimate of the optimum cost from the current node toa goal.
Denoted by h(n)

h(n) = estimated cost of the cheapest path from node n to a goal node

For example, suppose you want to find a shortest path from Kolkata to Guwahati, then heuristic for
Guwahati may be straight-line distance between Kolkata and Guwahati, that is

ℎ(𝐾𝑜𝑙𝑘𝑎𝑡𝑎) = 𝒆𝒖𝒄𝒍𝒊𝒅𝒆𝒂𝒏𝑫𝒊𝒔𝒕𝒂𝒏𝒄𝒆(𝑲𝒐𝒍𝒌𝒂𝒕𝒂, 𝑮𝒖𝒘𝒂𝒉𝒂𝒕𝒊)

3.7.2 Formulation of informed (heuristic) search problem as State Space


Informed search problems can be represented as state space. The state space of a problem
includes: an Initial state, one or more goal state, set of state transition operator O (a set of rules),
used to change the current state to another state and a heuristic function h.
In general, a state space is represented by 5 tuples as follows: 𝑆: [𝑆, 𝑠 , 𝑂, 𝐺, ℎ]
Where S: (Implicitly specified) Set of all possible states (possibly infinite).
𝑠 : start state of the problem, 𝑠 ∈ 𝑆 .
O: Set of state transition operator, each having same cost. This is used to change the state from
one state to another. It is the set of arcs (or links) between nodes
G: Set of Goal state, 𝐺 ⊆ 𝑆.
ℎ(): A heuristic function, estimating the distance toa goal node.
State Space

Initial state

actions

Goal State

Fig 8 State space with initial and goal node

We need to find a sequence of actions which transform the agent from the initial sate 𝑠 to Goal
state G. State space is commonly defined as a directed graph or as a tree in which each node is a
state and each arc represents the application of an operator transforming a state to a successor
state.
Thus, the problem is solved by using the rules (operators), in combination with an appropriate
control strategy, to move through the problem space until a path from initial state to a goal state
is found. This process is known as search. A solution path is a path in state space from 𝑠 (initial
sate) to G (Goal state).

We have already seen an OPEN list is used to implement an uninformed (Blind) search (section
3.2). But the problem with using only one list OPEN is that, it is not possible to keep track of the
node which is already visited. That is “how we can maintain a part of the state space that is
already visited”. To save the explicit space, we maintained another list called CLOSED. Now
we can select a node from OPEN and save it in CLOSED. Now, when we generate successor
node from CLOSED, we check whether it is already in (𝑂𝑃𝐸𝑁 ∪ 𝐶𝐿𝑂𝑆𝐸𝐷). If it is already in
(𝑂𝑃𝐸𝑁 ∪ 𝐶𝐿𝑂𝑆𝐸𝐷), we will not insert in OPEN again, otherwise insert.

3.7.3 Best-First Search

Best first search uses an evaluation function f(n) that gives an indication of which node to
expand next for each node. Every node in a search space has an evaluation function (heuristic
function) associated with it. A heuristic function value h(n) on each node indicates how the node
is from the goal node. Note that Evaluation function=heuristic cost function (in case of minimization
problem) OR objective function(in case of maximization).Decision of which node to be expanded
depends onvalue of evaluation function. Evaluation value= cost/distance of current node from goal node
and for goal node evaluation function value=0

Based on the evaluation function, f(n), Best-first search can be categorized into the following
categories:
1) Greedy Best first search
2) A* search

The following 2 list (OPEN and CLOSED) are maintained to implement these two algorithms.
1. OPEN – all those nodes that have been generated & have has heuristic function applied
to them but have not yet been examined.
2. CLOSED- contains all nodes that have already been examined.

3.7.4Greedy Best-First search:

Greedy best-first search algorithm always selects the path which appears best at that moment.
It is the combination of depth-first search and breadth-first search algorithms. It uses the
heuristic function and search. Best-first search allows us to take the advantages of both
algorithms. With the help of best-first search, at each step, we can choose the most promising
node. In the best first search algorithm, we expand the node which is closest to the goal node and
the closest cost is estimated by heuristic function, i.e.

f(n)= g(n).

Were, h(n)= estimated cost from node n to the goal.

The greedy best first algorithm is implemented by the priority queue (or to store the heuristic
function value).

Best first search algorithm:


1. Initialize: Set 𝑂𝑃𝐸𝑁 = {𝑠} , CLOSED={ };
𝑓(𝑠) = ℎ(𝑠)
2.Fail:If 𝑂𝑃𝐸𝑁 = { }, terminate with failure.
3. Select: Select the minimum cast state 𝒂 from OPEN, save n in CLOSED.
4.Terminate: If 𝑎 ∈ 𝐺𝑜𝑎𝑙 𝑛𝑜𝑑𝑒, terminate with success and return f(n), else
5. Expend: For each successor, m of n,
If 𝑚 ∉ [𝑂𝑃𝐸𝑁 ∪ 𝐶𝐿𝑂𝑆𝐸𝐷]
𝑆𝑒𝑡 𝑓(𝑚) = ℎ(𝑚) 𝑎𝑛𝑑 𝐼𝑛𝑠𝑒𝑟𝑡 𝑚 𝑖𝑛 𝑂𝑃𝐸𝑁
………
6. LOOP:Goto Step 2

OR Pseudocode (for Best-First Search algorithm)

Step 1: Place the starting node into the OPEN list.


Step 2: If the OPEN list is empty, Stop and return failure.
Step 3: Remove the node n, from the OPEN list which has the lowest value of h(n),
and places it in the CLOSED list.
Step 4: Expand the node n, and generate the successors of node n.
Step 5: Check each successor of node n, and find whether any node is a goal node
or not. If any successor node is goal node, then return success and terminate the
search, else proceed to Step 6.
Step 6: For each successor node, algorithm checks for evaluation function f(n), and
then check if the node has been in either OPEN or CLOSED list. If the node has not
been in both lists, then add it to the OPEN list.
Step 7: Return to Step 2.

Advantages of Best-First search:

 Best first search can switch between BFS and DFS, thus gaining the advantages of both
the algorithms.

 This algorithm is more efficient than BFS and DFS algorithms.

Disadvantages of Best-First search:

 Chances of getting stuck in a loop are higher.

 It can behave as an unguided depth-first search in the worst-case scenario.

Consider the following example for better understanding of greedy Best-First search algorithm.

Example1: Consider the following example (graph) with heuristic function value h(n)[Fig 2]
which illustrate the greedy Best-first search. Note that in the following example, heuristic
function is defined as

𝒉𝑺𝑳𝑫 = 𝑠𝑡𝑟𝑎𝑖𝑔ℎ𝑡 𝑙𝑖𝑛𝑒 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑓𝑟𝑜𝑚 𝑛 𝑡𝑜 𝑔𝑜𝑎𝑙

A 7 D

11 14 25

B C 10
F
8 20
15
9 10 G
E
H

Let heuristic function value h(n) for each node n to goal node G is defined as

Straight line distance


A→G=h(A)=40
B→G=h(B)=32
C→G=h(C)=25
D→G=h(D)=35
E→G=h(E)=19
F→G=h(F)=17
H→G=h(H)=10
G→G=h(G)=0

h(n)= straight line distance from node n to G

Note that h(G)=0

The nodes added/deleted from OPEN and CLOSED list using Best-First Search algorithm are
shown below.

OPEN CLOSED
[A] []
[C,B,D] [A]
B,D A,C
F,E,B,D A,C
G.E.B.D A,C,F
E,B,D A,C,F,G

Explanation are as follows:


Step1:initially OPEN list start with start state ‘A’ and CLOSED list with empty.
Step2: Children of A={C[25], B[32] D[35]}, so
OPEN={C[25], B[32], D[35]} therefore Best=C, so expend C node next.
Step3: Children of C={E[19],F[17]}, so
OPEN={F[17],E[19], B[32], D[35]} therefore Best=F, so expend node F next.
Step4: Children of F={G[0]}, therefore OPEN= {G[0], E[19], B[32], D[35]} Best=G, this is a goal node
so Stop.
Finally, we got the shortest path: ACFG and cost is 44.
Example2: Consider the below search problem, and we will traverse it using greedy best-first search. At
each iteration, each node is expanded using evaluation function f(n)=h(n) , which is given in the below
table.
S node h(n)
3 2
A 12
A B B 4
4 1 3 1 C 7
D 3
C D E F E 8
F 2
5 2 3 H 4
I 9
H I G S 13
G 0
Here, we are using two lists which are OPEN and CLOSED Lists. Following are the iteration for
traversing the above example.
13 S

4
12 A B

9
8 E F

I G 2
0

Initialization: Open [A, B], Closed [S]

Iteration 1: Open [A], Closed [S, B]


Iteration2: Open [E, F, A], Closed [S, B]
: Open [E, A], Closed [S, B, F]

Iteration 3: Open [I, G, E, A], Closed [S, B, F]


: Open [I, E, A], Closed [S, B, F, G]

Hence the final solution path will be S----> B----->F----> G

Evaluation of Best-First Search algorithm:

Time Complexity:

The worst-case time complexity of Greedy best first search is O(bm).

Space Complexity:

The worst-case space complexity of Greedy best first search is O(bm). Where, m is the maximum
depth of the search space.

Complete: Greedy best-first search is also incomplete, even if the given state space is finite.

Optimal: Greedy best first search algorithm is not optimal.

Example3: Apply Greedy Best-First Search algorithm on the following graph (L is a goal node).
Evaluation
10
function D

Start
E
node 2 8
A
F
6 13
s
B G
14 1
5 I K

5 0 Goal
C 7 L Node
H
6
J 2 M

Fig. 2.17: Greedy best first search graph.

Working: We start with a start-nodes, S. Now, S has three children i.e., A, B and C with their Heuristic
function values 2, 6 and 5 respectively. These weights show approximately, how far theyare from goal
node. So, we write, children of S are -(A:2), (B: 6), (C:5)
Out of these, the node with minimum value is (A : 2). So, we select A and its children are explored (or
generated).
Its children are(D: 10) and (E:8)
The search process now has four nodes to search for, namely-
(B : 6), (C: 5), (8.10) and (E: 8)
Out of these, node-C has the minimal value of 5. So, we select it and expand. So, we get(H: 7) as its child.
Now, the nodes to search are as follows-
(B:6), (D: 10,(E: 8) and (H : 7) and so on.

Working of the algorithm can be represented in tabular firm as follows

Step Node Children (on Available nodes (to search) Node Chosen
being expansion)
expanded
1. S (A:2), (B:6),(C:5) (A:2), (B:6),(C:5) (A:2)
2. A (D:10), (E:8) (B:6),(C:5), (D:10), (E:8) (C:5)
3. C (H:7) (B:6), (D:10), (E:8),(H:7) (B:6)
4. B (F:13), (G:14) (D:10), (E:8), (H:7) ,(F:13), (G:14) (H:7)
5. H (I:5), (J:6) (D:10), (E:8),(F:13), (G:14), (I:5), (I:5)
(J:6)
6. I (K:1), (L:0), (M:2) (D:10), (E:8), (H:7) ,(F:13), Goal node is found.
(G:14), (J:6),(K:1),(L:0),(M:2) So, search stops now
3.8 A* Algorithm

A* search is the most commonly known form of best-first search. It uses heuristic function h(n),
and cost to reach the node n from the start state g(n). It has combined features of uniform cost
search (UCS) and greedy best-first search, by which it solves the problem efficiently. A* search
algorithm finds the shortest path through the search space using the heuristic function. This
search algorithm expands less search tree and provides optimal result faster.
In A* search algorithm, we use search heuristic as well as the cost to reach the node. Hence, we
can combine both costs as following, and this sum is called as a fitness number (Evaluation
Function).
f(n) = g(n) + h(n)

Estimated cost of the Cost to reach node n Cost to reach from node
cheapest solution. from start state. n to goal node.

Fig 9 Evaluation function f(n) in A* Algorithm

A* algorithm evaluation function 𝒇(𝒏)is defined as 𝒇(𝒏) = 𝒈(𝒏) + 𝒉(𝒏)


Where 𝒈(𝒏) =sum of edge costs from start state to n
And 𝒉(𝒏) = estimate of lowest cost path from node ngoal node.

If h(n) is admissible then search will find optimal solution. Admissible means underestimates cost of any
solution which can reached from node. In other words, a heuristic is called admissible if it always under-
estimates, that is, we always have ℎ(𝑛) ≤ ℎ∗ (𝑛), where h*(n) denotes the minimum distance to a goal
state from state n.
A* search begins at root node and then search continues by visiting the next node which has the least
evaluation value f(n).
It evaluates nodes by using the following evaluation function
f(n) = h(n) +g(n) = estimated cost of the cheapest solution through n.
Where,
g(n): the actual shortest distance traveled from initial node to current node, it helps to avoid expanding
paths that are already expansive
h(n): the estimated (or “heuristic”) distance from current node to goal, it estimates which node is closest
to the goal node.
Nodes are visited in this manner until a goal is reached.

Suppose s is a start state then calculation of evaluation function f(n) for any node n is shown in following
figure 10.
s

g(n)

n f(n) = g(n)+h(n)

f(m) h(m) m h(n) estimate

Fig 10Calculation of evaluation function f(n).

Algorithm A*

1. Initialize: Set OPEN = (s), CLOSED = {}, g(s) = 0, f(s) = h(s)


2. Fail: If OPEN = { }, Terminate & fail
3. Select : Select the minimum cost state, n, from OPEN, Save
n in CLOSED.
4. Terminate: If n ∈ G, terminate with success, and return f(n)
5. Expand: For each successor, m, of n
If m ∉ [OPEN ∪ CLOSED]
Set g(m) = g(n) + C(n,m)
Set f(m) = g(m) +h(m)
Insert m in OPEN
If m ∈ [OPEN ∪ CLOSED]
Set g(m) = min {min {g(m), g(n) + C(n,m)}
Set f(m) = g(m) + h(m)
If f(m) has decreased and m ∈ CLOSED,
move m to OPEN

In step5, we generate a successor of n (say m) and for each successor m, if it does not belong to
OPEN or CLOSED that is 𝑚 ∉ [𝑂𝑃𝐸𝑁 ∪ 𝐶𝐿𝑂𝑆𝐸𝐷], then we insert it in OPEN with the cost
g(n)+C(n,m) i.e., cost up to n and additional cost from n m.
If 𝑚 ∈ [𝑂𝑃𝐸𝑁 ∪ 𝐶𝐿𝑂𝑆𝐸𝐷] then we set g(m) with original cost and new cost [𝑔(𝑛) + 𝐶(𝑛, 𝑚)].
If we arrive at some state with another path which has less cost from original one, then we
replace the existing cost with this minimum cost.
If we find f(m) is decreased (if larger then ignore) and 𝑚 ∈ 𝐶𝐿𝑂𝑆𝐸𝐷 then move m from
CLOSED to OPEN.
Note that, the implementation of A* Algorithm involves maintaining two lists- OPEN and
CLOSED. The list OPEN contains those nodes that have been evaluated by the heuristic
function but have not expanded into successors yet and the list CLOSED contains those nodes
that have already been visited.
See the following steps for working of A* algorithm:
Step-1: Define a list OPEN. Initially, OPEN consists of a single
node, the start node S.
Step-2: If the list is empty, return failure and exit.
Step-3: Remove node n with the smallest value of f(n) from OPEN
and move it to list CLOSED.
If node n is a goal state, return success and exit.
Step-4: Expand node n.
Step-5: If any successor to n is the goal node, return success and the
solution by tracing the path from goal node to S.
Otherwise, go to Setp-6.
Step-6: For each successor node,
Apply the evaluation function f to the node.
If the node has not been in either list, add it to OPEN.
Step-7: Go back to Step-2.

3.8.1 Working of A* algorithm

Example1: Let’s us consider the following graph to understand the working of A* algorithm.
The numbers written on edges represent the distance between the nodes. The numbers written on
nodes represent the heuristic value. Find the most cost-effective path to reach from start state A
to final state J using A* Algorithm.

10
A 3
6
6
1 F
7
8 B 2 5 G
H 3
D 7 3 1
3
1 2
I
C 8
5 5

5 3
E
3 5 J
Step-1:
We start with node A. Node B and Node F can be reached from node A. A* Algorithm
calculates f(B) and f(F). Estimated Cost f(n) = g(n) +h(n) for Node B and Node F is:
f(B) = 6+8=14
f(F) = 3+6=9

Since f(F) < f(F), so it decides to go to node F.


 Closed list (F)
Path- A →F
Step-2: Node G and Node H can be reached from node F.
A* Algorithm calculates f(G) and f(H).
f(G) = (3+1)=5=9
f(H) = (3+7) +5=13
Since f(G) < f(H), so it decides to go to node G.
 Closed list (G)
Path- A→F→G
Step-3: Node I can be reached from node G.
A* Algorithm calculates f(I).
f(I)=(3+1+3)+1=8; It decides to go to node I.
 Closed list (I).
Path- A→F→G→I
Step-4: Node E, Node H and Node J can be reached from node I.
A* Algorithm calculates f(E), f(H), f(J).
f(E) = (3+1+3+5) + 3 =15
f(H) = (3+1+3+2) + 3 = 12
f(J) = (3+1+3+3) +0 = 10
Since f(J) is least, so it decides to go to node J.
 Closed list (J)
Shortest Path - A→F→G→I→J
Path Cost is 3+1+3+3=10
Example 2:Consider the following graph and apply A* algorithm and find the most cost-
effective path to reach from start state S to final stateG. The heuristic function value of each
node n is defined in the table given.

State h(n)
S 5
A 3
B 4
C 2
D 6
G 0
Solution:
S→A=1+3=4
S→G=10+0=10
S→A→B=1+2+4=7
S→A→C=1+1+2=4
S→A→C→D=1+1+3+6=11
S→A→C→G=1+1+4=6
S→A→B→D=1+2+5+6=14
S→A→C→D→G=1+1+3+2=7
S→A→B→D→G=1+2+5+2=10
3.8.2 Advantages and disadvantages of A* algorithm
Advantages:
 A* search algorithm is the best algorithm than other search algorithms.
 A* search algorithm is optimal and complete.
 This algorithm can solve very complex problems.

Disadvantages:
 It does not always produce the shortest path as it mostly based on heuristics and
approximation.
 A* search algorithm has some complexity issues.
 The main drawback of A* is memory requirement as it keeps all generated nodes in the
memory, so it is not practical for various large-scale problems.

3.8.3 Admissibility Properties of A* algorithm


A heuristic is called admissible if it always under-estimates, that is, we always have h(n) ≤h*(n), where
h*(n) denotes the minimum distance to a goal state from state n.For finite state spaces, A* always
terminates.

In other words, if the heuristic function h always underestimates then true cost h* (that is heuristic
function cost h(n) is smaller than true cost h*(n)), then A* is guaranteed to find an optimal solution.

If there is a path from s to a goal state, A* terminates (even when the state space is infinite). Algorithm
A* is admissible, that is, if there is a path from s to a goal state, A* terminates by finding an optimal path
. If we are given two or more admissible heuristics, we can take there max to get a stronger admissible
heuristic.

Admissibility Condition :
By admissible algorithm. we mean that the algorithm is sure to find a most optimal solution if one exists.
Please note that this is possible only when the evaluation function value never overestimates the distance
of the node to the goal. Also note that if the evaluation function value which is a heuristic one is exactly
the same of the distance of the node to the goal, then this algorithm will immediately give the solution.
For example, the A* algorithm discussed above is admissible. There are three conditions to be satisfied
for A to be admissible.
They are as follows-
1. Each node in the graph has finite number of successors (or O).
2. All arcs in the graph have costs greater than some positive amount, (say C).
3. For each node in the graph, n, h(n) ≤h’(n).

This implies that the heuristic guess of the cost of getting from node n to the goal is never an
overestimate. This is known as a heuristic condition. Only if these three conditions are satisfied, A* is
guaranteed to find an optimal (least) cost path. Please note that A* algorithm is admissible for any
node n if on such path, h'(n) is always less than or equal to h(n). This is possible only when the
evaluation function value never overestimates the distance of the node to the goal. Although the
admissibility condition requires h'(n) to be a lower bound on h(n), it is expected that the more closely
h'(n) approaches h(n), the better is the performance of the algorithm.

If h(n) = h’(n) -an optimal solution path would be found without over expanding a node of the path. We
assume that one optimal solution exists. If h'(n)=0 then A* reduces to blind uniform cost algorithm or
breadth-first algorithm.

Please note that the admissible heuristics are by nature optimistic because they think that the cost of
solving the problem is less than it actually is because g(n) is the exact cost for each n. Also note that f(n)
should never overestimate the true cost of a solution through n.
For example, consider a network of roads and cities with roads connecting these cities. Our problems to
find a path between two cities such that the mileage/fuel cost is minimal. Then an admissible heuristic
would be to use distance to estimate the costs from a given city to the goal city. Naturally, the air distance
will be either equal to the real distance or will underestimate it i.e., h(n) ≤ h’(n).

3.8.4 Properties of heuristic Algorithm:

1. Admissibility condition: Algorithm A is admissible if it guarantees to return an optimal


solution when one exists. A heuristic function h is called admissive if many general
estimates., we always have ℎ(𝑛) ≤ ℎ∗ (𝑛)
2. Completeness condition: Algorithm A is complete if it always terminates with a solution
when one exists.
3. Dominance property: If A1 and A2 are two Admissible versions of Algorithm A such that A
is more informed than A2 h1(n) > h2(n).
4. Optimal Property: Algorithm A is optimal over a class of Algorithms If a dominates all
members of the class.

3.8.5 Results on A* Algorithm

1. A* is admissible: Algorithm A* is admissible , that is, if there is a path from S to goal state,
A* terminates by finding an optimal solution.
2. A* is complete: If there is a path from S to goal state, A terminates (Even when the state
space is ∞).
3. Dominance property: If A1& A2→ two Admissible versions of A* S.t.
A1 is more informed than A2, then A2 expends at least as many states as does A1. (So A1
dominates A2 Here, b/s its better heuristics than A 2)

If we are given two or more admissible heuristics, we can take their max to get a stronger
admissible heuristic.

3.9 Problem Reduction Search

Problem reduction search is broadly defined as a planning how best to solve a problem that can
be recursively decomposed into subproblems in multiple ways. There are many ways to
decompose a problem, we have to find the best decomposition, which gives the quality of
searching or cost is minimum.
We already know about the divide and conquer strategy, a solution to a problem can be obtained
by decomposing it into smaller sub-problems. Each of this sub-problem can then be solved to get
its sub solution. These sub solutions can then be recombined to get a solution as a whole. That is
called is Problem Reduction. This method generates arc which is called as AND arcs. One
AND arc may point to any number of successor nodes, all of which must be solved for an arc to
point to a solution.
When a problem can be divided into a set of sub problems, where each sub problem can be
solved separately and a combination of these will be a solution, AND-OR graphs or AND - OR
trees are used for representing the solution. The decomposition of the problem or problem
reduction generates AND arcs. Consider the following example to understand the AND-OR
graph (figure-11).

Goal = Get a bike


AND Arc

Goal : Steal Goal : Get Goal : Buy a


a bike some money bike

AND

OR
Fig 11AND-OR graph

The figure-11 shows an AND-OR graph. In an AND-OR graph, OR node represents a choice
between possible decompositions, and an AND node represents given decomposition. For
example, to Get a bike, we have two options, either:
1.(Steal a bike)
OR
2. Get some money AND Buy a Bike.

In this graph we are given two choices, first Steal a bike or get some money AND Buy a Bike.
When we have more than one choice and we have to pick one, we apply OR condition to choose
one.(That's what we did here).
Basically, the ARC here denotes AND condition.
Here we have replicated the arc between the Get some money and buy a bike because by getting
some money possibility of buying a bike is more than stealing.
AO* search algorithm is based on AND-OR graph, so it is called AO* search algorithm. AO*
Algorithm basically based on problem decomposition (Breakdown problem into small pieces).
The main difference between the A*(A star) and AO*(AO star) algorithms is that A* algorithm
represents an OR graph algorithm that is used to find a single solution (either this or that).
Butan AO* algorithm represents an AND-OR graph algorithm that is used to find more than
one solution by ANDing more than one branch.
A* algorithm guarantees to give an optimal solution while AO* doesn’t since AO* doesn’t
explore all other solutions once it got a solution.

3.9.1 Problem definition in AND-OR graph:

Given [G, s, T]
Where G: Implicitly specified AND/OR graph
s: Strat node of the AND/OR graph
T: Set of terminal nodes (called SOLVED)
h(n): Heuristic function estimating the cost of solving the sub
problem at n.

Example1:
Let us see one example with the presence of heuristic value at every node (see fig 2) . The
estimated heuristic value is given at each node. The heuristic value h(n) at any node indicates
“from this node at least h(n) value (or cost) is required to find solution”. Here we assume the
edge cost value (i.e., g(n) value) for each edge is 1. Remember in OR node we always mark that
successor node which indicates best path for solution.
S
1 1 1

7 A 12 B C 13
1 1 1 1
5 D E 6 5 F G 7

H
Fig 12: AND-OR graph with heuristic value at each node
Note that, the graph given in fig 2, there are two paths for solution from start state S: either S-A-
B or S-C. To calculate the cost of the path we use the formula f(n)=g(n)+h(n) [note that here g(n)
value is 1 for every edge].
Path1: f(S-A-B)=1+1+7+12=21
Path2: f(S-C)=1+13=14

Since 𝑚𝑖𝑛(21,14) = 14; so, we select successor node C, as its cost is minimum, so it indicates
best path for solution.
Note that C is a AND node; so, we consider both the successor node of C. The cost of node C is
f(C-F-G)=1+1+5+7=14; so, the revised cost of node C is 14 and now the revised cost of node S
is f(S-C)=1+14=15 (revised).
Note that once the cost (that is f value) of any node is revised, we propagate this change
backward through the graph to decide the current best path.
Now let us explore another path and check whether we are getting lessor cost as compared to this
cost or not.
F(A-D)=1+5=6 and f(A-E)=1+6=7; since A is an OR node so best successor node is D since
min(6,7)=6. So revised cost of Node A will be 6 instead of 7, that is f(A)=6 (revised). Now next
selected node is D and D is having only one node H so f(D-H)=1+2=3, so the revised cost of
node D is 3, so now the revised cost of node A, that is f(A-D-H)=4. This path is better than f(A-
E)=7. So, the final revised cost of node A is 4. Now the final revised cost of f(S-A-
B)=1+1+4+12=18 (revised).
Thus, the final revised cost for
Path1: f(S-A-B)=18 and
Path2: f(S-C)=15
So optimal cost is 15.

Example2:
Consider the following AND-OR Graph with estimated heuristic cost at every node. Note that
A,D,E are AND node and B, C are OR node. Edge cost (i.e., g(n) value) is also given.
Apply AO* algorithm and find the optimal cost path using AO* algorithm.
A h=7
2 1 A,D,E → AND Node
B,C → OR Node
h=4 B C

2 1 0 0

h=2 D E 6 9 F G 7
1 0 0 0

3 H 10 I J 4 K 3

Fig- 1: AND-OR graph with heuristic function value.

Heuristic (Estimated) cost at every Node is given. For example, heuristic cost at node A is h=7,
which means at least 7-unit cost required to find a solution.
Since A is AND Node, so we have to solve Both of its successor Node B and C.
Cost of Node A i.e., f(A-B-C) = (2+4) + (1+3) = 10
We perform cost revision in Bottom-up fashion.

A h=10 i.e., heuristic value of Node A will


2 1 be 10.

h=4 B C h=3

Fig (a)Cost revision

Now, Let’s see R.H.S first, Node C is an OR Node.


So, cost f(C-F)= 0+9=9
& f(C-G) =0+7=7
So Best successor of C in G, so we perform cost revision in Bottom-up fashion as follows:

A h=14
2 1

h=4 B C h=7
0 0

9 F G 7

Fig (b): Cost revision in Bottom-up.


Now revision cost of Node A is 14; So, till now, we can say that the best path to solve the
problem is: G-C-A. Now expend left node of root, i.e., Node B.

Since B is an OR Node; So, we have to See, which successor is best; i.e., D or E.


The revised cost of Node D is = (3+1) +(10+0) =14
And the revised cost of Node E = (40+0)+(3+0) = 7
So, the Best promising successor Node is E.
Now, perform cost Revision in Bottom-up fashion.

Note that all the leaf Node in the Marked tree is solved is solved. So, the best way to solve the problem is
to following the marked tree and solving those marked problem. This best cost to solve the problem is 18.
Note that for AND Node: if both successor (unit problem) is solved, then we declared SOLVED and for
OR Node: if any one best successor is SOLVED, then we declared SOLVED.

3.9.2 AO* algorithm

Our real-life situations cannot be exactly decomposed into either AND tree or OR tree but is
always combination of both. So, we need an AO* algorithm where O stands for 'ordered'. Instead
of two lists OPEN and CLOSED of A* algorithm, we use a single structure GRAPH in AO*
algorithm. If represents a part of the search graph that has been explicitly generated so far. Please
note that each node in the graph will point both down to its immediate successors and to its
immediate predecessors. Also note that each node will have some h'(n) value associated with it.
But unlikeA* search, g(n) is not stored. It is not possible to compute a single value of g(n) due to
many paths to the same state. It is not required also as we are doing top-down traversing along
best-knownpath.
This guarantees that only those nodes that are on the best path are considered for expansion
Hence, h'(n) will only serve as the estimate of goodness of a node.
Next, we develop an AO* algorithm.

Algorithm AO*
1. Initialize: Set G* ={s}, f(s) =h(s)
If s ∈ T, label s as SOLVED
2. Terminate: If s is SOLVED, then Terminate
3. Select: Select a non-terminal leaf node n from the
marked
sub-free
4. Expand: Make explicit the successors of n for each new
successors, m:
Set f(m) =h(m)
If m is terminal, label m SOLVED
5. Cost Revision: Call Cost-Revise (n)
6. Loop: Go to Step 2.
Cost Revision in AO*: Cost-Revise(n)
1. Create Z = {n}
2. If Z = {} return
3. Select a node m from Z such that m has no descendants in Z
4. If m is an AND node with successors 𝑟1, 𝑟2, … 𝑟𝑘:
Set 𝑓(𝑚) = ∑ [𝑓(𝑟 ) + 𝑐(𝑚, 𝑟 )]
Mark the edge to each successor of m
If each successor is labelled SOLVED,
then label m as SOLVED.

5. If m is an OR node with successor 𝑟1, 𝑟2, … 𝑟𝑘:


Set 𝑓(𝑚) = 𝑚𝑖𝑛 {𝑓(𝑟 ) + 𝑐(𝑚, 𝑟 )}
Mark the edge to the best successor of m
If the marked successor is labelled SOLVED, label m as SOLVED
6. If the cost or label of m has changes, then insert those parents
of m into Z for which ma is a marked successor

3.9.3 Advantage of AO* algorithm

Note that AO* will always find a minimum cost solution if one exists ifℎ’(𝑛) < ℎ(𝑛)and that all arc
costs are positive. The efficiency of this algorithm will depend on how closelyℎ’(𝑛) approximates ℎ(𝑛).
Also note that AO* is guaranteed to terminate even on graphs that have cycles.

Note: When the graph has only OR node then AO* algorithm works just like A* algorithm.

3.10 Memory Bound Heuristic Search

The following are the commonly used memory bound heuristics search:

1. Iterative deepening A* (IDA*)


2. Recursive Best-First search (RBFS)
3. Memory bound A* (MBA*)

3.10.1: Iterative Deepening A* (IDA*)

IDA* is a variant of the A* search algorithm which uses iterative deepening to keep the memory usage
lower than in A*.It is an informed search based on the idea of the uniformed iterative deepening search.
Iterative deepening A* or IDA* is similar to iterative-deepening depth-first, but with the
following modifications:

The depth bound modified to be an f-limit


1. Start with limit = h(start)
2. Prune any node if f(node) > f-limit
3. Next f-limit = minimum cost of any node pruned
Iterative Deepening is a kind of uniformed search strategy. It combines the benefits of depth-first
and breadth- first search.
Advantage of IDA* is: It is optimal and complete like breadth first search and modest memory
requirement like depth-first search.
IDA* algorithm
1. Set C= f(s)
2. Perform DFBB with cut-off C
Expand a state, n, only if its f-value is less than or equal to C
If a goal is selected for expansion, then return C and terminate
3. Update C to the minimum f-value which exceeded C among states which were examined
and Go TO Step 2.

3.10.2 Working of IDA*


 Perform depth-first search LIMITED to some f-bound.
 If goal found then ok.
 Else: increase the f-bound and restart.

How to establish the f-bound?


Initially: f(S)
 Generate all successors
 Record the minimal f(succ) > f(S)
 Continue with minimal f(succ) instead of f(S)

f1

f2

f3
f4

Consider the following example to understand the IDA*


f-limited, f-bound = 100 f-new = 120
S
f=100

A B C
f=120 f=130 f=120

D G E F
f=140 f=125 f=140 f=125

f-limited, f-bound f-new =


S
f=100

A B C
f=120 f=130 f=120

D G E F
f=140 f=125 f=140 f=125

f-limited, f-bound
P
f=100

A B C
f=120 f=130 f=120

D G E F
f=140 f=125 f=140 f=125

SUCC
3.10.3 Analysis of IDA*
IDA* is complete, optimal, and optimally efficient (assuming a consistent, admissible heuristic),
and requires only a polynomial amount of storage in the worst case:

S
F* = optimal path cost to a goal
𝑓∗ b
B = branching factor
𝛿
𝛿= minimum operator step cost

nodes of storage required

Note that IDA* is complete & optimal Space usage is linear in the depth of solution. Each
iteration is depth first search, and thus it does not require a priority queue.

3.10.4 Comparison of A* and IDA* algorithm:

 Iterative Deepening Search (IDS) is nothing but BFS plus DFS for tree search.
 IDA* algorithm is “complete and Optimal” algorithm.
 BFA and A* is good for optimality, but not memory.
 DFS: good for memory O(bd), but not optimality
 In the worst case, only one new state is expanded in each iteration
 (IDA*). If A* expands N states, then IDA* can expand:
1+2+3+…+N=O(N2)

3.11 Recursive Best first search (RBFS)

The idea of recursive best first search is to simulate A* search with O(bd) memory, where b is
the branching factor and d is the solution depth.
It is a memory bound, simple recursive algorithm that works like a standard best first search but
only takes up linear space. There are some things that make it different from recursive DFS. It
keeps track of f, the value of the best alternative path that can be found from any ancestor of the
current node, instead of continuing indefinitely down the current path.
RBFS mimic the operation of standard Best-First search algorithm. IBFS keep track of the f-
value of the best alternative path available from any ancestor of the current node. If the current
node exceeds the limit, the recursion unwinds back to the alternative path. As the recursion
unwinds, RBFS replaces the f-value of each node along the path with the best f-value of its
children. In this way, RBFS remembers the f-value of the best leaf in the forgotten subtree and
can therefore decide whether it’s worth re-expanding the subtree at some later time.
RBFS is somewhat most efficient than IDA*, but still suffers from excessive node regeneration.
A* and RBFS are optimal algorithms if heuristic function h(n) is admissible.

Fig12 Algorithm for recursive Best first Search

3.11.1 Advantages and Disadvantages of RBFS:

 More efficient than IDA* and still optimal.


 Best-first Search based on next best f-contour; fewer regeneration of nodes.
 Exploit results of search at a specific f-contour by saving next f-contour associated with a
node who successors have been explored.
 Like IDA* still suffers from excessive node regeneration.
 IDA* and RBFS not good for graphs.
 Can’t check for repeated states other than those on current path.
 Both are hard to characterize in terms of expected time complexity.
☞ Check Your Progress 2

Q.1 Apply A* on the following graph:-

12 10 16 15
2 1 2
1 2 3 4

1 3 3 1
7 11
12 5 6 7 8 15
5 5
1 4 10 0 15
0
12 9 10 11 12 0
8 3 1
4

Q.2 Apply A* algorithm on the following graph

5
1
3 2

4 2 3 23

4 3
4 2

5 3

20

6 0

Q.3 Differentiate between the A* and AO* algorithm.

Q.4: Apply AO* algorithm on the following graph. Heuristic value is also given at every node
and assume the edge cost value of each node is 1.
9
A
1 1 1

3 B 4 C D 5
1 1 1 1
5 E F 7 4 G H 4
Q.5 Apply AO* algorithm on the following graph. Heuristic value is also given at every node
and assume the edge cost value of each node is 1.
P
\ \
\
5 q r 11 s 8
\ \ \
\ \
4 t u v w x
7 1 3 3
\

y
1
Example 6: Given the 3 matrices A1, A2, A3 with their dimensions (3 × 4), (4 × 10), (10 × 1).
Consider the problem of solving this chain matrix multiplication. Apply the concept of AND-
OR graph and find a minimum cost solution tree.

(Multiple choice Questions)


Q.6 A* algorithm always to find an optimal solution if
A. h’ is always 0
B. g is always 1
C. h’ never overestimates h
D. h’ never underestimate h
Q.7A* algorithm uses 𝑓 ∗ = 𝑔 + ℎ∗ to estimate the cost of getting from the initial state to the
goal state, where g is a measure of cost getting from initial state to the current node and the
function ℎ∗ is an estimate of the cost of getting from the current node to the goal state. To find a
path involving the fewest number of steps, we should test,
A. g=1
B. g=0
C. ℎ∗ = 0
D. ℎ∗ = 1

3.12 Summary
 As the name ‘Uninformed Search’ means the machine blindly follows the algorithm
regardless of whether right or wrong, efficient or in-efficient.
 These algorithms are brute force operations, and they don’t have extra information about
the search space; the only information they have is on how to traverse or visit the nodes
in the tree. Thus, uninformed search algorithms are also called blind search algorithms.
 The search algorithm produces the search tree without using any domain knowledge,
which is a brute force in nature. They don’t have any background information on how to
approach the goal or whatsoever. But these are the basics of search algorithms in AI.
 The different types of uninformed search algorithms are as follows:
 Depth First Search
 Breadth-First Search
 Depth Limited Search
 Uniform Cost Search
 Iterative Deepening Depth First Search
 Bidirectional Search (if applicable)
 The following terms are frequently used in any search algorithms:

 State: It provides all the information about the environment.


 Goal State: The desired resulting condition in a given problem and the kind of search
algorithm we are looking for.
 Goal Test: The test to determine whether a particular state is a goal state.
 Path/Step Cost: These are integers that represent the cost to move from one node to
another node.

 To evaluate and compare the efficiency of any search algorithm, the following 4
properties are used:

 Completeness: A search algorithm is said to be complete if it guarantees to return a


solution, if exist.
 Optimality/Admissibility: If a solution found for an algorithm is guaranteed to be the
best solution (lowest path cost) among all other solutions, then such a solution for is said
to be an optimal solution.
 Space Complexity: A function describing the amount of space(memory) an algorithm
takes in terms of input to the algorithm. That is how much space is used by the
algorithm? Usually measured in terms of the maximum number of nodes in memory at a
time.
 Time Complexity: A function describing the amount of time the algorithm takes in terms
of input to the algorithm. That is, how long (worst or average case) does it take to find a
solution?
 Time and space complexity are measured in terms of:‘b‘ – maximum branching factor
(Max number of successor (child) of any node) in a tree, ‘d‘ – the depth of the shallowest
goal node, and ‘m‘ – maximum depth of the search tree (maybe infinity).
 The following table summarizes the 4 properties(Completeness,
Optimality/Admissibility,SpaceComplexity, Time Complexity) of the search
algorithm.
BFS DFS IDDFS Bidirectional
Search (if
Applicable)
d d d
Time b b b bd/2
Space bd bm bd bd/2
Optimum? Yes No Yes Yes
Complete? Yes No Yes Yes
 The advantage of DFS is it requires very little memory as compared to BFS, as it only
needs to store a stack of the nodes on the path from the root node to the current node. The
disadvantages of DFS are as follows: There is the possibility that many states keep
reoccurring, and there is no guarantee of finding the solution. The DFS algorithm goes
for deep down searching and sometimes it may go to the infinite loop.
 On the other hand, BFS gives a guarantee to get an optimal solution, if any solution exists
(Completeness) and if there is more than one solution for a given problem, then BFS will
provide the minimal solution which requires the least number of steps (Optimal), but one
major drawback of BFS is that it requires lots of memory space since each level of the
tree must be saved into memory to expand the next level.
 Iterative Deepening DFS (IDDFS) combines the benefits of both BFS and DFS search
algorithms in terms of fast search and memory efficiency. It is better than DFS and needs
less space than BFS. But the main drawback of IDDFS is that it repeats all the work from
the previous phase.
 The advantage of Bidirectional search (BS) is that it uses various techniques like DFS,
BFS, DLS, etc, so it is efficient and requires less memory. The Implementation of the
bidirectional search tree is difficult and in bidirectional search, one should know the goal
state in advance.
 In Informed search the domain dependent (heuristic) information is used in order to
search the space more efficiently.
 Informed searched include the following search:
 Best-first search: Order agenda based on some measure of how ‘good’ each state is.
 Uniform-cost search: Cost of getting to current state from initial state = g(n)
 Greedy search: Estimated cost of reaching goal from current state – Heuristic evaluation
functions, h(n)
 A* search: f(n) = g(n) + h(n)
 Admissibility: h(n)never overestimates the actual cost of getting to the goal state.
 Informed Ness: A search strategy which searches less of the statesperson order to find a
goal state is more informed.
 A* algorithm avoid expanding paths that are already expensive.
 In A*, Evaluation function f(n) = g(n) + h(n), where g(n) = cost so far to reach n, h(n) =
estimated cost to goal from n, f(n) = estimated total cost of path through n to goal.
 A∗ search uses an admissible heuristic function, i.e., h(n) ≤ h∗(n) where h∗(n) is the true
cost of cheapest solution from n.
 A∗ search has a very good property: A∗ search is optimal! So, if there is any solution, A∗
search is guaranteed to find a least cost solution. Remember, this needs an admissible
heuristic.
 The commonly used memory bound heuristics search are Iterative deepening A* (IDA*),
Recursive Best-First search (RBFS) and Memory bound A* (MBA*).
 Iterative Deepening Search (IDS) is nothing but BFS plus DFS for tree search. IDA*
algorithm is “complete and Optimal” algorithm.
 The idea of recursive best first search is to simulate A* search with O(bd) memory,
where b is the branching factor and d is the solution depth.

3.13 Solutions/Answers

Check your progress 1:

Answer 1:

Table 1: Open and closed list for BFS


OPEN LIST CLOSED LIST
1 1
2,3,4 2
3,4,6 3
4,6,5 4
6,5,7 5
5,7 6
7, 7
8-Goal State -

Table 2: Open and closed list for DFS


OPEN LIST CLOSED LIST
1 1
4,3,2 2
4,3,6 6
4,3 3
4,5 5
4,7 7
4,8-Goal State -

Answer 2

1. open = [A]; closed = [ ] B is not the goal.


2. open = [B,C,D]; closed = [A] Put his children onto the
queue.

3. open = [C,D,E,F}; closed = [B,A] Put him in closed. He is


done.
4. open = [D,E,F,G,H]; closed =[C,B,A] Items between red bard
are siblings.
5. open = [E,F,G,H,I,J]; closed = [D,C,B,A]
6. open = [F,G,H,I,J,K,L]; closed = [E,D,C,B,A]
7. open = [G,H,I,J,K,L,M,N] (as L is already on open);closed = [G,F,E,D,C,B,A]
8. open = [H,I,J,K,L,M,N]; closed = [G,F,E,D,C,B,A]
9. and so on until either goal is reached or open is empty.

Answer 3: Refer BFS section for solution.

Answer 4

BFS DFS IDDFS Bidirectional


Search (if
Applicable)
Time bd bd bd bd/2
Space bd bm bd bd/2
Optimum? Yes No Yes Yes
Complete? Yes No Yes Yes

Check your progress 2

Answer 1:
The OPEN and CLOSED list are shown below. Node with their f(n) values are inserted in OPEN list and
that node will be expended next whose f(n) value is minimum.

CLOSED
1(12) 2(12) 6(12) 5(13) 10(13) 11(13) 12(13)

OPEN

1(12)
2(12) 5(13)
5(13) 3(14) 6(12)
5(13) 3(14) 7(17) 10(13)
3(19) 7(17) 10(13)
3(19) 7(17) 10(13) 9(14)
3(19) 7(17) 9(14) 11(13)
3(19) 7(17) 9(14) 12(13)

Note that only 6 nodes are expended to reach to a goal node. Optimal cost to reach from start state (1) to
goal node (12) is 13.

Note: If all the edge cost is positive then Uniform cost search (UCS) algorithm is same as Dijkstra’s
algorithm. Dijkstra algorithm fails if graph is having a negative weight cycle. A* algorithm allows
negative weight also. It means A* algorithm work for negative (-ve) edge cost also. If some edge cost is
negative, then at any point of successive iteration, we cannot say till that node we have optimum cost
(because of the negative cost). So, in this case (-ve edge cost), nodes come back from CLOSED to OPEN.
Let us see one example (Example 2) having negative edge cost and you can also see how nodes come
back from CLOSED to OPEN.

Answer 2:
The OPEN and CLOSED list are shown below. Node with their f(n) values are inserted in OPEN list and
that node will be expended next whose f(n) value is minimum.

CLOSED
1(15) 2(7) 4(9) 5(11) 3(25) 4(7) 5(9)
OPEN

1(15)
2(7) 3(25)
3(25) 4(9)
3(25) 5(11)
3(25) 6(28) This is goal Node, but we can’t Pick this
4(7) 6(28) because better cost/path may exist.
Cost of 4 is decreased from 9 →7
so bring 4 from close →open 5(9) 6(28) (Note: In open 6(28) is Not min. cost ∴ we do not
pick them).
∴4(9) 6(26) 6(28)
Same Logic Goal (optimal cost is 26)

Optimal cost to reach from start state (1) to goal node (6) is 26.

Answer 3: An A* algorithm represents an OR graph algorithm that is used to find a single


solution (either this or that), but an AO* algorithm represents an AND-OR graph algorithm that
is used to find more than one solution by ANDing more than one branch.
Answer 4 :
9
A
1 1 1

3 B 4 C D 5
1 1 1 1
5 E F 7 4 G H 4

Path – 1: f(A-B-C) = 1+1+3+4=9


f(B-E) = 1+5=6 f(B-F) = 1+7=8
f(A-B-C) = 1+1+6+4=12

Path-2 : f(A-C-D) = 1+1+4+5=11


f(D-G-H) = 1+1+4+4=10
f(A-C-D)= 1+1+4+10=16

AND-OR graph with revised cost is shown in figure.


9 12
A
1 1 1

B 4 C D 10
1 1 1 1
5 E F 7 4 G H 4

So, the optimal cost f(A-B-C)= 12

Note that AO* algorithm does not explore all the solution path once it finds a solution.

Answer 5: Similar to Q.4


Answer 6:
Given a chain of matrices:𝐴1, 𝐴2, … . . , 𝐴𝑛 , where each matrix Ai has a dimension 𝑝 × 𝑝 .
Problem is to determine the order of multiplication or the way of parenthesizing the product of
matrices, so that the minimum number of required operations (i.e., multiplications) can be
minimized.
There are only 2 ways to parenthesizing the given 3 matrices, A1, A2, A3:
𝐴1 × (𝐴2 × 𝐴3)
𝑜𝑟
(𝐴1 × 𝐴2) × 𝐴3
As we know, if matrix A is of dimension (𝑝 × 𝑞) and matrix B is of dimension (𝑞 × 𝑟), then the
cost of multiplying A to B that is (𝐴 × 𝐵) is (𝑝 × 𝑞 × 𝑟) and the final dimension of the resultant
matrix (𝐴 × 𝐵) is (𝑝 × 𝑟).
Let us see how AND-OR graph is used to get the solution of this problem.
A1 A2 A3

3×5 5×6 6×10


A1 A2 A3
150+300=450 90+180=270
A1(A2 A3) (A1A2)A3

[5×10]
[3×5] [3×6] [3×10]

A1 A2 A3 A1 A2 A3
(0) (300) (90) (0)

Figure-1 AND/OR graph for multiplying 𝐴1 × 𝐴2 × 𝐴3

In this AND-OR graph, parent (root) node indicates the given problem for multiplying A1A2A3.
Next level of the tree (2 successors node) indicates the 2 choices (or ways) of multiplying (or
parenthesizing) the A1A2A3; first way is 𝐴1 × (𝐴2 × 𝐴3) and another way is (𝐴1 × 𝐴2) × 𝐴3.
Since out of these two choices, anyone will be the solution so there is a OR node for this. In an
OR node, we always mark the current best successor node. Next level we have an AND node.
For any AND node we must add the cost of both the successor node.
Cost of multiplying (𝐴2 × 𝐴3) = 5 × 6 × 10 = 300 and dimension of 𝐴2 × 𝐴3 is (5 × 10).
Since the dimension of A1 is (3 × 5) and the dimension of 𝐴2 × 𝐴3 is (5 × 10), so the cost of
multiplying 𝐴1 × (𝐴2 × 𝐴3) = 3 × 5 × 10 = 150. Thus the total cost will be 300+150=450.
Similarly,
The cost of multiplying (𝐴1 × 𝐴2) = 3 × 5 × 6 = 90 and dimension of 𝐴1 × 𝐴2 will be
(3 × 6). Since the dimension of 𝐴1 × 𝐴2 is (3 × 6) and the dimension of A3 is (6 × 10), so the
cost of multiplying (𝐴1 × 𝐴2) × 𝐴3 = 3 × 6 × 10 = 180. Thus the total cost will be
90+180=270. So, the best way to multiplying 𝐴1 × 𝐴2 × 𝐴3 is (𝐴1 × 𝐴2) × 𝐴3 and the
minimum cost of multiplying 𝐴1 × 𝐴2 × 𝐴3
is 270.

Multiple Choice Questions

Answer 6: Option C

Answer 7: Option A

3.14 FURTHER READINGS

1. Ela Kumar, “ Artificial Intelligence”, IK International Publications


2. E. Rich and K. Knight, “Artificial intelligence”, Tata Mc Graw Hill Publications
3. N.J. Nilsson, “Principles of AI”, Narosa Publ. House Publications
4. John J. Craig, “Introduction to Robotics”, Addison Wesley publication
5. D.W. Patterson, “Introduction to AI and Expert Systems" Pearson publication
s
UNIT 4 PREDICATE ANDPROPOSITIONAL LOGIC

4.1 Introduction
4.2 Objectives
4.3 Introduction to Propositional Logic
4.4 Syntax of Propositional Logic
4.4.1 Atomic Propositions
4.4.2 Compound Propositions
4.5 Logical Connectives
4.5.1 Conjunction
4.5.2 Disjunction
4.5.3 Negation
4.5.4 Implication
4.5.5 Bi-Conditional
4.6 Semantics
4.6.1 Negation Truth Table
4.6.2 Conjunction/Disjunction/Implication/Biconditional
Truth Table
4.6.3 Truth Table with three variables
4.7 Propositional Rules of Inference
4.7.1 Modus Ponens (MP)
4.7.2 Modus Tollens (MT)
4.7.3 Disjunctive Syllogism (DS)
4.7.4 Addition
4.7.5 Simplification
4.7.6 Conjunction
4.7.7 Hypothetical Syllogism (HS)
4.7.8 Absorption
4.7.9 Constructive Dilemma (CD)
4.8 Propositional Rules of Replacement
4.9 Validity and Satisfiability
4.10 Introduction to Predicate Logic
4.11 Inferencing in Predicate Logic
4.12 Proof Systems
4.13 Natural Deduction
4.14 Propositional Resolution
4.14.1 Clausal Form
4.14.2 Determining Unsatisfiability
4.15 Answers/Solutions
4.16 Further Readings
4.1 INTRODUCTION

Logic is the study and analysis of the nature of the valid argument, the
reasoning tool by which valid inferences can be drawn from a given set of facts
and premises. It is the basis on which all the sciences are built, and this
mathematical theory of logic is called symbolic logic. The English
mathematician George Boole (1815-1864) seriously studied and developed this
theory, called symbolic logic.

The reason why the subject-matter of the study is called Symbolic Logic is that
symbols are used to denote facts about objects of the domain and relationships
between these objects. Then the symbolic representations and not the original
facts and relationships are manipulated in order to make conclusions or to solve
problems.

The basic building blocks of arguments in symbolic logic are declarative


sentences, called propositions or statements. In MCS – 212 i.e., Discrete
Mathematics you learned about predicates and propositions and ways of
combining them to form more complex propositions. Also, you learned about
the propositions that contain the quantifiers ‘for All’ and ‘there exists’. In
symbolic logic, the goal is to determine which propositions are true and which
are false. Truth table a tool to find out all possible outcome of a proposition’s
truth value was also discussed in MCS-212.

Logical rules have the power to give accuracy to mathematical statements.


These rules come to rescue when there is a need to differentiate between valid
and invalid mathematical arguments. Symbolic logic may be thought of as a
formal language for representing facts about objects and relationships between
objects of a problem domain along with a precise inferencing mechanism for
reasoning and deduction.

Using symbolic logic, we can formalize our arguments and logical reasoning in
a manner that can easily show if the reasoning is valid, or is a fallacy. How we
symbolize the reasoning is what is presented in this unit.
4.2 OBJECTIVES

After going through this unit, you should be able to:


1. Understand the meaning of propositional logic.
2. Differentiate between atomic and compound propositions.
3. Know different types of connectives, their associated semantics and
corresponding truth tables.
4. Define propositional rules of inference and replacement.
5. Differentiate between valid and satisfiable arguments.

4.3 INTRODUCTION TO PROPOSITIONAL


LOGIC

Apart from the application of logic in mathematics, it also helps in various other
tasks related to computer science. It is widely used to design the electronic
circuitry, programming of android applications, applying artificial intelligence
to different tasks, etc. In simple terms, a proposition is a statement which is
either true or false.
Consider the following statements:
1. Earth revolves around the sun.
2. Water freezes at 100° Celsius.
3. An hour has 3600 seconds.
4. 2 is the only even prime number.
5. Mercury is the closest planet to the Sun in the solar system.
6. The USA lies on the continent of North America.
7. 1 + 2 = 4.
8. 15 is a prime number.
9. Moon rises in the morning and sets in the evening.
10. Delhi is the capital of India.
For all the above statements, one can easily conclude whether the particular
statement is true or false so these are propositions. First statement is a universal
truth. Second statement is false as the water freezes at 0° Celsius. Third
statement is again a universal truth. Fourth statement is true. Fifth statement is
also true as it is again a universal truth. On similar lines, the sixth statement is
also true. Seventh and eighth statements are false again as they deny the basic
mathematical rules. The Ninth statement is a negation of the universal truth so it
is a false statement. The Tenth Statement is also true.
Now consider the following statements:
1. What is your name?
2. a + 5 = b.
3. Who is the prime minister of India?
4. p is less than 5.
5. Pay full attention while you are in the class.
6. Let’s play football in the evening.
7. Don’t behave like a child, you are grown up now!
8. How much do you earn?
9. ∠X is an acute angle greater than 27°.
For all the above statements, we can’t say anything about their truthfulness so
they are not propositions. First and third statements are neither true nor false as
they are interrogative in nature. Also, we can’t say anything about the
truthfulness of the second statement until and unless we have the values of a and
b. Similar reasoning applies to the fourth statement as well. We can’t say
anything about fifth, sixth and seventh statements again as they are informative
statements. Eighth statement is again interrogative in nature. Again, we can’t
say anything about the truthfulness of the ninth statement until and unless we
have the value of ∠X.
Propositional logic has the following facts:
1. Propositional statements can be either true or false, they can’t be
both simultaneously.
2. Propositional logic is also referred to as binary logic as it works
only on the two values 1 (True) and 0 (False).
3. Symbols or symbolic variables such as x, y, z, P, Q, R, etc. are
used for representing the logic and propositions.
4. Any proposition or statement which is always valid (true) is known
as a tautology.
5. Any proposition or statement which is always invalid (false) is
known as a contradiction.
6. A table listing all the possible truth values of a proposition is
known as a truth table.
7. Objects, relations (or functions) and logical connectives are the
basic building blocks of propositional logic.
8. Logical connectives are also referred to as logical operators.
9. Statements which are interrogative, informative or opinions such as
“Where is Chandni Chowk located?”, “Mumbai is a good city to
live in”, “Result will be declared on 31st March” are not
propositions.

4.4 SYNTAX OF PROPOSITIONAL LOGIC

The syntax of propositional logic allows two types of sentences to represent


knowledge. The two types are as follows:
4.4.1 Atomic Propositions
These are simplest propositions containing a single proposition symbol and are
either true or false. Some of the examples of atomic propositions are as follows:
1. “Venus is the closest planet to the Sun in the solar system” is an atomic
preposition since it is a false fact.
2. “7 – 3 = 4” is an atomic preposition as it is a true fact.
4.4.2 Compound Propositions
They are formed by a collection of atomic propositions joined with logical
connectives or logical operators. Some of the examples of compound
propositions are as follows:
1. The Sun is very bright today and its very hot outside.
2. Diana studies in class 8th and her school is in Karol Bagh.
☞ Check Your Progress 1

Which of the following statements are propositions? Write yes or no.


1. How are you?
2. Sachin Tendulkar is one of the best cricketers in India.
3. The honorable Ram Nath Kovind is the 10th and current president of
India.
4. Lord Ram of the kingdom of Ayodhya is an example of a people's king.
5. In which year did prophet Muhammad received verbal revelations from
the Allah in the cave Mount Hira presently located in the Saudi Arabia?
6. Akbar was the founder of the Mughal dynasty in India.
7. In the year 2019, renowned actor Shri Amitabh Bachchan was awarded
with the Padma Vibhushan which is the second highest civilian honour of
the republic of India.
8. One should avoid eating fast food in order to maintain good health.
9. What is your age?
10. The first case of COVID-19 was detected in China.
11. Name the author of the book series “Shiva Trilogy”.
12. Former prime minister of India, Shri Atal Bihari Vajpayee was a member
of which political party?
13. Wing Commander Rakesh Sharma is the only Indian citizen to travel in
space till date.
14. Where do you live?

4.5 LOGICAL CONNECTIVES

Logical connectives are the operators used to join two or more atomic
propositions (operands). The joining should be done in a way that the logic and
truth value of the obtained compound proposition is dependent on the input
atomic propositions and the connective used.
4.5.1 Conjunction
A proposition “A ∧ B”with connective ∧ is known as conjunction of A and B. It
is a proposition (or operation) which is true only when both the constituent
propositions are true. Even if one of the input propositions is false then the
output is also false. It is also referred to as AND-ing the propositions. Example:
Ram is a playful boy and he loves to play football. It can be written as:
A = Ram is a playful boy.
B = Ram loves to play football.
A ∧ B = Ram is a playful boy and he loves to play football.

4.5.2 Disjunction
A proposition “A ∨ B”with connective ∨ is known as disjunction of A and B. It
is a proposition (or operation) which is true when at least one of the constituent
propositions are true. The output is false only when both the input propositions
are false. It is also referred to as OR-ing the propositions. Example:
I will go to her house or she will come to my house. It can be written as:
A = I will go to her house.
B = She will come to my house.
A ∨ B = I will go to her house or she will come to my house.

4.5.3 Negation
The proposition ¬ A (or ~A) with ¬ (or ~) connective is known as negation of
A. The purpose of negation is to negate the logic of given proposition. If A is
true, its negation will be false, and if A is false, its negation will be true.
Example:
University is closed. It can be written as:
A = University is closed.
¬ A = University is not closed.
4.5.4 Implication
The proposition A → B with → connective is known as A implies B. It is also
called if-then proposition. Here, the second proposition is a logical consequence
of the first proposition. For example, “If Mary scores good in examinations, I
will buy a mobile phone for her”. In this case, it means that if Mary scores
good, she will definitely get the mobile phone but it doesn’t mean that if she
performs bad, she won’t get the mobile phone. In set notation, we can also say
that A ⊆ B i.e., if something exists in the set A, then it necessarily exists in the
set B. Another example:
If you score above 90%, you will get a mobile phone.
A = You score above 90%.
B = You will get a mobile phone.
A → B = If you score above 90%, you will get a mobile phone.

4.5.5 Bi-conditional
A proposition A ⟷ B with connective ⟷ is known as a biconditional or if-and-
only-if proposition. It is true when both the atomic propositions are true or both
are false. A classic example of biconditional is “A triangle is equivalent if and
only if all its angles are60° each”. This statement means that if a triangle is an
equivalent triangle, then all of its angles are 60° each. There is one more
associated meaning with this statement which means that if all the interior
angles of a triangle are of 60° each then it's an equivalent triangle. Example:
You will succeed in life if and only if you work hard.
A = You will succeed in life.
B = You work hard.
A ⟷ B = You will succeed in life if and only if you work hard.

☞ Check Your Progress 2

Which of the following propositions are atomic and which are compound?
1. The first battle of Panipat was fought in 1556.
2. Jack either plays cricket or football.
3. Posthumously, at the age of 22, Neerja Bhanot became the youngest recipient of the
Ashok Chakra award which is India's highest peacetime gallantry decoration.
4. Chandigarh is the capital of the Indian states Haryana and Punjab.
5. Earth takes 365 days, 5 hours, 59 minutes and 16 seconds to complete one revolution
around the Sun.
6. Dermatology is the branch of medical science which deals with the skin.
7. Indian sportspersons won 7 medals at the 2020 Summer Olympics and 19 medals at
the 2020 Summer Paralympics both held at the Japanese city of Tokyo.
8. Harappan civilization is considered to be the oldest human civilization and it lies in
the parts of present-day India, Pakistan and Afghanistan.
9. IGNOU is a central university and offers courses through the distance learning mode.
10. Uttarakhand was carved out of the Indian state of Uttar Pradesh in the year 2000.

4.6 SEMANTICS

You had already learned various of the concepts to be covered in this unit, in MCS-212 i.e.,
Discrete Mathematics, here is a quick revision to those concepts and we will extend our
discussion to the advanced concepts, which are useful for our field of work i.e., Artificial
Intelligence. In MCS – 212 i.e., Discrete Mathematics you learned that Propositions are the
declarative sentence or statements which is either true or false, but not both, such sentences
can either be universally true or universally false.

On the other hand, consider the declarative sentence ‘Women are more intelligent than men’.
Some people may think it is true while others may disagree. So, it is neither universally true
nor universally false. Such a sentence is not acceptable as a statement or proposition in
mathematical logic.

Note that a proposition should be either uniformly true or uniformly false.

In propositional logic, as mentioned earlier also, symbols are used to denote propositions. For
instance, we may denote the propositions discussed above as follows:

P : The sun rises in the west,


Q : Sugar is sweet,
R : Ram has a Ph.D. degree.

The symbols, such as P, Q, and R, that are used to denote propositions, are called atomic
formulas, or atoms., in this case, the truth-value of P is False, the truth-value of Q is True
and the truth-value of R, though not known yet, is exactly one of ‘True’ or ‘False’,
depending on whether Ram is actually a Ph. D or not.
At this stage, it may be noted that once symbols are used in place of given statements in, say,
English, then the propositional system, and, in general, a symbolic system is aware only of
symbolic representations, and the associated truth values. The system operates only on these
representations. And, except for possible final translation, is not aware of the original
statements, generally given in some natural language, say, English.

When you’re talking to someone, do you use very simple sentences only? Don’t you use
more complicated ones which are joined by words like ‘and’, ‘or’, etc? In the same way,
most statements in mathematical logic are combinations of simpler statements joined by
words and phrases like ‘and’. ‘or’, ‘if … then’. ‘If and only if’, etc. We can build, from
atoms, more complex propositions, sometimes calledcompound propositions, by using
logical connectives,

The Logical Connectives are used to frame compound propositions, and they are as follows:

a) Disjunction The disjunction of two propositions p and q is the compound statement por
q, denoted by p  q.

The exclusive disjunction of two propositions p and q is the statement ‘Either of the two
(i.e. p or q) can be true, but both can’t be true’. We denote this by p  q .

b) Conjunction We call the compound statement ‘p and q’ the conjunction of the


statements p and q. We denote this by p  q.

c) Negation The negation of a proposition p is ‘not p’, denoted by ~p.

d) Implication (Conditional Connectives) Given any two propositions p and q, we denote


the statement ‘If p, then q’ by p  q. We also read this as ‘p implies q’. or ‘p is sufficient
for q’, or ‘p only if q’. We also call p the hypothesis and q the conclusion. Further, a
statement of the form p  q is called a conditional statement or a conditional proposition.

Let p and q be two propositions. The compound statement (p  q) (q  p) is the bi-
conditional of p and q. We denote it by p  q, and read it as ‘p if and only q’

Note : The two connectives  and  are called conditional connectives

The rule of precedence: The order of preference in which the connectives are applied in a
formula of propositions that has no brackets is

i) ~
ii) 

iii)  and 

iv)  and 
Note that the ‘inclusive or’ and ‘exclusive or’ are both third in the order of preference.
However, if both these appear in a statement, we first apply the left most one. So, for
instance, in p  q  ~ p, we first apply  and then . The same applies to the ‘implication’
and the ‘biconditional’, which are both fourth in the order of preference.

Let’s see the working of the various concepts learned above, with the help of
truth tables. In the following truth table, we write every TRUE value as Tand
every FALSE value as F.
4.6.1 Negation Truth Table
α ~α

_F(0)_ _T(1)_

_T(1)_ _F(0)_

4.6.2 Conjunction/Disjunction/Implication /Biconditional


Truth Table
α_1 α_2 Conjunction Disjunction Implication Biconditional
α_1 ∧α_2 α_1∨α_2 α_1 → α_2 α_1⟷α_2
_F(0)_ _F(0)_ _F(0)_ _F(0)_ _T(1)_ _T(1)_
_F(0)_ _T(1)_ _F(0)_ _T(1)_ _T(1)_ _F(0)_
_T(1)_ _F(0)_ _F(0)_ _T(1)_ _F(0)_ _F(0)_
_T(1)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_

4.6.3 Conjunction and Disjunction with three variables


α_1 α_2 α_3 (α_1 (α_2 (α_1 (α_2 (α_1 ∧α_2) ∧α_3 (α_1 ∨α_2) ∨α_3
∧α_2) ∧α_3) ∨α_2) ∨α_3) or or
α_1 ∧ (α_2 ∧α_3) α_1 ∨ (α_2 ∨α_3)
_T(1)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_

_T(1)_ _T(1)_ _F(0)_ _T(1)_ _F(0)_ _T(1)_ _T(1)_ _F(0)_ _T(1)_

_T(1)_ _F(0)_ _T(1)_ _F(0)_ _F(0)_ _T(1)_ _T(1)_ _F(0)_ _T(1)_

_T(1)_ _F(0)_ _F(0)_ _F(0)_ _F(0)_ _T(1)_ _F(0)_ _F(0)_ _T(1)_

_F(0)_ _T(1)_ _T(1)_ _F(0)_ _T(1)_ _T(1)_ _T(1)_ _F(0)_ _T(1)_

_F(0)_ _T(1)_ _F(0)_ _F(0)_ _F(0)_ _T(1)_ _T(1)_ _F(0)_ _T(1)_

_F(0)_ _F(0)_ _T(1)_ _F(0)_ _F(0)_ _F(0)_ _T(1)_ _F(0)_ _T(1)_

_F(0)_ _F(0)_ _F(0)_ _F(0)_ _F(0)_ _F(0)_ _F(0)_ _F(0)_ _F(0)_

Using these logical connectives, we can transform any sentence in to its equivalent
mathematical representation in Symbolic Logic and that representation is referred as Well
Form Formula (WFF), you had already learned a lot about WFF in MCS-212, lets briefly
discuss it here also, as it has wide applications in Artificial Intelligence also.
A Well-formed formula, or wff or formula in short, in the propositional logic is defined
recursively as follows:

1. An atom is a wff.
2. If A is a wff, then (~A) is a wff.
3. If A and B are wffs, then each of (A  B), (A  B), (A  B), and (A  B) is a wff.
4. Any wff is obtained only by applying the above rules.

From the above recursive definition of a wff it is not difficult to see that expression:

(( P  ( Q  ( ~ R)))  S) is a wff; because , to begin with, each of P, Q , ( ~ R) and S, by


definitions is a wff. Then, by recursive application, the expression: (Q  ( ~ R)) is a wff.
Again, by another recursive application, the expression: (P  (Q  ( ~ R))) is a wff. And,
finally the expression given initially is a wff.

Further, it is easy to see that according to the recursive definition of a wff, each of the
expressions: (P  (Q  )) and (P ( Q  R )) is not a wff.

Some pairs of parentheses may be dropped, for simplification. For example,

A  B and A  B respectively may be used instead of the given wffs ( A  B) and (A  B),
respectively. We can omit the use of parentheses by assigning priorities in increasing order
to the connectives as follows:

, , , , ~.

Thus, ‘’ has least priority and ‘~’ has highest priority. Further, if in an expression, there are
no parentheses and two connectives between three atomic formulas are used, then the
operator with higher priority will be applied first and the other operator will be applied later.

For example: Let us be given the wff P  Q  ~ R without parenthesis. Then among the
operators appearing in wff, the operator ‘~’ has highest priority. Therefore, ~ R is replaced by
(~R). The equivalent expression becomes P  Q  (~ R). Next, out of the two operators viz
‘’ and ‘’, the operators ‘’ has higher priority. Therefore, by applying parentheses
appropriately, the new expression becomes P  (Q  (~ R)). Finally, only one operator is
left. Hence the fullyparenthesized expression becomes (P  (Q  (~ R)))

Following are the rules of finding the truth value or meaning of a wff, when truth values of
the atoms appearing in the wff are known or given.

1. The wff ~ A is True when A is False, and ~ A is False when A is true. The wff
~ A is called the negation of A.
2. The wff (A  B) is True if A and B are both True; otherwise, the wff A  B is False. The
wff (A  B) is called the conjunction of A and B.
3. The wff (A  B) is true if at least one of A and B is True; otherwise, (A  B) is False. (A
 B) is called the disjunction of A and B.
4. The wff (A  B) is False if A is True and B is False; otherwise, (A  B) is True. The
wff (A  B) is read as “If A, then B,” or “A implies B.” The symbol ‘’ is called
implication.
5. The wff (A  B) is True whenever A and B have the same truth values; otherwise (A 
B) is False. The wff (A  B) is read as “A if and only if B.”

☞ Check Your Progress 3

Q1. Draw the truth table for the following:


a) α_2⟷(~ α_1 → (α_1∨α_2))
b) (~α_1⟷ (α_2⟷α_3) ) ∨ (α_3 ∧α_2)
c) ((α_1 ∧α_2) → α_3) ∨ ~α_4
d) ((α_1 → ~ α_2) ⟷α_3) → ~ (α_1 V α_1)
Q2. Verify the De Morgan’s Laws using Truth Tables
Q3. Write WFF for the following statements:
a) Every Person has Mother
b) There is a woman and she is mother of Siya

4.7 PROPOSITIONAL RULES OF INFERENCE

We need intelligent computers based on the concept of artificial intelligence


which are able to infer new “knowledge” or logic from the existing logic using
the theory of inference. Inference rules help us to infer new propositions and
conclusions based on existing propositions and logic. They act as templates to
generate new arguments from the premises or predicates. We deduce new
statements from the statements whose truthfulness is already known. These
rules come to the rescue when we need to prove something logically. In general,
inference rules preserve the truth. Depending on the problem, some or all of
these rules may be applied to infer new propositions. The procedure of
determining whether a proposition is a conclusion of the given propositions is
known as inferring a proposition. The inference rules are described below.
4.7.1 Modus Ponens (MP)
It states that if the propositions A → B and A are true, then B is also true.
Modus Ponens is also referred to as the implication elimination because it
eliminates the implication A → B and results in only the proposition B. It also
affirms the truthfulness of antecedent. It is written as:
α_1 → α_2, α_1 =>α_2
4.7.2 Modus Tollens (MT)
It states that if A → B and ¬ B are true then ¬ A is also true. Modus Tollens is
also referred to as denying the consequent as it denies the truthfulness of the
consequent. The rule is expressed as:
α_1 → α_2, ~α_2 =>~α_1
4.7.3 Disjunctive Syllogism (DS)
Disjunctive Syllogism affirms the truthfulness of the other proposition if one of
the propositions in a disjunction is false.
Rule 1:α_1 ∨α_2, ~α_1=>α_2
Rule 2:α_1 ∨α_2, ~α_2 =>α_1

4.7.4 Addition
The rule states that if a proposition is true, then its disjunction with any other
proposition is also true.
Rule 1:α_1=>α_1∨α_2
Rule 2:α_2=>α_1 ∨α_2

4.7.5 Simplification
Simplification means that if we have a conjunction, then both the constituent
propositions are also true.
Rule 1:α_1 ∧α_2 =>α_1
Rule 2:α_1 ∧α_2 =>α_2
4.7.6 Conjunction
Conjunction states if two propositions are true, then their conjunction is also
true. It is written as:
α_1, α_2 =>α_1 ∧α_2

4.7.7 Hypothetical Syllogism (HS)


The rule says that the conclusion α_1 → α_3 is true, whenever conditional
statements α_1 → α_2 and α_2 → α_3hold the truth values. This rules also
shows the transitive nature of implication operator.
α_1 → α_2, α_2 → α_3 =>α_1 → α_3

4.7.8 Absorption
The rule states that if the literal α_1 conditionally implies another literal α_2
i.e.,α_1 → α_2 is true, then α_1 → (α_1 ∧α_2) also holds.
α_1 → α_2 =>α_1 → (α_1 ∧α_2)

4.7.9 Constructive Dilemma (CD)


According to the rule, if proposition (α_1 ∨α_3) and proposition ((α_1 → α_2)
∧ (α_3 → α_4))have true values, then the well-formed formula (α_2 ∨α_4) also
holdsa true value.
(α_1 ∨α_3 ), (α_1 → α_2) ∧ (α_3 → α_4) =>α_2 ∨α_4

4.8 PROPOSITIONAL RULES OF


REPLACEMENT

We learned the concepts of Predicate and Propositional logic in MCS-212 (Discrete


Mathematics), just to brief the understanding, it is to remind you here that “a proposition is a
specialized statement whereas Predicate is a generalized statement”. To be more specific
the propositions use the logical connectives only and the predicates uses logical connectives
and quantifiers (universal and existential), both.

Note : is the symbol used for the Existential quantifier and  is used for the Universal
quantifier.
In predicate logic, a replacement rule is used to replace an argument or a set of
arguments with an equivalent argument. By equivalent arguments, we mean that
the logical interpretation of the arguments is the same. These rules are used to
manipulate the propositions. Also, the axioms and the propositional rules of
inference are used as an aid to generate the replacement rules. Given below is
the table summarizing the different replacement rules over the propositions α_1,
α_2 and α_3.
Replacement Rule Proposition Equivalent

Tautology α_1 ∧α_1 α_1


(Conjunction of a statement with
itself always implies the statement)
Double Negation (DN) (~(~α_1)) α_1
(Also Called negation of negation)
Commutativity α_1 ∧α_2 α_2 ∧α_1
(Valid for Conjunction and α_1 ∨α_2 α_1 ∨α_2
Disjunction)
Associativity (α_1 ∧α_2) ∧α_3 α_1 ∧ (α_2 ∧α_3)
(Valid for Conjunction and
Disjunction) (α_1 ∨α_2) ∨α_3 α_1 ∨ (α_2 ∨α_3)
DeMorgan’s Laws ~ (α_1 ∧α_2) (~α_1) ∨ (~α_2)
~ (α_1 ∨α_2) (~α_1) ∧ (~α_2)
Transposition α_1 → α_2 (~α_2) → (~α_1)
(Defined over implication)
Exportation α_1 → (α_2 → α_3) (α_1 ∧α_2) → α_3
Distribution α_1 ∧ (α_2 ∨α_3) (α_1 ∧α_2) ∨ (α_1 ∧α_3)
(AND over OR, and α_1 ∨ (α_2 ∧α_3) (α_1 ∨α_2) ∧ (α_1 ∨α_3)
OR over AND)
Material Implication (MI) α_1 → α_2 ~ α_1 ∨α_2
(Defined over implication)
Material Equivalence α_1⟷α_2 (α_1 → α_2) ∧ (α_2 → α_1)
(ME)(Defined over
biconditional)

4.9 VALIDITY AND SATISFIABILITY

An argument is called valid in propositional logic if the argument is a tautology


for all possible combinations of the given premises. An argument is called
satisfiable if the argument is true for at least one combination of the premises.
Consider the example given below. Kindly note, every TRUE value is written as
_T(1)_, and every FALSE value is written as _F(0)_.
Example 1: Consider the expression α_1 ∨ (α_2 ∨α_3) ∨ (~α_2 ∧ ~ α_3), whose
truth table is given below.
α_1 α_2 α_3 ~α_2 ~α_3 α_1 ∨α_2 ~α_2 ∧ ~α_3 α_1 ∨ (α_2 ∨α_3) ∨ (~α_2 ∧
~α_3)
_F(0)_ _F(0)_ _F(0)_ _T(1)_ _T(1)_ _F(0)_ _T(1)_ _T(1)_
_F(0)_ _F(0)_ _T(1)_ _T(1)_ _F(0)_ _T(1)_ _F(0)_ _T(1)_
_F(0)_ _T(1)_ _F(0)_ _F(0)_ _T(1)_ _T(1)_ _F(0)_ _T(1)_
_F(0)_ _T(1)_ _T(1)_ _F(0)_ _F(0)_ _T(1)_ _F(0)_ _T(1)_
_T(1)_ _F(0)_ _F(0)_ _T(1)_ _T(1)_ _F(0)_ _T(1)_ _T(1)_
_T(1)_ _F(0)_ _T(1)_ _T(1)_ _F(0)_ _T(1)_ _F(0)_ _T(1)_
_T(1)_ _T(1)_ _F(0)_ _F(0)_ _T(1)_ _T(1)_ _F(0)_ _T(1)_
_T(1)_ _T(1)_ _T(1)_ _F(0)_ _F(0)_ _T(1)_ _F(0)_ _T(1)_

It is noteworthy from the above truth table that the argument α_1 ∨ (α_2 ∨α_3)
∨ (~α_2 ∧ ~α_3) is a valid argument as it has true values for all possible
combinations of the premises α_1, α_2 and α_3.
Example 2: Consider the expression α_1 ∧ ((α_2 ∧α_3) ∨ (α_1 ∧α_3)) for the

α_1 α_2 α_3 α_1 α_2 (α_2 ∧α_3) ∨ (α_1 α_1 ∧ ((α_2 ∧α_3) ∨ (α_1
∧α_3 ∧α_3 ∧α_3) ∧α_3))
_F(0)_ _F(0)_ _F(0)_ _F(0)_ _F(0)_ _F(0)_ _F(0)_
_F(0)_ _F(0)_ _T(1)_ _T(1)_ _F(0)_ _T(1)_ _F(0)_
_F(0)_ _T(1)_ _F(0)_ _F(0)_ _F(0)_ _F(0)_ _F(0)_
_F(0)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_ _F(0)_
_T(1)_ _F(0)_ _F(0)_ _T(1)_ _F(0)_ _T(1)_ _T(1)_
_T(1)_ _F(0)_ _T(1)_ _T(1)_ _F(0)_ _T(1)_ _T(1)_
_T(1)_ _T(1)_ _F(0)_ _T(1)_ _F(0)_ _T(1)_ _T(1)_
_T(1)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_
propositions α_1,α_2, and α_3whose truth table is given below.
In this example, as the argument α_1 ∧ ((α_2 ∧α_3) ∨ (α_1 ∧α_3)) is true for a
few combinations of the premises α_1, α_2, and α_3, hence it is a satisfiable
argument.
4.10 INTRODUCTION TO PREDICATE LOGIC

Now it’s time to understand the difference between the Proposition and the Predicate(also
known as propositional function). In short, a proposition is a specialized statement whereas
Predicate is a generalized statement. To be more specific the propositions use the logical
connectives only and the predicates uses logical connectives and quantifiers (universal and
existential), both.

Note : is the symbol used for the Existential quantifier and  is used for the Universal
quantifier.

Let’s understand the difference through some more detail, as given below.

A propositional function, or a predicate, a variable x in a sentence p(x) involving x


becomes a proposition when we give x a definite value from the set of values it can take. We
usually denote such functions by p(x), q(x), etc. The set of values x can take is called the
universe of discourse.

So, if p(x) is ‘x > 5’, then p(x) is not a proposition. But when we give x particular values, say
x = 6 or x = 0, then we get propositions. Here, p(6) is a true proposition and p(0) is a false
proposition.

Similarly, if q(x) is ‘x has gone to Patna.’, then replacing x by ‘Taj Mahal’ gives us a false
proposition.

Note that a predicate is usually not a proposition. But, of course, every proposition is a
prepositional function in the same way that every real number is a real-valued function,
namely, the constant function.

Now, can all sentences be written in symbolic from by using only the logical connectives?
What about sentences like ‘x is prime and x + 1 is prime for some x.’? How would you
symbolize the phrase ‘for some x’, which we can rephrase as ‘there exists an x’? You must
have come across this term often while studying mathematics. We use the symbol ‘’ to
denote this quantifier, ‘there exists’. The way we use it is, for instance, to rewrite ‘There is
at least one child in the class.’ as‘( x in U)p(x)’,

where p(x) is the sentence ‘x is in the class.’ and U is the set of all children.

Now suppose we take the negative of the proposition we have just stated. Wouldn’t it be
‘There is no child in the class.’? We could symbolize this as ‘for all x in U, q(x)’ where x
ranges over all children and q(x) denotes the sentence ‘x is not in the class.’, i.e., q(x)  ~
p(x).

We have a mathematical symbol for the quantifier ‘for all’, which is ‘’.So, the
proposition above can be written as
‘( x  U)q(x)’, or ‘q(x),  x  U’.

An example of the use of the existential quantifier is the true statement.

( x R) (x + 1 > 0), which is read as ‘There exists an x in R for which x + 1 > 0.’.

Another example is the false statement

1 1
( x N) (x - = 0), which is read as ‘There exists an x in N for which x - = 0.’.
2 2

An example of the use of the universal quantifier is ( x N) (x2> x), which is read as ‘for
every x not in N, x2> x.’. Of course, this is a false statement, because there is at least one x
N, x R, for which it is false.

As you have already read in the example of a child in the class,

(  x U)p(x) is logically equivalent to ~ (  x  U) (~ p(x)). Therefore,

~( x  U)p(x)  ~~ ( x U) (~ p(x))  (  x  U) ( ~ p(x)).

This is one of the rules for negation that relate  and . The two rules are

~ (xU)p(x)  ( x U) (~ p(x)), and

~ (xU)p(x) ( x U) (~ p(x))

Where U is the set of values that x can take.

To Sum up “a proposition is a specialized statement whereas Predicate is a generalized


statement”. To be more specific the propositions use the logical connectives only and the
predicates uses logical connectives and quantifiers (universal and existential), both.

Note : is the symbol used for the Existential quantifier and  is used for the Universal
quantifier.

By interpretation of symbolic logic, we mean assigning the meaning to the


symbols of any formal language. By default, the statements do not have any
meaning associated with them but when we assign the symbols to them, their
meaning is automatically associated with them i.e., TRUE (T) or FALSE (F).
Example 1: Consider the following statements
a) Weather is not hot today and it is windy than yesterday.
b) Kids will go to park only if it is not hot.
c) If kids do not go to park, then they can play near pool.
d) If kids play near pool, then we can offer them a juice.
Let the statements be represented using literals as following:
H: Weather is not hot today
W: It is windy than yesterday
K: Kids will go to park
P: They can play near pool.
J: We can offer them juice.
The given statements can be interpreted as:
a) H ∧W
b) K -> ~H
c) ~K -> P
d) P -> J
These interpretations may be verified with the help of truth tables.

4.11 INFERENCING IN PREDICATE LOGIC

In general, we are given a set of arguments in predicate logic. Now, using the
rules of inference, we can deduce other arguments (or predicates) based on the
given arguments (or predicates). This process is known as entailment as we
entail new arguments (or predicates). The inference rules you learned in MCS-
212 and also in section 4.7 above, of this unit are also applicable here for the
process of entailment or making inferences. Now, with the help of the following
example we will learn how the rules of inference, discussed above, can be used
to solve the problems.
Example : There is a village that consists of two types of people – those who always tell the
truth, and those who always lie. Suppose that you visit the village and two villagers A and B
come up to you. Further, suppose

A says, “B always tells the truth,” and

B says, “A and I are of opposite types”.

What types are A and B ?

Solution: Let us start by assuming A is a truth-teller.

∴ What A says is true.

∴ B is a truth-teller.

∴ What B says is true.

∴ A and B are of opposite types.


This is a contradiction, because our premises say that A and B are both truth-tellers.

∴ The assumption we started with is false.

∴ A always tells lies.

∴ What A has told you is lie.

∴ B always tells lies.

∴ A and B are of the same type, i.e., both of them always lie.

Let us now consider the problem of showing that a statement is false. I.e.,Counter examples
: A common situation in which we look for counterexamples is to disprove statements of the
form p  q needs to be an example where p  ~ q. Therefore, a counterexample to p  q
needs to be an example where p  ~ q is true, i.e., p is true and ~ q is true, i.e., the hypothesis
p holds but the conclusion q does not hold.

For instance, to disprove the statement ‘If n is an odd integer, then n is prime.’, we need to
look for an odd integer which is not a prime number. 15 is one such integer. So, n = 15 is
a counterexample to the given statement.
Notice that a counter example to a statement p proves that p is false, i.e., ~ p is true.

Example: Following statements are to be Symbolized and thereafter construct a proof for the
following valid argument:

(i) If the BOOK_X is literally true, then the Earth was made in six days.
(ii) If the Earth was made in six days, then carbon dating is useless and
Scientists/Researchers are liars.
(iii) Scientists/Researchers are not liars.
(iv) The BOOK_X is literally true, Hence
(v) God does not exist.

Solution: Let us symbolize as follows:

B : BOOK_X is literally true


E : The Earth was created in six days
C : Carbon_dating techniques are useless
S : Scientists/Researchers are frauds
G : God exists
Therefore, the statements in the given arguments are symbolically represented as :
(i) B  E

(ii) E  C  S

(iii) ~ S
(iv) B

(v) ~ G (to show)

Using ModusPonens on (i) and (iv), we get expression (vi) E

Using ModusPonens on (ii) & (vi) we get expression (vii) C  S

Using Simplification on (vii), we get expression (viii) S

Using Addition on (viii), we get expression (ix) S  ~ G

Using DisjunctiveSyllogism(D.S.) on (iii) & (ix) we get expression (x) ~ G

The last statement is what is to be proved.

Remarks: (iii) and (viii) are contradicts with each other in the above deduction. In general, if
we come across two statements (like S and ~ S) that contradict each other during the process
of deduction, we can deduce any statement, even if the statement can never be True in any
way. So, we can assume that any statement is true if both S and ~ S have already happened in
the process of derivation.

4.12 PROOF SYSTEMS

A sequence of statements that follows logically from the previous set of


statements or observations are the mathematical proofs in propositional logic.
The last statements in the proof becomes the theorem. This proof system
symbolizes the science of valid inference.
There are more than two main styles of proof system for propositional logic, but
the two main styles are:
a) Axiomatic Proof Systems and
b) Natural deduction Systems

Axiomatic Proof Systems: This is a system where conclusion is derived from


either the given hypothesis or using assumed premise, which are considered as
truth. Such hypothesis or assumed premise is an axiom.
Example 1: Show that (α_1 ∨α_2) is a logical consequence of (α_2 ∧α_1) using
proof systems.
Proof:
Step 1: (α_2 ∧α_1) (Premise)
Step 2: α_1 (Simplification, Step 1)
Step 3: α_2 (Simplification, Step 1)
Step 4: α_1 ∨α_2 (Addition on either Step 2 or Step 3)
We observe here that each step is either a truth or a logical consequence of
previously established truths. In general, there can be more than one proof for
establishing a given conclusion.
Thus, the proof system in propositional logic is quite similar to the proofs in
mathematics, which follows a step-wise derivation of consequent from the
given hypothesis. However, in propositional logic, we use well-formed formulas
obtained from literals and connectors, rather than using English statements. We
use the rules of inference (Section 4.7) and replacement rules (Section 4.8) to
prove our conclusion.
Example 2:Show that α_1 -> ~α_4 can be derived from the given premises:
(α_1-> (α_2 ∨α_3)), α_2 -> ~α_1, α_4 -> ~α_3
Proof:
Step 1: (α_1-> (α_2 ∨α_3)) (Premise)
Step 2: α_1(Assumed Premise)
Step 3: α_2 ∨α_3 (Modus Ponens, Step 1, 2)
Step 4: α_2 -> ~α_1 (Premise)
Step 5: ~α_2 (Modus Tollens, Step 2, step 4)
Step 6: α_3 (Disjunctive Syllogism, Step 1, 2, 4)
Step 7: α_4-> ~α_3 (Premise)
Step 8: ~α_4 (Modus Tollens, Step 1, 2, 4, 7)
Step 9: α_1 -> ~α_4 (Step 2, 4)
Hence proved.
The discussion over Natural deduction Systems is given in section 4.13 below
4.13 NATURAL DEDUCTIONS

So far, we have discussed methods, of solving problems requiring reasoning of propositional


logic, that were based on
i) Truth-table construction
ii) Use of inference rules,
and follow, directly or indirectly, natural deduction approach.

In order to determine whether or not a conclusion C in an argument is valid or invalid based


on a given set of facts or axioms A1, A2,…, An, the only thing we currently know is that
either a truth table should be constructed for the formula P: A1 A2 … An C,, or this
formula should be converted to CNF or DNF by substituting equivalent formulas and
simplifying it. There are other possible options available as well. The trouble with these
methods, on the other hand, is that as the number n of axioms increases, the formula becomes
more complicated (imagine n being equal to 50), and the number of variables involved, say k,
also normally increases. When there are k different variables to consider in an argument, the
size of the truth table grows to 2k. For big values of k, the number of rows, denoted by 2k,
approaches an almost unmanageable level. As a result, it is necessary for us to look for
alternative approaches that, rather than processing the entire argument as a single formula,
process each of the individual formulas A1, A2, and C of the argument as well as their
derivatives by applying some principles that ensure the validity of the results.

In an earlier section, we introduced eight different inference rules that can be used in
propositional logic to help derive logical inferences. The methods of drawing valid
conclusions that have been discussed up until this point are examples of an approach to
drawing valid conclusions that is called the natural deduction approach of making
inferences. This is an approach to drawing valid conclusions in which the reasoning system
starts the reasoning process from the axioms, uses inferencing rules, and, if the conclusion
can be validly drawn, then it ultimately reaches the conclusion that was intended. On the
other hand, there is a method of obtaining legitimate conclusions that is known as the
Refutation approach. This method, which will be covered in the following part, will be
addressed.

The normal forms (CNF and DNF) also play a vital role in both Natural deductions and
Resolution approach. To understand the normal forms, we need to start with the basic
concepts of clauses, literals etc.

Some Definitions: A clause is a disjunction of literals. For example, (E~F~G) is a clause.


But (E~F~G) is not a clause. A literal is either an atom, say A, or its negation, say ~A.

Definition: A formula E is said to be in a Conjunctive Normal Form (CNF) if and only if E


has the form E : E1 …. En, n  1, where each of E1,…., En is a disjunction of literals.

Definition: A formula E is said to be in Disjunctive Normal Form (DNF) if and only if E


has the form E: E1 E2….En, where each Ei is a conjunction of literals.
Examples: Let A, B and C be atoms. Then F:(~AB)(A~B~C) is a formula in a
disjunctive normal form.

Example: Again G: (~AB)(A~B~C) is a formula in Conjunctive Normal Form,


because it is a conjunction of the two disjunctions of literals viz of (~ A  B) and (A  ~ B 
~ C)

Example: Each of the following is neither in CNF nor in DNF

(i) (~AB)(A~BC)
(ii) (AB) (~B~A)
Using table of equivalent formulas given above, any valid Propositional Logic formula can be
transformed into CNF as well as DNF.

The steps for conversion to DNF are as follows


Step 1: Use the equivalences to remove the logical operators ‘’ and ‘’:

(i) EG=(Eg)(GE)

(ii) EG=~EG

Step 2 Remove ~’s, if occur consecutively more than once, using

(iii) ~(~E)=E

(iv) Use De Morgan’s laws to take ‘~’ nearest to atoms

(v) ~(EG)=~E~G

(vi) ~(EG)=~E~G

Step 3 Use the distributive laws repeatedly

(vii) E(GH)=(EG)(EH)

(viii) E(GH)=(EG)(EH)

Example : Obtain a disjunctive normal form for the formula ~ (A  (~ B  C)).


Consider A(~BC)=~A(~BC) (Using (EF)=(~EF))

Hence, ~(A(~BC))=~(~A(~BC))

=~(~A)(~(~BC)) (Using~(EF)=~E~F)

= A  (B  (~ C)) (Using~(~E)=E and

~(EF)=~E~F

= (A  B)  (A  (~ C)) (Using E(FG)=(E  F)  (E  G))


However, if we are to obtain CNF of (~A(~BC)), in the last but one step, we obtain

~(A(~BC))=A(B~C), which is in CNF, because, each of A and

(B~C) is a disjunct.

Example: Obtain conjunctive Normal Form (CNF) for the formula: D(A(BC))

Consider

D(A(BC)) (using EF=~EF for the inner implication)

=D(~A(BC)) (using EF=~EF for the outer implication)

=~D(~A(BC))
= (~D~A)(BC) (using Associative law for disjunction)

= ((~D~AB)(~D~AC)
The last line denotes the conjunctive Normal Form of D(A(BC))

(Using distributivity of  over )

Note: If we stop at the last but one stop, then we obtain (~D~A)(BC)=~D~A(BC) is
a Disjunctive Normal Form for the given formula : D  (A  ( B  C ) )

Sum up of the technique of natural deduction - Apart from constructing truth


table to show that the conclusion follows from the given set of premises,
another method exists known as natural deduction.In this case, we assume that
the conclusion is not valid. We consider negated conclusion as a premise along
with other given premises. We apply certain implications, equivalences and
replacement rules to derive a contradiction to our assumption. Once, we obtain a
contradiction, this proves that the given argument is true.
Example 1: Show that Z is a valid conclusion from the premises X,X->Y and
Y->Z.
Proof: Step 1: ~Z (Negated conclusion as Premise)
Step 2: X (Premise)
Step 3:X-> Y (Premise)
Step 4: Y (using Step 2,3andModus Ponens)
Step 5: Y -> Z (Premise)
Step 6: Z (using Step 4, 5 and Modus Ponens)
We obtain the conclusion Z from the given premises, which is a contradiction to
our assumption ~Z. Thus, the given conclusion is valid.
Example 2: Show that ~P is concluded from RV S, S-> ~Q, P -> Q and R -
>~Q.
Proof: Step 1: ~(~P) (Negated conclusion as Premise)
Step 2: P (Step 1, Double negation)
Step 3: P -> Q (Premise)
Step 4: Q (Step 2, 3, Modus Ponens)
Step 5: S -> ~Q (Premise)
Step 6: ~S (Step 4, 5,Modus Tollens)
Step 7: R V S (Premise)
Step 8: R (Step 1, 3, 5, 7, Disjunctive Syllogism)
Step 9: R -> ~Q (Premise)
Step 10: ~Q (Step 1, 3, 5, 7, 9and Modus Ponens)
We obtain ~Q in Step 10and Q in step 4, which is a contradiction. Thus,
our assumption is not valid. Hence, the conclusion follows from the set of
premises.

4.14 PROPOSITIONAL RESOLUTION

For the most part, there are two distinct strategies that can be implemented in order to
demonstrate the correctness of a theorem or derive a valid conclusion from a given collection
of axioms:
i) natural deduction
ii) the method of refutation

In the method known as natural deduction, one begins with a given set of axioms, applies
various rules of inference, and ultimately arrives at a conclusion. This method is strikingly
similar to the intuitive reasoning that is characteristic of humans.

When using a refutation approach, one begins with the denial of the conclusion that is to be
drawn and then proceeds to deduce a contradiction or the word "false." We are able to deduce
a contradiction as a result of having presupposed that the conclusion is incorrect; hence, the
premise that the conclusion is incorrect is itself incorrect. Therefore, the argument concerning
the technique of resolution leads to the correctness of the conclusion. In this part of the
article, we will talk about a different method known as the Resolution Method, which was
proposed by Robinson in 1965 and is based on the refutation approach. The Robinson
technique, which has served as the foundation for numerous computerised theorem provers,
highlights the significance of the method in question.
Propositional resolution is a sound, complete and powerful rule of inference used in
Propositional Logic. It is used prove the unsatisfiability of the given set of statements. This is
done using a strategy called Resolution Refutation that uses Resolution rule as described
below.

Resolution Rule: The rule states given two statements as {α1, α2, α3…α,αm} and
{γ1, γ2, γ3….γn},then the conclusion is {α1, α2, α3…αm,γ1, γ2, γ3,….γn}.
For example, the statements {A, B} and {C, ~B} leads to the conclusion {A,
C}.
Resolution Refutation:
a) Convert all the given statements to Conjunctive Normal Form (CNF).It is
also described as AND of ORs’.For eg., (A ∨ B) ∧ (~A ∨ B) ∧ (~B ∨ A)
is a CNF.
b) Obtain the negation of the given conclusion
c) Apply the resolution rule until either the contradiction is obtained or the
resolution rule cannot be applied anymore.
4.14.1 Clausal Form
Any atomic sentence or its negation is called as a literal. A literal or the
disjunction of at least two literals is called as the clausal form or clause
expression. Next, a clause is defined as the set of literals in the clause form or
clause expression. For example, consider two atomic statements represented
using the literals X and Y. Their clausal expressions are X, ~X and (X V Y) and
the clauses for these expressions are {X}, {~X} and {X, Y}. It is noteworthy
that the empty set {} is always a clause as it represents an empty disjunction and
hence, is unsatisfiable. Kindly note, conjunctive normal form (CNF) of an
expression represents the corresponding clausal form.
Now, we shall first understand certain rules for converting the statements to the
clause form as given below.
i) Operator: (β1Vβ2Vβ3V.….Vβm)=>{β1,β2,β3,…. βm}
(β1∧ β2∧ β3∧….∧ βm)=>{ β1},{ β2},{β3},….,{βm}
ii) Negation: same as Double Negation and De Morgan’s Law in section
4.8
iii) Distribution: as in sub-section 4.8
iv) Implications: as Material implication and Material Equivalence
(section 4.8)
Example 1: Convert A∧ (B->C) to clausal expression.
Step 1: A ∧ (~B V C) (using rule Material Implication to eliminate ->)
Step 2: {A}, {~B V C} (using rule Operator to eliminate ∧)

Example 2:Derive the Clausal form or Conjunctive normal form of X <-> Y.


Step 1: Replace bi-condition using Material equivalence rule:
(X → Y) ∧ (Y → X)
Step 2: Use Material Implication replacement rule to replace the
implication:
(¬ X ∨ Y) ∧ (¬ Y ∨ X)

Example 3: Derive the CNF of ~Z∧~((~X) ->( ~Y))


Step 1: Replace implication using Material Implication(MI):
~(~Z ∧ (~(~X) ∨~Y))
Step 2: Use double negation (DN):
~(~Z∧ (X ∨~Y))
Step 3: Apply DeMorgan’s Law:
~~Z ∨ ~(X ∨~Y)
Step 4: Apply Double Negation:
Z ∨~(X∨~Y)
Step 5: Apply DeMorgan’s Law again:
Z ∨ (~X ∧~~Y)
Step 6: Apply Double Negation on ~~Y:
Z ∨ (~X ∧Y)
Step 7: Lastly, apply Distributive law to obtain CNF:
(Z∨ ~X) ∧ (Z ∨ Y), which is the AND of OR’s form.
(Z∨ ~X), (Z ∨ Y) are the clausal forms for the given expression.
4.14.2 Determining Unsatisfiability
If the obtained set of clauses is not satisfiable, then one can derive an empty
clause using resolution principle as described above. In other words, to
determine whether a set of propositions or premises {P} logically entails a
conclusion C, write P ∨ {¬C} in clausal form and try to derive the empty clause
as explained in examples below.
Example 1: Given a set of propositions : X -> Y, Y -> Z. Prove X -> Z.
Proof: To prove the conclusion, we add the negation of conclusion i.e ~(X ->
Z) to the set of premises, and derive an empty clause.
Step 1: X -> Y (Premise)
Step 2: ~X ∨ Y (Premise, Material Implication)
Step 3: Y -> Z (Premise)
Step 4: ~Y ∨ Z (Premise, Material Implication)
Step 5: ~(X -> Z) (Premise, Negated Conclusion)
Step 6: ~(~X ∨ Z) (Premise, Material Implication)
Step 7: X ∧ ~Z (Premise, DeMorgans)
Step 8: X (Clausal form, Operator)
Step 9: ~Z (Clausal form, Operator)
Step 10: Y (Resolution rule on Premises in Step 2 and Step 8)
Step 11: Z (Resolution rule on Step 10 and Step 4)
Step 12: {} (Conjunction on Step 11 and Step 9)
Thus, the given set of premises entail the conclusion.

Example 2:Use propositional resolution to derive the goal from the given
knowledge base.
a) Either it is a head, or Lisa wins.
b) If Lisa wins, then Mary will go.
c) If it is a head, then the game is over.
d) The game is not over.
Conclusion: Mary will go.
Proof: First consider propositions to represent each of the statement in
knowledge base.
Let H: It is a head
L: Lisa wins
M: Mary will go
G: Game is over.
Re-writing the given knowledge base using the propositions defined.
a) H∨L
b) L -> M
c) H -> G
d) ~G
Conclusion: M
Step 1: H∨L (Premise)
Step 2: L -> M (Premise)
Step 3: ~L ∨ M (Step 4, Material Implication)
Step 4: H -> G (Premise)
Step 5: ~H ∨ G (Step 4, Material Implication)
Step 6: ~G (Premise)
Step 7: ~M (Negated conclusion as Premise)
Step 8: H ∨ M (Resolution principle on Step 1 and 3)
Step 9: M ∨ G (Resolution principle on Step 8 and 5)
Step 10: M (Resolution principle on Step 9 and 6)
Step 11: {} (Sep 10 and 7)
After applying Proof by Refutation i.e., contradicting the conclusion, the
problem is terminated with an empty clause ({}). Hence, the conclusion is
derived.

Example 3: Show that ~S1 follows from S1 -> S2 and ~(S1 ∧ S2).
Proof:
Step 1: S1 -> S2 (Premise)
Step 2: ~S1 ∨ S2 (Material Implication, Step 1)
Step 3: ~(S1 ∧ S2) (Premise)
Step 4: ~S1 ∨~S2 (De Morgan’s, Step 3)
Step 5: ~S1 (Resolution, Step 2, 4)

The resolution mechanism in PL is not used until after the given statements or wffs have been
converted into clausal forms. To obtain the clasual form of a wff, one must first convert the
wff into the Conjuctive Normal Form (CNF). We are already familiar with the fact that a
phrase is a formula (and only a formula) of the form: A1 A2…….. An ,where Ai might be
either any atomic formula or its negation.
The method of resolution is actually generalization of Modus Ponens, whose expression is
P, P  Q P, ~ P  Q
which can be written in the equivalent form as (i.e. by using the
Q Q
relation PQ => ~P Q).

If we are provided that both P and ~ P  Q are true, then we may safely assume that Q is also
true. This is a straightforward application of a general resolution principle that will be
covered in more detail in this unit.

The construction of a truth table can be used to demonstrate the validity of a resolution
process (generally). In order to talk about the resolution process, we will first talk about some
of the applications of that method.

Example: Let C1:QR and C2: ~QS be two given clauses, so that, one of the literals i.e., Q
occurs in one of the clauses (in this case C1)and its negation (~Q) occurs in the other clause
C2. Then application of resolution method in this case tells us to take disjunction of the
remaining parts of the given clause C1 and C2, i.e., to take C3:RS as deduction from C1 and
C2. Then C3 is called a resolvent of C1 and C2.

The two literals Q and (~Q) which occur in two different clauses are called complementary
literals.

In order to illustrate resolution method, we consider another example.

Example: Let us be given the clauses C1:~S~QR and C2:~PQ.

In this case, complementary pair of literals viz. Q and ~Q occur in the two clause C1 and C2.

Hence, the resolution method states: ConcludeC3:~SR(~P)

Example: Let us be given the clauses C1:~QR and C2:~QS

Then, in this case, the clauses do not have any complementary pair of literals and hence,

resolution method cannot be applied.

Example: Consider a set of three clauses

C1:R

C2:~RS

C3:~S

Then, from C1 and C2 we conclude, through resolution:

C4:S

From C3 and C4, we conclude,


C5:FALSE

However, a resolvent FALSE can be deduced only from anunstatisfiable set of clauses.
Hence, the set of clauses C1, C2 and C3 is an unsatisfiable set of clauses.

Example: Consider the set of clauses

C1:RS
C2:~RS
C3:R~S
C4:~R~S
Then, from clauses C1 and C2 we get the resolvent
C5 : SS=S

From C3 and C4 we get the resolvent

C6:~S

From C5 and C6 we get the resolvent

C7:FALSE

Thus, again the set of clauses C1, C2, C3 and C4 is unsatisfiable.

Note: We could have obtained the resolvent FALSE from only two clauses, viz., C2 and C3.
Thus, out of the given four clauses, even set of only two clauses viz, C2 and C3 is
unsatisfiable. Also, a superset of any unsatisfiable set is unsatisfiable.

Example: Show that the set of clauses:

C1:RS
C2:~SW
C3:~RS
C4:~W is unsatisfiable.
From clauses C1 and C3 we get the resolvent
C7:S
From the clauses C7 and C2 we get the resolvent
C8:W
From the clauses C8 and C4 we get
C9:FALSE
Hence, the given set of clauses is unsatisfiable.
Solution of the Problem Using the Resolution Method As was discussed before, the
resolution process can also be understood as a refutation approach. The following is an
example of a proving technique that can be used to solve problems:

After the symbolic representation of the issue at hand, an additional premise in the form of
the negation of the wff, which stands for conclusion, should be added. You can infer either
false or a contradiction from this improved set of premises and axioms. If we are able to get
to the conclusion that the statement is not true, then the conclusion that was required to be
formed is correct, and the issue has been resolved. If, despite our best efforts, we are unable
to arrive at the conclusion that the hypothesis is false, then we are unable to determine
whether or not the conclusion is correct. As a result, the predicament cannot be solved using
the axioms that have been provided and the conclusion that has been drawn.

Let's go on to the next step and apply Resolution Method to the issues we discussed earlier.

Example: If the interest rate goes up, the stock prices might go down. Also, let's say that
most people are unhappy when the price of stocks goes down. Let's say that the rate of
interest goes up. Show that most people are unhappy and that we can draw that conclusion.

To show the above conclusion, let us denote the statements as follows:

 A : Interest rate goes up,


 S : Stock prices go down
 U : Most people are unhappy

The problem has the following four statements


1) If the interest rate goes up, stock prices go down.
2) If stock prices go down, most people are unhappy.
3) The interest rate goes up.
4) Most people are unhappy. (to conclude)

These statements are first symbolized as wffs of PL as follows:


(1) AS
(2) SU
(3) A
(4) U(to conclude)
Converting to clausal form, we get
(i) ~AS
(ii) ~SU
(iii) A
(iv) U (to be concluded)
As per resolution method, assume (iv) as false, i.e., assume ~U as initially given statement,
i.e., an axiom.
Thus, the set of axioms in clasual form is:

(i)~AS
(ii)~SU
(iii)A
(iv)~U

Then from (i) and (iii), through resolution, we get the clause
(v) S.
From (ii) and (iv), through resolution, we get the clause

(vi)~S

From (vi) and (v), through resolution we get,

(viii)FALSE

Hence, the conclusion, i.e., (iv)U:Most people are unhappy is valid.

From the above solution using the resolution method, we might have noticed that clausal
conversion is a major step that takes a lot of time after translation to wffs. Most of the time,
once the clause form is known, proof is easy to see, at least by a person.

☞ Check Your Progress 4

Ques 1. Prove that given set of premises are unsatisfiable:


a) {X, Y}, {~X, Z}, {~X,~Z}, {X, ~Y}
Ques 2. Consider the given knowledge base to prove the conclusion.
KB1. If Mary goes to school, then Mary eats lunch.
KB 2. If it is Friday, then Mary goes to school or eats lunch.
Conclusion: If it is Friday, then Mary eats lunch.

4.15 ANSWERS/SOLUTIONS

Check Your Progress 1


Ques No Ans Ques No Ans
1 No 8 No
2 No 9 No
3 Yes 10 Yes
4 Yes 11 No
5 No 12 No
6 Yes 13 Yes
7 Yes 14 No
Check Your Progress 2
Ques No Ans Ques No Ans
1 Atomic 6 Atomic
2 Compound 7 Compound
3 Compound 8 Compound
4 Compound 9 Compound
5 Atomic 10 Atomic
Check Your Progress 3
1. (~α_1 → (α_1 ∨α_2)) ⟷α_2
α_1 α_2 α_1 ∨α_2 ~ α_1 → (α_1 ∨α_2) (~α_1 → (α_1 ∨α_2)) ⟷α_2
_F(0)_ _F(0)_ _F(0)_ _F(0)_ _T(1)_
_F(0)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_
_T(1)_ _F(0)_ _T(1)_ _T(1)_ _F(0)_
_T(1)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_

2. (~α_1 ⟷ (α_2 ⟷α_3) ) ∨ (α_3 ∧α_2)


α_1 α_2 α_3 α_2 (~α_1 ⟷ α_3 (~α_1 ⟷ (α_2 ⟷α_3) ) ∨ (α_3
⟷α_3 (α_2 ⟷ S3) ∧α_2 ∧α_2)
_F(0)_ _F(0)_ _F(0)_ _T(1)_ _T(1)_ _F(0)_ _T(1)_
_T(1)_ _F(0)_ _F(0)_ _T(1)_ _F(0)_ _F(0)_ _F(0)_
_F(0)_ _F(0)_ _T(1)_ _F(0)_ _F(0)_ _F(0)_ _F(0)_
_T(1)_ _F(0)_ _T(1)_ _F(0)_ _T(1)_ _F(0)_ _T(1)_
_F(0)_ _T(1)_ _F(0)_ _F(0)_ _F(0)_ _F(0)_ _F(0)_
_T(1)_ _T(1)_ _F(0)_ _F(0)_ _T(1)_ _F(0)_ _T(1)_
_F(0)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_
_T(1)_ _T(1)_ _T(1)_ _T(1)_ _F(0)_ _T(1)_ _T(1)_

3. ((α_1 ∧α_2) → α_3) ∨ ~α_4


α_1 α_2 α_1∧α_2 α_3 ((α_1 ∧α_2) α_4 ((α_1 ∧α_2) → α_3) ∨
→ α_3 ~α_4
_F(0)_ _F(0)_ _F(0)_ _F(0)_ _T(1)_ _F(0)_ _T(1)_
_F(0)_ _F(0)_ _F(0)_ _F(0)_ _T(1)_ _T(1)_ _T(1)_
_F(0)_ _F(0)_ _F(0)_ _T(1)_ _T(1)_ _F(0)_ _T(1)_
_F(0)_ _F(0)_ _F(0)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_
_F(0)_ _T(1)_ _F(0)_ _F(0)_ _T(1)_ _F(0)_ _T(1)_
_F(0)_ _T(1)_ _F(0)_ _F(0)_ _T(1)_ _T(1)_ _T(1)_
_F(0)_ _T(1)_ _F(0)_ _T(1)_ _T(1)_ _F(0)_ _T(1)_
_F(0)_ _T(1)_ _F(0)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_
_T(1)_ _F(0)_ _F(0)_ _F(0)_ _T(1)_ _F(0)_ _T(1)_
_T(1)_ _F(0)_ _F(0)_ _F(0)_ _T(1)_ _T(1)_ _T(1)_
_T(1)_ _F(0)_ _F(0)_ _T(1)_ _T(1)_ _F(0)_ _T(1)_
_T(1)_ _F(0)_ _F(0)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_
_T(1)_ _T(1)_ _T(1)_ _F(0)_ _F(0)_ _F(0)_ _T(1)_
_T(1)_ _T(1)_ _T(1)_ _F(0)_ _F(0)_ _T(1)_ _F(0)_
_T(1)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_ _F(0)_ _T(1)_
_T(1)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_
4. ((α_1 → ~ α_2) ⟷α_3) → ~ (α_1 V α_1)
α_1 Α_2 α_1 → α_3 ~α_1 ((α_1 → ~α_2) ((α_1 → ~α_2) ⟷α_3) →
~α_2 ⟷ α_3) ~ (α_1 V α_1)
_F(0)_ _F(0)_ _F(0)_ _F(0)_ _T(1)_ _T(1)_ _T(1)_
_F(0)_ _F(0)_ _F(0)_ _T(1)_ _T(1)_ _F(0)_ _F(0)_
_F(0)_ _T(1)_ _T(1)_ _F(0)_ _T(1)_ _F(0)_ _F(0)_
_F(0)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_ _T(1)_
_T(1)_ _F(0)_ _T(1)_ _F(0)_ _F(0)_ _F(0)_ _T(1)_
_T(1)_ _F(0)_ _T(1)_ _T(1)_ _F(0)_ _T(1)_ _F(0)_
_T(1)_ _T(1)_ _T(1)_ _F(0)_ _F(0)_ _F(0)_ _F(0)_
_T(1)_ _T(1)_ _T(1)_ _T(1)_ _F(0)_ _T(1)_ _T(1)_

Check Your Progress 4


1. For the given set of premises, derivse an empty clause {} using proposition
resolution to show that the given set of premises are unsatisfiable.
Step 1: XV Y (Premise)
Step 2: ~X V Z (Premise)
Step 3: ~XV ~Z (Premise)
Step 4: X V ~Y (Premise)
Step 5: X (Resolution on Step 1 and 4)
Step 6: ~X ( Resolution on Step 2 and 3)
Step 7:{} (Using Step 5 and 6)
Thus, the given set of premises are unsatisfiable.

2. First introduce notation set for the given knowledge base as:
S: Mary goes to school
L: Mary eats lunch
F: It is Friday
Corresponding knowledge base is:
KB1: S -> L
KB2:F -> (S V L)
Conclusion: F -> L

Proof: Step 1: ~S V L (Premise, Material Implication)


Step 2: ~F V (S V L) (Premise, Material Implication)
Step 3: F (Negated conclusion)
Step 4: ~L (Negated conclusion)
Step 5: (S V L) (Resolution on Step 2 and 3)
Step 6: L (Resolution on Step 1 and 5)
Step 7:{}(Resolution on Step 4 and 6)
4.16 FURTHER READINGS

1. C. L. Liu & D. P. Mohapatra, Elements of Discrete Mathematics: A


Computer Oriented Approach, Fourth Edition, 2017, Tata McGraw Hill
Education.
2. David J. Hunter, Essentials of Discrete Mathematics, Third Edition, 2016,
Jones and Bartlett Publishers.
3. James L. Hein, Discrete Structures, Logic and Computability, Fourth
Edition, 2015, Jones and Bartlett Publishers.
4. Kenneth H. Rosen, Discrete Mathematics and Its Applications, Eighth
Edition, 2021, Tata McGraw Hill Education.
5. Thomas Koshy, Discrete Mathematics with Applications, 2012, Elsevier
Academic Press.
PROGRAMME DESIGN COMMITTEE
Prof. (Retd.) S.K. Gupta , IIT, Delhi Sh. Shashi Bhushan Sharma, Associate Professor, SOCIS, IGNOU
Prof. Ela Kumar, IGDTUW, Delhi Sh. Akshay Kumar, Associate Professor, SOCIS, IGNOU
Prof. T.V. Vijay Kumar JNU, New Delhi Dr. P. Venkata Suresh, Associate Professor, SOCIS, IGNOU
Prof. Gayatri Dhingra, GVMITM, Sonipat Dr. V.V. Subrahmanyam, Associate Professor, SOCIS, IGNOU
Mr. Milind Mahajan,. Impressico Business Solutions, Sh. M.P. Mishra, Assistant Professor, SOCIS, IGNOU
New Delhi Dr. Sudhansh Sharma, Assistant Professor, SOCIS, IGNOU

COURSE DESIGN COMMITTEE


Prof. T.V. Vijay Kumar, JNU, New Delhi Sh. Shashi Bhushan Sharma, Associate Professor, SOCIS, IGNOU
Prof. S.Balasundaram, JNU, New Delhi Sh. Akshay Kumar, Associate Professor, SOCIS, IGNOU
Prof D.P. Vidyarthi, JNU, New Delhi Dr. P. Venkata Suresh, Associate Professor, SOCIS, IGNOU
Prof. Anjana Gosain, USICT, GGSIPU, New Delhi Dr. V.V. Subrahmanyam, Associate Professor, SOCIS, IGNOU
Dr. Ayesha Choudhary, JNU, New Delhi Sh. M.P. Mishra, Assistant Professor, SOCIS, IGNOU
Dr. Sudhansh Sharma, Assistant Professor, SOCIS, IGNOU

SOCIS FACULTY
Prof. P. Venkata Suresh, Director, SOCIS, IGNOU
Prof. V.V. Subrahmanyam, SOCIS, IGNOU
Dr. Akshay Kumar, Associate Professor, SOCIS, IGNOU
Dr. Naveen Kumar, Associate Professor, SOCIS, IGNOU (on EOL)
Dr. M.P. Mishra, Associate Professor, SOCIS, IGNOU
Dr. Sudhansh Sharma, Assistant Professor, SOCIS, IGNOU
Dr. Manish Kumar, Assistant Professor, SOCIS, IGNOU

PREPARATION TEAM
Dr.Sudhansh Sharma, (Writer- Unit 5, 6 ) Prof Ela Kumar (Content Editor)
Asst.Professor SOCIS, IGNOU Department of Computers & Engg. IGDTUW, Delhi
(Writer- Unit 5, 6 )-(Partially Adapted from MCSE003
Artificial Intelligence & Knowledge Management)
Prof.Parmod Kumar (Language Editor)
Dr. Manish Kumar, Assistant Professor SOH, IGNOU, New Delhi
SOCIS, IGNOU (Writer- Unit 7)

Dr.Sudhansh Sharma, (Writer- Unit 8)


Asst.Professor SOCIS, IGNOU
(Writer- Unit 8)-(Partially Adapted from MCSE003
Artificial Intelligence & Knowledge Management)

Course Coordinator: Dr.Sudhansh Sharma,

Print Production
Sh Sanjay Aggarwal,Assistant Registrar, MPDD

, 2022
Indira Gandhi National Open University, 2022
ISBN-
All rights reserved. No part of this work may be reproduced in any form, by mimeograph or any other means, without permission
in writing from the Indira Gandhi National Open University.
Further information on the Indira Gandhi National Open University courses may be obtained from the University’s office at
Maidan Garhi, New Delhi-110068.
Printed and published on behalf of the Indira Gandhi National Open University, New Delhi by MPDD, IGNOU.
UNIT 5 FIRST ORDER PREDICATE LOGIC
Structure Page Nos.
5.0 Introduction
5.1 Objectives
5.2 Syntax of First Order Predicate Logic(FOPL)
5.3 Interpretations in FOPL
5.4 Semantics of Quantifiers
5.5 Inference & Entailment in FOPL
5.6 Conversion to clausal form
5.7 Resolution & Unification
5.8 Summary
5.9 Solutions/Answers
5.10 Further/Readings

5.0 INTRODUCTION

In the previous unit, we discussed how propositional logic helps us in solving problems. However, one of
the major problems with propositional logic is that, sometimes, it is unable to capture even elementary
type of reasoning or argument as represented by the following statements:

Every man is mortal.

Raman is a man.

Hence, he is mortal.

The above reasoning is intuitively correct. However, if we attempt to simulate the reasoning through
Propositional Logic and further, for this purpose, we use symbols P, Q and R to denote the statements
given above as:

P: Every man is mortal,

Q: Raman is a man,

R: Raman is mortal.

Once, the statements in the argument in English are symbolised to apply tools of propositional logic, we
just have three symbols P, Q and R available with us and apparently no link or connection to the original
statements or to each other. The connections, which would have helped in solving the problem become
invisible. In Propositional Logic, there is no way, to conclude the symbol R from the symbols P and Q.
However, as we mentioned earlier, even in a natural language, the conclusion of the statement denoted by
R from the statements denoted by P and Q is obvious. Therefore, we search for some symbolic system of
reasoning that helps us in discussing argument forms of the above-mentioned type, in addition to those
forms which can be discussed within the framework of propositional logic. First Order Predicate Logic
(FOPL) is the most well-known symbolic system for the pourpose.

The symbolic system of FOPL treats an atomic statement not as an indivisible unit. Rather, FOPL not
only treats an atomic statement divisible into subject and predicate but even further deeper structures of
an atomic statement are considered in order to handle larger class of arguments. How and to what extent
FOPL symbolizes and establishes validity/invalidity and consistency/inconsistency of arguments is the
subject matter of this unit.

5.1 OBJECTIVES
After studying this unit, you should be able to:

 explain why FOPL is required over and above PL;


 define, and give appropriate examples for, each of the new concepts required for FOPL including
those of quantifier, variable, constant, term, free and bound occurrences of variables, closed and open
wff;
 check consistency/validity, if any, of closed formulas;
 reduce a given formula of FOPL to normal forms: Prenex Normal Form (PNF) and (Skolem) Standard
Form, and conversion to the clausal form
 use the tools and techniques of FOPL, developed in the unit, to solve problems requiring logical
reasoning
 Perform unification and resolution mechanism.

5.2 SYNTAX OF FIRST ORDER PREDICATE LOGIC


We learned about the concept of propositions in Artificial intelligence, in Unit 4 of Block 1. Now it’s
time to understand the difference between the Proposition and the Predicate (also known as propositional
function). In short, a proposition is a specialized statement whereas Predicate is a generalized statement.
To be more specific the propositions uses the logical connectives only and the predicates uses logical
connectives and quantifiers (universal and existential), both.

Note :  is the symbol used for the Existential quantifier and  is used for the Universal quantifier.

Let’s understand the difference through some more detail, as given below.

A propositional function, or a predicate, in a variable x is a sentence p(x) involving x that becomes a


proposition when we give x a definite value from the set of values it can take. We usually denote such
functions by p(x), q(x), etc. The set of values x can take is called the universe of discourse.

So, if p(x) is ‘x > 5’, then p(x) is not a proposition. But when we give x particular values, say x = 6 or x =
0, then we get propositions. Here, p(6) is a true proposition and p(0) is a false proposition.

Similarly, if q(x) is ‘x has gone to Patna.’, then replacing x by ‘Taj Mahal’ gives us a false proposition.

Note that a predicate is usually not a proposition. But, of course, every proposition is a prepositional
function in the same way that every real number is a real-valued function, namely, the constant function.

Now, can all sentences be written in symbolic from by using only the logical connectives? What about
sentences like ‘x is prime and x + 1 is prime for some x.’? How would you symbolize the phrase ‘for
some x’, which we can rephrase as ‘there exists an x’? You must have come across this term often while
studying mathematics. We use the symbol ‘’ to denote this quantifier, ‘there exists’. The way we
use it is, for instance, to rewrite ‘There is at least one child in the class.’ as‘( x in U)p(x)’,

where p(x) is the sentence ‘x is in the class.’ and U is the set of all children.
Now suppose we take the negative of the proposition we have just stated. Wouldn’t it be ‘There is no
child in the class.’? We could symbolize this as ‘for all x in U, q(x)’ where x ranges over all children and
q(x) denotes the sentence ‘x is not in the class.’, i.e., q(x)  ~ p(x).

We have a mathematical symbol for the quantifier ‘for all’, which is ‘’. So the proposition above
can be written as

‘( x  U)q(x)’, or ‘q(x),  x  U’.

An example of the use of the existential quantifier is the true statement.

( x  R) (x + 1 > 0), which is read as ‘There exists an x in R for which x + 1 > 0.’.

Another example is the false statement

1 1
( x N) (x - = 0), which is read as ‘There exists an x in N for which x - = 0.’.
2 2
An example of the use of the universal quantifier is ( x  N) (x2 > x), which is read as ‘for every x not
in N, x2 > x.’. Of course, this is a false statement, because there is at least one x N, x  R, for which it is
false.

As you have already read in the example of a child in the class,

(  x U)p(x) is logically equivalent to ~ (  x  U) (~ p(x)). Therefore,

~( x  U)p(x)  ~~ ( x U) (~ p(x))  (  x  U) ( ~ p(x)).

This is one of the rules for negation that relate  and . The two rules are

~ ( x  U)p(x)  ( x  U) (~ p(x)), and

~ ( x  U)p(x)  ( x  U) (~ p(x))

Where U is the set of values that x can take.

5.3 INTERPRETATIONS IN FOPL


In order to have a glimpse at how FOPL extends propositional logic, let us again discuss the earlier
argument.

Every man is mortal. Raman is a man.

Hence, he is mortal.

In order to derive the validity of above simple argument, instead of looking at an atomic statement as
indivisible, to begin with, we divide each statement into subject and predicate. The two predicates which
occur in the above argument are:

‘is mortal’ and ‘is man’.

Let us use the notation


IL: is_mortal and

IN: is_man.

In view of the notation, the argument on para-phrasing becomes:


For all x, if IN (x) then IL (x).

IN (Raman).

Hence, IL (RAMAN)

More generally, relations of the form greater-than (x, y) denoting the phrase ‘x is greater than y’,
is_brother_ of (x, y) denoting ‘x is brother of y,’ Between (x, y, z) denoting the phrase that ‘x lies between
y and z’, and is_tall (x) denoting ‘x is tall’ are some examples of predicates. The variables x, y, z etc
which appear in a predicate are called parameters of the predicate.

The parameters may be given some appropriate values such that after substitution of appropriate value
from all possible values of each of the variables, the predicates become statements, for each of which we
can say whether it is ‘True’ or it is ‘False’.

For example, for the predicate greater-than (x, y), if x is given value 3 then we obtain greater-than (3, y),
for which still it is not possible to tell whether it is True or False. Hence, ‘greater-than (3, y)’ is also a
predicate. Further, if the variable y is given value 5 then we get greater (3, 5) which , as we known, is
False. Hence, it is possible to give its Truth-value, which is False in this case. Thus, from the predicate
greater-than (x, y), we get the statement greater-than (3, 5) by assigning values 3 to the variable x and 5
to the variable y. These values 3 and 5 are called parametric values or arguments of the predicate greater-
than.

(Please note ‘argument of a function/predicate’ is a mathematical concept, different from logical


argument)

Similarly, we can represent the phrase x likes y by the predicate LIKE (x, y). Then Ram likes Mohan can
be represented by the statement LIKE (RAM, MOHAN).

Also function symbols can be used in the first-order logic. For example, we can use product (x, y) to
denote x  y and father (x) to mean the ‘father of x’. The statement: Mohan’s father loves Mohan can be
symbolised as LOVE (father (Mohan), Mohan). Thus, we need not know name of father of Mohan and
still we can talk about him. A function serves such a role.

We may note that LIKE (Ram, Mohan) and LOVE (father (Mohan),Mohan) are atoms or atomic
statements of PL, in the sense that, one can associate a truth-value True or False with each of these, and
each of these does not involve a logical operator like ~, , ,  or .

Summarizing in the above discussion, LIKE (Ram, Mohan) and LOVE (father (Mohan) Mohan) are
atoms; where as GREATER, LOVE and LIKE are predicate symbols; x and y are variables and 3, Ram
and Mohan are constants; and father and product are function symbols.

From the above discussion we learned the following concepts of symbols.

i) Individual symbols or constant symbols: These are usually names of objects, such as Ram, Mohan,
numbers like 3, 5 etc.
ii) Variable symbols: These are usually lowercase unsubscripted or subscripted letters, like x, y, z, x 3.
iii) Function symbols: These are usually lowercase letters like f, g, h,….or strings of lowercase letters
such as father and product.
iv) Predicate symbols: These are usually uppercase letters like P, Q, R,….or strings of lowercase
letters such as greater-than, is_tall etc.

A function symbol or predicate symbol takes a fixed number of arguments. If a function symbol f takes n
arguments, f is called an n-place function symbol. Similarly, if a predicate symbol Q takes m arguments, P
is called an m-place predicate symbol. For example, father is a one-place function symbol, and
GREATER and LIKE are two-place predicate symbols. However, father-of in father_of (x, y) is a, two
place predicate symbol.

The symbolic representation of an argument of a function or a predicate is called a term where a term is
defined recursively as follows:

i) A variable is a term.
ii) A constant is a term.
iii) If f is an n-place function symbol, and t1….tn are terms, then f(t1,….,tn) is a term.
iv) Any term can be generated only by the application of the rules given above.
For example: Since, y and 3 are both terms and plus is a two-place function symbol, plus (y, 3) is a term
according to the above definition.

Furthermore, we can see that plus (plus (y, 3), y) and father (father (Mohan)) are also terms; the former
denotes (y + 3) + y and the later denotes grandfather of Mohan.

A predicate can be thought of as a function that maps a list of constant arguments to T or F. For example,
GREATER is a predicate with GREATER (5, 2) as T, but GREATER (1, 3) as F.

We already know that in PL, an atom or atomic statement is an indivisible unit for representing and
validating arguments. Atoms in PL are denoted generally by symbols like P, Q, and R etc. But in FOPL,

Definition: An Atom is

(i) either an atom of Propositional Logic, or


(ii) is obtained from an n-place predicate symbol P, and terms t1,….tn so that
P (t1,….,tn) is an atom.

Once, the atoms are defined, by using the logical connectives defined in Propositional Logic, and
assuming having similar meaning in FOPL, we can build complex formulas of FOPL. Two special
symbol  and  are used to denote qualifications in FOPL. The symbols  and  are called, respectively,
the universal quantifier and existential quantifier. For a variable x, (x) is read as for all x, and (x) is
read as there exists an x. Next, we consider some examples to illustrate the concepts discussed above.

In order to symbolize the following statements:

i) There exists a number that is rational.


ii) Every rational number is a real number
iii) For every number x, there exists a number y, which is greater than x.

let us denote x is a rational number by Q(x), x is a real number by R(x), and x is less than y by LESS(x,
y). Then the above statements may be symbolized respectively, as
(i) (x) Q(x)

(ii) (x) (Q(x)  R (x))

(iii) (x) (y) LESS(x, y).

Each of the expressions (i), (ii), and (iii) is called a formula or a well-formed formula or wff.

5.4 SEMATICS OF QUANTIFIERS

To understand the semantics of quantifiers we need to first understand the difference between the
Proposition and the Predicate(also known as propositional function). In short, a proposition is a
specialized statement whereas Predicate is a generalized statement. To be more specific the propositions
uses the logical connectives only and the predicates uses logical connectives and quantifiers (universal
and existential), both.

Note :  is the symbol used for the Existential quantifier and  is used for the Universal quantifier.

Let’s understand the difference through some more detail, as given below.

A propositional function, or a predicate, in a variable x is a sentence p(x) involving x that becomes a


proposition when we give x a definite value from the set of values it can take. We usually denote such
functions by p(x), q(x), etc. The set of values x can take is called the universe of discourse.

So, if p(x) is ‘x > 5’, then p(x) is not a proposition. But when we give x particular values, say x = 6 or x =
0, then we get propositions. Here, p(6) is a true proposition and p(0) is a false proposition.

Similarly, if q(x) is ‘x has gone to Patna.’, then replacing x by ‘Taj Mahal’ gives us a false proposition.

Note that a predicate is usually not a proposition. But, of course, every proposition is a prepositional
function in the same way that every real number is a real-valued function, namely, the constant function.

Now, can all sentences be written in symbolic from by using only the logical connectives? What about
sentences like ‘x is prime and x + 1 is prime for some x.’? How would you symbolize the phrase ‘for
some x’, which we can rephrase as ‘there exists an x’? You must have come across this term often while
studying mathematics. We use the symbol ‘’ to denote this quantifier, ‘there exists’. The way we
use it is, for instance, to rewrite ‘There is at least one child in the class.’ as‘( x in U)p(x)’,

where p(x) is the sentence ‘x is in the class.’ and U is the set of all children.

Now suppose we take the negative of the proposition we have just stated. Wouldn’t it be ‘There is no
child in the class.’? We could symbolize this as ‘for all x in U, q(x)’ where x ranges over all children and
q(x) denotes the sentence ‘x is not in the class.’, i.e., q(x)  ~ p(x).

We have a mathematical symbol for the quantifier ‘for all’, which is ‘’. So the proposition above
can be written as

‘( x  U)q(x)’, or ‘q(x),  x  U’.


An example of the use of the existential quantifier is the true statement.

( x  R) (x + 1 > 0), which is read as ‘There exists an x in R for which x + 1 > 0.’.

Another example is the false statement

1 1
( x N) (x - = 0), which is read as ‘There exists an x in N for which x - = 0.’.
2 2
An example of the use of the universal quantifier is ( x  N) (x2 > x), which is read as ‘for every x not
in N, x2 > x.’. Of course, this is a false statement, because there is at least one x N, x  R, for which it is
false.

As you have already read in the example of a child in the class,

(  x U)p(x) is logically equivalent to ~ (  x  U) (~ p(x)). Therefore,

~( x  U)p(x)  ~~ ( x U) (~ p(x))  (  x  U) ( ~ p(x)).

This is one of the rules for negation that relate  and . The two rules are

~ ( x  U)p(x)  ( x  U) (~ p(x)), and

~ ( x  U)p(x)  ( x  U) (~ p(x))

Where U is the set of values that x can take.

Next, we discuss three new concepts, viz Scope of occurrence of a quantified variable, Bound occurrence
of a quantifier variable or quantifier and Free occurrence of a variable.

Before discussion of these concepts, we should know the difference between a variable and occurrence
of a variable in a quantifier expression.

The variable x has THREE occurrences in the formula

(x) Q(x)  P(x, y).

Also, the variable y has only one occurrence and the variable z has zero occurrence in the above formula.
Next, we define the three concepts mentioned above.

Scope of an occurrence of a quantifiers is the smallest but complete formula following the quantifier
sometimes delimited by pair f parentheses. For example, Q(x) is the scope of (x) in the formula

(x) Q(x)  P(x, y).

But the scope of (x) in the formula: (x) (Q(x)  P(x, y)) is (Q(x)  P(x, y)).

Further in the formula:

(x) (P(x)  Q(x, y))  (x) (P(x)  R(x, 3)),

the scope of first occurrence of (x) is the formula (P(x)  Q (x, y) and the scope of second occurrence
of (x) is the formula
(P(x)  R(x, 3)).

As another example, the scope of the only occurrence of the quantifier (y) in

(x) (( P(x)  Q(x)  (y) (Q (x)  R (y))) is ( Q (x)  R(y)). But the scope of the only occurrence of
the existential variable (x) in the same formula is the formula:

(P(x)  Q(x)) P  (y) (Q (x)  R(y))

An occurrence of a variable in a formula is bound if and only if the occurrence is within the scope of a
quantifier employing the variable, or is the occurrence in that quantifier. An occurrence of a variable in a
formula is free if and only if this occurrence of the variable is not bound.

Thus, in the formula (x) P(x, y)  Q (x), there are three occurrences of x, out of which first two
occurrences of x are bound, where, the last occurrence of x is free, because scope of (x) in the above
formula is P(x, y). The only occurrence of y in the formula is free. Thus, x is both a bound and a free
variable in the above formula and y is only a free variable in the formula so far, we talked of an
occurrence of a variable as free or bound. Now, we talk of (only) a variable as free or bound. A variable
is free in a formula if at least one occurrence of it is free in the formula. A variable is bound in a formula
if at least one occurrence of it is bound.

It may be noted that a variable can be both free and bound in a formula. In order to further elucidate the
concepts of scope, free and bound occurrences of a variable, we consider a similar but different formula
for the purpose:

(x) (P(x, y)  Q(x)).

In this formula, scope of the only occurrence of the quantifier (x) is the whole of the rest of the formula,
viz. scope of (x) in the given formula is (P(x, y)  Q (x))

Also, all three occurrence of variable x are bound. The only occurrence of y is free.

Remarks: It may be noted that a bound variable x is just a place holder or a dummy variable in the
sense that all occurrences of a bound variable x may be replaced by another free variable say y, which
does not occur in the formula. However, once, x is replaced by y then y becomes bound. For example,
(x) (f (x)) is the same as (y) f (y). It is something like

2 2 23 13 7
 x dx = 
2
y dy   
2
1 1 3 3 3
Replacing a bound variable x by another variable y under the restrictions mentioned above is called
Renaming of a variable x

Having defined an atomic formula of FOPL, next, we consider the definition of a general formula
formally in terms of atoms, logical connectives, and quantifiers.

Definition A well-formed formula, wff a just or formula in FOPL is defined recursively as follows:

i) An atom or atomic formula is a wff.

ii) If E and G are wff, then each of ~ (E), (E  G), (E  G), (E  G), (E  G) is a wff.
iii) If E is a wff and x is a free variable in E, then (x)E is a wff.

iv) A wff can be obtained only by applications of (i), (ii), and (iii) given above.

We may drop pairs of parentheses by agreeing that quantifiers have the least scope. For example,
(x) P(x, y)  Q(x) stands for

((x) P(x, y))  Q(x)

We may note the following two cases of translation:

(i) for all x, P(x) is Q(x) is translated as


(x) (P(x)  Q(x) )
(the other possibility (x) P(x)  Q(x) is not valid.)

(ii) for some x, P(x) is Q (x) is translated as (x) P(x)  Q(x)


(the other possibility (x) P(x)  Q(x) is not valid)

Example
Translate the statement: Every man is mortal. Raman is a man. Therefore, Raman is mortal.

As discussed earlier, let us denote “x is a man” by MAN (x), and “x is mortal” by MORTAL(x). Then
“every man is mortal” can be represented by

(x) (MAN(x)  MORTAL(x)),

“Raman is a man” by

MORTAL(Raman).

The whole argument can now be represented by

(x) (MAN(x)  MORTAL(x))  MAN (Roman) MORTAL (Roman).

as a single statement.

In order to further explain symbolisation let us recall the axioms of natural numbers:

(1) For every number, there is one and only one immediate successor,

(2) There is no number for which 0 is the immediate successor.

(3) For every number other than 0, there is one and only one immediate predecessor.

Let the immediate successor and predecessor of x, respectively be denoted by f(x) and g(x).

Let E (x, y) denote x is equal to y. Then the axioms of natural numbers are represented respectively by the
formulas:

(i) (x) (y) (E(y, f(x))  (z) (E(z, f(x))  E(y, z)))
(ii) ~ ((x) E(0, f(x))) and

(iii) (x) (~ E(x, 0)  ((y), g(x))  (z) (E(z, g(x))  E(y, z))))).

From the semantics (for meaning or interpretation) point of view, the wff of FOPL may be divided into
two categories, each consisting of

(i) wffs, in each of which, all occurrences of variables are bound.


(ii) wffs, in each of which, at least one occurrence of a variable is free.

The wffs of FOPL in which there is no occurrence of a free variable, are like wffs of PL in the sense that
we can call each of the wffs as True, False, consistent, inconsistent, valid, invalid etc. Each such a
formula is called closed formula. However, when a wff involves a free occurrence, then it is not possible
to call such a wff as True, False etc. Each of such a formula is
called an open formula.

For example: Each of the formulas: greater (x, y), greater (x, 3), (y) greater (x, y) has one free
occurrence of variable x. Hence, each is an open formula.
Each of the formulas: (x) (y) greater (x, y), (y) greater (y, 1), greater (9, 2), does not have free
occurrence of any variable. Therefore each of these formulas is a closed formula.

Next we discuss some equivalences, and inequalities

The following equivalences hold for any two formulas P(x) and Q(x):

(i) (x) P(x)  (x) Q(x) = (x) (P(x)  Q(x))

(ii) (x) P(x)  ( x) Q (x) = (x) (P(x)  Q(x)

But the following inequalities hold, in general:

(iii) (x) (P(x)  Q(x)  (x) P(x)  (x) Q(x)


(iv) (x) (P(x)  Q(x)  (x) P(x)  (x) Q (x)

We justify (iii) & (iv) below:


Let P(x): x is odd natural number,

Q(x): x is even natural number.

Then L.H.S of (iii) above states for every natural number it is either odd or even, which is correct. But
R.H.S of (iii) states that every natural number is odd or every natural number is even, which is not
correct.

Next, L.H.S. of (iv) states that: there is a natural number which is both even and odd, which is not
correct. However, R.H.S. of (iv) says there is an integer which is odd and there is an integer which is
even, correct.

Equivalences involving Negation of Quantifiers

(v) ~ (x) P(x) = (x) ~ P(x)


(iv) ~ (x) P(x) = (x) ~ P(x)
Examples: For each of the following closed formula, Prove
(i) (x) P(x)  (y) ~ P(y) is inconsistent.

(ii) (x) P(x)  (y) P(y) is valid


Solution: (i) Consider
(x) P(x)  (y) ~ P(y)
= (x) P(x)  ~ (y) P(y) (taking negation out)
But we know for each bound occurrence, a variable is dummy, and can be replaced in the whole scope of
the variable uniformly by another free variable. Hence,

R = (x) P(x)  ~ (x) P(x)


Each conjunct of the formula is either
True of False and, hence, can be thought of as a formula of PL, in stead of formula of FOPL, Let us
replace (x) (P(x) by Q , a formula of PL.

R = Q  ~ Q = False
Hence, the proof.

(ii) Consider
(x) P(x)  (y) P(y)
Replacing ‘’ we get
= ~ (x) P(x)  (y) P(y)
= (x) ~ P(x)  (y) P(y)
= (x) ~ P(x)  (x) P(x) (renaming x as y in the second disjunct)

In other words,
= (x) (~ P(x)  P(x)) (using equivalence)
The last formula states: there is at least one element say b, for ~ P(b)  P(b) holds i.e., for b, either P(b)
is False or P(b) is True.
But, as P is a predicate symbol and b is a constant ~ P(b)  P(b) must be True. Hence, the proof.

Check Your Progress - 1

Ex. 1 Let P(x) and Q(x) represent “x is a rational number” and “x is a real number,” respectively.
Symbolize the following sentences:

(i) Every rational number is a real number.

(ii) Some real numbers are rational numbers.

(iii) Not every real number is a rational number.

Ex. 2 Let C(x) mean “x is a used-car dealer,” and H(x) mean “x is honest.” Translate each of the
following into English:

(i) (x)C(x)

(ii) (x) H(x)


(iii) (x)C(x)  ~ H (x))

(iv) (x) (C(x)  H(x))

(v) (x) (H(x)  C(x)).

Ex. 3 Prove the following:

(i) P(a)  ~ ((x) P(x)) is consistent.

(ii) (x) P(x)  ((y) ~ P(y)) is valid.

5.5 INFERENCING & ENTAILMENT IN FOPL


In the previous unit, we discussed eight inferencing rules of Propositional Logic (PL) and further
discussed applications of these rules in exhibiting validity/invalidity of arguments in PL. In this section,
the earlier eight rules are extended to include four more rules involving quantifiers for inferencing. Each
of the new rules, is called a Quantifier Rule. The extended set of 12 rules is then used for validating
arguments in First Order Predicate Logic (FOPL).

Before introducing and discussing the Quantifier rules, we briefly discuss why, at all, these rules are
required. For this purpose, let us recall the argument discussed earlier, which Propositional Logic could
not handle:

(i) Every man is mortal.


(ii) Raman is a man.
(iii) Raman is mortal.
The equivalent symbolic form of the argument is given by:

(i’) (x) (Man (x)  Mortal (x)

(ii’) Man (Raman)

(iii’) Mortal (Raman)

If, instead of (i’) we were given

(iv) Man (Raman)  Mortal (Raman) ,

(which is a formula of Propositional Logic also)

then using Modus Ponens on (ii’) & (iv) in Propositional Logic, we would have obtained (iii’) Mortal
(Raman).

However, from (i’) & (ii’) we cannot derive in Propositional Logic (iii’). This suggests that there should
be mechanisms for dropping and introducing quantifier appropriately, i.e., in such a manner that validity
of arguments is not violated. Without discussing the validity-preserving characteristics, we introduce the
four Quantifier rules.

(i) Universal Instantiation Rule (U.I.):

(x) p ( x)
p(a )
Where is an a arbitrary constant.

The rule states if (x) p(x) is True, then we can assume P(a) as True for any constant a (where a constant
a is like Raman). It can be easily seen that the rule associates a formula P(a) of Propositional Logic to a
formula (x) p(x) of FOPL. The significance of the rule lies in the fact that once we obtain a formula like
P(a), then the reasoning process of Propositional Logic may be used. The rule may be used , whenever,
its application seems to be appropriate.

(ii) Universal Generalisation Rule (U.G.)

P ( a ), for all a
( x ) p ( x )

The rule says that if it is known that for all constants a, the statement P(a) is True, then we can, instead,
use the formula (x) p( x) .

The rule associates with a set of formulas P(a) for all a of Propositional Logic, a formula (x) p( x) of
FOPL.

Before using the rule, we must ensure that P(a) is True for all a,
Otherwise it may lead to wrong conclusions.

(iii) Existential Instantiation Rule (E. I.)

(x ) P ( x )
( E .I .)
P(a)

The rule says if the Truth of (x) P( x) is known then we can assume the Truth of P(a) for some fixed a .
The rule, again, associates a formula P(a) of Propositional Logic to a formula (x) p( x) of FOPL.

An inappropriate application of this rule may lead to wrong conclusions. The source of possible errors
lies in the fact that the choice ‘a’ in the rule is not arbitrary and can not be known at the time of deducing
P(a) from (x) P( x) .

If during the process of deduction some other (y) Q( y) or (x) ( R( x) ) or even another (x)P(x) is
encountered, then each time a new constant say b, c etc. should be chosen to infer Q (b) from (y) Q( y)
or R(c) from (x) ( R( x) ) or P(d) from (x) P( x) .

(iv) Existential Generalization Rule (E.G)

P(a)
(E.G)
(x) P( x)

The rule states that if P(a), a formula of Propositional Logic is True, then the Truth of (x) P( x) , a
formula of FOPL , may be assumed to be True.

The Universal Generalisation (U.G) and Existential Instantiation rules should be applied with
utmost care, however, other two rules may be applied, whenever, it appears to be appropriate.
Next, The purpose of the two rules, viz.,

(i) Universal Instantiation Rule (U. I.)

(iii) Existantial Rule (E. I.)

is to associate formulas of Propositional Logic (PL) to formulas of FOPL in a manner, the validity of
arguments due to these associations, is not disturbed. Once, we get formulas of PL, then any of the eight
rules of inference of PL may be used to validate conclusions and solve problems requiring logical
reasoning for their solutions.

The purpose of the other Quantification rules viz. for generalisation, i.e.,

P(a), for all a


(ii)
(x) P( x)

P ( a)
(iv)
(x) P( x)

is that the conclusion to be drawn in FOPL is not generally a formula of PL but a formula of FOPL.
However, while making inference, we may be first associating formulas of PL with formulas of FOPL and
then use inference rules of PL to conclude formulas in PL. But the conclusion to be made in the problem
may correspond to a formula of FOPL. These two generalisation rules help us in associating formulas of
FOPL with formulas of PL.

Example: Tell, supported with reasons, which one of the following is a correct inference and which one
is not a correct inference.

(i) To conclude F (a)  G(a)  H (a)  I (a)


from (x)  F ( x)  G( x)   H ( x)  I ( x)

using Universal Instantiation (U.I.)

The above inference or conclusion is incorrect in view of the fact that the scope of universal
quantification is only the formula: F ( x)  G( x) and not the whole of the formula.

The occurrences of x in H ( x)  I ( x) are free occurrences. Thus, one of the correct inferences would
have been:

F (a)  G(a)  H ( x)  I ( x)

(ii) To conclude F (a)  G(a)  H (a)  I (a) from


(x) (F(x)  G (x)  H(x)  I (x)) using U.I.

The conclusion is correct in view of the argument given in (i) above.

(iii) To conclude ~ F(a) for an arbitrary a, from ~ (x) F(x) using U.I.
The conclusion is incorrect, because actually

~ (x) F(x) = (x) ~ F (x)

Thus, the inference is not a case of U.I., but of Existential Instantiation (E.I.)

Further, as per restrictions, we can not say for which a, ~ F(x) is True. Of course, ~ F(x) is true for
some constant, but not necessarily for a pre-assigned constant a.

(iv) to conclude  ( F (b)  G(b)  H (c) 


from (x)  ( F (b)  G( x)   H (c)

Using E.I. is not correct

The reason being that the constant to be substituted for x cannot be assumed to be the same constant b,
being given in advance, as an argument of F. However,

to conclude  ( F (b)  G(a)  H (c) 

from (x)  ( F (b)  G( x)   H (c )  is correct.

Step for using Predicate Calculus as a Language for Representing Knowledge & for Reasoning:

Step 1: Conceptualisation: First of all, all the relevant entities and the relations that exist between these
entities are explicitly enumerated. Some of the implicit facts like, ‘a person dead once is dead for ever’
have to be explicated.

Step 2: Nomenclature & Translation: Giving appropriate names to objects and relations. And then
translating the given sentences given in English to formulas in FOPL. Appropriate names are essential in
order to guide a reasoning system based on FOPL. It is well-established that no reasoning system is
complete. In other words, a reasoning system may need help in arriving at desired conclusion.

Step 3: Finding appropriate sequence of reasoning steps, involving selection of appropriate rule and
appropriate FOPL formulas to which the selected rule is to be applied, to reach the conclusion.

Applications of the 12 inferrencing rules (8 of Propositional Logic and 4 involving Quantifiers.)

Example: Symbolize the following and then construct a proof for the argument:

(i) Anyone who repairs his own car is highly skilled and saves a lot of money on repairs
(ii) Some people who repair their own cars have menial jobs. Therefore,
(iii) Some people with menial jobs are highly skilled.

Solution: Let us use the notation:

P(x) : x is a person
S(x) : x saves money on repairs

M(x) : x has a menial job

R(x) : x repairs his own car

H(x) : x is highly skilled.

Therefore, (i), (ii) and (iii) can be symbolized as:

(i) (x) (R(x)  (H(x)S(x)))


(ii) (x) (R(x)M(x))
(iii) (x) (M(x)H(x)) (to be concluded)

From (ii) using Existential Instantiation (E.I), we get, for some fixed a

(iv) R(a)  M(a)


Then by simplification rule of Propositional Logic, we get

(v) R(a)
From (i), using Universal Instantiation (U.I.), we get

R(a)  H(a)  S(a)


(vi)
Using modus ponens w.r.t. (v) and (vi) we get

H(a)  S(a)
(vii)
By specialisation of (vii) we get

(viii)
H(a)
By specialisation of (iv) we get

(ix) M(a)
By conjunctions of (viii) & (ix) we get

M(a)  H(a)

By Existential Generalisation, we get

(x) (M(x)  H(x))

Hence, (iii) is concluded.

Example:

(i) Some juveniles who commit minor offences are thrown into prison, and any juvenile thrown into
prison is exposed to all sorts of hardened criminals.
(ii) A juvenile who is exposed to all sorts of hardened criminals will become bitter and learn more
techniques for committing crimes.
(iii) Any individual who learns more techniques for committing crimes is a menace to society, if he is
bitter.
(iv) Therefore, some juveniles who commit minor offences will be menaces to the society.
Example: Let us symbolize the statement in the given argument as follows:

(i) J(x) : x is juvenile.

(ii) C(x) : x commits minor offences.

(iii) P(x) : x is thrown into prison.

(iv) E(x) : x is exposed to hardened criminals.

(v) B(x) : x becomes bitter.

(vi) T(x) : x learns more techniques for committing crimes.

(vii) M(x) : x is a menace to society.

The statements of the argument may be translated as:

(i) (x) (J(x) C(x) P(x)) ((y) (J(y)E(y))


(ii) (x) (J(x) E(x) B(x) T(x))
(iii) (x) (T(x) B(x) M(x))
Therefore,

(iv) (x) (J(x) C(x) M(x))

By simplification (i) becomes

(v) (x) (J(x) C(x) P(x)) and


(vi) (y) (J(y)  E(y))

From (v) through Existential Instantiation, for some fixed b, we get

(vii) J(b) C(b) P(b)


Through simplification (vii) becomes

(viii) J(b)
(ix) C(b) and
(x) P(b)
Using Universal Instantiation, on (vi), we get

(xi) J(b)  E (b)


Using Modus Ponens in (vii) and (xi) we get

(xii) E(b)
Using conjunction for (viii) & (xii) we get

(xiii) J(b) E(b)


Using Universal Instantiation on (ii) we get

(xiv) J(b) E(b)B(b) T(b)


Using Modus Ponens for (xiii) & (xiv), we get
(xv) T(b) B(b)
Using Universal Instantiation for (iii) we get

(xvi) T(b) B(b)M(b)


Using Modus Ponens with (xv) and (xvi) we get

(xvii) M(b)
Using conjunction for (viii), (ix) and (xvii) we get

(xviii) J(b) C(b) M(b)


From (xviii), through Existential Generalization we get the required (iv), i.e.

(x) (J(x) C(x) M(x))

Remark: It may be noted the occurrence of quantifiers is not, in general, commutative i.e.,

(Q1x) (Q2x) (Q2x) (Q1x)

For example

(x) (y) F(x,y) (y) (x) F(x,y) (A)

The occurrence of (y) on L.H.S depends on x i.e., occurrence of y on L.H.S is a function of x. However,
the occurrence of (y) on R.H.S is independent of x, hence, occurrence of y on R.H.S is not a function of
x.

For example, if we take F(x,y) to mean:

y and x are integers such that y>x,

then, L.H.S of (A) above states: For each x there is a y such that y>x.

The statement is true in the domain of real numbers.

On the other hand, R.H.S of (A) above states that: There is an integer y which is greater than x, for all x.

This statement is not true in the domain of real numbers.

When the logical statements are interconnected in a manner that one is consequence of other then such
Logical consequences (also called entailment) are the fundamental concept in logical reasoning, which
describes the relationship between statements that hold true when one statement logically follows
from one or more statements.
A valid logical argument is one in which the conclusion is entailed by the premises, because the
conclusion is the consequence of the premises. The philosophical analysis of logical consequence
involves the questions: In what sense does a conclusion follow from its premises? and What does it mean
for a conclusion to be a consequence of premises? All of philosophical logic is meant to provide accounts
of the nature of logical consequence and the nature of logical truth.
Logical consequence is necessary and formal, by way of examples that explain with formal
proof and models of interpretation. A sentence is said to be a logical consequence of a set of sentences,
for a given language, if and only if, using only logic (i.e., without regard to any personal interpretations of
the sentences) the sentence must be true if every sentence in the set is true.
5.6 CONVERSION TO CLAUSAL FORM
In order to facilitate problem solving through Propositional Logic, we discussed two normal forms, viz,
the conjunctive normal form CNF and the disjunctive normal form DNF. In FOPL, there is a normal
form called the prenex normal form. Further the statement in Prenex Normal Form is required to be
skolomized to get the clausal form, which can be used for the purpose of Resolution.

So, first step towards the Clausal form is to begin with Prenex Normal Form (PNF), and the second
step is skolomization, which will be discussed after PNF.

Prenex Normal Form (PNF): In broad sense it relates to re-alignment of the quantifiers, i.e. to bring all
the quantifiers in the beginning of the expression and then replacement the existential and universal
quantifiers with constants and the functions is performed for skolomization i.e. to bring the statement in
the clausal form.

The use of a prenex normal form of a formula simplifies the proof procedures, to be discussed.

Definition A formula G in FOPL is said to be in a prenex normal form if and only if the formula G is in
the form

(Q1x1)….(Qn xn) P

where each (Qixi), for i = 1, ….,n, is either (xi) or (xi), and P is a quantifier free formula. The
expression (Q1x1)….(Qn xn) is called the prefix and P is called the matrix of the formula G.

Examples of some formulas in prenex normal form:

(i) (x) (y) (R(x, y)  Q(y)), (x) (y) (~ P(x, y)  S(y)),

(ii) (x) (y) (z) (P(x, y)  R (z)).

Next, we consider a method of transforming a given formula into a prenex normal form. For this,
first we discuss equivalence of formulas in FOPL. Let us recall that two formulas E and G are equivalent,
denoted by E = G, if and only if the truth values of F and G are identical under every interpretation. The
pairs of equivalent formulas given in Table of equivalent Formulas of previous unit are still valid as these
are quantifier–free formulas of FOPL. However, there are pairs of equivalent formulas of FOPL that
contain quantifiers. Next, we discuss these additional pairs of equivalent formulas. We introduce some
notation specific to FOPL: the symbol G denote a formula that does not contain any free variable x. Then
we have the following pairs of equivalent formulas, where Q denotes a quantifier which is either  or .
Next, we introduce four laws for pairs of equivalent formulas.

In the rest of the discussion of FOPL, P[x] is used to denote the fact that x is a free variable in the formula
P, for example, P[x] = (y) P (x, y). Similarly, R [x, y] denotes that variables x and y occur as free
variables in the formula R Some of these equivalences, we have discussed earlier.

Then, the following laws involving quantifiers hold good in FOPL

(i) ( Qx ) P [ x]  G = ( Qx ) ( P [x ]  G).

(ii) ( Qx ) P [x ]  G = ( Qx ) ( P [x]  G).

In the above two formulas, Q may be either  or .


(iii) ~ (( x ) P [ x ]) = (x ) ( ~ P [ x ] ).

(iv) ~ (( x) P [ x ] ) = ( x ) ( ~ P [ x ]).

(v) (x) P [x]  (x) H [x] = (x) (P [x]  H [x]).

(vi) (x) P [x]  (x) H [x] = (x) (P [x]  H [x]).

That is, the universal quantifier  and the existential quantifier  can be distributed respectively over 
and .

But we must be careful about (we have already mentioned these inequalities)
(vii) (x) E [x]  (x) H [x]  (x) (P [x]  H [x]) and
(viii) (x ) P [x]  (x) H [x]  (x) (P [x]  H [x])

Steps for Transforming an FOPL Formula into Prenex Normal Form

Step 1 Remove the connectives ‘’ and ‘’ using the equivalences

P  G = (P  G)  ( G  P)

P G = ~ P  G

Step 2 Use the equivalence to remove even number of ~’s

~ ( ~ P) = P

Step 3 Apply De Morgan’s laws in order to bring the negation signs immediately before atoms.

~ (P  G) = ~ P  ~ G ~ (P  G) = ~ P
~G

and the quantification laws

~ ((x) P[x]) = (x) (~P[x]) ~ ((x) P [x]) =


(x) (~F[x])

Step 4 rename bound variables if necessary

Step 5 Bring quantifiers to the left before any predicate symbol appears in the formula. This is achieved
by using (i) to (vi) discussed above.

We have already discussed that, if all occurrences of a bound variable are replaced uniformly throughout
by another variable not occurring in the formula, then the equivalence is preserved. Also, we mentioned
under (vii) that  does not distribute over  and under (viii) that  does not distribute over . In such
cases, in order to bring quantifiers to the left of the rest of the formula, we may have to first rename one
of bound variables, say x, may be renamed as z, which does not occur either as free or bound in the other
component formulas. And then we may use the following equivalences.

(Q1 x) P[x]  (Q2 x) H[x] = (Q1 x) (Q2 z) (P[x]  H[z])

(Q3 x) P[x]  (Q4 x) H[x] = (Q3 x) (Q4 z) (P[x]  H[z])


Example: Transform the following formulas into prenex normal forms:

(i) (x) (Q(x)  (x) R (x, y))


(ii) (x) (~ (y) Q(x, y)  ((z) R(z)  S (x)))
(iii) (x) (y) ((z) Q(z, y, z)  ((u) R (x, u)  (v) R (y, v))).

Part (i)
Step 1: By removing ‘’, we get

(x) (~ Q (x)  (x) R (x, y))


Step 2: By renaming x as z in (x) R (x, y) the formula becomes

(x) (~ Q (x)  (z) R (z, y))


Step 3: As ~ Q(x) does not involve z, we get

(x) (z) (~ Q (x)  R (z, y))

Part (ii)

(x) (~ (y) Q (x, y)  ((z) R (z)  S (x)))


Step 1: Removing outer ‘’ we get

(x) (~ (~ ((y) Q (x, y)))  (( z) R (z)  S (x)))


Step 2: Removing inner ‘’ , and simplifying ~ (~ ( ) ) we get
(x) ((y) Q (x, y)  (~ ( (z) R(z))  S (x)))

Step 3: Taking ‘~’ inner most, we get


(x) (y) Q (x, y)  ((z) ~ R(z)  S(x)))

As first component formula Q (x, y) does not involve z and S(x) does not involve both y and z and ~ R(z)
does not involve y. Therefore, we may take out (  y) and (z) so that, we get

(x) (y) (z) (Q (x, y)  (~ R(z)  S (x) ), which is the required formula in prenex normal form.

Part (iii)

(x) (y) ((z) Q (x, y, z)  (( u) R (x, u)  (v) R (y v)))

Step 1: Removing ‘’, we get


(x) (y) ((z) Q (x, y, z)  (~ ((u) R (x, u))  (v) R (y, v)))

Step 2: Taking ‘~’ inner most, we get


(x) (y) ((z) Q (x, y, z)  ((u) ~ R (x, u)  (v) R (y, v)))

Step 3: As variables z, u & v do not occur in the rest of the formula except the formula which is in its
scope, therefore, we can take all quantifiers outside, preserving the order of their occurrences,
Thus we get
(x) (y) (z) (u) (v) (Q (x, y, z)  (~ R (x, u)  R (y, v)))

Skolomization : A further refinement of Prenex Normal Form (PNF) called (Skolem) Standard Form, is
the basis of problem solving through Resolution Method. The Resolution Method will be discussed next.

The Standard Form of a formula of FOPL is obtained through the following three steps:

(1) The given formula should be converted to Prenex Normal Form (PNF), and then

(2) Convert the Matrix of the PNF, i.e, quantifier-free part of the PNF into conjunctive normal form
(3) Skolomization: Eliminate the existential quantifiers using skolem constants and functions

Before illustrating the process of conversion of a formula of FOPL to Standard Normal Form, through
examples, we discuss briefly skolem functions.

Skolem Function
We in general, mentioned earlier that (x) (y) P(x,y)  (y) (x) P(x,y)…….(1)

For example, if P(x,y) stands for the relation ‘x>y’ in the set of integers, then the L.H.S. of the inequality
(i) above states: some (fixed) integer (x) is greater than all integers (y). This statement is False.

On the other hand, R.H.S. of the inequality (1) states: for each integer y, there is an integer x so that x>y.
This statement is True.

The difference in meaning of the two sides of the inequality arises because of the fact that on L.H.S. x in
(x) is independent of y in (y) whereas on R.H.S x of dependent on y. In other words, x on L.H.S. of
the inequality can be replaced by some constant say ‘c’ whereas on the right hand side x is some function,
say, f(y) of y.

Therefore, the two parts of the inequality (i) above may be written as

LH.S. of (1) = (x) (y) P (x,y) = (y) P(c,y),

Dropping x because there is no x appearing in (y) P(c,y)

R.H.S. of (1) = (y) (x) P(f(y),y) = (y) P(f(y), y)

The above argument, in essence, explains what is meant by each of the terms viz. skolem constant, skolem
function and skolomisation.

The constants and functions which replace existential quantifiers are respectively called skolem
constants and skolem functions. The process of replacing all existential variables by skolem constants
and variables is called skolemisation.

A form of a formula which is obtained after applying the steps for

(i) reduction to PNF and then to

(ii) CNF and then

(iii) applying skolomization is called Skolem Standard Form or just Standard Form.

We explain through examples, the skolomisation process after PNF and CNF have already been obtained.

Example: Skolomize the following:


(i) (x1) (x2) (y1) (y2)(x3)(y3) P(x1, x2, x3, y1, y2, y3)

(ii) (x1)(y1)(x2)(y2) (x3)P(x1, x2, x3, y1, y2 )(x1)(y3)( x2) (y4)Q(x1, x2,

y3, y4)

Solution (i) As existential quantifiers x1 and x2 precede all universal quantifiers, therefore, x1 and x2 are
to be replaced by constants, but by distinct constants, say by ‘c’ and ‘d’ respectively. As existential
variable x3 is preceded by universal quantifiers y1 and y2, therefore, x3 is replaced by some function f(y1,
y2) of the variables y1 and y2. After making these substitutions and dropping universal and existential
variables, we get the skolemized form of the given formula as

(y1) (y2) (y3) (c, d, f (y1, y2), y1, y2, y3).

Solution (ii) As a first step we must bring all the quantifications in the beginning of the formula through
Prenex Normal Form reduction. Also,

(x)…P(x,…)  (x)…. Q (x,….)  (x) (….P(x)  …Q(x,….),

therefore, we rename the second occurrences of quantifiers (x1) and (x2) by renaming these as x5 and
x6. Hence, after renaming and pulling out all the quantifications to the left, we get

(x1) (y1) (x2) (y2) (x3) (x5) (y3) (x6) (y4)

(P(x1, x2, x3, y1, y2)  Q (x5, x6, y3, y4)

Then the existential variable x1 is independent of all the universal quantifiers. Hence, x1 may be replaced
by a constant say, ‘c’. Next x2 is preceded by the universal quantifier y1 hence, x2 may be replaced by f
(y1). The existential quantifier x3 is preceded by the universal quantifiers y1 and y2. Hence x3 may be
replaced by g

(y1, y2). The existential quantifier x5 is preceded by again universal quantifier y1 and y2. In other words, x5
is also a function of y1 and y2. But, we have to use a different function symbol say h and replace x 5 by h
(y1, y2). Similarly x6 may be replaced by

j (y1, y2, y3).

Thus, (Skolem) Standard Form becomes


(y1) (y2) (y3) (P (c, f(y1), g(y1, y2), y1, y2)  Q (h (y1, y2), j (y1, y2, y3))).

Check Your Progress -2

Ex: 4 (i) Transform the formula (x) P(x)  (x) Q(x) into prenex normal form.

(ii) Obtain a prenex normal form for the formula

(x) (y) ((z) (P(x, y)  P(y, z))  (u) Q (x, y, u))

Ex 5. Obtain a (skolem) standard form for each of the following formula:

(i) (x) (y) (v) (z) (w) (u) P (x, y, z, u, v, w)


(ii) (x) (y) (z) ((P (x, y)  ~ Q (x, z))  R (x, y, z))
5.7 RESOLUTION & UNIFICATION
In the beginning of the previous section, we mentioned that resolution method for FOPL requires
discussion of a number of complex new concepts. Also, , we discussed (Skolem) Standard Form and also
discussed how to obtain Standard Form for a given formula of FOPL. In this section, along with
Resolution we will introduce two new, and again complex, concepts, viz., substitution and unification.

The complexity of the resolution method for FOPL mainly results from the fact that a clause in FOPL is
generally of the form : P(x)  Q ( f(x), x, y) ….., in which the variables x, y, z, may assume any one of
the values of their domain.

Thus, the atomic formula (x) P(x), which after dropping of universal quantifier, is written as just P(x)
stands for P(a1)  P(a2)…  P(an) where the set {a1 a2…, an} is assumed here to be domain (x).

Similarly, (x) P(x) stands for ( P(a1 )  P(a2)  ….  P(an)

However, in order to resolve two clauses – one containing say P(x) and the other containing ~ P(y) where
x and y are universal quantifiers, possibly having some restrictions, we have to know which values of x
and y satisfy both the clauses. For this purpose we need the concepts of substitution and unification as
defined and discussed in the rest of the section.

Instead of giving formal definitions of substitution, unification, unifier, most general unifier and
resolvent, resolution of clauses in FOPL, we illustrate the concepts through examples and minimal
definitions, if required

Example: Let us consider our old problem:

To conclude

(i) Raman is mortal

From the following two statements:

(ii) Every man is mortal and

(iii) Raman is a man

Using the notations

MAN (x) : x is a man

MORTAL (x) : x is mortal,

the problem can be formulated in symbolic logic as: Conclude

MORTAL (Raman)

from

(ii) ((x) (MAN(x)  MORTAL (x))

(iii) MAN (Raman).


As resolution is a refutation method, assume

(i) ~ MORTAL (Raman)

After Skelomization and dropping (x), (ii) in standard form becomes

(i) ~ MAN (x)  MORTAL (x)


(ii) MAN (Raman)

In the above x varies over the set of human beings including Raman. Hence, one special instance of (iv)
becomes

(vi) ~ MAN (Raman)  MORTAL (Raman)

At the stage, we may observe that

(a) MAN(Raman) and MORTAL(Raman) do not contain any variables, and, hence, their truth or falsity can
be determined directly. Hence, each of like a formula of PL. In term of formula which does not contain
any variable is called ground term or ground formula.

(b) Treating MAN (Raman) as formula of PL and using resolution method on (v) and (vi), we conclude

(vii) MORTAL (Raman),

Resolving (i) and (vii), we get False. Hence, the solution.

Unification: In the process of solution of the problem discussed above, we tried to make the two
expression MAN(x) and MAN(Raman) identical. Attempt to make identical two or more expressions is
called unification.

In order to unify MAN (x) and MAN (Raman) identical, we found that because one of the possible values
of x is Raman also. And, hence, we replaced x by one of its possible values : Raman.

This replacement of a variable like x, by a term (which may be another variable also)

which is one of the possible values of x, is called substitution. The substitution, in this case is denoted
formally as {Raman/x}

Substitution, in general, notationally is of the form {t1 / x1 , t2 / x2 …tm/ xm } where x1, x2 …, xm are
variables and t2, t2 …tm are terms and ti replaces the variable xi in some expression.

Example: (i) Assume Lord Krishna is loved by everyone who loves someone (ii) Also assume that no
one loves nobody. Deduce Lord Krishna is loved by everyone.

Solution: Let us use the symbols

Love (x, y): x loves y (or y is loved by x)

LK : Lord Krishna

Then the given problem is formalized as :

(i) (x) ((y) Love (x, y)Love (x, LK))


(ii) ~ (x) ((y) ~ Love (x, y))

To show : (x) (Love (x, LK))

As resolution is a refutation method, assume negation of the last statement as an axiom.

(iii) ~ (x) Love (x, LK)

The formula in (i) above is reduced in standard form as follows:

(x) (~ (y) Love (x, y)  Love (x, LK) )

= (x) ( (y) ~ Love (x, y)  Love (x, LK) )

= (x) (y) (~ Love (x, y)  Love L (x, LK) )

( (y) does not occurs in Love (x, LK))

After dropping universal quantifications, we get

(iv) ~ Love (x, y)  Love (x, LK)

Formula (ii) can be reduced to standard form as follows:

(ii) = (x) (y) Love (x, y)

y is replaced through skolomization by f(x)

so that we get

(x) Love (x, f(x))

Dropping the universal quantification

(v) Love (x, f(x))

The formula in (iii) can be brought in standard form as follows:

(iii) = (x) ( ~ Love (x, LK))

As existential quantifier x is not preceded by any universal quantification, therefore, x may be substituted
by a constant a , i.e., we use the substitution {a/x} in (iii) to get the standard form:

(vi) ~ Love (a, LK).

Thus, to solve the problem, we have the following standard form formulas for resolution:

(iv) ~ Love (x, y)  Love (x, LK)

(v) Love (x, f(x))

(vi) ~ Love (a, LK).


Two possibilities of resolution exist for two pairs of formulas viz.

one possibility: resolving (v) and (vi).

second possibility : resolving (iv) and (vi).

The possibilities exist because for each possibility pair, the predicate Love occurs in complemented form
in the respective pair.

Next we attempt to resolve (v) and (vi)

For this purpose we attempt to make the two formulas Love(x, f(x)) and Love (a, LK) identical, through
unification involving substitutions. We start from the left, matching the two formulas, term by term. First
place where matching may fail is when ‘x’ occurs in one formula and ‘a’ occurs in the other formula. As,
one of these happens to be a variable, hence, the substitution {a/x} can be used to unify the portions so
far.

Next, possible disagreement through term-by-term matching is obtained when we get the two disagreeing
terms from two formulas as f(x) and LK. As none of f(x) and LK is a variable (note f(x) involves a
variable but is itself not a variable), hence, no unification and, hence, no resolution of (v) and (vi) is
possible.

Next, we attempt unification of (vi) Love (a, LK) with Love (x, LK) of (iv).

Then first term-by-term possible disagreement occurs when the corresponding terms are ‘a’ and ‘x’
respectively. As one of these is a variable, hence, the substitution{a/x} unifies the parts of the formulas so
far. Next, the two occurrences of LK, one each in the two formulas, match. Hence, the whole of each of
the two formulas can be unified through the substitution {a/x}. Though the unification has been attempted
in corresponding smaller parts, substitution has to be carried in the whole of the formula, in this case in
whole of (iv). Thus, after substitution, (iv) becomes
(viii) ~ Love (a, y)  Love (a, L K)

resolving (viii) with (vi) we get

(ix) ~ Love (a, y)

In order to resolve (v) and (ix), we attempt to unify Love (x, f(x)) of (v) with

Love (a, y) of (ix).

The term-by-term matching leads to possible disagreement of a of (ix) with x of (v).

As, one of these is a variable, hence, the substitution {a/x} will unify the portions considered so far.

Next, possible disagreement may occur with f (x) of (v) and y of (ix). As one of these are a variable viz. y,
therefore, we can unify the two terms through the substitution {f(x)/y}. Thus, the complete
substitution {a/x, f (x)/y} is required to match the formulas. Making the substitutions, we get (v) becomes
Love (a, f(x)) and (ix) becomes ~ Love (a, f (x))

Resolving these formulas we get False. Hence, the proof.

Check you Progress - 3

Ex. 6: Unify, if possible, the following three formulas:


(i) Q (u, f (y, z)),
(ii) Q (u, a)
(iii) Q (u, g (h (k (u))))
Ex. 7: Determine whether the following formulas are unifiable or not:

(i) Q (f (a), g(x))


(ii) Q (x, y)
Example: Find resolvents, if possible for the following pairs of clauses:

(i) ~ Q (x, z, x)  Q (w, z, w) and


(ii) Q (w, h (v, v), w)
Solution: As two literals with predicate Q occur and are mutually negated in (i) and (ii),therefore, there is
possibility of resolution of ~ Q (x, z, x) from (i) with Q (w, h (v, v), w) of (ii). We attempt to unify Q (x, z,
x) and Q (w, h (v, v), w), if possible, by finding an appropriate substitution. First terms x and w of the two
are variables, hence, unifiable with either of the substitutions {x/w} or {w/x}. Let us take {w/x}.

Next pair of terms from the two formulas, viz, z and h(v, v) are also unifiable, because, one of the terms is
a variable, and the required substitution for unification is { h (v, v)/z}.

Next pair of terms at corresponding positions is again {w, x} for which, we have determined the
substitution {w/x}. Thus, the substitution {w/x, h(v, v)/z} unfies the two formulas. Using the
substitutions, (i) and (ii) become resp. as

(iii) ~ Q (w, h (v, v), w)  Q (w, h (v, v), w)


(iv) Q (w, h (v, v), w)
Resolving, we get

Q (w, h (v, v), w),

which is the required resolvent.

5.8 SUMMARY
In this unit, initially, we discuss how PL is inadequate to solve even simple problems, requires some
extension of PL or some other formal inferencing system so as to compensate for the inadequacy. First
Order Predicate Logic (FOPL), is such an extension of PL that is discussed in the unit.
Next, syntax of proper structure of a formula of FOPL is discussed. In this respect, a number of new
concepts including those of quantifier, variable, constant, term, free and bound occurrences of variables;
closed and open wff, consistency/validity of wffs etc. are introduced.

Next, two normal forms viz. Prenex Normal Form (PNF) and Skolem Standard Normal Form are
introduced. Finally, tools and techniques developed in the unit, are used to solve problems involving
logical reasoning.

5.9 SOLUTIONS/ANSWERS
Check Your Progress - 1

Ex. 1 (i) (x) (P (x)  Q(x))

(ii) (x) (P(x)  Q(x))


(iii) ~ (x) ( Q (x)  P(x))

Ex. 2
(i) There is (at least) one (person) who is a used-car dealer.

(ii) There is (at least) one (person) who is honest.

(iii) All used-car dealers are dishonest.

(iv) (At least) one used-car dealer is honest.

(v) There is at least one thing in the universe, (for which it can be said that) if that something is
Honest then that something is a used-car dealer

Note: the above translation is not the same as: Some no gap one honest, is a used-car dealer.

Ex 3: (i) After removal of ‘’ we get the given formula

= ~ P(a)  ~ (( x) P(x))

= ~ P(a)  (x) (~ P(x))

Now P(a) is an atom in PL which may assume any value T or F. On taking P(a) as F the given formula
becomes T, hence, consistent.

(ii) The formula can be written

(x) P(x)  ~ (x) (P(x)), by taking negation outside the second disjunct and then renaming.

The (x) P(x) being closed is either T or F and hence can be treated as formula of PL.
Let x P(x) be denoted by Q. Then the given formula may be denoted by Q  ~ Q = True (always)
Therefore, formula is valid.

Check Your Progress - 2

Ex: 4 (i) (x) P(x)  (x) Q(x) = ~ ((x) P(x))  (x) Q(x) (by removing the connective)

= (x) (~P(x))  (x) Q(x) (by taking ‘~’ inside)


= (x) (~P(x)  Q(x)) (By taking distributivity of x over )

Therefore, a prenex normal form of (x) P(x)  (x) Q(x) is (x) (~P(x)  Q(x)).

(ii) (x) (y) ((z) (P(x, y)  P(y, z))  (u) Q (x, y, u)) (removing the connective)

= (x) (y) (~ ((z) (P(x, z)  P(y, z)))


 (u) Q (x, y, u)) (using De Morgan’s Laws)

= (x) (y) ((z) (~P(x, z)  ~ P(y, z))


 (u) Q(x, y, u))

= (x) (y) (z) (~P(x, z)


 ~ P(y, z)  Q (x, y, u) (as z and u do not occur in the rest of the formula except
their respective scopes)

Therefore, we obtain the last formula as a prenex normal form of the first formula.

Ex 5 (i) In the given formula (x) is not preceded by any universal quantification. Therefore, we replace
the variable x by a (skolem) constant c in the formula and drop (x).
Next, the existential quantifier (z) is preceded by two universal quantifiers viz., v and y. we replace the
variable z in the formula, by some function, say, f (v, y) and drop (z). Finally, existential variable (u) is
preceded by three universal quantifiers, viz., (y), (y) and (w). Thus, we replace in the formula the
variable u by, some function g(y, v, w) and drop the quantifier ( u). Finally, we obtain the standard form
for the given formula as
(y) (v) (w) P(x, y, z, u, v, w)

(ii) First of all, we reduce the matrix to CNF.


= (P (x, y)  ~ Q (x, z))  R (x, y, z)
= (~ P (x, y)  Q (x, z))  R (x, y, z)
= (~ P (x, y)  R (x, y, z))  (Q (x, z)  R (x, y, z))

Next, in the formula, there are two existential quantifiers, viz., (y) and (z). Each of these is preceded by
the only universal quantifier, viz. (x).

Thus, each variable y and z is replaced by a function of x. But the two functions of x for y and z must be
different functions. Let us assume, variable, y is replaced in the formula by f(x) and the variable z is
replaced by g(x). Thus the initially given formula, after dropping of existential quantifiers is in the
standard form:
(x) ((~ P (x, y)  R (x, y, z))  (Q (x, z)  R (x, y, z)))
Check Your Progress - 3

Ex 6 : Refer to section 5.7

Ex 7 : Refer to section 5.7

5.10 FURTHER READINGS


1. Ela Kumar, “ Artificial Intelligence”, IK International Publications
2. E. Rich and K. Knight, “Artificial intelligence”, Tata Mc Graw Hill Publications
3. N.J. Nilsson, “Principles of AI”, Narosa Publ. House Publications
4. John J. Craig, “Introduction to Robotics”, Addison Wesley publication
5. D.W. Patterson, “Introduction to AI and Expert Systems" Pearson publication
6. McKay, Thomas J., Modern Formal Logic (Macmillan Publishing Company, 1989).
7. Gensler, Harry J. Symbolic Logic: Classical and Advanced Systems (Prentice Hall of India,
1990).
8. Klenk, Virginia Understanding Symbolic Logic (Prentice Hall of India 1983)
9. Copi Irving M. & Cohen Carl, Introduction Logic IX edition, (Prentice Hall of India, 2001).
UNIT 6 RULE BASED SYSTEMS AND OTHER
FORMALISM
Structure Page Nos.
6.0 Introduction
6.1 Objectives
6.2 Rule Based Systems
6.2.1 Forward chaining
6.2.2 Backward chaining
6.2.3 Conflict resolution
6.3 Semantic nets
6.4 Frames
6.5 Scripts
6.6 Summary
6.7 Solutions/Answers
6.8 Further/Readings

6.0 INTRODUCTION

Computer Science is the study of how to create models that can be represented in and executed by some
computing equipment. In this respect, the task for a computer scientist is to create, in addition to a model
of the problem domain, a model of an expert of the domain as problem solver who is highly skilled in
solving problems from the domain under consideration, and the concerned field relates to the field of
Expert Systems.

First of all we must understand that an expert system is nothing but a computer program or a set of
computer programs which contains the knowledge and some inference capability of an expert, most
generally a human expert, in a particular domain. An expert system is supposed to contain the capability
to lead to some conclusion, based on the inputs provided, the system already contains some pre-existing
information; which is processed to infer some conclusion. The expert system belongs to the branch of
Computer Science called Artificial Intelligence.

Taking into consideration all the points, discussed above, one of the many possible definitions of an
Expert System is : “An Expert System is a computer program that possesses or represents knowledge in a
particular domain, has the capability of processing/ manipulating or reasoning with this knowledge
with a view to solving a problem, giving some achieving or to achieve some specific goal.”

Whereas, the Artificial Intelligence programs written to achieve expert-level competence in solving
problems of different domains are more called knowledge based systems. A knowledge-based system is
any system which performs a job or task by applying rules of thumb to a symbolic representation of
knowledge, instead of employing mostly algorithmic or statistical methods. Often the term expert systems
is reserved for programs whose knowledge base contains the knowledge used by human experts, in
contrast to knowledge gathered from textbooks or non-experts. But more often than not, the two terms,
expert systems and knowledge-based systems are taken as synonyms. Together they represent the
most widespread type of AI application.

One of the underlying assumptions in Artificial Intelligence is that intelligent behaviour can be
achieved through the manipulation of symbol structures (representing bits of knowledge). One of the
main issues in AI is to find appropriate representation of problem elements and available actions as
symbol structures so that the representation can be used to intelligently solve problems. In AI, an
important criteria about knowledge representation schemes or languages is that they should
support inference. For intelligent action, the inferencing capability is essential in view of the fact that we
can’t represent explicitly everything that the system might ever need to know–some things have to be
left implicit, to be inferred/deduced by the system as and when needed in problem solving.

In general, a good knowledge representation scheme should have the following features:

 It should allow us to express the knowledge we wish to represent in the language. For example, the
mathematical statement: Every symmetric and transitive relation on a domain, need not be reflexive
is not expressible in First Order Logic.
 It should allow new knowledge to be inferred from a basic set of facts, as discussed above.
 It should have well-defined syntax and semantics.
Building a expert system is known as knowledge engineering and its practitioners are called
knowledge engineers. It is the job of the knowledge engineer to ensure to make sure that the computer
has all the knowledge needed to solve a problem. The knowledge engineer must choose one or more
forms in which to represent the required knowledge i.e., s/he must choose one or more knowledge
representation schemes.

A number of knowledge representing schemes like predicate logic, semantic nets, frames, scripts and rule
based systems, exists; and we will discuss them in this unit. Some popular knowledge representation
schemes are:

 First order logic,


 Semantic networks,
 Frames,
 Scripts and,
 Rule-based systems.
As predicate logic have been discussed in previous blocks so we will discuss the remaining knowledge
representation schemes here in this unit.

6.1 OBJECTIVES

After going through this unit, you should be able to:

 Understand the basics of expert system


 Understand the basics of Knowledge based systems
 discuss the various knowledge representation scheme like rule based systems, semantic nets,
frames, and scripts

6.2 RULE BASED SYSTEMS

We know that Planning is the process that exploits the structure of the problem under consideration for
designing a sequence of actions in order to solve the problem under consideration.

In order to plan a solution to the problem, one should have the knowledge of the nature and the structure
of the problem domain, under consideration. For the purpose of planning, the problem environments are
divided into two categories, viz., classical planning environments and non-classical planning
environments. The classical planning environments/domains are fully observable, deterministic, finite,
static and discrete. On the other hand, non-classical planning environments may be only partially
observable and/or stochastic. In this unit, we discuss planning only for classical environments.

Let’s begin with the Rule Based Systems :

Rather than representing knowledge in a declarative and somewhat static way (as a set of statements, each
of which is true), rule-based systems represent knowledge in terms of a set of rules each of which
specifies the conclusion that could be reached or derived under given conditions or in different situations.
A rule-based system consists of
(i) Rule base, which is a set of IF-THEN rules,
(ii) A bunch of facts, and
(iii) Some interpreter of the facts and rules which is a mechanism which decides which rule to apply
based on the set of available facts. The interpreter also initiates the action suggested by the rule
selected for application.

A Rule-base may be of the form:


R1: If A is an animal and A barks, than A is a dog
F1: Rocky is an animal
F2: Rocky Barks
The rule-interpreter, after scanning the above rule-base may conclude: Rocky is a dog.
After this interpretation, the rule-base becomes
R1: If A is an animal and A barks, then A is a dog
F1: Rocky is an animal
F2: Rocky Barks
F3: Rocky is a dog.
There are two broad kinds of rule-based systems:
 Forward chaining systems,
 Backward chaining systems.

In a forward chaining system we start with the initial facts, and keep using the rules to draw new
intermediate conclusions (or take certain actions) given those facts. The process terminates when the final
conclusion is established. In a backward chaining system, we start with some goal statements, which are
intended to be established and keep looking for rules that would allow us to conclude, setting new sub-
goals in the process of reaching the ultimate goal. In the next round, the subgoals become the new goals
to be established. The process terminates when in this process all the subgoals are given fact. Forward
chaining systems are primarily data-driven, while backward chaining systems are goal-driven. We will
discuss each in detail.
Next, we discuss in detail some of the issues involved in a rule-based system.

Advantages of Rule-base

A basic principle of rule-based system is that each rule is an independent piece of knowledge. In an IF-
THEN rule, the IF-part contains all the conditions for the application of the rule under consideration.
THEN-part tells the action to be taken by the interpreter. The interpreter need not search any where else
except within the rule itself for the conditions required for application of the rule.

Another important consequence of the above-mentioned characteristic of a rule-based system is that no


rule can call upon any other and hence rules are ignorant and hence independent, of each other. This gives
a highly modular structure to the rule-based systems. Because of the highly modular structure of the rule-
base, the rule-based system addition, deletion and modification of a rule can be done without any danger
side effects.

Disadvantages

The main problem with the rule-based systems is that when the rule-base grows and becomes very large,
then checking (i) whether a new rule intended to be added is redundant, i.e., it is already covered by some
of the earlier rules. Still worse, as the rule- base grows, checking the consistency of the rule-base also
becomes quite difficult. By consistency, we mean there may be two rules having similar conditions, the
actions by the two rules conflict with each other.

Let us first define working memory, before we study forward and backward chaining systems.

Working Memory: A working is a representation, in which

 lexically, there are application –specific symbols.


 structurally, assertions are list of application-specific
symbols.

 semantically, assertions denote facts.


 assertions can be added or removed from working memory.
Rule based systems usually work in domains where conclusions are rarely certain, even when we are
careful enough to try and include everything we can think of in the antecedent or condition parts of rules.

Sources of Uncertainty
Two important sources of uncertainty in rule based systems are:

 The theory of the domain may be vague or incomplete so the methods to generate exact or accurate
knowledge are not known.
 Case data may be imprecise or unreliable and evidence may be missing or in conflict.

So even though methods to generate exact knowledge are known but they are impractical due to lack or
data, imprecision or data or problems related to data collection.

So rule based deduction system developers often build some sort of certainty or probability computing
procedure on and above the normal condition-action format of rules. Certainty computing procedures
attach a probability between 0 and 1 with each assertion or fact. Each probability reflects how certain an
assertion is, whereas certainty factor of 0 indicates that the assertion is definitely false and certainty factor
of 1 indicates that the assertion is definitely true.

Example 1: In the example discussed above the assertion (ram at-home) may have a certainty factor, say
0.7 attached to it.

Example 2: In MYCIN a rule based expert system (which we will discuss later), a rule in which
statements which link evidence to hypotheses are expressed as decision criteria, may look like :

IF patient has symptoms s1,s2,s3 and s4


AND certain background conditions t1,t2 and t3 hold
THEN the patient has disease d6 with certainty 0.75
For detailed discussion on certainty factors, the reader may refer to probability theory, fuzzy sets,
possibility theory, Dempster-Shafter Theory etc.

6.2.1 Forward Chaining Systems


In a forward chaining system the facts in the system are represented in a working memory which is
continually updated, so on the basis of a rule which is currently being applied, the number of facts may
either increase or decrease. Rules in the system represent possible actions to be taken when specified
conditions hold on items in the working memory–they are sometimes called condition-action or
antecedent-consequent rules. The conditions are usually patterns that must match items in the working
memory, while the actions usually involve adding or deleting items from the working memory. So we
can say that in forward chaining proceeds forward, beginning with facts, chaining through rules,
and finally establishing the goal. Forward chaining systems usually represent rules in standard
implicational form, with an antecedent or condition part consisting of positive literals, and a consequent
or conclusion part consisting of a positive literal.

The interpreter controls the application of the rules, given the working memory, thus controlling the
system’s activity. It is based on a cycle of activity sometimes known as a recognize-act cycle. The system
first checks to find all the rules whose condition parts are satisfied i.e., the those rules which are
applicable, given the current state of working memory (A rule is applicable if each of the literals in its
antecedent i.e., the condition part can be unified with a corresponding fact using consistent substitutions.
This restricted form of unification is called pattern matching). It then selects one and performs the actions
in the action part of the rule which may involve addition or deleting of facts. The actions will result in a
new i.e., updated working memory, and the cycle starts again (When more than one rule is applicable,
then some sort of external conflict resolution scheme is used to decide which rule will be applied. But
when there are a large numbers of rules and facts then the number of unifications that must be tried
becomes prohibitive or difficult). This cycle will be repeated until either there is no rule which fires, or
the required goal is reached.

Rule-based systems vary greatly in their details and syntax, let us take the following example in
which we use forward chaining :

Example

Let us assume that the working memory initially contains the following facts :

(day monday)
(at-home ram)
(does-not-like ram)

Let, the existing set of rules are:

R1 : IF (day monday)
THEN ADD to working memory the fact : (working-with ram)

R2 : IF (day monday)
THEN ADD to working memory the fact : (talking-to ram)
R3 : IF (talking-to X) AND (working-with X)
THEN ADD to working memory the fact : (busy-at-work X)

R4 : IF (busy-at-work X) OR (at-office X)
THEN ADD to working memory the fact : (not-at-home X)

R5 : IF (not-at-home X)
THEN DELETE from working memory the fact : (happy X)

R6 : IF (working-with X)
THEN DELETE from working memory the fact : (does-not-like X)

Now to start the process of inference through forward chaining, the rule based system will first search
for all the rule/s whose antecedent part/s are satisfied by the current set of facts in the working memory.
For example, in this example, we can see that the rules R1 and R2 are satisfied, so the system will chose
one of them using its conflict resolution strategies. Let the rule R1 is chosen. So (working-with ram) is
added to the working memory (after substituting “ram” in place of X). So working memory now looks
like:

(working-with ram)
(day monday)
(at-home ram)
(does-not-like ram)

Now this cycle begins again, the system looks for rules that are satisfied, it finds rule R2 and R6. Let the
system chooses rule R2. So now (taking-to ram) is added to working memory. So now working memory
contains following:

(talking-to ram)
(working-with ram)
(day monday)
(at-home ram)
(does-not-like ram)

Now in the next cycle, rule R3 fires, so now (busy-at-work ram) is added to working memory, which now
looks like:

(busy-at-work ram)
(talking-to ram)
(working-with ram)
(day monday)
(at-home ram)
(does-not-like ram)

Now antecedent parts of rules R4 and R6 are satisfied. Let rule R4 fires, so (not-at-home, ram) is added
to working memory which now looks like :

(not-at-home ram)
(busy-at-work ram)
(talking-to ram)
(working-with ram)
(day monday)
(at-home ram)
(does-not-like ram)
In the next cycle, rule R5 fires so (at-home ram) is removed from the working memory :

(not-at-home ram)
(busy-at-work ram)
(talking-to ram)
(working-with ram)
(day monday)
(does-not-like ram)

The forward chining will continue like this. But we have to be sure of one thing, that the ordering of the
rules firing is important. A change in the ordering sequence of rules firing may result in a different
working memory.

6.2.2 Backward Chaining Systems


In forward chining systems we have seen how rule-based systems are used to draw new conclusions from
existing data and then add these conclusions to a working memory. The forward chaining approach is
most useful when we know all the initial facts, but we don’t have much idea what the conclusion might
be.

If we know what the conclusion would be, or have some specific hypothesis to test, forward chaining
systems may be inefficient. In forward chaining we keep on moving ahead until no more rules apply or
we have added our hypothesis to the working memory. But in the process the system is likely to do a lot
of additional and irrelevant work, adding uninteresting or irrelevant conclusions to working memory. Let
us say that in the example discussed before, suppose we want to find out whether “ram is at home”. We
could repeatedly fire rules, updating the working memory, checking each time whether (at-home
ram) is found in the new working memory. But maybe we had a whole batch of rules for drawing
conclusions about what happens when I’m working, or what happens on Monday–we really don’t care
about this, so would rather only have to draw the conclusions that are relevant to the goal.

This can be done by backward chaining from the goal state or on some hypothesized state that we are
interested in. This is essentially how Prolog works. Given a goal state to try and prove, for example (at-
home ram), the system will first check to see if the goal matches the initial facts given. If it does, then
that goal succeeds. If it doesn’t the system will look for rules whose conclusions i.e., actions match the
goal. One such rule will be chosen, and the system will then try to prove any facts in the preconditions of
the rule using the same procedure, setting these as new goals to prove. We should note that a backward
chaining system does not need to update a working memory. Instead it needs to keep track of what
goals it needs to prove its main hypothesis. So we can say that in a backward chaining system, the
reasoning proceeds “backward”, beginning with the goal to be established, chaining through rules,
and finally anchoring in facts.

Although, in principle same set of rules can be used for both forward and backward chaining. However,
in backward chaining, in practice we may choose to write the rules slightly differently. In backward
chaining we are concerned with matching the conclusion of a rule against some goal that we are trying to
prove. So the ‘then or consequent’ part of the rule is usually not expressed as an action to take (e.g.,
add/delete), but as a state which will be true if the premises are true.
To learn more, let us take a different example in which we use backward chaining (The system is used to
identify an animal based on its properties stored in the working memory):

Example

1. Let us assume that the working memory initially contains the following facts:

(has-hair raja) representing the fact “raja has hair”


(big-mouth raja) representing the fact “raja has a big mouth”
(long-pointed-teeth raja) representing the fact “raja has long pointed teeth”
(claws raja) representing the fact “raja has claws”

Let, the existing set of rules are:

1. IF (gives-milk X)
THEN (mammal X)

2. IF (has-hair X)
THEN (mammal X)

3. IF (mammal X) AND (eats-meat X)


THEN (carnivorous X)

4. IF (mammal X) AND (long-pointed-teeth X) AND (claws X)


THEN (carnivorous X)

5. IF (mammal X) AND (does-not-eat-meat X)


THEN (herbivorous X)

6. IF (carnivorous X) AND (dark-spots X)


THEN (cheetah, X)

7. IF (herbivorous X) AND (long-legs X) AND (long-neck X) AND (dark-spots X)


THEN (giraffe, X)
8. IF (carnivorous X) AND (big-mouth X)
THEN (lion, X)

9. IF (herbivorous X) AND (long-trunk X) AND (big-size X)


THEN (elephant, X)

10. IF (herbivorous, X) AND (white-color X) AND ((black-strips X)


THEN (zebra, X)

Now to start the process of inference through backward chaining, the rule
based system will first form a hypothesis and then it will use the antecedent – consequent rules
(previously called condition – action rules) to work backward toward hypothesis supporting
assertions or facts.

Let us take the initial hypothesis that “raja is a lion” and then reason
about whether this hypothesis is viable using backward chaining approach explained below :
 The system searches a rule, which has the initial hypothesis in the consequent part that someone i.e.,
raja is a lion, which it finds in rule 8.

 The system moves from consequent to antecedent part of rule 8 and it finds the first condition i.e., the
first part of antecedent which says that “raja must be a carnivorous”.

 Next the system searches for a rule whose consequent part declares that someone i.e., “raja is a
carnivorous”, two rules are found i.e., rule 3 and rule 4. We assume that the system tries rule 3 first.

 To satisfy the consequent part of rule 3 which now has become the system’s new hypothesis, the
system moves to the first part of antecedent which says that X i.e., raja has to be mammal.

 So a new sub-goal is created in which the system has to check that “raja is a mammal”. It does so by
hypothesizing it and tries to find a rule having a consequent that someone or X is a mammal. Again
the system finds two rules, rule 1 and rule 2. Let us assume that the system tries rule 1 first.

 In rule 1, the system now moves to the first antecedent part which says that X i.e., raja must give
milk for it to be a mammal. The system cannot tell this because this hypothesis is neither supported
by any of the rules and also it is not found among the existing facts in the working memory. So the
system abandons rule 1 and try to use rule 2 to establish that “raja is a mammal”.

 In rule 2, it moves to the antecedent which says that X i.e., raja must have hair for it to be a mammal.
The system already knows this as it is one of the facts in working memory. So the antecedent part of
rule 2 is satisfied and so the consequent that “raja is a mammal” is established.

 Now the system backtracks to the rule 3 whose first antecedent part is satisfied. In second condition
of antecedent if finds its new sub-goal and in turn forms a new hypothesis that X i.e., raja eats meat.

 The system tries to find a supporting rule or an assertion in the working memory which says that “raja
eats meat” but it finds none. So the system abandons the rule 3 and try to use rule 4 to establish that
“raja is carnivorous”.

 In rule 4, the first part of antecedent says that raja must be a mammal for it to be carnivorous. The
system already knows that “raja is a mammal” because it was already established when trying to
satisfy the antecedents in rule 3.

 The system now moves to second part of antecedent in rule 4 and finds a new sub-goal in which the
system must check that X i.e., raja has long-pointed-teeth which now becomes the new hypothesis.
This is already established as “ raja has long-pointed-teeth” is one of the assertions of the working
memory.

 In third part of antecedent in rule 4 the system’s new hypothesis is that “raja has claws”. This also is
already established because it is also one the assertions in the working memory.

 Now as all the parts of the antecedent in rule 4 are established so its consequent i.e., “raja is
carnivorous” is established.

 The system now backtracks to rule 8 where in the second part of the antecedent says that X i.e., raja
must have a big-mouth which now becomes the new hypothesis. This is already established because
the system has an assertion that “raja has a big mouth”.
 Now as the whole antecedent of rule 8 is satisfied so the system concludes that “raja is a lion”.

We have seen that the system was able to work backward through the antecedent – consequent rules,
using desired conclusions to decide that what assertions it should look for and ultimately establishing the
initial hypothesis.

How to choose the type of chaining among forward or backward chaining for a given problem ?

Many of the rule based deduction systems can chain either forward or backward, but which of these
approaches is better for a given problem is the point of discussion.

First, let us learn some basic things about rules i.e., how a rule relates its input/s (i.e., facts) to output/s
(i.e., conclusion). Whenever in a rule, a particular set of facts can lead to many conclusions, the rule is
said to have a high degree of fan out, and a strong candidate of backward chaining for its processing. On
the other hand, whenever the rules are such that a particular hypothesis can lead to many questions for the
hypothesis to be established, the rule is said to have a high degree of fan in, and a high degree of fan in is
a strong candidate of forward chaining.

To summarize, the following points should help in choosing the type of chaining for reasoning purpose :

 If the set of facts, either we already have or we may establish, can lead to a large number of
conclusions or outputs , but the number of ways or input paths to reach that particular conclusion in
which we are interested is small, then the degree of fan out is more than degree of fan in. In such
case, backward chaining is the preferred choice.
 But, if the number of ways or input paths to reach the particular conclusion in which we are
interested is large, but the number of conclusions that we can reach using the facts through that rule
is small, then the degree of fan in is more than the degree of fan out. In such case, forward
chaining is the preferred choice.

For case where the degree of fan out and fan in are approximately same, then in case if not many
facts are available and the problem is check if one of the many possible conclusions is true, backward
chaining is the preferred choice.

6.2.3 Conflict Resolution


Next, we discuss in detail some of the issues involved in a rule-based system.

Rule-based systems vary greatly in their details and syntax, A basic principle of rule-based system is that
each rule is an independent piece of knowledge. In an IF-THEN rule, the IF-part contains all the
conditions for the application of the rule under consideration. THEN-part tells the action to be taken by
the interpreter. The interpreter need not search any where else except within the rule itself for the
conditions required for application of the rule.

Another important consequence of the above-mentioned characteristic of a rule-based system is that no


rule can call upon any other and hence rules are ignorant and hence independent, of each other. This gives
a highly modular structure to the rule-based systems. Because of the highly modular structure of the rule-
base, the rule-based system addition, deletion and modification of a rule can be done without any danger
side effects.
The main problem with the rule-based systems is that when the rule-base grows and becomes very large,
then checking (i) whether a new rule intended to be added is redundant, i.e., it is already covered by some
of the earlier rules. Still worse, as the rule- base grows, checking the consistency of the rule-base also
becomes quite difficult. By consistency, we mean there may be two rules having similar conditions, the
actions by the two rules conflict with each other.

Some of the conflict resolution strategies which are used to decide which rule to fire are given below:

 Don’t fire a rule twice on the same data.


 Fire rules on more recent working memory elements before older ones. This allows the system to
follow through a single chain of reasoning, rather than keeping on drawing new conclusions from old
data.
 Fire rules with more specific preconditions before ones with more general preconditions. This allows
us to deal with non-standard cases.

These strategies may help in getting reasonable behavior from a forward chaining system, but the most
important thing is how should we write the rules. They should be carefully constructed, with the
preconditions specifying as precisely as possible when different rules should fire. Otherwise we will have
little idea or control of what will happen.

To understand, let us take the following example in which we use forward chaining:

Example

Let us assume that the working memory initially contains the following facts :

(day monday)
(at-home ram)
(does-not-like ram)

Let, the existing set of rules are:

R1 : IF (day monday)
THEN ADD to working memory the fact : (working-with ram)

R2 : IF (day monday)
THEN ADD to working memory the fact : (talking-to ram)

R3 : IF (talking-to X) AND (working-with X)


THEN ADD to working memory the fact : (busy-at-work X)

R4 : IF (busy-at-work X) OR (at-office X)
THEN ADD to working memory the fact : (not-at-home X)

R5 : IF (not-at-home X)
THEN DELETE from working memory the fact : (happy X)
R6 : IF (working-with X)
THEN DELETE from working memory the fact : (does-not-like X)

Now to start the process of inference through forward chaining, the rule based system will first search
for all the rule/s whose antecedent part/s are satisfied by the current set of facts in the working memory.
For example, in this example, we can see that the rules R1 and R2 are satisfied, so the system will chose
one of them using its conflict resolution strategies. Let the rule R1 is chosen. So (working-with ram) is
added to the working memory (after substituting “ram” in place of X). So working memory now looks
like:

(working-with ram)
(day monday)
(at-home ram)
(does-not-like ram)

Now this cycle begins again, the system looks for rules that are satisfied, it finds rule R2 and R6. Let the
system chooses rule R2. So now (taking-to ram) is added to working memory. So now working memory
contains following:

(talking-to ram)
(working-with ram)
(day monday)
(at-home ram)
(does-not-like ram)

Now in the next cycle, rule R3 fires, so now (busy-at-work ram) is added to working memory, which now
looks like:

(busy-at-work ram)
(talking-to ram)
(working-with ram)
(day monday)
(at-home ram)
(does-not-like ram)

Now antecedent parts of rules R4 and R6 are satisfied. Let rule R4 fires, so (not-at-home, ram) is added
to working memory which now looks like :

(not-at-home ram)
(busy-at-work ram)
(talking-to ram)
(working-with ram)
(day monday)
(at-home ram)
(does-not-like ram)

In the next cycle, rule R5 fires so (at-home ram) is removed from the working memory :

(not-at-home ram)
(busy-at-work ram)
(talking-to ram)
(working-with ram)
(day monday)
(does-not-like ram)

The forward chining will continue like this. But we have to be sure of one thing, that the ordering of the
rules firing is important. A change in the ordering sequence of rules firing may result in a different
working memory.
Check your Progress - 1

Exercise 1 ; In the “Animal Identifier System” discussed above use forward chaining to try to identify the
animal called “raja”.

6.3 SEMANTIC NETS

Semantic Network representations provide a structured knowledge representation. In such a network,


parts of knowledge are clustered into semantic groups. In semantic networks, the concepts and
entities/objects of the problem domain are represented by nodes and relationships between these entities
are shown by arrows, generally, by directed arrows. In view of the fact that semantic network
representation is a pictorial depiction of objects, their attributes and the relationships that exist between
these objects and other entities. A semantic net is just a graph, where the nodes in the graph represent
concepts, and the arcs are labeled and represent binary relationships between concepts. These networks
provide a more natural way, as compared to other representation schemes, for mapping to and from a
natural language.

For example, the fact (a piece of knowledge): Mohan struck Nita in the garden with a sharp knife last
week, is represented by the semantic network shown in Figure 1.1.

struck

past of
time agent
last week strike Mohan

instrument
place
object knife

garden

Nita
property of

sharp

Figure 1.1 Semantic Network


The two most important relations between concepts are (i) subclass relation between a class and its
superclass, and (ii) instance relation between an object and its class. Other relations may be has-part,
color etc. As mentioned earlier, relations are indicated by labeled arcs.

As information in semantic networks is clustered together through relational links, the knowledge
required for the performance of some task is generally available within short spatial span of the semantic
network. This type of knowledge organisation in some way, resembles the way knowledge is stored and
retrieved by human beings.

Subclass and instance relations allow us to use inheritance to infer new facts/relations from the explicitly
represented ones. We have already mentioned that the graphical portrayal of knowledge in semantic
networks, being visual, is easier than other representation schemes for the human beings to comprehend.
This fact helps the human beings to guide the expert system, whenever required. This is perhaps the
reason for the popularity of semantic networks.

Check Your Progress – 2


Exercise 2: Draw a semantic network for the following English statement:
Mohan struck Nita and Nita’s mother struck Mohan.

6.4 FRAMES

Frames are a variant of semantic networks that are one of the popular ways of representing non-
procedural knowledge in an expert system. In a frame, all the information relevant to a particular concept
is stored in a single complex entity, called a frame. Frames look like the data structure, record. Frames
support inheritance. They are often used to capture knowledge about typical objects or events, such as a
car, or even a mathematical object like rectangle. As mentioned earlier, a frame is a structured object and
different names like Schema, Script, Prototype, and even Object are used in stead of frame, in computer
science literature.

We may represent some knowledge about a lion in frames as follows:

Mammal :
Subclass : Animal
warm_blooded : yes

Lion :
subclass : Mammal
eating-habbit : carnivorous
size : medium

Raja :
instance : Lion
colour : dull-Yellow
owner : Amar Circus

Sheru :
instance : Lion
size : small
A particular frame (such as Lion) has a number of attributes or slots such as eating-habit and size. Each
of these slots may be filled with particular values, such as the eating-habit for lion may be filled up as
carnivorous.

Sometimes a slot contains additional information such as how to apply or use the slot values. Typically, a
slot contains information such as (attribute, value) pairs, default values, conditions for filling a slot,
pointers to other related frames, and also procedures that are activated when needed for different
purposes.

In the case of frame representation of knowledge, inheritance is simple if an object has a single parent
class, and if each slot takes a single value. For example, if a mammal is warm blooded then automatically
a lion being a mammal will also be warm blooded.

But in case of multiple inheritance i.e., in case of an object having more than one parent class, we have
to decide which parent to inherit from. For example, a lion may inherit from “wild animals” or “circus
animals”. In general, both the slots and slot values may themselves be frames and so on.

Frame systems are pretty complex and sophisticated knowledge representation tools. This
representation has become so popular that special high level frame based representation languages have
been developed. Most of these languages use LISP as the host language. It is also possible to represent
frame-like structures using object oriented programming languages, extensions to the programming
language LISP.

Check Your Progress – 3


Exercise 3: Define a frame for the entity date which consists of day, month and year. each of which is a
number with restrictions which are well-known. Also a procedure named compute-day-of-week is already
defined.

6.5 SCRIPTS
A script is a structured representation describing a stereotyped sequence of events in a particular context.
Scripts are used in natural language understanding systems to organize a knowledge base in terms of the
situations that the system should understand. Scripts use a frame-like structure to represent the commonly
occurring experience like going to the movies eating in a restaurant, shopping in a supermarket, or
visiting an ophthalmologist.
Thus, a script is a structure that prescribes a set of circumstances that could be expected to follow on from
one another.
Scripts are beneficial because:
 Events tend to occur in known runs or patterns.
 A casual relationship between events exist.
 An entry condition exists which allows an event to take place.
 Prerequisites exist upon events taking place.

Components of a script
The components of a script include:
 Entry condition: These are basic condition which must be fulfilled before events in the script
can occur.
 Results: Condition that will be true after events in script occurred.
 Props: Slots representing objects involved in events
 Roles: These are the actions that the individual participants perform.
 Track: Variations on the script. Different tracks may share components of the same scripts.
 Scenes: The sequence of events that occur.
Describing a script, special symbols of actions are used. These are:

Symbol Meaning Example


ATRANS transfer a relationship give
PTRANS transfer physical location of an object go
PROPEL apply physical force to an object push
MOVE move body part by owner kick
GRASP grab an object by an actor hold
INGEST taking an object by an animal eat drink
EXPEL expel from animal’s body cry
MTRANS transfer mental information tell
MBUILD mentally make new information decide
CONC conceptualize or think about an idea think
SPEAK produce sound Say
ATTEND focus sense organ listen
Example:-Script for going to the bank to withdraw money.
SCRIPT : Withdraw money
TRACK : Bank
PROPS : Money
Counter
Form
Token
Roles :
P= Customer
E= Employee
C= Cashier
Entry conditions: P has no or less money.
The bank is open.
Results : P has more money.
Scene 1: Entering
P PTRANS P into the Bank
P ATTEND eyes to E
P MOVE P to E
Scene 2: Filling form
P MTRANS signal to E
E ATRANS form to P
P PROPEL form for writing
P ATRANS form to P
ATRANS form to P
Scene 3: Withdrawing money
P ATTEND eyes to counter
P PTRANS P to queue at the counter
P PTRANS token to C
C ATRANS money to P
Scene 4: Exiting the bank
P PTRANS P to out of bank
Advantages of Scripts
 Ability to predict events.
 A single coherent interpretation maybe builds up from a collection of observations.
Disadvantages of Scripts
 Less general than frames.
 May not be suitable to represent all kinds of knowledge

6.6 SUMMARY
This unit majorly discussed the various knowledge representation mechanisms, used in Artificial
Intelligence. The unit begins with the discussion on Rule Based Systems, and discussed the related
concept of Forward chaining and Backward chaining, later the concept of Conflict resolution is discussed.
The unit also discussed the other techniques of knowledge representation like Semantic nets, Frames and
Scripts; along with relevant examples for each.

6.7 SOLUTIONS/ANSWERS
Check Your Progress – 1
Exercise 1: Refer to section 6.2
Check Your Progress – 2
Exercise 2: Refer to section 6.3
Check Your Progress – 3
Exercise 3: Refer to section 6.4
6.8 FURTHER READINGS
1. Ela Kumar, “ Artificial Intelligence”, IK International Publications
2. E. Rich and K. Knight, “Artificial intelligence”, Tata Mc Graw Hill Publications
3. N.J. Nilsson, “Principles of AI”, Narosa Publ. House Publications
4. John J. Craig, “Introduction to Robotics”, Addison Wesley publication
5. D.W. Patterson, “Introduction to AI and Expert Systems" Pearson publication
UNIT 7 PROBABILISTIC REASONING
Structure Page Nos.

7.0 Introduction 50
7.1 Objectives 51
7.2 Reasoning with uncertain information 51
7.3 Review of Probability Theory 55
7.4 Introduction to Bayesian Theory 57
7.5 Baye’s Networks 59
7.6 Probabilistic Inference 62
7.7 Basic idea of Inferencing with Bayes Networks 64
7.8 Other Paradigm of Uncertain Reasoning 65
7.9 Dempster Scheffer Theory 66
7.10 Summary 67
7.11 Solutions/ Answers 67
7.12 Further Readings 68

7.0 INTRODUCTION
This unit is dedicated to probability theory and its usage in decision making for various problems.
Contrary to the classical decision making of True and False propositions, the probability of the truth value
with a certain probability is used for making decisions. The inclusion of such a probabilistic approach is
quite relevant since uncertainties are quite obvious in the real world.

As we know, the probability of an event (uncertain event I) is basically the measure of the degree of
likelihood of the occurrence of event I. Let the set of all such possible events is represented as sample
space S The measure of probability is a function P () mapping the event outcome E from sample space S
to some real number and satisfying few conditions such as:

(i) 0 ≤ P(I) ≤ 1 for any event I ⊆ S

(ii) P(S) = 1, represents a certain outcome, and

(iii) For E ∩ E = ϕ , for all i ≠ j (the E are mutually exclusive), i.e. P(E ∪ E . . . ) = P(E ) +
P(E )+ . ..

Using the above mentioned three conditions, we can derive the basic laws of probability. It is also to be
noted that only these three conditions are not enough to compute the probability of an outcome. This
additionally requires the collection of experimental data for estimating the underlying distribution.

7.1 OBJECTIVES
After going through this unit, you should be able to:

 Understand the role of probabilistic reasoning in AI


 Understand the Concept of Bayesian theory and Bayesian networks
 Perform probabilistic inference through Bayesian Networks
 Understand the other Paradigm of Uncertain Reasoning & Dempster Scheffer Theory
7.2 REASONING WITH UNCERTAIN INFORMATION
Reasoning is an important step for various decision making. The amount of information and its
correctness plays a crucial role in reasoning. Decision making is easier when we have certain information
i.e. the correctness of information can be ascertained. In the other situation when the certainty of
information can not be ascertained, the decision-making process is likely to be erroneous or may not be
correct. In this situation how decisions are made with some uncertainty (uncertain information) is the core
objective of this unit. If we talk about the sources of uncertainty in the information, this could be due to
various reasons including experimental error, instrument fault, unreliable source and any other reason.
Once the information is received and we have to make decisions based on received uncertain information,
we can not rely on models which use certain information. One of the potential solutions appears to be
probabilistic reasoning for such scenarios. We can make use of probabilistic models for reasoning with
uncertain information with some probability. Let’s first see the basic probability concepts before
discussing probabilistic reasoning.

7.3 REVIEW OF PROBABILITY THEORY


Now, you are familiar with the reasoning and how it can be useful with probability theory. Before we
dive deeper into the Bayes’ theorem and its applications, let us review some of the basic concepts of
probability theory. These concepts will be helping us to understand other topics of this unit.

Trials, Sample Space, Events : You must have often observed that a random experiment may comprise a
series of smaller sub-experiments. These are called trials. Consider for instance the following situations.

Example 1: Suppose the experiment consists of observing the results of three successive tosses of a coin.
Each toss is a trial and the experiment consist of three trials so that it is completed only after the third toss
(trial) is over.

Example 2: Suppose from a lot of manufactured items, ten items are chosen successively following a
certain mechanism for checking. The underlying experiment is completed only after the selection of the
tenth item is completed; the experiment obviously comprises 10 trials.

Example 3: If you consider Example 1 once again you would notice that each toss (trial) results into
either a head (H) or a tail (T). In all there are 8 possible outcomes of the experiment viz., s1 = (H,H,H), s2
= (H,H,T), s3 = (H,T,H), s4 = (T,H,H), s5 =(1,T,H), s6 = (T,H,T), s7 =(H,T,T) and s8 = (T, T,
T).

Let ζ be a fixed sample space. We have already defined an event as a collection of sample points from ζ.
Imagine that the (conceptual) experiment underlying ζ is being performed. The phrase "the event E
occurs" would mean that the experiment results in an outcome that is included in the event E. Similarly,
non-occurrence of the event E would mean that the experiment results into an outcome that is not an
element of the event E. Thus, the collection of all sample points that are not included in the event E is also
an event which is complementary to E and is denoted as Ec. The event Ec is therefore the event which
contains all those sample points of ζ which are not in E. As such, it is easy to see that the event E occurs if
and only if the event Ec does not take place. The events E and Ec are complementary events and taken
together they comprise the entire sample space, i.e., E Ec = ζ.
You may recall that ζ is an event which consists of all the sample points. Hence, its complement is an
empty set in the sense that it does not contain any sample point and is called the null event, usually
denoted as ø so that ζc = ø.

Let us once again consider Example 3. Consider the eevent vent E that the three tosses produce at least one
head. Thus, E = {s1,s2, s3, s4, s5, s6, s7} so that the complementary event Ec={s8}, which is the event of not
scoring a head at all. Again in Example 3 in the case of selection without replacement, event that the
white marble is picked up at least once is defined as E = {(r 1,w), (r2,w), (w, r2) (w, r1)}. Hence, Ec = {(r1,
r2), (r2 , r1)} i.e. the event of not picking the white marble at all.

Let us now consider two events E and F. We write E F, read as E "union” F, to denote the collection of
sample points, which are responsible for occurrence of either E or F or both. Thus, E F is a new event
and it occurs if and only if either E or F or both occur i.e. if and only if at least one of the events E or F
occurs. Generalizing this idea, we can define a new event Ej, read as "union" of the k events E1, E2,..., Ek,
as the event which consists of all sample points that are in at least one of the events E 1, E2,…Ekand it
occurs if and only if at least one of the e vents E1, E2,...,Ek occurs.

Again, let E and F be two given events. We write E ∩ F, read as E "Intersection" F, to denote the
collection of sample points any of whose occurrence implies the occurrence of both E and F. Thus, E ∩ F
is a new event and it occurs if and only if both the events E and F occur. Generalizing this idea, we can
define a new event Ej read as ”intersection" of the k events E 1, E2,...,Ek, as the event which consists of
sample points that are common to each of the events E 1, E2,..., Ek, and it occurs only if all the k events E1,
E2,...,Ek occur simultaneously. Further, two events E and F are said to be mutually exclusive or disjoint if
they do not have a common sample point i.e. E ∩ F = ø.

Two mutually exclusive events then cannot occur simultaneously. In the coin-tossing experiment for
instance, the two events, heads and tails, are mutually exclusive: if one occurs, the other cannot occur. To
have a better understanding of these events let us once again look at Example 3. Let E be the event of
scoring an odd number of heads and F be the event that tail appears in the first two tosses, so that E = {s 1,
s5, s6, s7} and F = {s5, s8}. Now E ∩ F = {s5}, the event that only the third toss yields a head. Thus events
E and F are not mutually exclusive.

Fig. 1(a) Fig.1(b)

The above relations between events can be best viewed through a Venn diagram. A rectangle is drawn to
represent the sample space ζ. All the sample points are represented within the rectangle by means of
points. An event is represented by the region enclosed by a closed curve containing all the sample points
leading to that event. The space inside the rectangle but outside the closed curve representing E represents
the complementary event Ec (See Fig.1(a) above) Similarly, in Fig.1(b), the space inside the curve
represented by the broken line represent the event E U F and the shaded portion represents E ∩ F.

As is clear by now, the outcome of a random experiment being uncertain, none of the various events
associated with a sample space can be predicted with certainty before the underlying experiment is
performed and the outcome of it is noted. However, some events may intuitively seem to be more likely
than the rest. For example, talking about human beings, the event that a person will live 20 years seems to
be more likely compared to the event that the person will live 200 years. Such thoughts motivate us to
explore if one can construct a scale of measurement to distinguish between likelihoods of various events.
Towards this, a small but extremely significant fact comes to our help. Before we elaborate on this, we
need a couple of definitions.

Consider an event E associated with a random experiment; suppose the experiment is repeated n times
under identical conditions and suppose the event E (which is not likely to occur with every performance
of the experiment) occurs fn(E) times in these n repetitions. Then, fn(E) is called the frequency of the
event E in n repetitions of the experiment and rn.(E) = fn,(E)/n is called the relative frequency of the event
E in n repetitions of the experiment. Let us consider the following example.

Example 4: Consider the experiment of throwing a coin. Suppose we repeat the process of throwing a
coin 5 times and suppose the frequencies of occurrence of head is tabulated below in Table-1:

No. of repetitions (n) Frequency of head (fn(H) Relative frequency of head rn(H)

1 0 0

2 1 1/2

3 2 2/3

4 3 3/4

5 3 3/5
Notice that the third column in Table-1 gives the relative frequencies r n (H) of heads. We can keep on
increasing the number of repetitions n and continue calculating the values of r n (H) in Table 1. Merely to
fix ideas regarding the concept of probability of an event, we present below a very naive approach which
in no way is rigorous, but it helps to see things better at this stage.
Check Your Progress- 1

Problem -1. In each of the following exercises, an experiment is described. Specify the relevant sample
spaces:

a) A machine manufactures a certain item. An item produced by the machine is tested to


determine whether or not it is defective.

b) An urn contains six balls, which are colored differently. A ball is drawn from the urn and its
color is noted.

c) An urn contains ten cards numbered 1 through 10. A card is drawn, its number noted and the
card is replaced. Another card is drawn and its number is noted.

Problem 2. Suppose a six-faced die is thrown twice. Describe each of the following events:

i) The maximum score is 6.


ii) The total score is 9.
iii) Each throw results in an even score.
iv) Each throw results in an even score larger than 2.
v) The scores on the two throws differ by at least 2.
7.3.1 Conditional probability and independent events
Let ζ be the sample space corresponding to an experiment and E and F are two events of ζ Suppose the
experiment is performed and the outcome is known only partially to the effect that the event F has taken
place. Thus there still remains a scope for speculation about the occurrence of the other event E. Keeping
this additional piece of information confirming the occurrence of F in view, it would be appropriate to
modify the probability of occurrence of E suitably. That such modifications would be necessary can be
readily appreciated through two simple instances as follows:

Example 5: Suppose, E and F are such that F E so that occurrence of F would automatically imply the
occurrence of E. Thus with the information that the event F has taken place in view, it is plausible to
assign probability 1 to the occurrence of E irrespective of its original probability.

Example 6: Suppose. E and F are two mutually exclusive events and thus they cannot occur together.
Thus whenever we come to know that the event F has taken place, we can rule out the occurrence of E.
Therefore, in such a situation, it will be appropriate to assign probability 0 to the occurrence of E.

Example 7: Suppose a pair of balanced dice A and B are rolled simultaneously so that each of the 36
possible outcomes is equally likely to occur and hence has probability Let E be the event that the sum of
the two scores is 10 or more and F be the event that exactly one of the two scores is 5.

Then E = {(4.6), (5.5), (5,6), (6,4), (6,5), (6,6)} so that P(E) = 6/36 = 1/6.

Also, F= {(1.5), (2,5), (3,5), (4,5), (6,5), (5,1), (5,2), (5,3), (5,4), (5,6)}.

Now suppose we are told that the event F has taken place (note that this is only partial information
relating to the outcome of the experiment). Since each of the outcome originally had the same probability
of occurring, they should still have equal probabilities. Thus given that exactly one of the two scores is 5
each of the 10 outcomes of event F has probability while the probability of remaining 26 points in the
sample space is 0.

In the light of the information that the event F has taken place the sample points (4,6), (6,4), (5,5) and
(6,6) in the event E must not have materialized. One of the two sample points (5,6) or (6,5) must have
materialized. Therefore the probability of E would no longer be 1/6. Since all the 10 sample points in F
are equally likely, the revised probability of E given the occurrence of F, which occurs through the
materialization of one of the two sample points (6,5) or (5,6) should be 2/10 = 1/5.

The probability just obtained is called the conditional probability that E occurs given that F has occurred
and is denoted by P(E|F). We shall now derive a general formula for calculating P(E|F).

Consider the following probability table:

Table 2

Events E Ec

F P Q

Fc r s

In Table 2, P(E ∩ F) = p, P(Ec ∩ F) = q, P(E ∩ Fc) = r and P(Ec ∩ Fc) = s and hence, P(E)=P(E ∩ F) U (E
∩ Fc)) = P(E ∩ F) + P(E ∩ Fc) =p+r and similarly, P(F) = q +s.
Now suppose that the underlying random experiment is being repeated a large number of times, say N
times. Thus, taking a cue from the long term relative frequency interpretation of probability, the
approximate number of times the event F is expected to take place will be NP(F) = N(q+s). Under the
condition that the event F has taken place, the number of times the event E is expected to take place
would be NP(E ∩ F) as both E and F must occur simultaneously. Thus, the long term relative frequency
of E under the condition of occurrence of F, i.e. the probability of occurrence of E under the condition of
occurrence of F, should be NP(E ∩ F)/NP(F) = P(E ∩ F)/P(F). This is the proportion of times E occurs
out of the repetitions where F takes place. With the above background, we are now ready to define
formally the conditional probability of an event given another.

Definition: Let E and F be two events from a sample space ζ. The conditional probability of the event E
given the event F, denoted by P(E|F), is defined as P(E|F) = P(E ∩ F)/P(F), whenever P(F) > 0.

When P(F) = 0, we say that P(E|F) is undefined. We can also write from Eqn. P(E ∩ F) =
P(E|F)P(F).

Referring back to Example 3, we see that P(E) = 6/36,P(F) = 10/36; since, E ∩ F= {(5,6), (6,5)}, P(E ∩ F)
= 2/36, P(E|F) = (2/36)/(10/36) = 2/10 = 1/5, which is the same as that obtained in Example 3. Another
result can be generalized to k events E1 E2, ..., Ek, where k >2. And now an exercise for you.

Check Your Progress 2

Problem-1: In a class, three students tossed one coins (one each) for 3 times. Write down all the possible
outcomes which can be obtained in this experiment.

Problem-2: In problem 1, what is the probability of getting 2 more than 2 heads at a time. Also write the
probability of getting three tails at a time.

Problem-3: In problem 1 calculate the Relative frequency of tail r n(T).

7.4 INTRODUCTION TO BAYESIAN THEORY


Bayes’ theorem is widely used to calculate the conditional probabilities of events without a joint
probability. It is also used to calculate the conditional probability where intuition fails. In simple terms,
probability of a given hypothesis H conditional on E can be defined as P_E (H) = P(H \& E)/ P(E), where
P(E) > 0, and the term P(H \& E) also exists. Here P_E is referred to as a probability function. To simply
understand the Bayes’ theorem, have a look at the following definitions.

Joint Probability: This refers to the probability of two or more events simultaneously occurring, e.g. P(A
and B) or P(A, B).

Marginal Probability: It is the probability of an event occurring irrespective of outcome of the other
random variables e.g. P(A).

Conditional probability: A conditional probability is defined as the probability of occurrence of an event


provided that another event has occurred. e.g. P(A | B).

The conditional probability can also be written in terms of joint probability as P(A|B) = P(A, B)/ P(B). In
other way, if one conditional probability is given, other can be calculated as P(A|B) = P(B|A)*P(A) /P(B).
Let ‘S’ be a sample space in consideration. Let events ‘A1’ , ‘A2’ , . . . . . ‘An’ is the set of mutually
exclusive events in sample space ‘S’. Let ‘B’ be an event from sample space ‘S’ provided P(B) > 0, then
according to Bayes’ theorem.

P(Ak | B) = P(Ak ∩ B) / P(A1 ∩ B)+P(A2 ∩ B) . . . . . . . . . . P(An ∩ B) , this can also be written


in terms of Bayes’ theorem.
P(Ak | B) = P(Ak )P(B| Ak) / P(A1 ).P(B| A1)+P(A2 ).P(B| A2) . . . . . . . . . .P(An).P(B| An)

7.5 BAYE’S NETWORKS


The probabilistic models are being used in defining the relationships among variables and are used to
calculate probabilities.The Bayes’ network is a simpler form of appl
applying
ying Bayes’ theorem to complex real
world problems. This uses a probabilistic graphical model which captures the conditional dependence
explicitly and is represented using directed edges in a graph. Here if we take fully conditional models, we
may need a big amount of data to address all possible events/ cases and in such scenario probabilities may
not be calculated practically. On the other hand, simple assumptions like conditional independence of
random variables may turn out to be effective, giving a way for Bayes’ Network.

While representing a Bayes’ Network graphically, nodes represent the distribution of probabilities for
random variables. The edges in the graph represent the relationship among random variables. The key
benefits of a Bayes’ Network araree model visualization, relationships among random variables and
computations of complex probabilities.

Example 8: Let us now create a Bayesian Network for an example problem. Let us consider three random
variables A, B and C. It is given that A is depende
dependent
nt on B, and C is dependent on B. The conditional
dependence can be stated as P(A|B|) and P(C|B) for both the given statements respectively. Similarly the
conditional independence can be stated here as P(A|B, C) and P(C|B, A) for both statements respective ly.

Here, we can also write P(A|C, B) = P(A|B) as A is unaffected by the C. Also the joint probability of A
and C given B can be written as product of conditional probabilities as P(A, C|B) = P(A|B) * P(C|B).

Now using Bayes' theorem, the joint probability of P(A, B, C) can be written as P(A,B,C)=
P(A|B)*P(C|B)*P(B).

The corresponding graph is shown below in figure 1. Here each random variable is represented as a node
and edges between nodes are conditional probabilities.

7.6 PROBABILISTIC INFERENCE


The probabilistic inference is very much dependent on the conditional probability of the specified events
provided the information of occurrence of other events is available. For example, two events E and F such
that P(F)>0, the conditional probability of event E when F has occurred can be written as :
𝑃(𝐸 ∩ 𝐹)
𝑃(𝐸/𝐹) =
𝑃(𝐹)

When an experiment is repeated a large number of times (say 𝑛), the above expression can be given a
frequency interpretation. Let the number of occurrences of an event F is represented as No. (F) and the
probability of a joint event of E and F as No. (E F). The relative frequencies of both these events can be
computed as 𝑓 :
.( ∩ )
𝑓 (𝐸 ∩ 𝐹) = and similarly,

𝑁𝑜. (𝐹)
𝑓 (𝐹) =
𝑛
Here, if 𝑛 is large, the ratio of above two expressions represent the proportion of times the event E occurs
relative to the occurrence of F. This can also be understood as the approximate conditional occurrence of
event F with E.

𝑓 (𝐸 ∩ 𝐹) /𝑓 (𝐹) ≃ 𝑃(𝐸 ∩ 𝐹)/ 𝑃(𝐹)

We can also write the conditional probability of event F while it is given that event E has already
occurred, as

𝑃(𝐸 / 𝐹) = 𝑃( 𝐸 ∩ 𝐹)/ 𝑃(𝐹)

Using above two equations we can also write

𝑃(𝐹 / 𝐸) = 𝑃( 𝐸 / 𝐹) 𝑃(𝐹)/𝑃(𝐸)

The above expression is also one form of Bayes’ Rule. Here the notion is simple: the probability of an
event F occurring when we know the probability of an event E which has already occurred is the same as
the probability of occurring of event E when the probability of occurrence of event F is known.

7.7 BASIC IDEA OF INFERENCING WITH BAYE’S


NETWORKS
We are now aware of the Bayes theorem, probability and Bayes networks. Let’s now talk about how
inferences can be made using Bayes networks.A network here represents the degree of belief of
proposition and their causal interdependence. The inference in a network can be done by propagating the
given probabilities of related information through the network giving the output to one of the conclusion
nodes. The network representation also reduces the time and space requirements for huge computations
involving the probabilities of uncertain knowledge of propositional variables. Further, one can not make
the inference from such a large data in real time. The solution to such a problem can be found using the
network representation. Here the network of nodes represents variables connected by edges which
represents causal influences (dependencies) among nodes. Here the edge weights can be used to represent
the strength of influences or in other terms the conditional probabilities.
To use this type of probabilistic inference model, one first needs to assign probabilities to all basic facts in
the underlying knowledge base. This requires the definition of an appropriate sample space and the
assignment of a priori and conditional probabilities. In addition to this some methods must be selected to
compute the combined probabilities when pooling evidence in a sequence of inference steps. In the end,
when the outcome of an inference chain results in one or more proposed conclusions, the alternatives
must be compared and one should be chosen on the basis of likelihood.

7.8 OTHER PARADIGM OF UNCERTAIN REASONING

The other ways of dealing with uncertainty are the ones with no theoretical proof. These are mostly based
on intuition. These are selected over formal methods as a pragmatic solution to a particular problem,
when the formal methods impose difficult or impossible conditions. One such ad hoc procedure is used to
diagnose meningitis and infectious blood disease, the system is called MYCIN. The MYCIN uses If and
then rules to assess various forms of patient evidence. It also measures both belief and disbelief to
represent degree of confirmation and disconfirmation respectively in a given hypothesis. The ad hoc
methods have been used in a larger number of knowledge-based systems than formal methods. This is due
to the difficulties encountered in acquiring a large number of reliable probabilities related to the given
domain and to the complexities to the ensuing calculations.

One other paradigm is to use Heuristic reasoning methods. These are based on the use of procedures,
rules and other forms of encoded knowledge to achieve specified goals under certainty. Using both
domain specific and general heuristics, one of several alternative conclusions may be chosen through the
strength of positive vs negative evidence presented in the form of justification or endorsement.

The in depth and detailed discussion on this is not in the scope of this unit/ course.

7.9 DEMPSTER SCHEFFER THEORY


Let us now discuss a mathematical theory based only on the evidence, known as Dempster-Schafer (D-S)
theory given by Dempster and extended by Shafer in “Mathematical Theory of Evidences”. This uses a
belief function to combine separate and independent evidence pieces to quantify the belief in a statement.
The D-S theory is a generalization of Bayesian probability theory where multiple possible events are
assigned probabilities opposed to mutually exclusive singletons. The D-S theory assumes the existence of
ignorance in knowledge creating uncertainty which in turn induces belief. Here, uncertainty of the
hypothesis is represented by the belief function. The main characteristic of the theory is:

1. Multiple possible events are permitted to assign probabilities.


2. These events should be exhaustive and exclusive.
Here, the multiple sources of information are assigned some degree of belief and then aggregated using
the D-S combination rule. This also limits the theory for intensive computation because of the lack of
independent assumptions from such a large number of information sources.

Let us now define a few terms used in D-S theory which will be useful for us.

7.9.1 Evidence
These are events related to one hypothesis or set of hypotheses. Hare, a relation is not permitted between
various pieces of evidence or set of hypotheses. Also, the relation between the set of hypotheses and the
piece of evidence is only quantified by a source of data. In context of D-S theory, we have four types of
evidences as following:
a) Consonant Evidence: These are basically appea appearing
ring in a nested structure where each subset is
included into the next bigger subset and so on. Here with each increasing subset size, the
information refines the evidentiary set over the time.
b) Consistent Evidence: This assures the presence of at least one common element to all the subsets.
c) Arbitrary Evidence: A situation where there is not a common element occurring in the subsets
though some of the subsets may have a few common element(s).
d) Disjoint Evidence: There is no subset having common elements.
All these four evidence types can be understood by looking at the below given figure 2.(a -d).

Figure 2. (a-d)

The source of information can be an entity or person giving some relevant state information. Here the
information source is a non biased source of information. The information received from such sources is
combined to provide more reliable information for further use. The D-S theory models are able to handle
the varying precision regarding the information and hence no additional assumptions are ne eded to
represent the information.

7.9.2 Frame of Discernment


Let us consider a random variable ‘𝜃’ whose true value is not known. Let ‘𝜃’ = { 𝜃1, 𝜃2 ….. 𝜃n }
represent mutually exclusive and discretized values of the possible outcome of ‘ 𝜃 ’.
Conventionally, the uncertainty about ‘ 𝜃’ is given by the assigning probability pi to the elements
𝜃i, i = 1: n, satisfying sum pi = 1. In the case of D-S theory, the probabilities are assigned to the
subsets of ‘𝜃’ and the individual element ‘ 𝜃i’ along with it.

7.9.3 The Power Set P( 𝜽= 2^{𝜽})


This is defined as the set of all subsets of ‘ 𝜃’ including singletons, defining the frame of ‘ 𝜃’. The
subset of this powerset may contain a single or conjunctions of hypotheses. Here, with respect to
the power set, the complete probability assignment is called basic probability assignment.
The core functions in D-S theory are :

1. Basic Probability Assignment function


This is represented by m and maps the power set to the interval 0 and 1. Here, the basic
probability assignment (bpa) to the null set is 0 and for all subsets of the power set sum = 1. For
a given set A, m(A) represents the measure of belief assigned by the available evidences in
support of A, where A ∈ 2^{ 𝜃 }. Mathematically, the bpa can be represented as follows.

1. m : 2^{ 𝜃 }--> [0, 1] (interval)


2. m(𝜙) = 0 (null)
3. m(A) ≥ 0, ∀ A ∈ 2^{ 𝜃 }
4. sum{m(A) ∀ A ∈ 2^{ 𝜃 }} = 1.
This is to note here that, the element of power set with m(A) > 0 is termed as focal element(s).

Example9: Let 𝜃 = {a, b, c}; then the power set is P(𝜃) = {𝜙, a,b,c, (a,b), (a,c), (b,c),
(a,b,c)}. The information source assigned the m-values as m(a) = 0.2, m(c) = 0.1 and m(a,b) =
0.4. Here the mentioned three subsets are focal elements.

2. The Belief Function


The assignment of the basic probability we can define the lower and upper bounds of the
intervals representing the precise probability of a set. This is also bounded by continuous
measures of nonadditive nature known as Belief and Plausibility.
The lower bound (belief) for set A is defined as the sum of all basic probability assignments of
proper subset B of set A. The measurement of the amount of support by the information source
given to support a specific element as a correct one is done by the belief function,
mathematically Bel(A) = sum{B ⊂ A} m(B) ∀ A⊂ 𝜃.

3. The Plausibility Function


The upper bound (plausibility) for set A is defined as the sum of all basic probability
assignments of B intersecting set A, mathematically Pl(A) = sum {B ∩ A ≠ 𝜙} m(B). Here, the
plausibility function measures the level of information by a source contradicting an element as a
correct answer specifically.
Apart from the above-mentioned functions a few terms also require some attention while
referring
to the D-S theory. The Uncertainty Interval, shows the range where the true probability may be
found. This is calculated as the difference of belief and plausibility level i.e. Pl(A) - Bel(A).

7.9.4 Rule of Combination


In the D-S theory, the measure of Plausibility and Belief are taken from the combined
assignments. The D-S rule of combination takes multiple belief functions and combines them
using m i.e. respective basic probability assignments. The D-S combination rule is basically a
conjunctive operation i.e. AND. Here the joint m {12} (combination) is obtained using aggregating
two basic probability assignments m1 and m2 as following:
m{12} (A) = 1/sum{{B ∩ A = A} m1 (B) m2 (C)}{1-K} ,
Where, A ≠ 𝜙,
M{12} (𝜙) = 0,
And K = sum{B ∩ C = 𝜙} m1(B) m2(C).
Here, in the above expression, K is the basic probability mass which is associated with the
conflict calculated as a sum of products of the basic probability assignments of all sets having
null intersection. The normalization factor is represented as 1-K in the denomination. The rule
is associative, commutative but not continuous or idempotent in nature.
Example 10: In a multinational company 100 applicants appeared for a job interview. The company
setup two interview boards for applicants.

While assessing the grades of the class of 100 students, two of the class teachers responded the overall
result as follow. First teacher assessed that 40 students will get A and 20 students will get B grade
amongst the total 60 students he interviewed. Whereas second teacher stated that 30 students will get A
grade and 30 students will get either A or B amongst the 60 students he took the interview. Combining
both evidences to find the resultant evidence, we will do following calculations. Here frame of
discernment θ= {A, B} and Power set 2 = {∅, A, B, (A, B)},

Evidence (1) =Ev1 Evidence (2) =Ev2

m1(A) = 0.4 m2(A) = 0.3

m1(B) = 0.2 m2(A, B) = 0.3

m1(θ) = 0.4 m2(θ) = 0.4

Plausibility function (PI):


A ∩ A = A ≠ ∅hence m1 (A) = 0.4

A∩B=∅

A ∩ θ = A ≠ ∅ hence m1 (θ) = 0.4

Pl1(A) = m1 (A) + m1 (θ) = 0.4 + 0.4= 0.8

A∩A=A ≠ ∅hence m2(A) = 0.3

A∩B=∅

A ∩ θ = A ≠ ∅ hence m2 (θ) = 0.4

Pl2(A) = m2(A) + m2 (θ) = 0.3 + 0.4= 0.7

B∩A=∅

B ∩ B = B ≠ ∅ hence m1 (B) = 0.2

B ∩ θ = B ≠ ∅ hence m1 (θ) = 0.4

Pl1(B) = m1 (B) + m1 (θ) = 0.2 + 0.4 = 0.6


(A, B) ∩ A = A ≠ ∅ m2(A) = 0.3

(A, B) ∩ B = B ≠ ∅ , m2 (B) = 0

(A, B) ∩ (A, B) = (A, B) ≠ ∅ m2(A, B) = 0.3

(A, B) ∩ θ = (A, B) ≠ ∅ hence m2 (θ) = 0.4

Pl1(A, B) = m2 (A) + m2 (A, B) + m2 (θ) = 0.3 + 0.3 + +0.4 = 1.0

θ ∩ A = A ≠ ∅ hence m1 (A) = 0.4

θ ∩ B = B ≠ ∅ hence m1 (B) = 0.2

θ ∩ θ = θ ≠ ∅ hence m1 (θ) = 0.4

Pl1(θ) = m1 (A) + m1 (B) + m1 (θ) = 0.4 + 0.2 + 0.4 = 1.0

θ ∩ A = A ≠ ∅ hence m2 (A) = 0.3

θ ∩ (A, B) = (A, B) ≠ ∅, m2 (A, B) = 0.3

θ ∩ θ = θ ≠ ∅ hence m2 (θ) = 0.4

Pl2(θ) = m2 (A) + m2 (A, B) + m2 (θ) = 0.3 + 0.3 + 0.4 = 1.0

D-S Rule of Combination: Table 3 shows combination of concordant evidences using D-S theory.

Evidences m1(A)=0.4 m1(B)=0.2 m1(θ)=0.4

m2(A)=0.3 m1-2 (A) 0.12 m1-2 (∅) 0.06 m1-2 (A) 0.12

m2(A,B)=0.3 m1-2 (A) 0.12 m1-2 (B) 0.06 m1-2 (A,B) 0.12
m2(θ)=0.4 m1-2 (A) 0.16 m1-2 (B) 0.08 m1-2 0.16
(θ)

k = 0.06 and 1 − k = 0.94 Combined masses are worked out

Bel1-2(A) = m1-2(A) = 0.553

Bel1-2(B) = m1-2(B) = 0.149


Bel1−2(A, B) = m1−2(A) + m1−2(B) + m1−2(A, B) = 0.553 + 0.149 + 0.128 = 0.83 Bel1−2(θ) = m1−2(A) +
m1−2(B) + m1−2(A, B) + m1−2(θ) = 0.553 + 0.149 + 0.128 +

0.170 = 1

Pl1−2(A) = m1−2(A) + m1−2(A, B) + m1−2(θ) = 0.553 + 0.128 + 0.170 = 0.851,

(85 students in A Grade)

Pl1−2(B) = m1−2(B) + m1−2(A, B) + m1−2(θ) = 0.149 + 0.128 + 0.170 = 0.447,

(45 students in B Grade)

Pl1−2(A, B) = m1−2(A) + m1−2(B) + m1−2(AB) + m1−2(θ) = 0.553 + 0.149 + 0.128 + 0.170 = 1.0

Pl1−2(θ) = m1−2(A) + m1−2(B) + m1−2(A, B)+= 0.553 + 0.149 + 0.128 + 0.170 =

1.00. (100 students in total)

According to rule of combination, concluded ranges are then 55 to 85 students will get

``A’’ grade and 15 to 45 students will get ``B’’ grade.

Key advantages of D-S theory:

● The level of uncertainty reduces with addition of information.


● Addition of more evidences reduces ignorance
● We can represent diagnose hierarchies using D-S theory.

Check Your Progress 3

Problem-1. Differentiate between Join, Marginal and conditional probability with an example of each.

Problem-2. Explain Dempster Shafer theory with a suitable example.

Problem-3. What are different type of evidences? Give suitable example of each.

7.10 SUMMARY
This unit relates to the discussion over Reasoning with uncertain information, whih involves Review of
Probability Theory, and Introduction to Bayesian Theory. Unit also covers the concept of Baye’s
Networks, which is later used for the purpose of inferencing. Finally, the unit discussed about the Other
Paradigm of Uncertain Reasoning, including the Dempster Scheffer Theory
7.11 SOLUTIONS/ANSWERS
Check Your Progress- 1
Problem -1. In each of the following exercises, an experiment is described. Specify the relevant sample
spaces:
a) A machine manufactures a certain item. An item produced by the machine is tested to
determine whether or not it is defective.
b) An urn contains six balls, which are colored differently. A ball is drawn from the urn and its
color is noted.
c) An urn contains ten cards numbered 1 through 10. A card is drawn, its number noted and the
card is replaced. Another card is drawn and its number is noted.
Solution - *Please refer to section 7.3 to answer these problems.
Problem 2. Suppose a six-faced die is thrown twice. Describe each of the following events:
i) The maximum score is 6.
ii) The total score is 9.
iii) Each throw results in an even score.
iv) Each throw results in an even score larger than 2.
v) The scores on the two throws differ by at least 2.
Solution - *Please refer to section 7.3 to answer these problems.
Check Your Progress 2
Problem-1: In a class,three students tossed one coins (one each) for 3 times. Write down all the possible
outcomes which can be obtained in this experiment.
Solution - *Please refer to example 4 and section 7.3 to solve these problems

Problem-2: In problem 1, what is the probability of getting 2 more than 2 heads at a time. Also write the
probability of getting three tails at a time.
Solution - *Please refer to example 4 and section 7.3 to solve these problems

Problem-3: In problem 1 calculate the Relative frequency of tail r n(T).


Solution - *Please refer to example 4 and section 7.3 to solve these problems

Check Your Progress 3


Problem-1. Differentiate between Join, Marginal and conditional probability with an example of each.
Solution - *Please refer to section 7.9 and example 10 to answer these problems.

Problem-2. Explain Dempster Shafer theory with a suitable example.


Solution - *Please refer to section 7.9 and example 10 to answer these problems.

Problem-3. What are different type of evidences? Give suitable example of each.
Solution - *Please refer to section 7.9 and example 10 to answer these problems.

7.12 FURTHER READINGS


1. David Barber ,”Bayesian Reasoning And Machine Learning”, Cambridge University Press
2. John J. Craig, “Introduction to Robotics”, Addison Wesley publication
3. Ela Kumar, “ Artificial Intelligence”, IK International Publications
4. Ela Kumar, “ Knowledge Engineering ”, IK International Publications
The First Order
UNIT 8 FUZZY AND ROUGH SETS Predicate Logic

Structure Page Nos.

8.0 Introduction 50
8.1 Objectives
8.2 Fuzzy Systems 51
8.3 Introduction to Fuzzy Sets 55
8.4 Fuzzy Set Representation 57
8.5 Fuzzy Reasoning
8.6 Fuzzy Inference 59
8.7 Rough Set Theory 62
8.8 Summary 67
8.9 Solutions/ Answers 67
8.10 Further Readings 68

8.0 INTRODUCTION
In the earlier units, we discussed PL and FOPL systems for making inferences and
solving problems requiring logical reasoning. However, these systems assume that the
domain of the problems under consideration is complete, precise and consistent. But,
in the real world, the knowledge of the problem domains is generally neither precise
nor consistent and is hardly complete.

In this unit, we discuss a number of techniques and formal systems that attempt to
handle some of these blemishes. To begin with we discuss the fuzzy systems that
attempt to handle imprecision in knowledge bases, specially, due to use of natural
language words like hot, good, tall etc.

Then, we discuss non-monotonic systems which deal with indefiniteness of


knowledge in the knowledge bases. The significance of these systems lies in the fact
that most of the statements in the knowledge bases are actually based on beliefs of the
concerned persons or actors. These beliefs get revised as better evidence for some
other beliefs become available, where the later beliefs may be in conflict with the
earlier beliefs. In such cases, the earlier beliefs my have to be temporarily suspended
or permanently excluded from further considerations.

Subsequently, we will discuss two formal systems that attempt to handle


incompleteness of the available information. These systems are called Default
Reasoning Systems and Closed World Assumption Systems. Finally, we discuss
some inference rules, viz, abductive inference rule and inductive inference rule that
are, though not deductive, yet are quite useful in solving problems arising out of
everyday experience.

8.1 OBJECTIVES
After going through this unit, you should be able to:
 enumerate various formal methods, which deal with different types of blemishes
like incompleteness, imprecision and inconsistency in a knowledge base;
 discuss, why fuzzy systems are required;
 discuss, develop and use fuzzy arithmetic tools in solving problems, the
descriptions of which involve imprecision;
 discuss default reasoning as a tool for handling incompleteness of knowledge;
 discuss Closed World Assumption System, as another tool for handling
incompleteness of knowledge, and
 discuss and use non-deductive inference rules like abduction and induction, as
tools for solving problems from everyday experience.
5
The First Order
8.2 FUZZY SYSTEMS Predicate Logic

In the symbolic Logic systems like, PL and FOPL, that we have studied so far, any
(closed) formula has a truth-value which must be binary, viz., True or False.
However, in our everyday experience, we encounter problems, the descriptions of
which involve some words, because of which, to statements of situations, it is not
possible to assign a truth value: True or False. For example, consider the statement:

If the water is too hot, add normal water to make it comfortable for taking a bath.

In the above statement, for a number of words/phrases including ‘too hot’ ‘add’,
‘comfortable’ etc., it is not possible to tell when exactly water is too hot, when water
is (at) normal (temperature), when exactly water is comfortable for taking a bath.

For example, we cannot tell the temperature T such that for water at temperature T or
less, truth value False can be associated with the statement ‘Water is too hot’ and at
the same time truth-value True can also be associated to the same statement ‘Water is
too hot’ when the temperature of the water is, say, at degree T + 1, T + 2….etc.

Some other cases of Fuzziness in a Natural Language

Healthy Person: we cannot even enumerate all the parameters that determine health.
Further, it is even more difficult to tell for what value of a particular parameter, one is
healthy or otherwise.

Old/young person: It is not possible to tell exactly upto exactly what age, one is
young and, by just addition of one day to the age, one becomes old. We age gradually.
Aging is a continuous process.

Sweet Milk: Add small sugar cube one at a time to glass of milk, and go on adding
upto, say, 100 small cubes.

Initially, without sugar, we may take milk as not sweet. However, with addition of
each one small sugar particle cube, the sweetness gradually increases. It is not
possible to say that after addition of 100 small cubes of sugar, the milk becomes
sweet, and, till addition of 99 small cubes, it was not sweet.

Pool, Pond, Lake,….., Sea, Ocean: for different sized water bodies, we can not say
when exactly a pool becomes a pond, when exactly a pond becomes a lake and so on.

One of the reasons, for this type of problem of our inability to associate one of the
two-truth values to statements describing everyday situations, is due to the use of
natural language words like hot, good, beautiful etc. Each of these words does not
denote something constant, but is a sort of linguistic variable. The context of a
particular usage of such a word may delimit the scope of the word as a linguistic
variable. The range of values, in some cases, for some phrases or words, may be very
large as can be seen through the following three statements:

 Dinosaurs ruled the earth for a long period (about millions of years)
 It has not rained for a long period (say about six months).
 I had to wait for the doctor for a long period (about six hours).

Fuzzy theory provides means to handle such situations. A Fuzzy theory may be
thought as a technique of providing ‘continuization’ to the otherwise binary
disciplines like Set Theory, PL and FOPL.

Further, we explain how using fuzzy concepts and rules, in situation like the ones
quoted below, we, the human beings solve problems, despite ambiguity in language.
6
Let us recall the case of crossing a road discussed in Unit 1 of Block 1. We The First Order
Predicate Logic
Mentioned that a step by step method of crossing a road may consist of

(i) Knowing (exactly) the distances of various vehicles from the path to be
followed to cross over.

(ii) Knowing the velocities and accelerations of the various vehicles moving on the
road within a distance of, say, one kilometer.
 2
(iii) Using Newton’s Laws of motion and their derivatives like s = ut + at , and

calculating the time that would be taken by each of the various vehicles to reach
the path intended to be followed to cross over.

(iv) Adjusting dynamically our speeds on the path so that no collision takes place
with any of the vehicle moving on the road.

But, we know the human beings not only do not follow the above precise method but
cannot follow the above precise method. We, the human beings rather feel
comfortable with fuzziness than precision. We feel comfortable, if the instruction
for crossing a road is given as follows:

Look on both your left hand and right hand sides, particularly in the beginning, to
your right hand side. If there is no vehicle within reasonable distance, then attempt to
cross the road. You may have to retreat back while crossing, from somewhere on the
road. Then, try again.

The above instruction has a number of words like left, right (it may 45° to the right or
90° to the right) reasonable, each of which does not have a definite meaning. But we
feel more comfortable than the earlier instruction involving precise terms.

Let us consider another example of our being comfortable with imprecision than
precision. The statement: ‘The sky is densely clouded’ is more comprehensible to
human beings than the statement: ‘The cloud cover of the sky is 93.5 %’.

Thus is because of the fact that, we, the human beings are still better than computers
in qualitative reasoning. Because of better qualitative reasoning capabilities

 just by looking at the eyes only and/or nose only, we may recognize a person.
 just by taking and feeling a small number of grains from cooking rice bowl, we
can tell whether the rice is properly cooked or not.
 just by looking at few buildings, we can identify a locality or a city.

Achieving Human Capability

In order that computers achieve human capability in solving such problems,


computers must be able to solve problems for which only incomplete and/or
imprecise information/knowledge is available.

Modelling of Solutions and Data/Information/Knowledge

We know that for any problem, the plan of the proposed solution and the relevant
information is fed in the computer in a form acceptable to the computer.

However, the problems to be solved with the help of computers are, in the first place,
felt by the human beings. And then, the plan of the solution is also prepared by human
beings.
7
It is conveyed to the computer mainly for execution, because computers have much The First Order
Predicate Logic
better executional speed.

Summarizing the discussion, we conclude the following facts

(i) We, the human beings, sense problems, desire the problems to be solved and
express the problems and the plan of a solution using imprecise words of a natural
language.

(ii) We use computers to solve the problems, because of their executional power.

(iii) Computers function better, when the information is given to the computer in
terms of mathematical entities like numbers, sets, relations, functions, vectors,
matrices graphs, arrays, trees, records, etc., and when the steps of solution are
generally precise, involving no ambiguity.

In order to meet the mutually conflicting requirements:

(i) Imprecision of natural language, with which the human beings are comfortable,
where human beings feel a problem and plan its solution.

(ii) Precision of a formal system, with which computers operate efficiently, where
computers execute the solution, generally planned by human beings

a new formal system viz. Fuzzy system based on the concept of ‘Fuzzy’ was
suggested for the first time in 1965 by L. Zadeh.

In order to initiate the study of Fuzzy systems, we quote two statements to recall the
difference between a precise statement and an imprecise statement.

A precise Statement is of the form: ‘If income is more than 2.5 lakhs then tax is 10%
of the taxable income’.

An imprecise statement may be of the form: ‘If the forecast about the rain being
slightly less than previous year is believed, then there is around 30% probability that
economy may suffer heavily’.

The concept of ‘Fuzzy’, which when applied as a prefix/adjective to mathematical


entities like set, relation, functions, tree, etc., helps us in modelling the imprecise
data, information or knowledge through mathematical tools.

Crisp Set/Relation vs. Fuzzy Set/Relation: In order to differentiate the sets,


normally used so far, from the fuzzy sets to be introduced soon, we may call the
normally called sets as crisp sets.

Next, we explain, how the fuzzy sets are defined, using mathematical entities, to
capture imprecise concepts, through an example of the concept : tall.

In Indian context, we may say, a male adult, is

(i) definitely tall if his height > 6 feet

(ii) not at all tall if height < 5 feet and

(iii) if his height = 5' 2” a little bit tall

(iv) if his height = 5' 6” slightly tall

(v) if height = 5' 9” reasonably tall etc.

Next step is to model ‘definitely tall’ ‘not at all tall’, ‘little bit tall’, ‘slightly tall’
‘reasonably Tall’ etc. in terms of mathematical entities, e.g., numbers; sets etc.
8
In modelling the vague concept like ‘tall’, through fuzzy sets, the numbers in the The First Order
Predicate Logic
closed set [0, 1] of reals may be used on the following lines:

(i) ‘Definitely tall’ may be represented as ‘tallness having value 1’

(ii) ‘Not at all tall’ may be represented as ‘Tallness having value 0’

other adjectives/adverbs may have values between 0 and 1 as follows:

(iii) ‘A little bit tall’ may be represented as ‘tallness having value say .2’.

(iv) ‘Slightly tall’ may be represented as ‘tallness having value say .4’.

(v) ‘Reasonably tall’ may be represented as ‘tallness having value say .7’.

and so on.

Similarly, the values of other concepts or, rather, other linguistic variables like
sweet, good, beautiful, etc. may be considered in terms of real numbers between
0 and 1.

Coming back to the imprecise concept of tall, let us think of five male persons of an
organisation, viz., Mohan, Sohan, John, Abdul, Abrahm, with heights 5' 2”, 6' 4”,

5' 9”, 4' 8”, 5' 6” respectively.

Then had we talked only of crisp set of tall persons, we would have denoted the

Set of tall persons in the organisation = {Sohan}

But, a fuzzy set, representing tall persons, include all the persons alongwith
respective degrees of tallness. Thus, in terms of fuzzy sets, we write:

Tall = {Mohan/.2; Sohan/1; John/.7; Abdul/0; Abrahm/.4}.


The values .2, 1, .7, 0, .4 are called membership values or degrees:

Note: Those elements which have value 0 may be dropped e.g.

Tall may also be written as Tall = {Mohan/.2; Sohan/1; John/.7;, Abrahm/.4},


neglecting Abdul, with associated degree zero.

If we define short/Diminutive as exactly opposite of Tall we may say

Short = {Mohan/.8; Sohan/0; John/.3; Abdul/1; Abrahm/.6}

8.3 INTRODUCTION TO FUZZY SETS


In the case of Crisp sets, we have the concepts of Equality of sets, Subset of a set, and
Member of a set, as illustrated by the following examples:

(i) Equality of two sets

Let A = {1, 4, 3, 5}

B = {4, 1, 3, 5}

C = {1, 4, 2, 5}

be three given sets.

Then, Set A is equal to set B denoted by A = B. But A is not equal to C, denoted by


9
The First Order
A  C.
Predicate Logic
(ii) Subset

Consider sets A = {1, 2, 3, 4, 5, 6, 7}

B = {4, 1, 3, 5}

C = {4, 8}

Then B is a subset of A, denoted by B  A. Also C is not a subset of A, denoted by

C  A.

(iii) Belongs to/is a member of

If A = {1, 4, 3, 5}

Then each of 1, 4, 3 and 5 is called an element or member of A and the fact that 1 is a
member of A is denoted by 1  A.

Corresponding Definitions/ concepts for Fuzzy Sets

In order to define for fuzzy sets, the concepts corresponding to the concepts of
Equality of Sets, Subset and Membership of a Set considered so far only for crisp sets,
first we illustrate the concepts through an example:

Let X be the set on which fuzzy sets are to be defined, e.g.,

X = {Mohan, Sohan, John, Abdul, Abrahm}.

Then X is called the Universal Set.

Note: In every fuzzy set, all the elements of X with their corresponding memberships
values from 0 to 1, appear.

(i) Degree of Membership: In respect of fuzzy sets, we do not speak of just


‘membership’, but speak of ‘degree of membership’.

In the set

A = {Mohan/.2; Sohan/1; John/.7; Abrahm/.4},

Degree (Mohan) = .2, degree (John) =.4

For (ii) Equality of Fuzzy sets: Let A, B and C be fuzzy sets defined on X as
follows:

Let A = {Mohan/.2; Sohan/1; John/.7; Abrahm/.4}

B = {Abrahm/.4, Mohan/.2; Sohan/1; John/.7}.

Then, as degrees of each element in the two sets, equal; we say fuzzy set A equals
fuzzy set B, denoted as A = B

However, if C = {Abrahm/.2, Mohan/.4; Sohan/1; John/.7}, then

A ≠ C.

(iii) Subset/Superset

Intuitively, we know
10
(i) The Set of ‘Very Tall’ people should be a subset of the set of Tall The First Order
Predicate Logic
people.

(ii) If the degree of ‘tallness’ of a person is say .5 then degree of ‘Very


Tallness’ for the person should be lesser say .3.

Combining the above two ideas we, may say that if

A = {Mohan/.2; Sohan/1; John/.7; Abrahm/.4} and

B = {Mohan/.2, Sohan/.9, John/.6, Abraham/.4}and further,

C = {Mohan/.3, Sohan/.9, John/.5, Abraham/.4},

then, in view of the fact that for each element, degree in A is greater than or equal to
degree in B, B is a subset of A denoted as B  A.

However, degree (Mohan) = .3 in C and degree (Mohan) =.2 in A,

,therefore, C is not a subset of A.

On the other hand degree (John) = .5 in C and degree (John) = .7 in A,

therefore, A is also not a subset of C.

We generalize the ideas illustrated through examples above

Let A and B be fuzzy sets on the universal set X = {x1, x2, …, xn} (X is called the
Universe or Universal set) such that

A = {x1/v1, x2/v2, …., xn/vn} and B = {x1/w1, x2/w2, …., xn/wn}

with that 0  vi , wi  1. Then fuzzy set A equals fuzzy set B, denoted as A = B, if and
only if vi = wi for all i = 1,2,….,n. Further if w  vi for all i. then B is a fuzzy subset
of A.

Example: Let X = {Mohan, Sohan, John, Abdul, Abrahm}

A = {Mohan/.2; Sohan/1; John/.7; Abrahm/.4}

B = {Mohan/.2, Sohan/.9, John/.6, Abraham/.4}

Then B is a fuzzy subset of A.

In respect of fuzzy sets vis-à-vis (crisp) sets, we may note that:

 Corresponding to the concept of ‘belongs to’ of (Crisp) set, we use


the concept of ‘degree of membership’ for fuzzy sets.
 It may be noted that every crisp set may be thought of as a Fuzzy
Set, but not conversely. For example, if Universal set is
X = {Mohan, Sohan, John, Abdul, Abrahm} and

A = set of those members of X who are at least graduates, say,

= {Mohan, John, Abdul}

then we can rewrite A as a fuzzy set as follows:

A = {Mohan/1; Sohan/0; John/1; Abdul/1; Abrahm/0}, in which degree of each


member of the crisp set, is taken as one and degree of each element of the universal
set which does not appear in the set A, is taken as zero.
11
However, conversely, a fuzzy set may not be written as a crisp set. Let C be a fuzzy The First Order
Predicate Logic
set denoting Educated People, where degree of education is defined as follows:

degree of education (Ph.D. holders) = 1

degree of education (Masters degree holders) = 0.85

degree of education (Bachelors degree holders) = .6

degree of education (10 + 2 level) = 0.4

degree of education (8 th Standard) = 0.1

degree of education (less than 8th) = 0.

Let us C = {Mohan/.85; Sohan/.4; John/.6; Abdul/1; Abrahm/0}.

Then, we cannot think of C as a crisp set.

Next, we define some more concepts in respect of fuzzy sets.

Definition: Support set of a Fuzzy Set, say C, is a crisp set, say D, containing all the
elements of the universe X for which degree of membership in Fuzzy set is positive.
Let us consider again

C = {Mohan/.85; Sohan/.4; John/.6; Abdul/1; Abrahm/0}.

Support of C = D = {Mohan, Sohan, John, Abdul}, where the element


Abrahm does not belong to D, because, it has degree 0 in C.

Definition: Fuzzy Singleton is a fuzzy set in which there is exactly one element
which has positive membership value.

Example:

Let us define a fuzzy set OLD on universal set X in which degree of OLD is zero if a
person in X is below 20 years and Degree of Old is .2 if a person is between 20 and 25
years and further suppose that

Old = C = {Mohan/0; Sohan/0; John/.2; Abdul/0; Abrahm/0},

then support of old = {John} and hence old is a fuzzy singleton.

Check Your Progress - 1

Ex. 1: Discuss equality and subset relationship for the following fuzzy sets defined on
the Universal set X = { a, b , c, d, e}

A = { a/.3, b/.6, c/.4 d/0, e/.7}

B = {a/.4, b/.8, c/.9, d/.4, e/.7}

C = {a/.3, b/.7, c/.3, d/.2, e/.6}

8.4 FUZZY SET REPRESENTATION


For Crisp sets, we have the operations of Union, intersection & complementation,
as illustrated by the example:

Let X = {x1, x2, …, x10}

A = {x2, x3, x4, x5} 12


B = {x1, x3, x5, x7, x9} The First Order
Predicate Logic
Then A  B = {x1, x2, x3, x4, x5, x7, x9}

A  B = {x3, x5}

A' or X ~ A = {x1, x6, x7, x8, x9, x10}

The concepts of Union, intersection and complementation for crisp sets may be
extended to FUZZY sets after observing that for crisp sets A and B, we have

(i) A  B is the smallest subset of X containing both A and B.

(ii) A  B is the largest subset of X contained in both A and B.

(iii) The complement A' is such that

(a) A and A' do not have any element in common and

(b) Every element of the universal set is in either A or A'.

Fuzzy Union, Intersection, Complementation:

In order to motivate proper definitions of these operations, we may recall

(1) when a crisp set is treated as a fuzzy set then

(i) membership in a crisp set is indicated by degree/value of membership as 1 (one) in


the corresponding Fuzzy set,

(ii) non-membership of a crisp set is indicated by degree/value of membership as zero


in the corresponding Fuzzy Set.

Thus, smaller the value of degree of membership, a sort of lesser it is a member of


the Fuzzy set.

(2) While taking union of Crisp sets, members of both sets are included, and none
else. However, in each Fuzzy set, all members of the universal set occur but their
degrees determine the level of membership in the fuzzy set.

The facts under (1) and (2) above, lead us to define:

The Union of two fuzzy sets A and B, is the set C with the same universe as that of A
and B such that, the degree of an element of C is equal to the MAXIMUM of degrees
of the element, in the two fuzzy sets.

(if Universe A  Universe B, then take Universe C as the union of the universe A and
universe B)

The Intersection C of two fuzzy sets A and B is the fuzzy set in which, the degree
of an element of C is equal to the MINIMUM of degrees in the two fuzzy sets.

Example:

A = {Mohan/.85; Sohan/.4; John/.6; Abdul/1; Abrahm/0}

B = {Mohan/.75; Sohan/.6; John/0; Abdul/.8; Abrahm/.3}

Then

A  B = {Mohan/.85; Sohan/.6; John/.6; Abdul/1; Abrahm /.3}

A  B = {Mohan/.75; Sohan/.4; John/0; Abdul/.8; Abrahm/0} 13


The First Order
and, the complement of A denoted by A is given by
Predicate Logic
C = {Mohan/.15; Sohan/.6; John/.4; Abdul/0; Abrahm /1}

Properties of Union, Intersection and Complement of Fuzzy Sets:

The following properties which hold for ordinary sets, also, hold for fuzzy sets

Commutativity

(i) A  B = B  A

(ii) A  B = B  A

We prove only (i) above just to explain, how the involved equalities, may be proved in
general.

Let U = {x1, x2…..xn}. be universe for fuzzy sets A and B

If y  A  B, then y is of the form {xi/di} for some i

y  A  B  y = {xi/ei} as member of A and

y = (xi/fi} as member of B and

di = max {ei, fi} = max {fi, ei}

 y B  A.
Rest of the properties are stated without proof.

Associativity
(i) (A  B )  C = A  (B  C)

(ii) (A  B )  C = A  (B  C)

Distributivity

(i) A  (B  C) = (A  B)  (A  C)

(ii) A  (B  C) = (A  B)  (A  C)

DeMorgan’s Laws

(A  B)' = A'  B'

(A  B)' = A'  B'

Involution or Double Complement

(A')' = A

Idempotence

AA=A

AA=A

Identity

AU =U AU=A

A =A  A= 14
The First Order
where  : empty fuzzy set = {x/0 with xU} and U: universe = {x/1 with xU}
Predicate Logic
Check Your Progress - 2

Ex. 2: For the following fuzzy sets

A = {a/.5, b/.6, c/.3, d/0, e/.9} and

B = { a/.3, b/.7, c/.6, d/.3, e/.6},

find the fuzzy sets A  B, A  B and (A  B)'

Next, we discuss three operations, viz., concentration, dilation and normalization, that
are relevant only to fuzzy sets and can not be discussed for (crisp) sets.

(1) Concentration of a set A is defined as

CON (A) = {x/m2A(x)|xU}

Example:

If A = {Mohan/.5; Sohan/.9; John/.7; Abdul/0; Abrahm/.2}

then

CON (A) = {Mohan/.25; Sohan/.81; John/.49; Abdul/0; Abrahm/.04}.

In respect of concentration, it may be noted that the associated values being between 0
and 1, on squaring, become smaller. In other words, the values concentrate towards
zero. This fact may be used for giving increased emphasis on a concept. If Brightness
of articles is being discussed, then Very bright may be obtained in terms of

CON. (Bright).

(2) Dilation (Opposite of Concentration) of a fuzzy set A is defined as

DIL (A) = {x/ m A ( x)|x  U }


Example:
If A = {Mohan/.5; Sohan/.9; John/.7; Abdul/0; Abrahm/.2}
then
DIL (A) = {Mohan/.7; Sohan/.95; John/.84; Abdul/0; Abrahm/.45}
The associated values, that are between 0 and 1, on taking square-root get increased,
e.g., if the value associated with x was .01 before dilation, then the value associated
with x after dilation becomes .1, i.e., ten times of the original value.
This fact may be used for decreased emphasis. For example, if colour say ‘yellow’ has
been considered already, then ‘light yellow’ may be considered in terms of already
discussed ‘yellow’ through Dilation.
  mA ( x )  
(3) Normalization of a fuzzy set, is defined as NORM ( A)   x /   | x U  .
  Max  
NORM (A) and is a fuzzy set in which membership values are obtained by dividing
values of the membership function of A by the maximum membership function.

The resulting fuzzy set, called the normal, (or normalized) fuzzy set, has the
maximum of membership function value of 1.

Example:

If A = {Mohan/.5; Sohan/.9; John/.7; Abdul/0; Abrahm/.2}

Norm (A) = {Mohan/(.5 .9 = .55.); Sohan/1; John /(.7 .9 = .77.); Abdul/0;
Abrahm/(.2 .9 = .22.)} 15
Note: If one of the members has value 1, then Norm (A) = A, The First Order
Predicate Logic
Relation & Fuzzy Relation

We know from our earlier background in Mathematics that a relation from a set A to a
set B is a subset of A x B.
For example, The relation of father may be written as {{Dasrath, Ram), …}, which is
a subset of A  B, where A and B are sets of persons living or dead.

The relation of Age may be written as

{(Mohan, 43.7), (Sohan, 25.6), …},

where A is set of living persons and B is set of numbers denoting years.

Fuzzy Relation
In fuzzy sets, every element of the universal set occurs with some degree of
membership. A fuzzy relation may be defined in different ways. One way of
defining fuzzy relation is to assume the underlying sets as crisp sets. We will discuss
only this case.

Thus, a relation from A to B, where we assume A and B as crisp sets, is a fuzzy


set, in which with each element of A  B is associated a degree of membership
between zero and one.

For example:

We may define the relation of UNCLE as follows:

(i) x is an UNCLE of y with degree 1 if x is brother of mother or father,

(ii) x is an UNCLE of y with degree .7 if x is a brother of an UNCLE of y, and x is


not covered above,

(iii) x is an UNCLE of y with degree .6 if x is the son of an UNCLE of mother or

father.

Now suppose

Ram is UNCLE of Mohan with degree 1, Majid is UNCLE of Abdul with degree .7

and Peter is UNCLE of John with degree .7. Ram is UNCLE of John with degree.4

Then the relation of UNCLE can be written as a set of ordered-triples as follows:

{(Ram, Mohan, 1), (Majid, Abdul, .7), (Peter, John, .7), (Ram, John, .4)}.

As in the case of ordinary relations, we can use matrices and graphs to represent
FUZZY relations, e.g., the relation of UNCLE discussed above, may be graphically
denoted as
1
Ram .4
Mohan
Majid .7 John
.7
Peter Abdul

Fuzzy Graph
16
Fuzzy Reasoning The First Order
Predicate Logic
In the rest of this section, we just have a fleeting glance on Fuzzy Reasoning.

Let us recall the well-known Crisp Reasoning Operators

(i) AND

(ii) OR

(iii) NOT

(iv) IF P THEN Q

(v) P IF AND ONLY IF Q

Corresponding to each of these operators, there is a fuzzy operator discussed and


defined below. For this purpose, we assume that P and Q are fuzzy propositions with
associated degrees, respectively, deg (P) and deg (Q) between 0 and 1.

The deg (P) = 0 denotes P is False and deg (P) =1 denotes P is True.

Then the operators are defined as follows:

(i) Fuzzy AND to be denoted by  , is defined as follows:

For given fuzzy propositions P and Q, the expression P  Q denotes a fuzzy


proposition with Deg (P  Q) = min (deg (P), deg (Q))

Example: Let P: Mohan is tall with deg (P) = .7

Q: Mohan is educated with deg (Q) = .4

Then P  Q denotes: ‘Mohan is tall and educated’ with degree ((min) {.7, .4}) = .4

(ii) Fuzzy OR to be denoted by , is defined as follows:

For given fuzzy propositions P and Q, P  Q is a fuzzy proposition with

Deg (P  Q) = max (deg (P), deg (Q))

Example: Let P: Mohan is tall with deg (P) = .7

Q: Mohan is educated with deg (Q) = .4

Then P  Q denotes: ‘Mohan is tall or educated’ with degree ((max) {.7, .4}) = .7

8.5 FUZZY REASONING


The Fuzzy Reasoning is taken care by the following systems in general:

1) Non Monotonic reasoning Systems


2) Default Reasoning Systems
3) Closed World Assumption Systems
Let’s start our discussion with the understanding of Non Monotonic Reasoning
Systems

1) NON-MONOTONIC REASONING SYSTEMS


Monotonic Reasoning: The conclusion drawn in PL and FOPL are only through
(valid) deductive methods. When some axiom is added to a PL or an FOPL system,
then, through deduction, we can draw more conclusions. Hence, more additional facts
become available in the knowledge base with the addition of each axiom. Adding of 17
axioms to the knowledge base increases the amount of knowledge contained in the The First Order
Predicate Logic
knowledge base. Therefore, the set of facts through inferences in such systems can
only grow larger with addition of each axiomatic fact. Adding of new facts can not
reduce the size of K.B. Thus, amount of knowledge monotonically increases with the
number of independent premises due to new facts that become available.

However, in everyday life, many times in the light of new facts that become available,
we may have to revise our earlier knowledge. For example, we consider a sort of
deductive argument in FOPL:

(i) Every bird can fly long distances

(ii) Every pigeon is a bird. (iii) Tweety is a pigeon.

Therefore, Tweety can fly long distances.

However, later on, we come to know that Tweety is actually a hen and a hen cannot
fly long distances. Therefore, we have to revise our belief that Tweety can fly over
long distances.

This type of situation is not handled by any monotonic reasoning system including PL
and FOPL .This is appropriately handled by Non-Monotomic Reasoning Systems,
which are discussed next.

A non-monotomic reasoning system is one which allows retracting of old


knowledge due to discovery of new facts which contradict or invalidate a part of the
current knowledge base. Such systems also take care that retracting of a fact may
necessitate a chain of retractions from the knowledge base or even reintroduction of
earlier retracted ones from K.B. Thus, chain-shrink and chain emphasis of a K.B and
reintroduction of earlier retracted ones are part of a non-monotomic reasoning
system.
To meet the requirement for reasoning in the real-world, we need non-monotomic
reasoning systems also, in addition to the monotomic ones. This is true specially, in
view of the fact that it is not reasonable to expect that all the knowledge needed for a
set of tasks could be acquired, validated, and loaded into the system just at the outset.
In general, initial knowledge is an incomplete set of partially true facts. The set may
also be redundant and may contain inconsistencies and other sources of uncertainty.

Major components of a Non-Monotomic reasoning system

Next, we discuss a typical non-monotomic reasoning system (NMRS) consists of the

following three major components:

(1) Knowledge base (KB),

(2) Inference Engine (IE),

(3) Truth-Maintenance System (TMS).

The KB contains information, facts, rules, procedures etc. relevant to the type of
problems that are expected to be solved by the system. The component IE of NMRS
gets facts from KB to draw new inferences and sends the new facts discovered by it
(i.e., IE) to KB. The component TMS, after addition of new facts to KB. either from
the environment or through the user or through IE, checks for validity of the KB. It
may happen that the new fact from the environment or inferred by the IE may
conflict/contradict some of the facts already in the KB. In other words, an
inconsistency may arise. In case of inconsistencies, TMS retracts some facts from
KB. Also, it may lead to a chain of retractions which may require interactions
between KB and TMS. Also, some new fact either from the environment or from IE,
may invalidate some earlier retractions requiring reintroduction of earlier retracted 18
facts. This may lead to a chain of reintroductions. These retrievals and introductions The First Order
Predicate Logic
are taken care of by TMS. The IE is completely relieved of this responsibility. Main
job of IE is to conclude new facts when it is supplied a set of facts.

IE IE TMS

KB

Next, We explain the ideas discussed above through an example:

Let us assume KB has two facts P and ~ Q  ~ P and a rule called Modus Tollens.
When IE is supplied these knowledge items, it concludes Q and sends Q to KB.
However, through interaction with the environment, KB is later supplied with the
information that ~ P is more appropriate than P. Then TMS, on the addition of ~ P to
KB, finds that KB is no more consistent, at least, with P. The knowledge that ~ P is
more appropriate, suggests that P be retracted. Further Q was concluded assuming P
as True. But, in the new situation in which P is assumed to be not appropriate, Q also
becomes inappropriate. P and Q are not deleted from KB, but are just marked as
dormant or ineffective. This is done in view of the fact that later on, if again, it is
found appropriate to include P or Q or both, then, instead of requiring some
mechanism for adding P and Q, we just remove marks that made these dormant.

Non-monotomic Reasoning Systems deal with

1) Revisable belief systems

2) incomplete K.B. Default Reasoning

Closed World assumption

2) DEFAULT REASONING
In the previous section, we discussed uncertainty due to beliefs (which are not
necessarily facts) where beliefs are changeable. Here, we discuss another form of
uncertainty that occur as a result of incompleteness of the available knowledge at a
particular point of time.

One method of handling uncertainty due to incomplete KB is through default


reasoning which is also a form of non-monotomic reasoning and is based on the
following mechanism:

Whenever, for any entity relevant to the application, information is not in the KB, then
a default value for that type of entity, is assumed and is assigned to the entity. The
default assignment is not arbitrary but is based on experiments, observations or some
other rational grounds. However, the typical value for the entity is removed if some
information contradictory to the assumed or default value becomes available.

The advantage of this type of a reasoning system is that we need not store all facts
regarding a situation. Reiter has given one theory of default reasoning, which is
expressed as

a ( x ) : M b ( x ),....., Mb k ( x )
(A)
C( x )

where M is a consistency operator.


19
The inference rule (A) states that if a(x) is true and none of the conditions b k (x) is in The First Order
Predicate Logic
conflict or contradiction with the K.B, then you can deduce the statement C(x)

The idea of default reasoning is explained through the following example:

Suppose we have

Bird ( x ) : Mfly ( x )
(i)
Fly ( x )

(ii) Bird (twitty)

M fly (x) stands for a statement of the form ‘KB does not have any statement of the
form that says x does not have wings etc, because of which x may not be able to fly’.
In other words, Bird (x) : M fly (x) may be taken to stand for the statement ‘if x is a
normal bird and if the normality of x is not contradicted by other facts and rules in the
KB.’ then we can assume that x can fly. Combining with Bird (Twitty), we conclude
that if KB does not have any facts and rules from which, it can be inferred that Twitty
can not fly, then, we can conclude that twitty can fly.

Further, suppose, KB also contains

(i) Ostrich (twitty)

(ii) Ostrich (x)  ~ FLY (x).

From these two facts in the K.B., it is concluded that Twitty being an ostrich, can not
fly. In the light of this knowledge the fact that Twitty can fly has to be withdrawn.
Thus, Fly (twitty) would be locked. Because, default Mfly (Twitty) is now
inconsistent.

Let us consider another example:

Adult ( x ) : M drive ( x )
Drive ( x )

The above can be interpreted in the default theory as:

If a person x is an adult and in the knowledge base there is no fact (e.g., x is blind, or
x has both of his/her hands cut in an accident etc) which tells us something making x
incapable of driving, then x can drive, is assumed.

3) CLOSED WORLD ASSUMPTION


Another mechanism of handling incompleteness of a KB is called ‘Closed World
Assumption’ (CWA).
This mechanism is useful in applications where most of the facts are known and
therefore it is reasonable to assume that if a proposition cannot be proved, then it is
FALSE. This is called CWA with failure as negation.

This means if a ground atom P(a) is not provable, then assume ~ P(a). A predicate like

LESS (x, y) becomes a ground atom when the variables x and y are replaced by
constants say x by 2 and y by 3, so that we get the ground atom LESS (2, 3).

Example of an application where CWA is reasonable is that of Airline reservation


where city-to-city flight not explicitly entered in the flight schedule or time table, are
assumed not to exist.

AKB is complete if for each ground atom P(a); either P(a) or ~ P(a) can be proved.

20
By the use of CWA any incomplete KB becomes complete by the addition of the The First Order
Predicate Logic
meta rule:

If P(a) can not be proved then assume ~ P (a).

Example of an incomplete K.B: Let our KB contain only

(i) P(a).

(ii) P(b).

(iii) P(a)  Q(a).

(iv) Rule of Modus Ponens: From P and P  Q, conclude Q.

The above KB is incomplete as we can not say anything about Q(b) (or ~ Q(b)) from
the given KB.

Remarks: In general, KB argumented by CWA need not be consistent i.e.,

it may contain two mutually conflicting wffs. For example, if our KB contains

only P(a)  Q(b).

(Note: from P (a)  Q (b), we can not conclude either of P (a) and Q (b) with
definiteness)

As neither P(a) nor Q(b) is provable, therefore, we add ~ P(a) and ~ Q(b) by using
CWA.

But, then, the set of P(a)  Q(b), ~ P(a) and ~Q(b) is inconsistent.

8.6 FUZZY INFERENCE


PL and FOPL are deductive inferencing systems: i.e., the conclusions drawn are
invariably true whenever the premises are true. However, due to limitations of these
systems for making inferences, as discussed earlier, we must have other systems
inferences. In addition to Default Reasoning systems and Closed World Assumption
systems, we have the following useful reasoning systems:

1) Abductive inference System, which is based on the use of causal knowledge to


explain and justify a (possibly invalid) conclusion.

Abduction Rule (P  Q , Q) / P

Note that abductive inference rule is different form Modus Ponens inference rule in

that in abductive inference rule, the consequent of P  Q, i.e., Q is assumed to be

given as True and the antecedent of P  Q, i.e., P is inferred.

The abductive inference is useful in diagnostic applications. For example while

diagnosing a disease (say P), the doctor asks for the symptoms (say Q). Also, the
doctor knows that for given the disease, say, Malaria (P); the symptoms include high
fever starting with feeling of cold etc. (Q)

i.e., doctor knows PQ

The doctor then attempts to diagnose the disease (i.e., P) from symptoms. However, it
should be noted that the conclusion of the disease from the symptoms may not always 21
be correct. In general, abductive reasoning leads to correct conclusions, but the The First Order
Predicate Logic
conclusions may be incorrect also. In other words, Abductive reasoning is not a valid
form of reasoning.

Inductive Reasoning is a method of generalisation from a finite number of instances.


P(a1 ), P  a2  ......, P  an 
The rule, generally, denoted as , states that from n
 x P  x
instances P(ai) of a predicate/property P(x), we infer that P(x) is True for all x.

Thus, from a finite number of observations about some property of objects, we


generalize, i.e., make a general statement about all the elements of the domain in
respect of the property.

For example, we may, conclude that: all cows are white, after observing a large
number of white cows. However, this conclusion may have some exception in the
sense that we may come across a black cow also. Inductive Reasoning like Abductive
Reasoning, Closed World Assumption Reasoning and Default Reasoning is not
irrefutable. In other words, these reasoning rules lead to conclusions, which may be
True, but not necessarily always.

However, all the rules discussed under Propositional Logic (PL) and FOPL, including
Modus Ponens etc are deductive i.e., lead to irrefutable conclusions.

8.7 ROUGH SET THEORY


Rough set theory can be regarded as a new mathematical tool for imperfect data
analysis. The theory has found applications in many domains, such as decision
support, engineering, environment, banking, medicine and others. It is a mechanism to
deal with imprecise/imprecise knowledge, dealing with such a kind of knowledge is
particularly area of research for the scientists, working in the field of Artificial
Intelligence. There are various approaches to handle the imprecise knowledge, the
most successful one is that of the Fuzzy logic, which was proposed by L.Zadeh, we
discussed the same in our earlier sections of this unit.

In this section we will try to understand the Rough set theory approach, to manage the
imprecise knowledge, it was proposed by Z. Pawlak. This theory is quite
comprehensive and may be dealt as an independent discipline. It is quite connected
with other theories and hence connected with various fields like AI, Machine
Learning, Cognitive sciences, data mining, pattern recognition etc.

Rough set theory is quite comprehensive because of the following reasons :

• It requires no preliminary/additional information about the data as if it is the


requirement of probability in statistics, or membership grades in the fuzzy set theory.

• Facilitates the user with efficient tools and techniques to detect the hidden patterns

• Promotes data reductionality i.e. it reduces the original data and, find minimal
datasets from the data with the similar knowledge as it is in the original dataset.

• Helps to evaluate the data significance.

• Supports the mechanism to Sets the decision rules from the data, automatically

• It is easy to understand, best suited for concurrent or parallel or distributed


processing , and offers straightforward interpretation of obtained results.

Following are the basic/elementary concepts of the Rough set theory :


22
1) Some information (data, knowledge) is associated with every object of the The First Order
Predicate Logic
universe of discourse

2) Objects characterized by the same information are indiscernible or similar in


view of the available information about them. The indiscernibility relation
generated in this way is the mathematical basis of rough set theory. Any set of
all indiscernible (similar) objects is called an elementary set, and forms a
basic granule (atom) of knowledge about the universe.

3) Any union of some elementary sets is referred to as a crisp (precise) set –


otherwise the set is rough (imprecise, vague).

4) Each rough set has boundary-line cases, i.e., objects which cannot be with
certainty classified, by employing the available knowledge, as members of the
set or its complement. Obviously rough sets, in contrast to precise sets, cannot
be characterized in terms of information about their elements. With any rough
set a pair of precise sets, called the lower and the upper approximation of the
rough set, is associated.
Note: The lower approximation consists of all objects which surely belong to
the set and the upper approximation contains all objects which possibly
belong to the set. The difference between the upper and the lower
approximation constitutes the boundary region of the rough set.
Approximations are fundamental concepts of rough set theory.

5) Rough set based data analysis starts from a data table called a decision table,
columns of which are labeled by attributes, rows – by objects of interest and
entries of the table are attribute values.

6) Attributes of the decision table are divided into two disjoint groups called
condition and decision attributes, respectively. Each row of a decision table
induces a decision rule, which specifies decision (action, results, outcome,
etc.) if some conditions are satisfied. If a decision rule uniquely determines
decision in terms of conditions – the decision rule is certain. Otherwise the
decision rule is uncertain.
Note: Decision rules are closely connected with approximations. Roughly
speaking, certain decision rules describe lower approximation of decisions in
terms of conditions, whereas uncertain decision rules refer to the boundary
region of decisions.

7) With every decision rule two conditional probabilities, called the certainty and
the coverage coefficient, are associated.

a. The certainty coefficient expresses the conditional probability that an


object belongs to the decision class specified by the decision rule,
given it satisfies conditions of the rule.
b. The coverage coefficient gives the conditional probability of reasons
for a given decision. It turns out that the certainty and coverage
coefficients satisfy Bayes’ theorem. That gives a new look into the
interpretation of Bayes’ theorem, and offers a new method data to
draw conclusions from data.

8.8 SUMMARY
In this unit the Fuzzy Systems are discussed along with the Introduction to Fuzzy Sets
and their Representation. Later the conceptual understanding of Fuzzy Reasoning is
build, and the same is used to perform the Fuzzy Inference. The unit finally discussed
the concept of Rough Set Theory, also. 23
The First Order
Predicate Logic
8.9 SOLUTIONS/ANSWERS
Check Your Progress - 1

Ex. 1: Discuss equality and subset relationship for the following fuzzy sets defined on
the Universal set X = { a, b , c, d, e}

A = { a/.3, b/.6, c/.4 d/0, e/.7} ; B = {a/.4, b/.8, c/.9, d/.4, e/.7}; C = {a/.3, b/.7, c/.3,
d/.2, e/.6}

SOLUTION: Both A and C are subsets of the fuzzy set B, because deg (x in A ) 
deg (x in B) for all x  X

Similarly degree (x in C)  degree (x in B) for all x  X

Further, A is not a subset of C, because, deg (c in A) = .4 > .3 = degree (c in C)

Also, C is not a subset of A, because, degree (b in C) = .7 > .6 = degree (b in A)

Check Your Progress - 2

Ex. 2: For the following fuzzy sets A = {a/.5, b/.6, c/.3, d/0, e/.9} and B = { a/.3, b/.7,
c/.6, d/.3, e/.6}, find the fuzzy sets A  B, A  B and (A  B)'

Solution : A  B = {a/.3, b/.6, c/.3, d/0, e/.6},

where degree (x in A  B) = min { degree (x in A), degree (x in B)}.

A  B = {a/.5, b/.7, c/.6, d/.3, e/.9},

where degree (x in A  B) = max {degree (x in A), degree (x in B)}.

The fuzzy set (A  B) is obtained from A  B, by the rule:

degree (x in (A  B) ) = 1 − degree (x in A  B).

Hence

(A  B) = { a/.7, b/.4, c/.7, d/1, e/.4}

8.10 FURTHER READINGS


1. Ela Kumar, “ Artificial Intelligence”, IK International Publications
2. E. Rich and K. Knight, “Artificial intelligence”, Tata Mc Graw Hill
Publications
3. N.J. Nilsson, “Principles of AI”, Narosa Publ. House Publications
4. John J. Craig, “Introduction to Robotics”, Addison Wesley publication
5. D.W. Patterson, “Introduction to AI and Expert Systems" Pearson publication

24
PROGRAMME DESIGN COMMITTEE
Prof. (Retd.) S.K. Gupta , IIT, Delhi Sh. Shashi Bhushan Sharma, Associate Professor, SOCIS, IGNOU
Prof. Ela Kumar, IGDTUW, Delhi Sh. Akshay Kumar, Associate Professor, SOCIS, IGNOU
Prof. T.V. Vijay Kumar JNU, New Delhi Dr. P. Venkata Suresh, Associate Professor, SOCIS, IGNOU
Prof. Gayatri Dhingra, GVMITM, Sonipat Dr. V.V. Subrahmanyam, Associate Professor, SOCIS, IGNOU
Mr. Milind Mahajan,. Impressico Business Solutions, Sh. M.P. Mishra, Assistant Professor, SOCIS, IGNOU
New Delhi Dr. Sudhansh Sharma, Assistant Professor, SOCIS, IGNOU

COURSE DESIGN COMMITTEE


Prof. T.V. Vijay Kumar, JNU, New Delhi Sh. Shashi Bhushan Sharma, Associate Professor, SOCIS, IGNOU
Prof. S.Balasundaram, JNU, New Delhi Sh. Akshay Kumar, Associate Professor, SOCIS, IGNOU
Prof D.P. Vidyarthi, JNU, New Delhi Dr. P. Venkata Suresh, Associate Professor, SOCIS, IGNOU
Prof. Anjana Gosain, USICT, GGSIPU, New Delhi Dr. V.V. Subrahmanyam, Associate Professor, SOCIS, IGNOU
Dr. Ayesha Choudhary, JNU, New Delhi Sh. M.P. Mishra, Assistant Professor, SOCIS, IGNOU
Dr. Sudhansh Sharma, Assistant Professor, SOCIS, IGNOU

SOCIS FACULTY
Prof. P. Venkata Suresh, Director, SOCIS, IGNOU
Prof. V.V. Subrahmanyam, SOCIS, IGNOU
Dr. Akshay Kumar, Associate Professor, SOCIS, IGNOU
Dr. Naveen Kumar, Associate Professor, SOCIS, IGNOU (on EOL)
Dr. M.P. Mishra, Associate Professor, SOCIS, IGNOU
Dr. Sudhansh Sharma, Assistant Professor, SOCIS, IGNOU
Dr. Manish Kumar, Assistant Professor, SOCIS, IGNOU

PREPARATION TEAM
Mr. VenuGopal, General Manager.(Writer- Unit 9) Prof Anjana Gosain (Content Editor)
Sify Technologies, Noida,U.P USICT-GGSIPU,Delhi

Dr.Sudhansh Sharma, (Writer – Unit 10)


Dr. Rajesh Kumar(Language Editor)
Assistant .Professor, SOCIS, IGNOU
(Unit-10 : Partially Adapted from MCS 043 SOH, IGNOU, New Delhi
Advanced Database Management Systems)
Prof. Sachin Kumar (Writer-Unit 12)
Dr. Parmod Kumar, Assistanr Professor(Sr.G.) Department of Computer Science and Engineering
Department of Computer Applications, Ajay Kumar Garg Engineering College, Ghaziabad, U.P.
SRM Institute of Science and Technology,
Delhi NCR Campur Modinagar,U.P.(Writer-Unit 11)

Course Coordinator: Dr.Sudhansh Sharma,

Print Production
Sh Sanjay Aggarwal,Assistant Registrar, MPDD

, 2022
Indira Gandhi National Open University, 2022
ISBN-
All rights reserved. No part of this work may be reproduced in any form, by mimeograph or any other means, without permission
in writing from the Indira Gandhi National Open University.
Further information on the Indira Gandhi National Open University courses may be obtained from the University’s office at
Maidan Garhi, New Delhi-110068.
Printed and published on behalf of the Indira Gandhi National Open University, New Delhi by MPDD, IGNOU.
UNIT 9 INTRODUCTION TO MACHINE LEARNING
METHODS
Structure Page Nos.
9.0 Introduction 50
9.1 Objectives 51
9.2 Introduction to Machine Learning 51
9.3 Techniques of Machine Learning 55
9.4 Reinforcement Learning and Algorithms 57
9.5 Deep Learning and Algorithms 59
9.6 Ensemble Methods 62
9.7 Summary 67
9.8 Solutions/ Answers 67
9.9 Further Readings 68

9.0 INTRODUCTION

After Artificial Intelligence was introduced, in Computing World. There was a need for a machine that
would automatically make things better. This needs to be kept in check, so there should be some rules that
apply to all learning processes.

The main goal of machine learning, even at its most basic level, is to be able to analyse and adapt data on
its own and make decisions based on calculations and analyses. Machine learning is a way to try to
improve computers by imitating how the human brain learns. A computer that doesn't have intelligence is
just a fast machine for processing data. The devices that don't have AI or ML are just data processing
units that use the information they are given. Machine Learning is what we need to make devices that can
make decisions based on data.

To get to this level of intelligence, you need to put algorithms and data into a machine in a way that lets it
make decisions.

For example, Real-time GPS data is used by Maps on devices to show the quickest and fastest route.
Several algorithms, such as the shortest path (Dijkstra's algorithm) and the travelling salesman (an
algorithm that works like water flow), can be used to make the decision (WFA). These are algorithms that
have been used and can be made better, but they are useful for learning. Here, we can see that the
Machine, which is your computer or mobile device, uses GPS coordinates, traffic data based on density,
and predefined map routes to figure out the fastest way to get from Point A to Point B..

This is one of the simplest examples that can help us understand how Machine Learning can help in
independent decision making by devices and how it can help in making decision making easier and more
accurate.

The accuracy of the data as a whole is a topic of debate since decisions based on the data might be
accurate, but it is one of the issues whether or not they are acceptable within the limitations.
Consequently, it is necessary to set these boundaries for the entirety of the machine learning algorithms
and engines.

Simplest example would be if we instruct an auto driving car to reach a destination at a specified time. IT
should also work within the legal boundaries of the land not to break traffic rules to achieve the desired
result. The Boundaries and restriction cannot be ignored as they are very important for any Self learning
system.

Data/Inputs is a soul of all the business .Data has been a key component for making any decision . Data
is the key to successes from prehistoric era. The more you have the data more is the probability of making
the right decision. Machine learning is the key to unlock new world where customer data , corporate data ,
demographic data, or related dimension data relevant to the decision can help you make right and more
informed decision to stay ahead of competition.

Both artificial intelligence and statistics research groups contributed to the development of machine
learning. Companies like Google, Microsoft, Facebook, and Amazon all use machine learning as part of
their decision-making processes.

The most common applications of machine learning nowadays are to interpret and investigate cyber
phenomena, to extract and project the future values of those phenomena, and to detect anomalies.

There are a number of open-source solutions for machine learning that can be used with API calls or
without programming. Some of the Open-source Machine Learning projects, such as Weka, Orange, and
Rapid-Miner. To see how data that has been processed by an algorithm looks, you can put the results into
tools like Tableau, Pivotal, or Spotfire and use them to make dashboards and workflow strategies.

Michie et al. (D. Michie, 1994) says that Machine Learning usually refers to automatic computing
procedures based on logical or binary operations that learn how to do a task by looking at a series of
examples. Machine learning is used in a lot of ways today, but whether or not they are all ready is up for
debate. There is a lot of room for improvement when it comes to accuracy, which is a process that never
ends and changes every day.

9.1 OBJECTIVES

After going through this unit, you should be able to:

 Understand the basics of Machine learning


 Identify various techniques of Machine Learning
 Understand the concept of Reinforcement Learning
 Understand the concept of Deep Learning
 Understand Ensemble Methods
9.2 INTRODUCTION TO MACHINE LEARNING

Understanding data, describing the characteristics of a data collection, and locating hidden connections
and patterns within that data are all necessary steps in the process of developing a model. These steps can
be accomplished through the application of statistics, data mining, and machine learning. When it comes
to finding solutions to business issues, the methods and tools that are employed by these fields share a lot
in common with one another.

The more conventional forms of statistical investigation are the origin of a great deal of the prevalent data
mining and machine learning techniques. Data scientists have a background in technology and also have
expertise in areas such as statistics, data mining, and machine learning. This allows them to collaborate
effectively across all fields.

The process of "data mining" refers to the extraction of information from data that is latent, previously
unknown, and has the potential to be beneficial. Building computer algorithms that can automatically
search through large databases for recurring structures or patterns is the goal of this project. In the event
that robust patterns are discovered, it is likely that they will generalise to enable accurate predictions on
future data.

In the renowned book "Data Mining: Practical Machine Learning Tools and Techniques," written by Ian
Witten and Eibe Frank, the subject matter is thoroughly covered. The activity known as "data mining"
refers to the practise of locating patterns within data. The procedure needs to be fully automatic or, at the
very least, semiautomatic. The patterns that are found have to be significant in the sense that they lead to
some kind of benefit, most commonly an economic one. Consistently and in substantial amounts, the
statistics are there to be found.

Machine learning, on the other hand, is the core of data mining's technical infrastructure. It is used to
extract information from the raw data that is stored in databases; this information is then expressed in a
form that is understandable and can be applied to a range of situations.

STATISTICS DATA MINING

MACHINE
LEARNING

Figure 1 : Machine Learning – Data Mining - Statistics


We learned the differences between Machine Learning and Data Mining from the conversation;
nevertheless, because all three, machine learning, data mining, and statistics, are intertwined, we must
grasp their relationships. So, how do machine learning and statistics differ? In reality, there is no clear cut
between machine learning and statistics because data analysis techniques are a continuum—and a
multidimensional one at that. Some are derived from statistical skills, while others are more strongly
linked to the type of machine learning that has emerged from computer science. Both sides have had quite
diverse traditions throughout history. If forced to choose one point of emphasis, statistics may have been
more concerned with testing hypotheses, whereas machine learning has been more interested with
articulating the generalisation process as a search through possible hypotheses. This, however, is an
exaggeration: Many machine learning algorithms do not require any searching at all, and statistics is
significantly more than just hypothesis testing.

Most learning algorithms use statistical tests to build rules or trees and fix models that are "overfitted," or
too dependent on the details of the examples that were used to make them. So, a lot of statistical thinking
goes into the techniques we will talk about in this unit. Statistical tests are used to evaluate and validate
machine learning models and algorithms.

Machine learning is when a computer learns how to do a task by using algorithms that are logical and can
be turned into models that can be used. Artificially intelligent communities are the main reason why
Machine Learning is growing.The most important factor contributing to this expansion was that it assisted
in the collection of statistical and computational methods that could automatically construct usable
models from data. Companies such as Google, Microsoft, Facebook, and Netflix have been putting in
consistent effort over the past decade to make this more accurate and mature.

The primary function or application of machine learning algorithms can be summarized as follows:

(a) To gain an understanding of the cyber phenomenon that produced the data that is being
investigated;
(b) To abstract the understanding of underlying phenomena in the form of a model;
(c) To predict the future values of a phenomenon by using the model that was just generated; and
(d) To identify anomalous behavior exhibited by a phenomenon that is being observed.

There are various open-source implementations of machine learning algorithms that can be utilised with
either application programming interface (API) calls or non-programmatic applications. These methods
can also be used in conjunction with each other. Weka, Orange, and Rapid Miner are a few instances of
open-source application programming interfaces. These algorithms' outputs can be fed into visual
analytics tools like Tableau4 and Spotfire5, which can then be used to build dashboards and actionable
pipelines.

Almost all of the Frameworks have emphasised decision-tree techniques, in which classification is
determined by a series of logical processes. Given enough data (which may be a lot! ), these are capable
of representing even the most complex problems. Other techniques, such as genetic algorithms and
inductive logic procedures (ILP), are currently in development and, in theory, would allow us to deal with
a wider range of data, including cases where the number and type of attributes vary, where additional
layers of learning are superimposed, and where attributes and classes are organised hierarchically, and so
on. Machine Learning seeks to provide classification phrases that are basic enough for humans to
understand. They must be able to sufficiently simulate human reasoning in order to provide insight into
the decision-making process. Background knowledge, including statistical techniques, can be used in
development, but operation is assumed to be without human interference.

The expression “To learn” can be understood as :

 To learn means to acquire knowledge.to gain knowledge or understanding of something


through experiencing it or through learning about it (some art or practice)
 To gain experience with something, learn a new skill, or master a talent.
 To memorize (something), to acquire something through the experience, example, or
practice of doing so.
Machine Learning is a methodology for automatically improving computer systems through the process
of developing using experience and implementing a learning process. There are various techniques for
imparting machine learning and we will learn about few of those in the subsequent section.

Check Your Progress - 1

Q1. How machine learning differs from Artificial intelligence ?


………………………………………………………………………………………………………………
……………………………………………………

Q2 Briefly discuss the major function or use of Machine learning algorithms.

………………………………………………………………………………………………………………
……………………………………………………

9.3 TECHNIQUES OF MACHINE LEARNING


Machine learning uses various algorithms to improve, describe, and predict outcomes by
repeated learning from data. It is possible to make models that are more accurate as the algorithms learn
from the training data. A machine learning model is what you get when you use data to train your
machine learning algorithm. After it has been trained, a model will give you an output when you give it
an input. A predictive model is made, for example, by a predictive algorithm. Then, when you put data
into the predictive model, you'll get a prediction based on the data that was used to train the model. At the
moment, analytics models can't be made without machine learning.

Machine learning approaches are needed to make prediction models more accurate. Depending on the
type and amount of data and the business problem being solved, there are different ways to approach the
problem. In this section, we talk about the machine learning cycle.

The Machine Learning Cycle: Making a machine learning application is similar to making a machine
learning algorithm work, which is an iterative process. You can't just train a model once and leave it
alone, because data changes, preferences change, and new competitors come along. So, when your model
goes into production, you need to keep it updated. Even though you won't need as much training as when
you created the model, don't expect it to run on its own.
Figure 2 :Machine Learning Cycle at a Glance
1. ACCESS and load the data. 4.TRAIN models using the
features derived in step 3.

MOBILE
2. PREPROCESS the data 5. ITERATE to find the best
DEVICE
model

3. DERIVE features using the 6. INTEGRATE the best-trained


pre-processed data. model into a production system

Figure 2 : Machine Learning Cycle at a Glance

To use machine learning techniques effectively, you need to know how they work. You can't just use
them without knowing how they work and expect to get good results. Different techniques work for
different kinds of problems, but it's not always clear which techniques will work in a given situation. You
need to know something about the different kinds of solutions. We talk about a very large number of
techniques.

One step in the machine learning cycle is choosing the right machine learning algorithm. So, let's look at
how the machine learning cycle works.

The steps in the machine learning cycle are as follows:

1. Data Identification
2. Data Preparation
3. Selection of machine learning algorithm:
4. Training the algorithm to develop a model
5. Evaluating the model
6. Deploying the model
7. Performing Prediction
8. Assess the predictions

When your model has reached the point where it can make accurate predictions, you can restart the
process by re-evaluating it using questions such as "Is all of the information important?" Exist any more
data sets that could be used to improve the accuracy of the predictions? You may maintain the usefulness
of your applications that are based on machine learning by continually improving the models and
assessing new approaches.

When should you use machine learning? Think about using machine learning when you have a hard task
or problem that involves a lot of data and many different factors but no formula or equation to solve it.
For example, machine learning is a good choice if you need to deal with situations like face and speech
recognition, fraud detection by analysing transaction records, automated trading, energy demand
forecasting, predicting shopping trends, and many more.

When it comes to machine learning, there's rarely a straight line from the beginning to the end. Instead,
you'll find yourself constantly iterating and trying out new ideas and methods.

This unit talks about a step-by-step process for machine learning and points out some important decision
points along the way. The most common problem with machine learning is getting your data in order and
finding the right model. Here are some of the most important things to worry about with the data:

• Data comes in all shapes and sizes : There are many different kinds of data. Datasets from the
real world can be messy, with missing values, and may be in different formats. You might just
have simple numeric data. But sometimes you have to combine different kinds of data, like sensor
signals, text, and images from a camera that are being sent in real time.

• Preprocessing your data might require specialized knowledge and tools : You might need
specialised tools and knowledge to prepare your data before you use it. For example, you need to
know a lot about image processing to choose features to train an object detection algorithm.
Preprocessing needs to be done in different ways for different kinds of data.

• It takes time to find the best model to fit the data : Finding the best model to fit the data
takes time. Finding the right model is like walking a tightrope. Highly flexible models tend to fit
the data too well by trying to explain small differences that could just be noise. On the other hand,
models that are too simple might assume too much. Model speed, accuracy, and complexity are
always at odds with each other.

Does it appear to be a challenge? Try not to let this discourage you. Keep in mind that the process of
machine learning relies on trial and error. You merely go on to the next method or algorithm in the event
that the first one does not succeed. On the other hand, a well-organized workflow will assist you in
getting off to a good start.

Every machine learning workflow begins with three questions:

 What kind of information do you have to work with?


 How do you want to learn something from it?
 How and where will these new ideas come from?
Your answers to these questions help you decide whether to use supervised or unsupervised learning
algorithm, before proceeding to other details we will discuss these two types of learning algorithms.

Fundamentally Machine learning involves two classes of Learning algorithms:

a) Supervised learning, which requires training a model with data whose inputs and outputs are
already known in order for the model to be able to predict future outputs, such as whether or not
an email is authentic or spam or whether or not a tumor is cancerous. Classification models
classify given data into categories. Imaging for medical purposes, speech recognition, and rating
credit are a few examples of typical applications.
b) Unsupervised learning analyses data to uncover previously unkno
unknownwn patterns or structures. It is
used to infer conclusions from sets of data that contain inputs but no tagged answers. The most
prevalent method of learning without being observed is clustering. Exploratory data analysis is
used to uncover hidden patterns or groups in data. Clustering can be used for gene sequence
analysis, market research, and object recognition.
Note: In semi-supervised learning, algorithms are trained on small sets of labelled data before being
applied to unlabeled data, like in unsuper vised learning. When there is a dearth of quality data, this
unsupervised
method is frequently used.

MACHINE LEARNING

SUPERVISED LEARNING UNSUPERVISED LEARNING

Classification Regression
Clustering

Linear
Support Vector Regression, GLM PARTITIONING ALGORITHMS
K-Means, K-Medoids Fuzzy C-Means
SCR, GPR

Discriminant Analysis
Hierarchical
Ensemble Methods

Naive Bayes
Gaussian
Decision Trees Mixture
Nearest Neighbor

Neural Networds
Neural Networds

Hidden Markov
Model

Figure – 3 Machine Learning Algorithms

"How Do You Choose Which Algorithm to Use?" is a crucial question. There are numerous supervised
and unsupervised machine learning algorithms, each with its own learning strategy. This can make
picking the appropriate one difficult. There is no alternative solution or strategy that will work for
everyone. It takes some trial and error to find the proper algorithm. Even the most seasoned data scientists
can't predict whether or not an algorithm would work without putting it to the test. However, the size and
type of data you're working with, the insights you want to gain from the data, and how those insights will
be used all go into the algorithm you choose.

• If you need to train a model to produce a forecast, such as the future value of a continuous variable like
temperature or a stock price, or a classification, such as determining what kind of automobile is on a
webcam footage, go with supervised learning.

• If you want to look at your data and train a model to identify an appropriate way to represent it
internally, for as by grouping it, use unsupervised learning.

The purpose of supervised machine learning is to create a model capable of making predictions based on
data even when there is ambiguity. A supervised learning technique trains a model to generate good
predictions about the response to new data using a known set of input data and previous responses to the
data (output).

Using Supervised Learning to Predict Heart Attacks as an Example: Assume doctors want to determine if
someone will suffer a heart attack in the coming year. They have information on former patients' age,
weight, height, and blood pressure. They know if any of the previous patients had heart attacks within a
year. The challenge is to create a model using existing data that can predict if a new person will have a
heart attack in the coming year.

Supervised Learning Techniques: Every supervised learning method may be broken down into one of
two categories: classification or regression. Methods like as classification and regression, which are
employed in supervised learning, are put to use in the development of models that are able to forecast the
future.

• Classification techniques : Classification methods make predictions about discrete outcomes,


such as whether an e-mail is genuine or spam or if a tumour is cancerous or not. Classification
models classify incoming data into categories. Imaging for medical purposes, speech recognition,
and rating credit are a few examples of typical applications.

• Regression methods : Predictions can be made with regression algorithms about things like
shifts in temperature or alterations in the quantity of power consumed. The most typical
applications are stock price predicting, handwriting recognition, electricity load forecasting,
acoustic signal processing, and other similar tasks.

Note:

 Is it possible to tag or categorise your data? Use classification techniques if your data can be
divided into distinct groups or classes.

 Working with a collection of data? Use regression techniques if your answer is a real number,
such as the temperature or the time until a piece of equipment fails.

 Binary vs. Multiclass Classification: Before you start working on a classification problem, figure
out whether it's a binary or multiclass problem. A single training or test item (instance) can only
be classified into two classes in a binary classification task, such as determining whether an email
is real or spam. If you wish to train a model to categorise a picture as a dog, cat, or other animal,
for example, a multiclass classification problem might be separated into more than two
categories. Remember that a multiclass classification problem is more difficult to solve since it
necessitates a more sophisticated model. Certain techniques (such as logistic regression) are
specifically intended for binary classification situations. These methods are more efficient than
multiclass algorithms during training.

Now it's time to talk about the role of algorithms in machine learning. Algorithms are a very important
part of how machine learning works, so it's important to talk about both of them. Discussion about
algorithms and machine learning go hand in hand. They're the most important part of learning. In the
world of computers, algorithms have been used for a long time to help us solve hard problems. They are a
set of computer instructions for working with, changing, and interacting with data. An algorithm can be as
simple as adding a column or as complicated as figuring out how to recognize anyone's face in a picture.

For an algorithm to work, it must be written as a programme that a computer can understand. Machine
learning algorithms are usually written in either Java, Python, or R. Each of these languages has machine
learning libraries that support a wide range of machine learning algorithms.

Active user communities for these languages share code and talk about ideas, problems, and ways to solve
business problems. Machine learning algorithms are different from other algorithms. Most of the time, a
programmer starts by typing in the algorithm. Machine learning turns the process around. With machine
learning, the data itself creates the model. When you add more data to an algorithm, it gets harder to
understand. As the machine learning algorithm gets more and more information, it can make more
accurate algorithms.

It's a mix of science and art, to choose the right kind of machine learning algorithm. If you ask two data
scientists to solve the same business problem, they might do it in different ways. But data scientists can
figure out which machine learning algorithms work best if they know the different kinds. So, the most
important step after getting the data in the right format is to choose the right machine learning algorithm.

As a result of our earlier discussion, we understood that choosing a right algorithm for machine learning
is a process of trial and error. There is also a contradiction between certain aspects of the algorithms, such
as:

 the amount of time spent in training;


 the amount of memory needed;
 the accuracy with which predictions are made on new data; and
 the level of transparency or interpretability (how easily you can understand the reasons an
algorithm makes its predictions)

Let’s take a closer look at the most commonly used machine learning algorithms.
Naive Bayes
(AODE)Averaged One-Dependence
Bayesian Estimators
Bayesian Belief Network (BBN)
Deep Boltzmann Machine (DBM) Deep Learning Gaussian Naive Bayes
Deep Belief Networks (DBN) Multinomial Naive Bayes
Convolutional Neural Network
(CNN)
Bayesian Network (BN)
Stacked Auto-Encoders

Random Forest Classification and Regression Tree


Gradient Boosting Machines (GBM) Ensemble (CART)
Boosting Iterative Dichotomiser 3 (ID3)
Bootstrapped Aggregation (Bagging) Decision TreeC4.5
AdaBoost C5.0
Stacked Generalization (Blending) Chi-squared Automatic Interaction
Gradient Boosted Regression Trees Detection (CHAID)
(GBRT) Decision Stump
Conditional Decision Trees
Radial Basis Function Network M5
(RBFN) Neural Networks
Principal Component Analysis (PCA)
Perceptron Partial Least Squares Regression (PLSR)
Back-Propagation Machine Sammon Mapping
Hopfield Network Learning Multidimensional Scaling (MDS)
Algorithms Projection Pursuit
Ridge Regression Principal Component Regression
Least Absolute Shrinkage & Regularization (PCR)
Selection Operator (LASSO) Dimensionality Partial Least Squares Discriminant
Elastic Net Reduction Analysis
Least Angle Regression (LARS) Mixture Discriminant Analysis
(MDA)
Quadratic Discriminant Analysis
Cubist Rule System (QDA)
One Rule (OneR) Regularized Discriminant Analysis
Zero Rule (ZeroR) (RDA)
Repeated Incremental Pruning to Flexible Discriminant Analysis (FDA)
Produce Error Reduction (RIPPER) Linear Discriminant Analysis (LDA)

Instance
Linear Regression Based k-Nearest Neighbour (kNN)
Ordinary Least Squares Regression Learning Vector Quantization (LVQ)
(OLSR) Self-Organizing Map(SOM)
Regression
Stepwise Regression Locally Weighted Learning (LWL)
Multivariate Adaptive Regression
Splines (MARS) Clustering k-Means
Locally Estimated Scatterplot k-Medians
Smoothing (LOESS) Expectation Maximization
Logistic Regression Hierarchical Clustering

Figure – 4 : Types of machine learning algorithms


A brief discussion of the main types of machine learning algorithms, is given below : .

 Bayesian: Regardless of what the data shows, data scientists can use Bayesian algorithms to save
their ideas about how models should look. Given how much attention is devoted to how the data
shapes the model, you might ask why anyone would be interested in Bayesian algorithms. When
you don't have much data to work with, Bayesian techniques come in handy.

If you already knew something about a part of the model and could code that part directly, a
Bayesian algorithm might make sense. Consider a medical imaging system that looks for signs of
lung disease. These estimates can be incorporated into the model if a study published in a journal
calculates the likelihood of various lung diseases based on a person's lifestyle.

 Clustering : Clustering is an easy-to-understand approach. Objects with comparable properties


are combined (in a cluster). A cluster's contents are more similar than those of other clusters.
Because the data are not labelled, clustering is a sort of unsupervised learning. Based on the
parameters, the algorithm determines what each item is made of and assigns it to the appropriate
group.

 Decision tree : Decision tree algorithms show what will happen when a choice is made by using
a structure with branches. Decision trees can be used to show all the possible outcomes of a
choice. A decision tree shows all the possible outcomes at each branch. The likelihood of the
outcome is shown as a percentage for each node.

Sometimes, online sales use decision trees. You might want to figure out who is most likely to
use a 50% off coupon before sending it to them. Customers can be split into four groups:

a) Customers who are likely to use the code if they get a personal message.
b) Customers who will buy no matter what.
c) Customers who will never buy.
d) Customers who are likely to be upset if someone tries to reach out to them.

If you send out a campaign, it's obvious that you don't want to send items to three of the groups
since they will either ignore them or respond negatively. You'll get the best return on investment
(ROI) if you go after the convenience.

A decision tree will assist you in identifying these four client categories and organizing prospects
and customers according to who will respond best to the marketing campaign.

 Dimensionality reduction : Dimensionality reduction allows systems to eliminate redundant


data. These approaches remove data that is redundant, outliers, or otherwise useless. Sensor data
and other IoT use cases can benefit from dimensionality reduction. The status of a sensor in an
IoT system can be communicated using thousands of data points. Storing and analyzing "on" data
is wasteful and wastes storage. Furthermore, minimizing redundant data increases machine
learning system performance. Finally, data visualization is also benefited from dimensionality
reduction.

 Instance based : Instance-based algorithms are used to classify new data points based on training
data. Because there's no training phase, these algorithms are called "lazy learners." Instead,
instance-based algorithms compare new data to training data and classify it based on how similar
it is. Data sets with random changes, irrelevant data, or missing values are not good for instance-
based learning.

They can be quite good at finding patterns.

For example, instance learning is used in spatial and chemical structure analysis. There are many
instance-based algorithms used in biology, pharmacology, chemistry, and engineering.

 Neural networks and deep learning : A neural network is an artificial intelligence system that
attempts to solve problems in the same way that the human brain does. This is accomplished by
the utilisation of many layers of interconnected units that acquire knowledge from data and infer
linkages. In a neural network, the layers can be connected to one another in various ways. When
referring to the process of learning that takes place within a neural network with multiple hidden
layers, the term "deep learning" is frequently used. Models built with neural networks are able to
adapt to new information and gain knowledge from it. Neural networks are frequently utilised in
situations in which the data in question is not tagged or is not organised in a particular fashion.
The field of computer vision is quickly becoming one of the most important applications for
neural networks. Today, one can find applications for deep learning in a diverse range of
contexts.

The process of deep learning is utilised to assist self-driving autos in figuring out what is going
on in their surroundings. Deep learning algorithms analyse the unstructured data that is being
collected by the cameras as they capture pictures of the environment around them. This allows the
system to make judgments in what is essentially real time. The apps that radiologists use to better
analyse medical images also include deep learning as an integral part of their design.

 Linear regression : Regression algorithms are important in machine learning and are often used
for statistical analysis. Regression algorithms help analysts figure out how data points are related.

Regression algorithms can measure how strongly two variables in a set of data are linked to each
other. Regression analysis can also be used to predict the values of data in the future based on
their past values. But it's important to remember that regression analysis is based on the idea that
correlation means cause. Regression analysis can lead to wrong conclusions if you don't
understand the context of the data.

 Regularization to avoid over-fitting : The process of regularisation involves modifying models


in such a way that they no longer fit too well into the data . Any model that is used for machine
learning can benefit from using regularisation. For example, you can regularise a decision tree
model. Models that are excessively intricate and have a tendency to be overfit can become easier
to grasp with the help of regularisation. A model that has been overfit to the available data will
produce accurate predictions when additional data sets are added to it. When a model is
developed for a particular set of data, but that model is unable to produce accurate predictions
when applied to a more general set of data, this is an example of overfitting.

 Rule-based machine learning : Rule-based machine learning algorithms describe data with the
help of rules about relationships. A rule-based system is different from a machine learning
system, which builds a model that can be used on all the data. Rule-based systems are easy to
understand in general: if X data is put in, do Y. A rule-based approach to machine learning, on
the other hand, can get very complicated as systems get more complicated. For example, a system
might have 100 rules that are already set. As the system gets more and more data and learns how
to use it, it is likely that hundreds of rules will be broken. When making a rule-based approach,
it's important to make sure it doesn't get so complicated that it stops being clear.

Think about how hard it would be to make an algorithm based on rules to apply the GST codes.

Check your progress - 2

Q3. Discuss the various phases of Machine Learning.

………………………………………………………………………………………………………………
……………………………………………………………………

Q4 When should we use machine learning ?

………………………………………………………………………………………………………………
……………………………………………………………………

Q5 Compare the concept of Classification, Regression and Clustering? List the algorithms in respective
categories.

………………………………………………………………………………………………………………
……………………………………………………………………

9.4 REINFORCEMENT LEARNING AND ALGORITHMS

According to what we observed in the previous section, learning can be broken down into three main
categories: supervised, unsupervised, and semi-supervised. However, in addition to these two categories,
there are also other types of learning, such as reinforcement learning (RL), deep learning (DL), adaptive
learning, and so on.

The graph shown below, depicts the various branches and sub-branches of Machine learning, including
the various algorithms involved in each sub-branch. Let’s understand them in brief, as the entire coverage
of the said Machine Learning techniques is out of the scope of this unit. We will begin our discussion
with Reinforcement learning.

DBSCAN Naïve Bayes


K-Means Agglomerative K-NN SVM Decision Trees
Means-Shift
Logistic Regression
Fuzzy C-Means Clustering Classification

Euclat
Regression Polynomial
Apriori Pattern Search Regression
Ridge/Lasso
Fp-Growth
UNSUPERVISED SUPERVISED Regression

DIMENSION REDUCTION
(generalization)
t-SNE Random Forest
CLASSICAL
PCA LSA LDA LEARNING
SVD
Stacking Bagging

MACHINE ENSEMBLE
RELNFORCEMENT
METHODS
LEARNING LEARNING
Genetic Q-Learning XGBoost
Algorithm Boosting
SARSA Deep Q-Network
A3C LightGBM
(DQN) AdaBoost CatBoost

NEURAL
Convolutional NETS AND
Neural Networks DEEP LEARNING Perceptrons
DCNN (CNN)
(MLP)

Autoencoders

Recurrent
Neural Networks
LSM (RNN) Generative Seq2seq
Adversarial Networks
(GAN)
LSTM
GRU

In Reinforcement Learning (RL), algorithms get a set of instructions and rules and then figure
out how to handle a task by trying things out and seeing what works and what doesn't. As a way
to help the AI find the best way to solve a problem, decisions are either rewarded or punished.
Machine learning models are taught through reinforcement learning to make a series of
decisions. It is set up so that an Agent talks to an Environment.

Reinforcement Learning (RL) is a type of Machine Learning in which the agent gets a delayed re
ward in the next time step to evaluate how well it did in the previous time step. It was mostly use
d in games, like Atari and Mario, where it could do as well as or better than a person. Since Neur
al Networks have been added to the algorithm, it has been able to do more complicated tasks.

In reinforcement learning, an AI system is put in a situation that is like a game (i.e. a simulation).
The AI system tries until it finds a solution to the problem. Slowly but surely, the agent learns
how to reach a goal in an uncertain, potentially complicated environment, but we can't expect the
agent to slip upon the perfect solution by accident. This is where the interactions come into play,
the Agent is provided with the State of the Environment which becomes the input/basis for the
Agent to take Action. An Action first gives the Agent a Reward. (Note that rewards can be both
positive and negative depending on the fitness function for the problem.) Based on this reward,
the Policy (ML model) inside the Agent adapts and learns. Second, it affects the Environment
and changes its State, which means the input for the next cycle changes.

This cycle will keep going until the best Agent is created. This cycle tries to imitate the way that
organisms learn over the course of their lives. Most of the time, the Environment is reset after a
certain number of cycles or if something goes wrong. Note that you can run more than one Agent
at the same time to get to the solution faster, but each Agent runs on its own, independently.

Reinforcement Learning (RL) refers to a kind of Machine Learning method in which the agent receives a
delayed reward in the next time step to evaluate its previous action. It was mostly used in games.
Typically, a RL setup is composed of two components, an agent and an environment.

Environment

Reward / Action
Penalty

Agent
Next-State
The following are the meanings of the different parts of reinforcement learning:

1. AGENT : The agent is the person who learns and makes decisions.

2. ENVIRONMENT : The agent's environment is where it learns and decides what to do.
3.ACTION : A group of things that the agent can do.

4. STATE : How the agent is doing in its environment.

5. REWARD : The environment gives the agent a reward for each action they choose. Usually a scalar
value.

6. POLICY : Policy is the agent's way of deciding what to do (its control strategy), which is a mapping
from situations to actions.

7. VALUE FUNCTION : A way to map states to real numbers, where the value of a state is the long -term
reward that can be earned by starting in that state and following a certain policy.

8. FUNCTION APPROXIMATOR : is a term for the problem of figuring out what a function is by
looking at training examples. Decision trees, neural networks, and nearest -neighbor methods are all
examples of standard approximators.

9. MODEL : The agent's view of the environment, which maps state -action pairs to probability
distributions over states. Note that not every agent that learns from its environment uses a model of its
environment.

In spite of the fact that there is a large number of RL algorithms, it does not appear that there is
a comparison that is exhaustive of each of them. It is quite challenging to determine which
algorithms should be used for which type of activity. This section will attempt to provide an
introduction to several well-known algorithms.

RL Algorithms

Model-free RL Model-Based RL

Policy Learn the


Optimisation Q-Learning Model is given
Model

The algorithms for reinforcement lear


learning
ning may be broken down into two broad categories:
model-free and model-based. In this section, we will analyse the key differences between these
two types of reinforcement learning algorithms.

Model-Free Vs Model Based RL: The model is used to perform a simulation of the dynamic
processes that take place in the environment. In other words, the model learns the transition
probability T(s1|(s0, a)) from the present state s0 and action a to the next state s1, and it does so
by pairing the two states together. If the agent is able to successfully learn the transition
probability, then the agent will be aware of how probable it is to reach a particular state given the
present state and activity. On the other hand, as the state space and the action space grow, model-
based algorithms become less practical.

On the other hand, model-free algorithms acquire new information through an iterative process
of trial and error. As a consequence of this, it does not need any additional space in order to store
every possible combination of states and actions.

Within the realm of Model-Free RL, policy optimization serves as a subclass, and it is comprised
of two distinct sorts of policies. i.e. On-Policy Vs Off-Policy: The value is learned by an on-
policy agent based on its current action "a" which is derived from the current policy, but the
value is learned by an off-policy agent's counterpart based on the action "a*" which is received
from another policy. This policy is referred to as the greedy policy in Q-learning.

The Q-learning or value-iteration methods are the next subcategory that is included in Model-
Free RL. Q-learning is responsible for the acquisition of the action-value function. How
advantageous would it be to perform a certain action at a certain state? In its most basic form, the
action "a" receives a scalar value that is determined by the state "s". The algorithm is shown in
the following chart, which does a good job of conveying its details.

Initialize Q Table

Chose an action a

Perform action

Measure reward

Update Q

At the end of the training


Good Q* table

Lets extend our discussion to some more Reinforcement Learning Algorithms i.e. DQN and
SARSA
Deep Q Neural Network (DQN): It is Q-learning with Neural Networks . The motivation
behind is simply related to big state space environments where defining a Q-table would be a
very complex, challenging and time-consuming task. Instead of a Q-table Neural Networks
approximate Q-values for each action based on the state.

State–action–reward–state–action (SARSA): SARSA algorithm is a slight variation of the


popular Q-Learning algorithm. For a learning agent in any Reinforcement Learning algorithm
it’s policy can be of two types:-

 On Policy: In this, the learning agent learns the value function according to the current action
derived from the policy currently being used.

 Off Policy: In this, the learning agent learns the value function according to the action derived
from another policy.

The Q-Learning technique is considered an Off Policy technique that employs the greedy
learning strategy in order to acquire knowledge of the Q-value. On the other hand, the SARSA
approach is an On Policy and it makes advantage of the action that is being performed by the
current policy in order to learn the Q-value.

Text mining, facial recognition, city planning, and targeted marketing are some of the
applications which are actually the implementation of unsupervised learning algorithms. In a
similar manner, the classification methods that fall under the supervised learning umbrella have
applications in the areas of fraud detection, spam detection, diagnostics, picture classification,
and score prediction. Similarly , reinforcement learning has a wide range of applications in a
variety of fields, including the gaming industry, manufacturing, inventory management, and the
financial sector, among many others..

Check Your Progress – 3

Q6 What is Reinforcement Learning ? List the components involved in it.


………………………………………………………………………………………………………
……………………………………………………………
Q7 Briefly discuss the various algorithms of Reinforcement Learning.
………………………………………………………………………………………………………
……………………………………………………………
9.5 DEEP LEARNING AND ALGORITHMS

Deep learning is a type of machine learning that uses artificial neural networks and representation
learning. It is also called deep structured learning or differential programming.Deep learning is a way for
machines to learn through deep neural networks. It is used a lot to solve practical problems in fields like
computer vision (image), natural language processing (text), and automated speech recognition (audio).
Machine learning is often thought of as a tool with several algorithms. However, deep learning is actually
just a subset of approaches that mostly use neural networks, which are a type of algorithm loosely based
on the human brain.

A deep learning model learns to solve classification tasks directly from images, text, or sound. A neural
network architecture is commonly used to implement deep learning. The number of layers in a network
defines the depth of the network; the more layers, the deeper the network. Traditional neural networks
have two or three layers, whereas deep neural networks include hundreds.

Deep learning is especially well-suited to identification applications such as face recognition, text
translation, voice recognition, and advanced driver assistance systems, including, lane classification and
traffic sign recognition.

Relation Between Machine learning , Deep Learning and Artificial Intelligence

Machine Learning
L
Artificial Intelligence

Deep Learning

As seen in the diagram above, machine learning (ML), deep learning (DL), and artificial intelligence (AI)
are all related. Deep Learning is a collection of algorithms inspired by the human brain's workings in
processing the data and creating patterns for use in decision making, which are expanding and improving
or refining the idea of a single model architecture termed Artificial Neural Network (ANN). Later in this
course, we shall go deeper into neural networks. However fr now, a quick overview of neural networks is
provided below, followed by a discussion of the various Deep Learning algorithms, such as CNN, RNN,
Auto Encoders, GAN, and others..

Neural Networks: Just like the human brain, Neural Networks consist of Neurons. Each Neuron
takes in signals as input, multiplies them by weights, adds them together, and then applies a non-
linear function. These neurons are arranged in layers and stacked close to each other.
Neural Networks have proven to be effective function approximators. We can presume that e very
behaviour and system can be represented mathematically at some point (sometimes an incredible
complex one). If we can find that function, we will know everything there is to know about the
system. However, locating the function can be difficult. As a result, we must use Neural
Networks to estimate it.

A deep neural network is one that incorporates several nonlinear processing layers, makes use of
simple pieces that work in parallel, and takes its cues from the biological nervous systems of
living things. There is an input layer, numerous hidden layers, and an output layer that make up
this structure. Each hidden layer takes as its input the information that was output by the layer
that came before it and is connected to the other layers via nodes, also known as neurons.

To understand the basic deep neural networks we need to have brief understanding of various
algorithms, the same are given below:

Back Propagation Neural Networks :Propagation in reverse Back propagation is an iterative


process that allows neural networks to learn the desired function by utilising large quantities of
data and learning from their previous mistakes. We feed the network data, and in return, it
provides us with an output. We start by comparing the output to what we want using a loss
function. Then, based on the gap that we find, we iteratively adjust the weights of the various
variables. This non-linear optimization procedure is termed stochastic gradient descent, and it is
used to make the necessary modification to the weights.
After some time, the network will improve its ability to provide the output to a very high
standard. As a result, the training is complete. As a result, we are able to come close to
approximating our function. In addition, if we give the network an input for which we do not
know the corresponding output, it will provide us with an answer based on the approximated
function.

To further understand, let's look at an example. Let's say we need to recognise pictures that
contain a tree. Photos are input into the network, and the system produces results. We might
evaluate the results in light of our current situation and make adjustments to the network
accordingly.

As more photographs are passed via the network, the number of errors that occur decreases. We
can now feed it an unknown image, and it will tell us whether or not it has a tree. or not, that is
astounding either way.

Feed-forward neural networks (FNN) : Typically, feed-forward neural networks, also known as FNN,
are completely connected, this implies that each neuron in one layer is connected to each neuron in the
layer next to it. A "Multilayer Perceptron" is the name given to the structure shown below and that is the
topic of discussion here. A multilayer perceptron, has the ability to learn associations between the data
that are not linear, in contrast to a single-layer perceptron, which can only learn patterns that can be
separated in a linear manner. FNN are exceptionally well on tasks like classification and regression.
Contrary to other machine learning algorithms, they don’t converge so easily. The more data they have,
the higher their accuracy.
Convolutional Neural Networks (CNN) The term "convolution" refers to the function that is utilised by
convolutional neural networks (CNN). The idea that undelies them is that rather than linking each neuron
with all of the ones that come after it, we just connect it with a select few of those that come after it (the
receptive field). They strive to regularise feed-forward networks in order to avoid overfitting, which is
when the model is unable to generalise its findings since it can only learn from the data it has already
seen. Because of this, they are particularly skilled at determining how the data are related to one another
spatially. As a result, computer vision is their primary application, which includes image classification,
video identification, medical image analysis, and self-driving automobiles. These are the types of tasks
where they achieve near-superhuman results.

Due to their adaptability, they are also ideal for merging with other types of models, such as Recurrent
Networks and Auto-encoders. The recognition of sign languages is one such example.

Face Recognition Based on Convolutional Neural Network

Recurrent Neural Networks (RNN) are utilised in time series forecasting because they are
ideal for time-related data. They employ some type of feedback, in which the output is fed back
into the input. You can think of it as a loop that passes data back to the network from the output
to the input. As a result, they are able to recall previous data and use it to make predictions.

Researchers have transformed the original neuron into more complicated structures such as GRU
units and LSTM Units to improve performance. Language translation, speech production, and
text to speech synthesis have all employed LSTM units extensively in natural language
processing.
Recursive Neural Networks : Another type of recurrent network is the recursive neural
network, which is set up in a tree-like manner. As a result, they can simulate the hierarchical
structures training dataset's.

They're frequently utilised in NLP applications like audio-to-text transcription and sentiment
analysis because they're related to binary trees, contexts, and natural-language-based parsers.
They are, however, typically much slower than Recurrent Networks.

Auto-Encoders (Auto Encoder Neural Networks) are a type of unsupervised technique that
is used to reduce dimensionality and compress data. Their technique is to try and make the
output equal to the input. They are attempting to recreate the data.

An encoder and a decoder are included in Auto-Encoders. The encoder receives the input
and encodes it in a lower-dimensional latent space. Whereas, the decoder is used
to decode that vector back to the original input.

Input
Output

Code

Encoder Decoder

Restricted Boltzmann Machines (RBM) are stochastic neural networks that can learn a
probability distribution over their inputs and so have generative capabilities. They differ from
other networks in that they only have input and hidden layers ( no outputs).

They take the input and create a representation of it in the forward phase of the training. They
rebuild the original input from the representation in the backward pass. (This is similar to
autoencoders, but in a single network.)

Several RBMs are piled on top of each other to form a Deep Belief Network. They have the
same appearance as Fully Connected layers, but they are trained differently.
Generative Adversarial Networks (GANs): Ian Goodfellow introduced Generative Adversarial
Networks (GANs) in 2016, and they are built on a basic but elegant idea: You need to create
data, such as photos. What exactly do you do?

You must construct two models. You teach the first one to make up fake data (generator) and the
second one to tell the difference between actual and fake data (discriminator). And you turned
them against one another.

The generator develops better and better at image production, as its ultimate purpose is to
mislead the discriminator. As its purpose is to avoid being tricked, the discriminator improves its
ability to identify fake from real images. As a result, we now have extremely realistic fake data
from the discriminator.

Video games, astronomical imagery, interior design, and fashion are all examples of Generative
Adversarial Networks at action. Essentially, you can utilise GANs if you have photos in your
fields. Do you recall the movie Deep Fakes? That was all created by GANs.

Transformers are also very new, and they are mostly employed in language applications
because recurrent networks are becoming obsolete. They are based on the concept of "attention,"
which instructs the network to focus on a certain data piece.

Instead of complicating LSTM units, you may use Attention mechanisms to assign varying
weights to different regions of the input based on their importance. The attention mechanism is
simply another weighted layer whose sole purpose is to change the weights such that some parts
of the inputs are given greater weight than others.

In actuality, transformers are made up of stacked encoders (encoder layer), stacked decoders
(decoder layer), including several attention layers (self- attentions and encoder-decoder
attentions)

Graph Neural Networks: Deep Learning does not operate well with unstructured data in
general. And there are many circumstances in which unstructured data is organised as a graph in
the actual world. Consider social networks, chemical molecules, knowledge graphs, and location
information.

Graph Neural Networks are used to model graph data. This implies they locate and convert the
connections between nodes in a network into integers. As if it were an embedding. As a result,
they can be utilized in any other machine learning model to perform tasks such as grouping,
classifying, and so on.

Check Your Progress – 4


Q8 What is Deep Learning ?How Deep learning relates to AI & ML.
………………………………………………………………………………………………………
……………………………………………………………
Q9 Briefly discuss the various algorithms of Reinforcement Learning.
………………………………………………………………………………………………………
……………………………………………………………

9.6 ENSEMBLE METHODS

Ensemble learning is a general meta approach to machine learning that combines predictions
from different models to improve predictive performance.

Although you can create an apparently infinite number of ensembles for any predictive
modelling problem, the subject of ensemble learning is dominated by three methods. Bagging,
stacking, and boosting. They are the three primary classes of ensemble learning methods, and it's
essential to understand each one thoroughly.

• Bagging Ensemble learning is the process of fitting multiple decision trees to various
samples of the same dataset and averaging the results.
• Stacking Ensemble learning is fitting multiple types of models to the same data and
then using another model to learn how to combine the predictions in the best way
possible.
• Boosting Ensemble Learning entails successively adding ensemble members that
correct prior model predictions and produce a weighted average of the predictions.

Now let’s discuss each of the learning method in some detail

(I) Bagging Ensemble learning Bagging ensemble learning involves fitting numerous decision trees to
various samples of the same dataset, and then averaging the results of those tree fittings to provide a final
prediction.

In most cases, this is accomplished by making use of a single machine learning method, which is nearly
invariably an unpruned decision tree, and by training each model on a separate sample from the same
training dataset. After then, straightforward statistical approaches such as voting or averaging are utilised
in order to aggregate the predictions that were generated by each individual participant in the ensemble.

The manner in which each individual data sample is prepared to train members of the ensemble
constitutes the most essential component of the technique. Every model receives its own unique,
customised portion of the dataset to use for testing. Rows (examples) are selected at random from the
dataset, and once selected, they are replaced.

When a row is selected, it is added back to the dataset that it was learned from, so that it can be selected
once more from the same training dataset. This indicates that within a specific training dataset, a row of
data may be selected 0 times, 1 times, or multiple times.

This type of sample is known as a bootstrap sample. In the field of statistics, this approach is a way for
estimating the statistical value of a limited data sample. It is typically applied to somewhat limited data
sets. You can get a better overall estimate of the desired quantity if you make a number of distinct
bootstrap samples, estimate a statistical quantity, and then determine the average of the estimates. This is
in comparison to the situation in which you would just estimate the quantity based on the dataset.
In the same way, several training datasets can be compiled, put to use in the process of estimating a
predictive model, and then put to use in order to produce predictions. The majority of the time, it is
preferable to take the average of the predictions made by all of the models rather than to fit a single model
directly to the dataset used for training.
The following is a concise summary of the most important aspects of bagging:
• Take samples of the training dataset using bootstrapping.
• Unpruned decision trees fit on each sample.
• Voting or taking the average of all the predictions.
In a nutshell, bagging has an effect because it modifies the training data that is used to fit each individual
member of the ensemble. This results in skillful but unique models.

Bagging Ensemble

Input (X)

Sample 1 Sample 2 Sample 3 …

Tree1 Tree2 Tree3 …

Combine

Output
(Y)
Bagging ensemble learning
It is a comprehensive strategy that is simple to expand upon. For instance, additional alterations
can be made to the dataset that was used for training, the method that was used to fit the training
data can be modified, and the manner in which predictions are constructed can be altered.

Many popular ensemble algorithms are based on this approach, including:

 Bagged Decision Trees (canonical bagging)


 Random Forest
 Extra Trees

(II) Stacking Ensemble learning: Stacked Generalization, sometimes known as "stacking" due
to its abbreviated form, is an ensemble strategy that searches for a diverse group of members by
varying the types of models that are fitted to the training data and utilising a model to aggregate
predictions. It requires fitting of various kinds of models, applied to the same data, and then
using another model to find out how to integrate the predictions in the best way possible. This
process is known as model fitting.

There is a specific vocabulary for stacking. The individual models that comprise an ensemble are
referred to as level-0 models, whereas the model that integrates all of the predictions is referred
to as a level-1 model.

Although there are often only two levels of models applied, you are free to apply as many levels
as you see fit. For instance, instead of a single level-1 model, we might have three or five level-1
models and a single level-2 model that integrates the forecasts of level-1 models to generate a
prediction. This would allow us to make more accurate predictions.

It is possible to integrate the predictions using any machine learning model, but the majority of
users prefer linear models, such as linear regression for regression and logistic regression for
binary classification. Because of this, it is more likely that the more difficult components of the
model will be included in the lower-level ensemble member models, and that straightforward
models will be used to learn how to apply the various predictions.

The key elements of stacking are summarized below:


 Unchanged training dataset.
 Separate machine learning algorithms for respective ensemble member.
 Machine learning model to learn how to combine predictions in the best way.
The diversity of the ensemble is a direct result of the many diverse machine learning models that
serve as the ensemble's members.

As a consequence of this, it is recommended to make use of a variety of models that can be learnt
or constructed in a wide variety of methods. Because of this, it is ensured that they will make
separate assumptions, and as a consequence, it is less probable that their errors in prediction
would be linked to one another.
Stacking Ensemble
Input (X)

Model 1 Model 2 Model 3 …

Model

Output(y)
Stacking Ensemble Learning
Many popular ensemble algorithms are based on this approach, including:
 Stacked Models (canonical stacking)
 Blending
 Super Ensemble

(III) Boosting Ensemble learning: Boosting is an ensemble strategy that aims to alter the
training data so that it focuses on examples that earlier models that fit the training data got
wrong. Boosting tries to do this by focusing on examples that prior models got wrong. In order
for it to function, members are added to the ensemble one at a time, and when each new member
is added, the predictions that were produced by the model that came before it are refined. A
weighted average of the forecasts is what we get as a result.

The fact that boosting ensembles may correct errors in forecasts is the single most important
advantage of using them. The models are calibrated and introduced to the ensemble one at a
time, which means that the second model attempts to fix what the first model indicated, and so
on and so forth.

The majority of the time, this is accomplished using weak learners, which are relatively
straightforward decision trees that only make a single or a few decisions at a time. The forecasts
of the weak learners are merged by simple voting or by average, but the importance of each
learner's input is weighted according to how well they performed or how much they know. The
objective is to create a "strong-learner" out of a number of "weak-learners," each of which was
designed to accomplish a particular task.

Majority of the time the training dataset is left unchanged ; instead, the learning algorithm is
adjusted to pay more or less attention to certain examples (rows of data) depending on how well
they were predicted by ensemble members who were added earlier. For instance, a weight could
be assigned to each row of data in order to demonstrate the level of focus that a learning
algorithm must maintain on the model while it is doing so.

The key elements of boosting are summarized below:

• Give more weight to examples that are hard to guess when training.
• Add members of the ensemble one at a time to correct the predictions of earlier models.
• Use a weighted average of models to combine their predictions.

The idea of turning a group of weak learners into a group of strong learners was first thought of
in theory, and many algorithms were tried but didn't work very well. Until the Adaptive Boosting
(AdaBoost) algorithm was made, it wasn't clear that boosting was a good way to put together a
group of methods.

Since AdaBoost, many boosting methods have been made, and some, like stochastic gradient
boosting, may be among the best ways to use tabular (structured) data for classification and
regression.
Boosting Ensemble
Input

Weighted
Model 1
Sample 1

Weighted
Model 2
Sample 2

Weighted
Model 3
Sample 3

Combine

Output (y)

Boosting Ensemble Learning

To summarize, many popular ensemble algorithms are based on this approach, including:

 AdaBoost (canonical boosting)


 Gradient Boosting Machines
 Stochastic Gradient Boosting (XGBoost and similar)

This completes our tour of the standard ensemble learning techniques.

Check Your Progress – 5


Q10 What is Ensemble Learning ?
………………………………………………………………………………………………………
……………………………………………………………
Q11 Briefly discuss the various Ensemble Methods.
………………………………………………………………………………………………………
……………………………………………………………
9.7 SUMMARY

In this unit we discussed about the basic concepts of machine learning and also about the various Machine
learning algorithms. The unit also covers the understanding of reinforcement learning and its related
algorithms. There after we discussed the concept of Deep Learning and various techniques involved in
Deep Learning. The unit finally discussed about the Ensemble Learning and its related methods. The unit

9.8 SOLUTIONS/ANSWERS

Check Your Progress - 1

Q1. How machine learning differs from Artificial intelligence ?

Solution: Refer to Section 9.2

Q2 Briefly discuss the major function or use of Machine learning algorithms Solution: Refer to
Section 9.2

Check your progress - 2

Q3. Discuss the various phases of Machine Learning.

Solution: Refer to Section 9.3

Q4 When should we use machine learning ?

Solution: Refer to Section 9.3

Q5 Compare the concept of Classification, Regression and Clustering? List the algorithms in
respective categories. 9.3

Solution: Refer to Section 9.3

Check Your Progress – 3

Q6 What is Reinforcement Learning ? List the various components involved in


Reinforcement Learning.
Solution: Refer to Section 9.4
Q7 Briefly discuss the various algorithms of Reinforcement Learning.
Solution: Refer to Section 9.4

Check Your Progress – 4

Q8 What is Deep Learning ?How Deep learning relates to AI & ML.


Solution: Refer to Section 9.5
Q9 Briefly discuss the various algorithms of Reinforcement Learning.
Solution: Refer to Section 9.5

Check Your Progress – 5


Q10 What is Ensemble Learning ? 9.6
Solution: Refer to Section 9.6
Q11 Briefly discuss the various Ensemble Methods. 9.6
Solution: Refer to Section 9.6

9.9 FURTHER READINGS

 Prof. Ela Kumar, “Artificial Intelligence”Edition: First, Publisher: Dreamtech Press, (2020)ISBN:
9789389795134
 Machine learning an algorithm perspective, Stephen Marsland, 2 nd Edition, CRC Press,,
2015.
 Machine Learning, Tom Mitchell, 1st Edition, McGraw- Hill, 1997.
 Machine Learning: The Art and Science of Algorithms that Make Sense of Data, Peter
Flach, 1st Edition, Cambridge University Press, 2012.
UNIT 10 CLASSIFICATION

Structure
10.1 Introduction
10.2 Objectives
10.3 Understanding of Supervised Learning
10.4 Introduction to Classification
10.5 Classification Algorithms
10.5.1 Naïve Bayes
10.5.2 K-Nearest Neighbour (K-NN)
10.5.3 Decision Trees
10.5.4 Logistic Regression
10.5.5 Support Vector Machines

10.6 Summary
10.7 Solutions/Answers
10.8 Further Readings

10.1 INTRODUCTION

What exactly does learning entail, anyway? What exactly is meant by "machine learning"? These
are philosophical problems, but we won't be focusing too much on philosophy in this lesson; the
whole focus will be on gaining a solid understanding of how things work in practise. In the
subject of data mining, many of the ideas, such as classification and clustering, are being
addressed, and so here in this Unit, we are going to once again investigate those concepts.
Therefore, in order to achieve a better knowledge, the first step is to differentiate between the
two fields of study known as data mining and machine learning.

It's possible that, at their core, data mining and machine learning are both about learning from
data and improving one's decision-making. On the other hand, they approach things in a different
manner. To get things started, let's start with the most important question, What exactly is the
difference between Data Mining and Machine Learning?
What is data mining? Data mining is a subset of business analytics that involves exploring an
existing huge dataset in order to discover previously unknown patterns, correlations, and
anomalies that are present in the data. This process is referred to as "data exploration." It enables
us to come up with wholly original ideas and perspectives.
What exactly is meant by "machine learning"? The field of artificial intelligence (AI) includes
the subfield of machine learning . Machine learning involves computers performing analyses on
large data sets, after which the computers "learn" patterns that will assist them in making
predictions regarding additional data sets. It is not necessary for a person to interact with the
computer for it to learn from the data; the initial programming and possibly some fine-tuning are
all that are required.
It has come to our attention that there are a number of parallels between the two ideas, namely
Data Mining and Machine Learning. These parallels include the following:

 Both are considered to be analytical processes;


 Both are effective at recognising patterns;
 Both focus on gaining knowledge from data in order to enhance decision-making
capabilities;
 Both need a substantial quantity of information in order to be precise
Due to the mentioned similarities between the two, generally the people confuses the two
concepts. So, to clearly demarcate the two concepts one should understand that What are the key
differences between the two i.e. Data Mining and Machine Learning,?

The following are some of the most important distinctions between the two:
 Machine learning goes beyond what has happened in the past to make predictions about
future events based on the pre-existing data. Data mining, on the other hand, consists of
just looking for patterns that already exist in the data.
 At the beginning of the process of data mining, the 'rules' or patterns that will be used are
unknown. In contrast, when it comes to machine learning, the computer is typically
provided with some rules or variables to follow in order to comprehend the data and learn
from it.
 The mining of data is a more manual process that is dependent on the involvement and
choice-making of humans. With machine learning, on the other hand, once the
foundational principles have been established, the process of information extraction, as
well as "learning" and refining, is fully automated and does not require the participation
of a human. To put it another way, the machine is able to improve its own level of
intelligence.
 Finding patterns in an existing dataset (like a data warehouse) can be accomplished
through the process of data mining. On the other hand, machine learning is trained on a
data set referred to as a "training" data set, which teaches the computer how to make
sense of data and then how to make predictions about fresh data sets.
The approaches to data mining problems are based on the type of information/ knowledge to be
mined. We will emphasis on three different approaches: Classification, Clustering, and
Association Rules.

The classification task puts data into groups or classes that have already been set up. The value
of a user-specified goal attribute shows what type of thing a tuple is. Tuples are made up of one
or more predicating attributes and one or more goal attributes. The task is to find some kind of
relationship between the predicating attributes and the goal attribute, so that the information or
knowledge found can be used to predict the class of new tuple (s).

The purpose of the clustering process is to create distinct classes from groups of tuples that share
characteristic values. Clustering is the process of defining a mapping, using as input a database
containing tuples and an integer value k, in such a way that the tuples are mapped to various
clusters.

The idea entails increasing the degree of similarity within a class while decreasing the degree of
similarity between classes. There is not an objective attribute in the clustering process.
Clustering, on the other hand, is an example of an unsupervised classification, in contrast to
classification, which is supervised by the aim attribute.

The goal of association rule mining is to find interesting connections between elements in a data
set. Its initial use was for "market basket data." The rule is written as XY, where X and Y are
two sets of objects that do not intersect. Support and confidence are the two metrics for any rule.
The aim is to identify, using rules with support and confidence above, minimum support and
minimum confidence given the user-specified minimum support and minimum confidence.

The distance measure determines the distance between items or their dissimilarity. The following
are the measures used in this unit:

 (t ih  t jh ) 2
Euclidean distance: dis(ti,tj)= h 1

 | (t ih  t jh ) |
Manhattan distance: dis(ti,tj)= h 1

where ti and tj are tuples and h are the different attributes which can take values from 1 to k

There are some clear differences between the two, though. But as businesses try to get better at
predicting the future, machine learning and data mining may merge more in the future. For
example, more businesses may want to use machine learning algorithms to improve their data
mining analytics.
Machine learning algorithms use computational methods to "learn" information directly from
data, without using an equation as a model. As more examples are available for learning, the
algorithms get better and better at what they do.

Machine learning algorithms look for patterns in data that occur naturally. This gives you
more information and helps you make better decisions and forecasts. They are used every
day to make important decisions in diagnosing medical conditions, trading stocks,
predicting energy load, and more. Machine learning is used by media sites to sort through
millions of options and suggest songs or movies. It helps retailers figure out what their
customers buy and how they buy it. With the rise of "big data," machine learning has
become very important for solving problems in areas like:

• Computational finance, for applications such as credit scoring and algorithmic trading
• Face identification, motion detection, and object detection can all be accomplished
through image processing and computer vision.
• Tumor detection, drug development, and DNA sequencing can all be accomplished
through computational biology.
• Production of energy, for the sake of pricing and load forecasting • Automotive,
aerospace, and manufacturing, for the purpose of predictive maintenance • Processing of
natural languages

In general, Classical Machine Learning Algorithms can be put into two groups: Supervised
Learning Algorithms, which use data that has been labelled, and Un-Supervised Learning
Algorithms, which use data that has not been labelled and are used for Clustering. We will talk
more about Clustering in Unit 15, which is part of Block 4 of this course.

Machine Learning Techniques Unsupervised


CLUSTERING
Learning
Group and interpret data
based on input data
MACHINE LEARNING
CLASSIFICATION
Supervised
Learning
Develop predictive model
based on both input and
REGRESSION
output data

In this unit we will be discussing about the Supervised Learning Algorithms, which are mainly
used for the classification purpose.
10.2 OBJECTIVES

After completing this unit you should be able to :


 Understand Supervised Learning
 Understand Un-Supervised Learning
 Understanding various Classification Algorithms

10.3 UNDERSTANDING OF SUPERVISED LEARNING

To use machine learning techniques effectively, you need to know how they work. You can't just
use them without knowing how they work and expect to get good results. Different techniques
work for different kinds of problems, but it's not always clear which techniques will work in a
given situation. You need to know something about the different kinds of solutions.

Every workflow for machine learning starts with the following three questions:

• What kind of data do you have available to work with?


• What kinds of realisations are you hoping to arrive at as a result of it?
• In what ways and contexts will those realisations be utilised?

Your responses to these questions will assist you in determining whether supervised or
unsupervised learning is best for you.

Workflow at a Glance
1. ACCESS and load the 4.TRAIN models using the
data. features derived in step 3.

MOBILE
2. PREPROCESS the data 5. ITERATE to find the best
DEVICES
model

3. DERIVE features using the 6. INTEGRATE the best-trained


preprocessed data. model into a production system
In an interesting way, supervised machine learning is like how humans and animals learn
"concepts" or "categories." This is defined as "the search for and listing of attributes that can be
used to tell exemplars from non-exemplars of different categories."

Technically, supervised learning means learning a function that gives an output for a given input
based on a set of input-output pairs that have already been defined. It does this with the help of
something called "training data," which is a set of examples for training.

In supervised learning, the data used for training is labelled. For example, every shoe is labelled
as a shoe, and the same goes for every pair of socks. This way, the system knows the labels, and
if it sees a new type of shoes, it will recognise them as "shoes" even if it wasn't told to do so.

In the example above, the picture of shoes and the word "shoes" are both inputs, and the word
"shoes" is the output. After learning from hundreds or thousands of different pictures of shoes
and the word "shoes" as well as the word "socks," our system will know what to do when given
only a new picture of shoes (name: shoes).

supervised ML is often represented by the function y = f(x), where x is the input data and y is the
output variable, which is a function of x that needs to be predicted. In training data, the example
pair is usually made up of an input, which is usually a vector, and an output (a collection of
features determining a sample). The output value we want, which we call a "supervisory signal"
and whose meaning is clear from the name.

In fact, the goal of supervised machine learning is to build a model that can make predictions
based on evidence even when there is uncertainty. A supervised learning algorithm uses a known
set of input data and known responses to the data (output) to train a model to make reasonable
predictions about the response to new data.

Supervisor

Training Data set Desired Output

Algorithm Processing Output

Input Raw Data

Example: Predicting heart attacks with the help of supervised learning: Let's say doctors want to
know if someone will have a heart attack in the next year. They have information about the age,
weight, height, and blood pressure of past patients. They know if the patients who were there
before had heart attacks within a year. So the problem is making a model out of the existing data
that can tell if a new person will have a heart attack in the next year.

Following Steps are Involved in Supervised Learning, and they are self explanatory:

1. Determine the Type of Training Examples

2. Prepare/Gather the Training Data

3. Determine Relation Between Input Feature & Representing Learned Function

4. Select a Learning Algorithm

5. Run the Selected Algorithm on Training Data

6. Evaluate the Accuracy of the Learned Function Using Values from Test Set

There are some common issues which are generally faced when one applies the Supervised Learning, and
they are listed below:

(i) Training and classifying require a lot of computer time, especially when big data is involved.

(ii) Overfitting: The model may learn so much from the noise in the data that instead of seeing it
as a mistake, it can be seen as a learning concept.

(iii) A key difference between supervised and unsupervised learning is that if an input doesn't fit
into any class, the model will add it to one of the existing ones instead of making a new one.

Lets discuss some of the Practical Applications of Supervised Machine Learning. For beginners
at least, probably knowing ‘what does supervised learning achieve’ becomes equally or more
important than simply knowing ‘what is supervised learning’.

A very large number of practical applications of the method can be outlined, but the following
are some of the common ones
a) Detection of spam
b) Detection of fraudulent banking or other activities
c) Medical Diagnosis
d) Image recognition
e) Predictive maintenance
With increasing applications each day in all the fields, machine learning knowledge is an
essential skill.
☞ Check Your Progress 1
1. Compare between Supervised and Un-Supervised Learning.
……………………………………………………………………………………………
……………………………………………………………………………………………
2. List the Steps Involved in Supervised Learning
……………………………………………………………………………………………
……………………………………………………………………………………………
3. What are the Common Issues Faced While Using Supervised Learning
……………………………………………………………………………………………
……………………………………………………………………………………………

10.4 INTRODUCTION TO CLASSIFICATION

Every supervised learning approach can be classified as either a regression or a classification


method, depending on the nature of the data being analysed. The creation of predictive models is
possible through supervised learning by utilising classification and regression methods. This can
be performed by using these methods.

• Classification techniques : Classification methods make predictions about specific outcomes,


such as whether an email is legitimate or spam or whether a tumour is malignant or benign.
Classification models classify incoming data into categories. Applications such as medical
imaging, speech recognition, and credit scoring are typical examples.

• Regression techniques : The techniques of regression can accurately forecast continuous


reactions, such as shifts in temperature or variations in the amount of power required. Examples
of typical applications are as follows: A few examples of applications are the predicting of stock
prices, the recognition of handwriting, the forecasting of power load, and acoustic signal
processing.

Note: It's important to know whether a problem is a classification problem or a regression


problem.
• Can your information be tagged or put into groups? Use classification algorithms if
your data can be put into clear groups or classes.
• Are you working with a set of data? Use regression techniques if your answer is a real
number, like TEMP. or the time until a piece of equipment fails to work.

Before moving ahead lets understand some of the key terms, which will be frequently occurring
in this course, they are listed below :
 Classification The process of organizing data into a predetermined number of categories
is referred to as classification. Finding out which category or class a new collection of
data falls under is the primary objective of a classification problem, which can be stated
as follows, Data that is structured as well as data that is not structured can be utilized for
classification:

Structured data (data that is in a fixed field in a file or record is called "structured
data"). A relational database (RDBMS) is where most structured data is kept.

Unstructured data (unstructured data may have a natural structure, but it isn't set up
in a way that can be predicted). There is no data model, and the data is stored in the
format in which it was created. Rich media, text, social media activity, surveillance
images, and so on, are all types of unstructured data.Following are some of the
terminologies frequently encountered in machine learning – classification:

 Classifier: A classifier is an algorithm that puts the data you give it into a certain
category. A classifier is an algorithm that does classification on a dataset.
 Classification model: A classification model looks at the output values to try to figure
out what the values used for training mean. It will guess the class labels or categories of
the new data.
 Feature: property of any object (real or virtual) that can be measured on its own is
called a feature.
 Classification predictive modeling involves assigning a class label to input examples.

This section covers the following types of classification:


 Binary classification is a task for which there are only two possible outcomes. It means
predicting which of the two classes will be correct. For example, dividing people into
male and female.
Some popular algorithms that can be used to divide things into two groups are:
 Logistic Regression.
 k-Nearest Neighbors.
 Decision Trees.
 Support Vector Machine.
 Naive Bayes.
Major application areas of Binary Classification:
 Detection of Email spam.
 Prediction of Churn.
 Prediction of Purchase or Conversion(buy or not).

 Multi-class classification.:Multi-class classification means putting things into more than


two groups. In multiclass classification, there is only one target label for each sample.
This is done by predicting which of more than two classes the sample belongs to. An
animal, for instance, can either be a cat or a dog, but not both. Face classification, plant
species classification, and optical character recognition are some of the examples.
Popular algorithms that can be used for multi-class classification include:

 k-Nearest Neighbors.
 Decision Trees.
 Naive Bayes.
 Random Forest.
 Gradient Boosting.
Binary classification algorithms can be changed to work for problems with more than two
classes. This is done by fitting multiple binary classification models for each class vs. all other
classes (called "one-vs-rest") or one model for each pair of classes (called one-vs-one).

 One versus the Rest: Fit one binary classification model for each class versus all of the
other classes in the dataset.
 One-versus-one: Fit one binary classification model for each pair of classes using the
one-on-one comparison method.

Binary classification techniques such as logistic regression and support vector machine are two
examples of those that are capable of using these strategies for multi-class classification.

 Multi-label classification: Multi-label classification, also known as more than one class
classification, is a classification task in which each sample is mapped to a collection of target
labels. This classification task involves making predictions about one or more classes; say for
example, a news story can be about Games, People, and a Location all together at the same
time.

Note : Classification algorithms used for binary or multi-class classification cannot be used
directly for multi-label classification.
Specialized versions of standard classification algorithms can be used, so-called multi-label
versions of the algorithms, including:

 Multi-label Decision Trees


 Multi-label Random Forests
 Multi-label Gradient Boosting

Another approach is to use a separate classification algorithm to predict the labels for each class.

• An imbalanced classification is a task in which the number of examples in each class is


not the same. Examples includes Fraud detection, Outlier detection, Medical diagnostic tests.
These problems are modelled as two-way classification tasks, but you may need to use
specialised methods to solve them.

The different types of classifications discussed above, have to deal with different type of
learners, and Learners in Classification Problems are categorized into following two types :
1. Lazy Learners: Lazy Learner will first store the training dataset, and then it will wait until it is given
the test dataset. In the case of the Lazy learner, classification is done based on the information in the
training dataset that is most relevant to the question at hand. Less time is needed for training, but more
time is needed for making predictions. K-NN algorithm and Case-based reasoning are two examples.

2. Eager Learners: Eager Learners use a training dataset to make a classification model before they get
a test dataset. Eager learners, on the other hand, spend less time on training and more time on making
predictions. Decision Trees, Naive Bayes, and ANN are some examples.

Examples of eager learners include the classification techniques of Bayesian classification,


decision tree induction, rule-based classification, classification by back-propagation, support
vector machines, and classification based on association rule mining. Eager learners, when
presented with a collection of training pairs, will construct a generalisation model (also known as
a classification model) before being presented with new tuples, also known as test tuples, to
classify. One way to think about the learnt model is as one that is prepared and eager to
categorise tuples that have not been seen before.

Imagine, on the other hand, in the lazy learner approach the learner is required to wait until the
final moment, to develop a model for classification of the given test tuple, i.e. in the lazy
approach the learner just stores the training tuple, given for classification. It does not do
generalization until it is given a test tuple, after receiving the test tuple, it classifies the tuple
based on how similar it is to the training tuples that it has previously stored. Lazy learning
methods, on the other hand, do less work when a training pair is shown but more work when
classifying or making a prediction. Because lazy learners keep the training tuples, which are also
called "instances." So, they are also called "instance-based learners," even though this is how
most people learn.

When classifying or making a prediction, lazy learners can take a lot of processing power. They
need efficient ways to store information and can be done well on parallel hardware. They don't
explain or show much about how the data is put together. Lazy learners, on the other hand, tend
to be in favour of incremental learning. They can make models of complex decision spaces with
hyper-polygonal shapes that other learning algorithms may not be able to do as well (such as
hyper-rectangular shapes modeled by decision trees). The k-nearest neighbour classifier and the
case-based reasoning classifier are both types of lazy learners.
☞ Check Your Progress 2
4. Compare between Multi Class and Multi Label Classification
……………………………………………………………………………………
……………………………………………………………………………………
5. Compare between structured and unstructured data
……………………………………………………………………………………
……………………………………………………………………………………
6. Compare between Lazy learners and Eager Learners algorithms for machine learning.
……………………………………………………………………………………
……………………………………………………………………………………

10.5 CLASSIFICATION ALGORITHMS

The Classification algorithm is a type of Supervised Learning that uses the training data to figure
out the category of new observations. This method is used to figure out what kind of thing a new
observation is. Classification is the process by which a computer programme learns from a set of
data or observations and then sorts new observations into different classes or groups. "Yes" or
"No," "0" or "1," "Spam" or "Not Spam," "Cat or Dog," and so on are all good examples. Classes
are the same thing that have different names, like categories, objectives, and labels.

In classification, the output variable is not a value but a category, such as "Green or Blue," "Fruit
or Animal," etc. This is different from regression, where the output variable is a value. Since the
classification method is a supervised learning method, it needs data that has been labelled in
order to work. This means that the implementation of the algorithm includes both the input and
the output that go with it.

As the name suggests, classification algorithms do the job of predicting a label or putting a
variable into a category (categorization). For example, classifying something as "socks" or
"shoes" from our last example. Classification Predictive Algorithm is used every day in the spam
detector in emails. It looks for features that help it decide if an email is spam or not spam.

The primary objective of the Classification algorithm is to determine the category of the dataset
that is being provided, and these algorithms are primarily utilised to forecast the output for the
data that is categorical in nature. A discrete output function, denoted by y, is mapped to an input
variable, denoted by x, in a classification process. Therefore, y = function (x), where y denotes
the categorical output. The best example of an ML classification algorithm is Email Spam
Detector.
The diagram that follows can be used to have a better understanding of classification methods.
There are two classes, Class A and Class B, depicted in the graphic that may be found below.
These classes share characteristics that distinguish them from other classes but also distinguish
them from one another.

Y
Class A

Class B

The question arises as a result of the existence of a variety of algorithms under both supervised
and unsupervised learning. How Should One Choose Which Algorithm to Employ? The task of
selecting the appropriate machine learning algorithm can appear to be insurmountable because
there are dozens of supervised and unsupervised machine learning algorithms, and each takes a
unique approach to the learning process. There is no single approach that is superior or
universally applicable. Finding the appropriate algorithm requires some amount of trial and
error; even highly experienced data scientists are unable to determine whether or not an
algorithm will work without first putting it to the test themselves. However, the choice of
algorithm also depends on the quantity and nature of the data being worked with, as well as the
insights that are desired from the data and the applications to which those insights will be put.

• Choose supervised learning if you need to train a model to make a prediction, for
instance, the future value of a continuous variable, such as TEMP. or a stock price; use
regression techniques and use classification techniques in situations such as identifying
makes of cars from webcam video footage or identifying spams from emails; etc.

• Choose unsupervised learning if you need to investigate your data and want to train a
model to find a decent internal representation, such as by dividing the data into clusters.
This type of learning allows for more freedom in exploring and representing the data.

Note : The algorithm for Supervised Machine Learning can be broken down into two basic
categories: regression algorithms and classification algorithms. We have been able to forecast the
output for continuous values using the Regression methods; but, in order to predict the
categorical values, we will need to use the Classification algorithms.

Let’s take a closer look at the most commonly used algorithms for supervised machine learning.
Classification algorithms can be further divided into the mainly two categories, Linear Models
and Non Linear Models, which includes various algorithms under them, the same are listed
below :

o Linear Models : Involves Logistic Regression, Support Vector Machines


o Non-linear Models : Involves K-Nearest Neighbours, Kernel SVM, Naïve
Bayes, Decision Tree Classification, Random Forest Classification

In order to build a classification model following steps are to be followed :

1. Start the classifier that will be utilised from scratch.


2. Train the classifier: In order to fit the model (training), all classifiers in scikit-learn use
a function called fit(X, y) to do so. This method takes as input the train data X and the
train label y.
3. Predict the target: Given an observation X that is not labelled, the predict(X) function
returns the label that was predicted for the observation.
4. Conduct an analysis of the classification model.

EVALUATING A CLASSIFICATION MODEL: Now that we have a classification model,


let's learn how to evaluate it. After we have finished developing our model, we need to assess
how well it works, regardless of whether it is a regression or classification model. The following
are some of the methods that can be used to evaluate a classification model:

1. Log Loss or Cross-Entropy Loss:


o It's used to measure how well a classifier works, which gives a probability value between
0 and 1.
o The value of log loss should be close to 0 for a good binary classification model.
o If the predicted value is different from the actual value, the value of log loss goes up.
o The model is more accurate when the log loss is lower.
o For binary classification, the cross-entropy is calculated by taking the actual output (y)
and the expected output (p). The formula for cross-entropy is shown below.
− (ylog(p) + (1− y)log(1− p))

2. Confusion Matrix:
 The confusion matrix tells us how well the model works and gives us a matrix or table as
Output.
 Sometimes, this kind of structure is called the error matrix.
 The matrix is a summary of the results of the predictions. It shows how many predictions
were right and how many were wrong.

The matrix looks like as below table:

3. AUC-ROC curve:

 The letters AUC and ROC stand for "area under the curve" and "receiver operating
characteristics curve," respectively.
 This graph shows how well the classification model works at several different thresholds.
 We use the AUC-ROC Curve to see how well the multi-class classification model is
doing.
 The (TPR)TruePositiveRate and the (FPR)FalsePositiveRate are used to plot the ROC
curve, with TPR on the Y-axis and FPR on the X-axis.

Classification algorithms have several applications, Following are some popular applications or
use cases of Classification Algorithms:
o Detecting Email Spam
o Recognizing Speech
o Detection of Cancer tumor cells.
o Classifying Drugs
o Biometric Identification, etc.
☞ Check Your Progress 3
4. List the classification algorithms under the categories of Linear and Non-Linear Models. Also
Discuss the various methods used for evaluating a classification model
……………………………………………………………………………
……………………………………………………………………………

10.5.1 NAÏVE BAYES

This is an example of a statistical classification, which estimates the likelihood that a particular sample
belongs to a particular group given the sample in question. The Bayes theorem provides the foundation
for it. When used to big databases, the Bayesian classification demonstrates both improved accuracy and
increased speed. In this section, we will talk about the most basic kind of Bayesian categorization.

"The effect of a given attribute value on a certain class is unaffected by the values of other attributes, i.e.
both are independent," is one of the fundamental underlying assumptions that underpin the native
Bayesian classification, which is the simplest form of Bayesian classification. Class conditional
independence is another name for this basic assumption.

Let's go into greater depth about the naïve Bayesian classification, shall we? But before we get into it,
let's take a moment to define the fundamental theorem that underpins this classification i.e. the Bayes
Theorem.

Bayes Theorem: In order to understand this theorem firstly lets understand the meaning of the
following symbols or assumptions, they are as follows :

• X is an example of a data set whose class needs to be determined.


• H refers to the hypothesis which states that the data sample X falls into the class C.
• P(H | X): The probability that the data sample X belongs to the class C is expressed by the
formula P(H | X), where H is the hypothesis and X is the data sample. This formula represents the
likelihood that the data sample X belongs to the class C. The likelihood that the condition H is
true for the sample X is often referred to as the posterior probability.
•The prior probability of the H condition is denoted by the notation P(H), which is based on the
training data.
• The posterior probability of the X sample is denoted by the symbol P(X | H), which assumes
that H is correct.
• The prior probability on the sample X is denoted by the letter P(X).

Note: From the data sample X and the data used for training, we may get the parameters P(X), P(X | H),
and P(H). Whereas, P(H | X) is the only variable that, by itself, may define the likelihood that X belongs
to a class C; this probability, however, cannot be calculated. This purpose is served by Bayes' theorem in
particular.

The Bayer’s theorem states:


P(H | X) = P(X | H) P(H)
P(X)

Now after defining the Bayes theorem, let us explain the Bayesian classification with the help of an
example.
i) Consider the sample having an n-dimensional feature vector. For our example, it is a 3-dimensional
(Department, Age, Salary) vector with training data as given in the Figure 3.

ii) Assume that there are m classes C1 to Cm. And an unknown sample X. The problem is to data mine
which class X belongs to. As per Bayesian classification, the sample is assigned to the class, if the
following holds:

P(Ci|X) > P(Cj|X) where j is from 1 to m but j  i

In other words the class for the data sample X will be the class, which has the maximum probability for
the unknown sample. Please note: The P(Ci |X) will be found using:

P(Ci| X)= P(X|Ci) P(Ci) (3)


P(X)

In our example, we are trying to classify the following data:


X = (Department = “Personal”, Age = “31 – 40” and Salary = “Medium_Range)

into two classes (based on position) C1=_BOSS_ OR C2=_ASSISTANT_.

iii) The value of P(X) is constant for all the classes, therefore, only P(X|Ci) P(Ci) needs to be found to
be maximum. Also, if the classes are equally, then,
P(C1)=P(C2)=…..P(Cn), then we only need to maximise P(X|Ci).

How is P(Ci) calculated?

P(Ci)= Number of training samples for Class Ci


Total Number of Training Samples

In our example,
5
P(C1)=
11
and

6
P(C2)=
11
So P(C1) P(C2)

iv) P(X|Ci) calculation may be computationally expensive if, there are large numbers of attributes. To
simplify the evaluation, in the naïve Bayesian classification, we use the condition of class
conditional independence, that is the values of attributes are independent of each other. In such a
situation:
n
P(X|Ci)= П P(xk|Ci) ….(4)
k=1

where xk represent the single dimension or attribute.

The P(xk|Ci) can be calculated using mathematical function if it is continuous, otherwise, if it is


categorical then, this probability can be calculated as:
P(xk|Ci)= Number of training samples of class Ci having the value xk for the attribute Ak
Number of training samples belonging to Ci

For our example, we have x1 as Department= “_PERSONNEL_”

x2 as Age=”31 – 40” and

x3 as Salary= “Medium_Range”

P ( Department= “_PERSONNEL_” | Position = “_BOSS_”) = 1/5

P ( Department= “_PERSONNEL_” | Position = “_ASSISTANT_”) = 2/6

P (Age=”31 – 40” | Position = “_BOSS_”) = 3/5

P (Age=”31 – 40” | Position = “_ASSISTANT_”) = 2/6

P (Salary= “Medium_Range”| Position = “_BOSS_”) = 3/5

P (Salary= “Medium_Range” | Position = “_ASSISTANT_”) = 3/6

Using the equation (4) we obtain:

P ( X | Position = “_BOSS_”) = 1/5 * 3/5 * 3/5

P ( X | Position = “_ASSISTANT_”) = 2/6 * 2/6 * 3/6

Thus, the probabilities:

P ( X | Position = “_BOSS_”) P(Position = “_BOSS_”)

= (1/5 * 3/5 * 3/5) * 5/11

= 0.032727

P ( X | Position = “_ASSISTANT_”) P( Position = “_ASSISTANT_”)

= (2/6 * 2/6 * 3/6) * 6/11

= 0.030303

Since, the first probability of the above two is higher, the sample data may be classified into the _BOSS_
position. Kindly check to see that you obtain the same result from the decision tree .

Naiive Bayes : Steps to perform naiive bayes algorithm


Step 1: Handling Data : Data is loaded from the CSV File and spread into training and tested assets.

Step 2: Summarizing the Data : Summarise the properties in the training data set to calculate the
probabilities and make predictions.

Step 3: Making a Prediction : A particular prediction is made using a summarise of the data set to make a
single prediction.

Step 4: Making all the Predictions : Generate prediction given a test data set and a summarise data set.
Step 5: Evaluate Accuracy : Accuracy of the prediction model for the test data set as a percentage correct
out of them all the predictions made.

Step 6: Tying all Together : Finally, we tie to all steps together and form our own model of Naive Bayes
Classifier.

With the help of the following example, you can see how Naive Bayes' Classifier works:
Example: Let's say we have a list of _WEATHER_ conditions and a target variable called "Play"
that goes with it. So, using this set of data, we need to decide whether or not to play on a given
day based on the _WEATHER_.
If it's _SUNNY_, should the Player play?
So, here are the steps we need to take to solve this problem:

1. Make frequency tables out of the given dataset.

2. Make a Likelihood table by figuring out how likely each feature is.

Use Bayes's theorem to figure out the posterior probability.

To figure this out, first look at the dataset given below:

_OUTLOOK_ _PLAY_
0 _RAINY_ YES
1 _SUNNY_ YES
2 _OVERCAST_ YES
3 _OVERCAST_ YES
4 _SUNNY_ NO
5 _RAINY_ YES
6 _SUNNY_ YES
7 _OVERCAST_ YES
8 _RAINY_ NO
9 _SUNNY_ NO
10 _SUNNY_ YES
11 _RAINY_ NO
12 _OVERCAST_ YES
13 _OVERCAST_ YES

Frequency table for the _WEATHER_ Conditions:

_WEATHER_ NO YES
_OVERCAST_ 0 5
_RAINY_ 2 2
_SUNNY_ 2 3
TOTAL 4 10
Likelihood table _WEATHER_ condition:

_WEATHER_ NO YES
_OVERCAST_ 0 5 5/14=0.35
_RAINY_ 2 2 4/14=0.29
_SUNNY_ 2 3 5/14=0.35
ALL 4/14 = 0.29 10/14 = 0.71

Applying Bayes'theorem:

P(Yes|_SUNNY_)= P(_SUNNY_|Yes)*P(Yes)/P(_SUNNY_)

P(_SUNNY_|Yes)= 3/10= 0.3

P(_SUNNY_)= 0.35

P(Yes)=0.71

So P(Yes|_SUNNY_) = 0.3*0.71/0.35= 0.60

P(No|_SUNNY_)= P(_SUNNY_|No)*P(No)/P(_SUNNY_)

P(_SUNNY_|NO)= 2/4=0.5

P(No)= 0.29

P(_SUNNY_)= 0.35

So P(No|_SUNNY_)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|_SUNNY_)>P(No|_SUNNY_)

Hence on a _SUNNY_ day, Player can play the game.

☞ Check Your Progress 4

8. Predicting a class label using naïve Bayesian classification. We wish to predict the class label
of a tuple using naïve Bayesian classification, given the training data as shown in Table-1 Below.
The data tuples are described by the attributes age, income, student, and credit rating.
The class label attribute known as "buys computer" can take on one of two distinct values—
specifically, "yes" or "no." Let's say that C1 represents the class buying a computer and C2
represents the class deciding not to buy a computer. We are interested in classifying X as having
the following characteristics: (age = youth, income = medium, student status = yes, credit rating
= fair).

10.5.2 K-NEAREST NEIGHBOURS (K-NN)

This approach, places items in the class to which they are “closest” to their neighbour.It must determine
distance between an item and a class. Classes are represented by centroid (Central value) and the
individual points.One of the algorithms that is used is K-Nearest Neighbors.
We know that The classification task maps data into predefined groups or classes. Given database/dataset
D={t1,t2,…,tn} and a set of classes C={C1,…,Cm}, the classification Problem is to define a mapping
f:DC where each ti is assigned to one class, that is, it divides database/dataset D into classes specified
in the Set C.
A few very simple examples to elucidate classification could be:

 Teachers classify students’ marks data into a set of grades as A, B, C, D, or F.


 Classification of the height of a set of persons into the classes tall, medium or short.

The basic approaches to classification are:


 To create specific models by, evaluating training data, which is basically the old data, that has
already been classified by using the domain of the experts’ knowledge.
 Now applying the model developed to the new data.

Please note that in classification, the classes are predefined.

Some of the most common techniques used for classification may include the use of Decision Trees,
K-NN etc. Most of these techniques are based on finding the distances or uses statistical methods.

The distance measure finds, the distance or dissimilarity between objects the measures that are used in
this unit are as follows:
k
 Euclidean distance: dis(ti,tj)=  (t
h 1
ih  t jh ) 2
k
 Manhattan distance: dis(ti,tj)=  | (t
h 1
ih  t jh ) |
where ti and tj are tuples and h are the different attributes which can take values from 1 to k

In this section, we look at the distance based classifier i.e. the k-nearest neighbor classifiers.

A test tuple is compared to training tuples that are used in the classification process that are
similar to it. This is how nearest-neighbor classifiers work. There are n different characteristics
that can be used to define the training tuples. Each tuple can be thought of as a point located in a
space that has n dimensions. In this method, each and every one of the training tuples is
preserved within an n-dimensional pattern space. A K-nearest-neighbor classifier searches the
pattern space in order to find the k training tuples that are the most comparable to an unknown
tuple. These k training tuples are referred to as the "k nearest neighbours" of the unknown tuple.

A distance metric, like Euclidean distance, is used to define "closeness."The Euclidean distance
between two points or tuples, say, X1 = (x11, x12,..., x1n) and X2 = (x21, x22,..., x2n), is

(eq. 1)

In other words, for each numeric attribute, we take the difference in value between the
corresponding values of that property in tuple X1 and the values of that attribute in tuple X2 and
then square this difference. Finally, we add up all of these differences. The final tally of all the
accumulated distances is used to calculate the square root. In most cases, prior to making use of
Equation, we first normalize the values of every property (eq. 1). This helps avoid attributes with
initially high ranges (like income, for example) from outweighing attributes with originally
lower ranges, which helps prevent unfairness (such as binary attributes). Min-max normalization,
for example, can be used to change the value of a numeric attribute A from v to v' in the range
[0, 1].
(eq. 2)
Where, minA and maxA are the minimum and maximum values of attribute A

For the purpose of k-nearest-neighbor classification, the unknown tuple is assigned to the class
that has the highest frequency among its k closest neighbours. When k equals 1, the class of the
training tuple that is assigned to the unknown tuple is the one that is most similar to the unknown
tuple in pattern space. It is also possible to utilise nearest neighbour classifiers for prediction,
which means that they can be used to deliver a real-valued forecast for a given unknown tuple.
The result that the classifier produces in this scenario is the weighted average of the real-valued
labels that are associated with the unknown tuple's k nearest neighbours.

Classification Using Distance (K-Nearest Neighbours) - Some of the basic points to be noted
about this algorithm are:

 The training set includes classes along with other attributes. (Please refer to the training data given in
the Table given below).
 The value of the K defines the number of near items (items that have less distance to the attributes of
concern) that should be used from the given set of training data (just to remind you again, training
data is already classified data). This is explained in point (2) of the following example.
 A new item is placed in the class in which the most number of close items are placed. (Please refer to
point (3) in the following example).
 The value of K should be <= Number _ of _ training _ items However, in our example for
limiting the size of the sample data, we have not followed this formula.

Example: Consider the following data, which tells us the person’s class depending upon gender and
height

Name Gender Height Class


Sunita F 1.6m Short
Ram M 2m Tall
Namita F 1.9m Medium
Radha F 1.88m Medium
Jully F 1.7m Short
Arun M 1.85m Medium
Shelly F 1.6m Short
Avinash M 1.7m Short
Sachin M 2.2m Tall
Manoj M 2.1m Tall
Sangeeta F 1.8m Medium
Anirban M 1.95m Medium
Krishna F 1.9m Medium
Kavita F 1.8m Medium
Pooja F 1.75m Medium
1) You have to classify the tuple <Ram, M, 1.6> from the training data that is given to you.

2) Let us take only the height attribute for distance calculation and suppose K=5 then the following are
the near five tuples to the data that is to be classified (using Manhattan distance as a measure on the
height attribute).

Name Gender Height Class


Sunita F 1.6m Short
Jully F 1.7m Short
Shelly F 1.6m Short
Avinash M 1.7m Short
Pooja F 1.75m Medium

3) On examination of the tuples above, we classify the tuple <Ram, M, 1.6> to Short class since most of
the tuples above belongs to Short class.

Example- To classify whether a special paper tissue is Fine or not, we used data from a questionnaire
survey (to get people's opinions) and objective testing with two properties (acid durability and strength).
Here are four examples of training.
X1 =
X2 = Strength(gram/Cm2) Y = Classification
Acid_Durability_(seconds)

7 7 Poor

7 4 Poor

3 4 Fine

1 4 Fine
Now, the firm is producing a new kind of paper tissue that is successful in the laboratory and has the
values X1 = 3 and X2 = 7 respectively. Can we make an educated judgement about the classification of
this novel tissue without doing yet another expensive survey?

1. Find the value of the parameter K as the number of the nearest neighbours Suppose use K = 3
2. Determine the distance that separates the query-instance from each of the samples used for training

The coordinates of the query instance are (3, 7), and rather than computing the distance, we compute the
square distance, which is a more efficient calculation (without square root)

X1 = X2 = Square Distance to query


Acid_Durability_(seconds) Strength(gram/Cm2) instance (3, 7)

7 7

7 4

3 4

1 4
2. Sort the distance and determine nearest neighbors based on the K-th minimum distance

X1 = Rank
X2 = Strength Square Distance to query Is it included in 3-
Acid_Durability_(sec minimum
(gram/Cm2) instance (3, 7) Nearest neighbors?
onds) distance

7 7 3 Yes

7 4 4 No

3 4 1 Yes

1 4 2 Yes

3. Gather the category (Y) of the nearest neighbors. Notice in the second row last scolumn that the
category of nearest neighbor (Y) is not included because the rank of this data is more than 3 (=K).

Is it Y=
X2 = Rank included Category
X1 = Square Distance to
Strength minimum in 3- of
Acid_Durability_(seconds) query instance (3, 7)
(gram/Cm2) distance Nearest nearest
neighbors? Neighbor

7 7 3 Yes Poor

7 4 4 No -

3 4 1 Yes Fine

1 4 2 Yes Fine

4. Use simple majority of the category of nearest neighbors as the prediction value of the query
instance

Use the simple majority of the nearest neighbours as the query instance's prediction value.

Since 2<1 and we have 2 Fine and 1 Poor, So we can say that a new paper tissue that passed the lab test
with X1 = 3 and X2 = 7 is in the Fine category.

"However, distance cannot be determined using qualities that are categorical, as opposed to
quantitative, such as color." The preceding description operates under the presumption that all of
the attributes that are used to describe the tuples are numeric. When dealing with categorical
attributes, a straightforward way is to contrast the value of the attribute that corresponds to tuple
X1 with the value that corresponds to tuple X2. If there is no difference between the two (for
example, If tuple X1 and X2 both contain the blue color, then the difference between the two is
regarded as being equal to zero. If the two are distinct from one another (for example, if tuple X1
carries blue and tuple X2 carries red), then the comparison between them is counted as 1. It's
possible that other ways will incorporate more complex systems for differentiating grades (such
as in a scenario in which a higher difference score is provided, say, for blue and white than for
blue and black).

"What about the missing values?" If the value of a certain attribute A is absent from either the
tuple X1 or the tuple X2, we will, as a rule, assume the greatest feasible disparity between the
two. Imagine for a moment that each of the traits has been mapped to the interval [0, 1]. When it
comes to categorical attributes, the difference value is set to 1 if either one of the related values
of A or both of them are absent. If A is a number and it is absent from both the tuple X1 and the
tuple X2, then the difference is also assumed to be 1. If there is just one value that is absent and
the other value (which we will refer to as v 0) is present and has been normalised, Consequently,
we can either take the difference to be |1 – v' | or |0 – v' | (i.e., 1–v' or v'), depending on which of
the two is larger.

Nearest-neighbor classifiers use comparisons based on distance to give each attribute an equal
amount of weight. So, they can be less accurate if their attributes are noisy or don't make sense.
But the method has been changed to include the weighting of attributes and the removal of noisy
data tuples. How you measure distance can be very important. You could also use the city block
distance or another way to measure distance.

Nearest neighbour classifiers can be very slow when classifying test tuples into groups. If D is a
training database that contains |D| tuples and k is equal to one, then in order to classify a given
test tuple, it must be compared to |D| training tuples. It is possible to reduce the total number of
comparisons to O(log(|D|) by first putting the stored tuples in search trees and then performing
the comparisons. The running time can be reduced to O(1) if parallel implementation is used,
which is a constant that doesn't change no matter how big D is. You can also use partial distance
calculations and change the stored tuples to cut down on the time it takes to classify. In the
partial distance method, we use only some of the n attributes to figure out how far apart two
things are. If this distance exceeds a specified threshold, the procedure aborts the execution
of the current stored tuple and continues on to the next. Training tuples that aren't required are
removed using the editing procedure. This strategy is also known as pruning or condensing.
because it minimises the number of stored tuples.
☞ Check Your Progress 5
9. Apply KNN classification algorithm to the following data and predict value for (10,7)
for K = 3
Feature 1 Feature 2 Class

1 1 A
2 3 A
2 4 A
5 3 A
8 6 B
8 8 B
9 6 B

11 7 B
……………………………………………………………………………………………
……………………………………………………………………………………………

10.5.3 DECISION TREES

Given a data set D = {t1,t2 …, tn} where ti=<ti1, …, tih>, that is, each tuple is represented by h attributes,
assume that, the database schema contains attributes as {A1, A2, …, Ah}. Also, let us suppose that the
classes are C={C1, …., Cm}, then:

Decision or Classification Tree is a tree associated with D such that

 Each internal node is labeled with attribute, Ai


 Each arc is labeled with the predicate which can be applied to the attribute at the parent node.
 Each leaf node is labeled with a class, Cj

Basics steps in the Decision Tree are as follows:

 Building the tree by using the training set dataset/database.


 Applying the tree to the new dataset/database.

Decision Tree Induction is the process of learning about the classification using the inductive approach.
During this process, we create a decision tree from the training data. This decision tree can, then be used,
for making classifications. To define this we need to define the following.

Let us assume that we are given probabilities p1, p2, .., ps whose sum is 1. Let us also define the term
Entropy, which is the measure of the amount of randomness or surprise or uncertainty. Thus our basic
goal in the classification process is that, the entropy for a classification should be zero, that, if no surprise
then, entropy is equal to zero. Entropy is defined as:


s
H(p1,p2,…ps)= i 1
( p i * log(1 / p i )) ……. (1)

ID3 Algorithm for Classification

This algorithm creates a tree using the algorithm given below and tries to reduce the expected number of
comparisons.

Algorithm: ID3 algorithm for creating decision tree from the given training data.
Input: The training data and the attribute-list.

Output: A decision tree.

Process: Step 1: Create a node N

Step 2: If all of the sample data belong to the same class, C, which means the probability is 1,
then return N as a leaf node with the class C label.

Step 3: Return N as a leaf node if attribute-list is empty, and label it with the most common class
with in training data; // majority voting

Step 4: Select split-attribute, which is the attribute in the attribute-list with the highest
information gain;

Step 5: label node N with split-attribute;

Step 6: for each known value Ai, of split-attribute // partition the samples

Create a branch from node N for the condition: split-attribute = Ai;

// Now consider a partition and recursively create the decision tree:

Let xi be the set of data from training data that satisfies the condition:

split-attribute = Ai

if the set xi is empty then

attach a leaf labeled with the most common class in the prior set of training data;

else

attach the node returned after recursive call to the program with
training data as xi and

new attribute list = present attribute-list – split-attribute;

End of Algorithm.
Please note: The algorithm given above, chooses the split attribute with the highest information gain, that
is, calculated as follows:


s
Gain (D,S) =H(D) - i 1
( P ( Di ) * H ( Di )) ………..(2)

where S is new states ={D1,D2,D3…DS} and H(D) finds the amount of order in that state
Consider the following data in which Position attribute acts as class

Department Age Salary Position


_PERSONNEL 31-40 Medium_Range _BOSS_
_
_PERSONNEL 21-30 Low_Range _ASSISTANT_
_
_PERSONNEL 31-40 Low_Range _ASSISTANT_
_
_MIS_ 21-30 Medium_Range _ASSISTANT_
_MIS_ 31-40 High_Range _BOSS_
_MIS_ 21-30 Medium_Range _ASSISTANT_
_MIS_ 41-50 High_Range _BOSS_
_ADMIN_ 31-40 Medium_Range _BOSS_
_ADMIN_ 31-40 Medium_Range _ASSISTANT_
_SECURITY_ 41-50 Medium_Range _BOSS_
_SECURITY_ 21-30 Low_Range _ASSISTANT_

Figure 3: Sample data for classification

We are applying ID3 algorithm, on the above dataset as follows:


The initial entropy of the dataset using formula at (1) is

H(initial) = (6/11)log(11/6) + (5/11)log(11/5) = 0.29923


(_ASSISTANT_) (_BOSS_)

Now let us calculate gain for the departments using the formula at (2)

Gain(Department) = H(initial) – [ P(_PERSONNEL_) * H(_MIS_) + P(_MIS_) * H(_PERSONNEL_) +


P(_ADMIN_) * H(_ADMIN_) + P(_SECURITY_) * H(_SECURITY_) ]

= 0.29923- { (3/11)[(1/3)log3 +(2/3)log(3/2)] + (4/11)[ (2/4)log 2 + (2/4)log 2] +


(2/11)[ (1/2)log 2 + (1/2)log 2 ] + (2/11)[ (1/2)log 2 + (1/2)log 2 ] }
= 0.29923- 0.2943
= 0.0049
Similarly:
Gain(Age) = 0.29923- { (4/11)[(4/4)log(4/4)] + (5/11)[(3/5)log(5/3) + (2/5)log (5/2)] +
(2/11)[(2/2) log (2/2)] }
= 0.29923 - 0.1328
= 0.1664
Gain(Salary) = 0.29923 – { (3/11)[(3/3)log 3] + (6/11)[(3/6) log2 + (3/6)log 2] +
(2/11) [(2/2 log(2/2) ) }
= 0.29923 – 0.164
= 0.1350

Since age has the maximum gain, so, this attribute is selected as the first splitting attribute. In age range
31-40, class is not defined while for other ranges it is defined.

So, we have to again calculate the spitting attribute for this age range (31-40). Now, the tuples that belong
to this range are as follows:

Department Salary Position


_PERSONNEL Medium_Range _BOSS_
_
_PERSONNEL Low_Range _ASSISTANT_
_
_MIS_ High_Range _BOSS_
_ADMIN_ Medium_Range _BOSS_
_ADMIN_ Medium_Range _ASSISTANT_

Again the initial entropy = (2/5)log(5/2) + (3/5)log(5/3) = 0.29922


(_ASSISTANT_) (_BOSS_)

Gain(Department)= 0.29922- { (2/5)[ (1/2)log 2 + (1/2)log 2 ] +1/5[ (1/1)log1] +


(2/5)[ (1/2)log 2 + (1/2)log 2 ] }
=0.29922- 0.240
=0.05922
Gain (Salary) = 0.29922- { (1/5) [(1/1)log1 ] +(3/5)[ (1/3)log 3 + (2/3)log(3/2)] +
(1/5) [(1/1)log1 ] }
= 0.29922- 0.1658
=0.13335
The Gain is maximum for salary attribute, so we take salary as the next splitting attribute. In middle range
salary, class is not defined while for other ranges it is defined. So, we have to again calculate the spitting
attribute for this middle range. Since only department is left, so, department will be the next splitting
attribute. Now, the tuples that belong to this salary range are as follows:

Department Position
_PERSONNEL _BOSS_
_
_ADMIN_ _BOSS_
_ADMIN_ _ASSISTANT_

Again in the _PERSONNEL_ department, all persons are _BOSS_, while, in the _ADMIN_ there is a tie
between the classes. So, the person can be either _BOSS_ or _ASSISTANT_ in the _ADMIN_
department.
Now the decision tree will be as follows:

Age ?
21-30 41-50

31-40
_ASSISTANT_
_BOSS_

Salary ?
Low Range High Range
Medium

_ASSISTANT_ _BOSS_
range

Department ?

_PERSON Administration

_BOSS_
_ASSISTANT_/_BOSS_

Figure 4: The decision tree using ID3 algorithm for the sample data

Now, we will take a new dataset and we will classify the class of each tuple by applying the decision tree
that we have built above.

Steps of algorithm of decision tree

1. Data Pre-processing step


2. Fitting a Decision-Tree algorithm to the Training set
3. Predicting the test result
4. Test accuracy of the result(Creation of Confusion matrix)
5. Visualizing the test set result.
Example : Problem on Decision Tree - Consider whether a dataset based on which we will
determine whether to play football or not.
_OUTLOOK_ TEMP. HUMIDITY WIND PLAY
FOOTBALL(YES/NO)
_SUNNY_ _HOT_ HIGH WEAK NO
_SUNNY_ _HOT_ HIGH STRONG NO
_OVERCAST_ _HOT_ HIGH WEAK YES
_RAINY_ _MILD_ HIGH WEAK YES
_RAINY_ _COOL_ NORMAL WEAK YES
_RAINY_ _COOL_ NORMAL STRONG NO
_OVERCAST_ _COOL_ NORMAL STRONG YES
_SUNNY_ _MILD_ HIGH WEAK NO
_SUNNY_ _COOL_ NORMAL WEAK YES
_RAINY_ _MILD_ NORMAL WEAK YES
_SUNNY_ _MILD_ NORMAL STRONG YES
_OVERCAST_ M HIGH STRONG YES
_OVERCAST_ _HOT_ NORMAL WEAK YES
_RAINY_ _MILD_ HIGH STRONG NO

Here There are for independent variables to determine the dependent variable. The independent

variables are _OUTLOOK_, TEMP., Humidity, and Wind. The dependent variable is whether to

play football or not.

As the first step, we have to find the parent node for our decision tree. For that follow the steps:

Find the entropy of the class variable. E(S) = -[(9/14)log(9/14) + (5/14)log(5/14)] = 0.94

note: Here typically we will take log to base 2.Here total there are 14 yes/no. Out of which 9 yes

and 5 no.Based on it we calculated probability above.

From the above data for _OUTLOOK_ we can arrive at the following table easily
PLAY
YES NO TOTAL
_SUNNY_ 3 2 5
_OUTLOOK_ _OVERCAST_ 4 0 4
_RAINY_ 2 3 5
14

Now we have to calculate average weighted entropy. ie, we have found the total of weights of
each feature multiplied by probabilities.

E(S, _OUTLOOK_) = (5/14)*E(3,2) + (4/14)*E(4,0) + (5/14)*E(2,3) = (5/14)(-(3/5)log(3/5)-

(2/5)log(2/5))+ (4/14)(0) + (5/14)((2/5)log(2/5)-(3/5)log(3/5)) = 0.693


The next step is to find the information gain. It is the difference between parent entropy and

average weighted entropy we found above.

IG(S, _OUTLOOK_) = 0.94 - 0.693 = 0.247

Similarly find Information gain for TEMP., Humidity, and Windy.

IG(S, TEMP.) = 0.940 - 0.911 = 0.029

IG(S, Humidity) = 0.940 - 0.788 = 0.152

IG(S, Windy) = 0.940 - 0.8932 = 0.048

Now select the feature having the largest entropy gain. Here it is _OUTLOOK_. So it forms the

first node(root node) of our decision tree.

Now our data look as follows


_OUTLOOK_ TEMP. HUMIDITY WIND PLAYED(YES/NO)
_SUNNY_ _HOT_ HIGH WEAK NO
_SUNNY_ _HOT_ HIGH STRONG NO
_SUNNY_ _MILD_ HIGH WEAK NO
_SUNNY_ _COOL_ NORMAL WEAK YES
_SUNNY_ _MILD_ NORMAL STRONG YES

_OUTLOOK_ TEMP. HUMIDITY WIND PLAYED(YES/NO)


_OVERCAST_ _HOT_ HIGH WEAK YES
_OVERCAST_ _COOL_ NORMAL STRONG YES
_OVERCAST_ _MILD_ HIGH STRONG YES
_OVERCAST_ _HOT_ NORMAL WEAK YES

_OUTLOOK_ TEMP. HUMIDITY WIND PLAYED(YES/NO)


RAIN _MILD_ HIGH WEAK YES
RAIN _COOL_ NORMAL WEAK YES
RAIN _COOL_ NORMAL STRONG NO
RAIN _MILD_ NORMAL WEAK YES
RAIN _MILD_ HIGH STRONG NO

Since _OVERCAST_ contains only examples of class ‘Yes’ we can set it as yes. That means If

_OUTLOOK_ is _OVERCAST_ football will be played. Now our decision tree looks as follows.
The next step is to find the next node in our decision tree. Now we will find one under

_SUNNY_. We have to determine which of the following TEMP., Humidity or Wind has higher

information gain.

_OUTLOOK_ TEMP. HUMIDITY WIND PLAYED(YES/NO)


_SUNNY_ _HOT_ HIGH WEAK NO
_SUNNY_ _HOT_ HIGH STRONG NO
_SUNNY_ _MILD_ HIGH WEAK NO
_SUNNY_ _COOL_ NORMAL WEAK YES
_SUNNY_ _MILD_ NORMAL STRONG YES

Calculate parent entropy E(_SUNNY_)

E(_SUNNY_) = (-(3/5)log(3/5)-(2/5)log(2/5)) = 0.971.

Now Calculate the information gain of TEMP.. IG(_SUNNY_, TEMP.)

PLAY
YES NO TOTAL
_HOT_ 0 2 2
TEMP. _COOL_ 1 1 2
_MILD_ 1 0 1
5

E(_SUNNY_, TEMP.) = (2/5)*E(0,2) + (2/5)*E(1,1) + (1/5)*E(1,0)=2/5=0.4

Now calculate information gain.

IG(_SUNNY_, TEMP.) = 0.971–0.4 =0.571

Similarly we get

IG(_SUNNY_, Humidity) = 0.971

IG(_SUNNY_, Windy) = 0.020

Here IG(_SUNNY_, Humidity) is the largest value. So Humidity is the node that comes under

_SUNNY_.
PLAY
YES NO TOTAL
HIGH 0 3 3
HUMIDITY NORMAL 2 0 2
5

For humidity from the above table, we can say that play will occur if humidity is normal and will

not occur if it is high. Similarly, find the nodes under _RAINY_.

Note: A branch with entropy more than 0 needs further splitting.

Finally, our decision tree will look as below:

10.5.4 LOGISTIC REGRESSION

Logistic Regression in Machine Learning

 Logistic regression, which is part of the Supervised Learning method, is one of the most
popular Machine Learning algorithms. It is used to predict the categorical dependent
variable based on a set of independent variables.
 Logistic regression predicts the outcome of a dependent variable that has a "yes" or "no"
answer. Because of this, the result must be a discrete or categorical value. It can be Yes
or No, 0 or 1, true or false, etc., but instead of giving the exact value as 0 or 1, it gives the
probabilistic values that lie between 0 and 1.
 Logistic Regression is a lot like Linear Regression, but the way they are used is different.
Linear regression is used to solve regression problems, while logistic regression is used to
solve classification problems.
 In logistic regression, we fit a "S"-shaped logistic function, which predicts two maximum
values, instead of a regression line (0 or 1).
 The curve from the logistic function shows how likely something is, like whether the
cells are cancerous or not, whether a mouse is overweight or not based on its weight, etc.

Logistic Regression is an important machine learning algorithm because it can use both
continuous and discrete datasets to give probabilities and classify new data.

Logistic regression can be used to classify observations based on different types of data, and it is
easy to figure out which variables are the most useful for classifying. The logistic function is
shown in the picture below:

Y
1
S-Curve
y=0.8

0.5

Threshold Value

y=0.3
0 X

Note: Logistic regression is based on the idea of predictive modeling as regression, so that's why
it's called "logistic regression." However, it's used to classify samples, so it's a part of the
classification algorithm.

Logistic Function (Sigmoid Function):


 The math "function" called the "sigmoid function" turns predicted values into
probabilities.
 The sigmoid function is a special case of the logistic function. It is usually written as (x)
or sig (x). The formula for it is: σ(x) = 1/(1+exp(-x))
 It turns any real number into a different number between 0 and 1.
 The value of the logistic regression must be between 0 and 1, and it can't be higher than
that. Because of this, it forms a curve that looks like the letter "S." The Sigmoid function
or the logistic function is the name for the curve in the shape of a S.
 The threshold value tells us how likely it is that either 0 or 1 will happen in logistic
regression. For example, most values above the threshold are 1, and most values below it
are 0.
Assumptions for Logistic Regression:
 The dependent variable must be a categorical one.
 The independent variable shouldn't be related to more than one other variable.
Types of Logistic Regression: Based on the categories, Logistic Regression can be divided into
three types:
• Binomial: In binomial logistic regression, the dependent variables can only be either 0
or 1, Pass or Fail, etc.
• Multinomial: In multinomial Logistic regression, the dependent variable can be one of
three or more types that are not in order, such as "cats," "dogs," or "sheep."
• Ordinal: In ordinal Logistic regression, the dependent variables can be ranked, such as
"low," "medium," or "high."
Logistic Regression Equation From the Linear Regression equation, you can figure out what
the Logistic Regression equation is. Here are the steps you need to take in math to get Logistic
Regression equations:
 We know the equation of the straight line can be written as:
y = b0+b1X1+b2X2+…….+bnXn
 In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y): y/(1-y); 0 for y=0 and infinity for y=1
 But we need range between -[infinity] to +[infinity], then take logarithm of the equation
it will become: Log[y/(1-y)] = b0+b1X1+b2X2+…….+bnXn

The above equation is the final equation for Logistic Regression.

The relationship between a numerical response and a numerical or categorical predictor is the
subject of the statistical technique known as simple linear regression. While multiple regression
looks at the relationship between a single numerical response and a number of different
numerical and/or categorical predictors, single regression looks at the relationship between a
single numerical response and a single. What should be done, however, when the predictors are
odd (nonlinear, intricate dependence structure, and so on), or when the response is unusual
(categorical, count data, and so on)? When this occurs, we deal with odds, which are another
method of measuring the likelihood of an event and are frequently applied in the context of
gambling (and logistic regression).

Odds For some event E is expressed as,

odds(E) = P(E)/P(Ec ) = P(E)/(1 − P(E))

Similarly, if we are told the odds of E are x to y then

odds(E) = x/ y = {x/(x + y)}/{y/(x + y)}


which implies P(E) = x/(x + y), P(E c ) = y/(x + y)

Logistic regression is a statistical approach for modelling a binary categorical variable using
numerical and categorical predictors, and this idea of Odds is commonly employed in it. We
suppose the outcome variable was generated by a binomial distribution, and we wish to build a
model with p as the probability of success for a given collection of predictors. There are other
alternatives, but the logit function is the most popular.

Logit function: logit (p) = log {p/(1 – p)} , for 0 ≤ p ≤ 1

Example-1: In a survey of 250 customers of an auto dealership, the service department was
asked if they would tell a friend about it. The number of people who said "yes" was 210, where
"p" is the percentage of customers in the group from which the sample was taken who would
answer "yes" to the question. Find the sample odds and sample proportion.

Solution: The number of customers who would respond Yes in an simple random sample (SRS)
of size n has the binomial distribution with parameters n and p. The sample size of customers is n
= 250, and the number who responded Yes is the count X = 210. Therefore, the sample
proportion is p’=210/250 = 0.84

Since, Logistic regressions work with odds rather than proportions. We need to calculate the
Odds, the odds are simply the ratio of the proportions for the two possible outcomes. If p’ is the
proportion for one outcome, then 1 – p’ is the proportion for the second out

odds = p’ (1 – p’)

A similar formula for the population odds is obtained by substituting p for p’ in this expression

Odds of responding Yes. For the customer service data, the proportion of customers who would
recommend the service in the sample of customers is p’ = 0.84, so the proportion of customers
who would not recommend the service department will be 1 – p’ i.e. 1 – p’ = 1 - 0.84 = 0.16

Therefore, the odds of recommending the service department are

odds = p’/1 – p’ = 0.84/0.16 = 5.25

When people speak about odds, they often round to integers or fractions. If we round 5.25 to 5 =
5/1, we would say that the odds are approximately 5 to 1 that a customer would recommend the
service to a friend. In a similar way, we could describe the odds that a customer would not
recommend the service as 1 to 5.

☞ Check Your Progress 6


Q1 Odds of drawing a heart. If you deal one card from a standard deck, the probability
that the card is a heart is 13/52 = 1/4.
(a) Find the odds of drawing a heart.
(b) Find the odds of drawing a card that is not a heart.

10.5.5 SUPPORT VECTOR MACHINES

Support Vector Machine, also called Support Vector Classification, is a supervised and linear
Machine Learning technique that is most often used to solve classification problems. In this
section, we will take a look at Support Vect
Vectoror Machines, a new approach for categorising data
that has a lot of potential and can be used for both linear and nonlinear datasets. A support vector
machine, often known as an SVM, is a type of algorithm that transforms the primary training
data into a new format that has a higher dimension by making use of a nonlinear mapping. It
searches for the ideal linear separating hyperplane in this additional dimension. This hyperplane
is referred to as a "decision boundary" since it separates the tuples of one cl class
ass from those of
another. A hyperplane can always be used to split data from two classes if the appropriate
nonlinear mapping to a high enough dimension is used. This hyperplane is located by the SVM
through the use of support vectors, also known as impor
important
tant training tuples, and margins (defined
by the support vectors). We'll go into further detail about these fresh concepts in the next
paragraphs.

While studying machine learning, one of the classifiers that we come across is called a Support
Vector Machine, or SVM for short. One of the most common approaches for categorising data in
the field of machine learning, which performs admirably on both small and large datasets.
SVMs, which stands for support vector machines, can be utilised for both classificat ion and
regression jobs; however, their performance is superior when applied to classification scenarios.
When they were first introduced in the 1990s, they quickly became quite popular, and even now,
with only minor adjustments, they are the solution of c hoice when a high-performing algorithm
is required.

Fig1. Showing data into two groups


There are two main categories that can be applied to SVM:

• Linear Support Vector Machine: You can only use Linear SVM if the data can be completely
separated into linear categories. The ability to separate a set of data points into two classes using
just one straight line is what is meant when we talk about something being "completely linearly
separable" (if 2D).
• Non-Linear Support Vector Machine: When the data isn't linearly separable, we can use Non-
Linear SVM, which means that we apply advanced techniques like kernel tricks to categorise the
data points that can't be divided into two classes using a straight line. This allows us to use Non-
Linear SVM to classify the data points that aren't linearly separable (in 2D). We don't discover
datapoints that are linearly separable in the majority of real-world applications, so we use the
kernel approach to solve them instead.

Let's take a look of some SVM terminology.

 Support Vectors: The points on the hyperplane that are closest to the object in question are
referred to as support vectors. The boundary between the two groups will be determined with the
help of these data points.Infact these are the spots on the hyperplane that are closest to it. These
data points will be used to define a separation line.
 Margin: The margin is the distance between the hyperplane and the nearest observations to the
hyperplane (support vectors). A margin that is high is considered to be a favourable margin by
SVM.In Short, It's the distance between the hyperplane and the hyperplane's nearest observations
(support vectors). SVM considers a high margin to be a favourable margin.
Maximum
X2 Margin
Positive
Hyperplane

Maximum Margin
Hyperplane

Support Vectors

Negative Hyperplane

X1

Fig2. Diagram of Support Vector Machine

Working of SVM

SVM is defined solely in terms of support vectors; we do not need to be concerned with any other
observations because the margin is calculated based on the points that are support vectors that are closest
to the hyperplane. This is in contrast to logistic regression, in which the classifier is defined over all of the
points. Because of this, SVM is able to take use of some natural speedups.
To further understand how SVM operates, let's look at an example. Suppose we have a dataset that has
two different classes (green and blue). It is necessary for us to decide whether the new data point should
be classified as blue or green.

Fig 3. Dataset with two classes.

There are many ways to put these points into groups, but the question is which is the best and how do we
find it?

NOTE: We call this decision boundary a "straight line" because we are plotting data points on a two-
dimensional graph. If there are more dimensions, we call it a "hyperplane."

Fig4. Hyperplane
SVM is all about finding the hyperplane with the most space between the two classes, such hyperplane is
the best hyperplane. This is done by finding many hyperplanes that best fit the labels and then picking the
one that is farthest from the data points or has the biggest margin.i.e. The best hyperplane is the one with
the greatest distance between the two classes, and this is what SVM is all about.

Optimal Hyper plane


Support vector

Support vector

Fig5. Optimal Hyperplane

In geometry, a hyperplane is the name given to a subspace that is one dimension smaller than the ambient
space. Despite the fact that this definition is accurate, it is not very clear. Instead of making use of it, we
will concentrate on acquiring knowledge about lines in order to better understand what a hyperplane is. If
you can recall the mathematics that you studied in high school, you presumably know that a line has an
equation of the form, that the constant is called the slope, and that the y-axis is crossed by. If you can't
remember those things, you should look them up. It is important to note, however, that the linear equation
y = a x + b involves two variables. These variables are denoted by the letters y and x, but we are free to
give them any name we like. This formula is valid for a wide range of possibilities for the value of, and
we refer to the collection of those possibilities as a line.

Another notation for the equation of a line may be obtained if we define the two-dimensional vectors
x=(x1,x2) and w=(a,-1) as follows: where w.x is the dot product of w and x.
w.x+b=0

Now we need to locate a hyperplane : locating a hyperplane with the largest margin (a margin is a buffer
zone around the hyperplane equation), and working toward having the largest margin while having the
fewest points possible (known as support vectors).

"The goal is to maximise the minimum distance," to put it another way. for the sake of distance. If the
point from the positive group is substituted in the hyperplane equation while generating predictions on the
training data that was binary classified as positive and negative groups, we will get a value larger than 0.
(zero), Mathematically, wT(Φ(x)) + b > 0 And predictions from the negative group in the hyperplane
equation would give negative value as wT(Φ(x)) + b < 0. The indicators, on the other hand, were about
training data, which is how we're training our model. Give a positive sign for a positive class and a
negative sign for a negative class.

However, if we properly predict a positive class (positive sign or greater than zero sign) as positive while
testing this model on test data, then two positives equals positive and hence a greater than zero result. The
same is true if we correctly forecast the negative group, because two negatives equal a positive.

However, if the model incorrectly identifies the positive group as a negative group
group,, one plus and one
minus equals a minus, resulting in a result that is less than zero. Thus summarising this we can say that
The product of a predicted and actual label would be greater than 0 (zero) on correct prediction, otherwise
less than zero.

CHECK YOUR PROGRESS-7

11. Suppose you are using a Linear SVM classifier with 2 class classification problem. Now you have
been given the data in which some points are circled red that are representing support vectors. If you
remove any one red points from the data. Does the decision boundary will change?

A) Yes
B) No

12. The effectiveness of an SVM depends upon:


A) Selection of Kernel
B) Kernel Parameters
C) Soft Margin Parameter C
D) All of the above

13. The SVM’s are less effective when:


A) The data is linearly separable
B) The data is clean and ready to use
C) The data is noisy and contains overlapping points

10.11 SOLUTIONS/ANSWERS
☞ Check Your Progress 1
1. Compare between Supervised and Un-Supervised Learning.
Solution : Refer to section 10.3
2. List the Steps Involved in Supervised Learning
Solution : Refer to section 10.3
3. What are the Common Issues Faced While Using Supervised Learning
Solution : Refer to section 10.3

☞ Check Your Progress 2


4. Compare between Multi Class and Multi Label Classification
Solution : Refer to section 10.4
5. Compare between structured and unstructured data
Solution : Refer to section 10.4
6. Compare between Lazy learners and Eager Learners algorithms for machine learning.
Solution : Refer to section 10.4

☞ Check Your Progress 3


7. List the classification algorithms under the categories of Linear and Non-Linear Models. Also
Discuss the various methods used for evaluating a classification model
Solution : Refer to section 10.5

☞ Check Your Progress 4

8. Using naive Bayesian classification, predict a class label. Given the training data
in Table-1 below, we want to use naive Bayesian classification to predict the class
label of a tuple. The characteristics age, income, student, and credit rating
characterise the data tuples.

Solution : Also Refer to section 10.5.1


The buys computer class label attribute has two unique values (yes and no). Let C1 represent the
buys computer = yes class and C2 represent the buys computer = no class. The tuple we want to
categorise is

X = (youthful age, medium income, student status, fair credit rating)

For i = 1, 2, we must maximise P(X|Ci)P(Ci). The prior probability for each class, P(Ci), can be
calculated using the training tuples:

P(buys computer = yes) = 9/14 = 0.643


P(buys computer = no) = 5/14 = 0.357
To compute PX|Ci), for i = 1, 2, we compute the following conditional probabilities:
P(age = youth | buys computer = yes) = 2/9 = 0.222
P(age = youth | buys computer = no) = 3/5 = 0.600
P(income = medium | buys computer = yes) = 4/9 = 0.444
P(income = medium | buys computer = no) = 2/5 = 0.400
P(student = yes | buys computer = yes) = 6/9 = 0.667
P(student = yes | buys computer = no) = 1/5 = 0.200
P(credit rating = fair | buys computer = yes) = 6/9 = 0.667
P(credit rating = fair | buys computer = no) = 2/5 = 0.400
Using the above probabilities, we obtain

P(X|buys computer = yes) = P(age = youth | buys computer = yes) ×


P(income = medium | buys computer = yes) ×
P(student = yes | buys computer = yes) ×
P(credit rating = fair | buys computer = yes)
= 0.222×0.444×0.667×0.667 = 0.044.

Similarly, P(X|buys computer = no) = 0.600×0.400×0.200×0.400 = 0.019.

To find the class, Ci , that maximizes P(X|Ci)P(Ci), we compute


P(X|buys computer = yes)P(buys computer = yes) = 0.044×0.643 = 0.028
P(X|buys computer = no)P(buys computer = no) = 0.019×0.357 = 0.007

Therefore, the naïve Bayesian classifier predicts buys computer = yes for tuple X.

☞ Check Your Progress 5


9. Apply KNN classification algorithm to the following data and predict value for (10,7)
for K = 3

Feature 1 Feature 2 Class

1 1 A
2 3 A
2 4 A
5 3 A
8 6 B

8 8 B
9 6 B
11 7 B
Solution : Refer to section 10.5.2

☞ Check Your Progress 6


10. Odds of drawing a heart. If you deal one card from a standard deck, the
probability that the card is a heart is 13/52 = 1/4.

(a) Find the odds of drawing a heart.


(b) Find the odds of drawing a card that is not a heart.

Solution : Refer to section 10.5.4

☞ Check Your Progress 7


11. Suppose you are using a Linear SVM classifier with 2 class classification problem. Now you have
been given the data in which some points are circled red that are representing support vectors. If
you remove any one red points from the data. Does the decision boundary will change?

A) Yes
B) No
Solution: A

12. The effectiveness of an SVM depends upon:

A) Selection of Kernel
B) Kernel Parameters
C) Soft Margin Parameter C
D) All of the above
Solution: D
The SVM effectiveness depends upon how you choose the basic 3 requirements mentioned above in such
a way that it maximises your efficiency, reduces error and overfitting.

13. The SVM’s are less effective when:

A) The data is linearly separable


B) The data is clean and ready to use
C) The data is noisy and contains overlapping points
Solution: C
When the data has noise and overlapping points, there is a problem in drawing a clear hyperplane without
misclassifying.

10.12 FURTHER READINGS

1. Machine learning an algorithm perspective, Stephen Marsland, 2nd Edition, CRC Press,, 2015.
2. Machine Learning, Tom Mitchell, 1st Edition, McGraw- Hill, 1997.
3. Machine Learning: The Art and Science of Algorithms that Make Senseof Data, Peter Flach, 1st
Edition, Cambridge University Press, 2012.
UNIT 11 REGRESSION

11.0 Introduction
11.1 Objectives
11.2 Regression Algorithm
11.3 Linear Regression
11.4 Polynomial Regression
11.5 Support Vector Regression
11.6 Summary
11.7 Solution/Answers

11.0 INTRODUCTION

In 1908 British biologist Francis Galton investigated the relationships between two variables to study the
hereditary growth of children. In his research he categorised parents into two categories on the height: 1 st
category of the parents belongs to the family length smaller than average length of than parents’ length
and 2nd category of parents belong to the parents having lengththan the average length. This “regression
toward mediocrity” gave these statistical methods thereprimarily the term regression describes the
relationship between variables.
Simple regression y=m*x + C describes the relationship between one independent and one dependent
variable Where theueuse variable y varies with the value of x and thus a dependent variable,the value of
variable x affected any variable hence is a independent variable and m is having some constant value.
Consider the following the parent-children’s set

Parent 64.5 65.5 66.5 67.5 68.5 69.5 70.5 71.5 72.5
Children 65.8 66.7 67.2 67.6 68.2 68.9 69.5 69.9 72.2

The mean height of the children is 68.44 whereas the mean height for the parents is 68.5.
The linear equation for the parents and children is
height_child = 21.52 + 0.69 * height_parent
Mathematically simple linear regression can be defined as y=bx+c+ϵ.Where b is the slope of the
regression lin , x is the variable which can change the value of y but can’t be affected by another variable.
Whereas y is a variable which varies with a change in the n value of x. And the known as error value
majors between the actual value and predicted value. Variable y is described ass dependent variable or
response variable, and variable x is defined as an explanatory or predictor variable.
Regression is a supervised machine learning model which describes the relationships between the
response variable and predictor variables.So, regression model is used when it is required to determine the
value of one variable using another variable.
If the variable to be predicted is a single variable, then the regression equation will be y=a+bx
To determine the value of dependent variable y we need to determine the slope b and constant value and
thus by substituting the different set of values for variable x we can get the different value of variabley.
when x=0 then y=a, which means when there is no independent variable then the predicted variable
constant value.Suppose we are having multiple independent variables x 1,x2,x3,x4….,xn.Then the
regressionequation will be y=a+b1x1+b2x2+b3x3+…….box.

The regression line is also calasthma best fit line because the regression line aims to fit all the points or
will be minimum. Regression is a linear regression when there is one predictor variable, and we can apply
a linear regression model. Themultiple linear regression model came into existence when the number of
predictorsvaries are than then one in number. When the relationships between variable y and x are not
linear,we can apply-linear regression model.

Following are the ways used by regression analysis to determine the relationships between the response
variable and predictor variables:
 Find the relationship: It is required to determine the relationships between the predictor variable
and response variables. If any change in the independent variable will result inin the age of
dependent variable there is an exitstance of relationship.
 Strength of relationships: By changing the value of one variable how much another variable
changedetermine the strength of relationships.
 Formation of relationships: If a change in the value of the dependent variable will result in a
change in independent variable, then formulate a mathematical equation to represent the
relationships between both variables.
 Prediction: After formulation of the mathematical equation find the predicted value.
 Another independent variable:Another independent variable which is having impact on
dependent variable. If there exist, then formulate the mathematical equation using these variables
also.
Uses of Regression
 In a business scenario when it is required to determine the impact of different-different independent
variables to find the target value regression can be used.
 When we want to represent in a mathematical expressionform, or we want to model a problem to
determine the impact of different variables.
 It is very easy to explain about the business logic with the help of regression. Business logics can be
explained very easily to the person.
 When the target variable is normally distributed having some characteristics, regression is very
effective.

Examples of Regression
Relationship between uploading a picture on Facebook page and number of likes by the friends.
Relationship between the height of the child and their parents’ heights.
Relationship between the average food intake and weight gain.
Relationship between the numbers of hours studied and marks scored by the students.
Relationship between the product consumption by increasing the product price.

Terminologies used in Regression Analysis


 Dependent Variable: A variable used to predict the output. It is also called as the target variable.
 Independent Variable: The variable which is having an impact on dependent variable is called
independent variable. There may be one or more independent variable. This variable is also called
as predictor variable. For example, salary of an employee depends on
age,qualification,experience. Here salary is a dependent variable and age,qualification and
experience are independent variables.
 Outlier:Outlier is a value which effect out output, very high value or very low value will affect
the result. In case of regression first we have to remove the outlier first.
 Multicollinearity:If two values in our dataset are corelated to each other than other variables,
such a condition is called multicollinearity. Example: age and date of birth of are correlated to
each other. So, we have to avoid one of them.
 Underfitting and Overfitting:Overfitting results when our machine learning model work well
with the training data set but it does not work well with test data set. Underfitting results when
our dataset does not perform well even with our training data set.

11.1 OBJECTIVES

After completing this unit you will be able to:


 Understand the Regression Algorithm
 Understand and apply Linear Regresssion
 Understand and apply Polynomial Regression
 Understand and apply Support Vector Regression
11.2 REGRESSION ALGORITHM

Following are various types of regression algorithms.


Linear regression: Linear regression algorithm comes into existence when there is only one dependent
variable and independent variables can be be one or more in numbers. If there is a single independent
variable, then it is called as simple linear regression. In linear regression the relationships between the
dependent and independent variables are linear i.e. of type yi=a+b*xi ; where yi is a dependent variable
and xi is a independent variable. Variable b is the slope of the line and a is intercept with the axis.
Example child height=a+b*(parent height)

Multiple Linear Regression: When there is only one dependent variable and more than one independent
variables, then it results in multiple linear regression i.e. y=a+bx1+cx2+dx3. ; example weight = a+b *
(daily meal)+ c* (daily exercise)

Logistic regression: In logistic regression algorithm dependent variable is binary in nature (False/True).
This algorithm is generally used under cases like testing of the medicines, to detect the bank fraud etc.We
had already discussed the concept of logistic regression in unit no. 10 of this course.

Polynomial regression: Polynomial regression is described with the help of polynomial equation where
the occurrence of independent variable is more than one. There is no linear relationships between the
dependent and independent variables. It results in a curved line instead of a straight line i.e. y=c+
a*x+b*x2
Ecologic regression: Ecological regression algorithm is used when group data belongs to a group. Thus,
data is divided into different groups and regression is performed on different groups. Ecologocal
regression is mostly used in political research eg.party_votes %=.2+.5*(below_poverty_people_votes)

Ridge regression: It is a type of regularization. When data variables are highly correlated ridge
regression is used. Using some constraints on regression coefficients, it is used to reduce the error and
lower the bias. Mostly used in feature selection.

Lasso regression: Least absolute shrinkage and selection operator regression algorithm a penalty is
assign the coefficients. Lasso regression uses shrinkage technique where data values are shrunk towards a
mean.
Logic regression: In logic regression predictor variable and response variable both are binary in nature
and applicable to both classification and regression problem.

Bayesian regression: Bayesian regression algorithm is based on Bayesian statistics. Random variables
are used as a parameter to estimates. In this algorithm if the data is absent then some prior data is taken as
an input.

Quantile regression: This is used when the boundary of the quantile is of interest. Whenoverweight and
underweight is considered for the health analysis it is consider as an quantile regression.

Cox regression: Cox regression algorithm is used when output of a variable depends on set of
independent variables example patient_survival_after_surgery(Survived,Died)=(age,condition,BMI)

11.3 LINEAR REGRESSION


Linear regression is a mathematical method implemented where we want to find the response variable and
predictor variables. When the relationships are linear then it is called as linear regression or otherwise it is
called as a nonlinear regression. Linear regression makes prediction for continuous/real or numerical
variables like age,salary,price etc.

As shown in figure x-axis represent independent variable


and y-axis represent dependent variable. A Line with some
slope is called linear regression line which shows the
relationship between the independent and dependents
variable and dots represents the point of the data sets, where
some points lie on the line and some other points lie above
and below of the line.

If DV is the dependent variable and IV is the independent variable, then the Positive Linear relationship
results with the increases in dependent variable (DV)on the y-axis with respect to increase in value of
independent variable (IV) on x-axis. For example, the distance traversed by the car increases when the
speed of the car increases. Thus, the distance traversed by the car depends on the speed of the car.
And, the Negative Linear relationships result with the decrease in dependent variable (DV) on the y-axis
with respect to the increase in independent variable (IV) on x-axis. For example, time taken by the car
decreases with the increase in speed of car.

Now consider the following data points:


(xi,yi)={(45,75),(48,80),(51,100),(37,70)} for i =1,2,3 and 4.
(x1,y1)=(20,41)
(x2,y2)=(30,83)
(x3,y3)=(38,62)
(x4,y4)=(52,100)
Where x=area given in meter square
And y= price of the land
Following graph draw the linear relationship between x and y
As shown in below diagram set of data is given as a input and learning algorithm will generate a output
function conventionally known as a hypothesis (h). The role of the hypothesis function is to estimate the
price by taking area as a input to the function. Mapping function h will map from area of land to price of
land.

Dataset

Learning Algorithm

Area of Land Hypothesis Land Price

h(y)=Ɵ0+Ɵix, where h is a hypothesis of mapping from x to y.


we assume every point is described by our line on xy plane.
Total error=∑ 𝑦 ( ) − 𝑦 ( )
Where y ( ) is assumed data point and y ( ) is the actual data point.

Average error= 𝑦( ) − 𝑦( )

But as we no error function is not differentiable for -∞<x<∞


So loss function will be

J(Ɵ)= 𝑦( ) − 𝑦( )

𝑦 ( ) =hƟ(x(i))=Ɵ1x(i) + Ɵ0.
Now it is required to minimize our loss function(J(Ɵ)). A Gradient Descent approach will be used to
minimize the loss function
Linear regression using least square method
Mathematical function is used to find the sum of squares (square of the distance of the points and the
regression line) of all the data points. Least square method is a statistical method given by Carl Friedrich
Gauss used to determine the best fit line or the regression line by minimizing the sum of squares. Least
square method is used to find the line having minimum value of the sum of squares and this line is the
best-fit regression line.
Regression line is y=m*x+c where
y= Estimated or predicted value (Dependent Variable)
x= Value of x for observation (Independent variable)
c= Intercept with the y-axis.
m= Slope of the line
Example :1

Consider the following set of data points (x,y), find the regression line for the given data points.

X 1 2 3 4 5
Y 3 4 2 4 5
Solution:

(x
X y (x − ) (y − ) − )2 (x − )(y − )
1 3 -2 -0.6 4 1.2
2 4 -1 0.4 1 -0.4
3 2 0 -1.6 0 0
4 4 1 0.4 1 0.4
5 5 2 1.4 4 2.8
3 3.6 0 0 10 4

∑( ̅ )( )
where m= ̅)
∑(

m= 4/10= 0.4
𝑥̅ =mean of x=3
𝑦=mean of y=3.6
y=mx+c
m= 0.4
c=2.4
y=.4x+2.4

In the above figure blue points are the actual points and yellow points are the predicted points using least
square method. Some points represented by blue color lie above the line while some other blue color
points lie below the line. However some points represented by the yellow color lie on the line. All other
points not lying on the line are the far away from the line with some distance. Thus, actual blue data
points and the predicted yellow data points contain some distance between them. This distance or
difference between the data points represent an error.

Cost function is used to find the distance between the actual data point value lying other than the
regression line and the predicted value of data points lying on the regression line. Cost function optimizes
the regression coefficient or weights. It measures how a linear regression model is performing.
Difference between the actual value y on y-axis and predicted value 𝑦 is (y-𝑦), and cost=(y-𝑦)2
if there are n number of data points then the cost function will be

1
cost = (𝑦 − 𝑦)
2𝑛

or
1
cost = ∑|(𝑦 − 𝑦)|
𝑛
Since cost function provide the error between the actual value and predicted value so minimizing the
value of cost function will improve the prediction value. Higher the cost function value will degrade the
performance.

Mean Squared Error (MSE):The average of squared of the distance measured between the actual data
points lying other than the line and predicted data points lying on the line is called as a mean squared
error. It is written as:

1
MSE = 𝑦 − (𝑎 𝑥 + 𝑎 )
𝑁

Where N = total number of data points


𝑦 = Actual value
(𝑎 𝑥 + 𝑎 ) = Predicted value with slope 𝑎 and intercept 𝑎

Mean Absolute Error (MAE) is used to determine by calculation sum of all errors divided by the total
number of errors in a group of predictions. While considering a group of data points their directions are
not important. In other words, it is a mean of absolute differences among actual value and response value
results where all individual deviations have even importance.

1
MAE = 𝑦( ) − 𝑦( )
𝑁
Check your progress 1

1. What is regression? Define linear regression.


2. Describe about overfitting and underfitti ng.
----------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------
---------------------------
-

3. State True or False T F

(a)To determine relationship between numeric variables Linear Regression is used.


(b) With the help of logistic regression, 0/1value attributes are predicted.
(c) In Linear Regression, Least Square Error is used to find the line fitted best.
(d) If x-axis represent independent variable and y-axis represent dependent variable. Then
vertical offset diagram shown below is used for least square line fit .

11.4 POLYNOMIAL REGRESSION ALGORITHM


Linear model can apply to data set having linear in nature, however if we have data set of nonlinear in
nature then nonlinear model is to be applied.
As shown in figure all the data points are linear in nature. All points are close to the line, linear mod el
regression model can be applied to the data sets. In figure 2 all the data points are nonlinear in nature so
linear model cannot fit all the data points, only 2 or3 data points can be fitted to the linear model and all
other points are far away from the line. Loss value for this graph will be very high and accuracy will be
reduced.
y1=a0+bx is equation of linear regression, with slope b where a0 is the intercept with the x axis.

𝑦 =𝑎 +𝑏 𝑥 +𝑏 𝑥 +𝑏 𝑥 …………. 𝑏 𝑥 =𝑎 + (𝑏 𝑥 )

where y2 is a multiple regression equation with n independent variables.


Above two equations y1 and y2 are polynomial equations with degree 1.
Consider stock price 𝑆 as a polynomial function of time.
𝑆 =𝑎 +𝑎 𝑇 +𝑎 𝑇 +𝑎 𝑇 +𝜀
Where 𝑆 is a polynomial function and 𝜀 is an error and we need to find different values of
𝑎 , 𝑎 , 𝑎 𝑎𝑛𝑑 𝑎 such that the difference between the value obtained from the above equation and from
the model will be minimum.
Now we have data points (𝑇 , 𝑆 ), (𝑇 , 𝑆 ), (𝑇 , 𝑆 )……. (𝑇 , 𝑆 ).
Thus, the equation become
𝑦 =𝑎 +𝑎 𝑢 𝑎 𝑣 𝑎 𝑤 𝜀
Where 𝑦 = 𝑆 , 𝑢 T , 𝑣 T , 𝑤 T

For all m<<N , and

= ( XT X )-1 XT Y known as left inverse of matrix X


Squared Error (SE) is the error occurred between the predicted values and actual values used for
polynomial regression line. It is written as:

𝑆 = 𝑦 − 𝑎 +𝑎 𝑥 +𝑎 𝑥

Where n = total number of data points


𝑦 = Actual value
𝑎 +𝑎 𝑥 +𝑎 𝑥 = Predicted value

𝑆 = 𝑦 −𝑎 −𝑎 𝑥 −𝑎 𝑥 … (i)

To minimize the error = 0, = 0 𝑎𝑛𝑑 =0 …(ii)

On solving equation (i) we will get

𝛿𝑆
= −2 𝑦 −𝑎 −𝑎 𝑥 −𝑎 𝑥
𝛿𝑎

Since =0

.
⇒ −2 𝑦 + 2𝑎 + 2𝑎 𝑥 + 2𝑎 𝑥 = 0

𝑛𝑎 + 𝑎 𝑥 +𝑎 𝑥 = 𝑦 … . (𝑖𝑖𝑖)

Find , on solving equation (i)

𝛿𝑆
= −2𝑥 𝑦 −𝑎 −𝑎 𝑥 −𝑎 𝑥
𝛿𝑎

Since =0

.
⇒ −2 𝑥 𝑦 + 2𝑎 𝑥 + 2𝑎 𝑥 + 2𝑎 𝑥 =0

.
⇒𝑎 𝑥 +𝑎 𝑥 +𝑎 𝑥 = 𝑥𝑦 … . (𝑖𝑣)

Find , on solving equation (i)

𝛿𝑆
= −2𝑥 𝑦 −𝑎 −𝑎 𝑥 −𝑎 𝑥
𝛿𝑎

Since =0
.
⇒ −2 𝑥 𝑦 + 2𝑎 𝑥 + 2𝑎 𝑥 + 2𝑎 𝑥 =0

.
⇒𝑎 𝑥 +𝑎 𝑥 +𝑎 𝑥 = 𝑥 𝑦 … . (𝑣)

From equation (iii),(iv) and (v)

where value of i varies from 1 to n.

Example 2. Consider the following set of data points (x,y). Find the 2nd order polynomial
y=𝑎 + 𝑎 𝑥 + 𝑎 𝑥 , and using polynomial regression determine the value of y when x is 40.
X 40 10 -20 -88 -150 -170
Y 5.89 5.99 5.98 5.54 4.3 3.33

Solution. From the given data points (x,y):


𝑥 𝑦 𝑥 𝑥 𝑥 𝑥𝑦 𝑥 𝑦
40 5.89 1600 64000 2560000 235.6 9424
10 5.99 100 1000 10000 59.9 599
-20 5.98 400 -8000 160000 -119.6 2392
-88 5.54 7744 -681472 59969536 -487.52 42901.8
-150 4.3 22500 -3375000 506250000 -645 96750
-170 3.33 28900 -4913000 835210000 -566.1 96237

𝑥 𝑦 𝑥 𝑥 𝑥 𝑥𝑦 𝑥 𝑦

= = = = = = =
-378 31.03 61244 -8912472 1404159536 -1522.7 248304

6 −378 61244 𝑎 31.03

−378 61244 −8912472 𝑎 = −1522.7

61244 −8912472 1404159536 𝑎 248304

By solving above matrix the value a0,a1 and a2 will be


a0=6.07647
a1=-0.00253987
a2=-0.000104319
y= 6.07647 − 0.00253987 𝑥 − 0.000104319 𝑥2
y(50)= 6.07647 − 0.00253987 X 50 - 0.000104319 X 2500
= 5.68
Check your progress 2

1. Define polynomial regression.


------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------Write down the general equation for
the polynomial curve fitting.
------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------
2. State True or False

(a) A quadratic regression equation can be represented by ŷ = b 0 + b1x1 + b2x22 , where

x1and x2 are independent variables and y is one dependent variable.

(b) Height of regression line is used to determine the intercept in multiple regression.

(c) Multiple regression is used when dependent variable does not depend on more than

one independent variable.

11.5 SUPPORT VECTOR REGRESSION


Support vector machine is used in solving both classification and
regression problem. Consider a classification problem having
two different categories as sown in figure.It is easy to separate
these two categories by using a line between the two. There is a
hyperplane between these two categories which will separate
these two from each other. This hyperplane is used to divide the
points into different categories lying opposite to the line. Other than hyperplane there are two marginal
lines opposite to the hyperplane at a distance apart from the hyperplane. These two marginal lines are
having a certain distance from the hyperplane so that all the points can be easily categorised.
Parallel to the hyperplane there are two parallel lines at a marginal distance from the hyperplane. Thus,
we can say that there are three hyperplane ie., two line at a marginal distance are also hyperplane. These
two marginal hyperplanes must pass through at least one of the closest datapoints. These data points are
called support vectors. There can be more than one support vectors passes through the marginal
hyperplane. These support vectors determine the marginal distance of these two lines from hyperplane
(ie., d1 and d2). There can be more than one marginal hyperplane for the given data set.We have to choose
the marginal hyperplane so that the distance d1 and d2 will be the maximum distance max(d1+d2).
Considering the data points of the given graph of Fig a. It is not possible to divide the points into two
categories by using a linear hyperplane. Thus, we need to convert this graph into three-dimensional graph.
The SVM kernel convert the two-dimensional data points in 3- or 4-dimensional data points as shown in
Fig b, the hyperplane divide the data points of three dimension into two separate categories.

Consider the following graph


As shown in figure equation of hyperplane is y=wTx+b , where b is constant having value zero since line
is passes through the coordinate (0,0) and the slope of line m is -1.
y=wTx (since b=0)
Any point that lies below the hyperplane (wTx) contribute to the positive
value of x, and so is an example of the blue data points, while any data
point that lie above the hyperplane (wTx) contribute to the negative
value of x, and so is an example of the red data points.
For a given margin value M we can say for any value of x where
wTx≥M lies on blue points, and for any value of x where wTx≤ −M lies
on red points. Now consider a point 𝑥 that lies at the positive margin of
the hyperplane then wT𝑥 =M. Here 𝑥 is a support vector. On travelling
to the opposite direction perpendicular to the positive margine we will
reach a point closest to the negative margin of the hyperplane called as 𝑥 .
If x1 and x2 are two negative and positive regression vector lies on marginal lines wTx+b=-1 and
wTx+b=+1, the distance between marginal lines can be determined by
𝑤 (𝑥 − 𝑥 )=2
( )
=
| | | |
We have to maximize
| |
subject to 𝑤 (𝑥 − 𝑥 ) ≥ +1 and maximize | |
subject to

𝑤 (𝑥 − 𝑥 )≤ -1for all i=1, 2…...n. In generalized form, 𝑦 ∗ 𝑤 𝑥 + 𝑏 ≥ +1 , and we need to


|| ||
minimize ( reciprocal of
| |
).

If all the data points are classified by the marginal line, then it will overfit the machine. And this
is not happening in real scenario. It is not always possible that all the data point lies on the right
side of the classification. As shown in figure one of the red data points lie below positive margin
and one of the blue data points lie above negative margin of the hyperplane. These two data
points lie in opposite plane area. If ξis the distance of the data point from respective marginal
line, we need to find out the error ξifor such points.
𝑦 − ( 𝑤 𝑥 + 𝑏 ) ≤ 𝜖 + 𝜉 for each ξi ≥0
(𝑤 𝑥 + 𝑏 )− 𝑦 ≤ 𝜖+𝜉

Error computed = C ξ where C is the number of error and ξ is the error value

2
Thus it is required to minimize (𝑤 ∗ , 𝑏 ∗ ) = +C ξ
|𝑤|

Where * represent the optimal value.

Example 3. For the given points of two classes red and blue:
Blue: { (1,1), (2,1), (1,-1), (2,-1)}
Red : { (4,0), (5,1), (5,-1), (6,0)}
Ploat a graph for the red and blue categories. Find the support vectors and optimal separating
line.
Solution.
Now first support vector SV1 with x-coordinate 2 and y-coordinate 1 is represented by
2
SV1=
1
Similarly support vector SV2with x-coordinate 2 and y-coordinate -1 and SV3with x-coordinate 4
and y-coordinate 0 will be represented by
2 4
SV2= and SV3=
−1 0
Adding 1 as a input bias in support vector SV 1, SV2 and SV3
2 2 4
𝑆𝑉 = 1 , 𝑆𝑉 = −1 , and 𝑆𝑉 = 0
1 1 1
To determine the value of α1, α2and α3 form the given linear equations we will assume that the support
vector SV1, SV2 belong to the negative class and support vector SV 3belongs to the positive class.
α1𝑆𝑉 𝑆𝑉 + α2𝑆𝑉 𝑆𝑉 + α3𝑆𝑉 𝑆𝑉 = -1 (-ve class)
α1𝑆𝑉 𝑆𝑉 + α2𝑆𝑉 𝑆𝑉 + α3𝑆𝑉 𝑆𝑉 = -1 (-ve class)
α1𝑆𝑉 𝑆𝑉 + α2𝑆𝑉 𝑆𝑉 + α3𝑆𝑉 𝑆𝑉 = +1 (+ve class)
2 2 2 2 2 4
α 1 1 + α 1 −1 + α 1 0 = -1
1 1 1 1 1 1
2 2 2 2 2 4
α 1 −1 + α −1 −1 + α −1 0 = -1
1 1 1 1 1 1
2 4 2 4 4 4
α 1 0 + α −1 0 + α 0 0 = +1
1 1 1 1 1 1
After simplification of above three equations, we get
6α + 2α + 9 α = -1
4α + 6α + 9α =-1
9α + 9α + 17α =1
After simplification of above three equations, we get
α = α = -3.25 and α =3.5
The hyperplane that discriminates the positive class from the negative class is given by

𝑤= α 𝑆𝑉𝑖

2 2 4
𝑤=α 1 +α −1 + α 0
1 1 1
2 2 4 1
𝑤 = (−3.25) ∗ 1 + (-3.25) * −1 + (3.5) * 0 = 0
1 1 1 −3
Hyperplane equation is y=wx+b
2
Where w= and b=-3 or b+3=0 is a line parallel to y-axis which separate both of the category
0
red and blue.
Applications of Support Vector Regression
Used to solve supervised regression problems.
Can be used in both linear and non linear type of data.
Prediction of fire in forest during weather changes.
Prediction of electric power demand.

Check Your Progress 3


1. Define hyperplane.
---------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------
2. Explain about support vector.
---------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------
3. With the given set of points in two classes:
0 0 1 −1
Class A:
1 −1 0 0
3 6 3 6
Class B:
1 1 −1 −1
Plot these two classes and find the line separating these two classes. Determine the margin and support
vector of the two classes
11.6 SUMMARY
In this unit, we discussed about the concepts of regression – linear regression and nonlinear regression.
We discussed about how to find relationship between response variable and predictor variable. Various
terminologies used in regression are discussed with an example. Concept of dependent variable,
independent variables and how to find the relationships are defined in this unit. Different types of
regression also discussed in this unit.

This unit also focused on polynomial regression and how to plot a polynomial curve is also discussed.
Concepts of overfitting and underfitting are also discussed in this unit.

In this unit support vector regression algorithm is discussed. Concept of hyperplane, marginal hyperplane
and marginal distance are discussed with an example.

11.7 SOLUTIONS / ANSWERS


Check Your Progress 1
1. Regression is a supervised machine learning model which describes the relationships between
response variable and predictor variables.So, regression model is used when it is required to
determine the value of one variable using another variable.
Mathematically simple linear regression can be defined as y=bx+c+ϵ. Where b is the slope of the
regression line, x is the variable which can change the value of y but can’t be affected by another
variable. Whereas y is a variable which varies with change in value of x. And ϵ is the known as an
error value majored between the actual value and predicted value.
2. Overfitted results when it is unable to generalize well to new data. It results in high performance on
trading data. Whereas underfitting results poor performance on training dataset
3. a. T
b. T
c. T
d. T
Check Your Progress 2
1. Polynomial regression is a type of regression algorithm in which specifies the relationships
between independent and dependent variable. But here the independent variables are of n th
degree polynomial.
2. General equation for the polynomial curve fitting
𝑦 = 𝑚 𝑥 + 𝑚 𝑥 + 𝑚 𝑥 +. . . … … . . + 𝑚 𝑥

𝑦= 𝑚 𝑥 +𝐶

3. a. T
b. T
c. F

Check Your Progress 3


1. Hyperplane is the line which categorise the data points into two categories.
2. Data points lying on two marginal hyperplanes are called as support vectors.
3. Now first support vector SV1 with x-coordinate 1 and y-coordinate 0 is represented by
1
SV1=
0
Similarly support vector S2 and S3 will be represented by
3 3
SV2= and SV3=
1 −1
Adding 1 as a input bias in support vector S 1, S2 and S3
1 3 3
𝑆𝑉 = 0 , 𝑆𝑉 = 1 , and 𝑆𝑉 = −1
1 1 1
we need to find out 3 parameters α1, α2and α3 form the given linear equations by assuming that
support vector S1 ,S2 belong to the negative class and support vector S 3belongs to the positive
class.
α1𝑆𝑉 𝑆𝑉 + α2𝑆𝑉 𝑆𝑉 + α3𝑆𝑉 𝑆𝑉 = -1 (Negative class)
α1𝑆𝑉 𝑆𝑉 + α2𝑆𝑉 𝑆𝑉 + α3𝑆𝑉 𝑆𝑉 = +1 (Positive class)
α1𝑆𝑉 𝑆𝑉 + α2𝑆𝑉 𝑆𝑉 + α3𝑆𝑉 𝑆𝑉 = +1 (Positive class)
1 1 1 3 1 3
α 0 0 + α 0 1 + α 0 −1 = -1
1 1 1 1 1 1
1 3 3 3 3 3
α 0 1 + α 1 1 + α 1 −1 = +1
1 1 1 1 1 1
1 3 3 3 3 3
α 0 −1 + α 1 −1 + α −1 −1 = +1
1 1 1 1 1 1
After simplification of above three equations, we get
2 α + 4 α + 4 α = -1
4 α + 11 α + 9α =1
4 α + 9 α + 11 α =1
After simplification of above three equations, we get
α = -3.5, α =α = .75
The hyperplane that discriminates the positive class from the negative class is given by

𝑤= α 𝑆𝑖

1 3 3
𝑤 = α ∗ 0 + α ∗ 1 + α ∗ −1
1 1 1
1 3 3 1
𝑤 = (−3.5) ∗ 0 + (7.5) * 1 + (7.5) * −1 = 0
1 1 1 −2
Hyperplane equation is y=wx+b
1
Where w= and
0
b=-2 or b+2=0 is a line parallel to y-axis which separate both classes.

11.8 FURTHER READINGS


1. Machine learning an algorithm perspective, Stephen Marshland, 2nd Edition, CRC Press, 2015.
2. Machine Learning, Tom Mitchell, 1st Edition, McGraw- Hill, 1997.
3. Machine Learning: The Art and Science of Algorithms that Make Sense of Data, Peter Flach, 1st
Edition, Cambridge University Press, 2012.
UNIT 12 NEURAL NETWORKS AND DEEP LEARNING
Structure
12.1 Introduction
12.2 Objectives
12.3 Overview of Neural Network
12.4 Multilayer Feedforward Neural networks with Sigmoid activation functions
12.4.1 Neural Networks with Hidden Layers
12.5 Sigmoid Neurons: An Introduction
12.6 Back propagation Algorithm:
12.6.1 How Backpropagation Works?
12.7 Feed forward networks for Classification and Regression
12.8 Deep Learning
12.8.1 How Deep Learning Works
12.8.2 Deep Learning vs. Machine Learning
12.8.3 A Deep Learning Example
12.9 Summary
12.10 Solutions/ Answers
12.11 Further Reading

12.1 INTRODUCTION
Jain et al. in 1996 mentioned in their work that a neuron is a unique biological cell that has the
capability of information processing. Figure 1 describes a biological neuron's structure,
consisting of a cell body and tree-like branches called axons and dendrites. The working of
neurons is based on receiving the signals from other neurons through their dendrites, processing
the alerts through their body, and finally passing the signals to other neurons via its axon. The
synapse is responsible for connecting two neurons through an axon for the first neuron while the
dendrite for the second neuron. A synapse can either enhance or reduce the learning capabilities'
signal value. If the signals exceed a particular value, called a threshold, then the neuron fires,
otherwise not fire.

Biological Neuron Artificial Neuron


Cell Nucleus (Soma) Node
Dendrites Input
Synapse Weights or interconnections
Axon Output
Table1: Biological Neuron and Artificial Neuron
Figure 1: Biological Neuron and Artificial Neuron

12.2 OBJECTIVES

After completing this unit, you will be able to:


 Understand the concept of Neural Networks
 Understand Feed forward Neural networks
 Understand Back propagation Algorithm
 Understand the concept of Deep Learning

12.3 OVERVIEW OF ARTIFICIAL NEURAL NETWORKS

An artificial neural network (ANN) is like a computing system that simulates how the human
brain analyzes information and processes it. It is the branch of artificial intelligence (AI) and
solves problems that may be difficult or impossible to solve such issues for humans. In addition,
ANNs have the potential for self-learning that provide better results if more data becomes
available.

Artificial neurons consist of the following things:

 Interconnecting model of neurons. The neuron is the elementary component


connected with other neurons to form the network.
 Learning algorithm to train the network. Various learning algorithms are
available in the literature to train the model. Each layer consists of neurons, and
these neurons are connected to other layers. Weight is also assigned to each
layer, and these weights are changed at each iteration for training purpose.

An artificial neural network (ANN) consists of an interconnected group of artificial neuronsthat


process information through input, hidden and output layers and use a connectionist approach to
computation. Neural networks use nonlinearstatisticaldata modeling tools to solve complex
problems by finding the complex relationships between inputs and outputs. After getting this
relation, we can predict the outcome or classify our problems.

ANN is similar to the biological neural networks as both perform the functions collectively and
in parallel. Artificial Neural Network (ANN) is a general term used in various applications, such
as weather predictions, pattern recognitions, recommendation systems, and regression problems.

Figure 2 describes three neurons that perform "AND" logical operations. In this case, the output
neuron will fire if both input neurons are fired. The output neurons use a threshold value (T),
T=3/2 in this case. If none or only one input neuron is fired, then the total input to the output
becomes less than 1.5 and firing for output is not possible. Take another scenario where both
input neurons are firing, and the total input becomes 1+1=2, which is greater than the threshold
value of 1.5, then output neurons will fire. Similarly, we can perform the "OR” logical operation
with the help of the same architecture but set the new threshold to 0.5. In this case, the output
neurons will be fired if at least one input is fired.

T=3/2

Figure 2: Three neurons diagram

Example-1 : Below is a diagram if a single artificial neuron (unit):

X1
W1

X2 W2
V Y=Φ(V)
W3
X3
Figure A: Single unit with three inputs.
The node has three inputs x = (x1, x2, x3) that receive only binary signals (either 0 or 1). How many
different input patterns this node can receive? What if the node had four inputs? Or Five inputs? Can you
give a formula that computes the number of binary input patterns for a given number of inputs?

Answer - 1: For three inputs the number of combinations of 0 and 1 is 8:

x1 : 0 1 0 1 0 1 0 1
x2 : 0 0 1 1 0 0 1 1
x3 : 0 0 0 0 1 1 1 1

and for four inputs the number of combinations is 16:

x1 : 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
x2 : 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
x3 : 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
x4 : 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

You may check that for five inputs the number of combinations will be 32. Note that 8 = 2 3, 16 = 24 and
32 = 25 (for three, four and five inputs).

Thus, the formula for the number of binary input patterns is: 2n, where n in the number of inputs.

☞ Check Your Progress 1


Question -1: Below is a diagram if a single artificial neuron (unit):

X1
W1
V Y=Φ(V)
W2
X2
Figure A-1: Single unit with three inputs.
The node has three inputs x = (x1, x2) that receive only binary signals (either 0 or 1). How many different
input patterns this node can receive?
……………………………………………………………………………………………
……………………………………………………………………………………………

12.4 MULTILAYER FEEDFORWARD NEURAL


NETWORKS WITH SIGMOID ACTIVATION
FUNCTIONS

A multilayer feed forward neural network consists of the interconnection of various layers,
named input, hidden layer, and output layer. The number of hidden layers is not fixed. It depends
upon the requirements and complexity of the problem. The simple neural network is one with a
single input layer and an output layer is known as perceptrons. A Perceptron accepts inputs,
moderates them with certain weight values, then applies the transformation function to output the
final result. The word perceptron is used here because every connection has a certain weight, and
through these connections, one layer is connected to the next layer.

The model's working is defined as follows: All inputs usually are multiplied by the weight, and
this weighted sum is calculated. After it, this sum is applied to the activation function, and it is
the output of an individual layer. This output becomes the input to the next layer. We have
various activation functions, such as sigmoid, tanh, and Relu. After getting the output, the
predicted output is compared with the actual output.

12.4.1 Neural Networks with Hidden Layers


Figure 3 describes the hidden layers of a neural network by adding more neurons in between the
input and output layers. There may be a single hidden layer or multiple hidden layers.

Output Layer

Input
Layer Input

Output
Layer
Hidden Layer

Figure 3: Neural network with a hidden layer


Data/ input is labeled in the input layer using x valuewith 1, 2, 3, …, m as the subscript, while
neurons in the hidden layer are labeled as h with subscripts 1, 2, 3, …, n... this n and m may be
different as the hidden layer neurons, and the input neurons may have different values. Also, as
several hidden layers may be multiple, the first hidden layer has superscript 1, while the second
hidden layer has superscript 2, and so on. Output is labeled as y with a hat i.e.,𝑦.
The input data/ features with m dimension represented as (x1, x2, …, xm). You may say that a
feature is nothing, but it is only a dependent variable that significantly influences a specific
outcome/ dependent variable.

Now, we multiply m features (x1, x2, …, xm) with (w1, w2, …, wm) as a weight matrix, and then the
sum is computed by adding these multiplicative terms. Finally, we define it as a dot product:
w.x = w1 x1 + w2 x2 + … + wmxm = ∑ 𝑤𝑥

There is the following important observation:

1. For m features, we need precisely m weights to compute a dot product


2. Further, the exact computation is performed at the hidden layer. With n hidden
neurons at the hidden layer, you need n number of consequences (w1, w2, …, wn) to
find out the dot products
3. With one hidden layer, the output is defined as h: (h1, h2, …, hn)
4. Now imagine that you are working on a single-layer perceptron where you may
consider hidden output h: (h1, h2, …, hn) as input data and perform dot product with
weights (w1, w2, …, wn) to get your final output y_hat i.e.,𝑦.

Now, refereeing the above steps, you can understand the working of the multiple layers model
and how it works; When you train the model networks on more extensive datasets with many
input features, this process will consume a lot of computing resources. Therefore, deep learning
was not popular in the early days as limited computing resources were available. However, when
better configuration hardware is available, deep learning takes the attention of researchers. The
procedure for forwarding the input features to the hidden layer and the hidden layer to the output
layer is shown below in Figure 4.

Output
Layer Multi Later
Input Neural
Data m Input Network with
features one Hidden
Output Layer
Layer
Hidden Layer
𝑊 : 𝑤 𝑤 …..𝑤 𝑊 : 𝑤 𝑤 …..𝑤 𝑊 : 𝑤 𝑤 …..𝑤

Figure 4: Neural network with different weights

Now you can understand the exact working how multiple layers work.

Example – 2Consider the unit shown below.


X1
W1

X2 W2
V Y=Φ(V)
W3
X3
Figure B: Single unit with three inputs.

Suppose that the weights corresponding to the three inputs have the following values:
w1 = 2 ; w2 = -4 ; w3 = 1
and the activation of the unit is given by the step-function:
Φ(V) = 1 for V>=0 and Φ(V) = 0 Otherwise
Calculate what will be the output value y of the unit for each of the following input patterns:
Pattern P1 P2 P3 P4
X1 1 0 1 1
X2 0 1 0 1
X3 0 1 1 1
Answer:
To find the output value y for each pattern we have to:
a) Calculate the weighted sum: v = ∑ wi xi = w1 · x1 + w2 · x2 + w3 · x3
b) Apply the activation function to v, the calculations for each input pattern are:
P1 : v = 2 · 1 − 4 · 0 + 1 · 0 = 2 , (2 > 0) , y = ϕ(2) = 1
P2 : v = 2 · 0 − 4 · 1 + 1 · 1 = −3 , (−3 < 0) , y = ϕ(−3) = 0
P3 : v = 2 · 1 − 4 · 0 + 1 · 1 = 3 , (3 > 0) , y = ϕ(3) = 1
P4 : v = 2 · 1 − 4 · 1 + 1 · 1 = −1 , (−1 < 0) , y = ϕ(−1) = 0
Example - 3: Logical operators (i.e. NOT, AND, OR, XOR, etc) are the building blocks of any
computational device. Logical functions return only two possible values, true or false, based on the truth
or false values of their arguments. For example, operator AND returns true only when all its arguments
are true, otherwise (if any of the arguments is false) it returns false. If we denote truth by 1 and false by 0,
then logical function AND can be represented by the following table:

x1 : 0 0 1 1
x2 : 0 1 0 1
x1 AND x2 : 0 0 0 1

This function can be implemented by a single-unit with two inputs:

X1 W1

V Y=Φ(V)
W2
X3
Figure C: Single unit with two inputs.

if the weights are w1 = 1 and w2 = 1 and the activation of the unit is given by the step-function:
Φ(V) = 1 for V>=2 and Φ(V) = 0 Otherwise

Note that the threshold level is 2 (v ≥ 2).

a) Test how the neural AND function works.

Answer (a):
P1 : v = 1 · 0 + 1 · 0 = 0 , (0 < 2) , y = ϕ(0) = 0
P2 : v = 1 · 1 + 1 · 0 = 1 , (1 < 2) , y = ϕ(1) = 0
P3 : v = 1 · 0 + 1 · 1 = 1 , (1 < 2) , y = ϕ(1) = 0
P4 : v = 1 · 1 + 1 · 1 = 2 , (2 = 2) , y = ϕ(2) = 1

b) Suggest how to change either the weights or the threshold level of this single unitto implement
the logical OR function (true when at least one of the arguments is true):

x1 : 0 0 1 1
x2 : 0 1 0 1
x1 OR x2 : 0 1 1 1

Answer(b): One solution is to increase the weights of the unit: w1 = 2 and w2 = 2:


P1 : v = 2 · 0 + 2 · 0 = 0 , (0 < 2) , y = ϕ(0) = 0
P2 : v = 2 · 1 + 2 · 0 = 2 , (2 = 2) , y = ϕ(2) = 1
P3 : v = 2 · 0 + 2 · 1 = 2 , (2 = 2) , y = ϕ(2) = 1
P4 : v = 2 · 1 + 2 · 1 = 4 , (4 > 2) , y = ϕ(4) = 1
Alternatively, we could reduce the threshold to 1:
Φ(V) = 1 for V>=1 and Φ(V) = 0 Otherwise

c) The XOR function (exclusive or) returns true only when one of the arguments is true and
another is false. Otherwise, it returns always false. This can be represented by the following table:
x1 : 0 0 1 1
x2 : 0 1 0 1
x1 XOR x2 : 0 1 1 0

Do you think it is possible to implement this function using a single unit? A network of several
units?

Answer(c): This is a difficult question, and it puzzled scientists for some time because it is
impossible to implement the XOR function neither by a single unit nor by a single-layer feed-
forward network (single-layer perceptron). This was known as the XOR problem. The solution
was found using a feed-forward network with a hidden layer. The XOR network uses two hidden
nodes and one output node.

☞ Check Your Progress 2


Question-2 : Consider the unit shown below.
X1
W1

X2 W2
V Y=Φ(V)
W3
X3
Figure B: Single unit with three inputs.

Suppose that the weights corresponding to the three inputs have the following values:
w1 = 1 ; w2 = -1 ; w3 = 2
and the activation of the unit is given by the step-function:
Φ(V) = 1 for V>=1 and Φ(V) = 0 Otherwise
Calculate what will be the output value y of the unit for each of the following input patterns:
Pattern P1 P2 P3 P4
X1 1 0 1 1
X2 0 1 0 1
X3 0 1 1 1
……………………………………………………………………………………………
……………………………………………………………………………………………

Question - 3: NAND, NOR) are the universal building blocks of any computational device. Logical
functions return only two possible values, true or false, based on the truth or false values of their
arguments. For example, operator NAND returns False only when all its arguments are True, otherwise (if
any of the arguments is false) it returns false. If we denote truth by 1 and false by 0, then logical function
NAND can be represented by the following table:

x1 : 0 0 1 1
x2 : 0 1 0 1
x1 NAND x2 : 1 1 1 0

This function can be implemented by a single unit with two inputs:

X1 W1

V Y=Φ(V)
W2
X3
Figure C1: Single unit with two inputs.

if the weights are w1 = 1 and w2 = 1 and the activation of the unit is given by the step-function:
Φ(V) = 1 for V>=2 and Φ(V) = 0 Otherwise

Note that the threshold level is 2 (v ≥ 2).

a) Test how the neural NAND function works.


b) Suggest how to change either the weights or the threshold level of this single unit in order to
implement the logical NOR function (true when at least one of the arguments is true):

x1 : 0 0 1 1
x2 : 0 1 0 1
x1 NOR x2 : 1 0 0 0
……………………………………………………………………………………………
……………………………………………………………………………………………

12.5 SIGMOID NEURONS: AN INTRODUCTION

So far, we have paid attention to a neural network model, how it works and what is the role of
hidden layers. But now, we are required to emphasize on activation functions and their role in
neural networks. The activation function is a mathematical function that decides the threshold
value for a neuron, it may be linear or nonlinear.The purpose of an activation function is to add
non-linearity to the neural network. If you have a linear activation function, then the number of
hidden layers does matter, and the final output remains a linear combination of the input data.
However, this linearity cannot help solving complex problems like patterns separated by curves
where nonlinear activation is required.
Moreover, the activation function does not have a helpful derivative as its derivative is 0
everywhere. Therefore, it doesn't work for Backpropagation, a fundamental and valuable
concept in multilayer perceptron.

Now, as we’ve covered the essential concepts, let’s go over the most popular neural networks
activation functions.

Binary Step Function:Binary step function depends on a threshold value that decides whether a
neuron should be activated or not. The input fed to the activation function is compared to a certain
threshold; if the input is greater than it, then the neuron is activated, else it is deactivated, meaning
that its output is not passed on to the next hidden layer.

t(z)

0 Z

Figure 6: Sigmoidal Function

Mathematically it can be represented as:


F(x) = 0 for all x<0
F(x) = 1 for all x >=0

Here are some of the limitations of binary step function:


 It cannot provide multi-value outputs—for example, it cannot be used for multi-class
classification problems.
 The gradient of the step function is zero, which causes a hindrance in the backpropagation
process.

The idea of step function/Activation will be clear from this paragraph. For example, we have a
perceptron with an activation function that isn't very "stable" as a relationship candidate.

For example, say some person has bipolar issues. One day ( z < 0), s/he behaves with no
responses as s/he is quiet, and on the second day ( z ≥ 0), s/he changes the mood and becomes
very talkative, and speaks non-stop in front of you. There is s no transition for the spirit, and you
don't know the behavior when s/he will be quiet or talking. In such cases, we have a nonlinear
step function that helps.

So, minor changes in the weight of the input layer of our model may activate the neuron by
flipping from 0 to 1, which impacts the working of the hidden layer's working, and then the
outcome may affect. Therefore, we want a model that enhances our exiting neural network by
adjusting the weights. However, it is not possible by a linear activation function. If we don't have
such activation functions, this task cannot be accomplished by simply changing the weights.
w+∆w

y+∆y

So, we need to say goodbye to the perceptron model with this linear activation function.

We are finding a new activation function that accomplishes our task for our neural network
through the sigmoid function. We are changing only one thing: the activation function, and it
meets our recruitments, which are sudden changes in the mood. Now, we define the learning
Function by

𝑍= 𝑤 𝑥 + 𝑏𝑖𝑎𝑠

1
Sigmoidal function is: 𝜎(𝑧) =
1+𝑒
1
Sigmoed Function =
1−𝑙

𝑍= 𝑤 𝑥 + 𝑏𝑖𝑎𝑠

Figure 7: Sigmoidal Function

𝜎(𝑧) The function is called the sigmoid function. First, the value, 𝑍, is computed then the sigmoid
function is applied to𝑍. However, it looks very abstract or strange to you how it works. Those
who don't have good knowledge of mathematics need not worry. Figure 7 explains its curve and
its derivative. Here are some observations mentioned:

1. The output of the Sigmoid Function produces the same results as produced by the
linear step function;the output remains between 0 and 1. The curve marks 0.5 at z=0,
for which we can make a straightforward rule that if the sigmoid neuron's output
becomes more than or equal to 0.5, then its output one; otherwise, output o given for
smaller values.
2. The sigmoid function should be continuous. It means that partial derivative, that is,
σ(z) / (1-σ(z)), which is differentiable everywhere on the curve.

3. If z is a significant negative value, then the output is approximately 0; if z is a


significant positive value, the output is given by around 1

The sigmoid activation function introduces non-linearity, which is the essential part, into our
model. The meaning of this non-linearity is that the output is found out by the dot product of some
inputs x (x1, x2, …, xm), weights w (w1, w2, …, wm) plus bias, and then apply sigmoid function,
cannot be represented linearly. The idea is that the nonlinear activation function allows us to
classify nonlinear decision boundaries in our data.
We use hidden layers in our model by replacing perceptron with sigmoid activation function
neurons. Now, the question arises what the requirement for hidden layers is? Are these useful?
The answer is in yes. Hidden layers help us handle complex problems that single-layer neurons
cannot solve.

Hidden layers twist the problem so that it can rewrite the problem and provide easy solutions to
complex problems, pattern recognition problems. For example, figure 8 explains a classic
textbook problem, recognition of handwritten digits, that can help you understand the workings of
hidden layers and how they work.

6043862
Figure 8: Digits in dataset MNIST

The digits in figure 8 is taken from a well-known dataset called MNIST. It has 70,000 examples
of numbers that were written by a human. A picture of 28x28 pixels represents every digit.
Therefore, this value is 28*28=784 pixels. Every pixel takes a deal between 0 and 255 (RGB
color code). Zero manners the coloration is white and 255 manners the shade black.

Now, think about that computer that can really "see" a digit like a human see—the answer is no.
Therefore, we need proper training to recognize these digits. The computer can't understand an
image as a human can see. For this purpose, it can be interpreted to analyze how the pixel
numbers are working to represent an image. Here, we dissect an image into an array defined by
784 numbers as appearing in each collection [0, 0, 180, …, … 77, 0, 0, 0], and after that, we need
to feed the array into our model.

x1 0

x2 1
Input Layer: 784 28
28 Neurons, each
with value between
0 and 255 x3 2

9
x78

Figure 9: Neural Network with 28 * 28 pixel values


We set up a neural network, figure 9, for the problem mentioned above. It consists of 784 neurons
for input layers with 28x28 pixel values. So, you may consider a total of 16 hidden neurons and
ten output neurons. The ten output neurons returning in the form of an array will have different
values to classify any digit from 0 to 9. So, for example, if the neural network finds the
handwritten number is a zero, then the output array of [1, 0, 0, 0, 0, 0, 0, 0, 0, 0] would be
returned, the first output of the array would be fired a zero, while rest of neurons at output layer
would be set at 0. Similarly, take another example, If the neural network gets that the handwritten
digit is a 5, then the array sequence would [0, 0, 0, 0, 0, 1, 0, 0, 0, 0] with six digits one while rest
of the values will be 0's. Now, you can easily find out the sequence for any other number.

☞ Check Your Progress 3


Question-4 Discuss the utility of Sigmoid function in neural networks. Compare Sigmoid
function with the Binary Step function.
---------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------

12.6 BACK PROPAGATION ALGORITHM

The Backpropagation algorithm is a supervised learning algorithm for training the neural network
model. This algorithm was first introduced in the 1960s, it was not popular, and in 1989 it gets
popularized by Rumelhart, Hinton, and Williams, who have used this concept in a paper
titled "Learning representations by back-propagating errors.". It is one of the most fundamental
building blocks of any neural network., if you have multiple layers in the neural network. Thenit
is used to adjust the weight in the backward direction.

When designing a neural network, we initially need to initialize the weights and biases with some
random values. We initially gave some random values for weight and bias, but our model, through
the backpropagation algorithm, will adjust these values and get the output if the difference
between our actual output and predicted output is a large, more significant error.

This algorithm trains the neural network model based on chain rule method. In simple
terms, you can say that after every forward pass through a network, the backpropagation
algorithm works to perform a backward pass to adjust the weights and biased parameters of the
model. It repeatedly adjusts the weights and biases of all the edges among all the layers so that
Error i.e., the difference between predicted output and real output, should be minimum.

In other words, you can conclude that Backpropagation is used to minimize the error/ cost
function by repeatedly adjusting the network's weights and biases. The level of adjustment is
calculated by a method called the gradients of the error function concerning weight and biased
parameters. In short, we are changing the weight and bias parameters so that Error becomes very
small. Below Figure 10 explains the working of Backpropagation algorithm.
Training
Calculate the error

Error Min Update the parameter

Model is ready to make Prediction

Figure 10: Backpropagation Model

The steps of backpropagation model are given below:

1. First, some random value 'W'is initialized as the weights and propagated forward
accordingly.
2. Then, find out the Error after reducing that Error by propagating backward and increasing
the value weight 'W.'
3. After that, observe the Error and whether it has been increased. If supplementing, then
don't increase the value of 'W.'
4. Once again propagated backward and, at this time, decreased the value of 'W'.
5. Now, notice the Error, and check it has been reduced or not.

The weights that minimize the cost/ error reported. The detailed working is given by:

 Calculate the Error – The difference between the output produced by the model and
the actual.
 Minimize the Error – need to check the Error, whether it is minimum or not.
 Tune the parameters – If the error value is substantial, the weights and biases must be
updated. After reporting that significant Error again, this process will be repeated until
the Error becomes very small.
 Check whether the model is ready for prediction – if an Error becomes significantly
less, you can give some inputs to your model to get the output.

Now we learned about the need of backpropagation model and the meaning of training the model.

Now, we understand how the weight values are adjusted to reduce the Error. We are to determine
whether an increment or decrement in the weight is required. After knowing it, we can keep
updating the weights in that direction to reduce the Error. After some time, you will get to the
exact point where the Error is also increased if weights are updated further. We need to stop at
that moment, and it becomes the final weight value.

Square
Error
Increase
weight

Decrease
weight

Global Loss Minimum

Weight

Figure 12: Error Calculation

Backpropagation Algorithm:
Initially, initialize the network weights and take small random values for it

do

for every training example, say we are terming as ex

prediction_ouput = neural_net_output_produced (network, ex) / /it is used for


forward pass

actual output = teacher output(ex)

compute the Error part by (prediction output-actual output)

compute Delta w{h}} for all the mentioned weights, from the hidden layer to the output
layer // it is used for the backward pass

compute Delta w{i}} for all weights, from input layer to hidden layer // It is
backward pass continue step

need to update network weights accordingly // input layer does not modify

till all examples are correctly classified or after meeting another stopping criterion
12.6.1 How does Backpropagation work?
Now you may consider below Neural Network for a better understanding:

.15w1 .4w5
.05 i1 h1 o1
.20w2 .45w6 .01

.25w3 .5w7

.3w4
.10 i2 h2 o2
.99

.55w8

b1.35 b2.60
j I

Figure 13: Neural Network Example

This network contains:

1. Three input layers


2. Two layersof hidden neurons
3. Two neurons at the output layer.

The following steps are used in the Backpropagation:

Step1: We need to use forward propagation

Step 2: After that, we have to follow backward propagation

Step 3: We put all the values to calculate the updated weight

Step 1: We use forward propagation

We start the working with forwarding propagation


O

𝑛𝑒𝑡 0 = 𝑤 × 𝑜𝑢𝑡 ℎ + 𝑤 × 𝑜𝑢𝑡 ℎ + 𝑏 × 1 → .4 × .5932 + .45 × .5968 + .6 × 1 = 1.1019

1 1
𝑂𝑢𝑡 0 = → .
= 0.7523
1+𝑙 1+𝑙

𝑂𝑢𝑡 0 = 0.7629

We also repeat the process for the output layer neurons. Hidden layer neurons outputs become
the inputs.

Error for 0 : 𝐸0 = ∑ (𝑡𝑎𝑟𝑔𝑒𝑡 − 𝑜𝑢𝑡𝑝𝑢𝑡) = (. 01 − 0.7513) = 0.2748

Error for 0 : 𝐸0 = 0.02356

𝑇𝑜𝑡𝑎𝑙 𝐸𝑟𝑟𝑜𝑟 : 𝐸 𝑡𝑜𝑡𝑎𝑙 = 𝐸0 + 𝐸0 = 0.2748 + 0.02356

Step – 2: Follow backward propagation


Now, we use Backpropagation to reduce the errors by adjusting weight and biases.

Now, we are checking for W5

𝑆𝐸𝑡𝑜𝑡𝑎𝑙 𝑆𝐸𝑡𝑜𝑡𝑎𝑙 𝑆𝑜𝑢𝑡 0 𝑆𝑛𝑒𝑡 0


= × ×
𝑆𝑤 𝑆𝑜𝑢𝑡 0 𝑆𝑛𝑒𝑡 0 𝑆𝑤

w5

w6 net out 𝐸
01 01

b2
After applying the Backpropagation, we find a total change in errors regarding output-1 : O 1 and output-
2 : O2.

1 1
𝐸 = (𝑡𝑎𝑟𝑔𝑒𝑡 0 − 𝑜𝑢𝑡 0 ) + (𝑡𝑎𝑟𝑔𝑒𝑡 0 − 𝑜𝑢𝑡 𝑜 )
2 2
𝑆𝐸
= −(𝑡𝑎𝑟𝑔𝑒𝑡 0 − 𝑜𝑢𝑡 0 ) = −(0.01 − .7513) = 0.74136
𝑆𝑜𝑢𝑡 0

Now, we need to propagate backward to find the changes in O1 concerning its total net input.
𝑜𝑢𝑡 𝑜1 = 1⁄1 + 𝑒

𝛿𝑜𝑢𝑡 01
= 𝑜𝑢𝑡 𝑜1 (1 − 𝑜𝑢𝑡 𝑜1) = 0.75136507 (1 − 0.75136507) = 0.186815602
𝛿𝑛𝑒𝑡 01
Now, we check the total changes in O1 concerning weight W5.

𝑛𝑒𝑡 0 = 𝑤 × 𝑜𝑢𝑡 ℎ + 𝑤 × 𝑜𝑢𝑡 ℎ + 𝑏 × 1

𝛿𝑛𝑒𝑡 0 ( )
= 1 ∗ 𝑜𝑢𝑡 ℎ 𝑤 + 0 + 0 = .593269
𝛿𝑤
Step 3: We need to put all the values for calculating the updated weight
If we put all the values together, then:

𝜹𝑬𝒕𝒐𝒕𝒂𝒍 𝜹𝑬𝒕𝒐𝒕𝒂𝒍 𝜹𝒐𝒖𝒕 𝒐𝟏 𝜹𝒏𝒆𝒕 𝒐𝟏


= ∗ ∗ → 𝟎. 𝟎𝟖𝟐𝟏𝟔𝟕𝟎𝟒𝟏
𝜹𝒘𝟓 𝜹𝒐𝒖𝒕 𝒐𝟏 𝜹𝒏𝒆𝒕 𝒐𝟏 𝜹𝒘𝟓
Now, find out the updated value of weight W5:

𝛿𝐸𝑡𝑜𝑡𝑎𝑙
𝑊 =𝑊 −𝑛
𝛿𝑤

𝑊 = .4 − .5 ∗ 0.082167041

Updated w5 0.35891648

 Similarly, we calculate the weight for others also.


 After that, we need to propagate forward and compute the output. After that, recalculate
the Error.
 If a computed error is tiny, we need to stop. Otherwise, we further need to propagate
backward and adjust the weight accordingly.
 We keep this process will continue till the Error is significantly less quantity.

☞ Check Your Progress 4


Question 5: Write Back Propagation algorithm, and showcase its execution on a neural network
of your choice (make suitable assumptions if any)
…………………………………………………………………………………………………………………………
………………………………………………………………………………………………………………………..
12.7 FEED FORWARD NETWORKS FOR
CLASSIFICATION AND REGRESSION

Feed forward neural network is used for various problems, including classification , regression,
and pattern encoding. In the first case, the web returns a value called z=f(w,x), which is very
close to the target value y. While in the second case, the target becomes the input itself
v 𝑥, 𝑦, 𝑓(𝑤, 𝑥) .To deal with multiclassification, we can use either of the techniques.

Figure 14: Multi-classification

The above-mentioned left-hand side network is a modular architecture. Here, every class
connects with three distinct hidden neurons. While mentioned right-hand side network defines a
fully connected network, which is used for a richer classification process. The left-side network
is advantageous as it is modular and supports the classifiers' gradual construction. Whenever we
feel to add a new class, the fully connected network requires further training, while the modular
network only involves training for a new module. The same issue also holds for regression.
However, it is worth mentioning that the output neurons are typically linear in regression tasks
since there is no need to approximate any code.

As we have mentioned, a Neural network (NN) has several hidden layers; every layer consists of
multiple neurons/ nodes. Each node connects with input layer connections and output layer
connections. Also, we have mentioned that every connection is assigned a different weight, and
finally output layer. Before giving the data into the NN, the dataset should be normalized and
then processed. Training a neural network means adjusting the weights so that errors should be
minimum. After introducing the NN, we can apply new data for classification or regression
purposes.

Example - 4:The following diagram represents a feed-forward neural network with one hidden
layer:
A weight on connection between nodes i and j is denoted by wij , such as w13 is the weight on
the connection between nodes 1 and 3. The following table lists all the weights in the network:
w13 = -2, w23 = 3 ; w14=4,w24=-1 ; w35=1,w45=-1 ; w36 = -1, w46 = 1

Each of the nodes 3, 4, 5 and 6 uses the following activation function:


Φ(V) = 1 for V>=0 and Φ(V) = 0 Otherwise

Where, v denotes the weighted sum of a node. Each of the input nodes (1 and 2) can only receive
binary values (either 0 or 1). Calculate the output of the network (y5 and y6) for each of the input
patterns:
Pattern : P1 P2 P3 P4
Node 1 : 0 1 0 1
Node 2 : 0 0 1 1

Answer: In order to find the output of the network it is necessary to calculate weighted sums of
hidden nodes 3 and 4:
v3 = w13x1 + w23x2 , v4 = w14x1 + w24x2

Then find the outputs from hidden nodes using activation function ϕ:
y3 = ϕ(v3) , y4 = ϕ(v4) .

Use the outputs of the hidden nodes y3 and y4 as the input values to the output layer (nodes 5
and 6), and find weighted sums of output nodes 5 and 6:
v5 = w35y3 + w45y4 , v6 = w36y3 + w46y4 .
Finally, find the outputs from nodes 5 and 6 (also using ϕ):
y5 = ϕ(v5) , y6 = ϕ(v6) .
The output pattern will be (y5, y6). Perform this calculation for each input pattern:

P1: Input pattern (0, 0)

v3 = −2 · 0 + 3 · 0 = 0, y3 = ϕ(0) = 1
v4 = 4 · 0 − 1 · 0 = 0, y4 = ϕ(0) = 1
v5 = 1 · 1 − 1 · 1 = 0, y5 = ϕ(0) = 1
v6 = −1 · 1 + 1 · 1 = 0, y6 = ϕ(0) = 1
The output of the network is (1, 1)

P2: Input pattern (1, 0)


v3 = −2 · 1 + 3 · 0 = −2, y3 = ϕ(−2) = 0
v4 = 4 · 1 − 1 · 0 = 4, y4 = ϕ(4) = 1
v5 = 1 · 0 − 1 · 1 = −1, y5 = ϕ(−1) = 0
v6 = −1 · 0 + 1 · 1 = 1, y6 = ϕ(1) = 1
The output of the network is (0, 1).

P3: Input pattern (0, 1)


v3 = −2 · 0 + 3 · 1 = 3, y3 = ϕ(3) = 1
v4 = 4 · 0 − 1 · 1 = −1, y4 = ϕ(−1) = 0
v5 = 1 · 1 − 1 · 0 = 1, y5 = ϕ(1) = 1
v6 = −1 · 1 + 1 · 0 = −1, y6 = ϕ(−1) = 0
The output of the network is (1, 0).

P4: Input pattern (1, 1)


v3 = −2 · 1 + 3 · 1 = 1, y3 = ϕ(1) = 1
v4 = 4 · 1 − 1 · 1 = 3, y4 = ϕ(3) = 1
v5 = 1 · 1 − 1 · 1 = 0, y5 = ϕ(0) = 1
v6 = −1 · 1 + 1 · 1 = 0, y6 = ϕ(0) = 1
The output of the network is (1, 1).
☞ Check Your Progress 5
Question-6 The following diagram represents a feed-forward neural network with one hidden
layer:

A weight on connection between nodes i and j is denoted by wij , such as w23 is the weight on
the connection between nodes 2 and 3. The following table lists all the weights in the network:
w13 = -3, w23 = 2 ; w14=3,w24= - 2 ; w35=4,w45= - 3 ; w36 = -2, w46 = 2

Each of the nodes 3, 4, 5 and 6 uses the following activation function:


Φ(V) = 1 for V>=1 and Φ(V) = 0 Otherwise
Where, v denotes the weighted sum of a node. Each of the input nodes (1 and 2) can only receive
binary values (either 0 or 1). Calculate the output of the network (y5 and y6) for each of the input
patterns:
Pattern : P1 P2 P3 P4
Node 1 : 0 1 0 1
Node 2 : 0 0 1 1
……………………………………………………………………………………………
……………………………………………………………………………………………

12.8 DEEP LEARNING

Deep learning is a subset of artificial intelligence, commonly called AI, that tells us the workings
of the human brain to process data and patterns defining for decision making. Deep learning has
capable of learning unsupervised from unstructured data or unlabeled data. Deep learning is
further classified as an AI function that is used to simulate the workings of the human brain in
processing data to detect objects, recognize speech, translate languages, and make decisions.

12.8.1 How Deep Learning Works


Initially, there was a limitation of computing resources, and the concept of deep learning was not
so popular. Once these resources were available, deep learning took the attention of the
researchers. Deep Learning can handle all forms of data from all world regions. This data is
available in massive amounts, termed big data, and is taken from various sources, including
social media, search engines, different e-platforms, and others multimedia sources. Big data is
accessible through multiple fintech applications such as cloud computing.

However, this data is so vast and primarily considered unstructured that it could take decades or
centuries for humans to understand or find meaningful decisions. As mentioned earlier, the deep
learning model's work is similar to the multilayer perceptron models. We have various models,
such as convolution neural network (CNN) and long short term Model (LSTM). The exact
working of CNN and LSTM is out of scope, but you can refer to the working of multilayer
perception to understand the working of the deep learning model.

12.8.2 Deep Learning vs. Machine Learning


One of the popular AI techniques available for processing big data is machine learning.

For example, if a digital payments company wants to find out the occurrence of fraud in their
system, such a company might use machine learning tools. The used algorithm, built on the
machine learning technique, will process all transactions on the digital platform, try to find out
the patterns, and mention the detection of anomalies by the pattern.
Deep learning, termed a subset of machine learning algorithms, uses a hierarchical structure of
neural networks model to forward the same process as the machine learning algorithm used. The
neural networks, like the human brain, connect like a web. The program built for machine
learning linearly uses data, while deep learning systems enable a nonlinear approach to process
the data.

12.8.3 A Deep Learning Example


For example, the fraud detection system mentioned above can be solved through deep learning.
However, as a machine learning system starts working with parameters, such as transactions of
dollars an account holder sends or receives, the deep-learning method can also use the same
parameters, but it works on a neural network.

As we have mentioned earlier, each layer of the neural network takes the inputs from the
previous layer; for example, the input layer has the parameters like sender information, data from
social media, a credit score of the customer, using IP address, and others and passed the output to
next layer for decision making. The final layer, the output layer, decides whether fraud has been
detected or not.

Deep learning, a prevalent technology, is used across all major industries for various tasks,
including decision making. Other examples may include commercial apps for image
recognition, apps for a recommendation system, and medical research tools for exploring the
possibility of reusing drugs.

☞ Check Your Progress 6


Question-7 Compare between Deep Learning and Machine Learning
……………………………………………………………………………………………
……………………………………………………………………………………………

12.9 SUMMARY

In this unit we learned about the fundamental concepts of Neural networks, and various concepts related
to area of neural networks and deep learning, this includes the understanding of the activation function,
back propagation algorithm, feed forward networks and many more. In this unit the concepts are
simplified with the help of the numerical, which will help you map the theoretical concepts of neural
networks with that of their implementation part.
12.10 SOLUTIONS/ANSWERS

☞ Check Your Progress 1


Question -1 : Below is a diagram if a single artificial neuron (unit):

X1
W1
V Y=Φ(V)
W2
X2
Figure A-1: Single unit with three inputs.
The node has three inputs x = (x1, x2) that receive only binary signals (either 0 or 1). How many different
input patterns this node can receive?

Solution:Refer to Section 12.3

☞ Check Your Progress 2


Question-2 : Consider the unit shown below.
X1
W1

X2 W2
V Y=Φ(V)
W3
X3
Figure B: Single unit with three inputs.

Suppose that the weights corresponding to the three inputs have the following values:
w1 = 1 ; w2 = -1 ; w3 = 2
and the activation of the unit is given by the step-function:
Φ(V) = 1 for V>=1 and Φ(V) = 0 Otherwise
Calculate what will be the output value y of the unit for each of the following input patterns:
Pattern P1 P2 P3 P4
X1 1 0 1 1
X2 0 1 0 1
X3 0 1 1 1
Solution: Refer to section 12.4
Question - 3: NAND, NOR) are the universal building blocks of any computational device. Logical
functions return only two possible values, true or false, based on the truth or false values of their
arguments. For example, operator NAND returns False only when all its arguments are True, otherwise (if
any of the arguments is false) it returns false. If we denote truth by 1 and false by 0, then logical function
NAND can be represented by the following table:
x1 : 0 0 1 1
x2 : 0 1 0 1
x1 NAND x2 : 1 1 1 0

This function can be implemented by a single unit with two inputs:

X1 W1

V Y=Φ(V)
W2
X3
Figure C1: Single unit with two inputs.

if the weights are w1 = 1 and w2 = 1 and the activation of the unit is given by the step-function:
Φ(V) = 1 for V>=2 and Φ(V) = 0 Otherwise

Note that the threshold level is 2 (v ≥ 2).


a) Test how the neural NAND function works.
b) Suggest how to change either the weights or the threshold level of this single unit in order to
implement the logical NOR function (true when at least one of the arguments is true):
x1 : 0 0 1 1
x2 : 0 1 0 1
x1 NOR x2 : 1 0 0 0
Solution: Refer to section 12.4

☞ Check Your Progress 3


Question-4 Discuss the utility of Sigmoid function in neural networks. Compare Sigmoid
function with the Binary Step function.
Solution: Refer to Section 12.5

☞ Check Your Progress 4


Question 5:Write Back Propagation algorithm, and showcase its execution on a neural network
of your choice (make suitable assumptions if any)
Solution: Refer to Section 12.6

☞ Check Your Progress 5


Question-6 The following diagram represents a feed-forward neural network with one hidden
layer:
A weight on connection between nodes i and j is denoted by wij , such as w23 is the weight on
the connection between nodes 2 and 3. The following table lists all the weights in the network:
w13 = -3, w23 = 2 ; w14=3,w24= - 2 ; w35=4,w45= - 3 ; w36 = -2, w46 = 2

Each of the nodes 3, 4, 5 and 6 uses the following activation function:


Φ(V) = 1 for V>=1 and Φ(V) = 0 Otherwise

Where, v denotes the weighted sum of a node. Each of the input nodes (1 and 2) can only receive
binary values (either 0 or 1). Calculate the output of the network (y5 and y6) for each of the input
patterns:
Pattern : P1 P2 P3 P4
Node 1 : 0 1 0 1
Node 2 : 0 0 1 1
Solution: Refer to Section 12.7

☞ Check Your Progress 6


Question-7 Compare between Deep Learning and Machine Learning
Solution: Refer to Section 12.8

12.11 FURTHER READINGS

1) Dr K Uma Rao, “Artificial Intelligence and Neural Networks”, Pearson Education


(January 2011)
2) Tariq Rashid, “Make Your Own Neural Network: A Gentle Journey Through the
Mathematics of Neural Networks, and Making Your Own Using the Python Computer
Language”,
3) Russell J. Stuart, Norvig Peter,” Artificial Intelligence” , Pearson: A Modern
Approach Paperback – January 2015.
4) F. AcarSavaci, Artificial Intelligence and Neural Networks, Springer; 2006th edition
(18 July 2006).
5) Vladimir Golovko (Editor), Akira Imada (Editor), “Neural Networks and Artificial
Intelligence: 8th International Conference, ICNNAI 2014”
6) ToshinoriMunakata, “Fundamentals of the New Artificial Intelligence” Springer; 2nd ed. 2008
edition (February 2008)
PROGRAMME DESIGN COMMITTEE
Prof. (Retd.) S.K. Gupta , IIT, Delhi Sh. Shashi Bhushan Sharma, Associate Professor, SOCIS, IGNOU
Prof. Ela Kumar, IGDTUW, Delhi Sh. Akshay Kumar, Associate Professor, SOCIS, IGNOU
Prof. T.V. Vijay Kumar JNU, New Delhi Dr. P. Venkata Suresh, Associate Professor, SOCIS, IGNOU
Prof. Gayatri Dhingra, GVMITM, Sonipat Dr. V.V. Subrahmanyam, Associate Professor, SOCIS, IGNOU
Mr. Milind Mahajan,. Impressico Business Solutions, Sh. M.P. Mishra, Assistant Professor, SOCIS, IGNOU
New Delhi Dr. Sudhansh Sharma, Assistant Professor, SOCIS, IGNOU

COURSE DESIGN COMMITTEE


Prof. T.V. Vijay Kumar, JNU, New Delhi Sh. Shashi Bhushan Sharma, Associate Professor, SOCIS, IGNOU
Prof. S.Balasundaram, JNU, New Delhi Sh. Akshay Kumar, Associate Professor, SOCIS, IGNOU
Prof D.P. Vidyarthi, JNU, New Delhi Dr. P. Venkata Suresh, Associate Professor, SOCIS, IGNOU
Prof. Anjana Gosain, USICT, GGSIPU, New Delhi Dr. V.V. Subrahmanyam, Associate Professor, SOCIS, IGNOU
Dr. Ayesha Choudhary, JNU, New Delhi Sh. M.P. Mishra, Assistant Professor, SOCIS, IGNOU
Dr. Sudhansh Sharma, Assistant Professor, SOCIS, IGNOU

SOCIS FACULTY
Prof. P. Venkata Suresh, Director, SOCIS, IGNOU
Prof. V.V. Subrahmanyam, SOCIS, IGNOU
Dr. Akshay Kumar, Associate Professor, SOCIS, IGNOU
Dr. Naveen Kumar, Associate Professor, SOCIS, IGNOU (on EOL)
Dr. M.P. Mishra, Associate Professor, SOCIS, IGNOU
Dr. Sudhansh Sharma, Assistant Professor, SOCIS, IGNOU
Dr. Manish Kumar, Assistant Professor, SOCIS, IGNOU

PREPARATION TEAM
Dr.Sudhansh Sharma, (Writer- Unit 13) Prof Anjana Gosain (Content Editor)
Assistant Professor SOCIS, IGNOU USICT-GGSIPU,Delhi

Ms. Divya Kwatra, (Writer Unit 14)


Dr. Rajesh Kumar(Language Editor)
Assistant Professor, Department of Computer Science.
Hansraj College, University of Delhi SOH, IGNOU, New Delhi

Ms Sofia Goel, (Writer Unit 15)


Research Scholar, SOCIS, IGNOU

Ms.Sanyukta Kesharwani (Writer - Unit 16)


Director(Reserach), Scholastic Seed Inc.,Delhi

Course Coordinator: Dr.Sudhansh Sharma,

Print Production
Sh Sanjay Aggarwal,Assistant Registrar, MPDD

, 2022
Indira Gandhi National Open University, 2022
ISBN-
All rights reserved. No part of this work may be reproduced in any form, by mimeograph or any other means, without permission
in writing from the Indira Gandhi National Open University.
Further information on the Indira Gandhi National Open University courses may be obtained from the University’s office at
Maidan Garhi, New Delhi-110068.
Printed and published on behalf of the Indira Gandhi National Open University, New Delhi by MPDD, IGNOU.
UNIT 13 FEATURE SELECTION AND EXTRACTION

13.1 Introduction
13.2 Dimensionality Reduction
13.2.1 Feature Selection
13.2.2 Feature extraction
13.3 Principal Component Analysis
13.4 Linear Discriminant Analysis
13.5 Singular Value Decomposition.
13.6 Summary
13.7 Solutions/Answers
13.8 Further Readings

13.1 INTRODUCTION

Data sets are made up of numerous data columns, which are also referred to as data attributes.
These data columns can be interpreted as dimensions on an n-dimensional feature space, and
data rows can be interpreted as points inside that space. One can gain a better understanding of a
dataset by applying geometry in this manner. In point of fact, these characteristics are
measurements of the same entity. It is possible for their existence in the algorithm's logic to get
muddled, which will result in a change to how well the model functions.

Input variables are the columns of data that are fed into a model in order to provide a forecast for
a target variable. However, if your data is given in the form of rows and columns, such as in a
spreadsheet, then features is another term that can be used interchangeably with input variables.
It is possible that the presence of a large number of dimensions in the feature space implies that
the volume of that space is enormous. As a result, the points (data rows) in that space reflect a
small and non-representative sample of the space's contents. It is possible for the performance of
machine learning algorithms to degrade when there are an excessive number of input variables.
The existence of an excessive number of input variables has a significant impact on the
efficiency with which machine learning algorithms function. when it is used to data that contains
a large number of input attributes; this phenomenon is referred to as the "curse of
dimensionality." As a consequence of this, one of the most common goals is to cut down on the
number of input features. The process of decreasing the number of dimensions that characterise a
feature space is referred to as "dimensionality reduction," which is a phrase that was made up
specifically to describe this phenomenon.

The usefulness of data mining can be hindered by an excessive amount of information on


occasion. There are occasions when only a handful of the columns of data characteristics that
have been compiled for the purpose of constructing and testing a model do not offer any
information that is significant to the model. However, there are some that actually reduce the
reliability and precision of the model.
For instance, let's say you want to build a model that can forecast the incomes of people already
employed in their respective fields. Therefore, data columns like cellphone number, house
number, and so on will not truly contribute any value to the dataset, and they can therefore be
omitted. This is because irrelevant qualities introduce noise to the data and affect the accuracy of
the model. Additionally, because of the Noise, the size of the model as well as the amount of
time and system resources required for model construction and scoring are both increased.

At this point in time, we are required to put the concept of Dimension Reductionality into
practise. This can be done in one of two ways: either by selecting features to be extracted or by
extracting features to be selected. Both of these approaches are broken down in greater detail
below. The step of dimension reduction is one of the preprocessing phases that occurs during the
process of data mining. This step is one of the preprocessing steps that may be beneficial in
minimising the impacts of noise, correlation, and excessive dimensionality.

Some more examples are presented below to let you understand What does dimensionality
reduction have to do with machine learning and predictive modelling?
● A simple issue concerning the classification of e-mails, in which we are tasked with
deciding whether or not a certain email constitutes spam. can be brought up as a practical
illustration of the concept of dimensionality reduction. This can include elements like
whether or not the email has a standard subject line, the content of the email, whether or
not it uses a template, and so on. However, some of these features may overlap with one
another.
● A classification problem that involves humidity and rainfall can sometimes be simplified
down to just one underlying feature as a result of the strong relationship that exists
between the two variables. As a direct consequence of this, the number of characteristics
could get cut down in some circumstances.
● A classification problem with three dimensions can be difficult to understand, whereas a
problem with two dimensions can be translated to a fundamental space with two
dimensions, and a problem with one dimension can be mapped to a line with one
dimension. This concept is depicted in the diagram that follows, which shows how a
three-dimensional feature space can be broken down into two one-dimensional feature
spaces, with the number of features being reduced even further if it is discovered that
they are related.
Dimensionality Reduction
Y

X
Y
Y
X

X
In context of dimensionality reduction, various techniques like Principal Component Analysis,
Linear Discriminant Analysis, Singular Value Decomposition are frequently used. In this unit we will
discuss all the mentioned concepts, related to Dimension reductionality

13.2 Dimensionality Reduction

The Data mining and Machine Learning methodologies both have processing challenges when
working with big amounts of data (many attributes). In point of fact, the dimensions of the
feature space utilised by the approach, often referred to as the model attributes, play the most
important function. Processing algorithms grow more difficult and time-consuming to implement
as the dimensionality of the processing space increases.

These elements, also known as the model attributes, are the fundamental qualities, and they can
either be variables or features. When there are more features, it is more difficult to see them all,
and as a result, the work on the training set becomes more complex as well. This complexity was
further increased when a significant number of characteristics were linked; hence, the
classification became irrelevant as a result. In circumstances like these, the strategies for
decreasing the number of dimensions can prove to be highly beneficial. In a nutshell, "the
process of making a set of major variables from a huge number of random variables is what is
referred to as dimension reduction." When conducting data mining, the step of dimension
reduction can be helpful as a preprocessing step to lessen the negative effects of noise,
correlation, and excessive dimensionality.

Dimension reduction can be accomplished in two ways :

 Feature selection: During this approach, a subset of the complete set of variables is
selected; as a result, the number of conditions that can be utilised to illustrate the issue is
narrowed down. It's normally done in one of three ways.:
o Filter method
o Wrapper method
o Embedded method

 Feature extraction: It takes data from a space with many dimensions and transforms it
into another environment with fewer dimensions.

13.2.1 Feature selection: It is the process of selecting some attributes from a given collection of
prospective features, and then discarding the rest of the attributes that were considered. The use
of feature selection can be done for one of two reasons: either to get a limited number of
characteristics in order to prevent overfitting or to avoid having features that are redundant or
irrelevant. For data scientists, the ability to pick features is a vital asset. It is essential to the
success of the machine learning algorithm that you have a solid understanding of how to choose
the most relevant features to analyse. Features that are irrelevant, redundant, or noisy can
contaminate an algorithm, which can have a detrimental impact on the learning performance,
accuracy, and computing cost. The importance of feature selection is only going to increase as
the size and complexity of the typical dataset continues to balloon at an exponential rate.

Feature Selection Methods: Feature selection methods can be divided into two categories:
supervised, which are appropriate for use with labelled data, and unsupervised, which are
appropriate for use with unlabeled data. Filter methods, wrapper methods, embedding methods,
and hybrid methods are the four categories that unsupervised approaches fall under.:

● Filter methods: Filter methods choose features based on statistics instead of how well they
perform in feature selection cross-validation. Using a chosen metric, irrelevant attributes are
found and recursive feature selection is done. Filter methods can be either univariate, in
which an ordered ranking list of features is made to help choose the final subset of features,
or multivariate, in which the relevance of all the features as a whole is evaluated to find
features that are redundant or not important.

● Wrapper methods: Wrapper feature selection methods look at the choice of a set of
features as a search problem. Their quality is judged by preparing, evaluating, and
comparing a set of features to other sets of features. This method makes it easier to find
possible interactions between variables. Wrapper methods focus on subsets of features that
will help improve the quality of the results from the clustering algorithm used for the
selection. Popular examples are Boruta feature selection and Forward feature selection.

● Embedded methods: Embedded feature selection approaches incorporate the feature


selection machine learning algorithm as an integral component of the learning process. This
allows for simultaneous classification and feature selection to take place within the method.
Careful consideration is given to the extraction of the characteristics that will make the
greatest contribution to each iteration of the process of training the model. A few examples
of common embedded approaches are the LASSO feature selection algorithm, the random
forest feature selection algorithm, and the decision tree feature selection algorithm.

Among all approaches the most conventional feature selection is feed forward feature selection.

Forward feature selection: The first step in the process of feature selection is to evaluate each
individual feature and choose the one that results in the most effective algorithm model. This is
referred to as "forward feature selection." After that step, each possible combination of the
feature that was selected and a subsequent feature is analysed, and then a second feature is
selected, and so on, until the required specified number of features is chosen. The operation of
the forward feature selection algorithm is depicted here in the figure.
Selecting the best subset

Generate Learning
Set of all Performance
a subset Algorithm
features

The procedure to follow in order to carry out forward feature selection

1. Train the model with each feature being treated as a separate entity, and then evaluate
its overall performance.
2. Select the variable that results in the highest level of performance.
3. Carry on with the process while gradually introducing each variable.
4. The variable that produced the greatest amount of improvement is the one that gets
kept.
5. Perform the entire process once more until the performance of the model does not
show any meaningful signs of improvement.

Here, a fitness level prediction based on the three independent variables is used to show how
forward feature selection works.
ID Calories_burnt Gender Plays_Sport? Fintess Level
1 121 M Yes Fit
2 230 M No Fit
3 342 F No Unfit
4 70 M Yes Fit
5 278 F Yes Unfit
6 146 M Yes Fit
7 168 F No Unfit
8 231 F Yes Fit
9 150 M No Fit
10 190 F No Fit

So, the first step in Forward Feature Selection is to train n models and judge how well they work
by looking at each feature on its own. So, if you have three independent variables, we'll train
three models, one for each of these three traits. Let's say we trained the model using the Calories
Burned feature and the Fitness Level goal variable and got an accuracy of 87 percent.
ID Calories_burnt Gender Plays_Sport? Fintess Level
1 121 M Yes Fit
2 230 M No Fit
3 342 F No Unfit
4 70 M Yes Fit
5 278 F Yes Unfit
6 146 M Yes Fit
7 168 F No Unfit
8 231 F Yes Fit
9 150 M No Fit
10 190 F No Fit
Accuracy = 87%
We'll next use the Gender feature to train the model, and we acquire an accuracy of 80%. –

ID Calories_burnt Gender Plays_Sport? Fintess Level


1 121 M Yes Fit
2 230 M No Fit
3 342 F No Unfit
4 70 M Yes Fit
5 278 F Yes Unfit
6 146 M Yes Fit
7 168 F No Unfit
8 231 F Yes Fit
9 150 M No Fit
10 190 F No Fit
Accuracy = 80%
And similarly, the Plays_sport variable gives us an accuracy of 85%–

ID Calories_burnt Gender Plays_Sport? Fintess Level


1 121 M Yes Fit
2 230 M No Fit
3 342 F No Unfit
4 70 M Yes Fit
5 278 F Yes Unfit
6 146 M Yes Fit
7 168 F No Unfit
8 231 F Yes Fit
9 150 M No Fit
10 190 F No Fit
Accuracy = 85%
At this point, we are going to select the variable that produced the most favourable results. If you
take a look at this table, you'll notice that the variable titled "Calories Burned" alone has an
accuracy rating of 87 percent, while the variable titled "Gender" has an accuracy rating of 80
percent, and the variable titled "Plays Sport" has an accuracy rating of 85 percent. When these
two sets of data were compared, the winner was, unsurprisingly, the number of calories burned.
As a direct result of this, we will select this variable.
Variable used Accuracy
Calories_burnt 87.00%
Gender 80.00%
Plays_Sport? 85.00%
The next thing we'll do is repeat the previous steps, but this time we'll just add a single variable
at a time. Because of this, it makes perfect sense for us to retain the Calories Burned variable as
we proceed to add variables one at a time. Consequently, if we use gender as an illustration, we
have an accuracy rate of 88 percent. –
ID Calories_burnt Gender Plays_Sport? Fintess Level
1 121 M Yes Fit
2 230 M No Fit
3 342 F No Unfit
4 70 M Yes Fit
5 278 F Yes Unfit
6 146 M Yes Fit
7 168 F No Unfit
8 231 F Yes Fit
9 150 M No Fit
10 190 F No Fit
Accuracy = 88%
We acquire a 91 percent accuracy when we combine Plays Sport with Calories Burnt. The
variable that yields the greatest improvement will be kept. That makes natural sense. As you can
see, when we combine Plays Sport with Calories Burnt, we get a better result. As a result, we'll
keep it and use it in our model. We'll keep repeating the process till all the features are
considered in improving the model performance

ID Calories_burnt Gender Plays_Sport? Fintess Level


1 121 M Yes Fit
2 230 M No Fit
3 342 F No Unfit
4 70 M Yes Fit
5 278 F Yes Unfit
6 146 M Yes Fit
7 168 F No Unfit
8 231 F Yes Fit
9 150 M No Fit
10 190 F No Fit

Accuracy = 91%

13.2.2 Feature extraction:

The process of reducing the amount of resources needed to describe a large amount of data is
called "feature extraction." One of the main problems with doing complicated data analysis is
that there are a lot of variables to keep track of. A large number of variables requires a lot of
memory and processing power, and it can also cause a classification algorithm to overfit to
training examples and fail to generalise to new samples. Feature extraction is a broad term for
different ways to combine variables to get around these problems while still giving a true picture
of the data. Many people who work with machine learning think that extracting features in the
best way possible is the key to making good models. The data's information must be shown by
the features in a way that fits the needs of the algorithm that will be used to solve the problem.
Some "inherent" features can be taken straight from the raw data, but most of the time, we need
to use these "inherent" features to find "relevant" features that we can use to solve the problem.

In simple terms "feature extraction." can be described as a technique for Defining a set of
features, or visual qualities, that best show the information. Feature Extraction Techniques such
as: PCA, ICA, LDA, LLE, t-SNE and AE. are some of the common examples in machine
learning.

Feature extraction fills the following requirements:


It takes raw data, called features, and turns them into useful information by reformatting,
combining, and changing the primary features into new ones. This process continues until a new
set of data is created that the Machine Learning models can use to reach their goals.
Methods of Dimensionality Reduction : The following are two well-known and widely used
dimension reduction techniques:
Dimension Reduction Techniques

Principal Component Analysis Fisher Linear Discriminant


Analysis
● (PCA) Principal Component Analysis
● (LDA) Fisher Linear Discriminant Analysis
The reduction of dimensionality can be linear or non-linear, depending on the method used. The
most common linear method is called Principal Component Analysis, or PCA.

Check Your Progress - 1


Qn1. Define the term feature selection.
Qn2. What is the purpose of feature extraction in machine learning?
Qn3. Expand the following terms : PCA,LDA,GDA
Qn4. Name components of dimensionality reduction.

13.3 Principal Component Analysis

Karl Pearson was the first person to come up with this plan. It is based on the idea that when data
from a higher-dimensional space is put into a lower-dimensional space, the lower-dimensional
space should have the most variation. In simple terms, principal component analysis (PCA) is a
way to get important variables (in the form of components) from a large set of variables in a data
set. It tends to find the direction in which the data is most spread out. PCA is more useful when
you have data with three or more dimensions.

f2

e1
e2

f1

When applying the PCA method, the following are the primary steps that should be followed:

1. Obtain the dataset you need.


2. Calculate the mean of the vectors ().
3. Deduct the mean of the given data from the total.
4. Complete the computation for the covariance matrix.
5. Determine the eigenvectors and eigenvalues of the matrix that represents the covariance
matrix.
6. Creating a feature vector and deciding which components would be the major ones i.e. the
principal components.
7. Create a new data set by projecting the weight vector onto the dataset.As a result, we have a
smaller number of eigenvectors, and some data may have been lost in the process. However, the
remaining eigenvectors should keep the most significant variances.

Merits of Dimensionality Reduction

● It helps to compress data, which reduces the amount of space needed to store it and the
amount of time it takes to process it.
● If there are any redundant features, it also helps to get rid of them.

Limitations of Dimensionality Reduction


 You might lose some data.
 You might lose some data.
 PCA fails when the mean and covariance are not enough to describe a dataset.
 We don't know how many major parts we need to keep track of, but in practice, we
follow some rules.

Below is the practice question for Principal Component Analysis (PCA) :

Problem-01: 2, 3, 4, 5, 6, 7; 1, 5, 3, 6, 7, 8 are the given data. Using the PCA Algorithm,


calculate the primary component.

OR

Consider the two-dimensional patterns (2, 1), (3, 5), (4, 3), (5, 6), (6, 7), (8, 8) and (9, 10). (7, 8).

Using the PCA Algorithm, calculate the primary component.

OR

Calculate the principal component of following data-

Class1 values Class 2 values

X 2,3,4 X 5,6,7

Y 1,5,3 Y 6,7,8

Answer :

Step-1: Get data.

The given feature vectors are- x1,x2,x3,x4,x5,x6 with the values given below:
2 3 4 5 6 7
1 5 3 6 7 8
Step-2:
Find the mean vector (µ).

Mean vector (µ) = ((2 + 3 + 4 + 5 + 6 + 7) / 6, (1 + 5 + 3 + 6 + 7 + 8) / 6)= (4.5, 5)


4.5
Thus, Mean vector (μ) =
5
Step-03:
On subtracting mean vector (µ) from the given feature vectors.

● x1 – µ = (2 – 4.5, 1 – 5) = (-2.5, -4)


same for others

−2.5 −1.5 −0.5 0.5 1.5 2.5


Feature vectors (xi) generated after subtraction are
−4 0 −2 1 2 3
Step-04:
∑( μ)( μ)
Now to find covariance matrix : Covariance Matrix =

−2.5 [ 6.25 10
𝑚 = (𝑥 − 𝜇)(𝑥 − 𝜇) = −2.5 −4] =
−4 10 16
−1.5 [ 2.25 0
𝑚 = (𝑥 − 𝜇)(𝑥 − 𝜇) = −1.5 0] =
0 0 0
−0.5 [ 0.25 1
𝑚 = (𝑥 − 𝜇)(𝑥 − 𝜇) = −0.5 −2] =
−2 1 4
0.5 [ 0.25 0.5
𝑚 = (𝑥 − 𝜇)(𝑥 − 𝜇) = 0.5 1] =
1 0.5 4
1.5 [ 2.25 3
𝑚 = (𝑥 − 𝜇)(𝑥 − 𝜇) = 1.5 2] =
2 3 4
2.5 [ 6.25 7.5
𝑚 = (𝑥 − 𝜇)(𝑥 − 𝜇) = 2.5 3] =
3 7.5 9
1 17.5 22
Covariance Marix =
6 22 34
2.92 3.67
Covariance Matrix =
3.67 5.67
Step-05:

Eigen values and Eigen vectors of the covariance matrix.


2.92 3.67 λ 0
− =0
3.67 5.67 0 λ
2.92 − λ 3.67
=0
3.67 5.67 − λ
From here,

(2.92 − λ)(5.67 − λ) − (3.67x 3.67) = 0

16.56 − 2.92λ − 5.67λ + λ − 13.47 = 0


λ − 8.56λ + 3.09 = 0

Solving this quadratic equation, we get λ = 8.22, 0.38

This, two eigen values are λ = 8.22 and λ = 0.38.

Clearly, the second eigen value is vary small compared to the first eigen value.

So, the second eigen vactor can be left out.

Eigen vector corresponding to the greatest eigen value is the principle component for the given
data set.

So, we find the eigen vector corresponding to eigen value λ .

We use the following equation to find the eigen vector-

MX = λX

Where-

 𝑀 = 𝐶𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑀𝑎𝑡𝑟𝑖𝑥 ; 𝑋 = 𝐸𝑖𝑔𝑒𝑛 𝑣𝑒𝑐𝑡𝑜𝑟 , 𝑎𝑛𝑑 𝜆 = 𝐸𝑖𝑔𝑒𝑛 𝑣𝑎𝑙𝑢𝑒

Substituting the values in the above equation, we get-

On being substituting the values in the above equation, we get-


2.92 3.67 X1 X1
= 8.22
3.67 5.67 X2 X2
Solving these, we get-

2.92𝑋 + 3.67𝑋 = 8.22𝑋

3.67𝑋 + 5.67𝑋 = 8.22𝑋

On simplification, we get-

5.3𝑋 = 3.67𝑋 … … … (1)

3.67𝑋 = 2.55𝑋 … … … (2)

From (1) and (2), X1 =0.69X2

From (2), the eigen vector is-

X1 2.55
Eigen Vector ∶ =
X2 3.67
Thus, PCA for the given problem is
X1 2.55
Principle Component: =
X2 3.67
Lastly, we project the data points onto the new subspace as-

Projection
8
x1=0.69 x2
7
6
5
4
3
2
1

1 2 3 4 5 6 7 8

Problem -02

Use PCA Algorithm to transform the pattern (2, 1) onto the eigen vector in the previous
question.

Solution-
2
The given feature vector is (2, 1) i.e. Given Feature Vector:
1
The feature vector gets transformed to :

= Transpose of Eigen vector x (Feature Vector – Mean Vector)

2.55 2 4.5 −2.5


= 𝑥 − = [2.55 3.67] 𝑥 = −21.055
3.67 1 5 −4
Check Your Progress -3

Qn1. What are the advantages of dimensionality reduction?

Qn2. What are the disadvantages of dimensionality reduction?

13.4 Linear Discriminant Analysis


In most cases, the application of logistic regression has been restricted to problems involving two
classes of subjects. On the other hand, the Linear Discriminant Analysis is the linear
classification method that is recommended to use when there are more than two classes.

The algorithm for linear classification known as logistic regression is known for being both
straightforward and robust. On the other hand, there are a few restrictions or faults in the system
that highlight the requirement for more complex linear classification algorithms. The following
is a list of some of the problems:

 Binary class Problems. Concerns regarding the binary class is that the Logistic
regression is utilised for issues that involve binary classification or two classes. It is
possible to enhance it such that it can manage multiple-class categorization, but in
practise, this is not very common.
 Unstable, but with well-defined classes. When the classes are extremely distinct from
one another, logistic regression may become unstable.
 It is prone to instability when there are only a few occurrences. When there are not
enough examples from which to draw conclusions about the parameters, the logistic
regression model may become unstable.

In view of the limitations of logistic regression that were discussed earlier, the linear
discriminant analysis is one of the prospective linear methods that can be used for multi-class
classification. This is because it addresses each of the aforementioned concerns in their totality,
which is the primary reason for its success (i.e. flaws of logistic regression). When dealing with
issues that include binary categorization, two statistical methods that could be effective are
logical regression and linear discriminant analysis. Both of these techniques are linear and
regression-based.

Understanding LDA Models : In order to simplify the analysis of your data and make it
more accessible, LDA will make the following assumptions about it:

1. The distribution of your data is Gaussian, and when plotted, each variable appears to be a bell
curve.

2. Each feature has the same variance, which indicates that the values of each feature vary by the
same amount on average in relation to the mean.

On the basis of these presumptions, the LDA model generates estimates for both the mean and
the variance of each class. In the case where there is only one input variable, which is known as
the univariate scenario, it is straightforward to think about this.

When the sum of values is divided by the total number of values, we are able to compute the
mean value, or mu, of each input, or x, for each class(k), in the following manner.
muk = 1/nk * sum(x)
Where,
muk represents the average value of x for class k and
nk represents the total number of occurrences that belong to class k.
When calculating the variance across all classes, the average squared difference of each
individual result's distance from the mean is employed.
sigma^2 = 1 / (n-K) * sum((x – mu)^2)

Where sigma^2 represents the variance of all inputs (x), n represents the number of instances, K
represents the number of classes, and mu is the mean for input x.

Now we will discuss How to use LDA to Make Predictions ?

LDA generates predictions by calculating the likelihood that each class will be given a fresh
batch of data and then extrapolating from there. A forecast is created by selecting the output
class that contains the events that are the most likely to occur. The Bayes Theorem is
incorporated into the model in order to calculate the probabilities involved. Utilizing the
likelihood of each class as well as the probability of data belonging to that class, Bayes's
Theorem may be utilised to estimate the probability of the output class (k) given the input class
(x). This is accomplished by using the following formula:

P(Y=x|X=x) = (PIk * fk(x)) / sum(PIl * fl(x))

The base probability of each class (k) that can be found in your training data is denoted by the
symbol PIk (e.g. 0.5 for a 50-50 split in a two class problem). This concept is referred to as the
prior probability within Bayes' Theorem.

PIk = nk/n

The value of f, which represents the estimated likelihood that x is a member of the class, is
presented here as f(x). We make use of a Gaussian distribution function for the variable (x). By
simplifying the previous equation and then introducing the Gaussian, we are able to arrive at the
equation that is presented below. This type of function is referred to as a discriminate function,
and the output classification (y) is determined by selecting the category that contains the greatest
value:
Dk(x) = x * (muk/sigma^2) – (muk^2/(2*sigma^2)) + ln(PIk)
Where, Dk(x) is the discriminating function for class k given input x, and mu k, sigma^2, and PIk
are all estimated from your data.

Now to perform the above task we need to prepare our data first, so the question arises,
How to prepare data suitable for LDA?

This section gives you some ideas to think about when getting your data ready to use with LDA.
Problems with Classification: LDA is used to solve classification problems where the output
variable is a categorical one. This may seem obvious, as LDA works with both two and more
than two classes.

Gaussian Distribution: The standard way to use the model assumes that the input variables
have a Gaussian distribution. Think about looking at the univariate distributions of each attribute
and using transformations to make them look more like Gaussian distributions (e.g. log and root
for exponential distributions and Box-Cox for skewed distributions).

Remove Outliers: Think about removing outliers from your data. These things can mess up the
basic statistics like the mean and the standard deviation that LDA uses to divide classes.

Same Variance: LDA assumes that the variance of each input variable is the same. Before using
LDA, you should almost always normalise your data so that it has a mean of 0 and a standard
deviation of 1.

Below is a practice problems based on Linear Discriminant Analysis (LDA) -

Problem-2 : Compute the Linear Discriminant projection for the following two-dimensional
datasetX1=(x1,x2)={(4,1),(2,4),(2,3),(3,6),(4,4)} & X2=(x1,x2)={(9,10),(6,8),(9,5),(8,7),(10,8)}

10

X2
4

2
WLDA
0
0 2 4 6 8 10
X1
 The class statistics are:

0.80 −0.40 1.84 −0.04


S1 = : S2 =
−0.40 2.60 −0.04 2.64
μ1 = [3.00 3.60]: μ2 = [8.40 7.60]
 The within- and between-class scatter are

29.16 21.60 2.64 −0.44


S1 = : S2 =
21.60 16.00 −0.44 5.28

 The LDA projection is then obtained as the solution of the generalized eigenvalue
problem

11.89 8.81
S.1 .1
w SB v = λv ⇒ Sw SB − λl |= 0 ⇒| = 0 ⇒ λ = 15.65
5.08 3.76 − λ
11.89 8.81 v1 v1 v1 0.91
v = 15.65 v ⇒ v =
5.08 3.76 2 2 2 0.39

 On directly by

w∗ = Sw1 μ1 − μ2 = [−0.91 −0.39]T

Check Your Progress -3


Q1. Define LDA.
Q2. Write any two limitations of LDA.

13.5 Single Value Decomposition


The Singular Value Decomposition (SVD) method is a well-known technique for decomposing a matrix
into a large number of component matrices. This method is valuable since it reveals many of the
interesting and helpful characteristics of the initial matrix. We can use SVD to discover the optimal
lower-rank approximation to the matrix, determine the rank of the matrix, or test a linear system's
sensitivity to numerical error.

Singular value decomposition is a method of decomposing a matrix into three smaller matrices.
A = USVT
Where:
● A : is an m × n matrix
● U : is an m × n orthogonal matrix
● S : is an n × n diagonal matrix
● V : is an n × n orthogonal matrix
The rationale for the transposition of the last matrix will be revealed later in the presentation.
Also defined (in case your algebra is rusty) will be the word "orthogonal," as well as it describes
the reason for having two outer matrices having the same feature.
The diagonal matrix, S, has been flattened into a vector, reducing the formula into a single
summation. Singular values, or Si, are variables that are generally organized from largest to
smallest.

Below is a practice problems based on single value decomposition


−3 1
Problem-03: Find the SVD of the matrix 𝐴 = 6 −2
6 −2
81 −27
Solution : First, we’ll work with 𝐴 𝐴 = . The eigen values are λ = 0 , 90.
−27 9

1 −1⁄3 1 1
For λ = 0, the reduced matrix is , so v =
0 0 √10 3

1 3 1 −3
For λ = 90, the reduced matrix is , so v =
0 0 √10 1

1
1 1
Now, we can find the reduced SVD right away since 𝑢1 = 𝐴𝑉1 = −2
1 3
−2

We now need a basis for the null space of

10 −20 −20 1 −2 −2 2 2
𝐴𝐴 = −20 40 40 ~ 0 0 0 ⇒ 1 , 0
−40 40 40 0 0 0 0 1

Now the full SVD is given by:

−3 1 1⁄3 2⁄√5 2⁄√5


2⁄√10 0 −3⁄√10 1⁄√10
𝐴 = 6 −2 = −2⁄3 1⁄√5 0
6 −2 0 0 1⁄√10 3⁄√10
−2⁄3 0 1⁄√5

Check Your Progress - 4

Qn1. Define SVD

13.6 SUMMARY

In this unit we learned about the concept of Dimension Reductionality, wherein we


understood basics of Feature Selection and Feature extraction techniques. Thereafter an
explicit discussion of Principal Component Analysis(PCA), Linear Discriminant
Analysis(LDA) and Singular Value Decomposition(SVD) was given.

13.7 SOLUTIONS TO CHECK YOUR PROGRESS


Check Your Progress – 1

Refer Section 13.2 for detailed Solutions


Ans1. In machine learning and statistics, feature selection is the process of choosing a subset of
relevant features to use when making a model. It is also called variable selection, attribute
selection, or variable subset selection.

Ans2. The goal of Feature Extraction is to reduce the number of features in a dataset by making
new features from the ones that are already there (and then discarding the original features).

Refer Section 13.2 for detailed Solutions

Ans3.Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA) and


Generalized Discriminant Analysis (GDA)

Ans4. Feature selection and feature extraction.

Check Your Progress - 2

Refer Section 13.3 for detailed Solutions

Ans1. Advantages of dimensionality reduction

● It reduces the time and space complexity.


● Getting rid of multi-colinearity makes it easier to understand how the machine learning model's
parameters work.
● When there are only two or three dimensions, it's easier to see the data.

Ans2. Dimensionality Reduction's Drawbacks


 PCA likes to find linear relationships between variables, which isn't always a good thing.
 PCA doesn't work when the mean and covariance aren't enough to describe a dataset.
 In some problems, dimension reduction could also cause data loss.

Check Your Progress - 3

Refer Section 13.4 for detailed Solutions

Ans1. The LDA technique is a multi-class classification method that may be utilised to
automatically perform dimensionality reduction. LDA cuts the number of features down from the
initial number of features.

LDA projects the data into a new linear feature space, and obviously, the classifier will have a
high level of accuracy if the data can be linearly separated.Ans2. Some of the limitations of
Logistic Regression are as follows:

 Logistic regression is typically applied when attempting to solve problems involving


binary or two-class classifications. Problems involving two classes Even while it is
possible to extrapolate this information and apply it for multi-class classification, very
few people actually do this. On the other hand, Linear Discriminant Analysis is regarded
as a superior option whenever multi-class classification is necessary, and in the case of
binary classifications, both logistic regression and LDA are utilised in the analysis
process.

 Instability with Clearly Delineated Social Groups — Logistic Regression is known to be


unreliable in situations when the classes are clearly differentiated from one another.

Check Your Progress - 4

Refer Section 13.5 for detailed Solutions

Ans1The singular value decomposition is a factorization technique that can be used in linear
algebra for real or complex matrices. It extends the application of the eigen decomposition of a
square normal matrix to any m x n matrix by using an ortho normal eigen basis. It has something
to do with the polar decomposition.

13.8 FURTHER READINGS


1. Machine learning an algorithm perspective, Stephen Marshland, 2nd Edition, CRC Press, 2015.
2. Machine Learning, Tom Mitchell, 1st Edition, McGraw- Hill, 1997.
3. Machine Learning: The Art and Science of Algorithms that Make Sense of Data, Peter Flach, 1st
Edition, Cambridge University Press, 2012.
OperatingSystem:
UNIT14 ASSOCIATION RULES AnOverview

Structure
14.1 Introduction
14.2 Objectives
14.3 What are Association Rules?
14.3.1 Basic Concepts
14.3.2 Association rules: Binary Representation
14.3.3 Association rules Discovery
14.4 Apriori Algorithm
14.4.1 Frequent Itemsets Generation
14.4.2 Case Study
14.4.3 Generating Association Rules using Frequent Itemset
14.5 FP Tree Growth
14.5.1 FP Tree Construction
14.6 Pincer Search
14.7 Summary
14.8 Solutions to Check Your Progress
14.9 Further Readings

14.1 INTRODUCTION

Imagine a salesperson at a shop. If you buy a product from the shop, the salespersonrecommends you
more items related to the product you have purchased. He may also suggest you about the items that are
frequently bought with the product you have purchased. The salesperson may also try to figure out your
choices about the product you observe at the shop and may recommend you further. One of the
common examples of the situation is market basket analysis as illustrated in two cases in Figure 1. If
you add bread and butter to your basked at the store, the salesperson may recommend you add cookies,
eggs, milk to your shopping cart too. Similarly, if customer puts vegetables like onion and potato in
their cart, the salesman at shop may suggest adding other vegetables like tomato to the basket. If
salesman notices a male customer, then he may suggest adding beer too from his historical analysis of
male customer preferring to buy beer as well.
Figure 1: Two different cases of Market Basket Analysis

The salesperson thus analyses the purchasing habits of the customers and tries to analyze the correlations
among the items/products that are frequently bought together by them. The analysis of such
correlationshelps the retailers to develop marketing strategies to increase sale of products in the shop.
Discovering the correlations among all the items sold in a shop help businessman in making decisions
regarding designing catalogue, organizing store, and customer shopping analysis.

OBJECTIVES

After going through this unit, you should be able to:


 Understand the purpose of association rules.
 Understand the purpose of pattern search and,
 Describe various algorithms for pattern search
14.3 WHAT ARE ASSOCIATION RULES?

In machine learning, association rules are one of the important concepts that is widely applied in
problems like market basket analysis. Consider a supermarket, where all the related items such as
grocery items, dairy items, cosmetics, stationary items etc are kept together in same aisle. This
helps the customers to find their required items timely. This further helps them to remember the
items to purchase they might have forgotten or to they may like to purchase if suggested.
Association rules thus enable one to corelate among various products from a huge set of available
items. Analysing the items customer buy together also helps the retailers to identify the items they
can offer on discount. For example, retailer selling baby lotion and baby shampoo on MRP, but
offering a discount on their combination. Customer who wished to buy only shampoo or only
lotion, may now think of buying the combination. Other factors too can contribute to the purchase
of combination of products. Another strategy can keep related products on the opposite ends of the
shelf to prompt the customer to scan through the entire shelf hoping that he might add a few more
items to his cart.
It is important to note that the association rules do not extract the customer’s preference about the
items but find the relations among the items that are generally bought together by them. The rules
only identify the frequent associations between the items. The rules work with an antecedent (if)
and a consequent (then), both connecting to the set of items. For example, if a person buys pizza,
then he may buy a cold drink too. This is because there is a strong relation between pizza and cold
drink. Association rules help to find the dependency of one item on other by consider the history
of customer’s transaction patterns.

14.3.1 Basic Concepts

There are few terms that one should understand before understanding the algorithm.

a. k-Itemset: It is a set of kitems. For example, 2-itemset can be {pencil, eraser} or {bread, butter}
etc., 3-itemset can be {bread, butter, milk}.

b. Support: Frequency of appearance of an item appears in all the considered transactions is called
as the support of an item. Mathematically, support of an item x is defined as:

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑥


𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑥) =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑛𝑠𝑖𝑑𝑒𝑟𝑒𝑑 𝑡𝑟𝑎𝑛𝑠𝑐𝑡𝑖𝑜𝑛𝑠

c. Confidence: Confidence is defined as the likelihood of obtaining item y along with an item x.
Mathematically, it is defined as the ratio of frequency of transactions containing items x and y to
the frequency of transactions that contained item x.

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑥 𝑎𝑛𝑑 𝑦


𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(𝑦) =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑛𝑠𝑖𝑑𝑒𝑟𝑒𝑑 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑥
Confidence can also be defined as probability of occurrence of y, given probability of
occurrence of x.

𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(𝑥 => 𝑦) = 𝑃(𝑦/𝑥)

where x is antecedent, and y is a consequent. In terms of support, confidence can be described


as:

𝑠𝑢𝑝𝑝𝑜𝑟𝑡 (𝑥 ∪ 𝑦)
𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(𝑦) =
𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑥)

d. Frequent Itemset: An item whose support is at least the minimum support threshold is known
as a frequent itemset. For example, let minimum support threshold is 10, then an item set with
support score 11 is a frequent itemset but an item set with support score 9 is not.

14.3.2 Association Rules: Binary Representation

Let I = {I1, I2, I3……..In} be a set of n items and T = {T1, T2, T3……..Tt} be a set of t
transactions, where each transaction Ti contains a not null set of items purchased by a customer
such that Ti⊆ I . Let each item Ii in the store be represented by a binary variable, B. The variable
takes up the value 0 or 1, representing the absence or presence of item at the store.

𝐵(𝑖) = 1, if an item Ii is available at the store


0, otherwise

For example, consider a set of four transactions T1, T2, T3 and T4 with the following items:

T1 = {milk, cookies, bread},


T2 = {milk, bread, egg, butter},
T3 = {milk, cookies, bread, butter}, and
T4 = {cookies, bread, butter}.

The binary representations for the transaction set are shown in Table 1.

Table1: Binary representation of the transactions

Transaction Milk Cookies Bread Butter Egg


id
T1 1 1 1 0 0
T2 1 0 1 1 1
T3 1 1 1 1 0
T4 0 1 1 1 0
The binary variable can also be used to analyze the purchasing patterns of the customers. One can
analyze a basket in terms of binary values of the items that customer has purchased. Such items
are said to frequently bought together or associated to each other. These binary patterns are
represented in the form of association rules. For example, the customer who buys milk, also tend
to buy bread at the same time. This can be represented using support and confidence as:

Milk => Bread with support as 75% (or value as 0.75) and confidence as 100% (or value as 1).

Support(milk) = = = 0.75

( , ) .
Confidence (bread) = ( )
= = =1
⋅ .

This means that out of all the purchases made at the store, 10% of the times, whenever milk is
purchased, bread is also purchased together. Confidence 100% implies that out of all the
customers who bought milk, all of them also bought bread. Thus, support and confidence are the
two important rules to show the interest in items. Thus, in this case association rules are
important to consider if they satisfy the minimum threshold of support and confidence. These
thresholds can be set by the experts.

Let A and B be the two set of transactions such that A ⊆ T and B ⊆ T, such that A ≠Ø, B ≠Ø and
A∩ B=Ø. For a given transaction Ti, the rule A => B holds with support s and confidence c as
defined in section 1.3.1. Support s is defined as the percentage of transactions that contains the
items contained in set A as well as in set B i.e, union of set A and set B represented as A∪B.
Confidence c is defined as the percentage of transactions containing A also containing B. When
writing support and confidence in terms of percentage, their value range between 0 and 100, and
when expressed as ratios, their values range between 0 and 1.

Association rules can further be defined as strong association rules if they satisfy the minimum
threshold support called as min_sup and minimum threshold confidence called as min_conf.

14.3.3 Association Rules Discovery

The problem of discovery of association rules can be stated as: Given a set of transactions T, find
the rules whose support and confidence are greater than equal to the minimum support and
confidence threshold.
Traditional approach to generate association rules is to compute the support and confidence for
every possible combination of items. Butt his approach is computationally not possible as the
number of combinations of items can be exponentially large. To avoid such large number of
computations, basic approach should be to ignore the needless computations without computing
their support and confidence scores. For example, we can observe from Table 1 that the
combination {milk, egg} can be ignored as the combination if infrequent. Hence, we prune the rule
Milk => Egg without computing the support and confidence for the items.
Therefore, the steps for obtaining association rules can be summarized as:

1. Find all frequent item sets: By definition, obtain the frequent itemset as set of items whose
support score is at least the min_sup.
2. Generate strong association rules: By definition, obtain rules for the item sets obtained in step
1, with support score as min_sup and confidence score as min_conf. These rules are called as
strong rules.
Challenge in obtaining association rules:
Low threshold value: If minimum threshold support (min_sup) is set quite low, then many items
can belong to the frequent itemset.
Solution to the problem is to define a closed frequent itemset and maximal frequent itemset.
1. Closed Frequent Itemset: A frequent itemset A is said to be a closed frequent itemset if it
is closed and has support score at least equal to min_sup. An itemset is said to be closed
in a data set T if there does not exist any superset B such that support(B) equals support
(A).
2. Maximal Frequent Itemset: A frequent itemset A is said to be a maximal frequent
itemset in T if A is frequent and there exists no super set B such that A⊂B and B is
frequent in T.
We use Venn diagram to understand the concept as shown in Figure 2.

Figure 2: Venn Diagram to demonstrate the relationship between Closed item sets, frequent item sets,
closed frequent item sets and maximal frequent item sets.
This illustrates that all the maximal frequent item sets are subsets of closed items sets and frequent
item sets.

Check Your Progress 1


Ques 1: Given the following set of transactions for three items X, Y and Z as {{X, Y}, {X,Y,
Z}, {X, Z}, {X, Z}, {Z}, {Z}, {Z}, {Z}}.
a) Give the binary representation of the transaction set.

b) Find the support and confidence for the following rules:


i) X =>Y
ii) X =>Z
iii) Y =>Z
Ques 2: Find the maximal frequent itemsets and closed frequent itemsets from the following set
of transactions:
T1:A,B,C,E
T2:A,C,D,E
T3: B,C,E
T4:A,C,D,E
T5:C,D,E
T6:A,D,E

14.4 APRIORI ALGORITHM

R Aggarwal and R.Srikant proposed Apriori algorithm in the year 1994. The algorithm is used to
obtain the frequent item sets for association rules. The algorithm is names so, as it needs the prior
knowledge of the frequent item sets. This section discusses about the generation of frequent patterns
as observed in the analysis of market basket problem. Section 1.4.1 presents Apriori algorithm, used
to obtain the frequent item sets. Section 1.4.2 talks about generating strong association rules from the
frequent item sets generated. Finally, section 1.4.3 presents variations of Apriori algorithm.

14.4.1 Frequent Item sets Generation

For a given set of n items, there are 2n-1 possible combination of items. Consider an itemset I = {A, B, C, D}
with four items, there are 15 combinations of items such as {{A}, {B}, {C}, {D}, {A, B}, {A, C}, {A, D}, {B,
C}, {B, D}, {C, D}, {A, B, C}, {A, B, D}, {A, C, D}, {B, C, D}, {A, B, C, D}}. This can be represented by a
lattice diagram as shown in Figure 3.
Figure 3: Lattice representing 15 combinations of items

Apriori algorithm searches the items level by level; to find the (k+1)item sets, it uses the kitem
sets. To determine the frequent item sets, the algorithm first finds the candidate item sets from
the lattice representation. But as explained, for a given item sets of size n, maximum number of
item sets in the lattice can be 2n -1, one needs to control the search space in exponentially
growing item sets and increase the efficiency of the algorithm. For this, two important principles
are given below.

Definition 1: Apriori Principle: If an itemset is frequent, then all of its subsets must be frequent.
To understand the principle, let’s consider the itemset {A, B, C} is the frequent itemset,
then its subsets {A, B}, {A, C}, {B, C}, {A}, {B} and {C} are also the frequent item sets
(marked in yellow) as shown in Figure 4. This is because if a transaction contains the itemset {A,
B, C}, then the transaction will also contain all the subsets of this itemset.
On the other hand, if an itemset say {B, D} is not a frequent itemset, then all the
supersets containing {B, D} i.e {A, B, D}, {B, C, D} and {A, B, C, D} are also not the frequent
item sets (marked in blue) as shown in Figure 4. This is because if an itemset X is not a frequent
itemset, then any other itemset containing X will also not be a frequent itemset. Recalling the
definition that an itemset is frequent if and only if its support is greater than or equals to the
minimum support threshold. Conversely, an itemset who support is less than the minimum
support threshold, then the itemset is infrequent.
This Apriori property helps in pruning the item sets, thus reducing the search space and
also increases the efficiency of the algorithm. This type of pruning technique is known as
support-based pruning.

Figure 4: Frequent (yellow nodes) and Infrequent Item sets (blue nodes)

Definition 2: Antimonotone Property: This property states that if an itemset X does not satisfy a
function, then any proper set containing X will also not satisfy the function. This property is
called as antimonotone property as it is monotonic in terms of not satisfying a function. This
property further helps in reducing the search space.

Given the set of transactions with the minimum support and confidence scores, perform the
following steps to generate the frequent item sets using for Apriori algorithm.
Let C(k) be the set of k candidate item, F(k) be the set of k frequent items.

Step 1: Find F (1) i.e., frequent 1- item sets by computing the support for each item in the set of
transactions.

Step 2: This is a twofold step as described below:

i) Join Operation: For k ≥2, it generates C(k), new candidate k- itemset based on
frequent item sets obtained in the previous step. This can be done in one of the two
ways:

1. Compute F(k-1) * F(1) to extend each F(k-1) itemset by joining it with each F(1)
itemset. This join operation results in C(k) candidate item sets.

For e.g.: Let k=2, and three 1 item setsF (1) = {A}, F (1) = {B} and F (1) = {C}.
The two 1-itemsets can be augmented together to obtain 2-itemset such as F (2) =
{A, B}, F (2)= {A, C} and F(2) = {B, C}. Further, to obtain 3-itemsets, we
augment 2-itemsets with 1-itemsets such as joining {A, B} with {C} to obtain {A,
B, C}.

However, this method may generate duplicate F(k) item sets. For example,
augmenting F(2) = {A, B} and F(1) = {C} generates C(3) = {A, B, C}. Also
augmenting F(2) = {A, C} with F(1) = {B} generate C(3) = {A, B, C}. One way
to avoid the generation of duplicate item sets is to sort the items in the item sets in
lexicographic order. For example, the item sets {A, B}, {B, C}, {A, C}, {A, B,
C} are sorted in their lexicographic order but {C, B}, {A, C, B} are not. Thus,
items {A, B} , {C, D} can be augmented but the item sets {A, B}, {A, C} cannot
be augmented as it will violate the lexicographic order. A working example to
generate candidate F(k) itemset using this approach is shown in Figure 5.

a) F(1) joins F(1) to give C(2) (b) F(2) joins F(1) to give C(3)

Figure 5: Example of Augmenting F(k−1) and F(1) to generate candidate F(k) itemset

2. F(k-1) * F(k-1): A distinct pair of (k-1) item sets X = {A1, A2, A3…Ak-1} and Y =
{B1, B2,… Bk-1, sorted in lexicographic order, are augmented iff:
Xi = Yi, i= 1, 2,….(k-2) but Ak-1≠ Bk-1.

The item sets X and Y would be augmented only if the items A(k-1) and B(k-1)
are sorted in lexicographic order. For example, let frequent itemset X(3)= {A, B,
C} and Y(3) = {A, B, E}. X is joined with Y to generate the candidate itemset
C(4) ={A, B, C, E}. On the other hand, consider another frequent itemset Z = {A,
B, D}. Itemset X and Z can be joined to generate candidate set C(4) ={A, B, C, E}
but itemset Y and Z cannot be joined as their last items, E and D are not arranged
in lexicographic order. As a result, the F(k−1) × F(k−1) candidate \generation
method merges a frequent itemset onlywith ones that contains the items in the
sorted list, thus saving some computations. Working example demonstrating
candidate generation using F(k−1) × F(k−1) is demonstrated in Figure 6.

Figure 6: Example of joining F(k−1), F(k−1) to generate candidate F(k) itemset

ii) Prune Operation: Scan the generated entire candidate itemsets C(k) to compute the
support score of each item in the set. Any itemset with support score less than
minimum support threshold is pruned from the candidate itemset. Thus, the itemsets
left after pruning would form the Frequent itemset F(K) such that F(k) ⊆ C(k). To
prune the itemsets from the dataset, a priori property is used. According to the
property, an infrequent (k-1)-itemset is not a subset of a frequent k-itemset. Thus, if
any (k -1)-subset of a candidate itemset C(k) is not in F(k-1), then the candidate
cannot be frequent and so is removed from C(k).

14.4.2 Case Study: Find the frequent itemset from the given set of items
Consider the set of transactions represented in binary form in Table 1 as given below. Assume
minimum support threshold to be 2.
Transaction List of items
Id
T1 Milk, Cookies, Bread

T2 Milk, Bread, Egg, Butter

T3 Milk, Cookies, Bread, Butter

T4 Cookies, Bread, Butter

Step 1: Arrange the items in lexicographic order. Call this candidate set C(1)

Transaction Id List of items


T1 Bread, Cookies, Milk
T2 Bread, Butter, Egg, Milk
T3 Bread, Butter, Cookies, Milk
T4 Bread, Butter, Cookies

Step 2: Obtain support score for each item in candidate itemset C(1) as:

S.No Item Support


1 Bread 4
2 Butter 3
3 Cookies 3
4 Egg 1
5 Milk 3

Step 3: Prune the items whose support score is less than the minimum support threshold. This
results in 1-frequent itemset, F(1).

S.No Item Support


1 Bread 4
2 Butter 3
3 Cookies 3
4 Milk 3
Step 4: Generate 2-candidate itemsets from F(1) obtained in previous step and obtain support
score of each itemset i.e frequency of each itemset in the original transaction set.

As support score of each itemset is at least 2, hence, none of the itemset is pruned. So, the table
above represents 2-candidate itemsets, F(2).

S.No Itemset Support


1 Bread, Butter 3
2 Bread, Cookies 3
3 Bread, Milk 3
4 Butter, Cookies 2
5 Butter, Milk 2
6 Cookies, Milk 2

Step 5: Generate 3-candidate itemsets by joining F(2) and F(1) to obtain C(3). Compute the
support score of each itemset.

S.No Itemsets Count


1 Bread, Butter, Cookies 2
2 Bread, Butter, Milk 2
3 Bread, Cookies, Milk 2
4 Butter, Cookies, Milk 1

Step 6: Prune the itemsets in C(3) if their support is less than the minimum support threshold.
We then obtain F(3).

S.No Itemsets Count


1 Bread, Butter, Cookies 2
2 Bread, Butter, Milk 2
3 Bread, Cookies, Milk 2

Step 7: Similarly we obtain C(4) using F(3) and prune the candidate itemset to obtain frequent
itemset F(4).
S.No Itemset Support

1 Bread, Butter, Cookies, Milk 1

As the only itemset in C(4) has support count 1, so it is pruned and F(4) =Ø.

Now, the iteration stops.

Detail of the algorithm is given in Figure 7.

Input: T: a set of transactions; minsup: minimum support threshold


Output: F(k): set of k-frequent itemsets (result)

1. F1 ← {large 1 - itemsets}
2. k←2
3. while F(k−1) is not empty
4. C(k)← Apriori_gen(F(k−1), k)
5. for transactions t in T
6. Dt ← {c in C(k) : c ⊆ t}
7. for candidates c in Dt
8. count[c] ← count[c] + 1
9.
10. F(k) ← {c in C(k): count[c] ≥ minsup}
11. k←k+1
12.
13. return Union(F(k))
14.
15. Apriori_gen(F, k)
16. for all X ⊆ F, Y ⊆ F where X1 = Y1, X2 = Y2, ..., Xk-2 = Yk-2 and Xk-1,
Yk-1 are in lexicographic order
17. c = X ∪ {Yk-1}
18. if u ⊆ c for all u in C
19. result ←append (result, c)
20. return result

Figure 7: Apriori Algorithm

14.4.3 Generating Association Rules using Frequent Itemset

Once the frequent itemsets F(k) are generated from set of transactions T, next step is to generate
strong association rules from them. Recall that strong association rules satisfy both minimum
support as well as minimum confidence. Theoretically, confidence is defined as the ratio of
frequency of transactions containing items x and y to the frequency of transactions that contained
item x.
Based on the definition. Follow the given rules to generate strong association rules:

1. For each itemset f 𝜖 F(k), generate subsets of f.

2. For every non-empty subset x𝜖 f, generate the rule x => f – x

( )
( )
≥ min_conf

Consider the set of transactions and the generated frequent itemsets described in section 1.3.2.
One of the frequent itemset f belonging to set F(3) is:

f = {Bread, Butter, Cookies}.

Following the steps given above, the non-empty subsets x ⊆F(3),

x= {{Bread}, {Butter}, {Cookies}, {Bread, Butter}, {Bread, Cookies}, {Butter, Cookies}}

where all itemsets are sorted in lexicographic order. The generated association rules with their
confidence scores are stated:

Rule Confidence

{Bread, Butter} => {Cookies} ( , , )


= = 66.67%
( , )

{Bread, Cookies} => {Butter} ( , , )


= =66.67%
( , )

{Butter, Cookies} => {Bread} ( , , )


= =100%
( , )

{Butter} => {Bread, Cookies} ( , , )


= =66.67%
( )

{Bread} => {Butter, Cookies} ( , , )


= =50%
( )

{Cookies} => {Bread, Butter} ( , , )


= =66.67%
( )

Depending on the minimum confidence scores, the obtained association rules are preserved, and
rest are pruned.

Check Your Progress 2


Ques 1: For the following given Transaction Data-set, Generate Rules using Apriori Algorithm. Consider
the values as Support=50% and Confidence=75%
Transaction Id Set of Items

T1 Pen, Notebook, Pencil, Colours

T2 Pen, Notebook, Colours

T3 Pen, Eraser, Scale

T4 Pen, Colours, Eraser

T5 Notebook, Colours, Eraser

Ques 2: For the following given Transaction Data-set, Generate Rules using Apriori Algorithm. Consider
the values as Support=15% and Confidence=45%

Transaction Set of
Id Items

T1 A, B, C

T2 B, D

T3 B, E

T4 A, B, D
T5 A, E

T6 B, E

T7 A, E

T8 A, B, C, E

T9 A, B, E

14.5 FP TREE GROWTH

Although there had been significant reduction in the numb er of candidate itemsets, yet Apriori
algorithm can be slow due to scanning of entire transaction set iteratively. So, Frequent Pattern
growth, also called as FP growth method employs divide and conquer strategy to generate frequent
itemsets.
Working of FP growth algorithm:

1. Create Frequent Pattern Tree, or FP-tree by compressing the transaction database. Along with
preserving the information about the itemsets, the tree structure also retains the association
among the itemsets.

2. Divide the transaction database into a set of conditional databases. where each associated with
one frequent item or “pattern fragment,” and examines each database separately.

3. For each “pattern fragment,” examine its associated itemsets only. Therefore, this approach
may substantially reduce the size of the itemsets to be searched, along with examining the
“growth” of patterns.

Advantages of FP growth over Apriori algorithm:

1. Efficient than Apriori algorithm


2. No candidate itemset generation
3. Only two passes over the transaction set

14.5.1 FP Tree Construction


1. Obtain the 1−itemsets and their support score as computed in the first step of Apriori algorithm.
2. Sort the itemsets in descending order of their support score.
3. Create the root of tree as NULL.
4. Examine the first transaction in the set T, create the nodes of tree as the itemset with the
maximum support score at the top, the next itemset with lower count and so on. At the end, the
leaves of the tree will have the itemsets with the least support score.
5. Examine every transaction in the set, whose itemsets are also sorted in descending order of their
support score. If any itemset of this transaction is already existing in the tree, then they share the
same node, but count is incremented by 1. Else, a new node and branch will be created for a new
itemset.
6. Divide the FP tree obtained at the end of step 5 into conditional FP trees. For this, examine the
leaf (lowest) node of FP tree. The lowest node represents the frequency pattern of length 1. For
each item, traverse the path in the FP Tree to create conditional pattern base. It is the path labels
of al paths to any node of the given item in FP tree.
7. Conditional pattern base generates the frequent patterns. Thus, one conditional tree is
generated for one frequent pattern. Perform this step recursively to generate all the conditional
FP trees.

Consider the following transaction set T given in section 1.3.2

Transaction Id List of items


T1 Bread, Cookies, Milk
T2 Bread, Butter, Egg, Milk
T3 Bread, Butter, Cookies, Milk
T4 Bread, Butter, Cookies

We shall be representing each item with its S.No. in the description below. The items are sorted in
descending order of their support score as given below.

S.No Item Support


I1 Bread 4
I2 Butter 3
I3 Cookies 3
I4 Milk 3
I5 Egg 1

i) Create the null node.


null
ii) For transaction T1, item in descending order of their support score is I1 (4), I3
(3) and I4 (3). Create a branch connecting these three nodes in the tree as shown in Figure 8.

Figure 8: Creating FP Tree from transaction T1


iii) Examine the transaction T2 andT3, and further expand the FP tree as shown in Figure 9

Figure 9: FP Tree creation from transactions T2 and T3

iv) Examining the last transaction T 4 generates the final FP tree as shown in Figure 10.

Figure 10: Final FP tree obtained after scanning transaction T4


v) Now, obtain the conditional FP tree for each item I1…..I5by scanning the path from the lowest
leaf node as shown in the table below:

Item Conditional Pattern Base

I5 {I1, I3}:1, {I1, I2, I4}:1, {I1, I2, I3}:1

I4 {I1, I2}:1

I3 {I1):1, {I1, I2}:2

I2 {I1}:3

I1

vi) For each item, build the conditional Pattern Tree by taking the set of elements common in all
paths in the Conditional Pattern Base of that item. Further calculate its support score by
summing the support scores of all the paths in the Conditional Pattern Base.

Item Conditional Conditional


Pattern Base FP Tree

I5 {I1, I3}:1, {I1, I2, {I1: 3, I2: 2}


I4}:1, {I1, I2,
I3}:1
I4 {I1, I2}:1

I3 {I1):1, {I1, I2}:2 {I1}:3

I2 {I1}:3 {I1}:3

I1

vii) Frequent patterns generated are: {I1, I2, I5} i.e {Bread, Butter, Milk}, {I1, I3} i.e. {Bread, Cookies}
and {I1, I2}: {Bread Butter}.

Check Your Progress 3

Ques 1: Generate FP Tree for the following Transaction Dataset, with minimum support 40%.

Transaction Id Set of Items

T1 I5, I1, I4, I2

T2 I4, I1, I3, I5, I2

T3 I3, I1, I2, I5


T4 I2, I1, I4
T5 I4

T6 I4, I2

T7 I1, I4, I5

T8 I2, I3

Ques 2: Find the frequent itemset using FP Tree, from the given set of Transaction Dataset with
minimum support score as 2.

Transaction Id Set of Items

T1 Banana, Apple, Tea

T2 Apple, Carrot

T3 Apple, Peas

T4 Banana, Apple, Carrot


T5 Banana, Peas

T6 Apple, Peas

T7 Banana, Peas

T8 Banana, Apple,
Peas, Tea

T9 Banana, Apple, Peas

14.6 Pincer Search

In Apriori algorithm, the computation begins from the smallest set of frequent itemsets and
proceeds till it reaches the maximum frequent itemset. But the algorithm’s efficiency decreases
as the algorithm passes through many iterations. Solution to the problem is to generate the
frequent itemsets by using bottom-up and top-down approach together. Thus, the Pincer
algorithm works on bidirectional approach. It attempts to find the frequent item sets in a bottom
– up manner but, at the same time, it maintains a list of maximal frequent item sets. While
iterating the transactions set, it computes the support score of the candidate maximal frequent
item sets to record the frequent itemsets. This way, the algorithm outputs the subsets of the
frequent itemsets and, hence, they need not maintain support score in the next iteration. Thus,
Pincer–search is advantageous over Apriori algorithm when the frequent item set is long.

In each iteration, while computing the support count of each iteration in bottom-up direction, the
algorithm also computes the support score of some itemsets in top-down direction. These
itemsets are called as Maximal Frequent Candidate Set (MFCS). Consider an itemset x 𝜖 MFCS,
such that cardinality of x is greater than k, in the kth iteration. Then all the subsets of x must also
be frequent. Thus, all these subset itemsets can be pruned from the candidate sets in bottom-up
direction. Thus, increasing the algorithm efficiency. Similarly, when the algorithm finds an
infrequent itemset in bottom–up direction, then MFCS is updated I order to remove these
infrequent itemsets.

The MFCS is initialized to a singleton, which is the item set of cardinality n that contains all the
elements of the transaction set. If some ‘m’ infrequent 1– itemsets are observed after the first
iteration, then these infrequent itemsets will be pruned from the set, thereby reducing the
cardinality of singleton to n –m. This is an efficient algorithm as one the algorithm may result a
maximal frequent set in few iterations only.

Check Your Progress 4

Ques1: What is the advantage of using Princer algorithm over Apriori algorithm?

Ques 2: What is Maximal Frequent Candidate set?

Ques 3: What happens if an item is a maximal frequent item?

Ques 4: What is the bi–directional approach of Princer algorithm?

14.7 Summary

This unit presented the concepts of association rules. In this unit, we briefly described about
what are these association rules and how various models help to analyse data for patterns, or co-
occurrences, in a transaction database. These rules are created by finding the frequent itemsets in
the database using the support and the confidence. Support indicates the frequency of occurrence
of an item given the entire transaction database. High support score indicates a popular item
among the users. Confidence of a rule indicates the rule’s reliability. Higher confidence indicates
more occurrence of item(s) in a set.
A typical example that has been used to study association rules is market basket analysis.

The unit presented different concepts like frequent itemsets, closed items and maximal itemsets.
Various algorithms viz, Apriori algorithm, FP tree and Pincer algorithm are discussed in detail in
this unit, to generate the association rules by identifying the relationship between the items that
are often purchased or occurred together.

14.8 Solutions to Check your progress

Check Your Progress 1

Ques 1
a) Binary Representation of the Transaction set:
Transaction X Y Z
{X, Y} 1 1 0
{X, Y, Z} 1 1 1
{X, Z} 1 0 1
{Z} 0 0 1

b) Finding support and confidence of each rule:


Support = frequency of item/ number of transactions
Confidence(A->B) = support(A∪B)/support(B)

Support(X) = 4/8 = 50%


Support(Y) = 2/8 = 25%
Support(Z) = 7/8 = 87.5%
Support (X, Y) = 2/8 = 25%
Support (X, Z) = 3/8 = 37.5%
Support (Y, Z) = 1/8 = 12.5%
i) Confidence(X-> Y) = support (X, Y)/support(X) = (2/8)/(4/8) = 2/4 = 50%
ii) Confidence(X-> Z) = support (X, Z)/support(X) = (3/8)/(4/8) = 3/4 = 75%
iii) Confidence(Y-> Z) = support (Y, Z)/support(Y) = (1/8)/(2/8) = 1/2 = 50%
Ques 2 For finding the maximal frequent itemsets and the closed frequent itemsets:
a) Frequent itemset X 𝜖 F is maximal if it does not have any frequent supersets.
b) Frequent itemset X 𝜖 F is closed if it has no superset with the same frequency.

Step i) Count the occurrence of each item set:

Itemset Frequency Itemset Frequency


{A} 4 {D, E} 3
{B} 2 {A, B, C} 1
{C} 5 {A, B, D} 0
{D} 4 {A, B, E} 1
{E} 6 {A, C, D} 2
{A, B} 1 {A, C, E} 3
{A, C} 3 {A, D, E} 3
{A, D} 3 {B, C, D} 0
{A, E} 4 {B, C, E} 2
{B, C} 2 {C, D, E} 3
{B, D} 0 {A, B, C, D} 0
{B, E} 2 {A, B, C, E} 1
{C, D} 3 {B, C, D, E} 0
{C, E} 5

Let minimum support = 3.


a) {B}, {A, B}, {B, C}, {B, D}, {B, E}, {A, B, C}, {A, B, D}, {A, B, E}, {A, C, D}, {B,
C, D}, {B, C, E}, {A, B, C, D}, {A, B, C, E} and {B, C, D, E} are not frequent itemsets
as their support count is less than the minimum support score. Hence, we prune these
itemsets.

b) {A} is not closed as frequency of {A, E} (superset of {A}) is same as frequency of {A} =
4.
c) {C} is not closed as frequency of {C, E} (superset of {C}) is same as frequency of {C} =
5.
d) {D} and {E} are closed as they do not have any superset with same frequency, but {D}
and {E} are not maximal as they both have frequent supersets as {C, D} and {C, E}
respectively.
e) {A, C} is not closed as frequency of its superset {A, C, E} is same as that of {A, C}.
f) {A, D} is not closed as frequency of its superset {A, D, E} is same as that of {A, D}.
g) {A, E} is closed as it has no superset with same frequency. But {A, E} is not maximal as
it has a frequent superset i.e {A, D, E}.
h) {C, D} is not closed due to its superset {C, D, E}.
i) {C, E} is closed but not maximal due to its frequent superset {C, D, E}.
j) {D, E} is not closed due to its superset {C, D, E}.
k) {A, C, E} is a maximal itemset as it has no frequent super itemset. It is also closed as it
has no superset with same frequency.
l) {A, D, E} is also a maximal and closed itemset.

Check Your Progress 2


Ques 1:
Step i) Find Frequent itemset and its support, where
support = frequency of item/number of transactions
Item Frequency Support %
Pen 4 4/5 = 80%
Notebook 3 3/5 = 60%
Pencil 1 1/5 = 20%
Colours 4 4/5 = 80%
Eraser 3 3/5 = 60%
Scale 1 1/5 = 20%

Step ii) Remove the items whose support is less than the minimum support (=50%)

Item Frequency Support %


Pen 4 4/5 = 80%
Notebook 3 3/5 = 60%
Colours 4 4/5 = 80%
Eraser 3 3/5 = 60%

Step iii) Form the two-item candidate set and find their frequency and support.
Item Frequency Support %
Pen, Notebook 2 2/5 = 40%
Pen, Colours 3 3/5 = 60%
Pen, Eraser 2 2/5 = 40%
Notebook, Colours 3 3/5 = 60%
Notebook, Eraser 1 1/5 = 20%
Colours, Eraser 2 2/5 = 40%

Step (iv) Remove the pair of items whose support is less than the minimum support

Item Frequency Support %


Pen, Colours 3 3/5 = 60%
Notebook, Colours 3 3/5 = 60%
Step (v) Generate the rule:

For rules, consider the following item pairs:


1. (Pen, Colours): Pen -> Colours and Colours -> Pen
2. (Notebook, colours): Notebook -> Colours and Colours -> Notebook
Confidence(A->B) = support (A∪ B)/support(A)
Hence,

Rule 1: Confidence (Pen -> Colours) = support (Pen, Colours)/support (Pen)


= (3/5) / (4/5) = 3/4 = 75%
Rule 2: Confidence (Colours -> Pen) = support (Colours, Pen)/support (Colours)
= (3/5) / (4/5) = 3/4 = 75%
Rule 3: Confidence (Notebook->Colours)
= support (Notebook, Colours) / support (Notebook)
= (3/5) / (3/5) = 1 = 100%
Rule 4: Confidence (Colours -> Notebook)
= support (Notebook, Colours) / support (Colours)
= (3/5) / (4/5) = 3/4 = 75%
All the 4 generated rules are accepted as the confidence score of each rule is greater than or
equal to the minimum confidence given in the question.

Ques 2
Given set of transactions as:

Transaction Id Set of Items

T1 A, B, C

T2 B, D

T3 B, E

T4 A, B, D
T5 A, E

T6 B, E

T7 A, E

T8 A, B, C, E

T9 A, B, E

Step i) Find the frequency and support of each item in the transaction set, where
support = frequency of item / number of transactions

Item Frequency Support


A 6 6/9 = 66.67%
B 7 7/9 = 77.78%
C 2 2/9 = 22.22%
D 2 2/9 = 22.22%
E 6 6/9 = 66.67%
Step ii) Prune the items whose support is less than the minimum support: As support of each
item is greater than the minimum support (15%), hence, no item is pruned from the set. Now,
form the two-item candidate set and find their support score.
Item Frequency Support

A, B 4 4/9 = 44.44%
A, C 2 2/9 = 22.22%
A, D 1 1/9 = 11.11%
A, E 3 3/9 = 33.33%
B, C 6 6/9 = 66.67%
B, D 2 2/9 = 22.22%
B, E 4 4/9 = 44.44%
C, D 0 0
C, E 1 1/9 = 11.11%
D, E 0 0
Step iii) Prune the two items whose support is less than the minimum support(15%)
Item Frequency Support
A, B 4 4/9 = 44.44%
A, C 2 2/9 = 22.22%
A, E 3 3/9 = 33.33%
B, C 6 6/9 = 66.67%
B, D 2 2/9 = 22.22%
B, E 4 4/9 = 44.44%

Step iv) Now, find three item frequent set and their support score

Item Frequency Support


A, B, C 2 2/9 = 22.22%
A, B, D 1 1/9 = 11.11%
A, B, E 1 1/9 = 11.11%
A, C, E 1 1/9 = 11.11%
B, C, D 0 0
B, C, E 1 1/9 = 11.11%
B, D, E 0 0

Pruning the three item sets whose support is less than the minimum support, we get:
Item Frequency Support
A, B, C 2 2/9 = 22.22%
A, B, E 2 2/9 = 22.22%

Step v) Generate the rules from each three itemset and compute the confidence of
each rule as:
Confidence(x->y) = support(x∪y)/support(x)
For three itemset (A, B, C):

Rule Support
(A, B)->C (2/9) / (4/9) = 1/2 = 50%
(A, C) -> B (2/9) / (2/9) = 1 = 100%
(B, C) -> A (2/9) / (2/9) = 1 = 100%
A -> (B, C) (2/9) / (6/9) = 2/6 = 33.33%
B -> (A, C) (2/9) / (7/9) = 2/7 = 28.57%
C -> (A, B) (2/9) / (2/9) = 1 = 100%

From the generated set of rules, the rules with confidence greater than or equal to the
minimum confidence (45%) are the valid rules.

Thus, the valid rules are:

i) (A, B) -> C
ii) (B, C) -> A
iii) (A, C) -> B
iv) C -> (A, B)

Check Your Progress 3


Ques 1

Step i) From the given transaction set, find the frequency of each item and
arrange them in decreasing order of their frequencies.

Item Frequency Item Frequency


I1 5 I2 6
I2 6 I4 6
I3 3 I1 5
I4 6 I5 4
I5 4 I3 3
Step ii) For every transaction set, arrange the item in the decreasing order of their
frequency
Transaction Items Ordered Items
ID

T1 I5, I1, I4, I2 I2, I4, I1, I3


T2 I4, I1, I3, I5, I2 I2, I4, I1, I5, I3
T3 I3, I1, I2, I5 I2, I1, I5, I3
T4 I2, I1, I4 I2, I4, I1
T5 I4 I4
T6 I4, I2 I2, I4
T7 I1, I4, I5 I4, I1, I5

T8 I2, I3 I2, I3
Step iii) Create the FP tree now.
 Create a NULL node.
 From ordered items in T1, we get FP tree as:

 Follow transaction T2 and increase the count of existing items and


add the new items to the tree
 Follow transaction T3 and repeat the above process

 Follow Transaction 4, this leads to increase in count of I2, I4, I1.


 Follow transaction T5

 Follow transaction T6. This leads to increase in count of I2 and I4 as the path in
tree already exists.

 Follow Transaction T7
 Finally follow transaction T8

The final FP tree is represented as:

Ques 2 Step i) FP Tree for the given Transaction set is:


Step ii) Create the conditional Pattern base as:
Item Conditional Pattern Base Conditional FP Tree Frequent Pattern Generation

Tea {{Apple, Banana:1}, {Apple: 2, Banana: 2} {Apple,Tea:2},{Banana,Tea:2},


{Apple, Banana, Peas: 1}} {Apple,Banana,Tea:2}

Carrot {{Apple,Banana:1},{Apple:1} {Apple:2} {Apple,Carrot:2}


Peas {{Apple,Banana:2}, {Apple:4,Banana:2}, {Apple, Peas:4},{Banana,
{Apple:2},{Banana:2}} {Banana:2} Peas:4},{Apple,Banana,Peas:2}

Banana {{Apple:4}} {Apple:4} {Apple,Banana:4}

As the minimum support given 3. Hence all the itemsets with support score greater than equal to
3 form the frequent itemset as:
{Apple, Banana}, {Apple, Peas}, and {Banana, Peas}

14.9 FURTHER READINGS


1. Machine learning an algorithm perspective, Stephen Marshland, 2nd Edition, CRC Press, 2015.
2. Machine Learning, Tom Mitchell, 1st Edition, McGraw- Hill, 1997.
3. Machine Learning: The Art and Science of Algorithms that Make Sense of Data, Peter Flach, 1st
Edition, Cambridge University Press, 2012.
4. Data Mining – Concepts and Techniques(3rd Edition) , Kamber and Han, Morgan Kaufman
UNIT 15 CLUSTERING

Structure
15.1 Introduction to clustering
15.2Types of clustering
15.3 Partition Based
15.4Hierarchical Based
15.5 Density Based Clustering techniques
15.6Clustering algorithms
K-Means,
Agglomerative and Divisive,
DBSCAN,
Introduction to Fuzzy Clustering
Summary
15.7Solutions to Check your Progress

15.1 INTRODUCTION

Clustering or cluster analysis is a method for dividing a group of data objects into subgroups
based on a single observation. Here, each cluster is a subset of the data, where objects with
resemblance or similar properties are grouped. Also, they are similar to each other in one cluster
but differ from those objects which are in different clusters. In simple terms, the effort of
dividing a population into multiple groups so that data points of one group can easily be
compared with data points of another group. Thus, to separate the groups with identical features
and assign them into clusters. This process is performed by machines, not by humans and is
known as unsupervised learning because clustering is a form of learning by observation.
Clustering is often confused with classification in data analysis, where separation of data
happens based on class labels, while in clustering partitioning of large data sets occurs in groups
occurs based on similarity.

Let’s understand this with an example. Suppose, you are the head of a rental store and wish to
understand the preferences of your customers to scale up your business. Is it possible for you to
look at the details of each customer and devise a unique business strategy for each one of them?
Not. But, what you can do is to cluster all of your customers into say 10 groups based on their
purchasing habits and use a separate strategy for customers in each of these 10 groups. And this
is what we call clustering.

“Clustering is the process of dividing the entire data into groups (also known as clusters) based on the
patterns in the data.”

Data clustering is a prominent technique in data analysis and is applied in various research areas
including mining of data, data statistics, area of machine learning for any kind of analysis. It is
also applied in the world of financial services, health information systems web mining, financial
sectors and many more. Cluster analysis is the most recent area of research in data analysis due
to the massive volumes of data produced in databases. An example of clustering is outlier
detection where credit card fraud and criminal activities are monitored. Clustering can be used in
image recognition to find clusters or patterns in image or text recognition systems. Clustering has
a lot of uses in web search as well. Due to the enormous quantity of online pages, a keyword
search may frequently produce a huge number of hits (i.e., pages relevant to the search). Thus,
clustering is a very promising machine learning process and is proved to be one of the most
pragmatic data mining tools. In this unit, you will learn about the various types of clustering
techniques and algorithms.

Fig 1. Clustering

Real World Example:

A bank can potentially have millions of customers. Would it make sense to look at the details of
each customer separately and then decide? Certainly not! It is a manual process and will take a
huge amount of time.

So, what can the bank do? Clustering comes to the rescue in these situations where the banks can
group the customers based on their income, as shown:
Applications of Clustering in Real-World Scenarios

Clustering is a widely used technique in the industry. It is being used in almost every domain,
ranging from banking to recommendation engines, document clustering to image segmentation.

 Customer Segmentation

Customer segmentation is one of the most common applications of clustering. And it isn’t just
limited to banking. This strategy is across functions, including telecom, e-commerce, sports,
advertising, sales, etc.

 Document Clustering

Document Clustering is another common application of clustering. Let’s assume that you have
multiple documents, and you need to cluster similar documents together. Clustering helps us
group these documents such that similar documents are in the same clusters.

 Image Segmentation

We can also use clustering to perform image segmentation. Here, we try to club similar pixels in
the image together. We can apply clustering to create clusters having similar pixels in the same
group.

15.2 Types of Clustering

There are many clustering algorithms in the literature. Traditionally, documents are grouped
based on how similar they are to other documents. Similarity-based algorithms define a function
for computing document similarity and use it as the basis for assigning documents to clusters.
Each cluster should have data that are comparable to one another but different from those in
other clusters. Clustering algorithms fall into different categories based on the underlying
methodology of the algorithm (agglomerative or partition), the structure of the final solution (flat
or hierarchical), or the density based All the above-mentioned clustering types are discussed in
detail in the rest of this chapter.

Check Your Progress - 1

Qn1. What do you understand by the term “Clustering”?

Qn2. Where is Clustering used in present day scenario?


15.3 Partition Based Clustering

Partition algorithm is one of the most applied clustering algorithms. It has been widely used in
many applications due to its simple structure and easy implementation as compared to other
clustering algorithms. This clustering method classifies the information into multiple groups
based on the characteristics and similarities of the data. In the partitioning method when
database(D) contains multiple(N) objects then the partitioning approach divides the data into
user-specified(K) partitions, each of which represents a cluster and a specific region. That is, the
data is divided into K groups. That is, it divides the data into K groups or partitions in such a
manner that each group should have at least one object from the existing data. To put it another
way, it splits the data items into non-overlapping subsets (clusters) so that each data object fits
perfectly into one of them.

Fig 2. Partition based clustering

Partitioning approaches require a set of starting seeds (or clusters), which are then enhanced
iteratively by transferring objects from one group to another to improve partitioning.Objects in
the same cluster are "near" or related to one other, whereas objects in other clusters are "far
apart" or significantly distinct, according to the most prevalent approach to good partitioning.

Many algorithmsthat come under partitioning method some of the popular ones are K-Mean,
PAM (K-Mediods), and CLARA algorithm (Clustering Large Applications) etc.

CHECK YOUR PROGRESS

Qn1. Briefly explain how Partition Based Clustering works.


Qn2. Name three Partition Based Clustering methods.

15.4 Hierarchical BasedClustering

Hierarchical Clustering analysis is an algorithm that groups the data points with similar
properties and these groups are termed “clusters”. As a result of hierarchical clustering, we get a
set of clusters, and these clusters are always different from each other. Clustering of this data into
clusters is classified as:

 Agglomerative Clustering (involving decomposition of cluster using bottom-up strategy)


 Divisive Clustering (involving decomposition of cluster using top-down strategy)

Hierarchical clustering helps in creating clusters in the proper order (or hierarchy).

Example:

The most common everyday example we see is how we order our files and folders in our
computer by proper hierarchy.

As mentioned, Hierarchical clustering is classified into two types i.e., Agglomerative clustering
(AGNES) and Divisive Clustering (DIANA)

Hierarchical
clustering

Agglomerative Divisive

AGNES DIANA

Fig3. Showing hierarchical clustering AGNES vs DIANA

Hierarchical clustering Technique in terms of space and time complexity:


 Space complexity: When the number of data points is large, the space required for the
Hierarchical Clustering Technique is large since the similarity matrix must be stored in
RAM. The space complexity is measured by the order of the square of n.

Space complexity = O(n²) where n is the number of data points.

 Time complexity: The time complexity is also very high because we have to execute n
iterations and update and restore the similarity matrix in each iteration. The order of the
cube of n is the time complexity.

Time complexity = O(n³) where n is the number of data points.

Limitations of HierarchicalClustering Technique:

1. Hierarchical clustering does not have a mathematical goal.


2. All the methodologies applied for calculating the similarity index between clusters does
not apply fully in every situation, each technique has its own merits and demerits.
3. Due to high space and time complexity. This clustering algorithm is not applicable for
huge data.

Check Your Progress - 2

Qn1. Differentiate between Hierarchical Clustering and Partition Based Clustering.

Qn2. What are sub-types of Hierarchical Clustering and what is the difference between them?

Qn3. What is the space and time complexity of Hierarchical clustering?

Qn4. List some of the limitations of Hierarchical clustering

15.5 Density Based Clustering Technique

Clusters are formed in the Density Depending Clustering Technique based on the density of the
data points represented in the data space.Those locations that become dense as a result of the
large amount of data points that reside there are termed as clusters.

How it works:
1. It starts with a random unvisited starting data point. All points within a distance ‘Epsilon’
– Ɛ classify as neighborhood points.

2. We need a minimum number of points within the neighborhood to start the clustering
process. In this scenario, the current point of data turns into the first point in the cluster.
Else, the point is regardedas ‘Noise.’ In either case, the current point becomes a visited
point.
3. All points within the distance(
distance(Ɛ)
Ɛ) become part of the same cluster. This process is repeated
for all the newly added data points in the cluster group.

4. Continue with the process until you visit and label each point within the ‘ Ɛ’ neighborhood
of the cluster.

5. On completion of the process, start again with a new unvisited point thereby leading to
the discovery of more clusters or noise. At the end of the process, yo
youu ensure that you
mark each point as either cluster or noise.

Fig 4. Showing Density connected points

Check Your Progress - 3

Qn1. What do you understand by the term “Density Based Clustering”?

Qn2. How is “Density Based Clustering” performed?

15.6 Clustering Algorithms

a) K-Means

Among clustering algorithms, is an algorithm that tries to minimize the distance of the points in a
cluster with their centroid – the k-means clustering technique.

K-means is a centroid-based algorithm. It can also be called asa distance-based technique where
distances between points are calculated to allocate a point to a cluster. Each cluster in K -Means
is paired with a centroid.

The K-Means algorithm's main goal is to reduce the sum of distances between points and th eir
corresponding cluster centroid.

Real World Example:


Let's have a look at an example. Assume you went to a bookstore to purchase some books. There
are several types of books that can be found there. One thing you'll notice is that the books are
sorted into groups based on their category. All of the literature books will be kept in one
location, while science books will be organized by type. The K-Means algorithm's main goal is
to reduce the sum of distances between points and their corresponding cluster centroid.

Now we will understand this with the help of these figures.

Fig 5. Before and after K-means clustering

In the figure 5 data is presented into two stages. The first figure shows the data in raw stage
which is not clustered by k-means. Here all types of data are clubbed together thus becomes
impossible to differentiate them into their original category.

The second figure shows three clusters of three different colors red, green, and blue. These
clusters are formed after applying k-means clustering. The second figure shows data into three
different categories which are called clusters.

Working of K-means clustering algorithm .

k-means clustering technique is an immense clustering algorithm to group similar types of data
in groups which are so called clusters. It is the simplest and commonly used technique in
machine learning extensively used for data analysis. It can easily locate the similarity points
between the different data items and can group them into the clusters. The working of K-means
clustering algorithm is shown in the following three steps. Let’s see what these three steps are.
1. Selection of k values.
2. Initialization of the centroids.
3. Selection of the cluster and finding the average.

Let us understand the above steps with the help of the Fig6:

Below is a Practice Problem Based on K-Means Clustering Algorithm:


Problem-01:The following eight points (with (x, y) denoting places) should be grouped into
three clusters:
A1(2, 11), A2(2, 15), A3(8, 5), A4(6, 8), A5(7, 9), A6(6, 3), A7(1, 4), A8(4, 8)
The first cluster centers will be: A1(2, 11), A4(6, 8) and A7(1,4).
The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as-
Ρ(a, b) = |x2 – x1| + |y2 – y1|
Use K-Means Algorithm to find the three cluster centers after the second iteration.
Solution:We follow the above discussed K-Means Clustering Algorithm-
Iteration-01:
 We measure the distance between each location and the center of each of the three
clusters.
• The specified distance function is used to determine the distance.
The calculation of distance between point A1(2, 11) and each of the center of the three clusters -
Calculating Distance Between A1(2, 11) and C1(2, 11)-
Ρ(A1, C1)

= |x2 – x1| + |y2 – y1|


= |2 – 2| + |11 – 11|

=0

Calculating Distance Between A1(2, 10) and C2(6, 8)-


Ρ (A1, C2)

= |x2 – x1| + |y2 – y1|

= |6 – 2| + |8 – 10|
=4+2

=6

Calculating Distance Between A1(2, 10) and C3(1, 4)-


Ρ(A1, C3)
= |x2 – x1| + |y2 – y1|
= |1 – 2| + |4 – 10|
=1+6
=7
Problem-02 (Self-Test)

Use the k-means algorithm and Euclidean distance to cluster the following 8 examples into 3
clusters: A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9).
The distance matrix based on the Euclidean distance is given below:
Assume that the first seeds (cluster centers) are A1, A4, and A7. Only use the k-means method
for one epoch. Show: a) The new clusters (i.e., the cases belonging to each cluster) b) The new
clusters' centers at the end of this epoch c) Draw a 10 by 10 space with all 8 points and illustrate
the clusters and new centroids after the first epoch. d) How many more iterations will it take to
reach convergence? For each period, draw the result.

b) Agglomerative clustering

In this case of clustering, the hierarchical decomposition is done with the help of bottom-up
strategy where it starts by creating atomic (small) clusters by adding one data object at a time
and then merges them together to form a big cluster at the end, where this cluster meets all the
termination conditions. This procedure is iterative until all the data points are brought under one
single big cluster.

Basic algorithm of agglomerative clustering


1. Determine the proximity matrix.

2. Assume that each data point belongs to a cluster.

3. Do it again.

4. Combine the two groups that are the closest together.

5. Make changes to the proximity matrix.

6. Continue until just one cluster remains.

AGNES (Agglomerative Nesting) is a type of agglomerative clustering that combines the data
objects into a cluster based on similarity. The result of this algorithm is a tree-based structure
called Dendrogram. Here it uses the distance metrics to decide which data points should be
combined with which cluster. Basically, it constructs a distance matrix and checks for the pair of
clusters with the smallest distance and combines then. The figure 7. given below shows
dendrogram.
Fig 7. Agglomerative clustering

We begin at the bottom with 25 data points, each of which is assigned to a different cluster. Then

two nearest clusters are selected to merge till we get only one cluster at the topmostposition. The

distance between two clusters in the data space is represented by the height in the dendrogram at

which two clusters are merged. Based on how the distance between each cluster is m easured, we

can have 3 different methods

 Single linkage: Where the shortest distance between the two points in each cluster is

defined as the distance between the clusters.

 Complete linkage: In this case, we will consider the longest distance between each

cluster’s points as the distance between the clusters.


 Average linkage: In this situation, we'll take the average of each point in one cluster

compared to every other point in the other.

AGNES has few limitations one of this is it has a time complexity of at least O(n2);

hence it doesn’t do well in scaling, and one other major drawback is that whatever has

been done can never be undone, i.e. If we incorrectly group any cluster in an earlier stage

of the algorithm, then we will not be able to change the outcome/modify it. But this

algorithm has a bright side since there are many smaller clusters are formed; it can be

helpful in the process of discovery. It produces an ordering of objects that is very helpful

in visualization.

Problem-03: Hierarchical clustering (to be done at your own time, not in class) Use single-link,
complete-link, average-link agglomerative clustering as well as medoid and centroid to cluster
the following 8 examples: A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4),
A7=(1,2), A8=(4,9). The distance matrix is the same as the one in Exercise 1. Show the
dendrograms.

Solution:
c) Divisive Clustering (DIANA)

Diana basically stands for Divisive Analysis; this is another type of hierarchical clustering where
basically it works on the principle of top-down approach (inverse of AGNES) where the
algorithm begins by forming a big cluster, and it recursively divides the most dissimilar cluster
into two, and it goes on until we’re all the similar data points belong in their respective clusters.
These divisive algorithms result in highly accurate hierarchies than the agglomerative approach,
but they are computationally expensive.
Fig 8. Divisive clustering step by step process

d) Density-Based Spatial Scan (DBSCAN)

The distance between objects is used to cluster things in most partitioning methods. Such
algorithms can only locate spherical-shaped clusters and have trouble finding clusters of other
forms. DBSCAN can form clusters in different shapes; this type of algorithm is most suitable
when the dataset contains noise or outliers. Also, it depends on a density-based concept of
cluster: A cluster is defined as the most densely connected set of points.

Other clustering algorithms based on the concept of density have been developed. Their basic
concept is to keep creating a cluster till the density which consists of data points in the
"neighbourhood" surpasses a certain threshold.

For instance, every data point present in a given cluster should meet the requirement of having
minimum number of points in the neighborhood of a given radius. This type of method is
extremely useful for detecting outliers or to filter out noise. This is also useful for finding
clusters of arbitrary shape.

The best part in density-based methods is that they can divide a set of objects into multiple
exclusive clusters, or a hierarchy of clusters. Density-based techniques often only evaluate
exclusive clusters and ignore fuzzy clusters. Furthermore, density-based clustering algorithms
can be extended from entire space to subspace.

DBSCAN uses noise to find clusters of any shape in spatial databases.

In density-based clustering we partition points into dense regions separated by not-so-dense


regions. Characterization of points is done in following manner:

If a point has more than a given number of points (MinPts) within Eps, it is considered a core
point.

▪ These points belong in a dense region and are at the interior of a cluster.
Within Eps, a border point has less points than MinPts, but it is close to a core point.

Any point that isn't a core point or a boundary point is referred to as a noise point.

Fig 9. Core point, Border point and Noise point in DBScan

DBSCAN Algorithm
1. Label points as core, border and noise
2. Eliminate noise points for every core point p that has not been assigned to a cluster
3. Create a new cluster with the point p and all the points that are density-connected to p.
4. Assign border points to the cluster of the closest core point.

e) Fuzzy Clustering

Fuzzy clustering is the most used clustering algorithm present in real world and is a sort of
clustering in where each data point is assigned to multiple clusters. The algorithm suggests that
the data points can belongs to more than one cluster unlike hard clustering where data points can
actually belong to only one cluster.

Illustrative Example
A Guava can either be Yellow and Green (hard clustering), but a Guava can also be Yellow and
Green (this is fuzzy clustering). Here, the Guava can be Yellow to a certain degree as well as
Green to a certain degree. Instead of the Guava belonging to Green [green = 1] and not Yellow
[Yellow = 0], the Guava can belong to Green [green = 0.5] and Yellow [Yellow = 0.5]. These
values are normalized between 0 and 1; however, they do not represent probabilities, so the two
values do not need to add up to 1.

Fig 10. Understanding the algorithm

The Fuzzy Clustering algorithm attempts to group a finite set of n items.


𝑅 = 𝑅1, . . . , 𝑟𝑛𝑅 = {𝒓 , . . . , 𝒓 } into a set of c fuzzy clusters with respect to some given criterion.

when a finite set of data is given, the algorithm returns a list of clusters centers
𝐹 = 𝑓1, . . . , 𝑓𝑓𝐹 = {𝒇 , . . . , 𝑓 }And a partition matrix

𝑊 = 𝑤 , ∈ [0,1], 𝑖 = 1, . . . , 𝑛, 𝑗 = 1, . . . , 𝑓,where each element, 𝑤 tells the degree to which


element,𝒓 belongs to cluster 𝒇

The FCM aims to minimize an objective function:

𝑎𝑟𝑔𝑚𝑖𝑛 ∑ ∑ 𝑤 ∥𝒓 −𝒇 ∥ ,
where:
1
𝑤 = .
∥𝒓 −𝒇 ∥
∥𝒓 −𝒇 ∥

Illustrative Example

To better understand this principle, a classic example of mono-dimensional data is given below
on an x axis.

This data set can be conventionally grouped into two clusters, and this is done by selecting a
threshold on the x-axis.
The resulting two clusters are labelled 'A' and 'B', as seen in the image below. Each point
belonging to the data set would therefore have a membership coefficient of either 0 or 1. This
membership coefficient of each corresponding data point is represented by the inclusion of the y-
axis.

In fuzzy clustering, each data point can have membership to multiple clusters. By relaxing the
definition of membership coefficients from strictly 0 or 1, these values can range from inclusive
value from 0 to 1. The following image shows the data set from the previous clustering, but now
fuzzy c-means clustering has been applied. First, a new threshold value defining two clusters is
generated. Next, new membership coefficients for each data point are generated based on
clusters centroids, as well as distance from each cluster centroid.
As we can see, the middle data point belongs to cluster A and cluster B. The value (point) of 0.3
is this data point's membership coefficient for cluster A

Check Your Progress - 4

Qn1. Briefly explain 5 types of clustering.

Qn2. What is Agglomerative Clustering? What are 3 methods that can be used for Agglomerative
Clustering?

Qn3. Briefly explain fuzzy clustering and provide a Real-World Example for the same.

Qn4. Differentiate Between AGNES and DIANA

Qn5. What is DBSCAN?

Qn6. Explain how K-Means Algorithm works.

15.7 SUMMARY

The technique of separating a set of data objects into subgroups based on some observation is known as
clustering or cluster analysis. Each subset is referred to as a 'cluster,' with objects that are related to one
another but different from those in other clusters. In simple terms, the process of separating a population
or set of data points into several groups so that data points from the same group can be compared to data
points from different groups. Clustering is often confused with classification in data analysis, where
separation of data happens on the basis of class labels, while in clustering partitioning of large data sets
occurs in groups occurs on the basis of similarity.
Clustering algorithms fall into different categories based on the underlying methodology of the algorithm,
the structure of the final solution, or the density based.

In Density Based Clustering Technique, the clusters are created based upon the density of the data points
which are represented in the data space.

K-Means: Among clustering algorithms, is an algorithm that tries to minimize the distance of the points in
a cluster with their centroid the k-means clustering technique. The main objective of the K-Means
algorithm is to minimize the sum of distances between the points and their respective cluster centroid.

AGNES is a type of agglomerative clustering that combines the data objects into a cluster based on
similarity.

DBSCAN uses noise to find clusters of any shape in spatial databases.

Fuzzy Clustering is the most widely used clustering technique in practice, and it is a sort of clustering in
which each data point belongs to many clusters.

15.8 SOLUTIONS TO CHECK YOUR PROGRESS

Check Your Progress - 1


For detailed answers refer to Section 15.2.
Answer 1.
Clustering is the process of grouping data into groups (also known as clusters) based on patterns found in
the data. To put it another way, the work of dividing a population or Data points into groups so that data
points in the same group are more comparable to data points in other groups.
Answer 2.

Clustering is a widely used technique in the industry. It is being used in almost every domain, ranging
from banking to recommendation engines, document clustering to image segmentation.

 Customer Segmentation

 Document Clustering

 Image Segmentation

Check Your Progress - 2


For detailed answers refer to Section 15.3
Answer 1.
Partitioning approaches start with a set of beginning seeds (or clusters), which are then improved
iteratively by shifting objects from one group to another. According to the general criterion of good
partitioning, objects in the same cluster are "close" or connected to one another, whereas objects in other
clusters are "far away" or notably distinct.
Answer 2.
Popular partitions based on clustering are:
 K-Mean,

 PAM (K-Medoids),

 CLARA algorithm (Clustering Large Applications)

Check Your Progress - 3


For detailed answers refer to Section 15.4.
Answer 1:
Differences in assumptions, runtime, input, and output are all factors. Partitional clustering is, on average,
faster than hierarchical clustering. Stronger assumptions are required for partitional clustering.
Hierarchical clustering, on the other hand, simply requires a similarity metric. There are no input
parameters required for hierarchical clustering, but partitional clustering techniques require a number of
clusters to begin. Clusters are divided more meaningfully and subjectively using hierarchical clustering.
Partitional clustering, on the other hand, produces k clusters.
Answer 2:
The major distinction between Hierarchical and Partitional Clustering is that each cluster begins as a
singleton or individual cluster. The nearest clusters are joined with each iteration. This technique is
repeated until only one cluster for Hierarchical clustering remains.
Partitional clustering, on the other hand, needs the analyst to designate K clusters before executing the
algorithm, and objects that are closest to the clusters are clustered together. The spacing between the
cluster’s changes with each repetition. This process continues until the centroid of each cluster no longer
moves, or until the halting requirement is reached.
Answer 3:

Hierarchical clustering Technique in terms of space and time complexity:

 Space complexity: When the number of data points is large, the space required for the
Hierarchical Clustering Technique is large since the similarity matrix must be stored in
RAM. The space complexity is measured by the order of the square of n.

Space complexity = O(n²) where n is the number of data points.

 Time complexity: The time complexity is also very high because we have to execute n
iterations and update and restore the similarity matrix in each iteration. The order of the
cube of n is the time complexity.

Time complexity = O(n³) where n is the number of data points.


Answer 4:

Limitations ofHierarchical Clustering Technique:

1. Hierarchical clustering does not have a mathematical goal.


2. Every method for calculating cluster similarity has its own set of drawbacks.
3. Hierarchical clustering has a high spatial and temporal complexity. As a result, when we have a
lot of data, we can't apply this clustering approach.
Check Your Progress - 4
For detailed answers refer to Section 15.5.
Answer 1.
The term “Clustering” is a technique in which the clusters are created based upon the density of the data
points which are represented in the data space. The regions that become dense due to the huge number of
data points residing in that region are considered as clusters.

Answer 2.
Refer to Page 8
Check Your Progress - 5
For detailed answers refer to Section 15.6.
Answer 1.
a) K-Means

Among clustering algorithms, is an algorithm that tries to minimize the distance of the points in a
cluster with their centroid – the k-means clustering technique.
K-means is a centroid-based or distance-based technique in which the distances between points
are calculated to allocate a point to a cluster. Each cluster in K-Means is paired with a centroid.
b) AGNES (Agglomerative Nesting)
AGNES is a type of agglomerative clustering that combines the data objects into a cluster based
on similarity. The result of this algorithm is a tree-based structure called Dendrogram. Here it
uses the distance metrics to decide which data points should be combined with which cluster.
Basically, it constructs a distance matrix and checks for the pair of clusters with the smallest
distance and combines then.
c) Divisive Clustering (DIANA)

Diana basically stands for Divisive Analysis; this is another type of hierarchical clustering where
basically it works on the principle of top-down approach (inverse of AGNES) where the
algorithm begins by forming a big cluster, and it recursively divides the most dissimilar cluster
into two, and it goes on
d) Density-Based Spatial Scan (DBSCAN)
The distance between objects is used to cluster things in most partitioning methods. Only
spherical-shaped clusters can be found using these approaches, and clusters of any shape are
difficult to find. DBSCAN may create clusters of various shapes; this technique is best suited to
datasets with noise or outliers.
Fuzzy Clustering

The most widely used clustering technique in the real world is fuzzy clustering, which is a sort of
clustering in which each data point belongs to many clusters. The algorithm suggests that the
data points can belongs to more than one cluster unlike hard clustering where data points can
actually belong to only one cluster.

Answer 2.

a) AGNES (AGglomerativeNESting)
AGNES is a type of agglomerative clustering that combines the data objects into a cluster based on
similarity. The result of this algorithm is a tree-based structure called Dendrogram. Here it uses the
distance metrics to decide which data points should be combined with which cluster. Basically, it
constructs a distance matrix and checks for the pair of clusters with the smallest distance and combines
then.

Based on how the distance between each cluster is measured, we can have 3 different methods

 Single linkage: Where the shortest distance between the two points in each cluster is defined as
the distance between the clusters.
 Complete linkage: In this case, we will consider the longest distance between each cluster’s
points as the distance between the clusters.
 Average linkage: In this situation, we'll take the average of each point in one cluster compared to
every other point in the other.

Answer 3.
The most widely used clustering technique in the real world is fuzzy clustering, which is a type of
clustering in which each object belongs to many clusters. The algorithm suggests that the data points can
actually belongs to more than one cluster unlike hard clustering where data points can actually belong to
only one cluster.

Real-World Example:

A Guava can either be Yellow and Green (hard clustering), but a Guava can also be Yellow and Green
(this is fuzzy clustering). Here, the Guava can be Yellow to a certain degree as well as Green to a certain
degree. Instead of the Guava belonging to Green [green = 1] and not Yellow [Yellow = 0], the Guava can
belong to Green [green = 0.5] and Yellow [Yellow = 0.5]. These values are normalized between 0 and 1;
however, they do not represent probabilities, so the two values do not need to add up to 1.

Answer 4:
DIANA is like the reverse of AGNES. It begins with the root, in which all observations are included in a
single cluster. At each step of the algorithm, the current cluster is split into two clusters that are
considered most heterogeneous. The process is iterated until all observations are in their own cluster.

Answer 5:
DBSCAN can form clusters in different shapes; this type of algorithm is most suitable when the dataset
contains noise or outliers.
Also, it depends on a density-based concept of cluster: A cluster is defined as the most densely
connected set of points.
Answer 6:
Refer Page 10
MULTIPLE CHOICE QUESTIONS
Q1. The objective of clustering is to-
A. Sort the data points into categories.
B. To classify the objects into different classes
C. To predict the values of input data points and generate output.
D. All of the above

Solution: (A)
Q2. Clustering is a-
A. Supervised learning
B. Unsupervised learning
C. Reinforcement learning
D. None
Solution:(B)

Q3. Which of the following clustering algorithm is most sensitive to outliers?


A. K-means
B. K-modes
C. K-medians
D. K-medoids

Solution: (A)
Explanation: The K-Means clustering approach, which employs the mean of cluster data points to locate
the cluster center, is the most sensitive to outliers of all the options.

Q4. You saw the dendrogram below after doing K-Means Clustering analysis on a dataset. From
the dendrogram, which of the following conclusions can be drawn?
A. There were 32 data points in clustering analysis
B. The data points analyzed has best number of clusters is 4
C. The proximity function used here is Average-link clustering
D. The above interpretation of dendrogram is not possible for K -Means clustering analysis
Solution: (D)

Explanation: Dendrogram is not possible for K -Means clustering analysis. However, one can create a
cluster gram based on K-Means clustering analysis.
.
Q5. What are the two types of Hierarchical Clustering?
A. Top-Down Clustering (Divisive)
B. Bottom-Top Clustering (Agglomerative)
C. Dendrogram
D. K-means
Solution: Both A & B

Q6. Is it reasonable to assume the same clustering results from two K -Mean clustering runs?
A. Yes
B. No
Solution: (B)

Explanation: Instead, the K-Means clustering technique talks about local minima, which may or may not
equate to global minima in some situations. As a result, it's a good idea to run the K -Means method
several times before making any conclusions about the clusters.
It's worth noting, though, that by using the same seed value for each run, you may get the same clusterin g
results via K-means. However, this is accomplished by simply instructing the algorithm to select the same
set of random numbers for each iteration.
Q7. Which of the following clustering techniques has a difficulty with local optima convergence?
A. Agglomerative clustering algorithm
B. K- Means clustering algorithm
C. Expectation-Maximization clustering algorithm
D. Diverse clustering algorithm

Options:
A. A only
B. B and C
C. B and D
D. A and D
Solution: (B)
Out of the options given, only K-Means clustering algorithm and EM clustering algorithm has the
drawback of converging at local minima.

Q8. Which of the following is a bad characteristic of a dataset for clustering analysis-?
A. Objects with outliers
B. Objects with different densities
C. Objects with non-convex shapes
D. All of the above
Solution: (D)

Q9. Which is the following statement being incorrect?


A. k-means clustering is a method of vector quantization.
B. k-means clustering groups number of observations into k clusters.
C. k-means is same ask-nearest neighbor.
D. None
Solution: (C)

Q10. What do you understand by dendrogram?


A. A hierarchical structure
B. A diagrammatical view
C. A graph
D. None
Solution: (A)
UNIT 16 MACHINE LEARNING –
PROGRAMMING USING PYTHON
Structure Page Nos.
16.0 Introduction 50
16.1 Objectives 51
16.2 Classification Algorithms
16.2.1 Naïve Bayes
16.2.2 K-Nearest Neighbour (K-NN)
16.2.3 Decision Trees
16.2.4 Logistic Regression
16.2.5 Support Vector Machines
16.3 Regression Algorithms 55
16.3.1 Linear Regresssion
16.3.2 Polynomial Regression
16.4 Feature Selection and Extraction
16.4.1 Principal Component Analysis
16.5 Association Rules
16.5.1 Apriori Algorithm
16.6 Clustering Algorithms
16.6.1 K-Means,
16.7 Summary 67
16.9 Solutions/ Answers 67
16.10 Further Readings 68

16.0 INTRODUCTION
In this unit we will see the implementation of various machine learning algorithms,
learned in this course. To understand the codes you need to have understanding of the
respective Machine learning algorithms along with that understanding of Python
programming is must. The codes are readily using various libraries of Python
programming language viz. Scikit Learn, Matplotlib, numpy etc., you can execute
these codes through any of the Python programming tools. Most of the machines
learning algorithms, you learned in this course, are implemented here, just try to
execute them and analyse the results.

16.1 OBJECTIVES

After going through this unit, you should be able to:

● Understand the implementation aspect of various machine learning algorithms

16.2 CLASSIFICATION ALGORITHMS

The starting units of this course primarily focused on the various classification
algorithms viz. Naïve Bayes classifiers, K-Nearest Neighbour (K-NN), Decision
Trees, Logistic Regression and Support Vector Machines.The theoretical aspects of

50
the same is already discussed in the respective units, now we will see the
implementation part of the mentioned classifiers, in Python programming language.

16.2.1 NAIVE BAYES


It is a method of classification that is founded on Bayes' Theorem and
makes the assumption that predictors are free to act independently of
one another. A Naive Bayes classifier, to put it in layman's words,
makes the assumption that the existence of one particular characteristic
in a class is unrelated to the presence of any other feature.
We have already discussed this classifier in detail in Block 3 Unit 10
of this course, you may refer to Block 3 Unit 10 to understand the
concept.
The following procedures need to be carried out in order to classify
data using the Naive Bayes method.

 In the first step, we will begin by importing the dataset as well


as any necessary dependencies...
 The second step is to get the prior probability of each class
using the formula P(y).
 The Third Step is to Determine the likelihood of each
characteristic using the table you just created...
 Final and the Fourth Step is to Calculate the Posterior
Probability for each class by applying the Naive Bayesian
equation.
The screenshot of the executed code is given below
Implementation code in Python

51
OUTPUT :

Gaussian Naive Bayes model accuracy(in %): 95.0

16.2.2 K-Nearest Neighbour (K-NN)

You have already discussed this classifier in detail in Block 3 Unit 10 of this
course, you may refer to Block 3 Unit 10 to understand the concept.
We learned that Suppose the value of K is 3. The KNN algorithm starts by
calculating the distance of point X from all the points. It then finds the 3
nearest points with least distance to point X

In the example shown below following steps are performed:


• In Step 1, the scikit-learn package is used to import the k-nearest neighbour
algorithm.
• Step 2. is to create the feature variables and the target variables.
• Step 3. Separate the data into the test data and the training data.
• Step 4.Generate a k-NN model using neighbours value.
• Step 5. Train the model using the data or adjust the model based on the data.
• Proceed to Step 6, which is to make a forecast.

Now, in this section, we will see how Python's Scikit-Learn library can be
used to implement the KNN algorithm
Implementation code in Python
The screenshot of the executed code is given below

52
16.2.3 Desicion Tree Implementation
A decision tree is a type of supervised machine learning algorithm that may be
used for both regression and classification tasks. It is one of the most popular
and widely used machine learning techniques.
In this case, the decision tree method creates a node for each attribute present
in the dataset, with the attribute that is considered to be the most significant
being placed at the top of the tree. When we first get started, we will think of
the entire training set as the root. There must be a categorical breakdown of
the feature values.
Before beginning to develop the model, the values are discretized in order to
determine whether or not they are continuous. A recursive process distributes
records according to the attribute values of each record. A statistical method is
utilised in order to determine which qualities should be placed at the tree's
root and which should be placed at internal nodes.
You have already discussed this classifier in detail in Block 3 Unit 10 of this
course, you may refer to Block 3 Unit 10 to understand the concept.
Implementation code in Python
The screenshot of the executed code is given below

53
54
55
56
16.2.4 Logistic Regression

Logistic Regression (LR) is a classification algorithm that is used in Machine


Learning to predict the likelihood of a categorical dependent variable. It is
also known as "logistic regression." The dependent variable in logistic
regression is a binary variable, which means that it comprises data that is
either recorded as 1 (yes, success, etc.) or 0. (no, failure, etc.).
It should be brought to your attention that the Naive Bayes model is a
generative model, whereas the LR model is a discriminative model. LR
performs better than naive bayes when it comes to colinearity. This is because
naive bayes expects all of the characteristics to be independent, while LR does
not. Naive bayes works well with small datasets.
You have already discussed this classifier in detail in Block 3 Unit 10 of this
course, you may refer to Block 3 Unit 10 to understand the concept.
Implementation code in Python
The screenshot of the executed code is given below

57
58
16.2.5 Support Vector Machine
Support Vector Machine, more usually referred to as SVM, is a technique for
supervised and linear machine learning that is most frequently utilised for the
purpose of addressing classification issues. Support Vector Classification is
another name for SVM. In addition, there is a subset of SVM known as SVR,
which stands for Support Vector Regression. SVR applies the similar concepts
to the problem-solving process when addressing regression issues. SVM also
offers a method known as the kernel method, which is also known as the kernel
SVM. This method enables us to deal with non-linearity.
The following are the steps involved in implementation:

 Import the Libraries


 Make sure the Dataset is loaded.
 Dataset will be divided into X and Y.
 Create a Training set and a Test set from the X and Y Datasets.
 Scaling the features should be done.
 Ensure that the SVM is adjusted to the Training set.
 Make a guess about the results of the test set.
 Construct the Matrix of Confusion.

59
You have already discussed this classifier in detail in Block 3 Unit 10 of this
course, you may refer to Block 3 Unit 10 to understand the concept.
Implementation code in Python
The screenshot of the executed code is given below

60
OUTPUT:

Check Your Progress - 1

1. Make Suitable assumptions and modify the python code of following


Classification algorithms:
a. K-NN
b. Decision Tree
c. Logistic Regression
d. Support Vector Machines

16.3 REGRESSION ALGORITHMS

We learned about the basic concept of regression in the respective unit of this course,
in this unit we will implement the Linear regression and Polynomial regression in
Python language. Lets start with the Linear regression.

16.3.1 Linear Regression


The purpose of a linear regression model is to determine whether or not there
is a connection between one or more characteristics (also known as
independent variables) and a target variable that is continuous (dependent
variable). Linear Regression is referred to as Uni-variate Linear Regression
when there is only one feature, and it is referred to as Several Linear
Regression when there are multiple features.
Following are the stages involved in the implementation of a linear regression
model:

 Firstly, initialise the parameters.


 Given the value of an independent variable, predict what the value of a
dependent variable will be.
 Determine the amount of error that each forecast has for each data
point.
 Using a0 and a1, perform the calculation for the partial derivative.

61
 Add up the individual costs that you have determined for each of the
numbers.
You have already discussed this classifier in detail in Block 3 Unit 11 of this
course, you may refer to Block 3 Unit 11 to understand the concept.
Implementation code in Python
The screenshot of the executed code is given below

62
63
OUTPUT:

16.3.2 Polynomial Regression


Polynomial Regression is a type of linear regression in which the relationship
between the independent variable x and the dependent variable y is described
as an nth degree polynomial. This type of regression is also known as
"extended" linear regression. Polynomial regression is used to model a
nonlinear relationship between the value of an independent variable x and the
conditional mean of a dependent variable y. This relationship is represented
by the notation E(y |x). solely as a result of the non-linear relationship that
exists between the dependent and the independent variables When we want to
transform linear regression into polynomial regression, we just add some
polynomial terms.
You have already discussed this classifier in detail in Block 3 Unit 11 of this
course, you may refer to Block 3 Unit 11 to understand the concept.
64
Implementation code in Python
The screenshot of the executed code is given below

OUTPUT:

Check Your Progress - 2

2. Make Suitable assumptions and modify the python code of following


Regression algorithms:
a. Linear regression
b. Polynomial egression

65
16.4 FEATURE SELECTION AND EXTRACTION

Feature selection and extraction are one of the most important steps that must be
performed in order for machine learning to be successful. While we covered the
theoretical aspects of this process in the earlier units of this course, it is now time to
understand the implementation part of the mechanisms that we have learned for
Feature selection and extraction. Let's begin with dimensionality reduction, which is
the process of lowering the number of random variables that are being considered by
generating a set of primary variables. Dimensionality reduction may be seen of as a
way to streamline the analysis process.

16.4.1 Principal Component Analysis (PCA)


You have already discussed this classifier in detail in Block 4 Unit 13 of this
course, you may refer to Block 4 Unit 13 to understand the concept.
Among the various techniques the Principal Component Analysis (PCA) is
most frequently used, and the implementation of PCA is given below:
Implementation code in Python
The screenshot of the executed code is given below

66
OUTPUT :

Check Your Progress - 3

3. Make Suitable assumptions and modify the python code of Principal


Component Analysis, for dimensionality reduction.

16.5 ASSOCIATION RULES

We discussed Apriori algorithm and FP Growth algorithm, while studying the topic of
Association Rules. These algorithms are frequently used in pattern matching. Since FP
Growth is a step ahead to Apriori Algorithm, we are discussing the implementation of
Apriori algorithm only
67
16.5.1 Apriori Algorithm
The Apriori algorithm is a data mining technique that is used for mining
frequent item sets and appropriate association rules. It does this by using a
mathematical formula. We focused on the definitions of association rule
mining and Apriori algorithms, as well as the application of an Apriori
algorithm, in the area of this class that was most pertinent to the topic. In this
section, we will construct one Apriori model utilising the Python
programming language and a hypothetical situation involving a small firm.
However, it does have some limits, the effects of which can be mitigated
using a variety of different approaches. Data mining and pattern recognition
are two of the many applications that see widespread use of the method.
The candidate set is produced by the model that is described further down
below by merging the set of frequent items from the step before it.
Conduct testing on subsets, and if the candidate set contains infrequent item
sets, remove them. And then calculate the final frequent itemset by obtaining
the items that meet the minimal support requirement.
You have already discussed this classifier in detail in Block 4 Unit 14 of this
course, you may refer to Block 4 Unit 14 to understand the concept.
Implementation code in Python
The screenshot of the executed code is given below

68
69
70
71
16.6 CLUSTERING ALGORITHMS

We learned about the theoretical aspects of various clustering algorithms like K-


Means, DBSCAN etc. in the respective unit of this course. The K-Means algorithm
was quite simple and hence its implementation is given below:

16.6.1 K-Means - Implementation code in Python

You have already discussed this classifier in detail in Block 4 Unit 15 of this
course, you may refer to Block 4 Unit 15 to understand the concept.
Implementation code in Python
The screenshot of the executed code is given below

72
Check Your Progress - 4

4. Make Suitable assumptions and modify the python code of K-Means


algorithm

16.8 SUMMARY

In this unit we understood the implementation of various machine learning algorithms


for Classification, Regression, Dimension Reductionality and clustering. The
theoretical aspects of the respective algorithm were already discussed in the respective
units of this course.
73
16.9 SOLUTIONS/ANSWERS

Check Your Progress - 1

1. Make Suitable assumptions and modify the python code of following


Classification algorithms:
a. K-NN
b. Decision Tree
c. Logistic Regression
d. Support Vector Machines
Solution : Refer to section 16.2

Check Your Progress - 2

2. Make Suitable assumptions and modify the python code of following


Regression algorithms:
a. Linear regression
b. Polynomial egression
Solution : Refer to section 16.3

Check Your Progress - 3

3. Make Suitable assumptions and modify the python code of Principal


Component Analysis, for dimensionality reduction.
Solution : Refer to section 16.4

Check Your Progress - 4

4. Make Suitable assumptions and modify the python code of K-Means


algorithm
Solution : Refer to section 16.6

16.10 FURTHER READINGS

● https://www.kaggle.com/
● https://www.github.com/
● https://towardsdatascience.com
● https://machinelearningmastery.com

74

You might also like