0% found this document useful (0 votes)
78 views189 pages

AI and ML Final

Uploaded by

bharathsindhe03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views189 pages

AI and ML Final

Uploaded by

bharathsindhe03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 189

Semester : V

Subject : Artificial Intelligence and Machine Learning


Subject Code : 21CS54

Compiled by:
Dr.Girijamma H A, Dr.Sudhamani M J, Meenakshi S J
Department of CSE RNSIT
VISION AND MISSION OF INSTITUTION

Vision
Building RNSIT into a World Class Institution
Mission
To impart high quality education in Engineering, Technology and Management
with a Difference, Enabling Students to Excel in their Career by
1. Attracting quality Students and preparing them with a strong foundation in fundamentals
so as to achieve distinctions in various walks of life leading to outstanding contributions
2. Imparting value based, need based, choice based and skill based professional education to
the aspiring youth and carving them into disciplined, World class Professionals with social
responsibility
3. Promoting excellence in Teaching, Research and Consultancy that galvanizes academic
consciousness among Faculty and Students
4. Exposing Students to emerging frontiers of knowledge in various domains and make them
suitable for Industry, Entrepreneurship, Higher studies, and Research & Development
5. Providing freedom of action and choice for all the Stake holders with better visibility
VISION AND MISSION OF DEPARTMENT

Vision

Preparing Better Computer Professionals for a Real World

Mission

The Department of CSE will make every effort to promote an intellectual and ethical
environment by
1. Imparting solid foundations and applied aspects in both Computer Science Theory and
Programming practices
2. Providing training and encouraging R&D and Consultancy Services in frontier areas of
Computer Science and Engineering with a Global outlook
3. Fostering the highest ideals of ethics, values and creating awareness of the role of
Computing in Global Environment
4. Educating and preparing the graduates, highly sought after, productive, and well-
respected for their work culture
5. Supporting and inducing lifelong learning
V Semester

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING


Course Code 21CS54 CIE Marks 50
Teaching Hours/Week (L:T:P: S) 3:0:0:0 SEE Marks 50
Total Hours of Pedagogy 40 Total Marks 100
Credits 03 Exam Hours 03
Course Learning Objectives
CLO 1. Gain a historical perspective of AI and its foundations
CLO 2. Become familiar with basic principles of AI toward problem solving
CLO 3. Familiarize with the basics of Machine Learning & Machine Learning process, basics of
Decision Tree, and probability learning
CLO 4. Understand the working of Artificial Neural Networks and basic concepts of clustering
algorithms
Teaching-Learning Process (General Instructions)

These are sample Strategies, which teachers can use to accelerate the attainment of the various course
outcomes.
1. Lecturer method (L) need not to be only a traditional lecture method, but alternative
effective teaching methods could be adopted to attain the outcomes.
2. Use of Video/Animation to explain functioning of various concepts.
3. Encourage collaborative (Group Learning) Learning in the class.
4. Ask at least three HOT (Higher order Thinking) questions in the class, which promotes
critical thinking.
5. Adopt Problem Based Learning (PBL), which fosters students’ Analytical skills, develop
design thinking skills such as the ability to design, evaluate, generalize, and analyze
information rather than simply recall it.
6. Introduce Topics in manifold representations.
7. Show the different ways to solve the same problem with different logic and encourage the
students to come up with their own creative ways to solve them.
8. Discuss how every concept can be applied to the real world - and when that's possible, it
helps improve the students' understanding.
Module-1
Introduction: What is AI? Foundations and History of AI

Problem‐solving: Problem‐solving agents, Example problems, Searching for Solutions, Uninformed


Search Strategies: Breadth First search, Depth First Search,

Textbook 1: Chapter 1- 1.1, 1.2, 1.3


Textbook 1: Chapter 3- 3.1, 3.2, 3.3, 3.4.1, 3.4.3

Teaching-Learning Process Chalk and board, Active Learning. Problem based learning
Module-2
Informed Search Strategies: Greedy best-first search, A*search, Heuristic functions.
Introduction to Machine Learning , Understanding Data

Textbook 1: Chapter 3 - 3.5, 3.5.1, 3.5.2, 3.6


Textbook 2: Chapter 1 and 2

Teaching-Learning Process Chalk and board, Active Learning, Demonstration


Module-3
Basics of Learning theory
Similarity Based Learning
Regression Analysis
Textbook 2: Chapter 3 - 3.1 to 3.4, Chapter 4, chapter 5.1 to 5.4

Teaching-Learning Process Chalk and board, Problem based learning, Demonstration


Module-4
Decision Tree learning
Bayesian Learning

Textbook 2: Chapter 6 and 8

Teaching-Learning Process Chalk and board, Problem based learning, Demonstration


Module-5
Artificial neural Network
Clustering Algorithms

Textbook 2: Chapter 10 and 13

Teaching-Learning Process Chalk and board, Active Learning.


Course Outcomes Course Skill Set)
At the end of the course the student will be able to:
CO 1. Apply the knowledge of searching and reasoning techniques for different applications.
CO 2. Have a good understanding of machine leaning in relation to other fields and fundamental
issues and challenges of machine learning.
CO 3. Apply the knowledge of classification algorithms on various dataset and compare results
CO 4. Model the neuron and Neural Network, and to analyze ANN learning and its applications.
CO 5. Identifying the suitable clustering algorithm for different pattern

Assessment Details (both CIE and SEE)

The weightage of Continuous Internal Evaluation (CIE) is 50% and for Semester End Exam (SEE) is 50%.
The minimum passing mark for the CIE is 40% of the maximum marks (20 marks). A student shall be
deemed to have satisfied the academic requirements and earned the credits allotted to each subject/
course if the student secures not less than 35% (18 Marks out of 50) in the semester-end examination
(SEE), and a minimum of 40% (40 marks out of 100) in the sum total of the CIE (Continuous Internal
Evaluation) and SEE (Semester End Examination) taken together

Continuous Internal Evaluation:

Three Unit Tests each of 20 Marks (duration 01 hour)

1. First test at the end of 5th week of the semester


2. Second test at the end of the 10th week of the semester
3. Third test at the end of the 15th week of the semester
Two assignments each of 10 Marks

4. First assignment at the end of 4th week of the semester


5. Second assignment at the end of 9th week of the semester
Group discussion/Seminar/quiz any one of three suitably planned to attain the COs and POs for 20
Marks (duration 01 hours) OR Suitable Programming experiments based on the syllabus contents
can be given to the students to submit the same as laboratory work( for example; Implementation of
concept learning, implementation of decision tree learning algorithm for suitable data set, etc…)

6. At the end of the 13th week of the semester


The sum of three tests, two assignments, and quiz/seminar/group discussion will be out of 100 marks
and will be scaled down to 50 marks
(to have less stressed CIE, the portion of the syllabus should not be common /repeated for any of the
methods of the CIE. Each method of CIE should have a different syllabus portion of the course).

CIE methods /question paper has to be designed to attain the different levels of Bloom’s
taxonomy as per the outcome defined for the course.

Semester End Examination:

Theory SEE will be conducted by University as per the scheduled timetable, with common question
papers for the subject (duration 03 hours)

1. The question paper will have ten questions. Each question is set for 20 marks. Marks scored
shall be proportionally reduced to 50 marks.
2. There will be 2 questions from each module. Each of the two questions under a module (with a
maximum of 3 sub-questions), should have a mix of topics under that module.
The students have to answer 5 full questions, selecting one full question from each module.
Suggested Learning Resources:
Textbooks
1. Stuart J. Russell and Peter Norvig, Artificial Intelligence, 3rd Edition, Pearson,2015
2. S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021
Reference:
1. Elaine Rich, Kevin Knight, Artificial Intelligence, 3 rdedition, Tata McGraw Hill,2013
2. George F Lugar, Artificial Intelligence Structure and strategies for complex, Pearson Education,
5th Edition, 2011
3. Tom Michel, Machine Learning, McGrawHill Publication.
Weblinks and Video Lectures (e-Resources):
1. https://www.kdnuggets.com/2019/11/10-free-must-read-books-ai.html
2. https://www.udacity.com/course/knowledge-based-ai-cognitive-systems--ud409
3. https://nptel.ac.in/courses/106/105/106105077/
4. https://www.javatpoint.com/history-of-artificial-intelligence
5. https://www.tutorialandexample.com/problem-solving-in-artificial-intelligence
6. https://techvidvan.com/tutorials/ai-heuristic-search/
7. https://www.analyticsvidhya.com/machine-learning/
8. https://www.javatpoint.com/decision-tree-induction
9. https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/ml-
decision-tree/tutorial/
10. https://www.javatpoint.com/unsupervised-artificial-neural-networks
Activity Based Learning (Suggested Activities in Class)/ Practical Based learning

Role play for strategies– DFS & BFS, Outlier detection in Banking and insurance transaction for
identifying fraudulent behaviour etc. Uncertainty and reasoning Problem- reliability of sensor used to
detect pedestrians using Bayes Rule
COURSE OUTCOMES:
At the end of this course, students are able to:
CO1 Apply the knowledge of searching and reasoning techniques for different applications.
CO2 Analyze issues and challenges in machine learning within a broader interdisciplinary context
CO3 Apply the knowledge of classification algorithms on various dataset and compare results
CO4 Model the neuron and Neural Network, and to analyze ANN learning and its applications.
CO5 Identifying the suitable clustering algorithm for different pattern

CO-PO MATRIX
COURSE
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2 PSO3 PSO4
OUTCOMES
CO1 3 2 2 2 1 1 1 1 2 3 3 3
CO2 3 3 3 3 1 1 1 1 2 3 3 3
CO3 3 3 3 3 1 1 1 1 2 3 3 3
CO4 3 3 3 3 1 1 1 1 2 3 3 3
CO5 3 3 3 3 1 1 1 1 2 3 3 3
AI&ML, 21CS54, 5th Semester

Module 1
Introduction to Artificial Intelligence

AI is one of the newest fields in science and engineering. Work started in earnest soon after World War
II, and the name itself was coined in 1956. AI currently encompasses a huge variety of subfields, ranging
from the general (learning and perception) to the specific, such as playing chess, proving mathematical
theorems, writing poetry, driving a car on a crowded street, and diagnosing diseases. AI is relevant to
any intellectual task; it is truly a universal field.

1.1 WHAT IS AI?


In Figure 1.1 we see eight definitions of AI, laid out along two dimensions. The definitions on top are
concerned with thought processes and reasoning, whereas the ones on the bottom address behavior.
The definitions on the left measure success in terms of fidelity to human performance, whereas the
ones on the right measure against an ideal performance measure, called rationality. A system is rational
if it does the “right thing,” given what it knows. Historically, all four approaches to AI have been followed,
each by different people with different methods. A human-centered approach must be in part an
empirical science, involving observations and hypotheses about human behavior. A rationalist approach
involves a combination of mathematics and engineering.

1.1.1 Acting humanly: The Turing Test approach TURING TEST

The Turing Test, proposed by Alan Turing (1950), was designed to provide a satisfactory operational
definition of intelligence. A computer passes the test if a human interrogator, after posing some written
questions, cannot tell whether the written responses come from a person or from a computer. The
computer would need to possess the following capabilities:
• natural language processing to enable it to communicate successfully in English;
• knowledge representation to store what it knows or hears;
• automated reasoning to use the stored information to answer questions and to draw new conclusions;
• machine learning to adapt to new circumstances and to detect and extrapolate patterns.

Turing’s test deliberately avoided direct physical interaction between the interrogator and the
computer, because physical simulation of a person is unnecessary for intelligence. However, the so-
called total Turing Test includes a video signal so that the interrogator can test the subject’s perceptual

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT 1


AI&ML, 21CS54, 5th Semester

abilities, as well as the opportunity for the interrogator to pass physical objects “through the hatch.” To
pass the total Turing Test, the computer will need
• computer vision to perceive objects, and
• robotics to manipulate objects and move about.

1.1.2 Thinking humanly: The cognitive modelling approach

If we are going to say that a given program thinks like a human, we must have some way of
determining how humans think. We need to get inside the actual workings of human minds. There are
three ways to do this: through introspection—trying to catch our own thoughts as they go by; through
psychological experiments—observing a person in action; and through brain imaging—observing the
brain in action. Once we have a sufficiently precise theory of the mind, it becomes possible to express
the theory as a computer program. If the program’s input–output behavior matches corresponding
human behavior, that is evidence that some of the program’s mechanisms could also be operating in
humans.

1.1.3 Thinking rationally: The “laws of thought” approach

The Greek philosopher Aristotle was one of the first to attempt to codify “right thinking,” that is, certain
reasoning processes. His syllogisms provided patterns for argument structures that always yielded
correct conclusions when given correct premises. Logicians in the 19th century developed a precise
notation for statements about all kinds of objects in the world and the relations among them. The so-
called logicist tradition within artificial intelligence hopes to build on such programs to create
intelligent systems. There are two main obstacles to this approach. First, it is not easy to take informal
knowledge and state it in the formal terms required by logical notation, particularly when the
knowledge is less than 100% certain. Second, there is a big difference between solving a problem “in
principle” and solving it in practice.

1.1.4 Acting rationally: The rational agent approach

An agent is just something that acts (agent comes from the Latin agere, to do). Of course, all computer
programs do something, but computer agents are expected to do more: operate autonomously, perceive
their environment, persist over a prolonged time period, adapt to change, and create and pursue goals.
A rational agent is one that acts so as to achieve the best outcome or, when there is uncertainty, the best
expected outcome. In the “laws of thought” approach to AI, the emphasis was on correct inferences.
Making correct inferences is sometimes part of being a rational agent, because one way to act rationally
is to reason logically to the conclusion that a given action will achieve one’s goals and then to act on that
conclusion.
The rational-agent approach has two advantages over the other approaches. First, it is more general
than the “laws of thought” approach because correct inference is just one of several possible
mechanisms for achieving rationality. Second, it is more amenable to scientific development than are
approaches based on human behavior or human thought. The standard of rationality is mathematically
well defined and completely general, and can be “unpacked” to generate agent designs that provably
achieve it.

1.2 THE FOUNDATIONS OF ARTIFICIAL INTELLIGENCE

In this section, we provide a brief history of the disciplines that contributed ideas, viewpoints, and
techniques to AI.

1.2.1 Philosophy
Aristotle (384–322 B.C.), was the first to formulate a precise set of laws governing the rational part of
the mind. He developed an informal system of syllogisms for proper reasoning, which in principle
allowed one to generate conclusions mechanically, given initial premises. Much later, Ramon Lull (d.

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT 2


AI&ML, 21CS54, 5th Semester

1315) had the idea that useful reasoning could actually be carried out by a mechanical artifact. Thomas
Hobbes (1588–1679) proposed that reasoning was like numerical computation. Around 1500,
Leonardo da Vinci (1452–1519) designed but did not build a mechanical calculator. Gottfried Wilhelm
Leibniz (1646–1716) built a mechanical device intended to carry out operations on concepts rather
than numbers, but its scope was rather limited. Ren´e Descartes (1596–1650) gave the first clear
discussion of the distinction between mind and matter and of the problems that arise. Given a physical
mind that manipulates knowledge, the next problem is to establish EMPIRICISM the source of
knowledge. The empiricism movement, starting with Francis Bacon’s (1561– 1626) Novum Organum, 2
is characterized by a dictum of John Locke (1632–1704): “Nothing is in the understanding, which was
not first in the senses.” David Hume’s (1711–1776) A Treatise of Human Nature (Hume, 1739)
proposed what is now known as the principle of induction: that general rules are acquired by exposure
to repeated associations between their elements. Building on the work of Ludwig Wittgenstein (1889–
1951) and Bertrand Russell (1872–1970), the famous Vienna Circle, led by Rudolf Carnap (1891–1970),
developed the doctrine of logical positivism. The confirmation theory of Carnap and Carl Hempel
(1905–1997) attempted to analyze the acquisition of knowledge from experience.
The final element in the philosophical picture of the mind is the connection between knowledge and
action. This question is vital to AI because intelligence requires action as well as reasoning. Moreover,
only by understanding how actions are justified can we understand how to build an agent whose
actions are justifiable (or rational).

1.2.2 Mathematics

Philosophers staked out some of the fundamental ideas of AI, but the leap to a formal science required
a level of mathematical formalization in three fundamental areas: logic, computation, and probability.
The idea of formal logic can be traced back to the philosophers of ancient Greece, but its mathematical
development really began with the work of George Boole (1815–1864), who worked out the details of
propositional, or Boolean, logic (Boole, 1847). In 1879, Gottlob Frege (1848–1925) extended Boole’s
logic to include objects and relations, creating the firstorder logic that is used today. Alfred Tarski
(1902–1983) introduced a theory of reference that shows how to relate the objects in a logic to objects
in the real world.
The next step was to determine the limits of what could be done with logic and computation.
The first nontrivial algorithm is thought to be Euclid’s algorithm for computing greatest common
divisors. The word algorithm (and the idea of studying them) comes from al-Khowarazmi, a Persian
mathematician of the 9th century, whose writings also introduced Arabic numerals and algebra to
Europe. Boole and others discussed algorithms for logical deduction, and, by the late 19th century,
efforts were under way to formalize general mathematical reasoning as logical deduction. In 1930, Kurt
G¨odel (1906–1978) showed that there exists an effective procedure to prove any true statement in the
first-order logic of Frege and Russell, but that first-order logic could not capture the principle of
mathematical induction needed to characterize the natural numbers.
Besides logic and computation, the third great contribution of mathematics to AI is the theory
of probability. The Italian Gerolamo Cardano (1501–1576) first framed the idea of probability,
describing it in terms of the possible outcomes of gambling events. In 1654, Blaise Pascal (1623–1662),
in a letter to Pierre Fermat (1601–1665), showed how to predict the future of an unfinished gambling
game and assign average payoffs to the gamblers. Probability quickly became an invaluable part of all
the quantitative sciences, helping to deal with uncertain measurements and incomplete theories. James
Bernoulli (1654–1705), Pierre Laplace (1749–1827), and others advanced the theory and introduced
new statistical methods. Thomas Bayes (1702–1761), proposed a rule for updating probabilities in the
light of new evidence. Bayes’ rule underlies most modern approaches to uncertain reasoning in AI
systems.

1.2.3 Economics
science of economics got its start in 1776, when Scottish philosopher Adam Smith (1723–1790)
published An Inquiry into the Nature and Causes of the Wealth of Nations. While the ancient Greeks and
others had made contributions to economic thought, Smith was the first to treat it as a science, using
the idea that economies can be thought of as consisting of individual agents maximizing their own

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT 3


AI&ML, 21CS54, 5th Semester

economic well-being. Most people think of economics as being about money, but economists will say
that they are really studying how people make choices that lead to preferred outcomes.
Decision theory, which combines probability theory with utility theory, provides a formal and complete
framework for decisions (economic or otherwise) made under uncertainty

1.2.4 Neuroscience

How do brains process information? Neuroscience is the study of the nervous system, particularly the
brain. Although the exact way in which the brain enables thought is one of the great mysteries of
science, the fact that it does enable thought has been appreciated for thousands of years because of the
evidence that strong blows to the head can lead to mental incapacitation. Nicolas Rashevsky (1936,
1938) was the first to apply mathematical models to the study of the nervous system.

We now have some data on the mapping between areas of the brain and the parts of the body that they
control or from which they receive sensory input. Such mappings are able to change radically over the
course of a few weeks, and some animals seem to have multiple maps. Moreover, we do not fully
understand how other areas can take over functions when one area is damaged. There is almost no
theory on how an individual memory is stored. The measurement of intact brain activity began in 1929
with the invention by Hans Berger of the electroencephalograph (EEG). The recent development of
functional magnetic resonance imaging (fMRI) (Ogawa et al., 1990; Cabeza and Nyberg, 2001) is giving
neuroscientists unprecedentedly detailed images of brain activity, enabling measurements that
correspond in interesting ways to ongoing cognitive processes. These are augmented by advances in
single-cell recording of neuron activity. Individual neurons can be stimulated electrically, chemically, or
even optically (Han and Boyden, 2007), allowing neuronal input– output relationships to be mapped.
Despite these advances, we are still a long way from understanding how cognitive processes actually
work. The truly amazing conclusion is that a collection of simple cells can lead to thought, action, and
consciousness or, in the pithy words of John Searle (1992), brains cause minds.

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT 4


AI&ML, 21CS54, 5th Semester

1.2.5 Psychology
How do humans and animals think and act? The origins of scientific psychology are usually traced to
the work of the German physicist Hermann von Helmholtz (1821–1894) and his student Wilhelm
Wundt (1832–1920). Helmholtz applied the scientific method to the study of human vision, and his
Handbook of Physiological Optics is even now described as “the single most important treatise on the
physics and physiology of human vision” (Nalwa, 1993, p.15).

1.2.6 Computer engineering


How can we build an efficient computer?
For artificial intelligence to succeed, we need two things: intelligence and an artifact. The computer has
been the artifact of choice. The modern digital electronic computer was invented independently and
almost simultaneously by scientists in three countries embattled in World War II. The first operational
computer was the electro-mechanical Heath Robinson,8 built in 1940 by Alan Turing’s team for a single
purpose.
Each generation of computer hardware has brought an increase in speed and capacity and a decrease in
price. Performance doubled every 18 months or so until around 2005, when power dissipation
problems led manufacturers to start multiplying the number of CPU cores rather than the clock speed.

1.2.7 Control theory and cybernetics


How can artifacts operate under their own control?
Ktesibios of Alexandria (c. 250 B.C.) built the first self-controlling machine: a water clock with a
regulator that maintained a constant flow rate. This invention changed the definition of what an artifact
could do. Previously, only living things could modify their behavior in response to changes in the
environment.
Modern control theory, especially the branch known as stochastic optimal control, has as its
goal the design of systems that maximize an objective function over time. This roughly matches our
view of AI: designing systems that behave optimally. Why, then, are AI and control theory two different
fields, despite the close connections among their founders? The answer lies in the close coupling
between the mathematical techniques that were familiar to the participants and the corresponding sets
of problems that were encompassed in each world view. Calculus and matrix algebra, the tools of
control theory, lend themselves to systems that are describable by fixed sets of continuous variables,
whereas AI was founded in part as a way to escape from these perceived limitations. The tools of logical
inference and computation allowed AI researchers to consider problems such as language, vision, and
planning that fell completely outside the control theorist’s purview.

1.2.8 Linguistics
How does language relate to thought? In 1957, B. F. Skinner published Verbal Behavior. This was a
comprehensive, detailed account of the behaviorist approach to language learning, written by the
foremost expert in the field. But curiously, a review of the book became as well known as the book itself,
and served to almost kill off interest in behaviorism.

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT 5


AI&ML, 21CS54, 5th Semester

Modern linguistics and AI, then, were “born” at about the same time, and grew up together, intersecting
in a hybrid field called computational linguistics or natural language processing. The problem of
understanding language soon turned out to be considerably more complex than it seemed in 1957.
Understanding language requires an understanding of the subject matter and context, not just an
understanding of the structure of sentences. This might seem obvious, but it was not widely
appreciated until the 1960s. Much of the early work in knowledge representation (the study of how to
put knowledge into a form that a computer can reason with) was tied to language and informed by
research in linguistics, which was connected in turn to decades of work on the philosophical analysis of
language.

1.3 THE HISTORY OF ARTIFICIAL INTELLIGENCE

 The gestation of artificial intelligence (1943–1955)


 The birth of artificial intelligence (1956)
 Early enthusiasm, great expectations (1952–1969)
 A dose of reality (1966–1973)
 Knowledge-based systems: The key to power? (1969–1979)
 AI becomes an industry (1980–present)
 The return of neural networks (1986–present)
 AI adopts the scientific method (1987–present)
 The emergence of intelligent agents (1995–present)
 The availability of very large data sets (2001–present)

3.1 Problem solving agents


Intelligent agents are supposed to maximize their performance measure. Goal formulation, based on
the current situation and the agent’s performance measure, is the first step in problem solving. The
agent’s task is to find out how to act, now and in the future, so that it reaches a goal state. Before it can
do this, it needs to decide (or we need to decide on its behalf) what sorts of actions and states it should
consider.
Problem formulation is the process of deciding what actions and states to consider, given a goal.
The agent will not know which of its possible actions is best, because it does not yet know enough
about the state that results from taking each action. If the agent has no additional information—i.e., if
the environment is unknown in the sense defined then it is has no choice but to try one of the actions at
random.
The process of looking for a sequence of actions that reaches the goal is called search. A search
algorithm takes a problem as input and returns a solution in the form of an action sequence. Once a
solution is found, the actions it recommends can be carried out. This is called the execution phase. Thus,
we have a simple “formulate, search, execute” design for the agent, as shown in Figure 3.1. After
formulating a goal and a problem to solve, the agent calls a search procedure to solve it. It then uses the
solution to guide its actions, doing whatever the solution recommends as the next thing to do—typically,
the first action of the sequence—and then removing that step from the sequence. Once the solution has
been executed, the agent will formulate a new goal. Notice that while the agent is executing the solution
sequence it ignores its percepts when choosing an action because it knows in advance what they will be.
An agent that carries out its plans with its eyes closed, so to speak, must be quite certain of what is
going on. Control theorists call this an open-loop system, because ignoring the percepts breaks the loop
between agent and environment. We first describe the process of problem formulation, and then devote
the bulk of the chapter to various algorithms for the SEARCH function. We do not discuss the workings
of the UPDATE-STATE and FORMULATE-GOAL functions further in this chapter.
3.1.1 Well-defined problems and solutions
A problem can be defined formally by five components:
• The initial state that the agent starts in. For example, the initial state for our agent in Romania might
be described as In(Arad)

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT 6


AI&ML, 21CS54, 5th Semester

• A description of the possible actions available to the agent. Given a particular state s, ACTIONS(s)
returns the set of actions that can be executed in s. We say that each of these actions is applicable in s.
For example, from the state In(Arad), the applicable actions are {Go(Sibiu), Go(Timisoara), Go(Zerind)}.
• A description of what each action does; the formal name for this is the transition model, specified by
a function RESULT(s, a) that returns the state that results from doing action a in state s. We also use the
term successor to refer to any state reachable from a given state by a single action.2 For example, we
have
RESULT(In(Arad),Go(Zerind)) = In(Zerind) .
Together, the initial state, actions, and transition model implicitly define the state space of the
problem—the set of all states reachable from the initial state by any sequence of actions. The state
space forms a directed network or graph in which the nodes are states and the links between nodes are
actions. (The map of Romania shown in Figure 3.2 can be interpreted as a state-space graph if we view
each road as standing for two driving actions, one in each direction.) A path in the state space is a
sequence of states connected by a sequence of actions.
• The goal test, which determines whether a given state is a goal state. Sometimes there is an explicit
set of possible goal states, and the test simply checks whether the given state is one of them. The agent’s
goal in Romania is the singleton set {In(Bucharest)}.

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT 7


AI&ML, 21CS54, 5th Semester

Sometimes the goal is specified by an abstract property rather than an explicitly enumerated set of
states. For example, in chess, the goal is to reach a state called “checkmate,” where the opponent’s king
is under attack and can’t escape.
• A path cost function that assigns a numeric cost to each path. The problem-solving agent chooses a
cost function that reflects its own performance measure. For the agent trying to get to Bucharest, time
is of the essence, so the cost of a path might be its length in kilometers. In this chapter, we assume that
the cost of a path can be described as the sum of the costs of the individual actions along the path. The
step cost of taking action a in state s to reach state s’ is denoted by c(s, a, s’ ). The step costs for Romania
are shown in Figure 3.2 as route distances. We assume that step costs are nonnegative.4 The preceding
elements define a problem and can be gathered into a single data structure that is given as input to a
problem-solving algorithm. A solution to a problem is an action sequence that leads from the initial
state to a goal state. Solution quality is measured by the path cost function, and an optimal solution has
the lowest path cost among all solutions.
3.1.2 Formulating problems
In the preceding section we proposed a formulation of the problem of getting to Bucharest in terms of
the initial state, actions, transition model, goal test, and path cost. This formulation seems reasonable,
but it is still a model—an abstract mathematical description—and not the real thing. Compare the
simple state description we have chosen, In(Arad), to an actual crosscountry trip, where the state of the
world includes so many things: the traveling companions, the current radio program, the scenery out of
the window, the proximity of law enforcement officers, the distance to the next rest stop, the condition
of the road, the weather, and so on. All these considerations are left out of our state descriptions
because they are irrelevant to the problem of finding a route to Bucharest. The process of removing
detail from a representation is called abstraction. In addition to abstracting the state description, we
must abstract the actions themselves. A driving action has many effects. Besides changing the location
of the vehicle and its occupants, it takes up time, consumes fuel, generates pollution, and changes the
agent (as they say, travel is broadening). Our formulation takes into account only the change in location.
Also, there are many actions that we omit altogether: turning on the radio, looking out of the window,
slowing down for law enforcement officers, and so on. And of course, we don’t specify actions at the
level of “turn steering wheel to the left by one degree.” Can we be more precise about defining the
appropriate level of abstraction? Think of the abstract states and actions we have chosen as
corresponding to large sets of detailed world states and detailed action sequences. Now consider a
solution to the abstract problem: for example, the path from Arad to Sibiu to Rimnicu Vilcea to Pitesti to
Bucharest. This abstract solution corresponds to a large number of more detailed paths. For example,
we could drive with the radio on between Sibiu and Rimnicu Vilcea, and then switch it off for the rest of
the trip. The abstraction is valid if we can expand any abstract solution into a solution in the more
detailed world; a sufficient condition is that for every detailed state that is “in Arad,” there is a detailed
path to some state that is “in Sibiu,” and so on.5 The abstraction is useful if carrying out each of the
actions in the solution is easier than the original problem; in this case they are easy enough that they
can be carried out without further search or planning by an average driving agent. The choice of a good
abstraction thus involves removing as much detail as possible while retaining validity and ensuring that
the abstract actions are easy to carry out. Were it not for the ability to construct useful abstractions,
intelligent agents would be completely swamped by the real world.
3.2 EXAMPLE PROBLEMS
The problem-solving approach has been applied to a vast array of task environments. We list some of
the best known here, distinguishing between toy and real-world problems. A toy problem is intended to
illustrate or exercise various problem-solving methods. It can be given a concise, exact description and
hence is usable by different researchers to compare the performance of algorithms. A real-world
problem is one whose solutions people actually care about. Such problems tend not to have a single
agreed-upon description, but we can give the general flavor of their formulations

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT 8


AI&ML, 21CS54, 5th Semester

3.2.1 Toy problems


The first example we examine is the vacuum world first introduced in Chapter 2. (See Figure 2.2.) This
can be formulated as a problem as follows:
• States: The state is determined by both the agent location and the dirt locations. The agent is in one
of two locations, each of which might or might not contain dirt. Thus, there are 2 × 2^2 = 8 possible
world states. A larger environment with n locations has n · 2^n states.
• Initial state: Any state can be designated as the initial state.
• Actions: In this simple environment, each state has just three actions: Left, Right, and Suck. Larger
environments might also include Up and Down.
• Transition model: The actions have their expected effects, except that moving Left in the leftmost
square, moving Right in the rightmost square, and Sucking in a clean square have no effect. The
complete state space is shown in Figure 3.3.
• Goal test: This checks whether all the squares are clean.
• Path cost: Each step costs 1, so the path cost is the number of steps in the path. Compared with the
real world, this toy problem has discrete locations, discrete dirt, reliable cleaning, and it never gets any
dirtier. Chapter 4 relaxes some of these assumptions. The 8-puzzle, an instance of which is shown in
Figure 3.4, consists of a 3×3 board with eight numbered tiles and a blank space. A tile adjacent to the
blank space can slide into the space. The object is to reach a specified goal state, such as the one shown
on the right of the figure. The standard formulation is as follows:

• States: A state description specifies the location of each of the eight tiles and the blank in one of the
nine squares.
• Initial state: Any state can be designated as the initial state. Note that any given goal can be reached
from exactly half of the possible initial states (Exercise 3.4).
• Actions: The simplest formulation defines the actions as movements of the blank space Left, Right, Up,
or Down. Different subsets of these are possible depending on where the blank is.

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT 9


AI&ML, 21CS54, 5th Semester

• Transition model: Given a state and action, this returns the resulting state; for example, if we apply
Left to the start state in Figure 3.4, the resulting state has the 5 and the blank switched.
• Goal test: This checks whether the state matches the goal configuration shown in Figure 3.4. (Other
goal configurations are possible.)
• Path cost: Each step costs 1, so the path cost is the number of steps in the path.
What abstractions have we included here? The actions are abstracted to their beginning and final
states, ignoring the intermediate locations where the block is sliding. We have abstracted away actions
such as shaking the board when pieces get stuck and ruled out extracting the pieces with a knife and
putting them back again. We are left with a description of the rules of the puzzle, avoiding all the details
of physical manipulations. The 8-puzzle belongs to the family of sliding-block puzzles, which are often
used as test problems for new search algorithms in AI. This family is known to be NP-complete, so one
does not expect to find methods significantly better in the worst case than the search algorithms
described in this chapter and the next. The 8-puzzle has 9!/2 = 181, 440 reachable states and is easily
solved. The 15-puzzle (on a 4×4 board) has around 1.3 trillion states, and random instances can be
solved optimally in a few milliseconds by the best search algorithms. The 24-puzzle (on a 5 × 5 board)
has around 1025 states, and random instances take several hours to solve optimally.
The goal of the 8-queens problem is to place eight queens on a chessboard such that no queen attacks
any other. (A queen attacks any piece in the same row, column or diagonal.) Figure 3.5 shows an
attempted solution that fails: the queen in the rightmost column is attacked by the queen at the top left

Although efficient special-purpose algorithms exist for this problem and for the whole n-queens family,
it remains a useful test problem for search algorithms. There are two main kinds of formulation. An
incremental formulation involves operators that augment the state description, starting with an empty
state; for the 8-queens problem, this means that each action adds a queen to the state. A complete-state
formulation starts with all 8 queens on the board and moves them around. In either case, the path cost
is of no interest because only the final state counts. The first incremental formulation one might try is
the following:
• States: Any arrangement of 0 to 8 queens on the board is a state.
• Initial state: No queens on the board.
• Actions: Add a queen to any empty square.
• Transition model: Returns the board with a queen added to the specified square.
• Goal test: 8 queens are on the board, none attacked. In this formulation, we have 64 · 63 ··· 57 ≈ 1.8 ×
1014 possible sequences to investigate. A better formulation would prohibit placing a queen in any
square that is already attacked:
• States: All possible arrangements of n queens (0 ≤ n ≤ 8), one per column in the leftmost n columns,
with no queen attacking another.
• Actions: Add a queen to any square in the leftmost empty column such that it is not attacked by any
other queen.

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT 10


AI&ML, 21CS54, 5th Semester

This formulation reduces the 8-queens state space from 1.8 × 10^14 to just 2,057, and solutions are
easy to find. On the other hand, for 100 queens the reduction is from roughly 10^400 states to about
10^52 states (Exercise 3.5)—a big improvement, but not enough to make the problem tractable. Section
4.1 describes the complete-state formulation, and Chapter 6 gives a simple algorithm that solves even
the million-queens problem with ease
Our final toy problem was devised by Donald Knuth (1964) and illustrates how infinite state spaces can
arise. Knuth conjectured that, starting with the number 4, a sequence of factorial, square root, and floor
operations will reach any desired positive integer. For example, we can reach 5 from 4 as follows

The problem definition is very simple:


• States: Positive numbers.
• Initial state: 4.
• Actions: Apply factorial, square root, or floor operation (factorial for integers only). • Transition
model: As given by the mathematical definitions of the operations.
• Goal test: State is the desired positive integer. To our knowledge there is no bound on how large a
number might be constructed in the process of reaching a given target—for example, the number
620,448,401,733,239,439,360,000 is generated in the expression for 5—so the state space for this
problem is infinite. Such state spaces arise frequently in tasks involving the generation of mathematical
expressions, circuits, proofs, programs, and other recursively defined objects.
3.2.2 Real-world problems
We have already seen how the route-finding problem is defined in terms of specified locations and
transitions along links between them. Route-finding algorithms are used in a variety of applications.
Some, such as Web sites and in-car systems that provide driving directions, are relatively
straightforward extensions of the Romania example. Others, such as routing video streams in computer
networks, military operations planning, and airline travel-planning systems, involve much more
complex specifications. Consider the airline travel problems that must be solved by a travel-planning
Web site:
• States: Each state obviously includes a location (e.g., an airport) and the current time. Furthermore,
because the cost of an action (a flight segment) may depend on previous segments, their fare bases, and
their status as domestic or international, the state must record extra information about these
“historical” aspects.
• Initial state: This is specified by the user’s query.
• Actions: Take any flight from the current location, in any seat class, leaving after the current time,
leaving enough time for within-airport transfer if needed.
• Transition model: The state resulting from taking a flight will have the flight’s destination as the
current location and the flight’s arrival time as the current time.
• Goal test: Are we at the final destination specified by the user?
• Path cost: This depends on monetary cost, waiting time, flight time, customs and immigration
procedures, seat quality, time of day, type of airplane, frequent-flyer mileage awards, and so on

Commercial travel advice systems use a problem formulation of this kind, with many additional
complications to handle the byzantine fare structures that airlines impose. Any seasoned traveler
knows, however, that not all air travel goes according to plan. A really good system should include
contingency plans—such as backup reservations on alternate flights— to the extent that these are
justified by the cost and likelihood of failure of the original plan.
Touring problems are closely related to route-finding problems, but with an important difference.
Consider, for example, the problem “Visit every city in Figure 3.2 at least once, starting and ending in
Bucharest.” As with route finding, the actions correspond to trips between adjacent cities. The state
space, however, is quite different. Each state must include not just the current location but also the set
of cities the agent has visited. So the initial state would be In(Bucharest), Visited({Bucharest}), a typical
intermediate state would be In(Vaslui), Visited({Bucharest, Urziceni, Vaslui}), and the goal test would
check whether the agent is in Bucharest and all 20 cities have been visited.

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT 11


AI&ML, 21CS54, 5th Semester

The traveling salesperson problem (TSP) is a touring problem in which each city must be visited
exactly once. The aim is to find the shortest tour. The problem is known to be NP-hard, but an enormous
amount of effort has been expended to improve the capabilities of TSP algorithms. In addition to
planning trips for traveling salespersons, these algorithms have been used for tasks such as planning
movements of automatic circuit-board drills and of stocking machines on shop floors.
A VLSI layout problem requires positioning millions of components and connections on a chip to
minimize area, minimize circuit delays, minimize stray capacitances, and maximize manufacturing yield.
The layout problem comes after the logical design phase and is usually split into two parts: cell layout
and channel routing. In cell layout, the primitive components of the circuit are grouped into cells, each
of which performs some recognized function. Each cell has a fixed footprint (size and shape) and
requires a certain number of connections to each of the other cells. The aim is to place the cells on the
chip so that they do not overlap and so that there is room for the connecting wires to be placed between
the cells. Channel routing finds a specific route for each wire through the gaps between the cells. These
search problems are extremely complex, but definitely worth solving. Later in this chapter, we present
some algorithms capable of solving them.
Robot navigation is a generalization of the route-finding problem described earlier. Rather than
following a discrete set of routes, a robot can move in a continuous space with (in principle) an infinite
set of possible actions and states. For a circular robot moving on a flat surface, the space is essentially
two-dimensional. When the robot has arms and legs or wheels that must also be controlled, the search
space becomes many-dimensional. Advanced techniques are required just to make the search space
finite. We examine some of these methods in Chapter 25. In addition to the complexity of the problem,
real robots must also deal with errors in their sensor readings and motor controls.
Automatic assembly sequencing of complex objects by a robot was first demonstrated by FREDDY
(Michie, 1972). Progress since then has been slow but sure, to the point where the assembly of intricate
objects such as electric motors is economically feasible. In assembly problems, the aim is to find an
order in which to assemble the parts of some object. If the wrong order is chosen, there will be no way
to add some part later in the sequence without undoing some of the work already done. Checking a step
in the sequence for feasibility is a difficult geometrical search problem closely related to robot
navigation. Thus, the generation of legal actions is the expensive part of assembly sequencing. Any
practical algorithm must avoid exploring all but a tiny fraction of the state space. Another important
assembly problem is protein design, in which the goal is to find a sequence of amino acids that will fold
into a three-dimensional protein with the right properties to cure some disease.

3.3 SEARCHING FOR SOLUTIONS


Having formulated some problems, we now need to solve them. A solution is an action sequence, so
search algorithms work by considering various possible action sequences. The possible action
sequences starting at the initial state form a search tree with the initial state NODE at the root; the
branches are actions and the nodes correspond to states in the state space of the problem. Figure 3.6
shows the first few steps in growing the search tree for finding a route from Arad to Bucharest. The root
node of the tree corresponds to the initial state, In(Arad). The first step is to test whether this is a goal
state. (Clearly it is not, but it is important to check so that we can solve trick problems like “starting in
Arad, get to Arad.”) Then we need to consider taking various actions. We do this by expanding the
current state; that is, applying each legal action to the current state, thereby generating a new set of
states. In this case, we add three branches from the parent node In(Arad) leading to three new child
nodes: In(Sibiu), In(Timisoara), and In(Zerind). Now we must choose which of these three possibilities
to consider further.
This is the essence of search—following up one option now and putting the others aside for later, in
case the first choice does not lead to a solution. Suppose we choose Sibiu first. We check to see whether
it is a goal state (it is not) and then expand it to get In(Arad), In(Fagaras), In(Oradea), and
In(RimnicuVilcea). We can then choose any of these four or go back and choose Timisoara or Zerind.
Each of these six nodes is a leaf node, that is, a node with no children in the tree. The set of all leaf nodes
available for expansion at any given point is called the frontier. (Many authors call it the open list, which
is both geographically less evocative and less accurate, because other data structures are better suited
than a list.) In Figure 3.6, the frontier of each tree consists of those nodes with bold outlines.

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT 12


AI&ML, 21CS54, 5th Semester

The process of expanding nodes on the frontier continues until either a solution is found or there are
no more states to expand. The general TREE-SEARCH algorithm is shown informally in Figure 3.7.
Search algorithms all share this basic structure; they vary primarily according to how they choose
which state to expand next—the so-called search strategy.
The eagle-eyed reader will notice one peculiar thing about the search tree shown in Figure 3.6: it
includes the path from Arad to Sibiu and back to Arad again! We say that In(Arad) is a repeated state in
the search tree, generated in this case by a loopy path. Considering such loopy paths means that the
complete search tree for Romania is infinite because there is no limit to how often one can traverse a
loop. On the other hand, the state space—the map shown in Figure 3.2—has only 20 states. As we
discuss in Section 3.4, loops can cause certain algorithms to fail, making otherwise solvable problems
unsolvable. Fortunately, there is no need to consider loopy paths. We can rely on more than intuition for
this: because path costs are additive and step costs are nonnegative, a loopy path to any given state is
never better than the same path with the loop removed. Loopy paths are a special case of the more
general concept of redundant paths, which exist whenever there is more than one way to get from one
state to another. Consider the paths Arad–Sibiu (140 km long) and Arad–Zerind–Oradea–Sibiu (297 km
long). Obviously, the second path is redundant—it’s just a worse way to get to the same state. If you are
concerned about reaching the goal, there’s never any reason to keep more than one path to any given
state, because any goal state that is reachable by extending one path is also reachable by extending the
other. In some cases, it is possible to define the problem itself so as to eliminate redundant paths. For
example, if we formulate the 8-queens problem (page 71) so that a queen can be placed in any column,
then each state with n queens can be reached by n! different paths; but if we reformulate the problem
so that each new queen is placed in the leftmost empty column, then each state can be reached only
through one path.

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT 13


AI&ML, 21CS54, 5th Semester

In other cases, redundant paths are unavoidable. This includes all problems where the actions are
reversible, such as route-finding problems and sliding-block puzzles. Route finding on a rectangular
grid (like the one used later for Figure 3.9) is a particularly important example in computer games. In
such a grid, each state has four successors, so a search tree of depth d that includes repeated states has
4d leaves; but there are only about 2d2 distinct states within d steps of any given state. For d = 20, this
means about a trillion nodes but only about 800 distinct states. Thus, following redundant paths can
cause a tractable problem to become intractable. This is true even for algorithms that know how to
avoid infinite loops. As the saying goes, algorithms that forget their history are doomed to repeat it. The
way to avoid exploring redundant paths is to remember where one has been. To do this, we augment
the TREE-SEARCH algorithm with a data structure called the explored set (also known as the closed
list), which remembers every expanded node. Newly generated nodes that match previously generated
nodes—ones in the explored set or the frontier—can be discarded instead of being added to the frontier.
The new algorithm, called GRAPH-SEARCH, is shown informally in Figure 3.7. The specific algorithms in
this chapter draw on this general design. Clearly, the search tree constructed by the GRAPH-SEARCH
algorithm contains at most one copy of each state, so we can think of it as growing a tree directly on the
state-space graph, as shown in Figure 3.8. The algorithm has another nice property: the frontier
separates the state-space graph into the explored region and the unexplored region, so that every path
from the initial state to an unexplored state has to pass through a state in the frontier. (If this seems
completely obvious, try Exercise 3.13 now.) This property is illustrated in Figure 3.9. As every step
moves a state from the frontier into the explored region while moving some states from the unexplored
region into the frontier, we see that the algorithm is systematically examining the states in the state
space, one by one, until it finds a solution.

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT 14


AI&ML, 21CS54, 5th Semester

3.3.1 Infrastructure for search algorithms


Search algorithms require a data structure to keep track of the search tree that is being constructed.
For each node n of the tree, we have a structure that contains four components:
• n.STATE: the state in the state space to which the node corresponds;
• n.PARENT: the node in the search tree that generated this node;
• n.ACTION: the action that was applied to the parent to generate the node;
• n.PATH-COST: the cost, traditionally denoted by g(n), of the path from the initial state to the node, as
indicated by the parent pointers.

Given the components for a parent node, it is easy to see how to compute the necessary components for
a child node. The function CHILD-NODE takes a parent node and an action and returns the resulting
child node:

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT 15


AI&ML, 21CS54, 5th Semester

The node data structure is depicted in Figure 3.10. Notice how the PARENT pointers string the nodes
together into a tree structure. These pointers also allow the solution path to be extracted when a goal
node is found; we use the SOLUTION function to return the sequence of actions obtained by following
parent pointers back to the root. Up to now, we have not been very careful to distinguish between nodes
and states, but in writing detailed algorithms it’s important to make that distinction. A node is a
bookkeeping data structure used to represent the search tree. A state corresponds to a configuration of
the world. Thus, nodes are on particular paths, as defined by PARENT pointers, whereas states are not.
Furthermore, two different nodes can contain the same world state if that state is generated via two
different search paths. Now that we have nodes, we need somewhere to put them. The frontier needs to
be stored in such a way that the search algorithm can easily choose the next node to expand according
to its preferred strategy. The appropriate data structure for this is a queue. The operations on a queue
are as follows:
• EMPTY?(queue) returns true only if there are no more elements in the queue.
• POP(queue) removes the first element of the queue and returns it.
• INSERT(element, queue) inserts an element and returns the resulting queue.
Queues are characterized by the order in which they store the inserted nodes. Three common variants
are the first-in, first-out or FIFO queue, which pops the oldest element of the queue; the last-in, first-
out or LIFO queue (also known as a stack), which pops the newest element of the queue; and the
priority queue, which pops the element of the queue with the highest priority according to some
ordering function. The explored set can be implemented with a hash table to allow efficient checking for
repeated states. With a good implementation, insertion and lookup can be done in roughly constant
time no matter how many states are stored. One must take care to implement the hash table with the
right notion of equality between states. For example, in the traveling salesperson problem (page 74),
the hash table needs to know that the set of visited cities {Bucharest,Urziceni,Vaslui} is the same as
{Urziceni,Vaslui,Bucharest}. Sometimes this can be achieved most easily by insisting that the data
structures for states be in some canonical form; that is, logically equivalent states should map to the
same data structure. In the case of states described by sets, for example, a bit-vector representation or a
sorted list without repetition would be canonical, whereas an unsorted list would not.
3.3.2 Measuring problem-solving performance
Before we get into the design of specific search algorithms, we need to consider the criteria that might
be used to choose among them. We can evaluate an algorithm’s performance in four ways:
• Completeness: Is the algorithm guaranteed to find a solution when there is one?
• Optimality: Does the strategy find the optimal solution, as defined on page 68?
• Time complexity: How long does it take to find a solution?
• Space complexity: How much memory is needed to perform the search?
Time and space complexity are always considered with respect to some measure of the problem
difficulty. In theoretical computer science, the typical measure is the size of the state space graph, |V| +
|E|, where V is the set of vertices (nodes) of the graph and E is the set of edges (links). This is
appropriate when the graph is an explicit data structure that is input to the search program. (The map
of Romania is an example of this.) In AI, the graph is often represented implicitly by the initial state,
actions, and transition model and is frequently infinite. For these reasons, complexity is expressed in
terms of three quantities: b, the branching factor or maximum number of successors of any node; d, the
depth of the shallowest goal node (i.e., the number of steps along the path from the root); and m, the
maximum length of any path in the state space. Time is often measured in terms of the number of nodes
generated during the search, and space in terms of the maximum number of nodes stored in memory.
For the most part, we describe time and space complexity for search on a tree; for a graph, the answer
depends on how “redundant” the paths in the state space are. To assess the effectiveness of a search
algorithm, we can consider just the search cost— which typically depends on the time complexity but

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT 16


AI&ML, 21CS54, 5th Semester

can also include a term for memory usage—or we can use the total cost, which combines the search
cost and the path cost of the solution found. For the problem of finding a route from Arad to Bucharest,
the search cost is the amount of time taken by the search and the solution cost is the total length of the
path in kilometers. Thus, to compute the total cost, we have to add milliseconds and kilometers. There
is no “official exchange rate” between the two, but it might be reasonable in this case to convert
kilometers into milliseconds by using an estimate of the car’s average speed (because time is what the
agent cares about). This enables the agent to find an optimal tradeoff point at which further
computation to find a shorter path becomes counterproductive. The more general problem of tradeoffs
between different goods is taken up in Chapter 16.

3.4 UNINFORMED SEARCH STRATEGIES


This section covers several search strategies that come under the heading of uninformed search (also
called blind search). The term means that the strategies have no additional information about states
beyond that provided in the problem definition. All they can do is generate successors and distinguish a
goal state from a non-goal state. All search strategies are distinguished by the order in which nodes are
expanded. Strategies that know whether one non-goal state is “more promising” than another are called
informed search or heuristic search strategies; they are covered in Section 3.5.
3.4.1 Breadth-first search
Breadth-first search is a simple strategy in which the root node is expanded first, then all the
successors of the root node are expanded next, then their successors, and so on. In general, all the
nodes are expanded at a given depth in the search tree before any nodes at the next level are expanded.
Breadth-first search is an instance of the general graph-search algorithm (Figure 3.7) in which the
shallowest unexpanded node is chosen for expansion. This is achieved very simply by using a FIFO
queue for the frontier. Thus, new nodes (which are always deeper than their parents) go to the back of
the queue, and old nodes, which are shallower than the new nodes, get expanded first. There is one
slight tweak on the general graph-search algorithm, which is that the goal test is applied to each node
when it is generated rather than when it is selected for expansion. This decision is explained below,
where we discuss time complexity. Note also that the algorithm, following the general template for
graph search, discards any new path to a state already in the frontier or explored set; it is easy to see
that any such path must be at least as deep as the one already found. Thus, breadth-first search always
has the shallowest path to every node on the frontier. Pseudocode is given in Figure 3.11. Figure 3.12
shows the progress of the search on a simple binary tree. How does breadth-first search rate according
to the four criteria from the previous section? We can easily see that it is complete—if the shallowest
goal node is at some finite depth d, breadth-first search will eventually find it after generating all
shallower nodes (provided the branching factor b is finite). Note that as soon as a goal node is
generated, we know it is the shallowest goal node because all shallower nodes must have been
generated already and failed the goal test. Now, the shallowest goal node is not necessarily the optimal
one;

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT 17


AI&ML, 21CS54, 5th Semester

technically, breadth-first search is optimal if the path cost is a nondecreasing function of the depth of
the node. The most common such scenario is that all actions have the same cost. So far, the news about
breadth-first search has been good. The news about time and space is not so good. Imagine searching a
uniform tree where every state has b successors. The root of the search tree generates b nodes at the
first level, each of which generates b more nodes, for a total of b^2 at the second level. Each of these
generates b more nodes, yielding b^3 nodes at the third level, and so on. Now suppose that the solution
is at depth d. In the worst case, it is the last node generated at that level. Then the total number of nodes
generated is b + b^2 + b^3 + ··· + b^d = O(b^d) . (If the algorithm were to apply the goal test to nodes
when selected for expansion, rather than when generated, the whole layer of nodes at depth d would be
expanded before the goal was detected and the time complexity would be
As for space complexity: for any kind of graph search, which stores every expanded node in the
explored set, the space complexity is always within a factor of b of the time complexity. For breadth-first
graph search in particular, every node generated remains in memory. There will be nodes in the
explored set and O(b^d) nodes in the frontier,

so the space complexity is O(b^d), i.e., it is dominated by the size of the frontier. Switching to a tree
search would not save much space, and in a state space with many redundant paths, switching could
cost a great deal of time. An exponential complexity bound such as O(b^d) is scary. Figure 3.13 shows
why. It lists, for various values of the solution depth d, the time and memory required for a breadthfirst
search with branching factor b = 10. The table assumes that 1 million nodes can be generated per
second and that a node requires 1000 bytes of storage. Many search problems fit roughly within these
assumptions (give or take a factor of 100) when run on a modern personal computer

Two lessons can be learned from Figure 3.13. First, the memory requirements are a bigger problem for
breadth-first search than is the execution time. One might wait 13 days for the solution to an important
problem with search depth 12, but no personal computer has the petabyte of memory it would take.
Fortunately, other strategies require less memory. The second lesson is that time is still a major factor. If
your problem has a solution at depth 16, then (given our assumptions) it will take about 350 years for
breadth-first search (or indeed any uninformed search) to find it. In general, exponential-complexity
search problems cannot be solved by uninformed methods for any but the smallest instances.

3.4.3 Depth-first search


Depth-first search always expands the deepest node in the current frontier of the search tree. The
progress of the search is illustrated in Figure 3.16. The search proceeds immediately to the deepest
level of the search tree, where the nodes have no successors. As those nodes are expanded, they are

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT 18


AI&ML, 21CS54, 5th Semester

dropped from the frontier, so then the search “backs up” to the next deepest node that still has
unexplored successors. The depth-first search algorithm is an instance of the graph-search algorithm in
Figure 3.7; whereas breadth-first-search uses a FIFO queue, depth-first search uses a LIFO queue. A
LIFO queue means that the most recently generated node is chosen for expansion. This must be the
deepest unexpanded node because it is one deeper than its parent—which, in turn, was the deepest
unexpanded node when it was selected. As an alternative to the GRAPH-SEARCH-style implementation,
it is common to implement depth-first search with a recursive function that calls itself on each of its
children in turn. (A recursive depth-first algorithm incorporating a depth limit is shown in Figure 3.17.)

The properties of depth-first search depend strongly on whether the graph-search or tree-search
version is used. The graph-search version, which avoids repeated states and redundant paths, is
complete in finite state spaces because it will eventually expand every node. The tree-search version, on
the other hand, is not complete.

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT 19


AI&ML, 21CS54, 5th Semester

Module 2
INFORMED SEARCH STRATEGIES
3.5 INFORMED (HEURISTIC) SEARCH STRATEGIES
This section shows how an informed search strategy—one that uses problem-specific
knowledge beyond the definition of the problem itself—can find solutions more efficiently
than can an uninformed strategy. The general approach we consider is called best-first
search. Best-first search is an instance of the general TREE-SEARCH or GRAPH-SEARCH
algorithm in which a node is selected for expansion based on an evaluation function, f(n).
The evaluation function is construed as a cost estimate, so the node with the lowest
evaluation is expanded first. The implementation of best-first graph search is identical to
that for uniform-cost search (Figure 3.14), except for the use of f instead of g to order the
priority queue. The choice of f determines the search strategy. (For example, as Exercise
3.21 shows, best-first tree search includes depth-first search as a special case.)
Most best-first algorithms include as a component of f a heuristic function, denoted
h(n): h(n) = estimated cost of the cheapest path from the state at node n to a goal state.

3.5.1 Greedy best-first search


Greedy best-first search tries to expand the node that is closest to the goal, on the grounds
that this is likely to lead to a solution quickly. Thus, it evaluates nodes by using just the
heuristic function; that is, f(n) = h(n). Let us see how this works for route-finding problems
in Romania; we use the straight line distance heuristic, which we will call hSLD . If the goal
is Bucharest, we need to know the straight-line distances to Bucharest, which are shown
in Figure 3.22. For example, hSLD (In(Arad)) = 366. Notice that the values of hSLD cannot
be computed from the problem description itself. Moreover, it takes a certain amount of
experience to know that hSLD is correlated with actual road distances and is, therefore, a
useful heuristic. Figure 3.23 shows the progress of a greedy best-first search using hSLD
to find a path from Arad to Bucharest. The first node to be expanded from Arad will be
Sibiu because it is closer to Bucharest than either Zerind or Timisoara. The next node to
be expanded will be Fagaras because it is closest. Fagaras in turn generates Bucharest,
which is the goal. For this particular problem, greedy best-first search using hSLD finds a
solution without ever expanding a node that is not on the solution path; hence, its search
cost is minimal. It is not optimal, however: the path via Sibiu and Fagaras to Bucharest is
32 kilometers longer than the path through Rimnicu Vilcea and Pitesti. This shows why
the algorithm is called “greedy”—at each step it tries to get as close to the goal as it can.
Greedy best-first tree search is also incomplete even in a finite state space, much like
depth-first search. Consider the problem of getting from Iasi to Fagaras. The heuristic
suggests that Neamt be expanded first because it is closest to Fagaras, but it is a dead end.
The solution is to go first to Vaslui—a step that is actually farther from the goal according
to the heuristic—and then to continue to Urziceni, Bucharest, and Fagaras. The algorithm
will never find this solution, however, because expanding Neamt puts Iasi back into the
frontier, Iasi is closer to Fagaras than Vaslui is, and so Iasi will be expanded again, leading

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT

1
AI&ML, 21CS54, 5th Semester

to an infinite loop. (The graph search version is complete in finite spaces, but not in infinite
ones.) The worst-case time and space complexity for the tree version is O(bm), where m is
the maximum depth of the search space. With a good heuristic function, however, the
complexity can be reduced substantially. The amount of the reduction depends on the
particular problem and on the quality of the heuristic

3.5.2 A* search: Minimizing the total estimated solution cost


The most widely known form of best-first search is called A∗ search (pronounced “A-star
search”). It evaluates nodes by combining g(n), the cost to reach the node, and h(n), the
cost to get from the node to the goal:
f(n) = g(n) + h(n) .
Since g(n) gives the path cost from the start node to node n, and h(n) is the estimated cost
of the cheapest path from n to the goal, we have
f(n) = estimated cost of the cheapest solution through n . Thus, if we are trying to find the
cheapest solution, a reasonable thing to try first is the node with the lowest value of g(n)
+ h(n). It turns out that this strategy is more than just reasonable: provided that the
heuristic function h(n) satisfies certain conditions, A∗ search is both complete and optimal.
The algorithm is identical to UNIFORM-COST-SEARCH except that A∗ uses g + h instead of
g.

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT

2
AI&ML, 21CS54, 5th Semester

Conditions for optimality: Admissibility and consistency The first condition we require for
optimality is that h(n) be an admissible heuristic. An admissible heuristic is one that never
overestimates the cost to reach the goal. Because g(n) is the actual cost to reach n along
the current path, and f(n) = g(n) + h(n), we have as an immediate consequence that f(n)
never overestimates the true cost of a solution along the current path through n.
Admissible heuristics are by nature optimistic because they think the cost of solving the
problem is less than it actually is. An obvious example of an admissible heuristic is the
straight-line distance hSLD that we used in getting to Bucharest. Straight-line distance is
admissible because the shortest path between any two points is a straight line, so the
straight line cannot be an overestimate. In Figure 3.24, we show the progress of an A∗ tree
search for Bucharest. The values of g are computed from the step costs in Figure 3.2, and

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT

3
AI&ML, 21CS54, 5th Semester

the values of hSLD are given in Figure 3.22. Notice in particular that Bucharest first appears
on the frontier at step (e), but it is not selected for expansion because its f-cost (450) is
higher than that of Pitesti (417). Another way to say this is that there might be a solution
through Pitesti whose cost is as low as 417, so the algorithm will not settle for a solution
that costs 450. A second, slightly stronger condition called consistency (or sometimes
monotonicity) is required only for applications of A∗ to graph search.9 A heuristic h(n) is
consistent if, for every node n and every successor nof n generated by any action a, the
estimated cost of reaching the goal from n is no greater than the step cost of getting to
nplus the estimated cost of reaching the goal from n’ :
h(n) ≤ c(n, a, n’ ) + h(n’ ) .
This is a form of the general triangle inequality, which stipulates that each side of a triangle
cannot be longer than the sum of the other two sides. Here, the triangle is formed by n, n’,
and the goal Gn closest to n. For an admissible heuristic, the inequality makes perfect
sense: if there were a route from n to Gn via n’ that was cheaper than h(n), that would
violate the property that h(n) is a lower bound on the cost to reach Gn. It is fairly easy to
show (Exercise 3.29) that every consistent heuristic is also admissible. Consistency is
therefore a stricter requirement than admissibility, but one has to work quite hard to
concoct heuristics that are admissible but not consistent. All the admissible heuristics we
discuss in this chapter are also consistent. Consider, for example, hSLD . We know that the
general triangle inequality is satisfied when each side is measured by the straight-line
distance and that the straight-line distance between n and n’ is no greater than c(n, a, n’ ).
Hence, hSLD is a consistent heuristic.
Optimality of A* As we mentioned earlier, A∗ has the following properties: the tree-search
version of A∗ is optimal if h(n) is admissible, while the graph-search version is optimal if
h(n) is consistent. We show the second of these two claims since it is more useful. The
argument essentially mirrors the argument for the optimality of uniform-cost search, with
g replaced by f—just as in the A∗ algorithm itself. The first step is to establish the following:
if h(n) is consistent, then the values of f(n) along any path are nondecreasing. The proof
follows directly from the definition of consistency. Suppose n is a successor of n; then g(n’
) = g(n) + c(n, a, n’ ) for some action a, and we have
f(n’) = g(n’ ) + h(n’ ) = g(n) + c(n, a, n’ ) + h(n’ ) ≥ g(n) + h(n) = f(n) .
The next step is to prove that whenever A∗ selects a node n for expansion, the optimal
path to that node has been found. Were this not the case, there would have to be another
frontier node n on the optimal path from the start node to n, by the graph separation
property of

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT

4
AI&ML, 21CS54, 5th Semester

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT

5
AI&ML, 21CS54, 5th Semester

Figure 3.9; because f is nondecreasing along any path, nwould have lower f-cost than n and
would have been selected first. From the two preceding observations, it follows that the
sequence of nodes expanded by A∗ using GRAPH-SEARCH is in nondecreasing order of f(n).
Hence, the first goal node selected for expansion must be an optimal solution because f is
the true cost for goal nodes (which have h = 0) and all later goal nodes will be at least as
expensive. The fact that f-costs are nondecreasing along any path also means that we can
draw contours in the state space, just like the contours in a topographic map. Figure 3.25
shows an example. Inside the contour labeled 400, all nodes have f(n) less than or equal to
400, and so on. Then, because A∗ expands the frontier node of lowest f-cost, we can see
that an A∗ search fans out from the start node, adding nodes in concentric bands of
increasing f-cost. With uniform-cost search (A∗ search using h(n)=0), the bands will be
“circular” around the start state. With more accurate heuristics, the bands will stretch
toward the goal state and become more narrowly focused around the optimal path. If C∗ is
the cost of the optimal solution path, then we can say the following:
• A∗ expands all nodes with f(n) < C∗.
• A∗ might then expand some of the nodes right on the “goal contour” (where f(n) = C∗)
before selecting a goal node. Completeness requires that there be only finitely many nodes
with cost less than or equal to C∗, a condition that is true if all step costs exceed some finite
and if b is finite. Notice that A∗ expands no nodes with f(n) > C∗—for example, Timisoara
is not expanded in Figure 3.24 even though it is a child of the root. We say that the subtree
below
Timisoara is pruned; because hSLD is admissible, the algorithm can safely ignore this
subtree while still guaranteeing optimality. The concept of pruning—eliminating
possibilities from consideration without having to examine them—is important for many
areas of AI. One final observation is that among optimal algorithms of this type—
algorithms that extend search paths from the root and use the same heuristic

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT

6
AI&ML, 21CS54, 5th Semester

information—A∗ is optimally efficient for any given consistent heuristic. That is, no other
optimal algorithm is guaran- teed to expand fewer nodes than A∗ (except possibly through
tie-breaking among nodes with f(n) = C∗). This is because any algorithm that does not
expand all nodes with f(n) < C∗ runs the risk of missing the optimal solution. That A∗
search is complete, optimal, and optimally efficient among all such algorithms is rather
satisfying. Unfortunately, it does not mean that A∗ is the answer to all our searching needs.
The catch is that, for most problems, the number of states within the goal contour search
space is still exponential in the length of the solution. The details of the analysis are beyond
the scope of this book, but the basic results are as follows. For problems with constant step
costs, the growth in run time as a function of the optimal solution depth d is analyzed in
terms of the the absolute error or the relative error of the heuristic. The absolute error is
defined as Δ ≡ h∗ − h, where h∗ is the actual cost of getting from the root to the goal, and
the relative error is defined as ≡ (h∗ − h)/h∗. The complexity results depend very strongly
on the assumptions made about the state space. The simplest model studied is a state
space that has a single goal and is essentially a tree with reversible actions. (The 8-puzzle
satisfies the first and third of these assumptions.) In this case, the time complexity of A∗ is
exponential in the maximum absolute error, that is, O(b^Δ). For constant step costs, we can
write this as O(b^d), where d is the solution depth. For almost all heuristics in practical
use, the absolute error is at least proportional to the path cost h∗, so is constant or growing
and the time complexity is exponential in d. We can also see the effect of a more accurate
heuristic: O(b^d) = O((b^)^d), so the effective branching factor (defined more formally in
the next section) is b^. When the state space has many goal states—particularly near-
optimal goal states—the search process can be led astray from the optimal path and there
is an extra cost proportional to the number of goals whose cost is within a factor of the
optimal cost. Finally, in the general case of a graph, the situation is even worse. There can
be exponentially many states with f(n) < C∗ even if the absolute error is bounded by a
constant. For example, consider a version of the vacuum world where the agent can clean
up any square for unit cost without even having to visit it: in that case, squares can be
cleaned in any order. With N initially dirty squares, there are 2N states where some subset
has been cleaned and all of them are on an optimal solution path—and hence satisfy f(n)
< C∗—even if the heuristic has an error of 1. The complexity of A∗ often makes it
impractical to insist on finding an optimal solution. One can use variants of A∗ that find
suboptimal solutions quickly, or one can sometimes design heuristics that are more
accurate but not strictly admissible. In any case, the use of a good heuristic still provides
enormous savings compared to the use of an uninformed search. In Section 3.6, we look at
the question of designing good heuristics. Computation time is not, however, A∗’s main
drawback. Because it keeps all generated nodes in memory (as do all GRAPH-SEARCH
algorithms), A∗ usually runs out of space long before it runs out of time. For this reason,
A∗ is not practical for many large-scale problems. There are, however, algorithms that
overcome the space problem without sacrificing optimality or completeness, at a small
cost in execution time. We discuss these next.

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT

7
AI&ML, 21CS54, 5th Semester

3.6 HEURISTIC FUNCTIONS


In this section, we look at heuristics for the 8-puzzle, in order to shed light on the nature
of heuristics in general. The 8-puzzle was one of the earliest heuristic search problems. As
mentioned in Section 3.2, the object of the puzzle is to slide the tiles horizontally or
vertically into the empty space until the configuration matches the goal configuration
(Figure 3.28). The average solution cost for a randomly generated 8-puzzle instance is
about 22 steps. The branching factor is about 3. (When the empty tile is in the middle, four
moves are possible; when it is in a corner, two; and when it is along an edge, three.) This
means that an exhaustive tree search to depth 22 would look at about 3^22 ≈ 3.1 × 10^10
states. A graph search would cut this down by a factor of about 170,000 because only 9!/2
= 181, 440 distinct states are reachable. (See Exercise 3.4.) This is a manageable number,
but the corresponding number for the 15-puzzle is roughly 10^13, so the next order of
business is to find a good heuristic function. If we want to find the shortest solutions by
using A∗, we need a heuristic function that never overestimates the number of steps to the
goal. There is a long history of such heuristics for the 15-puzzle; here are two commonly
used candidates:
• h1 = the number of misplaced tiles. For Figure 3.28, all of the eight tiles are out of position,
so the start state would have h1 = 8. h1 is an admissible heuristic because it is clear that
any tile that is out of place must be moved at least once.
• h2 = the sum of the distances of the tiles from their goal positions. Because tiles cannot
move along diagonals, the distance we will count is the sum of the horizontal and vertical
distances. This is sometimes called the city block distance or Manhattan distance. h2 is
also admissible because all any move can do is move one tile one step MANHATTAN
DISTANCE closer to the goal. Tiles 1 to 8 in the start state give a Manhattan distance of h2
= 3 + 1 + 2 + 2 + 2 + 3 + 3 + 2 = 18 . As expected, neither of these overestimates the true
solution cost, which is 26.

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT

8
AI&ML, 21CS54, 5th Semester

3.6.1 The effect of heuristic accuracy on performance


One way to characterize the quality of a heuristic is the effective branching factor b∗. If the
total number of nodes generated by A∗ for a particular problem is N and the solution depth
is d, then b∗ is the branching factor that a uniform tree of depth d would have to have in
order to contain N + 1 nodes. Thus,
N +1=1+ b∗ + (b∗) ^2 + ··· + (b∗)^ d .
For example, if A∗ finds a solution at depth 5 using 52 nodes, then the effective branching
factor is 1.92. The effective branching factor can vary across problem instances, but usually
it is fairly constant for sufficiently hard problems. (The existence of an effective branching
factor follows from the result, mentioned earlier, that the number of nodes expanded by
A∗ grows exponentially with solution depth.) Therefore, experimental measurements of
b∗ on a small set of problems can provide a good guide to the heuristic’s overall usefulness.
A welldesigned heuristic would have a value of b∗ close to 1, allowing fairly large problems
to be solved at reasonable computational cost
To test the heuristic functions h1 and h2, we generated 1200 random problems with
solution lengths from 2 to 24 (100 for each even number) and solved them with iterative
deepening search and with A∗ tree search using both h1 and h2. Figure 3.29 gives the
average number of nodes generated by each strategy and the effective branching factor.
The results suggest that h2 is better than h1, and is far better than using iterative
deepening search. Even for small problems with d = 12, A∗ with h2 is 50,000 times more
efficient than uninformed iterative deepening search

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT

9
AI&ML, 21CS54, 5th Semester

One might ask whether h2 is always better than h1. The answer is “Essentially, yes.” It is
easy to see from the definitions of the two heuristics that, for any node n, h2(n) ≥ h1(n).
We thus say that h2 dominates h1. Domination translates directly into efficiency: A∗ using
h2 will never expand more nodes than A∗ using h1 (except possibly for some nodes with
f(n) = C∗). The argument is simple. Recall the observation on page 97 that every node with
f(n) < C∗ will surely be expanded. This is the same as saying that every node with h(n) <
C∗ − g(n) will surely be expanded. But because h2 is at least as big as h1 for all nodes, every
node that is surely expanded by A∗ search with h2 will also surely be expanded with h1,
and h1 might cause other nodes to be expanded as well. Hence, it is generally better to use
a heuristic function with higher values, provided it is consistent and that the computation
time for the heuristic is not too long.

3.6.2 Generating admissible heuristics from relaxed problems


We have seen that both h1 (misplaced tiles) and h2 (Manhattan distance) are fairly good
heuristics for the 8-puzzle and that h2 is better. How might one have come up with h2? Is
it possible for a computer to invent such a heuristic mechanically? h1 and h2 are estimates
of the remaining path length for the 8-puzzle, but they are also perfectly accurate path
lengths for simplified versions of the puzzle. If the rules of the puzzle were changed so that
a tile could move anywhere instead of just to the adjacent empty square, then h1 would
give the exact number of steps in the shortest solution. Similarly, if a tile could move one
square in any direction, even onto an occupied square, then h2 would give the exact
number of steps in the shortest solution. A problem with fewer restrictions on the actions
is called a relaxed problem. The state-space graph of the relaxed problem is a supergraph
of the original state space because the removal of restrictions creates added edges in the
graph. Because the relaxed problem adds edges to the state space, any optimal solution in
the original problem is, by definition, also a solution in the relaxed problem; but the
relaxed problem may have better solutions if the added edges provide short cuts. Hence,
the cost of an optimal solution to a relaxed problem is an admissible heuristic for the
original problem. Furthermore, because the derived heuristic is an exact cost for the

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT

10
AI&ML, 21CS54, 5th Semester

relaxed problem, it must obey the triangle inequality and is therefore consistent (see page
95). If a problem definition is written down in a formal language, it is possible to construct
relaxed problems automatically.11 For example, if the 8-puzzle actions are described as A
tile can move from square A to square B if A is horizontally or vertically adjacent to B and
B is blank, we can generate three relaxed problems by removing one or both of the
conditions:
(a) A tile can move from square A to square B if A is adjacent to B.
(b) A tile can move from square A to square B if B is blank.
(c) A tile can move from square A to square B.
From (a), we can derive h2 (Manhattan distance). The reasoning is that h2 would be the
proper score if we moved each tile in turn to its destination. The heuristic derived from (b)
is discussed in Exercise 3.31. From (c), we can derive h1 (misplaced tiles) because it would
be the proper score if tiles could move to their intended destination in one step. Notice
that it is crucial that the relaxed problems generated by this technique can be solved
essentially without search, because the relaxed rules allow the problem to be decomposed
into eight independent subproblems. If the relaxed problem is hard to solve, then the
values of the corresponding heuristic will be expensive to obtain.12 A program called
ABSOLVER can generate heuristics automatically from problem definitions, using the
“relaxed problem” method and various other techniques (Prieditis, 1993). ABSOLVER
generated a new heuristic for the 8-puzzle that was better than any preexisting heuristic
and found the first useful heuristic for the famous Rubik’s Cube puzzle. One problem with
generating new heuristic functions is that one often fails to get a single “clearly best”
heuristic. If a collection of admissible heuristics h1 ...hm is available for a problem and
none of them dominates any of the others, which should we choose? As it turns out, we
need not make a choice. We can have the best of all worlds, by defining
h(n) = max{h1(n),...,hm(n)} .

This composite heuristic uses whichever function is most accurate on the node in question.
Because the component heuristics are admissible, h is admissible; it is also easy to prove
that h is consistent. Furthermore, h dominates all of its component heuristics.
3.6.3 Generating admissible heuristics from subproblems: Pattern databases
Admissible heuristics can also be derived from the solution cost of a subproblem of a given
problem. For example, Figure 3.30 shows a subproblem of the 8-puzzle instance in Figure

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT

11
AI&ML, 21CS54, 5th Semester

3.28. The subproblem involves getting tiles 1, 2, 3, 4 into their correct positions. Clearly,
the cost of the optimal solution of this subproblem is a lower bound on the cost of the
complete problem. It turns out to be more accurate than Manhattan distance in some cases.
The idea behind pattern databases is to store these exact solution costs for every possible
subproblem instance—in our example, every possible configuration of the four tiles and
the blank. (The locations of the other four tiles are irrelevant for the purposes of solving
the subproblem, but moves of those tiles do count toward the cost.) Then we compute an
admissible heuristic hDB for each complete state encountered during a search simply by
looking up the corresponding subproblem configuration in the database. The database
itself is constructed by searching back13 from the goal and recording the cost of each new
pattern encountered; the expense of this search is amortized over many subsequent
problem instances. The choice of 1-2-3-4 is fairly arbitrary; we could also construct
databases for 5-6-7-8, for 2-4-6-8, and so on. Each database yields an admissible heuristic,
and these heuristics can be combined, as explained earlier, by taking the maximum value.
A combined heuristic of this kind is much more accurate than the Manhattan distance; the
number of nodes generated when solving random 15-puzzles can be reduced by a factor
of 1000. One might wonder whether the heuristics obtained from the 1-2-3-4 database
and the 5-6-7-8 could be added, since the two subproblems seem not to overlap. Would
this still give an admissible heuristic? The answer is no, because the solutions of the 1-2-
3-4 subproblem and the 5-6-7-8 subproblem for a given state will almost certainly share
some moves—it is unlikely that 1-2-3-4 can be moved into place without touching 5-6-7-
8, and vice versa. But what if we don’t count those moves? That is, we record not the total
cost of solving the 1-2- 3-4 subproblem, but just the number of moves involving 1-2-3-4.
Then it is easy to see that the sum of the two costs is still a lower bound on the cost of
solving the entire problem. This is the idea behind disjoint pattern databases. With such
databases, it is possible to solve random 15-puzzles in a few milliseconds—the number of
nodes generated is reduced by a factor of 10,000 compared with the use of Manhattan
distance. For 24-puzzles, a speedup of roughly a factor of a million can be obtained. Disjoint
pattern databases work for sliding-tile puzzles because the problem can be divided up in
such a way that each move affects only one subproblem—because only one tile is moved
at a time. For a problem such as Rubik’s Cube, this kind of subdivision is difficult because
each move affects 8 or 9 of the 26 cubies. More general ways of defining additive,
admissible heuristics have been proposed that do apply to Rubik’s cube (Yang et al., 2008),
but they have not yielded a heuristic better than the best nonadditive heuristic for the
problem.
3.6.4 Learning heuristics from experience
A heuristic function h(n) is supposed to estimate the cost of a solution beginning from the
state at node n. How could an agent construct such a function? One solution was given in
the preceding sections—namely, to devise relaxed problems for which an optimal solution
can be found easily. Another solution is to learn from experience. “Experience” here means
solving lots of 8-puzzles, for instance. Each optimal solution to an 8-puzzle problem
provides examples from which h(n) can be learned. Each example consists of a state from
the solution path and the actual cost of the solution from that point. From these examples,

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT

12
AI&ML, 21CS54, 5th Semester

a learning algorithm can be used to construct a function h(n) that can (with luck) predict
solution costs for other states that arise during search. Techniques for doing just this using
neural nets, decision trees, and other methods are demonstrated in Chapter 18. (The
reinforcement learning methods described in Chapter 21 are also applicable.) Inductive
learning methods work best when supplied with features of a state that are relevant to
predicting the state’s value, rather than with just the raw state description. For example,
the feature “number of misplaced tiles” might be helpful in predicting the actual distance
of a state from the goal. Let’s call this feature x1(n). We could take 100 randomly generated
8-puzzle configurations and gather statistics on their actual solution costs. We might find
that when x1(n) is 5, the average solution cost is around 14, and so on. Given these data,
the value of x1 can be used to predict h(n). Of course, we can use several features. A second
feature x2(n) might be “number of pairs of adjacent tiles that are not adjacent in the goal
state.” How should x1(n) and x2(n) be combined to predict h(n)? A common approach is
to use a linear combination:
h(n) = c1x1(n) + c2x2(n) .
The constants c1 and c2 are adjusted to give the best fit to the actual data on solution costs.
One expects both c1 and c2 to be positive because misplaced tiles and incorrect adjacent
pairs make the problem harder to solve. Notice that this heuristic does satisfy the condition
that h(n)=0 for goal states, but it is not necessarily admissible or consistent.

INTRODUCTION TO MACHINE L E A R N I N G
1.1 NEED FOR MACHINE LEARNING
Business organizations use huge amount of data for their daily activities. They have now
started to use the latest technology, machinelearning, to manage the data.
Machine learning has become so popular because of three reasons:

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT

13
AI&ML, 21CS54, 5th Semester

1. High volume of available data to manage: Big companies such as Facebook,


Twitter, and YouTube generate huge amount of data that grows at a
phenomenal rate. It is estimated that the data approximately gets doubled
every year.
2. Second reason is that the cost of storage has reduced. The hardware cost has
also dropped.Therefore, it is easier now to capture, process, store, distribute,
and transmit the digital information.
3. Third reason for popularity of machine learning is the availability of complex
algorithms now. Especially with the advent of deep learning, many
algorithms are available for machine learning.
let us establish these terms - data, information, knowledge, intelligence, and wisdom
using a knowledge pyramid as shown in Figure 1.1.

Figure 1.1: The Knowledge Pyramid


 All facts are data. Data can be numbers or text that can be processed by a computer.
Today, organizations are accumulating vast and growing amounts of data with data
sources such as flat files, databases, or data warehouses in different storage
formats.
 Processed data is called information. This includes patterns, associations, or
relationships among data. For example, sales data can be analyzed to extract
information like which is the fast selling product.
 Condensed information is called knowledge. For example, the historical patterns
and future trends obtained in the above sales data can be called knowledge. Unless
knowledge is extracted, data is of no use. Similarly, knowledge is not useful
unless it is put into action.
 Intelligence is the applied knowledge for actions. An actionable form of
knowledge is called intelligence. Computer systems have been successful till this
stage.
 The ultimate objective of knowledge pyramid is wisdom that represents the
maturity of mind that is, so far, exhibited only by humans.

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT

14
AI&ML, 21CS54, 5th Semester

The objective of machine learning is to process these archival data for


organizations to take better decisions to design new products, improve the
business processes, and to develop effective decision support systems.

1.2 MACHINE LEARNING EXPLAINED


Machine learning is an important sub-branch of Artificial Intelligence (AI). A
frequently quoted definition of machine learning was by Arthur Samuel, one of the
pioneers of Artificial Intelligence. He stated that “Machine learning is the field of
study that gives the computers ability to learn without being explicitly
programmed.”
The key to this definition is that the systems should learn by itself without explicit
programming.How is it possible? It is widely known that to perform a computation,
one needs to write programsthat teach the computers how to do that computation.
In conventional programming, after understanding the problem, a detailed
design of the program such as a flowchart or an algorithm needs to be created and
converted into programs using a suitable programming language. This approach
could be difficult for many real-world problems such as puzzles, games, and complex
image recognition applications. Initially, artificial intelligence aims to understand
these problems and develop general purpose rules manually. Then, these rules are
formulated into logic and implemented in a program to create intelligent systems.
This idea of developing intelligent systems by using logic and reasoning by converting
anexpert’s knowledge into a set of rules and programs is called an expert system. An
expert system like MYCIN was designed for medical diagnosis after converting the
expert knowledge of many doctors into a system. However, this approach did not
progress much as programs lacked real intelligence. The word MYCIN is derived from
the fact that most of the antibiotics’ names end with‘mycin’.
The above approach was impractical in many domains as programs still
depended on humanexpertise and hence did not truly exhibit intelligence. Then, the
momentum shifted to machine learning in the form of data driven systems. The focus
of AI is to develop intelligent systems by using data-driven approach, where data is
used as an input to develop intelligent models. The models can then be used to predict
new inputs. Thus, the aim of machine learning is to learn a model or set of rules from
the given dataset automatically so that it can predict the unknown datacorrectly.
As humans take decisions based on an experience, computers make models based
on extracted patterns in the input data and then use these data-filled models for
prediction and to take decisions. For computers, the learnt model is equivalent to
human experience. This is shown in Figure 1.2.

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT

15
AI&ML, 21CS54, 5th Semester

Figure 1.2: (a) A Learning System for Humans (b) A Learning Systemfor
Machine Learning
Often, the quality of data determines the quality of experience and, therefore, the quality ofthe
learning system. In statistical learning, the relationship between the input x and output y is
modeled as a function in the form y = f(x). Here, f is the learning function that maps the input xto
output y. Learning of function f is the crucial aspect of forming a model in statistical learning.In
machine learning, this is simply called mapping of input to output.
The learning program summarizes the raw data in a model. Formally stated, a model is anexplicit
description of patterns within the data in the form of:
1. Mathematical equation
2. Relational diagrams like trees/graphs
3. Logical if/else rules, or
4. Groupings called clusters
In summary, a model can be a formula, procedure or representation that can generate data
decisions. The difference between pattern and model is that the former is local and applicable onlyto
certain attributes but the latter is global and fits the entire dataset. For example, a model can be helpful
to examine whether a given email is spam or not. The point is that the model is generated automatically
from the given data.
Another pioneer of AI, Tom Mitchell’s definition of machine learning states that, “A computer
program is said to learn from experience E, with respect to task T and some performance measure
P,if its performance on T measured by P improves with experience E.” The important components of this
definition are experience E, task T, and performance measure P.
For example, the task T could be detecting an object in an image. The machine can gain the
knowledge of object using training dataset of thousands of images. This is called experience E.So,
the focus is to use this experience E for this task of object detection T. The ability of the systemto detect
the object is measured by performance measures like precision and recall. Based on the performance
measures, course correction can be done to improve the performance of the system.
Models of computer systems are equivalent to human experience. Experience is based on data.
Humans gain experience by various means. They gain knowledge by rote learning. They observe others
and imitate it. Humans gain a lot of knowledge from teachers and books. We learn many things by trial and
error. Once the knowledge is gained, when a new problem is encountered, humans search for similar
past situations and then formulate the heuristics and use that for prediction. But, in systems,
experience is gathered by these steps:
1. Collection of data

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT

16
AI&ML, 21CS54, 5th Semester

2. Once data is gathered, abstract concepts are formed out of that data. Abstraction is used to
generate concepts. This is equivalent to humans’ idea of objects, for example, we have some
idea about how an elephant looks like.
3. Generalization converts the abstraction into an actionable form of intelligence. It can
be viewed as ordering of all possible concepts. So, generalization involves ranking of concepts,
inferencing from them and formation of heuristics, an actionable aspect of intelligence.
Heuristics are educated guesses for all tasks. For example, if one runs or encounters a danger,
it is the resultant of human experience or his heuristics formation.In machines, it happens
the same way.
4. Heuristics normally works! But, occasionally, it may fail too. It is not the faultof
heuristics as it is just a ‘rule of thumb′. The course correction is done by taking
evaluation measures. Evaluation checks the thoroughness of the models and to-do course
correction, if necessary, to generate better formulations.

1.3 MACHINE LEARNING IN RELATION TO OTHER FIELDS


Machine learning uses the concepts of Artificial Intelligence, Data Science, and Statistics primarily.It is
the resultant of combined ideas of diverse fields.

1.3.1 Machine Learning and Artificial Intelligence


Machine learning is an important branch of AI, which is a much broader subject. The aim of AI is to
develop intelligent agents. An agent can be a robot, humans, or any autonomous systems. Initially, the
idea of AI was ambitious, that is, to develop intelligent systems like human beings. The focus was on
logic and logical inferences. It had seen many ups and downs. These down periods were called AI
winters.
The resurgence in AI happened due to development of data driven systems. The aim is to find
relations and regularities present in the data. Machine learning is the subbranch of AI, whose aimis to
extract the patterns for prediction. It is a broad field that includes learning from examples andother
areas like reinforcement learning. The relationship of AI and machine learning is shown in Figure 1.3.
The model can take an unknown instance and generate results.
Figure 1.3: Relationship of AI with Machine Learning

intelligence

Deep learning is a subbranch of machine learning. In deep learning, the models are constructedusing
neural network technology. Neural networks are based on the human neuron models. Many neurons
form a network connected with the activation functions that trigger further neurons to perform tasks.

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT

17
AI&ML, 21CS54, 5th Semester

1.3.2 Machine Learning, Data Science, Data Mining, and Data Analytics
Data science is an ‘Umbrella’ term that encompasses many fields. Machine learning starts with data.
Therefore, data science and machine learning are interlinked. Machine learning is a branch of data
science. Data science deals with gathering of data for analysis. It is a broad field that includes:
Big Data Data science concerns about collection of data. Big data is a field of data science that deals
with data’s following characteristics:
1. Volume: Huge amount of data is generated by big companies like Facebook, Twitter,
YouTube.
2. Variety: Data is available in variety of forms like images, videos, and in different formats.
3. Velocity: It refers to the speed at which the data is generated and processed.
Big data is used by many machine learning algorithms for applications such as language trans-lation
and image recognition. Big data influences the growth of subjects like Deep learning. Deep learning is
a branch of machine learning that deals with constructing models using neural networks.

Data Mining Data mining’s original genesis is in the business. Like while mining the earth one gets
into precious resources, it is often believed that unearthing of the data produces hidden infor- mation
that otherwise would have eluded the attention of the management. Nowadays, many consider that
data mining and machine learning are same. There is no difference between these fields except that
data mining aims to extract the hidden patterns that are present in the data, whereas, machine learning
aims to use it for prediction.

Data Analytics Another branch of data science is data analytics. It aims to extract useful knowledge
from crude data. There are different types of analytics. Predictive data analytics is used for making
predictions. Machine learning is closely related to this branch of analytics and shares almost all
algorithms.
Pattern Recognition It is an engineering field. It uses machine learning algorithms to extract the
features for pattern analysis and pattern classification. One can view pattern recognition as a specific
application of machine learning.
These relations are summarized in Figure 1.4.
Data science

analytics

Figure 1.4: Relationship of Machine Learning with Other Major Fields

1.3.3 Machine Learning and Statistics


Statistics is a branch of mathematics that has a solid theoretical foundation regarding statistical learning. Like
machine learning (ML), it can learn from data. But the difference between statistics and ML is thatstatistical

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT

18
AI&ML, 21CS54, 5th Semester

methods look for regularity in data called patterns. Initially, statistics sets a hypothesis and performs
experiments to verify and validate the hypothesis in order to find relationships among data.
Statistics requires knowledge of the statistical procedures and the guidance of a good statistician.
It is mathematics intensive and models are often complicated equations and involve many
assumptions. Statistical methods are developed in relation to the data being analyzed. In addition,
statistical methods are coherent and rigorous. It has strong theoretical foundations and interpretations
that require a strong statistical knowledge.
Machine learning, comparatively, has less assumptions and requires less statistical knowledge.
But, it often requires interaction with various tools to automate the process of learning.
Nevertheless, there is a school of thought that machine learning is just the latest version of ‘old
Statistics’ and hence this relationship should be recognized.

1.4 TYPES OF MACHINE LEARNING


What does the word ‘learn’ mean? Learning, like adaptation, occurs as the result of interaction of the
program with its environment. It can be compared with the interaction between a teacher and a

Supervised Semi-supervised Reinforcement

student. There are four types of machine learning as shown in Figure 1.5.

Cluster
analysis

Figure 1.5: Types of Machine Learning


Before discussing the types of learning, it is necessary to discuss about data.

Labelled and Unlabeled Data Data is a raw fact. Normally, data is represented in the formof a
table. Data also can be referred to as a data point, sample, or an example. Each row of the table
represents a data point. Features are attributes or characteristics of an object. Normally, the columns
of the table are attributes. Out of all attributes, one attribute is important and is called a label. Label is
the feature that we aim to predict. Thus, there are two types of data – labelled and unlabeled.

Labelled Data To illustrate labelled data, let us take one example dataset called Iris flower datasetor
Fisher’s Iris dataset. The dataset has 50 samples of Iris – with four attributes, length and width of sepals
and petals. The target variable is called class. There are three classes – Iris setosa, Iris virginica, and
Iris versicolor.
The partial data of Iris dataset is shown in Table 1.1.
Table 1.1: Iris Flower Dataset

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT

19
AI&ML, 21CS54, 5th Semester

S.No. Length of Width of Length of Width of Class


Petal Petal Sepal Sepal
1. 5.5 4.2 1.4 0.2 Setosa
2. 7 3.2 4.7 1.4 Versicolor
3. 7.3 2.9 6.3 1.8 Virginica
A dataset need not be always numbers. It can be images or video frames. Deep neural networkscan
handle images with labels. In the following Figure 1.6, the deep neural network takes images ofdogs and
cats with labels for classification.
(a)

Cat

(b)

Figure 1.6: (a) Labelled Dataset (b) Unlabeled Dataset


In unlabeled data, there are no labels in the dataset.

1.4.1 Supervised Learning


Supervised algorithms use labelled dataset. As the name suggests, there is a supervisor or teacher
component in supervised learning. A supervisor provides labelled data so that the model is constructed
and generates test data.
In supervised learning algorithms, learning takes place in two stages. In layman terms, during thefirst
stage, the teacher communicates the information to the student that the student is supposed tomaster.
The student receives the information and understands it. During this stage, the teacher has noknowledge
of whether the information is grasped by the student.
This leads to the second stage of learning. The teacher then asks the student a set of questionsto
find out how much information has been grasped by the student. Based on these questions,
the student is tested, and the teacher informs the student about his assessment. This kind of learningis
typically called supervised learning.
Supervised learning has two methods:
1. Classification
2. Regression

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT

20
AI&ML, 21CS54, 5th Semester

Classification
Classification is a supervised learning method. The input attributes of the classification algorithmsare
called independent variables. The target attribute is called label or dependent variable. The
relationship between the input and target variable is represented in the form of a structure which is
called a classification model. So, the focus of classification is to predict the ‘label’ that is in a discrete
form (a value from the set of finite values). An example is shown in Figure 1.7 where a classification
algorithm takes a set of labelled data images such as dogs and cats to construct a model that can later
be used to classify an unknown test image data.

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT

21
AI&ML, 21CS54, 5th Semester

In classification, learning takes place in two stages. During the first stage, called training stage,the learning
algorithm takes a labelled dataset and starts learning. After the training set, samples are processed and
the model is generated. In the second stage, the constructed model is tested withtest or unknown sample
and assigned a label. This is the classification process.
This is illustrated in the above Figure 1.7. Initially, the classification learning algorithm learnswith
the collection of labelled data and constructs the model. Then, a test case is selected, and the model
assigns a label.
Similarly, in the case of Iris dataset, if the test is given as (6.3, 2.9, 5.6, 1.8, ?), the classification will
generate the label for this. This is called classification. One of the examples of classification is –Image
recognition, which includes classification of diseases like cancer, classification of plants, etc.
The classification models can be categorized based on the implementation technology like decision
trees, probabilistic methods, distance measures, and soft computing methods. Classificationmodels can
also be classified as generative models and discriminative models. Generative models deal with the
process of data generation and its distribution. Probabilistic models are examples of
generative models. Discriminative models do not care about the generation of data. Instead, they
simply concentrate on classifying the given data.
Some of the key algorithms of classification are:
• Decision Tree
• Random Forest
• Support Vector Machines
• Naïve Bayes
• Artificial Neural Network and Deep Learning networks like CNN

Regression Models
Regression models, unlike classification algorithms, predict continuous variables like price. In
other words, it is a number. A fitted regression model is shown in Figure 1.8 for a dataset that represent
weeks input x and product sales y.
4

3.5
y-axis - Product sales data (y)

2.5

1.5

1
1 2 3 4 5
x-axis - Week data (x)
Regression line (y = 0.66X + 0.54)

Figure 1.8: A Regression Model of the Form y = ax + b


The regression model takes input x and generates a model in the form of a fitted line of the form y
= f(x). Here, x is the independent variable that may be one or more attributes and y is the dependent
variable. In Figure 1.8, linear regression takes the training set and tries to fit it with a line – product
sales = 0.66 Week + 0.54. Here, 0.66 and 0.54 are all regression coefficients that arelearnt from data.
The advantage of this model is that prediction for product sales (y) can be made for unknown week

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT

22
AI&ML, 21CS54, 5th Semester

data (x). For example, the prediction for unknown eighth week can be made bysubstituting x as 8 in that
regression formula to get y.
One of the most important regression algorithms is linear regression that is explained in the next
section.
Both regression and classification models are supervised algorithms. Both have a supervisor andthe
concepts of training and testing are applicable to both. What is the difference between classificationand
regression models? The main difference is that regression models predict continuous variablessuch as
product price, while classification concentrates on assigning labels such as class.
1.4.2 Unsupervised Learning
The second kind of learning is by self-instruction. As the name suggests, there are no supervisor or
teacher components. In the absence of a supervisor or teacher, self-instruction is the most commonkind
of learning process. This process of self-instruction is based on the concept of trial and error.
Here, the program is supplied with objects, but no labels are defined. The algorithm itself observes
the examples and recognizes patterns based on the principles of grouping. Grouping is done in ways
that similar objects form the same group.
Cluster analysis and Dimensional reduction algorithms are examples of unsupervised algorithms.

Cluster Analysis
Cluster analysis is an example of unsupervised learning. It aims to group objects into disjoint clusters
or groups. Cluster analysis clusters objects based on its attributes. All the data objectsof the
partitions are similar in some aspect and vary from the data objects in the other partitions significantly.
Some of the examples of clustering processes are — segmentation of a region of interest in an
image, detection of abnormal growth in a medical image, and determining clusters of signatures in a
gene database.
An example of clustering scheme is shown in Figure 1.9 where the clustering algorithm takes a set
of dogs and cats images and groups it as two clusters-dogs and cats. It can be observed that the samples
belonging to a cluster are similar and samples are different radically across clusters.

Some of the key clustering algorithms are:


• k-means algorithm
• Hierarchical algorithms
Dimensionality Reduction
Dimensionality reduction algorithms are examples of unsupervised algorithms. It takes a higher
dimension data as input and outputs the data in lower dimension by taking advantage of the varianceof the
data. It is a task of reducing the dataset with few features without losing the generality.

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT

23
AI&ML, 21CS54, 5th Semester

The differences between supervised and unsupervised learning are listed in the following
Table 1.2.
Table 1.2: Differences between Supervised and Unsupervised Learning

S.No. Supervised Learning Unsupervised Learning


1. There is a supervisor component No supervisor component
2. Uses Labelled data Uses Unlabelled data
3. Assigns categories or labels Performs grouping process such that similar objectswill
be in one cluster

1.4.3 Semi-supervised Learning


There are circumstances where the dataset has a huge collection of unlabelled data and some labelled
data. Labelling is a costly process and difficult to perform by the humans. Semi-supervisedalgorithms use
unlabelled data by assigning a pseudo-label. Then, the labelled and pseudo-labelled dataset can be
combined.
1.4.4 Reinforcement Learning
Reinforcement learning mimics human beings. Like human beings use ears and eyes to perceive theworld
and take actions, reinforcement learning allows the agent to interact with the environment to get
rewards. The agent can be human, animal, robot, or any independent program. The rewardsenable the
agent to gain experience. The agent aims to maximize the reward.
The reward can be positive or negative (Punishment). When the rewards are more, the behaviorgets
reinforced and learning becomes possible.
Consider the following example of a Grid game as shown in Figure 1.10.
Block

Goal

Danger

Figure 1.10: A Grid game

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT

24
AI&ML, 21CS54, 5th Semester

In this grid game, the gray tile indicates the danger, black is a block, and the tile with diagonallines
is the goal. The aim is to start, say from bottom-left grid, using the actions left, right, top andbottom to
reach the goal state.
To solve this sort of problem, there is no data. The agent interacts with the environment toget
experience. In the above case, the agent tries to create a model by simulating many paths and finding
rewarding paths. This experience helps in constructing a model.
It can be said in summary, compared to supervised learning, there is no supervisor orlabelled
dataset. Many sequential decisions need to be taken to reach the final decision. Therefore,
reinforcement algorithms are reward-based, goal-oriented algorithms.

1.5 CHALLENGES OF MACHINE LEARNING


What are the challenges of machine learning? Let us discuss about them now.

Problems that can be Dealt with Machine Learning


Computers are better than humans in performing tasks like computation. For example, while calculatingthe
square root of large numbers, an average human may blink but computers can display the result inseconds.
Computers can play games like chess, GO, and even beat professional players of that game.
However, humans are better than computers in many aspects like recognition. But, deep learning
systems challenge human beings in this aspect as well. Machines can recognize human faces in a
second. Still, there are tasks where humans are better as machine learning systems still require quality
data for model construction. The quality of a learning system depends on the quality of data. This is a
challenge. Some of the challenges are listed below:
1. Problems – Machine learning can deal with the ‘well-posed’ problems where specificationsare
complete and available. Computers cannot solve ‘ill-posed’ problems.
Consider one simple example (shown in Table 1.3):
Table 1.3: An Example

Input (x1, x2 ) Output (y)

1, 1 1
2, 1 2
3, 1 3
4, 1 4
5, 1 5
Can a model for this test data be multiplication? That is, y x x . Well! It is true! But, this is
1 2
equally true that y may be y x x , or y x x2. So, there are three functions that fit the data.
1 2 1
This means that the problem is ill-posed. To solve this problem, one needs more example to
check the model. Puzzles and games that do not have sufficient specification may become anill-
posed problem and scientific computation has many ill-posed problems.

Dr. Girijamma H A, Dr. Sudhamani M J, Meenakshi S J , Dept. Of CSE, RNSIT

25
AI&ML, 21CS54, 5th Semester

2. Huge data – This is a primary requirement of machine learning. Availability of a quality data is
a challenge. A quality data means it should be large and should not have data problems such
as missing data or incorrect data.
3. High computation power – With the availability of Big Data, the computational resource
requirement has also increased. Systems with Graphics Processing Unit (GPU) or even Tensor
Processing Unit (TPU) are required to execute machine learning algorithms. Also, machine
learning tasks have become complex and hence time complexity has increased, and that can
be solved only with high computing power.
4. Complexity of the algorithms – The selection of algorithms, describing the algorithms,
application of algorithms to solve machine learning task, and comparison of algorithms have
become necessary for machine learning or data scientists now. Algorithms have become a big
topic of discussion and it is a challenge for machine learning professionals todesign, select, and
evaluate optimal algorithms.
5. Bias/Variance – Variance is the error of the model. This leads to a problem called bias/
variance tradeoff. A model that fits the training data correctly but fails for test data, in general
lacks generalization, is called overfitting. The reverse problem is called underfitting where the
model fails for training data but has good generalization. Overfitting and underfitting are great
challenges for machine learning algorithms.

1.6 MACHINE LEARNING PROCESS


The emerging process model for the data mining solutions for business organizations is CRISP-DM.Since
machine learning is like data mining, except for the aim, this process can be used for machinelearning.
CRISP-DM stands for Cross Industry Standard Process – Data Mining. This process involves six steps.
The steps are listed below in Figure 1.11.

Understand the
business

preprocessing

Model evaluation

Model deployment

Figure 1.11: A Machine Learning/Data Mining Process

26
AI&ML, 21CS54, 5th Semester

1. Understanding the business – This step involves understanding the objectives and
requirements of the business organization. Generally, a single data mining algorithm is enough
for giving the solution. This step also involves the formulation of the problem statement for
the data mining process.
2. Understanding the data – It involves the steps like data collection, study of the charac teristics
of the data, formulation of hypothesis, and matching of patterns to the selected hypothesis.
3. Preparation of data – This step involves producing the final dataset by cleaning the raw data
and preparation of data for the data mining process. The missing values may cause problems
during both training and testing phases. Missing data forces classifiers to produce inaccurate
results. This is a perennial problem for the classification models. Hence, suitable strategies
should be adopted to handle the missing data.
4. Modelling – This step plays a role in the application of data mining algorithm for the datato
obtain a model or pattern.
5. Evaluate – This step involves the evaluation of the data mining results using statistical analysis
and visualization methods. The performance of the classifier is determined by evaluating the
accuracy of the classifier. The process of classification is a fuzzy issue. For example,
classification of emails requires extensive domain knowledge and requires domain experts.
Hence, performance of the classifier is very crucial.
6. Deployment – This step involves the deployment of results of the data mining algorithm to
improve the existing process or for a new situation.

1.7 MACHINE LEARNING APPLICATIONS


Machine Learning technologies are used widely now in different domains. Machine learning applications
are everywhere! One encounters many machine learning applications in the day-to-day life. Some
applications are listed below:
1. Sentiment analysis – This is an application of natural language processing (NLP) where the
words of documents are converted to sentiments like happy, sad, and angry which arecaptured
by emoticons effectively. For movie reviews or product reviews, five stars or one star are
automatically attached using sentiment analysis programs.
2. Recommendation systems – These are systems that make personalized purchases possible.For
example, Amazon recommends users to find related books or books bought by peoplewho have
the same taste like you, and Netflix suggests shows or related movies of your taste. The
recommendation systems are based on machine learning.
3. Voice assistants – Products like Amazon Alexa, Microsoft Cortana, Apple Siri, and Google
Assistant are all examples of voice assistants. They take speech commands and perform tasks.
These chatbots are the result of machine learning technologies.
4. Technologies like Google Maps and those used by Uber are all examples of machine learning
which offer to locate and navigate shortest paths to reduce time.
The machine learning applications are enormous. The following Table 1.4 summarizes some ofthe
machine learning applications.

27
AI&ML, 21CS54, 5th Semester

Table 1.4: Applications’ Survey Table


S.No. Problem Domain Applications
1. Business Predicting the bankruptcy of a business firm
2. Banking Prediction of bank loan defaulters and detecting credit card frauds
3. Image Processing Image search engines, object identification, image classification, and
generating synthetic images
4. Audio/Voice Chatbots like Alexa, Microsoft Cortana. Developing chatbots forcustomer
support, speech to text, and text to voice
5. Telecommuni- Trend analysis and identification of bogus calls, fraudulent calls and
cation its callers, churn analysis

6. Marketing Retail sales analysis, market basket analysis, product performance


analysis, market segmentation analysis, and study of travel patterns of
customers for marketing tours

7. Games Game programs for Chess, GO, and Atari video games
8. Natural Language Google Translate, Text summarization, and sentiment analysis
Translation
9. Web Analysis and Identification of access patterns, detection of e-mail spams, viruses,
Services personalized web services, search engines like Google, detection of
promotion of user websites, and finding loyalty of users after web page
layout modification

10. Medicine Prediction of diseases, given disease symptoms as cancer or diabetes.


Prediction of effectiveness of the treatment using patient history and
Chatbots to interact with patients like IBM Watson uses machinelearning
technologies.

11. Multimedia and Face recognition/identification, biometric projects like identificationof


Security a person from a large image or video database, and applications involving
multimedia retrieval

12. Scientific Domain Discovery of new galaxies, identification of groups of houses basedon
house type/geographical location, identification of earthquake
epicenters, and identification of similar land use

Key Terms:
• Machine Learning – A branch of AI that concerns about machines to learn automatically withoutbeing
explicitly programmed.
• Data – A raw fact.
• Model – An explicit description of patterns in a data.
• Experience – A collection of knowledge and heuristics in humans and historical training data in case of
machines.
• Predictive Modelling – A technique of developing models and making a prediction of unseen data.
• Deep Learning – A branch of machine learning that deals with constructing models using neural
networks.
• Data Science – A field of study that encompasses capturing of data to its analysis covering all stagesof
data management.
• Data Analytics – A field of study that deals with analysis of data.

28
AI&ML, 21CS54, 5th Semester

• Big Data – A study of data that has characteristics of volume, variety, and velocity.
• Statistics – A branch of mathematics that deals with learning from data using statistical methods.
• Hypothesis – An initial assumption of an experiment.
• Learning – Adapting to the environment that happens because of interaction of an agent with the
environment.
• Label – A target attribute.
• Labelled Data – A data that is associated with a label.
• Unlabelled Data – A data without labels.
• Supervised Learning – A type of machine learning that uses labelled data and learns with the help of a
supervisor or teacher component.
• Classification Program – A supervisory learning method that takes an unknown input and assigns a
label for it. In simple words, finds the category of class of the input attributes.
• Regression Analysis – A supervisory method that predicts the continuous variables based on the input
variables.
• Unsupervised Learning – A type of machine leaning that uses unlabelled data and groups the attributes
to clusters using a trial and error approach.
• Cluster Analysis – A type of unsupervised approach that groups the objects based on attributesso
that similar objects or data points form a cluster.
• Semi-supervised Learning – A type of machine learning that uses limited labelled and largeunlabelled
data. It first labels unlabelled data using labelled data and combines it for learning purposes.
• Reinforcement Learning – A type of machine learning that uses agents and environment interactionfor
creating labelled data for learning.
• Well-posed Problem – A problem that has well-defined specifications. Otherwise, the problem is called
ill-posed.
• Bias/Variance – The inability of the machine learning algorithm to predict correctly due to lackof
generalization is called bias. Variance is the error of the model for training data. This leads to problems
called overfitting and underfitting.
• Model Deployment – A method of deploying machine learning algorithms to improve the existing
business processes for a new situation.

29
Module 2 AI&ML, 21CS54, 5th Semester

Introduction to Machine Learning  21

2.1 WHAT IS DATA?


All facts are data. In computer systems, bits encode facts present in numbers, text, images, audio,
and video. Data can be directly human interpretable (such as numbers or texts) or diffused data
such as images or video that can be interpreted only by a computer.
Data is available in different data sources like flat files, databases, or data warehouses. It can
either be an operational data or a non-operational data. Operational data is the one that is
encountered in normal business procedures and processes. For example, daily sales data is
operational data, on the other hand, non-operational data is the kind of data that is used for decision
making.
Data by itself is meaningless. It has to be processed to generate any information. A string of
bytes is meaningless. Only when a label is attached like height of students of a class, the data
becomes meaningful. Processed data is called information that includes patterns, associations, or
relationships among data. For example, sales data can be analyzed to extract information like which
product was sold larger in the last quarter of the year.
Elements of Big Data
Data whose volume is less and can be stored and processed by a small-scale computer is called ‘small
data’. These data are collected from several sources, and integrated and processed by a small-scale
computer. Big data, on the other hand, is a larger data whose volume is much larger than ‘small data’
and is characterized as follows:
1. Volume – Since there is a reduction in the cost of storing devices, there has been a
tremendous growth of data. Small traditional data is measured in terms of gigabytes (GB) and
terabytes (TB), but Big Data is measured in terms of petabytes (PB) and exabytes (EB). One exabyte
is 1 million terabytes.
2. Velocity – The fast arrival speed of data and its increase in data volume is noted as velocity.
The availability of IoT devices and Internet power ensures that the data is arriving at a faster rate.
Velocity helps to understand the relative growth of big data and its accessibility by users, systems
and applications.
3. Variety – The variety of Big Data includes:
• Form – There are many forms of data. Data types range from text, graph, audio, video, to
maps. There can be composite data too, where one media can have many other sources of
data, for example, a video can have an audio song.
• Function – These are data from various sources like human conversations, transaction
records, and old archive data.
• Source of data – This is the third aspect of variety. There are many sources of data. Broadly,
the data source can be classified as open/public data, social media data and multimodal
data.

Some of the other forms of Vs that are often quoted in the literature as characteristics of
Big data are:
4. Veracity of data – Veracity of data deals with aspects like conformity to the facts, truthfulness,
believablity, and confidence in data. There may be many sources of error such as technical
errors, typographical errors, and human errors. So, veracity is one of the most important
aspects of data.
5. Validity – Validity is the accuracy of the data for taking decisions or for any other goals that
are needed by the given problem.
6. Value – Value is the characteristic of big data that indicates the value of the information that
is extracted from the data and its influence on the decisions that are taken based on it.
Thus, these 6 Vs are helpful to characterize the big data. The data quality of the numeric
attributes is determined by factors like precision, bias, and accuracy.

30
Module 2 AI&ML, 21CS54, 5th Semester

 Precision is defined as the closeness of repeated measurements. Often, standard deviation is


used to measure the precision.
 Bias is a systematic result due to erroneous assumptions of the algorithms or procedures.
 Accuracy is the degree of measurement of errors that refers to the closeness of measurements
to the true value of the quantity. Normally, the significant digits used to store and manipulate
indicate the accuracy of the measurement.

2.1.1 Types of Data


In Big Data, there are three kinds of data. They are structured data, unstructured data, and semi-
structured data.

Structured Data
In structured data, data is stored in an organized manner such as a database where it is available in
the form of a table. The data can also be retrieved in an organized manner using tools like SQL. The
structured data frequently encountered in machine learning are listed below:

Record Data A dataset is a collection of measurements taken from a process. We have a collection
of objects in a dataset and each object has a set of measurements. The measurements can be
arranged in the form of a matrix. Rows in the matrix represent an object and can be called as entities,
cases, or records. The columns of the dataset are called attributes, features, or fields. The table is
filled with observed data. Also, it is better to note the general jargons that are associated with the
dataset. Label is the term that is used to describe the individual observations.

Data Matrix It is a variation of the record type because it consists of numeric attributes. The
standard matrix operations can be applied on these data. The data is thought of as points or vectors
in the multidimensional space where every attribute is a dimension describing the object.

Graph Data It involves the relationships among objects. For example, a web page can refer to
another web page. This can be modeled as a graph. The modes are web pages and the hyperlink is
an edge that connects the nodes.

Ordered Data Ordered data objects involve attributes that have an implicit order among them.
The examples of ordered data are:
 Temporal data – It is the data whose attributes are associated with time. For example, the
customer purchasing patterns during festival time is sequential data. Time series data is a
special type of sequence data where the data is a series of measurements over time.

 Sequence data – It is like sequential data but does not have time stamps. This data involves the
sequence of words or letters. For example, DNA data is a sequence of four characters – A T G C.

 Spatial data – It has attributes such as positions or areas. For example, maps are spatial data
where the points are related by location.

Unstructured Data
Unstructured data includes video, image, and audio. It also includes textual documents, programs,
and blog data. It is estimated that 80% of the data are unstructured data.

Semi-Structured Data
Semi-structured data are partially structured and partially unstructured. These include data like
XML/JSON data, RSS feeds, and hierarchical data.

2.1.2 Data Storage and Representation

31
Module 2 AI&ML, 21CS54, 5th Semester

Once the dataset is assembled, it must be stored in a structure that is suitable for data analysis. The
goal of data storage management is to make data available for analysis. There are different
approaches to organize and manage data in storage files and systems from flat file to data
warehouses. Some of them are listed below:

Flat Files These are the simplest and most commonly available data source. It is also the cheapest
way of organizing the data. These flat files are the files where data is stored in plain ASCII or EBCDIC
format. Minor changes of data in flat files affect the results of the data mining algorithms.
Hence, flat file is suitable only for storing small dataset and not desirable if the dataset becomes
larger.
Some of the popular spreadsheet formats are listed below:
• CSV files – CSV stands for comma-separated value files where the values are separated by
commas. These are used by spreadsheet and database applications. The first row may have
attributes and the rest of the rows represent the data.
• TSV files – TSV stands for Tab separated values files where values are separated by Tab. Both
CSV and TSV files are generic in nature and can be shared. There are many tools like Google Sheets
and Microsoft Excel to process these files.
Database System It normally consists of database files and a database management system
(DBMS). Database files contain original data and metadata. DBMS aims to manage data and improve
operator performance by including various tools like database administrator, query processing, and
transaction manager. A relational database consists of sets of tables. The tables have rows and
columns. The columns represent the attributes and rows represent tuples. A tuple corresponds to
either an object or a relationship between objects. A user can access and manipulate the data in the
database using SQL.

Different types of databases are listed below:


1 A transactional database is a collection of transactional records. Each record is a
transaction. A transaction may have a time stamp, identifier and a set of items, which may have links
to other tables. Normally, transaction databases are created for performing associational analysis
that indicates the correlation among the items.
2. Time-series database stores time related information like log files where data is associated
with a time stamp. This data represents the sequences of data, which represent values or events
obtained over a period (for example, hourly, weekly or yearly) or repeated time span. Observing
sales of product continuously may yield a time-series data.
3. Spatial databases contain spatial information in a raster or vector format. Raster formats are
either bitmaps or pixel maps. For example, images can be stored as a raster data. On the other hand,
the vector format can be used to store maps as maps use basic geometric primitives like points, lines,
polygons and so forth.
World Wide Web (WWW) It provides a diverse, worldwide online information source.
The objective of data mining algorithms is to mine interesting patterns of information present in
WWW.
XML (eXtensible Markup Language) It is both human and machine interpretable data format that
can be used to represent data that needs to be shared across the platforms.
Data Stream It is dynamic data, which flows in and out of the observing environment. Typical
characteristics of data stream are huge volume of data, dynamic, fixed order movement, and real-
time constraints.
RSS (Really Simple Syndication) It is a format for sharing instant feeds across services.
JSON (JavaScript Object Notation) It is another useful data interchange format that is often used for
many machine learning algorithms.

2.2 BIG DATA ANALYTICS AND TYPES OF ANALYTICS


The primary aim of data analysis is to assist business organizations to take decisions. For example,
a business organization may want to know which is the fastest selling product, in order for them to

32
Module 2 AI&ML, 21CS54, 5th Semester

market activities. Data analysis is an activity that takes the data and generates useful information
and insights for assisting the organizations.
Data analysis and data analytics are terms that are used interchangeably to refer to the same
concept. However, there is a subtle difference. Data analytics is a general term and data analysis is a
part of it. Data analytics refers to the process of data collection, preprocessing and analysis. It deals
with the complete cycle of data management. Data analysis is just analysis and is a part of data
analytics. It takes historical data and does the analysis.
Data analytics, instead, concentrates more on future and helps in prediction.
There are four types of data analytics:
1. Descriptive analytics
2. Diagnostic analytics
3. Predictive analytics
4. Prescriptive analytics

Descriptive Analytics It is about describing the main features of the data. After data collection is
done, descriptive analytics deals with the collected data and quantifies it. It is often stated that
analytics is essentially statistics. There are two aspects of statistics – Descriptive and Inference.
Descriptive analytics only focuses on the description part of the data and not the inference part.
Diagnostic Analytics It deals with the question – ‘Why?’. This is also known as causal analysis, as
it aims to find out the cause and effect of the events. For example, if a product is not selling,
diagnostic analytics aims to find out the reason. There may be multiple reasons and associated
effects are analyzed as part of it.
Predictive Analytics It deals with the future. It deals with the question – ‘What will happen in
future given this data?’. This involves the application of algorithms to identify the patterns to predict
the future. The entire course of machine learning is mostly about predictive analytics and forms the
core of this book.
Prescriptive Analytics It is about the finding the best course of action for the business
organizations. Prescriptive analytics goes beyond prediction and helps in decision making by giving
a set of actions. It helps the organizations to plan better for the future and to mitigate the risks that
are involved.

2.3 BIG DATA ANALYSIS FRAMEWORK


For performing data analytics, many frameworks are proposed. All proposed analytics frameworks
have some common factors. Big data framework is a layered architecture. Such an architecture has
many advantages such as genericness. A 4-layer architecture has the following layers:
1. Date connection layer
2. Data management layer
3. Data analytics later
4. Presentation layer
Data Connection Layer It has data ingestion mechanisms and data connectors. Data ingestion
means taking raw data and importing it into appropriate data structures. It performs the tasks of
ETL process. By ETL, it means extract, transform and load operations.

Data Management Layer It performs preprocessing of data. The purpose of this layer is to
allow parallel execution of queries, and read, write and data management tasks. There may be
many schemes that can be implemented by this layer such as data-in-place, where the data is
not moved at all, or constructing data repositories such as data warehouses and pull data
on-demand mechanisms.
Data Analytic Layer It has many functionalities such as statistical tests, machine learning
algorithms to understand, and construction of machine learning models. This layer implements
many model validation mechanisms too. The processing is done as shown in Box 2.1.

Presentation Layer It has mechanisms such as dashboards, and applications that display the

33
Module 2 AI&ML, 21CS54, 5th Semester

results of analytical engines and machine learning algorithms.


Thus, the Big Data processing cycle involves data management that consists of the following
steps.
1. Data collection
2. Data preprocessing
3. Applications of machine learning algorithm
4. Interpretation of results and visualization of machine learning algorithm
This is an iterative process and is carried out on a permanent basis to ensure that data is suitable
for data mining.
Application and interpretation of machine learning algorithms constitute the basis for the rest
of the book. So, primarily, data collection and data preprocessing are covered as part of this chapter.

2.3.1 Data Collection


The first task of gathering datasets are the collection of data. It is often estimated that most of the
time is spent for collection of good quality data. A good quality data yields a better result. It is
often difficult to characterize a ‘Good data’. ‘Good data’ is one that has the following properties:
1. Timeliness – The data should be relevant and not stale or obsolete data.
2. Relevancy – The data should be relevant and ready for the machine learning or data mining
algorithms. All the necessary information should be available and there should be no bias in
the data.
3. Knowledge about the data – The data should be understandable and interpretable, and should
be self-sufficient for the required application as desired by the domain knowledge engineer.

Broadly, the data source can be classified as open/public data, social media data and multimodal
data.
1. Open or public data source – It is a data source that does not have any stringent copyright
rules or restrictions. Its data can be primarily used for many purposes. Government census
data are good examples of open data:
• Digital libraries that have huge amount of text data as well as document images
• Scientific domains with a huge collection of experimental data like genomic data
and biological data
• Healthcare systems that use extensive databases like patient databases, health insurance
data, doctors’ information, and bioinformatics information
2. Social media – It is the data that is generated by various social media platforms like
Twitter,
Facebook, YouTube, and Instagram. An enormous amount of data is generated by these
platforms.
3. Multimodal data – It includes data that involves many modes such as text, video, audio
and mixed types. Some of them are listed below:
• Image archives contain larger image databases along with numeric and text data
• The World Wide Web (WWW) has huge amount of data that is distributed on the Internet.
These data are heterogeneous in nature.

2.3.2 Data Preprocessing


In real world, the available data is ’dirty’. By this word ’dirty’, it means:
• Incomplete data • Inaccurate data
• Outlier data • Data with missing values
• Data with inconsistent values • Duplicate data
Data preprocessing improves the quality of the data mining techniques. The raw data must
be preprocessed to give accurate results. The process of detection and removal of errors in data
is called data cleaning. Data wrangling means making the data processable for machine learning
algorithms. Some of the data errors include human errors such as typographical errors or
incorrect

34
Module 2 AI&ML, 21CS54, 5th Semester

measurement and structural errors like improper data formats. Data errors can also arise from
omission and duplication of attributes. Noise is a random component and involves distortion of
a value or introduction of spurious objects. Often, the noise is used if the data is a spatial or
temporal component. Certain deterministic distortions in the form of a streak are known as
artifacts.
Consider, for example, the following patient Table 2.1. The ‘bad’ or ‘dirty’ data can be observed
in this table.

It can be observed that data like Salary = ’ ’ is incomplete data. The DoB of patients, John, Andre, and
Raju, is the missing data. The age of David is recorded as ‘5’ but his DoB indicates it is 10/10/1980.
This is called inconsistent data.

Inconsistent data occurs due to problems in conversions, inconsistent formats, and difference in
units. Salary for John is -1500. It cannot be less than ‘0’. It is an instance of noisy data. Outliers are
data that exhibit the characteristics that are different from other data and have very unusual values.
The age of Raju cannot be 136. It might be a typographical error. It is often required to distinguish
between noise and outlier data.
Outliers may be legitimate data and sometimes are of interest to the data mining algorithms. These
errors often come during data collection stage. These must be removed so that machine learning
algorithms yield better results as the quality of results is determined by the quality of input data.
This removal process is called data cleaning.

Missing Data Analysis


The primary data cleaning process is missing data analysis. Data cleaning routines attempt to fill
up the missing values, smoothen the noise while identifying the outliers and correct the
inconsistencies of the data. This enables data mining to avoid overfitting of the models.
The procedures that are given below can solve the problem of missing data:
1. Ignore the tuple – A tuple with missing data, especially the class label, is ignored. This
method is not effective when the percentage of the missing values increases.
2. Fill in the values manually – Here, the domain expert can analyse the data tables and carry
out the analysis and fill in the values manually. But, this is time consuming and may not
be feasible for larger sets.
3. A global constant can be used to fill in the missing attributes. The missing values may be
’Unknown’ or be ’Infinity’. But, some data mining results may give spurious results by
analysing these labels.
4. The attribute value may be filled by the attribute value. Say, the average income can replace
a missing value.
5. Use the attribute mean for all samples belonging to the same class. Here, the average value
replaces the missing values of all tuples that fall in this group.
6. Use the most possible value to fill in the missing value. The most probable value can be
obtained from other methods like classification and decision tree prediction.
Some of these methods introduce bias in the data. The filled value may not be correct and could
be just an estimated value. Hence, the difference between the estimated and the original value is
called an error or bias.

35
Module 2 AI&ML, 21CS54, 5th Semester

Removal of Noisy or Outlier Data


Noise is a random error or variance in a measured value. It can be removed by using binning,
which is a method where the given data values are sorted and distributed into equal frequency
bins. The bins are also called as buckets. The binning method then uses the neighbor values to
smooth the noisy data.
Some of the techniques commonly used are ‘smoothing by means’ where the mean of the
bin removes the values of the bins, ‘smoothing by bin medians’ where the bin median replaces
the bin values, and ‘smoothing by bin boundaries’ where the bin value is replaced by the closest
bin boundary. The maximum and minimum values are called bin boundaries. Binning methods
may be used as a discretization technique. Example 2.1 illustrates this principle.

Example 2.1: Consider the following set: S = {12, 14, 19, 22, 24, 26, 28, 31, 34}. Apply various
binning techniques and show the result.
Solution: By equal-frequency bin method, the data should be distributed across bins. Let us
assume the bins of size 3, then the above data is distributed across the bins as shown below:
Bin 1 : 12 , 14, 19
Bin 2 : 22, 24, 26
Bin 3 : 28, 31, 32
By smoothing bins method, the bins are replaced by the bin means. This method results in:
Bin 1 : 15, 15, 15
Bin 2 : 24, 24, 24
Bin 3 : 30.3, 30.3, 30.3
Using smoothing by bin boundaries method, the bins' values would be like:
Bin 1 : 12, 12, 19
Bin 2 : 22, 22, 26
Bin 3 : 28, 32, 32
As per the method, the minimum and maximum values of the bin are determined, and it serves
as bin boundary and does not change. Rest of the values are transformed to the nearest value. It
can be observed in Bin 1, the middle value 14 is compared with the boundary values 12 and 19
and changed to the closest value, that is 12. This process is repeated for all bins.

Data Integration and Data Transformations


Data integration involves routines that merge data from multiple sources into a single data source.
So, this may lead to redundant data. The main goal of data integration is to detect and remove
redundancies that arise from integration. Data transformation routines perform operations like
normalization to improve the performance of the data mining algorithms. It is necessary to
transform data so that it can be processed. This can be considered as a preliminary stage of data
conditioning. Normalization is one such technique. In normalization, the attribute values are
scaled to fit in a range (say 0-1) to improve the performance of the data mining algorithm. Often, in
neural networks, these techniques are used. Some of the normalization procedures used are:
1. Min-Max
2. z-Score
Min-Max Procedure It is a normalization technique where each variable V is normalized by its
difference with the minimum value divided by the range to a new range, say 0–1. Often, neural
networks require this kind of normalization. The formula to implement this normalization is
given as:

Here max-min is the range. Min and max are the minimum and maximum of the given data,
new max and new min are the minimum and maximum of the target range, say 0 and 1.

Example 2.2: Consider the set: V = {88, 90, 92, 94}. Apply Min-Max procedure and map the marks

36
Module 2 AI&ML, 21CS54, 5th Semester

to a new range 0–1.


Solution: The minimum of the list V is 88 and maximum is 94. The new min and new max are
0 and 1, respectively. The mapping can be done using Eq. (2.1) as:

So, it can be observed that the marks {88, 90, 92, 94} are mapped to the new range {0, 0.33, 0.66,
1}. Thus, the Min-Max normalization range is between 0 and 1.

z-Score Normalization This procedure works by taking the difference between the field value
and mean value, and by scaling this difference by standard deviation of the attribute.

Here, s is the standard deviation of the list V and m is the mean of the list V.
Example 2.3: Consider the mark list V = {10, 20, 30}, convert the marks to z-score.
Solution: The mean and Sample Standard deviation (s) values of the list V are 20 and 10, respec-
tively. So the z-scores of these marks are calculated using Eq. (2.2) as:

Hence, the z-score of the marks 10, 20, 30 are -1, 0 and 1, respectively.

Data Reduction
Data reduction reduces data size but produces the same results. There are different ways in which
data reduction can be carried out such as data aggregation, feature selection, and dimensionality
reduction.

2.4 DESCRIPTIVE STATISTICS


Descriptive statistics is a branch of statistics that does dataset summarization. It is used to
summarize and describe data. Descriptive statistics are just descriptive and do not go beyond that.
In other words, descriptive statistics do not bother too much about machine learning algorithms
and its functioning.
Let us discuss descriptive statistics with the fundamental concepts of datatypes.
Dataset and Data Types
A dataset can be assumed to be a collection of data objects. The data objects may be records,
points, vectors, patterns, events, cases, samples or observations. These records contain many
attributes. An attribute can be defined as the property or characteristics of an object.

37
Module 2 AI&ML, 21CS54, 5th Semester

For example, consider the following database shown in sample Table 2.2.

Every attribute should be associated with a value. This process is called measurement.
The type of attribute determines the data types, often referred to as measurement scale types.
The data types are shown in Figure 2.1.

Broadly, data can be classified into two types:


1. Categorical or qualitative data
2. Numerical or quantitative data

Categorical or Qualitative Data The categorical data can be divided into two types. They are
nominal type and ordinal type.
•Nominal Data – In Table 2.2, patient ID is nominal data. Nominal data are symbols and
cannot be processed like a number. For example, the average of a patient ID does not make
any statistical sense. Nominal data type provides only information but has no ordering
among data. Only operations like (=, ≠) are meaningful for these data. For example, the
patient ID can be checked for equality and nothing else.
•Ordinal Data – It provides enough information and has natural order. For example, Fever
= {Low, Medium, High} is an ordinal data. Certainly, low is less than medium and medium
is less than high, irrespective of the value. Any transformation can be applied to these data
to get a new value.
Numeric or Qualitative Data It can be divided into two categories. They are interval type and
ratio type.
•Interval Data – Interval data is a numeric data for which the differences between values
are meaningful. For example, there is a difference between 30 degree and 40 degree. Only
the permissible operations are + and -.
•Ratio Data – For ratio data, both differences and ratio are meaningful. The difference
between the ratio and interval data is the position of zero in the scale. For example,
take the Centigrade-Fahrenheit conversion. The zeroes of both scales do not match.
Hence, these are interval data.

Another way of classifying the data is to classify it as:


1.Discrete value data
2.Continuous data
Discrete Data This kind of data is recorded as integers. For example, the responses of the survey
can be discrete data. Employee identification number such as 10001 is discrete data.
Continuous Data It can be fitted into a range and includes decimal point. For example, age is a
continuous data. Though age appears to be discrete data, one may be 12.5 years old and it makes
sense. Patient height and weight are all continuous data.
Third way of classifying the data is based on the number of variables used in the dataset. Based

38
Module 2 AI&ML, 21CS54, 5th Semester

on that, the data can be classified as univariate data, bivariate data, and multivariate data. This is
shown in Figure 2.2.

2.5 UNIVARIATE DATA ANALYSIS AND VISUALIZATION


Univariate analysis is the simplest form of statistical analysis. As the name indicates, the dataset
has only one variable. A variable can be called as a category. Univariate does not deal with cause or
relationships. The aim of univariate analysis is to describe data and find patterns.
Univariate data description involves finding the frequency distributions, central tendency
measures, dispersion or variation, and shape of the data.

2.5.1 Data Visualization


Let us consider some forms of graphs

Bar Chart A Bar chart (or Bar graph) is used to display the frequency distribution for variables.
Bar charts are used to illustrate discrete data. The charts can also help to explain the counts of
nominal data. It also helps in comparing the frequency of different groups.
The bar chart for students' marks {45, 60, 60, 80, 85} with Student ID = {1, 2, 3, 4, 5} is shown
below in Figure 2.3.

Pie Chart These are equally helpful in illustrating the univariate data. The percentage frequency
distribution of students' marks {22, 22, 40, 40, 70, 70, 70, 85, 90, 90} is below in Figure 2.4.

It can be observed that the number of students with 22 marks are 2. The total number of
students are 10. So, 2/10 × 100 = 20% space in a pie of 100% is allotted for marks 22 in Figure 2.4.

39
Module 2 AI&ML, 21CS54, 5th Semester

Histogram It plays an important role in data mining for showing frequency distributions.
The histogram for students’ marks {45, 60, 60, 80, 85} in the group range of 0-25, 26-50, 51-75,
76-100 is given below in Figure 2.5. One can visually inspect from Figure 2.5 that the number of
students in the range 76-100 is 2.

Histogram conveys useful information like nature of data and its mode. Mode indicates the
peak of dataset. In other words, histograms can be used as charts to show frequency, skewness
present in the data, and shape.

Dot Plots These are similar to bar charts. They are less clustered as compared to bar charts,
as they illustrate the bars only with single points. The dot plot of English marks for five students
with ID as {1, 2, 3, 4, 5} and marks {45, 60, 60, 80, 85} is given in Figure 2.6. The advantage
is that by visual inspection one can find out who got more marks.

2.5.2 Central Tendency


Therefore, a condensation or summary of the data is necessary. This makes the data analysis easy
and simple. One such summary is called central tendency. Thus, central tendency can explain the
characteristics of data and that further helps in comparison. Mass data have tendency to
concentrate at certain values, normally in the central location. It is called measure of central
tendency (or averages). Popular measures are mean, median and mode.
1. Mean – Arithmetic average (or mean) is a measure of central tendency that represents the
‘center’ of the dataset. Mathematically, the average of all the values in the sample (population) is
denoted as x. Let x1, x2, … , xN be a set of ‘N’ values or observations, then the arithmetic mean is
given as:

For example, the mean of the three numbers 10, 20, and 30 is 20

40
Module 2 AI&ML, 21CS54, 5th Semester

•Weighted mean – Unlike arithmetic mean that gives the weightage of all items equally,
weighted mean gives different importance to all items as the item importance varies.
Hence, different weightage can be given to items. In case of frequency distribution, mid values of
the range are taken for computation. This is illustrated in the following computation.

In weighted mean, the mean is computed by adding the product of proportion and
group mean. It is mostly used when the sample sizes are unequal.

•Geometric mean – Let x1, x2, … , xN be a set of ‘N’ values or observations. Geometric mean
is the Nth root of the product of N items. The formula for computing geometric mean is
given as follows:

Here, n is the number of items and xi are values. For example, if the values are 6 and 8, the
geometric mean is given as In larger cases, computing geometric mean is difficult. Hence, it is
usually calculated as:

The problem of mean is its extreme sensitiveness to noise. Even small changes in the input
affect the mean drastically. Hence, often the top 2% is chopped off and then the mean is calcu-
lated for a larger dataset.

2. Median – The middle value in the distribution is called median. If the total number of items
in the distribution is odd, then the middle value is called median. A median class is that class
where (N/2)th item is present.
In the continuous case, the median is given by the formula:

Median class is that class where N/2th item is present. Here, i is the class interval of the
median class and L1 is the lower limit of median class, f is the frequency of the median class, and
cf is the cumulative frequency of all classes preceding median.
3. Mode – Mode is the value that occurs more frequently in the dataset. In other words, the
value that has the highest frequency is called mode.

2.5.3 Dispersion
The spreadout of a set of data around the central tendency (mean, median or mode) is called
dispersion. Dispersion is represented by various ways such as range, variance, standard deviation,
and standard error. These are second order measures. The most common measures of the
dispersion data are listed below:

Range Range is the difference between the maximum and minimum of values of the given list
of data.

Standard Deviation The mean does not convey much more than a middle point. For example,
the following datasets {10, 20, 30} and {10, 50, 0} both have a mean of 20. The difference between
these two sets is the spread of data. Standard deviation is the average distance from the mean of
the dataset to each point.

41
Module 2 AI&ML, 21CS54, 5th Semester

The formula for sample standard deviation is given by:

Here, N is the size of the population, xi is observation or value from the population and m is
the population mean. Often, N – 1 is used instead of N in the denominator of Eq. (2.8).

Quartiles and Inter Quartile Range It is sometimes convenient to subdivide the dataset using
coordinates. Percentiles are about data that are less than the coordinates by some percentage of
the total value. kth percentile is the property that the k% of the data lies at or below Xi. For
example, median is 50th percentile and can be denoted as Q0.50. The 25th percentile is called first
quartile (Q1) and the 75th percentile is called third quartile (Q3). Another measure that is useful
to measure dispersion is Inter Quartile Range (IQR). The IQR is the difference between Q3 and Q1.
Interquartile percentile = Q3 – Q1 (2.9)
Outliers are normally the values falling apart at least by the amount 1.5 × IQR above the third
quartile or below the first quartile.
Interquartile is defined by Q0.75 – Q0.25. (2.10)

Example 2.4: For patients’ age list {12, 14, 19, 22, 24, 26, 28, 31, 34}, find the IQR.
Solution: The median is in the fifth position. In this case, 24 is the median. The first quartile is
median of the scores below the mean i.e., {12, 14, 19, 22}. Hence, it’s the median of the list below
24. In this case, the median is the average of the second and third values, that is, Q0.25 = 16.5.
Similarly, the third quartile is the median of the values above the median, that is {26, 28, 31, 34}.
So, Q0.75 is the average of the seventh and eighth score. In this case, it is 28 + 31/2 = 59/2 = 29.5.
Hence, the IQR using Eq. (2.10) is:
= Q0.75 – Q0.25
= 29.5-16.5 = 13

Five-point Summary and Box Plots The median, quartiles Q1 and Q3, and minimum
and maximum written in the order < Minimum, Q1, Median, Q3, Maximum > is known as
five-point summary.

Example 2.5: Find the 5-point summary of the list {13, 11, 2, 3, 4, 8, 9}.
Solution: The minimum is 2 and the maximum is 13. The Q1, Q2 and Q3 are 3, 8 and 11, respectively.
Hence, 5-point summary is {2, 3, 8, 11, 13}, that is, {minimum, Q1, median, Q3, maximum}. Box plots
are useful for describing 5-point summary. The Box plot for the set is given in
Figure 2.7.

2.5.4 Shape
Skewness and Kurtosis (called moments) indicate the symmetry/asymmetry and peak location of
the dataset.
Skewness
The measures of direction and degree of symmetry are called measures of third order. Ideally,
skewness should be zero as in ideal normal distribution. More often, the given dataset may not

42
Module 2 AI&ML, 21CS54, 5th Semester

have perfect symmetry (consider the following Figure 2.8).

Generally, for negatively skewed distribution, the median is more than the mean. The relationship
between skew and the relative size of the mean and median can be summarized by a convenient
numerical skew index known as Pearson 2 skewness coefficient.

Also, the following measure is more commonly used to measure skewness. Let X1, X2, …, XN
be a set of ‘N’ values or observations then the skewness can be given as:

Here, m is the population mean and s is the population standard deviation of the univariate
data. Sometimes, for bias correction instead of N, N - 1 is used.

Kurtosis
Kurtosis also indicates the peaks of data. If the data is high peak, then it indicates higher kurtosis
and vice versa. Kurtosis is measured using the formula given below:

It can be observed that N - 1 is used instead of N in the numerator of Eq. (2.14) for bias correction.
Here, x and s are the mean and standard deviation of the univariate data, respectively.
Some of the other useful measures for finding the shape of the univariate dataset are mean
absolute deviation (MAD) and coefficient of variation (CV).

Mean Absolute Deviation (MAD)


MAD is another dispersion measure and is robust to outliers. Normally, the outlier point is
detected by computing the deviation from median and by dividing it by MAD. Here, the absolute
deviation between the data and mean is taken. Thus, the absolute deviation is given as:

Coefficient of Variation (CV)


Coefficient of variation is used to compare datasets with different units. CV is the ratio of standard
deviation and mean, and %CV is the percentage of coefficient of variations.

2.5.5 Special Univariate Plots


The ideal way to check the shape of the dataset is a stem and leaf plot. A stem and leaf plot are a
display that help us to know the shape and distribution of the data. In this method, each value is
split into a ’stem’ and a ’leaf’. The last digit is usually the leaf and digits to the left of the leaf mostly
form the stem. For example, marks 45 are divided into stem 4 and leaf 5 in Figure 2.9.
The stem and leaf plot for the English subject marks, say, {45, 60, 60, 80, 85} is given in

43
Module 2 AI&ML, 21CS54, 5th Semester

Figure 2.9.

It can be seen from Figure 2.9 that the first column is stem and the second column is leaf.
For the given English marks, two students with 60 marks are shown in stem and leaf plot as stem-
6 with 2 leaves with 0. The normal Q-Q plot for marks x = [13 11 2 3 4 8 9] is given below in
Figure 2.10.

2.6 BIVARIATE DATA AND MULTIVARIATE DATA


Bivariate Data involves two variables. Bivariate data deals with causes of relationships. The aim is
to find relationships among data. Consider the following Table 2.3, with data of the temperature in
a shop and sales of sweaters.

Here, the aim of bivariate analysis is to find relationships among variables. The relationships can
then be used in comparisons, finding causes, and in further explorations. To do that, graphical
display of the data is necessary. One such graph method is called scatter plot.

Scatter plot is used to visualize bivariate data. It is useful to plot two variables with or without

44
Module 2 AI&ML, 21CS54, 5th Semester

nominal variables, to illustrate the trends, and also to show differences. It is a plot between
explanatory and response variables. It is a 2D graph showing the relationship between two
variables.

Line graphs are similar to scatter plots. The Line Chart for sales data is shown in Figure 2.12.

2.6.1 Bivariate Statistics


Covariance and Correlation are examples of bivariate statistics. Covariance is a measure of joint
probability of random variables, say X and Y. Generally, random variables are represented in
capital letters. It is defined as covariance(X, Y) or COV(X, Y) and is used to measure the variance
between two dimensions. The formula for finding co-variance for specific x, and y are:

Here, xi and yi are data values from X and Y. E(X) and E(Y) are the mean values of xi and yi.
N is the number of given data. Also, the COV(X, Y) is same as COV(Y, X).
Example 2.6: Find the covariance of data X = {1, 2, 3, 4, 5} and Y = {1, 4, 9, 16, 25}.

The covariance between X and Y is 12. It can be normalized to a value between -1 and +1. This
is done by dividing it by the correlation of variables. This is called Pearson correlation coefficient.

45
Module 2 AI&ML, 21CS54, 5th Semester

Sometimes, N - 1 is also can be used instead of N. In that case, the covariance is 60/4 = 15.

Correlation
The Pearson correlation coefficient is the most common test for determining any association
between two phenomena. It measures the strength and direction of a linear relationship between
the x and y variables.
1.If the value is positive, it indicates that the dimensions increase together.
2.If the value is negative, it indicates that while one-dimension increases, the other dimension
decreases.
3.If the value is zero, then it indicates that both the dimensions are independent of each
other.
If the dimensions are correlated, then it is better to remove one dimension as it is a redundant
dimension.
If the given attributes are X = (x1, x2, … , xN) and Y = (y1, y2, … , yN), then the Pearson correlation
coefficient, that is denoted as r, is given as:

where, sX, sY are the standard deviations of X and Y.

2.7 MULTIVARIATE STATISTICS


In machine learning, almost all datasets are multivariable. Multivariate data is the analysis of
more than two observable variables, and often, thousands of multiple measurements need to be
conducted for one or more subjects.

Multivariate data has three or more variables. The aim of the multivariate analysis is much
more. They are regression analysis, factor analysis and multivariate analysis of variance that are
explained in the subsequent chapters of this book.

Heatmap
Heatmap is a graphical representation of 2D matrix. It takes a matrix as input and colours it. The
darker colours indicate very large values and lighter colours indicate smaller values. The
advantage of this method is that humans perceive colours well. So, by colour shaping, larger values
can be perceived well. For example, in vehicle traffic data, heavy traffic regions can be
differentiated from low traffic regions through heatmap.

In Figure 2.13, patient data highlighting weight and health status is plotted. Here, X-axis
is weights and Y-axis is patient counts. The dark colour regions highlight patients’ weights vs
patient counts in health status.

46
Module 2 AI&ML, 21CS54, 5th Semester

Pairplot
Pairplot or scatter matrix is a data visual technique for multivariate data. A scatter matrix consists
of several pair-wise scatter plots of variables of the multivariate data.
A random matrix of three columns is chosen and the relationships of the columns is plotted
as a pairplot (or scatter matrix) as shown below in Figure 2.14.

2.8 ESSENTIAL MATHEMATICS FOR MULTIVARIATE DATA

Machine learning involves many mathematical concepts from the domain of Linear algebra,
Statistics, Probability and Information theory. The subsequent sections discuss important aspects
of linear algebra and probability.

2.8.1 Linear Systems and Gaussian Elimination for Multivariate Data


A linear system of equations is a group of equations with unknown variables.
Let Ax = y, then the solution x is given as:

This is true if y is not zero and A is not zero. The logic can be extended for N-set of equations
with ‘n’ unknown variables.
It means if A= and y=(y1 y2…yn), then the unknown variable x can be

47
Module 2 AI&ML, 21CS54, 5th Semester

computed as:

If there is a unique solution, then the system is called consistent independent. If there are various
solutions, then the system is called consistent dependant. If there are no solutions and if the
equations are contradictory, then the system is called inconsistent.
For solving large number of system of equations, Gaussian elimination can be used. The
procedure for applying Gaussian elimination is given as follows:
1.Write the given matrix.
2.Append vector y to the matrix A. This matrix is called augmentation matrix.
3.Keep the element a11 as pivot and eliminate all a11 in second row using the matrix operation,

The same logic


can be used to remove a11 in all other equations.
4.Repeat the same logic and reduce it to reduced echelon form. Then, the unknown variable as:

5.Then, the remaining unknown variables can be found by back-substitution as:

This part is called backward substitution.

To facilitate the application of Gaussian elimination method, the following row operations are
applied:
1.Swapping the rows
2.Multiplying or dividing a row by a constant
3.Replacing a row by adding or subtracting a multiple of another row to it
These concepts are illustrated in Example 2.8.

48
Module 2 AI&ML, 21CS54, 5th Semester

2.8.2 Matrix Decomposition


It is often necessary to reduce a matrix to its constituent parts so that complex matrix operations
can be performed.
Then, the matrix A can be decomposed as:

where, Q is the matrix of eigen vectors, Λ is the diagonal matrix and QT is the transpose of matrix
Q.

LU Decomposition
One of the simplest matrix decomposition is LU decomposition where the matrix A can be
decomposed matrices: A = LU
Here, L is the lower triangular matrix and U is the upper triangular matrix. The decomposition
can be done using Gaussian elimination method as discussed in the previous section. First, an
identity matrix is augmented to the given matrix. Then, row operations and Gaussian elimination
is applied to reduce the given matrix to get matrices L and U.
Example 2.9 illustrates the application of Gaussian elimination to get LU.

49
Module 2 AI&ML, 21CS54, 5th Semester

Now, it can be observed that the first matrix is L as it is the lower triangular matrix whose
values are the determiners used in the reduction of equations above such as 3, 3 and 2/3.
The second matrix is U, the upper triangular matrix whose values are the values of the reduced
matrix because of Gaussian elimination.

2.8.3 Machine Learning and Importance of Probability and Statistics


Machine learning is linked with statistics and probability. Like linear algebra, statistics is the heart
of machine learning. The importance of statistics needs to be stressed as without statistics;

Probability Distributions
A probability distribution of a variable, say X, summarizes the probability associated with X’s
events. Distribution is a parameterized mathematical function. In other words, distribution is a
function that describes the relationship between the observations in a sample space.
Consider a set of data. The data is said to follow a distribution if it obeys a mathematical
function that characterizes that distribution. The function can be used to calculate the probability
of individual observations.
Probability distributions are of two types:
1.Discrete probability distribution
2.Continuous probability distribution
The relationships between the events for a continuous random variable and their probabilities

Continuous Probability Distributions Normal, Rectangular, and Exponential distributions


fall under this category.

1. Normal Distribution – Normal distribution is a continuous probability distribution.


This is also known as gaussian distribution or bell-shaped curve distribution. It is the
most common distribution function. The shape of this distribution is a typical bell-shaped

50
Module 2 AI&ML, 21CS54, 5th Semester

curve. In normal distribution, data tends to be around a central value with no bias on left
or right. The heights of the students, blood pressure of a population, and marks scored in
a class can be approximated using normal distribution.
PDF of the normal distribution is given as:

Here, m is mean and s is the standard deviation. Normal distribution is characterized


by two parameters – mean and variance.
One important concept associated with normal distribution is z-score. It can be
computed as:

This is useful to normalize the data.

2. Rectangular Distribution – This is also known as uniform distribution. It has equal


probabilities for all values in the range a, b. The uniform distribution is given as follows:

3.Exponential Distribution – This is a continuous uniform distribution. This probability


distribution is used to describe the time between events in a Poisson process. Exponential
distribution is another special case of Gamma distribution with a fixed parameter of 1.
This distribution is helpful in modelling of time until an event occurs.
The PDF is given as follows:

Discrete Distribution Binomial, Poisson, and Bernoulli distributions fall under this category.
1. Binomial Distribution – Binomial distribution is another distribution that is often encountered
in machine learning. It has only two outcomes: success or failure. This is also called
Bernoulli trial.
The objective of this distribution is to find probability of getting success k out of n trials.
The way to get success out of k out of n number of trials is given as:

The binomial distribution function is given as follows, where p is the probability of


success and probability of failure is (1 - p). The probability of success in a certain number
of trials is given as:

Combining both, one gets PDF of binomial distribution as:

Here, p is the probability of each choice, k is the number of choices, and n is the total
number of choices. The mean of binomial distribution is given below:

And the variance is given as:

51
Module 2 AI&ML, 21CS54, 5th Semester

Hence, the standard deviation is given as:

2. Poisson Distribution – It is another important distribution that is quite useful. Given an


interval of time, this distribution is used to model the probability of a given number of
events k. The mean rule l is inclusive of previous events. Some of the examples of Poisson
distribution are number of emails received, number of customers visiting a shop and the
number of phone calls received by the office.
The PDF of Poisson distribution is given as follows:

3.Bernoulli Distribution – This distribution models an experiment whose outcome is binary.


The outcome is positive with p and negative with 1 - p. The PMF of this distribution is
given as:

The mean is p and variance is p(1 - p) = q

Density Estimation
Let there be a set of observed values x1, x2, … , xn from a larger set of data whose distribution is
not known. Density estimation is the problem of estimating the density function from an observed
data.
There are two types of density estimation methods, namely parametric density estimation and
non-parametric density estimation.

Parametric Density Estimation It assumes that the data is from a known probabilistic distri-
bution and can be estimated as Maximum likelihood function
is a parametric estimation method.

Maximum Likelihood Estimation For a sample of observations, one can estimate the probability
distribution. This is called density estimation. Maximum Likelihood Estimation (MLE) is a
probabilistic framework that can be used for density estimation. This involves formulating
a function called likelihood function which is the conditional probability of observing the
observed samples and distribution function with its parameters. For example, if the observations
are X = {x1, x2, … , xn}, then density estimation is the problem of choosing a PDF with suitable
parameters to describe the data. MLE treats this problem as a search or optimization problem
where the probability should be maximized for the joint probabilities of X and its parameter, theta.

52
Module 2 AI&ML, 21CS54, 5th Semester

If one assumes that the regression problem can be framed as predicting output y given input x,
then for p(y/x), the MLE framework can be applied as:

Here, h is the linear regression model. If Gaussian distribution is assumed as it is an obvious


fact that most of the data follow Gaussian distribution, then MLE can be stated as:

Here, b is the regression coefficient and xi is the given sample. One can maximize this function
or minimize the negative log likelihood function to provide a solution for linear regression
problem. The Eq. (2.37) yields the same answer of the least-square approach.

Gaussian Mixture Model and Expectation-Maximization (EM) Algorithm In machine learning,


clustering is one of the important tasks. It is discussed in Chapter 13. MLE framework is quite
useful for designing model-based methods for clustering data. A model is a statistical method and
data is assumed to be generated by a distribution model with its parameter, theta. There may be
many distributions involved and that is why it is called as mixture model.

Generally, there can be many unspecified distributions with different set of parameters. The
EM algorithm has two stages:
1. Expectation (E) Stage – In this stage, the expected PDF and its parameters are estimated
for each latent variable.
2.Maximization (M) stage – In this, the parameters are optimized using the MLE function.
This process is iterative, and the iteration is continued till all the latent variables are fitted
by probability distributions effectively along with the parameters.

Non-parametric Density Estimation A non-parametric estimation can be generative


or discriminative. Parzen window is a generative estimation method that finds as
conditional density. Discriminative methods directly compute as posteriori probability.
Parzen window and k-Nearest Neighbour (KNN) rule are examples of non-parametric density

53
Module 2 AI&ML, 21CS54, 5th Semester

estimation.
This window can be replaced by any other function too. If Gaussian function is used, then it
is called Gaussian density function.
KNN Estimation The KNN estimation is another non-parametric density estimation method.
Here, the initial parameter k is determined and based on that k-neighbours are determined.
The probability density function estimate is the average of the values that are returned by
the neighbours.

2.9 OVERVIEW OF HYPOTHESIS


Data collection alone is not enough. Data must be interpreted to give a conclusion. The conclusion
should be a structured outcome. This assumption of the outcome is called a hypothesis. Statistical
methods are used to confirm or reject the hypothesis. The assumption of the statistical test is
called null hypothesis. It is also called as hypothesis zero (H0). In other words, hypothesis is the
existing belief. The violation of this hypothesis is called first hypothesis (H1) or hypothesis one.
This is the hypothesis the researcher is trying to establish.
There are two types of hypothesis tests, parametric and non-parametric. Parametric tests are
based on parameters such as mean and standard deviation. Non-parametric tests are dependent
on characteristics such as independence of events or data following certain distribution.
Statistical tests help to:
1.Define null and alternate hypothesis
2.Describe the hypothesis using parameters
3.Identify the statistical test and statistics
4.Decide the criteria called significance value a
5.Compute p-value (probability value)
6.Take the final decision of accepting or rejecting the hypothesis based on the parameters

Hypothesis testing is particularly important as it is an integral part of the learning algorithms.


Generally, the data size is small. So, one may have to know whether the hypothesis will work for
additional samples and how accurate it is. No matter how effective the statistical tests are, two
kinds of errors are involved, that are Type I and Type II.

Type I error is the incorrect rejection of a true null hypothesis and is called false positive.
Type II error is the incomplete failure of rejecting a false hypothesisand is called false negative.
During these calculations, one must include the size of the data sample. Degree of freedom

54
Module 2 AI&ML, 21CS54, 5th Semester

indicates the number of independent pieces of information used for the test. It is indicated as n.
The mean or variance can be used to indicate the degree of freedom.

Hypothesis Testing
Let us define two important errors called sample error and true (or actual error). Let us assume
that D is the unknown distribution, Target function is f(x): x ≥ {0, 1}, x is the instance, h(x) is the
hypothesis, and sample set is S that derives the samples on instances drawn from X. Then,
the actual error is denoted as:

In other words, true error is the probability that the hypothesis will mis classify an instance
that is drawn at random. The point is that population is very large and hence it is not possible to
determine true error and can only be estimated. So, another error is called sample error or
estimator.
Sample error is with respect to sample S. It is the probability for instances drawn from X,
that is, the fractions of S that are misclassified. The sample error is given as follows:

p-value
Statistical tests can be performed to either accept or reject the null hypothesis. This is done by
the value called p-value or probability value. It indicates the probability of hypothesis being true.
The p-value is used to interpret or quantify the test. For example, a statistical test result may give
a value of 0.03. Then, one can compare it with the level 0.05. As 0.03 < 0.05, the result is assumed
to be significant. This means that the variables tested are not independent. Here, 0.05 is called
significant level. In general, significant level is called a and p-value is compared with a. If p-value ≤
a, then the hypothesis H1 is rejected and if p-value >a, then the hypothesis H0 is rejected.

Confidence Intervals
The acceptance or rejection of the hypothesis can also be done using confidence interval.
The confidence interval is computed as:
Confidence interval = 1 – significant level (2.44)

Confidence level is the range of values that indicates the location of true mean. Confidence
intervals indicate the confidence of the result. If the confidence level is 90%, then it infers that there
is 90% of chance that the true mean lies in this range and remaining 10% indicates that true mean
in not present. For finding this, one requires mean and standard deviation. Then, x can be given as

Here, s is the standard deviation, N is the number of samples, and z is the value
associated with 90% and is called % of confidence. This is also called as margin of error.
Sample error is the unbiased estimate of true error. If no information is provided, then both
errors are the same. It is, however, often safe to suggest a margin of confidence associated with the
hypothesis. The hypothesis with 95% confidence about the sample error can be given as follows:

This 1.96 indicates the 95% confidence of the error. The number 1.96 can be replaced by
any number that is associated with different levels of confidence.
The procedure to estimate the difference between two hypothesis, say h1 and h2, is as follows:
1.A parameter d can be chosen to estimate the error of two hypothesis:
2.d ≡ errorD(h1) - errorD(h2)(2.46)

55
Module 2 AI&ML, 21CS54, 5th Semester

Here, there are two hypothesis h1 and h2 tested on two sample sets s1 and s2. Similarly,
n1 and n2 are randomly drawn number of samples.
3.The estimator ^d can be estimated as the difference as:

4.The confidence intervals can be used for the estimator also as follows:

Sometimes, it is desirable to find interval L and U such that N% of the probability falls in this

2.9.1 Comparing Learning Methods


Some of the methods for comparing the learning programs are given below:
Z-test
Z-test assumes normal distribution of data whose population variation is known. The sample size
is assumed to be large. The focus is to test the population mean. The z-statistic is given as:

t-test and Paired t-test


t-test is a hypothesis test and checks if the difference between two samples’ mean is real or by
chance. Here, data is continuous and randomly selected. There will only be small number of
samples and variance between groups is real. The t-test statistics follows t-distribution under null
hypothesis and is used when the number of samples <30. So, the procedure is:
•Select a group
•Compute average
•Compare it with theoretical value and compute t-statistic:

Here, t is t-statistic, m is the mean of the group, m is the theoretical value or population mean,
s is the standard deviation, and n is the group size or sample size.

Independent Two Sample t-test t-statistic for two groups A and B is computed as follows:

Here, mean(A) and mean(B) are for two different samples. N1 and N2 are sample sizes of
two groups A and B. s^2 is the variance of the two samples and the degree of freedom is given as

56
Module 2 AI&ML, 21CS54, 5th Semester

N1 + N2 - 2 . Then, t-statistic is compared with the t-critical value.

Paired t-test It is used to evaluate the hypothesis before and after intervention. The fact is that
these samples are not independent. For example, consider the case of an effect of medication for
a diabetic patient. The sequence is that first the sugar is tested, then the medication is done, and
again the sugar test is conducted to study the effect of medication. In short, in paired t-test, the
data is taken from the same subject twice. In an unpaired t-test, the samples are taken
independently. In this only one group is involved. The t-statistic is computed as:

Here, t is t-statistic, m is the mean of the group, m is the theoretical value or population mean,
s is the standard deviation, and n is the group size or sample size.

Chi-Square Test
Chi-Square test is a non-parametric test. The goodness-of-fit test statistics follows a Chi-Square
distribution under null hypothesis and measures the statistical significance between observed
frequency and expressed frequency, and each observation is independent of each other and
follows normal distribution. This comparison is used to calculate the value of the Chi-Square
statistic as:

Here, E is the expected frequency, O is the observed frequency and the degree of freedom is
C – 1, where, C is number of categories. The Chi-Square test allows us to detect the duplication of
data and helps to remove the redundancy of values.
Example 2.11: Consider the following Table 2.4, where the machine learning course registration
is done by both boys and girls. There are 50 boys and 50 girls in the class and the registration of
the course is given in the table. Apply Chi-Square test and find out whether any differences exist
between boys and girls for course registration.

Solution: Let the null hypothesis be H0 when there is no difference between boys and girls and
H1 be the alternate hypothesis when there is a significant difference between boys and girls.
For applying the Chi-Square test based on the observations, the expectation should be obtained
by multiplying the total boys X registered/Total and Total girls X not registered/Total as shown in
Table 2.5.

57
Module 2 AI&ML, 21CS54, 5th Semester

for degree of freedom = number of categories -1 = 2 - 1 = 1. The p value for this statistic is 0.0412.
This is less than 0.05. Therefore, the result is significant.

2.10 FEATURE ENGINEERING AND DIMENSIONALITY REDUCTION TECHNIQUES

Features are attributes. Feature engineering is about determining the subset of features that form
an important part of the input that improves the performance of the model, be it classification or
any other model in machine learning.

Feature engineering deals with two problems – Feature Transformation and Feature Selection.
Feature transformation is extraction of features and creating new features that may be helpful in
increasing performance. For example, the height and weight may give a new attribute called Body
Mass Index (BMI).

Feature subset selection is another important aspect of feature engineering that focuses on
selection of features to reduce the time but not at the cost of reliability.
The features can be removed based on two aspects:
1.Feature relevancy – Some features contribute more for classification than other features.
For example, a mole on the face can help in face detection than common features like
nose. In simple words, the features should be relevant.
2. Feature redundancy – Some features are redundant. For example, when a database table
has a field called Date of birth, then age field is not relevant as age can be computed
easily from date of birth.
So, the procedure is:
1.Generate all possible subsets
2.Evaluate the subsets and model performance
3.Evaluate the results for optimal feature selection

Filter-based selection uses statistical measures for assessing features. In this approach,
no learning algorithm is used. Correlation and information gain measures like mutual information
and entropy are all examples of this approach.

Wrapper-based methods use classifiers to identify the best features. These are selected
and evaluated by the learning algorithms. This procedure is computationally intensive but has
superior performance.

2.10.1 Stepwise Forward Selection


This procedure starts with an empty set of attributes. Every time, an attribute is tested for
statistical significance for best quality and is added to the reduced set. This process is continued
till a good reduced set of attributes is obtained.

2.10.2 Stepwise Backward Elimination


This procedure starts with a complete set of attributes. At every stage, the procedure removes the
worst attribute from the set, leading to the reduced set.

2.10.3 Principal Component Analysis


The idea of the principal component analysis (PCA) or KL transform is to transform a given
set of measurements to a new set of features so that the features exhibit high information
packing properties. This leads to a reduced and compact set of features.
Consider a group of random vectors of the form:

58
Module 2 AI&ML, 21CS54, 5th Semester

The mean vector of the set of random vectors is defined as:

The operator E refers to the expected value of the population. This is calculated theoretically
using the probability density functions (PDF) of the elements xi and the joint probability density
functions between the elements xi and xj. From this, the covariance matrix can be calculated as:

The mapping of the vectors x to y using the transformation can now be described as:

This transform is also called as Karhunen-Loeve or Hoteling transform. The original vector x
can now be reconstructed as follows:

If K largest eigen values are used, the recovered information would be:

The PCA algorithm is as follows:


1.The target dataset x is obtained
2.The mean is subtracted from the dataset. Let the mean be m. Thus, the adjusted dataset is
X – m. The objective of this process is to transform the dataset with zero mean.
3.The covariance of dataset x is obtained. Let it be C.
4.Eigen values and eigen vectors of the covariance matrix are calculated.
5.The eigen vector of the highest eigen value is the principal component of the dataset.
The eigen values are arranged in a descending order. The feature vector is formed with these
eigen vectors in its columns.
Feature vector = {eigen vector1, eigen vector2, … , eigen vectorn}
6.Obtain the transpose of feature vector. Let it be A.
7.PCA transform is y = A × (x – m), where x is the input dataset, m is the mean, and A is the
transpose of the feature vector.
The original data can be retrieved using the formula given below:

The new data is a dimensionaly reduced matrix that represents the original data.

59
Module 2 AI&ML, 21CS54, 5th Semester

Figure 2.15. The scree plot indicates that only 6 out of 246 attributes are important.

From Figure 2.15, one can infer the relevance of the attributes. The scree plot indicates that
the first attribute is more important than all other attributes.

2.10.4 Linear Discriminant Analysis


Linear Discriminant Analysis (LDA) is also a feature reduction technique like PCA. The focus of
LDA is to project higher dimension data to a line (lower dimension data). LDA is also used to
classify the data. Let there be two classes, c1 and c2. Let m1 and m2 be the mean of the patterns of
two
classes. The mean of the class c1 and c2 can be computed as:

The aim of LDA is to optimize the function:

2.10.5 Singular Value Decomposition


Singular Value Decomposition (SVD) is another useful decomposition technique. Let A be the
matrix, then the matrix A can be decomposed as:

Here, A is the given matrix of dimension m × n, U is the orthogonal matrix whose dimension is
m × n, S is the diagonal matrix of dimension n × n, and V is the orthogonal matrix. The procedure
for finding decomposition matrix is given as follows:
1.For a given matrix, find AA^T

60
Module 2 AI&ML, 21CS54, 5th Semester

2.Find eigen values of AA^T


3.Sort the eigen values in a descending order. Pack the eigen vectors as a matrix U.
4.Arrange the square root of the eigen values in diagonal. This matrix is diagonal matrix, S.
5.Find eigen values and eigen vectors for A^TA. Find the eigen value and pack the eigen vector
as a matrix called V.
Thus, A = USV^ T. Here, U and V are orthogonal matrices. The columns of U and V are left and
right singular values, respectively. SVD is useful in compression, as one can decide to retain only a
certain component instead of the original matrix A as:

Based on the choice of retention, the compression can be controlled.

61
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

MODULE 3
CHAPTER 3
BASICS OF LEARNING THEORY
3.1 INTRODUCTION TO LEARNING AND ITS TYPES
Learning is a process by which one can acquire knowledge and construct new ideas or
concepts based on the experiences.
The standard definition of learning proposed by Tom Mitchell is that a program can learn
from E for the task T, and P improves with experience E.
There are two kinds of problems – well-posed and ill-posed. Computers can solve only well-
posed problems, as these have well-defined specifications and have the following
components inherent to it.
1. Class of learning tasks (T) 2. A measure of performance (P) 3. A source of experience (E)
Let x- input, χ-input space, Y –is the output space. Which is the set of all possible outputs,
that is yes/no,
Let D –dataset for n inputs.Consider, target function be: χ-> Y , that maps input to output.
Objective: To pick a function, g: χ-> Y to appropriate hypothesis f.

Fig: Learning Environment

Learning model= Hypothesis set + Learning algorithm

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi, Dept.Of CSE,RNSIT 1


ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

Classical and Adaptive ML systems.


Classic machines examine data inputs according to a predetermined set of rules, finding
patterns and relationships that can be used to generate predictions or choices. Support vector
machines, decision trees, and logistic regression are some of the most used classical machine-
learning techniques.
A class of machine learning techniques called adaptive machines, commonly referred to as
adaptive or deep learning, is created to automatically learn from data inputs without being
explicitly programmed. By learning hierarchical representations of the input, these
algorithms are able to handle more complex and unstructured data, such as photos, videos,
and natural language.
Adaptive ML is the next generation of traditional ML – the new, the improved, the better. Even
though traditional ML witnessed significant progress.

Learning Types

3.2 INTRODUCTION TO COMPUTATION LEARNING THEORY

These questions are the basis of a field called ‘Computational Learning Theory’ or in short
(COLT).

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi, Dept.Of CSE,RNSIT 2


ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

3.3 DESIGN OF A LEARNING SYSTEM

3.4 INTRODUCTION TO CONCEPT LEARNING

3.4.1 Representation of a Hypothesis

3.4.2 Hypothesis Space


Hypothesis space is the set of all possible hypotheses that approximates the target function f.
The subset of hypothesis space that is consistent with all-observed training instances is called
as Version Space.

3.4.3 Heuristic Space Search


Heuristic search is a search strategy that finds an optimized hypothesis/solution to a problem
by iteratively improving the hypothesis/solution based on a given heuristic function or a cost
measure.

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi, Dept.Of CSE,RNSIT 3


ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

3.4.4 Generalization and Specialization


Searching the Hypothesis Space

There are two ways of learning the hypothesis, consistent with all training instances from the
large hypothesis space.

1. Specialization – General to Specific learning


2. Generalization – Specific to General learning
Generalization – Specific to General Learning This learning methodology will search
through the hypothesis space for an approximate hypothesis by generalizing the most specific
hypothesis.

Specialization – General to Specific Learning This learning methodology will search


through the hypothesis space for an approximate hypothesis by specializing the most general
hypothesis.

3.4.5 Hypothesis Space Search by Find-S Algorithm

Limitations of Find-S Algorithm

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi, Dept.Of CSE,RNSIT 4


ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

3.4.6 Version Spaces

List-Then-Eliminate Algorithm

Candidate Elimination Algorithm

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi, Dept.Of CSE,RNSIT 5


ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

The diagrammatic representation of deriving the version space is shown below:

Deriving the Version Space

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi, Dept.Of CSE,RNSIT 6


ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

MODULE 3
CHAPTER 4

SIMILARITY-BASED LEARNING
4.1 Similarity or Instance-based Learning

4.1.1 Difference between Instance-and Model-based Learning

Some examples of Instance-based Learning algorithms are:

a) KNN
b) Variants of KNN
c) Locally weighted regression
d) Learning vector quantization
e) Self-organizing maps
f) RBF networks

Nearest-Neighbor Learning
 A powerful classification algorithm used in pattern recognition.
 K nearest neighbors stores all available cases and classifies new cases based on a
similarity measure (e.g distance function)
 One of the top data mining algorithms used today.
 A non-parametric lazy learning algorithm (An Instance based Learning method).

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi, Dept.Of CSE,RNSIT 1


ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

 Used for both classification and regression problems.

Here, 2 classes of objects called C1


and C2. When given a test instance T,
the category of this test instance is
determined by looking at the class of
k=3 nearest neighbors. Thus, the
class of this test instance T is
predicted as C2.

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi, Dept.Of CSE,RNSIT 2


ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

Algorithm 4.1: k-NN

4.3 Weighted k-Nearest-Neighbor Algorithm


The weighted KNN is an extension of k-NN.It chooses the neighbors by using the weighted
distance. In weighted kNN, the nearest k points are given a weight using a function called as
the kernel function. The intuition behind weighted kNN, is to give more weight to the points
which are nearby and less weight to the points which are farther away.

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi, Dept.Of CSE,RNSIT 3


ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

4.4 Nearest Centroid Classifier


The Nearest Centroids algorithm assumes that the centroids in the input feature space are
different for each target label. The training data is split into groups by class label, then the
centroid for each group of data is calculated. Each centroid is simply the mean value of each
of the input variables, so it is also called as Mean Difference classifier. If there are two classes,
then two centroids or points are calculated; three classes give three centroids, and so on.

4.5 Locally Weighted Regression (LWR)

Where, г is called the bandwidth parameter and controls the rate at which w i reduces to zero
with distance from xi.

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi, Dept.Of CSE,RNSIT 4


ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

MODULE 3
CHAPTER 5
REGRESSION ANALYSIS
5.1 Introduction to Regression
Regression analysis is a fundamental concept that consists of a set of machine learning methods
that predict a continuous outcome variable (y) based on the value of one or multiple predictor
variables (x).
OR
Regression analysis is a statistical method to model the relationship between a dependent
(target) and independent (predictor) variables with one or more independent variables.
Regression is a supervised learning technique which helps in finding the correlation between
variables.
It is mainly used for prediction, forecasting, time series modelling, and determining the causal-
effect relationship between variables.
Regression shows a line or curve that passes through all the datapoints on target-predictor
graph in such a way that the vertical distance between the datapoints and the regression line
is minimum." The distance between datapoints and line tells whether a model has captured a
strong relationship or not.
• Function of regression analysis is given by:
Y=f(x)
Here, y is called dependent variable and x is called independent variable.
Applications of Regression Analysis
 Sales of a goods or services
 Value of bonds in portfolio management
 Premium on insurance componies
 Yield of crop in agriculture
 Prices of real estate

5.2 INTRODUCTION TO LINEARITY, CORRELATION AND CAUSATION


A correlation is the statistical summary of the relationship between two sets of variables. It is
a core part of data exploratory analysis, and is a critical aspect of numerous advanced machine
learning techniques.
Correlation between two variables can be found using a scatter plot
There are different types of correlation:

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi, Dept.Of CSE,RNSIT 1


ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

Positive Correlation: Two variables are said to be positively correlated when their values
move in the same direction. For example, in the image below, as the value for X increases, so
does the value for Y at a constant rate.
Negative Correlation: Finally, variables X and Y will be negatively correlated when their
values change in opposite directions, so here as the value for X increases, the value for Y
decreases at a constant rate.
Neutral Correlation: No relationship in the change of variables X and Y. In this case, the
values are completely random and do not show any sign of correlation, as shown in the
following image:

Causation
Causation is about relationship between two variables as x causes y. This is called x implies b.
Regression is different from causation. Causation indicates that one event is the result of the
occurrence of the other event; i.e. there is a causal relationship between the two events.
Linear and Non-Linear Relationships
The relationship between input features (variables) and the output (target) variable is
fundamental. These concepts have significant implications for the choice of algorithms, model
complexity, and predictive performance.
Linear relationship creates a straight line when plotted on a graph, a Non-Linear relationship
does not create a straight line but instead creates a curve.
Example:
Linear-the relationship between the hours spent studying and the grades obtained in a class.
Non-Linear-
Linearity:
Linear Relationship: A linear relationship between variables means that a change in one
variable is associated with a proportional change in another variable. Mathematically, it can be
represented as y = a * x + b, where y is the output, x is the input, and a and b are constants.

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi, Dept.Of CSE,RNSIT 2


ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

Linear Models: Goal is to find the best-fitting line (plane in higher dimensions) to the data
points. Linear models are interpretable and work well when the relationship between variables
is close to being linear.
Limitations: Linear models may perform poorly when the relationship between variables is
non-linear. In such cases, they may underfit the data, meaning they are too simple to capture
the underlying patterns.
Non-Linearity:
Non-Linear Relationship: A non-linear relationship implies that the change in one variable is
not proportional to the change in another variable. Non-linear relationships can take various
forms, such as quadratic, exponential, logarithmic, or arbitrary shapes.
Non-Linear Models: Machine learning models like decision trees, random forests, support
vector machines with non-linear kernels, and neural networks can capture non-linear
relationships. These models are more flexible and can fit complex data patterns.
Benefits: Non-linear models can perform well when the underlying relationships in the data
are complex or when interactions between variables are non-linear. They have the capacity to
capture intricate patterns.

Types of Regression

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi, Dept.Of CSE,RNSIT 3


ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

Linear Regression:
Single Independent Variable: Linear regression, also known as simple linear regression, is
used when there is a single independent variable (predictor) and one dependent variable
(target).
Equation: The linear regression equation takes the form: Y = β0 + β1X + ε, where Y is the
dependent variable, X is the independent variable, β0 is the intercept, β1 is the slope
(coefficient), and ε is the error term.
Purpose: Linear regression is used to establish a linear relationship between two variables and
make predictions based on this relationship. It's suitable for simple scenarios where there's only
one predictor.
Multiple Regression:
Multiple Independent Variables: Multiple regression, as the name suggests, is used when there
are two or more independent variables (predictors) and one dependent variable (target).
Equation: The multiple regression equation extends the concept to multiple predictors: Y = β0
+ β1X1 + β2X2 + ... + βnXn + ε, where Y is the dependent variable, X1, X2, ..., Xn are the
independent variables, β0 is the intercept, β1, β2, ..., βn are the coefficients, and ε is the error
term.
Purpose: Multiple regression allows you to model the relationship between the dependent
variable and multiple predictors simultaneously. It's used when there are multiple factors that
may influence the target variable, and you want to understand their combined effect and make
predictions based on all these factors.
Polynomial Regression:
Use: Polynomial regression is an extension of multiple regression used when the relationship
between the independent and dependent variables is non-linear.
Equation: The polynomial regression equation allows for higher-order terms, such as quadratic
or cubic terms: Y = β0 + β1X + β2X^2 + ... + βnX^n + ε. This allows the model to fit a curve
rather than a straight line.
Logistic Regression:
Use: Logistic regression is used when the dependent variable is binary (0 or 1). It models the
probability of the dependent variable belonging to a particular class.
Equation: Logistic regression uses the logistic function (sigmoid function) to model
probabilities: P(Y=1) = 1 / (1 + e^(-z)), where z is a linear combination of the independent
variables: z = β0 + β1X1 + β2X2 + ... + βnXn. It transforms this probability into a binary
outcome.
Lasso Regression (L1 Regularization):
Use: Lasso regression is used for feature selection and regularization. It penalizes the absolute
values of the coefficients, which encourages sparsity in the model.

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi, Dept.Of CSE,RNSIT 4


ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

Objective Function: Lasso regression adds an L1 penalty to the linear regression loss function:
Lasso = RSS + λΣ|βi|, where RSS is the residual sum of squares, λ is the regularization strength,
and |βi| represents the absolute values of the coefficients.
Ridge Regression (L2 Regularization):
Use: Ridge regression is used for regularization to prevent overfitting in multiple regression. It
penalizes the square of the coefficients.
Objective Function: Ridge regression adds an L2 penalty to the linear regression loss function:
Ridge = RSS + λΣ(βi^2), where RSS is the residual sum of squares, λ is the regularization
strength, and (βi^2) represents the square of the coefficients.

Limitations of Regression

5.3 INTRODUCTION TO LINEAR REGRESSION


Linear regression model can be created by fitting a line among the scattered data points. The
line is of the form:

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi, Dept.Of CSE,RNSIT 5


ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

Ordinary Least Square Approach


The ordinary least squares (OLS) algorithm is a method for estimating the parameters of a
linear regression model. Aim: To find the values of the linear regression model's parameters
(i.e., the coefficients) that minimize the sum of the squared residuals.
In mathematical terms, this can be written as: Minimize ∑(yi – ŷi)^2

where yi is the actual value, ŷi is the predicted value.


A linear regression model used for determining the value of the response variable, ŷ, can be
represented as the following equation.
y = b0 + b1x1 + b2x2 + … + bnxn + e
 where: y - is the dependent variable, b0 is the intercept, e is
the error term
 b1, b2, …, bn are the coefficients of the independent
variables x1, x2, …, xn
The coefficients b1, b2, …, bn can also be called the coefficients
of determination. The goal of the OLS method can be used to
estimate the unknown parameters (b1, b2, …, bn) by minimizing
the sum of squared residuals (RSS). The sum of squared residuals
is also termed the sum of squared error (SSE).
This method is also known as the least-squares method for regression or linear regression.
Mathematically the line of equations for points are:
y1=(a0+a1x1)+e1
y2=(a0+a1x2)+e2 and so on
……. yn=(a0+a1xn)+en.

In general ei=yi - (a0+a1x1)

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi, Dept.Of CSE,RNSIT 6


ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

Linear Regression Example

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi, Dept.Of CSE,RNSIT 7


ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi, Dept.Of CSE,RNSIT 8


ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

Linear Regression in Matrix Form

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi, Dept.Of CSE,RNSIT 9


ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

5.4 VALIDATION OF REGRESSION METHODS


The regression should be evaluated using some metrics for checking the correctness. The
following metrics are used to validate the results of regression.

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi, Dept.Of CSE,RNSIT 10


ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

Coefficient of Determination
The coefficient of determination (R² or r-squared) is a statistical measure in a regression model
that determines the proportion of variance in the dependent variable that can be explained by
the independent variable.
The sum of the squares of the differences between the y-value of the data pair and the average
of y is called total variation. Thus, the following variation can be defined as,
The explained variation is given by, =∑( Ŷi – mean(Yi))2
The unexplained variation is given by, =∑( Yi - Ŷi )2
Thus, the total variation is equal to the explained variation and the unexplained variation.
The coefficient of determination r2 is the ratio of the explained and unexplained variations.

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi, Dept.Of CSE,RNSIT 11


CHAPTER 5

REGRESSION ANALYSIS

5. Consider the following dataset in Table 5.11 where the week and number of working hours per
week spent by a research scholar in a library are tabulated. Based on the dataset, predict the
number of hours that will be spent by the research scholar in the 7th and 9th week. Apply Linear
regression model.

Table 5.11

xi 1 2 3 4 5
(week)
yi 12 18 22 28 35
(Hours Spent)

Solution

The computation table is shown below:

xi yi xi  xi xi  yi
1 12 1 12
2 18 4 36
3 22 9 66
4 28 16 112
5 35 25 175
Sum = 15 Sum = 115 Avg ( xi  xi )=55/5=11 Avg( xi  yi )=401/5=80.2
avg( xi )=15/5=3 avg( yi )=115/5=23

The regression Equations are

a1 =
( xy ) − ( x )( y )
( x ) ( x)
2
2
i

( )
a0 = y − a1  x

80.2 − 3(23) 80.2 − 69 11.2


a1 = = = = 5.6
11 − 32 11 − 9 2
a0 = 23 − 5.6  3 = 23 −16.8 = 6.2

Therefore, the regression equation is given as


y = 5.6 + 6.2  x

The prediction for the 7th week hours spent by the research scholar will be

y = 5.6 + 6.2  7 = 49 hours

The prediction for the 9th week hours spent by the research scholar will be

y = 5.6 + 6.2  9 = 61.4  61 hours

6. The height of boys and girls is given in the following Table5.12.

Table 5.12: Sample Data

Height of Boys 65 70 75 78

Height of Girls 63 67 70 73

Fit a suitable line of best fit for the above data.

Solution

The computation table is shown below:

xi yi xi  xi xi  yi
65 63 4225 4095
70 67 4900 4690
75 70 5625 5250
78 73 6084 5694
Sum = 288 Sum = 273 Avg ( xi  xi Avg( xi  yi
Mean( xi Mean( yi )=20834/4=5208.5 )=19729/4=4932.25
)=288/4=72 )=273/4=68.25

The regression Equations are

a1 =
( xy ) − ( x )( y )
( x ) ( x)
2
2
i

( )
a0 = y − a1  x
4932.25 − 72(68.25) 18.25
a1 = = = 0.7449
5208.5 − 722 24.5

a0 = 68.25 − 0.7449  72 = 68.25 − 53.6328 = 14.6172

Therefore, the regression line of best fit is given as

y = 0.7449 + 14.6172  x

7. Using multiple regression, fit a line for the following dataset shown in Table 5.13.
Here, Z is the equity, X is the net sales and Y is the asset. Z is the dependent variable
and X and Y are independent variables. All the data is in million dollars.

Table 5.13: Sample Data

Z X Y

4 12 8

6 18 12

7 22 16

8 28 36

11 35 42

Solution

The matrix X and Y is given as follows:

1 12 8
 
1 18 12 
X = 1 22 16 
 
1 28 36 
1 35 42 

4
 
6
Y = 7 
 
8
11
 
The regression coefficients can be found as follows
^
a = (( X T X ) −1 X T )Y

Substituting the values one get,


−1
 8   1 12
T
1 12 8  4
      
  1 1 1 1 1  1 18 12   1 18 12   6 
^
 
a =  12 18 22 28 35   1 22 16    1 22 16    7 
       
  8 12 16 36 42  1 28 36   1 28 36   8 
 1 42   1 35 42  11
  35

4
−1  
 5 115 114  1 1 1 1 1  6
   
=  115 2961 3142   12 18 22 28 35    7 
114 3142 3524   8 12 16 36 42   8 
     
11
 

 −0.4135 

= 0.39625

 
 −0.0658 
 
Therefore, the regression line is given as

y = 0.39625x1 − −0.0658x2 − 0.4135

***
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

MODULE 4
CHAPTER 6
DECISION TREE LEARNING
6.1 Introduction

 Why called as decision tree ?


 As starts from root node and finds number of solutions .
 The benefits of having a decision tree are as follows :
 It does not require any domain knowledge.
 It is easy to comprehend.
 The learning and classification steps of a decision tree are simple and fast.
 Example : Toll free number

6.1.1 Structure of a Decision Tree A decision tree is a structure that includes a root
node, branches, and leaf nodes. Each internal node denotes a test on an attribute, each
branch denotes the outcome of a test, and each leaf node holds a class label. The topmost
node in the tree is the root node.

Applies to classification and regression model.

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi S J, Dept.Of CSE,RNSIT


1
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

The decision tree consists of 2 major procedures:

1) Building a tree and

2) Knowledge inference or classification.

Building the Tree

Knowledge Inference or Classification

Advantages of Decision Trees

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi S J, Dept.Of CSE,RNSIT


2
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

Disadvantages of Decision Trees

6.1.2 Fundamentals of Entropy

 How to draw a decision tree ?


Entropy
Information gain

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi S J, Dept.Of CSE,RNSIT


3
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

Algorithm 6.1: General Algorithm for Decision Trees

6.2 DECISION TREE INDUCTION ALGORITHMS

6.2.1 ID3 Tree Construction(ID3 stands for Iterative Dichotomiser 3 )


A decision tree is one of the most powerful tools of supervised learning algorithms
used for both classification and regression tasks.
It builds a flowchart-like tree structure where each internal node denotes a test on an
attribute, each branch represents an outcome of the test, and each leaf node (terminal

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi S J, Dept.Of CSE,RNSIT


4
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

node) holds a class label. It is constructed by recursively splitting the training data
into subsets based on the values of the attributes until a stopping criterion is met, such
as the maximum depth of the tree or the minimum number of samples required to split
a node.

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi S J, Dept.Of CSE,RNSIT


5
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

6.2.2 C4.5 Construction


C4.5 is a widely used algorithm for constructing decision trees from a dataset.
Disadvantages of ID3 are: Attributes must be nominal values, dataset must not include
missing data, and finally the algorithm tend to fall into overfitting.
To overcome this disadvantage Ross Quinlan, inventor of ID3, made some
improvements for these bottlenecks and created a new algorithm named C4.5. Now, the
algorithm can create a more generalized models including continuous data and could
handle missing data. And also works with discrete data, supports post-prunning.

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi S J, Dept.Of CSE,RNSIT


6
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

Dealing with Continuous Attributes in C4.5

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi S J, Dept.Of CSE,RNSIT


7
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

6.2.3 Classification and Regression Trees Construction


Classification and Regression Trees (CART) is a widely used algorithm for
constructing decision trees that can be applied to both classification and regression
tasks. CART is similar to C4.5 but has some differences in its construction and splitting
criteria.
The classification method CART is required to construct a decision tree based on Gini's
impurity index. It serves as an example of how the values of other variables can be used
to predict the values of a target variable. It functions as a fundamental machine-learning
method and provides a wide range of use cases

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi S J, Dept.Of CSE,RNSIT


8
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

6.2.4 Regression Trees

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi S J, Dept.Of CSE,RNSIT


9
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

6.3 VALIDATING AND PRUNING OF DECISION TREES

Validating and pruning decision trees is a crucial part of building accurate and robust
machine learning models. Decision trees are prone to overfitting, which means they can
learn to capture noise and details in the training data that do not generalize well to new,
unseen data.

Validation and pruning are techniques used to mitigate this issue and improve the
performance of decision tree models.

The pre-pruning technique of Decision Trees is tuning the hyperparameters prior to


the training pipeline. It involves the heuristic known as ‘early stopping’ which stops the
growth of the decision tree - preventing it from reaching its full depth. It stops the tree-
building process to avoid producing leaves with small samples. During each stage of
the splitting of the tree, the cross-validation error will be monitored. If the value of the
error does not decrease anymore - then we stop the growth of the decision tree.

The hyperparameters that can be tuned for early stopping and preventing overfitting

are: max_depth, min_samples_leaf, and min_samples_split

These same parameters can also be used to tune to get a robust model

Post-pruning does the opposite of pre-pruning and allows the Decision Tree model to

grow to its full depth. Once the model grows to its full depth, tree branches are removed

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi S J, Dept.Of CSE,RNSIT


10
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

to prevent the model from overfitting. The algorithm will continue to partition data into

smaller subsets until the final subsets produced are similar in terms of the outcome

variable. The final subset of the tree will consist of only a few data points allowing the

tree to have learned the data to the T. However, when a new data point is introduced

that differs from the learned data - it may not get predicted well.

The hyperparameter that can be tuned for post-pruning and preventing overfitting

is: ccp_alpha

ccp stands for Cost Complexity Pruning and can be used as another option to control

the size of a tree. A higher value of ccp_alpha will lead to an increase in the number of

nodes pruned.

Dr.Girijamma H.A, Dr.Sudhamani M J,Meenakshi S J, Dept.Of CSE,RNSIT


11
Chapter 10
Artificial Neural Networks
The term "Artificial neural network" refers to a biologically inspired sub-field of artificial
intelligence modelled after the brain.
An Artificial neural network is usually a computational network based on biological neural
networks that construct the structure of the human brain.
Similar to a human brain has neurons interconnected to each other, artificial neural networks also
have neurons that are linked to each other in various layers of the networks. These neurons are
known as nodes.

The biological neuron consists of main four parts:


• dendrites: nerve fibres carrying electrical signals to the cell .
• cell body: computes a non-linear function of its inputs
• axon: single long fiber that carries the electrical signal from the cell body to other neurons
• synapse: the point of contact between the axon of one cell and the dendrite of another,
regulating a chemical connection whose strength affects the input to the cell.


Dendrites are tree like networks made of nerve fiber connected to the cell body.
An Axon is a single, long connection extending from the cell body and carrying signals from the
neuron. The end of axon splits into fine strands. It is found that each strand terminated into small
bulb like organs called as synapse. It is through synapse that the neuron introduces its signals to
other nearby neurons. The receiving ends of these synapses on the nearby neurons can be found
both on the dendrites and on the cell body. There are approximately 104 synapses per neuron in the
human body. Electric impulse is passed between synapse and dendrites. It is a chemical process
which results in increase/decrease in the electric potential inside the body of the receiving cell. If
the electric potential reaches a thresh hold value, receiving cell fires & pulse / action potential of
fixed strength and duration is send through the axon to synaptic junction of the cell. After that, cell
has to wait for a period called refractory period.

Difference between biological and Artificial Neuron


ARTIFICIAL NEURONS:
Artificial neurons are like biological neurons that are linked to each other in various layers of the
networks. These neurons are known as nodes.
A node or a neuron can receive one or more input information and process it. artificial neurons are
connected by connection links to another neuron. Each connection link is associated with a synaptic
weight. The structure of a single neuron is shown below:
Fig: McCulloch-Pitts Neuron Mathematical model.

Simple Model of an ANN


The first mathematical model of a biological neuron was designed by McCulloch-Pitts in 1943.
It includes 2 steps:
1. It receives weighted inputs from other neurons.
2. It operates with a threshold function or activation function.

Basically, a neuron takes an input signal (dendrite), processes it like the CPU (soma), passes
the output through a cable like structure to other connected neurons (axon to synapse to
other neuron’s dendrite).

OR
Working:
The received input are computed as a weighted sum which is given to the activation function
and if the sum exceeds the threshold value the neuron gets fired.The neuron is the basic
processing unit that receives a set of inputs x1,x2,x3,….xn and their associated weights
w1,w2,w3,….wn. The summation function computes the weighted sum of the inputs
received by the neuron.
Sum=∑xiwi

Activation functions:
• To make work more efficient and for exact output, some force or activation is given. Like
that, activation function is applied over the net input to calculate the output of an ANN.
Information processing of processing element has two major parts: input and output. An
integration function (f) is associated with input of processing element.

• Several activation functions are there.

1. Identity function or Linear Function: It is a linear function which is defined as 𝑓(𝑥) =


𝑥 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑥

The output is same as the input ie the weighted sum. The function is useful when we do
not apply any threshold. The output value ranged between –∞ and +∞
2. Binary step function: This function can be defined as
𝑓(𝑥) = { 1 𝑖𝑓 𝑥 ≥ 𝜃
0 𝑖𝑓 𝑥 < 𝜃
Where, θ represents threshhold value. It is used in single layer nets to convert
the net input to an output that is binary (0 or 1).
3. Bipolar step function: This function can be defined as
𝑓(𝑥) = { 1 𝑖𝑓 𝑥 ≥ 𝜃
−1 𝑖𝑓 𝑥 < 𝜃
Where, θ represents threshold value. It is used in single layer nets to convert
the net input to an output that is bipolar (+1 or -1).
4. Sigmoid function: It is used in Back propagation nets.
Two types:
a) Binary sigmoid function: It is also termed as logistic sigmoid function or unipolar
sigmoid function. It is defined as

where, λ represents steepness parameter. The range of sigmoid function is 0


to 1
b) Bipolar sigmoid function: This function is defined as

Where λ represents steepness parameter and the sigmoid range is between -1


and +1.
5. Ramp function: The ramp function is defined as:

It is a linear function whose upper and lower limits are fixed.


6. Tanh-Hyperbolic tangent function : Tanh function is very similar to the sigmoid/logistic
activation function, and even has the same S-shape with the difference in output range of -1 to
1. In Tanh, the larger the input (more positive), the closer the output value will be to 1.0,
whereas the smaller the input (more negative), the closer the output will be to -1.0.

7. ReLU Function
ReLU stands for Rectified Linear Unit.
Although it gives an impression of a linear function, ReLU has a derivative function and
allows for backpropagation while simultaneously making it computationally efficient.
The main catch here is that the ReLU function does not activate all the neurons at the same
time.
The neurons will only be deactivated if the output of the linear transformation is less than 0
8. Softmax function: Softmax is an activation function that scales numbers/logits into

probabilities. The output of a Softmax is a vector (say v) with probabilities of each

possible outcome. The probabilities in vector v sums to one for all possible outcomes or

classes.

Artificial Neural Network Structure


• Artificial Neural Networks Computational models inspired by the human brain: – Massively
parallel, distributed system, made up of simple processing units (neurons) – Synaptic
connection strengths among neurons are used to store the acquired knowledge.

• Knowledge is acquired by the network from its environment through a learning process.

• The Neural Network is constructed from 3 type of layers:


• Input layer — initial data for the neural network.
• Hidden layers — intermediate layer between input and output layer and place where all the
computation is done.

• Output layer — produce the result for given inputs.

PERCEPTRON AND LEARNING THEORY


• The perceptron is also a simplified model of a biological neuron.
• The perceptron is an algorithm for supervised learning of binary classifiers. It is a type of
linear classifier, i.e. a classification algorithm that makes all of its predictions based on a
linear predictor function combining a set of weights with the feature vector.
• One type of ANN system is based on a unit called a perceptron.

OR
• The perceptron can represent all boolean primitive functions AND, OR, NAND , NOR.
• Some boolean functions can not be represented .
– E.g. the XOR function.

Major components of a perceptron


• Input
• Weight
• Bias
• Weighted summation
• Step/activation function
• output
WORKING:
• Feed the features of the model that is required to be trained as input in the first layer. All
weights and inputs will be multiplied – the multiplied result of each weight and input will be
added up.The Bias value will be added to shift the output function .This value will be
presented to the activation function (the type of activation function will depend on the need)
The value received after the last step is the output value.
The activation function is a binary step function which outputs a value 1, if f(x) is above the
threshold value Θ and a 0 if f(x) is below the threshold value Θ. Then the output of a neuron
is:
PROBLEM:
Design a 2 layer network of perceptron to implement NAND gate. Assume your own weights and
biases in the range of [-0.5 0.5]. Use learning rate as 0.4.

Solution:

X0

Ɵ3 Ɵ4
X1 𝑤13

X3 X4
𝑤34
AND NOT
𝑤23
X2

Figure 1 Two Layer Network for NAND gate

Table 1: Weights and Biases


𝑿𝟏 𝑿𝟐 𝑶𝒅𝒆𝒔𝒊𝒓𝒆𝒅 𝒘𝟏𝟑 𝒘𝟐𝟑 𝒘𝟑𝟒 Ɵ𝟑 Ɵ𝟒 𝑿𝟎

0 1 1 0.1 -0.4 0.3 0.2 -0.3 1


Table 2: Truth Table of NAND Gate
𝑿𝟏 𝑿𝟐 𝑿𝟏 𝑨𝑵𝑫 𝑿𝟐 𝑵𝑨𝑵𝑫 = 𝑵𝑶𝑻(𝑿𝟏 𝑨𝑵𝑫 𝑿𝟐)

0 0 0 1
0 1 0 1
1 0 0 1
1 1 1 0

ITERATION 1:
Step 1: FORWARD PROPAGATION
1. Calculate net inputs and outputs in input layer as shown in Table 3.
Table 3: Net Input and Output Calculation
Input Layer 𝑰𝒋 𝑶𝒋

𝑿𝟏 0 0

𝑿𝟐 1 1

2. Calculate net inputs and outputs in hidden and output layer as shown in Table 4.
Table 4: Net Input and Output Calculation in Hidden and Output layer

𝑼𝒏𝒊𝒕𝒋 𝑵𝒆𝒕 𝑰𝒏𝒑𝒖𝒕 𝑰𝒋 𝑵𝒆𝒕 𝒐𝒖𝒕𝒑𝒖𝒕 𝑶𝒋

𝑿𝟑 𝐼3 = 𝑋1 𝑊13 + 𝑋2 𝑊23 + 𝑋0 Ɵ3 1
𝑶𝟑 =
= 0(0.1) + 1(−0.4) + 1(0.2) 1 + 𝑒 −𝐼3
1
= −0.2 =
1 + 𝑒 −(−0.2)

= 0.450
𝑼𝒏𝒊𝒕𝒌 𝑵𝒆𝒕 𝑰𝒏𝒑𝒖𝒕 𝑰𝒌 𝐍𝐞𝐭 𝐨𝐮𝐭𝐩𝐮𝐭 𝑶𝒌

𝑿𝟒 𝐼4 = 𝑂3 𝑊34 + 𝑋0 Ɵ4 1
𝑶𝟒 =
= (0.450 ∗ 0.3) + 1(−0.3) 1 + 𝑒 −𝐼4
1
= −0.165 =
1 + 𝑒 −(−0.165)

= 0.458

3. Calculate Error
𝑬𝒓𝒓𝒐𝒓 = 𝑶𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝑶𝒆𝒔𝒕𝒊𝒎𝒂𝒕𝒆𝒅
= 1 − 0.458
𝐸𝑟𝑟𝑜𝑟 = 0.542

Step 2: BACKWARD PROPAGATION


1. For each 𝒖𝒏𝒊𝒕𝒌 in the output layer
𝑬𝒓𝒓𝒐𝒓𝒌 = 𝑶𝒌 ∗ (𝟏 − 𝑶𝒌 ) ∗ (𝑶𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝑶𝒌 )

For each 𝒖𝒏𝒊𝒕𝒋 in the hidden layer

𝑬𝒓𝒓𝒐𝒓𝒋 = 𝑶𝒋 ∗ (𝟏 − 𝑶𝒋 ) ∗ (∑ 𝑬𝒓𝒓𝒐𝒓 ∗ 𝑾𝒋𝒌 )


𝒌

Table 5: Error Calculation


For each output 𝑬𝒓𝒓𝒐𝒓𝒌
layer 𝒖𝒏𝒊𝒕𝒌
𝑋4 𝐸𝑟𝑟𝑜𝑟𝑘 = 𝑂𝑘 ∗ (1 − 𝑂𝑘 ) ∗ (𝑂𝑑𝑒𝑠𝑖𝑟𝑒𝑑 − 𝑂𝑘 )
= 0.458(1 − 0.458)(1 − 0.458)
= 0.134

For each hidden layer 𝑬𝒓𝒓𝒐𝒓𝒋


𝒖𝒏𝒊𝒕𝒋

𝑋3 𝐸𝑟𝑟𝑜𝑟𝑗 = 𝑂𝑗 ∗ (1 − 𝑂𝑗 ) ∗ (∑ 𝐸𝑟𝑟𝑜𝑟 ∗ 𝑊𝑗𝑘 )


𝑘

= 0.450 ∗ (1 − 0.450) ∗ 0.134 ∗ 0.3


= 0.0099

2. Update Weights and biases


Table 6: Weight and Bias Calculation

𝒘𝒊𝒋 𝒘𝒊𝒋 = 𝒘𝒊𝒋 + (𝜶 ∗ 𝑬𝒓𝒓𝒐𝒓𝒋 ∗ 𝑶𝒊 ) Net Weight

𝑤13 𝑤13 = 𝑤13 + (0.4 ∗ 𝐸𝑟𝑟𝑜𝑟3 ∗ 𝑂1 ) 0.1


= 0.1 ∗ (0.4 ∗ 0.0099 ∗ 0)
𝑤23 𝑤23 = 𝑤23 + (0.4 ∗ 𝐸𝑟𝑟𝑜𝑟3 ∗ 𝑂2 ) -0.396
= −0.4 ∗ (0.4 ∗ 0.0099 ∗ 1)
𝑤24 𝑤24 = 𝑤24 + (0.4 ∗ 𝐸𝑟𝑟𝑜𝑟4 ∗ 𝑂2 ) 0.324
= 0.3 ∗ (0.4 ∗ 0.134 ∗ 0.450)
Ɵ𝒋 Ɵ𝒋 = Ɵ𝒋 + (𝜶 ∗ 𝑬𝒓𝒓𝒐𝒓𝒋 ) Net Bias

Ɵ3 Ɵ3 = Ɵ3 + (0.4 ∗ 𝐸𝑟𝑟𝑜𝑟3 ) 0.203


= 0.2 + (0.4 ∗ 0.0099)
Ɵ4 Ɵ4 = Ɵ4 + (0.4 ∗ 𝐸𝑟𝑟𝑜𝑟4 ) -0.246
= −0.3 + (0.4 ∗ 0.134

ITERATION 2:
Step 1: FORWARD PROPAGATION

1. Calculate net inputs and outputs in hidden and output layer


Table 7: Inputs and Outputs in Hidden and Output layer

𝑼𝒏𝒊𝒕𝒋 𝑵𝒆𝒕 𝑰𝒏𝒑𝒖𝒕 𝑰𝒋 𝑵𝒆𝒕 𝒐𝒖𝒕𝒑𝒖𝒕 𝑶𝒋

𝑿𝟑 𝐼3 = 𝑋1 𝑊13 + 𝑋2 𝑊23 + 𝑋0 Ɵ3 1
𝑶𝟑 =
= 0(0.1) + 1(−0.396) + 1(0.203) 1 + 𝑒 −𝐼3
1
= −0.193 = −(−0.193)
1+𝑒
= 0.451
𝑼𝒏𝒊𝒕𝒌 𝑵𝒆𝒕 𝑰𝒏𝒑𝒖𝒕 𝑰𝒌 𝑵𝒆𝒕 𝒐𝒖𝒕𝒑𝒖𝒕 𝑶𝒌

𝑿𝟒 𝐼4 = 𝑂3 𝑊34 + 𝑋0 Ɵ4 1
𝑶𝟒 =
= (0.451 ∗ 0.324) + 1(−0.246) 1 + 𝑒 −𝐼4
1
= −0.099 = −(−0.099)
1+𝑒
= 0.475

2. Calculate Error
𝑬𝒓𝒓𝒐𝒓 = 𝑶𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝑶𝒆𝒔𝒕𝒊𝒎𝒂𝒕𝒆𝒅
= 1 − 0.475
𝐸𝑟𝑟𝑜𝑟 = 0.525

ITERATION ERROR
1 0.542 =0.542-0.525
=0.017
2 0.525

In iteration 2 the error gets reduced to 0.525. This process will continue until desired output
is achieved.
How a Multi-Layer Perceptron does solves the XOR problem. Design an MLP with back
propagation to implement the XOR Boolean function.
Solution:

X1 X2 Y
0 0 1
0 1 0
1 0 0
1 1 1

X0

0.1

X1 -0.3
-0.2
0.4
0.4

0.2
X3 0.2

X2 X5
-0.3
-0.3

X4

Figure 2: Multi Layer Perceptron for XOR

Learning rate: =0.8


Table 8: Weights and Biases
X1 X2 W13 W14 W23 W24 W35 W45 𝜃3 𝜃4 𝜃5
1 0 -0.2 0.4 0.2 -0.3 0.2 -0.3 0.4 0.1 -0.3

Step 1: Forward Propagation


1. Calculate Input and Output in the Input Layer shown in Table 9.
Table 9: Net Input and Output Calculation
Input Layer Ij Oj
X1 1 1
X2 0 0
2. Calculate Net Input and Output in the Hidden Layer and Output Layer shown in Table 10.
Table 10: Unit j at Hidden Layer and Output Layer – Net Input and Output Calculation
Unit j Net Input Ij Output Oj
X3 I3 = X1*W13 + X2*W23+ X0*θ3 1 1
O3 = 1+𝑒 −𝐼3 = 1+𝑒 −0.2 = 0.549
I3 = 1*-0.2 + 0*0.2+ 1*0.4 = 0.2
X4 I4 = X1*W14 + X2*W24+ X0*θ4 1 1
O4 = 1+𝑒 −𝐼4 = 1+𝑒 −0.5 = 0.622
I4 = 1*0.4 + 0*-0.3+ 1*0.1 = 0.5
X5 I5 = O3 * W35 + O4*W45 + X0*θ5 1 1
O5 = 1+𝑒 −𝐼5 = 1+𝑒 0.376 =0.407
I5 = 0.549 * 0.2 + 0.622 * -0.3 + 1*-0.3 = -0.376

3. Calculate Error = Odesired – OEstimated


So error for this network is,
Error = Odesired – O7 = 1 – 0.407 = 0.593

Step 2: Backward Propagation


1. Calculate Error at each node as shown in Table 11.
For each unit k in the output layer, calculate
Error k = Ok (1-Ok) (YN – Ok)
For each unit j in the hidden layer, calculate
Error j = Oj (1-Oj) ∑𝑘 𝐸𝑟𝑟𝑜𝑟𝑘 𝑊𝑗𝑘

Table 11: Error Calculation for each unit in the Output layer and Hidden layer
For Output Layer Errork
Unit k
X5 Error 5 = O5 (1-O5) (1 – O5)
= 0.407 * (1-0.407) * (1- 0.407)
= 0.143
For Hidden layer Errorj
Unit j
X4 Error 4 = O4 (1-O4) ∑𝑘 𝐸𝑟𝑟𝑜𝑟𝑘 𝑊𝑗𝑘 = O4 (1-O4) 𝐸𝑟𝑟𝑜𝑟5 𝑊45
= 0.622 (1-0.622) *- 0.3 *0.143
= -0.010
X3 Error 3 = O3 (1-O3) ∑𝑘 𝐸𝑟𝑟𝑜𝑟𝑘 𝑊𝑗𝑘 = O3 (1-O3) 𝐸𝑟𝑟𝑜𝑟5 𝑊35
= 0.549 (1- 0.549) * 0.143 * 0.2
= -0.007

2. Update weight using the below formula,


Learning rate α = 0.8
∆Wij = ∝∗ Error j* Oi
Wij = Wij+ ∆Wij
The updated weight and bias is shown in Table 12 and Table 13.
Table 12: Weight Updation
Wij Wij = Wij+ ∝∗ Error j* Oi New Weight
W13 W13 = W13 + 0.8 * Error 3* O1 -0.194
= -0.2 + 0.8 * 0.007 * 1
W14 W14 = W14 + 0.8 * Error 4* O1 0.392
= 0.4+ 0.8 * -0.01 *1
W23 W23 = W23 + 0.8 * Error 3* O2 0.2
= 0.2 + 0.8 * 0.007 *0
W24 W24 = W24+ 0.8 * Error 4 * O2 -0.3
= -0.3+ 0.8 * -0.001 *0
W35 W35 = W35 + 0.8 * Error 5* O3 0.154
= 0.2 + 0.8 *0.143* 0.4
W45 W45 = W45 + 0.8 * Error 5* O4 -0.288
= 0.3 + 0.8 * 0.143* 0.1

Update bias using the below formula,


∆θj = = ∝∗ Error j
θj = θj + ∆θj
Table 13: Bias Updation
θj θj = θj + ∝∗ Error j New Bias
𝜃3 Θ3 = θ3 + ∝∗ Error 3 0.405
= 0.4 + 0.8 * 0.007
𝜃4 θ 4 = θ4 + ∝∗ Error 4 0.092
= 0.1 + 0.8 *- 0.01
𝜃5 θ 5 = θ5 + ∝∗ Error 5 -0.185
= -0.3 + 0.8 * 0.143
Iteration 2
Now with the updated weights and biases,
1. Calculate Input and Output in the Input Layer shown in Table 14.
Table 14: Net Input and Output Calculation
Input Layer Ij Oj
X1 1 1
X2 0 0

2. Calculate Net Input and Output in the Hidden Layer and Output Layer shown in Table 15.
Table 15: Net Input and Output Calculation in the Hidden Layer and Output Layer
Unit j Net Input Ij Output Oj
X3 I3 = X1*W13 + X2*W23+ X0*θ3 1 1
O3 = 1+𝑒 −𝐼3 = 1+𝑒 −0.211 =
I3 = 1*-0.194 + 0*0.2+ 1*0.405 = 0.211
0.552
X4 I4 = X1*W14 + X2*W24+ X0*θ4 1 1
O4 = 1+𝑒 −𝐼4 = 1+𝑒 −0.484 =
I4 = 1*0.392 + 0*-0.3+ 1*0.092 = 0.484
0.618
X5 I5 = O3 * W35 + O4*W45 + X0*θ5 1 1
O5 = 1+𝑒 −𝐼5 = 1+𝑒 0.282 =0.429
I5 = 0.552* 0.154 + 0.618* -0.288 + 1*-0.185 = -
0.282

The output we receive in the network at node 5 is 0.407.


Error = 1 - 0.429= 0.571
Now when we compare the error, we get in the previous iteration and in the current iteration, the
network has learnt which reduces the error by 0.022.
Error is reduced by 0.055: 0.593 – 0.571.

Consider the Network architecture with 4 input units and 2 output units. Consider four training
samples each vector of length 4.
Training samples
i1: (1, 1, 1, 0)
i2: (0, 0, 1, 1)
i3: (1, 0, 0, 1)
i4: (0, 0, 1, 0)
Output Units: Unit 1, Unit 2
Learning rate η(t) = 0.6
Initial Weight matrix
Unit 1 0.2 0.8 0.5 0.1
[ ]:[ ]
Unit 2 0.3 0.5 0.4 0.6
Identify an algorithm to learn without supervision? How do you cluster them as we
expected?

Solution:
Use Self Organizing Feature Map (SOFM)

Iteration 1:
Training Sample X1: (1, 1, 1, 0)
Weight matrix
Unit 1 0.2 0.8 0.5 0.1
[ ]: [ ]
Unit 2 0.3 0.5 0.4 0.6

Compute Euclidean distance between X1: (1, 1, 1, 0) and Unit 1 weights.

d2 = (0.2 -1)2 + (0.8 – 1)2 + (0.5 -1)2 + (0.1 – 0)2


= 0.94
Compute Euclidean distance between X1: (1, 1, 1, 0) and Unit 2 weights.

d2 = (0.3 -1)2 + (0.5 – 1)2 + (0.4 -1)2 + (0.6– 0)2


= 1.46
Unit 1 wins
Update the weights of the winning unit
New Unit 1 weights = [0.2 0.8 0.5 0.2] + 0.6 ([1 1 1 0] - [0.2 0.8 0.5 0.2])
= [0.2 0.8 0.5 0.2] + 0.6 [0.8 0.2 0.5 -0.2]
= [0.2 0.8 0.5 0.2] + [0.48 0.12 0.30 -0.12]
= [0.68 0.92 0.80 0.08]
Unit 1 0.68 0.92 0.80 0.08
[ ]:[ ]
Unit 2 0.3 0.5 0.4 0.6
Iteration 2:
Training Sample X2: (0, 0, 1, 1)
Weight matrix
Unit 1 0.68 0.92 0.80 0.08
[ ]:[ ]
Unit 2 0.3 0.5 0.4 0.6
Compute Euclidean distance between X2: (0, 0, 1, 1) and Unit 1 weights.

d2 = (0.68 -0)2 + (0.92 – 0)2 + (0.80 -1)2 + (0.08 – 1)2


= 2.1952
Compute Euclidean distance between X2: (0, 0, 1, 1) and Unit 2 weights.

d2 = (0.3 -0)2 + (0.5 – 0)2 + (0.4 -1)2 + (0.6– 1)2


= 0.86
Unit 2 wins
Update the weights of the winning unit
New Unit 2 weights = [0.3 0.5 0.4 0.6] + 0.6 ([0 0 1 1] - [0.3 0.5 0.4 0.6])
= [0.3 0.5 0.4 0.6] + 0.6 [-0.3 -0.5 0.6 0.4]
= [0.3 0.5 0.4 0.6] + [-0.18 -0.30 0.36 0.24]
= [0.12 0.2 0.76 0.84]
Unit 1 0.68 0.92 0.80 0.08
[ ]:[ ]
Unit 2 0.12 0.2 0.76 0.84

Iteration 3:
Training Sample X3: (1, 0, 0, 1)
Weight matrix
Unit 1 0.68 0.92 0.80 0.08
[ ]:[ ]
Unit 2 0.12 0.2 0.76 0.84

Compute Euclidean distance between X3: (1, 0, 0, 1) and Unit 1 weights.

d2 = (0.68 -1)2 + (0.92 – 0)2 + (0.80 -0)2 + (0.08 – 1)2


= 2.44
Compute Euclidean distance between X3: (1, 0, 0, 1) and Unit 2 weights.

d2 = (0.12 -1)2 + (0.2 – 0)2 + (0.76 -0)2 + (0.84– 1)2


= 1.42
Unit 2 wins
Update the weights of the winning unit
New Unit 2 weights = [0.12 0.2 0.76 0.84] + 0.6 ([1 0 0 1] - [0.12 0.2 0.76 0.84])
= [0.12 0.2 0.76 0.84] + 0.6 [0.88 -0.2 -0.76 0.16]
= [0.12 0.2 0.76 0.84] + [0.53 -0.12 -0.46 0.096]
= [0.65 0.08 0.3 0.94]
Unit 1 0.68 0.92 0.80 0.08
[ ]:[ ]
Unit 2 0.65 0.08 0.3 0.94

Iteration 4:
Training Sample X4: (0, 0, 1, 0)
Weight matrix
Unit 1 0.68 0.92 0.80 0.08
[ ]:[ ]
Unit 2 0.65 0.08 0.3 0.94

Compute Euclidean distance between X4: (0, 0, 1, 0) and Unit 1 weights.

d2 = (0.68 -0)2 + (0.92 –0)2 + (0.80 -1)2 + (0.08 – 0)2


= 1.36
Compute Euclidean distance between X1: (0, 0, 1, 0) and Unit 2 weights.

d2 = (0.65- 0)2 + (0.08 – 0)2 + (0.3 -1)2 + (0.94– 0)2


= 1.8025
Unit 1 wins
Update the weights of the winning unit
New Unit 1 weights = [0.68 0.92 0.80 0.08] + 0.6 ([0 0 1 0] - [0.68 0.92 0.80 0.08])
= [0.68 0.92 0.80 0.08] + 0.6 [-0.68 -0.92 0.2 -0.08]
= [0.68 0.92 0.80 0.08] + [-0.408 -0.552 0.12 -0.258]
= [0.27 0.37 0.92 -0.178]
Unit 1 0.27 0.37 0.92 − 0.178
[ ]:[ ]
Unit 2 0.65 0.08 0.3 0.94

Best mapping unit for each of the sample taken are,


X1: (1, 1, 1, 0)  Unit 1
X2: (0, 0, 1, 1)  Unit 2
X3: (1, 0, 0, 1)  Unit 2
X4: (0, 0, 1, 0)  Unit 1

This process is continued for many epochs until the feature map doesn’t change.
Learning Rules
Learning in NN is performed by adjusting the network weights in order to minimize the
difference between the desired and estimated output.

Delta Learning Rule and Gradient Descent


 Developed by Widrow and Hoff, the delta rule, is one of the most common learning rules.
 It is supervised learning.
 Delta rule is derived from gradient descent method(Back-propogation).
 It is Non-linearly separable. Also called as continuous perceptron Learning rule.
 It updates the connection weights with the difference between the target and the output
value. It is the least mean square learning algorithm.
 The Delta difference is measured as an error function or also called as cost function.

TYPES OF ANN
1. Feed Forward Neural Network
2. Fully connected Neural Network
3. Multilayer Perceptron
4. Feedback Neural Network
Feed Forward Neural Network:
Feed-Forward Neural Network is a single layer perceptron. A sequence of inputs enters the layer and are
multiplied by the weights in this model. The weighted input values are then summed together to form a total.
If the sum of the values is more than a predetermined threshold, which is normally set at zero, the output
value is usually 1, and if the sum is less than the threshold, the output value is usually -1.
The single-layer perceptron is a popular feed-forward neural network model that is frequently used for
classification.
The model may or may not contain hidden layer and there is no backpropagation.
Based on the number of hidden layers they are further classified into single-layered and multilayered feed
forward network.

Fully connected Neural Network:

 A fully connected neural network consists of a series of fully connected layers that connect
every neuron in one layer to every neuron in the other layer.

 The major advantage of fully connected networks is that they are “structure agnostic” i.e. there
are no special assumptions needed to be made about the input.
Multilayer Perceptron:
A multi-layer perceptron has one input layer and for each input, there is one neuron (or node), it has
one output layer with a single node for each output and it can have any number of hidden layers and
each hidden layer can have any number of nodes.
The information flows in both directions.
The weight adjustment training is done via backpropagation.
Every node in the multi-layer perception uses a sigmoid activation function. The sigmoid activation
function takes real values as input and converts them to numbers between 0 and 1 using the sigmoid
formula.

Feedback Neural Network:


Feedback networks also known as recurrent neural network or interactive neural network are
the deep learning models in which information flows in backward direction.
It allows feedback loops in the network. Feedback networks are dynamic in nature, powerful and
can get much complicated at some stage of execution
Neuronal connections can be made in any way.
RNNs may process input sequences of different lengths by using their internal state, which can
represent a form of memory.
They can therefore be used for applications like speech recognition or handwriting recognition.
LEARNING OF MULTI LAYER PERCEPTRON

WHY MULTI LAYER PERCEPTRON?


Imagine a group of 7-year-old students who are working on a math problem, imagine that each of
them can only do arithmetic with two numbers. But you are giving them an equation like this 5 x 3
+ 2 x 4 + 8 x 2, how can they solve it?

To solve this problem, we can break it down into smaller parts and give them to each of the
students. One student can solve the first part of the equation "5 x 3 = 15" and another student can
solve the second part of the equation "2 x 4 = 8". The third student can solve the third part "8 x 2 =
16".
Finally, we can simplify it to 15 + 8 + 16. Same way, one of the students in the group can solve "15
+ 8 = 23" and another one can solve "23 + 16 = 39", and that's the answer

So here we are breaking down the large math problem into different sections and giving them to
each of the students who are just doing really simple calculations, but as a result of the teamwork,
they can solve the problem efficiently.

This is exactly the idea of how a multi-layer perceptron (MLP) works. Each neuron in the MLP is
like a student in the group, and each neuron is only able to perform simple arithmetic operations.
However, when these neurons are connected and work together, they can solve complex problems.

The principle weakness of the perceptron was that it could only solve problems that were linearly
separable.

A multilayer perceptron (MLP) is a fully connected feed-forward artificial neural network with at
least three layers input, output, and at least one hidden layer.
The mapping between inputs and output is non-linear. (Ex: XOR gate)
In Perceptron the neuron must have an activation function that imposes a threshold, like ReLU or
sigmoid, neurons in a Multilayer Perceptron can use any arbitrary activation function.
MLP networks are uses back propagation for supervised learning network.
The activation functions used in the layers can be linear or Non-linear depending on the type of a
problem.
NOTE : In each iteration, after the weighted sums are forwarded through all layers, the gradient of
the Mean Squared Error is computed across all input and output pairs. Then, to propagate it back,
the weights of the first hidden layer are updated with the value of the gradient. That’s how the
weights are propagated back to the starting point of the neural network.

This process keeps going until gradient for each input-output pair has converged, meaning the
newly computed gradient hasn’t changed more than a specified convergence threshold, compared to
the previous iteration.
Works in 2 stages.
1. Forward phase
2. Backward phase

Starting from initial random weights, multi-layer perceptron (MLP) minimizes the loss function
by repeatedly updating these weights. After computing the loss, a backward pass propagates it
from the output layer to the previous layers, providing each weight parameter with an update
value meant to decrease the loss.

ALGORITHM
Radial Basis Function Neural Network
This networks have a fundamentally different architecture than most neural network architectures.
Most neural network architecture consists of many layers and introduces nonlinearity by repetitively
applying nonlinear activation functions.
RBF network on the other hand only consists of an input layer, a single hidden layer, and an
output layer.
The input layer is not a computation layer, it just receives the input data and feeds it into the special
hidden layer of the RBF network. The computation that is happened inside the hidden layer is very
different from most neural networks, and this is where the power of the RBF network comes from.
The output layer performs the prediction task such as classification or regression.
RBF Neural networks are conceptually similar to K-Nearest Neighbor (k-NN) models.
It is useful for interpolation, function approximation ,time series prediction and classification.

RBFNN Architecture :
Self-organizing Feature Map
SOM is trained using unsupervised learning.
SOM doesn’t learn by backpropagation with Stochastic Gradient Descent(SGD) ,it use competitive
learning to adjust weights in neurons. Artificial neural networks often utilize competitive
learning models to classify input without the use of labeled data.
Used: In dimension reduction to reduce our data by creating a spatially organized representation,
also it help us to discover the correlation between data.
Self organizing maps have two layers, the first one is the input layer and the second one is the
output layer or the feature map.
SOM doesn’t have activation function in neurons, we directly pass weights to output layer without
doing anything.
Network Architecture and operations
It consists of 2 layers:
1. Input layer
2. Output layer
No Hidden layer.

The initialization of the weight to vectors initiates the mapping processes of the Self-Organizing
Maps.

The mapped vectors are then examined to determine which weight most accurately represents the
chosen sample using a sample random vector. Neighboring weights that are near each weighted
vector are present. The chosen weight is allowed to turn into a vector for a random sample. This
encourages the map to develop and take on new forms. In a 2D feature space, they typically form
hexagonal or square shapes. More than 1,000 times are spent repeatedly performing this entire
process.

To put it simply, learning takes place in the following ways:

 To determine whether appropriate weights are similar to the input vector, each node is analyzed.
The best matching unit is the term used to describe the appropriate node.

 The Best Matching Unit's neighborhood value is then determined. Over time, the neighbors tend
to decline in number.

The appropriate weight further evolves into something more resembling the sample vector. The
surrounding areas change similarly to the selected sample vector. A node's weights change more as
it gets closer to the Best Matching Unit (BMU), and less as it gets farther away from its neighbor.
For N iterations, repeat step two.
Advantages and Disadvantages of ANN

Limitations of ANN
Challenges of Artificial Neural Networks
Chapter 13
CLUSTERING ALGORITHMS
 Clustering: the process of grouping a set of objects into classes of similar objects
 Documents within a cluster should be similar.

 Documents from different clusters should be dissimilar

 Finding similarities between data according to the characteristics found in the data and
grouping similar data objects into clusters.

 Unsupervised learning: no predefined classes.

 Example: Below fig: shows the data points with two features shown in different shaded
samples.
If few similarities then manually we can do , but when examples have more
features then cannot be done manually, so automatic clustering is required.

Clusters are represented by centroids.


Example:
(3,3),(2,6) and(7,9).
Centroid : (3+2+7,3+6+9)=(4,6). The clusters should not overlap and every
cluster should represent only one class.
Difference between Clustering and Classification

Applications of Clustering
Advantages and Disadvantages

Challenges of Clustering Algorithms


1. Collection of data with higher dimensions.
2. Designing a proximity measure is another challenge.
3. The curse of dimensionality

PROXIMITY MEASURES
Clustering algorithms need a measure to find the similarity or dissimilarity among the
objects to group them. Similarity and Dissimilarity are collectively known as proximity
measures. This is used by a number of data mining techniques, such as clustering,
nearest neighbour classification, and anomaly detection.

Distance measures are known as dissimilarity measures, as these indicate how one
object is different from another.
Measures like cosine similarity indicate the similarity among objects.
Distance measures and similarity measures are two sides of a same coin, as more
distance indicates more similarity and vice-versa.

If all the conditions are satisfied, then the distance measure is called metric.
Some of proximity measures:
1. Quantitative variables
a) Euclidean distance: It is one of the most important and common
distance measure. It is also called L2 norm.

Advantage: The distance does not change with the addition of new object.
Disadvantage: i) If the unit changes, the resulting Euclidean or squared
Euclidean Changes drastically.
ii) Computational complexity is high, because it involves square root and
square.
b) City Block Distance: Known as Manhattan Distance or L1 norm.

c) Chebyshev Distance: Also known as maximum value distance. This is


the absolute magnitude of the differences between the coordinates of a
pair of objects.This distance is called supremum distance or Lmax or
L∞ norm.

d) Minkowski Distance: In general, all the above distances measures


can be generalized as:
Binary Attributes: Binary Attributes have only two values. Distance
measures have discussed above cannot be applied to find the distance
between objects that have binary attributes. For finding the distance
among objects with binary objects, the contingency table is used.
Hamming Distance: Hamming distance is a metric for comparing two binary data
strings. While comparing two binary strings of equal length, Hamming distance is the
number of bit positions in which the two bits are different. It is used for error detection
or error correction when data is transmitted over computer networks.

Example
Suppose there are two strings 1101 1001 and 1001 1101.

11011001 ⊕ 10011101 = 01000100. Since, this contains two 1s, the Hamming distance,
d(11011001, 10011101) = 2.

2. Categorical variables

Ordinal Variables

Cosine Similarity
 Cosine similarity is a metric used to measure how similar the documents are
irrespective of their size.
 It measures the cosine of the angle between two vectors projected in a multi-
dimensional space.
 The cosine similarity is advantageous because even if the two similar
documents are far apart by the Euclidean distance (due to the size of the
document), chances are they may still be oriented closer together.
 The smaller the angle, higher the cosine similarity.
 Consider 2 documents P1 and P2.
◦ If distance is more, then less similar.
◦ If distance is less, then more similar.

1. Consider the following data and, calculate the Euclidean, Manhattan and
Chebyshev distances.

a. (2 3 4) and (1 5 6)

Solution

Euclidean distance = (2 1)2  (3  5)2  (4  6)2  9  3

2 1  3  5)  4  6  1 2  2  5
Manhattan distance =

Chebyshev Distance = max 2 1 , 3  5) , 4  6   max{1, 2, 2}  2

b. (2 2 9) and (7 8 9)

25  36  09 61
Euclidean Distance = (2  7)  (2  8)  (9  9) 
2 2 2
  7.81
Manhattan Distance = 2  7  2  8)  9  9  5  6  0  11

Chebyshev Distance = max{ 2  7  2  8)  9  9 }  {5, 6, 0}  6

2. Find cosine similarity, SMC and Jaccard coefficients for the following binary
data:

a. (1 0 1 1) and (1 1 0 0)

Solution
10 11
110 0

C = 2, b = 1, d = 1,
ad 1
SMC =   0.25
abcd 4

d 1
  0.25
Jaccard Coefficient =
bcd 4

Cosine Similarity = 3 2 0 1 0)


(11 011
 31 2
b. (1 0 0 0 1) and (1 0 0 0 0 1)

Solution
No match

(1 0 0 0 1) and (1 1 0 0 0)

10001
11000

A=2, b= 1, c = 1, d= 1
ad 2
SMC =   0.5
abcd 5
d 1
Jaccard Coefficient =   0.33
bcd 3
(11 01 0 0  0 0 1 0) 1 1
Cosine Similarity =    0.5
2 2 2 2 2
3. Find Hamming distance for the following binary data:
a. (1 1 1) and (1 0 0)

Solution
It differs in two positions; therefore Hamming distance is 2
b. (1 1 1 0 0) and (0 0 1 1 1)
Solution
It differs in four positions; therefore, Hamming distance is 4

4. Find the distance between:


a. Employee ID: 1000 and 1001
Solution
They are not equal. Therefore, distance is 0

b. Employee name – John & John and John & Joan


Solution
The distance between John and John is 1
The distance between John and Joan is 0

5. Find the distance between:


a. (Yellow, red, green) and (red, green, yellow)

Solution

Yellow = 1, red = 2, Green = 3



 1  1  0.5
1 2
Therefore, the distance between (yellow, red) =
2 2 2

2  3 1 1
Distance between (red, green) =    0.5
2
2 2

3 1  2
Distance between (green, yellow) =  1
2 2
Therefore, distance between (Yellow, red, green) and (red, green, yellow) is (0.5,0.5,1).
b. (bread, butter, milk) and (milk, sandwich, Tea)

Solution

Bread =1, Butter =2, Milk = 3, Sandwich = 4, Tea = 5

2 1
The distance between (bread, milk) = 1 3  
5 1 4 2

2 1
The distance between (butter, sandwich) = 2  4  
5 1 4 2

1
The distance between (Milk, Tea) = 3  5  2 
5 1 4 2
Therefore, the distance
between (bread, butter, milk)
and (milk, sandwich, Tea) =
 1 1 1 
 , , 
 2 2 2 
 Hierarchical Clustering Algorithms
Hierarchical clustering involves creating clusters that have a predetermined ordering
from top to bottom.
For example, all files and folders on the hard disk are organized in a hierarchy.
Hierarchical relationship is shown in the form of dendogram.
There are two types of hierarchical clustering.
◦ Divisive and Agglomerative.

 Divisive method : In divisive or top-down clustering method we assign all of the


observations to a single cluster and then partition the cluster to two least similar clusters.
Finally, we proceed recursively on each cluster until there is one cluster for each
observation. There is evidence that divisive algorithms produce more accurate
hierarchies than agglomerative algorithms in some circumstances but is conceptually
more complex.
 Agglomerative method: In agglomerative or bottom-up clustering method we assign
each observation to its own cluster. Then, compute the similarity (e.g., distance)
between each of the clusters and join the two most similar clusters. Finally, repeat steps
2 and 3 until there is only a single cluster left. The related algorithm is shown below.

 The following three methods differ in how the distance between each cluster is
measured.
1. Single Linkage
2. Average Linkage
3. Complete Linkage
Single Linkage or MIN algorithm
In single linkage hierarchical clustering, the distance between two clusters is
defined as the shortest distance between two points in each cluster. For example,
the distance between clusters “r” and “s” to the left is equal to the length of the
arrow between their two closest points.
 Complete Linkage : In complete linkage hierarchical clustering, the distance between
two clusters is defined as the longest distance between two points in each cluster. For
example, the distance between clusters “r” and “s” to the left is equal to the length of the
arrow between their two furthest points.

OR

 Average Linkage : In average linkage hierarchical clustering, the distance between two
clusters is defined as the average distance between each point in one cluster to every
point in the other cluster. For example, the distance between clusters “r” and “s” to the
left is equal to the average length each arrow between connecting the points of one
cluster to the other.
Mean-Shift Algorithm
Use dataset and apply hierarchical methods. Show the dendrogram.

SNo. X Y

1. 3 5

2. 7 8

3. 12 5

4. 16 9

5. 20 8

Table Sample Data

Solution

The similarity table among the variables is computed and is shown in Table 134.4. Euclidean
distance is computed and is shown in the following Table 143.57.

Table 134.57: Proximity Matrix

Objects 0 1 2 3 4
0 - 5 9 9.85 17.26
1 - 5.83 9.49 13
2 - 5.66 8.94
3 - 4.12
4 -

The minimum distance is 4.12. Therefore, the items 1 and 4 are clustered together. The resultant
table is given as shown in the following Table.

Table After Iteration 1

Clusters {1,4} 2 3 5

{1,4} - 5 5.66 4.12


2 - 5.83 13
3 - 8.94

5 -

The distance between the group {1, 4} and items 2, 3, 5 are computed using this formula.
Thus, the distance between {1,4} and {2} is:
Minimum { {1,4}, {2} = Minimum {(1,2),(4,2)=5
The distance between {1,4} and {3} is given as:
Minimum { {1,3}, {4,3} } = Minimum {9,5.66}=5.66
The distance between {1,4} and {5} is given as:
Minimum { {1,5}, {2,5} } = Minimum {17.26,4.12} = 4.12

The minimum distance of above table is 4.12. Therefore, {1,4} and {5} are combined. This
results in the following Table.

Table After Iteration 2

Clusters {1,4,5} 2 5
{1,4,5} - 5 5.66
2 - 5.83

5 -

Thus, the distance between {1,4,5} and {2} is:


Minimum {(1,2),(4,2},(5,2)}= {5,9.49,13} = 5

Thus, the distance between {1,4,5} and {3} is:


Minimum { {1,3}, {4,3},{5,3)} = Minimum {9,5.66,8.94} = 5.66

The minimum is 5. Therefor {1,4,5} and {2} is combined. And finally, it is combined with
{3}.

therefore, the order of cluster is {1,4} then {5}, then {2} and finally {3}.
Complete Linkage or MAX or Clique
Here from the first iteration table minimum is taken and {1,4} is combined. Then maximum
is computed as

Thus, the distance between {1,4} and {2} is:


Max{ {1,4}, {2} = Max {(1,2),(4,2)= 9.49
The distance between {1,4} and {3} is given as:
Max { {1,3}, {4,3} } = Max {9,5.66}=9
The distance between {1,4} and {5} is given as:
Max{ {1,5}, {2,5} } = Max {17.26,4.12} = 17.26

This results in a Table

Clusters {1,4} 2 3 5

{1,4} - 9.49 9 17.26


2 - 5.83 13

3 - 8.94

5 -

So, the minimum is 8.94. Therefore, {3,5} is combined. This is shown in the following Table.

Clusters {1,4} {3,5} 2


{1,4} - 17.26 9.49
{3,5} - 13

2 -

The minimum is 9.49. Therefore {1,4,2} are combined. The order of cluster is {1,4}, {1,4}
and {2}, and {3,5}.
Hint: The same is used for average link algorithm where the average distance of all pairs of
points across the clusters is used to form clusters.

Consider the following data shown in Table 143.125. Use k-means algorithm with k
= 2 and show the result.
Table Sample Data
SNO X Y

1. 3 5

2. 7 8

3. 12 5

4. 16 9
Solution

Let us assume the seed points are (3,5) and (16,9). This is shown in the following table
as starting clusters.

Table Initial Cluster Table

Cluster 1 Cluster 2
(3,5) (16,9)

Centroid (3,5) Centroid (16,9)

Iteration 1: Compare all the data points or samples with the centroid and assigned to the
nearest sample.

Take the sample object 2 and compare it with the two centroids as follows:

Dist(2,centroid 1) =   5
(7  3)2  (8  5)2 16  9 25
Dist(2,centroid 2) = (7 16)2  (8  9)2  811  82  9.05
Object 2 is closer to centroid of cluster 1 and hence assign it to the cluster 1. This is shown in
Table. For the object 3:,

Dist(3,centroid 1) =  9 
(12  3)2  (5  5)2 81
Dist(3,centroid 2) = (12 16)2  (5  9)2  16 16  32  5.66

Object 3 is closer to centroid of cluster 2. and hence remains in the same cluster 1

This is shown in the following Table.


Table Cluster Table After Iteration 1

Cluster 1 Cluster 2
(3,5) (12,4)
(7,8) (10,4)

Centroid (10/2,13/2)=(5,6.5) Centroid (28/2,14/2)=(14,7)

The second iteration is started again. Compute again,

Dist(1,centroid 1) =  6.25 
(7  5)2  (8  6.5)2
Dist(1,centroid 2) = (12 14)2  (8  7)2  49 1  50  7.07

Object 1 is closer to centroid of cluster 1 and hence remains in the same cluster. Take the
sample object 3, compute again

Dist(3,centroid 1) = (12  5)2  (5  6.5)2  51.25  7.16

Dist(3,centroid 2) = (16 14)2  (9  7)2  4  4  8  2.82

Object 3 is closer to centroid of cluster 2 and remains in the same cluster.


Therefore, the resultant clusters are
{(3,5), (7,80} and {(12,5),(16,9)}.
PARTITIONAL CLUSTERING ALGORITHMS

It is a type of clustering that divides the data into non-hierarchical groups. It is also known as
the centroid-based method. The most common example of partitioning clustering is the K-
Means Clustering algorithm.

In this type, the dataset is divided into a set of k groups, where K is used to define the number
of pre-defined groups. The cluster center is created in such a way that the distance between the
data points of one cluster is minimum as compared to another cluster centroid.

K means can be viewed as greedy algorithm as it involves partitioning ‘n’ samples to k


clusterd to minimize sum of squared Error. SSE is a metric that is a measure of error that gives
the sum of the squared Euclidean distances of each data to its closet centroid.

𝑘
SSE= ∑ 𝑓(𝑥) = ∑𝑖=1(𝐝𝐢𝐬𝐭(𝐜𝐢 , x)2)
Here ci = centroid of ith cluster
x=sample data
PROBLEM
Density-Based Clustering

A cluster is a dense region of points, which is separated by low-density regions, from other
regions of high density.
Used when the clusters are irregular or intertwined, and when noise and outliers are present.
Density-Based Clustering refers to unsupervised learning methods that identify distinctive
groups/clusters in the data, based on the idea that a cluster in data space is a contiguous region
of high point density, separated from other such clusters by contiguous regions of low point
density.

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a base algorithm


for density-based clustering. It can discover clusters of different shapes and sizes from a large
amount of data, which is containing noise and outliers.
The DBSCAN algorithm uses two parameters:
minPts: The minimum number of points (a threshold) clustered together for a region to be
considered dense.
eps (ε): A distance measure that will be used to locate the points in the neighborhood of any
point.
These parameters can be understood if we explore two concepts called Density Reachability
and Density Connectivity.
Reachability in terms of density establishes a point to be reachable from another if it lies within
a particular distance (eps) from it. Connectivity, on the other hand, involves a transitivity based
chaining-approach to determine whether points are located in a particular cluster. For example,
p and q points could be connected if p->r->s->t->q, where a->b means b is in the neighborhood
of a.

There are three types of points after the DBSCAN clustering is complete:

 Core — This is a point that has at least m points within distance n from itself.
 Border — This is a point that has at least one Core point at a distance n.
 Noise — This is a point that is neither a Core nor a Border. And it has less than m points
within distance n from itself.
Grid-Based Approaches
grid-based clustering method takes a space-driven approach by partitioning the embedding
space into cells independent of the distribution of the input objects.

The grid-based clustering approach uses a multiresolution grid data structure. It quantizes the
object space into a finite number of cells that form a grid structure on which all of the
operations for clustering are performed.

The main advantage of the approach is its fast processing time, which is typically independent
of the number of data objects, yet dependent on only the number of cells.

Subspace Clustering
CLIQUE is a density-based and grid-based subspace clustering algorithm, useful for finding
clustering in subspace.
Concept of Dense cell
CLIQUE partitions each dimension into several overlapping intervals and intervals it into
cells. Then, algorithm determines whether the cells is dense or sparse. The cell is considered
dense if it exceeds a threshold value.

It is defined as the ratio of number of points and volume of the region. In one pass, the
algorithm finds the number of cells , number of points etc and then combines the dense cells.
For that the algorithm uses the contiguous intervals and a set of dense cells.

MONOTONICITY Property

CLIQUE uses anti- monotonicity property or apriori algorithm. It means that all the
subsets of a frequent itemset are frequent. Similarly if the subset is infrequent then its
superset are infrequent.

Algorithm works in 2 stages:


PROBABILITY MODEL BASED METHODS
Probability model-based methods in clustering are a class of techniques that use statistical models to
represent the underlying probability distributions of data points in a dataset.
These methods are used to group similar data points together into clusters based on their likelihood of
belonging to a particular cluster according to the assumed probability distribution.

Two popular probability model-based clustering methods are Gaussian Mixture Models (GMMs) and
Hidden Markov Models (HMMs). other than these we have other set of model . those are:

1. Fuzzy Clustering
2. EM algorithm
Fuzzy Clustering :
Fuzzy Clustering is a type of clustering algorithm in machine learning that allows a data point to belong
to more than one cluster with different degrees of membership. Unlike traditional clustering algorithms,
such as k-means or hierarchical clustering, which assign each data point to a single cluster, fuzzy
clustering assigns a membership degree between 0 and 1 for each data point for each cluster.

Let us consider ci and cj then an element say x, can belong to both the cluster.The strength of
the association of an object with the cluster is given as wij . The value of wij lies between 0
and 1. The sum of the weights of an object, if added, gives 1.
Expectation Maximization Algorithm
The Expectation-Maximization (EM) algorithm is a statistical method used for estimating
parameters in statistical models when you have incomplete or missing data. It's commonly used
in unsupervised machine learning tasks such as clustering and Gaussian Mixture Model (GMM)
fitting.

Given a mix of distributions, data can be generated by randomly picking a distribution and
generating the point. Gaussian distribution is a bell shaped curve.

The function of Gaussian distribution is given by:


The EM algorithm iteratively optimizes a likelihood function in two steps: the E-step
(Expectation) and the M-step (Maximization).
Here's a high-level overview of how the EM algorithm works:

1. Initialization: Start with initial estimates of the model parameters. These initial values can be
random or based on some prior knowledge.
2. E-step (Expectation):
 In this step, you compute the expected values (expectation) of the latent (unobserved)
variables given the observed data and the current parameter estimates.
 This involves calculating the posterior probabilities or likelihoods of the missing data or
latent variables.
 Essentially, you're estimating how likely each possible value of the latent variable is,
given the current model parameters.
3. M-step (Maximization):
 In this step, you update the model parameters to maximize the expected log-likelihood
found in the E-step.
 This involves finding the parameters that make the observed data most likely given the
estimated values of the latent variables.
 The M-step involves solving an optimization problem to find the new parameter values.
4. Iteration:
 Repeat the E-step and M-step alternately until convergence criteria are met. Common
convergence criteria include a maximum number of iterations, a small change in
parameter values, or a small change in the likelihood.
5. Termination:
 Once the EM algorithm converges, you have estimates of the model parameters that
maximize the likelihood of the observed data.
6. Result:
 The final parameter estimates can be used for various purposes, such as clustering,
density estimation, or imputing missing data.

The EM algorithm is widely used in various fields, including machine learning, image
processing, and bioinformatics.

One of its notable applications is in Gaussian Mixture Models (GMMs), where it's used to
estimate the means and covariances of Gaussian distributions that are mixed to model
complex data distributions.
It's important to note that the EM algorithm can sometimes get stuck in local optima, so the
choice of initial parameter values can affect the results. To mitigate this, you may run the
algorithm multiple times with different initializations and select the best result.

CLUSTER EVALUATION METHODS


Evaluation of clustering algorithm is a difficult task, as domain knowledge is absent most of the times.
SO, clustering algorithms validation is difficult as compared to the validation of classification
algorithms.
Evaluation of Clustering
1. Internal
2. External
3. Relative
Cohesion and separation

Here, N – No. of cluster,


C – set of centroids
Xi – centroid
Mj – samples.

Here, x – centroid of the entire dataset


Xi – centroid of the cluster
Ci – size of the cluster
DUNN Index
This metric measures the ratio between the distance between the clusters and the distance within
the clusters. A high Dunn index indicates that the clusters are well-separated and distinct.
DUNN index is calculated as:

Here,α and β are parameters. DUNN index is a useful measure that can combine both cohension and
separation.

Silhouette Coefficient
This metric measures how well each data point fits into its assigned cluster and ranges from -1 to
1. A high silhouette coefficient indicates that the data points are well-clustered, while a low
coefficient indicates that the data points may be assigned to the wrong cluster.

--

You might also like