AI and ML Final
AI and ML Final
Compiled by:
Dr.Girijamma H A, Dr.Sudhamani M J, Meenakshi S J
Department of CSE RNSIT
VISION AND MISSION OF INSTITUTION
Vision
Building RNSIT into a World Class Institution
Mission
To impart high quality education in Engineering, Technology and Management
with a Difference, Enabling Students to Excel in their Career by
1. Attracting quality Students and preparing them with a strong foundation in fundamentals
so as to achieve distinctions in various walks of life leading to outstanding contributions
2. Imparting value based, need based, choice based and skill based professional education to
the aspiring youth and carving them into disciplined, World class Professionals with social
responsibility
3. Promoting excellence in Teaching, Research and Consultancy that galvanizes academic
consciousness among Faculty and Students
4. Exposing Students to emerging frontiers of knowledge in various domains and make them
suitable for Industry, Entrepreneurship, Higher studies, and Research & Development
5. Providing freedom of action and choice for all the Stake holders with better visibility
VISION AND MISSION OF DEPARTMENT
Vision
Mission
The Department of CSE will make every effort to promote an intellectual and ethical
environment by
1. Imparting solid foundations and applied aspects in both Computer Science Theory and
Programming practices
2. Providing training and encouraging R&D and Consultancy Services in frontier areas of
Computer Science and Engineering with a Global outlook
3. Fostering the highest ideals of ethics, values and creating awareness of the role of
Computing in Global Environment
4. Educating and preparing the graduates, highly sought after, productive, and well-
respected for their work culture
5. Supporting and inducing lifelong learning
V Semester
These are sample Strategies, which teachers can use to accelerate the attainment of the various course
outcomes.
1. Lecturer method (L) need not to be only a traditional lecture method, but alternative
effective teaching methods could be adopted to attain the outcomes.
2. Use of Video/Animation to explain functioning of various concepts.
3. Encourage collaborative (Group Learning) Learning in the class.
4. Ask at least three HOT (Higher order Thinking) questions in the class, which promotes
critical thinking.
5. Adopt Problem Based Learning (PBL), which fosters students’ Analytical skills, develop
design thinking skills such as the ability to design, evaluate, generalize, and analyze
information rather than simply recall it.
6. Introduce Topics in manifold representations.
7. Show the different ways to solve the same problem with different logic and encourage the
students to come up with their own creative ways to solve them.
8. Discuss how every concept can be applied to the real world - and when that's possible, it
helps improve the students' understanding.
Module-1
Introduction: What is AI? Foundations and History of AI
Teaching-Learning Process Chalk and board, Active Learning. Problem based learning
Module-2
Informed Search Strategies: Greedy best-first search, A*search, Heuristic functions.
Introduction to Machine Learning , Understanding Data
The weightage of Continuous Internal Evaluation (CIE) is 50% and for Semester End Exam (SEE) is 50%.
The minimum passing mark for the CIE is 40% of the maximum marks (20 marks). A student shall be
deemed to have satisfied the academic requirements and earned the credits allotted to each subject/
course if the student secures not less than 35% (18 Marks out of 50) in the semester-end examination
(SEE), and a minimum of 40% (40 marks out of 100) in the sum total of the CIE (Continuous Internal
Evaluation) and SEE (Semester End Examination) taken together
CIE methods /question paper has to be designed to attain the different levels of Bloom’s
taxonomy as per the outcome defined for the course.
Theory SEE will be conducted by University as per the scheduled timetable, with common question
papers for the subject (duration 03 hours)
1. The question paper will have ten questions. Each question is set for 20 marks. Marks scored
shall be proportionally reduced to 50 marks.
2. There will be 2 questions from each module. Each of the two questions under a module (with a
maximum of 3 sub-questions), should have a mix of topics under that module.
The students have to answer 5 full questions, selecting one full question from each module.
Suggested Learning Resources:
Textbooks
1. Stuart J. Russell and Peter Norvig, Artificial Intelligence, 3rd Edition, Pearson,2015
2. S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021
Reference:
1. Elaine Rich, Kevin Knight, Artificial Intelligence, 3 rdedition, Tata McGraw Hill,2013
2. George F Lugar, Artificial Intelligence Structure and strategies for complex, Pearson Education,
5th Edition, 2011
3. Tom Michel, Machine Learning, McGrawHill Publication.
Weblinks and Video Lectures (e-Resources):
1. https://www.kdnuggets.com/2019/11/10-free-must-read-books-ai.html
2. https://www.udacity.com/course/knowledge-based-ai-cognitive-systems--ud409
3. https://nptel.ac.in/courses/106/105/106105077/
4. https://www.javatpoint.com/history-of-artificial-intelligence
5. https://www.tutorialandexample.com/problem-solving-in-artificial-intelligence
6. https://techvidvan.com/tutorials/ai-heuristic-search/
7. https://www.analyticsvidhya.com/machine-learning/
8. https://www.javatpoint.com/decision-tree-induction
9. https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/ml-
decision-tree/tutorial/
10. https://www.javatpoint.com/unsupervised-artificial-neural-networks
Activity Based Learning (Suggested Activities in Class)/ Practical Based learning
Role play for strategies– DFS & BFS, Outlier detection in Banking and insurance transaction for
identifying fraudulent behaviour etc. Uncertainty and reasoning Problem- reliability of sensor used to
detect pedestrians using Bayes Rule
COURSE OUTCOMES:
At the end of this course, students are able to:
CO1 Apply the knowledge of searching and reasoning techniques for different applications.
CO2 Analyze issues and challenges in machine learning within a broader interdisciplinary context
CO3 Apply the knowledge of classification algorithms on various dataset and compare results
CO4 Model the neuron and Neural Network, and to analyze ANN learning and its applications.
CO5 Identifying the suitable clustering algorithm for different pattern
CO-PO MATRIX
COURSE
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2 PSO3 PSO4
OUTCOMES
CO1 3 2 2 2 1 1 1 1 2 3 3 3
CO2 3 3 3 3 1 1 1 1 2 3 3 3
CO3 3 3 3 3 1 1 1 1 2 3 3 3
CO4 3 3 3 3 1 1 1 1 2 3 3 3
CO5 3 3 3 3 1 1 1 1 2 3 3 3
AI&ML, 21CS54, 5th Semester
Module 1
Introduction to Artificial Intelligence
AI is one of the newest fields in science and engineering. Work started in earnest soon after World War
II, and the name itself was coined in 1956. AI currently encompasses a huge variety of subfields, ranging
from the general (learning and perception) to the specific, such as playing chess, proving mathematical
theorems, writing poetry, driving a car on a crowded street, and diagnosing diseases. AI is relevant to
any intellectual task; it is truly a universal field.
The Turing Test, proposed by Alan Turing (1950), was designed to provide a satisfactory operational
definition of intelligence. A computer passes the test if a human interrogator, after posing some written
questions, cannot tell whether the written responses come from a person or from a computer. The
computer would need to possess the following capabilities:
• natural language processing to enable it to communicate successfully in English;
• knowledge representation to store what it knows or hears;
• automated reasoning to use the stored information to answer questions and to draw new conclusions;
• machine learning to adapt to new circumstances and to detect and extrapolate patterns.
Turing’s test deliberately avoided direct physical interaction between the interrogator and the
computer, because physical simulation of a person is unnecessary for intelligence. However, the so-
called total Turing Test includes a video signal so that the interrogator can test the subject’s perceptual
abilities, as well as the opportunity for the interrogator to pass physical objects “through the hatch.” To
pass the total Turing Test, the computer will need
• computer vision to perceive objects, and
• robotics to manipulate objects and move about.
If we are going to say that a given program thinks like a human, we must have some way of
determining how humans think. We need to get inside the actual workings of human minds. There are
three ways to do this: through introspection—trying to catch our own thoughts as they go by; through
psychological experiments—observing a person in action; and through brain imaging—observing the
brain in action. Once we have a sufficiently precise theory of the mind, it becomes possible to express
the theory as a computer program. If the program’s input–output behavior matches corresponding
human behavior, that is evidence that some of the program’s mechanisms could also be operating in
humans.
The Greek philosopher Aristotle was one of the first to attempt to codify “right thinking,” that is, certain
reasoning processes. His syllogisms provided patterns for argument structures that always yielded
correct conclusions when given correct premises. Logicians in the 19th century developed a precise
notation for statements about all kinds of objects in the world and the relations among them. The so-
called logicist tradition within artificial intelligence hopes to build on such programs to create
intelligent systems. There are two main obstacles to this approach. First, it is not easy to take informal
knowledge and state it in the formal terms required by logical notation, particularly when the
knowledge is less than 100% certain. Second, there is a big difference between solving a problem “in
principle” and solving it in practice.
An agent is just something that acts (agent comes from the Latin agere, to do). Of course, all computer
programs do something, but computer agents are expected to do more: operate autonomously, perceive
their environment, persist over a prolonged time period, adapt to change, and create and pursue goals.
A rational agent is one that acts so as to achieve the best outcome or, when there is uncertainty, the best
expected outcome. In the “laws of thought” approach to AI, the emphasis was on correct inferences.
Making correct inferences is sometimes part of being a rational agent, because one way to act rationally
is to reason logically to the conclusion that a given action will achieve one’s goals and then to act on that
conclusion.
The rational-agent approach has two advantages over the other approaches. First, it is more general
than the “laws of thought” approach because correct inference is just one of several possible
mechanisms for achieving rationality. Second, it is more amenable to scientific development than are
approaches based on human behavior or human thought. The standard of rationality is mathematically
well defined and completely general, and can be “unpacked” to generate agent designs that provably
achieve it.
In this section, we provide a brief history of the disciplines that contributed ideas, viewpoints, and
techniques to AI.
1.2.1 Philosophy
Aristotle (384–322 B.C.), was the first to formulate a precise set of laws governing the rational part of
the mind. He developed an informal system of syllogisms for proper reasoning, which in principle
allowed one to generate conclusions mechanically, given initial premises. Much later, Ramon Lull (d.
1315) had the idea that useful reasoning could actually be carried out by a mechanical artifact. Thomas
Hobbes (1588–1679) proposed that reasoning was like numerical computation. Around 1500,
Leonardo da Vinci (1452–1519) designed but did not build a mechanical calculator. Gottfried Wilhelm
Leibniz (1646–1716) built a mechanical device intended to carry out operations on concepts rather
than numbers, but its scope was rather limited. Ren´e Descartes (1596–1650) gave the first clear
discussion of the distinction between mind and matter and of the problems that arise. Given a physical
mind that manipulates knowledge, the next problem is to establish EMPIRICISM the source of
knowledge. The empiricism movement, starting with Francis Bacon’s (1561– 1626) Novum Organum, 2
is characterized by a dictum of John Locke (1632–1704): “Nothing is in the understanding, which was
not first in the senses.” David Hume’s (1711–1776) A Treatise of Human Nature (Hume, 1739)
proposed what is now known as the principle of induction: that general rules are acquired by exposure
to repeated associations between their elements. Building on the work of Ludwig Wittgenstein (1889–
1951) and Bertrand Russell (1872–1970), the famous Vienna Circle, led by Rudolf Carnap (1891–1970),
developed the doctrine of logical positivism. The confirmation theory of Carnap and Carl Hempel
(1905–1997) attempted to analyze the acquisition of knowledge from experience.
The final element in the philosophical picture of the mind is the connection between knowledge and
action. This question is vital to AI because intelligence requires action as well as reasoning. Moreover,
only by understanding how actions are justified can we understand how to build an agent whose
actions are justifiable (or rational).
1.2.2 Mathematics
Philosophers staked out some of the fundamental ideas of AI, but the leap to a formal science required
a level of mathematical formalization in three fundamental areas: logic, computation, and probability.
The idea of formal logic can be traced back to the philosophers of ancient Greece, but its mathematical
development really began with the work of George Boole (1815–1864), who worked out the details of
propositional, or Boolean, logic (Boole, 1847). In 1879, Gottlob Frege (1848–1925) extended Boole’s
logic to include objects and relations, creating the firstorder logic that is used today. Alfred Tarski
(1902–1983) introduced a theory of reference that shows how to relate the objects in a logic to objects
in the real world.
The next step was to determine the limits of what could be done with logic and computation.
The first nontrivial algorithm is thought to be Euclid’s algorithm for computing greatest common
divisors. The word algorithm (and the idea of studying them) comes from al-Khowarazmi, a Persian
mathematician of the 9th century, whose writings also introduced Arabic numerals and algebra to
Europe. Boole and others discussed algorithms for logical deduction, and, by the late 19th century,
efforts were under way to formalize general mathematical reasoning as logical deduction. In 1930, Kurt
G¨odel (1906–1978) showed that there exists an effective procedure to prove any true statement in the
first-order logic of Frege and Russell, but that first-order logic could not capture the principle of
mathematical induction needed to characterize the natural numbers.
Besides logic and computation, the third great contribution of mathematics to AI is the theory
of probability. The Italian Gerolamo Cardano (1501–1576) first framed the idea of probability,
describing it in terms of the possible outcomes of gambling events. In 1654, Blaise Pascal (1623–1662),
in a letter to Pierre Fermat (1601–1665), showed how to predict the future of an unfinished gambling
game and assign average payoffs to the gamblers. Probability quickly became an invaluable part of all
the quantitative sciences, helping to deal with uncertain measurements and incomplete theories. James
Bernoulli (1654–1705), Pierre Laplace (1749–1827), and others advanced the theory and introduced
new statistical methods. Thomas Bayes (1702–1761), proposed a rule for updating probabilities in the
light of new evidence. Bayes’ rule underlies most modern approaches to uncertain reasoning in AI
systems.
1.2.3 Economics
science of economics got its start in 1776, when Scottish philosopher Adam Smith (1723–1790)
published An Inquiry into the Nature and Causes of the Wealth of Nations. While the ancient Greeks and
others had made contributions to economic thought, Smith was the first to treat it as a science, using
the idea that economies can be thought of as consisting of individual agents maximizing their own
economic well-being. Most people think of economics as being about money, but economists will say
that they are really studying how people make choices that lead to preferred outcomes.
Decision theory, which combines probability theory with utility theory, provides a formal and complete
framework for decisions (economic or otherwise) made under uncertainty
1.2.4 Neuroscience
How do brains process information? Neuroscience is the study of the nervous system, particularly the
brain. Although the exact way in which the brain enables thought is one of the great mysteries of
science, the fact that it does enable thought has been appreciated for thousands of years because of the
evidence that strong blows to the head can lead to mental incapacitation. Nicolas Rashevsky (1936,
1938) was the first to apply mathematical models to the study of the nervous system.
We now have some data on the mapping between areas of the brain and the parts of the body that they
control or from which they receive sensory input. Such mappings are able to change radically over the
course of a few weeks, and some animals seem to have multiple maps. Moreover, we do not fully
understand how other areas can take over functions when one area is damaged. There is almost no
theory on how an individual memory is stored. The measurement of intact brain activity began in 1929
with the invention by Hans Berger of the electroencephalograph (EEG). The recent development of
functional magnetic resonance imaging (fMRI) (Ogawa et al., 1990; Cabeza and Nyberg, 2001) is giving
neuroscientists unprecedentedly detailed images of brain activity, enabling measurements that
correspond in interesting ways to ongoing cognitive processes. These are augmented by advances in
single-cell recording of neuron activity. Individual neurons can be stimulated electrically, chemically, or
even optically (Han and Boyden, 2007), allowing neuronal input– output relationships to be mapped.
Despite these advances, we are still a long way from understanding how cognitive processes actually
work. The truly amazing conclusion is that a collection of simple cells can lead to thought, action, and
consciousness or, in the pithy words of John Searle (1992), brains cause minds.
1.2.5 Psychology
How do humans and animals think and act? The origins of scientific psychology are usually traced to
the work of the German physicist Hermann von Helmholtz (1821–1894) and his student Wilhelm
Wundt (1832–1920). Helmholtz applied the scientific method to the study of human vision, and his
Handbook of Physiological Optics is even now described as “the single most important treatise on the
physics and physiology of human vision” (Nalwa, 1993, p.15).
1.2.8 Linguistics
How does language relate to thought? In 1957, B. F. Skinner published Verbal Behavior. This was a
comprehensive, detailed account of the behaviorist approach to language learning, written by the
foremost expert in the field. But curiously, a review of the book became as well known as the book itself,
and served to almost kill off interest in behaviorism.
Modern linguistics and AI, then, were “born” at about the same time, and grew up together, intersecting
in a hybrid field called computational linguistics or natural language processing. The problem of
understanding language soon turned out to be considerably more complex than it seemed in 1957.
Understanding language requires an understanding of the subject matter and context, not just an
understanding of the structure of sentences. This might seem obvious, but it was not widely
appreciated until the 1960s. Much of the early work in knowledge representation (the study of how to
put knowledge into a form that a computer can reason with) was tied to language and informed by
research in linguistics, which was connected in turn to decades of work on the philosophical analysis of
language.
• A description of the possible actions available to the agent. Given a particular state s, ACTIONS(s)
returns the set of actions that can be executed in s. We say that each of these actions is applicable in s.
For example, from the state In(Arad), the applicable actions are {Go(Sibiu), Go(Timisoara), Go(Zerind)}.
• A description of what each action does; the formal name for this is the transition model, specified by
a function RESULT(s, a) that returns the state that results from doing action a in state s. We also use the
term successor to refer to any state reachable from a given state by a single action.2 For example, we
have
RESULT(In(Arad),Go(Zerind)) = In(Zerind) .
Together, the initial state, actions, and transition model implicitly define the state space of the
problem—the set of all states reachable from the initial state by any sequence of actions. The state
space forms a directed network or graph in which the nodes are states and the links between nodes are
actions. (The map of Romania shown in Figure 3.2 can be interpreted as a state-space graph if we view
each road as standing for two driving actions, one in each direction.) A path in the state space is a
sequence of states connected by a sequence of actions.
• The goal test, which determines whether a given state is a goal state. Sometimes there is an explicit
set of possible goal states, and the test simply checks whether the given state is one of them. The agent’s
goal in Romania is the singleton set {In(Bucharest)}.
Sometimes the goal is specified by an abstract property rather than an explicitly enumerated set of
states. For example, in chess, the goal is to reach a state called “checkmate,” where the opponent’s king
is under attack and can’t escape.
• A path cost function that assigns a numeric cost to each path. The problem-solving agent chooses a
cost function that reflects its own performance measure. For the agent trying to get to Bucharest, time
is of the essence, so the cost of a path might be its length in kilometers. In this chapter, we assume that
the cost of a path can be described as the sum of the costs of the individual actions along the path. The
step cost of taking action a in state s to reach state s’ is denoted by c(s, a, s’ ). The step costs for Romania
are shown in Figure 3.2 as route distances. We assume that step costs are nonnegative.4 The preceding
elements define a problem and can be gathered into a single data structure that is given as input to a
problem-solving algorithm. A solution to a problem is an action sequence that leads from the initial
state to a goal state. Solution quality is measured by the path cost function, and an optimal solution has
the lowest path cost among all solutions.
3.1.2 Formulating problems
In the preceding section we proposed a formulation of the problem of getting to Bucharest in terms of
the initial state, actions, transition model, goal test, and path cost. This formulation seems reasonable,
but it is still a model—an abstract mathematical description—and not the real thing. Compare the
simple state description we have chosen, In(Arad), to an actual crosscountry trip, where the state of the
world includes so many things: the traveling companions, the current radio program, the scenery out of
the window, the proximity of law enforcement officers, the distance to the next rest stop, the condition
of the road, the weather, and so on. All these considerations are left out of our state descriptions
because they are irrelevant to the problem of finding a route to Bucharest. The process of removing
detail from a representation is called abstraction. In addition to abstracting the state description, we
must abstract the actions themselves. A driving action has many effects. Besides changing the location
of the vehicle and its occupants, it takes up time, consumes fuel, generates pollution, and changes the
agent (as they say, travel is broadening). Our formulation takes into account only the change in location.
Also, there are many actions that we omit altogether: turning on the radio, looking out of the window,
slowing down for law enforcement officers, and so on. And of course, we don’t specify actions at the
level of “turn steering wheel to the left by one degree.” Can we be more precise about defining the
appropriate level of abstraction? Think of the abstract states and actions we have chosen as
corresponding to large sets of detailed world states and detailed action sequences. Now consider a
solution to the abstract problem: for example, the path from Arad to Sibiu to Rimnicu Vilcea to Pitesti to
Bucharest. This abstract solution corresponds to a large number of more detailed paths. For example,
we could drive with the radio on between Sibiu and Rimnicu Vilcea, and then switch it off for the rest of
the trip. The abstraction is valid if we can expand any abstract solution into a solution in the more
detailed world; a sufficient condition is that for every detailed state that is “in Arad,” there is a detailed
path to some state that is “in Sibiu,” and so on.5 The abstraction is useful if carrying out each of the
actions in the solution is easier than the original problem; in this case they are easy enough that they
can be carried out without further search or planning by an average driving agent. The choice of a good
abstraction thus involves removing as much detail as possible while retaining validity and ensuring that
the abstract actions are easy to carry out. Were it not for the ability to construct useful abstractions,
intelligent agents would be completely swamped by the real world.
3.2 EXAMPLE PROBLEMS
The problem-solving approach has been applied to a vast array of task environments. We list some of
the best known here, distinguishing between toy and real-world problems. A toy problem is intended to
illustrate or exercise various problem-solving methods. It can be given a concise, exact description and
hence is usable by different researchers to compare the performance of algorithms. A real-world
problem is one whose solutions people actually care about. Such problems tend not to have a single
agreed-upon description, but we can give the general flavor of their formulations
• States: A state description specifies the location of each of the eight tiles and the blank in one of the
nine squares.
• Initial state: Any state can be designated as the initial state. Note that any given goal can be reached
from exactly half of the possible initial states (Exercise 3.4).
• Actions: The simplest formulation defines the actions as movements of the blank space Left, Right, Up,
or Down. Different subsets of these are possible depending on where the blank is.
• Transition model: Given a state and action, this returns the resulting state; for example, if we apply
Left to the start state in Figure 3.4, the resulting state has the 5 and the blank switched.
• Goal test: This checks whether the state matches the goal configuration shown in Figure 3.4. (Other
goal configurations are possible.)
• Path cost: Each step costs 1, so the path cost is the number of steps in the path.
What abstractions have we included here? The actions are abstracted to their beginning and final
states, ignoring the intermediate locations where the block is sliding. We have abstracted away actions
such as shaking the board when pieces get stuck and ruled out extracting the pieces with a knife and
putting them back again. We are left with a description of the rules of the puzzle, avoiding all the details
of physical manipulations. The 8-puzzle belongs to the family of sliding-block puzzles, which are often
used as test problems for new search algorithms in AI. This family is known to be NP-complete, so one
does not expect to find methods significantly better in the worst case than the search algorithms
described in this chapter and the next. The 8-puzzle has 9!/2 = 181, 440 reachable states and is easily
solved. The 15-puzzle (on a 4×4 board) has around 1.3 trillion states, and random instances can be
solved optimally in a few milliseconds by the best search algorithms. The 24-puzzle (on a 5 × 5 board)
has around 1025 states, and random instances take several hours to solve optimally.
The goal of the 8-queens problem is to place eight queens on a chessboard such that no queen attacks
any other. (A queen attacks any piece in the same row, column or diagonal.) Figure 3.5 shows an
attempted solution that fails: the queen in the rightmost column is attacked by the queen at the top left
Although efficient special-purpose algorithms exist for this problem and for the whole n-queens family,
it remains a useful test problem for search algorithms. There are two main kinds of formulation. An
incremental formulation involves operators that augment the state description, starting with an empty
state; for the 8-queens problem, this means that each action adds a queen to the state. A complete-state
formulation starts with all 8 queens on the board and moves them around. In either case, the path cost
is of no interest because only the final state counts. The first incremental formulation one might try is
the following:
• States: Any arrangement of 0 to 8 queens on the board is a state.
• Initial state: No queens on the board.
• Actions: Add a queen to any empty square.
• Transition model: Returns the board with a queen added to the specified square.
• Goal test: 8 queens are on the board, none attacked. In this formulation, we have 64 · 63 ··· 57 ≈ 1.8 ×
1014 possible sequences to investigate. A better formulation would prohibit placing a queen in any
square that is already attacked:
• States: All possible arrangements of n queens (0 ≤ n ≤ 8), one per column in the leftmost n columns,
with no queen attacking another.
• Actions: Add a queen to any square in the leftmost empty column such that it is not attacked by any
other queen.
This formulation reduces the 8-queens state space from 1.8 × 10^14 to just 2,057, and solutions are
easy to find. On the other hand, for 100 queens the reduction is from roughly 10^400 states to about
10^52 states (Exercise 3.5)—a big improvement, but not enough to make the problem tractable. Section
4.1 describes the complete-state formulation, and Chapter 6 gives a simple algorithm that solves even
the million-queens problem with ease
Our final toy problem was devised by Donald Knuth (1964) and illustrates how infinite state spaces can
arise. Knuth conjectured that, starting with the number 4, a sequence of factorial, square root, and floor
operations will reach any desired positive integer. For example, we can reach 5 from 4 as follows
Commercial travel advice systems use a problem formulation of this kind, with many additional
complications to handle the byzantine fare structures that airlines impose. Any seasoned traveler
knows, however, that not all air travel goes according to plan. A really good system should include
contingency plans—such as backup reservations on alternate flights— to the extent that these are
justified by the cost and likelihood of failure of the original plan.
Touring problems are closely related to route-finding problems, but with an important difference.
Consider, for example, the problem “Visit every city in Figure 3.2 at least once, starting and ending in
Bucharest.” As with route finding, the actions correspond to trips between adjacent cities. The state
space, however, is quite different. Each state must include not just the current location but also the set
of cities the agent has visited. So the initial state would be In(Bucharest), Visited({Bucharest}), a typical
intermediate state would be In(Vaslui), Visited({Bucharest, Urziceni, Vaslui}), and the goal test would
check whether the agent is in Bucharest and all 20 cities have been visited.
The traveling salesperson problem (TSP) is a touring problem in which each city must be visited
exactly once. The aim is to find the shortest tour. The problem is known to be NP-hard, but an enormous
amount of effort has been expended to improve the capabilities of TSP algorithms. In addition to
planning trips for traveling salespersons, these algorithms have been used for tasks such as planning
movements of automatic circuit-board drills and of stocking machines on shop floors.
A VLSI layout problem requires positioning millions of components and connections on a chip to
minimize area, minimize circuit delays, minimize stray capacitances, and maximize manufacturing yield.
The layout problem comes after the logical design phase and is usually split into two parts: cell layout
and channel routing. In cell layout, the primitive components of the circuit are grouped into cells, each
of which performs some recognized function. Each cell has a fixed footprint (size and shape) and
requires a certain number of connections to each of the other cells. The aim is to place the cells on the
chip so that they do not overlap and so that there is room for the connecting wires to be placed between
the cells. Channel routing finds a specific route for each wire through the gaps between the cells. These
search problems are extremely complex, but definitely worth solving. Later in this chapter, we present
some algorithms capable of solving them.
Robot navigation is a generalization of the route-finding problem described earlier. Rather than
following a discrete set of routes, a robot can move in a continuous space with (in principle) an infinite
set of possible actions and states. For a circular robot moving on a flat surface, the space is essentially
two-dimensional. When the robot has arms and legs or wheels that must also be controlled, the search
space becomes many-dimensional. Advanced techniques are required just to make the search space
finite. We examine some of these methods in Chapter 25. In addition to the complexity of the problem,
real robots must also deal with errors in their sensor readings and motor controls.
Automatic assembly sequencing of complex objects by a robot was first demonstrated by FREDDY
(Michie, 1972). Progress since then has been slow but sure, to the point where the assembly of intricate
objects such as electric motors is economically feasible. In assembly problems, the aim is to find an
order in which to assemble the parts of some object. If the wrong order is chosen, there will be no way
to add some part later in the sequence without undoing some of the work already done. Checking a step
in the sequence for feasibility is a difficult geometrical search problem closely related to robot
navigation. Thus, the generation of legal actions is the expensive part of assembly sequencing. Any
practical algorithm must avoid exploring all but a tiny fraction of the state space. Another important
assembly problem is protein design, in which the goal is to find a sequence of amino acids that will fold
into a three-dimensional protein with the right properties to cure some disease.
The process of expanding nodes on the frontier continues until either a solution is found or there are
no more states to expand. The general TREE-SEARCH algorithm is shown informally in Figure 3.7.
Search algorithms all share this basic structure; they vary primarily according to how they choose
which state to expand next—the so-called search strategy.
The eagle-eyed reader will notice one peculiar thing about the search tree shown in Figure 3.6: it
includes the path from Arad to Sibiu and back to Arad again! We say that In(Arad) is a repeated state in
the search tree, generated in this case by a loopy path. Considering such loopy paths means that the
complete search tree for Romania is infinite because there is no limit to how often one can traverse a
loop. On the other hand, the state space—the map shown in Figure 3.2—has only 20 states. As we
discuss in Section 3.4, loops can cause certain algorithms to fail, making otherwise solvable problems
unsolvable. Fortunately, there is no need to consider loopy paths. We can rely on more than intuition for
this: because path costs are additive and step costs are nonnegative, a loopy path to any given state is
never better than the same path with the loop removed. Loopy paths are a special case of the more
general concept of redundant paths, which exist whenever there is more than one way to get from one
state to another. Consider the paths Arad–Sibiu (140 km long) and Arad–Zerind–Oradea–Sibiu (297 km
long). Obviously, the second path is redundant—it’s just a worse way to get to the same state. If you are
concerned about reaching the goal, there’s never any reason to keep more than one path to any given
state, because any goal state that is reachable by extending one path is also reachable by extending the
other. In some cases, it is possible to define the problem itself so as to eliminate redundant paths. For
example, if we formulate the 8-queens problem (page 71) so that a queen can be placed in any column,
then each state with n queens can be reached by n! different paths; but if we reformulate the problem
so that each new queen is placed in the leftmost empty column, then each state can be reached only
through one path.
In other cases, redundant paths are unavoidable. This includes all problems where the actions are
reversible, such as route-finding problems and sliding-block puzzles. Route finding on a rectangular
grid (like the one used later for Figure 3.9) is a particularly important example in computer games. In
such a grid, each state has four successors, so a search tree of depth d that includes repeated states has
4d leaves; but there are only about 2d2 distinct states within d steps of any given state. For d = 20, this
means about a trillion nodes but only about 800 distinct states. Thus, following redundant paths can
cause a tractable problem to become intractable. This is true even for algorithms that know how to
avoid infinite loops. As the saying goes, algorithms that forget their history are doomed to repeat it. The
way to avoid exploring redundant paths is to remember where one has been. To do this, we augment
the TREE-SEARCH algorithm with a data structure called the explored set (also known as the closed
list), which remembers every expanded node. Newly generated nodes that match previously generated
nodes—ones in the explored set or the frontier—can be discarded instead of being added to the frontier.
The new algorithm, called GRAPH-SEARCH, is shown informally in Figure 3.7. The specific algorithms in
this chapter draw on this general design. Clearly, the search tree constructed by the GRAPH-SEARCH
algorithm contains at most one copy of each state, so we can think of it as growing a tree directly on the
state-space graph, as shown in Figure 3.8. The algorithm has another nice property: the frontier
separates the state-space graph into the explored region and the unexplored region, so that every path
from the initial state to an unexplored state has to pass through a state in the frontier. (If this seems
completely obvious, try Exercise 3.13 now.) This property is illustrated in Figure 3.9. As every step
moves a state from the frontier into the explored region while moving some states from the unexplored
region into the frontier, we see that the algorithm is systematically examining the states in the state
space, one by one, until it finds a solution.
Given the components for a parent node, it is easy to see how to compute the necessary components for
a child node. The function CHILD-NODE takes a parent node and an action and returns the resulting
child node:
The node data structure is depicted in Figure 3.10. Notice how the PARENT pointers string the nodes
together into a tree structure. These pointers also allow the solution path to be extracted when a goal
node is found; we use the SOLUTION function to return the sequence of actions obtained by following
parent pointers back to the root. Up to now, we have not been very careful to distinguish between nodes
and states, but in writing detailed algorithms it’s important to make that distinction. A node is a
bookkeeping data structure used to represent the search tree. A state corresponds to a configuration of
the world. Thus, nodes are on particular paths, as defined by PARENT pointers, whereas states are not.
Furthermore, two different nodes can contain the same world state if that state is generated via two
different search paths. Now that we have nodes, we need somewhere to put them. The frontier needs to
be stored in such a way that the search algorithm can easily choose the next node to expand according
to its preferred strategy. The appropriate data structure for this is a queue. The operations on a queue
are as follows:
• EMPTY?(queue) returns true only if there are no more elements in the queue.
• POP(queue) removes the first element of the queue and returns it.
• INSERT(element, queue) inserts an element and returns the resulting queue.
Queues are characterized by the order in which they store the inserted nodes. Three common variants
are the first-in, first-out or FIFO queue, which pops the oldest element of the queue; the last-in, first-
out or LIFO queue (also known as a stack), which pops the newest element of the queue; and the
priority queue, which pops the element of the queue with the highest priority according to some
ordering function. The explored set can be implemented with a hash table to allow efficient checking for
repeated states. With a good implementation, insertion and lookup can be done in roughly constant
time no matter how many states are stored. One must take care to implement the hash table with the
right notion of equality between states. For example, in the traveling salesperson problem (page 74),
the hash table needs to know that the set of visited cities {Bucharest,Urziceni,Vaslui} is the same as
{Urziceni,Vaslui,Bucharest}. Sometimes this can be achieved most easily by insisting that the data
structures for states be in some canonical form; that is, logically equivalent states should map to the
same data structure. In the case of states described by sets, for example, a bit-vector representation or a
sorted list without repetition would be canonical, whereas an unsorted list would not.
3.3.2 Measuring problem-solving performance
Before we get into the design of specific search algorithms, we need to consider the criteria that might
be used to choose among them. We can evaluate an algorithm’s performance in four ways:
• Completeness: Is the algorithm guaranteed to find a solution when there is one?
• Optimality: Does the strategy find the optimal solution, as defined on page 68?
• Time complexity: How long does it take to find a solution?
• Space complexity: How much memory is needed to perform the search?
Time and space complexity are always considered with respect to some measure of the problem
difficulty. In theoretical computer science, the typical measure is the size of the state space graph, |V| +
|E|, where V is the set of vertices (nodes) of the graph and E is the set of edges (links). This is
appropriate when the graph is an explicit data structure that is input to the search program. (The map
of Romania is an example of this.) In AI, the graph is often represented implicitly by the initial state,
actions, and transition model and is frequently infinite. For these reasons, complexity is expressed in
terms of three quantities: b, the branching factor or maximum number of successors of any node; d, the
depth of the shallowest goal node (i.e., the number of steps along the path from the root); and m, the
maximum length of any path in the state space. Time is often measured in terms of the number of nodes
generated during the search, and space in terms of the maximum number of nodes stored in memory.
For the most part, we describe time and space complexity for search on a tree; for a graph, the answer
depends on how “redundant” the paths in the state space are. To assess the effectiveness of a search
algorithm, we can consider just the search cost— which typically depends on the time complexity but
can also include a term for memory usage—or we can use the total cost, which combines the search
cost and the path cost of the solution found. For the problem of finding a route from Arad to Bucharest,
the search cost is the amount of time taken by the search and the solution cost is the total length of the
path in kilometers. Thus, to compute the total cost, we have to add milliseconds and kilometers. There
is no “official exchange rate” between the two, but it might be reasonable in this case to convert
kilometers into milliseconds by using an estimate of the car’s average speed (because time is what the
agent cares about). This enables the agent to find an optimal tradeoff point at which further
computation to find a shorter path becomes counterproductive. The more general problem of tradeoffs
between different goods is taken up in Chapter 16.
technically, breadth-first search is optimal if the path cost is a nondecreasing function of the depth of
the node. The most common such scenario is that all actions have the same cost. So far, the news about
breadth-first search has been good. The news about time and space is not so good. Imagine searching a
uniform tree where every state has b successors. The root of the search tree generates b nodes at the
first level, each of which generates b more nodes, for a total of b^2 at the second level. Each of these
generates b more nodes, yielding b^3 nodes at the third level, and so on. Now suppose that the solution
is at depth d. In the worst case, it is the last node generated at that level. Then the total number of nodes
generated is b + b^2 + b^3 + ··· + b^d = O(b^d) . (If the algorithm were to apply the goal test to nodes
when selected for expansion, rather than when generated, the whole layer of nodes at depth d would be
expanded before the goal was detected and the time complexity would be
As for space complexity: for any kind of graph search, which stores every expanded node in the
explored set, the space complexity is always within a factor of b of the time complexity. For breadth-first
graph search in particular, every node generated remains in memory. There will be nodes in the
explored set and O(b^d) nodes in the frontier,
so the space complexity is O(b^d), i.e., it is dominated by the size of the frontier. Switching to a tree
search would not save much space, and in a state space with many redundant paths, switching could
cost a great deal of time. An exponential complexity bound such as O(b^d) is scary. Figure 3.13 shows
why. It lists, for various values of the solution depth d, the time and memory required for a breadthfirst
search with branching factor b = 10. The table assumes that 1 million nodes can be generated per
second and that a node requires 1000 bytes of storage. Many search problems fit roughly within these
assumptions (give or take a factor of 100) when run on a modern personal computer
Two lessons can be learned from Figure 3.13. First, the memory requirements are a bigger problem for
breadth-first search than is the execution time. One might wait 13 days for the solution to an important
problem with search depth 12, but no personal computer has the petabyte of memory it would take.
Fortunately, other strategies require less memory. The second lesson is that time is still a major factor. If
your problem has a solution at depth 16, then (given our assumptions) it will take about 350 years for
breadth-first search (or indeed any uninformed search) to find it. In general, exponential-complexity
search problems cannot be solved by uninformed methods for any but the smallest instances.
dropped from the frontier, so then the search “backs up” to the next deepest node that still has
unexplored successors. The depth-first search algorithm is an instance of the graph-search algorithm in
Figure 3.7; whereas breadth-first-search uses a FIFO queue, depth-first search uses a LIFO queue. A
LIFO queue means that the most recently generated node is chosen for expansion. This must be the
deepest unexpanded node because it is one deeper than its parent—which, in turn, was the deepest
unexpanded node when it was selected. As an alternative to the GRAPH-SEARCH-style implementation,
it is common to implement depth-first search with a recursive function that calls itself on each of its
children in turn. (A recursive depth-first algorithm incorporating a depth limit is shown in Figure 3.17.)
The properties of depth-first search depend strongly on whether the graph-search or tree-search
version is used. The graph-search version, which avoids repeated states and redundant paths, is
complete in finite state spaces because it will eventually expand every node. The tree-search version, on
the other hand, is not complete.
Module 2
INFORMED SEARCH STRATEGIES
3.5 INFORMED (HEURISTIC) SEARCH STRATEGIES
This section shows how an informed search strategy—one that uses problem-specific
knowledge beyond the definition of the problem itself—can find solutions more efficiently
than can an uninformed strategy. The general approach we consider is called best-first
search. Best-first search is an instance of the general TREE-SEARCH or GRAPH-SEARCH
algorithm in which a node is selected for expansion based on an evaluation function, f(n).
The evaluation function is construed as a cost estimate, so the node with the lowest
evaluation is expanded first. The implementation of best-first graph search is identical to
that for uniform-cost search (Figure 3.14), except for the use of f instead of g to order the
priority queue. The choice of f determines the search strategy. (For example, as Exercise
3.21 shows, best-first tree search includes depth-first search as a special case.)
Most best-first algorithms include as a component of f a heuristic function, denoted
h(n): h(n) = estimated cost of the cheapest path from the state at node n to a goal state.
1
AI&ML, 21CS54, 5th Semester
to an infinite loop. (The graph search version is complete in finite spaces, but not in infinite
ones.) The worst-case time and space complexity for the tree version is O(bm), where m is
the maximum depth of the search space. With a good heuristic function, however, the
complexity can be reduced substantially. The amount of the reduction depends on the
particular problem and on the quality of the heuristic
2
AI&ML, 21CS54, 5th Semester
Conditions for optimality: Admissibility and consistency The first condition we require for
optimality is that h(n) be an admissible heuristic. An admissible heuristic is one that never
overestimates the cost to reach the goal. Because g(n) is the actual cost to reach n along
the current path, and f(n) = g(n) + h(n), we have as an immediate consequence that f(n)
never overestimates the true cost of a solution along the current path through n.
Admissible heuristics are by nature optimistic because they think the cost of solving the
problem is less than it actually is. An obvious example of an admissible heuristic is the
straight-line distance hSLD that we used in getting to Bucharest. Straight-line distance is
admissible because the shortest path between any two points is a straight line, so the
straight line cannot be an overestimate. In Figure 3.24, we show the progress of an A∗ tree
search for Bucharest. The values of g are computed from the step costs in Figure 3.2, and
3
AI&ML, 21CS54, 5th Semester
the values of hSLD are given in Figure 3.22. Notice in particular that Bucharest first appears
on the frontier at step (e), but it is not selected for expansion because its f-cost (450) is
higher than that of Pitesti (417). Another way to say this is that there might be a solution
through Pitesti whose cost is as low as 417, so the algorithm will not settle for a solution
that costs 450. A second, slightly stronger condition called consistency (or sometimes
monotonicity) is required only for applications of A∗ to graph search.9 A heuristic h(n) is
consistent if, for every node n and every successor nof n generated by any action a, the
estimated cost of reaching the goal from n is no greater than the step cost of getting to
nplus the estimated cost of reaching the goal from n’ :
h(n) ≤ c(n, a, n’ ) + h(n’ ) .
This is a form of the general triangle inequality, which stipulates that each side of a triangle
cannot be longer than the sum of the other two sides. Here, the triangle is formed by n, n’,
and the goal Gn closest to n. For an admissible heuristic, the inequality makes perfect
sense: if there were a route from n to Gn via n’ that was cheaper than h(n), that would
violate the property that h(n) is a lower bound on the cost to reach Gn. It is fairly easy to
show (Exercise 3.29) that every consistent heuristic is also admissible. Consistency is
therefore a stricter requirement than admissibility, but one has to work quite hard to
concoct heuristics that are admissible but not consistent. All the admissible heuristics we
discuss in this chapter are also consistent. Consider, for example, hSLD . We know that the
general triangle inequality is satisfied when each side is measured by the straight-line
distance and that the straight-line distance between n and n’ is no greater than c(n, a, n’ ).
Hence, hSLD is a consistent heuristic.
Optimality of A* As we mentioned earlier, A∗ has the following properties: the tree-search
version of A∗ is optimal if h(n) is admissible, while the graph-search version is optimal if
h(n) is consistent. We show the second of these two claims since it is more useful. The
argument essentially mirrors the argument for the optimality of uniform-cost search, with
g replaced by f—just as in the A∗ algorithm itself. The first step is to establish the following:
if h(n) is consistent, then the values of f(n) along any path are nondecreasing. The proof
follows directly from the definition of consistency. Suppose n is a successor of n; then g(n’
) = g(n) + c(n, a, n’ ) for some action a, and we have
f(n’) = g(n’ ) + h(n’ ) = g(n) + c(n, a, n’ ) + h(n’ ) ≥ g(n) + h(n) = f(n) .
The next step is to prove that whenever A∗ selects a node n for expansion, the optimal
path to that node has been found. Were this not the case, there would have to be another
frontier node n on the optimal path from the start node to n, by the graph separation
property of
4
AI&ML, 21CS54, 5th Semester
5
AI&ML, 21CS54, 5th Semester
Figure 3.9; because f is nondecreasing along any path, nwould have lower f-cost than n and
would have been selected first. From the two preceding observations, it follows that the
sequence of nodes expanded by A∗ using GRAPH-SEARCH is in nondecreasing order of f(n).
Hence, the first goal node selected for expansion must be an optimal solution because f is
the true cost for goal nodes (which have h = 0) and all later goal nodes will be at least as
expensive. The fact that f-costs are nondecreasing along any path also means that we can
draw contours in the state space, just like the contours in a topographic map. Figure 3.25
shows an example. Inside the contour labeled 400, all nodes have f(n) less than or equal to
400, and so on. Then, because A∗ expands the frontier node of lowest f-cost, we can see
that an A∗ search fans out from the start node, adding nodes in concentric bands of
increasing f-cost. With uniform-cost search (A∗ search using h(n)=0), the bands will be
“circular” around the start state. With more accurate heuristics, the bands will stretch
toward the goal state and become more narrowly focused around the optimal path. If C∗ is
the cost of the optimal solution path, then we can say the following:
• A∗ expands all nodes with f(n) < C∗.
• A∗ might then expand some of the nodes right on the “goal contour” (where f(n) = C∗)
before selecting a goal node. Completeness requires that there be only finitely many nodes
with cost less than or equal to C∗, a condition that is true if all step costs exceed some finite
and if b is finite. Notice that A∗ expands no nodes with f(n) > C∗—for example, Timisoara
is not expanded in Figure 3.24 even though it is a child of the root. We say that the subtree
below
Timisoara is pruned; because hSLD is admissible, the algorithm can safely ignore this
subtree while still guaranteeing optimality. The concept of pruning—eliminating
possibilities from consideration without having to examine them—is important for many
areas of AI. One final observation is that among optimal algorithms of this type—
algorithms that extend search paths from the root and use the same heuristic
6
AI&ML, 21CS54, 5th Semester
information—A∗ is optimally efficient for any given consistent heuristic. That is, no other
optimal algorithm is guaran- teed to expand fewer nodes than A∗ (except possibly through
tie-breaking among nodes with f(n) = C∗). This is because any algorithm that does not
expand all nodes with f(n) < C∗ runs the risk of missing the optimal solution. That A∗
search is complete, optimal, and optimally efficient among all such algorithms is rather
satisfying. Unfortunately, it does not mean that A∗ is the answer to all our searching needs.
The catch is that, for most problems, the number of states within the goal contour search
space is still exponential in the length of the solution. The details of the analysis are beyond
the scope of this book, but the basic results are as follows. For problems with constant step
costs, the growth in run time as a function of the optimal solution depth d is analyzed in
terms of the the absolute error or the relative error of the heuristic. The absolute error is
defined as Δ ≡ h∗ − h, where h∗ is the actual cost of getting from the root to the goal, and
the relative error is defined as ≡ (h∗ − h)/h∗. The complexity results depend very strongly
on the assumptions made about the state space. The simplest model studied is a state
space that has a single goal and is essentially a tree with reversible actions. (The 8-puzzle
satisfies the first and third of these assumptions.) In this case, the time complexity of A∗ is
exponential in the maximum absolute error, that is, O(b^Δ). For constant step costs, we can
write this as O(b^d), where d is the solution depth. For almost all heuristics in practical
use, the absolute error is at least proportional to the path cost h∗, so is constant or growing
and the time complexity is exponential in d. We can also see the effect of a more accurate
heuristic: O(b^d) = O((b^)^d), so the effective branching factor (defined more formally in
the next section) is b^. When the state space has many goal states—particularly near-
optimal goal states—the search process can be led astray from the optimal path and there
is an extra cost proportional to the number of goals whose cost is within a factor of the
optimal cost. Finally, in the general case of a graph, the situation is even worse. There can
be exponentially many states with f(n) < C∗ even if the absolute error is bounded by a
constant. For example, consider a version of the vacuum world where the agent can clean
up any square for unit cost without even having to visit it: in that case, squares can be
cleaned in any order. With N initially dirty squares, there are 2N states where some subset
has been cleaned and all of them are on an optimal solution path—and hence satisfy f(n)
< C∗—even if the heuristic has an error of 1. The complexity of A∗ often makes it
impractical to insist on finding an optimal solution. One can use variants of A∗ that find
suboptimal solutions quickly, or one can sometimes design heuristics that are more
accurate but not strictly admissible. In any case, the use of a good heuristic still provides
enormous savings compared to the use of an uninformed search. In Section 3.6, we look at
the question of designing good heuristics. Computation time is not, however, A∗’s main
drawback. Because it keeps all generated nodes in memory (as do all GRAPH-SEARCH
algorithms), A∗ usually runs out of space long before it runs out of time. For this reason,
A∗ is not practical for many large-scale problems. There are, however, algorithms that
overcome the space problem without sacrificing optimality or completeness, at a small
cost in execution time. We discuss these next.
7
AI&ML, 21CS54, 5th Semester
8
AI&ML, 21CS54, 5th Semester
9
AI&ML, 21CS54, 5th Semester
One might ask whether h2 is always better than h1. The answer is “Essentially, yes.” It is
easy to see from the definitions of the two heuristics that, for any node n, h2(n) ≥ h1(n).
We thus say that h2 dominates h1. Domination translates directly into efficiency: A∗ using
h2 will never expand more nodes than A∗ using h1 (except possibly for some nodes with
f(n) = C∗). The argument is simple. Recall the observation on page 97 that every node with
f(n) < C∗ will surely be expanded. This is the same as saying that every node with h(n) <
C∗ − g(n) will surely be expanded. But because h2 is at least as big as h1 for all nodes, every
node that is surely expanded by A∗ search with h2 will also surely be expanded with h1,
and h1 might cause other nodes to be expanded as well. Hence, it is generally better to use
a heuristic function with higher values, provided it is consistent and that the computation
time for the heuristic is not too long.
10
AI&ML, 21CS54, 5th Semester
relaxed problem, it must obey the triangle inequality and is therefore consistent (see page
95). If a problem definition is written down in a formal language, it is possible to construct
relaxed problems automatically.11 For example, if the 8-puzzle actions are described as A
tile can move from square A to square B if A is horizontally or vertically adjacent to B and
B is blank, we can generate three relaxed problems by removing one or both of the
conditions:
(a) A tile can move from square A to square B if A is adjacent to B.
(b) A tile can move from square A to square B if B is blank.
(c) A tile can move from square A to square B.
From (a), we can derive h2 (Manhattan distance). The reasoning is that h2 would be the
proper score if we moved each tile in turn to its destination. The heuristic derived from (b)
is discussed in Exercise 3.31. From (c), we can derive h1 (misplaced tiles) because it would
be the proper score if tiles could move to their intended destination in one step. Notice
that it is crucial that the relaxed problems generated by this technique can be solved
essentially without search, because the relaxed rules allow the problem to be decomposed
into eight independent subproblems. If the relaxed problem is hard to solve, then the
values of the corresponding heuristic will be expensive to obtain.12 A program called
ABSOLVER can generate heuristics automatically from problem definitions, using the
“relaxed problem” method and various other techniques (Prieditis, 1993). ABSOLVER
generated a new heuristic for the 8-puzzle that was better than any preexisting heuristic
and found the first useful heuristic for the famous Rubik’s Cube puzzle. One problem with
generating new heuristic functions is that one often fails to get a single “clearly best”
heuristic. If a collection of admissible heuristics h1 ...hm is available for a problem and
none of them dominates any of the others, which should we choose? As it turns out, we
need not make a choice. We can have the best of all worlds, by defining
h(n) = max{h1(n),...,hm(n)} .
This composite heuristic uses whichever function is most accurate on the node in question.
Because the component heuristics are admissible, h is admissible; it is also easy to prove
that h is consistent. Furthermore, h dominates all of its component heuristics.
3.6.3 Generating admissible heuristics from subproblems: Pattern databases
Admissible heuristics can also be derived from the solution cost of a subproblem of a given
problem. For example, Figure 3.30 shows a subproblem of the 8-puzzle instance in Figure
11
AI&ML, 21CS54, 5th Semester
3.28. The subproblem involves getting tiles 1, 2, 3, 4 into their correct positions. Clearly,
the cost of the optimal solution of this subproblem is a lower bound on the cost of the
complete problem. It turns out to be more accurate than Manhattan distance in some cases.
The idea behind pattern databases is to store these exact solution costs for every possible
subproblem instance—in our example, every possible configuration of the four tiles and
the blank. (The locations of the other four tiles are irrelevant for the purposes of solving
the subproblem, but moves of those tiles do count toward the cost.) Then we compute an
admissible heuristic hDB for each complete state encountered during a search simply by
looking up the corresponding subproblem configuration in the database. The database
itself is constructed by searching back13 from the goal and recording the cost of each new
pattern encountered; the expense of this search is amortized over many subsequent
problem instances. The choice of 1-2-3-4 is fairly arbitrary; we could also construct
databases for 5-6-7-8, for 2-4-6-8, and so on. Each database yields an admissible heuristic,
and these heuristics can be combined, as explained earlier, by taking the maximum value.
A combined heuristic of this kind is much more accurate than the Manhattan distance; the
number of nodes generated when solving random 15-puzzles can be reduced by a factor
of 1000. One might wonder whether the heuristics obtained from the 1-2-3-4 database
and the 5-6-7-8 could be added, since the two subproblems seem not to overlap. Would
this still give an admissible heuristic? The answer is no, because the solutions of the 1-2-
3-4 subproblem and the 5-6-7-8 subproblem for a given state will almost certainly share
some moves—it is unlikely that 1-2-3-4 can be moved into place without touching 5-6-7-
8, and vice versa. But what if we don’t count those moves? That is, we record not the total
cost of solving the 1-2- 3-4 subproblem, but just the number of moves involving 1-2-3-4.
Then it is easy to see that the sum of the two costs is still a lower bound on the cost of
solving the entire problem. This is the idea behind disjoint pattern databases. With such
databases, it is possible to solve random 15-puzzles in a few milliseconds—the number of
nodes generated is reduced by a factor of 10,000 compared with the use of Manhattan
distance. For 24-puzzles, a speedup of roughly a factor of a million can be obtained. Disjoint
pattern databases work for sliding-tile puzzles because the problem can be divided up in
such a way that each move affects only one subproblem—because only one tile is moved
at a time. For a problem such as Rubik’s Cube, this kind of subdivision is difficult because
each move affects 8 or 9 of the 26 cubies. More general ways of defining additive,
admissible heuristics have been proposed that do apply to Rubik’s cube (Yang et al., 2008),
but they have not yielded a heuristic better than the best nonadditive heuristic for the
problem.
3.6.4 Learning heuristics from experience
A heuristic function h(n) is supposed to estimate the cost of a solution beginning from the
state at node n. How could an agent construct such a function? One solution was given in
the preceding sections—namely, to devise relaxed problems for which an optimal solution
can be found easily. Another solution is to learn from experience. “Experience” here means
solving lots of 8-puzzles, for instance. Each optimal solution to an 8-puzzle problem
provides examples from which h(n) can be learned. Each example consists of a state from
the solution path and the actual cost of the solution from that point. From these examples,
12
AI&ML, 21CS54, 5th Semester
a learning algorithm can be used to construct a function h(n) that can (with luck) predict
solution costs for other states that arise during search. Techniques for doing just this using
neural nets, decision trees, and other methods are demonstrated in Chapter 18. (The
reinforcement learning methods described in Chapter 21 are also applicable.) Inductive
learning methods work best when supplied with features of a state that are relevant to
predicting the state’s value, rather than with just the raw state description. For example,
the feature “number of misplaced tiles” might be helpful in predicting the actual distance
of a state from the goal. Let’s call this feature x1(n). We could take 100 randomly generated
8-puzzle configurations and gather statistics on their actual solution costs. We might find
that when x1(n) is 5, the average solution cost is around 14, and so on. Given these data,
the value of x1 can be used to predict h(n). Of course, we can use several features. A second
feature x2(n) might be “number of pairs of adjacent tiles that are not adjacent in the goal
state.” How should x1(n) and x2(n) be combined to predict h(n)? A common approach is
to use a linear combination:
h(n) = c1x1(n) + c2x2(n) .
The constants c1 and c2 are adjusted to give the best fit to the actual data on solution costs.
One expects both c1 and c2 to be positive because misplaced tiles and incorrect adjacent
pairs make the problem harder to solve. Notice that this heuristic does satisfy the condition
that h(n)=0 for goal states, but it is not necessarily admissible or consistent.
INTRODUCTION TO MACHINE L E A R N I N G
1.1 NEED FOR MACHINE LEARNING
Business organizations use huge amount of data for their daily activities. They have now
started to use the latest technology, machinelearning, to manage the data.
Machine learning has become so popular because of three reasons:
13
AI&ML, 21CS54, 5th Semester
14
AI&ML, 21CS54, 5th Semester
15
AI&ML, 21CS54, 5th Semester
Figure 1.2: (a) A Learning System for Humans (b) A Learning Systemfor
Machine Learning
Often, the quality of data determines the quality of experience and, therefore, the quality ofthe
learning system. In statistical learning, the relationship between the input x and output y is
modeled as a function in the form y = f(x). Here, f is the learning function that maps the input xto
output y. Learning of function f is the crucial aspect of forming a model in statistical learning.In
machine learning, this is simply called mapping of input to output.
The learning program summarizes the raw data in a model. Formally stated, a model is anexplicit
description of patterns within the data in the form of:
1. Mathematical equation
2. Relational diagrams like trees/graphs
3. Logical if/else rules, or
4. Groupings called clusters
In summary, a model can be a formula, procedure or representation that can generate data
decisions. The difference between pattern and model is that the former is local and applicable onlyto
certain attributes but the latter is global and fits the entire dataset. For example, a model can be helpful
to examine whether a given email is spam or not. The point is that the model is generated automatically
from the given data.
Another pioneer of AI, Tom Mitchell’s definition of machine learning states that, “A computer
program is said to learn from experience E, with respect to task T and some performance measure
P,if its performance on T measured by P improves with experience E.” The important components of this
definition are experience E, task T, and performance measure P.
For example, the task T could be detecting an object in an image. The machine can gain the
knowledge of object using training dataset of thousands of images. This is called experience E.So,
the focus is to use this experience E for this task of object detection T. The ability of the systemto detect
the object is measured by performance measures like precision and recall. Based on the performance
measures, course correction can be done to improve the performance of the system.
Models of computer systems are equivalent to human experience. Experience is based on data.
Humans gain experience by various means. They gain knowledge by rote learning. They observe others
and imitate it. Humans gain a lot of knowledge from teachers and books. We learn many things by trial and
error. Once the knowledge is gained, when a new problem is encountered, humans search for similar
past situations and then formulate the heuristics and use that for prediction. But, in systems,
experience is gathered by these steps:
1. Collection of data
16
AI&ML, 21CS54, 5th Semester
2. Once data is gathered, abstract concepts are formed out of that data. Abstraction is used to
generate concepts. This is equivalent to humans’ idea of objects, for example, we have some
idea about how an elephant looks like.
3. Generalization converts the abstraction into an actionable form of intelligence. It can
be viewed as ordering of all possible concepts. So, generalization involves ranking of concepts,
inferencing from them and formation of heuristics, an actionable aspect of intelligence.
Heuristics are educated guesses for all tasks. For example, if one runs or encounters a danger,
it is the resultant of human experience or his heuristics formation.In machines, it happens
the same way.
4. Heuristics normally works! But, occasionally, it may fail too. It is not the faultof
heuristics as it is just a ‘rule of thumb′. The course correction is done by taking
evaluation measures. Evaluation checks the thoroughness of the models and to-do course
correction, if necessary, to generate better formulations.
intelligence
Deep learning is a subbranch of machine learning. In deep learning, the models are constructedusing
neural network technology. Neural networks are based on the human neuron models. Many neurons
form a network connected with the activation functions that trigger further neurons to perform tasks.
17
AI&ML, 21CS54, 5th Semester
1.3.2 Machine Learning, Data Science, Data Mining, and Data Analytics
Data science is an ‘Umbrella’ term that encompasses many fields. Machine learning starts with data.
Therefore, data science and machine learning are interlinked. Machine learning is a branch of data
science. Data science deals with gathering of data for analysis. It is a broad field that includes:
Big Data Data science concerns about collection of data. Big data is a field of data science that deals
with data’s following characteristics:
1. Volume: Huge amount of data is generated by big companies like Facebook, Twitter,
YouTube.
2. Variety: Data is available in variety of forms like images, videos, and in different formats.
3. Velocity: It refers to the speed at which the data is generated and processed.
Big data is used by many machine learning algorithms for applications such as language trans-lation
and image recognition. Big data influences the growth of subjects like Deep learning. Deep learning is
a branch of machine learning that deals with constructing models using neural networks.
Data Mining Data mining’s original genesis is in the business. Like while mining the earth one gets
into precious resources, it is often believed that unearthing of the data produces hidden infor- mation
that otherwise would have eluded the attention of the management. Nowadays, many consider that
data mining and machine learning are same. There is no difference between these fields except that
data mining aims to extract the hidden patterns that are present in the data, whereas, machine learning
aims to use it for prediction.
Data Analytics Another branch of data science is data analytics. It aims to extract useful knowledge
from crude data. There are different types of analytics. Predictive data analytics is used for making
predictions. Machine learning is closely related to this branch of analytics and shares almost all
algorithms.
Pattern Recognition It is an engineering field. It uses machine learning algorithms to extract the
features for pattern analysis and pattern classification. One can view pattern recognition as a specific
application of machine learning.
These relations are summarized in Figure 1.4.
Data science
analytics
18
AI&ML, 21CS54, 5th Semester
methods look for regularity in data called patterns. Initially, statistics sets a hypothesis and performs
experiments to verify and validate the hypothesis in order to find relationships among data.
Statistics requires knowledge of the statistical procedures and the guidance of a good statistician.
It is mathematics intensive and models are often complicated equations and involve many
assumptions. Statistical methods are developed in relation to the data being analyzed. In addition,
statistical methods are coherent and rigorous. It has strong theoretical foundations and interpretations
that require a strong statistical knowledge.
Machine learning, comparatively, has less assumptions and requires less statistical knowledge.
But, it often requires interaction with various tools to automate the process of learning.
Nevertheless, there is a school of thought that machine learning is just the latest version of ‘old
Statistics’ and hence this relationship should be recognized.
student. There are four types of machine learning as shown in Figure 1.5.
Cluster
analysis
Labelled and Unlabeled Data Data is a raw fact. Normally, data is represented in the formof a
table. Data also can be referred to as a data point, sample, or an example. Each row of the table
represents a data point. Features are attributes or characteristics of an object. Normally, the columns
of the table are attributes. Out of all attributes, one attribute is important and is called a label. Label is
the feature that we aim to predict. Thus, there are two types of data – labelled and unlabeled.
Labelled Data To illustrate labelled data, let us take one example dataset called Iris flower datasetor
Fisher’s Iris dataset. The dataset has 50 samples of Iris – with four attributes, length and width of sepals
and petals. The target variable is called class. There are three classes – Iris setosa, Iris virginica, and
Iris versicolor.
The partial data of Iris dataset is shown in Table 1.1.
Table 1.1: Iris Flower Dataset
19
AI&ML, 21CS54, 5th Semester
Cat
(b)
20
AI&ML, 21CS54, 5th Semester
Classification
Classification is a supervised learning method. The input attributes of the classification algorithmsare
called independent variables. The target attribute is called label or dependent variable. The
relationship between the input and target variable is represented in the form of a structure which is
called a classification model. So, the focus of classification is to predict the ‘label’ that is in a discrete
form (a value from the set of finite values). An example is shown in Figure 1.7 where a classification
algorithm takes a set of labelled data images such as dogs and cats to construct a model that can later
be used to classify an unknown test image data.
21
AI&ML, 21CS54, 5th Semester
In classification, learning takes place in two stages. During the first stage, called training stage,the learning
algorithm takes a labelled dataset and starts learning. After the training set, samples are processed and
the model is generated. In the second stage, the constructed model is tested withtest or unknown sample
and assigned a label. This is the classification process.
This is illustrated in the above Figure 1.7. Initially, the classification learning algorithm learnswith
the collection of labelled data and constructs the model. Then, a test case is selected, and the model
assigns a label.
Similarly, in the case of Iris dataset, if the test is given as (6.3, 2.9, 5.6, 1.8, ?), the classification will
generate the label for this. This is called classification. One of the examples of classification is –Image
recognition, which includes classification of diseases like cancer, classification of plants, etc.
The classification models can be categorized based on the implementation technology like decision
trees, probabilistic methods, distance measures, and soft computing methods. Classificationmodels can
also be classified as generative models and discriminative models. Generative models deal with the
process of data generation and its distribution. Probabilistic models are examples of
generative models. Discriminative models do not care about the generation of data. Instead, they
simply concentrate on classifying the given data.
Some of the key algorithms of classification are:
• Decision Tree
• Random Forest
• Support Vector Machines
• Naïve Bayes
• Artificial Neural Network and Deep Learning networks like CNN
Regression Models
Regression models, unlike classification algorithms, predict continuous variables like price. In
other words, it is a number. A fitted regression model is shown in Figure 1.8 for a dataset that represent
weeks input x and product sales y.
4
3.5
y-axis - Product sales data (y)
2.5
1.5
1
1 2 3 4 5
x-axis - Week data (x)
Regression line (y = 0.66X + 0.54)
22
AI&ML, 21CS54, 5th Semester
data (x). For example, the prediction for unknown eighth week can be made bysubstituting x as 8 in that
regression formula to get y.
One of the most important regression algorithms is linear regression that is explained in the next
section.
Both regression and classification models are supervised algorithms. Both have a supervisor andthe
concepts of training and testing are applicable to both. What is the difference between classificationand
regression models? The main difference is that regression models predict continuous variablessuch as
product price, while classification concentrates on assigning labels such as class.
1.4.2 Unsupervised Learning
The second kind of learning is by self-instruction. As the name suggests, there are no supervisor or
teacher components. In the absence of a supervisor or teacher, self-instruction is the most commonkind
of learning process. This process of self-instruction is based on the concept of trial and error.
Here, the program is supplied with objects, but no labels are defined. The algorithm itself observes
the examples and recognizes patterns based on the principles of grouping. Grouping is done in ways
that similar objects form the same group.
Cluster analysis and Dimensional reduction algorithms are examples of unsupervised algorithms.
Cluster Analysis
Cluster analysis is an example of unsupervised learning. It aims to group objects into disjoint clusters
or groups. Cluster analysis clusters objects based on its attributes. All the data objectsof the
partitions are similar in some aspect and vary from the data objects in the other partitions significantly.
Some of the examples of clustering processes are — segmentation of a region of interest in an
image, detection of abnormal growth in a medical image, and determining clusters of signatures in a
gene database.
An example of clustering scheme is shown in Figure 1.9 where the clustering algorithm takes a set
of dogs and cats images and groups it as two clusters-dogs and cats. It can be observed that the samples
belonging to a cluster are similar and samples are different radically across clusters.
23
AI&ML, 21CS54, 5th Semester
The differences between supervised and unsupervised learning are listed in the following
Table 1.2.
Table 1.2: Differences between Supervised and Unsupervised Learning
Goal
Danger
24
AI&ML, 21CS54, 5th Semester
In this grid game, the gray tile indicates the danger, black is a block, and the tile with diagonallines
is the goal. The aim is to start, say from bottom-left grid, using the actions left, right, top andbottom to
reach the goal state.
To solve this sort of problem, there is no data. The agent interacts with the environment toget
experience. In the above case, the agent tries to create a model by simulating many paths and finding
rewarding paths. This experience helps in constructing a model.
It can be said in summary, compared to supervised learning, there is no supervisor orlabelled
dataset. Many sequential decisions need to be taken to reach the final decision. Therefore,
reinforcement algorithms are reward-based, goal-oriented algorithms.
1, 1 1
2, 1 2
3, 1 3
4, 1 4
5, 1 5
Can a model for this test data be multiplication? That is, y x x . Well! It is true! But, this is
1 2
equally true that y may be y x x , or y x x2. So, there are three functions that fit the data.
1 2 1
This means that the problem is ill-posed. To solve this problem, one needs more example to
check the model. Puzzles and games that do not have sufficient specification may become anill-
posed problem and scientific computation has many ill-posed problems.
25
AI&ML, 21CS54, 5th Semester
2. Huge data – This is a primary requirement of machine learning. Availability of a quality data is
a challenge. A quality data means it should be large and should not have data problems such
as missing data or incorrect data.
3. High computation power – With the availability of Big Data, the computational resource
requirement has also increased. Systems with Graphics Processing Unit (GPU) or even Tensor
Processing Unit (TPU) are required to execute machine learning algorithms. Also, machine
learning tasks have become complex and hence time complexity has increased, and that can
be solved only with high computing power.
4. Complexity of the algorithms – The selection of algorithms, describing the algorithms,
application of algorithms to solve machine learning task, and comparison of algorithms have
become necessary for machine learning or data scientists now. Algorithms have become a big
topic of discussion and it is a challenge for machine learning professionals todesign, select, and
evaluate optimal algorithms.
5. Bias/Variance – Variance is the error of the model. This leads to a problem called bias/
variance tradeoff. A model that fits the training data correctly but fails for test data, in general
lacks generalization, is called overfitting. The reverse problem is called underfitting where the
model fails for training data but has good generalization. Overfitting and underfitting are great
challenges for machine learning algorithms.
Understand the
business
preprocessing
Model evaluation
Model deployment
26
AI&ML, 21CS54, 5th Semester
1. Understanding the business – This step involves understanding the objectives and
requirements of the business organization. Generally, a single data mining algorithm is enough
for giving the solution. This step also involves the formulation of the problem statement for
the data mining process.
2. Understanding the data – It involves the steps like data collection, study of the charac teristics
of the data, formulation of hypothesis, and matching of patterns to the selected hypothesis.
3. Preparation of data – This step involves producing the final dataset by cleaning the raw data
and preparation of data for the data mining process. The missing values may cause problems
during both training and testing phases. Missing data forces classifiers to produce inaccurate
results. This is a perennial problem for the classification models. Hence, suitable strategies
should be adopted to handle the missing data.
4. Modelling – This step plays a role in the application of data mining algorithm for the datato
obtain a model or pattern.
5. Evaluate – This step involves the evaluation of the data mining results using statistical analysis
and visualization methods. The performance of the classifier is determined by evaluating the
accuracy of the classifier. The process of classification is a fuzzy issue. For example,
classification of emails requires extensive domain knowledge and requires domain experts.
Hence, performance of the classifier is very crucial.
6. Deployment – This step involves the deployment of results of the data mining algorithm to
improve the existing process or for a new situation.
27
AI&ML, 21CS54, 5th Semester
7. Games Game programs for Chess, GO, and Atari video games
8. Natural Language Google Translate, Text summarization, and sentiment analysis
Translation
9. Web Analysis and Identification of access patterns, detection of e-mail spams, viruses,
Services personalized web services, search engines like Google, detection of
promotion of user websites, and finding loyalty of users after web page
layout modification
12. Scientific Domain Discovery of new galaxies, identification of groups of houses basedon
house type/geographical location, identification of earthquake
epicenters, and identification of similar land use
Key Terms:
• Machine Learning – A branch of AI that concerns about machines to learn automatically withoutbeing
explicitly programmed.
• Data – A raw fact.
• Model – An explicit description of patterns in a data.
• Experience – A collection of knowledge and heuristics in humans and historical training data in case of
machines.
• Predictive Modelling – A technique of developing models and making a prediction of unseen data.
• Deep Learning – A branch of machine learning that deals with constructing models using neural
networks.
• Data Science – A field of study that encompasses capturing of data to its analysis covering all stagesof
data management.
• Data Analytics – A field of study that deals with analysis of data.
28
AI&ML, 21CS54, 5th Semester
• Big Data – A study of data that has characteristics of volume, variety, and velocity.
• Statistics – A branch of mathematics that deals with learning from data using statistical methods.
• Hypothesis – An initial assumption of an experiment.
• Learning – Adapting to the environment that happens because of interaction of an agent with the
environment.
• Label – A target attribute.
• Labelled Data – A data that is associated with a label.
• Unlabelled Data – A data without labels.
• Supervised Learning – A type of machine learning that uses labelled data and learns with the help of a
supervisor or teacher component.
• Classification Program – A supervisory learning method that takes an unknown input and assigns a
label for it. In simple words, finds the category of class of the input attributes.
• Regression Analysis – A supervisory method that predicts the continuous variables based on the input
variables.
• Unsupervised Learning – A type of machine leaning that uses unlabelled data and groups the attributes
to clusters using a trial and error approach.
• Cluster Analysis – A type of unsupervised approach that groups the objects based on attributesso
that similar objects or data points form a cluster.
• Semi-supervised Learning – A type of machine learning that uses limited labelled and largeunlabelled
data. It first labels unlabelled data using labelled data and combines it for learning purposes.
• Reinforcement Learning – A type of machine learning that uses agents and environment interactionfor
creating labelled data for learning.
• Well-posed Problem – A problem that has well-defined specifications. Otherwise, the problem is called
ill-posed.
• Bias/Variance – The inability of the machine learning algorithm to predict correctly due to lackof
generalization is called bias. Variance is the error of the model for training data. This leads to problems
called overfitting and underfitting.
• Model Deployment – A method of deploying machine learning algorithms to improve the existing
business processes for a new situation.
29
Module 2 AI&ML, 21CS54, 5th Semester
Some of the other forms of Vs that are often quoted in the literature as characteristics of
Big data are:
4. Veracity of data – Veracity of data deals with aspects like conformity to the facts, truthfulness,
believablity, and confidence in data. There may be many sources of error such as technical
errors, typographical errors, and human errors. So, veracity is one of the most important
aspects of data.
5. Validity – Validity is the accuracy of the data for taking decisions or for any other goals that
are needed by the given problem.
6. Value – Value is the characteristic of big data that indicates the value of the information that
is extracted from the data and its influence on the decisions that are taken based on it.
Thus, these 6 Vs are helpful to characterize the big data. The data quality of the numeric
attributes is determined by factors like precision, bias, and accuracy.
30
Module 2 AI&ML, 21CS54, 5th Semester
Structured Data
In structured data, data is stored in an organized manner such as a database where it is available in
the form of a table. The data can also be retrieved in an organized manner using tools like SQL. The
structured data frequently encountered in machine learning are listed below:
Record Data A dataset is a collection of measurements taken from a process. We have a collection
of objects in a dataset and each object has a set of measurements. The measurements can be
arranged in the form of a matrix. Rows in the matrix represent an object and can be called as entities,
cases, or records. The columns of the dataset are called attributes, features, or fields. The table is
filled with observed data. Also, it is better to note the general jargons that are associated with the
dataset. Label is the term that is used to describe the individual observations.
Data Matrix It is a variation of the record type because it consists of numeric attributes. The
standard matrix operations can be applied on these data. The data is thought of as points or vectors
in the multidimensional space where every attribute is a dimension describing the object.
Graph Data It involves the relationships among objects. For example, a web page can refer to
another web page. This can be modeled as a graph. The modes are web pages and the hyperlink is
an edge that connects the nodes.
Ordered Data Ordered data objects involve attributes that have an implicit order among them.
The examples of ordered data are:
Temporal data – It is the data whose attributes are associated with time. For example, the
customer purchasing patterns during festival time is sequential data. Time series data is a
special type of sequence data where the data is a series of measurements over time.
Sequence data – It is like sequential data but does not have time stamps. This data involves the
sequence of words or letters. For example, DNA data is a sequence of four characters – A T G C.
Spatial data – It has attributes such as positions or areas. For example, maps are spatial data
where the points are related by location.
Unstructured Data
Unstructured data includes video, image, and audio. It also includes textual documents, programs,
and blog data. It is estimated that 80% of the data are unstructured data.
Semi-Structured Data
Semi-structured data are partially structured and partially unstructured. These include data like
XML/JSON data, RSS feeds, and hierarchical data.
31
Module 2 AI&ML, 21CS54, 5th Semester
Once the dataset is assembled, it must be stored in a structure that is suitable for data analysis. The
goal of data storage management is to make data available for analysis. There are different
approaches to organize and manage data in storage files and systems from flat file to data
warehouses. Some of them are listed below:
Flat Files These are the simplest and most commonly available data source. It is also the cheapest
way of organizing the data. These flat files are the files where data is stored in plain ASCII or EBCDIC
format. Minor changes of data in flat files affect the results of the data mining algorithms.
Hence, flat file is suitable only for storing small dataset and not desirable if the dataset becomes
larger.
Some of the popular spreadsheet formats are listed below:
• CSV files – CSV stands for comma-separated value files where the values are separated by
commas. These are used by spreadsheet and database applications. The first row may have
attributes and the rest of the rows represent the data.
• TSV files – TSV stands for Tab separated values files where values are separated by Tab. Both
CSV and TSV files are generic in nature and can be shared. There are many tools like Google Sheets
and Microsoft Excel to process these files.
Database System It normally consists of database files and a database management system
(DBMS). Database files contain original data and metadata. DBMS aims to manage data and improve
operator performance by including various tools like database administrator, query processing, and
transaction manager. A relational database consists of sets of tables. The tables have rows and
columns. The columns represent the attributes and rows represent tuples. A tuple corresponds to
either an object or a relationship between objects. A user can access and manipulate the data in the
database using SQL.
32
Module 2 AI&ML, 21CS54, 5th Semester
market activities. Data analysis is an activity that takes the data and generates useful information
and insights for assisting the organizations.
Data analysis and data analytics are terms that are used interchangeably to refer to the same
concept. However, there is a subtle difference. Data analytics is a general term and data analysis is a
part of it. Data analytics refers to the process of data collection, preprocessing and analysis. It deals
with the complete cycle of data management. Data analysis is just analysis and is a part of data
analytics. It takes historical data and does the analysis.
Data analytics, instead, concentrates more on future and helps in prediction.
There are four types of data analytics:
1. Descriptive analytics
2. Diagnostic analytics
3. Predictive analytics
4. Prescriptive analytics
Descriptive Analytics It is about describing the main features of the data. After data collection is
done, descriptive analytics deals with the collected data and quantifies it. It is often stated that
analytics is essentially statistics. There are two aspects of statistics – Descriptive and Inference.
Descriptive analytics only focuses on the description part of the data and not the inference part.
Diagnostic Analytics It deals with the question – ‘Why?’. This is also known as causal analysis, as
it aims to find out the cause and effect of the events. For example, if a product is not selling,
diagnostic analytics aims to find out the reason. There may be multiple reasons and associated
effects are analyzed as part of it.
Predictive Analytics It deals with the future. It deals with the question – ‘What will happen in
future given this data?’. This involves the application of algorithms to identify the patterns to predict
the future. The entire course of machine learning is mostly about predictive analytics and forms the
core of this book.
Prescriptive Analytics It is about the finding the best course of action for the business
organizations. Prescriptive analytics goes beyond prediction and helps in decision making by giving
a set of actions. It helps the organizations to plan better for the future and to mitigate the risks that
are involved.
Data Management Layer It performs preprocessing of data. The purpose of this layer is to
allow parallel execution of queries, and read, write and data management tasks. There may be
many schemes that can be implemented by this layer such as data-in-place, where the data is
not moved at all, or constructing data repositories such as data warehouses and pull data
on-demand mechanisms.
Data Analytic Layer It has many functionalities such as statistical tests, machine learning
algorithms to understand, and construction of machine learning models. This layer implements
many model validation mechanisms too. The processing is done as shown in Box 2.1.
Presentation Layer It has mechanisms such as dashboards, and applications that display the
33
Module 2 AI&ML, 21CS54, 5th Semester
Broadly, the data source can be classified as open/public data, social media data and multimodal
data.
1. Open or public data source – It is a data source that does not have any stringent copyright
rules or restrictions. Its data can be primarily used for many purposes. Government census
data are good examples of open data:
• Digital libraries that have huge amount of text data as well as document images
• Scientific domains with a huge collection of experimental data like genomic data
and biological data
• Healthcare systems that use extensive databases like patient databases, health insurance
data, doctors’ information, and bioinformatics information
2. Social media – It is the data that is generated by various social media platforms like
Twitter,
Facebook, YouTube, and Instagram. An enormous amount of data is generated by these
platforms.
3. Multimodal data – It includes data that involves many modes such as text, video, audio
and mixed types. Some of them are listed below:
• Image archives contain larger image databases along with numeric and text data
• The World Wide Web (WWW) has huge amount of data that is distributed on the Internet.
These data are heterogeneous in nature.
34
Module 2 AI&ML, 21CS54, 5th Semester
measurement and structural errors like improper data formats. Data errors can also arise from
omission and duplication of attributes. Noise is a random component and involves distortion of
a value or introduction of spurious objects. Often, the noise is used if the data is a spatial or
temporal component. Certain deterministic distortions in the form of a streak are known as
artifacts.
Consider, for example, the following patient Table 2.1. The ‘bad’ or ‘dirty’ data can be observed
in this table.
It can be observed that data like Salary = ’ ’ is incomplete data. The DoB of patients, John, Andre, and
Raju, is the missing data. The age of David is recorded as ‘5’ but his DoB indicates it is 10/10/1980.
This is called inconsistent data.
Inconsistent data occurs due to problems in conversions, inconsistent formats, and difference in
units. Salary for John is -1500. It cannot be less than ‘0’. It is an instance of noisy data. Outliers are
data that exhibit the characteristics that are different from other data and have very unusual values.
The age of Raju cannot be 136. It might be a typographical error. It is often required to distinguish
between noise and outlier data.
Outliers may be legitimate data and sometimes are of interest to the data mining algorithms. These
errors often come during data collection stage. These must be removed so that machine learning
algorithms yield better results as the quality of results is determined by the quality of input data.
This removal process is called data cleaning.
35
Module 2 AI&ML, 21CS54, 5th Semester
Example 2.1: Consider the following set: S = {12, 14, 19, 22, 24, 26, 28, 31, 34}. Apply various
binning techniques and show the result.
Solution: By equal-frequency bin method, the data should be distributed across bins. Let us
assume the bins of size 3, then the above data is distributed across the bins as shown below:
Bin 1 : 12 , 14, 19
Bin 2 : 22, 24, 26
Bin 3 : 28, 31, 32
By smoothing bins method, the bins are replaced by the bin means. This method results in:
Bin 1 : 15, 15, 15
Bin 2 : 24, 24, 24
Bin 3 : 30.3, 30.3, 30.3
Using smoothing by bin boundaries method, the bins' values would be like:
Bin 1 : 12, 12, 19
Bin 2 : 22, 22, 26
Bin 3 : 28, 32, 32
As per the method, the minimum and maximum values of the bin are determined, and it serves
as bin boundary and does not change. Rest of the values are transformed to the nearest value. It
can be observed in Bin 1, the middle value 14 is compared with the boundary values 12 and 19
and changed to the closest value, that is 12. This process is repeated for all bins.
Here max-min is the range. Min and max are the minimum and maximum of the given data,
new max and new min are the minimum and maximum of the target range, say 0 and 1.
Example 2.2: Consider the set: V = {88, 90, 92, 94}. Apply Min-Max procedure and map the marks
36
Module 2 AI&ML, 21CS54, 5th Semester
So, it can be observed that the marks {88, 90, 92, 94} are mapped to the new range {0, 0.33, 0.66,
1}. Thus, the Min-Max normalization range is between 0 and 1.
z-Score Normalization This procedure works by taking the difference between the field value
and mean value, and by scaling this difference by standard deviation of the attribute.
Here, s is the standard deviation of the list V and m is the mean of the list V.
Example 2.3: Consider the mark list V = {10, 20, 30}, convert the marks to z-score.
Solution: The mean and Sample Standard deviation (s) values of the list V are 20 and 10, respec-
tively. So the z-scores of these marks are calculated using Eq. (2.2) as:
Hence, the z-score of the marks 10, 20, 30 are -1, 0 and 1, respectively.
Data Reduction
Data reduction reduces data size but produces the same results. There are different ways in which
data reduction can be carried out such as data aggregation, feature selection, and dimensionality
reduction.
37
Module 2 AI&ML, 21CS54, 5th Semester
For example, consider the following database shown in sample Table 2.2.
Every attribute should be associated with a value. This process is called measurement.
The type of attribute determines the data types, often referred to as measurement scale types.
The data types are shown in Figure 2.1.
Categorical or Qualitative Data The categorical data can be divided into two types. They are
nominal type and ordinal type.
•Nominal Data – In Table 2.2, patient ID is nominal data. Nominal data are symbols and
cannot be processed like a number. For example, the average of a patient ID does not make
any statistical sense. Nominal data type provides only information but has no ordering
among data. Only operations like (=, ≠) are meaningful for these data. For example, the
patient ID can be checked for equality and nothing else.
•Ordinal Data – It provides enough information and has natural order. For example, Fever
= {Low, Medium, High} is an ordinal data. Certainly, low is less than medium and medium
is less than high, irrespective of the value. Any transformation can be applied to these data
to get a new value.
Numeric or Qualitative Data It can be divided into two categories. They are interval type and
ratio type.
•Interval Data – Interval data is a numeric data for which the differences between values
are meaningful. For example, there is a difference between 30 degree and 40 degree. Only
the permissible operations are + and -.
•Ratio Data – For ratio data, both differences and ratio are meaningful. The difference
between the ratio and interval data is the position of zero in the scale. For example,
take the Centigrade-Fahrenheit conversion. The zeroes of both scales do not match.
Hence, these are interval data.
38
Module 2 AI&ML, 21CS54, 5th Semester
on that, the data can be classified as univariate data, bivariate data, and multivariate data. This is
shown in Figure 2.2.
Bar Chart A Bar chart (or Bar graph) is used to display the frequency distribution for variables.
Bar charts are used to illustrate discrete data. The charts can also help to explain the counts of
nominal data. It also helps in comparing the frequency of different groups.
The bar chart for students' marks {45, 60, 60, 80, 85} with Student ID = {1, 2, 3, 4, 5} is shown
below in Figure 2.3.
Pie Chart These are equally helpful in illustrating the univariate data. The percentage frequency
distribution of students' marks {22, 22, 40, 40, 70, 70, 70, 85, 90, 90} is below in Figure 2.4.
It can be observed that the number of students with 22 marks are 2. The total number of
students are 10. So, 2/10 × 100 = 20% space in a pie of 100% is allotted for marks 22 in Figure 2.4.
39
Module 2 AI&ML, 21CS54, 5th Semester
Histogram It plays an important role in data mining for showing frequency distributions.
The histogram for students’ marks {45, 60, 60, 80, 85} in the group range of 0-25, 26-50, 51-75,
76-100 is given below in Figure 2.5. One can visually inspect from Figure 2.5 that the number of
students in the range 76-100 is 2.
Histogram conveys useful information like nature of data and its mode. Mode indicates the
peak of dataset. In other words, histograms can be used as charts to show frequency, skewness
present in the data, and shape.
Dot Plots These are similar to bar charts. They are less clustered as compared to bar charts,
as they illustrate the bars only with single points. The dot plot of English marks for five students
with ID as {1, 2, 3, 4, 5} and marks {45, 60, 60, 80, 85} is given in Figure 2.6. The advantage
is that by visual inspection one can find out who got more marks.
For example, the mean of the three numbers 10, 20, and 30 is 20
40
Module 2 AI&ML, 21CS54, 5th Semester
•Weighted mean – Unlike arithmetic mean that gives the weightage of all items equally,
weighted mean gives different importance to all items as the item importance varies.
Hence, different weightage can be given to items. In case of frequency distribution, mid values of
the range are taken for computation. This is illustrated in the following computation.
In weighted mean, the mean is computed by adding the product of proportion and
group mean. It is mostly used when the sample sizes are unequal.
•Geometric mean – Let x1, x2, … , xN be a set of ‘N’ values or observations. Geometric mean
is the Nth root of the product of N items. The formula for computing geometric mean is
given as follows:
Here, n is the number of items and xi are values. For example, if the values are 6 and 8, the
geometric mean is given as In larger cases, computing geometric mean is difficult. Hence, it is
usually calculated as:
The problem of mean is its extreme sensitiveness to noise. Even small changes in the input
affect the mean drastically. Hence, often the top 2% is chopped off and then the mean is calcu-
lated for a larger dataset.
2. Median – The middle value in the distribution is called median. If the total number of items
in the distribution is odd, then the middle value is called median. A median class is that class
where (N/2)th item is present.
In the continuous case, the median is given by the formula:
Median class is that class where N/2th item is present. Here, i is the class interval of the
median class and L1 is the lower limit of median class, f is the frequency of the median class, and
cf is the cumulative frequency of all classes preceding median.
3. Mode – Mode is the value that occurs more frequently in the dataset. In other words, the
value that has the highest frequency is called mode.
2.5.3 Dispersion
The spreadout of a set of data around the central tendency (mean, median or mode) is called
dispersion. Dispersion is represented by various ways such as range, variance, standard deviation,
and standard error. These are second order measures. The most common measures of the
dispersion data are listed below:
Range Range is the difference between the maximum and minimum of values of the given list
of data.
Standard Deviation The mean does not convey much more than a middle point. For example,
the following datasets {10, 20, 30} and {10, 50, 0} both have a mean of 20. The difference between
these two sets is the spread of data. Standard deviation is the average distance from the mean of
the dataset to each point.
41
Module 2 AI&ML, 21CS54, 5th Semester
Here, N is the size of the population, xi is observation or value from the population and m is
the population mean. Often, N – 1 is used instead of N in the denominator of Eq. (2.8).
Quartiles and Inter Quartile Range It is sometimes convenient to subdivide the dataset using
coordinates. Percentiles are about data that are less than the coordinates by some percentage of
the total value. kth percentile is the property that the k% of the data lies at or below Xi. For
example, median is 50th percentile and can be denoted as Q0.50. The 25th percentile is called first
quartile (Q1) and the 75th percentile is called third quartile (Q3). Another measure that is useful
to measure dispersion is Inter Quartile Range (IQR). The IQR is the difference between Q3 and Q1.
Interquartile percentile = Q3 – Q1 (2.9)
Outliers are normally the values falling apart at least by the amount 1.5 × IQR above the third
quartile or below the first quartile.
Interquartile is defined by Q0.75 – Q0.25. (2.10)
Example 2.4: For patients’ age list {12, 14, 19, 22, 24, 26, 28, 31, 34}, find the IQR.
Solution: The median is in the fifth position. In this case, 24 is the median. The first quartile is
median of the scores below the mean i.e., {12, 14, 19, 22}. Hence, it’s the median of the list below
24. In this case, the median is the average of the second and third values, that is, Q0.25 = 16.5.
Similarly, the third quartile is the median of the values above the median, that is {26, 28, 31, 34}.
So, Q0.75 is the average of the seventh and eighth score. In this case, it is 28 + 31/2 = 59/2 = 29.5.
Hence, the IQR using Eq. (2.10) is:
= Q0.75 – Q0.25
= 29.5-16.5 = 13
Five-point Summary and Box Plots The median, quartiles Q1 and Q3, and minimum
and maximum written in the order < Minimum, Q1, Median, Q3, Maximum > is known as
five-point summary.
Example 2.5: Find the 5-point summary of the list {13, 11, 2, 3, 4, 8, 9}.
Solution: The minimum is 2 and the maximum is 13. The Q1, Q2 and Q3 are 3, 8 and 11, respectively.
Hence, 5-point summary is {2, 3, 8, 11, 13}, that is, {minimum, Q1, median, Q3, maximum}. Box plots
are useful for describing 5-point summary. The Box plot for the set is given in
Figure 2.7.
2.5.4 Shape
Skewness and Kurtosis (called moments) indicate the symmetry/asymmetry and peak location of
the dataset.
Skewness
The measures of direction and degree of symmetry are called measures of third order. Ideally,
skewness should be zero as in ideal normal distribution. More often, the given dataset may not
42
Module 2 AI&ML, 21CS54, 5th Semester
Generally, for negatively skewed distribution, the median is more than the mean. The relationship
between skew and the relative size of the mean and median can be summarized by a convenient
numerical skew index known as Pearson 2 skewness coefficient.
Also, the following measure is more commonly used to measure skewness. Let X1, X2, …, XN
be a set of ‘N’ values or observations then the skewness can be given as:
Here, m is the population mean and s is the population standard deviation of the univariate
data. Sometimes, for bias correction instead of N, N - 1 is used.
Kurtosis
Kurtosis also indicates the peaks of data. If the data is high peak, then it indicates higher kurtosis
and vice versa. Kurtosis is measured using the formula given below:
It can be observed that N - 1 is used instead of N in the numerator of Eq. (2.14) for bias correction.
Here, x and s are the mean and standard deviation of the univariate data, respectively.
Some of the other useful measures for finding the shape of the univariate dataset are mean
absolute deviation (MAD) and coefficient of variation (CV).
43
Module 2 AI&ML, 21CS54, 5th Semester
Figure 2.9.
It can be seen from Figure 2.9 that the first column is stem and the second column is leaf.
For the given English marks, two students with 60 marks are shown in stem and leaf plot as stem-
6 with 2 leaves with 0. The normal Q-Q plot for marks x = [13 11 2 3 4 8 9] is given below in
Figure 2.10.
Here, the aim of bivariate analysis is to find relationships among variables. The relationships can
then be used in comparisons, finding causes, and in further explorations. To do that, graphical
display of the data is necessary. One such graph method is called scatter plot.
Scatter plot is used to visualize bivariate data. It is useful to plot two variables with or without
44
Module 2 AI&ML, 21CS54, 5th Semester
nominal variables, to illustrate the trends, and also to show differences. It is a plot between
explanatory and response variables. It is a 2D graph showing the relationship between two
variables.
Line graphs are similar to scatter plots. The Line Chart for sales data is shown in Figure 2.12.
Here, xi and yi are data values from X and Y. E(X) and E(Y) are the mean values of xi and yi.
N is the number of given data. Also, the COV(X, Y) is same as COV(Y, X).
Example 2.6: Find the covariance of data X = {1, 2, 3, 4, 5} and Y = {1, 4, 9, 16, 25}.
The covariance between X and Y is 12. It can be normalized to a value between -1 and +1. This
is done by dividing it by the correlation of variables. This is called Pearson correlation coefficient.
45
Module 2 AI&ML, 21CS54, 5th Semester
Sometimes, N - 1 is also can be used instead of N. In that case, the covariance is 60/4 = 15.
Correlation
The Pearson correlation coefficient is the most common test for determining any association
between two phenomena. It measures the strength and direction of a linear relationship between
the x and y variables.
1.If the value is positive, it indicates that the dimensions increase together.
2.If the value is negative, it indicates that while one-dimension increases, the other dimension
decreases.
3.If the value is zero, then it indicates that both the dimensions are independent of each
other.
If the dimensions are correlated, then it is better to remove one dimension as it is a redundant
dimension.
If the given attributes are X = (x1, x2, … , xN) and Y = (y1, y2, … , yN), then the Pearson correlation
coefficient, that is denoted as r, is given as:
Multivariate data has three or more variables. The aim of the multivariate analysis is much
more. They are regression analysis, factor analysis and multivariate analysis of variance that are
explained in the subsequent chapters of this book.
Heatmap
Heatmap is a graphical representation of 2D matrix. It takes a matrix as input and colours it. The
darker colours indicate very large values and lighter colours indicate smaller values. The
advantage of this method is that humans perceive colours well. So, by colour shaping, larger values
can be perceived well. For example, in vehicle traffic data, heavy traffic regions can be
differentiated from low traffic regions through heatmap.
In Figure 2.13, patient data highlighting weight and health status is plotted. Here, X-axis
is weights and Y-axis is patient counts. The dark colour regions highlight patients’ weights vs
patient counts in health status.
46
Module 2 AI&ML, 21CS54, 5th Semester
Pairplot
Pairplot or scatter matrix is a data visual technique for multivariate data. A scatter matrix consists
of several pair-wise scatter plots of variables of the multivariate data.
A random matrix of three columns is chosen and the relationships of the columns is plotted
as a pairplot (or scatter matrix) as shown below in Figure 2.14.
Machine learning involves many mathematical concepts from the domain of Linear algebra,
Statistics, Probability and Information theory. The subsequent sections discuss important aspects
of linear algebra and probability.
This is true if y is not zero and A is not zero. The logic can be extended for N-set of equations
with ‘n’ unknown variables.
It means if A= and y=(y1 y2…yn), then the unknown variable x can be
47
Module 2 AI&ML, 21CS54, 5th Semester
computed as:
If there is a unique solution, then the system is called consistent independent. If there are various
solutions, then the system is called consistent dependant. If there are no solutions and if the
equations are contradictory, then the system is called inconsistent.
For solving large number of system of equations, Gaussian elimination can be used. The
procedure for applying Gaussian elimination is given as follows:
1.Write the given matrix.
2.Append vector y to the matrix A. This matrix is called augmentation matrix.
3.Keep the element a11 as pivot and eliminate all a11 in second row using the matrix operation,
To facilitate the application of Gaussian elimination method, the following row operations are
applied:
1.Swapping the rows
2.Multiplying or dividing a row by a constant
3.Replacing a row by adding or subtracting a multiple of another row to it
These concepts are illustrated in Example 2.8.
48
Module 2 AI&ML, 21CS54, 5th Semester
where, Q is the matrix of eigen vectors, Λ is the diagonal matrix and QT is the transpose of matrix
Q.
LU Decomposition
One of the simplest matrix decomposition is LU decomposition where the matrix A can be
decomposed matrices: A = LU
Here, L is the lower triangular matrix and U is the upper triangular matrix. The decomposition
can be done using Gaussian elimination method as discussed in the previous section. First, an
identity matrix is augmented to the given matrix. Then, row operations and Gaussian elimination
is applied to reduce the given matrix to get matrices L and U.
Example 2.9 illustrates the application of Gaussian elimination to get LU.
49
Module 2 AI&ML, 21CS54, 5th Semester
Now, it can be observed that the first matrix is L as it is the lower triangular matrix whose
values are the determiners used in the reduction of equations above such as 3, 3 and 2/3.
The second matrix is U, the upper triangular matrix whose values are the values of the reduced
matrix because of Gaussian elimination.
Probability Distributions
A probability distribution of a variable, say X, summarizes the probability associated with X’s
events. Distribution is a parameterized mathematical function. In other words, distribution is a
function that describes the relationship between the observations in a sample space.
Consider a set of data. The data is said to follow a distribution if it obeys a mathematical
function that characterizes that distribution. The function can be used to calculate the probability
of individual observations.
Probability distributions are of two types:
1.Discrete probability distribution
2.Continuous probability distribution
The relationships between the events for a continuous random variable and their probabilities
50
Module 2 AI&ML, 21CS54, 5th Semester
curve. In normal distribution, data tends to be around a central value with no bias on left
or right. The heights of the students, blood pressure of a population, and marks scored in
a class can be approximated using normal distribution.
PDF of the normal distribution is given as:
Discrete Distribution Binomial, Poisson, and Bernoulli distributions fall under this category.
1. Binomial Distribution – Binomial distribution is another distribution that is often encountered
in machine learning. It has only two outcomes: success or failure. This is also called
Bernoulli trial.
The objective of this distribution is to find probability of getting success k out of n trials.
The way to get success out of k out of n number of trials is given as:
Here, p is the probability of each choice, k is the number of choices, and n is the total
number of choices. The mean of binomial distribution is given below:
51
Module 2 AI&ML, 21CS54, 5th Semester
Density Estimation
Let there be a set of observed values x1, x2, … , xn from a larger set of data whose distribution is
not known. Density estimation is the problem of estimating the density function from an observed
data.
There are two types of density estimation methods, namely parametric density estimation and
non-parametric density estimation.
Parametric Density Estimation It assumes that the data is from a known probabilistic distri-
bution and can be estimated as Maximum likelihood function
is a parametric estimation method.
Maximum Likelihood Estimation For a sample of observations, one can estimate the probability
distribution. This is called density estimation. Maximum Likelihood Estimation (MLE) is a
probabilistic framework that can be used for density estimation. This involves formulating
a function called likelihood function which is the conditional probability of observing the
observed samples and distribution function with its parameters. For example, if the observations
are X = {x1, x2, … , xn}, then density estimation is the problem of choosing a PDF with suitable
parameters to describe the data. MLE treats this problem as a search or optimization problem
where the probability should be maximized for the joint probabilities of X and its parameter, theta.
52
Module 2 AI&ML, 21CS54, 5th Semester
If one assumes that the regression problem can be framed as predicting output y given input x,
then for p(y/x), the MLE framework can be applied as:
Here, b is the regression coefficient and xi is the given sample. One can maximize this function
or minimize the negative log likelihood function to provide a solution for linear regression
problem. The Eq. (2.37) yields the same answer of the least-square approach.
Generally, there can be many unspecified distributions with different set of parameters. The
EM algorithm has two stages:
1. Expectation (E) Stage – In this stage, the expected PDF and its parameters are estimated
for each latent variable.
2.Maximization (M) stage – In this, the parameters are optimized using the MLE function.
This process is iterative, and the iteration is continued till all the latent variables are fitted
by probability distributions effectively along with the parameters.
53
Module 2 AI&ML, 21CS54, 5th Semester
estimation.
This window can be replaced by any other function too. If Gaussian function is used, then it
is called Gaussian density function.
KNN Estimation The KNN estimation is another non-parametric density estimation method.
Here, the initial parameter k is determined and based on that k-neighbours are determined.
The probability density function estimate is the average of the values that are returned by
the neighbours.
Type I error is the incorrect rejection of a true null hypothesis and is called false positive.
Type II error is the incomplete failure of rejecting a false hypothesisand is called false negative.
During these calculations, one must include the size of the data sample. Degree of freedom
54
Module 2 AI&ML, 21CS54, 5th Semester
indicates the number of independent pieces of information used for the test. It is indicated as n.
The mean or variance can be used to indicate the degree of freedom.
Hypothesis Testing
Let us define two important errors called sample error and true (or actual error). Let us assume
that D is the unknown distribution, Target function is f(x): x ≥ {0, 1}, x is the instance, h(x) is the
hypothesis, and sample set is S that derives the samples on instances drawn from X. Then,
the actual error is denoted as:
In other words, true error is the probability that the hypothesis will mis classify an instance
that is drawn at random. The point is that population is very large and hence it is not possible to
determine true error and can only be estimated. So, another error is called sample error or
estimator.
Sample error is with respect to sample S. It is the probability for instances drawn from X,
that is, the fractions of S that are misclassified. The sample error is given as follows:
p-value
Statistical tests can be performed to either accept or reject the null hypothesis. This is done by
the value called p-value or probability value. It indicates the probability of hypothesis being true.
The p-value is used to interpret or quantify the test. For example, a statistical test result may give
a value of 0.03. Then, one can compare it with the level 0.05. As 0.03 < 0.05, the result is assumed
to be significant. This means that the variables tested are not independent. Here, 0.05 is called
significant level. In general, significant level is called a and p-value is compared with a. If p-value ≤
a, then the hypothesis H1 is rejected and if p-value >a, then the hypothesis H0 is rejected.
Confidence Intervals
The acceptance or rejection of the hypothesis can also be done using confidence interval.
The confidence interval is computed as:
Confidence interval = 1 – significant level (2.44)
Confidence level is the range of values that indicates the location of true mean. Confidence
intervals indicate the confidence of the result. If the confidence level is 90%, then it infers that there
is 90% of chance that the true mean lies in this range and remaining 10% indicates that true mean
in not present. For finding this, one requires mean and standard deviation. Then, x can be given as
Here, s is the standard deviation, N is the number of samples, and z is the value
associated with 90% and is called % of confidence. This is also called as margin of error.
Sample error is the unbiased estimate of true error. If no information is provided, then both
errors are the same. It is, however, often safe to suggest a margin of confidence associated with the
hypothesis. The hypothesis with 95% confidence about the sample error can be given as follows:
This 1.96 indicates the 95% confidence of the error. The number 1.96 can be replaced by
any number that is associated with different levels of confidence.
The procedure to estimate the difference between two hypothesis, say h1 and h2, is as follows:
1.A parameter d can be chosen to estimate the error of two hypothesis:
2.d ≡ errorD(h1) - errorD(h2)(2.46)
55
Module 2 AI&ML, 21CS54, 5th Semester
Here, there are two hypothesis h1 and h2 tested on two sample sets s1 and s2. Similarly,
n1 and n2 are randomly drawn number of samples.
3.The estimator ^d can be estimated as the difference as:
4.The confidence intervals can be used for the estimator also as follows:
Sometimes, it is desirable to find interval L and U such that N% of the probability falls in this
Here, t is t-statistic, m is the mean of the group, m is the theoretical value or population mean,
s is the standard deviation, and n is the group size or sample size.
Independent Two Sample t-test t-statistic for two groups A and B is computed as follows:
Here, mean(A) and mean(B) are for two different samples. N1 and N2 are sample sizes of
two groups A and B. s^2 is the variance of the two samples and the degree of freedom is given as
56
Module 2 AI&ML, 21CS54, 5th Semester
Paired t-test It is used to evaluate the hypothesis before and after intervention. The fact is that
these samples are not independent. For example, consider the case of an effect of medication for
a diabetic patient. The sequence is that first the sugar is tested, then the medication is done, and
again the sugar test is conducted to study the effect of medication. In short, in paired t-test, the
data is taken from the same subject twice. In an unpaired t-test, the samples are taken
independently. In this only one group is involved. The t-statistic is computed as:
Here, t is t-statistic, m is the mean of the group, m is the theoretical value or population mean,
s is the standard deviation, and n is the group size or sample size.
Chi-Square Test
Chi-Square test is a non-parametric test. The goodness-of-fit test statistics follows a Chi-Square
distribution under null hypothesis and measures the statistical significance between observed
frequency and expressed frequency, and each observation is independent of each other and
follows normal distribution. This comparison is used to calculate the value of the Chi-Square
statistic as:
Here, E is the expected frequency, O is the observed frequency and the degree of freedom is
C – 1, where, C is number of categories. The Chi-Square test allows us to detect the duplication of
data and helps to remove the redundancy of values.
Example 2.11: Consider the following Table 2.4, where the machine learning course registration
is done by both boys and girls. There are 50 boys and 50 girls in the class and the registration of
the course is given in the table. Apply Chi-Square test and find out whether any differences exist
between boys and girls for course registration.
Solution: Let the null hypothesis be H0 when there is no difference between boys and girls and
H1 be the alternate hypothesis when there is a significant difference between boys and girls.
For applying the Chi-Square test based on the observations, the expectation should be obtained
by multiplying the total boys X registered/Total and Total girls X not registered/Total as shown in
Table 2.5.
57
Module 2 AI&ML, 21CS54, 5th Semester
for degree of freedom = number of categories -1 = 2 - 1 = 1. The p value for this statistic is 0.0412.
This is less than 0.05. Therefore, the result is significant.
Features are attributes. Feature engineering is about determining the subset of features that form
an important part of the input that improves the performance of the model, be it classification or
any other model in machine learning.
Feature engineering deals with two problems – Feature Transformation and Feature Selection.
Feature transformation is extraction of features and creating new features that may be helpful in
increasing performance. For example, the height and weight may give a new attribute called Body
Mass Index (BMI).
Feature subset selection is another important aspect of feature engineering that focuses on
selection of features to reduce the time but not at the cost of reliability.
The features can be removed based on two aspects:
1.Feature relevancy – Some features contribute more for classification than other features.
For example, a mole on the face can help in face detection than common features like
nose. In simple words, the features should be relevant.
2. Feature redundancy – Some features are redundant. For example, when a database table
has a field called Date of birth, then age field is not relevant as age can be computed
easily from date of birth.
So, the procedure is:
1.Generate all possible subsets
2.Evaluate the subsets and model performance
3.Evaluate the results for optimal feature selection
Filter-based selection uses statistical measures for assessing features. In this approach,
no learning algorithm is used. Correlation and information gain measures like mutual information
and entropy are all examples of this approach.
Wrapper-based methods use classifiers to identify the best features. These are selected
and evaluated by the learning algorithms. This procedure is computationally intensive but has
superior performance.
58
Module 2 AI&ML, 21CS54, 5th Semester
The operator E refers to the expected value of the population. This is calculated theoretically
using the probability density functions (PDF) of the elements xi and the joint probability density
functions between the elements xi and xj. From this, the covariance matrix can be calculated as:
The mapping of the vectors x to y using the transformation can now be described as:
This transform is also called as Karhunen-Loeve or Hoteling transform. The original vector x
can now be reconstructed as follows:
If K largest eigen values are used, the recovered information would be:
The new data is a dimensionaly reduced matrix that represents the original data.
59
Module 2 AI&ML, 21CS54, 5th Semester
Figure 2.15. The scree plot indicates that only 6 out of 246 attributes are important.
From Figure 2.15, one can infer the relevance of the attributes. The scree plot indicates that
the first attribute is more important than all other attributes.
Here, A is the given matrix of dimension m × n, U is the orthogonal matrix whose dimension is
m × n, S is the diagonal matrix of dimension n × n, and V is the orthogonal matrix. The procedure
for finding decomposition matrix is given as follows:
1.For a given matrix, find AA^T
60
Module 2 AI&ML, 21CS54, 5th Semester
61
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)
MODULE 3
CHAPTER 3
BASICS OF LEARNING THEORY
3.1 INTRODUCTION TO LEARNING AND ITS TYPES
Learning is a process by which one can acquire knowledge and construct new ideas or
concepts based on the experiences.
The standard definition of learning proposed by Tom Mitchell is that a program can learn
from E for the task T, and P improves with experience E.
There are two kinds of problems – well-posed and ill-posed. Computers can solve only well-
posed problems, as these have well-defined specifications and have the following
components inherent to it.
1. Class of learning tasks (T) 2. A measure of performance (P) 3. A source of experience (E)
Let x- input, χ-input space, Y –is the output space. Which is the set of all possible outputs,
that is yes/no,
Let D –dataset for n inputs.Consider, target function be: χ-> Y , that maps input to output.
Objective: To pick a function, g: χ-> Y to appropriate hypothesis f.
Learning Types
These questions are the basis of a field called ‘Computational Learning Theory’ or in short
(COLT).
There are two ways of learning the hypothesis, consistent with all training instances from the
large hypothesis space.
List-Then-Eliminate Algorithm
MODULE 3
CHAPTER 4
SIMILARITY-BASED LEARNING
4.1 Similarity or Instance-based Learning
a) KNN
b) Variants of KNN
c) Locally weighted regression
d) Learning vector quantization
e) Self-organizing maps
f) RBF networks
Nearest-Neighbor Learning
A powerful classification algorithm used in pattern recognition.
K nearest neighbors stores all available cases and classifies new cases based on a
similarity measure (e.g distance function)
One of the top data mining algorithms used today.
A non-parametric lazy learning algorithm (An Instance based Learning method).
Where, г is called the bandwidth parameter and controls the rate at which w i reduces to zero
with distance from xi.
MODULE 3
CHAPTER 5
REGRESSION ANALYSIS
5.1 Introduction to Regression
Regression analysis is a fundamental concept that consists of a set of machine learning methods
that predict a continuous outcome variable (y) based on the value of one or multiple predictor
variables (x).
OR
Regression analysis is a statistical method to model the relationship between a dependent
(target) and independent (predictor) variables with one or more independent variables.
Regression is a supervised learning technique which helps in finding the correlation between
variables.
It is mainly used for prediction, forecasting, time series modelling, and determining the causal-
effect relationship between variables.
Regression shows a line or curve that passes through all the datapoints on target-predictor
graph in such a way that the vertical distance between the datapoints and the regression line
is minimum." The distance between datapoints and line tells whether a model has captured a
strong relationship or not.
• Function of regression analysis is given by:
Y=f(x)
Here, y is called dependent variable and x is called independent variable.
Applications of Regression Analysis
Sales of a goods or services
Value of bonds in portfolio management
Premium on insurance componies
Yield of crop in agriculture
Prices of real estate
Positive Correlation: Two variables are said to be positively correlated when their values
move in the same direction. For example, in the image below, as the value for X increases, so
does the value for Y at a constant rate.
Negative Correlation: Finally, variables X and Y will be negatively correlated when their
values change in opposite directions, so here as the value for X increases, the value for Y
decreases at a constant rate.
Neutral Correlation: No relationship in the change of variables X and Y. In this case, the
values are completely random and do not show any sign of correlation, as shown in the
following image:
Causation
Causation is about relationship between two variables as x causes y. This is called x implies b.
Regression is different from causation. Causation indicates that one event is the result of the
occurrence of the other event; i.e. there is a causal relationship between the two events.
Linear and Non-Linear Relationships
The relationship between input features (variables) and the output (target) variable is
fundamental. These concepts have significant implications for the choice of algorithms, model
complexity, and predictive performance.
Linear relationship creates a straight line when plotted on a graph, a Non-Linear relationship
does not create a straight line but instead creates a curve.
Example:
Linear-the relationship between the hours spent studying and the grades obtained in a class.
Non-Linear-
Linearity:
Linear Relationship: A linear relationship between variables means that a change in one
variable is associated with a proportional change in another variable. Mathematically, it can be
represented as y = a * x + b, where y is the output, x is the input, and a and b are constants.
Linear Models: Goal is to find the best-fitting line (plane in higher dimensions) to the data
points. Linear models are interpretable and work well when the relationship between variables
is close to being linear.
Limitations: Linear models may perform poorly when the relationship between variables is
non-linear. In such cases, they may underfit the data, meaning they are too simple to capture
the underlying patterns.
Non-Linearity:
Non-Linear Relationship: A non-linear relationship implies that the change in one variable is
not proportional to the change in another variable. Non-linear relationships can take various
forms, such as quadratic, exponential, logarithmic, or arbitrary shapes.
Non-Linear Models: Machine learning models like decision trees, random forests, support
vector machines with non-linear kernels, and neural networks can capture non-linear
relationships. These models are more flexible and can fit complex data patterns.
Benefits: Non-linear models can perform well when the underlying relationships in the data
are complex or when interactions between variables are non-linear. They have the capacity to
capture intricate patterns.
Types of Regression
Linear Regression:
Single Independent Variable: Linear regression, also known as simple linear regression, is
used when there is a single independent variable (predictor) and one dependent variable
(target).
Equation: The linear regression equation takes the form: Y = β0 + β1X + ε, where Y is the
dependent variable, X is the independent variable, β0 is the intercept, β1 is the slope
(coefficient), and ε is the error term.
Purpose: Linear regression is used to establish a linear relationship between two variables and
make predictions based on this relationship. It's suitable for simple scenarios where there's only
one predictor.
Multiple Regression:
Multiple Independent Variables: Multiple regression, as the name suggests, is used when there
are two or more independent variables (predictors) and one dependent variable (target).
Equation: The multiple regression equation extends the concept to multiple predictors: Y = β0
+ β1X1 + β2X2 + ... + βnXn + ε, where Y is the dependent variable, X1, X2, ..., Xn are the
independent variables, β0 is the intercept, β1, β2, ..., βn are the coefficients, and ε is the error
term.
Purpose: Multiple regression allows you to model the relationship between the dependent
variable and multiple predictors simultaneously. It's used when there are multiple factors that
may influence the target variable, and you want to understand their combined effect and make
predictions based on all these factors.
Polynomial Regression:
Use: Polynomial regression is an extension of multiple regression used when the relationship
between the independent and dependent variables is non-linear.
Equation: The polynomial regression equation allows for higher-order terms, such as quadratic
or cubic terms: Y = β0 + β1X + β2X^2 + ... + βnX^n + ε. This allows the model to fit a curve
rather than a straight line.
Logistic Regression:
Use: Logistic regression is used when the dependent variable is binary (0 or 1). It models the
probability of the dependent variable belonging to a particular class.
Equation: Logistic regression uses the logistic function (sigmoid function) to model
probabilities: P(Y=1) = 1 / (1 + e^(-z)), where z is a linear combination of the independent
variables: z = β0 + β1X1 + β2X2 + ... + βnXn. It transforms this probability into a binary
outcome.
Lasso Regression (L1 Regularization):
Use: Lasso regression is used for feature selection and regularization. It penalizes the absolute
values of the coefficients, which encourages sparsity in the model.
Objective Function: Lasso regression adds an L1 penalty to the linear regression loss function:
Lasso = RSS + λΣ|βi|, where RSS is the residual sum of squares, λ is the regularization strength,
and |βi| represents the absolute values of the coefficients.
Ridge Regression (L2 Regularization):
Use: Ridge regression is used for regularization to prevent overfitting in multiple regression. It
penalizes the square of the coefficients.
Objective Function: Ridge regression adds an L2 penalty to the linear regression loss function:
Ridge = RSS + λΣ(βi^2), where RSS is the residual sum of squares, λ is the regularization
strength, and (βi^2) represents the square of the coefficients.
Limitations of Regression
Coefficient of Determination
The coefficient of determination (R² or r-squared) is a statistical measure in a regression model
that determines the proportion of variance in the dependent variable that can be explained by
the independent variable.
The sum of the squares of the differences between the y-value of the data pair and the average
of y is called total variation. Thus, the following variation can be defined as,
The explained variation is given by, =∑( Ŷi – mean(Yi))2
The unexplained variation is given by, =∑( Yi - Ŷi )2
Thus, the total variation is equal to the explained variation and the unexplained variation.
The coefficient of determination r2 is the ratio of the explained and unexplained variations.
REGRESSION ANALYSIS
5. Consider the following dataset in Table 5.11 where the week and number of working hours per
week spent by a research scholar in a library are tabulated. Based on the dataset, predict the
number of hours that will be spent by the research scholar in the 7th and 9th week. Apply Linear
regression model.
Table 5.11
xi 1 2 3 4 5
(week)
yi 12 18 22 28 35
(Hours Spent)
Solution
xi yi xi xi xi yi
1 12 1 12
2 18 4 36
3 22 9 66
4 28 16 112
5 35 25 175
Sum = 15 Sum = 115 Avg ( xi xi )=55/5=11 Avg( xi yi )=401/5=80.2
avg( xi )=15/5=3 avg( yi )=115/5=23
a1 =
( xy ) − ( x )( y )
( x ) ( x)
2
2
i
( )
a0 = y − a1 x
The prediction for the 7th week hours spent by the research scholar will be
The prediction for the 9th week hours spent by the research scholar will be
Height of Boys 65 70 75 78
Height of Girls 63 67 70 73
Solution
xi yi xi xi xi yi
65 63 4225 4095
70 67 4900 4690
75 70 5625 5250
78 73 6084 5694
Sum = 288 Sum = 273 Avg ( xi xi Avg( xi yi
Mean( xi Mean( yi )=20834/4=5208.5 )=19729/4=4932.25
)=288/4=72 )=273/4=68.25
a1 =
( xy ) − ( x )( y )
( x ) ( x)
2
2
i
( )
a0 = y − a1 x
4932.25 − 72(68.25) 18.25
a1 = = = 0.7449
5208.5 − 722 24.5
y = 0.7449 + 14.6172 x
7. Using multiple regression, fit a line for the following dataset shown in Table 5.13.
Here, Z is the equity, X is the net sales and Y is the asset. Z is the dependent variable
and X and Y are independent variables. All the data is in million dollars.
Z X Y
4 12 8
6 18 12
7 22 16
8 28 36
11 35 42
Solution
1 12 8
1 18 12
X = 1 22 16
1 28 36
1 35 42
4
6
Y = 7
8
11
The regression coefficients can be found as follows
^
a = (( X T X ) −1 X T )Y
4
−1
5 115 114 1 1 1 1 1 6
= 115 2961 3142 12 18 22 28 35 7
114 3142 3524 8 12 16 36 42 8
11
−0.4135
= 0.39625
−0.0658
Therefore, the regression line is given as
***
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)
MODULE 4
CHAPTER 6
DECISION TREE LEARNING
6.1 Introduction
6.1.1 Structure of a Decision Tree A decision tree is a structure that includes a root
node, branches, and leaf nodes. Each internal node denotes a test on an attribute, each
branch denotes the outcome of a test, and each leaf node holds a class label. The topmost
node in the tree is the root node.
node) holds a class label. It is constructed by recursively splitting the training data
into subsets based on the values of the attributes until a stopping criterion is met, such
as the maximum depth of the tree or the minimum number of samples required to split
a node.
Validating and pruning decision trees is a crucial part of building accurate and robust
machine learning models. Decision trees are prone to overfitting, which means they can
learn to capture noise and details in the training data that do not generalize well to new,
unseen data.
Validation and pruning are techniques used to mitigate this issue and improve the
performance of decision tree models.
The hyperparameters that can be tuned for early stopping and preventing overfitting
These same parameters can also be used to tune to get a robust model
Post-pruning does the opposite of pre-pruning and allows the Decision Tree model to
grow to its full depth. Once the model grows to its full depth, tree branches are removed
to prevent the model from overfitting. The algorithm will continue to partition data into
smaller subsets until the final subsets produced are similar in terms of the outcome
variable. The final subset of the tree will consist of only a few data points allowing the
tree to have learned the data to the T. However, when a new data point is introduced
that differs from the learned data - it may not get predicted well.
The hyperparameter that can be tuned for post-pruning and preventing overfitting
is: ccp_alpha
ccp stands for Cost Complexity Pruning and can be used as another option to control
the size of a tree. A higher value of ccp_alpha will lead to an increase in the number of
nodes pruned.
•
Dendrites are tree like networks made of nerve fiber connected to the cell body.
An Axon is a single, long connection extending from the cell body and carrying signals from the
neuron. The end of axon splits into fine strands. It is found that each strand terminated into small
bulb like organs called as synapse. It is through synapse that the neuron introduces its signals to
other nearby neurons. The receiving ends of these synapses on the nearby neurons can be found
both on the dendrites and on the cell body. There are approximately 104 synapses per neuron in the
human body. Electric impulse is passed between synapse and dendrites. It is a chemical process
which results in increase/decrease in the electric potential inside the body of the receiving cell. If
the electric potential reaches a thresh hold value, receiving cell fires & pulse / action potential of
fixed strength and duration is send through the axon to synaptic junction of the cell. After that, cell
has to wait for a period called refractory period.
Basically, a neuron takes an input signal (dendrite), processes it like the CPU (soma), passes
the output through a cable like structure to other connected neurons (axon to synapse to
other neuron’s dendrite).
OR
Working:
The received input are computed as a weighted sum which is given to the activation function
and if the sum exceeds the threshold value the neuron gets fired.The neuron is the basic
processing unit that receives a set of inputs x1,x2,x3,….xn and their associated weights
w1,w2,w3,….wn. The summation function computes the weighted sum of the inputs
received by the neuron.
Sum=∑xiwi
Activation functions:
• To make work more efficient and for exact output, some force or activation is given. Like
that, activation function is applied over the net input to calculate the output of an ANN.
Information processing of processing element has two major parts: input and output. An
integration function (f) is associated with input of processing element.
The output is same as the input ie the weighted sum. The function is useful when we do
not apply any threshold. The output value ranged between –∞ and +∞
2. Binary step function: This function can be defined as
𝑓(𝑥) = { 1 𝑖𝑓 𝑥 ≥ 𝜃
0 𝑖𝑓 𝑥 < 𝜃
Where, θ represents threshhold value. It is used in single layer nets to convert
the net input to an output that is binary (0 or 1).
3. Bipolar step function: This function can be defined as
𝑓(𝑥) = { 1 𝑖𝑓 𝑥 ≥ 𝜃
−1 𝑖𝑓 𝑥 < 𝜃
Where, θ represents threshold value. It is used in single layer nets to convert
the net input to an output that is bipolar (+1 or -1).
4. Sigmoid function: It is used in Back propagation nets.
Two types:
a) Binary sigmoid function: It is also termed as logistic sigmoid function or unipolar
sigmoid function. It is defined as
7. ReLU Function
ReLU stands for Rectified Linear Unit.
Although it gives an impression of a linear function, ReLU has a derivative function and
allows for backpropagation while simultaneously making it computationally efficient.
The main catch here is that the ReLU function does not activate all the neurons at the same
time.
The neurons will only be deactivated if the output of the linear transformation is less than 0
8. Softmax function: Softmax is an activation function that scales numbers/logits into
possible outcome. The probabilities in vector v sums to one for all possible outcomes or
classes.
• Knowledge is acquired by the network from its environment through a learning process.
OR
• The perceptron can represent all boolean primitive functions AND, OR, NAND , NOR.
• Some boolean functions can not be represented .
– E.g. the XOR function.
Solution:
X0
Ɵ3 Ɵ4
X1 𝑤13
X3 X4
𝑤34
AND NOT
𝑤23
X2
0 0 0 1
0 1 0 1
1 0 0 1
1 1 1 0
ITERATION 1:
Step 1: FORWARD PROPAGATION
1. Calculate net inputs and outputs in input layer as shown in Table 3.
Table 3: Net Input and Output Calculation
Input Layer 𝑰𝒋 𝑶𝒋
𝑿𝟏 0 0
𝑿𝟐 1 1
2. Calculate net inputs and outputs in hidden and output layer as shown in Table 4.
Table 4: Net Input and Output Calculation in Hidden and Output layer
𝑿𝟑 𝐼3 = 𝑋1 𝑊13 + 𝑋2 𝑊23 + 𝑋0 Ɵ3 1
𝑶𝟑 =
= 0(0.1) + 1(−0.4) + 1(0.2) 1 + 𝑒 −𝐼3
1
= −0.2 =
1 + 𝑒 −(−0.2)
= 0.450
𝑼𝒏𝒊𝒕𝒌 𝑵𝒆𝒕 𝑰𝒏𝒑𝒖𝒕 𝑰𝒌 𝐍𝐞𝐭 𝐨𝐮𝐭𝐩𝐮𝐭 𝑶𝒌
𝑿𝟒 𝐼4 = 𝑂3 𝑊34 + 𝑋0 Ɵ4 1
𝑶𝟒 =
= (0.450 ∗ 0.3) + 1(−0.3) 1 + 𝑒 −𝐼4
1
= −0.165 =
1 + 𝑒 −(−0.165)
= 0.458
3. Calculate Error
𝑬𝒓𝒓𝒐𝒓 = 𝑶𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝑶𝒆𝒔𝒕𝒊𝒎𝒂𝒕𝒆𝒅
= 1 − 0.458
𝐸𝑟𝑟𝑜𝑟 = 0.542
ITERATION 2:
Step 1: FORWARD PROPAGATION
𝑿𝟑 𝐼3 = 𝑋1 𝑊13 + 𝑋2 𝑊23 + 𝑋0 Ɵ3 1
𝑶𝟑 =
= 0(0.1) + 1(−0.396) + 1(0.203) 1 + 𝑒 −𝐼3
1
= −0.193 = −(−0.193)
1+𝑒
= 0.451
𝑼𝒏𝒊𝒕𝒌 𝑵𝒆𝒕 𝑰𝒏𝒑𝒖𝒕 𝑰𝒌 𝑵𝒆𝒕 𝒐𝒖𝒕𝒑𝒖𝒕 𝑶𝒌
𝑿𝟒 𝐼4 = 𝑂3 𝑊34 + 𝑋0 Ɵ4 1
𝑶𝟒 =
= (0.451 ∗ 0.324) + 1(−0.246) 1 + 𝑒 −𝐼4
1
= −0.099 = −(−0.099)
1+𝑒
= 0.475
2. Calculate Error
𝑬𝒓𝒓𝒐𝒓 = 𝑶𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝑶𝒆𝒔𝒕𝒊𝒎𝒂𝒕𝒆𝒅
= 1 − 0.475
𝐸𝑟𝑟𝑜𝑟 = 0.525
ITERATION ERROR
1 0.542 =0.542-0.525
=0.017
2 0.525
In iteration 2 the error gets reduced to 0.525. This process will continue until desired output
is achieved.
How a Multi-Layer Perceptron does solves the XOR problem. Design an MLP with back
propagation to implement the XOR Boolean function.
Solution:
X1 X2 Y
0 0 1
0 1 0
1 0 0
1 1 1
X0
0.1
X1 -0.3
-0.2
0.4
0.4
0.2
X3 0.2
X2 X5
-0.3
-0.3
X4
Table 11: Error Calculation for each unit in the Output layer and Hidden layer
For Output Layer Errork
Unit k
X5 Error 5 = O5 (1-O5) (1 – O5)
= 0.407 * (1-0.407) * (1- 0.407)
= 0.143
For Hidden layer Errorj
Unit j
X4 Error 4 = O4 (1-O4) ∑𝑘 𝐸𝑟𝑟𝑜𝑟𝑘 𝑊𝑗𝑘 = O4 (1-O4) 𝐸𝑟𝑟𝑜𝑟5 𝑊45
= 0.622 (1-0.622) *- 0.3 *0.143
= -0.010
X3 Error 3 = O3 (1-O3) ∑𝑘 𝐸𝑟𝑟𝑜𝑟𝑘 𝑊𝑗𝑘 = O3 (1-O3) 𝐸𝑟𝑟𝑜𝑟5 𝑊35
= 0.549 (1- 0.549) * 0.143 * 0.2
= -0.007
2. Calculate Net Input and Output in the Hidden Layer and Output Layer shown in Table 15.
Table 15: Net Input and Output Calculation in the Hidden Layer and Output Layer
Unit j Net Input Ij Output Oj
X3 I3 = X1*W13 + X2*W23+ X0*θ3 1 1
O3 = 1+𝑒 −𝐼3 = 1+𝑒 −0.211 =
I3 = 1*-0.194 + 0*0.2+ 1*0.405 = 0.211
0.552
X4 I4 = X1*W14 + X2*W24+ X0*θ4 1 1
O4 = 1+𝑒 −𝐼4 = 1+𝑒 −0.484 =
I4 = 1*0.392 + 0*-0.3+ 1*0.092 = 0.484
0.618
X5 I5 = O3 * W35 + O4*W45 + X0*θ5 1 1
O5 = 1+𝑒 −𝐼5 = 1+𝑒 0.282 =0.429
I5 = 0.552* 0.154 + 0.618* -0.288 + 1*-0.185 = -
0.282
Consider the Network architecture with 4 input units and 2 output units. Consider four training
samples each vector of length 4.
Training samples
i1: (1, 1, 1, 0)
i2: (0, 0, 1, 1)
i3: (1, 0, 0, 1)
i4: (0, 0, 1, 0)
Output Units: Unit 1, Unit 2
Learning rate η(t) = 0.6
Initial Weight matrix
Unit 1 0.2 0.8 0.5 0.1
[ ]:[ ]
Unit 2 0.3 0.5 0.4 0.6
Identify an algorithm to learn without supervision? How do you cluster them as we
expected?
Solution:
Use Self Organizing Feature Map (SOFM)
Iteration 1:
Training Sample X1: (1, 1, 1, 0)
Weight matrix
Unit 1 0.2 0.8 0.5 0.1
[ ]: [ ]
Unit 2 0.3 0.5 0.4 0.6
Iteration 3:
Training Sample X3: (1, 0, 0, 1)
Weight matrix
Unit 1 0.68 0.92 0.80 0.08
[ ]:[ ]
Unit 2 0.12 0.2 0.76 0.84
Iteration 4:
Training Sample X4: (0, 0, 1, 0)
Weight matrix
Unit 1 0.68 0.92 0.80 0.08
[ ]:[ ]
Unit 2 0.65 0.08 0.3 0.94
This process is continued for many epochs until the feature map doesn’t change.
Learning Rules
Learning in NN is performed by adjusting the network weights in order to minimize the
difference between the desired and estimated output.
TYPES OF ANN
1. Feed Forward Neural Network
2. Fully connected Neural Network
3. Multilayer Perceptron
4. Feedback Neural Network
Feed Forward Neural Network:
Feed-Forward Neural Network is a single layer perceptron. A sequence of inputs enters the layer and are
multiplied by the weights in this model. The weighted input values are then summed together to form a total.
If the sum of the values is more than a predetermined threshold, which is normally set at zero, the output
value is usually 1, and if the sum is less than the threshold, the output value is usually -1.
The single-layer perceptron is a popular feed-forward neural network model that is frequently used for
classification.
The model may or may not contain hidden layer and there is no backpropagation.
Based on the number of hidden layers they are further classified into single-layered and multilayered feed
forward network.
A fully connected neural network consists of a series of fully connected layers that connect
every neuron in one layer to every neuron in the other layer.
The major advantage of fully connected networks is that they are “structure agnostic” i.e. there
are no special assumptions needed to be made about the input.
Multilayer Perceptron:
A multi-layer perceptron has one input layer and for each input, there is one neuron (or node), it has
one output layer with a single node for each output and it can have any number of hidden layers and
each hidden layer can have any number of nodes.
The information flows in both directions.
The weight adjustment training is done via backpropagation.
Every node in the multi-layer perception uses a sigmoid activation function. The sigmoid activation
function takes real values as input and converts them to numbers between 0 and 1 using the sigmoid
formula.
To solve this problem, we can break it down into smaller parts and give them to each of the
students. One student can solve the first part of the equation "5 x 3 = 15" and another student can
solve the second part of the equation "2 x 4 = 8". The third student can solve the third part "8 x 2 =
16".
Finally, we can simplify it to 15 + 8 + 16. Same way, one of the students in the group can solve "15
+ 8 = 23" and another one can solve "23 + 16 = 39", and that's the answer
So here we are breaking down the large math problem into different sections and giving them to
each of the students who are just doing really simple calculations, but as a result of the teamwork,
they can solve the problem efficiently.
This is exactly the idea of how a multi-layer perceptron (MLP) works. Each neuron in the MLP is
like a student in the group, and each neuron is only able to perform simple arithmetic operations.
However, when these neurons are connected and work together, they can solve complex problems.
The principle weakness of the perceptron was that it could only solve problems that were linearly
separable.
A multilayer perceptron (MLP) is a fully connected feed-forward artificial neural network with at
least three layers input, output, and at least one hidden layer.
The mapping between inputs and output is non-linear. (Ex: XOR gate)
In Perceptron the neuron must have an activation function that imposes a threshold, like ReLU or
sigmoid, neurons in a Multilayer Perceptron can use any arbitrary activation function.
MLP networks are uses back propagation for supervised learning network.
The activation functions used in the layers can be linear or Non-linear depending on the type of a
problem.
NOTE : In each iteration, after the weighted sums are forwarded through all layers, the gradient of
the Mean Squared Error is computed across all input and output pairs. Then, to propagate it back,
the weights of the first hidden layer are updated with the value of the gradient. That’s how the
weights are propagated back to the starting point of the neural network.
This process keeps going until gradient for each input-output pair has converged, meaning the
newly computed gradient hasn’t changed more than a specified convergence threshold, compared to
the previous iteration.
Works in 2 stages.
1. Forward phase
2. Backward phase
Starting from initial random weights, multi-layer perceptron (MLP) minimizes the loss function
by repeatedly updating these weights. After computing the loss, a backward pass propagates it
from the output layer to the previous layers, providing each weight parameter with an update
value meant to decrease the loss.
ALGORITHM
Radial Basis Function Neural Network
This networks have a fundamentally different architecture than most neural network architectures.
Most neural network architecture consists of many layers and introduces nonlinearity by repetitively
applying nonlinear activation functions.
RBF network on the other hand only consists of an input layer, a single hidden layer, and an
output layer.
The input layer is not a computation layer, it just receives the input data and feeds it into the special
hidden layer of the RBF network. The computation that is happened inside the hidden layer is very
different from most neural networks, and this is where the power of the RBF network comes from.
The output layer performs the prediction task such as classification or regression.
RBF Neural networks are conceptually similar to K-Nearest Neighbor (k-NN) models.
It is useful for interpolation, function approximation ,time series prediction and classification.
RBFNN Architecture :
Self-organizing Feature Map
SOM is trained using unsupervised learning.
SOM doesn’t learn by backpropagation with Stochastic Gradient Descent(SGD) ,it use competitive
learning to adjust weights in neurons. Artificial neural networks often utilize competitive
learning models to classify input without the use of labeled data.
Used: In dimension reduction to reduce our data by creating a spatially organized representation,
also it help us to discover the correlation between data.
Self organizing maps have two layers, the first one is the input layer and the second one is the
output layer or the feature map.
SOM doesn’t have activation function in neurons, we directly pass weights to output layer without
doing anything.
Network Architecture and operations
It consists of 2 layers:
1. Input layer
2. Output layer
No Hidden layer.
The initialization of the weight to vectors initiates the mapping processes of the Self-Organizing
Maps.
The mapped vectors are then examined to determine which weight most accurately represents the
chosen sample using a sample random vector. Neighboring weights that are near each weighted
vector are present. The chosen weight is allowed to turn into a vector for a random sample. This
encourages the map to develop and take on new forms. In a 2D feature space, they typically form
hexagonal or square shapes. More than 1,000 times are spent repeatedly performing this entire
process.
To determine whether appropriate weights are similar to the input vector, each node is analyzed.
The best matching unit is the term used to describe the appropriate node.
The Best Matching Unit's neighborhood value is then determined. Over time, the neighbors tend
to decline in number.
The appropriate weight further evolves into something more resembling the sample vector. The
surrounding areas change similarly to the selected sample vector. A node's weights change more as
it gets closer to the Best Matching Unit (BMU), and less as it gets farther away from its neighbor.
For N iterations, repeat step two.
Advantages and Disadvantages of ANN
Limitations of ANN
Challenges of Artificial Neural Networks
Chapter 13
CLUSTERING ALGORITHMS
Clustering: the process of grouping a set of objects into classes of similar objects
Documents within a cluster should be similar.
Finding similarities between data according to the characteristics found in the data and
grouping similar data objects into clusters.
Example: Below fig: shows the data points with two features shown in different shaded
samples.
If few similarities then manually we can do , but when examples have more
features then cannot be done manually, so automatic clustering is required.
Applications of Clustering
Advantages and Disadvantages
PROXIMITY MEASURES
Clustering algorithms need a measure to find the similarity or dissimilarity among the
objects to group them. Similarity and Dissimilarity are collectively known as proximity
measures. This is used by a number of data mining techniques, such as clustering,
nearest neighbour classification, and anomaly detection.
Distance measures are known as dissimilarity measures, as these indicate how one
object is different from another.
Measures like cosine similarity indicate the similarity among objects.
Distance measures and similarity measures are two sides of a same coin, as more
distance indicates more similarity and vice-versa.
If all the conditions are satisfied, then the distance measure is called metric.
Some of proximity measures:
1. Quantitative variables
a) Euclidean distance: It is one of the most important and common
distance measure. It is also called L2 norm.
Advantage: The distance does not change with the addition of new object.
Disadvantage: i) If the unit changes, the resulting Euclidean or squared
Euclidean Changes drastically.
ii) Computational complexity is high, because it involves square root and
square.
b) City Block Distance: Known as Manhattan Distance or L1 norm.
Example
Suppose there are two strings 1101 1001 and 1001 1101.
11011001 ⊕ 10011101 = 01000100. Since, this contains two 1s, the Hamming distance,
d(11011001, 10011101) = 2.
2. Categorical variables
Ordinal Variables
Cosine Similarity
Cosine similarity is a metric used to measure how similar the documents are
irrespective of their size.
It measures the cosine of the angle between two vectors projected in a multi-
dimensional space.
The cosine similarity is advantageous because even if the two similar
documents are far apart by the Euclidean distance (due to the size of the
document), chances are they may still be oriented closer together.
The smaller the angle, higher the cosine similarity.
Consider 2 documents P1 and P2.
◦ If distance is more, then less similar.
◦ If distance is less, then more similar.
1. Consider the following data and, calculate the Euclidean, Manhattan and
Chebyshev distances.
a. (2 3 4) and (1 5 6)
Solution
2 1 3 5) 4 6 1 2 2 5
Manhattan distance =
b. (2 2 9) and (7 8 9)
25 36 09 61
Euclidean Distance = (2 7) (2 8) (9 9)
2 2 2
7.81
Manhattan Distance = 2 7 2 8) 9 9 5 6 0 11
2. Find cosine similarity, SMC and Jaccard coefficients for the following binary
data:
a. (1 0 1 1) and (1 1 0 0)
Solution
10 11
110 0
C = 2, b = 1, d = 1,
ad 1
SMC = 0.25
abcd 4
d 1
0.25
Jaccard Coefficient =
bcd 4
Solution
No match
(1 0 0 0 1) and (1 1 0 0 0)
10001
11000
A=2, b= 1, c = 1, d= 1
ad 2
SMC = 0.5
abcd 5
d 1
Jaccard Coefficient = 0.33
bcd 3
(11 01 0 0 0 0 1 0) 1 1
Cosine Similarity = 0.5
2 2 2 2 2
3. Find Hamming distance for the following binary data:
a. (1 1 1) and (1 0 0)
Solution
It differs in two positions; therefore Hamming distance is 2
b. (1 1 1 0 0) and (0 0 1 1 1)
Solution
It differs in four positions; therefore, Hamming distance is 4
Solution
2 3 1 1
Distance between (red, green) = 0.5
2
2 2
3 1 2
Distance between (green, yellow) = 1
2 2
Therefore, distance between (Yellow, red, green) and (red, green, yellow) is (0.5,0.5,1).
b. (bread, butter, milk) and (milk, sandwich, Tea)
Solution
2 1
The distance between (bread, milk) = 1 3
5 1 4 2
2 1
The distance between (butter, sandwich) = 2 4
5 1 4 2
1
The distance between (Milk, Tea) = 3 5 2
5 1 4 2
Therefore, the distance
between (bread, butter, milk)
and (milk, sandwich, Tea) =
1 1 1
, ,
2 2 2
Hierarchical Clustering Algorithms
Hierarchical clustering involves creating clusters that have a predetermined ordering
from top to bottom.
For example, all files and folders on the hard disk are organized in a hierarchy.
Hierarchical relationship is shown in the form of dendogram.
There are two types of hierarchical clustering.
◦ Divisive and Agglomerative.
The following three methods differ in how the distance between each cluster is
measured.
1. Single Linkage
2. Average Linkage
3. Complete Linkage
Single Linkage or MIN algorithm
In single linkage hierarchical clustering, the distance between two clusters is
defined as the shortest distance between two points in each cluster. For example,
the distance between clusters “r” and “s” to the left is equal to the length of the
arrow between their two closest points.
Complete Linkage : In complete linkage hierarchical clustering, the distance between
two clusters is defined as the longest distance between two points in each cluster. For
example, the distance between clusters “r” and “s” to the left is equal to the length of the
arrow between their two furthest points.
OR
Average Linkage : In average linkage hierarchical clustering, the distance between two
clusters is defined as the average distance between each point in one cluster to every
point in the other cluster. For example, the distance between clusters “r” and “s” to the
left is equal to the average length each arrow between connecting the points of one
cluster to the other.
Mean-Shift Algorithm
Use dataset and apply hierarchical methods. Show the dendrogram.
SNo. X Y
1. 3 5
2. 7 8
3. 12 5
4. 16 9
5. 20 8
Solution
The similarity table among the variables is computed and is shown in Table 134.4. Euclidean
distance is computed and is shown in the following Table 143.57.
Objects 0 1 2 3 4
0 - 5 9 9.85 17.26
1 - 5.83 9.49 13
2 - 5.66 8.94
3 - 4.12
4 -
The minimum distance is 4.12. Therefore, the items 1 and 4 are clustered together. The resultant
table is given as shown in the following Table.
Clusters {1,4} 2 3 5
5 -
The distance between the group {1, 4} and items 2, 3, 5 are computed using this formula.
Thus, the distance between {1,4} and {2} is:
Minimum { {1,4}, {2} = Minimum {(1,2),(4,2)=5
The distance between {1,4} and {3} is given as:
Minimum { {1,3}, {4,3} } = Minimum {9,5.66}=5.66
The distance between {1,4} and {5} is given as:
Minimum { {1,5}, {2,5} } = Minimum {17.26,4.12} = 4.12
The minimum distance of above table is 4.12. Therefore, {1,4} and {5} are combined. This
results in the following Table.
Clusters {1,4,5} 2 5
{1,4,5} - 5 5.66
2 - 5.83
5 -
The minimum is 5. Therefor {1,4,5} and {2} is combined. And finally, it is combined with
{3}.
therefore, the order of cluster is {1,4} then {5}, then {2} and finally {3}.
Complete Linkage or MAX or Clique
Here from the first iteration table minimum is taken and {1,4} is combined. Then maximum
is computed as
Clusters {1,4} 2 3 5
3 - 8.94
5 -
So, the minimum is 8.94. Therefore, {3,5} is combined. This is shown in the following Table.
2 -
The minimum is 9.49. Therefore {1,4,2} are combined. The order of cluster is {1,4}, {1,4}
and {2}, and {3,5}.
Hint: The same is used for average link algorithm where the average distance of all pairs of
points across the clusters is used to form clusters.
Consider the following data shown in Table 143.125. Use k-means algorithm with k
= 2 and show the result.
Table Sample Data
SNO X Y
1. 3 5
2. 7 8
3. 12 5
4. 16 9
Solution
Let us assume the seed points are (3,5) and (16,9). This is shown in the following table
as starting clusters.
Cluster 1 Cluster 2
(3,5) (16,9)
Iteration 1: Compare all the data points or samples with the centroid and assigned to the
nearest sample.
Take the sample object 2 and compare it with the two centroids as follows:
Dist(2,centroid 1) = 5
(7 3)2 (8 5)2 16 9 25
Dist(2,centroid 2) = (7 16)2 (8 9)2 811 82 9.05
Object 2 is closer to centroid of cluster 1 and hence assign it to the cluster 1. This is shown in
Table. For the object 3:,
Dist(3,centroid 1) = 9
(12 3)2 (5 5)2 81
Dist(3,centroid 2) = (12 16)2 (5 9)2 16 16 32 5.66
Object 3 is closer to centroid of cluster 2. and hence remains in the same cluster 1
Cluster 1 Cluster 2
(3,5) (12,4)
(7,8) (10,4)
Dist(1,centroid 1) = 6.25
(7 5)2 (8 6.5)2
Dist(1,centroid 2) = (12 14)2 (8 7)2 49 1 50 7.07
Object 1 is closer to centroid of cluster 1 and hence remains in the same cluster. Take the
sample object 3, compute again
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as
the centroid-based method. The most common example of partitioning clustering is the K-
Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the number
of pre-defined groups. The cluster center is created in such a way that the distance between the
data points of one cluster is minimum as compared to another cluster centroid.
𝑘
SSE= ∑ 𝑓(𝑥) = ∑𝑖=1(𝐝𝐢𝐬𝐭(𝐜𝐢 , x)2)
Here ci = centroid of ith cluster
x=sample data
PROBLEM
Density-Based Clustering
A cluster is a dense region of points, which is separated by low-density regions, from other
regions of high density.
Used when the clusters are irregular or intertwined, and when noise and outliers are present.
Density-Based Clustering refers to unsupervised learning methods that identify distinctive
groups/clusters in the data, based on the idea that a cluster in data space is a contiguous region
of high point density, separated from other such clusters by contiguous regions of low point
density.
There are three types of points after the DBSCAN clustering is complete:
Core — This is a point that has at least m points within distance n from itself.
Border — This is a point that has at least one Core point at a distance n.
Noise — This is a point that is neither a Core nor a Border. And it has less than m points
within distance n from itself.
Grid-Based Approaches
grid-based clustering method takes a space-driven approach by partitioning the embedding
space into cells independent of the distribution of the input objects.
The grid-based clustering approach uses a multiresolution grid data structure. It quantizes the
object space into a finite number of cells that form a grid structure on which all of the
operations for clustering are performed.
The main advantage of the approach is its fast processing time, which is typically independent
of the number of data objects, yet dependent on only the number of cells.
Subspace Clustering
CLIQUE is a density-based and grid-based subspace clustering algorithm, useful for finding
clustering in subspace.
Concept of Dense cell
CLIQUE partitions each dimension into several overlapping intervals and intervals it into
cells. Then, algorithm determines whether the cells is dense or sparse. The cell is considered
dense if it exceeds a threshold value.
It is defined as the ratio of number of points and volume of the region. In one pass, the
algorithm finds the number of cells , number of points etc and then combines the dense cells.
For that the algorithm uses the contiguous intervals and a set of dense cells.
MONOTONICITY Property
CLIQUE uses anti- monotonicity property or apriori algorithm. It means that all the
subsets of a frequent itemset are frequent. Similarly if the subset is infrequent then its
superset are infrequent.
Two popular probability model-based clustering methods are Gaussian Mixture Models (GMMs) and
Hidden Markov Models (HMMs). other than these we have other set of model . those are:
1. Fuzzy Clustering
2. EM algorithm
Fuzzy Clustering :
Fuzzy Clustering is a type of clustering algorithm in machine learning that allows a data point to belong
to more than one cluster with different degrees of membership. Unlike traditional clustering algorithms,
such as k-means or hierarchical clustering, which assign each data point to a single cluster, fuzzy
clustering assigns a membership degree between 0 and 1 for each data point for each cluster.
Let us consider ci and cj then an element say x, can belong to both the cluster.The strength of
the association of an object with the cluster is given as wij . The value of wij lies between 0
and 1. The sum of the weights of an object, if added, gives 1.
Expectation Maximization Algorithm
The Expectation-Maximization (EM) algorithm is a statistical method used for estimating
parameters in statistical models when you have incomplete or missing data. It's commonly used
in unsupervised machine learning tasks such as clustering and Gaussian Mixture Model (GMM)
fitting.
Given a mix of distributions, data can be generated by randomly picking a distribution and
generating the point. Gaussian distribution is a bell shaped curve.
1. Initialization: Start with initial estimates of the model parameters. These initial values can be
random or based on some prior knowledge.
2. E-step (Expectation):
In this step, you compute the expected values (expectation) of the latent (unobserved)
variables given the observed data and the current parameter estimates.
This involves calculating the posterior probabilities or likelihoods of the missing data or
latent variables.
Essentially, you're estimating how likely each possible value of the latent variable is,
given the current model parameters.
3. M-step (Maximization):
In this step, you update the model parameters to maximize the expected log-likelihood
found in the E-step.
This involves finding the parameters that make the observed data most likely given the
estimated values of the latent variables.
The M-step involves solving an optimization problem to find the new parameter values.
4. Iteration:
Repeat the E-step and M-step alternately until convergence criteria are met. Common
convergence criteria include a maximum number of iterations, a small change in
parameter values, or a small change in the likelihood.
5. Termination:
Once the EM algorithm converges, you have estimates of the model parameters that
maximize the likelihood of the observed data.
6. Result:
The final parameter estimates can be used for various purposes, such as clustering,
density estimation, or imputing missing data.
The EM algorithm is widely used in various fields, including machine learning, image
processing, and bioinformatics.
One of its notable applications is in Gaussian Mixture Models (GMMs), where it's used to
estimate the means and covariances of Gaussian distributions that are mixed to model
complex data distributions.
It's important to note that the EM algorithm can sometimes get stuck in local optima, so the
choice of initial parameter values can affect the results. To mitigate this, you may run the
algorithm multiple times with different initializations and select the best result.
Here,α and β are parameters. DUNN index is a useful measure that can combine both cohension and
separation.
Silhouette Coefficient
This metric measures how well each data point fits into its assigned cluster and ranges from -1 to
1. A high silhouette coefficient indicates that the data points are well-clustered, while a low
coefficient indicates that the data points may be assigned to the wrong cluster.
--